E E Handbook of Signal Processing Systems
Shuvra S. Bhattacharyya • Ed F. Deprettere Rainer Leupers • Jarmo Takala Editors
Handbook of Signal Processing Systems Foreword by S.Y. Kung
Editors Shuvra S. Bhattacharyya ECE Department and UMIACS University of Maryland 2311 A. V. Williams Building College Park Maryland 20742 USA
[email protected] Rainer Leupers RWTH Aachen University Templergraben 55 52056 Aachen Germany
[email protected]
Ed F. Deprettere Leiden Institute of Advanced Computer Science Leiden Embedded Research Center Leiden University Niels Bohrweg 1 2333 CA Leiden Netherlands
[email protected] Jarmo Takala Department of Computer Systems Tampere University of Technology Korkeakoulunkatu 1 33720 Tampere Finland
[email protected]
e-ISBN 978-1-4419-6345-1 ISBN 978-1-4419-6344-4 DOI 10.1007/978-1-4419-6345-1 Springer New York Dordrecht Heidelberg London Library of Congress Control Number: 2010932128 © Springer Science+Business Media, LLC 2010 All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Springer Science+Business Media, LLC, 233 Spring Street, New York, NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com)
To Milu Shuvra Bhattacharyya To Deirdre Ed Deprettere To Bettina Rainer Leupers To Auli Jarmo Takala
Foreword
It gives me immense pleasure to introduce this timely handbook to the research/development communities in the field of signal processing systems (SPS). This is the first of its kind and represents state-of-the-arts coverage of research in this field. The driving force behind information technologies (IT) hinges critically upon the major advances in both component integration and system integration. The major breakthrough for the former is undoubtedly the invention of IC in the 50’s by Jack S. Kilby, the Nobel Prize Laureate in Physics 2000. In an integrated circuit, all components were made of the same semiconductor material. Beginning with the pocket calculator in 1964, there have been many increasingly complex applications followed. In fact, processing gates and memory storage on a chip have since then grown at an exponential rate, following Moore’s Law. (Moore himself admitted that Moore’s Law had turned out to be more accurate, longer lasting and deeper in impact than he ever imagined.) With greater device integration, various signal processing systems have been realized for many killer IT applications. Further breakthroughs in computer sciences and Internet technologies have also catalyzed large-scale system integration. All these have led to today’s IT revolution which has profound impacts on our lifestyle and overall prospect of humanity. (It is hard to imagine life today without mobiles or Internets!) The success of SPS requires a well-concerted integrated approach from multiple disciplines, such as device, design, and application. It is important to recognize that system integration means much more than simply squeezing components onto a chip and, more specifically, there is a symbiotic relationship between applications and technologies. Emerging applications, e.g. 3D TV, will prompt modern system requirements on performance and power consumption, thus inspiring new intellectual challenges. The SPS architecture must consider overall system performance, flexibility, and scalability, power/thermal management, hardware-software partition, and algorithm developments. With greater integration, system designs become more complex and there exists a huge gap between what can be theoretically designed and what can be practically implemented. It is critical to consider, for instance, how to deploy in concert an ever increasing number of transistors with acceptable power consumption; and how to make hardware effective for applications and yet friendly vii
viii
Foreword
to the users (easy to program). In short, major advances in SPS must arise from close collaboration between application, hardware/architecture, algorithm, CAD, and system design. The handbook has offered a comprehensive and up-to-date treatment of the driving forces behind SPS, current architectures and new design trends. It has also helped seed the SPS field with innovations in applications; architectures; programming and simulation tools; and design methods. More specifically, it has provided a solid foundation for several imminent technical areas, for instance, scalable, reusable, and reliable system architectures, energy-efficient high-performance architectures, IP deployment and integration, on-chip interconnects, memory hierarchies, and future cloud computing. Advances in these areas will have greater impact on future SPS technologies. It is only fitting for Springer to produce this timely handbook. Springer has long played a major role in academic publication on SPS, many of them have been in close cooperation with IEEE’s Signal Processing, Circuits and Systems, and Computer Societies. For nearly 20 years, I have been the Editor-in-Chief of Springer’s Journal of Signal Processing Systems, considered by many as a major forum for the SPS researchers. Nevertheless, the idea has been around for years that a singlevolume reference book would very effectively complement the journal in serving this technical community. Then, during the 2008 IEEE Workshop on Signal Processing Systems, Washington D.C., Jennifer Evans from Springer and the editorial team led by Prof. Shuvra Bhattacharyya met to brainstorm implementation of such idea. The result is this handsome volume, containing contributions from renowned pioneers and active researchers. I congratulate the authors and editors for putting together such an outstanding handbook. It is truly a timely contribution to the exciting field of SPS for this new decade and many decades to come. Princeton
S.Y. Kung
Preface
This handbook provides a reference on key topics in the design, analysis, implementation, and optimization of hardware and software for signal processing systems. Plans for the handbook originated through valuable discussions with SY Kung (Princeton University) and Jennifer Evans (Springer). Through these discussions, we became aware of the need for an integrated reference focusing on the efficient implementation of signal processing applications. We were then fortunate to be able to enlist the participation of a diverse group of leading experts to contribute chapters in a broad range of complementary topics. We hope that this handbook will serve as a useful reference to engineering practitioners, graduate students, and researchers working in the broad area of signal processing systems. It is also envisioned that selected chapters from the book can be used as core readings for seminar- or project-oriented graduate courses in signal processing systems. Given the wide range of topics covered in the book, instructors would have significant flexibility to orient such a course towards particular themes or levels of abstraction that they would like to emphasize. The handbook is organized in four parts. Part I motivates representative applications that drive and apply state-of-the art methods for design and implementation of signal processing systems; Part II discusses architectures for implementing these applications; Part III focuses on compilers and simulation tools; and Part IV describes models of computation and their associated design tools and methodologies. We are very grateful to all of the authors for their valuable contributions, and for the time and effort they have devoted to preparing the chapters. We would also like to thank Jennifer Evans and SY Kung for their encouragement of this project, which was instrumental in getting the project off the ground, and to Jennifer Maurer for her support and patience throughout the entire development process of the handbook. Shuvra Bhattacharyya, Ed Deprettere Rainer Leupers, and Jarmo Takala
ix
Contents
Part I Applications Managing Editor: Shuvra Bhattacharyya Signal Processing for Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . William S. Levine 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Brief introduction to control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Sensitivity to disturbance . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Sensitivity conservation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Signal processing in control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 simple cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Demanding cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Exemplary case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Digital Signal Processing in Home Entertainment . . . . . . . . . . . . . . . . . . . . Konstantinos Konstantinides 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 The Personal Computer and the Digital Home . . . . . . . . . . 2 The Digital Photo Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Image Capture and Compression . . . . . . . . . . . . . . . . . . . . . 2.2 Focus and exposure control . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Pre-processing of CCD data . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 White Balance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Demosaicing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6 Color transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7 Post Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.8 Preview display . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3 3 4 8 9 9 10 12 15 19 20 20 21 23 23 24 25 25 26 26 27 27 27 27 27
xi
xii
Contents
2.9 Compression and storage . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.10 Photo Post-Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.11 Storage and Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.12 Digital Display . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 The Digital Music Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Broadcast Radio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Audio Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 The Digital Video Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Digital Video Broadcasting . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 New Portable Digital Media . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Digital Video Recording . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Digital Video Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Digital Video Playback and Display . . . . . . . . . . . . . . . . . . 5 Sharing Digital Media in the Home . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 DLNA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Digital Media Server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Digital Media Player . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Copy Protection and DRM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Microsoft Windows DRM . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 To Probe Further . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27 28 28 29 29 29 30 31 31 33 33 33 35 35 35 36 36 36 37 38 40 41 41
MPEG Reconfigurable Video Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Marco Mattavelli, Jörn W. Janneck and Mickaël Raulet 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Requirements and rationale of the MPEG RVC framework . . . . . . . 2.1 Limits of previous monolithic specifications . . . . . . . . . . . 2.2 Reconfigurable Video Coding specification requirements 3 Description of the standard or normative components of the framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 The toolbox library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 The C AL Actor Language . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 FU Network language for the codec configurations . . . . . 3.4 Bitstream syntax specification language BSDL . . . . . . . . . 3.5 Instantiation of the ADM . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6 Case study of new and existing codec configurations . . . . 4 The procedures and tools supporting decoder implementations . . . 4.1 OpenDF supporting framework . . . . . . . . . . . . . . . . . . . . . . 4.2 CAL2HDL synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 CAL2C synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
43 44 44 46 47 48 48 49 53 55 57 57 61 61 62 64 65 66
Contents
xiii
Signal Processing for High-Speed Links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 Naresh Shanbhag, Andrew Singer, and Hyeon-min Bae 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 2 System Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 2.1 The Transmitter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 2.2 The Channel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 2.3 Receiver Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 2.4 Noise Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 3 Signal Processing Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 3.1 Modulation Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 3.2 Adaptive Equalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 3.3 Equalization of Nonlinear Channels . . . . . . . . . . . . . . . . . . 87 3.4 Maximum Likelihood Sequence Estimation (MLSE) . . . . 88 4 Case Study of an OC-192 EDC-based Receiver . . . . . . . . . . . . . . . . 91 4.1 MLSE Receiver Architecture . . . . . . . . . . . . . . . . . . . . . . . . 91 4.2 MLSE Equalizer Algorithm and VLSI Architecture . . . . . 92 4.3 Measured Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 5 Advanced Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 5.1 FEC for Low-power Back-Plane Links . . . . . . . . . . . . . . . . 96 5.2 Coherent Detection in Optical Links . . . . . . . . . . . . . . . . . . 98 6 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 7 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 Video Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 Yu-Han Chen and Liang-Gee Chen 1 Evolution of Video Coding Standards . . . . . . . . . . . . . . . . . . . . . . . . 103 2 Basic Components of Video Coding Systems . . . . . . . . . . . . . . . . . . 105 2.1 Color Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 2.2 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 2.3 Transform and Quantization . . . . . . . . . . . . . . . . . . . . . . . . . 110 2.4 Entropy Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 3 Emergent Video Applications and Corresponding Coding Systems 113 3.1 HDTV Applications and H.264/AVC . . . . . . . . . . . . . . . . . 113 3.2 Streaming and Surveillance Applications and Scalable Video Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 3.3 3D Video Applications and Multiview Video Coding . . . . 117 4 Conclusion and Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 Low-power Wireless Sensor Network Platforms . . . . . . . . . . . . . . . . . . . . . . 123 Jukka Suhonen, Mikko Kohvakka, Ville Kaseva, Timo D. Hämäläinen, and Marko Hännikäinen 1 Characteristics of Low-power WSNs . . . . . . . . . . . . . . . . . . . . . . . . . 123 1.1 Quality of Service Requirements . . . . . . . . . . . . . . . . . . . . . 125 1.2 Services for Signal Processing Application . . . . . . . . . . . . 126
xiv
Contents
2
Key Standards and Industry Specifications . . . . . . . . . . . . . . . . . . . . 127 2.1 IEEE 1451 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 2.2 WSN Communication Standards . . . . . . . . . . . . . . . . . . . . . 127 3 Software Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 3.1 Sensor Operating Systems . . . . . . . . . . . . . . . . . . . . . . . . . . 130 3.2 Middlewares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 4 Hardware Platforms and Components . . . . . . . . . . . . . . . . . . . . . . . . 131 4.1 Communication subsystem . . . . . . . . . . . . . . . . . . . . . . . . . 131 4.2 Computing subsystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 4.3 Sensing subsystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 4.4 Power subsystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 4.5 Existing Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 5 Medium Access Control Features and Services . . . . . . . . . . . . . . . . . 136 5.1 MAC Technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 5.2 Unsynchronized Low Duty-Cycle MAC Protocols . . . . . . 137 5.3 Synchronized Low Duty-Cycle MAC Protocols . . . . . . . . 138 5.4 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 6 Routing and Transport . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 6.1 Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 6.2 Routing Paradigms and Technologies . . . . . . . . . . . . . . . . . 144 6.3 Hybrid Transport and Routing Protocols . . . . . . . . . . . . . . 147 7 Embedded WSN Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 7.1 Localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 7.2 Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 8 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 8.1 Low-Energy TUTWSN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 8.2 Low-Latency TUTWSN . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 Signal Processing for Cryptography and Security Applications . . . . . . . . . 161 Miroslav Kneževi´c1, Lejla Batina1 , Elke De Mulder1 , Junfeng Fan1 , Benedikt Gierlichs1 , Yong Ki Lee2 , Roel Maes1 and Ingrid Verbauwhede1,2 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 2 Efficient Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 2.1 Secret-Key Algorithms and Implementations . . . . . . . . . . . 162 2.2 Public-Key Algorithms and Implementations . . . . . . . . . . 164 2.3 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 3 Secure Implementations: Side Channel Attacks . . . . . . . . . . . . . . . . 170 3.1 DSP Techniques Used in Side Channel Analysis . . . . . . . . 171 4 Working with Fuzzy Secrets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 4.1 Fuzzy Secrets: Properties and Applications in Cryptography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 4.2 Generating Cryptographic Keys from Fuzzy Secrets . . . . . 173 5 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
Contents
xv
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 High-Energy Physics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 Anthony Gregerson, Michael J. Schulte, and Katherine Compton 1 Introduction to High-Energy Physics . . . . . . . . . . . . . . . . . . . . . . . . . 179 2 The Compact Muon Solenoid Experiment . . . . . . . . . . . . . . . . . . . . . 182 2.1 Physics Research at the CMS Experiment . . . . . . . . . . . . . 183 2.2 Sensors and Triggering Systems . . . . . . . . . . . . . . . . . . . . . 184 2.3 The Super Large Hadron Collider Upgrade . . . . . . . . . . . . 188 3 Characteristics of High-Energy Physics Applications . . . . . . . . . . . 189 3.1 Technical Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 3.2 Characteristics of the Development Process . . . . . . . . . . . . 192 3.3 Evolving Specifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194 3.4 Functional Verification and Testing . . . . . . . . . . . . . . . . . . . 194 3.5 Unit Modification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 4 Techniques for Managing Design Challenges . . . . . . . . . . . . . . . . . . 196 4.1 Managing Hardware Cost and Complexity . . . . . . . . . . . . . 196 4.2 Managing Variable Latency and Variable Resource Use . . 198 4.3 Scalable and Modular Designs . . . . . . . . . . . . . . . . . . . . . . . 201 4.4 Dataflow-Based Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202 4.5 Language-Independent Functional Testing . . . . . . . . . . . . . 203 4.6 Efficient Design Partitioning . . . . . . . . . . . . . . . . . . . . . . . . 204 5 Example Problems from the CMS Level-1 Trigger . . . . . . . . . . . . . . 204 5.1 Particle Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 5.2 Jet Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206 6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 Medical Image Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 Raj Shekhar, Vivek Walimbe, and William Plishker 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 2 Medical Imaging Modalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214 2.1 Radiography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214 2.2 Computed Tomography . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214 2.3 Nuclear Medicine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 2.4 Magnetic Resonance Imaging . . . . . . . . . . . . . . . . . . . . . . . 215 2.5 Ultrasound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216 3 Medical Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217 3.1 Imaging Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217 4 Image Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219 4.1 CT Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220 4.2 PET/SPECT Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . 223 4.3 Reconstruction in MRI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224 4.4 Ultrasound Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . 224 5 Image Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224 6 Image Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
xvi
Contents
6.1 Classification of Segmentation Techniques . . . . . . . . . . . . 227 6.2 Threshold-based . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228 6.3 Deformable Model-based . . . . . . . . . . . . . . . . . . . . . . . . . . . 230 7 Image Registration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232 7.1 Geometry-based . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233 7.2 Intensity-based . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233 7.3 Transformation Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234 8 Image Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235 8.1 Surface Rendering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236 8.2 Volume Rendering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236 9 Medical Image Processing Platforms . . . . . . . . . . . . . . . . . . . . . . . . . 237 9.1 Computing Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . 237 9.2 Platforms For Accelerated Implementation . . . . . . . . . . . . 238 10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239 11 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240 Signal Processing for Audio HCI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243 Dmitry N. Zotkin and Ramani Duraiswami 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243 2 Microphone Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244 2.1 Source Localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244 2.2 Beamforming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246 2.3 Practical Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246 3 Audio Reproduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247 3.1 Head-related transfer function . . . . . . . . . . . . . . . . . . . . . . . 248 3.2 Reverberation Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 248 3.3 User Motion Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250 3.4 Signal Processing Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . 250 4 Sample Application: Fast HRTF Measurement System . . . . . . . . . . 252 5 Sample Application: Auditory Scene Capture and Reproduction . . 256 6 Sample Application: Acoustic Field Visualization . . . . . . . . . . . . . . 260 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263 Distributed Smart Cameras and Distributed Computer Vision . . . . . . . . . . 267 Marilyn Wolf and Jason Schlessman 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267 2 Approaches to Computer Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268 3 Early Work in Distributed Smart Cameras . . . . . . . . . . . . . . . . . . . . . 270 4 Approaches to Distributed Computer Vision . . . . . . . . . . . . . . . . . . . 270 4.1 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271 4.2 Platform Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273 4.3 Mid-Level Fusion for Gesture Recognition . . . . . . . . . . . . 275 4.4 Target-Level Fusion for Tracking . . . . . . . . . . . . . . . . . . . . 275 4.5 Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276 4.6 Sparse Camera Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
Contents
xvii
5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279 Part II Architectures Managing Editor: Jarmo Takala Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283 Oscar Gustafsson and Lars Wanhammar 1 Number representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283 1.1 Binary representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284 1.2 Two’s complement representation . . . . . . . . . . . . . . . . . . . . 284 1.3 Redundant representations . . . . . . . . . . . . . . . . . . . . . . . . . . 284 1.4 Shifting and increasing the wordlength . . . . . . . . . . . . . . . . 286 1.5 Negation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287 1.6 Finite-wordlength effects . . . . . . . . . . . . . . . . . . . . . . . . . . . 287 2 Addition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290 2.1 Ripple-carry addition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291 2.2 Carry-lookahead addition . . . . . . . . . . . . . . . . . . . . . . . . . . . 291 2.3 Carry-select and conditional sum addition . . . . . . . . . . . . . 295 2.4 Multi-operand addition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296 3 Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298 3.1 Partial product generation . . . . . . . . . . . . . . . . . . . . . . . . . . . 298 3.2 Summation structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301 3.3 Vector merging adder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305 3.4 Multiply-accumulate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305 3.5 Multiplication by a constant . . . . . . . . . . . . . . . . . . . . . . . . . 305 3.6 Distributed arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308 4 Division . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312 4.1 Restoring and nonrestoring division . . . . . . . . . . . . . . . . . . 313 4.2 SRT division . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314 4.3 Speeding up division . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314 4.4 Square root extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315 5 Floating-point representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315 5.1 Normalized representations . . . . . . . . . . . . . . . . . . . . . . . . . 316 5.2 IEEE 754 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316 5.3 Addition and subtraction . . . . . . . . . . . . . . . . . . . . . . . . . . . 317 5.4 Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318 5.5 Quantization error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318 6 Computation of elementary function . . . . . . . . . . . . . . . . . . . . . . . . . 319 6.1 CORDIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319 6.2 Polynomial and piecewise polynomial approximations . . 321 6.3 Table-based methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323 7 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
xviii
Contents
Application-Specific Accelerators for Communications . . . . . . . . . . . . . . . . 329 Yang Sun, Kiarash Amiri, Michael Brogioli, and Joseph R. Cavallaro 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330 1.1 Coarse Grain Versus Fine Grain Accelerator Architectures331 1.2 Hardware/Software Workload Partition Criteria . . . . . . . . 333 2 Hardware Accelerators for Communications . . . . . . . . . . . . . . . . . . . 335 2.1 MIMO Channel Equalization Accelerator . . . . . . . . . . . . . 336 2.2 MIMO Detection Accelerators . . . . . . . . . . . . . . . . . . . . . . . 338 2.3 Channel Decoding Accelerators . . . . . . . . . . . . . . . . . . . . . 344 3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 360 FPGA-based DSP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363 John McAllister 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363 2 FPGA Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365 2.1 FPGA Fundamentals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365 2.2 The Evolution of FPGA . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366 3 State-of-the-art FPGA Technology for DSP Applications . . . . . . . . 368 3.1 FPGA Programmable Logic Technology . . . . . . . . . . . . . . 369 3.2 FPGA DSP-datapath for DSP . . . . . . . . . . . . . . . . . . . . . . . 371 3.3 FPGA Memory Hierarchy in DSP System Implementation374 3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376 4 Core-Based Implementation of FPGA DSP Systems . . . . . . . . . . . . 377 4.1 Strategy Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377 4.2 Core-based Implementation of DSP Applications from Simulink . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 378 4.3 Other Core-based Development Approaches . . . . . . . . . . . 379 5 Programmable Processor-Based FPGA DSP System Design . . . . . . 380 5.1 Application Specific Microprocessor Technology for FPGA DSP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381 5.2 Reconfigurable Processor Design for FPGA . . . . . . . . . . . 384 5.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386 6 System Level Design for FPGA DSP . . . . . . . . . . . . . . . . . . . . . . . . . 387 6.1 Deadalus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387 6.2 Compaan/LAURA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 388 6.3 Koski . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 388 6.4 Requirements for FPGA DSP System Design . . . . . . . . . . 389 7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 390 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 390 General-Purpose DSP Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393 Jarmo Takala 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393 2 Arithmetic Type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395 3 Data Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395
Contents
xix
3.1 Fixed-Point Data Paths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395 3.2 Floating-Point Data Paths . . . . . . . . . . . . . . . . . . . . . . . . . . . 402 3.3 Special Function Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403 4 Memory Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404 5 Address Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 409 6 Program Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411 7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411 8 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412 Application Specific Instruction Set DSP Processors . . . . . . . . . . . . . . . . . . 415 Dake Liu 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415 1.1 ASIP Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415 1.2 Difference Between ASIP and General CPU . . . . . . . . . . . 416 2 DSP ASIP Design Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417 3 Architecture and Instruction Set Design . . . . . . . . . . . . . . . . . . . . . . . 420 3.1 Code Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 420 3.2 Using an Available Architecture . . . . . . . . . . . . . . . . . . . . . 422 3.3 Generating a Custom Architecture . . . . . . . . . . . . . . . . . . . 425 4 Instruction Set Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427 4.1 Instruction Set Specification . . . . . . . . . . . . . . . . . . . . . . . . . 428 4.2 Instructions for Functional Acceleration . . . . . . . . . . . . . . . 429 4.3 Instruction Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 430 4.4 Instruction and Architecture Optimization . . . . . . . . . . . . . 431 4.5 Design Automation of an Instruction Set Architecture . . . 433 5 ASIP Instruction Set for Radio Baseband Processors . . . . . . . . . . . . 434 6 ASIP Instruction Set for Video and Image Compression . . . . . . . . . 437 7 Programming Toolchain and Benchmarking . . . . . . . . . . . . . . . . . . . 439 8 ASIP Firmware Design and Benchmarking . . . . . . . . . . . . . . . . . . . . 440 8.1 Firmware Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 441 8.2 Benchmarking and Instruction Usage Profiling . . . . . . . . . 442 9 Microarchitecture Design for ASIP . . . . . . . . . . . . . . . . . . . . . . . . . . 444 10 Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446 Coarse-Grained Reconfigurable Array Architectures . . . . . . . . . . . . . . . . . 449 Bjorn De Sutter, Praveen Raghavan, Andy Lambrechts 1 Application Domain of Coarse-Grained Reconfigurable Arrays . . . 449 2 CGRA Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451 3 CGRA Design Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453 3.1 Tight versus Loose Coupling . . . . . . . . . . . . . . . . . . . . . . . . 454 3.2 CGRA Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 456 3.3 Interconnects and Register Files . . . . . . . . . . . . . . . . . . . . . 461 3.4 Computational Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . 463 3.5 Memory Hierarchies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465
xx
Contents
3.6 Compiler Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 466 Case Study: ADRES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 468 4.1 Mapping Loops onto ADRES CGRAs . . . . . . . . . . . . . . . . 469 4.2 ADRES Design Space Exploration . . . . . . . . . . . . . . . . . . . 475 5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 480 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 481 4
Multi-core Systems on Chip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485 Luigi Carro and Mateus Beck Rutzig 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485 2 Analytical Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 491 2.1 Performance Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . 492 2.2 Energy Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 496 2.3 DSP Applications on a MPSoC Organization . . . . . . . . . . 497 3 MPSoC Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 500 3.1 Architecture View Point . . . . . . . . . . . . . . . . . . . . . . . . . . . . 501 3.2 Organization View Point . . . . . . . . . . . . . . . . . . . . . . . . . . . . 503 4 Successful MPSoC Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504 4.1 OMAP Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504 4.2 Samsung S3C6410/S5PC100 . . . . . . . . . . . . . . . . . . . . . . . . 506 4.3 Other MPSoCs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 508 5 Open Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 510 5.1 Interconnection Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . 510 5.2 MPSoC Programming Models . . . . . . . . . . . . . . . . . . . . . . . 511 6 Future Research Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 512 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513 DSP Systems using Three-Dimensional (3D) Integration Technology . . . . . 515 Tong Zhang, Yangyang Pan, and Yiran Li 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515 2 Overview of 3D Integration Technology . . . . . . . . . . . . . . . . . . . . . . 517 3 3D Logic-Memory Integration Paradigm for DSP Systems: Rationale and Opportunities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 518 4 3D DRAM Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 521 5 Case Study I: 3D Integrated VLIW Digital Signal Processor . . . . . 523 5.1 3D DRAM Stacking in VLIW Digital Signal Processors . 523 5.2 3D DRAM L2 Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 524 5.3 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 526 6 Case Study II: 3D Integrated H.264 Video Encoder . . . . . . . . . . . . . 527 6.1 H.264 Video Encoding Basics . . . . . . . . . . . . . . . . . . . . . . . 528 6.2 3D DRAM Stacking in H.264 Video Encoder . . . . . . . . . . 530 6.3 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535 7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 538 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 538
Contents
xxi
Mixed Signal Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543 Olli Vainio 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543 2 General Properties of Data Converters . . . . . . . . . . . . . . . . . . . . . . . . 544 2.1 Sample and Hold Circuits . . . . . . . . . . . . . . . . . . . . . . . . . . . 545 2.2 Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 547 2.3 Performance Specifications . . . . . . . . . . . . . . . . . . . . . . . . . 548 3 Analog to Digital Converters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 549 3.1 Nyquist-Rate A/D Converters . . . . . . . . . . . . . . . . . . . . . . . 550 3.2 Oversampled A/D Converters . . . . . . . . . . . . . . . . . . . . . . . 553 4 Digital to Analog Converters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 557 4.1 Nyquist-rate D/A Converters . . . . . . . . . . . . . . . . . . . . . . . . 558 4.2 Oversampled D/A Converters . . . . . . . . . . . . . . . . . . . . . . . 560 5 Switched-Capacitor Circuits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 561 5.1 Amplifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 562 5.2 Switched Capacitors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 562 5.3 Integrators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563 5.4 First-Order Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564 5.5 Second-Order Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 565 5.6 Differential SC Circuits . . . . . . . . . . . . . . . . . . . . . . . . . . . . 566 5.7 Performance Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . 567 6 Frequency Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 567 6.1 Phase-Locked Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 567 6.2 Fractional-N Frequency Synthesis . . . . . . . . . . . . . . . . . . . . 568 6.3 Delay-Locked Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 569 6.4 Direct Digital Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 570 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 571 Part III Programming and Simulation Tools Managing Editor: Rainer Leupers C Compilers and Code Optimization for DSPs . . . . . . . . . . . . . . . . . . . . . . . 575 Björn Franke 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575 2 C Compilers for DSPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 576 2.1 Industrial Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 576 2.2 Compiler and Language Extensions . . . . . . . . . . . . . . . . . . 577 2.3 Compiler Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 578 3 Code Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 579 3.1 Address Code Optimizations . . . . . . . . . . . . . . . . . . . . . . . . 580 3.2 Control Flow Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . 582 3.3 Loop Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 585 3.4 Memory Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 592 3.5 Optimizations for Code Size . . . . . . . . . . . . . . . . . . . . . . . . 594 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 598
xxii
Contents
Compiling for VLIW DSPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 603 Christoph W. Kessler 1 VLIW DSP architecture concepts and resource modeling . . . . . . . . 603 2 Case study: TI ’C62x DSP processor family . . . . . . . . . . . . . . . . . . . 611 3 VLIW DSP code generation overview . . . . . . . . . . . . . . . . . . . . . . . . 615 4 Instruction selection and resource allocation . . . . . . . . . . . . . . . . . . . 617 5 Cluster assignment for clustered VLIW architectures . . . . . . . . . . . 619 6 Register allocation and generalized spilling . . . . . . . . . . . . . . . . . . . . 620 7 Instruction scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 622 7.1 Local instruction scheduling . . . . . . . . . . . . . . . . . . . . . . . . 623 7.2 Modulo scheduling for loops . . . . . . . . . . . . . . . . . . . . . . . . 624 7.3 Global instruction scheduling . . . . . . . . . . . . . . . . . . . . . . . 627 7.4 Generated instruction schedulers . . . . . . . . . . . . . . . . . . . . . 629 8 Integrated code generation for VLIW and clustered VLIW . . . . . . . 630 9 Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 633 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 634 Software Compilation Techniques for MPSoCs . . . . . . . . . . . . . . . . . . . . . . . 639 Rainer Leupers, Weihua Sheng, Jeronimo Castrillon 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 640 1.1 MPSoCs and MPSoC Compilers . . . . . . . . . . . . . . . . . . . . . 640 1.2 Challenges of Building MPSoC Compilers . . . . . . . . . . . . 641 2 Foundation Elements of MPSoC Compilers . . . . . . . . . . . . . . . . . . . 642 2.1 Programming Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 643 2.2 Granularity, Parallelism and Intermediate Representation 649 2.3 Platform Description for MPSoC compilers . . . . . . . . . . . . 656 2.4 Mapping and Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . 659 2.5 Code Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 664 3 Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 666 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 675 DSP Instruction Set Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 679 Florian Brandner and Nigel Horspool and Andreas Krall 1 Introduction/Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 679 2 Interpretive Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 681 2.1 The Classic Interpreter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 682 2.2 Threaded Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 683 2.3 Interpreter Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . 685 3 Compiled Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 686 3.1 Basic Principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 686 3.2 Simulation of Pipelined Processors . . . . . . . . . . . . . . . . . . . 688 3.3 Static Binary Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . 689 4 Dynamically Compiled Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 690 4.1 Basic Simulation Principles . . . . . . . . . . . . . . . . . . . . . . . . . 690 4.2 Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 690 5 Hardware Supported Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 691
Contents
xxiii
5.1 Interfacing Simulators with Hardware . . . . . . . . . . . . . . . . 691 5.2 Simulation in Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . 692 6 Generation of Instruction Set Simulators . . . . . . . . . . . . . . . . . . . . . . 694 6.1 Processor Description Languages . . . . . . . . . . . . . . . . . . . . 694 6.2 Retargetable Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 696 7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 697 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 702 Optimization of Number Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . 707 Wonyong Sung 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 707 2 Fixed-point Data Type and Arithmetic Rules . . . . . . . . . . . . . . . . . . 708 2.1 Fixed-Point Data Type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 708 2.2 Fixed-point Arithmetic Rules . . . . . . . . . . . . . . . . . . . . . . . . 710 2.3 Fixed-point Conversion Examples . . . . . . . . . . . . . . . . . . . . 711 3 Range Estimation for Integer Word-length Determination . . . . . . . . 713 3.1 L1-norm based Range Estimation . . . . . . . . . . . . . . . . . . . . 714 3.2 Simulation based Range Estimation . . . . . . . . . . . . . . . . . . 714 3.3 C++ Class based Range Estimation Utility . . . . . . . . . . . . . 715 4 Floating-point to Integer C Code Conversion . . . . . . . . . . . . . . . . . . 717 4.1 Fixed-Point Arithmetic Rules in C Programs . . . . . . . . . . . 717 4.2 Expression Conversion Using Shift Operations . . . . . . . . . 719 4.3 Integer Code Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 720 4.4 Implementation Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 721 5 Word-length Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 726 5.1 Finite Word-length Effects . . . . . . . . . . . . . . . . . . . . . . . . . . 726 5.2 Fixed-point Simulation using C++ gFix Library . . . . . . . 728 5.3 Word-length Optimization Method . . . . . . . . . . . . . . . . . . . 729 5.4 Optimization Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 735 6 Summary and Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 736 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 737 Intermediate Representations for Simulation and Implementation . . . . . . 739 Jerker Bengtsson 1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 739 2 Untimed Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 742 2.1 System Property Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . 742 2.2 Functions Driven by State Machines . . . . . . . . . . . . . . . . . . 746 3 Timed Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 752 3.1 Job Configuration Networks . . . . . . . . . . . . . . . . . . . . . . . . . 752 3.2 IPC Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 753 3.3 Timed Configuration Graphs . . . . . . . . . . . . . . . . . . . . . . . . 758 3.4 Set of Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 758 3.5 Construction of Timed Configuration Graphs . . . . . . . . . . 762 4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 765 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 766
xxiv
Contents
Embedded C for Digital Signal Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 769 Bryan E. Olivier 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 769 2 Typical DSP Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 770 3 Fixed Point Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 771 3.1 Fixed Point Sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 772 3.2 Fixed Point Constants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 773 3.3 Fixed Point Conversions And Arithmetical Operations . . 774 3.4 Fixed Point Support Functions . . . . . . . . . . . . . . . . . . . . . . . 777 4 Memory Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 778 5 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 780 5.1 FIR Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 780 5.2 Sine Function in Fixed Point . . . . . . . . . . . . . . . . . . . . . . . . 781 6 Named Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 782 7 Hardware I/O Addressing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 783 8 History and Future of Embedded C . . . . . . . . . . . . . . . . . . . . . . . . . . 785 9 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 786 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 787 Part IV Design Methods Managing Editor: Ed Deprettere Signal Flow Graphs and Data Flow Graphs . . . . . . . . . . . . . . . . . . . . . . . . . 791 Keshab K. Parhi and Yanni Chen 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 791 2 Signal Flow Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 792 2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 792 2.2 Transfer Function Derivation of SFG . . . . . . . . . . . . . . . . . 793 3 Data Flow Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 796 3.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 796 3.2 Synchronous Data Flow Graph . . . . . . . . . . . . . . . . . . . . . . 797 3.3 Construct an Equivalent Single-rate DFG from the Multi-rate DFG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 797 3.4 Equivalent Data Flow Graphs . . . . . . . . . . . . . . . . . . . . . . . 799 4 Applications to Hardware Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 802 4.1 Unfolding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 803 4.2 Folding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 808 5 Applications to Software Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 811 5.1 Intra-iteration and Inter-iteration Precedence Constraints . 811 5.2 Definition of Critical Path and Iteration Bound . . . . . . . . . 811 5.3 Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 812 6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 816 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 816
Contents
xxv
Systolic Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 817 Yu Hen Hu and Sun-Yuan Kung 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 817 2 Systolic Array Computing Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 819 2.1 Convolution Systolic Array . . . . . . . . . . . . . . . . . . . . . . . . . 819 2.2 Linear System Solver Systolic Array . . . . . . . . . . . . . . . . . 820 2.3 Sorting Systolic Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 822 3 Formal Systolic Array Design Methodology . . . . . . . . . . . . . . . . . . . 823 3.1 Loop Representation, Regular Iterative Algorithm (RIA), and Index Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . 823 3.2 Localized and Single Assignment Algorithm Formulation 824 3.3 Data Dependence and Dependence Graph . . . . . . . . . . . . . 826 3.4 Mapping an Algorithm to a Systolic Array . . . . . . . . . . . . . 827 3.5 Linear Schedule and Assignment . . . . . . . . . . . . . . . . . . . . 829 4 Wavefront Array Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 831 4.1 Synchronous versus Asynchronous Global On-chip Communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 831 4.2 Wavefront Array Processor Architecture . . . . . . . . . . . . . . 832 4.3 Mapping Algorithms to Wavefront Arrays . . . . . . . . . . . . . 832 4.4 Example: Wavefront Processing for Matrix Multiplication 833 4.5 Comparison of Wavefront Arrays against Systolic Arrays 835 5 Hardware Implementations of Systolic Array . . . . . . . . . . . . . . . . . . 835 5.1 Warp and iWARP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 835 5.2 SAXPY Matrix-1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 837 5.3 Transputer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 839 5.4 TMS 32040 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 840 6 Recent Developments and Real World Applications . . . . . . . . . . . . . 841 6.1 Block Motion Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 841 6.2 Wireless Communication . . . . . . . . . . . . . . . . . . . . . . . . . . . 844 7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 847 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 848 Decidable Signal Processing Dataflow Graphs: Synchronous and Cyclo -Static Dataflow Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 851 Soonhoi Ha and Hyunok Oh 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 851 2 SDF (Synchronous Dataflow) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 852 2.1 Static Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 853 2.2 Software Synthesis from SDF graph . . . . . . . . . . . . . . . . . . 855 2.3 Static Scheduling Techniques . . . . . . . . . . . . . . . . . . . . . . . 858 3 Cyclo-Static Data Flow (CSDF) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 860 3.1 Static Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 861 3.2 Static Scheduling and Buffer Size Reduction . . . . . . . . . . . 862 3.3 Hierarchical Composition . . . . . . . . . . . . . . . . . . . . . . . . . . . 863 4 Other Decidable Dataflow Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 864
xxvi
Contents
4.1 FRDF (Fractional Rate Data Flow) . . . . . . . . . . . . . . . . . . . 864 4.2 SPDF (Synchronous Piggybacked Data Flow) . . . . . . . . . . 868 4.3 SSDF (Scalable SDF) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 871 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 873 Mapping Decidable Signal Processing Graphs into FPGA Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 875 Roger Woods 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 875 1.1 Chapter breakdown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 876 2 FPGA hardware platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 877 2.1 FPGA logic blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 878 2.2 FPGA DSP functionality . . . . . . . . . . . . . . . . . . . . . . . . . . . 879 2.3 FPGA memory organization . . . . . . . . . . . . . . . . . . . . . . . . 880 2.4 FPGA design strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 881 3 Circuit architecture derivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 882 3.1 Basic mapping of DSP functionality to FPGA . . . . . . . . . . 883 3.2 Retiming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 884 3.3 Cut-set theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 885 3.4 Application to recursive structures . . . . . . . . . . . . . . . . . . . 888 4 Circuit architecture optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 890 4.1 Folding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 890 4.2 Unfolding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 892 4.3 Adaptive LMS filter example . . . . . . . . . . . . . . . . . . . . . . . . 892 4.4 Towards FPGA-based IP core generation . . . . . . . . . . . . . . 894 5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 895 5.1 Incorporation of FPGA cores into data flow based systems896 5.2 Application of techniques for low power FPGA . . . . . . . . 896 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 897 Dynamic and Multidimensional Dataflow Graphs . . . . . . . . . . . . . . . . . . . . 899 Shuvra S. Bhattacharyya, Ed F. Deprettere, and Joachim Keinert 1 Multidimensional synchronous data flowgraphs (MDSDF) . . . . . . . 899 1.1 Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 900 1.2 Arbitrary sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 901 1.3 A complete example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 904 1.4 Initial tokens on arcs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 905 2 Windowed synchronous/cyclo-static dataflow . . . . . . . . . . . . . . . . . . 906 2.1 Windowed synchronous/cyclo-static data flow graphs . . . 907 2.2 Balance equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 909 3 Motivation for dynamic DSP-oriented dataflow models . . . . . . . . . . 912 4 Boolean dataflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 914 5 The Stream-based Function Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 916 5.1 The concern for an actor model . . . . . . . . . . . . . . . . . . . . . . 916 5.2 The stream-based function actor model . . . . . . . . . . . . . . . 917 5.3 The formal model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 918
Contents
xxvii
5.4 Communication and scheduling . . . . . . . . . . . . . . . . . . . . . . 919 5.5 Composition of SBF actors . . . . . . . . . . . . . . . . . . . . . . . . . 920 6 CAL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 921 7 Parameterized dataflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 923 8 Enable-invoke dataflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 926 9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 928 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 929 Polyhedral Process Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 931 Sven Verdoolaege 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 931 2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 932 3 Polyhedral Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 934 3.1 Polyhedral Sets and Relations . . . . . . . . . . . . . . . . . . . . . . . 934 3.2 Lexicographic Order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 937 3.3 Polyhedral Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 937 3.4 Piecewise Quasi-Polynomials . . . . . . . . . . . . . . . . . . . . . . . 939 3.5 Polyhedral Process Networks . . . . . . . . . . . . . . . . . . . . . . . . 939 4 Polyhedral Analysis Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 940 4.1 Parametric Integer Programming . . . . . . . . . . . . . . . . . . . . . 941 4.2 Emptiness Check . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 942 4.3 Parametric Counting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 942 4.4 Computing Parametric Upper Bounds . . . . . . . . . . . . . . . . 943 4.5 Polyhedral Scanning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 944 5 Dataflow Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 946 5.1 Standard Dataflow Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 946 5.2 Reuse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 949 6 Channel Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 950 6.1 FIFOs and Reordering Channels . . . . . . . . . . . . . . . . . . . . . 950 6.2 Multiplicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 951 6.3 Internal Channels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 953 7 Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 954 7.1 Two Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 955 7.2 More than Two Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . 956 7.3 Blocking Writes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 957 7.4 Linear Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 959 8 Buffer Size Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 961 8.1 FIFOs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 961 8.2 Reordering Channels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 962 8.3 Accuracy of the Buffer Sizes . . . . . . . . . . . . . . . . . . . . . . . . 964 9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 964 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 964
xxviii
Contents
Kahn Process Networks and a Reactive Extension . . . . . . . . . . . . . . . . . . . . 967 Marc Geilen and Twan Basten 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 968 2 Denotational Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 971 3 Operational Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 973 3.1 Labeled transition systems . . . . . . . . . . . . . . . . . . . . . . . . . . 974 3.2 Operational Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 977 4 The Kahn Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 979 5 Analizability Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 981 6 Implementing Kahn Process Networks . . . . . . . . . . . . . . . . . . . . . . . 982 6.1 Implementing Atomic Processes . . . . . . . . . . . . . . . . . . . . . 982 6.2 Correctness Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 982 6.3 Run-time Scheduling and Buffer Management . . . . . . . . . 984 7 Extensions of KPN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 987 8 Reactive Process Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 990 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 990 8.2 Design Considerations of RPN . . . . . . . . . . . . . . . . . . . . . . 994 8.3 Operational Semantics of RPN . . . . . . . . . . . . . . . . . . . . . . 996 8.4 Implementation Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 998 8.5 Analyzable Models Embedded in RPN . . . . . . . . . . . . . . . . 1000 9 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1000 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1003 Methods and Tools for Mapping Process Networks onto Multi-Processor Systems-On-Chip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1007 Iuliana Bacivarov, Wolfgang Haid, Kai Huang, Lothar Thiele 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1007 2 KPN Design Flows for Multiprocessor Systems . . . . . . . . . . . . . . . . 1009 3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1011 3.1 System Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1012 3.2 System Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1013 3.3 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1014 3.4 Design Space Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . 1017 4 Specification, Synthesis, Analysis, and Optimization in DOL . . . . . 1019 4.1 Distributed Operation Layer . . . . . . . . . . . . . . . . . . . . . . . . . 1019 4.2 System Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1020 4.3 System Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1022 4.4 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1026 4.5 Design Space Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . 1030 4.6 Results of the DOL Framework . . . . . . . . . . . . . . . . . . . . . . 1033 5 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1036 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1037
Contents
xxix
Integrated Modeling using Finite State Machines and Dataflow Graphs . . 1041 Joachim Falk, Joachim Keinert, Christian Haubelt, Jürgen Teich, and Christian Zebelein 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1041 2 Modeling Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1042 2.1 *charts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1042 2.2 The California Actor Language . . . . . . . . . . . . . . . . . . . . . . 1046 2.3 Extended Codesign Finite State Machines . . . . . . . . . . . . . 1049 2.4 SysteMoC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1053 3 Design Methodologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1057 3.1 Ptolemy II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1057 3.2 The OpenDF Design Flow . . . . . . . . . . . . . . . . . . . . . . . . . . 1058 3.3 SystemCoDesigner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1059 4 Integrated Finite State Machines for Multidimensional Data Flow . 1064 4.1 Array-OL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1065 4.2 Windowed Data Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1069 4.3 Exploiting Knowledge about the Model of Computation . 1073 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1074 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1077
List of Contributors
Kiarash Amiri Rice University, 6100 Main Street, Houston, TX 77005, USA, e-mail:
[email protected] Iuliana Bacivarov ETH Zurich, Computer Engineering and Networks Laboratory, CH-8092 Zurich, Switzerland, e-mail:
[email protected] Hyeon-min Bae KAIST, Daejeon, South Korea, e-mail:
[email protected] Twan Basten Eindhoven University of Technology, and Embedded Systems Institute, Den Dolech 2, 5612AZ, Eindhoven, The Netherlands, e-mail:
[email protected] Lejla Batina Katholieke Universiteit Leuven, ESAT/COSIC and IBBT, Kasteelpark Arenberg 10, B-3001 Leuven-Heverlee, Belgium, e-mail: lejla.batina@esat. kuleuven.be Jerker Bengtsson Centre for Research on Embedded Systems, Halmstad University, Box 823, 301 18 Halmstad, Sweden, e-mail:
[email protected] Shuvra S. Bhattacharyya University of Maryland, Dept. of Electrical and Computer Engineering, and Institute for Advanced Computer Studies, College Park MD 20742, e-mail:
[email protected] Florian Brandner Institut für Computersprachen, Technische Universität Wien, Argentinierstraße 8, A-1040 Wien, Austria, e-mail:
[email protected] Michael Brogioli xxxi
xxxii
List of Contributors
Freescale Semiconductor Inc., 7700 W. Parmer Lane, MD: PL63, Austin, TX 78729, USA, e-mail:
[email protected] Luigi Carro Federal University of Rio Grande do Sul, Bento Gonçalves 9500, Porto Alegre, RS, Brazil, e-mail:
[email protected] Jeronimo Castrillon RWTH Aachen University, Software for Systems on Silicon, Templergraben 55, 52056 Aachen, Germany, e-mail:
[email protected] Joseph R. Cavallaro Rice University, 6100 Main Street, Houston, TX 77005, USA, e-mail:
[email protected] Liang-Gee Chen Graduate Institute of Electronics Engineering and Department of Electrical Engineering, National Taiwan University, No. 1 Sec. 4 Roosevelt Rd., Taipei 10617, Taiwan, R.O.C., e-mail:
[email protected] Yanni Chen Marvell Semiconductor Incorporation, 5488 Marvell Lane, Santa Clara, CA 95054, USA, e-mail:
[email protected] Yu-Han Chen Graduate Institute of Electronics Engineering and Department of Electrical Engineering, National Taiwan University, No. 1 Sec. 4 Roosevelt Rd., Taipei 10617, Taiwan, R.O.C., e-mail:
[email protected] Katherine Compton University of Wisconsin, Madison, WI, e-mail:
[email protected] Elke De Mulder Katholieke Universiteit Leuven, ESAT/COSIC and IBBT, Kasteelpark Arenberg 10, B-3001 Leuven-Heverlee, Belgium, e-mail: elke.demulder@esat. kuleuven.be Ed F. Deprettere Leiden University, Leiden Institute of Advanced Computer Science, Niels Bohrweg 1, 2333 CA Leiden, The Netherlands, e-mail:
[email protected] Bjorn De Sutter Computer Systems Lab, Ghent University, Sint-Pietersnieuwstraat 41, 9000 Gent, Belgium, and Department of Electronics and Informatics, Vrije Universiteit Brussel, Pleinlaan 2, 1050 Brussel, Belgium, e-mail: bjorn.desutter@elis. ugent.be Ramani Duraiswami Department of Computer Science and Institute for Advanced Computer Studies (UMIACS), University of Maryland, College Park MD 20742 USA, e-mail:
[email protected]
List of Contributors
xxxiii
Joachim Falk University of Erlangen-Nuremberg, Hardware-Software-Co-Design, Am Weichselgarten 3, 91058 Erlangen, Germany, e-mail: joachim.falk@ informatik.uni-erlangen.de Junfeng Fan Katholieke Universiteit Leuven, ESAT/COSIC and IBBT, Kasteelpark Arenberg 10, B-3001 Leuven-Heverlee, Belgium, e-mail:
[email protected]. be Björn Franke Institute for Computing Systems Architecture, School of Informatics, University of Edinburgh, 10 Crichton Street, Edinburgh, EH8 9AB, United Kingdom, e-mail:
[email protected] Marc Geilen Eindhoven University of Technology, Den Dolech 2, 5612AZ, Eindhoven, The Netherlands, e-mail:
[email protected] Benedikt Gierlichs Katholieke Universiteit Leuven, ESAT/COSIC and IBBT, Kasteelpark Arenberg 10, B-3001 Leuven-Heverlee, Belgium, e-mail: benedikt.gierlichs@ esat.kuleuven.be Anthony Gregerson University of Wisconsin, Madison, WI, e-mail:
[email protected] Oscar Gustafsson Division of Electronics Systems, Department of Electrical Engineering, Linköping University, SE-581 83 Linköping, Sweden, e-mail:
[email protected] Soonhoi Ha Seoul National University, 599 Gwanak-ro, Gwanak-gu, Seoul 151-744, Korea, e-mail:
[email protected] Wolfgang Haid ETH Zurich, Computer Engineering and Networks Laboratory, CH-8092 Zurich, Switzerland, e-mail:
[email protected] Timo D. Hämäläinen Department of Computer Systems, Tampere University of Technology, P.O. Box 553, FI-33101 Tampere, Finland, e-mail:
[email protected] Marko Hännikäinen Department of Computer Systems, Tampere University of Technology, P.O. Box 553, FI-33101 Tampere, Finland, e-mail:
[email protected] Christian Haubelt University of Erlangen-Nuremberg, Hardware-Software-Co-Design, Am Weichselgarten 3, 91058 Erlangen, Germany, e-mail: christian.haubelt@ informatik.uni-erlangen.de
xxxiv
List of Contributors
Nigel Horspool Department of Computer Science, University of Victoria, Victoria, BC V8W 3P6, Canada, e-mail:
[email protected] Yu Hen Hu University of Wisconsin - Madison, Dept. Electrical and Computer Engineering, 1415 Engineering Drive, Madison, WI 53706, e-mail:
[email protected] Kai Huang ETH Zurich, Computer Engineering and Networks Laboratory, CH-8092 Zurich, Switzerland, e-mail:
[email protected] Jörn W. Janneck Department of Automatic Control, Lund University, SE-221 00 Lund, Sweden, e-mail:
[email protected] Ville Kaseva Department of Computer Systems, Tampere University of Technology, P.O. Box 553, FI-33101 Tampere, Finland, e-mail:
[email protected] Joachim Keinert Fraunhofer Institute for Integrated Circuits IIS, Am Wolfsmantel 33, 91058 Erlangen, Germany, e-mail:
[email protected] Christoph W. Kessler Dept. of Computer and Information Science, Linköping University, 58183 Linköping, Sweden, e-mail:
[email protected] Miroslav Knezevic Katholieke Universiteit Leuven, ESAT/COSIC and IBBT, Kasteelpark Arenberg 10, B-3001 Leuven-Heverlee, Belgium, e-mail: miroslav.knezevic@esat. kuleuven.be Mikko Kohvakka Department of Computer Systems, Tampere University of Technology, P.O. Box 553, FI-33101 Tampere, Finland, e-mail:
[email protected] Konstantinos Konstantinides, x e-mail:
[email protected] Andreas Krall Institut für Computersprachen, Technische Universität Wien, Argentinierstraße 8, A-1040 Wien, Austria, e-mail:
[email protected] S. Y. Kung Princeton University, e-mail:
[email protected] Andy Lambrechts IMEC, Kapeldreef 75, 3001 Heverlee, Belgium, e-mail:
[email protected] Yong Ki Lee University of California, Los Angeles, Electrical Engineering, 420 Westwood
List of Contributors
xxxv
Plaza, Los Angeles, CA 90095-1594, USA, e-mail:
[email protected] Rainer Leupers RWTH Aachen University, Software for Systems on Silicon, Templergraben 55, 52056 Aachen, Germany, e-mail:
[email protected] William S. Levine Department of ECE, University of Maryland, College Park, MD 20742, USA, e-mail:
[email protected] Yiran Li Rensselaer Polytechnic Institute, 110 8th street, Troy, NY 12180, USA, e-mail:
[email protected] Dake Liu Department of Electrical Engineering, Linköping University, 58183, Linköping, Sweden, e-mail:
[email protected] Roel Maes Katholieke Universiteit Leuven, ESAT/COSIC and IBBT, Kasteelpark Arenberg 10, B-3001 Leuven-Heverlee, Belgium, e-mail:
[email protected] Marco Mattavelli GR-SCI-MM, EPFL, CH-1015 Lausanne, Switzerland, e-mail: marco.
[email protected] John McAllister Institute of Electronics, Communications and Information Technology (ECIT), Queen’s University Belfast, Northern Ireland Science Park, Queen’s Road, Queen’s Island, Belfast BT 9DT, UK, e-mail:
[email protected] Hyunok Oh Hanyang University, Haengdangdong, Seongdonggu, Seoul 133-791, Korea, e-mail:
[email protected] Bryan E. Olivier ACE Associated Compiler Experts bv., De Ruyterkade 113, 1011 AB Amsterdam, The Netherlands, e-mail:
[email protected] Yangyang Pan Rensselaer Polytechnic Institute, 110 8th street, Troy, NY 12180, USA, e-mail:
[email protected] Keshab K. Parhi University of Minnesota, Dept. of Elect. & Comp. Eng., Minneapolis, MN 55455, USA, e-mail:
[email protected] William Plishker Department of Diagnostic Radiology and Nuclear Medicine, University of Maryland, 22 S. Greene St., Baltimore, MD 21201, USA, and Department of
xxxvi
List of Contributors
Electrical and Computer Engineering, University of Maryland, 1423 AV Williams, College Park, MD 20742, USA, e-mail:
[email protected] Praveen Raghavan IMEC, Kapeldreef 75, 3001 Heverlee, Belgium, e-mail:
[email protected] Mickaël Raulet IETR/INSA Rennes, F-35708, Rennes, France, e-mail: mickael.raulet@ insa-rennes.fr Mateus Rutzig Federal University of Rio Grande do Sul, Bento Gonçalves 9500, Porto Alegre, RS, Brazil, e-mail:
[email protected] Jason Schlessman Department of Electrical Engineering, Princeton University, Engineering Quad, Princeton, NJ 08544, USA, e-mail:
[email protected] Michael Schulte University of Wisconsin, Madison, WI, e-mail:
[email protected] Naresh Shanbhag University of Illinois at Urbana-Champaign, Urbana, e-mail: shanbhag@ illinois.edu Raj Shekhar Department of Diagnostic Radiology and Nuclear Medicine, University of Maryland, 22 S. Greene St., Baltimore, MD 21201, USA, e-mail: rshekhar@ umm.edu Weihua Sheng RWTH Aachen University, Software for Systems on Silicon, Templergraben 55, 52056 Aachen, Germany, e-mail:
[email protected] Andrew Singer University of Illinois at Urbana-Champaign, Urbana, e-mail: acsinger@ illinois.edu Jukka Suhonen Department of Computer Systems, Tampere University of Technology, P.O. Box 553, FI-33101 Tampere, Finland, e-mail:
[email protected] Yang Sun Rice University, 6100 Main Street, Houston, TX 77005, USA, e-mail:
[email protected] Wonyong Sung School of Electrical Engineering, Seoul National University, 599 Gwanangno, Gwanak-gu, Seoul, 151-744, Republic Of Korea, e-mail:
[email protected] Jarmo Takala Department of Computer Systems, Tampere University of Technology, P.O. Box
List of Contributors
xxxvii
553, FI-33101 Tampere, Finland, e-mail:
[email protected] Jürgen Teich University of Erlangen-Nuremberg, Hardware-Software-Co-Design, Am Weichselgarten 3, 91058 Erlangen, Germany, e-mail: teich@informatik. uni-erlangen.de Lothar Thiele ETH Zurich, Computer Engineering and Networks Laboratory, CH-8092 Zurich, Switzerland, e-mail:
[email protected] Olli Vainio Department of Computer Systems, Tampere University of Technology, P.O. Box 553, FI-33101 Tampere, Finland, e-mail:
[email protected] Ingrid Verbauwhede Katholieke Universiteit Leuven, ESAT/COSIC and IBBT, Kasteelpark Arenberg 10, B-3001 Leuven-Heverlee, Belgium, e-mail: ingrid.verbauwhede@ esat.kuleuven.be Sven Verdoolaege Katholieke Universiteit Leuven, Department of Computer Science, Celestijnenlaan 200A, B-3001 Leuven, Belgium, e-mail:
[email protected] Vivek Walimbe GE Healthcare, 3000 N.Grandview Blvd., W-622, Waukesha, WI 53188, USA, e-mail:
[email protected] Lars Wanhammar Division of Electronics Systems, Department of Electrical Engineering, Linköping University, SE-581 83 Linköping, Sweden, e-mail:
[email protected] Marilyn Wolf School of Electrical and Computer Engineering, Georgia Institute of Technology, 777 Atlantic Drive NW, Atlanta, GA 30332-0250, USA, e-mail: wolf@ece. gatech.edu Roger Woods ECIT, Queens University Belfast, Queen’s Island, Queen’s Road, Belfast, BT3 9DT, UK, e-mail:
[email protected] Christian Zebelein University of Erlangen-Nuremberg, Hardware-Software-Co-Design, Am Weichselgarten 3, 91058 Erlangen, Germany, e-mail: christian.zebelein@ informatik.uni-erlangen.de Tong Zhang Rensselaer Polytechnic Institute, 110 8th street, Troy, NY 12180, USA, e-mail:
[email protected] Dmitry N. Zotkin
xxxviii
List of Contributors
Institute for Advanced Computer Studies (UMIACS), University of Maryland, College Park MD 20742 USA, e-mail:
[email protected]
Part I
Applications Managing Editor: Shuvra Bhattacharyya
Signal Processing for Control William S. Levine
Abstract Signal processing and control are closely related. In fact, many controllers can be viewed as a special kind of signal processor that converts an exogenous input signal and a feedback signal into a control signal. Because the controller exists inside of a feedback loop, it is subject to constraints and limitations that do not apply to other signal processors. A well known example is that a stable controller in series with a stable plant can, because of the feedback, result in an unstable closed-loop system. Further constraints arise because the control signal drives a physical actuator that has limited range. The complexity of the signal processing in a control system is often quite low, as is illustrated by the Proportional + Integral + Derivative (PID) controller. Model predictive control is described as an exemplar of controllers with very demanding signal processing. ABS brakes are used to illustrate the possibilities for improved controller capability created by digital signal processing. Finally, suggestions for further reading are included.
1 Introduction There is a close relationship between signal processing and control. For example, a large amount of classical feedback control theory was developed by people working for Bell Laboratories who were trying to solve problems with the amplifiers used for signal processing in the telephone system. This emphasizes that feedback, which is a central concept in control, is also a very important technique in signal processing. Conversely, the Kalman filter, which is now a critical component in many signal processing systems, was discovered by people working on control systems. In fact, the state space approach to the analysis and design of linear systems grew out of
William S. Levine Department of ECE, University of Maryland, College Park, MD 20742, USA e-mail:
[email protected]
S.S. Bhattacharyya et al. (eds.), Handbook of Signal Processing Systems, DOI 10.1007/978-1-4419-6345-1_1, © Springer Science+Business Media, LLC 2010
3
4
William S. Levine
research and development related to control problems that arose in the American space program.
2 Brief introduction to control The starting point for control is a system with inputs and outputs as illustrated schematically in Fig.1. It is helpful to divide the inputs into those that are available
z
d u
Plant
y
Fig. 1 A generic system to be controlled, called the plant. d is a vector of exogenous inputs; u is a vector of control inputs; y is a vector of measured outputs–available for feedback; and z is a vector of outputs that are not available for feedback
for control and those that come from some source over which we have no control. They are conventionally referred to as control inputs and exogenous inputs. Similarly, it is convenient to divide the outputs into those that we care about, but are not available for feedback, and those that are measured and available for feedback. A good example of this is a helicopter. Rotorcraft have four control inputs, collective pitch, variable pitch, tail rotor pitch, and thrust. The wind is an important exogenous input. Wind gusts will push the aircraft around and an important role of the controller is to mitigate this. Modern rotorcraft include a stability and control augmentation system (SCAS). This is a controller that, as its name suggests, makes it easier for the pilot to fly the aircraft. The main interest is in the position and velocity of the aircraft. In three dimensions, there are three such positions and three velocities. Without the SCAS, the pilot would control these six signals indirectly by using the four control inputs to directly control the orientation of the aircraft—these are commonly described as the roll, pitch, and yaw angles—and the rate of change of this orientation—the three angular velocities. Thus, there is a six component vector of angular orientations and angular velocities that are used by the SCAS. These are the feedback signals. There is also a six component vector of positions and velocities that are of interest but that are not used by the SCAS. While there are ways to measure these signals in a
Signal Processing for Control
5
rotorcraft, in many situations there are signals of interest that can not be measured in real time. The SCAS-controlled system is again a system to be controlled. There are two different control systems that one might employ for the combined aircraft and SCAS. One is a human pilot. The other is an autopilot. In both cases, one can identify inputs for control as well as exogenous inputs. After all, the wind still pushes the aircraft around. In both cases, there are some outputs that are available for feedback and others that are important but not fed back. It should be clear from the example that once one has a system to be controlled there is some choice over what output signals are to be used by the controller directly. That is, one can, to some extent, choose which signals are to be fed back. One possible choice is to feed back no signals. The resulting controller is referred to as an open-loop controller. Open loop control has received relatively little attention in the literature. However, it is an important option in cases where any measurements of the output signals would be impossible or very inaccurate. A good example of this until very recently was the control of washing machines. Because it was impossible to measure how clean the dishes or clothing inside the machine were, the controller simply washed for a fixed period of time. Now some such machines measure the turbidity of the waste water as an indirect measurement of how clean the wash is and use this as feedback to determine when to stop washing. There is also a possibility of feed forward control. The idea here is to measure some of the exogenous inputs and use this information to control the system. For example, one could measure some aspects, say the ph, of the input to a sewage treatment system. This input is normally uncontrollable, i.e., exogenous. Knowing the input ph obviously facilitates controlling the output ph. Note that a feed-forward controller is different from an open-loop controller in that the former is based on a measurement and the latter is not. The fact remains that most control systems utilize feedback. This is because feedback controls have several significant advantages over open-loop controls. First, the designer and user of the controlled system does not need to know nearly as much about the system as he or she would need for an open-loop control. Second, the controlled system is much less sensitive to variability of the plant and other disturbances. This is well illustrated by what is probably the oldest, most common, and simplest feedback control system in the world—the float valve level control in the toilet tank. Imagine controlling the level of water in the tank by means of an open-loop control. The user, in order to refill the tank after a flush, would have to open the valve to allow water to flow into the tank, wait just the right amount of time for the tank to fill, and then shut off the flow of water. Looking into the tank would be cheating because that would introduce feedback—the user would, in effect, measure the water level. Many errors, such as turning off the flow too late, opening the input valve too much, a large stone in the tank, or too small a prior flush, would cause the tank to overflow. Automation would not solve this. Contrast the feedback solution. Once the tank has flushed, the water level is too low. This lowers the float and opens the input valve. Water flows into the tank,
6
William S. Levine
thereby raising the water level and the float, until the float gets high enough to close the input valve. None of the open-loop disasters can occur. The closed-loop system corrects for variations in the size and shape of the tank and in the initial water level. It will even maintain the desired water level in the presence of a modest leak in the tank. Furthermore, the closed-loop system does not require any action by the user. It responds automatically to the water level. However, feedback, if improperly applied, can introduce problems. It is well known that the feedback interconnection of a perfectly stable plant with a perfectly stable controller can produce an unstable closed-loop system. It is not so well known, but true, that feedback can increase the sensitivity of the closed-loop system to disturbances. In order to explain these potential problems it is first necessary to introduce some fundamentals of feedback controller design. The first step in designing a feedback control system is to obtain as much information as possible about the plant, i.e., the system to be controlled. There are three very different situations that can result from this study of the plant. First, in many cases very little is known about the plant. This is not uncommon, especially in the process industry where the plant is often nonlinear, time varying, very complicated, and where models based on first principles are imprecise and inaccurate. Good examples of this are the plants that convert crude oil into a variety of products ranging from gasoline to plastics, those that make paper, those that process metals from smelting to rolling, those that manufacture semiconductors, sewage treatment plants, and electric power plants. Second, in some applications a good non parametric model of the plant is known. This was the situation in the telephone company in the 1920’s and 30’s. Good frequency responses were known for the linear amplifiers used in transmission lines. Today, one would say that accurate Bode plots were known. These plots are a precise and detailed mathematical model of the plant and are sufficient for a good feedback controller to be designed. They are non parametric because the description consists of experimental data alone. Third, it is often possible to obtain a detailed and precise mathematical model of the plant. For example, good mathematical descriptions of many space satellites can be derived by means of Newtonian mechanics. Simple measurements then complete the mathematical description. This is the easiest situation to describe and analyze so it is discussed in much more detail here. Nonetheless, the constraints and limitations described below apply to all feedback control systems. In the general case, where the plant is nonlinear, this description takes the form of a state-space ordinary differential equation. x(t) ˙ = f (x(t), u(t), d(t))
(1)
y(t) = g(x(t), u(t), d(t)) z(t) = h(x(t), u(t), d(t))
(2) (3)
In this description, x(t) is the n-dimensional state vector, u(t) is the m-dimensional input vector (the control input), d(t) is the exogenous input, y(t) is the p-dimensional
Signal Processing for Control
7
measured output (available for feedback), and z(t) is the vector of outputs that are not available for feedback. The dimensions of the d(t) and z(t) vectors will not be needed in this article. In the linear time-invariant (LTI) case, this becomes x(t) ˙ = Ax(t) + Bu(t) + Ed(t)
(4)
y(t) = Cx(t) + Du(t) + Fd(t) z(t) = Hx(t) + Gu(t) + Rd(t)
(5) (6)
Another description is often used in the LTI case because of its many useful features. It is equivalent to the description above and obtained by simply taking Laplace transforms of Equations (4), (5), and (6) and then doing some algebraic manipulation to eliminate X (s) from the equations. G (s) G12 (s) U(s) Y (s) = 11 (7) Z(s) G21 (s) G22 (s) D(s) The capital letters denote Laplace transforms of the corresponding lower case letters and s denotes the complex variable argument of the Laplace transform. Note that setting s = j makes this description identical to the non parametric description of case 2 on the previous page when the plant is asymptotically stable and extends the frequency domain characterization (Bode plot) of the plant to the unstable case as well. The issues of closed-loop stability and sensitivity can be easily demonstrated with a simplified version of Equation (7) above. All signals will be of interest but it will not be necessary to display this explicitly so ignore Z(s) and let all other signals be scalars. For simplicity, let G12 (s) = 1. Let G11 (s) G p (s) and let Gc (s) denote the dynamical part of the controller. Finally, suppose that the objective of the controller is to make Y (s) ≈ R(s) as closely as possible where R(s) is some exogenous input to the controller. It is generally impossible to make Y (s) = R(s) ∀s The resulting closed-loop system will be, in block diagram form
D(s) R(s)
E(s)
Gc(s)
Fig. 2 Simple feedback connection
U(s)
Gp(s)
Y(s)
8
William S. Levine
Simple algebra then leads to G p (s)Gc (s) 1 D(s) + R(s) 1 + G p(s)Gc (s) 1 + G p(s)Gc (s) −Gc (s) Gc (s) U(s) = D(s) + R(s) 1 + G p(s)Gc (s) 1 + G p(s)Gc (s) Y (s) =
(8) (9)
It is conventional in control theory to define the two transfer functions appearing in Equation (8) as follows. The transfer function in Equation (9) is also defined below. G p (s)Gc (s) 1 + G p(s)Gc (s) 1 S(s) = 1 + G p(s)Gc (s) Gc (s)Gc (s) V (s) = 1 + G p(s)Gc (s) T (s) =
(10) (11) (12)
Now that a complete description of a simple feedback control situation is available, it is straightforward to demonstrate some of the unique constraints on Gc (s), the signal processor that is used to make the closed-loop system track R(s).
2.1 Stability A simple example demonstrating that the feedback interconnection of asymptotically stable and thoroughly well-behaved systems can result in an unstable closedloop system is shown below. This example also illustrates that the usual culprit is too high an open-loop gain. Let G p (s) =
3 s2 + 4s + 3
(13)
k s+2
(14)
Gc (s) =
The closed-loop system, T (s) is, after some elementary calculations, T (s) =
3k (s + 2)(s2 + 4s + 3) + 3k
(15)
The denominator of T (s) is also the denominator of S(s) and V (s) and it has roots with positive real parts for all k > 20. Thus, the closed-loop system is unstable for all k > 20.
Signal Processing for Control
9
2.2 Sensitivity to disturbance Notice that T (s) + S(s) = 1
∀s
(16)
Since S(s) is the response to a disturbance input (for example, wind gusts acting on an aircraft) and T (s) is the response to the input that Y (s) should match as closely as possible (for example, the pilot’s commands), this appears to be an ideal situation. An excellent control design would make T (s) ≈ 1, ∀s and this would have the effect of reducing the response to the disturbance to nearly zero. There is, unfortunately, a problem hidden in Equation (16). For any real system, G p (s) goes to zero as |s| → . This, together with the fact that too large a gain for Gc (s) will generally cause instability, implies that T (s) → 0 as |s| → and this implies that S(s) → 1 as |s| → . Hence, there will be values of s at which the output of the closed-loop system will consist entirely of the disturbance signal. Remember that letting s = j in any of the transfer functions above gives the frequency response of the system. Thus, the closed-loop system is a perfect reproducer of high-frequency disturbance signals.
2.3 Sensitivity conservation Theorem: (Bode Sensitivity Integral) Suppose that G p (s)Gc (s) is rational, has no right half plane poles, and has at least 2 more poles than zeros. Suppose also that the corresponding closed-loop system T (s) is stable. Then the sensitivity S(s) satisfies 0
log|S( j )|d = 0
(17)
This is a simplified version of a theorem that can be found in [12]. The log in Equation (17) is the natural logarithm. The implications of this theorem and Equation (16) are: if you make the closedloop system follow the input signal R(s) closely over some frequency range, as is the goal of most control systems, then inevitably there will be a frequency range where the closed-loop system actually amplifies disturbance signals. One might hope that the frequency range in which the sensitivity must be greater than one in order to make up for the region of low sensitivity could be infinite. This could make the magnitude of the increased sensitivity infinitesimally small. A further result [12] proves that this is not the case. A frequency range where the sensitivity is low implies a frequency range where it is high. Thus, a feedback control system that closely tracks exogenous signals, R( j ) for some range of values of frequency (typically 0 ≤ < b ) will inevitably amplify any disturbance signals, D( j ), over some other frequency range. This discussion has been based on the assumption that both the plant and the controller are continuous-time systems. This is because most real plants are continuous-
10
William S. Levine
time systems. Thus, however one implements the controller, and it is certainly true that most controllers are now digitally implemented, the controller must act as a continuous-time controller. It is therefore subject to the limitations described above.
3 Signal processing in control An understanding of the role of signal processing in control begins with a refinement of the abstract system model in Fig. 1 to that shown in Fig. 3.
z
d
Plant
u actuators
y sensors
Fig. 3 The plant with sensors and actuators indicated
Although control theorists usually include the sensors and actuators as part of the plant, it is important to understand their presence and their role. In almost all modern control systems excepting the float valve, the sensors convert the physical signal of interest into an electronic version and the actuator takes an electronic signal as its input and converts it into a physical signal. Both sensing and actuation involve specialized signal processing. A common issue for sensors is that there is very little energy in the signal so amplification and other buffering is needed. Most actuation requires much greater power levels than are used in signal processing so some form of power amplification is needed. Generally, the power used in signal processing for control is a small fraction of the power consumed by the closed-loop system. For this reason, control designers rarely concern themselves with the power needed by the controller. Most modern control systems are implemented digitally. The part of a digital control system between the sensor output and the actuator input is precisely a digital signal processor (DSP). The fact that it is a control system impacts the DSP in at least three ways. First, it is essential that a control signal be produced on time every time. This is a hard real-time constraint. Second, the fact that the control signal eventually is input into an actuator also imposes some constraints. Actuators saturate. That is, there are limits to their output values that physically cannot be exceeded. For example, the rudder on a ship and the control surfaces on a wing can only turn so many degrees. A pump has a maximum amount it can move. A motor
Signal Processing for Control
11
has a limit on the torque it can produce. Control systems must either be designed to avoid saturating the actuator or to avoid the performance degradation that can result from saturation. A mathematical model of saturation is given in Fig. (4). Another nonlinearity that is difficult, if not impossible, to avoid is the so-called dead zone illustrated in Fig. 4. This arises, for example, in the control of movement because very small motor torques are unable to overcome stiction. Although there are many other common nonlinearities that appear in control systems, we will only mention one more, the quantization nonlinearity that appears whenever a continuous plant is digitally controlled. Thirdly, the fact that the DSP is actually inside a feedback loop imposes the constraints discussed at the end of the previous section, too much gain causes closedloop instability and too precise tracking can cause amplified sensitivity to disturbances. Output
Output
Input
Input
Fig. 4 A saturation nonlinearity is shown at the left and a dead zone nonlinearity at the right.
The main impact of the real time constraint on the DSP is that it must be designed to guarantee that it produces a control signal at every sampling instant. The main impact of the nonlinearities is from the saturation. A control signal that is too large will be truncated. This can actually cause the closed-loop system to become unstable. More commonly, it will cause the performance of the closed-loop system to degrade. The most famous example of this is known as integrator windup as discussed in the following subsection. The dead zone limits the accuracy of the control. The impact of the sensitivity and stability constraints is straightforward. The design of DSPs for control must account for these effects as well as the obvious direct effect of the controller on the plant. There is one more general issue associated with the digital implementation of controllers. It is well known that the Nyquist rate upper bounds the interval between samples when a continuous-time signal is discretized in time. For purposes of control it is also possible to sample too fast. The intuitive explanation for this is that if the samples are too close together the plant output changes very little from sample
12
William S. Levine
to sample. A feedback controller bases its actions on these changes. If the changes are small enough, noise dominates the signal and the controller performs poorly.
3.1 simple cases A generic feedback control situation is illustrated in Fig. 5. Fundamentally, the controller in the figure is a signal processor, taking the signals dc and y as inputs and producing the signal u as output. The objective of the controller is to make y(t) = dc (t) ∀t ≥ 0. This is actually impossible because all physical systems have some form of inertia. Thus, control systems are only required to come close to this objective. There are a number of precise definitions of the term “close." In the simplest cases, the requirements/specifications are that the closed-loop system be asymptotically stable and that | y(t) − dc (t) |< ∀t ≥ T ≥ 0, where T is some specified “settling time" and is some allowable error. In more demanding applications there will be specified limits on the duration and size of the transient response [14]. An interesting additional specification is often applied when the plant is approximately linear and time-invariant. In this case, an easy application of the final value theorem for Laplace transforms proves that the error in the steady-state response of the closedloop system to a step input (i.e., a signal that is 0 up to t = 0 at which time it jumps to a constant value ) goes to zero as t → provided the open-loop system (controller in series with the plant with no feedback) has a pole at the origin. Note that this leads to the requirement/specification that the open-loop system have a pole at the origin. This is achievable whereas, because of inaccuracies in sensors and actuators, zero steady-state error is not. dp
z
dc
Controller
u
Plant
y
Fig. 5 The plant and controller. Notice that the exogenous input of Fig. 5 has been split into two parts with dc denoting the exogenous input to the controller and d p that to the plant.
In many cases the signal processing required for control is straightforward and undemanding. For example, the most common controller in the world, used in applications from the toilet tank to aircraft, is the Proportional + Integral + Derivative (PID) Controller. A simple academic version of the PID Controller is illustrated in
Signal Processing for Control
13
Fig. 6. Notice that this simple PID Controller is a signal processing system with two inputs, the signal to be tracked dc and the actual measured output y and one output, the control signal u. This PID controller has three parameters that must be chosen so as to achieve satisfactory performance. The simplest version has kd and ki equal to zero. This is a proportional controller. The float valve in the toilet tank is an example. For most plants, although not the toilet tank, choosing k p too large will result in the closed-loop system being unstable. That is, y(t) will either oscillate or grow until saturation occurs somewhere in the system. It is also usually true that, until instability occurs, increasing k p will decrease the steady-state error. This captures the basic trade off in control design. Higher control gains tend to improve performance but also tend to make the closed-loop system more nearly unstable.
kp dc
y
u
Proportional Gain
kd
d/dt
Derivative Gain 1 s
ki Integral Gain
Fig. 6 A simple academic PID Controller
Changing ki to some non zero value usually improves the steady-state performance of the closed-loop system. To see this, first break the feedback loop by removing the connection carrying y in Fig. 5 and set kd = 0. Then, taking Laplace transforms, gives U(s) = (k p + ki /s)Y (s)
(18)
Assuming that the plant is linear and denoting its transfer function by G p (s) gives the open-loop transfer function, (k p s + ki Y (s) =( )G p (s) Dc (s) s
(19)
This open-loop system will have at least one pole at the origin unless G p (s) has a zero there. As outlined in the previous section, this implies in theory that the steady-state error in response to a step input should be zero. In reality, dead zone nonlinearity and both sensor and actuator nonlinearity will result in nonzero steadystate errors.
14
William S. Levine
It is relatively simple to adjust the gains k p and ki so as to achieve good performance of the closed-loop system. There are a number of companies that supply ready-made PID controllers. Once these controllers are connected to the plant, a technician or engineer “tunes" the gains to get good performance. There are a number of tuning rules in the literature, the most famous of which is due to Ziegler and Nichols. It should be noted that the Ziegler-Nichols tuning rules are now known to be too aggressive; better tuning rules are available [5]. Choosing a good value of kd is relatively difficult although the tuning rules do include values for it. Industry surveys have shown that a significant percentage of PID controllers have incorrect values for kd . When it is properly chosen, the D term improves the transient response. There are three common improvements to the simple PID controller of Fig. 6. First, it is very unwise to include a differentiator in a physical system. Differentiation amplifies high frequency noise, which is always present. Thus, the kd s term in the s PID controller should be replaced by k kknds+1 . This simply adds a low pass filter d in series with the differentiator and another parameter, kn to mitigate the problems with high frequency noise. Second, the most common input signal dc (t) is a step. Differentiating such a signal would introduce a signal into the system that is an impulse (infinite at t = 0 and 0 everywhere else). Instead, the signal input to the D-term is chosen to be y(t), not dc (t) − y(t). Notice that this change does away with the impulse and only changes the controller output at t = t0 whenever dc (t) is a step at t0 . Lastly, when the PID controller saturates the actuator (u(t) becomes too large for the actuator it feeds) a phenomenon known as integral windup often occurs. This causes the transient response of the closed-loop system to be much larger and to last much longer than would be predicted by a linear analysis. There are several ways to mitigate the effects of integral windup [5]. Nowadays the simple PID controller is almost always implemented digitally. The discretization in time and quantization of the signals has very little effect on the performance of the closed-loop system in the vast majority of cases. This is because the plant is usually very slow in comparison to the speeds of electronic signal processors. The place where digital implementation has its largest impact is on the possibility for improving the controller. The most exciting of these is to make the PID controller either adaptive or self-tuning. Both automate the tuning of the controller. The difference is that the self-tuning controller operates in two distinct modes, tuning and controlling. Often, an operator selects the mode, tuning when there is reason to believe the controller is not tuned properly and controlling when it is correctly tuned. The adaptive controller always operates in its adaptive mode, simultaneously trying to achieve good control and tuning the parameters. There is a large literature on both self tuning and adaptive control; many different techniques have been proposed. There are also about a dozen turnkey self-tuning PID controllers on the market [5]. There are many plants, even LTI single-input single-output (SISO) ones, for which the PID Controller is ineffective or inappropriate. One example is an LTI plant with a pair of lightly damped complex conjugate poles. Nonetheless, for many
Signal Processing for Control
15
such plants the controller can be easily implemented in a DSP. There are many advantages to doing this. Most are obvious but one requires a small amount of background. Most real control systems require a “safety net." That is, provisions must be made to deal with failures, faults, and other problems that may arise. For example, the float valve in the toilet tank will eventually fail to close fully. This is an inevitable consequence of valve wear and/or small amounts of dirt in the water. An overflow tube in the tank guarantees that this is a soft failure. If the water level gets too high, the excess water spills into the overflow tube. From there it goes into the toilet and eventually into the sewage line. The valve failure does not result in a flood in the house. These safety nets are extremely important in the real world, as should be obvious from this simple example. It is easy to implement them within the controller DSP. This is now commonly done. In fact, because many control applications are easily implemented in a DSP, and because such processors are so cheap, it is now true that there is often a good deal of unused capacity in the DSPs used for control. An important area of controls research is to find ways to use this extra capacity to improve the controller performance. One success story is described in the following section.
3.2 Demanding cases The signal processing necessary for control can be very demanding. One such control technique that is extensively used is commonly known as Model Predictive Control () [11] although it has a number of other names such as Receding Horizon Control, Dynamic Matrix Control, and numerous variants. The application of MPC is limited to systems with “slow" dynamics because the computations involved require too much time to be accomplished in the sampling interval for “fast" systems. The simplest version of MPC occurs when there is no exogenous input and the objective of the controller is to drive the system state to zero. This is the case described below. The starting point for MPC is a known plant which may have multiple inputs and multiple outputs (MIMO) and a precise mathematical measure of the performance of the closed-loop system. It is most convenient to use a state-space description of the plant x(t) ˙ = f (x(t), u(t))
(20)
y(t) = g(x(t), u(t))
(21)
Note that we have assumed that the system is time-invariant and that the noise can be adequately handled indirectly by designing a robust controller (a controller that is relatively insensitive to disturbances) rather than by means of a more elaborate
16
William S. Levine
controller whose design accounts explicitly for the noise. It is assumed that the state x(t) is an n-vector, the control u(t) is an m-vector, and the output y(t) is a p-vector. The performance measure is generically J(u[0,) ) =
0
l(x(t), u(t)), dt
(22)
The notation u[0,) denotes the entire signal {u(t) : 0 ≤ t < }. The control objective is to design and build a feedback controller that will minimize the performance measure. It is possible to solve this problem analytically in a few special cases. When an analytic solution is not possible one can resort to the approximate solution known as MPC. First, discretize both the plant and the performance measure with respect to time. Second, further approximate the performance measure by a finite sum. Lastly, temporarily simplify the problem by abandoning the search for a feedback control and ask only for an open-loop control. The open-loop control will depend on the initial state so assume that x(0) = x0 and this is known. The approximate problem is then to find the sequence [u(0) u(1) · · · u(N − 1] that solves the constrained nonlinear programming problem given below i=N
min
l(x(i), u(i) + lN (x(N + 1))
[u(0) u(1)···u(N)] i=0
subject to x(i + 1) = f d (x(i), u(i))
i = 0, 1, · · · , N
and x(0) = x0
(23)
(24) (25)
Notice that the solution to this problem is a discrete-time control signal, [uo (0) uo (1) · · · uo (N − 1]
(26)
(where the superscript “o" denotes optimal) that depends only on knowledge of x0 .
This will be converted into a closed-loop (and an MPC) controller in two steps. The first step is to create an idealized MPC controller that is easy to understand but impossible to build. The second step is to modify this infeasible controller by a practical, implementable MPC controller. The idealized MPC controller assumes that y(i) = x(i), ∀i. Then, at i = 0, x(0) = x0 is known. Solve the nonlinear programming problem instantaneously. Apply the control uo (0) on the time interval 0 ≤ t < , where is the discretization interval. Next, at time t = , equivalently i = 1, obtain the new value of x, i.e., x(1) = x1 . Again, instantaneously solve the nonlinear programming problem, exactly as before except using x1 as the initial condition. Again, apply only the first step of the newly computed optimal control (denote it by uo (1)) on the interval ≤ t < 2 . The idea is to compute, at each time instant, the open-loop optimal control for the full time horizon of N + 1 time steps but only implement that control for the first step. Continue to repeat this forever.
Signal Processing for Control
17
Of course, the full state is not usually available for feedback (i.e., y(i) = x(i)) and it is impossible to solve a nonlinear programming problem in zero time. The solution to both of these problems is to use an estimate of the state. Let an optimal (in some sense) estimate of x(k + 1) given all the data up to time k be denoted by x(i ˆ + 1|i). For example, assuming noiseless and full state feedback, y(k) = x(k) ∀k, and the dynamics of the system are given by x(i + 1) = fd (x(i), u(i))
i = 0, 1, · · · , N
(27)
then x(k ˆ + 1|k) = f d (x(k), u(k))
(28)
The implementable version of MPC simply replaces xi in the nonlinear programˆ + 1|i) and solves for uo (i + 1). This means that ming problem at time t = i by x(i the computation of the next control value can start at time t = i and can take up to the time (i + i) . It can take a long time to solve a complicated nonlinear programming problem. Because of this the application of MPC to real problems has been limited to relatively slow systems, i.e., systems for which is large enough to insure that the computations will be completed before the next value of the control signal is needed. As a result, there is a great deal of research at present on ways to speed up the computations involved in MPC. The time required to complete the control computations becomes even longer if it is necessary to account explicitly for noise. Consider the following relatively simple version of MPC in which the plant is linear and time-invariant except for the saturation of the actuators. Furthermore, the plant has a white Gaussian noise input and the output signal contains additive white Gaussian noise as well. The plant is then modeled by x(i + 1) = Ax(i) + Bu(i) + D (i) y(i) = Cx(i) + (i)
(29) (30)
The two noise signals are zero mean white Gaussian noises with covariance matrices E( (i) T (i)) = ∀i and E( (i) T (i)) = I ∀i where E(·) denotes expectation. The two noise signals are independent of each other. The performance measure is J(u(0,]) = E(
1 i= T (y (i)Qy(i) + uT (i)Ru(i)) 2 i=0
(31)
In the equation above, Q and R are symmetric real matrices of appropriate dimensions. To avoid technical difficulties R is taken to be positive definite (i.e., u Ru > 0 for all u = 0) and Q positive semidefinite (i.e., y Qy ≥ 0 for all y). In the absence of saturation, i.e., if the linear model is accurate for all inputs, then the solution to the stochastic control problem of minimizing Equation (31) subject to Equations (29 &
18
William S. Levine
30) is the Linear Quadratic Gaussian () regulator [15]. It separates into two independent components. One component is the optimal feedback control uo (i) = F o x(i), where the superscript “o" denotes optimal. This control uses the actual state vector which is, of course, unavailable. The other component of the optimal control is a Kalman filter which produces the optimal estimate of x(i) given the available data at time i. Denoting the output of the filter by x(i|i), ˆ the control signal becomes uo (i) = F o x(i|i) ˆ
(32)
This is known as the certainty equivalent control because the optimal state estimate is used in place of the actual state. For this special case, all of the parameters can be computed in advance. The only computations needed in real time are the one step ahead Kalman filter /predictor and the matrix multiplication in Equation (32). If actuator saturation is important, as it is in many applications, then the optimal control is no longer linear and it is necessary to use MPC. As before, approximate the performance measure by replacing by some finite N. An exact feedback solution to this new problem would be extremely complicated. Because of the finite time horizon, it is no longer true that the optimal control is time invariant. Because of the saturation, it is no longer true that the closed-loop system is linear nor is it true that the state estimation and control are separate and independent problems. Even though this separation is not true, it is common to design the MPC controller by using a Kalman filter to estimate the state vector. Denote this estimate by x(i|i) ˆ and the one step ahead prediction by x(i|i ˆ − 1). The finite time LQ problem with dynamics given by Equations (29 & 30) and performance measure
J(u(0,N] ) = E(
1 i=N T (y (i)Qy(i) + uT (i)Ru(i)) + yT (N + 1)Qy(N + 1)) 2 i=0
is then solved open loop with initial constraint ⎧ ⎪ ⎨umax , u(i) = u(i), ⎪ ⎩ −umax ,
(33)
condition x(0) = x(i|i ˆ − 1) and with the if u(i) ≥ umax ; if |u(i)| < umax ; if u(i) ≤ −umax ;
(34)
This is a convex programming problem which can be quickly and reliably solved. As is usual in MPC, only the first term of the computed control sequence is actually implemented. The substitution of the one step prediction for the filtered estimate is done so as to provide time for the computations. There is much more to MPC than has been discussed here. An introduction to the very large literature on the subject is given in section 5. One issue that is too important to omit is that of stability. The rationale behind MPC is an heuristic notion that the performance of such a controller is likely to be very good. While good performance implicitly requires stability, it certainly does not guarantee it. Thus,
Signal Processing for Control
19
it is comforting to know that stability is theoretically guaranteed under reasonably weak assumptions for a broad range of MPC controllers [16].
3.3 Exemplary case ABS brakes are a particularly vivid example of the challenges and benefits associated with the use of DSP in control. The basic problem is simply stated. Control the automobile’s brakes so as to minimize the distance it takes to stop. The theoretical solution is also simple. Because the coefficient of sliding friction is smaller than the coefficient of rolling friction, the vehicle will stop in a shorter distance if the braking force is the largest it can be without locking the wheels. All this is well known. The difficulty is the following. The smallest braking force that locks the wheels depends on how slippery the road surface is. For example, a very small braking force will lock the wheels if they are rolling on glare ice. A much larger force can be applied without locking the wheels when they are rolling on dry pavement. Thus, the key practical problem for ABS brakes is how to determine how much force can be applied without locking the wheels. In fact, there is as yet no known way to measure this directly. It must be estimated. The control, based on this estimate, must perform at least as well as an unassisted human driver under every possible circumstance. The key practical idea of ABS brakes is to frequently change the amount of braking force to probe for the best possible value. Probing works roughly as follows. Apply a braking force and determine whether the wheels have locked. If they have, reduce the braking force to a level low enough that the wheels start to rotate again. If they have not, increase the braking force. This is repeated at a reasonably high frequency. Probing cannot stop and must take be repeated frequently because the slipperiness of the road surface can change rapidly. It is important to keep the braking force close to its optimal value all the time. Balancing the dual objectives of quickly detecting changes in the slipperiness of the surface and keeping the braking force close to its optimal value is one of the keys to successful ABS brakes. Note that this tradeoff is typical of adaptive control problems where the controller has to serve the dual objectives of providing good control and providing the information needed to improve the controller. The details of the algorithm for combining probing and control are complicated [13]. The point here is that the complicated algorithm is a life saving application of digital signal processing for control. Two other points are important. In order for ABS brakes to be possible, it was first necessary to develop a hydraulic braking system that could be electronically modulated. That is, the improved controller depends on a new actuator. Second, there is very little control theory available to assist in the design of the controller upon which ABS braking depends. There are three reasons why control theory is of little help in designing ABS brakes. First, the interactions between tires and the road surface are very complex. The dynamic distortion of the tire under loading plays a fundamental role in traction
20
William S. Levine
(see [21] for a dramatic example). While there has been research on appropriate mathematical models for the tire/road interaction, realistic models are too complex for control design and simpler models that might be useful for controller design are not realistic enough. Second, the control theory for systems that have large changes in dynamics as a result of small changes in the control—exactly what happens when a tire loses traction—is only now being developed. Such systems are one form of hybrid system. The control of hybrid systems is a very active current area of research in control theory. Third, a crucial component of ABS brakes is the electronically controllable hydraulic brake cylinder. This device is obviously needed in order to implement the controller. Control theory has very little to say about the physical devices needed to implement a controller.
4 Conclusions Nowadays most control systems are implemented digitally. This is motivated primarily by cost and convenience. It is abetted by continuing advances in sensing and actuation. More and more physical signals can be converted to electrical signals with high accuracy and low noise. More and more physical actuators convert electrical inputs into physical forces, torques, etc. Sampling and digitization of the sensor signals is also convenient, very accurate if necessary, and inexpensive. Hence, the trend is very strongly towards digital controllers. The combination of cheap sensing and digital controllers has created exciting opportunities for improved control and automation. Adding capability to a digitally implemented controller is now often cheap and easy. Nonlinearity, switching, computation, and logical decision making can all be used to improve the performance of the controller. The question of what capabilities to add is wide open.
5 Further reading There are many good undergraduate textbooks dealing with the basics of control theory. Two very popular ones are by Dorf and Bishop [1] and by Franklin, Powell, and Emami Naeini [2]. A standard undergraduate textbook for the digital aspects of control is [3]. A very modern and useful upper level undergraduate or beginning graduate textbook is by Goodwin, Graebe, and Salgado [4]. An excellent source for information about PID control is [5]. Graduate programs in control include an introductory course in linear system theory that is very similar to the ones for signal processing. The book by Kailath [7] although old is very complete and thorough. Rugh [8] is more control oriented and newer. The definitive graduate textbook on nonlinear control is [6].
Signal Processing for Control
21
For more advanced and specialized topics in control, The Control Handbook [9] is an excellent source. It was specifically designed for the purpose of providing a starting point for further study. It also contains good introductory articles about undergraduate topics and a variety of applications. There is a very large literature on MPC. Besides the previously cited [11], there are books by Maciejowski [17] and Kwon and Han [18] as well as several others. There are also many survey papers. The one by Rawlings [19] is easily available and recommended. Another very useful and often cited survey is by Qin and Badgwell [20]. This is a particularly good source for information about companies that provide MPC controllers and other important practical issues, Lastly, the Handbook of Networked and Embedded Control Systems [10] provides introductions to most of the issues of importance in both networked and digitally controlled systems.
References 1. Dorf, R.C., Bishop, R. H.: Modern Control Systems, (11th edition), Pearson Prentice-Hall, Upper Saddle River, (2008) 2. Franklin, G.F., Powell, J.D., Emami-Naeini, A.: Feedback Control of Dynamic Systems (5th edition), Prentice-Hall, Upper Saddle River, (2005) 3. Franklin, G.F., Powell, J.D., Workman, M.: Digital Control of Dynamic Systems (3rd edition), Addison-Wesley, Menlo Park (1998) 4. Goodwin, G.C., Graebe, S.F., Salgado, M.E.: Control System Design, Prentice-Hall, Upper Saddle River, (2001) 5. Astrom, K.J., Hagglund, T.: PID Controllers: Theory, Design, and Tuning (2nd edtion), International Society for Measurement and Control, Seattle, (1995) 6. Khalil, H.K. Nonlinear Systems (3rd edition), Prentice-Hall, Upper Saddle River, (2001) 7. Kailath, T. Linear Systems, Prentice-Hall, Englewood Cliffs, (1980) 8. Rugh, W.J.: Linear System Theory (2nd edition), Prentice-Hall, Upper Saddle River, (1996) 9. Levine, W.S. (Editor): The Control Handbook, CRC Press, Boca Raton (1995) 10. Hristu-Varsakelis, D., Levine, W.S.: Handbook of Networked and Embedded Control Systems, Birkhauser, Boston, (2005) 11. Camacho, E.F., Bordons, C.: Model Predictive Control, 2nd Edition, Springer, London, (2004) 12. Looze, D. P., Freudenberg, J. S.: Tradeoffs and limitations in feedback systems. The Control Handbook, pp 537-550, CRC Press, Boca Raton(1995) 13. Bosch: Driving Safety Systems (2nd edition), SAE International, Warrenton, (1999) 14. Yang, J. S., Levine, W. S.: Specification of control systems. The Control Handbook, pp 158169, CRC Press, Boca Raton (1995) 15. Lublin, L., Athans, M.: Linear quadratic regulator control. The Control Handbook, pp 635650, CRC Press, Boca Raton (1995) 16. Scokaert, P. O. M, Mayne, D. Q., Rawlings, J. B.: Suboptimal model predictive control (feasibility implies stability). IEEE Transactions on Automatic Control, 44(3) pp 648-654 (1999) 17. Maciejowski, J. M.: Predictive control with constraints. Prentice Hall, Englewood Cliffs (2002) 18. Kwon, W. H.,Han, S.: Receding Horizon Control: Model Predictive Control for State Models. Springer, London (2005) 19. Rawlings, J. B.: Tutorial overview of model predictive control. IEEE Control Systems Magazine, 20(3) pp 38-52, (2000)
22
William S. Levine
20. Qin, S. J., Badgwell, T. A.: A survey of model predictive control technology. Control Engineering Practice,11, pp 733-764 (2003) 21. Hallum, C.: The magic of the drag tire. SAE Paper 942484, Presented at SAE MSEC (1994)
Digital Signal Processing in Home Entertainment Konstantinos Konstantinides
Abstract In the last decade or so, audio and video media switched from analog to digital and so did consumer electronics. In this chapter we explore how digital signal processing has affected the creation, distribution, and consumption of digital media in the home. By using “photos”, “music”, and “video” as the three core media of home entertainment, we explore how advances in digital signal processing, such as audio and video compression schemes, have affected the various steps in the digital photo, music, and video pipelines. The emphasis in this chapter is more on demonstrating how applications in the digital home drive and apply state-of-the art methods for the design and implementation of signal processing systems, rather than describing in detail any of the underlying algorithms or architectures, which we expect to be covered in more detail in other chapters. We also explore how media can be shared in the home, and provide a short review the principles of the DLNA stack. We conclude with a discussion on digital rights management (DRM) and a short overview of the Microsoft Windows DRM.
1 Introduction Only a few years ago, the most complex electronic devices in the home were probably the radio and the television, all built using traditional analog electronic components such as transistors, resistors, and capacitors. However, in the late 1990s we witnessed a major change in home entertainment. Media began to go digital and so did consumer electronics, like cameras, music players, and TVs. Today, a typical home may have more than a half-dozen electronic devices, built using both traditional electronic components and sophisticated microprocessors or digital signal processors. In this chapter will try to showcase how digital signal processing (DSP) has affected the creation, distribution, and consumption of digital media in e-mail:
[email protected]
S.S. Bhattacharyya et al. (eds.), Handbook of Signal Processing Systems, DOI 10.1007/978-1-4419-6345-1_2, © Springer Science+Business Media, LLC 2010
23
24
Konstantinos Konstantinides
the home. We will use “photos”, “music”, and “video” as the three core media of home entertainment. The emphasis in this chapter is more on demonstrating how applications in the digital home drive and apply state-of-the art methods for the design and implementation of signal processing systems, rather than describing in detail any of the underlying algorithms or architectures. We expect these details to be exciting topics of future chapters in this book. For example, while we discuss the need for efficient image and video compression in the home, there is no plan to explain the details of either the JPEG or MPEG compression standards in this chapter.
1.1 The Personal Computer and the Digital Home There are many contributing factors for the transition from analog to digital home entertainment, including advances in semiconductors, DSP theory, and communications. For example, submicron technologies allowed more powerful and cheaper processors. New developments in coding and communications led the way to new standards for the efficient coding and transmission of photos, videos, and audio. Finally, an ever-growing world-wide Web offered new alternatives for the distribution and sharing of digital content. One may debate which device in the home is the center of home entertainment — the personal computer (PC), the TV, and the set-top box, are all competing for the title — however, nobody can dispute the effects of the PC in the world of digital entertainment. Figure 1 shows the functional evolution of the home PC. In the early days, a PC was used mostly for such tasks as word processing, home finances, Web access, and playing simple PC games. Today, PCs are also the main interfaces to digital cameras and portable music players, and are being used for passive entertainment, such as listening to Internet radio and watching Internet videos. However, the integration of media server capabilities allows now a PC to also become an entertainment hub. Windows Media Player 11 (WMP11), Tiversity, and other popular media server applications allow now users to stream pictures, music, and videos, from any PC to a new generation of devices called digital media receivers (DMRs). DMRs integrate wired and wireless networking and powerful processors to play HD content in real time. Example of DMRs include connected-TVs and gaming devices, such as the Xbox360 or the PlayStation 3. Many DMRs can also operate as Extenders for Windows Media Center, which allows users to access remotely any PC with Windows Media Center. This service allows users to interact with their PCs at both the four feet level (personal computing using a keyboard and a mouse) and the ten feet level (personal entertainment using a remote control). This PC evolution has changed completely the way consumer-electronics (CE) manufacturers design new devices for the home. Instead of designing isolated islands of entertainment, they now design devices that can be interconnected and share media over the home network or the Internet. In the next few sections we will examine in more details the processing pipelines for photos, music, and videos, the Digital Living Network Alliance (DLNA) interoperability standard for content shar-
Digital Signal Processing in Home Entertainment
Productivity
Passive Entertainment
25
Entertainment Hub
Word processing Home Finances Web Access Simple PC games
Pictures Music Video
Media Streaming Interactive Entertainment Extreme PC Gaming
Past
Present
Near Future
Fig. 1 The evolution of the home PC.
ing in the home, and digital rights management issues, and how all these together influence the design and implementation of future signal processing systems.
2 The Digital Photo Pipeline If you ask people what non-living item they would try to rescue from a burning home, the most common answer will be “our photos”. Photos touch people like very view other items in a home. They provide a visual archive of a family’s past and present, memories most of us would like to preserve for the future generations as well. Figure 2 shows the photo pipelines for both an analog (film-based) and a digital camera. In the past, taking photos was a relatively simple process. As shown in Figure 2a, you took photos using a traditional film camera, you gave the film for developing, and you received a set of either prints or slides. In most cases, you would store your best pictures in a photo album and the rest in a shoe box. Unless you were a professional photographer, you had very little influence on how your printed pictures would look. Digital photography changed all that. Figure 2b shows a typical digital photo processing pipeline, and as we are going to see, digital signal processing is now part of every stage.
2.1 Image Capture and Compression Using a digital camera, digital signal processing begins with image capture. A new generation of integrated camera processors and DSP algorithms make sure that cap-
26
Konstantinos Konstantinides
Capture
Develop
Print
a) Film camera
Print Capture
Post-processing Store Organize
Digital Display b) Digital camera Fig. 2 Photo processing pipeline.
tured raw sensor data are translated to color-corrected image data that can be stored either in raw pixels or as compressed images. Inside the camera, the following image processing algorithms are usually involved:
2.2 Focus and exposure control Based on the content of the scene, camera controls determine the aperture size, the shutter speed, the automatic gain control, and the focal position of the lens.
2.3 Pre-processing of CCD data In this step, the camera processor removes noise and other artifacts to produce an accurate representation of the captured scene.
Digital Signal Processing in Home Entertainment
27
2.4 White Balance The goal of this step is to remove unwanted color casts from the image and colors that are perceived as “white” by the human visual system are rendered as white colors in the photo as well.
2.5 Demosaicing Most digital cameras have a single sensor which is overlaid with a color filter array (CFA). For example, a CFA with a Bayer pattern alternates red and green filters for odd rows with green and blue filters for odd rows. During demosaicing the camera uses interpolation algorithms to reconstruct a full-color image from the incomplete color samples.
2.6 Color transformations Color transformations transform color data from the sensor’s color space to one that betters matches the colors perceived by the human visual system, such as CIE XYZ.
2.7 Post Processing Demosaicing and color transformations may introduce color artifacts that are usually removed during this phase. In addition, a camera may also perform edge enhancement and other post-processing algorithms that improve picture quality.
2.8 Preview display The camera scales the captured image and renders it in a manner suitable to be previewed on the camera’s preview display.
2.9 Compression and storage The image is finally compressed and stored in the camera’s storage medium, such as flash memory. Compression is defined as the process of reducing the size of the digital representation of a signal and is an inherent operation in every digital camera.
28
Konstantinos Konstantinides
Consider a typical 8 Mega pixel image. Assuming 24-bits per pixel, un-compressed, such an image will take about 24 Mbytes. In contrast, a compressed version of the same picture can take less than 1.2 Mbytes in space, with no loss in perceptual quality. This difference means that if your camera has 500 MB of storage you can either store about 20 uncompressed pictures or about 400 compressed ones. In 1991, JPEG was introduced as an international standard for the coding of digital images, and remains among the most popular image-compression schemes.
2.10 Photo Post-Processing The ability to have access to the raw bits of a picture allowed users for the first time to manipulate and post-process their pictures. Even the simplest of photo processing software allows users to perform a number of image processing operations, such as: • • • • •
Crop, rotate, and resize images Automatically fix brightness and contrast Remove red-eye Adjust color and light Do special effects, or perform other sophisticated image processing algorithms
While early versions of image processing software depended heavily on user input, for example, a user had to select the eye area before doing red-eye removal, newer tools may use automatic face recognition algorithms and require minimal user interaction.
2.11 Storage and Search Digital pictures can be stored in a variety of media, including CD-ROMs, DVDs, PC hard drives, or network-attached storage (NAS) devices, both at home or on remote backup servers. These digital libraries not only provide a better alternative to shoe boxes, but also allow users to use new algorithms for the easy storage, organization, and retrieval of their photos. For example, new face recognition algorithms allow users to retrieve all photos that include “John” with a click of a button, instead of pouring through all the family albums. Similarly, photos can be automatically organized according to a variety of metadata, such as date, time, or even (GPSbased) location.
Digital Signal Processing in Home Entertainment
29
2.12 Digital Display Low-cost digital printers allow now for at home print-on-demand services, but photo display is not constrained any more by the covers of a photo album or the borders of the traditional photo frame. PCs, digital frames, and digital media receivers, like the Apple TV or the xbox360, allow users to display their photos into a variety of devices, including wall-mounted LCD monitors, digital frames, or even HDTVs. Display algorithms allow users to create custom slide shows with sophisticated transition effects that include rotations, zooming, blending and panning. For example, many display devices support now “Ken Burns effects”, a combination of display algorithms that combine slow zooming with panning, fading, and smooth transitions, similar to effects Ken Burns is using in his documentaries. All in all, the model digital photo pipeline integrates a variety of sophisticated digital image processing algorithms, from the time you take a picture all the way to the time you may display them on your HDTV.
3 The Digital Music Pipeline Music is the oldest form of home entertainment but in the last few years, thanks to digital signal processing, we have seen dramatic changes on how we acquire and consume music. In 1980, Philips and Sony introduced the first commercial audio CD and opened a new world of digital audio recordings and entertainment. Audio CDs offered clear sound, random access to audio tracks, and more than 60 minutes of continuous audio playback, so they quickly dominated the music market. But that was only the beginning. By mid 1990’s audio compression and the popular MP3 format allowed for a new generation of players and music distribution services. Figure 3 shows a simplified view of the digital music processing pipeline from a master recording to a player. As in the digital photo pipeline, digital signal processing is part of every stage.
3.1 Compression In most cases, music post-production processing begins with compression. Hardcore audiophiles will usually select a lossless audio compression scheme, which allows for the exact reconstruction of the original data. Examples of lossless audio codecs include WMA lossless, Meridian lossless (used in the DVD-Audio format), or FLAC (Free Lossless Audio Codec), to name just a few. However, in most cases, audio is being compressed using a lossy audio codec, such as MPEG-audio Layer III, also known as MP3, or the Advanced Audio Codec (AAC). Compared to the 1.4 Mbits/s bit rate from a stereo CD, compression allows for CD-quality music at about
30
Konstantinos Konstantinides
Compression
Digital audio broadcasting
Satellite or Digital Radios
Internet Downloads
Internet Audio Streaming
Portable Players
Home Digital Libraries or Servers
Digital Media Receivers
Fig. 3 Music processing pipeline.
128 kbits/s per audio channel. Music codecs achieve these levels of compression by taking into consideration both the characteristics of the audio signal and the human auditory perception system.
3.2 Broadcast Radio While AM/FM radios are still part of almost every household, they are complemented by other sources of broadcasted music to the home, including Satellite radio, HD Radio, and streamed Internet radio. Digital Audio Broadcasting (DAB), a European standard developed under the Eureka program, is being used in several countries in Europe, Canada, and Asia, and can use both MPEG-audio Layer II and AAC audio codecs. In North America, Sirius Satellite and XM Satellite Radio started offering subscription-based radio services in 2002.
Digital Signal Processing in Home Entertainment
31
In the United States, in 2002 the Federal Communications Commission (FCC) selected the “HD Radio” from iBiquity Digital Corporation as the method for digital terrestrial broadcasting for off-the-air AM or FM programming. Unlike DAB, HD Radio does not require any additional bandwidth or new channels. It uses Inband On-Channel (IBOC) technology which allows the simultaneous transmission of analog and digital broadcasts using the same bandwidth as existing AM/FM stations. HD Radio is a trademark of iBiquity Digital Corporation which develops all HD Radio transmitters and licenses HD Radio receiver technology. HD Radio is officially known as the NRSC-5 specification. Unlike DAB, NRSC-5 does not specify the audio codec. Today HD Radio uses a proprietary audio codec developed by iBiquity. In 1995, Real Networks introduced the RealAudio player and we entered the world of Internet-based radio. Today, Internet audio streaming is an integral part of any media player and allows users to access thousands of radio stations around the world.
3.3 Audio Processing Access to digital music allowed not only a new generation of music services, like the iTunes store from Apple Inc., but also created new opportunities for new tools for the storage, processing, organization, and distribution of music inside the home. Digital media servers allow us to store and organize our whole music collections like never before. Metadata in the audio files allows us to organize and view music files by genre, artist, album, and multiple other categories. In addition, new audio tools allow us to edit, mix, and create new audio tracks with the simplicity of “cut” and “paste.” But what if a track has no metadata? What if you just listened to a song and you would like to know more about it? New audio processing algorithms can use “digital signatures” in each song to automatically recognize a song while it is being played (see for example the Shazam service). Similar tools can automatically extract tempo and other music characteristics from a song or a selection of songs and automatically create a playlists with similar songs. For example, the “Genius” tool in Apple’s iTunes can automatically generate a playlist from a user’s library based on a single song selection.
4 The Digital Video Pipeline We have already discussed the importance of efficient coding in both the digital photo and music pipelines. Therefore, it will not come as a surprise that compression plays a key role in digital video entertainment as well. As early as 1988, the International Standards Organization (ISO) established the Moving Pictures Expert group (MPEG) with the mission to develop standards for the coded representation
32
Konstantinos Konstantinides
of moving pictures and associated audio. Since then, MPEG delivered a variety of video coding standards, including MPEG-1, MPEG-2, MPEG-4 (part 2), and more recently MPEG-4, part 10, or H.264. These standards opened the way to a variety of new digital media, including the VCD, the DVD, and more recently HDTV, HD-DVD and Blu-ray. However, compression is only a small part of the video processing pipeline in home entertainment. From Figure 4, the home can receive digital video content from a variety of sources, including digital video broadcasting, portable digital storage media, and portable video capture devices.
Internet Services
Capture
Post-processing
Store Organize
Off ff-the-Air Satellite Cable IPTV
Playback & Display
Digital Media: DVD Blu-ray Fig. 4 Digital video processing pipeline.
Digital Signal Processing in Home Entertainment
33
4.1 Digital Video Broadcasting As of June of 2009, in the United States, all television stations are required to broadcast only in digital format. In the next few years, similar transitions from analog to digital TV broadcasting will continue in countries all over the world. In addition to these changes, new Internet Protocol Television (IPTV) providers continue to challenge and get more market share from the traditional cable and satellite providers. Digital video broadcasting led to new developments in coding and communications and allowed a more efficient use of the communication spectrum. In addition, it influenced a new generation of set-top boxes and HDTVs with advanced video processing, personal video recording capabilities, and Internet connectivity.
4.2 New Portable Digital Media It is not a surprise that the DVD player is considered by many the most successful product in the history of consumer electronics. By providing excellent video quality, random access to video recordings, and multi-channel audio, it allowed consumers to experience movies in the home like never before. DVD movies combine MPEG2 video with Dolby Digital audio compression; however, a DVD is more than a collection of MPEG-coded files. Digital video and audio algorithms are being used in the whole DVD-creation pipeline, inside digital video cameras, in the editing room for the creation of special effects and post processing, and finally in the DVD authoring house, where bit rate-control algorithms apply rate-distortion principles to preserve the best video quality while using the minimum amount of bits. As we transition from standard definition to high definition, these challenges become even greater and the development of new digital signal processing algorithms and architectures is more important than ever. For example, the Blu-Ray specification not only allows for new coding schemes for high-definition video, but also provides advanced interactivity through Java-based programming and “BD Live.”
4.3 Digital Video Recording In 1999, Replay TV and Tivo launched the first digital video recorders or DVRs and changed forever the way people watch TV. DVRs offered far more than the traditional “time shifting” feature of a VCR. Hard disk drive-based recordings allowed numerous other features, including random access to recordings, instant replay, and commercial skip. More recently DVRs became complete entertainment centers with access to the Internet for remote access and Internet-based video streaming. In addition, USB-based ports allow users to connect personal media for instant access to their personal photos, music, and videos. Like modern Blu-ray players, DVRs inte-
34
Konstantinos Konstantinides
grate advanced digital media processors that allow for enhanced video processing and control. Figure 5 shows the simplified block diagram of a typical DVR. At the core of the DVR is a digital media processor that can perform multiple operations, including: • Interfaces with all the DVR peripherals, such as the IR controller and the hard disk drive • For digital video, it extracts a Program Stream from the input Transport Stream and either stores it in the HDD or decodes it and passes it to the output • For analog video, it passes it through to the output or compresses it and stores it in the HDD • Performs all DVR controls, communicates with the service provider to update the TV Guide, maintains the recording schedules, and performs all playback trick modes (such as, pause, fast forward, reverse play, and skip) • Performs multiple digital video and image-related operations, such as scaling and deinterlacing
IR Receiver
Front Panel Control
TV Tuners
Digital Video
Digital Media Processor
USB
Analog Video
Ethernet
Audio Codec
Security
SDRAM
HDD
Fig. 5 Block diagram of a typical DVR.
In addition to the media processor, a DVR includes at least one TV tuner, Infrared (IR) and/or RF interfaces for the remote control, a front panel control, securityrelated controls, such as Smart Card or CableCard interfaces, and audio and video output interfaces. DVRs may also include additional interfaces, including USB and Ethernet ports, so they can play media from external devices or for adding external peripherals, such as an external hard disk drive.
Digital Signal Processing in Home Entertainment
35
4.4 Digital Video Processing Comparing Figure 2 and Figure 4, we see that the digital video processing pipeline for personal camcorders is not that much different from the digital camera pipeline. After video capture using a CCD, video processing algorithms take the raw censor data, color correct them, and translate them in real-time to a compressed digital video bit stream. Alternatively, one can use any PC-based video capture card to capture and translate any analog video signal to digital. Regardless of the input, PC-based video editing tools allow us to apply a variety of post-processing imaging algorithms to the videos and create custom videos in a variety of formats. User videos can be stored either on DVDs or NAS-based media libraries. In addition, many users upload their videos into social networking sites, where videos are automatically transcoded to formats suitable for video streaming over the Internet, such as Adobe’s Flash video or H.264. Modern video servers allow for the seamless integration of video with Internet content, allowing users to search within video clips using either metadata or by automatically extracting data from the associated audio and video data.
4.5 Digital Video Playback and Display Every video eventually needs to be displayed on a screen, and modern Highdefinition monitors or TVs were never more sophisticated than now. Starting from their input ports, they need to interface with a variety of analog and digital ports; from the traditional RF input to DVI or HDMI. Sophisticated deinterlacing, motion compensation, and scaling algorithms are able to take any input, at any resolution, and convert it to the TV’s native resolution. For example, more recently, 100/120 Hz HDTVs need to apply advanced motion estimation and interpolation techniques to interpolate traditional 50/60 Hz video sequences to their higher frame rate. LEDbased LCD displays use also advanced algorithms to adjust the LED backlight according to the input video for the best possible dynamic contrast. In summary, digital signal processing is involved in every aspect of the capture, post-processing, and playback of photos, music, and videos, but that’s only the beginning. In the next section we will examine how our digital media can be shared anywhere in our home, using existing coding and communication technologies.
5 Sharing Digital Media in the Home In the previous sections we discussed how digital signal processing is associated with almost every aspect of the digital entertainment in the home. In most homes we see multiple islands of entertainment. For example, the living room may be one such island with a TV, a set-top box, a DVD player, and a gaming platform. The
36
Konstantinos Konstantinides
office may be another island with a PC and a NAS. In other rooms we may find a laptop or a second PC or even additional TVs. In addition, each family may have a variety of mobiles or smart phones, a camera, and a camcorder. Each of these islands is able to create, store, or playback content. Ideally, consumers would like to share media from any of these devices with any other device. For example, say you download a movie from the Internet on a PC but you really want to view it on the big TV in the living room, wouldn’t be nice to be able to stream that movie from the PC in the office to your TV via your home network? You definitely can, and the role of the Digital Living Network Alliance (DLNA) is to make this as easy as possible.
5.1 DLNA DLNA is a cross-industry association that was established in 2003 and includes now more than 250 companies from consumer electronics, the computer industry, and mobile device companies. The vision of DLNA is to create a wired or wireless interoperable network where digital content, such as photos, music, and videos can be shared by all home-entertainment devices. Figure 6 shows the DLNA interoperability stack for audio-visual devices. From Figure 6, DLNA uses established standards, such as IEEE 802.11 for wireless networking, HTTP for media transport, universal plug and play (UPnP) for device discovery and control, and standard media formats, such as JPEG and MPEG for digital media. DLNA defines more than 10 device classes, but for the purposes of this discussion we will only cover the two key classes: Digital media server and digital media player.
5.2 Digital Media Server A digital media server is any device that can acquire, record, store, and stream digital media to digital media players. Examples include PCs, advanced set-top boxes, music or video servers, and NAS devices.
5.3 Digital Media Player Digital media players (also known as digital media receivers) find content on digital media servers and provide playback capabilities. Figure 7 shows a typical interaction between a DLNA client and a DLNA server. The devices use HTTP for media transport and may also include a local user interface for setup and control. Examples of digital media players include connected TVs and game consoles. Today, digital media players combine wired and wireless connectivity and a powerful digi-
Digital Signal Processing in Home Entertainment
Media Formats Device Discovery, Control and Media Management
37
JPEG, LPCM, MPEG-2, MP3, MPEG-4, AAC LC, AVC/H.264 + optional formats UPnP AV 1.0 UPnP Device Architecture 1.0
Media Transport
HTTP (mandatory) RTP (optional)
Network Stack
IPv4 Protocol Suite
Network Connectivity
Wired: 803.3i, 802.3u Wireless: 802.11a/b/g Bluetooth
Fig. 6 DLNA Audio-Video Interoperability stack.
tal signal processor. However, despite the efforts of the standardization committees to establish standards for the coding of pictures, audio, and video, this did not stop the proliferation of other coding formats. For example, just for video, the coding alphabet soup for audio/video codecs includes DivX, xvid, Real Video, Windows Media Video (WMV), Adobe Flash video, and Matroska video, to name just a few. This proliferation of coding formats is a major challenge for consumer electronics manufacturers who strive to provide their customers a seamless experience. However, while it is rather easy to upload a new codec on a PC, this may not be the case for a set-top box, a DMR, or a mobile client. While manufacturers work on the next generation of multi-format players, it is not uncommon to see now PC-based media transcoders. These transcoders interact with media servers and allow for on-the-fly transcoding of any media format to the most commonly used formats, like MPEG-2 or WMV.
6 Copy Protection and DRM The digital distribution of media content opened new opportunities to consumers but also created a digital rights management nightmare for content creators and copyright holders. The Serial Copy Management System (SCMS) was one of the first schemes that attempted to control the number of second-generation digital copies
38
Konstantinos Konstantinides
Local User Interface MPEG, JPEG HTTP
Content via HTTP
UPnP-AV UPnP Device
UPnP Messages
Networking (IP/WiFi) DLNA Client
DLNA Server
Fig. 7 Typical DLNA Client-Server interaction.
one could make from a master copy. For example, while one can make a copy from an original CD, a special SCMS flag in the copy may indicate that one can’t make a copy of a copy. SCMS does not stopping you making unlimited copies from the original, but you can’t make a copy of the copy. SCMS was first introduced in digital audio tapes, but it is also available in audio CDs and other digital media. CSS (content scrambling system) is another copy protection scheme, used by almost every commercially available DVD-Video disc and controls the digital copying and playback of DVDs. DRM or digital rights management refers to technologies that control how digital media can be consumed. While, Fairplay from Apple Inc. and Windows DRM from Microsoft are the most commonly used DRM schemes, the development of alternative copy protection and DRM schemes is always an active area of research and development in digital signal processing.
6.1 Microsoft Windows DRM Microsoft provides two popular DRM kits; one for portable devices and one for networked devices. Figure 8 shows that under Windows Media DRM portable devices can acquire licenses either directly from a license server through an Internet connection, or indirectly, from a local media server (for example a user’s computer). Figure 9 shows license acquisition scenarios when two devices support the Windows Media DRM for Network devices. In this scenario, in addition to the basic DLNA stack, the digital media receiver integrates the Windows Media DRM layer
Digital Signal Processing in Home Entertainment
39
License Server
Direct License Acquisition
Internet
Direct License Acquisition
Indirect License Acquisition Fig. 8 License acquisition for portable devices.
(WMDRM) and has also the capability to play Windows Media Video (WMV) content. Using Windows Media DRM for network devices, the two devices exchange keys and data as follows: 1. The first time a device is used it must be registered and authorized by the server through UPnP. A special certificate identifies the device and contains information used to ensure secure communication. 2. During the initial registration, the server performs a proximity detection to verify that it is close enough to be considered inside the home. 3. Periodically, the server repeats the proximity detection to revalidate the device. 4. The device requests content for playback from the server. 5. If the server determines that the device is validated and has the right to play the content, it sends a response containing a new, encrypted session key, a rights policy statement specifying the security restrictions that the device must enforce, and finally the content. The content is encrypted by the session key. Each time content is requested, a new session is established. 6. The network device must parse the rights policy and determine if it can adhere to the required rights. If it can, it may render the content. As digital content is shared among multiple devices, most made by different manufacturers, an open DRM standard or DRM-free content seem to be the only ways to achieve inter-operability.
40
Konstantinos Konstantinides
Internet
Direct License Acquisition
Local User Interface WMV
MPEG, JPEG
WMDRM
Protected Content
HTTP UPnP-AV UPnP Device
UPnP Messages
Networking (IP/WiFi) DLNA Client with Windows DRM
DLNA Server Indirect License Acquisition
Fig. 9 License acquisition and data transfer for networked devices.
7 Summary and Conclusions The home entertainment ecosystem is transitioning from isolated analog devices into a network of interconnected digital devices. Driven by new developments in semiconductors, coding and digital communications, these devices can share photos, music, and videos using a home network, enhance the user experience, and continue to drive the development of new digital signal processing algorithms and architectures.
Digital Signal Processing in Home Entertainment
41
8 To Probe Further A good overview of the digital camera processing pipeline is given by [5]. An overview of image and video compression standards, covering both theory and architectures, including the JPEG [12], MPEG-1 [13], MPEG-2 [11], AAC [10], and MPEG-4 [7] standards, can be found in [9]. The H.264 standard, also known as MPEG-4, part 10, is defined in [2]. Chapter 5 and Chapter 3 of this Handbook provide additional details on video coding. The DAB specification can be found in [8]. In [3], Maxson provides a very thorough description of the NRSC-5 specification [1] for HD Radio transmission. For more information about DLNA, DLNA’s web site, www.dlna.org, provides excellent information and tutorials. A nice overview of DLNA can also be found in a White paper on UPnP and DLNA by [4]. Most of the material on Windows DRM is from an excellent description of Microsoft’s DRM by [6]. More details about Microsoft’s Extenders for Windows Media Center can be found in Microsoft’s Windows site [14].
References 1. NRSC-5-B in-band/on-channel digital radio broadcasting standard, CEA and NAB. Technical report, NRSC, April 2008. 2. H.264 series H: Audiovisual and multimedia systems, infrastracture of audiovisual services — coding of moving video, advanced video coding for generic audiovisual services. Technical report, ITU, 2007. 3. D. P. Maxson. The IBOC Handbook: understanding HD Radio technology. Focal Press, 2007. 4. Networked digital media standards, a UPnP/DLNA overview. Technical report, Allegro Software Development Corporation, October 2006. White paper. 5. R. Ramanath et al. Color image processing pipeline. IEEE Signal Processing Magazine, 22(1):34–43, January 2005. 6. J. Cohen. A general overview of Windows Media DRM 10 device technologies. Technical report, Microsoft Corporation, September 2004. 7. 14496-2, information technology - coding of audio-visual objects — part 2: Visual. Technical report, ISO/IEC, 2001. 8. ETS 300 401, radio broadcasting systems; digital audio broadcasting (DAB) to mobile, portable, and fixed receivers. Technical report, ETSI, 1997. 9. V. Bhaskaran and K. Konstantinides. Image and Video Compression Standards: Algorithms and Architectures. Kluwer Academic Publishers, second edition, 1997. 10. JTC1/SC29/WG11 N1430 MPEG-2 DIS 13818-7 (MPEG-2 advanced audio coding, AAC). Technical report, ISO/IEC, November 1996. 11. JTC1 CD 13818, generic coding of moving pictures and associated audio. Technical report, ISO/IEC, 1994. 12. JTC1 CD 10918 digital compression and coding of continuous-tone still images — part 1: Requirements and guidelines. Technical report, ISO/IEC, 1993. 13. JTC1 CD 11172, coding of moving pictures and associated audio for digital storage media up to 1.5 Mbits/s. Technical report, ISO/IEC, 1992. 14. Microsoft. Extenders for windows media center. http://www.microsoft.com/ windows/products/winfamily/mediacenter/features/extender.mspx, Cited: 4/6/2009.
MPEG Reconfigurable Video Coding Marco Mattavelli, Jörn W. Janneck and Mickaël Raulet
Abstract The current monolithic and lengthy scheme behind the standardization and the design of new video coding standards is becoming inappropriate to satisfy the dynamism and changing needs of the video coding community. Such a scheme and specification formalism do not enable designers to exploit the clear commonalities between the different codecs, neither at the level of the specification nor at the level of the implementation. Such a problem is one of the main reasons for the typical long time interval elapsing between the time a new idea is validated until it is implemented in consumer products as part of a worldwide standard. The analysis of this problem originated a new standard initiative within the ISO/IEC MPEG committee, called Reconfigurable Video Coding (RVC). The main idea is to develop a video coding standard that overcomes many shortcomings of the current standardization and specification process by updating and progressively incrementing a modular library of components. As the name implies, flexibility and reconfigurability are new attractive features of the RVC standard. The RVC framework is based on the usage of a new actor/dataflow oriented language called C AL for the specification of the standard library and the instantiation of the RVC decoder model. C AL dataflow models expose the intrinsic concurrency of the algorithms by employing the notions of actor programming and dataflow. This chapter gives an overview of the concepts and technologies building the standard RVC framework and the non standard tools supporting the RVC model from the instantiation and simulation of the C AL model to the software and/or hardware code synthesis. Marco Mattavelli Microelectronic Systems Lab, EPFL, CH-1015 Lausanne, Switzerland e-mail:
[email protected] Jörn W. Janneck University of California at Berkeley, Berkeley, CA 94720, U.S.A. e-mail:
[email protected] Mickaël Raulet IETR/INSA Rennes, F-35043, Rennes, France e-mail:
[email protected]
S.S. Bhattacharyya et al. (eds.), Handbook of Signal Processing Systems, DOI 10.1007/978-1-4419-6345-1_3, © Springer Science+Business Media, LLC 2010
43
44
Marco Mattavelli, Jörn W. Janneck and Mickaël Raulet
1 Introduction A large number of successful MPEG (Motion Picture Expert Group) video coding standards has been developed since the first MPEG-1 standard in 1988. The standardization efforts in the field, besides having as first objective to guarantee the interoperability of compression systems, have also aimed at providing appropriate forms of specifications for wide and easy deployment. While video standards are becoming increasingly complex, and they take ever longer to be produced, this makes it difficult for standards bodies to produce timely specifications that address the need to the market at any given point in time. The structure of past standards has been one of a monolithic specification together with a fixed set of profiles that subset the functionality and capabilities of the complete standard. Similar comments apply to the reference code, which in more recent standards has become normative itself. Video devices are typically supporting a single profile of a specific standard, or a small set of profiles. They have therefore only very limited adaptivity to the video content, or to environmental factors (bandwidth availability, quality requirements). Within the ISO/IEC MPEG committee, Reconfigurable Video Coding (RVC) [12] [4] standard is intended to address the two following issues: make standards faster to produce, and permit video devices based on those standards to exhibit more flexibility with respect to the coding technology used for the video content. The key idea is to standardize a library of video coding components, instead of an entire video decoder. The standard can then evolve flexibly by incrementally extending that library, and video devices can configure themselves to support a variety of coding algorithms by composing encoders and decoders from that library of predefined coding modules. This chapter gives an overview of the concepts and technologies building the standard RVC framework and can complement and be complemented by Chapter 14, Chapter 2, and Chapter 5.
2 Requirements and rationale of the MPEG RVC framework Started in 2004, the MPEG Reconfigurable Video Coding (RVC) framework [4] is a new ISO standard currently (Fig. 1) under its final stage of standardization, aiming at providing video codec specifications at the level of library components instead of monolithic algorithms. RVC solves this problem by defining two standards: a language with which a video decoder can be described (ISO/IEC23001-4 or MPEG-B pt. 4 [10]) and a library of video coding tools employed in MPEG standards (ISO/IEC23002-4 or MPEG-C pt. 4 [11]). The new concept is to be able to specify a decoder of an existing standard or a completely new configuration that may better satisfy application-specific constraints by selecting standard components from a library of standard coding algorithms. The possibility of dynamic configuration and reconfiguration of codecs also requires new methodologies and new tools for describing the new bitstream syntaxes and the parsers of such new codecs.
MPEG Reconfigurable Video Coding
45
The essential concepts of the RVC framework (Fig. 2) are the following: • RVC-C AL [6], a dataflow language describing the Functional Unit (FU) behavior. This language defines the behavior of dataflow components called actors, which is a modular component that encapsulates its own state such that an actor can neither read nor modify the state of any other actor. The only interaction between actors is via messages (known in C AL as tokens) which flow from an output of one actor to an input of another. The behavior of an actor is defined in terms of a set of atomic actions. The execution inside an actor is purely sequential: at any point in time, only one action can be active inside an actor. An action can consume (read) tokens, modify the internal state of the actor, produce tokens, and interact with the underlying platform on which the actor is running. • FNL (Functional unit Network Language), a language describing the video codec configurations. FNL is an XML dialect that lists the FUs composing the codec, the parameterization of these FUs and the connections between the FUs. FNL
first hardware Code generator
2002
MPEG-4 SP decoder... ... in hardware
... in software
RVC-CAL specification MPEG-4 AVC decoder
2003
2004
CAL inception CLR 1.0
2005
2006
OpenDF (SourceForge)
interpreter (Ptolemy)
2007
2008
first software Code generator Eclipse plugin
2009
Orcc (SourceForge) Graphiti (SourceForge)
ISO/IEC FDIS 23001-4 ISO/IEC FDIS 23002-4
MPEG RVC
Fig. 1 C AL and RVC standard timeline.
MPEG-C
Decoder Description
FU Network Description (FNL) Bitstream Syntax Description (RVC-BSDL)
Model Instantiation: Selection of FUs and Parameter Assignment
MPEG Tool Library (RVC-CAL FUs)
Abstract Decoder Model (FNL + RVC-CAL) MPEG-B
Encoded Video Data
Decoder Implementation
MPEG Tool Library Implementation
Decoding Solution
Decoded Video Data
RVC Decoder Implementation
Fig. 2 RVC standard
2010
46
•
•
• • •
Marco Mattavelli, Jörn W. Janneck and Mickaël Raulet
allows hierarchical constructions: an FU can be defined as a composition of other FUs and described by another FND (FU Network Description). BSDL (Bitstream Syntax Description Language), a language describing the structure of the input bitstream. BSDL is a XML dialect that lists the sequence of the syntax elements with possible conditioning on the presence of the elements, according to the value of previously decoded elements. BSDL is further explained in section 3.4. A library of video coding tools, also called Functional Units (FU) covering all MPEG standards (the “MPEG Toolbox”). This library is specified and provided using RVC-C AL (a subset of the original C AL language) as specification language for each FU. An “Abstract Decoder Model” (ADM) constituting a codec configuration (described using FNL) instantiating FUs of the MPEG Toolbox. Figure 2 depicts the process of instantiating an “Abstract Decoder Model” in RVC. Tools simulating and validating the behavior of the ADM (Open DataFlow environment [1]). Tools automatically generating software and hardware descriptions of the ADM.
2.1 Limits of previous monolithic specifications MPEG has produced several video coding standards such as MPEG-1, MPEG-2, MPEG-4 Video, AVC (Advanced Video Coding) and recently SVC (Scalable Video Coding). While at the beginning MPEG-1 and MPEG-2 were only specified by textual descriptions, with the increasing complexity of algorithms, starting with the MPEG-4 set of standards, C or C++ specifications, called also reference software, have became the formal specification of the standards. However, the past monolithic specification of such standards (usually in the form of C/C++ programs) lacks flexibility and does not allow to use the combination of coding algorithms from different standards enabling to achieve specific design or performance trade-offs and thus fill, case by case, the requirements of specific applications. Indeed, not all coding tools defined in a profile@level of a specific standard are required in all application scenarios. For a given application, codecs are either not exploited at their full potential or require unnecessarily complex implementations. However, a decoder conformant to a standard has to support all of them and may results in non-efficient implementations. Moreover, such descriptions composed of non-optimized non-modular software packages have started to show many limits. If we consider that they are in practice the starting point of any implementation, system designers have to rewrite these software packages not only to try to optimize performances, but also to transform these descriptions into appropriate forms adapted to the current system design methodologies. Such monolithic specifications hide the inherent parallelism and the dataflow structure of the video coding algorithms, features that are necessary to be exploited for efficient implementations. In the meanwhile the evolution of video coding tech-
MPEG Reconfigurable Video Coding
47
nologies, leads to solutions that are increasingly complex to be designed and present significant overlap between successive versions of the standards. Why C etc. Fail? The control over low-level details, which is considered a merit of C language, typically tends to over-specify programs. Not only the algorithms themselves are specified, but also how inherently parallel computations are sequenced, how and when inputs and outputs are passed between the algorithms and, at a higher level, how computations are mapped to threads, processors and application specific hardware. In general, it is not possible to recover the original knowledge about the intrinsic properties of the algorithms by means of analysis of the software program and the opportunities for restructuring transformations on imperative sequential code are very limited compared to the parallelization potential available on multi-core platforms [3]. These are the main reasons for which C has been replaced by C AL in RVC.
2.2 Reconfigurable Video Coding specification requirements Scalable parallelism. In parallel programming, the number of things that are happening at the same time can scale in two ways: It can increase with the size of the problem or with the size of the program. Scaling a regular algorithm over larger amounts of data is a relatively well-understood problem, while building programs such that their parts execute concurrently without much interference is one of the key problems in scaling the von Neumann model. The explicit concurrency of the actor model provides a straightforward parallel composition mechanism that tends to lead to more parallelism as applications grow in size, and scheduling techniques permit scaling concurrent descriptions onto platforms with varying degrees of parallelism. Modularity and reuse. The ability to create new abstractions by building reusable entities is a key element in every programming language. For instance, objectoriented programming has made huge contributions to the construction of von Neumann programs, and the strong encapsulation of actors along with their hierarchical composability offers an analog for parallel programs. Concurrency. In contrast to procedural programming languages, where control flow is made explicit, the actor model emphasizes explicit specification of concurrency. Rallying around the pivotal and unifying von Neumann abstraction has resulted in a long and very successful collaboration between processor architects, compiler writers, and programmers. Yet, for many highly concurrent programs, portability has remained an elusive goal, often due to their sensitivity to timing. The untimedness and asynchrony of stream-based programming offers a solution to this problem. The portability of stream-based programs is underlined by the fact that programs of considerable complexity and size can be compiled to competitive hardware [14] as well as software [26], which suggests that stream-based programming might even be a solution to the old problem of flexibly co-synthesizing different mixes of hardware/software implementations from a single source.
48
Marco Mattavelli, Jörn W. Janneck and Mickaël Raulet
Encapsulation. The success of a stream programming model will in part depend on its ability to configure dynamically and to virtualize, i.e. to map to collections of computing resources too small for the entire program at once. Moving parts of a program on and off a resource requires encapsulation, i.e. a clear distinction between those pieces that belong to the parts to be moved and those that do not. The transactional execution of actors generates points of quiescence, the moments between transactions, when the actor is in a defined and known state that can be safely transferred across computing resources.
3 Description of the standard or normative components of the framework The fundamental element of the RVC framework, in the normative part, is the Decoder Description (Fig. 2) that includes two types of data: The Bitstream Syntax Description (BSD), which describes the structure of the bitstream. The BSD is written in RVC-BSDL. It is used to generate the appropriate parser to decode the corresponding input encoded data [9] [25]. The FU Network Description (FND), which describes the connections between the coding tools (i.e. FUs). It also contains the values of the parameters used for the instantiation of the different FUs composing the decoder [5] [14] [26]. The FND is written in the so called FU Network Language (FNL). The syntax parser (built from the BSD), together with the network of FUs (built from the FND), form a C AL model called the Abstract Decoder Model (ADM), which is the normative behavioral model of the decoder.
3.1 The toolbox library An interesting feature of the RVC standard that distinguishes it from traditional decoders-rigidly-specified video coding standards is that, a description of the decoder can be associated to the encoded data in various ways according to each application scenario. Figure 3 illustrates this conceptual view of RVC [20]. All the three types of decoders are within the RVC framework and constructed using the MPEG-B standardized languages. Hence, they all conform to the MPEG-B standard. A Type-1 decoder is constructed using the FUs within the MPEG Video Tool Library (VTL) only. Hence, this type of decoder conforms to both the MPEG-B and MPEG-C standards. A Type-2 decoder is constructed using FUs from the MPEG VTL as well as one or more proprietary libraries (VTL 1-n). This type of decoder conforms to the MPEG-B standard only. Finally, a Type-3 decoder is constructed using one or more proprietary VTL (VTL 1-n), without using the MPEG VTL. This type of decoder also conforms to the MPEG-B standard only. An RVC decoder (i.e. conformant to MPEG-B) is composed of coding tools described in VTLs according
MPEG Reconfigurable Video Coding
49
to the decoder description. The MPEG VTL is described by MPEG-C. Traditional programming paradigms (monolithic code) are not appropriate for supporting such types of modular framework. A new dataflow-based programming model is thus specified and introduced by MPEG RVC as specification formalism.
MPEG VTL (MPEG-C)
Decoder Descripon Coded data
Video Tools Libraries {1..N}
Decoder type-1 Decoder type-2 or Decoder type-3 or
Decoded video
MPEG-B decoder
Fig. 3 The conceptual view of RVC.
The MPEG VTL is normatively specified using RVC-C AL. An appropriate level of granularity for the components of the standard library is important, to enable an effective possibility of reconfigurations, for codecs, and an efficient reuse of components in codecs implementations. If the library is composed of too coarse modules, such modules will be too large/coarse to allow their usage in different and interesting codec configurations, whereas, if the library component granularity level is too fine, the number of modules in the library will result to be too large for an efficient and practical reconfiguration process at the codec implementation side, and may obscure the desired high-level description and modeling features of the RVC codec specifications. Most of the efforts behind the standardization of the MPEG VTL were devoted to study the best granularity trade-off level of the VTL components. However, it must be noticed that the choice of the best trade-off in terms of high-level description and module re-usability, does not really affect the potential parallelism of the algorithm that can be exploited in multi-core and FPGA implementations.
3.2 The C AL Actor Language C AL [6] is a domain-specific language that provides useful abstractions for dataflow programming with actors. For more information on dataflow methods, the reader may refer to Part IV of this handbook, which contains several chapters that go into detail on various kinds of dataflow techniques for design and implementation of signal processing systems. C AL has been used in a wide variety of applications and has been compiled to hardware and software implementations, and work on
50
Marco Mattavelli, Jörn W. Janneck and Mickaël Raulet
mixed HW/SW implementations is under way. The next section provides a brief introduction to some key elements of the language. 3.2.1 Basic Constructs The basic structure of a C AL actor is shown in the Add actor (Fig. 4), which has two input ports A and B, and one output port Out, all of type T. T may be of type int, or uint for respectively integers and unsigned integers, of type bool for booleans, or of type float for floating-point integers. Moreover CAL designers may assign a number of bits to the specific integer type depending on the variable numeric size. The actor contains one action that consumes one token on each input ports, and produces one token on the output port. An action may fire if the availability of tokens on the input ports matches the port patterns, which in this example corresponds to one token on both ports A and B.
a c t o r Add ( ) T A, T B ⇒ T Out : a c t i o n [ a ] , [ b ] ⇒ [ sum ] do sum : = a + b ; end end
Fig. 4 Basic structure of a C AL actor.
An actor may have any number of actions. The untyped Select actor (Fig. 5) reads and forwards a token from either port A or B, depending on the evaluation of guard conditions. Note that each of the actions has empty bodies. a c t o r S e l e c t ( ) S , A, B ⇒ O u t p u t : a c t i o n S : [ s e l ] , A: [ v ] ⇒ [ v ] guard s e l end action S : [ s e l ] , B: [ v ] ⇒ [ v ] guard not s e l end end
Fig. 5 Guard structure in a C AL actor.
MPEG Reconfigurable Video Coding
51
3.2.2 Priorities and State Machines An action may be labeled and it is possible to constrain the legal firing sequence by expressions over labels. In the PingPongMerge actor, reported in the Figure 6, a finite state machine schedule is used to force the action sequence to alternate between the two actions A and B. The schedule statement introduces two states s1 and s2.
a c t o r Pi ngPongM erge ( ) I n p u t 1 , I n p u t 2 ⇒ O u t p u t : A: a c t i o n I n p u t 1 : [ x ] ⇒ [ x ] end B : a c t i o n I n p u t 2 : [ x ] ⇒ [ x ] end s c h e d u l e fsm s1 : s1 (A) −−> s2 ; s2 (B ) −−> s1 ; end end
Fig. 6 FSM structure in a C AL actor.
The Route actor, in the Figure 7, forwards the token on the input port A to one of the three output ports. Upon instantiation it takes two parameters, the functions P and Q, which are used as predicates in the guard conditions. The selection of which action to fire is in this example not only determined by the availability of tokens and the guards conditions, by also depends on the priority statement. a c t o r Rout e ( P , Q) A ⇒ X, Y, Z : toX : a c t i o n [ v ] guard P ( v ) toY : a c t i o n [ v ] guard Q( v ) t oZ : a c t i o n [ v ]
⇒ X: [ v ] end ⇒ Y: [ v ] end ⇒ Z : [ v ] end
priority toX > toY > t oZ ; end end
Fig. 7 Priority structure in a C AL actor.
52
Marco Mattavelli, Jörn W. Janneck and Mickaël Raulet
3.2.3 C AL subset language for RVC For an in-depth description of the language, the reader is referred to the language report [6], for the specific subset specified and standardized by ISO in the Annex C of [10]. This subset only deals with fully typed actors and some restrictions on the C AL language constructs from [6] to have efficient hardware and software code generations without changing the expressivity of the algorithm. For instance, Figures 5, 6 and 7 are not RVC-C AL compliant and must be changed as the Figures 8, 9 and 10 where T1, T2, T are the types and only typed paramters can be passed to the actors not functions as P, Q.
a c t o r S e l e c t ( ) T1 S , T2 A, T3 B ⇒ T3 O u t p u t : a c t i o n S : [ s e l ] , A: [ v ] ⇒ [ v ] guard s e l end action S : [ s e l ] , B: [ v ] ⇒ [ v ] guard not s e l end end
Fig. 8 Guard structure in a RVC-C AL actor.
a c t o r Pi ngPongM erge ( ) T I n p u t 1 , T I n p u t 2 ⇒ T O u t p u t : A: a c t i o n I n p u t 1 : [ x ] ⇒ [ x ] end B : a c t i o n I n p u t 2 : [ x ] ⇒ [ x ] end s c h e d u l e fsm s1 : s1 (A) −−> s2 ; s2 (B ) −−> s1 ; end end
Fig. 9 FSM structure in a RVC-C AL actor.
A large selection of example actors is available at the OpenDF repository [15], among them can also be found the MPEG-4 decoder discussed below. Many other actors written in RVC-C AL will be soon available at the standard MPEG RVC tool repository once the conformance testing process will be completed.
MPEG Reconfigurable Video Coding
53
a c t o r Rout e ( ) T A ⇒ T X , T Y, T Z : f u n t i o n P ( T v _ i n )−−> T : \ \ body o f t h e f u n c t i o n P P( v_in ) end f u n t i o n Q( T v _ i n )−−> T : \ \ body o f t h e f u n c t i o n P Q( v _ i n ) end toX : a c t i o n [ v ] guard P ( v ) toY : a c t i o n [ v ] guard Q( v ) t oZ : a c t i o n [ v ]
⇒ X: [ v ] end ⇒ Y: [ v ] end ⇒ Z : [ v ] end
priority toX > toY > t oZ ; end end
Fig. 10 Priority structure in a RVC-C AL actor.
3.3 FU Network language for the codec configurations A set of C AL actors are instantiated and connected to form a C AL application, i.e. a C AL network. Figure 11 shows a simple C AL network Sum, which consists of the previously defined RVC-C AL Add actor and the delay actor shown in Figure 12. 6XP 2XW
,Q
=Y
,Q
% $
$GG
2XW
2XW
Fig. 11 A simple C AL network.
The source/language that defined the network Sum is found in Figure 13. Please, note that the network itself has input and output ports and that the instantiated entities may be either actors or other networks, which allow for a hierarchical design. Formerly, networks have been traditionally described in a textual language, which can be automatically converted to FNL and vice versa — the XML dialect
54
Marco Mattavelli, Jörn W. Janneck and Mickaël Raulet
a c t o r Z ( v ) T I n ⇒ T Out : A: a c t i o n ⇒ [ v ] end B : a c t i o n [ x ] ⇒ [ x ] end s c h e d u l e fsm s0 : s0 (A) −−> s1 ; s1 (B ) −−> s1 ; end end
Fig. 12 RVC-C AL Delay actor. network Sum ( ) I n ⇒ Out : entities add = Add ( ) ; z = Z( v =0); structure I n −−> add . A; z . Out −−> add . B ; add . Out −−> z . I n ; add . Out −− > Out ; end
Fig. 13 Textual representation of the Sum network.
standardized by ISO in Annex B of [10]. XML (Extensible Markup Language) is a flexible way to create common information formats. XML is a formal recommendation from the World Wide Web Consortium (W3C). XML is not a programming language, it is rather a set of rules that allow you to represent data in a structured manner. Since the rules are standard, the XML documents can be automatically generated and processed. Its use can be gauged from its name itself: • Markup: Is a collection of Tags • XML Tags: Identify the content of data • Extensible: User-defined tags The XML representation of the Sum network is found in Figure 14. A graphical editing framework called Graphiti editor [8] is available to create, edit, save and display a network. The XML and textual format for the network description are supported by such an editor.
MPEG Reconfigurable Video Coding
55
xml v e r s i o n = " 1 . 0 " e n c o d i n g = "UTF−8" ? > <XDF name= "Sum" > < P o r t k i n d = " I n p u t " name= " I n " / > < P o r t k i n d = " O u t p u t " name= " Out " / > < I n s t a n c e i d = " add " / > < Ins tance id="z "> < C l a s s name= "Z" / > < P a r a m e t e r name= " v " > < Expr k i n d = " L i t e r a l " l i t e r a l −k i n d = " I n t e g e r " v a l u e = " 0 " / > Parameter> Instance > < C o n n e c t i o n d s t = " add " d s t −p o r t = "A" s r c = " " s r c −p o r t = " I n " / > < C o n n e c t i o n d s t = " add " d s t −p o r t = "B" s r c = " z " s r c −p o r t = " Out " / > < C o n n e c t i o n d s t = " z " d s t −p o r t = " I n " s r c = " add " s r c −p o r t = " Out " / > < C o n n e c t i o n d s t = " " d s t −p o r t = " Out " s r c = " add " s r c −p o r t = " Out " / > < / XDF>
Fig. 14 XML representation of the Sum network.
3.4 Bitstream syntax specification language BSDL MPEG-B Part 5 is an ISO/IEC international standard that specifies BSDL [9] (Bitstream Syntax Description Language), an XML dialect describing generic bitstream syntaxes. In the field of video coding, the bitstream description in BSDL of MPEG4 AVC [30] bitstreams represents all the possible structures of the bitstream which conforms to MPEG-4 AVC. A Binary Syntax Description (BSD) is one unique instance of the BSDL description. It represents a single MPEG-4 AVC encoded bitstream: it is no longer a BSDL schema but a XML file showing the data of the bitstream. Figure 15 shows a BSD associated to its corresponding BSDL schema. An encoded video bitstream is described as a sequence of binary elements of syntax of different lengths: some elements contain a single bit, while others contain many bits. The Bitstream Schema (in BSDL) indicates the length of these binary elements in a human- and machine-readable format (hexadecimal, integers, strings. . . ). For example, hexadecimal values are used for start codes as shown in Figure 15. The XML formalism allows organizing the description of the bitstream in a hierarchical structure. The Bitstream Schema (in BSDL) can be specified at different levels of granularity. It can be fully customized to the application requirements [29]. BSDL was originally conceived and designed to enable adaptation of scalable multimedia contents in a format-independent manner [15]. In the RVC framework, BSDL is used to fully describe video bitstreams. Thus, BSDL schemas must specify all the elements of syntax, i.e. at a low level of granularity. Before the use of BSDL in RVC, the existing BSDL descriptions described scalable contents at a high level of granularity. Figure 15 is an example BSDL description for video in MPEG-4 AVC format. In the RVC framework, BSDL has been chosen because:
56
Marco Mattavelli, Jörn W. Janneck and Mickaël Raulet
< s t a r t C o d e >00000001 s t a r t C o d e > < f o r b i d d e n 0 b i t >0 f o r b i d d e n 0 b i t > < n a l R e f e r e n c e >3 n a l R e f e r e n c e > < n a l U n i t T y p e >20 n a l U n i t T y p e > < p a y l o a d >5 100 p a y l o a d > NALUnit > < s t a r t C o d e >00000001 s t a r t C o d e > NALUnit >
< e l e m e n t name="NALUnit" bs2:ifNext="00000001"> <xsd : sequence > < x s d : e l e m e n t name="startCode" type="avc:hex4" fixed="00000001"/ > < x s d : e l e m e n t name="nalUnit" type="avc:NALUnitType"/ > < x s d : e l e m e n t ref="payload"/ > x s d : s e q u e n c e >
Type o f NALUnitType −−>
< x s d : complexType name="NALUnitType"> <xsd : sequence > < x s d : e l e m e n t name="forbidden_zero_bit" type="bs1:b1" fixed="0"/ > < x s d : e l e m e n t name="nal_ref_idc" type="bs1:b2"/ > < x s d : e l e m e n t name="nal_unit_type" type="bs1:b5"/ > x s d : s e q u e n c e > x s d : complexType >
< x s d : e l e m e n t name="payload" type="bs1:byteRange"/ >
Fig. 15 A Bitstream Syntax Description (BSD) fragment of an MPEG-4 AVC bitstream and its corresponding BS schema fragment codec in RVC-BSDL
MPEG Reconfigurable Video Coding
57
• it is stable and already defined by an international standard; • the XML-based syntax interacts well with the XML-based representation of the configuration of RVC decoders; • the parser may be easily generated from the BSDL schema by using standard tools (e.g. XSLT); • the XML-based syntax integrates well with the XML infrastructure of the existing tools.
3.5 Instantiation of the ADM In the RVC framework, the decoding platform acquires the Decoder Description that fully specifies the architecture of the decoder and the structure of the incoming bitstream. So as to instantiate the corresponding decoder implementation, the platform uses a library of building blocks specified by MPEG-C. Conceptually, such a library is a user defined proprietary implementation of the MPEG RVC standard library, providing the same I/O behavior. Such a library can be expressly developed to explicitly expose an additional level of concurrency and parallelism appropriate for implementing a new decoder configuration on user specific multi-core target platforms. The dataflow form of the standard RVC specification, with the associated Model of Computation, guarantee that any reconfiguration of the user defined proprietary library, developed at whatever lower level of granularity, provides an implementation that is consistent with the (abstract) RVC decoder model that is originally specified using the standard library. Figure 2 and 3 show how a decoding solution is built from, not only the standard specification of the codecs in RVC-C AL by using the normative VTL, and this already provides an explicit, concurrent and parallel model, but also from any non-normative “multi-core-friendly” proprietary Video Tool Libraries, that increases if necessary the level of explicit concurrency and parallelism for specific target platforms. Thus, the standard RVC specification, which is already an explicit model for concurrent systems, can be further improved or specialized by proprietary libraries that can be used in the instantiation phase of an RVC codec implementation.
3.6 Case study of new and existing codec configurations 3.6.1 Commonalities All existing MPEG codecs are based on the same structure, the hybrid decoding structure including a parser that extracts values for texture reconstruction and motion compensation. Therefore, MPEG-4 SP and MPEG-4 AVC are hybrid decoders. Figure 16 shows the main functional blocks composing an hybrid decoder structure.
58
Marco Mattavelli, Jörn W. Janneck and Mickaël Raulet
Encoded Bitstream
Parser BSDL
Residu RVC-CAL
+
Decoded Video +
Motion Compensation RVC-CAL
Picture Buffering RVC-CAL
Fig. 16 Hybrid decoder structure
As said earlier, an RVC decoder is described as a block diagram with FNL [10], an XML dialect that describes the structural network of interconnected actors from the Standard MPEG Toolbox. The only 2 case studies performed so far by MPEG RVC experts [26] [14] are the RVC-C AL specifications of MPEG-4 Simple Profile decoder and MPEG-4 AVC decoder [7]. 3.6.2 MPEG-4 Simple Profile (SP) Decoder Figure 17 shows the network representation of the macroblock-based MPEG-4 Simple Profile decoder description. The parser is a hierarchical network of actors (each of them is described in a separate FNL file). All other blocks are atomic actors programmed in RVC-C AL. Figure 17 presents the structure of the MPEG-4 Simple Profile ADM as described within RVC. Essentially it is composed of four mains parts: the parser, a luminance component (Y) processing path, and two chrominance component (U, V) processing paths. Each of the path is composed by its texture decoding engine as well as its motion compensation engine (both are hierarchical RVC-C AL Functional Units). MOTION COMPENSATION (Y)
TEXTURE DECODING (Y)
BITSTREAM
TEXTURE DECODING (U) MOTION COMPENSATION (V)
TEXTURE DECODING (V)
Fig. 17 MPEG-4 Simple Profile decoder description
MERGE
[01111001...]
PARSER
MOTION COMPENSATION (U)
DECODED DATA
MPEG Reconfigurable Video Coding
59
The MPEG-4 Simple Profile abstract decoder model that essentially results to be a dataflow program (Figure 17, Table 3), is composed of 27 atomic FUs (or actors in dataflow programming) and 9 sub-networks (actor/network composition); atomic actors can be instantiated several times, for instance there are 42 actor instantiations in this dataflow program. Figure 22 shows a top-level view of the decoder. The main functional blocks include the bitstream parser, the reconstruction block, the 2D inverse cosine transform, the frame buffer and the motion compensation module. These functional units are themselves hierarchical compositions of actor networks. 3.6.3 MPEG-4 AVC Decoder MPEG-4 Advanced Video Coding (AVC), or also know as H.264 [30], is a stateof-the-art video compression standard. Compared to previous coding standards, it is able to deliver higher video quality for a given compression ratio, and 30% better compression ratio compared to MPEG-4 SP for the same video quality. Because of its complexity, many applications including Blu-ray, iPod video, HDTV broadcasts, and various computer applications use variations of MPEG-4 AVC codec (also called profiles). A popular uses of MPEG-4 AVC is the encoding of high definition video contents. Due to high resolutions processing required, HD video is the application that requires the highest performance for decoding. Common formats used for HD include 720p (1280x720) and 1080p (1920x1080) resolutions, with frame rates between 24 and 60 frames per second. The decoder introduced in this section corresponds to the Constrained Baseline Profile (CBP). This profile is primarily fitted to lowest-cost applications and corresponds to a subset of features that are in common between the Baseline, Main, and High Profiles. The description of this decoder expresses the maximum of parallelism and mimics the MPEG4 SP. This description is composed of different hierarchical level. Fig. 18 shows a view of the highest hierarchy of the MPEG-4 AVC decoder — note that for readability, one input represents a group of input for similar information on each actor. The main functional block includes a parser, one luma and two chroma decoders. The parser analyses the syntax of the bitstream with a given formal grammar. This grammar, written by hand, will later be given to the parser by a BSDL[25] description. As the execution of a parser strongly depends on the context of the bitstream, the parser incorporates a Finite State Machine so that it can sequentially extract the information from bitstream. This information passes through an entropy decoder and is then encapsulated in several kinds of tokens (residual coefficients, motion vectors. . . ). These tokens are finally sent to the selected input port of the luma/chroma decoding actor. Because decoding a luma/chroma component does not need to share information with the other luma/chroma component, we choose to encapsulate each luma/chroma decoding in a single actor. This means that each decoding actor can
60
Marco Mattavelli, Jörn W. Janneck and Mickaël Raulet
run independently and at the same time in a separate thread. The entire decoding component actor has the same structure. COEFY MV
Ydecoder
Y
PREDSelect
Y V ID
MMCO COEFY
IN
BITS
Parser
COEFC
b
COEFC
r
COEFC
b
MV
MV PREDSelect
PREDSelect
C b decoder
Cb
CbV ID
C r decoder
Cr
CrV ID
MMCO
MMCO COEF
Cr
MV PRED
Select
MMCO
Fig. 18 Top view of MPEG-4 Advanced Video Coding decoder description.
Luma/chroma decoding actors (Fig. 19) decode a picture and store the decoded picture for later use in inter-prediction process. Each component owns the memory needed to store pictures, encapsulates into the Decoded Picture Buffer (DPB) actor. DPB actor also contains the Deblocking Filter and is a buffering solution to regulate and reorganize the resulting video flow according to the Memory Management Control Operations (MMCO) input.
RD
MV PRED
select
COEF
Prediction
MV PRED
MB pred
MBpred
select
RES COEF
Inverse Transform
MMCO RES
Decoded Picture Buffer
RD Out
Out VID
MMCO
Fig. 19 Structure of decoding actors.
The Decoded Picture Buffer creates each frame by adding prediction data, provided by the actor prediction, and residual data, provided by the actor Inverse Transform. The Prediction actor (Fig. 20) encompassesinter/intra prediction modes and a multiplexer that sends prediction results to the output port. The PREDselect input port has the role to stoke the right actors contingent on a prediction mode. The target of this structure is to offer a quasi-static work of the global actor and, by adding or removing prediction modes, to easily switch between configurations of the decoder. For instance, adding B inter-prediction mode into this structure switches the decoder into the main profile configuration.
MPEG Reconfigurable Video Coding
61
PRED s elec t
PRED s elec t
PRED
MV RD
s elec t
PRED s elec t MV RD
Intra16x16 Prediction
PRED intra16x16
Intra4x4 Prediction
PRED intra4x4
Inter Prediction
PRED inter
PRED s elec t PRED intra4x4 PRED intra16x16 PRED inter
Mux Intra/Inter Prediction
MB pred
MB pred
Fig. 20 Structure of prediction actor.
4 The procedures and tools supporting decoder implementations 4.1 OpenDF supporting framework C AL is supported by a portable interpreter infrastructure that can simulate a hierarchical network of actors. This interpreter was first used in the Moses [21] project. Moses features a graphical network editor, and allows the user to monitor actors execution (actor state and token values). The project being no longer maintained, it has been superseded by an Eclipse environment composed of 2 tools/plugins— the Open Dataflow environment for C AL editing (OpenDF [15] for short) and the Graphiti editor for graphically editing the network. One interesting and very attracting implementation methodology of MPEG RVC decoder descriptions is the direct synthesis of the standard specification. OpenDF is also a compilation framework. It provides a source of relevant application of realistic sizes and complexity and also enables meaningful experiments and advances in dataflow programming. More details on the software and hardware code generators can be found in [13] [31]. Today there exists a backend for generation of HDL (VHDL/Verilog) [14] [13]. A second backend targeting ARM11 and embedded C is under development [22] as part of the EU project ACTORS [2]. It is also possible to simulate C AL models in the Ptolemy II [4] environment. Works made on action synthesis and actor synthesis [26] [31] led to the creation of a new compiler framework called Open RVC C AL Compiler [27]. This framework is designed to support multiple language front-ends, each of which translates actors written in RVC-C AL and FNL network into an Intermediate Representation (IR), and to support multiple language back-ends, each of which translates the Intermediate Representation into the supported languages. IR provides a dataflow representation that can be easily transformed in low level languages. Currently the only available backend is a C language backend.
62
Marco Mattavelli, Jörn W. Janneck and Mickaël Raulet RVC Abstract Decoder Model CAL
FNL
BSDL
Abstract Decoder Model
Scheduling Analysis
Simulator
Ptolemy II
OpenDF
Moses
SDF, CSDF, BDF, DDF
SW code generator
C
HW code generator
ARM
VHDL Verilog
Non-normative tools and simulators for RVC
Fig. 21 OpenDF: tools
4.2 CAL2HDL synthesis Some of the authors have performed an implementation study [13], in which the RVC MPEG-4 Simple Profile decoder specified in C AL according to the MPEG RVC formalism has been implemented on an FPGA using a C AL-to-RTL code generator called Cal2HDL. The objective of the design was to support 30 frames of 1080p in the YUV420 format per second, which amounts to a production of 93.3 Mbyte of video output per second. The given target clock rate of 120 MHz implies 1.29 cycles of processing per output sample on average. The results of the implementation study were encouraging in that the code generated from the MPEG RVC C AL specification did not only outperform the handwritten reference in VHDL, both in terms of throughput and silicon area, but also allowed for a significantly reduced development effort. Table 1 shows the comparison between C AL specification and the VHDL reference implemented over a Xilinx Virtex 2 pro FPGA running at 100MHz. It should be emphasized that this counter-intuitive result cannot be attributed to the sophistication of the synthesis tool. On the contrary the tool does not perform a number of potential optimizations, such as for instance optimizations involving more than one actor. Instead, the good results appear to be yield by the implementation and development process itself. The implementation approach was based generating a proprietary implementation of the standard MPEG RVC toolbox composed of FUs of lower level of granularity. Thus the implementation methodology was to substitute the FU of the standard abstract decoder model of the MPEG-4 SP with an equivalent implementation, in terms of behavior. Essentially standard toolbox FUs were substituted with networks of FU described as actors of lower granularity. The initial design cycle of the proprietary RVC library resulted in an implementation that was not only inferior to the VHDL reference, but one that also failed to meet the throughput and area constraints. Subsequent iterations explored sev-
MPEG Reconfigurable Video Coding Bitstream
serialize
63 parser
acdc
idct2d
ddr
motion
Video
Fig. 22 Top-level dataflow graph of the proprietary implementation of the RVC MPEG-4 decoder.
eral other points in the design space until arriving at a solution that satisfied the constraints. At least for the considered implementation study, the benefit of short design cycles seem to outweigh the inefficiencies that resulted from high-level synthesis and the reduced control over implementation details. Size Speed Code size Dev. time slices, BRAM kMB/s kSLOC MM C AL 3872, 22 290 4 3 VHDL 4637, 26 180 15 12 Improv. 1.2 1.6 3.75 4 factor kMB/s=kilo macroblocks per second kSLOC=kilo source lines of code Table 1 Hardware synthesis results for a proprietary implementation of a MPEG-4 Simple Profile decoder. The numbers are compared with a reference hand written design in VHDL.
In particular, the asynchrony of the programming model and its realization in hardware allowed for convenient experiments with design ideas. Local changes, involving only one or a few actors, do not break the rest of the system in spite of a significantly modified temporal behavior. In contrast, any design methodology that relies on precise specification of timing —such as RTL, where designers specify behavior cycle-by-cycle— would have resulted in changes that propagate through the design. Table 1 shows the quality of result produced by the RTL synthesis engine of the MPEG-4 Simple Profile video decoder. Note that the code generated from the highlevel dataflow RVC description and proprietary implementation of the MPEG toolbox actually outperforms the hand-written VHDL design in terms of both throughput and silicon area for a FPGA implementation.
64
Marco Mattavelli, Jörn W. Janneck and Mickaël Raulet
4.3 CAL2C synthesis Another synthesis tool called Cal2C [26] [31] currently available at [27] validates another implementation methodology of the MPEG-4 Simple Profile dataflow program provided by the RVC standard (Figure 17). The SW code generator presented in details in [26] uses process network model of computation [16] to implement the C AL dataflow model. The compiler creates a multi-thread program from the given dataflow model, where each actor is translated into a thread and the connectivity between actors is implemented via software FIFOs. Although the generation provides correct SW implementations, inherent context switches occur during execution, due to the concurrent execution of threads, which may lead to inefficient SW execution if the granularity of actor is too fine. Major problems with multi-threaded programs are discussed in [18]. A more appropriate solution that avoids thread management are presented in [19] [23]. Instead of suspending and resuming threads based on the blocking read semantic of process network [17], actors are, instead, managed by a user-level scheduler that select the sequence of actor firing. The scheduler checks, before executing an actor, if it can fire, depending on the availability of tokens on inputs and the availability of rooms on outputs. If the actor can fire, it is executed (these two steps refers to the enabling function and the invoking function of [23]). If the actor cannot fire, the scheduler simply tests the next actor to fire (sorted following an appropriate given strategy) and so on. This code generator based on this concept [31] is available at [27]. Such a compiler presents a scheduler that has the two following characteristics: (1) actor firings are checked at run-time (the dataflow model is not scheduled statically), (2) the scheduler executes actors following a round-robin strategy (actors are sorted a priori). In the case of the standard RVC MPEG-4 SP dataflow model such a generated mono-thread implementation is about 4 times faster than the one obtainable by [26]. Table 2 shows that synthesized C-software is faster than the simulated C AL dataflow program (80 frames/s instead of 0.15 frames/s), and twice the real-time decoding for a QCIF format (25 frames/s). However it remains slower than the automatically synthesized hardware description by Cal2HDL [13]. MPEG4 SP Speed Clock speed Code size decoder kMB/s GHz kSLOC C AL simulator 0.015 2.5 3.4 Cal2C 8 2.5 10.4 Cal2HDL 290 0.12 4 Table 2 MPEG-4 Simple Profile decoder speed and SLOC.
As described above, the MPEG-4 Simple Profile dataflow program is composed of 61 actor instantiations in the flattened dataflow program. The flattened network becomes a C file that currently contains a round robin scheduler for the actor scheduling and FIFOs connections between actors. Each actor becomes a C file con-
MPEG Reconfigurable Video Coding
65
taining all its action/processing with its overall action scheduling/control. Its number of SLOC is shown in Table 3. All of the generated files are successfully compiled by gcc. For instance, the “ParserHeader” actor inside the “Parser” network is the most complex actor with multiple actions. The translated C-file (with actions and state variables) includes 2062 SLOC for both actions and action scheduling. The original C AL file contains 962 lines of codes as a comparison. MPEG-4 SP decoder CAL C actors C scheduler Number of files 27 61 1 Code Size (kSLOC) 2.9 19 2 Table 3 Code size and number of files automatically generated for MPEG-4 Simple Profile decoder.
A comparison of the C AL description (Tab. 4) shows that the MPEG-4 AVC C AL decoder is twice more complex in RVC-C AL than the MPEG-4 Simple Profile C AL description. Some parts of the model have already been redesign in order to improve pipelining and parallelism between actors. A simulation of the MPEG-4 AVC C AL model on a Intel Core 2 Duo @ 2.5Ghz is more than 2.5 slower than the RVC MPEG-4 Simple Profile description. Comparing to the MPEG-4 Simple Profile C AL model, the MPEG-4 AVC decoder has been modeled to use more C AL possibility (for instance processing of several tokens in one firing) while staying fully RVC conformant. Thanks to this increasing complexity, MPEG-4 AVC C AL model is the most reliable way to test the accordance and the efficiency of the current RVC tools. The current SW code generation of MPEG-4 AVC is promising since we can achieve up to 53 fps. MPEG-4 AVC decoder C AL C actors C scheduler Number of files 43 83 1 Code Size (kSLOC) 5.8 44 0.9 Table 4 Code size and number of files automatically generated for MPEG-4 AVC decoder.
5 Conclusion This chapter describes the essential components of the ISO/IEC MPEG Reconfigurable Video Coding framework based on the dataflow concept. The RVC MPEG tool library, that covers in modular form all video algorithms from the different MPEG video coding standards, shows that dataflow programming is an appropriate way to build complex heterogeneous systems from high level system specifications. The MPEG RVC framework is also supported by a simulator, software and hardware code synthesis. C AL dataflow models used by the MPEG RVC standard result also
66
Marco Mattavelli, Jörn W. Janneck and Mickaël Raulet
particularly efficient for specifying signal processing systems in a very synthetic form compared to classical imperative languages. Moreover, C AL model libraries can be developed in the form of libraries of proprietary implementations of standard RVC components to describe architectural features of the desired implementation platform, thus enabling the RVC implementer/designer to work at level of abstraction comparable to the one of the RVC video coding algorithms. Hardware and software code generators then provide the low level implementation of the actors and associated network of actors for the different target implementation platforms (multi-core processors or FPGA).
References 1. Open DataFlow Sourceforge Project. http://opendf.sourceforge.net/ 2. Actors FP7 project: http://www.actors-project.eu 3. Bhattacharyya, S.S., Brebner, G., Janneck, J.W., Eker, J., von Platen, C., Mattavelli, M., Raulet, M.: OpenDF: a dataflow toolset for reconfigurable hardware and multicore systems. SIGARCH Comput. Archit. News 36(5), 29–35 (2008). DOI 10.1145/1556444.1556449 4. Bhattacharyya, S.S., Eker, J., Janneck, J.W., Lucarz, C., Mattavelli, M., Raulet, M.: Overview of the MPEG reconfigurable video coding framework. Journal of Signal Processing Systems (2009). DOI 10.1007/s11265-009-0399-3 5. Ding, D., Yu, L., Lucarz, C., Mattavelli, M.: Video decoder reconfigurations and AVS extensions in the new MPEG reconfigurable video coding framework. In: Signal Processing Systems, 2008. SiPS 2008. IEEE Workshop on, pp. 164–169 (2008). DOI 10.1109/SIPS. 2008.4671756 6. Eker, J., Janneck, J.W.: CAL Language Report Specification of the CAL Actor Language. Tech. Rep. UCB/ERL M03/48, EECS Department, University of California, Berkeley (2003) 7. Gorin, J., Raulet, M., Cheng4, Y.L., Lin, H.Y., Siret, N., Sugimoto, K., Lee, G.: An RVC dataflow description of the AVC Constrained Baseline Profile decoder. In: IEEE International Conference on Image Processing, Special Session on Reconfigurable Video Coding. Cairo, Egypt (2009) 8. Graphiti Editor sourceforge: URL http://graphiti-editor.sf.net 9. International Standard ISO/IEC FDIS 23001-5: MPEG systems technologies - Part 5: Bitstream Syntax Description Language (BSDL) 10. ISO/IEC FDIS 23001-4: MPEG systems technologies – Part 4: Codec Configuration Representation (2009) 11. ISO/IEC FDIS 23002-4: MPEG video technologies – Part 4: Video tool library (2009) 12. Jang, E.S., Ohm, J., Mattavelli, M.: Whitepaper on Reconfigurable Video Coding (RVC). In: ISO/IEC JTC1/SC29/WG11 document N9586. Antalya, Turkey (2008). URL http: //www.chiariglione.org/mpeg/technologies/mpb-rvc/index.htm 13. Janneck, J., Miller, I., Parlour, D., Roquier, G., Wipliez, M., Raulet, M.: Synthesizing hardware from dataflow programs. Journal of Signal Processing Systems (2009). DOI 10.1007/ s11265-009-0397-5. URL http://dx.doi.org/10.1007/s11265-009-0397-5 14. Janneck, J.W., Miller, I.D., Parlour, D.B., Roquier, G., Wipliez, M., Raulet, M.: Synthesizing hardware from dataflow programs: An MPEG-4 simple profile decoder case study. In: Signal Processing Systems, 2008. SiPS 2008. IEEE Workshop on, pp. 287–292 (2008). DOI 10. 1109/SIPS.2008.4671777 15. Joseph, A., Thomas-Kerr, I., Burnett, S., Ritz, C., Devillers, S., Schrijver, D.D., Walle, R.: Is that a fish in your ear? a universal metalanguage for multimedia. Multimedia, IEEE 14(2), 72–77 (2007). DOI {10.1109/MMUL.2007.38}
MPEG Reconfigurable Video Coding
67
16. Kahn, G.: The semantics of a simple language for parallel programming. In: J.L. Rosenfeld (ed.) Information processing, pp. 471–475. North Holland, Amsterdam, Stockholm, Sweden (1974) 17. Kahn, G., MacQueen, D.B.: Coroutines and networks of parallel processes. In: IFIP Congress, pp. 993–998 (1977) 18. Lee, E.A.: The problem with threads. IEEE Computer Society 39(5), 33–42 (2006). DOI http://doi.ieeecomputersociety.org/10.1109/MC.2006.180 19. Lee, E.A., Parks, T.M.: Dataflow Process Networks. Proceedings of the IEEE 83(5), 773–801 (1995) 20. Lucarz, C., Amer, I., Mattavelli, M.: Reconfigurable Video Coding: Concepts and Technologies. In: IEEE International Conference on Image Processing, Special Session on Reconfigurable Video Coding. Cairo, Egypt (2009) 21. Moses project: URL http://www.tik.ee.ethz.ch/moses/ 22. von Platen, C., Eker, J.: Efficient realization of a cal video decoder on a mobile terminal (position paper). In: Signal Processing Systems, 2008. SiPS 2008. IEEE Workshop on, pp. 176–181 (2008). DOI 10.1109/SIPS.2008.4671758 23. Plishker, W., Sane, N., Kiemb, M., Anand, K., Bhattacharyya, S.S.: Functional DIF for Rapid Prototyping. In: Proceedings of the 2008 The 19th IEEE/IFIP International Symposium on Rapid System Prototyping - Volume 00, pp. 17–23. IEEE Computer Society (2008) 24. Ptolemy II: URL http://ptolemy.eecs.berkely.edu 25. Raulet, M., Piat, J., Lucarz, C., Mattavelli, M.: Validation of bitstream syntax and synthesis of parsers in the MPEG Reconfigurable Video Coding framework. In: Signal Processing Systems, 2008. SiPS 2008. IEEE Workshop on, pp. 293–298 (2008). DOI 10.1109/SIPS. 2008.4671778 26. Roquier, G., Wipliez, M., Raulet, M., Janneck, J.W., Miller, I.D., Parlour, D.B.: Automatic software synthesis of dataflow program: An MPEG-4 simple profile decoder case study. In: Signal Processing Systems, 2008. SiPS 2008. IEEE Workshop on, pp. 281–286 (2008). DOI 10.1109/SIPS.2008.4671776 27. The Open RVC CAL Compiler project sourceforge: URL http://orcc.sf.net 28. The OpenDF dataflow project sourceforge: URL http://opendf.sf.net 29. Thomas-Kerr, J., Janneck, J., Mattavelli, M., Burnett, I., Ritz, C.: Reconfigurable media coding: Self-Describing multimedia bitstreams. In: Signal Processing Systems, 2007 IEEE Workshop on, pp. 319–324 (2007). DOI {10.1109/SIPS.2007.4387565} 30. Wiegand, T., Sullivan, G., Bjontegaard, G., Luthra, A.: Overview of the H.264/AVC video coding standard. Circuits and Systems for Video Technology, IEEE Transactions on 13(7), 560–576 (2003). DOI {10.1109/TCSVT.2003.815165} 31. Wipliez, M., Roquier, G., Nezan, J.: Software code generation for the RVC-CAL language. Journal of Signal Processing Systems (2009). DOI 10.1007/s11265-009-0390-z. URL http://dx.doi.org/10.1007/s11265-009-0390-z
Signal Processing for High-Speed Links Naresh Shanbhag, Andrew Singer, and Hyeon-min Bae
Abstract The steady growth in demand for bandwidth has resulted in data-rates in the 10s of Gb/s in back-plane and optical channels. Such high-speed links suffer from impairments such as dispersion, noise and nonlinearities. Due to the difficulty in implementing multi-Gb/s transmitters and receivers in silicon, conventionally, high-speed links were implemented primarily with analog circuits employing minimal signal processing. However, the relentless scaling of feature sizes exemplified by Moore’s Law has enabled the application of sophisticated signal processing techniques to both back-plane and optical links employing mixed analog and digital architectures and circuits. As a result, over the last decade, signal processing has emerged as a key technology to the advancement of low-cost, high data-rate, backplane and optical communication systems. In this chapter, we provide an overview of some of the driving factors that limit the performance of high-speed links, and highlight some of the potential opportunities for the signal processing and circuits community to make substantial contributions in the modeling, design and implementation of these systems.
1 Introduction Demand for high-bandwidth data networks continues to grow, requiring the deployment of a hierarchy of communication networks spanning 1000s of kilometers (ultra long-haul (ULH) optical), through 100s of kilometers (metro optical links), to 10s of kilometers (wireless cellular), through the last mile access networks (DSL and Naresh Shanbhag University of Illinois at Urbana-Champaign, Urbana, e-mail: [email protected] Andrew Singer University of Illinois at Urbana-Champaign, Urbana, e-mail: [email protected] Hyeon-min Bae KAIST, Daejeon, South Korea, e-mail: [email protected]
S.S. Bhattacharyya et al. (eds.), Handbook of Signal Processing Systems, DOI 10.1007/978-1-4419-6345-1_4, © Springer Science+Business Media, LLC 2010
69
70
Naresh Shanbhag, Andrew Singer, and Hyeon-min Bae
HFC, and wireless), through 10s of meters (Ethernet LAN), to less than a meter (back-plane and chip-to-chip I/O)[1]. Interestingly, the structure of data aggregation and distribution in this network hierarchy has resulted in the highest data rates in its two extremes, ULH and metro optical (longest) and back-plane and I/O (shortest), compared to any other type of communication link. Indeed, today, both back-plane and optical links support data rates in excess of 10Gb/s. We refer to such links as high-speed links. Also interesting is the fact that high-speed links have, in the last decade, seen an extensive application of signal processing techniques, and VLSI architectures in the design of high-speed transceivers simultaneously for both link types. This chapter discusses this recent trend and the role of signal processing and integrated circuits in the design of high-speed links. While the data rates for both back-plane and optical links exceed those of all other digital communications media, in some respects, they are among the least sophisticated, using simple on-off keying (OOK) or non-return to zero (NRZ), and baseband comparators for symbol-by-symbol clock and data recovery (CDR). For example, until recently, optical links did not employ any form of equalization, and back-plane links to this day do not incorporate any physical layer forward errorcorrection (FEC). In this respect, these links are quite primitive in that they haven’t leveraged the tremendous advances in statistical signal processing and communication techniques that have enabled a number of important advances in both wired and wireless communications such as those in digital voice-band modems, cellular technologies, such as GSM and CDMA, broadband enabling technologies, such as cable modems and DSL modems, and OFDM technology in use in a host of wireless digital transmission standards. The reason that simple modulation, transmitter and receiver structures were suitable for high-speed links is because at lower data-rates and short distances, both the back-plane and optical links act as infinite bandwidth channels. As a result, optical links have been the transmission media of choice in backbone networks and are rapidly making in-roads into customer premises, enterprise networks, while backplane links dominate system-level interconnect such as those in storage area networks. However, as data rates in these links rise above a few Gb/s into the 10 Gb/s range, these links began to exhibit intersymbol interference (ISI) or dispersion in addition to noise. Figure 1 shows block diagrams of a high-speed back-plane link and an optical link. Much of the processing in both links is similar even though significant differences exist. Serial data d[k] in the back-plane transmitter in Fig. 1(a) is clocked using a transmit phase-locked loop (PLL) and sent to a pre-emphasis filter/driver to pre-equalize channel ISI. The back-plane channel is typically a copper trace, typically few tens of inches, running over an FR4 dielectric through vias or stubs connecting different metal layers and connectors linking line cards to the board. At the receiver, the channel output is processed by a receive amplifier, which filters out-ofband noise, and amplifies signal for subsequent processing. The receiver recovers a symbol-rate clock from the data using one of several clock recovery techniques. A linear equalizer (LE) or decision feedback equalizer (DFE) may further reduce ISI. The equalized output is sliced to make a decision. In an optical transmitter, an
Signal Processing for High-Speed Links
input data
Level/ swing control
d [k ]
Pre-emphasis Driver
RX amplifier
Channel
output data
dˆ[ k ]
CDR
FR-4 trace, stubs, vias, connectors
1/ T reference clock
71
TX-PLL
(a) 0Km
Transmitter input data
d[k ]
Laser source
FEC Encoder
Modulator
75Km
Fiber
Fiber
optical amplifier
Optical Filter
Photo Detector
Electrical Filter
125Km
CDR
Optical channel
FEC Decoder
output data
dˆ[k ]
Receiver
(b) Fig. 1 High-speed link configurations for: (a) a back-plane serial link, and (b) an optical fiber link.
information sequence d[k] is used to modulate the intensity of a laser source. For long-haul links (typically greater than 200 km), the optical signal will propagate along multiple spans of fiber with repeated optical amplification to overcome attenuation. Also shown in Fig. 1(b) are data eye diagrams, which illustrate the transmitted and received symbol patterns as they would appear on an oscilloscope at various stages through the optical fiber. Note how an open transmit eye from which correct decisions can easily be made using a simple slicing operation, begins to close due to dispersion and noise, as fiber length increases. Eye diagrams for back-plane links similarly indicate significant eye closure at the receiver input. As a result, high-speed links have recently began to employ sophisticated signal processing and communication techniques such as equalization, and FEC. In back-plane links, FEC is only recently being deployed [2, 3], while equalization [4, 5] was introduced within the last decade. For optical links, electronic dispersion compensation (EDC) [6] based on signal processing techniques have emerged as a technology of great promise for OC-192 (10 Gb/s) data rates and beyond. In addition, the unrelenting progress of semiconductor technology exemplified by Moore’s Law has provided an implementation platform that is cost-effective, low-power and high-performance. The next several decades of high-speed links will likely be dominated by the signal processing, architectures and circuit techniques that are used both for signal transmission, modulation and coding, as well as for equalization and decoding at the receiver. The design of signal processing-enhanced back-plane and optical communication links presents unique challenges spanning algorithmic issues in transmitter and receiver design, mixed-signal analog front-end design, and VLSI architectures for
72
Naresh Shanbhag, Andrew Singer, and Hyeon-min Bae
implementing the digital signal processing back-end. Though these are tethered applications, reducing power is important due to stringent power budgets imposed by the systems, e.g., boards, transponders and line-cards, in which these devices reside. For example, high-speed I/O links have a power budget of a few tens of mW/Gb/s and a target BER <= 10−15, making their design extremely challenging. Thus, a cost-effective solution, i.e., a solution that meets the system performance specifications within the power budget, requires joint optimization of the signal procession algorithms, VLSI architectures, analog and digital integrated circuit design. This chapter will provide an overview of some of the driving factors governing the development of current and next generation high-speed links, with a particular emphasis on the tools, models, and methods by which the signal processing and circuits communities can play an integral role. The remainder of this chapter is structured as follows. In Section 2, system models describing the functionality of link components are developed to enable simulation and modeling of end-to-end back-plane and fiber links. Section 3 describes the application of signal processing techniques for combating dispersion, and noise, in high-speed links. Section 4 provides a case study of an OC192 EDC-based receiver and the chapter concludes with some remarks on emerging areas for research in high-speed links in Section 5.
2 System Models In this section, we briefly discuss some of the salient characteristics in high-speed links of practical signal processing interest. We pay particular attention to models that can be used for system design, performance simulation and experimentation. These are the essential building blocks from which a signal processing-based development can make a substantive, practical impact on current and future backplane and optical links. The discussion is centered around the link components, viz., the transmitter, channel and receiver.
2.1 The Transmitter Both the back-plane and optical links primarily employ binary signaling techniques such as NRZ modulation. Unlike an optical link, the back-plane link is carrierless, i.e., the output of a back-plane channel is the baseband pulse s(t) = k d[k]g(t − kT ), where d[k] is the information sequence, and g(t) is the baseband pulse (Fig. 2(a)), typically NRZ, that captures the frequency response of the transmit driver (see Fig. 1(a)). Furthermore, unlike traditional communication links which are average power limited, the back-plane link is intrinsically peak-power limited. This is because back-plane transmit drivers are integrated in modern day low-voltage (1V ) CMOS technologies. Both optical and back-plane transmitters also include a trans-
Signal Processing for High-Speed Links
d [k ]
73
Peak power constraint
s (t )
g (t )
1
0
(a) 'f
d [k ]
Waveform Generator
m(t )
s(t )
fc
(Driver)
cos(2Sf ct )
2P1
(Laser/Modulator) 2P0
1
0
(b) Fig. 2 Transmitter functional model for: (a) a back-plane channel, and (b) an optical channel.
mit PLL that generates a transmit clock. The jitter on the transmit clock plays a strong role in determining the BER in the back-plane receiver. Figure 2(b) describes the modulation process for an optical link. The transmitter output signal is given by s(t) = d[k]g(t − kT )cos(2 fct)
(1)
k
where fc is the carrier frequency of the optical source, typically around 193 THz for 1550 nm wavelength1 Note that the transmit signal spectrum for an NRZ transmitter has finite bandwidth. The optical signal is hard to attenuate completely at such high data rates. Thus, one measure of the quality of the laser is the extinction ratio Er (see Fig. 2) P1 Er = 10 log10 . (2) P0 where P1 and P0 are the powers in a transmitted one and zero, respectively. Extinction ratios of 8 dB are typical for directly modulated lasers employed in short reach, while long-haul applications employ externally modulated lasers using MachZehnder modulators [7] with an extinction ratio of around 15 dB. Unlike back-plane links where higher signal to noise ratio (SNR) usually implies a higher quality signal and a lower bit-error rate BER, for optical communications, a higher optical signal1
The center frequency is 228 THz for 1310nm wavelength, which can be obtained using the relationship f = c.
74
Naresh Shanbhag, Andrew Singer, and Hyeon-min Bae
to-noise ratio (OSNR) may not provide a lower BER, if the extinction ratio is poor. The power in the “signal” component of the OSNR includes that in both the ones and the zeros, while the BER of an optical link will be a function of the difference in these optical powers.
2.2 The Channel Channel modeling is a critical first step in developing transmitter and receiver algorithms, architectures and circuits. Back-plane channel models tend to be simpler than those for fiber due to the presence of non-linearities in the latter. This subsection describes the back-plane and fiber channel models. 2.2.1 Back-plane Channel Model
transmit driver input
RR LT
CC
LR
output
FR-4
FR-4
CT
receive amplifier
CR
Z ST
(a)
R
'V ( x, t ) L G
C 'I ( x, t )
'x (b) Fig. 3 Back-plane channel model: (a) a circuit model of the link, and (b) a lossy transmission line element.
A complete back-plane channel begins at the output node of the transmit driver, and ends at the input node of the receive amplifier. Starting from the transmitter,
Signal Processing for High-Speed Links
75
a simplified channel model (see Fig. 3(a)) includes the transmit side pad capacitance CT , bond wire inductance LT , a transmission line model of the FR-4 trace, impedance ZT representing discontinuities due to vias and unterminated stubs, and ending with coupling capacitor CC (in some cases) at the receiver, receive termination resistor RT (either off-chip or on-chip), bond wire inductance LR , and receive side pad capacitance CR . The transmission line model for the FR-4 trace can be described by Telegrapher’s equations shown below, which in turn can be derived via the application of KCL and KVL to the RLGC circuit in Fig. 3(b), as follows:
V (x,t)) = RI(x,t) + L I(x,t) t x I(x,t) V (x,t) = GV (x,t) +C t x
(3) (4)
where V (x,t) and I(x,t) are the voltage and current signals, respectively, at time t from an initial time t = 0, and distance x from the source, and R, L, G, and C, are the series resistance, series inductance, shunt conductance and shunt capacitance per unit length. The elements R and G result in loss in the transmitted signal energy. In addition, the back-plane channel suffers from frequency-dependent loss mechanisms at frequencies above few GHz owing to skin effect and dielectric losses. Skin effect arises due to crowding of high-frequency current towards the surface of the conductor leading to the resistance R becoming frequency-dependent, i.e., R = R( f ) above a certain frequency. The skin depth is given by
=√
1 f
(5)
where and are the permeability and resistivity, respectively, of the conductor. For f ≥ fs , where fs is the frequency at which = rw (rw is the radius of the FR-4 trace), skin-effect kicks in and the resistance R( f ) ≈ R(0)(0.5(rw / ) + 0.26). Otherwise, for f < fs , R( f ) = R(0). Dielectric loss occurs due to absorption of electromagnetic energy in the dielectric surrounding the conductor. Dielectric loss is specified in terms of the loss tangent tan , which is the ratio of the resistive to the reactive component of the electromagnetic energy in the dielectric material, and is proportional to f . These loss mechanisms result in the back-plane channel being a band-limited low-pass channel, whose frequency response is given by: H( f ) = exp (− l)
(6)
where ( f ) is the frequency dependent propagation coefficient that captures the skin-effect and dielectric-loss effects, and l is the trace length. Typical values of the parameters for a FR-4 trace are tan = 0.035, and r = 4.3. Figure 4(a) shows that a typical 20" FR-4 channel exhibits a loss of 20dB at around 4GHz, which is close to the −3dB bandwidth of an NRZ pulse. This indicates that the back-plane channel suffers from severe ISI ranging over approximately 15-to-20 symbol-periods. In
76
Naresh Shanbhag, Andrew Singer, and Hyeon-min Bae
Frequency (Magnitude) Response (dB)
0
-10
-20
-30
-40
-50
-60
0
5
10 Frequency (Hz)
15 x 10
9
(a)
Frequency (Magnitude) Response (dB)
-20 -30 -40 -50 -60 -70 -80 -90
0
5
10 Frequency (Hz)
15 9
x 10
(b) Fig. 4 Time and frequency domain responses for a 20" FR-4 back-plane channel: (a) data channel, and (b) cross-talk channel.
addition, impedance discontinuities due to vias, stubs, and connectors, cause reflections, which result in notches in the frequency response and further aggravate the interference problem. The back-plane channel also experiences far-end cross-talk from neighboring traces as shown in Fig. 4(b). 2.2.2 Optical Fiber Channel Model Optical fiber comes in two types: single-mode fiber (SMF) and multi-mode fiber (MMF). Single-mode fiber (SMF) permits only a single optical mode (see Fig. 5(a)) to propagate thereby enabling greater propagation distances before pulse spreading occurs. Thus, SMF is employed in long reach applications and some short reach ones [8]. Longer reach applications use 1550 nm wavelengths as this represents the wavelength of minimum loss due to absorption. Propagation of the optical signal over SMF fiber is governed by the nonlinear Schrödinger equation (NLSE)[8]. Figure 5(a) shows that the propagating wave is supported by two orthogonal axes of polarization, or polarization states within the fiber. A fraction 2 of the light energy couples into one of the polarization states
Signal Processing for High-Speed Links
77
Y-polarization
Y-polarization (t)
Fiber
z
z
(1-
2 1/2
)
X-polarization
X-polarization
(a) 1 Precursor Symmetrical Postcurso r
0.9 0.
δ(t)
cladding
0.8 0.
Mode 1
Mode 2 core
Amplitude
0.7 0. 0.6 0. 0.5 0. 0.4 0. 0.3 0. 0.2 0.
cladding
0
t
0.1 0. 0 0
Z
0.5 0.
1
1.5 1.
2
2.5 2.
3
3.5 3.
4
4.5 4.
5
Time (UI)
(b) Fig. 5 Optical fiber: (a) signal propagation single-mode fiber, and (b) in multi-mode fiber. The outputs shown are three sample impulse responses developed by the IEEE 802.3aq working group to model modal dispersion.
while the remainder (1 − 2 ) couples into the other polarization state. These two states experience different group velocities, resulting in the pulse arrival times at the receiver being separated by (t). This differential group delay experienced by these two states gives rise to polarization mode dispersion (PMD) causing pulse spreading at the output of the photodetector in the receiver. A length L fiber with attenuation and PMD results in the following time-varying impulse response: ln(10) L h(t, ) = e− 20 ( (t − 0 ) + 1 − 2 (t − 0 − (t))), (7) where 0 is the baseline delay of the fiber and (t) is the instantaneous DGD of √ the fiber due to first-order PMD. The mean DGD of the fiber, expressed in ps/ km, gives a measure of the expected √ delay between the two polarization axes. A typical value of the mean DGD is 1ps/ km [9]. A good model of the variability of PMD over time is captured by [10]:
32 2 4 2 f ( ) = 2 3 exp − 2 m m where f ( ) is the marginal distribution, and m is the mean DGD. We see that the instantaneous DGD of a fiber will exceed three times its mean DGD with probability 4.2 × 10−5. Chromatic dispersion (CD) occurs due to the dependence of the index of refraction on frequency. This gives rise to a difference in the velocities with which the frequencies within the transmitted signal spectrum propagate resulting in pulse spreading in time. The resulting equivalent baseband linear time-invariant model for the channel is given by
78
Naresh Shanbhag, Andrew Singer, and Hyeon-min Bae
2 HCD ( f ) = exp − j D L f 2 , c
(8)
where D is the dispersion parameter at wavelength , L is the fiber length, f is the equivalent baseband frequency. Typical fiber parameter values are listed in Table 1. Table 1 Typical parameter values for single mode fiber used in long haul communications. parameter wavelength propagation velocity fiber attenuation dispersion second order dispersion dispersion slope center frequency core refractive index core area nonlinearity
symbol c D 2 = −Dc(10−6 )/(2 f c2 ) 3 fc n2 A = 2 n2 / A
value 1550 nm 2.997925 × 108 m/s 0.2 × 10−3 dB/m 17 ps/nm-km 20 × 10−27 s2 /m 1.1 × 10−40 s3 /m 193.1 × 1012 Hz 2.6 × 10−20 m2 W−1 90 × 10−12 m2 5 × 10−3 m−1 W−1
A simple model for signal propagation in MMF captures the coupling of the modulated laser source onto a set of Nm propagating modes, each of which propagates a slightly different distance through the fiber (see Fig. 5(b)). If the detector area is sufficiently large, then the output of a square-law detector yields a sum of powers of the modal arrivals since the modes are spatially orthogonal. The following finitelength impulse response model for MMF propagation can be obtained when modal noise and mode coupling is ignored: h(t) =
Nm −1
e− ln(10) L/20 hk (t − 0 − k ),
(9)
k=0 m −1 |hk |2 = 1, Nm is the number of propagating modes, an amount equal to where Nk=0 2 |hk | of the incident power is coupled into the kth mode, k is the differential modal delay with respect to the overall baseline delay of 0 induced by the wavelength of the source, is the attenuation, and L is the fiber length. In the presence of modal coupling, the terms hk would become time-varying and can be modelled as Rayleigh fading under suitable assumptions on mode orthogonality [11].
2.3 Receiver Models The receiver processing in both back-plane and optical links begins with a filter that removes out-of-band noise. In back-plane systems, the receive filter is an electrical component, which also serves as an amplifier, and an equalizer. In an optical link, the optical signal is optically filtered and then passed on to a device for optical-to-
Signal Processing for High-Speed Links
79
electrical (O/E) conversion, which generates an electrical current signal. We refer to this device as the O/E detector in order to distinguish it from statistical detectors, such as slicers, that appear later in the receive chain. A transimpedance amplifier (TIA) which converts the detector current into a voltage signal for subsequent processing. Both links have synchronization sub-systems referred to as a clock-recovery unit (CRU), which recovers a baud-rate clock from the receiver signal. The output clock of a CRU is employed to sample the output of the receive filter (or the TIA in case of an optical link), using a decision-making latch or slicer. The combination of the CRU and slicer is referred to as a clock-data recovery (CDR). A CDR, including the front-end filter (TIA in case of optical links), implements the bulk of processing in optical and electrical receivers. However, a CDR employs little or no signal processing and thus fails to achieve the requisite BER in the presence of ISI. As a result, modern day receiver for both link types, include additional components such as equalizers (both digital and analog), ADCs, in addition to a CDR. In the optical context, such signal processing enhanced links are called EDC-based links. 2.3.1 O/E Detectors
E(f) E(t)
Photo Detector
fc
i(t)
2
LPF
10GHz
i(f)
-2fc
0
2fc 400 THz
Fig. 6 A block diagram model for a photodetector receiver is shown as an ideal square-law detector followed by a low pass filter.
Typical optical detectors include the avalanche photodiode (APD) and the PIN diode, each of which produce an electrical current proportional to the incident optical power. Intensity modulation can be modeled as E(t) = m(t) cos(2 fct), with fc around 200 THz, as in Fig. 6. A photodetector in the absence of noise and dispersion, can be viewed as a bandlimited square-law device, approximately producing an output current
80
Naresh Shanbhag, Andrew Singer, and Hyeon-min Bae
i(t) = R
t t−
|E(t)|2 dt = R
t t−
|m(t)|2 dt 2
where R is the responsivity of the detector (typically between 0.7 A/W and 0.85 A/W for APD and PIN receivers), owing to the finite integration bandwidth of the detector, where 1/ << fc is typically on the order of 10 GHz, for a 10 Gb/s link, for example. The double-frequency term induced by the quadratic detector is therefore filtered out, leaving only the baseband demodulated intensity signal m2 (t), in the absence of dispersion. Since m(t) > 0 for intensity modulation, the data can be completely recovered from m2 (t) by a simple CDR, in the absence of noise and dispersion. 2.3.2 Analog Front-End (AFE) The AFE refers to the linear signal processing blocks in the receive path. The backplane AFE consists of a variable gain amplifier (VGA) for signal amplification. Usually, the receive amplifier also provides some high-frequency peaking in order to compensate of the channel roll-off, and limits out-of-band noise. In EDC-based optical links, the AFE consists of a cascade of the TIA and the VGA. The TIA is a linear amplifier that takes an input current iin (t) and generates an output voltage vo (t) where vo (t) = −iin (t)R f
Av 1 + Av
(10)
Av is the transimpedence (in Ohms) and Av is the open loop gain of the where R f 1+A v TIA amplifier kernel. Typical values of R f reside in the range 800 to 2k . The 3dB bandwidth of the TIA fc,tia is typically designed to be approximately 70% of the symbol rate in order to achieve a balance between ISI and SNR. The performance of the TIA is also specified in terms of its sensitivity. A VGA is typically used to boost the receive signal level if subsequent processing will include more than a simple CDR. Linearity of the VGA (for back-plane), and the detector-TIA-VGA combination (for an optical link) is important if an ADC is used, in order to preserve the information. The VGA can be modeled as a transconductance stage with an output load resistor RL and an output capacitance CL . The low-frequency gain of such a stage is given by
Av = −gmRL
(11)
where gm is the transconductance of the transistor, and its 3dB bandwidth is given by 1 fc,vga = . (12) 2 RLCL
Signal Processing for High-Speed Links
81
2.3.3 Analog-to-Digital Converter (ADC) An ADC is typically modeled as an ideal sampler followed by a memoryless quantizer. Such models do not account for the many non-idealities that exist in practice especially in the multi-GHz regime. A practical ADC consists of a band-limited front-end, a sampler and a quantizer, and may exhibit hysteresis and other patterndependent behavior. Non-idealities of the ADC are often aggregated in terms of the effective number of bits (ENOB) of the ADC, which attempts to quantify the resulting quantization noise as if it were additive and independent of the input. At 10 GS/s, ADCs with 3-to-6 bits of resolution have been produced for both optical and back-plane channels using SiGe bipolar processes[12, 13] (optical only), and more recently in CMOS for back-plane [14], optical [15], and both [16], illustrating the strong emergence of signal processing enhanced solutions for both link types. 2.3.4 Clock-Recovery Unit Design of a high-performance CRU is critical in high-speed links. Unlike other communication links such as DSL or wireless, the CRU in back-plane and optical can dominate the complexity of the receiver. This is very true for a CDR-based receiver. The CRU generates symbol timing information and the receiver clock from the received analog signal in the presence of dispersion, noise, and input jitter. A PLL forms the basis of the operation of a CRU. A PLL can be viewed as an extremely narrowband filter with its passband centered at a frequency equal to the symbol-rate. A traditional PLL is referred to as a linear PLL because the PLL func-
Iin (t )
'I (t )
linear phase-detector
1
k PD
xvco (t )
LPF
1
Ivco (t )
VCO
kVCO s
s
Z LPF
(a) BB phase-detector
Iin (t )
Ivco (t )
'Iq (t )
R
1 sC
fast path
kVCO s slow path
(b) Fig. 7 Phase-locked loop (PLL): (a) linear, and (b) binary or bang-bang PLL.
82
Naresh Shanbhag, Andrew Singer, and Hyeon-min Bae
tionality in the phase domain can be easily described using linear feedback system theory. The linear PLL in Fig. 7(a) employs a linear phase-detector (PD) to compute the instantaneous phase-error (t) between the input phase in (t) and the clock phase vco (t). The phase error is integrated using a low-pass filter (LPF) whose output is used to control the input voltage, and hence the frequency, of a voltagecontrolled oscillator (VCO). At high-speeds, the linear PD is unable to generate narrow pulses that represent small phase-errors. Thus, most high-speed links employ a binary or a bang-bang (BB) PLL (see Fig. 7(b)). The operation of a BB-PLL is best understood by considering a 1st order loop, which is obtained by eliminating the slow-path in Fig. 7(b). In a 1st -order BB-PLL, a quantized, typically 3-level, phase-error q (t) is generated using a BB phase-detector (BB-PD). The BB-PD ensures that the VCO frequency switches between two extremes fnom ± fbb , also called the lock range, where fnom is the nominal data frequency, which is made equal to the symbol-rate, and fbb is the frequency step. Thus, unlike in a linear PLL, the VCO in a BB-PLL operates at two discrete frequencies separated by 2 fbb . A 1st -order BB-PLL locks as long as the average data frequency lies in the lock range. Otherwise, the VCO starts to hunt for the right frequency and in the process generates hunting jitter, whose amplitude is proportional to the lock range. A separate frequency-acquisition loop (not shown) is usually implemented in order to acquire the symbol frequency. Even then, high-frequency jitter wander, as dictated by the jitter tolerance requirements, can push the BB-PLL out of lock resulting in hunting jitter. This presents an intrinsic limitation of a 1st -order BB-PLL in that its lock range and its output jitter amplitude are tied together via a single parameter f bb . The 2nd -order BB-PLL is able to achieve the conflicting goals of having a large lock range and on minimizing output jitter amplitude, by processing the quantized phase-error through two paths: (1) a direct/proportional path for fast updates, and (2) an integral path for slow updates. The proportional path is able to track highfrequency jitter such as those due to data, while the slow path manages to track the high-frequency jitter wander thereby resulting is small output jitter. CRUs in optical links have much more stringent requirements than those in backplane channels. This is because in long haul applications, the CRU needs to satisfy SONET (Synchronous Optical NETwork) specifications on jitter transfer, jitter tolerance and jitter generation. Jitter manifests itself as a random spread in the zerocrossings of a signal. The input jitter Jin is a function of the input SNR (SNRin ), for which an approximate relationship is given by Jin = k √
Ttr SNRin
(13)
where Ttr is the transition period, and k is a dimensionless constant. A typical CRU uses a wideband PLL for tracking the jitter in the input signal so that the correct sampling phase is used for the data. However, the resulting recovered clock at the output of the PLL will contain jitter. A second narrowband PLL is used to clean-up the jitter in the clock at the output of the first PLL. This requires the
Signal Processing for High-Speed Links
83
use of a first-in-first-out (FIFO) buffer at the output of the sampler to align data and output clock.
2.4 Noise Models High-speed links suffer from various noise sources. In addition to ISI, these noise sources fundamentally limit the achievable BER. 2.4.1 Back-plane Noise Models Noise in back-plane links is dominated by [17]: (1) jitter in the recovered clock, and (2) thermal noise from the AFE and the input termination (see Fig. 3(a)). Jitter refers to the random variation in the zero-crossings of a nominally periodic signal. Clock jitter in the receiver is caused by a number of sources such as jitter in the transmit PLL, noise in the channel and power supply, jitter generated by the receive PLL in the CRU. Jitter analysis [18] is an important aspect of back-plane link design. It can be shown that the sampled received signal in the presence of jitter and ISI is given by: y[n] = d[k]hg [n − k] − v[k]q[k]ht0 [n − k] − s[n] v[k]ht0 [n − k] k
k
(14)
k
where hg [k] is the sampled pulse-response (composite of transmit shaping g(t) and channel impulse response h(t)), ht0 [k] = h(t0 + nT) is the sampled channel impulse response with a sampling offset of t0, v[k] = d[k] − d[k − 1] is the data transition, q[k] is the jitter in the transmit clock, and s[n] is the jitter in the recovered clock. The first term in (14) represents ISI, while the second and third terms represent the impact of jitter. Note: jitter appears as additional noise source in the received signal, and that jitter appears only when there is a data transition (non-zero v[k]). Thermal noise in back-plane links are generated by the termination resistors (RT in Fig. 3(a)), the sampler, and the receive amplifier. The voltage noise PSD (in V 2 /Hz) is given by: (15) No = 4kT (RT + Rsw + ) gm where RT , Rsw , gm and are the termination resistance, sampler resistance, input stage transconductance, and a technology dependent parameter, respectively. Assuming typical values of RT = 50 , Rsw = 200 , gm = 1mA/V and = 1, we get No = 2 × 10−17V 2 /Hz. In addition, the slicer imposes a minimum input voltage requirement referred to as slicer resolution, needed to resolve the input signal. The slicer resolution is a function of static offsets that occur in differential circuit structures due to transistor mismatch. Typical values of uncorrected offsets are in the range of 10mV .
84
Naresh Shanbhag, Andrew Singer, and Hyeon-min Bae d[k]
Fiber
Fiber
PD
FA
Tx
TIA
ne(t)
no(t) FA: Fiber Amplifier PD: Photo Detector
Fig. 8 Block diagram of an optical link model showing primary sources of noise.
2.4.2 Optical Fiber Noise Models In amplified links, the dominant source of noise as observed at the receiver is amplified spontaneous emission noise no (t), or ASE noise (see Fig. 8). In unamplified links, the dominant source of noise is the electrical receiver noise ne (t), which encompasses all sources of noise in the receiver, including that of the detector, the TIA and and subsequent electronics. The OSNR is computed as the ratio of the signal power in the transmitted optical signal to the ASE noise power in a reference bandwidth at a reference frequency adjacent to the transmitted signal. The typical noise reference bandwidth used is 0.1 nm, which is approximately 12.5 GHz for 1550 nm transmission.2 As such, to simulate a given OSNR level, the noise variance used in a Ts -sampled discrete-time simulation should be 2 ase = E{|E(t)|2 }10−OSNR/10
Wsamp , Wref
where Wsamp = 1/Ts , is the sampling rate of the numerical simulation, Wref = 12.5 GHz for a 0.1 nm noise resolution bandwidth and OSNR is the target OSNR. Since the signal will propagate along two orthogonal axes of polarization, to adequately model the ASE noise, a separate noise simulation can be run for each polarization axis with the same target OSNR and the results added after detection. For directly detected optical signals, additive Gaussian noise in the optical domain, becomes signal-dependent and non-Gaussian in the electrical domain. The conditional probability density function for the detector output r(t) given the value of noise-free channel output y(t) can be shown to be a signal-dependent non-central chi-square distribution with 2M degrees of freedom [19]. For modeling or numerical simulation, each of the component models given in this section can be combined to create an end-to-end system model. The system model can be employed to design transmitter and receiver algorithms, VLSI architectures and integrated circuits for high-speed links, in order to meet system specifications.
2
This is computed by noting that f = c/ and therefore | f | = (c/ 2 )| | = 12.478GHz, for = 1550 nm and | | = 0.1nm.
Signal Processing for High-Speed Links
85
3 Signal Processing Methods In the last several years, a number of efforts have emerged to bring signal processing into back-plane and optical communcation links. In this section, we will explore some of the basic models and structures that have been proposed and brought into practice that rely on digital signal processing methods to improve the performance of optical links.
3.1 Modulation Formats NRZ is the dominant modulation format for both back-plane and optical links. However, other more spectrally-efficient techniques have also been explored. In backplane, modulation formats such as 4-PAM [20] and more recently duobinary (DB) [21, 22] have been explored. Duobinary has also been explored for optical links [23]. Duobinary modulation refers to the use of a controlled amount of ISI in the transmitted signal or in the composite channel response. In back-plane channels, duobinary is implemented by modulating the amplitude while in optical links, it is implemented using phase modulation. The transmitted power spectrum of DBmodulated signals is roughly half that of NRZ modulation resulting in reduced ISI induced by the channel.
3.2 Adaptive Equalization In practice, the channel in high-speed links are unknown. Therefore, adaptive equalization techniques are needed that can learn the channel parameters via a process of training. Many of the adaptive equalization techniques employed in high-speed links tend to be simplified versions of those in DSL and wireless. A key difference being that the high data-rates in high-speed links force the designer to explore a rich mix of analog and digital equalization approaches instead on a predominantly digital approach as in other communication links. The baseband output r(t) of a high-speed link can be written in terms of the transmitted bit sequence d[n] as r(t) =
d[k]hg (t − kT ) + w(t),
k=−
where hg (t) = g(t) ∗ h(t) is the convolution of the transmit pulse g(t) with the channel impulse response h(t) assuming that the latter is time-invariant. It is well-known [24] that the optimal receive filter for such a linear modulation over a linear channel is a matched filter to the transmit pulse hg (t) and that a set of sufficient statistics for the optimal detection of the sequence d[n] (for any criteria of optimality) can be
86
Naresh Shanbhag, Andrew Singer, and Hyeon-min Bae
obtained by sampling the output of this matched filter at the symbol-synchronous intervals, t = nT . An attractive means of implementing such a matched filter when the channel response is not known precisely in advance, is the use of an adaptive fractionallyspaced equalizer (FSE), whose output can be written y[n] = y(nT ) =
N
ck r(nT − k ),
(16)
k=0
where the the delay-line spacing of the N taps in the equalizer, , is typically a fraction of the symbol period, i.e. = T /M f , where M f = 2 is typical. The coefficients of the tapped delay line, or feed-forward equalizer (FFE), can be adapted using a variety of methods such as the least-mean square (LMS) algorithm, followed by subsequent symbol-spaced sampling. In principle, y[n] can exactly reproduce the T -spaced outputs of the optimal continuous-time matched filter hg (nT − t) when the highest frequency in r(t) is less than 1/2 , i.e., −spaced sampling satisfies the sampling theorem. FSEs have the added benefit of having performance that is relatively insensitive to the sampling phase of the receiver, which is an important feature in high-speed links, where the recovered clock is often frought with jitter. FSEs with T /2-spacing have been implemented with reasonable success for both back-plane [25] and MMF channels [26]. These tend to cover a span of 3to-7 T -spaced symbols. A block diagram of a -spaced FFE is shown in Fig. 9(a). FFE structures are typically implemented with analog tapped delay-line circuits in CMOS, where the cost (and therefore the power consumption) of a transceiver need to be kept to a minimum. Challenges include the realization of accurate, low-loss delay-lines, such that the spacing between adjacent taps are equal and equal to a fraction of the bit-period, and that the signal does not decay too substantially as it ˆ =1 is passed from the input to the end of the delay line. For symbol decisions d[n] ˆ = 0 for y[n] ≤ and error signal e[n] = (d[n] ˆ − y[n]), the for y[n] > and d[n] coefficient update algorithms using LMS, for step-size , ck [n] = ck [n − 1] + r(nT − k )e[n], k = 0, 1, . . . , N − 1 need not be overly complex, since the resolution with which the multiplications (16) can be carried out is limited. FFEs are not particularly well-suited to channels with deep spectral nulls such as those in back-plane channels with discontinuities. For such channels, the DFE shown in Fig. 9, is a natural extension of the FFE. A DFE typically consists of an FFE with an additional linear (or non-linear) filter used to process past symbol ˆ in order to cancel post-cursor ISI. Assuming past symbol decisions decisions d[n] are correct, then the FFE portion of a DFE converts as much of the pre-cursor ISI as possible, within its degrees of freedom, into post-cursor ISI, the effects of which will be subtracted off by the feedback filter. The output of the DFE with a linear feedback filter is given by
Signal Processing for High-Speed Links
87
r(t- )
r(t)
cN-1
c1
c0
r(t-2 )
dÖ[ n]
y[n]
(a) r(t- )
r(t) c0
c1
r(t-2 )
cN1-1
dÖ[n]
y[n]
T
Feed-forward Equalizer
dÖ[n -1]
b1
T dÖ[ n - 2]
b2
T
bN2 Feedback Equalizer
(b) Fig. 9 Linear equalization techniques: (a) block diagram of a fractionally-spaced feedforward linear equalizer (FFE), and (b) block diagram of a fractionally-spaced decision feedback equalizaer (DFE).
y[n] =
N1 −1
k=0
N2
ˆ − j], ck [n]r(nT − k ) − b j [n]d[n
(17)
j=1
whose LMS coefficient updates are given by ck [n] = ck [n − 1] + r(nT − k )e[n], k = 0, 2, . . . , N1 − 1 ˆ − j]e[n], j = 1, 2, . . . , N2 . b j [n] = b j [n − 1] + d[n
(18) (19)
3.3 Equalization of Nonlinear Channels Unlike back-plane and MMF channels, the SMF channel is non-linear. This is due to the presence of a direct (square-law) O/E detector in the receiver front-end. As a result, there is no discrete-time equivalent baseband channel, and adaptive equalization techniques, such as the FFE or DFE discussed previously, do not provide much benefit for SMF applications. Some success for small amounts of CD has been
88
Naresh Shanbhag, Andrew Singer, and Hyeon-min Bae
achieved using sub-optimal techniques such as DFEs with state-dependent thresholds [27] or with nonlinear feedback taps, i.e. where the feedback filter contains not only prior detected bits, but also pairwise products [26]. In order to develop optimum detection techniques, the SMF channel is described as a nonlinear continuous-time channel with finite memory whose output at the detector can be written, r(t) = S(t; d[n], d[n − 1], . . ., d[n − Lc ]) + n(t),
(20)
where n(t) comprises optical noise in an amplified link or electrical noise in an unamplified span of fiber, and S(t; d[n], d[n − 1], . . ., d[n − Lc ]) describes the noisefree, state-dependent, component of the signal, which is a function of the Lc adjacent transmitted symbols. Sufficient statistics for optimal sequence detection [28] for such continuous-time channels with memory can be obtained from a baud-sampled (with perfect symbol timing) bank of matched filters. For data rates at or above 10 Gb/s, processing such matched filters at line rate, together with an MLSE engine poses significant implementation challenges.
3.4 Maximum Likelihood Sequence Estimation (MLSE) The optimum receiver in terms of minimizing the BER for a nonlinear channel with memory, such as the optical fiber channel, is given by the maximum a-posteriori probability (MAP) detector and can be computed using the BCJR algorithm [3], which is a two-pass algorithm, requiring a forward and backward recursion over the entire data sequence. The MLSE algorithm provides the optimal detector in terms of minimizing sequence error rate (SER) rather than BER. The MLSE algorithm provides nearly the same performance as MAP detection for uncoded transmissions, or when symbol detection is carried out separately from FEC decoding, using only a single forward-pass through the data. The Viterbi algorithm [30] can be used to recursively solve for the SER-optimal transmitted sequence consistent with the observations given a channel model at the receiver that can describe the probability density governing the observations r[n]. Specifically, we can define the set of possible states at time index k as the 2Lc vectors s[k] = [d[k − 1], . . . , d[k − Lc ]], where d[n] ∈ {0, 1}. In addition, we define the set of valid state transitions from time index k to time index k + 1 as the 2Lc +1 possible vectors of the form b[k] = [d[k], d[k − 1], . . . , d[k − Lc ]]. Using this notation, the noise-free outputs of our channel model x[n] = f (d[n], s[n]) and channel observations r[n] = x[n] + w[n], where the noise-free outputs, the channel observations, and the noise samples, w[n], can be either scalars (in the case of baud-sampled statistics), or vectors (in the case of over-sampled channel outputs or state-dependent matched filter outputs). Given a sequence of channel observations r[k] = {r[k], r[k − 1], . . . , r[k − Lr + 1]} of length Lr , the channel when viewed as a 2Lc -state machine can exhibit 2Lc +Lr
Signal Processing for High-Speed Links
89
possible state sequences s˜[k] = {s[k], s[k − 1], . . . , s[k − Lr + 1]} (owing to the shiftstructure of adjacent states). Each valid state sequence corresponds to a path in the trellis representation of the channel (see Fig. 10). The MLSE algorithm seeks a particular state sequence or a path sˆ[k] for which the probability of the channel observations sequence r[k] given sˆ[k] is maximum, i.e. sˆ[k] = arg max P(r[k]|˜s[k]) = argmax P(r[k]|˜s[k])P(˜s[k]) = arg max P(r[k], s˜[k]) s˜[k]
s˜ [k]
s˜[k]
(21) assuming that each state sequence is equally likely. Assuming further that the noise samples are conditionally independent, we can write Lr
Lr
k=0
k=0
P(r[k], s˜[k]) = P(s[k + 1]|s[k]) P(r[k]|s[k + 1], s[k]), and define the branch metric
(b[k]) = − ln P(s[k + 1]|s[k]) − lnP(r[k]|b[k] = {s[k + 1], s[k]}) to quantify the likelihood that a given state transition occurred. Then, we obtain the log probability of any state sequence, or path, as a sum of branch metrics along the state transitions in that path, L(˜s[k]) = − ln(P(r[k], s˜[k]) =
Lr
(b[k]),
k=0
ˆ which is also referred to as the path metric. The label S(s[k]) is used to designate the shortest path (the path with the smallest path metric) through the trellis ending in state s[k], which is known as a survivor path since all longer paths also ending in state s[k] are discarded. The path metric of the survivor path ending in state s[k] is ˆ labeled M(s[k]), i.e. M(s[k]) = L(S(s[k])). This is best visualized using a trellis, as shown in Fig. 10, where all permissable/valid state transitions b[k] have equal probability. The Viterbi algorithm is now a dynamic programming approach to finding the path through the trellis of smallest accumulated branch metric. MLSE implementation in high-speed links brings up a number of practical considerations. First, the link needs to be ADC-based. This constraint is no longer a major issue for data-rates in the 10Gb/s range today though power considerations may preclude their use in many back-plane applications in the short-term. Second, since the data is continuously transmitted, it will be necessary, for latency reasons, to output bit decisions as the data is processed. This is called the sliding window version of the Viterbi algorithm, which requires a finite look-back window of length D. It is desirable to make the look-back window as long as possible, in order to reduce the probability that more than one of the survivor paths (see Fig. 10) will exist at time k − D when processing at time k. However, the storage requirements and complexity of the algorithm per output symbol will increase linearly in D. We
90
Naresh Shanbhag, Andrew Singer, and Hyeon-min Bae
r[k-4]
S[k]={d[k-1],d[k-2]}
0 1 2 3
0
00 01 10 11
0 0 0
1
received sequence
r[k-3]
r[k-2]
0 1 1 1 1
0 0
0 1
0
1
0
1
0
1 1 1
0
1
1
r[k-1]
r[k]
0 0 0 0
1 1 1 1
arg min {M ( s[k ])}
most likely sequence Sˆ ( s[k ] 1) {0, 2,3,3,1} o dˆ {1,1,1, 0}
Trellis
s[ k ]{0,1,2,3}
Fig. 10 Maximum likelihood sequence estimation (MLSE): A trellis for a 4-state (Lc = 2) MLSE algorithm with Lr = 5 is shown. The trellis diagram illustrates valid state sequences with solid lines as paths through the trellis. Above each line connecting two states is the corresponding symbol that would have been transmitted for that transition.
have found that setting D equal to a small multiple of the channel memory Lc works well in practice, and setting D > 3Lc does not improve performance significantly. For D in this range and when more than one survivor path exists at k − D, selecting the one with the smallest path metric at k appears to work well in practice. As illustrated in Fig. 10, the conventional Viterbi decoder would employ the current received sample r[n] to compute the branch metrics, and update the four path metrics that correspond to the likelihood of the best path (survivor path) ending at each of the four states in one symbol period. The survivor paths are stored in memory. The path metric update step is followed by the trace-back step where one of the survivor paths is traced back to a predetermined depth, D, and a decision is made. The survivor path metric update is performed using an add-compare-select (ACS) operation, which is recursive and therefore also difficult to implement at optical line rates. A probabilistic model for the observations r[k] given the state transitions b[k], i.e., given the transmitted bits d[k], . . . , d[k − Lc ], is required for branch metric computations. In a practical applications, this model must be determined adaptively from the channel observations r[n] without the aid of any training, i.e. without any knowledge of the true bit sequence d[n]. One such model for a baud-sampled receiver uses an adaptive system identification algorithm based on a Volterra series expansion of the channel impairments [12, 31, 32] Lc
Lc
r[n] = c0 + ak d[n − k] + k=0
Lc
k=0 =0,=k
bk, d[n − k]d[n − ] + · · · ,
(22)
where, r[n] = r(nT ), are baud-sampled receiver outputs, d[n] ∈ {0, 1} are assumed transmitted channel symbols, and whose parameters c0 , ak and bk, can be adaptively
Signal Processing for High-Speed Links
91
estimated using the LMS algorithm. c0 [n] = c0 [n − 1] + e[n] − (c0[n] − c0[0]), (23) ˆ ak [n] = ak [n − 1] + d[n − k]e[n] − (ak[n] − ak [0]), 0 ≤ k ≤ L (24) ˆ ˆ bk, [n] = bk, [n − 1] + d[n − k]d[n − ]e[n] − (bk,[n] − bk,[0] 0 ≤ k, ≤ L (25) ˆ are past decided bits where, e[n] = (rADC [n] − r[n]), for ADC outputs rADC [n], d[n] from the MLSE algorithm, and is a step-size parameter for the regularization terms c0 [0], ak [0] and bk, [0], which are used to incorporate prior knowledge and to mitigate accumulation of finite-precision effects. In a practical implementation, LMS updates can be performed at a large fraction of the line rate to reduce power, which also enables the use of MLSE outputs, or perhaps FEC-corrected outputs, when making LMS updates. Given the model coefficents, (22) can be used to precompute the noise-free outputs for each state transition in the trellis. The branch metrics can then be computed using either a parametric or non-parametric model for the noise statistics within each state. We have found that a suitably chosen parametric model for the noise statistics [12], e.g., state-dependent Gaussian, works well in practice, requires relatively few parameters to be estimated, and enables the use of Euclidean branch metrics, which simplifies the architecture.
4 Case Study of an OC-192 EDC-based Receiver Designing EDC-based receivers at OC-192 rates, i.e., 9.953 Gb/s to 12.5 Gb/s, poses numerous signal processing and mixed-signal circuit design challenges. In this section, we illustrate some of the issues that arise in implementing EDC using mixed-signal ICs by describing the design of a maximum likelihood sequence estimation (MLSE) receiver [12] with adaptive non-linear channel estimation. Key issues include: the design of a baud-sampled ADC, clock recovery in the presence of dispersion, non-linear channel estimation in the absence of a training sequence, and the design of a high-speed Viterbi equalizer. This is the first reported design of a complete MLSE receiver design meeting the specifications of OC-192/STM-64 long-haul(LH), ultra long-haul (ULH), and metro fiber links at rates up to 12.5 Gb/s.
4.1 MLSE Receiver Architecture The MLSE receiver was designed as a two-chip solution in two different process technologies that were chosen to match the functionality to be implemented. The analog signal conditioning, sampling and clock recovery functions were implemented in an analog front-end (AFE) IC in a 0.18 m, 3.3 V, 75 GHz, SiGe BiCMOS process. The digital MLSE equalizer and the non-linear channel estimator
92
Naresh Shanbhag, Andrew Singer, and Hyeon-min Bae Analog Front−End (AFE) IC
Data Input
VGA
ADC
CRU
4b
Digital Equalizer (DE) IC
1:8
32b ADC data
8:16
Path− Finder
Path− Selector
16b
DATA[15:0]
Channel Estimator
CLK−DIV CLK
CLK−DIST
CLK
Fig. 11 Block diagram of MLSE-based receiver.
were implemented in a digital equalizer (DE) IC in a 0.13 m, 1.2 V CMOS process. The architecture of the MLSE receiver is shown in Fig. 11. The received signal is at a line rate of between 9.953 Gb/s for uncoded links and up to 12.5 Gb/s for FECbased links. The received signal is amplified by a VGA then sampled by a 4-bit flash ADC. A dispersion-tolerant CRU recovers a line rate clock for the ADC. The 4-bit line rate ADC samples are demuxed by a factor of 1:8 to generate a 32-bit interface to the DE IC, which implements a 4-state MLSE algorithm, i.e. it assumes a channel memory of 3 symbol periods. The digital equalizer includes a blind, adaptive channel estimator of the form in (22) and a data decoding unit on-chip that requires no training.
4.2 MLSE Equalizer Algorithm and VLSI Architecture Designing an MLSE equalizer at OC-192 line rates (9.953 Gb/s-12.5 Gb/s) is extremely challenging because of the high data rates involved. Conventional highspeed Viterbi architectures often employ parallel processing or higher radix processing [33]. Parallel/block processing architectures tend to suffer from edge-effects while higher-radix architectures have been shown to achieve speed-ups of up to a factor of 2, which is not sufficient for this application. Instead, the architecture in [12, 34] (see Fig. 12) incorporates a number of techniques that enable the MLSE computations to complete within the bounds imposed by the high OC-192 line rates. First, the channel trellis is traversed in a time-reversed manner (see Fig. 12(a)). Doing so eliminates the trace-back step and the survivor memory associated with conventional approaches thereby providing a considerable power saving while enhancing throughput.
Signal Processing for High-Speed Links
93
received sequence S[k]={d[k-1],d[k-2]}
0 1 2 3
00 01 10 11
r[k-4] 0
r[k-3] 0
1 0 0 0
r[k-2]
1 1 1
0 1
0
1
0
1
0 1
1 0
r[k]
0 1
0
1 0
r[k-1]
1
1 0 0 0
1 1 1
(a)
branch metric computation
path-finder
path-selector
(b)
(c) Fig. 12 High-speed MLSE architecture: (a) time-reversed trellis, in which a cold-start at r[k] leads to 4 possible reverse-survivor paths ending at states s[k − D], (b) VLSI architecture, and (c) pathselector.
94
Naresh Shanbhag, Andrew Singer, and Hyeon-min Bae
Second, the ACS bottleneck associated with conventional approaches is eliminated by restricting the survivor path memory to a reasonably small number (6) of symbol periods. This number is obtained by noting that the channel memory spans roughly 3 symbol periods, and the structure of the channel trellis. With a fixed survivor depth D, and time-reversed traversal of the trellis, it becomes possible to implement the ACS computations in a fully feed-forward architecture. This feed-forward ACS unit is referred to as the path-finder block in Fig. 12. A feedforward architecture [35] can be easily pipelined in order to meet an arbitrary speed requirement. Third, in order to avoid edge effects, past decisions are employed to select among the four possible survivor paths. Though this choice results in a recursive stage referred to as the path-selector (see Fig. 11) consisting of a series of multiplexers it is not difficult to meet the speed requirements. This is because the path-selector is small and localized part of the architecture and hence easily amenable to circuit optimizations. The three innovations described above provide a favorable trade-off between BER with ease of implementation. Though these do not reduce frequency at which the architecture needs to be clocked, which remains at up to 12.5 GHz, they do result in an architecture that can be easily transformed into one that can operate at a lower clock-frequency as described next. The fourth innovation is to make multiple (Nd ) decisions along the chosen survivor path in each clock cycle resulting in a clock frequency that needs to be 1/Nd times the line rate. Fifth, the architecture is unrolled [35] in time by a factor of Nu resulting a further reduction of clock frequency by a factor of Nu . Thus, the final clock frequency is given by R (26) fclk = Nu Nd where R is the line rate. Algorithmically, unfolding [35] does not ease the throughput problem as it expands the number of computations in the critical path by the same factor as it lengthens the clock period. This fact does not present a problem in this case because the critical path is localized to the path-selector, which as mentioned earlier, is a simple cascade of multiplexers and hence can be custom designed to meet the speed requirements. A side benefit of making multiple decisions and employing unfolding is that the equalizer doubles up as a 1 : Nd Nu demultiplexer as well. Most receivers need a 1:16 demux functionality which is usually implemented using a separate demux chip. By choosing Nd = 2 and Nu = 8, the DE chip demultiplexes and decodes the data simultaneously.
4.3 Measured Results The MLSE receiver was tested with up to 200 km of SMF. The MLSE receiver achieves a BER of 10−4 at an OSNR of 14.2 dB with 2200 ps/nm of dispersion as shown in Fig. 13(a). A received power of −14 dBm is used throughout the testing.
Signal Processing for High-Speed Links
95
S R vs CD (linear re ime)
fi ed
ER
1 0 1 0 1 0
S R ( 1nm) [d ]
1 0
Standard Standard Standard Standard Standard Standard
ransponder ransponder ransponder ransponder ransponder ransponder
- 00
-200
cdr cdr cdr S S S
1e 1e 1e 1e1e1e-
1 0 1 0 12 0 11 0 10 0 0 0 -1 00 -1 00 -1200 -1000 - 00
- 00
0
200
00
00
00
1000 1200 1 00 1 00 1 00 2000 2200 2 00
CD [ps/nm]
(a)
itter amplitude ( I)
100
10
1
01 1 00
0
1 00
0
1 00
0
1 00
Fre uenc ( ask
oop ack
oop back
0
1 00
0km
100km
0
1 00
0
) 12 km
(b) Fig. 13 MLSE receiver test results: a) chromatic dispersion test, and b) SONET jitter tolerance.
96
Naresh Shanbhag, Andrew Singer, and Hyeon-min Bae
The MLSE receiver reduces the OSNR penalty of a standard CDR by more than 2 dB at CD of 1600 ps/nm and BER of 10−4. The penalty for a standard CDR increases rapidly beyond 1600 ps/nm of CD. Figure 13(a) also shows that the MLSE receiver does not have a penalty in the back-to-back (i.e. 0 km) configuration. The receiver has been shown to provide an error free (BER < 10−15) post-FEC output, with a pre-FEC BER of 10−3 at 10.71 Gb/s with 2000 ps/nm of dispersion. It can compensate for 60 ps of instantaneous differential group delay (DGD) with a 2 dB OSNR penalty (BER=10−6 ). The channel estimator in the MLSE engine adapts at a rate of 30 MHz and thus is easily able to track variations in DGD, which are typically less than 1 KHz. The MLSE receiver satisfies the SONET jitter tolerance specifications with 2200 ps/nm of dispersion (see Fig. 13(b)). A HP OMNIBER OTN is used for this measurement. The measured output clock jitter is < 0.5 psrms . The output BER is kept at 10−3 and the CD is varied up to 2200 ps with a 231 − 1 PRBS. The fixed BER is achieved by adjusting the OSNR while maintaining the received optical power at −14 dBm. The PLL output clock jitter is less than 0.64 psrms across test conditions. The PLL maintains locked without cycle slips with up to 1300 consecutive identical digits (CID) at a BER of 10−2, with 125 km SMF-28 optical fiber link. The MLSE receiver illustrates the benefits of jointly addressing signal processing algorithm design, VLSI architectures, mixed-signal integrated circuit implementation, as well as electromagnetic signal integrity issues in packaging and circuit board design.
5 Advanced Techniques In this section, we will describe a few of the directions in which new signal processing advances are being made in high-speed links of the future. In particular, we focus on the application of FEC to reduce power in back-plane links, and the use of coherent detection in optical links.
5.1 FEC for Low-power Back-Plane Links As described earlier, modern state-of-the-art back-plane links today rely exclusively upon an equalization-based transceiver to compensate for ISI and noise in order to achieve BER < 10−15 [5, 20]. This is highly unusual because most other communication links today, such as DSL, wireless, optical, disk drives, and others, employ some form of FEC in order to achieve the requisite BER. In such links, the equalizer achieves a BER in the range of 10−3 -to-10−4 at its output, and the FEC to further reduces the BER by orders-of-magnitude to 10−12-to-10−15. Therefore, in the absence of FEC, back-plane links are designed to be ISI-dominated with high SNR > 30dB, and hence consume more power than necessary. Though digital techniques such as
Signal Processing for High-Speed Links
97
FEC tend to consume more power than analog, scaling of feature sizes favor digital designs more than analog. Thus, it is expected that in nanoscale process technologies, a digital heavy FEC-based I/O link [2], and even perhaps an ADC-based link, might be an attractive approach to reduce power. In an (n, k, dmin ) code, a block of k data/information bits/symbols (dataword) are mapped a block of n code bits/symbols (codeword) where n > k. Such a code is said to have a code-rate of r = nk . FEC results in the line-rate being greater than the data rate by a factor of 1/r. Thus, a coded link suffers from increased ISI and hence exhibit an ISI penalty. The ability to correct errors due to FEC results in a coding gain, where the coding gain is the reduction in channel SNR to achieve the same BER. For back-plane links, we expect that the coding gain for specific types of codes will offset the ISI penalty with an acceptable latency, and thereby result in a reduced power link. In order to capture the trade-offs inherent in FEC, consider the specific example of a (327, 265, 15) (r = 0.81) BCH code with dmin = 15,t = 7 employed over a link comprising of a 20” FR4 channel supporting NRZ shaped 10Gb/s data and a receiver with an LE and a DFE.
−5
BER
10
−10
10
LE, 17 taps, 10G DFE, 17 taps, 10G Coding + LE Coding + DFE −15
10
15
20
25
30
SNR (E /N ) dB b
35
40
o
(a) corrected. bits
recd. bits
ERROR DETECTOR
CLOCK
AND
BUFFER SYNDROME UNIT GF(2m)
BMU GF(2m)
ELU GF(2m)
GATED CLOCK
(b) Fig. 14 BCH codes in back-plane links: (a) performance of a (327, 265, 15) BCH code over a 20" FR-4 channel at 10 Gb/s, and (b) a low-power BCH decoder architecture
98
Naresh Shanbhag, Andrew Singer, and Hyeon-min Bae
Figure 14 shows the performance of a BCH code. We observe that the BER reduces by six orders-of-magnitude and ten orders-of-magnitude with an LE and a DFE, respectively. Furthermore, the crossover point between coded and uncoded link, due to the ISI penalty, occurs between 25dB-to-30dB and around 24dB for an LE and a DFE, respectively. In addition, we find that at a BER = 10−10 , FEC provides a coding gain of 7dB-to-8dB for both LE and DFE, and that the coding gain increases as the BER reduces. It is expected that at the desired BER = 10−15 , a coding gain of at least 10dB is feasible. The coding gain or SNR slack can be budgeted across the transmit driver, and the receiver circuitry in order to save power. The use of FEC in multi-Gb/s links raises interesting implementation questions especially those related to power consumption. For example, as the pre-FEC errorrate is quite small, e.g., 10−4, a power-efficient decoder can be designed by incorporating a low-complexity error-detector (see Fig. 14(b)), which is constantly operating but which is used to gate the subsequent processing in the decoder. In addition to FEC, the back-plane channel can benefit from other signal processing techniques such as the use of MLSE detector, as was demonstrated for an optical link. A channel shortening equalizer will need to precede the MLSE detector as the back-plane channel is known to have a large memory. The availability of high-speed ADCs opens up the application of a number of low-power techniques including algorithmic noise-tolerance (ANT) [36], where statistical signal processing techniques are employed to detect and correct errors in signal processing architectures.
5.2 Coherent Detection in Optical Links In section 2.1, we described a basic model for both directly modulated and externally modulated optical sources for intensity modulations, such as NRZ or RZ signalling. We also described a more spectrally efficient method, based on a form of optical duobinary signalling, in which the phase of the carrier is additionally modulated by either 0 or radians, depending on the state of the precoder. Attempts to achieve even greater spectral efficiency in optical transmitters invariably require innovations in both the optical devices involved as well as in the transmit and receive signal processing. Recently, a number of additional phase-shift keyed modulation formats have been demonstrated, including the differential QPSK format shown in Fig. 15 (a) [37]. The transmitter requires a digital precoder, and an optical encoder with dual Mach-Zehnder modulators. Receiver decoding is performed optically, using an optical delay-and-add structure to avoid the need for generating a coherent local oscillator. However, there are benefits of using a coherent receiver when additional signal processing is used for EDC. While the received signal after direct detection is proportional to the received optical power, if a coherent receiver were used, then the received electrical signal would be proportional the electromagnetic field in the fiber. As a result, CD and PMD might appear as linear distortions, enabling linear equalization techniques to substantially outperform their counterparts when used
Signal Processing for High-Speed Links
99
after direct (square-law) detection. Four photo detectors are presently required to recover the in-phase and quadrature (I and Q) components of the electromagnetic field from both polarizations of the received optical signal. If the length of fiber to be compensated were known in advance, techniques for coherent moduation combined with linear pre-equaliation [38] can be used. One advantage of electronic preequalization is that direct detection can still be used at the receiver, in exchange for substantial signal processing followed by digital to analog conversion at the modulator.
1/z
a+b
a
^ b[2k]
b[k]
Encoder
+/MZM
π /4
b
a-b
1/z
c
c+d
π /4
d
+ Fiber MZM
π /2
^ b[2k-1]
+/Transmitter
c-d
Receiver
Fig. 15 Block diagram for a differential QPSK transmitter and receiver.
In addition, techniques that exploit polarization diversity [39] and modal diversity [40, 41] are being explored, whereby PMD and modal dispersion are viewed as opportunities for enhancing date rates, rather than impediments. This is analogous to what is being done in wireless systems today. Thus, as was the case for backplane links, there exists a wide range of unexplored signal processing techniques in the context of optical links.
6 Concluding Remarks Signal processing techniques have found their way into the design of high-speed links, such as back-plane and optical links, today. Thanks to the tremendous advances in semiconductor technology, equalization in back-plane links and EDCbased optical products are making their way into commercial systems. We hope that this chapter has provided an overview of the general concepts, and will enable further exploration into this fascinating and challenging topic.
100
Naresh Shanbhag, Andrew Singer, and Hyeon-min Bae
7 Acknowledgements The authors thank their colleagues at the University of Illinois at Urbana-Champaign, and Finisar Corporation, Inc., for their contributions.
References 1. L. D. Paulson, “The ins and outs of new local I/O trends,” IEEE Computer Magazine, July 2003. 2. R. Narasimha and N. R. Shanbhag, “Forward error-correction for high-speed I/O,” Proceedings of the 42th Annual Asilomar Conference on Signals, Systems, and Computers, October 2008. 3. N. Blitvic, M. Lee, and V. Stojanovi´c, “Channel coding for high-speed links: A systematic look at code performance and system simulation,” IEEE Transactions on Advanced Packaging, pp. 268–279, May 2009. 4. M.-J. Lee, W. J. Dally, and P. Chiang, “Low-power area-efficient high-speed i/o circuit techniques,” IEEE Journal of Solid-State Circuits, pp. 1591–1599, November 2000. 5. J. E. Jaussi, G. Balamurugan, D. R. Johnson, B. Casper, A. Martin, J. Kennedy, N. Shanbhag, and R. Mooney, “8-Gb/s source-synchronous I/O link with adaptive receiver equalization, offset cancellation, and clock de-skew,” IEEE Journal of Solid-State Circuits, pp. 80–88, January 2005. 6. A. Singer, N. R. Shanbhag, and H.-M. Bae, “Electronic dispersion compensation,” IEEE Signal Processing Magazine, pp. 110–130, November 2008. 7. G. E. Keiser, Optical Fiber Communications. McGraw-Hill International Editions: Electrical Engineering Series, 2000. 8. G. P. Agrawal, Fiber-Optic Communication Systems. Wiley-Interscience, 2002. 9. J. Peters, A. Dori, and F. Kapron, “Bellcore’s fiber measurement audit of existing cable plant for use with high bandwidth systems,” Proc. of the NFOEC‘97, September 1997. 10. M. Karlsson, “Probability density functions of the differential group delay in optical fiber communication systems,” IEEE Journal of Lightwave Technology, vol. 19, pp. 324–332, 2001. 11. G. Papen and R. Blahut, Lightwave Communication Systems. 2005. draft. 12. H. M. Bae, J. Ashbrook, J. Park, N. Shanbhag, A. C. Singer, and S. Chopra, “An MLSE receiver for electronic-dispersion compensation of OC-192 links,” Journal of Solid State Circuits, vol. 41, pp. 2541–2554, November 2006. 13. A. Farbert, S. Langenbach, N. Stojanovic, C. Dorschky, T. Kupfer, C. Schulien, J.-P. Elbers, H. Wernz, H. Griesser, and C. Glingener, “Performance of a 10.7 Gb/s receiver with digital equaliser using maximum likelihood sequence estimation,” ECOC’2004 Proceedings, pp. PD– Th4.1.5, 2004. 14. M. Harwood, et al., “A 12.5 Gb/s SerDes in 65nm CMOS using a baud-rate ADC with digital receiver equalization and clock recovery,” IEEE International Solid-State Circuits Conference, February 2007. 15. O. Agazzi, et al., “A 90nm CMOS DSP MLSD transceiver with integrated AFE for electronic dispersion compensation of multi-mode optical fibers at 10 Gb/s,” IEEE International SolidState Circuits Conference, 2008. 16. J. Cao, et al., “A 500mW digitally calibrated AFE in 65nm CMOS for 10Gb/s serial links over backplane and multimode fiber,” IEEE International Solid-State Circuits Conference, February 2009. 17. V. Stojanovi´c, A. Amirkhany, and M. A. Horowitz, “Optimal linear precoding with theoretical and practical data rates in high-speed serial link back-plane communications,” IEEE Int. Conf. Communications, 2004.
Signal Processing for High-Speed Links
101
18. G. Balamurugan and N. R. Shanbhag, “Modeling and mitigation of jitter in multi-gbps sourcesynchronous I/O links,” Proceedings of the 21st International Conference on Computer Design, October 2003. 19. P. Humblet and M. Azizoglu, “On the bit error rate of lightwave systems with optical amplifiers,” Journal of Lightwave Technology, vol. 9, pp. 1576–1582, November 1981. 20. J. Stonick, G.-Y. Wei, J. L. Sonntag, and D. K. Weinlader, “An adaptive PAM-4 5-Gb/s backplane transceiver in 0.25 m CMOS,” IEEE Journal of Solid-State Circuits, pp. 436–443, March 2003. 21. K. Yamaguchi and et al., “12Gb/s duobinary signaling with x2 oversampled edge equalization,” IEEE International Solid-State Circuits Conference, February 2005. 22. J. Lee, M.-S. Chen, and H.-D. Wang, “Design and comparison of three 20-Gb/s backplane transceivers for duobinary, PAM4, and NRZ data,” IEEE Journal of Solid-State Circuits, pp. 2120–2133, September 2008. 23. T. Franck, P. B. Hansen, T. N. Nielsen, and L. Eskildsen, “Duobinary transmitter with low intersymbol interference,” IEEE Photonics Technology Letters, pp. 597–599, April 1998. 24. S. Qureshi, “Adaptive equalization,” Proceedings of the IEEE, vol. 73, pp. 1349 – 1387, September 1985. 25. M. BiChan and A. C. Carusone, “A 6.5 Gb/s backplane transmitter with 6-tap FIR equalizer and variable tap spacing,” IEEE Custom Integrated Circuits Conference, 2008. 26. Q. Yu and A. Shanbhag, “Electronic data processing for error and dispersion compensation,” Journal of Lightwave Technology, vol. 24, pp. 4514 – 4525, December 2006. 27. J. Winters and R. Gitlin, “Electrical signal processing techniques in long-haul fiber optic systems,” IEEE Transactions on Communications, pp. 1439–1453, September 1990. 28. S. Benedetto, E. Bigliere, and V. Castellani, Digital Transmission Theory. Englewood Cliffs, NJ: Prentice Hall, 1987. 29. L.R. Bahl et al., “Optimal decoding of linear codes for minimizing symbol error rate,” IEEE Trans. on Information Theory, vol. 20, pp. 284–287, March 1974. 30. G. Forney, “Maximum-likelihood sequence estimation of digital sequences in the presence of intersymbol interference,” IEEE Transactions on Communications, vol. 18, May 1972. 31. W. Sauer-Greff, M. Lorang, H. Haunstein, and R. Urbansky, “Modified Volterra series and state model approach for nonlienar data channels,” Proc. IEEE Signal Processing ’99, pp. 19– 23, 1999. 32. O. Agazzi, M. Hueda, H. Carrer, and D. Crivelli, “Maximum likelihood sequence estimation in dispersive optical channels,” Journal of Lightwave Technology, vol. 23, pp. 749–763, February 2005. 33. P. J. Black and T. H. Meng, “A 140-Mb/s, 32-state, radix-4 Viterbi decoder,” IEEE Journal of Solid-State Circuits, vol. 27, pp. 1877–1885, Dec. 1992. 34. R. Hegde, A. Singer, and J. Janovetz, “Method and apparatus for delayed recursion decoder,” US Patent, no. 7206363. filed June, 2003, issued April, 2007. 35. K. K. Parhi, VLSI Digital Signal Processing Systems: Design and Implementation. Wiley, 1999. 36. R. Hegde and N. R. Shanbhag, “Soft digital signal processing,” IEEE Transactions on VLSI Systems, pp. 813–823, December 2001. 37. R. Griffin, “Integrated DQPSK transmitters,” Optical Fiber Communication Conference, 2005. Technical Digest, vol. OFC/NFOEC Volume 3, p. OWE3, March 2005. 38. M. E. Said, J. Stich, and M. Elmasry, “An electrically pre-equalized 10Gb/s duobinary transmission system,” Journal of Lighwave Technology, no. 1, pp. 388–400, 2005. 39. C. Laperle, B. Villeneuve, Z. Zhang, D. McGhan, H. Sun, and M. OŠSullivan, “Wavelength division multiplexing (WDM) and polarization mode dispersion (PMD) performance of a coherent 40Gbit/s dual-polarization quadrature phase shift keying (DP-QPSK) transceiver,” Post Deadline Papers OFC/NFOEC 2007, no. PDP16, 2007. 40. H. Stuart, “Dispersive multiplexing in multimode optical fiber,” Science, vol. 289, pp. 281– 283, 2000. 41. A. Tarighat, R. Hsu, A. Shah, A. Sayed, and B. Jalali, “Fundamentals and challenges of optical Multi-Input Multiple-Output multimode fiber links,” IEEE Communications Magazine, pp. 1– 8, May 2007.
Video Compression Yu-Han Chen and Liang-Gee Chen
Abstract In this chapter, we show the demands of video compression and introduce video coding systems with state-of-the-art signal processing techniques. In the first section, we show the evolution of video coding standards. The coding standards are developed to overcome the problems of limited storage capacity and limited communication bandwidth for video applications. In the second section, the basic components of a video coding system are introduced. The redundant information in a video sequence is explored and removed to achieve data compression. In the third section, we will introduce several emergent video applications (including High Definition TeleVision (HDTV), streaming, surveillance, and multiview videos) and the corresponding video coding systems. People will not stop pursuing move vivid video services. Video coding systems with better coding performance and visual quality will be continuously developed in the future.
1 Evolution of Video Coding Standards Digital video compression techniques play an important role to enable video applications in our daily life. The evolution of video coding standards is shown in Fig. 1. One demand of video compression is storage. Without compression, a raw video in the YCbYr4:2:0 color format [1], CIF (352×288) spatial resolution, and 30 Frame Per Second (FPS) temporal resolution generates about 36.5Mbps of data rate. For a 700MB CD-ROM, only 153 seconds of the CIF video can be stored. In order to store one hour video in a CD-ROM, 25 times of compression ratio is required to achieve Yu-Han Chen e-mail: [email protected] Graduate Institute of Electronics Engineering and Department of Electrical Engineering, National Taiwan University, Taipei, Taiwan, R.O.C. Liang-Gee Chen e-mail: [email protected] Graduate Institute of Electronics Engineering and Department of Electrical Engineering, National Taiwan University, Taipei, Taiwan, R.O.C.
S.S. Bhattacharyya et al. (eds.), Handbook of Signal Processing Systems, DOI 10.1007/978-1-4419-6345-1_5, © Springer Science+Business Media, LLC 2010
103
104
Yu-Han Chen and Liang-Gee Chen
1.5Mbps of data rate. In the early 1990s, Motion Picture Experts Group (MPEG) of International Standard Organization/International Electrotechnical Commission (ISO/IEC) started to develop multimedia coding standards for storage of audio and video. MPEG-1 [2] is the first successful standard which is applied to the VCD industry. MPEG-2 [3] (a.k.a H.262) is another successful story in the DVD industry. Without compression, a raw video in the YCbCr4:2:0 color format, D1 (720×480) spatial resolution, and 30 FPS temporal resolution generates 124.4Mbps of data rate. For a 4.7GB DVD-ROM, only 310 seconds of the D1 video can be stored. MPEG-2 is developed to provide about 2 hours D1 video in a DVD-ROM with around 5 Mbps of data rate. ITU-T Video Coding Experts Group (VCEG) Short-Term H.263+/++
H.261 (1990)
H.263 (1995)
(1998/2000) Long-Term
H.262 (1996) MPEG-2 (1993) MPEG-1
Joint Video Team (JVT) 2001
H.26L (1999~) H.264/AVC (2003)
(1999) MPEG-4
ISO Motion Picture Experts Group (MPEG)
H.264/AVC Fidelity Range Extensions (FRExt) High Profile (2005)
H.264/AVC Multiview High Profile: Multiview Video Coding (MVC) (2009) H.264/AVC Scalable Extension: Scalable Video Coding (SVC) (2007)
Fig. 1 Evolution of video coding standards.
Vide Coding Expert Group (VCEG) of International Telecommunication Union (ITU) also developed a series of coding standards, like H.261 [4] and H.263 [5], targeting at realtime video applications through communication network, e.g. video conferences and video phones. H.261 was designed for data rates which are multiples of 64kbps. It is also referred as p×64 kbps (p is in the range of 1–30). These data rates are suitable for ISDN (Integrated Service Digital Network) lines. On the other hand, H.263 was designed for data rates less than 64 kbps. Its targeting applications are multimedia terminals providing low bitrate visual telephone services over PSNT (Public Switched Telephone Network). Later, ISO/IEC developed MPEG-4 [6] to extend video applications to mobile devices with wireless channels. Realtime video transmission through complicated communication network is more challenging than storage because of lower data rates and unpredicted network conditions. To provide a video phone service in the YCbCr4:2:0 color format, QCIF (176×144) spatial resolution, and 10 FPS temporal resolution under a 64 kbps channel, about 50 times of compression ratio is required. The video coder needs to maintain acceptable video quality under very limited channel bandwidth. In addition, rate control, i.e. adaptively adjusting the data rate according to channel conditions, and error concealment, i.e. recovering corrupted videos induced by packet loss, is also crucial in these environments. In 2001, experts from ITU-T VCEG and ISO/IEC MPEG formed the Joint Video Team (JVT) to develop a video coding standard called H.264/AVC (Advanced Video
Video Compression
105
Coding) [7]. The first version of H.264/AVC is finalized in 2003. The coding standard is famous with its outstanding coding performance. Compared with MPEG-4, H.263, and MPEG-2, H.264/AVC can achieve 39%, 49%, and 64% of bit-rate saving under the same visual quality, respectively [8]. The advantage makes H.264/AVC potential for wide range of applications. In recent years, new profiles were developed to target at the emergent applications like HDTV, streaming, surveillance, and multiview videos. The above applications and the corresponding video coding systems are introduced in Sec. 3.
2 Basic Components of Video Coding Systems In this section, we introduce the basic components of a video coding system as shown in Fig. 2. In the evolution of various video coding standards, every new standard tries to improve its coding efficiency by further exploring all kinds of video redundancy. In general, the redundancy in a video sequence can be categorized into perceptual, temporal, spatial, and statistical redundancy. Perceptual redundancy is the detailed information that human cannot perceive well. Removal of perceptually redundant information will not lead to severe perceptual quality degradation. Temporal and spatial redundancy comes from the pixels that are similar to the neighboring pixels temporally or spatially. We can predict the current pixel with the neighboring pixels and encode the different value between them. The different value is close to zero with high probability and good to be encoded with less bits by entropy coding. Statistical redundancy comes from the information occurred with high probability if we encode all information with the same number of bits. Short codewords (with fewer bits) for frequently occurred symbols and long codewords for infrequently occurred ones can reduce the total data size of the bitstream.
Fig. 2 Basic components of a video coding system.
2.1 Color Processing The first functional block is color processing. Color processing includes color transform and chroma (chrominance) subsampling [1]. Image data from sensors are usually in RGB (Red, Green, and Blue) color format. Each pixel in an image with
106
Yu-Han Chen and Liang-Gee Chen
the RGB format is first transformed to another color space with one luma (luminance) component and 2 chroma components. The most frequently used color space is YCbCr and defined as Y = 0.299R + 0.587G + 0.114B Cb = − 0.169R − 0.331G + 0.500B Cr = 0.500R − 0.419G − 0.081B
(1)
Y is the luma component. Cb and Cr are the chroma components correlated to blue color difference (Y − B) and red color difference (Y − R), respectively. All the luma components forms a grayscale image. Human visual system is less sensitive to the chroma parts in an image. If we sub-sample the chroma parts by 2 horizontally and vertically (e.g. average the neighboring 2 × 2 chroma pixels), there is almost no perceptual quality degradation. With removal of this kind of perceptual redundancy by chroma subsampling, information is reduced by 2 (from 1(Y ) + 1(Cb) + 1(Cr) = 3 to 1(Y ) + 14 (Cb) + 14 (Cr) = 1.5). Chroma subsampling is widely adopted in video coding standards. However, in emergent HDTV applications, people try to reserve all the color information to provide more vivid videos.
2.2 Prediction The second functional block is prediction. Prediction is usually the most computationintensive part in current video coding standards. This unit tries to find the similarity inside a video sequence. Prediction techniques can be categorized into temporal prediction and spatial prediction. Temporal prediction explores the temporal similarity between consecutive frames. Without scene change, the objects in a video are almost the same. The consecutive images are only a little different due to object movement. Therefore, we can predict the current frame with the previous frame well. Temporal prediction is also called inter prediction or inter-frame prediction. Spatial prediction explores the spatial similarity between neighboring pixels. For a region with a smooth texture, neighboring pixels are very similar. Therefore, each pixel can be well predicted by the surrounding ones. Spatial prediction is also called intra prediction or intra-frame prediction. Simple examples of prediction are shown in Fig. 3. With prediction, each pixel in the current frame is subtracted from its corresponding predictor. For temporal prediction in Fig. 3 (b), the predictor for each pixel in the current frame is the pixel at the same location of the previous frame as shown in Eq. 2. Predictorx, y, t = Pixelx, y, t−1
(2)
Pixelx, y, t and Predictorx, y, t are respectively the values of original and predicted pixels in the coordinate (x, y) of the frame at time t. The coordinate (0, 0) is set as the upper-left corner of an image. For spatial prediction in Fig. 3 (c), the predictor is the average of the upper and left pixels as shown in Eq. 3.
Video Compression
107
Predictorx, y, t =
Pixelx−1, y, t + Pixelx, y−1, t 2
(3)
As we can see, most of the pixel difference values (also called residues) after prediction are close to zero. Spatial prediction is better for videos with less texture like the Foreman sequence. On the other hand, temporal prediction is better for videos with less motion like the Weather sequence. If prediction is accurate, the residues are almost zeros. In this condition, the entropy of the image is low and thus entropy coding can achieve good coding performance.
(a)
(b)
(c)
Fig. 3 Illustration of prediction. After prediction, the whiter/blacker pixels mean more positive/negative residues. The gray pixels are residues close to zero. (a) Video sequences "Foreman" (above) and "Weather" (below); (b) Temporal prediction; (c) Spatial prediction.
2.2.1 Temporal Prediction The above temporal prediction scheme is simple but not accurate, especially for videos with moving objects. The most frequently used technique for temporal prediction is Motion Estimation (ME). Illustration of ME is shown in Fig. 4. In most of the video coding standards, an image is first partitioned into MacroBlocks (MBs) which are usually with the size of 16 × 16 pixels. Each MB in the current frame searches for a best matching block in the reference frame. Then, the pixel values of the current MB is subtracted by the corresponding pixel values of the best matching block to generate the residual block. The process is called motion compensation. The Motion Vector (MV) information is also needed to code. In the decoder side, MV information is required to get the prediction data in the reference frame. Then,
108
Yu-Han Chen and Liang-Gee Chen
the prediction data are added with the residual data to generate the reconstructed image. There are many kinds of motion estimation algorithms, please read the reference [9] for the details.
MV
Fig. 4 Illustration of motion estimation.
2.2.2 Spatial Prediction Here we introduce a spatial prediction technique called intra prediction which is adopted in H.264/AVC. Intra prediction explores the spatial similarity in a region with a regular texture. It utilizes the boundary pixels in the neighboring pre-coded blocks to predict the current block. Illustration of Intra 4 × 4 modes in H.264/AVC is shown in Fig. 5. Each MB is partitioned into 16 4 × 4 blocks. Each 4 × 4 block is predicted by the 13 left and top boundary pixels. There are 9 Intra 4 × 4 modes with different predictive directions from the neighboring pixels. The mode with the prediction data most similar to the current block will be chosen. There is another scheme call Intra 16 × 16 as shown in Fig. 6. In general, Intra 4 × 4/16 × 16 modes are suitable for regions with more/less detailed texture. Please refer to [10] for the detailed introduction of intra prediction in H.264/AVC. 2.2.3 Coding Structure In a video coding system, a video sequence is partitioned into Group-Of-Pictures (GOPs). Each GOP contains three kinds of frames—I-frames (Intra Coded Frames), P-frames (Predictive Coded Frames), and B-frames (Bidirectionally Predictive Coded Frames). For I-frames, only spatial prediction is supported. For P-frames and Bframes, both spatial and temporal prediction is supported. P-frames can only use the preceding frames for temporal prediction (a.k.a forward prediction). B-fames can use both the preceding and the following (backward prediction) frames for temporal prediction. In general, B-frames can achieve the best coding performance because the most data are utilized for prediction. However, B-frames require more memory to store the reference frames and more computation to find the best prediction
Video Compression
109
0 (vertical)
1 (horizontal)
2 (DC)
M A B C D E F G H I J K L
M A B C D E F G H I J K L
M A B C D E F G H I Mean J (A...D, K I...L) L
3 (diagonal down-left) 4 (diagonal down-right)
5 (vertical-right)
M A B C D E F G H I J K L
M A B C D E F G H I J K L
M A B C D E F G H I J K L
6 (horizontal-down)
7 (vertical-left)
8 (horizontal-up)
M A B C D E F G H I J K L
M A B C D E F G H I J K L
M A B C D E F G H I J K L
(a)
(b)
Fig. 5 Illustration of 9 kinds of Intra 4 × 4 modes for luma blocks in H.264/AVC (a) 13 boundary pixels and the predictive directions for different modes; (b) Examples of the prediction data with real images. 0 (vertical)
1 (horizontal)
H
H
V
V
2 (DC)
3 (plane)
H
V
. . . .
H
Mean(H+V)
V
(a)
(b)
Fig. 6 Illustration of 4 Intra 16 × 16 modes in H.264/AVC (a) The predictive directions; (b) Examples of the prediction data with real images.
mode. In a GOP, the first frame is always an I-frame. We can use two parameters N and M to specify the coding structure of a GOP. N is the GOP size. It represents the distance between two I-frames. M represents the distance between two anchor frames (I-frames or P-frames). It also means there are (M-1) B-frames between two anchor frames. In Fig. 7, a GOP with parameters N=9 and M=3 is shown. When a bitstream error occurs, the error will propagate through the temporal prediction paths in a GOP. An I-frame do not require other reference frames to reconstruct the image. Therefore, error propagation will be stopped after I-frames.
110
Yu-Han Chen and Liang-Gee Chen
GOP Forward Prediction
I
B
B
P
B
B
P
B
B
I
Backward Prediction Fig. 7 Illustration of Groups-Of-Picture (GOP).
2.3 Transform and Quantization In an image, neighboring pixels are usually correlated. To further remove the spatial redundancy, transformation is performed after prediction to transfer a block of residual pixels to a more compact representation. Discrete Cosine Transform (DCT) [11] is adopted in most of the video coding standards. The N × N 2-D DCT transform is defined as Eq. 4. N−1 F(u, v) = N2 C(u)C(v) N−1 f (x, y) cos 2 (2x+1)u cos 2 (2y+1)v x=0 y=0 4N 4N √1 , if u,v = 0 2 C(u),C(v) = 1, otherwise
(4)
F(u, v) is the DCT value at the location (u,v) and f (x, y) is the pixel value at the location (x,y). The first and the second cosine terms are the vertical/horizontal cosine basis function. With DCT, energy will be gathered to the low-frequency parts as shown in Fig. 8. The entropy of a block is reduced and it leads to better coding efficiency in entropy coding.
(a)
(b)
Fig. 8 Illustration of Discrete Cosine Transform (a) The original Lena image; (b) After transform, energy is compacted to low-frequency components (the upper-left side). While and black pixels are with high energy. Gray pixels are with low energy.
Video Compression
111
Quantization [12] is performed after transformation. It discards some of the information in a block and leads to lossy coding. A simple example of uniform quantization is shown in Fig. 9. There are several decision boundaries like x1 and x2 . All input value between two decision boundaries are output as the same value. For example, the output value is y1 if the input value is inside the interval x1 < X < x2 . In this example, y1 is set as the average value of x1 and x2 . Quantization step is the distance between two decision boundaries. In a video coding system, quantization step is adjusted to control the data rate. Larger quantization step leads to lower data rate but poorer visual quality, and vice versa. In addition, human visual system is less sensitive to high-frequency components. High-frequency components are usually quantized more heavily with less visual quality degradation. This is another example to remove perceptual redundancy for data compression.
2
1
3
2
1 1
2
3
1
2
Fig. 9 Illustration of one-to-one input to output mapping of uniform quantization.
2.4 Entropy Coding The last step of video coding is entropy coding. It translates symbols into bitstream. Symbols can be the transformed and quantized residual values, the MVs, or the prediction modes. From information theory, we know that the average length of the code is bounded by the entropy of the information source. It is why prediction, transformation, and quantization techniques are adopted in video coding systems to reduce the entropy of the encoded data. The entropy H is defined as H = − P(Ai ) log2 P(Ai )
(5)
Ai is the i-th symbol and P(Ai ) is its probability. In addition, the optimal code length of a symbol is its own entropy − log2 P. Therefore, we have to build a probability model for each kind of symbol. For frequently appeared symbols, shorter codewords are assigned. As a result, the final bitstream is shorter.
112
Yu-Han Chen and Liang-Gee Chen
Huffman coding is a famous entropy coding scheme. An example of Huffman coding is shown in Fig. 10. At the start, the symbols are sorted by their probability values from low to high. At each step, we choose the two symbols with lowest probability. In this example, symbol A and symbol D are chosen first and combined to a new symbol {A, D} with probability 0.22. The symbol A with lower probability is put in the left side and assigned a bit 0. The symbol D with higher probability is put in the right side and assigned a bit 1. Then, symbol {A, D} and symbol E are chosen to form a symbol {{A, D}, E} with probability 0.45. Symbol {A, D} is assigned a bit 0 and symbol E is assigned a bit 1. With the process, we can finally build a codeword tree as shown in Fig. 10 (b). The final codeword of each symbol is assigned by traversing the tree from the root to the leaf. For example, the codeword of symbol D is 001. To encode BABCE, the bitstream is 11000111001. For Huffman coding, the codeword length is always an integer number. Therefore, Huffman coding cannot achieve the best coding performance in many conditions. For example, the optimal codeword length for symbol B is 1.737 bits according to Eq. 5 but not 2 bits assigned by Huffman coding as shown in Fig. 10 (a).
(a)
(b)
Fig. 10 Illustration of Huffman coding (a) List of the symbols, probability values, and the final codeword; (b) The process of Huffman code generation.
Arithmetic coding is another famous entropy coding scheme. Unlike the Huffman coding, arithmetic coding does not assign codeword to each symbol but represents the whole data into a single fractional number n between 0 and 1. Arithmetic coding divides the solution space into intervals according to the probability values of the symbols as shown in Fig. 11 (a). The process of arithmetic coding for BABCE is shown in Fig. 11 (b). At the start, the current interval is [0.00, 1.00). The first symbol is B. The sub-interval [0.10, 0.40) representing B becomes the next current interval. Then, the current interval becomes [0.10, 0.13) for symbol A and go on. At last, the interval becomes [0.1083325, 0.10885). The lower and upper bounds of the final interval are 0.00011011101... and 0.00011011110... with binary representations. The final chosen fractional number to encode BABCE is 0.0001101111 (0.10839844). The encoded bitstream with arithmetic coding is 0001101111.
Video Compression
113
(a)
(b)
Fig. 11 Illustration of arithmetic coding (a) List of the symbols, probability values, and the partitioned intervals; (b) The process of arithmetic coding for BABCE.
3 Emergent Video Applications and Corresponding Coding Systems 3.1 HDTV Applications and H.264/AVC With the progress in video sensors, storage devices, communication systems, and display devices, the style of multimedia applications changes. The prevalence of full HD (High Definition) camera coders and digital full HD TVs shows the demands of video services with high resolution and high quality. In order to provide better visual quality, the video coding standards for HDTV applications should be optimized for coding performance. Video coding standards with better coding performance can provide better visual quality under the same transmitted data rate. In 2003, H.264/AVC was standardized by JVT to combine the state-of-the-art coding tools for optimization of coding performance. Then, H.264/AVC High Profile [13] [14] was developed for the incoming HDTV applications in 2005. H.264/AVC High Profile has been adopted into HD-DVD and Blu-ray Disc. Compared to MPEG-2 which is adopted in DVD, H.264/AVC High Profile can provide comparable subjective visual quality with only one-third of data rate [13]. Here, we show some advanced coding tools of H.264/AVC. For inter (temporal) prediction, Multiple Reference Frames (MRF) [15] are supported. As shown in Fig. 12, more than one prior reconstructed frames can be used for ME. Each MB can choose its own best matching block in different reference frames. In addition, B-Frame is also adopted. That is to say not only forward prediction but also backward prediction can utilize more than one reference frames. The above tools can improve inter prediction accuracy when some parts of the current image are unseen in the first preceding image (e.g. uncovered background). Variable Block Sizes is another useful technique. It is helpful for the MBs which include multiple objects moving in different directions. Each MB is first split into 4 kinds of partitions—
114
Yu-Han Chen and Liang-Gee Chen
16 × 16, 16 × 8, 8 × 16, and 8 × 8. If the 8 × 8 partition is selected, each block is further split into 4 kinds of sub-partitions—8 × 8, 8 × 4, 4 × 8, and 4 × 4. With this scheme, each partition can have its own MV for the best matching block. Smaller partitions can get lower prediction error but need higher bitrate to transmitted MV information. Therefore, there is trade-off between rate and distortion for mode decision. Please refer to [16] [17] for the issue of rate-distortion optimization. The last coding tool for inter prediction is Fractional Motion Estimation (FME). Sometimes, the moving distance of an object is not on the integer-pixel grid of a frame. Therefore, H.264/AVC provides FME up to quarter-pixel resolution. The half-pixel and quarter-pixel prediction data are interpolated from the neighboring integer-pixels in the reference frame. Please refer to [10] for the detailed.
Frame
Ref-4 (t-4)
Ref-3 (t-3)
Best Matching Block Ref-2 (t-2)
Current MB Ref-1 (t-1)
Current (t)
Fig. 12 Illustration of Multiple Reference Frame (MRF) scheme.
For intra (spatial) prediction, there are two modes defined in H.264/AVC Baseline Profile as introduced in Sec. 2.2. Intra 4 × 4 mode is suitable for regions with detailed texture. On the contrary, Intra 16 × 16 mode is suitable for regions with smooth texture. In H.264/AVC High Profile, anther intra prediction mode Intra 8 × 8 is added for the texture with the medium size. Each MB is partitioned into 8 × 8 blocks. There are also 9 modes provided and the predictive directions are the same with those of Intra 4 × 4 modes. For entropy coding, H.264/AVC adopts Context-based Adaptive Variable Length Coding (CAVLC) and Context-based Adaptive Binary Arithmetic Coding (CABAC) [18]. Compared to the previous standards, CAVLC and CABAC adaptively adjust probability models according to the context information. Accurate probability models lead to better coding performance for entropy coding. In addition, CABAC can save 10% bitrate compared to CAVLC. It is because arithmetic coding can achieve better coding performance than variable length coding. There are still some other coding tools in H.264/AVC High Profile like Adaptive Transform Block-size and Perceptual Quantization Scaling Matrices. Please refer to [13] [14] for the details.
Video Compression
115
3.2 Streaming and Surveillance Applications and Scalable Video Coding There are various of multimedia platforms with different communication channels and display sizes. The variety of multimedia systems makes scalability become important for video standards to support various demands. Scalability of video coding means one encoded bitstream can be partially or fully decoded to generate several kinds of videos. More decoding data contribute higher spatial/temporal resolution or better visual quality. For a streaming service, its clients may be a mobile phone, a PC, or an HD TV. Without scalability, we need to encode a video into different bitstreams for different requirements. It is not efficient for storage. Another key application for scalable video coding is surveillance. Due to the limited storage size, less important data should be removed from the bitstream. The importance of the information in a surveillance video is decayed with time. We can truncate the bitstream to reduce the resolution or visual quality of the videos in the longer past under scalable video coding rather than just dump them under the conventional video coding systems. Scalability of video coding also means more truncation points are provided for the users. For these requirements, H.264/AVC Scalable Extension (a.k.a Scalable Video Coding, SVC) [19] is established and finalized in 2007. The new video coding framework can be depicted as Fig. 13. An unified scalable bitstream is only encoded once, but it can be adapted to support various multimedia applications with different specifications from mobile phone, personal computer to HDTV. SVC Encoder
HDTV
Adaptation Uniform Bitstream
PC HD720p (1280x720), 60fps High Quality
Total3. 27 Mbps
Mobile CIF (176x144), 15fps Low Quality
D1 (720x576), 30fps Medium Quality
Fig. 13 The concept and framework of Scalable Video Coding.
In general, there are three kinds of video scalability should be included— temporal scalability, spatial scalability, and quality scalability. Temporal scalability is to provide different frame rates (temporal resolution). In SVC, temporal scalability is achieved by Hierarchical B-frame (HB) as shown in Fig. 14. The decoding order of frames in a GOP is from A, B1 , B2 , to B3 . More decoded frames lead to higher frame rate. In fact, HB is first proposed in H.264/AVC High Profile as an optional coding tools to improve coding efficiency because it can explore more tem-
116
Yu-Han Chen and Liang-Gee Chen
poral redundancy in the whole GOP. This scheme will induce large encoding and decoding delay and thus is more suitable for storage applications.
30 FPS 15 FPS
A
B3
B2
B3
B1
B3
B2
B3
A
B3
7.5 FPS
B2
A+B1 (a)
(b)
Fig. 14 Illustration of Hierarchical B-frame technique. A is the anchor frame (I-frame or P-frame). The superscript number of a B-frame represents the encoding order. (a) Three-level Hierarchical B-frame (HB) scheme which GOP size is 8 frames. The B1 frame utilizes the anchor frames as its reference frames. The B2 frames utilize the anchor and B1 frames as the reference frames. At last, the reference frames of the B3 frames are the anchor, B1 , and B2 frames. (b) The decomposition of HB for temporal scalability.
SVC provides spatial scalability (different frame resolutions) with a multi-layer coding structure as shown in Fig. 15. If the pictures of different layers are independently coded, there is redundancy between different spatial layers. Therefore, three techniques of inter-layer prediction included in SVC. Inter-layer motion prediction utilizes the MB partition and MVs of the base layer to predict the information in the enhancement layer as shown in Fig. 16. Inter-layer residual prediction upsamples the co-located residues of the base layer as the predictors for the residues of the current MB as depicted in Fig. 17. In this figure, the absolute values of the residues are shown. Pixels with black color is close to zero. Inter-layer intra prediction upsamples the co-located intra-coded signals as the prediction data for the enhancement layer. Quality (SNR) scalability is to provide various decoded bit rate truncation points so that users can choose different visual quality. It can be realized by three main strategies in SVC—Coarse Grain Scalability (CGS), Medium Grain Scalability (MGS), Fine Grain Scalability (FGS). CGS is to add a layer with a smaller quantization parameter, like an additional spatial layer with better quality and with the same frame resolution. But CGS can only provide few pre-defined quality points. FGS focuses on providing arbitrary quality levels according to users’ channel capacity. Each FGS enhancement layer represents refinement of residual signals. It is designed through multiple scans in the entire frame so that the quality improvement can be distributed to the whole frame. MGS is recently proposed in H.264/AVC
Video Compression Video (YCbCr)
117
Prediction
Spatial Decimation
Entropy Coding
Transform & Quantization
Spatial Interpolation
Prediction
Multiplexier
Bitstream
Entropy Coding
Transform & Quantization
Fig. 15 The video coding flow of multi-layer coding scheme for SVC.
Base Layer
Enhancement Layer Enhancement
Layer
4x4
8x8
8x4 4x4
16x8 8x8
Base Layer
(a)
(b)
Fig. 16 Inter-layer motion prediction (a) MB partition scaling; (b) motion vector scaling.
Current
Residues
Final Residues -
Base Layer Residues
Up-sample
Fig. 17 Illustration of inter-layer residual prediction.
Scalable Extension to provide quality scalability between CGS and FGS. Please refer to [20] for the details.
3.3 3D Video Applications and Multiview Video Coding The TV evolves from the monochrome one to the color one, and then to the HDTV in the past 80 years. People still want to get a more vivid perceptual experience
118
Yu-Han Chen and Liang-Gee Chen
from the TV. Stereo vision is the natural feeling for human beings. Stereo parallax provides us depth perception for scene understanding. In the recent years, threedimensional TV (3DTV) becomes popular. It is because more and more advanced 3D capturing devices (e.g. camera arrays) [21] [22] and 3D displays [23] are introduced. Multiview video as an extension of 2D video technology is one of the potential 3D applications. At the same time, multiple 2D views are displayed as shown in Fig. 18. Viewers can perceive different stereo videos at different viewpoints to reconstruct the 3D scene. In order to display multiple views at the same time, the required transmitted data are much more than those in the traditional single view video sequence. For example, an 8-view video requires 8 times of bandwidth compared to the single view video. Therefore, an efficient data compression system for multiview video is required. The Multiview Video Coding (MVC) system is shown in Fig. 19.
Fig. 18 Multiview video data set is composed of two to several view channels.
Multiview Sensor
Multiview Display HDTV Stereo TV Storage/ MVC MVC Encoder Transmission Decoder Multiview 3DTV
Fig. 19 Overview of an MVC system. Redrawn from [24]
Video Compression
119
MPEG-2 Multiview Profile (MVP) [25] is the first developed standard which describes the bitstream conformance of stereoscopic video. Figure 20 illustrates the two-layer coding structure. In the coding structure, the “temporal scalability" supported by MPEG-2 Main Profile is utilized in MPEG-2 MVP. That is, a view is encoded as the base layer, and the other one is encoded as the enhancement layer. The frames in the enhancement layer are additionally predicted via Disparity Estimation (DE). The process of DE is similar to ME. The only difference is that DE searches for the best matching block in the neighboring views at the same time slot. On the contrary, ME searches for the best matching block in the preceding or the following frames in the same view. The concept of MPEG-2 MVP is simple, and the only additional functional block is a processing unit of syntax element in entropy coding. However, MPEG-2 MVP can not provide the good coding efficiency because the prediction tool defined in MPEG-2 is poor compared with the state-ofthe-art coding standard such as H.264/AVC [26].
Right View Left View
ME
ME DE
DE
I
ME
P
ME DE
ME
P
DE
ME
P
ME
Fig. 20 Two-layer coding structure of MPEG-2 Multi-view Profle.
In recent years, MPEG 3D Audio/Video (3DAV) group has worked toward the standardization for MVC [27] which adopts H.264/AVC as the base layer due to its outstanding coding performance. The hybrid ME and DE inter prediction technique is also adopted. To further improve the coding performance, MVC extends the original hierarchical-B coding structure to inter-view prediction for multiview video sequences as shown in Fig. 21. The bitrate can be saved up to 50%, but it depends on the characteristics of multivew video sequences [24].
4 Conclusion and Future Directions With the improvement of video sensor and display technologies, more and more video applications are emergent. Due to the limited capacity of storage devices and communication channels, video compression plays a key role to drive the applications. H.264/AVC is the most important coding standard in recent years due to its outstanding coding performance. Some other coding standards like Scalable
120
Yu-Han Chen and Liang-Gee Chen
Fig. 21 Example of MVC hierarchial-B coding structure with the extension of inter-view prediction. The horizontal direction represents different time slots and the vertical direction represents different views. Redrawn from [24]
Video Coding (SVC) and Multivew Video Coding (MVC) are developed based on H.264/AVC and add some specific coding tools targeting at their own applications. The demands for video services with better visual quality will keep go on. Therefore, video coding systems for super high resolution, high frame rate, high dynamic range, and large color gamut will also be developed in the near future.
References 1. Douglas A. Kerr. Chrominance Subsampling in Digital Images. Available: http://doug.kerr.home.att.net/pumpkin/Subsampling.pdf, November 2005. 2. Coding of Moving Pictures and Associated Audio for Digital Storage Media at up to about 1.5 Mbit/s–Part 2: Video. ISO/IEC 11172-2 (MPEG-1 Video), ISO/IEC JTC 1, March 1993. 3. Generic Coding of Moving Pictures and Associated Audio Information–Part 2: Video. ITU-T Rec. H.262 and ISO/IEC 13818-2 (MPEG-2 Video), ITU-T and ISO/IEC JTC 1, May 1996. 4. Video Codec for Audiovisual Services at p × 64 Kbit/s. ITU-T Rec. H.261, ITU-T, November 1990. 5. Video Coding for Low Bit Rate Communication. ITU-T Rec. H.263, ITU-T, November 1995. 6. Coding of Audio-Visual Objects–Part 2: Visual. ISO/IEC 14496-2 (MPEG-4 Visual), ISO/IEC JTC 1, April 1999. 7. Advanced Video Coding for Generic Audiovisual Services. ITU-T Rec. H.264 and ISO/IEC 14496-10 (MPEG-4 AVC), ITU-T and ISO/IEC JTC 1, May 2003. 8. A. Joch, F. Kossentini, H. Schwarz, T. Wiegand, and G. J. Sullivan. Performance comparison of video coding standards using Lagragian coder control. In Proc. IEEE International Conference on Image Processing (ICIP), pages 501–504, 2002. 9. Yu-Wen Huang, Ching-Yeh Chen, Chen-Han Tsai, Chun-Fu Shen, and Liang-Gee Chen. Survey on block matching motion estimation algorithms and architectures with new results. Journal of VLSI Signal Processing, 42(3):297–320, March 2006. 10. Thomas Wiegand, Gary J. Sullivan, Gisle Bjntegaard, and Ajay Luthra. Overview of the H.264/AVC video coding standard. IEEE Transactions on Circuits and Systems for Video Technology, 13(7):560–576, July 2003.
Video Compression
121
11. K. Ramamohan Rao and P. Yip. Discrete Cosine Transform: Algorithms, Advantages, Applications. Academic Press, August 1990. 12. Robert M. Gray and David L. Neuhoff. Quantization. IEEE Transactions on Information Theory, 44(6):2325–2383, October 1998. 13. G. Sullivan, P. Topiwala, and A. Luthra. The H.264 advanced video coding standard : Overview and introduction to the fidelity range extensions. In Proc. SPIE Conference on Applications of Digital Image Processing XXVII, August 2004. 14. D. Marpe and et al. H.264/MPEG4-AVC fidelity range extensions : Tools, profiles, performance, and application areas. In Proc. IEEE International Conference on Image Processing (ICIP), volume 1, pages 593–596, September 2005. 15. T. Wiegand and B. Girod. Multi-Frame Motion-Compensated Prediction for Video Transmission. Kluwer Academic Publishers, September 2001. 16. G. J. Sullivan and T. Wiegand. Rate-distortion optimization for video compression. IEEE Signal Processing Magazine, 15(6):74–90, November 1998. 17. T. Wiegand, H. Schwarz, A. Joch, F. Kossentini, and G. J. Sullivan. Rate-constrained coder control and comparison of video coding standards. IEEE Transactions on Circuits and Systems for Video Technology, 13(7):688–703, July 2003. 18. D. Marpe, H. Schwarz, and T. Wiegand. Context-based adaptive binary arithmetic coding in the H.264/AVC video compression standard. IEEE Transactions on Circuits and Systems for Video Technology, 13(7):620–644, July 2003. 19. J. Reichel, H. Schwarz, and M. Wien. Working Draft 4 of ISO/IEC 14496-10:2005/AMD3 Scalable Video Coding. ISO/IEC JTC1/SC29/WG11 and ITU-T SG16 Q.6, Doc. N7555, January 2005. 20. Heiko Schwarz, Detlev Marpe, and Thomas Wiegand. Overview of the scalable video coding extension of the H.264/AVC standard. IEEE Transactions on Circuits and Systems for Video Technology, 17:1103–1120, September 2007. 21. B. S. Wilburn, M. Smulski, H.-H. K. Lee, and M. A. Horowitz. Light field video camera. In Proceedings of Media Processors, SPIE ElectronicImaging, volume 4674, pages 29–36, 2002. 22. C. Zhang and T. Chen. A self-reconfigurable camera array. In Eurographics symposium on Rendering, 2004. 23. Nick Holliman. 3d display systems. In Handbook of Optoelectronics, chapter 3. Taylor and Francis, 2006. 24. Philipp Merkle, Aljoscha Smolic abd Karsten Muller, and Thomas Wiegand. Efficient prediction structures for multiview video coding. IEEE Transactions on Circuits and Systems for Video Technology, 17(11):1461–1473, November 2007. 25. ISO/IEC JTC 1/SC 29/WG11 N1088. Proposed draft amendament No. 3 to 13818-2 (multiview profile). MPEG-2, 1995. 26. S.-Y. Chien, S.-H. Yu, L.-F. Ding, Y.-N. Huang, and L.-G. Chen. Efficient stereo video coding system for immersive teleconference with two-stage hybrid disparity estimation algorithm. In Proc. of IEEE International Conference on Image Processing, 2003. 27. ISO/IEC JTC1/SC29/WG11 N6501. Requirements on multi-view video coding. 2004.
Low-power Wireless Sensor Network Platforms Jukka Suhonen, Mikko Kohvakka, Ville Kaseva, Timo D. Hämäläinen, and Marko Hännikäinen
Abstract Wireless sensor network (WSN) is a technology comprising even thousands of autonomic and self-organizing nodes that combine environmental sensing, data processing, and wireless multihop ad-hoc networking. The features of WSNs enable monitoring, object tracking, and control functionality. The potential applications include environmental and condition monitoring, home automation, security and alarm systems, industrial monitoring and control, military reconnaissance and targeting, and interactive games. This chapter describes low-power WSN as a platform for signal processing by presenting the WSN services that can be used as building blocks for the applications. It explains the implications of resource constraints and expected performance in terms of throughput, reliability and latency.
1 Characteristics of Low-power WSNs A WSN consists of nodes that are deployed in the vicinity of an inspected phenomenon [2] as shown in Fig. 1. In addition, a network may contain one or more sink nodes that request other nodes to perform measurements and collect the measured values for further use. Instead of sending raw data to the sink, a sensor node may collaborate with its neighbors or nodes along the routing path to provide application results [43]. The sink node typically acts as a gateway to other networks and user interfaces [22]. The backbone infrastructure that is connected to a sink may contain components for data storing, visualization, and network control. Compared to the traditional computer networks, WSNs have several unique characteristics as listed in the following. These characteristics do not apply to all WSNs, but many useful classes of WSNs share most or all of the properties described in this list. • Network size and density: WSNs may consist of tens of thousands of nodes. The density of nodes can be high, depending on the application requirements for sensing coverage and robustness via redundancy. S.S. Bhattacharyya et al. (eds.), Handbook of Signal Processing Systems, DOI 10.1007/978-1-4419-6345-1_6, © Springer Science+Business Media, LLC 2010
123
124
Suhonen et al.
Fig. 1 A typical WSN scenario.
• Communication paradigm: In WSNs, node identifiers are typically not important. Instead WSNs are data-centric, which means that messages are not send to individual nodes but to geographical locations or regions based on the data content. • Application specific: A WSN is deployed to perform a specific task, e.g. environmental monitoring, target tracking, or intruder alerting. As a result, the node platforms and communication protocols are designed to optimal performance on a certain application-dependent scenario. The application specific behavior enables data aggregation, and in-network processing, and decision making. • Network lifetime: WSNs are typically deployed to observe certain physical phenomenon that range in duration from fractions of a second to a few months or even several years. As replacing batteries is not feasible due to large network size and deployment to possibly hazardous environment, nodes must optimize their energy usage for network lifetime. • Low cost: To allow cost effective deployment of a large number of nodes, the cost of an individual sensor node should be minimized. Also, as recovering sensors after deployment in some application scenarios may not be feasible, sensors should be cheap enough to be considered disposable. • Resource constraints: A typical WSN node combines low cost with small physical size and is battery powered. Thus, computation, communication, memory, and energy resources are very limited. • Dynamic nature: Wireless communications are inherently unreliable due to environmental interferences. The unreliability is especially evident in WSNs because of harsh operating conditions e.g. due to environmental changes in outdoors, node mobility, and nodes dying due to depleted energy sources. As a result, the unreliability causes network dynamics due to link breaks even when nodes are stationary.
Low-power Wireless Sensor Network Platforms
125
Table 1 Comparison of typical requirements in wireless computer networks and low-energy WSNs. Requirement
Computer network
Low-energy WSN
Resource constraints Low Very high (1-2 MIPS, 32-128 kB) Adaptivity Static Dynamic environment Scalability Moderate (10 nodes) High (10000 nodes) Latency High (250 ms-1 s) High-Low (1 s-1 hour) Throughput Very high (MB/s) Low-Moderate (bit/s-kbit/s)
• Deployment: To avoid tedious network planning of a large number of nodes, WSNs are often randomly deployed. This necessitates network self-configuration and autonomous operation.
1.1 Quality of Service Requirements Quality of Service (QoS) is commonly expressed and managed by throughput, latency, jitter, and reliability. These QoS parameters also apply to the WSNs, but their importance differs from the legacy networks. The requirements of low-energy WSNs compared to the traditional wireless computer networks, e.g. IEEE 802.11 Wireless LAN (WLAN), are summarized in Table 1. Sensing applications can tolerate high latency and low throughput but the reliability is particularly significant. In the traditional computer networks, the data is routed via highly reliable wired links, while only the end links may be wireless, utilizing e.g. cellular connections or WLAN. In WSNs, packets are forwarded via multiple wireless hops. On each wireless link, the packet error rates (PER) of 10%-30% are common, which significantly decreases the end-to-end reliability. In addition to the traditional QoS metrics, other metrics can be identified for WSNs as presented in Fig. 2. While the reliability metric denotes the probability to transfer a single measurement through the network, the availability expresses the probability to receive a new measurement from a node within a certain waiting period [49]. The data accuracy describes the consistency of measurements (results are same in similar conditions) and the granularity of sensor values, sensing location, and time information. The security ensures that unauthorized parties do not gain access or tamper with the sensed data. The mobility is important in tracking WSNs as a node may be attached to moving objects. Due to the significance of the network lifetime, energy efficiency is considered as a QoS parameter. Usually, the other parameters have a trade-off with the energy efficiency, making it impossible to optimize all parameters at the same time. As an example, the QoS requirements for two cases are presented in Fig. 2(a) and Fig. 2. An environmental monitoring application sends sensor values periodically to a sink. As the sensed environment changes slowly, the measurement interval and thus throughput can be low, while missing a single sample is not critical. However,
126
Suhonen et al.
(a)
(b)
Fig. 2 Quality of Service (QoS) parameters in WSNs. a Typical environmental monitoring network emphasizing energy efficiency. b A control network emphasizing low latency and reliability.
availability is still important, as too many samples may not be missed consecutively. In a control network, short messages are relayed infrequently between switches, lamps, and other accessories. Thus, the required bandwidth is low, but the timely and reliable delivery of commands is required. Due to the diversity of applications and their contradictory requirements, a single solution is not suitable for every WSN application. Thus, the protocols and node platforms need to be tailored to meet the application requirements.
1.2 Services for Signal Processing Application A WSN offers following services for signal processing applications. • Environmental sensing: Each WSN node contains at least one or several physical sensors. Instead of accessing nodes directly, e.g. via Inter-Integrated Circuit (I2C) bus, sensing services operate via standardized interfaces. • Data processing and storage: Before the sensor values can be forwarded to a user, the values are preprocessed and stored locally. The limitations of computing and storage can be overcome by distributed computing services. • Data transfer: Data transfer services allow collecting data for further use. The realized QoS is largely affected by the choice of networking protocols. • Localization: A sensor value is naturally associated with a certain area. However, random deployment and mobility prevent concluding the location information from source node identifiers, which necessitates either online (distributed) or offline (centralized) localization services. • Time synchronization: Several WSN applications, such as event alerts and tracking, require exact timestamping of sensor events to compare the order of events. As low cost sensor nodes do not have accurate time source, an agreement on global time is achieved with time synchronization service.
Low-power Wireless Sensor Network Platforms
127
A WSN provides at least the sensing, data processing, and data transfer services. The localization and time synchronization services are not usually considered in proposed WSN standards but need customized solutions.
2 Key Standards and Industry Specifications In the sensor industry, a vast number of sensors exists to measure physical parameters, such as temperature, pressure, humidity, illumination, gas, flow rate, strain, and acidity. Standardized sensor interfaces, data formats, and communication protocols are required to enable effective integration, access, fusion, and the use of sensor-derived data. The goal is to allow sensors from different manufacturers to work together without human intervention and customization.
2.1 IEEE 1451 IEEE 1451 standard family defines a set of open, network-independent communication interfaces for connecting transducers (sensors and actuators) to microprocessors, instrumentation systems and networks. In IEEE 1451, a single sensor, an actuator, or a module comprising several transducers and any data conversion or signal conditioning (e.g. signal amplification or filtering) is referred to as Transducer Interface Module (TIM). Transducer Electronic Data Sheet (TEDS) describes a TIM and defines a set of commands to control and read data from the TIM. TEDS virtually eliminates error prone, manual entering of data and system configuration and allows transducers to be installed, upgraded, replaced or moved by plug-and-play principle. IEEE 1451 provides interfaces for several standardized communication protocols by IEEE 1451.2 through IEEE 1451.6. IEEE 1451.2 defines wired point-to-point communication. IEEE 1451.3 defines distributed multi-drop system, where a large number of TIMs may be connected along a wired multi-drop bus. IEEE 1451.4 specifies mixed-mode communication protocols, which carry analog sensor values with digital TEDS data. IEEE 1451.6 defines a high-speed Controller Area Network (CAN) bus. IEEE 1451.5 standard defines wireless sensors and thus, it is most closely related with WSNs. Supported communication technologies are IEEE 802.11a/b/g, IEEE 801.15.1, and IEEE 802.15.4.
2.2 WSN Communication Standards The operating frequency band, nominal data rate, and protocol support of WSN communication standards are listed in Table 2. The support for PHYsical (PHY),
128
Suhonen et al.
Standard
Frequency
IEEE 802.15.4 868 MHz 915 MHz 2.4 GHz ZigBee WirelessHART 2.4 GHz ISA100.11a 2.4 GHz Z-Wave 865 MHz 915 MHz MiWi 2.4 GHz ANT 2.4 GHz
Data rate 20 kbps 40 kbps 250 kbps 250 kbps 250 kbps 40 kbps 40 kbps 250 kbps 1 Mbps
Protocol layers Security PHY MAC NWK TRP APS ACL Encryption x x x
x x x
x x x x
x x x x
x
x
x x x x x x x
x x
x x x x x x
x x x x x x
x x x x x x
x x
x
Table 2 The properties of WSN communication standards.
Medium Access Control (MAC), Network (NWK), and Transport (TRP) protocols denotes that a standard defines the layer in question. Application Support (APS) defines application profiles detailing the services, message formats, and methods required to access applications, therefore allowing interoperability between devices from different manufacturers. For security, Access Control Lists (ACLs) allow only certain nodes to participate in the network while data encryption prevents unauthorized use of data. The listed standards use 128-bit Advanced Encryption System (AES) for data encryption. IEEE 802.15.4 network supports three types of network devices: a Personal Area Network (PAN) coordinator, coordinators, and devices. The PAN coordinator initiates the network and operates often as a gateway to other networks. Coordinators collaborate with each other for data routing and network self-organization. Devices do not have data routing capability and can communicate only with coordinators. ZigBee standard defines network and application layers on top of the IEEE 802.15.4. The network layer supports star, peer-to-peer, and cluster-tree topologies. A ZigBee network has always one device referred to as a ZigBee coordinator that controls the network. The coordinator is the central node in the star topology, the root of the tree in the tree topology, and can be located anywhere in the peer-to-peer topology. ZigBee defines a wide range of application profiles targeted at home and building automation, remote controls, and health care. WirelessHART and ISA100.11a [20] are targeted at process industry applications where process measurement and control applications have stringent requirements for end-to-end communication delay, reliability, and security. The standards have similar operating principle and the convergence of the standards is planned in ISA100.12. Both standards build on top of the IEEE 802.15.4 physical layer and utilize a TDMA MAC that employs network wide time synchronization, channel hopping, channel blacklisting. A centralized network manager is responsible for route updates and communication scheduling for entire network. However, as the centralized control of TDMA schedules limits the network size and the tolerance
Low-power Wireless Sensor Network Platforms
129
Fig. 3 Software architecture of a WSN node.
against network dynamics, the usability of the standards in WSNs is limited to static networks. Z-Wave is targeted for the control of building automation and entertainment electronics. The maximum number of nodes in a network is 232. Supported network topologies are star and mesh. Z-Wave has been developed by over 120 companies including Zensys, Intel and Cisco. MiWi developed by Microchip is a simpler version of ZigBee operating above a IEEE 802.15.4 compliant transceiver. MiWi is suitable for smaller networks having at most 1024 nodes. Supported network topologies are star and mesh. Compared to IEEE 802.15.4, MiWi operates only in non-beacon enabled (unsynchronized) mode, which means that the coordinators must listen to the channel continously for incoming packets and their energy-consumption is high. ANT developed by Dynastream Innovations is based on a star-topology, but more complex topologies can be achieved by using several channels: each node can be simultaneously a master and a slave on different channels. Master nodes always receive, while slaves transmit when new data is provided. A practical limit for network size is few thousands nodes. The disadvantages of ANT are high power consumption in master nodes and low scalability due to random access transmissions. These limit the usability of ANT in large multi-hop WSN applications.
3 Software Components The software components in WSNs include sensor operating systems and middleware as shown in Fig. 3. The purpose of these is to ease the application development by providing network access and allowing support for heterogeneous platforms by abstracting hardware access.
130
Suhonen et al.
3.1 Sensor Operating Systems An operating system targets at easing the use of system resources. Its main functions are the concurrent execution and communication of multiple applications, access and management of Input/Output (I/O) devices, permanent data storage, and control and protection of system access between multiple users [47]. WSN operating systems are required to have an extremely small memory footprint while still providing the basic OS services. Furthermore, a WSN OS should support energy management to allow power saving and real time task scheduling [27]. TinyOS [17] is the most widely known OS for WSNs that uses the event-driven approach. An event-driven OS reacts to hardware interrupts that indicate e.g. reception of data from transceiver, a timer event, or finished sensing. TinyOS was originally implemented for Berkeley mote, but has been later ported to many other platforms. Software is divided into components encapsulated in frames. Each component has a separate command handler for upper layer requests and an event handler for lower layer events. The processing is done in atomic tasks. Contiki is another event based OS. It supports Internet Protocol (IP) routing and dynamic loading of application images as an application program module can be linked to the OS kernel during runtime. Contiki implements a support for preemptive multi-threading through a library on top of the event-handler kernel. Both Contiki and TinyOS are released as open source.
3.2 Middlewares Middleware is an application layer that consists of frameworks and interfaces that ease the application development, runtime configuration, and management in WSNs [43]. Several types of middlewares have been proposed for WSNs: middleware interfaces, database middlewares, virtual machines, mobile agents, and application driven middlewares [27]. The middleware interfaces aim to standardize hardware access, such as the transducer interface specification of IEEE 1451. Database middlewares overcome the memory constraints of individual sensor nodes by accessing the WSN as a distributed database. For example, TinyDB [31] is a query processing system implemented on top of the TinyOS. It supports basic Structured Query Language (SQL) type query operations and data aggregation for improving network energy efficiency. Virtual machines allow task propagation as a byte code, which enable its execution in heterogeneous sensor hardware. A mobile agent is an object that in addition to the code carries its state and data. A mobile agent makes its migration and processing decisions autonomously. Typically, a mobile agent operates on top of a virtual machine to obtain platform independence and small object size. The application driven middlewares support task allocation, networking, and distributed computing. The component library of the TinyOS includes the application driven middleware functionality as it provides methods for network communica-
Low-power Wireless Sensor Network Platforms
131
Fig. 4 Sensor node hardware architecture.
tion and distributed services, and abstracts data acquisition, allowing programmer to concentrate on implementing the sensing applications.
4 Hardware Platforms and Components Sensor node platforms implement the physical layer (hardware) of the protocol stack. The hardware activity measured as the fraction of time the hardware is in an active state (processing data or receiving/transmitting a packet) may be below 1% in low data-rate monitoring applications. Thus, it is very important to minimize the power consumption in idle and sleep modes. A general hardware architecture of a sensor node platform is presented in Fig. 4. The architecture can be divided into four subsystems: • Communication subsystem enabling wireless communication, • Computing subsystem allowing data processing and the management of node functionality, • Sensing subsystem connecting the wireless sensor node to the outside world, and • Power subsystem providing the system supply voltage.
4.1 Communication subsystem The communication subsystem consists of a wireless transceiver and an antenna. A wireless transceiver can be based on acoustic, optical or Radio Frequency (RF) waves. Acoustic communication is typically used for under water communications
132
Suhonen et al.
or measuring distances based on time-of-flight measurements [3]. The disadvantages are long and variable propagation delay, high path loss, noise, and very low data rate. Optical communication [54] has low energy consumption especially in reception mode, and it can utilize very small antenna. However, the alignment of a transmitter to a receiver is difficult or even impossible in large-scale WSN applications. RF communication combines the benefits of high data rate, long range and nearly omnidirectional radiation, making it the most suitable communication technology for WSNs. Disadvantages are large antenna size and higher energy consumption compared to the optical technology. In general, an RF transceiver (radio) has four operation modes: transmit, receive, idle, and sleep. Radio is active in transmit and receive modes, when power consumption is also the highest. In idle mode, most of circuitry is shut down, but the transition to the active mode is fast. The lowest power consumption is achieved in sleep mode when all circuitry is switched off. Most short-range radios utilized with WSNs operate in the 433 MHz, 868 MHz, 915 MHz, and 2.4 GHz license-free Industrial Scientific Medical (ISM) frequency bands. The 2.4 GHz band is the widest providing more channels, while obstacles have least effect on lower frequency bands. Depending on the frequency band and antenna type, operating range with 1 mW transmission power is from few meters to hundreds meters [25]. The characteristics of the most potential commercial low power radios are summarized in Table 3 [25]. Microchip, Nordic Semiconductor, and Texas Instruments utilize on-chip buffers for the adaptation of a high-speed radio with a low-speed MCU. Current consumptions are specified at the lowest band and 0 dBm transmission power. The table indicates that data rate and frequency band has only a low effect on current consumption. The last two columns present the energy consumption with 3.0 V supply voltage, indicating that the radios operating at the 2.4 GHz frequency band are the most energy-efficient, which is mostly caused by their high data rates.
4.2 Computing subsystem The central component of a platform is processor unit that forms the computing subsystem. The processor unit is typically implemented by a MCU, which integrates a processor core with program and data memories, timers, configurable I/O ports, Analog-to-Digital Converter (ADC) and other peripherals. Flash memory is typically used as a program memory, while data memory consists of Static Random Access Memory (SRAM) and Electronically Erasable Programmable Read-Only Memory (EEPROM). WSN nodes utilize typically 1-10 Million Instructions Per Second (MIPS) processing speed. Memory resources typically consists of 1-10 kB of data memory and 16-128 kB of program memory. The characteristics of potential MCUs from different manufacturers are compared in Table 4. The energy-efficiencies of MCUs can be compared according to
Low-power Wireless Sensor Network Platforms
133
Table 3 Radio features, current consumptions, and energy efficiencies. Radio MC MRF24J40 NS nRF2401A NS nRF24L01 NS nRF905 RFM TR1001 RFM TR3100 SE XE1201A SE XE1203F TI CC2420 TI CC2500 TI CC1000 TI CC1100
Data rate Band Buffer Sleep RX TX RX TX (kbps) (MHz) (B) ( A) (mA) (mA) (nJ/b) (nJ/b) 250 1000 2000 50 115.2 576 64 152.3 250 500 76.8 500
2400 128 2400 32 2400 32 433-915 32 868 no 433 no 433 no 433-915 no 2400 128 2400 64 433-915 no 433-915 64
2 0.9 0.9 2.5 0.7 0.7 0.2 0.2 1 0.4 0.2 0.4
18 19.0 12.3 14.0 3.8 7.0 6.0 14.0 18.8 17.0 9.3 16.5
22 13.0 11.3 12.5 12 10 11.0 33.0 17.4 21.2 10.4 15.5
264 39 17 750 313 52 516 650 209 127 406 93
216 57 18 840 99 36 281 276 226 102 363 99
Manufacturers: Microchip (MC), Nordic Semiconductor (NS), sRF Monolithics (RFM), Semtech (SE), Texas Instruments (TI)
their current consumption at one MIPS processing speed. The comparison indicates that Semtech XE8802 and Texas Instruments MSP430F1611 MCUs are the most energy-efficient [25].
4.3 Sensing subsystem There exists a large variety of low power sensorssuitable for WSNs [1]. Important requirements for sensors are low power consumption and short sensing time, which determine the energy consumption of a single sensing. In addition, adequate accuracy is required within the entire temperature range. The features of some example sensors are presented in Table 5. Most of the sensors fulfill the requirements well.
Table 4 The comparison of the features of low power MCUs. MCU Atmel AT89C51RE2 (8051) Atmel ATmega103L (AVR) Atmel AT91FR40162S (ARM) Cypress CY8C29666 Freescale M68HC08 Microchip PIC18LF8722 Microchip PIC24FJ128 Semtech XE8802 (CoolRisc) TI MSP430F1611
FLASH SRAM EEPROM Sleep 1 MIPS (kB) (kB) (B) ( A) (mA) 128 128 2048 32 61 128 128 22 48
8 4 256 2 2 3.9 8 1 10
0 4096 0 0 0 1024 0 0 0
75 1 400 5 22 2.32 21 1.9 1.3
7.4 1.38 0.96 10 3.75 1.0 1.6 0.3 0.33
134
Suhonen et al.
A WSN node can also operate as a decision unit, which takes sensor readings from the WSN as input and generates action commands as output. These action commands are then transformed into actions by actuators. Besides an electric switch and a servo drive, an actuator can be a mobile robot. In order to improve the reliability of actions, the robot can be a WSN node and act based on its own sensor readings and the data of the other WSN nodes in the network [1].
4.4 Power subsystem The power subsystem stores supply energy and converts it to an appropriate supply voltage level. The subsystem consists of an energy storage, a voltage regulator, and optionally an energy scavenging unit. The energy storage can be a non-rechargeable (primary) battery, a rechargeable (secondary) battery, or a supercapacitor [38]. Primary batteries are cheap and have the highest energy density. They are the most common power source for WSNs. Secondary batteries have lower energy density and are more expensive, but they can be recharged only 500 - 1000 times. Compared to secondary batteries, supercapacitors have lower energy density and they are more expensive. However, their lifetime is in the order of a million charging/discharging cycles. Supercapacitors are suitable to be used with an energy generator, since energy is typically generated in peaks during short periods of time, and the amount of stored energy can be relatively low. The most potential sources for energy scavenging are photovoltaics and vibrations [45]. Solar cells are mature technology and they can provide up to 15 mW/cm3 power at outdoor conditions. In indoor conditions, achieved power reduces to 10 W/cm3 . A promising method for converting vibration to source power is a piezoelectric conversion. Commonly occurring vibrations can provide up to 200 W / cm3 power. Other possible energy sources are temperature gradients (40 W/cm3 at 5 ◦ C temperature differential) and air flow (380 W/cm3 at 5 m/s).
Table 5 Features of typical sensors. Physical quantity
Example sensor
Acceleration Air pressure Humidity Illumination Infra-red Magnetic field Position Temperature
VTI SCA3000 VTI SCP1000 Sensorion SHT15 Avago APDS-9002 Fuji MS-320 Hitachi HM55B Fastrax iTRAX03 Dallas DS620U
Accuracy Active Sensing Energy current time cons. 1% 150 Pa 2% 50% 5% 1.0 m 0.5◦ C
120 A 25 A 300 A 2.0 mA 35 A 9.0 mA 32 mA 800 A
10 ms 110 ms 210 ms 1.0 ms cont. 30 ms 4.0 s 200 ms
3.6 J 8.3 J 190 J 6.0 J 810 J 380 mJ 480 J
Low-power Wireless Sensor Network Platforms
135
Table 6 Comparison of existing low-power sensor node platforms. Platform Mica Mica2 Mica2dot MicaZ BTnode ver3
MCU
Sensors
ATmega103L ATmega128L ATmega128L ATmega128L Atmega128L
no no no no no
Medusa MK-2
Radio
RF band Sleep Size (MHz) ( A) (mm2 )
TR1000 915 200-300 1856 CC1000 433,915 17 1856 CC1000 433,915 17 492 CC2420 2400 30 1856 ZV4002 / 2400 / 3000 1890 CC1000 433,915 TR1000 915 27 4500 Sensors:
ATmega128L T,L,A + ARM7 EYES node MSP430 no TR1001 868 ScatterWeb ESB MSP430 L,AC,A,IR TR1001 868 TinyNode MSP430 no XE1205 433-915 Tmote sky MSP430 H,T,L CC2420 2400 ProSpeckz CY8C29666 no CC2420 2400 TUTWSN LR PIC18LF8722 T,L,H,A,IR nRF905 433 TUTWSN LE PIC18LF8722 T,L,H,A,IR nRF24L01 2400
5.1 8 5.1 5.1 330 31 24
2600 3000 1200 2621 704 8100 2800
Acceleration (A), Acoustic (AC), Humidity (H), Light (L), Passive Infra-Red (IR), Temperature (T)
4.5 Existing Platforms WSN platforms have improved significantly during the last decade along with the advances in low power processing and communication technology. Still, due to the strict energy constrains, and the visions of complex networking and data fusion, it is not possible to fulfill all the requirements with the current level of technology. Thus, the platform research can be divided into two branches: high performance platforms, and low power platforms [16]. The high performance platforms have been developed for researching complex data processing and fusion in sensor nodes. The design target has been the reduction of transmitted data by efficient data processing. These platforms utilize high performance processors having at least tens of MIPS processing performance and hundreds of kilobytes program and data memories. For long-lived battery operation, their energy consumption is not adequate. However, these high power platforms can be used as a part of a WSN for data processing and data routing. Examples of the high performance platforms are Piconode [42], AMPS [33], and Stargate [7]. Low power platforms are aiming to maximize the lifetime and minimizing the physical size of nodes. These are obtained by minimizing hardware complexity and energy consumption. These platforms are capable for performing low data rate communication and data processing required for networking and simple applications. The most essential sensor node platforms are listed in Table 6 [25]. Besides the Commercial Off-The-Self (COTS) platforms presented in the table, a lot of research work has been conducted for developing System-on-a-Chip (SoC) platforms targeting to even smaller size and higher energy-efficiency. For example,
136
Suhonen et al.
a WiseNET SoC sensor node [9] developed in Swiss Center for Electronics and Microtechnology integrates a low-power radio with CoolRISC MCU core, low-leakage memories, two Analog-to-Digital Converter (ADC) and power management blocks. The reception mode current consumption is only 2 mA, which is nearly one order of magnitude less than in typical low power radios. Yet, the data rate is only 25 kbps. The transmission mode current consumption at 10 dBm output power is 24 mA. The sleep mode current consumption of the radio block is 3.5 A. At best, low power platforms can perform various sensing tasks and they enable the extending of network lifetime to even years. However, this necessitates an energy-efficient MAC protocol, which maximizes the time node spends in the sleep mode.
5 Medium Access Control Features and Services The MAC sublayer is the lowest part of data link layer and it operates on top of the physical layer. A MAC protocol manages radio transmissions and receptions on a shared wireless medium and provides connections for overlying routing protocol. Hence, it has a very high effect on network performance and energy consumption.
5.1 MAC Technologies MAC protocols can be categorized into contention and contention-free protocols. In contention protocols, nodes compete for a shared channel, while trying to avoid frame collisions. As the power consumption of the low power radios in the reception mode is high, the energy-efficiency of the conventional MAC approaches is not adequate for the low energy WSN as such. Further energy saving is achieved by duty cycling: time is divided into a short active period and a long sleep period, which are repeated consecutively. These low duty-cycle protocols can also be divided into two categories: unsynchronized and synchronized protocols, according to the synchronization of data exchanges. ALOHA [44] is the simplest contention protocol, where nodes transmit data without coordination. Slotted ALOHA reduces collisions by dividing time into slots and transmitting data on the slot boundaries only. Carrier Sense Multiple Access (CSMA) further reduces collisions and improves achievable throughput by checking channel activity prior to transmissions and avoiding transmission during busy channel situations. Carrier Sense Multiple Access / Collision Avoidance (CSMA/CA) [6] is a modification of CSMA, which reduces the congestion on a channel by deferring a transmission for a random interval (contention window). The contention window is increased if the channel is sensed to be busy (backoff), thus allowing the MAC to
Low-power Wireless Sensor Network Platforms
137
Fig. 5 Operation of unsynchronized low duty-cycle protocols.
adjust to the network conditions. Still, collisions may occur due to a hidden node problem: nodes separated by two hops may not detect each other, and their transmissions may collide on a receiver that is located between the nodes. The hidden node collisions can be significantly reduced by performing a Request-To-Send (RTS) / Clear-To-Send (CTS) handshaking prior to a data transmission. Therefore, the handshaking is defined as an option in many CSMA/CA-based protocols. While contention-based protocols work well under low traffic loads, their performance and reliability degrades drastically under higher loads because of collisions and retransmissions. In contention-free protocols, nodes get unique time slots or frequency channels for transmissions. Ideally, collisions are eliminated. Time Division Multiple Access (TDMA) divides time into numerous slots, where only one node is allowed to transmit on each slot. Other alternatives are Frequency Division Multiple Access (FDMA) and Code Division Multiple Access (CDMA), which provide contentionfree operation by separate frequency channels and spreading codes, respectively. Contention-free protocols achieve high performance and reliability regardless of the traffic load. Yet, the bandwidth must be reserved in advance, which increases control traffic overhead.
5.2 Unsynchronized Low Duty-Cycle MAC Protocols Unsynchronized low duty-cycle MAC protocols [39] are based on a Low Power Listening (LPL) mechanism, where nodes poll channel asynchronously to test for possible traffic, as presented in Fig. 5. Transmissions are preceded with a preamble that is longer than the channel-polling interval. Hence, the preamble part acts like a wake up signal. If a busy channel is detected, nodes begin to listen to the channel until a data packet is received or a time-out occurs. The drawback of the basic LPL mechanism is that the transmission and reception of long preamble increases energy consumption significantly. Therefore, several variations are proposed to reduce the preamble energy. For example, in X-MAC [4], a sender transmits multiple short preambles with the address of the intended receiver. Each preamble is followed by a short reception period. Upon receiving
138
Suhonen et al.
a preamble, the destination node sends an acknowledgment (ACK) between the preambles. Other nodes can enter early a sleep mode for reducing overhearing. After receiving the ACK, the source node begins the transmission of a data frame. The preamble can be eliminated completely by utilizing an additional transceiver referred to as a wake-up radio [13]. The wake-up radio mechanism is based on the assumption that the listen mode of the wake-up radio is ultra low power and it can be active constantly. At the same time, the normal data radio is in the sleep mode as long as packet transmission or reception is not required. The wake-up radio protocols are successful in avoiding overhearing and idle listening in the data radio. Their major problems are the energy consumption and cost of the wake-up radio. In addition, the difference in the transmission ranges between data and wake-up radio may pose significant problems. Unsynchronized protocols are relatively simple and robust, and require a small amount of memory compared to synchronized protocols. A general drawback is rather high overhearing, since each node must receive at least the beginning of each frame transmitted within radio range. Thus, they suit best for relatively simple WSNs utilizing very low data rates. Unsynchronized protocols tolerate dynamics in networks, but their energy-efficiency is limited by the channel sampling mechanism [57].
5.3 Synchronized Low Duty-Cycle MAC Protocols Synchronized low duty-cycle MAC protocols utilize scheduling to ensure that nodes agree on the data exchange times. Due to the synchronized operation, nodes know the exact moments of active periods in advance, thus eliminating the need of long preambles. As a global synchronization is very difficult in large networks, the synchronization is often realized by receiving beacon frame from one or more neighbor nodes, as shown in Fig. 6. The beacon frame includes synchronization and status information, such as the duration of the active period and the time between active periods. After a beacon, nodes exchange frames during the active period. The active period is followed by a sleep period to save energy. Together, the active period and the sleep period are referred to as an access cycle. The access cycle is repeated periodically. For establishing the synchronized operation, neighboring nodes are typically discovered by a network scan. The network scan means a long-term reception of frequency channels for receiving beacons from neighbors, since their schedules and frequency channels are unknown. Clearly, this is energy-hungry. However, the synchronized operation after the network scan is very energy-efficient [57]. While the channel access in the unsynchronized protocols is usually contentionbased, the synchronized MAC protocols use either contention-based, contentionfree, or hybrid channel access mechanism. Sensor-MAC (S-MAC) [56] utilizes purely contention-based channel access by using CSMA/CA with RTS/CTS mechanism. The protocol utilizes a fixed active pe-
Low-power Wireless Sensor Network Platforms
139
Fig. 6 Operation of synchronized low duty-cycle protocols.
riod length and an adjustable, network specific wake up period. Neighboring nodes may coordinate their active periods to occur simultaneously to form virtual clusters. At the beginning of an active period nodes wake up and exchange synchronization (SYNC) frames for synchronizing their operation. The fixed access cycle length causes idle listening, which decreases energy efficiency. T-MAC [51] is a variation of the S-MAC that improves the energy-efficiency by adjusting the active period according to traffic. It utilizes a short listening window after the CTS phase and each frame exchange. If no activity occurs during the listening window, node returns to sleep mode. IEEE 802.15.4 [19] standard defines a MAC layer that can use both contentionbased and contention-free channel access. It operates on beacon-enabled and nonbeacon modes. In the non-beacon mode, a protocol is based on a simple CSMA/CA. Energy-efficient synchronized low duty-cycle operation is provided by the beaconenabled mode, where all communications are performed in a superframe structure. The superframe is divided into three parts: the beacon, Contention Access Period (CAP) and Contention Free Period (CFP). CAP is a mandatory part of a superframe during which channel is accessed using a slotted CSMA/CA scheme. CFP is an optional feature of IEEE 802.15.4 MAC, in which a channel access is performed in dedicated time slots. CFP can be utilized only for a direct communication with a PAN coordinator. Thus, its applicability and benefits are very limited in multi-hop networks. The cluster-tree type IEEE 802.15.4 network can provide comparably good energy-efficiency in static and sparse networks. The hidden node problem reduces performance in dense networks, since any handshaking prior to transmissions is not used. TUTWSN MAC [25, 27] is another example of protocol that uses both contentionbased and contention-free channel access. The superframe structure is similar to the IEEE 802.15.4. However, instead of using carrier sensing, CAP and CFP are divided into fixed time-slots. To allow implementation on the simplest radios without carrier sensing capabilities, TUTWSN MAC uses slotted ALOHA on CAP. Each time slot is further divided into two subslots, first subslot is for data frame and the following subslot is for acknowledgment. The use of contention free slots is preferred as it eliminates collisions and thus increases reliability. The CAP is used only for joining a cluster and requesting reservations on CFP.
140
Suhonen et al.
Fig. 7 Network topology in WSN MAC performance evaluation.
In the synchronized low duty-cycle protocols, the major advantage is that a sender knows a receiver’s wake up time in advance and thus can transmit efficiently. In dynamic networks, synchronized links are short-lived and new neighbors need to be searched frequently, which increases energy consumption rapidly. In contention based protocols, a major disadvantage is the energy cost of receiving an entire active period [39]. Contention-free protocols have better energy-efficiency in stationary networks, but their performance reduces rapidly as network dynamics increases.
5.4 Performance This section analyzes the performance of the low-energy MAC protocols. The results are based on the models presented in [25] and [57]. The models have the following assumptions: • Each sensor node measures one sensor sample and forwards it to a next-hop node during one data generation interval, • Each data frame is followed by an acknowledgment for fair comparison, • There are no transmission errors nor collisions, • There is no contention, and carrier sense attempts produce an idle result, • The power consumption of idle listening equals to the reception mode power, and • The active time of MCU equals to the active time of radio. Therefore, the performance models focus on the power consumption of the channel access mechanisms, while the effects of data processing, contention, and control frame exchanges are eliminated. For contention based protocols, the results are slightly better than in practice. Energy consumptions are analyzed for a router node (A), and a leaf node (B) presented in Fig. 7. Both nodes have eight neighbors (n). Data generation interval (TDATA ) is equal for each node, and it varies from 1 s to 1000 s. Arrows in the figure indicate data routing directions. The traffic load is accumulated in routers, since they transmit their own data and the multi-hop routed data from nDL nodes. For example, the router node C routes data from four nodes (nDL = A, B, D, E).
Low-power Wireless Sensor Network Platforms
141
Average power consumption (P) of a protocol is calculated by normalized transmission (tT X ) and reception (tRX ) activities and their power consumptions as P = tT X PT X + tRX PRX + (1 − tTX − tRX )PS .
(1)
The normalized activity is determined by dividing the duration of an activity by the interval of the activity resulting in a percentage value of the activity. Data exchanges are normalized by TDATA during which all nodes in the network generate exactly one data frame. Similarly, the transmission and reception activity for maintaining synchronization is normalized by TSY NC . The performance analysis assumes the commonly used TI CC2420 transceiver and Atmel ATmega128L MCU. As the transceiver and MCU constitute the majority of power consumption, other power consumption sources are ignored in the analysis. Other analysis parameters are as follows. For fair comparison, all protocols use 8 B control frame (Beacon/ACK/RTS/CTS) length and 32 B data frame length. In IEEE 802.15.4 and T-MAC protocols, 2 ms average contention window is used, which conforms the default settings of IEEE 802.15.4 when there is no collisions. T-MAC uses 90 s synchronization interval, while TUTWSN utilizes 2 ALOHA slots per CAP. For realistic results, 20 ppm maximum clock drift was assumed. The clock drift reflects the inaccuracy of the timing crystals, which must be compensated by extra listening of the channel as a neighbor node might begin its transmission earlier or later than expected. The power consumption with the analyzed protocols is presented in Fig. 8. In this analysis, the access cycle length (synchronized protocols) and the listening interval (LPL-based X-MAC protocol) are adjusted for lowest possible energy consumption. In the synchronized protocols, the optimal access cycle length ranges from 2 s to 2000 s as the data generation interval ranges from 1 s to 1000 s. The longer access cycle saves energy, as beacons need to sent less frequently, while shortening the access cycle gives more capacity. The active period length is long enough to allow one data transmission. In the unsynchronized X-MAC, the channel listening interval ranges from 0.1 s to 3.3 s. Long listening intervals make LPL-based protocols energy inefficient, because the transmission time of the preamble signal is relative to the channel listening interval. Clearly, the synchronized protocols are potentially the most energy efficient choices for a WSN MAC. However, the energy efficiency has a trade-off with latency. The maximum per hop forwarding delay, assuming no packet errors and collisions, is the same as the access cycle length or channel listening interval. The average per hop delay is half of the maximum delay. Fig 9 shows the power consumption of the protocols with comparable delays. On each case, the access cycle length is limited to 1 s. X-MAC uses 0.1 s-0.8 s between 1 s-64 s data generation intervals, as it allows better energy-efficiency. TUTWSN and IEEE 802.15.4 are the most energy-efficient choices for leaf nodes, as they minimize idle listening. In T-MAC, the power consumption of a leaf is high, because the protocol does not make a distinction between router and leaf nodes. In this comparison, 802.15.4 has the highest has the highest router node
142
Suhonen et al.
TUTWSN Ideal MAC
X-MAC T-MAC 802.15.4
Router node power (mW)
Leaf node power (mW)
X-MAC T-MAC 802.15.4 10 1
0.1
TUTWSN Ideal MAC
10 1
0.1
0.01
1 10 100 1000 Data generation interval (s)
0.01
1 10 100 1000 Data generation interval (s)
Fig. 8 Optimal power consumption of MAC protocols.
10 1
0.1
0.01
TUTWSN Ideal MAC
X-MAC T-MAC 802.15.4
Router node power (mW)
Leaf node power (mW)
X-MAC T-MAC 802.15.4
1 10 100 1000 Data generation interval (s)
TUTWSN Ideal MAC
10 1
0.1
0.01
1 10 100 1000 Data generation interval (s)
Fig. 9 Power consumption of MAC protocols, when maximum per-hop delay is 1 s.
power consumption among synchronized MACs as its active period (CAP) length is fixed, which causes a lot of idle listening. T-MAC and TUTWSN are more energyefficient because they have mechanisms to minimize idle listening. T-MAC adjusts the active period length based on the traffic, whereas TUTWSN prefers contention free channel access and adjusts the amount of reserved slots dynamically according to the traffic. The LPL-based X-MAC protocol is the least energy-efficient when network traffic is high, but has the lowest power consumption when data generation interval is low as a node uses energy only to transmit data and does not have to maintain synchronization. T-MAC is the most energy-efficient synchronized protocol when data generation interval is long, because its energy-efficient synchronization mech-
Low-power Wireless Sensor Network Platforms
143
anism. In IEEE 802.15.4 and TUTWSN, beacon frames are transmitted every access cycle, which means that at long data generation interval, the power consumed due to beacon reception dominates. The results indicate clearly that there is no single purpose, fit-for-all low-energy WSN MAC. The optimal MAC depends on the required delays and data generation intervals. For example, a synchronized MAC can be selected over a LPL MAC even in very low traffic networks, if the delay is not critical. This is the case e.g. in environmental networks in which samples need to be collected only once per hour. Another consideration is the role of the nodes. If the backbone network consists of router nodes that are mains powered, IEEE 802.15.4 would be a good choice as its leaf nodes have very low energy consumption.
6 Routing and Transport As WSNs are designed to operate in large geographic areas, forwarding data directly to the target node would not be feasible as the required transmission energy increases proportionally to the square of the distance. Therefore, data is routed along several hops. A routing layer operates on top of the MAC layer. As several alternative routes to a destination node may exist, the routing decision has a significant effect on load balancing, end-to-end reliability, and latency. Furthermore, the route construction and maintenance methods used in a routing protocol determine energyefficiency and mobility support. Due to resource constraints, WSN routing protocols often also combine transport layer functionality.
6.1 Services The basic service for a routing protocol is the multihop forwarding a packet from a source to a destination. However, a routing protocol may also provide: • QoS support allowing route selection based on different QoS-metrics, such as end-to-end latency, reliability, or energy usage, • Multicast and broadcast support allowing efficient packet delivery to several nodes at once, • Mobility support enabling source, intermediate, and/or destination node mobility, • End-to-end reliability ensuring that a packet is not lost and performing retransmission if necessary, and • Congestion control to avoid packet drops due to traffic congestion, and • Fragmentation to enable transmission of large contents Overall, the supported services are largely limited by the used routing paradigm and technology. The end-to-end reliability, congestion control, and fragmentation logically belong to a transport layer protocol. However, due to resource constraints
144
Suhonen et al.
Fig. 10 WSN routing paradigms.
and the need for cross-layer optimizations, many WSN routing protocols support for these services.
6.2 Routing Paradigms and Technologies WSN routing protocols can be classified based on their operation as node-centric, data-centric, location-based, multipath, or cost-based [21, 34] as shown in Fig. 10. The classes are not exclusive as a routing protocol may be both data-centric and query based, while having features seen in location based protocols. 6.2.1 Node-centric Routing Node-centric approach is the traditional approach used in the computer networks in which nodes are addressed with globally unique identifiers. While the paradigm allows compatibility with the existing protocols, the requirement of required unique addressing is challenging in WSNs. Due to the large network size and error prone nature of sensor nodes, a decentralized address maintenance is preferred. However, as a network may partition or network segments may join, ensuring a consistent addressing scheme involves a lot of messaging and is energy-consuming. Node-centric protocols typically rely on routing tables containing an entry for each route identified by destination address and next hop node for the target. The routing table may be constructed proactively by discovering routes to all potential targets, but this increases memory requirements and would not be practical in large networks. Instead, the node-centric protocols designed for ad-hoc wireless networks, such as Dynamic Source Routing (DSR) [22], or Ad-hoc On-demand Distance Vector routing (AODV), use on-demand (reactive) approach in which routes are constructed only when needed. The drawback compared to the proactive approach is the route construction delay when sending first packets.
Low-power Wireless Sensor Network Platforms
145
Fig. 11 Interest based routing. a sink advertises its interests to the network. b nodes matching the interest transmit data to the sink.
6.2.2 Data-centric Routing As WSNs are inherently data oriented, the data centric routing is a more natural paradigm than the node-centric approach. In data-centric routing, data is routed based on its content rather than using sender or receiver identifiers. As the datacentric routing is already content aware, data-aggregation can be naturally performed. Data centric routing may take interest based, negotiation based, or query based approaches. In the interest based approach, a sink node request data from the network by sending a request describing the data it wants to every node in the network [21]. A node forwards the interest and directs its routing tree toward the sink node as shown in Fig. 11. Then, nodes that fulfill the requirements as defined in the interest start transmitting data to the sink. Although the route construction is proactive, the interest based routing is scalable as the number of sinks (data consumers) is low compared to the number of nodes (data sources). Negotiation-based protocols exchange negotiation messages before actual data transmission takes place [26]. This saves energy, as a node can determine during the negotiation that the actual data is not needed. For negotiation protocols to be useful, the negotiation overhead and data descriptor sizes must be smaller than the actual data. Query based routing protocols request a specific information from the network. A query might be expressed with a high level language such as SQL. For example, a query might request “average temperature around area x,y during the last hour”. The query can be routed via a random walk or directed at a certain region [29]. After the query has been resolved, the result is transmitted back to the source. 6.2.3 Location-based Routing Location-based routing uses geographic location information to make routing decision. The approach is natural to WSNs, as sensor measurements usually relate to a specific location. A basic principle in the geographic routing is to select a next hop neighbor that is closer to the target node than a forwarding node. However, a problem with such greedy forwarding is that routing may fail due to hole in the network. Proposed solutions to the problem include switching to a different mode when a hole is detected, such as in Greedy Perimeter Stateless Routing (GPSR) [23] where
146
Suhonen et al.
packet is routed around a hole according to right-hand rule. Another method is to express packet’s route with mathematical formula as proposed in Trajectory Based Forwarding (TBF) [35]. Nodes forward packet to a neighbor that is closest to the defined route (trajectory). With the help of global knowledge of the network, a route that avoids holes can be selected. The most significant benefit of the location-based routing is its scalability. Routing tables or a global knowledge of the network topology is not typically required, which reduces both data memory requirements and routing overhead. Also, geographic routing usually tolerates source and intermediate node mobility. However, determining the position for each node can be problematic. The use of positioning chips such as GPS increases the price and energy consumption, while manual configuration is not suitable for large scale networks. 6.2.4 Multipath Routing In the multipath routing, a packet traverses from a source node to a target node via several paths. The main goal is to increase reliability, as a packet can be received via an alternative path even if the routing in some path fails. However, the multipath routing has a trade-off between the reliability and energy, as it increases network load and energy usage due to the extra transmissions. Flooding packet to every node in the network is the simplest case of multipath routing. In flooding, each node forwards a new flood packet to all of its neighbors. To suppress duplicates, already received flood packets are not forwarded. Flooding is commonly used during the setup phase of several WSN routing protocols, but is not used for routing as such because packets can easily congest network and thus decrease reliability. Controlled multipath routing algorithms limit the number of alternative routes. For example, in gradient broadcast [55] data is forwarded along an interleaved mesh. Each packet is assigned with a budget that is initialized by the source node. The budget consists of the minimum cost to send a packet to the sink and an additional credit. When a node receives the packet, it compares remaining budget against the cost required to forward the packet to the sink. If the cost is smaller or equal than the budget, the node forwards the packet. As the credit increases the budget, it allows forwarding the packet along other than minimum cost paths. Thus, the credit determines the amount of redundancy for the packet and has a trade-off between used energy and reliability. If credit is zero, packet must be forwarded along minimum cost path. 6.2.5 Cost-based Routing In cost-based routing, each node is assigned with a cost value that is relative to the distance between a node and a sink. The cost may be calculated from an any metric, e.g. the number of hops or the required energy to forward a packet to the
Low-power Wireless Sensor Network Platforms
147
sink. The benefit of the cost-based routing is that the knowledge of forwarding path states is not required: a node forwards its data by sending it to any neighbor that has lower cost. The drawback is that the routes must be created proactively. Also, although data to the sink is forwarded efficiently, another routing mechanism, such as flooding, must be used for data traveling in the other direction. However, the trade-off can be acceptable since most of the traffic is usually toward the sink.
6.3 Hybrid Transport and Routing Protocols Pump-Slowly, Fetch-Quickly (PSFQ) [52] combines the functionality of transport and routing layers to achieve a low communication cost. Data is transmitted with relatively slow speed by delaying forwarding with two configured time values Tmin and Tmax . In broadcast networks, the Tmin parameter allows a node to receive a frame multiple times. A node then evaluates the necessity to forward the frame based on how many times it was received. If a sequence number gap in a received frame is detected, PSFQ uses a negative acknowledgment to request all missed frames. The frame is requested in less than Tmin , which allows reducing latency on error situations. SPEED [50] is a routing protocol that combines non-deterministic location-based forwarding with inbuilt congestion control mechanism and soft latency guarantees. The protocol does not guarantee strict limit for latency, but defines an end-to-end delay that is proportional to the distance between source and destination nodes. Thus, it maintains a certain delivery speed. In SPEED, the next hop is selected randomly among the neighbors with the probability that is proportional to the link speed. Only the nodes that advance towards the target and meet the delivery time can be selected. The link speed is calculated by dividing the distance between nodes (obtained with the geographic location information) by measured link delay. The next hop selection is combined with feedback received from neighbors. If a node cannot forward a packet due to congestion or a hole in the network, it sends a backpressure beacon, which reduces the forwarding probability to that node.
7 Embedded WSN Services This section describes localization and synchronization in WSNs. On one hand, these can be seen as services that enable building various WSN applications, such as surveillance or tracking system. On the other hand, signal processing is used in various localization and synchronization algorithms that might benefit from an implementation with an embedded processing chip.
148
Suhonen et al.
7.1 Localization Ubiquitous localization has been widely studied during the recent years. In general, the solutions focus on finding effective location estimation algorithms and measurements that correlate with location. Localization can be categorized to range-based, proximity-based, and scene analysis [14]. The underlying technologies vary from pure RF-based, to UltraSound (US), InfraRed (IR), and multimodal solutions. Range-based approaches rely on estimating distances between localized nodes and anchor nodes, which know their locations a priori. This process is called ranging. Received Signal Strength (RSS) is a common RF-based ranging technique. Distances estimated using RSS can have large errors due to multipath signals and shadowing caused by obstructions [36]. The inherent unreliability has to be addressed in the used localization algorithms. In [24], RSS is replaced with multiple varying power level beacon transmissions. Several location estimation techniques can be used in range-based localization. Utilized methods include trilateration, weighted center of gravity calculation, and Kalman filtering. Many mathematical optimization methods, such as the steepest descent method, sum of errors minimization, and Minimum Mean Square Error (MMSE) method, have been used to solve range-based location estimation problems. Proximity-based approaches exploiting RF signals [18] [5] estimate locations from connectivity information. Such solutions are also commonly referred to as range-free in the literature. In the strongest base station method [18] the localized node’s location is estimated to be the same as the location of the anchor node it is connected to. In [5], the unknown location is estimated using connectivity information to several anchor nodes. Only a very coarse grained location can be estimated using the strongest base station method. The solutions presented in [5] improve the granularity slightly. Nevertheless, in order to reach small granularities the connectivity-based schemes require a very dense grid of anchor nodes. Their, strength is fairly simple implementation and modest HW requirements. Scene analysis consists of an off-line learning phase and an online localization phase. The off-line phase includes recording RSS values corresponding to different anchor nodes as a function of the users location. The recorded RSS values and the known locations of the anchor nodes are used either to construct an RF-fingerprint database [30], or a probabilistic radio map [8] [58]. In the online phase the localized node measures RSS values to different anchor nodes. With RF-fingerprinting the location of the user is determined by finding the recorded reference fingerprint values that are closest to the measured one (in signal space). The unknown location is then estimated to be the one paired with the closest reference fingerprint or in the (weighted) centroid of k nearest reference fingerprints. Location estimation using a probabilistic radio map includes finding the point(s) in the map that maximize the location probability. The applicability and scalability of scene analysis approaches are greatly reduced by the time consuming collection and maintenance of the RF sample database. Searching through the sample database or radio map is computationally intensive.
Low-power Wireless Sensor Network Platforms
149
The Joint Clustering (JC) technique [58] uses location clustering to reduce the computational cost of searching the radio map. It betters the scalability of the searching algorithm to some extent. MoteTrack [30] achieves similar effect by disseminating the RF-fingerprint database to a WSN and decentralizing the localization procedure. In general, RF signal strength based localization possesses fundamental limits due to the unreliability of the measurements [8]. There is strong evidence that, at best, accuracy in the scale of meters can be achieved regardless of the used method [8]. US-based approaches [40] [41] use time-of-flight ranging and can achieve high accuracies. However, anchor nodes need to be positioned and orientated carefully due to the directionality of US and the requirement for Line-of-Sight (LoS) exposure. A dense network of anchor nodes is needed due to the LoS requirement, short range of US, and the fact, that typically ranging measurements to at least four anchor nodes are needed. The addition of US transmitters and receivers increases HW costs and reduces energy-efficiency compared to purely RF-based solutions. Some schemes [41] require multiple US transmitters/receivers per one HW platform further increasing the HW costs. IR-based solutions [53] are based on inferring proximity. They can localize nodes inside the range of LoS IR transmissions. IR-based schemes suffer errors in the presence of obstructions. Also, differing light and ambient IR levels, caused by for example fluorescent lighting or direct sunlight, produce difficulties [40]. The anchor network costs are high because a dense matrix of IR sensors is needed in order to avoid dead spots. In the presence of a myriad of location sensing techniques data fusion has become and attractive location estimation method. It can combine measurements from multiple sensors while managing measurement uncertainty. In [10] Fox et al. survey Bayesian filtering techniques capable of multisensor fusion. Probabilistic fusion methods require relative large amounts of computation. Thus, in the presence of resource constrained nodes, a centralized implementation running in a more powerful base station is often the only feasible choice. For example in the Localization Stack [15] the fusion layer is implemented in Java.
7.2 Synchronization The most straightforward way to achieve accurate synchronization would be to equip every node with a Global Positioning System (GPS) receiver, Universal Time Coordinated (UTC) signal receiver or an accurate atomic clock. However, in reality this would be infeasible in WSN nodes due to increased size, cost, and energy consumption. Thus, synchronization has to be achieved by using a specific synchronization protocol. Although effective in the Internet, the widely used Network Time Protocol is too resource consuming for WSNs. Furthermore it requires external configuration making ad hoc operation impossible. The IEEE 1588 standard for a precision clock
150
Suhonen et al.
synchronization protocol for networked measurement and control systems has the similar shortcomings. Next, synchronization protocols designed for WSNs and targeting at network-wide synchronization via multiple hops will be overviewed. The protocol in [46] is based on the assumption that the clock value of a node of the linear form: ti = ait + bi , where ti is the local clock value of node i, ai is the drift, bi is the offset, and t presents real time. The goal of the protocol is not to achieve network-wide global time but instead each node performs pairwise synchronization with each of its neighbors maintaining a list of clock parameters (a and b) for every neighbor. The Timing-sync Protocol for Sensor Networks (TPSN) [11] uses two frames to synchronize a pair of nodes. The global time synchronization algorithm of TPSN builds a spanning tree. In the tree, every node knows its level and its parent. Remote clock estimation is performed between adjacent levels. The tree construction is initiated by the reference node which is assigned level 0. The child nodes flood the level discovery frames until leaf nodes are reached. The tree is static for the lifetime of the reference node except for the joining of new node using a level discovery frame. Similarly to TPSN, the Lightweight Tree-based Synchronization (LTS) protocols [12] synchronize a pair of nodes with two frames. Global synchronization is achieved by creating a minimum height spanning tree. For time reference robustness multiple reference nodes are considered. Fault tolerance against dynamic channel variations, changes in topology, changes in size, and node mobility is achieved by re-creating the spanning tree on every re-synchronization. The Flooding Time Synchronization Protocol (FTSP) [32] uses link layer timestamping and linear regression to estimate neighbor node clock parameters. The protocol floods all the synchronization frames through the network. This gives good tolerance against failures. Furthermore, an algorithm for electing a new reference node when the current one fails is presented. A node starts acting as reference node after a constant timeout when it has not receive any synchronization frames. This results in possibly many new reference nodes of which only the one with the minimum ID remains after the synchronization flooding resumes. Delay Measurement Time Synchronization (DMTS) [37] uses one message to synchronize a sender and all the receivers in its neighborhood. The multi-hop DMTS algorithm uses a leader selection algorithm to select a time reference for the whole network. The time reference is at tree level 0. The time is periodically flooded trough the network by broadcasting it from level to level. Synchronization frames from only lower level nodes are accepted. This is continued until leaf nodes are reached. The Time-Diffusion synchronization Protocol (TDP) [48] consist of active and inactive cycles. At the start of every active cycle a subset of nodes is selected as masters who can relay synchronization data. The timing messages sent by the masters create individual tree structures for every master. Furthermore, the protocol includes a method for detecting outliers. TDP does no rely on external time servers making it fully self-contained. Furthermore, the creation of new synchronization trees in every cycle increase the fault tolerance. Li et al. [28] propose a diffusion method where nodes achieve global synchronization by spreading local synchronization information to the entire system. The
Low-power Wireless Sensor Network Platforms
151
Table 7 Design requirements of TUTWSN low-energy and low-latency protocols. TUTWSN variant
Network Measurement End-to-end Router Node Lifetime size interval latency power power
Low-energy Thousands 30 s-15 min Low-latency Hundreds 0.5 s-10 s
<10 min Battery Battery 2 years <6 s Mains Battery 4 years
method does not rely on a single time reference which betters the robustness of the protocol. However, any malfunctioning node affects the time accuracy of the whole network. The authors present a solution for this by replacing a fraction of the normal nodes by tamper-proof nodes.
8 Experiments In this section, TUTWSN is used as a case study of WSN performance. TUTWSN uses cross-layer design between network protocols and hardware platforms to achieve the requirements of target WSN applications [27]. Two variations of the protocol are presented: low-energy and low-latency TUTWSN. The design requirements for these variants are presented in Table 7. The lowenergy TUTWSN is targeted at sensing applications requiring moderate throughput, long network lifetime, and forwarding latencies that are in order of second per hop. The low-latency TUTWSN is targeted at localization and target tracking applications requiring very low end-to-end delays and light throughput. It uses a heterogeneous approach by allowing ultra low power mobile nodes, while the energy consumption of router nodes forming a backbone network is higher. Both protocols variations enable multihop networking with one or more sinks and share a common hardware platform. The program memory usage of the low-energy and low-latency TUTWSN protocols were 100 kB and 60 kB, respectively. The data memory usage was less than 4 kB in both cases. Thus, the protocols represent feasibility of the very resource constrained WSN hardware. The platform used in the experiments is build from COTS components. A TUTWSN node is controlled by a 8-bit Microchip PIC18LF8722 MCU and Nordic Semiconductor nRF24L01 transceiver. The transceiver is used with 1 Mbit/s data rate, which ensures short transmission and reception times, allowing node to spend most of the time in low-power states. The node is powered by two AA batteries.
152
Suhonen et al.
8.1 Low-Energy TUTWSN Low-energy TUTWSN uses a clustered topology, in which each cluster operates on its own frequency referred to as a cluster channel, which increases scalability and avoids collisions between clusters. The experimented configuration of TUTWSN MAC utilize 2 s access cycle, 4 ALOHA slots, and 8 reserved slots. Frame size is 32 B due to hardware limitations. As data is send only in the reserved slots, the total throughput at MAC layer is 1 kbit/s. The routing uses the cost-based approach and supports several sinks by maintaining a separate cost for each sink [27]. A node initially searches its neighbors with a network scan. When a new neighbor is found, a node sends a cost request to it. A node selects the next hop and sets its cost based on the received replies. Additionally, nodes periodically recalculate the cost and broadcast an advertisement to their neighbors. This way, nodes can react to the changes in the network conditions as cost changes. 8.1.1 Scalability In a low-duty cycle MAC protocol (such as IEEE 802.15.4 or TUTWSN MAC), the maximum number of nodes ( ) in an interference area can be determined by the access cycle length (TAC ), the superframe length, the average number of member nodes in each cluster, and the number of utilized non-interfering frequency channels (nCH ) as TAC nCH (1 + nS) = , (2) tSF + tguard where tSF is the length of a superframe, tguard is a short guard time between consecutive superframes. is maximized by minimizing the superframe and guard time lengths and by maximizing TAC , nCH , and nS . It can be clearly seen in the equation that by utilizing a high data-rate radio operating at a wide frequency band provides the highest scalability. In the experimented 2.4 GHz TUTWSN implementation, TAC = 4 s, tSF = 280 ms, and each superframe is followed by a 220 ms long guard period to allow time for data processing. In TUTWSN sensor node platforms, the interference range in indoor conditions is around 100 m equaling to the area of 31400 m2 . 8 member nodes can be connected to each cluster head. The limit is due to data memory as various statistics is kept from each member. The transceiver provide 82 channels. Assuming 41 non-overlapping channels, 2880 nodes can be located per an interference area. If only one channel is used (nCH = 1), would be reduced to 72 nodes per an interference area. The network depth is limited by the routing protocol. As routing cost is expressed in 8-bit integer value and per-hop cost ranges between 1 . . . 8, the maximum network depth is 32.
Low-power Wireless Sensor Network Platforms
153
Table 8 Average power consumption and estimated lifetime with a 2000 mAh battery. Power consumption (W ) Lifetime (years) Access cycle length Subnode Headnode Subnode Headnode 2s 4s 8s
153 120 103
740 533 327
4.5 5.7 6.6
0.9 1.3 2.1
8.1.2 Power Consumption Power consumption of the platform was tested in a multihop network consisting of 2 sinks, 13 headnodes, and 12 subnodes (27 nodes in total). Each node measured its temperature and transmitted a packet every 10 s to the nearest sink. The maximum hop count was 3. Average headnode and subnode power consumptions with 2 s, 4 s, and 8 s access cycle lengths are presented in Table 8. A short access cycle consumes more power as beacons must be sent and received more frequently. Depending on the used access cycle length, the lifetime with a conservative estimate of 2000 mAh capacity with the two AA-batteries is estimated between 4.5 and 6.6 years with a subnode and 0.9 and 2.1 years with a headnode. The power consumption of a headnode is significantly higher as it must also receive channel during CAP and forward traffic. 8.1.3 Availability and End-to-end Reliability Practical performance of a low-power network was measured in a indoor deployment consisting of 120 nodes and 8 sinks. Sinks requested nodes to send data in 60 s intervals and diagnostics data describing the battery voltage, buffer usage, and link reliabilities in 120 s intervals. As a result, a node generated data packet on average in 40 s intervals. Each node transmitted its data to the nearest sink. An average routing path length was 4 and the maximum hop count to the sink was 6. The reliability of the network was estimated with availability metric. If a node does not have problems, the reception interval at a sink equals to the data generation interval. Retransmissions or packet drops increase the interval to reach a certain availability. The average availability of nodes, and the availability of the most and least reliable nodes are presented in Fig. 12. On average, 95% of traffic is received in less than 140 s interval. The availability increases only slightly after 99% as the reception interval is increased, denoting that 1% of traffic is lost. The unreliability is by the limited data memory, as each node can only hold 20 packets. Therefore, packet drops due to buffer overflow occur, especially when a node has lost a link but is still receiving traffic from other nodes.
Suhonen et al.
Samples received (%)
154
100
Fig. 12 Node availability in the experimented network.
95 90 85
Best node Average Worst node
80
0 100 200 300 400 500 600 Packet reception interval (s)
Fig. 13 Low-latency TUTWSN topology. The router nodes forward data via multiple hops to one or multiple sinks. Mobile nodes broadcast their data to the router nodes.
8.2 Low-Latency TUTWSN The low-latency network consists of router nodes and mobile nodes as shown in Fig. 13. The router nodes are responsible of data forwarding via a multihop network to one or multiple sinks. The low-latency TUTWSN operates on single network wide communication channel. The channel access method with routers and mobile nodes are different. Routers sense channel continuously, which consumes power but allows fast data forwarding. ALOHA protocol with randomized backoffs is used for channel access. A carrier sense based protocol would improve bandwidth usage efficiency compared to ALOHA, but the functionality is not included in the most low-cost and low-power radio transceivers. The channel access of a mobile node is designed to minimize energy consuming idle listening. A mobile node broadcasts beacon frames periodically but with slightly randomized intervals to avoid collisions. Then, the node briefly listens to the channel. The beacon can be piggypacked with application generated data. A router node that receives the beacon waits a random time to avoid collisions and sends then an acknowledgment to the mobile node. If the beacon frame containing data is not acknowledged, a mobile node temporarily shortens its beacon generation interval, thus reducing the delay between retransmission attempts. A beacon frame is forwarded to a sink only if it contains application data. For routing, a cost-based protocol is used. The routing and MAC are cross-layer designed for low delays and efficient implementation. The routing layer uses network beacons to communicate cost information to neighbors and acquire in depth information from the neighborhood. Using this information the routing protocol cal-
Power (mW)
5 4 3 2 1 0
0 4 8 12 16 20 24 28 32 Access cycle (s)
155
Lifetime (months)
Low-power Wireless Sensor Network Platforms
30 25 20 15 10 5 0
(a)
0 4 8 12 16 20 24 28 32 Access cycle (s) (b)
Fig. 14 Estimated mobile node power consumption and lifetime with 2000 mAh batteries.
culates routes and fills the routing table. When forwarding data, the MAC layer can look up next hop information straight from the routing table without the need to cycle the packets to the routing layer. This reduces queuing, processing, and stack handover delays. 8.2.1 Power Consumption and Network Lifetime As the routers are active all the time their power consumption does not depend on network behavior nor operating mode. The average measured power consumption for a router node is 78 mW. With a conservative estimate of 2000 mAh capacity produced by two AA-batteries the lifetime of a router is estimated to be 8 days. Thus, to achieve feasible network lifetime, the router nodes should be equipped with big enough batteries or be mains powered. The average mobile node power consumption was measured using 0.25 s, 1 s, 4 s, 16 s, and 32 s access cycle lengths. The results are presented in Fig. 14(a). The lifetimes of a mobile node with the same access cycles and using 2000 mAh battery capacity are shown Fig. 14(b). The lifetime ranges from 3.1 months to 2.1 years. However, it is unlikely that a node needs to transmit e.g. every 0.25 s for its whole lifetime. Thus, longer lifetimes and low-latency operation can still be achieved when using shorter access cycles only when needed. 8.2.2 Delay and Throughput The performance of the low-latency network was tested using a network where the nodes communicated via a fixed multi-hop route to a sink. This enables accurate analysis of the network behavior over varying amount of hops.
156
Suhonen et al.
Table 9 Fixed multi-hop scenario delay results. Packets were relayed over 1-8 hops.
Fig. 15 Fixed multi-hop scenario throughput results. Packets were relayed over 1-8 hops. The data always originated from the end of the hop chain
Throughput (kbps)
Percentile Maximum delay (s) Average delay per hop (s) 99% 2.1 0.4 99.5% 2.5 0.4 99.9% 3.6 0.5 100% 4.4 0.7
4 3 2 1 0
0 1 2 3 4 5 6 7 8 9 Hop count
The delay results are presented in Table 9. The scenario consisted of an eight-hop chain of nodes. 99% of the packets reached the sink in 2.1 s and all the packets were received within 4.4 s. Fig. 15 presents the throughput as a function of hop count. In the experiments, the data always originated from the end of the hop chain. The throughput ranges from 3.5 kpbs to 500 bps.
9 Summary WSN is a recently emerged technology that does not have de-facto platforms, standards, or technologies. The limitations in the current manufacturing techniques cause a trade-off between size, price, performance, and lifetime of a sensor node. Thus, due to contradictive requirements of different WSN applications, there is no general purpose WSN technology. Instead, platform components and communication protocols must be specifically selected to meet the application demands. An ideal WSN platform would combine the hundreds of MIPS processing power and several MBs memory of the high performance platforms with the 1 mA average and 1 A sleep mode currents of low-energy platforms. However, it is unlikely that this can be achieved in the near future only by improving manufacturing technologies. A combination of low-energy protocols and hardware acceleration is required, which opens up applications for signal processing systems. In low-power platforms, transceiver consumes most of the energy. Thus, it is possible to lower energy consumption by trading communication time with processing time by preprocessing data, data fusion, and aggregation. Accelerating computing intensive tasks, such as encryption, would leave processing power for other task.
Low-power Wireless Sensor Network Platforms
157
In addition, accelerated compression would allow more complex sensing, as images, sound, and video could be transferred energy-efficiently. ASIC implementation of communication protocols and applications could improve performance and decrease power consumption. However, as a trade-off, the network would lose its reprogrammability capabilities and configurability. Overall, the processing power and capacity of a single sensor node is small. However, the large number of sensor nodes enable massively parallel distributed data processing and storage. Thus, the feasibility and useful features of WSNs lie in the co-operation of the sensor nodes.
References 1. Akyildiz, I.F., Kasimoglu, I.H.: Wireless sensor and actor networks: Research challenges. Elsevier Ad Hoc Networks 2(4), 351–367 (2004) 2. Akyildiz, I.F., Su, W., Sankarasubramaniam, Y., Cayirci, E.: Wireless sensor networks: a survey. Elsevier Computer Networks 38(4), 393–422 (2002) 3. Baunach, M., Kolla, R., Mühlberger, C.: Beyond theory: Development of a real world localization application as low power wsn. In: Proc. 32nd IEEE Conference on Local Computer Networks (LCN’07), pp. 872–884. Dublin, Ireland (2007) 4. Buettner, M., Yee, G., Anderson, E., Han, R.: X-MAC: A short preamble MAC protocol for duty-cycled wireless sensor networks. In: Proc. 4th ACM Conf. Embedded Networked Sensor Systems (SenSys’06), pp. 307–320. Boulder, Colorado, USA (2006) 5. Bulusu, N., Heidemann, J., Estrin, D.: GPS-less low-cost outdoor localization for very small devices. Personal Communications, IEEE [see also IEEE Wireless Communications] 7(5), 28–34 (2000) 6. Colvin, A.: CSMA with collision avoidance. Computer Communications 6(5), 227–235 (1983) 7. Crossbow Technology, Inc.: Stargate X-Scale processor platform. Available: http://www.xbow.com/Products/Product_pdf_files/Wireless_pdf/ 6020-0049-01_B_STARGATE.pdf (2004) 8. Elnahrawy, E., Li, X., Martin, R.P.: The limits of localization using signal strength: a comparative study. In: Sensor and Ad Hoc Communications and Networks, 2004. IEEE SECON 2004. 2004 First Annual IEEE Communications Society Conference on, pp. 406–414 (2004) 9. Enz, C.C., El-Hoiydi, A., Decotignie, J.D., Peiris, V.: WiseNET: An ultralow-power wireless sensor network solution. Computer 37(8), 62–70 (2004) 10. Fox, V., Hightower, J., Liao, L., Schulz, D., Borriello, G.: Bayesian filtering for location estimation. Pervasive Computing, IEEE 2(3), 24–33 (July-Sept. 2003). DOI 10.1109/MPRV. 2003.1228524 11. Ganeriwal, S., Kumar, R., Srivastava, M.B.: Timing-sync protocol for sensor networks. In: SenSys ’03: Proceedings of the 1st international conference on Embedded networked sensor systems, pp. 138–149. ACM, New York, NY, USA (2003). DOI http://doi.acm.org/10.1145/ 958491.958508 12. Greunen, J.V., Rabaey, J.: Lightweight time synchronization for sensor networks. In: WSNA ’03: Proceedings of the 2nd ACM international conference on Wireless sensor networks and applications, pp. 11–19. ACM, New York, NY, USA (2003). DOI http://doi.acm.org/10.1145/ 941350.941353 13. Guo, C., Zhong, L., Rabaey, J.: Low power distributed MAC for ad hoc sensor radio networks. In: Global Telecommunications Conf. (GLOBECOM’01), vol. 5, pp. 2944–2948. San Antonio, TX, USA (2001)
158
Suhonen et al.
14. Hightower, J., Borriello, G.: Location systems for ubiquitous computing. Computer 34(8), 57–66 (2001) 15. Hightower, J., Brumitt, B., Borriello, G.: The location stack: a layered model for location in ubiquitous computing. Mobile Computing Systems and Applications, 2002. Proceedings Fourth IEEE Workshop on pp. 22–28 (2002). DOI 10.1109/MCSA.2002.1017482 16. Hill, J., Horton, M., Kling, R., Krishnamurthy, L.: Wireless sensor networks: The platforms enabling wireless sensor networks. Communications of the ACM 6(47), 41–46 (2004) 17. Hill, J., Szewczyk, R., Woo, A., Hollar, S., Culler, D., Pister, K.: System architecture directions for networked sensors. In: Proc. 9th ACM Int’l Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS’00), pp. 94–103. Cambridge, MA, USA (2000) 18. Hodes, T.D., Katz, R.H., Servan-Schreiber, E., Rowe, L.: Composable ad-hoc mobile services for universal interaction. In: MobiCom ’97: Proceedings of the 3rd annual ACM/IEEE international conference on Mobile computing and networking, pp. 1–12. ACM, New York, NY, USA (1997). DOI http://doi.acm.org/10.1145/262116.262121 19. IEEE 802.15.4: IEEE Standard for Information Technology—Telecommunications and Information Exchange Between Systems—Local and Metropolitan Area Networks—Specific Requirements—Part 15.4: Wireless Medium Access Control (MAC) and Physical Layer (PHY) Specifications for Low-Rate Wireless Personal Area Networks (LR-WPAN) (2006) 20. ISA: ISA100.11a release 1. Available: http://www.isa.org/source/ISA100. 11a_Release1_Status.ppt (2007) 21. Al Karaki, J.N., Kamal, A.E.: Routing techniques in wireless sensor networks: A survey. IEEE Wireless Communications 11(6), 6–28 (2004) 22. Karl, H., Willig, A.: Protocols and Architectures for Wireless Sensor Networks. John Wiley & Sons Ltd (2005) 23. Karp, B., Kung, H.T.: GPSR: Greedy perimeter stateless routing for wireless networks. In: Proc. 6th annual Int’l Conf. on mobile computing and networking (MobiCom’00), pp. 243– 254. Boston, MA, USA (2000) 24. Kaseva, V.A., Kohvakka, M., Kuorilehto, M., Hännikäinen, M., Hämäläinen, T.D.: A wireless sensor network for RF-based indoor localization. EURASIP Journal on Advances in Signal Processing (2008). DOI 10.1155/2008/731835 25. Kohvakka, M.: Medium access control and hardware prototype designs for low-energy wireless sensor networks. Ph.D. thesis, Tampere University of Technology, Tampere, Finland (2009) 26. Kulik, J., Heinzelman, W., Balakrishnan, H.: Negotiation-based protocols for disseminating information in wireless sensor networks. Kluwer Wireless Networks 8(2), 169–185 (2002) 27. Kuorilehto, M., Kohvakka, M., Suhonen, J., Hämäläinen, P., Hännikäinen, M., Hämäläinen, T.D.: Ultra-Low Energy Wireless Sensor Networks in Practice - Theory, Realization and Deployment. John Wiley & Sons Ltd (2007) 28. Li, M.Q., Rus, M.D.: Global clock synchronization in sensor networks. IEEE Trans. Comput. 55(2), 214–226 (2006). DOI http://dx.doi.org/10.1109/TC.2006.25 29. Liu, J., Zhao, F., Petrovic, D.: Information-directed routing in ad hoc sensor networks. IEEE Journal on Selected Areas in Communications 23(4), 851–861 (2005) 30. Lorincz, K., Welsh, M.: MoteTrack: A robust, decentralized approach to RF-based location tracking. In: In Proceedings of the International Workshop on Location- and ContextAwareness (LoCA 2005) at Pervasive 2005. Oberpfaffenhofen, Germany (2005) 31. Madden, S., Franklin, M.J., Hellerstein, J.M., Hong, W.: The design of an acquisitional query processor for sensor networks. In: Proc. ACM Int’l Conf. on Management of Data (SIGMOD’03), pp. 491–502. San Diego, CA, USA (2003) 32. Maróti, M., Kusy, B., Simon, G., Ákos Lédeczi: The flooding time synchronization protocol. In: SenSys ’04: Proceedings of the 2nd international conference on Embedded networked sensor systems, pp. 39–49. ACM, New York, NY, USA (2004). DOI http://doi.acm.org/10. 1145/1031495.1031501
Low-power Wireless Sensor Network Platforms
159
33. Min, R., Bhardwaj, M., Cho, S.H., Ickes, N., Shih, E., Sinha, A., Wang, A., Chandrakasan, A.: Energy-centric enabling technologies for wireless sensor networks. IEEE Wireless Communications 9(4), 28–39 (2002) 34. Niculescu, D.: Communication paradigms for sensor networks. IEEE Communications Magazine 43(3), 116–122 (2005) 35. Niculescu, D., Nath, B.: Trajectory based forwarding and its applications. In: Proc. 9th annual Int’l Conf. on Mobile computing and networking (MobiCom’03), pp. 260–272. San Diego, CA, USA (2003) 36. Patwari, N., Ash, J.N., Kyperountas, S., Hero III, A.O., Moses, R.L., Correal, N.S.: Locating the nodes: cooperative localization in wireless sensor networks. Signal Processing Magazine, IEEE 22(4), 54–69 (2005) 37. Ping, S.: Delay measurement time synchronization for wireless sensor networks. Tech. Rep. IRB-TR-03-013, Intel Research Berkeley Lab (2003) 38. Pitcher, G.: If the cap fits... pp. 25–26 (2006) 39. Polastre, J., Hill, J., Culler, D.: Versatile low power media access for wireless sensor networks. In: Proc. 2nd Internation Conf. on Embedded Networked Sensor Systems (Sensys’04), pp. 95– 107. Baltimore, MD, USA (2004) 40. Priyantha, N.B., Chakraborty, A., Balakrishnan, H.: The cricket location-support system. In: MobiCom ’00: Proceedings of the 6th annual international conference on Mobile computing and networking, pp. 32–43. ACM Press, New York, NY, USA (2000) 41. Priyantha, N.B., Miu, A.K.L., Balakrishnan, H., Teller, S.: The cricket compass for contextaware mobile applications. In: MobiCom ’01: Proceedings of the 7th annual international conference on Mobile computing and networking, pp. 1–14. ACM Press, New York, NY, USA (2001) 42. Reason, J.M., Rabaey, J.M.: A study of energy consumption and reliability in a multi-hop sensor network. ACM SIGMOBILE Mobile Computing and Communications Review 8(1), 84–97 (2004) 43. Römer, K., Kasten, O., Mattern, F.: Middleware challenges for wireless sensor networks. ACM SIGMOBILE Mobile Computing and Communications Review 6(4), 59–61 (2002) 44. Roberts, L.: ALOHA packet system with and without slots and capture. ACM SIGCOMM Computer Communication Review 5(2), 28–42 (1975) 45. Roundy, S., Wright, P.K., Rabaey, J.: A study of low level vibrations as a power source for wireless sensor nodes. Computer Communications 26(11), 1131–1144 (2003) 46. Sichitiu, M., Veerarittiphan, C.: Simple, accurate time synchronization for wireless sensor networks. In: WCNC ’03: Proceedings of the IEEE conference on Wireless Communications and Networking, vol. 2, pp. 1266–1273 (2003) 47. Stallings, W.: Operating Systems Internals and Design Principles, 5 edn. Prentice-Hall (2005) 48. Su, W., Akyildiz, I.F.: Time-diffusion synchronization protocol for wireless sensor networks. IEEE/ACM Trans. Netw. 13(2), 384–397 (2005). DOI http://dx.doi.org/10.1109/TNET.2004. 842228 49. Suhonen, J., Hämäläinen, T.D., Hännikäinen, M.: Availability and end-to-end reliability in low duty cycle multihop wireless sensor networks. Sensors 9(3), 2088–2116 (2009) 50. Tian He, Stankovic, J.A., Lu, C., Abdelzaher, T.: SPEED: A stateless protocol for real-time communication in sensor networks. In: Proc. 23rd Int’l Conf. on Distributed Computing Systems, pp. 46–55. Providence, RI, USA (2003) 51. van Dam, T., Langendoen, K.: An adaptive energy-efficient MAC protocol for wireless sensor networks. In: Proc. 1st Int’l Conf. on Embedded Networked Sensor Systems (Sensys’03), pp. 171–180. Los Angeles, CA, USA (2003) 52. Wan, C.Y., Campbell, A.T., Krishnamurthy, L.: Pump-slowly, fetch-quickly (PSFQ): A reliable transport protocol for sensor networks. IEEE Journal on Selected Areas in Communications 23(4), 862–872 (2005) 53. Want, R., Hopper, A., Falcao, V., Gibbons, J.: The active badge location system. ACM Transactions on Information Systems 10(1), 91–102 (1992) 54. Wolf, M., Kress, D.: Short-range wireless infrared transmission: the link budget compared to RF. IEEE Wireless Communications Magazine 10(2), 8–14 (2003)
160
Suhonen et al.
55. Ye, F., Zhong, G., Lu, S., Zhang, L.: GRAdient broadcast: a robust data delivery protocol for large scale sensor networks. Kluwer Wireless Networks 11(3), 285–298 (2005) 56. Ye, W., Heidemann, J., Estrin, D.: An energy-efficient MAC protocol for wireless sensor networks. In: Proc. 21st Annual Joint Conf. of the IEEE Computer and Communications Societies (INFOCOM’02), vol. 3, pp. 1567–1576. New York, NY, USA (2002) 57. Yoon, S.: Power management in wireless sensor networks. North Carolina State University, PhD Thesis (2007) 58. Youssef, M.A., Agrawala, A., Shankar, A.U.: WLAN location determination via clustering and probability distributions. In: Pervasive Computing and Communications, 2003. (PerCom 2003). Proceedings of the First IEEE International Conference on, pp. 143–150 (2003)
Signal Processing for Cryptography and Security Applications Miroslav Kneževi´c1, Lejla Batina1 , Elke De Mulder1, Junfeng Fan1 , Benedikt Gierlichs1 , Yong Ki Lee2 , Roel Maes1 and Ingrid Verbauwhede1,2
Abstract Embedded devices need both an efficient and a secure implementation of cryptographic primitives. In this chapter we show how common signal processing techniques are used in order to achieve both objectives. Regarding efficiency, we first give an example of accelerating hash function primitives using the retiming transformation, a well known technique to improve signal processing applications. Second, we outline the use of some special features of DSP processors and techniques earlier developed for efficient implementations of public-key algorithms. Regarding the secure implementations we outline the concept of side channel attacks and show how a multitude of techniques for preprocessing the data are used in such scenarios. Finally, we talk about fuzzy secrets and point out the use of DSP techniques for an important role in cryptography — a key derivation.
1 Introduction The implementation of cryptographic and security applications has adapted techniques more than expected from the architectures, design methods and tools of signal processing systems. When implementing cryptographic applications on embedded devices, in hardware, software or a combination of both, there are two optimization goals. The first request is one of efficiency: the implementations should be efficient in area, time, power consumption etc. This is very similar to the implementation of signal processing applications. The other requirement is one of a secure implementation. Implementations should not leak information. Signal pro-
1 Katholieke
Universiteit Leuven, ESAT/COSIC and IBBT Kasteelpark Arenberg 10, B-3001 Leuven-Heverlee, Belgium and 2 University of California, Los Angeles, Electrical Engineering 420 Westwood Plaza, Los Angeles, CA 90095-1594, USA
S.S. Bhattacharyya et al. (eds.), Handbook of Signal Processing Systems, DOI 10.1007/978-1-4419-6345-1_7, © Springer Science+Business Media, LLC 2010
161
162
Authors Suppressed Due to Excessive Length
cessing techniques are used for both. As research progresses, signal processing and cryptography find more overlaps. In this chapter, an overview is given of the use of signal processing techniques as they are used for efficient implementations as well as used for the analysis of secure implementations. The techniques that are used for efficient implementations in particular, are discussed in more detail in Chapter 28 and Chapter 12. Section 2 gives an overview of efficient implementations for both secret-key and public-key. It describes techniques borrowed from fast filter implementations to optimize the implementation of hash functions. It also describes implementations of public-key on DSP processors. Section 3 focuses on secure implementations and the signal processing techniques used in analysis. Section 4 discusses the usage of signal processing techniques for fuzzy secrets. Section 5 gives the conclusion and describes some new trends.
2 Efficient Implementation 2.1 Secret-Key Algorithms and Implementations 2.1.1 Cryptographic Hash Functions and Implementations A cryptographic hash function is a deterministic algorithm that takes input strings — M of arbitrary length and returns short fixed-length strings, so called message digests — Hash(M), and it is required to satisfy the following cryptographic properties: 1. Preimage resistance: It must be infeasible to find any preimage, M, for a given hash output, H, such that H=Hash(M). 2. Second Preimage resistance: It must be infeasible to find another preimage, M1 , for a given input, M0 , such that Hash(M0 )=Hash(M1 ). 3. Collision resistance: It must be infeasible to find two different inputs of the same hash output, M0 and M1 , such that Hash(M0 )=Hash(M1 ). To be useful in applications such as Digital Signature Algorithm and message authentications, the performance becomes an important factor since inputs are easily large. To obtain architectures with a better performance, a design methodology is developed in [36] based on theoretically proven signal processing techniques [44]. Here we illustrate this methodology with the example of MD6, which was one of the SHA-3 candidates [54]. Chapter 28 provides more detailed analysis of the common DSP techniques used in this example. First, we derive a DFG (Data Flow Graph) of MD6 as shown in Fig. 1. This DFG represents one iteration of the algorithm, where the square nodes are registers and each register is paired with an algorithmic delay (D) so that the register values can be updated on each cycle. The circle nodes are functional nodes where ∧ denotes the
Signal Processing for Cryptography and Security Applications
163
୩ri-n
k
~X
k
~Y
k
~Z
k
~X^
k
~X_
k
~YX
k
~ZX
k
~]^
~_`
bitwise “AND” operator, ⊕: the bitwise “XOR” operator, b: left-shifted by b bits (zero shifting in), and b: right-shifted by b bits (zero shifting in). Si is constant, but changes from round to round. Each round is composed of 13 iterations and the number of rounds depends on the algorithm parameters.
k
୨li-n
Si Fig. 1 MD6 DFG — Critical Path
The critical path delay can be measured as the maximum delay among any two consecutive algorithmic delays (D’s), which for MD6 consists of 6 XOR delays as shown in Fig. 1 (indicated by an arrow with a dotted line). Before applying some techniques to minimize the critical path delay, we first need to know how much we can achieve. This can be done by analyzing the iteration bound of a DFG, which defines a theoretical lower bound of the achievable critical path delay as defined by the following equation:
4 × Delay(⊕) + Delay(∧) t T = max l = (1) l∈L wl 18
k
Si
k
୩ri-n
k
k
~X
~Y
~Z
~X^
k
~X_
~YX
k
~ZX
~]^
~_`
where tl is the loop calculation time, wl is the number of algorithmic delays in the lth loop, and L is the set of all possible loops. The iteration bound of MD6 is defined with the loop indicated by a thick dotted line in Fig. 1.
୨li-n
k k
k
k
k
Fig. 2 MD6 DFG — Retiming Transformation
In order to achieve a critical path delay as the iteration bound, it requires the unfolding transformation with the factor of the un-canceled denominator, which is 18 in this case. This introduces an overhead by duplicating the functional nodes (18 times duplication). Therefore, we choose not to perform the unfolding transformation but perform retiming transformation only. The achievable critical path delay only with the retiming transformation can be defined as follows:
164
Authors Suppressed Due to Excessive Length
T =
4 × Delay(⊕) + Delay(∧) = Prop(⊕) 18
(2)
· is the maximum part when 4 × Delay(⊕) + Delay(∧) is evenly distributed into 18 parts. Since it is assumed that a functional node can not be split into multiple parts in the iteration bound analysis, ⊕ cannot be further split. By retiming transformations, we replace the algorithmic delays (D’s) in order to reduce the critical path delay. Some of the algorithmic delays between registers (including those in the abbreviated area) can be moved to proper positions as shown in Fig. 2 where the new critical path delay is Prop(⊕). The details about retiming transformations and implementation issues can be found in [36] [44]. It shows that techniques which were originally developed for improving the throughput of digital filters with feedback can be also applied to hash function implementations.
2.2 Public-Key Algorithms and Implementations 2.2.1 Efficient Modular Multiplication Public-key cryptography (PKC), a concept introduced by Diffie and Hellman [12] in the mid 70’s, has gained its popularity together with the rapid evolution of today’s digital communication systems. The best-known public-key cryptosystems are based on factoring i.e. RSA [49] and on the discrete logarithm problem in a large prime field (Diffie-Hellman, ElGamal, Schnorr, DSA) [39] or on an elliptic/hyperelliptic curve (ECC/HECC) [32] [42] [33]. Based on the hardness of the underlying mathematical problem, PKC usually deals with large numbers ranging from a few hundreds to a few thousands of bits in size. Consequently, the efficient implementations of the PKC primitives has always been a challenge. An efficient implementation of the mentioned cryptosystems highly depends on the efficient implementation of modular arithmetic. Namely, modular multiplication forms the basis of modular exponentiation which is the core operation of the RSA cryptosystem. It is also present in many other cryptographic algorithms including those based on ECC and HECC. In particular, if one uses projective coordinates for ECC/HECC, modular multiplication remains the most time consuming operation for ECC. Hence, an efficient implementation of PKC relies on efficient modular multiplication. Two algorithms for modular multiplication, namely Barrett [3] and Montgomery [43] algorithms are widely used today. Both algorithms avoid multiple-precision divisions, the operation that is considered to be expensive, especially in hardware. Implemented on a first generation of DSP produced by Texas Instruments (TMS 32010), the Barrett’s algorithm is one of the famous examples in the area of signal processing for cryptography. The properties of a DSP such as fast multiply and accumulate (MAC) operation and a fast microprocessor on a single chip seemed to be an ideal combination for the proposed algorithm. Furthermore, additional speed-ups
Signal Processing for Cryptography and Security Applications
165
were obtained by taking advantage of a feature of the TMS320 DSP which allowed auto increment and decrement of data pointers during MAC operations. This ensured the data fetching for free. The original Barrett reduction algorithms is given in Alg. 1. Algorithm 1 Barrett modular reduction.
Input: A = (A2n−1 . . .A0 )2 , M = (Mn−1 . . .M0 )2 where Mn−1 = 0, = 22n /M . Output: R = A mod M. A 2n−1 Qˆ ⇐ 2n+1 ˆ mod 2n+1 R ⇐ A mod 2n+1 − QM while R ≥ M do R ⇐ R−M end while return R.
Montgomery’s algorithm is the most commonly used reduction algorithm, nowadays. It is widely used to speed up the modular multiplications and squarings required during the exponentiation process. The Montgomery reduction algorithm computes T = A · r−1 mod M, given A < M 2 and r such that gcd(r, M) = 1. Even though the algorithm works with any r, relatively prime to M, r is very often chosen to be a power of 2. In that case, the Montgomery algorithm performs integer divisions by a power of 2, which is an intrinsically fast operation on both generalpurpose and specialized processors. The algorithm is given in Alg. 2. Algorithm 2 Montgomery modular reduction. Input: A = (A2n−1 . . .A0 )2 , M = (Mn−1 . . .M0 )2 , r = 2n where gcd(M, 2) = 1, M = −M −1 mod r. Output: T = Ar−1 mod M. S ⇐ (A mod r)M mod r T ⇐ (A + SM)/r if T ≥ M then T ⇐ T −M end if return T .
There is a number of publications reporting the implementations of Montgomery’s algorithm on a DSP. One of the first reported results was a work of Michael Wiener who had developed a general software implementation of RSA on the Motorola DSP56000 that achieved 10.2 Kbps for 512-bit modular exponentiation with the Chinese remainder theorem (CRT) [6]. Dusse and Kaliski have published an RSA implementation on the same chip achieving 11.6 Kbps for 512-bit modular exponentiation with the CRT [14]. They adopted the Montgomery’s algorithm by reducing the number of shift operations. Since on many processors the “high part” and the “low part” of accumulators are separately addressable, the shift operations have to be performed with move instructions. This was also true for the DSP56000
166
Authors Suppressed Due to Excessive Length
and the idea of reducing the number of shifts turned out to be a very good approach. Itoh et al.proposed a fast implementation method of Montgomery multiplication on a DSP TMS320C6201 [29]. This DSP operates eight function units in parallel achieving a performance of 11.7 ms for 1024-bit RSA signing (87.5 Kbps). Recently, Gastaldo et al.have published an enhanced Montgomery multiplication algorithm, a variant of the finely integrated product scanning (FIPS) algorithm, that is targeted to digital signal processors [19]. More about various approaches for implementing Montgomery’s algorithm on a DSP can be found in [9] [51]. 2.2.2 Implementations of Public-Key algorithms There are several implementations reported in the literature describing efficient PK cryptosystems on DSP architectures, mainly relying on the algorithm of Montgomery. For RSA cryptosystems, modular multiplication is practically the only required operation. As it is commonly implemented as a sequence of repeated squarings and multiplications, some authors explored possible speed-ups that can be achieved for a dedicated squaring. Krishnamurthy et al. proposed a new method for the Montgomery squaring reduction especially suited for DSP processors. The new method restructures the loop in the squaring reduction and when implemented on the TI TMS320C6201 DSP platform results in speed-ups of 10-15% compared to the work in [29]. The results are obtained by removing the redundancy of previous proposals and by maximizing the feature of pipelining as much as the platform permits. Guajardo et al. [24] described a strategy for an efficient implementation of ECC over GF(p) on the 16-bit TI MSP430x33x family. This family of devices is commonly used for ultra-low power applications such as electronic meters for gas, water and electricity and also for sensor systems that collect analogue signals. The signal are converted to digital ones before being transmitted to a host. The properties make the platform very suitable for embedded security applications. The authors use a Generalized-Merssenne prime to facilitate the underlying field arithmetic. The operations i.e. multiplication, squaring and modular reduction are adapted to explore the specifics of the architecture. For example, multiplications with small constants were traded for additions as the ratio of multiplication and addition is around 10 on this processor. Also, squaring was found to be much faster when implemented as a special routine because the communication with data memory was reduced (requiring only one input in this case). The use of a special prime, as suggested by standards, was additionally tailored to the 16-bit architecture. However, all the tricks used were not enough to meet the requirements for the 160-bit security level (as required minimum for ECC key lengths). Therefore, they propose to use 128-bit EC implementations as sufficient for the targeted applications. We note here that implementing squaring in a different manner from general multiplication is found to be sensitive to side-channel attacks (see Sect. 3). Therefore, the ideas of dedicated squaring are abandoned for embedded applications.
Signal Processing for Cryptography and Security Applications
167
2.2.3 Other Ideas for Arithmetic Borrowed from Signal Processing Cryptographic application as well as signal processing require arithmetic-intensive operations, hence there are several ideas that were applied in both. Examples include Residue Number System (RNS) arithmetic [52], signed-digit arithmetic, computation in frequency domain etc. Nowadays ever faster arithmetic is demanded e.g. for RSA cryptosystems due to the constant improvements in factoring and the resulting requirements for even longer key sizes. In order to achieve that, the Residue Number System is an alternative to the common radix representation. RNS arithmetic is a very old idea which relies on the Chinese Remainder Theorem (CRT) and it provides a good means for very long integer arithmetic. Let xa denote an RNS representation of x, then: xa = (x[a1 ], x[a2 ], . . . , x[an ])
(3)
where, x[ai ] = x mod ai . The set a = {a1, a2 , . . . , an } is called a base (of size n). It is required that gcd(ai , a j ) = 1 for i = j. CRT implies that the integer x which satisfies 0 ≤ x < ni=1 ai is uniquely represented by xa . A well known advantage of RNS is that to add, subtract and multiply such numbers we only need to compute the addition, subtraction and multiplication of their components, of size much smaller than the original modulus. Also carry-free arithmetic makes parallelization possible. The final result is obtained by means of the CRT. The disadvantage of an RNS representation is the overhead introduced by the input and output conversions from binary to RNS and vice versa. It is also difficult to perform division. To overcome this disadvantage, a combination with Montgomery multiplication was proposed [46]. Exponent recoding techniques for modular exponentiation (as used for RSA) replace the binary representation of an exponent with a representation which has fewer non-zero terms (see Gollmann et al. [23]). Many techniques for exponent recoding have been proposed in the literature. Here we mention the signed-digit representation. Consider an integer representation of the form k = li=0 si 2i , where si ∈ {−1, 0, 1}. This is called the (binary) signed digit (SD) representation (see Menezes et al. [39]). The representation is redundant. For example, the integer 3 ¯ 2 , where 1¯ = −1. It is said that an SD reprecan be represented as (011)2 or (101) sentation is sparse if it has no adjacent non-zero digits. A sparse SD representation is also called a non-adjacent form (NAF). Every integer k has a unique NAF which has the minimum weight of any signed digit representation of k. For RSA, the NAF techniques are not beneficial because they assume performing the division operation, which should be avoided due to the complexity of the operation. However, for ECC it is considered advantageous because the Hamming weight of the scalar is minimal. Also in this case “-1” means point subtraction instead of addition, which is an addition of the inverse point. More precisely, the scalar is decomposed as a NAF and the scalar multiplication is done with a series of addition/subtractions of elliptic curve points.
168
Authors Suppressed Due to Excessive Length
Other redundancy-based techniques include on-line arithmetic as proposed by Ercegovac [15]. In particular on-line arithmetic for DSP applications can be found in [16]. ECC implementation in the frequency domain is given by Baktir et al. [2]. The work is based on Discrete Fourier transform (DFT) modular multiplication and ECC processor architecture using the multiplier is presented. In this way multiplication in GF(qm ) is achieved by only a linear number of base field GF(q) multiplications in addition to a quadratic number of simpler base field operations such as additions/subtractions and bitwise rotations. An idea for implementation of large integer multiplication in the discrete Fourier domain originates from Kalach and David [31].
2.3 Architecture The architecture design of cryptographic systems requires special considerations on performance, cost and security. Many cryptographic systems, especially public-key cryptosystems, involve complex operations with large integers or long polynomials. Thus, cryptographic co-processors, just like accelerators for multimedia, communication or image processing, are introduced to offload the heavy processes from the main CPU. As a result, a Hardware/Software (HW/SW) co-design approach is usually used. Both the co-processor and the interface need to be carefully designed such that the co-processor can efficiently offload the work from CPU while security and flexibility are maintained. We illustrate the idea with an Elliptic Curve Processor. 2.3.1 Datapath Given a cryptographic algorithm, a design space can be explored. Taking ECC over GF(p) as an example, each scalar point multiplication involves hundreds of modular multiplications. As a result, speeding up the modular multiplication is the key to improve the overall performance. The multiplier can be implemented in a bitparallel, digit-parallel or digit-serial way, and by adjusting the digit-size the design space can be explored. On some FPGA platforms, we can also use the dedicated multiplier (18x18 multiplier on Virtex-II Pro) or MAC (25-bit MAC on Virtex-4) as a building block of modular multiplier [40] [26]. A multi-core architecture can utilize the parallelism between modular multiplications. The basic idea is similar to a classical superscalar processor, i.e, by exploring the instruction level parallelism, one can perform multiple modular multiplications simultaneously [50] [17]. Indeed, the bottleneck of a multi-core co-processor normally comes from memory access, and traditional approaches such as hierarchical memory are helpful.
Signal Processing for Cryptography and Security Applications
169
2.3.2 Interface While integrating the co-processor into a system, the interface between the coprocessor and CPU is of vital importance. Many research have been carried out on this topic, and they mainly focus on two aspects, performance and flexibility. The HW/SW partitioning has a big impact on the overall performance of the system. Take ECC as an example, one can immediately see the following three choices, as shown in Fig. 2.3.2.
Fig. 3 Hardware/Software co-design for ECC.
When choosing PL as the partitioning point, the co-processor performs the modular arithmetic while the CPU generates instruction sequences for point addition/doubling and all functions above. The upside of this choice is that it is highly flexible, while the downside is performance. The transfer of data and instructions between CPU and co-processor becomes the bottleneck. If we choose PH as the partitioning point, then CPU only needs to send the secret k and the base point P at the very beginning of a scalar multiplication, which substantially reduces the communication overhead. However, the co-processor loses flexibility. 2.3.3 Programmability and Security Being able to respond to new standards or attacks is very important to cryptographic systems. For example, when a new side-channel analysis appears, corresponding countermeasures need to be integrated to the cryptographic systems. Thus, the co-processor should have a certain degree of programmability. One method to achieve low communication overhead and high flexibility is to introduce control hierarchy. Instead of using a fixed state machine, one could use a programmable micro-controller as the control cell of the co-processor.
170
Authors Suppressed Due to Excessive Length
2.3.4 Low Power Architecture When using cryptography in constrained devices, i.e. passive RFID tags and wireless sensor nodes, power consumption becomes a big challenge. Several low power architectures [37] [27] for ECC have been proposed. Traditional techniques such as reducing the hardware size, lowering the clock frequency and using clock-gating are helpful. For ECC implementations, registers consume most of the dynamic power. Thus, reducing the size of register file or using registers with short length can significantly reduce power consumption.
3 Secure Implementations: Side Channel Attacks Cryptographic algorithms, i.e. mathematical objects describing the input-output behavior of a black box, have to go through a thorough process of cryptanalysis before they are widely accepted as secure and possibly standardized. Implementations of cryptographic algorithms essentially instantiate these black boxes. An implementation of an algorithm does however not automatically inherit the algorithm’s security. The fact that the abstract black box is implemented in software, hardware, or a combination of both gives rise to new security risks. Indeed, measurable physical properties of an implementation become an additional, unintended source of information, giving away knowledge about the secrets involved in cryptographic implementations. Those sources are called side channels. While the execution time of an implementation can be measured even from a distance, e.g. over a network, the physical accessibility of embedded devices in particular gives rise to even more side channels such as the power consumption, electromagnetic radiation, acoustics, heat dissipation, light emission, etc. Cryptanalytical methods exploiting such information leakage to extract secret data are called side channel attacks. Side channel attacks are a serious concern as they allow to extract secret information from unprotected implementations of black box secure algorithms with moderate effort. With the ever increasing availability and ubiquity of embedded devices, side channel attacks are a realistic threat nowadays. One type of side channel attacks, a differential side channel attack, sequences several steps to extract the secret information. In a first stage the attacker collects measurements of the side channel, e.g. power consumption, as a function of the time. Each measurement is the physical representation of the execution of the cryptographic algorithm with a different message but a fixed key. In a second stage, the attacker preprocesses the data, chooses a hypothetical power leakage model and calculates the hypothetical leakage using this model for each message and for every possible key guess. An example of one such model is the hamming weight of the processed data at a certain time instant. In this case an adversary assumes that the amplitude of the side channel measurement is related to the amount of binary ones in the data. The validity of this model in certain cases is demonstrated in Fig. 3. In the last phase of the attack, the attacker uses a statistical test to quantify the similar-
171
Supply current
Signal Processing for Cryptography and Security Applications
1
2
3 4 Time [clock cycles]
5
6
Fig. 4 Measurements of supply current over time. Data dependent operations in clock cycles three and four leak information about the operand(s).
ity between the hypothetical leakage and the real side channel leakage for every key guess. The key that reveals the strongest relation is the adversary’s best guess.
3.1 DSP Techniques Used in Side Channel Analysis The side channel analysis research domain has become a mature research area over the last decade, hence a multitude of techniques for preprocessing the data and quantifying the relation between the real and hypothetical leakage have been proposed. The measurements of the side channel are susceptible to external and internal noise contamination. Therefore, a first group of preprocessing techniques aims at reducing the amplitude noise. The easiest preprocessing technique is averaging the measurements. This was first pointed out by Kocher et al. in [34]. More noise reduction can be achieved through filtering the data. Results have been published for the power side channel [41] as well as for the electromagnetic side channel [1]. Another, more recently proposed technique is calculating the fourth-order cumulant [35]. In some cases, the useful information is modulated onto a carrier. For this type of data extraction, demodulation techniques are appropriate [1]. Due to clock jitter, absence of a suitable trigger or simply due to countermeasures, measurements can suffer from temporal misalignment. Some publications deal with this problem and try to solve the temporal misalignment. Homma et al. explain the phase-based waveform matching procedure [28], Gebotys and White introduce the phase replacement technique in [21]. Pelletier and Charvet use wavelets to get rid of misalignment [45]. In the final phase of a side channel attack, the attacker uses a statistical tool to find the correct key by analyzing the relation between the hypothetical leakage and the measured leakage. The difference of means test was proposed by Kocher et al.in their seminal paper [34]. Later, more advanced techniques were put forward: the
172
Authors Suppressed Due to Excessive Length
T-test [11], the Pearson correlation coefficient [7], Spearman rank correlation [4], mutual information [22], and maximum likelihood [10]. While all the previous techniques are applied in the time domain, the differential frequency analysis exploits the frequency information [20]. Advanced signal processing techniques strengthen the adversary’s capabilities and raise the need for proper and better countermeasures against side channel attacks.
4 Working with Fuzzy Secrets 4.1 Fuzzy Secrets: Properties and Applications in Cryptography Secrets play an important role in cryptography, since the alleged security of many cryptographic schemes is solely based on the assumed confidentiality of certain data, mostly called a key. In order for a secret key to be secure, it needs to be hard for an adversary to guess its value. Keys are mostly represented as digital n bit strings and when they are sampled uniformly at random from {0, 1}n , an adversary can only guess their value with probability 2−n , which becomes negligibly small when n is chosen large enough. Also, a cryptographic algorithm should only succeed when the correct key is applied, and fail when even the slightest bit is changed, in order to prevent an adversary from guessing close keys, which is easier than 2−n . This means that secret keys cannot be subject to any kind of noise or distortion. In a number of security applications, secret data is present, but not in the form of a uniformly distributed and noise-free binary key string. This is often the case when the secret stems from a physical phenomenon which is mostly analogue in nature, not uniformly distributed and subject to random noise. Such data is called a fuzzy secret. Some examples of cryptographic settings involving fuzzy secrets are listed below: • Biometric readings such as fingerprints, iris scans, face structure and voice patterns are believed to be unique to each human being and can hence serve as a unique, and possible secret, identifier. To obtain digital data from these readings, a feature extraction process is necessary. However, the extracted features are most likely not uniformly distributed and some features are measured differently at consecutive readings, i.e. they are subject to measurement noise. Extracted biometrical features can hence not be directly used as a cryptographic key. • Physically unclonable functions [48] [53] or PUFs are cryptographic hardware primitives that fulfill a function similar to biometrics, i.e. they measure unique physical properties of their embedding device. Interesting constructions of PUFs are those that are embedded into integrated circuits, since they allow to generate device-unique and highly protected chip identifiers [25] [18]. The uniqueness of every chip stems from uncontrollable nano-scale manufacturing variability in the production process of the circuits. Like biometrics, PUF responses suffer from
Signal Processing for Cryptography and Security Applications
173
a non-uniform distribution and unreliability due to measurement inaccuracy and random thermal noise in the electrical measurement circuits. More settings involving fuzzy secrets exist, e.g. in authentication with long passphrases and in quantum key agreement. The non-uniformity and the noise of a fuzzy secret can be quantified. Information entropy is commonly used to describe the uncertainty about a random variable. A special case of entropy called min-entropy describes the worst case uncertainty, i.e. it evaluates the best chances an adversary has in guessing a secret value chosen from a known distribution. The min-entropy of a random variable X is denoted as de f
H (X ) and defined as H (X ) = − log2 maxx Pr[X = x]. A uniformly distributed n-bit variable has maximal min-entropy equal to n, while a deterministic value has min-entropy zero. To describe the noise, it is assumed that the measured fuzzy secret X is a bit vector of length n. Two distinct observations of the fuzzy secret possibly differ in a number of bit locations due to noise. The amplitude of the noise affecting X can be quantified by the maximum number (X) of differing bits between two measurements.
4.2 Generating Cryptographic Keys from Fuzzy Secrets Mostly, a physical variable cannot be directly used. It should first be measured as accurately as possible and afterwards quantized into a discrete value which can be used by a digital algorithm. The quantization process translates physical noise into bit errors of a bit vector. Also, due to rounding, part of the information in the measured value is already lost. A careful choice for quantization should be made to optimize these parameters given the implementation constraints. After measurement and quantization, one is left with a possibly non-uniform and unreliable bit string X ∈ {0, 1}n. By performing multiple measurements, the aforementioned parameters H (X ) and (X) can be estimated. The next step in the transformation is dealing with the noise which can introduce up to (X) bit errors. Error correction techniques are frequently used in channel coding theory. A process known as the code-offset technique [30] [13] [38] is an elegant way of using error correcting block codes in order to get rid of bit errors in a measurement of X . The technique works in two phases: 1. In the generation phase, X is measured and a codeword C is randomly selected from an [n, k,t]-block code, with n the code length, k the code dimension and t the error correcting capability. The binary offset between X and C ia calculated as W = X ⊕ C and W is made publicly available, e.g. it is published in an online database. 2. In the reproduction phase, X is measured which can differ from X in up to (X) bit positions. Using the publicly available W , one can calculate C = X ⊕ W , which differs from C in at most (X) bits. If the used code is chosen appropriately such that t ≥ (X), the decoding algorithm will succeed to compute
174
Authors Suppressed Due to Excessive Length
C = Decode(C ). In that case the same X as in the generation phase can be reconstructed as X = C ⊕ W . This technique uses a public communication to establish a noise-free secret from a noisy secret. The data that is passed through this channel, W is called helper data. It is clear that the publishing of W decreases the uncertainty an adversary may have about the value of X. In fact, for the above technique it can be shown that the minentropy of X is reduced from H (X) to H (X) − n + k, i.e. that the helper data induces a min-entropy loss. To conclude, the now noise-free secret with limited min-entropy needs to be transformed into a uniformly distributed key K. In order to do this, randomness extractors are used. Randomness extractors succeed in extracting uniform randomness from non-uniformly distributed variables [8] [5]. It is obvious that the output domain of a randomness extractor is smaller than the input, hence they are compression functions. In fact, the min-entropy of a random variable X is a measure for the maximum number of uniform bits an ideal randomness extractor can extract from X . After error correction with the code-offset technique, the fuzzy secret X can hence contribute at most H (X) − n + k uniformly random bits. Generic randomness extractors can be constructed relatively easy by using so-called universal hash functions: if H is a universal hash family, the process that selects a random function h ← H and calculates K = h (X) is a randomness extractor. The seed can again be published as helper data to allow reconstruction of K at later times, this time however without additional loss in min-entropy.
5 Conclusion and Future Work The purpose of this chapter is to illustrate the usage of signal processing techniques into the design and implementation of cryptographic and security applications. This is only the beginning and by no means complete. New directions are being explored. One example is the exploration of signal processing in the encrypted domain, which is the topic of the SPEED project [47].
References 1. D. Agrawal, B. Archambeault, J. R. Rao, and P. Rohatgi. The EM Side-Channel(s). In B. S. Jr. Kaliski, Ç. K. Koç, and C. Paar, editors, Cryptographic Hardware and Embedded Systems CHES 2002, 4th International Workshop, Redwood Shores, CA, USA, August 13-15, 2002, Revised Papers, volume 2523 of Lecture Notes in Computer Science, pages 29–45. Springer, 2002. 2. S. Baktir, S. Kumar, C. Paar, and B. Sunar. A State-of-the-art Elliptic Curve Cryptographic Processor Operating in the Frequency Domain. In Mobile Networks and Applications (MONET) Journal, Special Issue on Next Generation Hardware Architectures for Secure Mobile Computing, pages 259–270, 2007.
Signal Processing for Cryptography and Security Applications
175
3. P. Barrett. Implementing the Rivest Shamir and Adleman Public Key Encryption Algorithm on a Standard Digital Signal Processor. In Proc. CRYPTO’86, pages 311–323, 1986. 4. L. Batina, B. Gierlichs, and K. Lemke-Rust. Comparative Evaluation of Rank Correlation Based DPA on an AES Prototype Chip. In Proceedings of the 11th international conference on Information Security - ISC 2008:, pages 341–354, Berlin, Heidelberg, 2008. Springer. 5. C. H. Bennett, G. Brassard, and J.-M. Robert. Privacy Amplification by Public Discussion. SIAM J. Comput., 17(2):210–229, 1988. 6. E. F. Brickell. A Survey of Hardware Implementations of RSA. In Advances in CryptologyCRYPTO’89(LNCS 435), pages 368–370, 1990. 7. E. Brier, C. Clavier, and F. Olivier. Correlation Power Analysis with a Leakage Model. In M. Joye and J.-J. Quisquater, editors, Cryptographic Hardware and Embedded Systems - CHES 2004: 6th International Workshop Cambridge, MA, USA, August 11-13, 2004. Proceedings, volume 3156 of Lecture Notes in Computer Science, pages 16–29. Springer, 2004. 8. J. L. Carter and M. N. Wegman. Universal Classes of Hash Functions. In STOC ’77: Proceedings of the 9th ACM symposium on Theory of computing, pages 106–112. ACM, 1977. 9. Ç.K. Koç, T. Acar, and B.S. Kaliski Jr. Analyzing and Comparing Montgomery Multiplication Algorithms. IEEE Micro, pages 26–33, June 1996. 10. S. Chari, C.S. Jutla, J.R. Rao, and P. Rohatgi. Towards Sound Approaches to Counteract Power-Analysis Attacks. In M. Wiener, editor, Advances in Cryptology: Proceedings of CRYPTO’99, number 1666 in Lecture Notes in Computer Science, pages 398–412, Santa Barbara, CA, USA, August 15-19 1999. Springer-Verlag. 11. J.-S. Coron, P. C. Kocher, and D. Naccache. Statistics and Secret Leakage. In Y. Frankel, editor, Financial Cryptography, 4th International Conference, FC 2000 Anguilla, British West Indies, February 20-24, 2000, Proceedings, volume 1962 of Lecture Notes in Computer Science, pages 157–173. Springer, 2000. 12. W. Diffie and M. E. Hellman. New Directions in Cryptography. IEEE Transactions on Information Theory, 22:644–654, 1976. 13. Y. Dodis, R. Ostrovsky, L. Reyzin, and A. Smith. Fuzzy Extractors: How to Generate Strong Keys from Biometrics and Other Noisy Data. SIAM Journal on Computing, 38(1):97–139, 2008. 14. S. R. Dusse and B. S. Kaliski Jr. A Cryptographic Library for the Motorola DSP56000. In Advances in Cryptology-Eurocrypt’90(LNCS 473), pages 230–244, 1991. 15. M. D. Ercegovac. On-line Arithmetic: An Overview. In SPIE Real-Time Signal Processing VII, pages 86–93, 1984. 16. M. D. Ercegovac and T. Lang. On-line Arithmetic for DSP Applications. In 32nd IEEE Midwest Symposium on Circuits and Systems, 1989. 17. J. Fan, K. Sakiyama, and I. Verbauwhede. Elliptic Curve Cryptography on Embedded Multicore Systems. Design Automation for Embedded Systems, 12, 2008. 18. B. Gassend, D. Clarke, M. van Dijk, and S. Devadas. Silicon Physical Random Functions. In CCS ’02: Proceedings of the 9th ACM conference on Computer and communications security, pages 148–160, New York, NY, USA, 2002. ACM. 19. P. Gastaldo, G. Parodi, and R. Zunino. Enhanced Montgomery Multiplication on DSP Architectures for Embedded Public-Key Cryptosystems. EURASIP J. Embedded Syst., 2008:1–9, 2008. 20. C. H. Gebotys, S. Ho, and C. C. Tiu. EM Analysis of Rijndael and ECC on a Wireless JavaBased PDA. In J. R. Rao and B. Sunar, editors, Cryptographic Hardware and Embedded Systems - CHES 2005, 7th International Workshop, Edinburgh, UK, August 29 - September 1, 2005, Proceedings, volume 3659 of Lecture Notes in Computer Science, pages 250–264. Springer, 2005. 21. C. H. Gebotys and B. A. White. EM Analysis of a Wireless Java-Based PDA. ACM Transactions on Embedded Computing Systems, 7(4):1–28, 2008. 22. B. Gierlichs, L. Batina, P. Tuyls, and B. Preneel. Mutual Information Analysis - A Generic Side-Channel Distinguisher. In Elisabeth Oswald and Pankaj Rohatgi, editors, Cryptographic Hardware and Embedded Systems - CHES 2008, volume 5154 of Lecture Notes in Computer Science, pages 426–442, Washington DC,US, 2008. Springer-Verlag.
176
Authors Suppressed Due to Excessive Length
23. D. Gollmann, Y. Han, and C. Mitchell. Redundant Integer Representation and Fast Exponentiation. Designs, Codes and Cryptography, 7:135–151, 1998. 24. J. Guajardo, R. Blumel, U. Krieger, and C. Paar. Effcient Implementation of Elliptic Curve Cryptosystems on the TI MSP 430x33x Family of Microcontrollers. In Public Key Cryptography, pages 365–382, 2001. 25. J. Guajardo, S. Kumar, G.-J. Schrijen, and P. Tuyls. FPGA Intrinsic PUFs and Their Use for IP Protection. In CHES ’07: Proceedings of the 9th international workshop on Cryptographic Hardware and Embedded Systems, pages 63–80. Springer-Verlag, 2007. 26. T. Güneysu and C. Paar. Ultra High Performance ECC over NIST Primes on Commercial FPGAs. In CHES, pages 62–78, 2008. 27. D. Hein, J. Wolkerstorfer, and N. Felber. ECC Is Ready for RFID — A Proof in Silicon. In Selected Areas in Cryptography: 15th International Workshop, SAC 2008, pages 401–413, Berlin, Heidelberg, 2009. Springer-Verlag. 28. N. Homma, S. Nagashima, Y. Imai, T. Aoki, and A. Satoh. High-Resolution Side-Channel Attack Using Phase-Based Waveform Matching. In L. Goubin and M. Matsui, editors, Cryptographic Hardware and Embedded Systems - CHES 2006, 8th International Workshop, Yokohama, Japan, October 10-13, 2006, Proceedings, volume 4249 of Lecture Notes in Computer Science, pages 187–200. Springer, 2006. 29. K. Itoh, M. Takenaka, N. Torii, S. Temma, and Y. Kurihara. Fast Implementation of Public-Key Cryptography on a DSP TMS320C6201. In Proceedings of the 1st International Workshop on Cryptographic Hardware and Embedded Systems (CHES Š99), pages 61–72, August 1999. 30. A. Juels and M. Wattenberg. A Fuzzy Commitment Scheme. In CCS ’99: Proceedings of the 6th ACM conference on Computer and communications security, pages 28–36. ACM Press, 1999. 31. K. Kalach and J. P. David. Hardware Implementation of Large Number Multiplication by FFT with Modular Arithmetic. In 3rd International IEEE-NEWCAS Conference, pages 267–270, 2005. 32. N. Koblitz. Elliptic Curve Cryptosystem. Math. Comp., 48:203–209, 1987. 33. N. Koblitz. A Family of Jacobians Suitable for Discrete Log Cryptosystems. In S. Goldwasser, editor, Advances in Cryptology: Proceedings of CRYPTO’88, number 403 in Lecture Notes in Computer Science, pages 94–99. Springer-Verlag, 1988. 34. P. Kocher, J. Jaffe, and B. Jun. Differential Power Analysis. In M. Wiener, editor, Advances in Cryptology: Proceedings of CRYPTO’99, number 1666 in Lecture Notes in Computer Science, pages 388–397. Springer-Verlag, 1999. 35. T.-H. Le, J. Clédière, C. Servière, and J.-L. Lacoume. Noise Reduction in Side Channel Attack Using Fourth-Order Cumulant. IEEE Transactions on Information Forensics and Security, 2(4):710–720, 2007. 36. Y. K. Lee, H. Chan, and I. Verbauwhede. Design Methodology for Throughput Optimum Architectures of Hash Algorithms of the MD4-class. Journal of Signal Processing Systems, 53(1-2):89–102, 2008. 37. Y. K. Lee, K. Sakiyama, L. Batina, and I. Verbauwhede. Elliptic-Curve-Based Security Processor for RFID. IEEE Trans. Comput., 57(11):1514–1527, 2008. 38. J.-P. Linnartz and P. Tuyls. New Shielding Functions to Enhance Privacy and Prevent Misuse of Biometric Templates. In AVBPA, pages 393–402, 2003. 39. A. Menezes, P. van Oorschot, and S. Vanstone. Handbook of Applied Cryptography. CRC Press, 1997. 40. N. Mentens, K. Sakiyama, B. Preneel, and I. Verbauwhede. Efficient Pipelining for Modular Multiplication Architectures in Prime Fields. In GLSVLSI ’07: Proceedings of the 17th ACM Great Lakes symposium on VLSI, pages 534–539, New York, NY, USA, 2007. ACM. 41. T.S. Messerges, E. A. Dabbish, and R. H. Sloan. Examining Smart-Card Security under the Threat of Power Analysis Attacks. IEEE Transactions on Computers, 51(5):541–552, May 2002. 42. V. Miller. Uses of Elliptic Curves in Cryptography. In H. C. Williams, editor, Advances in Cryptology: Proceedings of CRYPTO’85, number 218 in Lecture Notes in Computer Science, pages 417–426. Springer-Verlag, 1985.
Signal Processing for Cryptography and Security Applications
177
43. P. Montgomery. Modular Multiplication without Trial Division. Mathematics of Computation, 44(170):519–521, 1985. 44. K.K. Parhi. VLSI Digital Signal Processing Systems: Design and Implementation. Weley, 1999. 45. H. Pelletier and X. Charvet. Improving the DPA Attack using Wavelet Transform. NIST Physical Security Testing Workshop, 2005. 46. K.C. Posch and R. Posch. Modulo Reduction in Residue Number Systems. IEEE Transactions on Parallel and Distributed Systems, 6(5):449–454, May 1995. 47. SPEED Project. http://www.speedproject.eu/. 48. P. S. Ravikanth. Physical one-way functions. PhD thesis, Massachusetts Institute of Technology, 2001. Chair-Benton, Stephen A. 49. R. L. Rivest, A. Shamir, and L. Adleman. A Method for Obtaining Digital Signatures and Public-Key Cryptosystems. Communications of the ACM, 21(2):120–126, 1978. 50. K. Sakiyama, L. Batina, B. Preneel, and I. Verbauwhede. Superscalar Coprocessor for HighSpeed Curve-Based Cryptography. In CHES, pages 415–429, 2006. 51. J. Großschädl, K. C. Posch, and S. Tillich. Architectural Enhancements to Support Digital Signal Processing and Public-Key Cryptography. In Proceedings of the 2nd Workshop on Intelligent Solutions in Embedded Systems (WISES Š04), pages 129–143, June 2004. 52. M. A. Soderstrand, W. K. Jenkins, G. A. Jullien, and F. J. Taylor. Residue Number System: Modern Applications in Digital Signal Processing. In IEEE Press, 1986. 53. P. Tuyls, G. J. Schrijen, B. Škori´c, J. van Geloven, N. Verhaegh, and R. Wolters. Read-Proof Hardware from Protective Coatings. In Louis Goubin and Mitsuru Matsui, editors, CHES ’06: Proceedings of the 8th international workshop on Cryptographic Hardware and Embedded Systems, volume 4249 of LNCS, pages 369–383. Springer, 2006. 54. The SHA-3 Zoo. http://ehash.iaik.tugraz.at/wiki/The_SHA-3_Zoo.
High-Energy Physics Anthony Gregerson, Michael J. Schulte, and Katherine Compton
Abstract High-energy physics (HEP) applications represent a cutting-edge field for signal processing systems. HEP applications require sophisticated hardware-based systems to process the massive amounts of data that they generate. Scientists use these systems to identify and isolate the fundamental particles produced during collisions in particle accelerators. This chapter examines the fundamental characteristics of HEP applications and the technical and developmental challenges that shape the design of signal processing systems for HEP. These challenges include huge data rates, low latencies, evolving specifications, and long design times. We cover techniques for HEP system design, including scalable designs, testing and verification, dataflow-based modeling, and design partitioning. Throughout, we provide concrete examples from the design of the Level-1 Trigger System for the Compact Muon Solenoid (CMS) Experiment at the Large Hadron Collider (LHC). We also discuss some of the new physics algorithms to be included in the proposed Super LHC upgrade.
1 Introduction to High-Energy Physics For decades, the field of high-energy particle physics has served as one of the drivers that has pushed forward the state-of-the-art in many engineering disciplines. The study of particles and the search for the principles governing their behavior and interaction has existed for nearly 3000 years; as a result, most of the low-hanging fruit of particle physics have already been picked. As technology progressed, particle Anthony Gregerson University of Wisconsin, Madison, WI, e-mail: [email protected] Michael Schulte University of Wisconsin, Madison, WI, e-mail: [email protected] Katherine Compton University of Wisconsin, Madison, WI, e-mail: [email protected]
S.S. Bhattacharyya et al. (eds.), Handbook of Signal Processing Systems, DOI 10.1007/978-1-4419-6345-1_8, © Springer Science+Business Media, LLC 2010
179
180
Anthony Gregerson, Michael J. Schulte, and Katherine Compton
physicists moved beyond the search for commonly-occurring particles and towards the search for rare and elusive particles that might only be created from energy sources much greater than those that naturally occur on Earth. The search to uncover previously undiscovered particles and understand particle interactions at high energy levels led to the evolution of modern particle physics, commonly known as high-energy physics (HEP). Although the search for these particles may appear esoteric, the discovery of certain high-energy particles could provide key insights into theories at the frontier of our current understanding of matter, energy, forces, and the nature of the Universe. A number of HEP experiments make use of natural sources of high-energy particles, such as neutron stars and supernovae. However, most experiments are built around particle accelerators — colossal machines designed to accelerate particles to near-light speed. When these particles collide with high enough energy, a ‘collision event’ occurs. Collision events produce short-lived, unstable particles that rapidly decay into common, stable particles [55]. This process is depicted in Fig. 1. Regardless of the source of the high-energy particles, HEP experiments that study the high-energy collision events present substantial signal processing challenges. These experiments use large sets of sophisticated sensors that measure properties of the particles produced in the collisions, such as the quantity and composition of energy, electromagnetic characteristics, trajectory, and propensity for interaction and absorption with other materials. The unstable particles created in high-energy interactions are incredibly short-lived; for example, the elementary particle known as the Z-boson has an average life of less than 10-24 seconds [5]. This particle would not have been observed if not for a particle accelerator. Even though it was theorized to exist in the late 1960s, it took 15 years before scientists were able to discover a Z-boson at CERN’s Super Proton Synchrotron accelerator. The short life and rarity of high-energy particles means that HEP experiments must collect data very rapidly if they are to capture important events. Thus, the experiments’ sensors may produce huge quantities of raw data that must then be processed — often in real time — to identify and characterize the particles produced in the collisions. In general, the major signal processing topics for HEP applications can be divided into four areas: 1. Control and Synchronization — Particle colliders must accelerate two beams of atomic or sub-atomic-scale objects to relativistic speeds and successfully collide the objects in these two beams. This requires highly sophisticated control over the superconducting magnets responsible for guiding and accelerating the beams [53]. Achieving high collision rates can require GHz-scale sampling rates and a dynamic range of 100 dB [20]. Control systems are also necessary to monitor and calibrate the experiments. Furthermore, particle collisions are extremely short-lived events; great care must be taken to ensure proper timing and clock synchronization of all systems to ensure that these events are properly sampled and processed [52]. 2. Detector Sampling and Signal Conditioning — HEP systems feature huge quantities and varieties of sensors, such as crystals, photodiodes, cathode strips, drift tubes, resistive plates, scintillators, Cerenkov counters, charge-coupled devices,
High-Energy Physics
181
Fig. 1 A computer rendering of a beam collision in the LHC’s CMS experiment. This simulation represents a pattern of post-collision particles that might be observed if supersymmetrical particles are produced. Image Copyright CMS Experiment at CERN.
and other materials. A single experimental system may contain many of these sensor types. Each group of sensors must be sampled and reconstructed using techniques suited to the particular technology deployed. To properly isolate the data of individual particles, the signals coming from the sensors must also be carefully filtered to remove background energy noise and phase-adjusted to compensate for non-linear amplification effects [12]. 3. High-Speed Communications — The large number of sensors required in HEP systems and massive amount of data these sensors produce may require thousands of high-bandwidth, low-latency data links. They may also need complex communication systems to transport the experimental data to external computing resources for analysis. These systems may bridge electronic and optical links, cross clock domains, handle compression and error detection, and act as translators between protocols. The communication links may also be radiationhardened to compensate for higher levels of background radiation from the accelerator [42]. 4. Particle Triggering — The quantity of data recorded by most HEP systems exceeds the capabilities of general-purpose computing systems. Dedicated signal processing hardware is required for low-resolution processing of the raw data. This generally amounts to a pattern-recognition problem; the sensor data must be analyzed to identify and isolate the particles produced in collisions. The hardware must then decide, for each set of data, whether or not it will be recorded for further analysis. This process is called ‘triggering’. Although the topics of control, signal conditioning, and high-speed communication present substantial challenges, their basic characteristics are not significantly different in HEP applications than in other applications. For a thorough discussion of signal processing systems for high-speed communication links and control, see
182
Anthony Gregerson, Michael J. Schulte, and Katherine Compton
Chapter 1 and Chapter 4. The main focus of this chapter is instead on triggering, the low-resolution processing that analyzes the raw sensor data and identifies and sorts particles produced by the collisions, as this is a unique challenge of HEP applications. The particle recognition problem is compounded by the nature of HEP systems, whose large scale, high cost, and specialized function make each system unique. There are only a handful of HEP systems operational at a given time, and each is generally designed for different types of experiments. This means that new systems must typically be designed from scratch and customized to meet the special needs of the new experiments. Beyond the technical challenges inherent in measuring and identifying shortlived sub-atomic particles, the creation of systems for HEP applications also offers significant challenges to the design process. The giant size of the systems, combined with the diversity of knowledge needed to create systems for advanced theoretical physics, necessitates large, distributed development teams. Ensuring efficient collaboration and producing reliable, maintainable, cost-effective, and high-quality designs is a major challenge that requires the use of new tools and techniques. Throughout this chapter, we complement our discussion of high-level HEP principles with concrete examples from the CMS experiment for the LHC at the CERN research center. The CMS experiment represents a real-world application at the forefront of HEP technology, and it illustrates some of the greatest challenges of digital signal processing systems for high-energy physics. In this chapter, we characterize the key features and most significant challenges of designing DSP systems for HEP. This includes a description of modern HEP systems, the important problems these systems are attempting to solve, and the role of DSP in achieving these goals. In Section 2, we introduce the CMS experiment and describe some of the important features of its data acquisition and triggering systems. In Section 3, we outline the key technical and developmental characteristics common to HEP applications. In Section 4, we discuss the technical challenges of designing HEP systems. In Section 5, we cover the tools, techniques, and technology needed to efficiently manage the development process. Finally, in Section 6, we provide tangible examples of signal processing algorithms from the ongoing project to upgrade the CMS experiment.
2 The Compact Muon Solenoid Experiment High-energy physics experiments are performed throughout the world, from Stanford’s SLAC National Accelerator Laboratory to the Super-K observatory in Japan. In the future, HEP applications may extend beyond the Earth to space-based observatories, such as the proposed Joint Dark Energy Mission [37]. The largest and most prominent HEP application at the time of this writing is CERN’s Large Hadron Collider (LHC), near Geneva, Switzerland. The LHC is the highest-energy particle accelerator ever created and among the largest and most massive man-made machines ever built. The ring-shaped main accelerator track is over 27 km in circumference. It
High-Energy Physics
183
is capable of accelerating protons to over 99.999999% of the speed of light and generating collision energies exceeding 10 Tera-electron-Volts (TeV). The LHC hosts a number of major experiments, each of which consists of collision chambers with thousands of sensors and associated hardware for processing the sensor output data.
2.1 Physics Research at the CMS Experiment The Compact Muon Solenoid detector is one of the main experiments at the LHC. CMS is considered a general-purpose experiment because it is used to search for a variety of different high-energy phenomena and may be adapted for new experimental searches in the future. Earlier accelerators, such as the Tevatron, have shared some of these goals. However the higher energy collisions produced by the LHC allow its experiments to investigate physics phenomena over a much larger energy range [38]. CMS is one of two complementary experiments, the other being ATLAS. Because CMS and ATLAS share the same experimental goals, many of the technical characteristics of CMS are also found in ATLAS. The Higgs Boson The Standard Model of particle physics is the closest scientists have come to a ‘holy grail’ of science: the creation of a Grand Unified Theory that describes all of the particles and forces in the Universe using a single set of consistent rules. The Standard Model has been in development since the 1960s, and is currently an accepted theory that is taught to students of physics. The Standard Model accounts for all fundamental forces, except for the gravitational force. Moreover, 23 of the 24 fundamental particles predicted by the Standard Model have been observed by scientists. The only particle that has been predicted, but not yet observed, is the Higgs Boson. The Higgs Boson is a very significant particle in the Standard Model, as it is the sole particle responsible for giving mass to matter, through a phenomenon known as the Higgs Field. Previous searches for the Higgs at lower energy levels have not been successful, but scientists believe that the TeV-scale energy at the LHC is high enough that they should be able to find the Higgs if it exists [28]. Evidence of the Higgs would solidify the Standard Model. Perhaps more interestingly, if scientists fail to find any evidence of the Higgs, it would suggest a significant flaw in the long-accepted Standard Model and stimulate scientists to explore alternate theories. Supersymmetry Supersymmetry (SuSy) represents one of the most actively-studied potential extensions to the Standard Model [9]. SuSy theorizes that every known particle also has a ‘superpartner’ — that is, a corresponding particle with complementary properties, but a much larger mass. The existence of superpartners could provide solutions to several major unsolved problems in physics. For example, superpartners could be
184
Anthony Gregerson, Michael J. Schulte, and Katherine Compton
the components of dark matter, and explain why the Universe appears to behave as if it has significantly more matter in it than scientists have observed. Another benefit of discovering superpartners is that SuSy would synthesize the strong nuclear force and electro-weak force into a single unified force. SuSy is also a component of several other modern theories, such as String Theory. There is a possibility that the high-energy collisions at the CMS experiment will generate evidence of the existence of superpartners. Extra Dimensions Some theorists have argued that some of the apparent unanswered problems in physics, such as dark matter [44], can be resolved through the existence of additional dimensions beyond the four space-time dimensions that we perceive. The existence of extra dimensions is also an essential facet of String Theory. A joint experiment between the LHC’s CMS and TOTEM systems could potentially detect the effects of gravity from a 5th dimension [39]. Quark Compositeness On several occasions throughout the history of particle physics, physicists have believed they had found the most basic building blocks of the Universe, only to discover that those blocks were in fact made up of even smaller and more basic pieces. Scientists found that atoms were made up of protons, neutrons, and electrons; later they discovered protons were made up of quarks, which are currently considered to be elementary particles. Some physicists have hypothesized that quarks may themselves be composite structures made up of even smaller particles called ‘preons’ [26]. Part of the CMS experiment includes measurements that could be used to find evidence of preons.
2.2 Sensors and Triggering Systems To enable the search for new phenomena, the cylindrical CMS particle detector consists of several layers of highly complex sensors. In total, the layers are nearly seven meters thick, and each layer takes unique measurements designed to identify different classes of particles. The innermost section of the cylinder is a silicon-based particle tracker that is capable of measuring the deflection of particles subjected to a strong magnetic field. It is used to calculate the charge-to-mass ratio of the particles produced in collisions and reconstruct their trajectories. Beyond the tracker are the Electromagnetic and Hadron Calorimeter layers. These sensors absorb electromagnetic particles and hadrons — two types of fundamental particles — and measure their energy. The calorimeters are surrounded by many layers of muon absorption chambers, designed to capture muons — a class of weakly interacting particles —
High-Energy Physics
185
that pass through the calorimeters [18]. An image of the CMS detector as it was being assembled is shown in Fig. 2.
Fig. 2 A cross section of the CMS sensors and readout electronics during its assembly. A construction worker standing at the bottom of the chamber demonstrates the scale of the CMS experiment. Image Copyright CMS Experiment at CERN.
Handling the data produced by these sensors presents significant challenges. The tracker monitors particle trajectories by detecting particles passing over micrometerscale semiconductor strips. If the tracker’s silicon detectors are viewed as analogous to a camera sensor, the CMS tracker has a resolution of 66 megapixels, with each pixel representing a detector with its own data channel. Unlike a typical digital camera, which can take at most a few shots per second, the CMS sensors take a new snapshot of particle collision data at a rate of up to 40 million times per second. The tracker, however, only provides part of the sensor data that must be processed; the two calorimeters each provide an additional 4096 data channels, with each channel producing an 8-bit energy measurement. In total, the accelerator is capable of producing collisions at a peak instantaneous event rate of about one billion collision events per second. Each event produces approximately 1 Megabyte of data from the tracker and calorimeters, resulting in a peak data rate of 1 Petabyte per second (PB/s). To put this data rate in perspective, it would take over 10 million modern hard disk drives working in parallel in order to achieve a high enough write speed to store all of this data. If all of the data could be written to archival media, there are barriers preventing analysis by general-purpose computers. Even if a computing cluster with 1-Gigabit-per-second (Gb/s) Ethernet links was available, it would require tens of thousands of links to transport the data to the computers.
186
Anthony Gregerson, Michael J. Schulte, and Katherine Compton
Clearly the ultra-high data rates produced by the CMS detector make the prospect of archiving the data for offline processing infeasible using today’s technology. In order to make the problem tractable, the data rate from the experiment’s sensors must be significantly reduced before the sensor data is stored. At first, it might seem appealing to simply slow the rate of collisions. After all, millions of collisions each second may seem unnecessarily high. However, there are two key concepts about HEP applications that make such high collision frequencies necessary. First, if the particles and phenomena searched for in the CMS experiment actually exist, they are extremely short-lived and difficult to observe. It may require a tremendous number of collisions to first produce and then successfully observe the phenomena. Second, even with some of the most advanced controls and technology in the world, the task of colliding two protons head-on at nearly the speed of light is incredibly difficult. The vast majority of collisions occurring in the experiment are low-energy glancing collisions that are unlikely to produce interesting data. Indeed, one of the common goals of HEP system designers is to maximize the accelerator’s ‘luminosity’, a measure of the efficiency with which collision events occur each time the accelerator causes its particle beams to cross. Beams are not continuously distributed streams of particles, but are in fact made up of a series of spaced particle clusters. The rate at which these clusters impact the clusters in the opposing beam is referred to as the ‘bunch crossing rate.’ Together, the luminosity and bunch crossing rate determine the collision event rate, and consequently the processing throughput needed to analyze the events that are produced. The ideal solution to this problem is to create a signal processing system that can specifically identify and isolate the data produced by scientifically interesting collisions and discard data from low-energy collisions. The system then relays only the potentially interesting data to archival storage. In HEP systems, this process is called ‘triggering’. CMS accomplishes its triggering function using a complex signal processing and computational system dubbed the Triggering and Data Acquisition Systems, or TriDAS [22]. TriDAS is a two-level, hybrid system of signal processing hardware and physics-analysis software. The frontend Level-1 Trigger is custom, dedicated hardware that reduces the peak sensor data rate from 1 PB/s to 75 Gigabytes per second (GB/s). The backend High-Level Trigger is a software application running on a computing cluster that further reduces the data rate from 75 GB/s to 100 Megabytes per second (MB/s). At 100 MB/s the data rate is low enough that it can be transferred to archival storage and analyzed offline with sophisticated algorithms on high-performance workstations. A diagram of the CMS TriDAS is shown in Fig. 3. To decide which data to keep versus discard, triggering systems must identify discrete particles and particle groups created by the collisions. A perfect system would record data from only those collisions that exhibit the physical phenomena that are being studied in the experiment. However, due to practical constraints on how quickly the data can be analyzed and how much data can be economically stored, it is not feasible to create perfect triggering systems for experiments with high collision rates. In real systems, designers manage hardware cost by placing a threshold on the number of particles or particle groups that will be retained from
High-Energy Physics
187
Fig. 3 The CMS Trigger and Data Acquisition System. The Level-1 Trigger controls which events are sent from the CMS detector to CERN’s farm of computers and which events are instead discarded. Image Copyright CMS Experiment at CERN.
any one event. The system sorts particles and groups by their energy levels. Data for the top candidates (up to the threshold) is retained; the remaining data is discarded. The CMS Level-1 Trigger is actually made up of multiple specialized hardware systems designed to identify specific particle types, such as photons, electrons, taus, muons, and high-energy particle groups called ‘jets’. Once particles are identified and classified, each type is separately sorted and the highest-energy members of each class are sent to the High-Level Trigger software for further processing. The triggers for the LHC system have some interesting features that differentiate them from those of previous accelerators. The LHC’s luminosity and bunch crossing rate are both very high. At peak rate, there is a bunch crossing every 25 ns. Each bunch crossing generates about 20-30 collision events. This translates to a collision event rate approaching 1 billion events per second. In comparison, the Tevatron’s D0 experiment performed bunch crossings every 3 s in 1989 [1], and the Large Electron Positron-II’s L3 experiment produced bunch crossings every 22 s in 1998 [17]. Not only does the LHC’s high collision rate produce much higher data bandwidth, the length of processing time between groups of data is on the order of one thousand times less than that of previous large particle accelerators. The triggering systems for each of these accelerators do have some features in common. They each use multi-level triggering systems where the first level(s) are composed of high-performance dedicated DSP hardware and the later level(s) are composed of software running on general-purpose hardware. As time has passed and hardware has become faster, new technologies have been integrated into the
188
Anthony Gregerson, Michael J. Schulte, and Katherine Compton
hardware levels of more recent system designs. For decades, early triggers relied primarily on fixed-function Application-Specific Integrated Circuits (ASICs), discrete memories, and occasionally custom DSP processors [54] [36] [32] [4]. Around the mid-1990’s, architects began incorporating programmable logic into their triggers. Today, ASICs still play an important role in the most performance-critical functions, but more flexible Programmable Logic Devices (PLDs), such as FieldProgrammable Gate Arrays (FPGAs) have become pervasive in trigger logic [13] [25] [29] [2] [51].
2.3 The Super Large Hadron Collider Upgrade To understand many of the motivations behind design decisions in HEP systems, it is important to evaluate them in the context of the projects’ massive size and cost. While it is difficult to get an exact number, publicly released figures put the cost of designing and building the Large Hadron Collider at more than $5 billion as of 2009. The time from project approval to project completion spanned over 10 years, despite the reuse of facilities from LHC’s predecessor, the Large Electron-Positron Collider. These high costs and long design times provide a firm motivation to design the LHC to be as flexible as possible for future needs, to ensure that the useful life of the experiment can be extended, even after the initial experiments have run their course. The cost and time of constructing a new accelerator, coupled with difficult economic conditions that have contributed to funding cuts to many scientific institutions, have provided strong motivation to adopt flexible systems that can be upgraded and kept in use as long as possible. For example, Fermilab’s Tevatron, the highest-energy particle accelerator prior to the LHC, stayed in operation for over 20 years and had several major upgrades to extend its lifetime. To this end, the Level-1 Trigger hardware of the LHC’s CMS system is designed to strike a balance between the high-performance of fully dedicated custom hardware and the flexibility needed to ensure that the systems can adapt to changes in experimental parameters and research goals. Triggering decision thresholds are fully programmable. Particle identification algorithms are currently implemented in RAM-based look-up tables to allow adjustment or even full replacement of the algorithms. For most of the history of trigger design, RAM-based look-up tables were the primary means for incorporating reprogrammability in triggering algorithms, however the potential for using them was limited due to their high relative costs. Modern systems, like the CMS Level-1 Trigger, increase flexibility further by replacing many fixed-function ASICs with reprogrammable FPGAs [22]. The adoption of FPGA-based platforms is a broad trend in the development of HEP systems and one of the key characteristics in the design of modern systems [41] [33] [34]. The specific impacts of the use of FPGAs in HEP systems are discussed several times throughout this chapter. For a more general discussion of FPGA-based DSP platforms, see Chapter 14.
High-Energy Physics
189
Although the initial CMS Level-1 Trigger used some programmable hardware, at the time it was designed performance considerations limited many components to using fixed-function electronics [48]. The flexibility provided by the limited amounts of programmable hardware should allow for some modifications to the system over the initial LHC’s planned lifetime. However, a significant proposed LHC upgrade, known as the Super Large Hadron Collider (SLHC), requires a redesign of the triggering systems for the experiments. This redesign will replace most of the remaining ASICs with reconfigurable hardware, increasing system flexibility. The purpose of the SLHC upgrade is to boost the LHC’s luminosity by an order of magnitude, with a corresponding increase in collision rate. This could allow the accelerator to continue to make new discoveries years after the original experiments have run their course [30]. Naturally, increasing the collision event rate places significant strain on the triggering system, since this system is responsible for performing the first-pass analysis of the event data. The existing Level-1 Trigger hardware and algorithms are insufficient to handle the SLHC’s increased luminosity, and must thus be upgraded [47] [45]. The design problems of this upgrade provide examples of the challenges of modern system design for HEP applications.
3 Characteristics of High-Energy Physics Applications Although every HEP system is different, many of them share certain high-level technical and design-oriented characteristics. Understanding the unique features of HEP applications is key to designing signal processing systems for these applications.
3.1 Technical Characteristics Many of the technical challenges of HEP systems derive from their extreme scale. This scale gives rise to several important technical characteristics of HEP applications, discussed here using concrete examples from the CMS Trigger introduced in Section 2.2. Number of Devices The thousands of sensor data channels in HEP experiments require that they use very large quantities of signal processing devices. The CMS Level-1 Trigger, for example, requires thousands of FPGAs and ASICs for its particle identification, isolation, and ranking functions, which in turn represent a great expense for construction, installation, and fault testing. Furthermore, the large device count makes it difficult to ensure reliability due to the numerous potential points of failure.
190
Anthony Gregerson, Michael J. Schulte, and Katherine Compton
Fig. 4 Faster sampling rates and larger event sizes have led to increasing data rates in HEP experiments. The LHC experiments ALICE, ATLAS, CMS, and LHCb produce exceptionally high rates (Note that both axes use a log scale and data rate increases radially from the origin).
Data Bandwidth Colliding billions of particles per second and recording the resulting data requires tremendous bandwidth to transport sensor data to the processing systems. It also requires large amounts of temporary storage to hold collision data while waiting for triggering decisions. As new systems are created with higher luminosity and increasingly sensitive detectors, data rates produced by HEP applications have rapidly increased [50], as shown in Fig. 4. The Tevatron’s CDF-II experiment, for example, produced a raw data rate of up to 400 GB/s [21]. The CMS detector, on the other hand, can produce a peak data rate approaching 1 PB/s. This data is held in temporary buffers until the trigger system determines what portion to send to archival storage, which is limited to a rate of 100 Mb/s. The huge amount of raw data is compressed, but still results in an input data rate of hundreds of GB/s to the CMS Level-1 Trigger. The SLHC upgrade will further increase that data rate by both adding new sensors and increasing the luminosity (event rate) [47]. Interconnection Systems that make global decisions based on thousands of sensors must necessarily feature a large amount of interconnection. Because the triggering algorithm needs to sort and rank particles, it must obtain global information from across all of the Trigger subsystems. Each of these subsystems must connect to the global sorting system, requiring high fan-in. This global processing, combined with a need for internal data sharing within subsystems, creates a tremendous amount of interconnection within the triggering system. This interconnection adds to the cost and complexity of the system and can make testing and modification more difficult.
High-Energy Physics
191
End-to-End Latency The extremely high amounts of data produced by HEP sensors can quickly overwhelm the storage capabilities of the temporary data buffers. The LHC, for example, performs a new group of particle collisions every 25 ns at its peak rate, producing up to 1 Petabyte of data per second. In a general HEP system, the detectors that produce this data can be grouped into two categories: fast-readout and slow-readout. Fast-readout detectors can be fully read for each bunch crossing. Data from the fast-readout detectors is used in the first-level triggering decisions. Slow-readout detectors can only transfer a portion of their total data for each bunch crossing. Data from these detectors is stored in local buffers until a triggering decision is made. The first-level triggering decision is used to decide which portion of data from the slow detectors should be acquired and which should be discarded. Thus, triggering systems must make their decision as to which data to discard within a very short window of time to prevent the size of the data buffers from becoming prohibitively costly and to be able to select the data to read from the slow detectors before it is overwritten by data from the next bunch crossing. This means that these systems must be able to perform all of their processing in a matter of milliseconds to microseconds. The CMS Level-1 Trigger, for example, must provide triggering decisions within 3.2 s of the collision event. Network transport delays occupy a significant portion of this time, leaving a computation window of less than 1 s. Thus, the Level-1 Trigger must accept new data every 25 ns and must identify, isolate, and rank the particles contained in that data within 1 s of receiving it. These tight latency requirements, combined with the high bandwidth requirements, are the primary technical constraints that require the Level-1 Trigger processing to be done in hardware rather than as software on commodity processors. Because the processing requirements of triggering systems are motivated by detector read-out constraints, the low-latency and high-bandwidth challenges are interrelated, and both are directly affected by the rate of collision events being produced in the experiments. The level of collisions being produced at different particle accelerators is by no means the same, and is driven by the type of accelerator and the goals of the experiments. Experiments that produce collisions at very low rates — those studying neutrino interactions for example — typically have relaxed requirements for their triggering systems [3]. However, the largest, most complex, and most expensive accelerators that have been built are proton-anti-proton, electronpositron, or heavy-ion colliders, such as the LHC, the Tevatron, the Large Electron Positron Collider, and the Relativistic Heavy Ion Collider. These accelerators have historically required triggering systems with low-latency and high-bandwidth requirements that pushed the limits of the technology available at the time of their design [41] [19] [2]. Looking toward the future, the technical requirements of the LHC upgrades and post-LHC accelerators are likely to pose even more significant challenges. Historical trends show that particle accelerator collision energies have been increasing at an exponential rate [56], as seen in Fig. 5. Higher energy collisions may correlate to even shorter-lived particles and higher rates of interesting collision events.
192
Anthony Gregerson, Michael J. Schulte, and Katherine Compton
These higher collision rates will likely be necessary, as even with extremely high collision rates, some high-energy phenomena still occur very rarely. For example, despite producing one billion collisions per second at the LHC, it is still predicted that a Higgs boson, if it does exist, will only be produced at a rate of 0.1 Hz [46]. Higher collision rates will put more pressure on the detector’s local buffers, making it unlikely that advances in memory density will eliminate the need for low-latency triggering systems. Overcoming the technical signal processing challenges may be one of the key aspects of making new accelerators feasible.
Fig. 5 The energy of the largest particle accelerators has shown a trend of exponential increase with time.
3.2 Characteristics of the Development Process It may be tempting to focus exclusively on the technical specifications of HEP applications as the primary source of complexity in HEP system design. However, it is important not to overlook the challenges in the development process itself, which may also be a significant factor in the development time, cost, quality, and maintainability of a system. For example, during the initial design of the CMS experiment, the development and testing of firmware for programmable logic devices proved to be a major bottleneck. These development challenges stem from several distinguishing features of the HEP system design process.
High-Energy Physics
193
Wide-Scale Collaboration and Disparate Design Styles Most HEP systems are too large to be created by a single institution. For example, the design of the CMS Trigger system was the product of the CMS Collaboration, made up of over 3000 scientists and engineers from nearly 200 universities spread across 38 countries. Although this collaboration supplies the great breadth and diversity of knowledge needed for HEP engineering projects, managing large groups across multiple languages and time zones is a formidable task. The existence of numerous, separate design teams may also give rise to multiple design approaches, some of which may not be compatible. One of the consequences of the diversity of the collaborative development teams and the large scale of the system design is that the CMS Trigger subsystems have been designed by many groups from different organizations and different countries. Because the firmware designers come from a variety of backgrounds, it is natural that they produce a variety of disparate design styles using their favored hardware design languages and methodologies. A famous example of the problems that can arise from these differences is the loss of the Mars Climate Orbiter probe in 1999 [40]. One group of designers assumed that all measurements would be made in Imperial units whereas another group assumed they would be made in metric units. Because of these differing assumptions, the probe’s thrusters fired at the wrong time, and it crashed through the atmosphere. Frequent Adjustments and Alterations Despite significant simulation and theoretical modeling, the experimental nature of HEP and the unprecedented energy of the LHC mean that it is inevitable that some aspects of the system and the algorithms it implements will need to be adjusted after the system comes online and starts collecting real collision data. It is not cost-effective to replace hundreds of devices each time an algorithm is changed, so designers must anticipate which portions of the system may require adjustment after deployment and design those sections with reconfigurable logic to provide the needed flexibility. Periodic Upgrades The economics of particle accelerators make it likely that it may be decades before the LHC is replaced. During this time, the LHC will be subject to a periodic series of large technology upgrades to allow for new experiments and the expansion of existing experiments. Other particle accelerators have followed a similar pattern of large-scale upgrades to extend their useful lifetimes. The Tevatron, for example, originally finished construction in the early 1980’s, but continued receiving major upgrades until the turn of the century, allowing it to continue operation through 2010. The proposed SLHC upgrade is 10-year project, and will consist of a series of several incremental upgrades to the existing LHC systems.
194
Anthony Gregerson, Michael J. Schulte, and Katherine Compton
Long Design Times Each new particle accelerator is designed to be larger and higher-energy than its predecessors to facilitate new experiments that were not possible with previous accelerators. The design and construction of a new particle accelerator is therefore a monumental project. The Large Electron-Positron Collider took 13 years from the beginning of its design in 1976 to the start of experiments in 1989 [24]. The LHC project was approved in 1995, and it was not until March 2010 that the LHC’s physics program officially began.
3.3 Evolving Specifications The LHC had a 10-year initial development cycle, and it is predicted that its eventual successors may require design times measured in decades. Because of the high costs of constructing new accelerators, scientists try to prolong the lifetimes of existing accelerators through long-term upgrade projects. These upgrades may keep the development process active for many years after the initial construction. Long design windows and frequent upgrades mean that designers must grapple with continuously evolving sets of specifications. Major design specifications, including the hardware platform and interface, may undergo several changes during the initial design period, and often these changes may be dictated by system-wide decisions beyond the subsystem designers’ control. Hardware engineers must take care to create scalable, adaptable designs to deal with potential changes. However, this can create conflicting design goals. On one hand, designers should try to exploit specific hardware features as much as possible to obtain the best performance, meet the strict requirements of the HEP project, and reduce the cost of their systems. On the other hand, any designs that are tied too closely to specific hardware may require significant and time-consuming redesigns if design specifications or the hardware platforms change. Because of the extremely long development cycles, engineers may even wish to plan ahead and create designs that surpass today’s technology with the expectation that by the time the system is ready to be built, technology will have advanced sufficiently to implement the designer’s vision.
3.4 Functional Verification and Testing As the scale and complexity of hardware designs increase, the difficulty of verifying correctness and diagnosing bugs tends to increase at a super-linear rate. In addition to the large scale of HEP systems, there are a number of other facets of the development process that make the testing process formidable. In the initial design stage, algorithms implemented in hardware must be verified to be equivalent to the original physics algorithms that were developed and tested in
High-Energy Physics
195
high-level software simulators. Providing equivalent testing is often a complex process. Although the software and hardware are essentially implementing the same algorithm, the form of the computation differs. Software is typically easiest to write in a sequential fashion whereas high-performance hardware systems are created to be explicitly parallel. The hardware and software simulators may produce similar end results, but the difference between parallel and sequential simulation can cause mismatches at intermediate points of the computation. This means that it is difficult to share and reuse the same modular testing infrastructure between both the hardware and software. In the construction stage, the actual hardware must be tested to ensure that it operates identically to the pre-verified simulated hardware design. If there are bugs or performance glitches, they must be detected and properly diagnosed so that they can be repaired. Because triggering systems require global decisions across thousands of sensors, there is a significant amount of interconnection between subsystems. Large amounts of interconnect can make it difficult to isolate bugs to individual subsystems. In the design stage, a hardware engineer may need to simulate large subsections of the overall design and track bugs across numerous subsystems to find their source. The bug tracking process is slowed by the fact that individual subsystems are likely designed by different teams, using different design styles and potentially different design languages. These subsystems may use different hierarchal structures and require different internal testing interfaces. Despite rigorous testing, it is inevitable that any complex system will exhibit bugs and performance issues even after the first assembly is complete. Particularly in a complex system such as the CMS Trigger, which requires high-speed communications and tight latency deadlines, it is difficult to ensure proper timing and synchronization between subsystems.
3.5 Unit Modification In addition to bug fixes, regular modifications to the experimental physics algorithms can be expected. As physicists begin collecting real data, they will be able to better tune their algorithms to meet the characteristics of the experiment. These modifications may require frequent changes to the Trigger firmware. To accommodate the need for algorithmic changes, the original CMS L1 Trigger design used re-programmable FPGAs wherever possible, provided they could meet the strict performance requirements. The CMS L1 Trigger upgrade extends this approach even further by replacing almost all fixed-function ASIC hardware with modern highperformance FPGAs. However, while FPGAs may allow for rapid, low-cost modifications to the physics algorithms, there is no guarantee that after the changes, the original hardware timing will be maintained. Changes in timing could affect other parts of the system. For example, FPGA firmware modifications could increase device latency. If the latency change is severe enough, the system may now fail to meet the timing requirements of the system. Other parts of the system may then require
196
Anthony Gregerson, Michael J. Schulte, and Katherine Compton
modification to once again meet timing constraints. Localized changes thus have a potential system-wide impact. Moreover, changes to the algorithms implemented on an FPGA may also affect resource utilization of each device. If in initial designs, hardware designers try to maximize the resource utilization of each FPGA to minimize the total number of required devices (and consequently the cost and complexity of the system), they may find that there are not enough free resources remaining to implement needed algorithmic changes. On the other hand, if designers leave the FPGAs under-utilized to reserve space for future modifications, they may end up needing an excessive number of devices and needlessly increasing the cost and complexity of the system.
4 Techniques for Managing Design Challenges The challenges of system design for HEP applications can be managed by adopting strategies and techniques geared towards the unique properties of such applications. In this section, we outline several techniques that may be applied to improve the development process.
4.1 Managing Hardware Cost and Complexity Cost and complexity concerns factor into all of the major design challenges mentioned. However, one aspect that is particularly worth focusing on is the choice of technology platform, as this choice has far-reaching effects on the system cost. The technical features of HEP, particularly the very high data rates and low latencies required for the triggering systems, make it infeasible to implement these systems purely in software running on commodity computers. This is somewhat unfortunate; the high degree of flexibility provided by software implementations of algorithms, in conjunction with the fact that physics algorithms are originally developed and simulated in software, would make a software-based system very attractive in terms of designer productivity. On the other end of the spectrum is the use of custom, fixed-function triggering circuits. ASIC-based systems of this sort were prominent from the advent of particle accelerators in the 1930s up to the present. ASICs provide the highest possible performance, which is often a requirement for the high data rates of particle accelerators. However, designing custom ASICs is very expensive, and circuits designed for HEP applications are necessarily low volume; there are only about 20 active large-scale particle accelerators worldwide at the time of this writing. ASICs are also generally inflexible, and making large changes to ASICs is likely to require replacing the chips after a lengthy and expensive redesign and re-fabrication process. Furthermore, as custom parts, there is potential for significant delays if quality-control problems arise during fabrication.
High-Energy Physics
197
Occupying a space somewhere between ASICs and general-purpose computers are programmable logic devices, particularly FPGAs. These devices give some of the performance advantage of ASICs while allowing for greater flexibility through hardware-based programmability. In theory, a re-programmable device can have its function modified quickly and frequently by revising and reloading its firmware onsite. This avoids the costly and time-consuming process of designing and fabricating a completely new device. FPGAs also have the advantage of being commodity parts, meaning that the cost of fabrication and device testing is amortized across all customers for the particular device, and not borne solely by the particular HEP system. One of the key trends in HEP systems over recent years is due to the rapid progression of semiconductor technology. FPGAs, although still slower than ASICs, are now fast enough and have enough communication bandwidth to handle most of the HEP-related functions that previously required ASICs. This has led to a trend of converting the high-performance hardware frontends of HEP triggering systems from rigid ASIC-based systems to more flexible FPGA-based systems. Whereas the original CMS Level-1 Trigger was a mixture of ASICs and FPGAs, the Trigger re-design for the SLHC upgrade will eliminate most remaining ASICs in favor of FPGAs. One of the Trigger subsystems that is to be converted is the Electron ID system (described in Section 5.1), shown in Fig. 6. This system contains over 700 ASICs and thousands of RAM chips that will be converted to FPGAs.
Fig. 6 One of 126 ASIC-based boards used in the CMS Trigger’s Electron ID subsystem. These boards will be replaced with FPGA-based boards in the SLHC upgrade.
Though FPGAs provide more flexibility than ASICs, they are not the solution to every problem. Some algorithms are simply better suited for software than hardware — for example, those that require recursion, traversal of complex data structures, or scheduling of operations with non-deterministic execution times or resource requirements. Therefore, most triggering systems feature both a hardware-based fron-
198
Anthony Gregerson, Michael J. Schulte, and Katherine Compton
tend and a software-based backend. These hybrid systems provide appropriate levels of control and flexibility while still meeting performance constraints. In the past, software-based trigger levels were commonly implemented in DSP processors, however the trend in modern systems is toward general-purpose processors [29] [25] [16] [41] [23] [43]. This is primarily motivated by the higher degree of expertise needed to create software for DSP processors and the growing complexity of trigger software [8] [7].
4.2 Managing Variable Latency and Variable Resource Use As described in the previous sections, FPGA technology has been widely adopted in the CMS Trigger to allow flexible system modifications and upgrades. However, modifying the algorithms implemented on FPGAs has the potential to change the circuits’ latency and resource use. In some cases, this could mean that algorithmic changes either cannot fit on the existing chips or cannot meet timing constraints without lengthy and expensive modifications to the rest of the system. Thus, it is important that FPGA-based designs follow a modular approach, in which functional modules are self-contained and ideally can be modified without impacting other modules. Each of the techniques for controlling the latency and resource use of a design incurs some costs. Therefore, when choosing techniques to use, it is important to identify which portions of the subsystem’s functionality are actually likely to change. Some functions, such as the generation and calculation of input primitives, are essential enough to the Trigger’s functionality that they can be assumed to be fixed and may be implemented using fast and efficient direct implementations. The original Level-1 Trigger uses high-performance ASICs to perform functions such as generating energy sums from the calorimeter-based sensors and sorting the energy values of the isolated particles. Other functions, such as jet reconstruction (described further in Section 5.2), may require frequent changes to adapt to the performance of the accelerator. This section presents three approaches to minimize the impact of algorithmic changes on latency and resource use; resource reservation, worst-case analysis, and fixed-resource and fixed-latency structures. A designer may use one or more of these approaches to accommodate future changes to FPGA-based modules without creating system-wide disruptions. Resource Reservation is a fairly passive approach to handling variable latency and resource usage. This technique simply sets aside a portion of the device resources and targets a lower latency limit than the actual limit during the initial design process. The purpose is to provide room for new features or algorithmic changes within the reserved resource and latency bounds. If end-to-end latency is a concern, modules can be designed to be capable of running at a higher-than-expected operating frequency and even to contain ‘empty stages’ — pipeline stages that do not perform any computations and are reserved for later use. Resource reservation has
High-Energy Physics
199
several advantages. First, as the percentage of committed resources on an FPGA increases, the likelihood that the addition of new logic will negatively affect the latency of the existing logic also increases. More logic means more routing congestion within the FPGA. Signals that used to be routed efficiently may now use a longer path to reach their destination, increasing the path delay. Typically, FPGA designers may aim for around 80% utilization. In HEP systems, the target utilization may be much lower due to the expected algorithmic changes and demanding performance specifications. For example, the CMS Level-1 Trigger upgrade plans currently mandate no more than 50% device utilization. As a final point, resource reservation does provide some additional degree of flexibility to the design. Whereas the unused resources are originally intended to accommodate necessary updates to the triggering algorithms, it is possible that some of these resources could instead be used to add new features that were not even envisioned in the original design, or to fix bugs that were not detected until system assembly. The resource reservation approach, however, also clearly has negative aspects. Since the Level-1 Trigger is a massively parallel system, leaving large portions of device resources unused results in an increase in the total number of devices required to implement the system. In addition to the increased device cost, this also means more interconnection between devices and greater complexity in the Trigger’s architecture — aggravating the complexity problem that plagued the original Trigger design. Second, there is no easy way to determine exactly what percentage of resources should be reserved. Reserving too much adds to the cost and complexity of the system, and reserving too little may cause problems when trying to modify the algorithms. Without reliable predictions of the future resource needs, resource reservation may be inefficiently skewed towards being either too conservative or too aggressive. Worst-Case Analysis aims to determine the upper bounds on the resource utilization and latency that a given algorithm may exhibit for use with resource reservation. By only reserving enough resources to deal with the worst-case permutation of the algorithm, the designer can avoid setting aside too many resources, needlessly increasing the cost and complexity of the system. To understand the technique, consider an algorithm with a fixed number of inputs and outputs. Any such algorithm can be expressed as a table of result values indexed by the input values. In other words, the particular input values are used to look up the answer in the table. This type of computation structure is thus called a look-up table. In many cases, a table such as this is optimized using Boolean algebra to create hardware far more efficient than implementing the complete table. Two different algorithms will have two different tables, each of which may be optimized differently, potentially to a different degree. To explore how algorithmic changes could affect the optimized implementation, one could use the following technique. First, randomly generate a large collection of lookup tables with the needed input and output data sizes. Next, use Boolean reduction and optimization techniques to generate an FPGA-based implementation. Then, synthesize this implementation and record its resource use and latency. After a large number of random tables have been processed, the data can be formed into a histogram, and statistical analysis can determine the expected per-
200
Anthony Gregerson, Michael J. Schulte, and Katherine Compton
centage of all possible permutations that are covered by a given resource and latency allocation. This technique provides more information to the designer to help in choosing the amount of resources to reserve, however it still relies on a certain amount of intuition and probability. The results from randomly generated permutations may contain high-resource or high-latency outliers that may represent permutations that would never actually be used in the design. The system designer may wish to restrict the resource allocation to cover less than 100% of the permutations in order to reduce cost or because the number of inputs makes full coverage infeasible. On the other hand, the future modification themselves may be outliers that are not captured by this approach. An alternate method is to provide further restrictions on the generated permutations, such that they do not represent an arbitrary algorithm that would never be used for that system. For example, if the value of a specific input will always be within a certain range, then any permutations for inputs that fall outside this range can be discarded. Doing so, however, requires high levels of insight into the functions and needs of the physics algorithms. Fixed-Resource and Fixed-Latency Structures are alternative approaches that guarantees the ability to accommodate arbitrary changes to the physics algorithms without unreliable guesses about the amount of resources to reserve and without introducing potential problems due to FPGA routing. The basic concept is simple: build logic structures that, given a specific number of inputs and outputs, can implement any possible algorithm in that structure. As in the worst-case analysis, this approach may use a look-up table to store the output results for all possible input combinations; the inputs index into that table to look up the output. However in this case, the table is not optimized, and is instead placed as-is into a memory structure. The size and performance of such a table is invariant to its contents, and changes to the algorithms can be made simply by loading new output data into the table. In fact, if the tables are stored in a writable memory, it may be feasible to alter algorithms dynamically as needed, avoiding the costly system down-time introduced by re-synthesis of logic and re-verification. This technique is already used for certain algorithms in the original CMS Level-1 Trigger design, such as the algorithm that performs electron and photon identification. In that case, the system uses discrete RAM modules to store the algorithm’s look-up tables. It is important to note that fixed structures such as look-up tables are not appropriate to use in every situation. The number of entries in such tables is 2k , where k the number of input bits. Each entry is as wide as the number of output bits. If a specific physics processing algorithm has six 48-bit inputs and outputs a 1-bit value, the amount of memory needed for that table would exceed the estimated number of atoms in the observable Universe by several orders of magnitude. Even for massive scientific instrumentation, this algorithm would not be feasible to implement as a complete look-up table. Table compression or pre-hashing of inputs can reduce the cost of such tables, provided that this does not have a detrimental impact the accuracy of the algorithm. However, in most cases, directly implementing the optimized logic for an algorithm is significantly less expensive than using a look-up table structure. Such structures should be reserved for algorithms that may be sub-
High-Energy Physics
201
ject to frequent, extensive, or difficult to anticipate changes or in cases when latency invariance is critical.
4.3 Scalable and Modular Designs A consequence of the lengthy design periods and large number of groups collaborating on HEP projects is that the specific hardware platforms used to implement designs may change several times over the course of the development process or may not be concretely specified until later stages of the process. These changes may result from budget-related issues or simply the advance of technology over the course of the design period. When designing ASIC-based modules, this may mean a change between different fabrication process technologies, such as 90 nm versus 65 nm or silicon-based versus gallium arsenide-based. For FPGA-based modules, it may mean changing between different FPGA vendors, chip families, chip sizes, or speed grades. By using a few basic techniques and tools, a designer can create scalable designs that can be quickly and efficiently adapted to device changes. First, designs should aim for internal modularity, dividing the basic pieces of functionality into isolated components that could potentially be moved to separate chips without loss of correctness. This type of modularity is particularly important when changes in device technology result in a different ratio between the bandwidth and logic resources available on the chip. Due to the very high data rates of HEP applications, the partitioning of the sensor grid across hardware devices is primarily based on the off-chip bandwidth available to each device. If a particular device offers a high ratio of logic to bandwidth, the designer can pack many different processing modules onto the same device. If the design parameters are revised to use a different device that features a lower ratio of logic to bandwidth, the number of instances of the processing modules may need to be reduced or some of the module may even need to be moved to other chips. Maintaining modularity in the design affords these options. Second, modules should be created in a parameterized fashion. Hardware design languages (HDLs), such as Verilog and VHDL, support parameterized hardware descriptions. These allow the designer to use variable parameters to specify the bit-width of signals and support parameter-based auto-generation to easily vary the number of times a module is replicated on the same chip. Using parameterized designs and auto-generation of design units is valuable because it allows much more rapid adaption to different sized chips. Changing the number of module instances by manually instantiating the new quantity needed for a system with thousands of sensors is not only very time-consuming and tedious, but also much more error-prone. Fully parameterized designs also afford the ability to explore the design space and rapidly examine the hardware efficiency using a variety of different chips, each of which may have a different resource limit. This can help the designer make intelligent decisions about the potential performance of the system on future devices as well as provide concrete data to help guide the device selection process. When at-
202
Anthony Gregerson, Michael J. Schulte, and Katherine Compton
tempting to convince a project manager to adopt a particular device, the argument becomes much more potent if one can demonstrate that a change to that device would result in a halving of the total number of chips needed to implement the design.
4.4 Dataflow-Based Tools Although creating modular and parameterized designs adds some automation and efficiency to hardware updates due to technology changes, there are some challenges that these techniques cannot solve. For example, in systems with a large amount of interconnection and very high data rates between devices, it can be difficult to efficiently balance computation and communication, and to choose the optimal number of devices to prevent bottlenecks or under-utilization of data links. This type of problem has been well-studied in the field of graph theory, but most HDL implementations are not well-suited to rigorous mathematical analysis. On the other hand, dataflow-style design strongly correlates to graph structures, allowing use of graph theory. However, this style is not necessarily well-suited to direct translation to high-performance, low-latency hardware. Researchers are working to create automatic tools that bridge the gap between dataflow languages and HDLs. One such example is the Open Dataflow (OpenDF) project, which features open-source tools for conversion between the CAL dataflow language (described in further detail in Chapter 3) and VHDL/Verilog [14]. Dataflow-style descriptions can also serve as an intermediary to help bridge the divide between software and hardware development for HEP systems. Dataflow descriptions are particularly compatible with functional programming languages. These languages may be an attractive option to physicists, as they are a convenient means of clearly expressing mathematical algorithms. If physics algorithms are developed using functional languages, it may be possible to quickly and automatically convert the algorithms into dataflow descriptions, calculate the optimal number of each of the hardware modules needed to implement the algorithm, and convert the dataflow-based designs into efficient HDL implementations that can be synthesized and implemented in hardware. The idea of a fully automated process to translate physics algorithms into highquality hardware designs is not yet feasible. Some algorithms are inherently sequential, or are not easily translated to functional languages and dataflow descriptions. Current translation tools do not yet achieve the level of quality and efficiency required to meet the strict constraints imposed by HEP systems. However, as these tools mature, this design flow may become more and more attractive; if not for the final hardware design, then at least for intermediate design exploration. Reducing the need for expert hardware designers could go a long way towards eliminating some of the firmware-design bottlenecks in the process of developing new hardware-based physics algorithms.
High-Energy Physics
203
4.5 Language-Independent Functional Testing HEP applications have many architecturally challenging characteristics. As an engineer, it is tempting to focus most of the effort on the design of the hardware and to devote less effort on testing those designs. However, for the large-scale, complex systems used in HEP, the time it takes to test a design and verify that it properly implements the original physics algorithms can be at least as time-consuming as the design phase. An effective testing methodology is a crucial step in guaranteeing the quality of the system design. The most significant challenges in functional verification of HEP system hardware derive from the fact that separate frameworks are developed to test the original software-based algorithmic description and the hardware-based implementation of those algorithms. If the physicists designing software and the engineers designing the hardware each develop their tests independently, there is a risk that differences in the test suites may result in differences in the tested hardware. For example, the tests may not have equivalent coverage, the corner cases in software and those for hardware may differ, or one implementation may make assumptions that do not hold for the other. A better approach is to attempt to develop a single testing framework that can be used to test the software and hardware implementations side-by-side, using identically formatted test sets. Although this may seem like a simple concept, it is complicated by the fact that the software and hardware implementations use different languages (C/C++ versus Verilog/VHDL) and different simulators (sequentialstep versus parallel/event-driven simulators). One approach to overcoming these issues is to use a language-independent testing framework. Hardware and software systems may take their input and provide output in differing formats; this difference can be accommodated by adding a translation layer into the testing flow. Inputs and outputs can be specified in text files using a standard, unified format. Scripts can translate this standard input format into the format appropriate for the particular hardware or software unit under test, automatically invoke the appropriate simulator to generate the output, translate that output into a standard format, and write it to a text file. Another script can compare these text files and identify differences that indicate if the hardware and software implementations are not equivalent. Research tools such as the DICE framework [15] use this approach. HEP systems are often mission-critical components of expensive experiments. At times it may be desirable to use more rigorous formal verification methods. However, formal verification for electronic circuits is infeasible for very large systems due to high computational complexity. For example, the commonly used Model Checking technique is NP-Complete [57]. Formal methods also typically require that all portions of the design are described in a common synthesis language and format. This presents a problem for HEP system that, due to their collaborative design process, are often a hybrid of VHDL, Verilog, and proprietary IP cores that use a mixture of structural, RTL, and behavioral descriptions. An approach to overcoming this problem is to convert the multiple synthesis languages into a dataflow description. Control data-flow graphs can be used to represent high-level synthesis
204
Anthony Gregerson, Michael J. Schulte, and Katherine Compton
languages [6]. By converting the hardware design and system specifications into dataflow graphs using a language-independent format such as the Dataflow Interchange Format [35], it might be possible to perform formal verification by proving the equivalence of the two graphs. Of course, if any of these techniques are used to transform tests or implementations from one format to another, it is important to ensure that these transformations are lossless; that is, none of the diagnostic information from the original tests is lost in the conversion between formats.
4.6 Efficient Design Partitioning For signal processing applications that may need hundreds or thousands of devices for implementation, it is critical to efficiently partition the application across those devices. One of the key characteristics of HEP applications is their huge data rates that generally exceed the bandwidth of any one device in the system. Thus, even if the computation could fit on a single device, partitioning across multiple devices may be needed simply to meet bandwidth requirements. During the partitioning of processing modules across many devices, it is important to consider how the partitioning affects the required interconnection and bandwidth between devices. A common approach is to minimize the cut-size — that is, the number of data connections or cross-sectional bandwidth — between partitions. This helps to mitigate bottlenecks in the inter-device communication systems. Techniques for automated design partitioning on FPGA devices have been studied by several research groups. Some tools, such as SPARCS [49] use graph-based partitioning methods that may be complementary to some of the dataflow-based design techniques discussed in Section 4.4. These methods are effective at bandwidthdriven partitioning. Since HEP triggering systems often have tight end-to-end latency constraints, performance-driven partitioning algorithms [27] may be preferable in some situations.
5 Example Problems from the CMS Level-1 Trigger Thus far, we have primarily focused on HEP challenges in a general sense, supported by selected examples from specific HEP systems. To give the reader a better sense of the precise type of algorithms and designs that may occur in HEP applications, this section presents a set of more detailed challenges specifically faced in the SLHC upgrade of the CMS Level-1 Trigger at CERN’s Large Hadron Collider. The CMS is a general-purpose HEP experiment that searches for a variety of different physics phenomena. As such, its systems share similarities with other general-purpose HEP systems, such as the LHC’s ATLAS experiment and the Tevatron’s D0 experiment [31]. All three experiments use energy readings from one or
High-Energy Physics
205
more calorimeters and muon detectors as a source of raw data. Each experiment uses a triggering system to attempt to isolate and identify fundamental particles created in collisions, and avoid storing data for uninteresting events. Whereas the exact specifications of the CMS Trigger are unique, its functions are similar to those that are likely to be encountered in other triggering systems. In this section, we discuss algorithms for two such functions: particle identification and jet reconstruction.
5.1 Particle Identification A particle accelerator’s sensors record various measurements from the particles created in high-energy collisions. A primary function of triggering systems is to process sensor data and partially reconstruct key information about these particles. Particles can be identified by observing their interactions with different sensor types. Some particles, like those with magnetic charge, will deflect in a magnetic field. Some will only be absorbed by specific materials. Others will pass through most materials without interacting with them. The interaction of various particles and the sensors in the CMS detector are shown in Fig. 7.
Fig. 7 Post-collision particles can be identified as belonging to one of five categories based on their interaction with the detector’s different sensors. Image Copyright CMS Experiment at CERN.
To better understand the processing of this sensor data, it may be useful to view it as similar to an image processing problem. Collectively, the sensor data forms the image. Each sensor location can be viewed as a ‘pixel’, and data from different sensor types can be viewed as separate channels in each pixel. In the case of CMS’s Electromagnetic Calorimeter (ECAL) and Hadron Calorimeter (HCAL), the incoming sensor data forms a 4-Kilopixel image, since both the ECAL and HCAL have a
206
Anthony Gregerson, Michael J. Schulte, and Katherine Compton
4096-sensor grid. Identifying specific particle types is now similar to the problem of identifying objects in a picture. Two of the particle types that need to be identified are electrons and photons. As Fig. 7 shows, electrons and photons are absorbed by the ECAL. These particles are small; their energy is deposited only in the equivalent of two adjacent pixels [10]. To identify these particles, the CMS Level-1 Calorimeter Trigger applies Equation (1) to every group of two pixels. To return to the image processing analogy, this process is akin to the problem of finding every instance of two adjacent pixels in a 4-Kilopixel image wherein green makes up more than x% of the total color saturation, where x is a programmable threshold value. In the CMS system, a new image is delivered from the sensors every 25 ns; the particle identification circuitry must therefore be fast enough to process up to 40 million of these images each second. ECAL > T hreshold (1) ECAL + HCAL Whereas Equation (1) is a seemingly simple equation, it is only an approximation; the real system requires complex special-case handling to deal with non-linear effects in the sensors at high energy. For this reason, in previous accelerators this identification algorithm has been implemented using large and expensive RAMbased look-up tables. As discussed in Section 4.2, table-based algorithm implementation provides a great deal of flexibility to change an algorithm (by loading a new set of values into the table) without needing to change the underlying hardware. It also provides an effective way to implement Equation (1) with the additional special-case handling. Research from the SLHC upgrade project has investigated a set of alternative FPGA implementation methods for this algorithm, including hardware reprogramming and the use of multi-level hashing to fit the look-up tables on-chip [33]. By using a parallel implementation across many FPGAs and a pipelined architecture, these designs are capable of performing the electron/photon identification across the entire sensor image in approximately 30 ns.
5.2 Jet Reconstruction In some cases, physicists are interested in large groups of particles rather than individual ones. Certain physical phenomena may generate highly-energetic bunches of particles known as ‘jets’. Unlike individual particles, whose data is isolated to a few adjacent sensors — or a few pixels in the image processing analogy — the data from jets may be spread across tens or hundreds of sensors. From a physics standpoint, in CMS it is worthwhile to consider a region of at least 8 x 8 sensors (pixels) when identifying a jet [11]. In the process of jet detection, any group of pixels representing a sufficiently high total energy may be considered a jet, provided that there is a bright spot (energy deposit) at the center of the region. Several stages of processing must be applied to each jet. First, the energy from every pixel in the jet must be summed to calculate the total jet energy. Next, a weighted energy position must be
High-Energy Physics
207
calculated within the jet. For example, if there are more bright pixels to the right of the jet’s center than to the left, the jet’s center of energy is somewhere right-ofcenter in the region. Finally, all of the jets must be sorted based on descending total energy. However, this process is more difficult than simply sorting integers, because multiple jets may occur at overlapping locations. The Jet Overlap Problem The primary problem with sorting jets is that the physics algorithms for the CMS system require that no pixel can appear in more than one jet. Duplicating pixels would be akin to double-counting particles for multiple jets, and would introduce errors into the data. However, since a jet region is formed around every bright group of pixels, jet regions will overlap when there are closely bunched groups of bright pixels. To ensure that jets do not contain duplicated pixels when they are sent to the High-Level Trigger, the Level-1 Trigger must delete jets until no two jets overlap. The goal is to keep the most interesting, i.e. highest-energy, jets and delete the lessinteresting jets. After the overlapping jets are deleted, the highest-energy jets are saved and the lower-energy jets are discarded. The Level-1 Trigger must first sort all of the jets to identify the highest-energy jet. Then, it must delete any jets that overlap with the highest-energy jet. This process is then repeated for the next-highest energy jet of those that remain, and continues until none of the remaining jets overlap. The problem with this algorithm is that it must be performed iteratively. This problem is illustrated in Fig. 8. In this figure, Jet C has a higher energy than Jet D, however we cannot determine if Jet D should be deleted until we know whether Jet C is deleted. Similarly, we cannot determine if Jet C is deleted until we know if Jet B should be deleted, and we cannot determine if Jet B will be deleted until we know if Jet A has a higher energy than Jet B. If Jet A has higher energy than B, then B and D will be deleted. Conversely, if it has lower energy, then A and C will be deleted. As this example shows, the jets cannot be processed in parallel, and must be processed sequentially, starting at the highestenergy jet. This sequential chain of decisions is a problem because of the large size of the sensor image. If every pixel in the image is bright, there could be up to 4096 jets in the image. Such a situation could require thousands of sequential steps to sort and isolate the jets. This is unacceptable, since the entire Level-1 Trigger has less than one microsecond to perform all of its processing, including jet reconstruction. The worst-case sorting situation occurs when the sensor image is full of jets in close proximity. However, simulations suggest that during real experiments the actual occurrences of jets should typically be sparse. When there are a small number of sparsely distributed jets, the sorting can be accomplished in a small number of steps. Some proposed solutions to the jet sorting problem are to perform a hierarchal sort, in which jets are first sorted and isolated over small regions, then sorted globally. Although this method reduces worst-case complexity by limiting the maximum number of iterations due to the smaller area, it is not guaranteed to produce a fully correct solution—the algorithm may end up deleting some jets that should not have been deleted. Another option is to simply assume that jets will always be sparsely
208
Anthony Gregerson, Michael J. Schulte, and Katherine Compton
Fig. 8 Four overlapping jets and their associated energy values. The highest-energy jet must be identified before any jets can be deleted.
distributed and set a hard limit on the number of iterations of jet trimming. This approach avoids accidental jet deletions, but potentially produces jets with incorrect energy data if many adjacent jets occur.
6 Conclusion High-energy physics (HEP) applications present a unique range of challenges to DSP system designers, particularly in the form of triggering systems that process the colossal amounts of data produced by large sensor arrays and determine which data to keep and which data to discard. From a technical standpoint, triggering systems require massive bandwidth, low latency, and large-scale duplication and interconnection. In the Level-1 Trigger system of the CMS experiment at the Large Hadron Collider, the world’s most advanced particle accelerator, this amounts to a bandwidth of hundreds of gigabytes per second, a processing latency of less than one microsecond, and thousands of interconnected FPGAs and ASICs. From a design methodology standpoint, these applications require the adoption of special techniques and tools to manage cost and complexity, given the particularly large scope and long timeline of HEP development projects. Some of these include (1) techniques for allowing rapid, inexpensive modification and upgrade of algorithms, (2) strategies for dealing with long design windows and evolving platforms, (3) technologies to narrow the gap between physicists and engineers and software and hardware infrastructure, (4) practices to allow for efficient, homogenous testing of new algorithms and their hardware implementations, and (5) tools to ease the process of performing collaborative design work with multi-national, interdisciplinary design teams.
High-Energy Physics
209
DSP system design for HEP applications is an on-going research area. Some of the current research on the SLHC upgrade project has focused on exploiting FPGA features to provide flexible, efficient, parallel triggering architectures and creating new algorithms and hardware designs to accommodate an accelerator luminosity increase of an order of magnitude. Other research focuses on improving the development process of HEP applications through dataflow descriptions and dataflow-based tools to provide a viable intermediary between hardware and software implementations. Furthermore, this work aims to create an infrastructure to promote homogenous, language-independent functional testing and verification of new algorithms and implementations. For related reading, see Chapter 14 on FPGA-based signal processing architectures and Chapter 31 on the process of mapping dataflow graphs to FPGAs. In this chapter, we have outlined the unique characteristics of HEP applications, highlighted many of the technical and developmental challenges posed to the designers of signal processing systems, and described techniques and strategies that can help meet these challenges. The design of the Large Hadron Collider has already pushed today’s engineering capabilities to their limit, and historical trends have shown that the amount of data collected by HEP experiments is increasing at an exponential rate. Advancements to the triggering systems that handle this data will be vital to the creation of the next generation of particle accelerators. The development of novel tools, techniques, and architectures for these signal processing systems will play a crucial role through the completion of the Super Large Hadron Collider and beyond.
References 1. Abolins, M., Edmunds, D., Laurens, P., Linnemann, J., Pi, B.: A high luminosity trigger design for the Tevatron collider experiment in DO. IEEE Trans. Nuclear Science 36, 384–389 (1989) 2. Adler, S.S., et al.: PHENIX on-line systems. Nucl. Inst. and Meth. in Phys. Res. A 449, 560–592 (2003) 3. Altegoer, J., et al.: The trigger system of the NOMAD experiment. Tech. Rep. CERN-EP 98-202, CERN (1998) 4. Amidei, D., et al.: A two level FASTBUS based trigger system for CDF. Tech. Rep. FERMILAB-Pub-87/187-E, The Enrico Fermi Institute (1987) 5. Amsler, C., et al.: Review of particle physics, 2008-2009. Review of particle properties, 20082009. Phys. Lett. B 667 (2008) 6. Ang, R., Dutt, N.: Equivalent design representations and transformations for interactive scheduling (1992) 7. Anikeev, K., et al.: CDF Level 2 Trigger upgrade (2005) 8. Anvar, S., et al.: The charged trigger system of NA48 at CERN. Nucl. Inst. and Meth. in Phys. Res. A 419, 686–694 (1998) 9. Azuelos, G., et al.: The Beyond the Standard Model Working Group: Summary report (2002) 10. Bachtis, M.: SLHC CMS Calorimeter Trigger dataflow and algorithms (2008) 11. Bachtis, M.: SLHC Calorimeter Trigger algorithms (2009) 12. Baudrenghien, P., Höfle, W., Kotzian, G., Rossi, V.: Digital signal processing for the multibunch LHC transverse feedback system. Tech. Rep. LHC-PROJECT-Report-1151, CERN (2008)
210
Anthony Gregerson, Michael J. Schulte, and Katherine Compton
13. Bene, P., et al.: First-level charged particle tirgger for the L3 detector. Nucl. Inst. and Meth. in Phys. Res. A 306, 150–158 (1991) 14. Bhattacharyya, S.S., Brebner, G., Eker, J., Janneck, J.W., Mattavelli, M., von Platen, C., Raulet, M.: OpenDF — A dataflow toolset for reconfigurable hardware and multicore systems. In: Proceedings of the Swedish Workshop on Multi-Core Computing, pp. 43–49. Ronneby, Sweden (2008) 15. Bhattacharyya, S.S., Kedilaya, S., Plishker, W., Sane, N., Shen, C., Zaki, G.: The DSPCAD Integrative Command Line Environment: Introduction to DICE version 1. Tech. Rep. UMIACSTR-2009-13, Institute for Advanced Computer Studies, University of Maryland at College Park (2009) 16. Bieser, F.S., et al.: The STAR trigger. Nucl. Inst. and Meth. in Phys. Res. A 499, 766–777 (2003) 17. Blaising, J., et al.: Performance of the L3 second level trigger implemented for the LEP II with the SGS Thomson C104 packet switch. IEEE Trans. Nuclear Science 45, 1765–1770 (1998) 18. Busson, P.: Overview of the new CMS electromagnetic calorimeter electronics. In: 8th Workshop on Electronics for LHC Experiments, pp. 324–328 (2003) 19. Canale, V., et al.: The DELPHI trigger system at LEP200. Tech. Rep. DELPHI 99-7 DAS 88, CERN (1999) 20. Caspers, F., et al.: The 4.8 GHz LHC Schottky pick-up system. Tech. Rep. LHC-PROJECTReport-1031, CERN (2007) 21. Chung, Y., et al.: The Level-3 Trigger at the CDF experiment at the Tevatron Run II. IEEE Transactions on Nuclear Science 52 (2005) 22. CMS Collaborations: CMS TriDaS Project: Technical Design Report; 1, The Trigger systems. Tech. rep., CERN (2000) 23. Covarelli, R.: The CMS High-Level Trigger. In: CIPANP ’09 (2009) 24. Darriulat, P.: The discovery of the W and Z particles, a personal recollection. The European Physical Journal C - Particles and Fields 34(1), 33–40 (2004) 25. Day, C.T., Jacobsen, R.G., Kral, J.F., Levi, M.E., White, J.L.: The BaBar trigger, readout and event gathering system. In: CHEP95 (1995) 26. Dugne, J., Fredriksson, S., Hansson, J.: Preon Trinity - A schematic model of leptons, quarks and heavy vector bosons. Europhysics Letters 57, 188 (2002) 27. Fang, W.J., Wu, A.C.H.: Performance-driven multi-FPGA partitioning using functional clustering and replication. In: Proceedings of the 35th Design Automation Conference (1998) 28. Flores-Castillo, L.: Standard Model Higgs searches at the LHC. Nuclear Physics B 177–178, 229–233 (2004) 29. Fuljahn, T., et al.: Concept of the first level trigger for Hera-B. IEEE Trans. on Nucl. Sci. 45(4), 1782–1786 (1998) 30. Gianotti, F., Mangano, M., Virdee, T.: Physics potential and experimental challenges of the LHC luminosity upgrade. European Physical Journal C 39, 293 (2005) 31. Godbole, R.: Predictions for Higgs and SUSY Higgs properties and their signatures at the hadron colliders (1999) 32. Gordon, H., et al.: The Axial-Field Spectrometer at the CERN ISR. In: 1981 INS International Symposium on Nuclear Radiation Detectors (1981) 33. Gregerson, A., Farmahini-Farahani, A., Buchli, B., Naumov, S., Bachtis, M., Schulte, M., Compton, K., Smith, W., Dasu, S.: FPGA design analysis of the Clustering Algorithm for the CERN Large Hadron Collider. In: Proceedings of the 17th IEEE Symposium on FieldProgrammable Custom Computing Machines (2009) 34. Gregerson, A., Farmahini-Farahani, A., Plishker, W., Xie, Z., Compton, K., Schulte, M.: Advances in architectures and tools for FPGAs and their impact on the design of complex systems for particle physics. In: Topical Workshop on Electronics in Particle Physics (2009) 35. Hsu, C., Keceli, F., Ko, M., Shahparnia, S., Bhattacharyya, S.S.: DIF: An interchange format for dataflow-based design tools. In: Proceedings of the International Workshop on Systems, Architectures, Modeling, and Simulation, pp. 423–432 (2004) 36. Jacobs, D.A.: Applications of ESOP, a fast microprogrammable processor, in high energy physics experiments at CERN. Computer Physics Communications 22(2–3), 261–267 (1981)
High-Energy Physics
211
37. JDEM: The Joint Dark Energy Mission (2009). URL http://jdem.gsfc.nasa.gov 38. Kane, G., Pierce, A.: Perspectives of LHC Physics. World Scientific (2008) 39. Kisselev, A., Petrov, V., Ryutin, R.: 5-dimensional quantum gravity effects in exclusive double diffractive events. Physics Letters B 630, 100–105 (2005) 40. Kopec, D., Tamang, S.: Failures in complex systems: case studies, causes, and possible remedies. SIGCSE Bull. 39(2), 180–184 (2007) 41. LeCompte, T., Diehl, H.T.: The CDF and D0 upgrades for Run II. Annu. Rev. Nucl. Part. Sci 50, 71–117 (2000) 42. Moreira, P., Toifl, T., Kluge, A., Cervelli, G., Faccio, F., Marchioro, A., Christiansen, J.: G-link and gigabit ethernet compliant serializer for LHC data transmission. In: 47th IEEE Nuclear Science Symposium and Medical Imaging Conference (2000) 43. Navarro, A.P., Frank, M.: Event reconstruction in the LHCb online cluster. Tech. Rep. LHCbCONF-2009-018, CERN (2009) 44. Pal, S., Bharadwaj, S., Kar, S.: Can extra-dimensional effects replace dark matter? Physics Letters B 609, 195–199 (2005) 45. Parodi, F.: Upgrade of the tracking and trigger system at ATLAS and CMS (2009) 46. Pastore, F.: ATLAS trigger: Design and commissioning. In: ICCMSE Symposium: Computing in Experimental High Energy Physics (2009) 47. Smith, W.: Trigger and data acquisition for the Super LHC. In: Proceedings of the 10th Workshop on electronics for LHC Experiments, pp. 34–37 (2004) 48. Smith, W., Chumney, P., Dasu, S., Jaworski, M., Lackey, J.: CMS Regional Calorimeter Trigger high speed ASICs. In: Proceedings of the 6th Workshop on Electronics for LHC Experiments, pp. 2000–2010 (2000) 49. Sriram, I., Govindarajan, S., Srinivasn, V., Kaul, M., Vemuri, R.: An integrated partitioning and synthesis system for dynamically reconfigurable multi-FPGA architectures (1998) 50. Stamen, R.: The ATLAS trigger (2005) 51. Stettler, M., Iles, G., Hansen, M., Foudas, C., Jones, J., Rose, A.: The CMS Global Calorimeter Trigger hardware design. In: 12th Workshop on Electronics for LHC and Future Experiments (2006) 52. Varela, J.: Timing and synchronization in the LHC experiments. In: Proceedings of the 6th Workshop on Electronics for LHC Experiments, pp. 2000–2010 (2000) 53. Varela, J.: CMS L1 Trigger control systems (2002) 54. Waloschek, P.: Fast trigger techniques. Physica Scripta 23, 480–486 (1981) 55. Wiedemann, H.: Particle accelerator physics. Springer Verlag (2007) 56. Wilson, E.: An introduction to particle accelerators. Oxford University Press (2001) 57. Yadgar, A., Orna, G., Assf, S.: Parallel sat solving for model checking (2002)
Medical Image Processing Raj Shekhar, Vivek Walimbe, and William Plishker
1 Introduction Of many types of images around us, medical images are a ubiquitous type since xrays were first discovered in 1985. The recent years, especially after the introduction of computed tomography (CT) in 1972, have witnessed an explosion in the use of medical imaging and, consequently, the volume of medical image data being produced. It is estimated that 40,000 terabytes of medical image data were generated in the United States alone in 2009 [8]. With expanding use of medical imaging and growing size and resolution of medical images, medical image processing has evolved into an important subspecialty of image processing and the field continues to gain prominence and catch the fancy of the scientific community. This chapter is designed to introduce the reader to the common medical imaging modalities (i.e., techniques) first. Most medical images result from the interaction of some form of energy with the human body, governed by the fundamental principles of physics. Specialized instruments control the release and recording of the energy before and after the said interaction. Often the recorded energy (i.e., sensed signals) need additional processing or go through the process of image reconstruction to assume a format that is easy to comprehend and interpret. It is also possible to make further calculations on the reconstructed images to extract new or quantitative information. As the number of images per exam has grown overwhelming the traditional two-dimensional (2D), slice-by-slice review process, newer algorithms for efficient and intuitive visualization of the entire exam (often a three-dimensional [3D] imRaj Shekhar University of Maryland, Baltimore, MD, e-mail: [email protected] Vivek Walimbe GE Healthcare, Waukesha, WI, e-mail: [email protected] William Plishker University of Maryland, College Park, MD and University of Maryland, Baltimore, MD, e-mail: [email protected]
S.S. Bhattacharyya et al. (eds.), Handbook of Signal Processing Systems, DOI 10.1007/978-1-4419-6345-1_9, © Springer Science+Business Media, LLC 2010
213
214
Raj Shekhar, Vivek Walimbe, and William Plishker
age) have become equally important. Clearly, each imaging modality represents a sophisticated confluence of physics, electronics, data processing, and computing. We focus primarily on data processing and computing aspects of medical imaging while keeping the discussion of physics and electronics to their basic minimum. The learning objectives of this chapter are to acquire an overview of the common algorithms used in the reconstruction, processing, and visualization of medical images, and to develop an appreciation for the associated computing considerations.
2 Medical Imaging Modalities This section provides an overview of common medical imaging modalities. The overview comprises a brief discussion to the underlying physics and instrumentation, the reconstruction problem, and the defining characteristics of the modality. For an in-depth treatment, the reader is referred to many excellent sources cited in Section 11.
2.1 Radiography Commonly referred to as x-ray, conventional radiography (or simply radiography) is the oldest of all medical imaging modalities. Radiography creates a projection image by collapsing the 3D anatomic information into a 2D image that is the shadow cast by the body to the x-rays. As x-rays enter the body, they are absorbed differentially by various types of tissues along their paths. For example, the hard tissues such as bones absorb x-rays more than the soft tissues. The pixels in a radiograph thus represent the attenuated intensity of the x-rays exiting the body and a radiograph provides information about the tissue composition of the body. The advantages of radiography are its relative simplicity of acquisition and instrumentation, as well as, its speed, high spatial resolution, and ease of use. Radiation exposure and a lack of depth and a loss of contrast from overlapped tissues and scattered x-rays are primary disadvantages. Radiography differs significantly from other modalities in that the sensed signals form the image directly, making image reconstruction rather straightforward.
2.2 Computed Tomography CT literally means slice imaging. Like radiography, CT is based on differential attenuation of x-rays by the body. Unlike radiography, however, CT generates crosssectional images with exquisite detail without the problem of overlapped tissues. Stacking several parallel slices leads to a 3D image or an image volume, and CT is
Medical Image Processing
215
credited with starting the volumetric imaging trend. In the scanning phase of CT, an x-ray source and an array of detectors located diametrically opposite to each other with the body in between rotate to produce a number of projections. In the reconstruction phase, the projections (also referred to as the raw data) are processed to form a cross-sectional image representing a map of tissue attenuation coefficient. Early scanners that acquired data for one cross-section at a time were slow, but the development of helical scanning – continuous rotation of the CT gantry while the patient is linearly translated – significantly reduced the scanning time. Modern multislice CT scanners with tens and hundreds of parallel detector rows allow simultaneous data acquisition for as many cross-sections, leading to whole-body scanning in a few seconds [10]. The reconstruction still could take tens of minutes. While radiation exposure is a concern, CT is the most commonly prescribed cross-sectional imaging modality these days because of its high spatial resolution and fast scanning.
2.3 Nuclear Medicine The term nuclear medicine refers to a class of techniques that image the distribution of a radiotracer in the body. The radiotracer is injected into the bloodstream that then spreads nonuniformly within the body cells according to the underlying metabolic or physiologic mechanisms. The radioactive isotopes of the radiotracer release gamma rays when they disintegrate. Nuclear medicine differs from x-ray-based techniques in a critical way in that the source of radiation is internal to the body, earning it the moniker emission imaging. Nuclear medicine includes planar scintigraphy, a planar imaging technique analogous to conventional radiography, and two cross-sectional imaging techniques: single photon emission computed tomography (SPECT) and positron emission tomography (PET). Specialized wide-area gamma cameras and detector assemblies exist to absorb and detect emitted gamma photons in SPECT and PET to acquire CT-like projections. Nuclear medicine is remarkably noisy and of low spatial resolution, but remains highly sensitive to functional abnormalities which appear as either abnormally high or abnormally low intensities in the reconstructed images. A unique advantage of nuclear medicine is a wide and growing selection of radiotracers capable of imaging different physiologic processes [42].
2.4 Magnetic Resonance Imaging Based on nuclear magnetic resonance phenomenon, magnetic resonance imaging (MRI) is the most recent of the major imaging modalities. Nuclear magnetic resonance is the generation of radiofrequency (RF) signal in the form of free induction decay (FID) from nuclear spins systems (nuclei of atoms with odd atomic numbers) by careful manipulation of magnetic fields and electromagnetic excitations. Due to high water content, nucleus of the hydrogen atom is the most prevalent nuclear spin
216
Raj Shekhar, Vivek Walimbe, and William Plishker
system present in the human body. MRI is thus targeted to imaging proton density as well various magnetic properties (T1 and T2 relaxation times, for example) of the hydrogen nuclei, which allow creating high-contrast and high-resolution 2D and 3D anatomic scans. Magnetic resonance spectroscopic techniques [29] can techniques can also reveal chemical nature of the body tissues, as well as, concentrations of specific molecules, permitting functional imaging. A big challenge in the development of MRI was to spatially localize the evoked RF signals in 3D space. This was addressed by gradient-based slice selection, frequency encoding, and phase encoding concepts. Like CT, MRI requires reconstruction from observed data. Unlike CT, MRI possesses no known radiation risks and offers superior soft-tissue contrast. With new techniques such as diffusion tensor imaging [19], MRI continues to expand in its capabilities to image various anatomic and physiologic processes.
2.5 Ultrasound Unlike imaging modalities discussed above, ultrasound imaging is unique in that it is a reflection imaging technique. Ultrasound is inaudible sound above 20 kHz, with medical ultrasound ranging between between 1 MHz and 70 MHz. Ultrasound is generated and received by piezoelectric transducers. When excited, these send out ultrasound waves into a medium. Often the same transducer elements are used to receive reflected waves that are then converted to electrical signals. Various imaging modes exist, but primarily a focused ultrasound beam generated from a single element or an array of elements is sent and received to form a single line (called scan line) of an image. Mechanical or electronic steering of the transmitted beam allows creating a traditional 2D image. An ultrasound image maps the difference in acoustic impedance, which is observed at tissue interfaces. Taking advantage of the Doppler principle, ultrasound is also used to image blood flow in the body. Real-time 3D imaging consitutes a significant new development in ultrasound [39][27][33]. All in all, the strengths of ultrasound are its compactness, economy, portability and real-time imaging capability. The disadvantages are low signal-to-noise (SNR) ratio, spresence of speckle noise (a texture pattern overriding an image), and a lack of correlation between image intensity and tissue type. Nevertheless, ultrasound remains an important tool in a range of diagnostic and image-guided interventional procedures.
Medical Image Processing
217
3 Medical Applications 3.1 Imaging Pipeline Any medical application involves a series of processing steps from image generation to image presentation that are responsible for transforming the desired anatomic and/or physiologic parameters to an image, a series of images, and related quantitative information that are then displayed for interpretation by the physician. Figure 1 shows the basic functional blocks of this “imaging pipeline.” Virtually all medical imaging applications involve some combination of the basic components of the imaging pipeline shown in Figure 1. The imaging device is always the first component of the image chain that is responsible for converting the anatomic/physiologic information into a digital signal called the raw data. The specific sensing element in the imaging device varies depending on the imaging modality, for example, piezoelectric transducers for ultrasound, cesium iodide (CsI) crystals for x-ray/CT, and gamma detectors for PET. The next step is image reconstruction. In most cases, the image data is corrupted by noise and other artifacts commonly due to a combination of the physics of the imaging modality (e.g., speckle noise in ultrasound or noise in CT images due to scattered radiation) and noise introduced by the electronics of the imaging device. Basic preprocessing is performed to account for these artifacts, either on the raw image data before reconstruction or on the reconstructed image data. These processing steps generally involve some filtering operations, ranging from the basic median and averaging filtering to the more sophisticated adaptive filtering techniques such as anisotropic diffusion filtering. Advanced image analysis techniques can then be applied on the filtered image data. Examples of these techniques include image registration used for aligning images acquired from different modalities, different patients, or at different times; image segmentation to identify different anatomic features of specific pathology such as tumors for images; and quantitative analysis techniques to numerically characterize size and shape of the different features in the image. These advanced processing
Subject
Signal
Imaging Device
Post reconstruction Basic Image Processing Advanced Image Analysis
Raw Image Data
Data Preprocessing Cleaned Raw Image Data
2D/3D Image(s)
Quantitative Data
Fig. 1 Basic components of the imaging pipeline.
Image Reconstruction Image Visualization
Physician
218
Raj Shekhar, Vivek Walimbe, and William Plishker
steps are useful to extract specific information from the images and/or prepare images for presentation in a specific format to maximize the value of the available image information. The final stage of the imaging pipeline is the visualization component that is used to display the processed image data, together with the quantitative information, to the physician. Several special visualization algorithms are available based on the dimensionality of image data (2D/3D/four-dimensional [4D]) and the specific needs of the application (e.g., interactive display, real-time updates to images, etc.). Visualization is a vital step of the imaging pipeline, because the image information is useful and effective for the physician only if displayed in a manner that is appropriate for the specific application. Figures 2 and 3 show examples of medical imaging pipelines for two clinical applications: (1) left ventricle functional analysis using 3D echocardiography, and (2) multimodality fusion of whole-body PET/CT. 3D echocardiography is volumetric ultrasound imaging of the heart. The images are used to study the health of the left ventricle of the heart by visually evaluating the motion and structure of the ventricular wall and mitral value over the entire cardiac cycle. Quantitative information about the variation of the ventricular volume and regional wall motion and thickness is important to make accurate diagnosis of any underlying heart ailments. The block diagram in Figure 2 shows the basic steps involved in enabling this application. Multimodality fusion of PET and CT images permits an accurate localization of active tumors, shown unequivocally by PET, with respect to the detailed anatomic map provided by CT. The precise identification and localization of active tumors with respect to CT obtained is important for making a definitive diagnosis and is playing an increasing role in image-guided therapies for the delivery of highly con-
Subject
Signal
Piezoelectric Transducer
Raw Image Data
Projection-space Filtering Cleaned Raw Image Data
Anisotropic Diffusion or Median Filtering
2D/3D Image(s)
Scan Conversion
Filtered Image
Registration Segmentation
Quantitative Data
Visualization
Physician
Fig. 2 Application 1: Functional analysis of left ventricle using 3D echocardiography. [44]
Medical Image Processing
219
formal doses to the tumor, advancement of an interventional device to the tumor, and monitoring treatment response in general. The diagram in Figure 3 shows the basic steps involved in enabling the fusion and display of PET and CT images. The subsequent sections describe in more detail various algorithms that form all the basic elements of imaging pipeline described in this section.
4 Image Reconstruction Except for projection imaging techniques of conventional radiology and planar scintigraphy, the tomographic imaging techniques introduced earlier acquire raw data in a format that does not correspond directly and intuitively with the physical world. In CT, PET and SPECT, the acquired data are in the form of projections taken from many angles around the body. In MRI, the raw data belong to the frequency domain. In ultrasound, the sensed signals are high-frequency echoes from many different scan lines. In all of these techniques, the raw data need further processing to convert them to either a 2D image representing an anatomic section or a 3D image representing a volumetric section of the anatomy. Image reconstruction is the process of converting raw signals to an array of numbers that correspond to some tissue characteristics in the physical space.
Subject
Signal
CsI Detector Gamma Camera
Threshold Filtering
Nonrigid Registration, Tumor Segmentation
CT & PET Projection Data
3D PET/CT Image(s)
Aligned CT/PET Images
Projection-space Filtering Cleaned Raw Image Data
Filtered Backprojection, Iterative Reconstruction
Fused Image Visualization
Fig. 3 Application 2: Multimodality fusion of whole-body PET/CT. [35]
Physician
220
Raj Shekhar, Vivek Walimbe, and William Plishker
4.1 CT Reconstruction The most common CT image reconstruction algorithm is the filtered (or convolution) backprojection algorithm. The algorithm has its origins in the Radon transform , which relates a 2D object function f (x, y) to its projections (J(p, )). R{ f (x, y)} = J(p, ) =
f (p · cos( ) − q · sin( ), p · cos( ) + q · cos( ))dq
where the p-q coordinate system is a rotated version of the x-y coordinate system by the angle (see Figure 4). In practice, f (x, y) represents the cross-section of the body, whereas J(p, ) with ranging over 180 or 360 degrees is a complete set of projections collected by rotation of the CT gantry around the patient. In CT, J(p, ) is the observed data and f (x, y) must be recovered from J(p, ). Put another way, CT image reconstruction requires the inverse Radon transform. The central slice theorem is an important concept in understanding the inverse Radon transform. The theorem states that the Fourier transform of a given projection
Fig. 4 The process of generating a projection from an object in 2D.
Medical Image Processing
221
J(p, ) is identical to the intersection of -line and F(u, v), the 2D Fourier transform of f (x, y). This further means that if Fourier transforms of individual projections are arranged circularly as shown in Figure 5, the result is F(u, v) in polar coordinates. According to the inverse Radon transform, the inverse Fourier transform of the circularly arranged projections (in frequency domain) equals f (x, y). Using the inverse Radon transform to reconstruct f (x, y) requires a polar-toCartesian conversion of the circularly organized projections for convenient inverse Fourier transform. This reconstruction method, referred to as the direct Fourier method has been attempted, but is not a popular method because of the need for, and errors and computations associated with interpolation. This method further requires that all projections must be available before reconstruction can begin, which too discourages its use. If we formulate the inverse Radon transform in polar coordinates obviating interpolation and view the reconstruction process in the spatial domain, the result is the popular filtered backprojection algorithm. The Jacobian associated with coordinate transformation expresses itself as convolution (in spatial domain) of each projection with a ramp function. The reconstructed image then is the summation of the backprojections of each of the filtered projections. This algorithm is computationally less demanding and, unlike the direct Fourier method,
Fig. 5 Central slice theorem: Circularly arranged projections after Fourier transform depict the object in the Fourier domain.
222
Raj Shekhar, Vivek Walimbe, and William Plishker
reconstruction can begin as soon as the first projection is acquired, making it the method of choice in most current CT scanners. Slight variations of the reconstruction algorithms presented here are also available that accommodate the fan-beam (as opposed to parallel-beam) projections most CT scanners generate. The modern CT scanners are multislice scanners that use a cone-shaped x-ray beam and tens and hundreds of parallel detector rows to acquire projections. To properly account for measurements made with a diverging beam, cone-beam reconstruction algorithms or 3D versions of the algorithms are used [7]. Another CT image reconstruction technique of note from a historical perspective is the iterative algebraic reconstruction. Although used in the first CT scanner, this technique is less frequently used now. This technique treats the 2D image to be reconstructed as a 2D matrix of elements with initially unknown values (see Figure 6). The rays in the observed projections represent weighted summation of matrix element values. Starting with an initial guess of matrix element values (for example, all 1’s), the reconstruction process iterates through: 1. forward projection, i.e., forming projections based on the latest matrix element values, 2. computing difference between the projections from step 1 and the actual observed projections, and 3. backprojection of the difference to update the matrix element values. When the matrix element values converge to steady state values, usually after hundreds of iterations, the algorithm terminates. The iterative nature of the algorithm
Fig. 6 Schematic showing the process of algebraic reconstruction. The elements of the reconstruciton matrix are determined iteratively through a series of forward and backward projections.
Medical Image Processing
223
makes this algorithm computationally inefficient and less favorable for practical implementation. However, unlike previously described algorithms, this algorithm is capable of reconstructing an image from a limited number of projections. This feature is helpful in emission imaging and in (nonmedical) earth resource imaging, either of which may gather projections distributed over less than 180 or 360 degrees needed for filtered backprojection algorithm.
4.2 PET/SPECT Reconstruction PET and SPECT systems also collect CT-like projections and, therefore, filtered backprojection reconstruction is applicable to these imaging modalities, provided projections are properly corrected for attenuation as gamma rays travel through the tissues. In CT, the source of the radiation is outside the body and its characteristics are known. The only unknown is the attenuation characteristics of the body, and the goal of CT is to generate a map of attenuation coefficients. The goal in nuclear medicine, however, is to create a distribution of unknown emission sources located within the body when the attenuation coefficients themselves are unknown. For this reason, in PET always and in SPECT often, a CT-like transmission scan with a known external source is created. This scan, a map of attenuation coefficients, serves to correct for tissue attenuation. While this so-called attenuation correction is straightforward in PET because of coincident detection (two photons travel in opposite directions from the point of their origin), the step remains difficult in SPECT because the distance a photon travels before hitting a detector is unknown. Several practical solutions exist to overcome this problem. All in all, filtered backprojection following attenuation correction remains a common reconstruction approach in PET and SPECT. The photon emission intensity in PET and SPECT are very small and the resulting projections that represent detected photon counts are generally noisy. The photon detection process in nuclear medicine is often modeled as Poisson process and therefore statistically-based iterative reconstruction techniques specifically for PET and SPECT also exist. The most common of these methods is the maximum likelihood expectation maximization (ML-EM) technique [36]. If Q is the ensemble of detector measurements and R represents the reconstructed image, the ML approach maximizes the likelihood p(Q|R). This likelihood function is further expanded in terms of individual detectors readings and pixel intensities while incorporating the Poisson model. The practical implementation maximizes log-likelihood iteratively using the EM algorithm. The ML-EM algorithm usually better accounts for the noise in the projection data and yields better-quality reconstructed SPECT and PET images. The price paid is the high computation cost of ML-EM. Two common variants of the algorithm, one using ordered subsets [13] and another taking a coarse-tofine multigrid approach [28], provide algorithm acceleration. Dedicated hardware architectures hold great promises for ML-EM and similar iterative reconstruction algorithms.
224
Raj Shekhar, Vivek Walimbe, and William Plishker
4.3 Reconstruction in MRI Fourier reconstruction is the primary reconstruction method in MRI. The signals measured in MRI are the FIDs. The design of pulse-sequences (excitation and readout pulses) control what tissue characteristics are measured by the FIDs. The combination of slice selection, frequency encoding and phase encoding help localize FID to individual pixels (in 2D imaging) and voxels (in 3D acquisition). Irrespective of the pulse-sequence used, the data gathered in MRI represent samples of the object in the frequency domain. A 2D or 3D inverse Fourier transform on the acquired data reconstructs the image. Fast Fourier transform architectures are important in MRI reconstruction. Some fast acquisition pulse-sequences collect data nonuniformly or, for example, along a spiral trajectory in the frequency domain. In these situations, the raw data need to be resampled over a regular grid before applying the inverse Fourier transform. Interpolation thus is a preprocessing step in MRI reconstruction in some acquisition protocols.
4.4 Ultrasound Reconstruction Ultrasound imaging relies on sending a narrow acoustic pulse out in the medium in a particular direction and receiving its echoes. The strength of the echoes together with their time of arrival helps form a scan line. Several signal processing steps are involved in forming a scan line, which, in order, include filtering, envelop detection, attenuation correction, and log compression. Filtering removes high frequency noise, whereas envelop detection removes the carrier frequency (the frequency of ultrasound) usually through quadrature filter or Hilbert transformation. The absorption of sound by the medium attenuates the echoes greatly; attenuation correction or time gain compensation gives users control for amplifying weak echoes from deeper depths. Finally, log compression compresses the large dynamic range of the amplitude of received echoes to facilitate gray-level display of the image. Multiple scan lines, often arranged in the shape of a fan (hence the name sector scan), as a result of mechanical rocking of the transducer or electronic beam steering when using an array transducer, form a complete 2D image. 3D imaging often involves stacking multiple 2D images. A polar-to-Cartesian conversion, called scan conversion, is needed for display and storage purposes. Efficient architectures for all stages of ultrasound image formation have been proposed [38].
5 Image Preprocessing Noise is inseparable from the physical processes and electronics involved in medical image generation. Ultrasound images are characteristically noisy and laden with speckles. Quantum noise and radiation scattering exhibit as intensity variation even
Medical Image Processing
225
for a uniform medium in CT. PET and SPECT images, likewise, are affected by the random statistical nature of radioactive decay and photon absorption by the body and the external detectors. Statistical fluctuations of the FID sensed by receiver coils and mechanical vibrations on the coils themselves similarly introduce noise in MRI. The purpose of image preprocessing is to suppress noise and improve image contrast such that the reviewing physician can more easily identify features of interest. Image preprocessing is also often necessary as a data preparation step prior to applying more advanced image processing tasks of segmentation and registration described later. Most classical image filtering approaches are applicable to medical images. These include both the spatial and frequency domain filters. In the spatial domain, the preprocessing can be pixel- or voxel-based, or neighborhood-based. An everyday example of pixel/voxel-based preprocessing in medical imaging is the practice of applying window/level. CT and MRI produce 12-bit images, whereas most displays remain 8-bit for gray-scale viewing. The necessary 12-to-eight-bit conversion maps a narrow range of 12-bit intensities (called “window;” “level” is the center of the window), coinciding with those of features of interest, to a 0-to-255 range. Another common technique belonging to the same category is histogram equalization. Through intensity redistribution, the goal here is to convert the nonuniform histogram of an image into a uniform histogram as much as possible. Histogram equalization may help improve the visibility of a target structure in some applications. In others, it may have an opposite effect and histogram modification, reshaping of a histogram to mimic a desired distribution, may be more effective [6]. Neighborhood-based preprocessing includes such low-pass spatial domain filters as the Gaussian and rank-order median filters. The well-known edge detection and image sharpening techniques for enhancing higher frequencies are also common in medical image processing. Often adaptive versions of neighborhood-based filters yield better results. Examples of these include adaptive arithmetic mean filters as well as those for feature enhancement [6]. The adaptive anisotropic diffusion filter [23] has been shown to be highly effective in achieving simultaneous noise suppression and edge enhancement. The basic premise of anisotropic diffusion filtering is to suppress noise and preserve edges by encouraging intra-region smoothing while prohibiting inter-region smoothing. This is achieved by the following iterative process. For an image I(v,t), ¯ where v, ¯ is a vector in the 2D/3D space and t a given point in time (i.e., iteration number), anisotropic diffusion filtering is given ¯ by I(v,t) ¯ v,t)). ¯ I(v,t) ¯ is the gradient image and c(v,t) ¯ is a monot = div(c(v,t)I( tonically decreasing function of the gradient (c = f (I(v,t)) ¯ ). The c-term encourages less diffusion (smoothing) in the vicinity of high gradients (strong edges) and more smoothing in regions with low gradients (weak or no edges). Figure 7 shows application of Gaussian and anisotropic diffusion filters on a relatively low signalto-noise ratio CT image. Note that, unlike the Gaussian filter, anisotropic diffusion filter removes noise without blurring the edges. Frequency domain filters, including wavelet filters, may also be employed for similar effects, although these tend to be computationally more demanding because of the need for forward and inverse frequency transformations. An example
226
Raj Shekhar, Vivek Walimbe, and William Plishker
of speckle noise suppression with wavelet filtering of an ultrasound image is shown in Figure 8. When multiple images exist, spatial compounding or image averaging remains an effective technique for suppression of speckles in ultrasound images. For spatial compounding to be successful, the scan line patterns in images to be averaged must be different. This is often achieved by acquiring successive images (sector scans) by slightly perturbing the location of the apex of the sector. For 3D images, image preprocessing can be performed either slice-by-slice or on the 3D image itself with the use of 3D filter kernels. Iterative and frequency domain preprocessing techniques are generally complex for a 2D image. Their repeated application or extension to three dimensions can be extraordinarily time-consuming. Novel architectures for image preprocessing are an area of active research and development. Of special note is the field programmable gate array (FPGA)-based generalized architecture by Dandekar et al. [5] with an efficient voxel access scheme customized for neighborhood operations-based image preprocessing such as 3D median filtering and 3D anisotropic diffusion filtering.
Fig. 7 The anisotropic diffusion filter more effectively removes noise than an equiavalent Gaussian filter.
Fig. 8 Example of wavelet-based filtering of an ultrasound image. [45]
Medical Image Processing
227
6 Image Segmentation Image segmentation is a class of computer vision algorithms that involves identifying and subdividing an image into its components or constituent parts called segments. Image segmentation can be thought of as assigning a label to every pixel or voxel in an image, such that pixels/voxels with the same label share certain specific characteristic or computed property, such as color, intensity, or texture. Pixels/voxels having identical labels are grouped into similar segments, which typically are the objects and boundaries (lines, curves, etc.) within images. Such identification of the image segments often is the first important step in performing more complex image analysis operations. Separating the image segments also aids in effective visualization of complex image data by simplifying the image information into sets of spatially, structurally, and/or functionally correlated primitives. In medical applications, image segmentation typically refers to the identification of known anatomic structures from images. Structures of interest include organs or parts of organs, vascular structures, skeletal components, or abnormalities such as tumors and cysts. Segmentation is a key step to enable detection/diagnosis (by locating tumors and other pathological conditions, measuring volumes of tissues and cavities, etc.), image-guided interventional procedures (by creating accurate 2D/3D anatomical maps specific for individual patients), and other applications such as staging and monitoring treatment responses.
6.1 Classification of Segmentation Techniques A wide range of segmentation methods continues to be developed for addressing specific problems in medical applications. In contrast to generic segmentation methods, methods used for medical image segmentation are often application-specific; as such, they can make use of prior knowledge for the particular objects of interest and other expected or possible structures in the image. A commonly used approach to classify segmentation methods is based on the image primitive being identified by a given method. Basic image primitives in this sense are pixels/voxels, regions/volumes, and edges/surfaces. Accordingly, segmentation techniques can be classified as 1. pixel/voxel identification methods – These techniques segment an image by labeling individual image pixels or voxels based on characteristics such as intensity, texture, etc. Groups of pixels/voxels with identical labels then form different segments of the image. 2. region/volume identification methods – These techniques are based on identifying groups of pixels with identical characteristics. These approaches are predominantly based on specific characteristics of regions such as intensity, texture, etc. and employ different region growing or splitting-and-merging methods to link to-
228
Raj Shekhar, Vivek Walimbe, and William Plishker
gether regions with similar values for a given (set) of parameters to create image segments of interest. 3. Edge/surface identification methods – These techniques typically involve identifying boundaries in images. Edge/surface identification methods are typically based on techniques used to identify change in the value of a given (set) of parameters between regions, e.g., change in intensity between two regions. Segments of interest can either be the edges/surfaces or the regions contained within or separated by these boundaries. Region/volume-based methods and edge/surface-based methods can be used to identify the same image primitives. Edge/surface identification help identify the surfaces/edges in an image and thus separate the volumes/regions contained within. Thus, these two approaches differ mainly in terms of methodology rather than the targeted image primitive. The choice of the approach to solve a particular segmentation approach depends largely on the characteristics of the images involved and the ease and accuracy with which either set of primitives are identifiable. Different image processing techniques can be adopted to segment by identifying the basic image primitives. These techniques are grouped into different categories based on whether they utilize a variety of simple image characteristics (such as intensity, image gradient) and/or more sophisticated mathematical or statistical models. Of these, the two most common groups are the following: 1. Threshold-based techniques, which classify image segments (globally or locally) based on predefined criteria (e.g., intensity), where the information is represented by the intensity histogram. 2. Deformable model-based techniques, which involve (1) detecting and fitting segments in the image to a priori known template, and (2) modifying the template to more accurately match the individual image. The templates can be parametric shapes, e.g., straight lines, circles, ellipses, or special predefined shape templates such as atlas of brain, model of left ventricle of heart, etc. The next two sections give a brief overview of these two groups of techniques, which can be used to segment an image based on identification of any of the three basic image primitives noted above.
6.2 Threshold-based Threshold-based algorithms are based on the assumption that the objects of interest in an image can be classified based on values of specific characteristics such as image intensity or gradient magnitude. The algorithms are ideal for pixel- and regionbased segmentation because they essentially search for individual pixels or groups of pixels in an image based on some threshold criteria. An example of thresholdbased segmentation is shown in Figure 9. For example, a basic algorithm to segment background (B) out of an image f (i, j), generates an output image g(i, j) thus:
Medical Image Processing
229
for each pixel (i, j) if f (u, j) ≤ threshold value then g(i, j) = f (i, j) else g(i, j) = B An image can be divided into two or more regions or segments by defining the number (single- or multiple-thresholding) and/or value (one point- or bandthresholding) for the thresholds. Thresholds can be selected manually according to a priori knowledge or automatically through image information. Automatic threshold detection is generally based on use of the image intensity histogram or probability distribution function. These algorithms can perform segmentation based on a single pass through all pixels or can be iterative. For iterative algorithms the threshold values can be fixed for all iterations or made adaptive based on analysis of image histogram for every iteration. Threshold-based algorithms can also be useful in identifying structures depicted by edge/surface points. An example is the Canny edge detector that uses the threshold of gradient magnitude to find the potential edge pixels. Threshold-based technique inherently do not account for spatial information. So, their effectiveness is limited to images where large chunks of pixels/voxels have similar characteristics. Also, these algorithms are very sensitive to image noise and partial volume effects, which in medical imaging, refers to the scenario wherein insufficient image resolution results in anatomical/physiological information from different tissue types mixing within a single pixel/voxel. Due to a combination of these factors, regions or edges detected by these algorithms can often be sets of discrete pixels, and therefore may be incomplete or discontinuous. In such cases, it is necessary to apply post-processing like morphological operation to connect the breaks or eliminate the holes. Morphological operations refer to the group of image processing operations that process images based on shapes. These operations determine the value of each pixel in the output image based on a comparison of the corresponding pixel in the input image with its neighbors. By appropriately selecting the size and shape of the neighborhood and the comparison criteria, it is possible to grow/shrink/merge/split
Fig. 9 Example of thresholdbased segmentation. The top row shows short-axis views of left ventricle of heart at two different cardiac phases. The bottom row shows segmentation of the left ventribular wall based on a single threshold value.
230
Raj Shekhar, Vivek Walimbe, and William Plishker
chunks of pixels in an image to connect or eliminate the holes/breaks in an image. Incorporation of such morphological operation is also useful in automatic threshold detection using histogram-based approach in cases where it may be difficult to identify significant peaks and valleys in the image to define the threshold criteria.
6.3 Deformable Model-based Segmentation algorithms using deformable models, also known as active contour models, form a class of connectivity-preserving techniques that forms a very reliable and robust approach to edge/surface-based image segmentation. Segmentation starts with placing, either manually or through automatic initialization, a template of the expected shape, modeled as a 2D contour or 3D surface, in the vicinity of the (perceived) true borders of the object of interest. A second step, called energy minimization step, then refines the shape template appropriately under image-derived and shape integrity preserving constraints to make it snap to the true borders The initial template can be defined parametrically (e.g., a line, circle, ellipse, sphere, etc.) or can be a more complex model that represents the object to be segmented (e.g., left ventricle of heart, etc.). The shape template is generally sampled as discrete points (vertices), each of which is moved during the iterative energy preservation step under the influence of internal and external forces acting at that point. The internal forces are model/template-derived forces that maintain the integrity (smoothness, regularity) and overall shape of the template and are usually defined through the geometric properties of the template such as length, area and curvature. The external forces are typically derived based on the image information (such as intensity distribution, strength of edges, etc.) and drive the template toward edges or image gradients (Figure 10). External Intensity Gradient Force
Internal Shapepreserving Forces
Fig. 10 Internal and external forces acting at a node of a deformable shape template.
Medical Image Processing
231
Algorithms have also been developed that use external force equations inspired from other engineering disciplines, for example as heat diffusion theory from thermodynamics and viscous fluid theory from fluid dynamics. An individual iteration of template deformation starts from the calculation of the template-derived internal forces and the external force active at each vertex. The resultant sum of forces is applied to the corresponding vertex, and the locations of all vertices are then refreshed to obtain the next template shape and location. Deformable model-based segmentation is especially suitable for noisy images such as ultrasound, because these algorithms are robust to intensity dropouts and noise. However, initial placement of the template is a challenge with this approach, and this process remains mostly manual or semi-automatic at best. A priori knowledge about the image to be segmented can be used to automate initial placement of the template. See Figure 11 and Figure 12.
(A)
(B)
(C)
Fig. 11 Automatic deformable model-based segmentation scheme. (A) Left ventricle 3D mesh template initialized within the image. (B) Final result of segmentation obtained by internal and external force based refinement of the initialized template. (C) Automatically segmented volumetric shapes at different phases of the cardiac cycle, obtained by deformable model-based segmentation of each frame [46]. Fig. 12 Typical results of automatic segmentation of the left ventricular wall in a 3D echocardiography image. Standard cross-sectional views superimposed with endo- and epicardial borders (inner and outer wall boundaries, respectively)are shown. Top and bottom rows show two separate phases of the cardiac cycle in a typical case [43].
232
Raj Shekhar, Vivek Walimbe, and William Plishker
7 Image Registration Image registration is the process of combining two images such that the features in one image are aligned with the features in another. The first phase of most registration techniques is correcting for global image misalignment, called rigid registration. Nonrigid registration often follows such that non-linear local deformation from breathing or changes over time is corrected. Such correction improves overlays as shown in Figure 13. The image registration problem setup varies in a variety of axes: dimensionality (2D versus 3D), extrinsic versus intrinsic markers, invasive fiducials (e.g., screw markers) versus non-invasive (e.g., molds, frames, dental adapters, skin markers), modality pairings (e.g., PET and CT) and subject pairings (inter-subject versus intrasubject). Despite the variation of inputs and problem setup, the result of a registration problem is a deformation field. This field contains the information necessary to deform all of the voxels in the floating image (the image that is manipulated) such that it aligns with the reference image (the image that stay unchanged). Image registration can be subject-specific with a solution that is tailored to a specific part of anatomy, but we will focus on general techniques here. Although some problem domains require registration of 2D images, majority of registration tasks involve 3D images. The following description assumes 3D image registration.
Fig. 13 Image registration algorithms align the features of one image with the features in another. When overlaying images, image registration enables physicians to properly overlay images for diagnosis or treatment.
Medical Image Processing
233
7.1 Geometry-based If landmarks can be identified, registration can be done by aligning the landmarks in space and then interpolating between and around those landmarks to produce a complete deformation field. Landmarks can be points, lines, surfaces, or regions such as those that may be • manually marked by experts, • derived from fiducials attached to the subject, or • discovered through segmentation as discussed in Section 6. The quality of registration with geometry-based approaches is primarily dependent on the quality of finding the same landmarks in both images and being able to correlate those landmarks correctly. Furthermore, landmarks should be disperesed strategically around or in areas of interest so that an interpolated deformation field will provide meaningful correction to regions of interest. For example, deriving fiducial locations from externally placed markers is an effective way of attaining surface alignment, but does not in general provide accurate alignment of internal structures. Internal fiducials are able to align internal structures, but such markers are invasive and have limited clinical application.
7.2 Intensity-based Intensity-based image registration algorithms rely on correlations between voxel intensities and are known to be robust and computationally intensive. Since landmarks are not explicitly provided, intensity-based approaches can utilize any number of approaches to model and apply deformation fields. Construction of the deformation field can start from a just few parameters in the case of rigid registration or from a set of "control points" which capture the non-uniformity of nonrigid registration. The final transformation contains the information necessary to deform all of the voxels in the floating image. Once a transformation is constructed, it is applied to the floating image. This transformed image can be compared to the reference image using a variety of morphological operations, such as mean squared difference (MSD) or mutual information. For iterative approaches, the similarity value is returned so that it may guide towards successively better solutions. Problem parameters may change during run-time to improve speed and accuracy. 7.2.1 Similarity The foundation of intensity-based image registration is the calculation of how similar two images are. The similarity calculation guides the transformation of the floating image and can be part of the convergence criteria. In an iterative intensity-based algorithm, the similarity calculation tends to dominate the execution time even for
234
Raj Shekhar, Vivek Walimbe, and William Plishker
simple similarity metrics, because of the time it takes to traverse and transform the images. A common complication of determining if a transformed floating image is similar to a reference image is the problem of overlap. Since images and regions of interests have boundaries beyond which we often have no data, similarity measurements must account for solutions that lead to non-overlapping voxels. Similarity metrics are often normalized with respect to overlap, by dividing the result by the number of overlapping voxels. This gives a per voxel account of similarity which mitigates some of the effects of changing overlap. One of the simpler similarity metrics is sum of squared differences in which intensities between images are assumed to be directly correlated, which can work well for single modality registration. But since intensity values mean different things in different images, registration engines often apply more flexible (but more complex) similarity metrics such as cross-correlation and mutual information. Mutual information has been shown to be especially effective at producing robust similarity measurements for both multimodality and single modality registration problems [24].
7.3 Transformation Models The model used for the transformation of the floating image to the reference image defines the problem space of registration. In medical imaging, these fall into two general classes: rigid and nonrigid registration. In rigid registration, the objects are assumed to be misaligned in a uniform fashion. Rigid registration corrects the uniform misalignments typically through translation and rotation, which can be represented as a transformation matrix that may be uniformly applied to each voxel in the image. This model works well for registration problems that focus on anatomical features that do not change with time or position, like bone. But for soft tissues that may change with respiration, position, or time, registration must be more flexible. Nonrigid registration accounts for misalignment due to features not of fixed size or shape. Such deformations may occur over months, such as the growth of a tumor, or in seconds as in an operating room in which respiration is deforming the liver. Many organs in the body have full freedom to move in any direction and transform into many possible shapes, which makes for a limitless number of deformation fields possible. But to make the registration problem tractable, transformation models often derive deformation fields from a finite number of parameters. This may take the form of the free form deformation grid, in which a subset of points in the image space called “control points,” influence just a local region [30]. A deformation for any given point can be determined by the control point neighborhood immediately around it. Subvolume based nonrigid registration [5] divides the problem into a hierarchical tree of smaller and smaller rigid registration subproblems. The final image is interpolated from the final values of the local subvolume registration solutions.
Medical Image Processing
235
Fig. 14 Image visualization pipeline for surface or volume rendering of medical images.
7.3.1 Optimization Algorithms Arriving at an optimal similarity for a registration problem can require many steps in iterative implementations. Since the cost of each step is significant, image registration optimization engines are tailored to converge quickly based on the similarity and transformation model used. Slope based optimization engines utilize slopes calculated analytically or finite differences to guide the problem towards a good transformation, like Newton’s Method or gradient descent. These greatly reduce the steps needed to converge, but often suffer from the non-linear and noisy problem space. In such cases, non-slope based methods like Simplex [21] and Powell’s [25], provide good convergence in the face of noise. The require more steps to converge in ideal problem spaces, but prove to be more robust in real world cases.
8 Image Visualization Before medical images went digital, they were printed on films, which the physicians viewed and interpreted. In early days of medical imaging, even a volumetric exam comprised relatively few image slices, making this mode of viewing acceptable. This paradigm of slice viewing continued, although in a filmless fashion, even after the advent of digital images. However, the growth in the number of slices has made slice viewing challenging because of limited screen space and human inability to review such a vast amount of information simultaneously. New visualization techniques are addressing the problems of the older paradigm and enabling newer viewing options. Chief among these is 3D visualization of the entire 3D image set and more recently 4D visualization where physicians may see not only a rendering of the 3D anatomy contained in the images but also changes in the anatomy over time. There are two primary options with image visualization: surface rendering and volume rendering. Such a visualization is enabled by a processing pipeline like the one shown in Figure 14.
236
Raj Shekhar, Vivek Walimbe, and William Plishker
8.1 Surface Rendering If we had our image volume represented as surfaces, we would have an efficient representation for 3D visualization as the surfaces in the volume would map well to polygons rendered by graphics processors (GPUs). Constructing a surface from a 3D set of voxels can be a complex task, and, as discussed before, sophisticated segmentation techniques can be used to created surfaces. In interactive viewing applications, the use of threshold-based isosurfacing techniques such as marching cubes [20] is common before surface rendering. An isosurface is a surface constructed from a set of voxels with the same intensity and techniques like marching cubes convert a discrete set of voxels to a continuous surface. Multiple thresholds may be used to construct several isosurfaces. To match with existing GPU paradigms, surfaces tend to be modeled as polygons, which then go through the traditional graphics processing pipeline of texture/color selection, shading and rasterizing. An example of surface rendering is shown in Figure 15.
8.2 Volume Rendering It is not necessary to identify the voxels of interest and make the associated assumptions in surface rendering to visualize in 3D space an object of interest. An alternative to surface rendering, volume rendering is a more brute force albeit computationally demanding approach to 3D visualization. In volume rendering, all voxels are assigned to one or more tissue types based on their intensity. Fuzzy memberships are allowed such that a particular voxel may belong to two classes (e.g., 60% muscle and 40% fat). Such classification techniques are less error prone and eliminate the need for single intensity-based isosurfacing techniques in surface rendering. Associated with each tissue type is also a color and an opacity (or translucency). These can be controlled to hide or highlight different intensity ranges or tissue types. In fact, most visualization systems offer predefined color and opacity functions (called, presets) to achieve this effect. After color and opacity assignments and voxel-wise shading calculation, volume rendering is the process of creating a composite projection image from a 3D set of voxels. Ray tracing techniques are common and the primary computational challenge is to resample the 3D image along each ray at a fine sampling interval and then to sum the resampled colored and shaded intensities weighted by the opacity values. Common optimizations such as early ray termination (when cumulative opacity reaches 1) have been implemented. Another significant acceleration technique is shear-warp factorization [18] that allows compositing along data axis as opposed to camera axis. An example of volume rendering is shown in Figure 15.
Medical Image Processing
237
Fig. 15 Example of surface rendering (left) and volume rendering (right).
9 Medical Image Processing Platforms Medical image processing is potentially a time demanding task. Time constraints are involved in a variety of clinical scenarios. The time from acquisition to when physicians must have the information contained in those images may be a matter of minutes in the case of shock trauma injury in which diagnosis and treatment plans must be made immediately, or less than a second in the case of image-guided surgery in which decisions about a patient must be made in real time. Even when these time demands are relaxed in the case of planning a procedure days after the image is acquired, practical issues of volume load imply that any image processing to be done must still complete in a reasonable amount of time. Many automatic, robust and effective medical image processing algorithms are iterative in nature. Examples discussed here include iterative reconstruction from noisy projections, deformable model segmentation, and intensity-based image registration. These tend to be computationally intensive, typically taking minutes to hours to solve on high-end processors. Such complexity has inhibited the adoption of these technologies in the clinical workflow. While specific requirements vary from application to application, the computation time must be on the order of seconds to be viable in most image-guided interventions scenarios.
9.1 Computing Requirements Computational and memory requirements are driven by the size of the data which must be processed and stored. These can vary widely by the particular medical image processing scenario, so we will attempt to quantify the gamut here. Voxel intensity (sometimes referred to as gray-scale value) is typically stored in 8 bit or 16
238
Raj Shekhar, Vivek Walimbe, and William Plishker
bit fixed point data with signed and endianness varying across equipment. Scanning equipment precision often does not produce byte aligned values, but because of computing considerations, all images are stored in this format. The number of voxels in a volume depends on the modality, the equipment used, but these images can vary in size from 256 x 256 8-bit voxels (e.g., a moderate viewing size of a low resolution 2D ultrasound) to 512 x 512 x 512 16-bit voxels (e.g., a full-body scan with a high resolution CT). Therefore, the memory requirements for image processing platforms may be light, requiring 32 KB or they may be significant, requiring 26 MB or more. The archiving and general storage of these images are done with Picture and Archiving Computing Systems (PACS). While many images stored are compressed, image processing often requires random access to the volume. Most images to be processed are stored in a raw uncompressed form. In medical imaging, these values are trending upward rapidly, with higher resolution scanners and the increasing interest in spatiotemporal 4D data, with each frame consisting of a 3D volume. A growing majority of hospitals store and transmit medical images digitally. Hospital networks are designed with protocols and bandwidth necessary to move such large images efficiently. Medical image processing solutions are able to utilize this protocol and infrastructure to integrate with existing imaging systems. Interoperability across products and vendors is achieved through the Digital Imaging Communication in Medicine (DICOM) standard [22]. DICOM describes how image data is organized and how it may be compressed as well as a header format with numerous fields including from patient identifiers, scanning equipment, image dimensions, and voxel sizes. The DICOM standard also describes communication protocols for querying and retrieving information. Open source releases of the DICOM standard have made it possible for many software imaging products to be both a consumer of input images and a producer of processed images.
9.2 Platforms For Accelerated Implementation As with other computationally intensive application domains, medical imaging applications are seeking future performance boosts from multicore platforms. There are a diversity of multicore platforms availible [2], and most computationally intensive medical image processing algorithms can be readily accelerated by exploiting different styles of parallelism on these different platforms. Traditionally, medical imaging has exploited high level parallelism through standards like the Message Passing Interface (MPI). MPI is a popular standard for explicitly parallelizing code on multiprocessor systems. Threads have local memory that can be readily implemented on distributed memory platforms such as clusters. MPI is well suited to exploit higher levels of parallelism in the application domain. For example, MPI can be employed for medical image registration, in which the gradient computation can be distributed equally across nodes in a small cluster [14]. This distribution is possible by virtue of the fact that each finite difference calculation for each control
Medical Image Processing
239
point is independent and requires only the neighboring voxels and control points to calculate. These general purpose descriptions like MPI have been shown effective for acceleration but domain specific approaches are also directions of new research. Dataflow based description of the application is a formal description of an application to unambigously describe behavior with enough structure for analysis and optimization for parallel platforms described in the previous section. [32]. Using this in conjuction with a model-based multiprocessor such as the Signal Passing Interface (SPI) could expose more parallelism in domain specific fashion to permit more analysis and opitimization of issues such as communication minimization, buffer reduction, and load balancing [31]. The increasing programmability of GPUs have opened them up for many other applications including image reconstruction, image registration, and visualization. As with MPI, the key to efficient implementations on a GPU is finding and exploiting application parallelism. GPUs readily accelerate finer grain parallelism as compared to MPI. For example, a single similarity calculation can be vectorized to take advantage of the arrays of processing elements in a single GPU [34]. With hardware description languages (HDLs), the final implementation is not destined for a processor, so designers layout their application structurally, exposing interfaces and cycle-by-cycle control. An FPGA can be used to accelerate the voxel processing of the nonrigid transformation and the similarity calculation [4][5]. The independence of tasks allows for many memory accesses, operations, and IO to be performed in the same clock cycle.
10 Summary Medical image processing is a unique instance of digital signal processing. Like computer vision applications, it requires a mix of traditional DSP-style kernels for certain types of processing and more complex, sometimes control-oriented processing for higher level constructs. Unlike traditional image processing, medical images are often 3D and larger, increasing the processing requirements. Medical image processing applications have unique requirements for speed and quality, which creates a challenge problem space for arriving at clinically viable implementations. The medical image processing pipeline starts with capturing raw sensors readings, which are then reconstructed into 2D or 3D images, which may then be filtered, segmented, registered, and visualized. Each stage of this pipeline requires high performance implementations that are often embedded components in a piece of imaging equipment or dedicated viewing workstations. The computational demands of these applications continues to grow with the resolution of the scanners along with the volume of the case load. New solutions are needed to enable new algorithms and new approaches to improve the speed and quality of the results presented to physicians. Furthermore, we expect the platforms that deliver this will in turn enable more study into diagnostic and interventional uses
240
Raj Shekhar, Vivek Walimbe, and William Plishker
of image processing by the medical imaging community. Improved software and implementations should usher in a new generation of medical imaging techniques.
11 Further Reading This chapter provided a comprehensive overview of medical image processing and introduced the readers to commonly occurring tasks that users of medical images perform. Such a presentation indeed did not permit an in-depth treatment of the individual topics. We refer the readers to various sources to acquire a greater understanding of the various topics covered in this chapter. The book by Prince and Links [26] is recommended for learning more about medical imaging modalities. This book goes into the details of underlying physics, instrumentation, and image reconstruction, and gives equal emphasis to each aspect and each modality. For rigorous mathematical treatment of image reconstruction, the classic book of Kak and Slaney [16] still stands out. A comprehensive three-volume medical imaging handbook [3] [40] [17] is recommended to learn more about virtually all the topics presented in this chapter. The second volume that goes into image processing and analysis and the third volume with focus on visualization are particularly recommended for image segmentation, image registration and visualization topics. Books dedicated to each of the individual topics are available. Detailed theoretical description of segmentation concepts can be found in the two handbooks on medical image analysis [1] and [41]. A good resource for reading and understanding more details about image thresholding is [9], whereas more information about deformable model segmentation can be found in [37]. We recommend [12] for image registration and [15] for visualization. Unfortunately, no good resources to our knowledge exist that examine the computing aspects and computing challenges of medical image processing in a unified fashion. Much of the material remains scattered and the best way to acquaint one to these topics is through research papers. Indeed, this handbook is an attempt to organize and systematically present such material.
References 1. Bankman IN. Handbook of Medical Image Processing and Analysis (Part II), Elsevier Science and Technology Books, Ed 2, 2008. 2. Blake G, Dreslinski RG, and Mudge T, A survey of multicore processors, IEEE Signal Pro˘ S37, cessing. Mag., vol. 26, no. 6, pp. 26âA ¸ Nov. 2009. 3. Beutel J, Kundel HL, Van Metter R. Handbook of medical imaging. Volume 1: Physics and psychophysics. SPIE Press, Bellingham, WA, 2000. 4. Castro-Pareja CR, and Shekhar R, Hardware acceleration of mutual information-based 3D image registration, Journal of Imaging Science and Technology, vol. 49(2), pp. 105-113, 2005.
Medical Image Processing
241
5. Dandekar O, Castro-Pareja C, Shekhar R, FPGA-based real-time 3D image preprocessing for image-guided medical interventions, Journal of Real-Time Image Processing, vol. 1(4), pp. 285-301, 2007. 6. Dhawan AP. Medical Image Analysis. John Wiley & Sons, Hoboken, NJ, 2003 7. Feldkamp LA, Davis LC, and Kress JW, "Practical cone-beam algorithm," J. Opt. Soc. Am. A 1, 612-619 (1984). 8. Frost and Sullivan Briefing, The Potential for Radiology Informatics: The Next Big Wave since Anesthetics Discovery in Medicine, August 5, 2009 9. Gonzalez RC and Woods RE. Digital Image Processing, Prentice Hall, Ed. 3, 2008(2002). 10. Goldman LW. Principles of CT: multislice CT. J Nucl Med Technol. vol. 36(2):57-68, 2008. 11. Guttmann C, Benson R, Warfield S, Wei X, Anderson M, Hall C, Abu-Hasaballah K, Mugler J, Wolfson L. White matter abnormalities in mobility-impaired older persons. Neurology. 2000 Mar 28;54(6):1277-83. PMID: 10746598. 12. Hajnal JV, Hawkes DJ, Hill DLG. Medical image registration. CRC Press. Boca Raton, FL. 2001 13. Hudson HM, Larkin RS. Accelerated image reconstruction using ordered subsets of projection data. IEEE Trans Med Imaging. 1994;13(4):601-9. 14. Ino F, Ooyama K, and Hagihara K, A data distributed parallel algorithm for nonrigid image ˘ S43, registration, Parallel Computation, vol. 31(1), pp. 19âA ¸ 2005. 15. Kaufman A. Volume visualization. IEEE Computer Society Press. Los Alamitos, CA. 1990. 16. Kak AC, Slaney M. Principles of computerized tomographic imaging. IEEE Press. Piscataway, NJ. 1988. 17. Kim Y, Horii SC. Handbook of medical imaging. Volume 3: Display and PACS. SPIE Press, Bellingham, WA, 2000. 18. Lacroute P, Levoy M. Fast volume rendering using a shear-warp factorization of the viewing transformation Proceedings of the 21st annual conference on Computer graphics and interactive techniques, pp. 451-458, 1994. 19. Le Bihan D, Mangin JF, Poupon C, Clark CA, Pappata S, Molko N, Chabriat H. Diffusion tensor imaging: concepts and applications. J Magn Reson Imaging. 2001 Apr;13(4):534-46. 20. Lorensen WE, Cline HE. Marching cubes: A high resolution 3D surface construction algorithm. Proceedings of the 14th annual conference on Computer graphics and interactive techniques, pp. 163-169, 1987. 21. Nelder JA and Mead R, A simplex method for function minimization, The Computer Journal, vol. 7, pp. 308-313, 1964. 22. NEMA, Digital Imaging and Communications in Medicine, Part 1-15. NEMA Standards Publication PS3.X, 2000. 23. Perona P. Malik J. Scale-space and edge detection using anisotropic diffusion. IEEE Transactions on Pattern Recognition and Machine Intelligence, 12(7):629-639, 1990. 24. Pluim JPW, Maintz JBA., Viergever, MA, Mutual-information-based registration of medical images: a survey, IEEE Transactions on Medical Imaging, vol. 22(8), pp. 986-1004, August 2003. 25. Powell MJD, An efficient method for finding the minimum of a function of several variables without calculating derivatives. The Computer Journal vol. 7, pp. 896-910, 1964. 26. Prince JL, Links JM. Medical imaging signals and systems. Pearson Prentice Hall, Upper saddle River, NJ. 2006. 27. Ramm OT, Smith SW, and Pavy HGJ. High-speed ultrasound volumetric imaging system–II: Parallel processing and image display. IEEE Trans Ultrason Ferroelectr Freq Control 38:109115, 1991. 28. Ranganath MV, Dhawan AP, Mullani N. A multigrid expectation maximization reconstruction algorithm for positron emission tomography. IEEE Trans Med Imaging. vol. 7(4), pp. 273-8, 1988. 29. Ross B, Kreis R, Ernst T. Clinical tools for the 90s: magnetic resonance spectroscopy and metabolite imaging. Eur J Radiol. 1992 Mar-Apr;14(2):128-40.
242
Raj Shekhar, Vivek Walimbe, and William Plishker
30. Rueckert D, Sonoda LI, et al. "Nonrigid registration using free-form deformations: application tobreast MR images," IEEE Transactions on Medical Imaging, vol. 18(8), pp. 712-721, 1999. 31. Saha S, Puthenpurayil S, Schlessman J, Bhattacharyya SS, and Wolf W. The signal passing interface and its application to embedded implementation of smart camera applications. Proceedings of the IEEE, 96(10):1576-1587, October 2008. 32. Sen M, Hemaraj Y, Plishker W, Shekhar R, and Bhattacharyya SS. Model-based mapping of reconfigurable image registration on FPGA platforms. Journal of Real-Time Image Processing, 2008. 14 pages 33. Salgo IS. Three-dimensional echocardiographic technology. Cardiol Clin 25:231-239, 2007. 34. Shams R and Barnes N, Speeding up mutual information computation using NVIDIA CUDA hardware, in Proc. Digital Image Computing: Techniques and Applications (DICTA), Ade˘ S560, laide, Australia, pp. 555âA ¸ 2007. ˘ 35. Shekhar R, Walimbe V, Raja S, Zagrodsky V, Kanvinde M, Wu G, Bybel B, âAIJAutomated Three-Dimensional Elastic Registration of Whole-Body PET and CT from Separate or Combined Scanners,âA˘ ˙I Journal of Nuclear Medicine, 46(9):1488-96, 2005. 36. Shepp LA, Vardi Y. Maximum likelihood reconstruction for emission tomography. IEEE Trans Med Imaging. 1(2):113-22, 1982. 37. Singh A, Terzopoulos D, Goldgof DB, Sons JW, Deformable Models in Medical Image Analysis, 1998. 38. Sikdar S, Managuli R, Gong L, Shamdasani V, Mitake T, Hayashi T, Kim Y. A single mediaprocessor-based programmable ultrasound system. IEEE Trans Inf Technol Biomed. 7(1):64-70, 2003. 39. Smith SW, Pavy HGJ, and von Ramm OT. High-speed ultrasound volumetric imaging system–I: Transducer design and beam steering. IEEE Trans Ultrason Ferroelectr Freq Control 38:100-108, 1991. 40. Sonka M, Fitzpatrick, JM. Handbook of medical imaging. Volume 2: Medical image processing and analysis. SPIE Press, Bellingham, WA, 2000. 41. Suri JS, Setarehdan SK, Singh S. Advanced Algorithmic Approaches to Medical Image Segmentation: State Of The Art Applications in Cardiology, Neurology, Mammography and Pathology, Springer; Ed 1, 2002. 42. Wester HJ. Nuclear imaging probes: from bench to bedside. Clin Cancer Res. vol. 13(12):3470-81, 2007. 43. Walimbe V, Zagrodsky V, Shekhar R, Fully automatic segmentation of left ventricular myocardium in real-time three-dimensional echocardiography, Proceedings of SPIE (Medical Imaging 2006, San Diego, California, USA), 2006. 44. Walimbe V, Garcia M, Lalude O, Thomas J, Shekhar R, Quantitative real-time threedimensional stress echocardiography: A preliminary investigation of feasibility and effectiveness. Journal of the American Society of Echocardiography, 20(1):13-22, 2007 45. Zagrodsky V, Phelan M, Shekhar R, Automated detection of a blood pool in ultrasound images of abdominal trauma, Ultrasound in Medicine and Biology, vol. 33(11), pp. 1720-1726, 2007. 46. Zagrodsky V, Walimbe V, Castro-Pareja CR, Qin J, Song JM, Shekhar R, Registration-assisted segmentation of real-time 3D echocardiographic data using deformable models, IEEE Transactions on Medical Imaging, 24(9):1089-99, 2005.
Signal Processing for Audio HCI Dmitry N. Zotkin and Ramani Duraiswami
Abstract This chapter reviews recent advances in computer audio processing from the viewpoint of improving the human-computer interface. Microphone arrays are described as basic tools for untethered audio acquisition, and principles for the synthesis of realistic virtual audio are outlined. The influence of room acoustics on audio acquisition and production is also considered. The chapter finishes with a review of several relevant signal processing systems, including a fast head-related transfer function (HRTF) measurement system and a complete system for capture, visualization, and reproduction of auditory scenes.
1 Introduction Audio signal processing plays a major role in enabling natural human-computer interaction (HCI). In many daily activities, our natural sound processing abilities are taken for granted. Yet some of them are quite powerful, such as acoustic source localization, selective attention to one acoustic stream out of many, and event detection. It can be envisioned that adding a capability for spatially selective sound processing to a human-computer interface would be beneficial since the system would be able to focus closely on the operator’s speech amidst the crowded and noisy background. It would also make possible interaction over longer distances enabling Dmitry N. Zotkin Institute for Advanced Computer Studies (UMIACS) University of Maryland, College Park MD 20742 USA e-mail: [email protected],web:http://www.umiacs.umd.edu/users/dz Ramani Duraiswami Department of Computer Science and Institute for Advanced Computer Studies (UMIACS) University of Maryland, College Park MD 20742 USA e-mail: [email protected],web:http://www.umiacs.umd.edu/users/ ramani
S.S. Bhattacharyya et al. (eds.), Handbook of Signal Processing Systems, DOI 10.1007/978-1-4419-6345-1_10, © Springer Science+Business Media, LLC 2010
243
244
Dmitry N. Zotkin and Ramani Duraiswami
capabilities such as allowing an operator to move freely in the room instead of being tethered to the computer by a microphone. A basic tool for spatially selective auditory processing is a microphone array. We introduce the array concept, explain the principles of operation, and provide comments on practical implementation and use of microphone arrays for HCI. Another application of signal processing for HCI is the synthesis of virtual auditory scenes. The goal of the process is to create an acoustic impression for users as if they are present in the real audio scene. To do that, it is enough to re-create the acoustic signals that would be heard by the user in the real scene. We will review the design principles for the virtual auditory scene synthesis system from the signal processing perspective, describe problems that have to be solved in order for the rendering to be convincing, and outline the solutions available. Numerous applications of the technology being described are possible in HCI and other areas. These include audio user interfaces (including special interfaces for low-vision users), remote collaboration applications, auditory scene analysis, remote education and training, entertainment, surveillance, and many others. We conclude the chapter with the description of several practical applications and outline the signal processing mechanisms involved. These include a system for headrelated transfer function measurement and a system for capture, reproduction, and visualization of arbitrary acoustic scenes for a human listener.
2 Microphone Arrays In the most generic terms, a microphone array is a collection of several spatially separated microphones [1]. (Closely related to this area is the general domain of array processing for applications in wireless communication, radar, and sonar. However, because of the difference in the wave speed, and because of the predominance of multipath and reverberation in auditory scenes, the signal processing techniques used here are somewhat different.) Because of the spatial separation between the sensors, an audio signal from a given source will reach different microphones at different times. The signal magnitude can also be different at different microphones, either due to natural signal decay with distance or due to interactions between the signal and the mechanical structures of the microphone array. These timing and strength differences in the signals recorded at different microphones are detected and used for further signal processing and for making inferences about the source, the source position, or the environment characteristics.
2.1 Source Localization In source localization, one is presented with a collection of signals recorded at several microphones, and the question is to estimate what is the position in the space
Signal Processing for Audio HCI
245
that best explains the observations. Three broad classes of localization algorithms exist: i) time difference of arrival (TDOA) based; ii) narrowband spectral estimation based; and iii) steered response power based. Some approaches combine elements from more than one class. In TDOA-based estimation, a time delay estimation algorithm [2] is used first to determine the time delays between channels. Various weighting schemes exist for the generalized cross-correlation algorithm [3]. One well-known scheme is known as phase transform (PHAT), where magnitude information is discarded and only phase information is used for TDOA determination. This particular weighting was shown to be robust to reverberative noise and is often used in practical applications. Once TDOAs are known, they are used to infer the sound source location by finding the position in 2-D or 3-D space that best explains the observed TDOAs (e.g., by least squares or by intersection of surfaces formed by each TDOA) [4]. Some TDOA grouping (e.g. to remove outliers — values that are inconsistent with the rest) can also be performed. TDOA-based localization is fast and reasonably accurate in low noise; however, the major problem is the loss of information incurred when retaining only the position of the highest cross-correlation peak. Adaptations of the technique to the specific characteristics of speech sources were also made (e.g., [5]). The second class of algorithms operates by computing the narrowband spatial correlation matrix between channels [6]. Such an approach is well-suited when it is in fact known that the signal of interest is narrowband (as in sonar and radar applications). The dominant acoustic source manifests itself as an eigenvalue of the spatial correlation matrix. Further, multiple sources can be detected and resolved. If the signal is broadband, the analysis can be performed in many frequency bands and the results merged together via a consistency checker, as in the MUSIC algorithm [7]. The spatial correlation method works well for narrowband signals even in the presence of significant noise. The disadvantage of the method is that it has high computational cost; because of that, it is not practical for implementation with broadband (e.g., speech) signals. The third group of methods employs repetitive beamforming in many directions and searches for the peak in signal power as a function of direction [8]. The concept of beamforming is defined in the next subsection; briefly, it allows the system to focus “acoustic attention” at the specific location, amplifying the acoustic signal that originates there or near there. Such methods are called steered response power. In essence, the focus of the system is moved through many positions in space, and the peak in acoustic power at certain location means that an active source must be located there. In an original implementation of the algorithm, an “energy map” — the distribution of acoustic energy in space — was computed at the regular grid of points covering the space of interest, and acoustic sources were found as the peaks in the energy map. Later it was realized that the peaks in the map have substantial width and therefore faster search mechanism such as hierarchical refinement [9], gradient ascent [10], or stochastic region contraction [11] can be used. Also, the power computation operation can be done on the fly as necessary during the search. The method can also be optimized to operate on pre-computed cross-correlation
246
Dmitry N. Zotkin and Ramani Duraiswami
vectors between channel pairs. The SRP method is more computationally expensive than TDOA but is also more robust.
2.2 Beamforming The concept of beamforming refers to spatially selective sound capture. In beamforming, the location of interest is assumed to be known and the goal is to amplify the sound originating from that location in comparison with interfering sounds produced elsewhere. Often one performs sound localization first followed by beamforming in the direction returned by the localizer, though as described before the localization itself may be done using repeated beamforming procedure. As described before, the acoustic signal from a given location in space arrives at different microphones at different time instants and possibly with different magnitudes. The simplest “delay-and-sum” beamformer computes what would be the set of time delays associated with the desired location, delays the signals in each channel to undo those delays, and sums up the results. As a consequence, the copies of the signal originating from the desired location sum up in phase, while those from other locations add incoherently. A generic beamformer multiplies the input signals by complex weights and sums up the results. Two main classes of beamforming algorithms are data-independent and statistically-optimum beamformers. In the first class, the weights are determined in advance and are derived from the beampattern specified; in the second class, the weights are adapted on the fly attempting to actively suppress interference arriving from possibly unknown directions. A vast body of literature exists covering various beamforming techniques, and it is impossible to cover all aspects in a short space. An introduction to the earlier literature may be found in the review article [12], while a more speech oriented introduction may be found in [1]. We will describe a particular beamforming technique for a spherical microphone array in a later section describing a spatial audio capture and reproduction system.
2.3 Practical Issues The array design is sensitive to a large number of issues, and tradeoffs between noise robustness, beam width, sidelobe magnitudes, interference suppression ability, and other parameters are necessary. First and foremost, the array can be constructed in various physical shapes. The most well-studied configuration is probably the linear equispaced array. Obviously such a localizer is able to provide only one angular output (the bearing), and any realized beampattern must be symmetrical with respect to the array axis. Non-equispaced arrays can be used to obtain satisfactory performance over a wider frequency range. A planar array is able to provide two angular directions (azimuth and elevation). Volumetric arrays are also often used
Signal Processing for Audio HCI
247
(for example, microphones can be arranged to lie on the surface of a virtual or a real sphere). Because a spherical arrangement is symmetrical, the array characteristics are independent of the steering direction. Mounting microphones on a solid sphere allows one to avoid certain numerical problems associated with open-air spherical configurations and to widen the useful frequency band for the array [13]. Finally, a relatively new research direction is to use numerical simulations of wave propagation to perform localization and beamforming using microphones mounted on arbitrary-shaped supports. The physical array structure also imposes limits on the performance of the array. The high frequency performance limit is set by spatial aliasing; the distance between microphones in the array must not be greater than half the wavelength for any operating frequency. In practice, though, it is observed that the desired beampattern degrades rather gracefully above the spatial aliasing frequency. The low frequency limit, conversely, is related to the size of the whole array relative to the wavelength — if the distance between microphones is small compared to the wavelength, phase differences between signals recorded would be very small and can be lost in quantization or equipment noise. The localization and beamforming accuracy also depend on the array shape. As a general rule, a larger spacing between microphones generates larger signal differences, which gives more reliable position information. It is also generally hard to obtain source range with an array confined to a small area of space for the same reason (signals arriving from sources at various ranges differ only marginally). A large array, or two small arrays located far from each other, can give much more reliable three-dimensional localization information and do better range discrimination in beamforming.
3 Audio Reproduction The goal of audio reproduction is to create a virtual auditory scene – a realistic three-dimensional audio rendering so that users perceive the acoustic sources as being external to them and located at the correct places in exocentric space [14]. For example, in a computer game environment it might be desirable to synthesize a helicopter sound overhead. One obvious solution is to place an array of loudspeakers around the person and just play the sounds from wherever they are supposed to be, panning as appropriate to permit sound placement in-between nodes, e.g., as in [15]. Such a setup is often used in permanent virtual audio installations in special environments. However, it is obviously cumbersome and non-portable; also, the loudspeaker-produced sound would obviously be heard by everybody in the room, which is not desired in regular (e.g., home or office) environments. Therefore, the following discussion assumes that the synthesis is done over headphones. Three building blocks of the virtual auditory reproduction system [16] are: head-related transfer function based filtering, reverberation simulation, and user motion tracking.
248
Dmitry N. Zotkin and Ramani Duraiswami
3.1 Head-related transfer function An acoustic wave propagating through space interacts with objects presented in the space. In particular, the head and body of a human listener obstruct wave propagation and cause the sound wave reaching the ear to be modified. As a result, the sound as received by the listener is not the same sound as produced by the source. Furthermore, due to the geometry of human bodies and in particular to the complex shape of the outer ear, the changes in sound spectrum depend greatly on the direction from which the sound came relative to the listener, and to the lesser extent on the source range. Those spectral shape cues create the perception of the sound direction and are formally characterized as the head-related transfer function (HRTF) — the ratio between the sound spectrum at the eardrum and the spectrum of the sound source [17]. In other words, these are the filters to be applied to transform the original source waveform to the one present at the eardrum. It follows that if the acoustic wave with such filtering imposed is delivered to listener’s ears, an impression of the sound originating from the corresponding direction would be created. Individual differences in body geometry (and to a smaller extent its composition) however make the HRTF substantially different among individuals. Figure 1 shows how large variations in pinna shape are between individuals. For accurate scene reproduction the HRTF can be measured for each participant, which is a timeconsuming process. During the measurement, the filtering effect of the HRTF is measured directly by placing a known sound source at a known location and recording the sound with a microphone placed in the subject’s ear. The HRTF obtained in this way is then used for scene rendering [18]. Various methods of selecting the HRTF that “best-fits” individual in some sense were tried by researchers with matching either physical parameters (e.g. from a ear picture [19]) or perceptual experience (e.g., by asking which audio piece sounds closest to being overhead [20]). Successful attempts of computing HRTF using numerical methods on head/ear mesh have also been reported, though they usually require very significant computation time [21] [22]. Recent research attempts to improve the efficiency of such simulations [23] [24].
3.2 Reverberation Simulation In addition to the sound propagating directly from the source, commonly (i.e., except in anechoic rooms) we hear the multiple copies of it from interactions with the boundaries of the environment and with various objects. From those reflections, we are able to infer the room size, the wall properties, and perhaps our position in the room. The joint effect of the reflections is termed reverberation. Each individual reflection is not heard individually. Rather, they are perceptually joined into one auditory stream. The perception of reverberation is complicated; very roughly speaking, the energy decay rate in the reverberation tail implies the room size, and relative magnitude at various frequencies is influenced by the wall materials. Hu-
Signal Processing for Audio HCI
249
Fig. 1 Sample ear pictures (photos are from CIPIC HRTF database).
mans use the ratio of direct to reverberant sound energy to judge the distance to the sound source; hence, the reverberation is a cue that is very important to the perception of sounds as being external to the listener. Also, the reflections that constitute the reverberation are directional (i.e., arrive to the listener from a specific direction just like the direct sound) and should be presented as directional in an auditory display at least for the first few reflections. A method for fast reverberation simulation was recently proposed [25]. The sound source in a room creates a reverberant acoustic field, which is equivalent to the field formed in a boundaryless room by an infinite regular lattice of vanishing strength sound sources constructed by reflecting the original source in room walls [26]. The reverberant field is usually computed directly by summing up the fields produced by all sources in the lattice, which is very costly. A fast method replaces the source lattice with a single source, which is located in the center of the room and has very complex radiation pattern so that everywhere in the room the field created
250
Dmitry N. Zotkin and Ramani Duraiswami
by this source is equal to the field created by the source lattice [27]. This process is slow; however it is done once and subsequent computation of reverberant field involves evaluation of the field created by that single source only and is therefore very fast. This method can be used in real-time room acoustic simulations to allow for greater reproduction accuracy compared to conventional methods.
3.3 User Motion Tracking A key feature of any acoustic scene that is external to the listener is that it is stationary with respect to the environment. That is, when the listener rotates, the auditory objects move in the listener-bound coordinate frame so as to “undo” the listener’s rotation. To allow for simulation of those natural changes in the sound that occur with the listener’s motion, active tracking of the user’s head and on-the-fly adjustment of the auditory stream to compensate for the head motion/rotation is necessary. If such compensation are not done, the listener subconsciously assumes that the sound’s origin is in his/her head (as with ordinary headphone listening), and externalization becomes very hard. An inertial (gyroscopic), electromagnetic, or optical tracker providing at least the head orientation is needed for virtual audio rendering system. In addition, the stream adjustment must be made with sufficiently low latency so as not to create dissonance between expected and perceived auditory streams [28]. Note that all these three problems do not exist if the auditory scene is rendered using a loudspeaker array. Indeed, HRTF filtering for the correct direction and with correct HRTF is done by the mere act of listening; room reverberation is added naturally, given that the array is placed in some environment; and motion tracking is also unnecessary because in this case the sources are in fact external to the user and obviously shift when the user rotates. In essence, the loudspeaker array setup represents the “original” acoustic scene — with sources physically surrounding the listener — much better than the headphone setup. However, as mentioned before, such a setup is expensive/non-portable and can be used only in large research/training facilities; also, it may be hard to reproduce scenes very different from the physical room in which the setup in installed. Internal sound reflections within the array can also present problems and distort the sound if not carefully addressed at the design stage.
3.4 Signal Processing Pipeline We describe here the typical signal processing pipeline involved in real-time virtual auditory scene rendering. The goal is to render several acoustic sources at certain positions in a simulated environment with given parameters (room dimensions, wall materials, etc.) Assume that for each source the clean (reverberation-free) signal is given and that the HRTF of the listener is known. For each source, the following
Signal Processing for Audio HCI
251
Fig. 2 A sequence of signal processing steps for rendering one audio frame in headphone-based real-time virtual auditory scene rendering system.
sequence of steps (illustrated in Figure 2) is performed and the results are summed up together to produce auditory stream. • Using source location and room dimensions, compute locations of reflections (up to a certain order) using simple geometric rules. • Obtain current listener position/orientation data from the headtracker. • Compute locations of the acoustic source and of its reflections in the listenerbound coordinate system. • Compose impulse responses (IR) for the left and right ears by combining the HRIRs for the source and its reflections with appropriate delays. • Convolve the source signal with the left/right ear IRs using frequency-domain convolution. • Render the resulting audio frame to the listener and repeat. In practice, due to limited computational resources available, only the IR containing the reflections of up to order 4 or 5 (“IR head”) can be computed in real time. The remaining reverberant tail (“IR tail”) is largely independent of the specific source/receiver position in the room and reflects room properties only, such as room volume and wall materials. Therefore, the IR tail is computed in advance for a specific set of room parameters. Then, during the rendering (where the same
252
Dmitry N. Zotkin and Ramani Duraiswami
room parameters must be used), the IR head is computed in the real time, and precomputed IR tail is appended to it unchanged. Also, artifacts such as clicks can arise at frame boundaries due to IR filter being changed significantly between frames due to source/listener motion. Usual techniques such as frame overlapping with smooth fade-in/fade-out windows can be used to eliminate windowing artifacts.
4 Sample Application: Fast HRTF Measurement System The HRTF measurement problem was briefly mentioned before. The head-related transfer functions for the left ear Hl (k; s) and for the right ear Hr (k; s) are defined as Hl (k; s) =
l (k; s) , c (k)
Hr (k; s) =
r (k; s) , c (k)
where the signal spectra at the left eardrum, at the right eardrum, and at the center of the head as if the listener were absent are l (k; s), r (k; s), and c (k), respectively. Here, s designates the direction to the source and we use the wavenumber k in place of more commonly used frequency f or radial frequency for notational convenience later in the chapter; note that k = 2 f /c = /c, where c is the sound speed. The traditional method of measuring the HRTF (e.g., [29], [30]) is to position a sound source (e.g., a loudspeaker) in the direction s and to play the test signal x(t) through it. At the same time, a microphone inserted into the left ear records the signal yl (t), which is nothing but x(t) filtered with the left-ear head-related impulse response (HRIR) — the inverse Fourier transform of Hl (k; s). Hl (k; s) is then computed from known x(t) and measured yl (t). The measurements are usually done on a regular grid of directions to sample the true (continuous) HRTF at sufficient number of locations to permit more or less continuous rendering of the auditory space. One of the problems associated with this traditional method is that the whole acquisition procedure is quite slow because the test sounds have to be produced sequentially from many positions. This necessitates the use of the equipment to actually move the loudspeaker(s) between the positions; such equipment may be a semi-circular hoop rotated by a stepper motor, a robotic arm, or another implement. As a result, completion of a measurement at the acceptable sampling density (about a thousand of positions) requires several hours. An alternative fast method was developed recently by University of Maryland researchers [31]. The method is based on Helmholtz principle of reciprocity [32]. Assume that in a linear time-invariant scene the positions of the sound source and of the receiver are swapped. By reciprocity, the recording made would be the same as for the unswapped case. The reciprocal HRTF measurement method is thus to place the sound source in the listener’s ear and many microphones around the listener in the directions where HRTF is to be measured. The transfer function obtained would be the same as for the traditional (unswapped) case, and the great advantage of the reciprocal setup is that the recording can be done in parallel for all microphones
Signal Processing for Audio HCI
253
Fig. 3 a) A test signal (upsweep chirp) for the HRTF measurement system. b) The power spectrum of the test signal.
with a multi-channel recording setup. The time necessary for the measurement is thus reduced from hours to seconds. The test signal x(t) is usually selected to be wideband (to provide enough energy in the frequency range of interest) and short (to be able to window out equipment and wall reflections). A burst of white noise, a single click, or a frequency sweep (chirp) can be selected. In our case, the test signal is a 96 samples (2.45 ms) long upsweep chirp; the signal plot and its spectrum are shown in Figure 3. The sampling rate for the acquisition system is 39.0625 kHz. The test signal is fed to the subminiature Knowles ED-9689 speaker, which is wrapped in the silicone plug (to provide acoustic insulation and to hold speaker in place) and is inserted into the subject’s ear. 128 Knowles FG-3629 microphones are mounted in the nodes of a sphere-like structure constructed from a ZomeTool set (Figure 4). These microphones are connected to four 32-channel custom-made preamplifier boxes. The output of each box consists of 32 amplified analog microphone signals and is connected to a 32-channel National Instrument PCI-6071E data acquisition card via a standard NI cable. Channel multiplexing is done in the NI card hardware, introducing a sampling skew of less than one sample; we do not compensate for that in processing. In total, four PCI-6071E cards are installed in an acquisition computer (a typical desktop computer) to handle 128 microphones. The NI cards are also connected to each other with a short synchronization cable (a standard feature of the NI cards). One of the NI cards is configured in software to be a common clock source and is also used to output the test signal to the speaker. Also, a hardware trigger is used to ensure that
254
Dmitry N. Zotkin and Ramani Duraiswami
Fig. 4 A prototype fast HRTF measurement system based on the acoustic reciprocity principle. The diameter of the structure is about 1.4 meters.
128-channel acquisition and test signal playback start at the same time instant and operate off the common clock source. In order to measure the HRTF, the calibration signal is first obtained to compensate for individual differences in channels of the system. The speaker is placed at the center of the recording mesh on a stick, and a test signal x(t) is played back through it and is recorded at all microphones as yci (t); here i is the microphone index and c denotes “calibration”. The recorded signal is windowed so as to exclude the test signal reflections coming from the room walls, ceiling, and floor. In our setup, the arrival time of the closest reflection is about 4 ms, which is consistent with the geometry of the setup and with its position in the room. The recorded signal is therefore cut at 144 samples (3.7 ms) from the start of the chirp in the recording, which is detected using thresholding. Note that the test signal does not overlap with any reflections; if it were, it would be necessary to further decrease the length of the test signal to be able to window out the reflections.
Signal Processing for Audio HCI
255
Then, the speaker is inserted in the left ear of the subject, the subject is placed at the center of the recording mesh, and the test signal x(t) is played back through the speaker and is recorded at ith microphone located at si as yi (t). The playback and recording is repeated 48 times, and the resulting signal is time-averaged to improve the SNR. The same thresholding and windowing procedure is used on yi (t). The left-ear HRTF Hl (k; si ) is then computed as Yi (k)/Yic (k) where the capital letters signify the Fourier transforms of the corresponding signals. Note that the original signal x(t) is not used in calculations but is implicitly present in yci (t). The same process is then repeated with the right ear of the subject to obtain right-ear HRTF Hr (k; si ).
Fig. 5 KEMAR HRTF measurement for (azimuth, elevation) of (−55, 45). a) Left ear HRTF. b) Right ear HRTF. c) Left ear HRIR. d) Right ear HRIR.
The computed HRTF is then windowed in the frequency domain with trapezoidal window in order to dampen it to zero at the frequencies where the measurements are unreliable. In particular, the subminiature speaker is not very efficient at emitting lowest frequencies (below about 700 Hz), and as seen from Figure 3(b) there is no energy in the test signal above approximately 14 kHz, so the HRTF obtained is very noisy in these very low / very high frequency ranges and is gradually tapered to zero (Figure 5(a)(b)). The HRTF is then subject to the inverse Fourier transform to generate HRIR. Sample HRIR obtained by the measurements is shown in Figure 5(c)(d). The measurement direction is in the left-front-above octant, so the left ear
256
Dmitry N. Zotkin and Ramani Duraiswami
HRIR is larger in magnitude and has earlier onset. It can also be seen that the HRIR onset is somewhat smeared, unlike what can be expected in reality when a pulse test signal is played. This doesn’t seem to produce significant differences in perception; to correct for that, the HRIR can be converted to be minimum phase with appropriate time delays as obtained from thresholding yi (t) or from acoustic models such as HAT [33].
5 Sample Application: Auditory Scene Capture and Reproduction The second application described deals with “auditory experience sharing and preservation” [34] . The goal here is to perform a multi-channel audio recording in such a way that the spatially/temporally removed listener can perceive the acoustic scene to be fully the same as if he/she were actually present at the place where the recording was made. Full reproduction here means that the sound sources are perceived to be in the correct directions and distances, environmental information is kept, and the acoustic scene stays external with respect to the listener (i.e., remains stationary with respect to the room when the listener moves/rotates). In order to do that, such recording should capture the spatial characteristics of the acoustic field. Indeed, a single-microphone recording is quite able to faithfully reproduce the acoustic signal at a given point, but it is obviously impossible to infer the spatial structure of the acoustic scene (i.e., where different parts of the audio scene are located with respect to the recording point) from it. A microphone array that is able to sample such spatial structure is therefore necessary for such recording. In terms of the signal processing pipeline, this application is somewhat similar to the virtual auditory scene (VAS) rendering described earlier. However, a very important distinction is that the VAS rendering synthesizes the scene “from scratch” by placing sound sources for which clean signals are available at known places in known environment and proceeding from there by mixing the source signals with HRTF and reverberation, whereas in scene capture and rendering application nothing is known about the scene structure. A tempting approach is to localize all sound sources, isolate them somehow, and then render them using VAS; however, localization attempts with multiple active sources (such as at the concert) are futile; moreover, such approach would remove all but the most prominent components from the scene, which contradicts the goal of keeping the environmental information intact. Therefore, the scene capture and reproduction system attempts to blindly re-create the scene without ever trying to analyze it. More specifically, it operates by decomposing the acoustic scene into pieces arriving from various directions and then rendering these pieces to the listener to appear to come from their respective directions. The grid of decomposition directions is fixed in advance and is data-independent. The placement of sources would be reproduced reasonably well with such model; indeed, assume that a source is located in one of the directions in the grid. Then, the part of the acoustic scene that arrives from that direction would have high in-
Signal Processing for Audio HCI
257
tensity and contain mostly the signal of that source; therefore, during the rendering process the source would appear to be there. Conversely, the waves arriving from other directions would be weak.
Fig. 6 A prototype 64-microphone array with embedded video camera. The array connects to the PC via a USB 2.0 cable streaming digital data in real-time.
Spherical microphone arrays constitute an ideal platform for such a system due to their inherent omnidirectionality. A 64-microphone array is shown in Figure 6 (the array also has an embedded video camera that is used for the acoustic field visualization application described later). Assume that the array has Lq microphones located at positions sq on the surface of a sound-hard sphere of radius a. Assume that the time-domain signal measured at each microphone is x(t; sq )it (the acoustic potential) is (k; sq ). The goal is to decompose an acoustic scene into waves y(t; s j ) arriving from various directions s j (the total number of directions is L j ) and then render them for the listener from directions s˜j (which are the directions s j converted to the listener-bound coordinate system) using his/her HRTF Hl (k; s) and Hr (k; s). It is easier to work in the frequency domain. Let us denote the Fourier transform of
258
Dmitry N. Zotkin and Ramani Duraiswami
y(t; s j ) by (k; s j ). Further, denote by the Lq × 1 vector of (k; sq ), and denote by the L j × 1 vector of (k; s j ). We want to build a simple linear mechanism for decomposing the scene into the directional components so that it can be expressed by equation = W , where W is the L j × Lq matrix of decomposition weights constructed in some way. One way to proceed is to note that the operation that allows for selection of a part of audio scene that arrives from a specified direction is nothing but a beamforming operation [35]. A beamforming operation for direction s j in frequency domain can be written as
(k; s j ) = −i
(ka)2 4
p−1
Lq
n=0
q=1
(2n + 1)i−nhn (ka) (k; sq )Pn(s j · sq ),
where hn (ka) is the derivative of the spherical Hankel function of order n and Pn (...) is an associate Legendre polynomial of order n as well. p here is the truncation number and should be kept approximately equal to ka; increasing it improves the selectivity but also greatly decreases robustness, so a balance must be kept [13]. This decomposition method can be expressed in the desired matrix-vector multiplication framework as w(k; s j , sq ) = −i
(ka)2 4
p−1
(2n + 1)i−nhn(ka)Pn(s j · sq)
n=0
forming the weight matrix WB (“B” stands for “beamformer”). Another method of decomposition [36] is to note that a relationship inverse to the desired one can be established as = F , where F is a Lq × L j matrix with elements f (k; sq , s j ) equal to the potential caused at microphone at sq by the wave arriving from the direction s j . The equation for f (k; sq , s j ) appears in many papers (see e.g. [37]) and is f (k; sq , s j ) =
i (ka)2
p−1 in (2n + 1)P (s · s ) n j q . hn (ka) n=0
Then, a linear system = F is solved for , exactly if Lq = L j or in least square sense otherwise. For this method, the weight matrix WP = F −1 (generalized inverse in case of non-square F; “P” stands for “plane-wave decomposition” [36]). Much more can be said about the behavior of two decomposition methods, but the reasons of space preclude us from doing so; interested readers can refer to the literature covering this in more details [13] [36] [37] [38] [39]. The decomposed acoustic scene is stored for later rendering and/or transmitted over network to the remote listening location. When rendering the scene for the specific user, his/her HRTF are assumed to be known. During the rendering, the directions s j with respect to the original microphone array position are converted to directions s˜ j with respect to the listener-bound coordinate frame using current listener head position/orientation, and the audio streams yl (t), yr (t) for left/right
Signal Processing for Audio HCI
259
ears are obtained as [40] Lj
Yl (k) =
Hl (k; s˜j ) (k; s j ),
yl (t) = IFT [Yl (k)](t),
j=1 Lj
Yr (k) =
Hr (k; s˜j ) (k; s j ),
yr (t) = IFT [Yr (k)](t),
j=1
where IFT is the inverse Fourier transform. Essentially the scene components arriving from different directions are filtered with their corresponding HRTF and are summed up. The above description is a brief outline of auditory scene capture and reproduction principles. Many practical details are omitted for brevity. In particular, the ideal beampattern is a delta function, but the beam width is always finite and sidelobes do exist. This inevitably creates spatial spreading of acoustic sources as well as source energy leakage into other directions. It is not clear how accurate the beamformer must be for faithful, lifelike reproduction of the auditory scenes. A perceptual study would answer these questions best; however, the state of the art in microphone arrays is such that the accurate capture and recreation of sound at all frequencies spanning the human hearing range is impossible. Indeed from the aliasing criteria it can be computed that for the human head sized sphere the number of microphones necessary to avoid aliasing at the frequency of 20 kHz is about 1600 [39]. A smaller microphone array would need fewer microphones but won’t be able to handle low frequencies reliably. Perhaps an advanced array consisting of several concentric open spheres, or an anthropomorphic localization system using scattering of the sound off a complex-shaped antenna, or a technological solution such as very small, wirelessly or daisy-chain connected microphone boards can be employed to allow for spatial auditory capture up to 20 kHz. Increase in the number of microphones would also allow for narrower beam pattern. For reference, the head-sized 64-microphone array can be constructed relatively easily (shown in fact in Figure 6) and has useful frequency range from about 700 Hz to about 4 kHz. Possible applications of the system are numerous; one can imagine, for example, such a system being placed in the performance space and transmitting the spatial audio stream in real time to the interested listeners, enabling them to be there acoustically without actually being there physically. Similarly, a person might want to capture some auditory experience — e.g., a celebration — to store for later or to share with family and friends. Another application from a different area could involve a robot being sent to environments that are dangerous to or unreachable by humans and relaying back the spatial auditory experience to the operator as if he/she is being there. The auditory sensor can also be augmented with video stream to open further possibilities, and the research in spatial audio capture and synthesis is ongoing.
260
Dmitry N. Zotkin and Ramani Duraiswami
6 Sample Application: Acoustic Field Visualization A different application that uses the same spherical microphone array hardware but has quite different signal processing chain is the real-time visualization of the acoustic field energy. The microphone array immersed in the acoustic field has the ability to analyze the directional properties of the field. In this application, the goal is to generate an image that can be used by a human operator to visualize the spatial distribution of the acoustic energy in the space. Furthermore, our microphone array setup has a video camera embedded (Figure 6) so that simultaneous acoustic analysis and visual imaging of the space can be done, allowing the operator to instantly identify the acoustically active areas in the environment by overlaying the traditional visual image of the scene with the image showing the acoustic energy distribution in space [41].
Fig. 7 Sample acoustic energy map. Pixel intensity is proportional to the energy in the signal beamformed to the direction indicated on axes.
To create the acoustic energy map, the beamforming operation is performed on a relatively dense grid of directions. The energy map is just a regular, two-dimensional image with the azimuth corresponding to the X axis and the elevation corresponding to the Y axis. In this way, the array’s full omnidirectional field of view is unwrapped and is mapped to the planar image. The energy in the signal obtained when beamforming to the certain direction is plotted as the pixel intensity for that direction. A sample map is shown in Figure 7. It can be seen in the image that an active acoustic source exists in the scene and is located approximately in the direction (−100, 40) (azimuth, elevation). In the same way, multiple acoustic sources can be visualized, or a reflection coming from the particular direction in space can be seen. Furthermore, if the original acoustic signal is short (like a pulse), an evolution of the reverberant sound field can be followed by recording it and then playing it back in the slow motion. In such a playback, the arrivals of the original sound and of its
Signal Processing for Audio HCI
261
reflections from various surfaces and objects in the room are literally visible painted over the places where the sound/reflections are coming from [42]. Such information is usually obtained indirectly in architectural acoustics for the purposes of adjusting the environment for better attendee experience, and a real-time visualization tool could prove very useful in this area. Furthermore, it has been shown that the acoustic energy map is created according to the central projection equations — the same equations that regular (video) imaging cameras conform to [43], so the device has been tagged “audio camera”, and the vast body of work derived for video imaging (e.g., multi-camera calibration or epipolar constraints) can be applied to audio or multimodal (auditory plus traditional visual) imaging. The signal processing chain in our implementation of the visualization application is based on the data processing on GPU (graphical processing unit). In the last few years, the GPU processing power has been increasing extremely fast and is now two orders of magnitude more than the CPU (“regular” central processor) power. This is partially due to the fact that GPU consists of a large number (128240) of individual processors running the same instructions in parallel on multiple data streams (SIMD parallel architecture, single instruction multiple data). Originally such architecture was developed for performing multiple graphics-related data operations, such as texture mapping and vertex shading, in parallel; however several years ago an NVidia (a major manufacturer of the graphics cards) has released a toolkit for general-purpose GPU programming called CUDA [44], allowing the developers to write arbitrary GPU-executable code in C language [45]. Still, the GPU development remains much more constrained in comparison to general C programming. In particular, SIMD architecture requires all processors to execute identical code; otherwise, the code execution is serialized, and no benefits are obtained from having multiple processors. Different memory types are also available to the developer; each processor has a (fastest but small-sized) local memory, each group of 8 processors share a common block of (fast, moderately-sized) shared memory, and there is also a (slow but large) global memory residing on the graphics card; finally, transfers between GPU memory and main PC memory can be done but are extremely slow and should be used only for loading and unloading computation results. This architecture is not unlike the main CPU architecture with multiple levels of caching; however, with CPU programming the cache is not software controllable, whereas with GPU programming the developer must explicitly choose which type of memory to use for particular variables and arrange for data transfers. These and many other architectural details must be kept in mind while developing GPU code. However, the speed-ups achieved are very large in case of tasks that can be easily parallelized. An excellent overview of general-purpose computing on GPU is done in [46]. In our particular application example, GPU (NVidia GeForce 8800 GTX with 128 processors operating in parallel) is used to do all the signal processing involved. Parallelization of beamforming operations is trivial as each operation is really independent except that they all use the same input data. In addition, beamforming can be formulated as a matrix-vector multiplication in the frequency domain, and modules to execute fast Fourier transform and fast matrix-vector multiplication are
262
Dmitry N. Zotkin and Ramani Duraiswami
Fig. 8 A sequence of signal processing steps involved in the GPU implementation of an acoustic scene visualization algorithm.
readily available in CUDA toolkit. Below is the sequence of signal processing steps (illustrated in Figure 8) involved in our implementation of the real-time visualization algorithm. It is important to note that this is a signal processing chain specific to our implementation of the algorithm; it is not an inherent property of the application itself. • • • •
Acquire the frame of the time-domain audio signal for all 64 channels. Transfer the audio data to the GPU memory for processing. In parallel, perform 64 Fourier transforms — one per audio channel. In parallel, perform 8192 beamforming operations for all directions comprising the whole energy map (of size 128 by 64 pixels). • Store outputs in the image buffer. • Map the accumulated image to the screen via texture mapping procedure. Note that no inverse Fourier transform is done prior to computation of the energy in the beamformed signal; it is just not necessary as the energy is the same. Also, the computed energy map is never transferred back to main PC memory but is rather used directly from the GPU memory for display via texture-mapping. All the static data necessary for computations (the grid of directions and the beamformer weight matrix) is pre-computed on CPU at the start of the application and loaded into GPU memory for use by GPU code.
Signal Processing for Audio HCI
263
The performance benefits from GPU use are very substantial. In the CPU-only implementation, the computation of the energy map for a 30-ms long signal window takes about 490 milliseconds for single frequency only; as such, real-time operation is not possible. In the GPU-based system, it is possible to compute in real-time the energy maps for three frequencies; thus, computations are done about 50 times faster. In a version of a system, powers in these three frequency bands are shown via three primary color channels on the display, allowing one to see sound characteristics together with sound location. Commercial applications of the developed solution are possible in many areas, such as noise analysis and abatement, hearing protection, architectural acoustics and surveys of audio environments, visual displays of auditory events, and assistive devices for hearing-impaired individuals.
References 1. M. Brandstein and D. Ward (2001). “Microphone arrays: signal processing techniques and applications”, Springer, New York, NY. 2. J. Chen, J. Benesty, and Y. Huang (2006). “Time delay estimation in room acoustic environments: An overview”, EURASIP Journal on Applied Signal Processing, vol. 2006, no. 1. 3. M. S. Brandstein and H. F. Silverman (1997). “A robust method for speech signal time-delay estimation in reverberant rooms”, Proc. IEEE ICASSP 1997, Munich, Germany, pp. 375-378. 4. A. G. Piersol (1981). “Time delay estimation using phase data”, IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 29, no. 3, pp. 471-477. 5. B. Yegnanarayana, S. R. M. Prasanna, R. Duraiswami, and D. N. Zotkin (2005). ”Processing of reverberant speech for time-delay estimation”, IEEE Transactions on Speech and Audio Processing, vol. 13, no. 6, pp. 1110-1118. 6. J. Dmochowski, J. Benesty, and S. Affes (2007). “Direction of arrival estimation using the parameterized spatial correlation matrix”, IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 4, pp. 1327-1339. 7. H. Wang and M. Kaveh (1985). “Coherent signal-subspace processing for the detection and estimation of angles of arrival of multiple wide-band sources”, IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 33, no. 4, pp. 823-831. 8. D. B. Ward and R. C. Williamson (2002). “Particle filter beamforming for acoustic source localization in a reverberant environment”, Proc. IEEE ICASSP 2002, Orlando, FL, vol. 2, pp. 1777-1780. 9. D. N. Zotkin and R. Duraiswami (2004). ”Accelerated speech source localization via a hierarchical search of steered response power”, IEEE Transactions on Speech and Audio Processing, vol. 12, no. 5, pp. 499-508. 10. M. Wax and T. Kailath (1983). “Optimum localization of multiple sources by passive arrays”, IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 31, no. 5, pp. 1210-1217. 11. M. F. Berger and H. F. Silverman (1991). “Microphone array optimization by stochastic region contraction”, IEEE Transactions on Signal Processing, vol. 39, no. 11, pp. 2377-2386. 12. B. D. van Veen and K. B. Buckley (1988). “Beamforming: A versatile approach to spatial filtering”, IEEE ASSP Magazine, vol. 5, no. 2, pp. 4-24. 13. B. Rafaely (2005). “Analysis and design of spherical microphone arrays”, IEEE Transactions on Speech and Audio Processing, vol. 13, no. 1, pp. 135-143. 14. C. Kyriakakis, P. Tsakalides, and T. Holman (1999). “Surrounded by sound: Immersive audio acquisition and rendering methods”, IEEE Signal Processing Magazine, vol. 16, no. 1, pp. 55-66. 15. V. Pulkki (2002). “Compensating displacement of amplitude-panned virtual sources”, Proc. 22th AES Conference, Espoo, Finland, pp. 186-195.
264
Dmitry N. Zotkin and Ramani Duraiswami
16. D. N. Zotkin, R. Duraiswami, and L. S. Davis (2004). ”Rendering localized spatial audio in a virtual auditory space”, IEEE Transactions on Multimedia, vol. 6, no. 4, pp. 553-564. 17. W. M. Hartmann (1999). “How we localize sound”, Physics Today, November 1999, pp. 2429. 18. E. M. Wenzel, M. Arruda, D. J. Kistler, and F. L. Wightman (1993). “Localization using nonindividualized head-related transfer functions”, Journal of the Acoustical Society of America, vol. 94, no. 1, pp. 111-123. 19. C. Jin, P. Leong, J. Leung, A. Corderoy, and S. Carlile (2000). “Enabling individualized virtual auditory space using morphological measurements”, Proceedings of the First IEEE PacificRim Conference on Multimedia (2000 International Symposium on Multimedia Information Processing), pp. 235-238. 20. P. Runkle, A. Yendiki, and G. Wakefield (2000). “Active sensory tuning for immersive spatialized audio”, Proc. ICAD 2000, Atlanta, GA. 21. T. Xiao and Q.-H. Liu (2003). “Finite difference computation of head-related transfer function for human hearing”, Journal of the Acoustical Society of America, vol. 113, no. 5, pp. 24342441. 22. M. Otani and S. Ise (2006). “Fast calculation system specialized for head-related transfer function based on boundary element method”, Journal of the Acoustical Society of America, vol. 119, no. 5, pp. 2589-2598. 23. N. A. Gumerov and R. Duraiswami (2009). “A broadband fast multipole accelerated boundary element method for the 3D Helmholtz equation”, Journal of the Acoustical Society of America, vol. 125, no. 1, pp. 191-205. 24. N. A. Gumerov, A. O’Donovan, R. Duraiswami, and D. N. Zotkin (2010). “Computation of the head-related transfer function via the fast multipole accelerated boundary element method and its spherical harmonic representation”, Journal of the Acoustical Society of America, vol. 127, no. 1, pp. 370-386. 25. R. Duraiswami, D. N. Zotkin, and N. A. Gumerov (2007). ”Fast evaluation of the room transfer function using multipole expansion”, IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 2, pp, 565-576. 26. J. B. Allen and D. A. Berkeley (1979). “Image method for efficiently simulating small-room acoustics”, Journal of the Acoustical Society of America, vol. 65, no, 4, pp. 943-950. 27. N. A. Gumerov and R. Duraiswami (2004). “Fast multipole methods for the Helmholtz equation in three dimensions”, Elsevier Science, Amsterdam, The Netherlands. 28. N. F. Dixon and L. Spitz (1980). “The detection of auditory visual desynchrony”, Perception, vol. 9, no. 6, pp. 719-721. 29. V. R. Algazi, R. O. Duda, D. P. Thompson, and C. Avendano (2001). “The CIPIC HRTF database”, Proc. IEEE WASPAA 2001, New Paltz, NY, pp. 99-102. 30. E. Grassi, J. Tulsi, and S. A. Shamma (2003). “Measurement of head-related transfer functions based on the empirical transfer function estimate”, Proc ICAD 2003, Boston, MA. 31. D. N. Zotkin, R. Duraiswami, E. Grassi, and N. A. Gumerov (2006). ”Fast head-related transfer function measurement via reciprocity”, Journal of the Acoustical Society of America, vol. 120, no. 4, pp. 2202-2215. 32. P. M. Morse and K. U. Ingard (1968). “Theoretical Acoustics”, Princeton University Press, New Jersey. 33. V. R. Algazi, R. O. Duda, and D. M. Thompson (2002). “The use of head-and-torso models for improved spatial sound synthesis”, Proc. 113th AES convention, Los Angeles, CA, preprint #5712. 34. A. E. O’Donovan, D. N. Zotkin, and R. Duraiswami (2008). “Spherical microphone array based immersive audio scene rendering”, Proc. ICAD 2008, Paris, France. 35. J. Meyer and G. Elko (2002). “A highly scalable spherical microphone array based on an orthonormal decomposition of the soundfield”, Proc. IEEE ICASSP 2002, Orlando, FL, vol. 2, pp. 1781-1784. 36. D. N. Zotkin, R. Duraiswami, and N. A. Gumerov (2010). “Plane-wave decomposition of acoustical scenes via spherical and cylindrical microphone arrays”, IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 1, pp. 2-16.
Signal Processing for Audio HCI
265
37. H. Teutsch (2007). “Modal array signal processing: principles and applications of acoustic wavefield decomposition”, Springer-Verlag, Berlin, Germany. 38. M. Park and B. Rafaely (2005). “Sound-field analysis by plane-wave decomposition using spherical microphone array”, Journal of the Acoustical Society of America, vol. 118, no. 5, pp. 3094-3103. 39. Z. Li and R. Duraiswami (2007). “Flexible and optimal design of spherical microphone arrays for beamforming”, IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 2, pp. 702-714. 40. R. Duraiswami, D. N. Zotkin, Z. Li, E. Grassi, N. A. Gumerov, and L. S. Davis (2005). ”High order spatial audio capture and its binaural head-tracked playback over headphones with HRTF cues”, Proc. 119th AES convention, New York, NY, preprint #6540. 41. A. E. O’Donovan, R. Duraiswami, and N. A. Gumerov (2007). “Real time capture of audio images and their use with video”, Proc. IEEE WASPAA 2007, New Paltz, NY, pp. 10-13. 42. A. E. O’Donovan, R. Duraiswami, and D. N. Zotkin (2008). “Imaging concert hall acoustics using visual and audio cameras”, Proc. IEEE ICASSP 2008, Las Vegas, NV, April 2008, pp. 5284-5287. 43. A. E. O’Donovan, R. Duraiswami, and J. Neumann (2007). “Microphone arrays as generalized cameras for integrated audio-visual processing”, Proc. IEEE CVPR 2007, Minneapolis, MN. 44. NVIDIA, NVIDIA CUDA Programming Guide 2.3, 2009. 45. http://www.gpgpu.org/ – General-Purpose Computation on GPU. 46. J. D. Owens et al. (2008). “GPU computing”, Proceedings of the IEEE, vol. 96, no. 5, pp. 879-899.
Distributed Smart Cameras and Distributed Computer Vision Marilyn Wolf and Jason Schlessman
Abstract Distributed smart cameras are multiple-camera systems that perform computer vision tasks using distributed algorithms. Distributed algorithms scale better to large networks of cameras than do centralized algorithms. However, new approaches are required to many computer vision tasks in order to create efficient distributed algorithms. This chapter motivates the need for distributed computer vision, surveys background material in traditional computer vision, and describes several distributed computer vision algorithms for calibration, tracking, and gesture recognition.
1 Introduction Distributed smart cameras have emerged as an important category of distributed sensor and signal processing systems. Distributed sensors in other media have been important for quite some time, but recent advances have made the deployment of large camera systems feasible. The unique properties of imaging add new classes of problems that are not apparent in unidimensional and low-rate sensors. Physically distributed cameras have been used in computer vision for quite some time to handle two problems: occlusion and pixels-on-target. Cameras at different locations expose and occlude different parts of the scene. Their imagery can be combined to create a more complete model of the scene. Pixels-on-target refers to the resolution with which a part of the scene is captured, which in most applications is primarily limited by sensor resolution and not by optics. Wide-angle lenses cover a large area but at low resolution for any part of the scene. Imagery from multiple cameras can be Marilyn Wolf School of Electrical and Computer Engineering Georgia Institute of Technology, e-mail: [email protected] Jason Schlessman Department of Electrical Engineering Princeton University, e-mail: [email protected]
S.S. Bhattacharyya et al. (eds.), Handbook of Signal Processing Systems, DOI 10.1007/978-1-4419-6345-1_11, © Springer Science+Business Media, LLC 2010
267
268
Marilyn Wolf and Jason Schlessman
combined to provide both extended coverage and adequate resolution. Distributed smart cameras combine physically distributed cameras and distributed algorithms. Early approaches to distributed-camera-based computer vision used server-based, centralized algorithms. While such algorithms are often easier to conceive and implement, they do not scale well. Properly-designed distributed algorithms scale to handle much larger camera networks. VLSI technology has aided both the imagegathering and computational abilities of distributed smart camera systems. Moore’s Law has progressed to the point where very powerful multiprocessors can be put on a single chip at very low cost [18]. The same technology has also provided cheap and powerful image sensors, particularly in the case of CMOS image sensors [19]. Distributed smart cameras have been used for a variety of applications, including tracking, gesture recognition, and target identification. Networks of several hundred cameras have been tested. Over time, we should expect to see much larger networks both tested and deployed. Surveillance is one application that comes to mind. While surveillance and security are a large application—analysts estimate that 25 million security cameras are installed in the United States—that industry moves at a relatively slow pace. Health care, traffic analysis, and entertainment are other important applications of distributed smart cameras. After starting in the mid-1990s, research on distributed smart cameras has progressed rapidly over the past decade. A recent special issue of Proceedings of the IEEE [15] presented a variety of recent results in the field. The International Conference on Distributed Smart Cameras is devoted to the topic. We start with a review of some techniques from computer vision that were not specifically developed for distributed systems but have been used as components in distributed systems. Section 3 reviews early research in distributed smart cameras. Section 4 introduces several distributed algorithms for computer vision.
2 Approaches to Computer Vision Several algorithms that used in traditional computer vision problems such as tracking also play important roles as components in distributed computer vision systems. In this section we briefly review some of those algorithms. Tracking refers to the target or object of interest as foreground and non-interesting objects as background (even though this usage is at variance with the terminology of theater). Many tracking algorithms assume that the background is relatively static and use a separate step, known as background elimination or background subtraction, to eliminate backgrounds. The simplest background subtraction algorithm simply compares each pixel in the current frame to the corresponding pixel in a reference frame that does not include any targets. If the current-frame pixel is equal the reference-frame pixel, then it is marked as background. This algorithm is computationally cheap but not very accurate. Even very small amounts of movement in the background can cause erroneous classification; the classic example of extraneous motion is the blowing of tree leaves in the wind. The mixture-of-Gaussians approach [7] provides much more results. Mixture-of-Gaussian systems also use multiple models so more than
Distributed Smart Cameras and Distributed Computer Vision
269
one candidate model can be kept alive at a time. The algorithm proceeds in three steps: compare Gaussians of each model to find matches; update Gaussian mean and variance; update weights. Given a pixel value as a vector of luminance and |X− |
x chrominance information X ∈ (Y,Cb,Cr) and x = x , we can compare a current-frame and reference-frame pixels using a threshold:
Y aY
2
Cb + aCb
2
Cr + aCr
2
(1)
Matching weights are updated as follows, where n is the number of frames over which this Gaussian distribution has been active: n, n < Nmax = Nmax , n ≥ Nmax
N
X n
(Xn − X n−1 ) N (Xn − X n−1 )(Xn − X n )−X n−1 N
= X n−1 +
X n = X n−1 +
(2) (3) (4)
A weight wm evaluates the system’s confidence in that model and pixel position: M= wm
m, Mmax ,
= wm−1 +
m < Mmax m ≥ Mmax
(5)
( −wm−1 ) M
(6)
An appearance model is used to compare the appearance of targets in different frames; appearance models are useful both in a single-camera system to ensure tracking continuity or between cameras to compare views. Appearance models may make use of shape and/or color. Bounding box is a simple shape model, but more complex curve-based models may also be used. A color or luminance histogram is often used to represent detail within the image. One common method for comparing two histograms is the Bhattacharyya distance. More than one appearance model may be necessary since many interesting targets change in both shape and color as a function of the observerÕs position. The two major approaches to singlecamera tracking are Kalman filtering [2] and particle filters. Particle filters [4] use Monte Carlo methods to estimate a probability distribution. Weighted particles represent samples of the hidden states as weighted by Bayesan estimates of probability masses. Coates [5] describes a distributed algorithms for computing particle filters. In their architecture, each sensor node ran one particle filter, with random number generators on each node that were synchronized. They described two approaches to the problem of distributing observations over the network, one parametric and one based on adaptive encoding. Sheng et al. [16] developed two distributed particle filter algorithms: one that exchanges information between cliques (nearby sensors
270
Marilyn Wolf and Jason Schlessman
whose signals tend to be correlated) and another in which cliques compute partial estimates based on local information and forward their estimates to a fusion center. Oh et al. [13] proposed Markov chain Monte Carlo (MCMC) algorithms for multitarget tracking (MCMC-MTT). This algorithm used a stochastic framework to select moves that manipulate the assignment of target observations to tracks. They showed that relatively simple and efficient bounds on the stochastic search resulted in accurate track assignments. Liu et al. [1] developed a distributed multi-target tracking system in which they separated position from target modeling information. They optimized their system architecture to distribute target position frequently and target models less frequently.
3 Early Work in Distributed Smart Cameras The DARPA-sponsored Video Surveillance and Monitoring (VSAM) program was one of the first efforts to develop distributed computer vision systems. A system developed at Carnegie Mellon University [6] developed a cooperative tracking system in which tracking targets were handed off from camera to camera. Each sensor processing unit (SPU) classified targets into categories such as human or vehicle. At the MIT Media Lab, Mallet and Bove [11] developed a distributed camera network that could hand-off tracking targets in real time. Their camera network consisted of small cameras mounted on tracks in the ceiling of a room. The cameras would move to improve their view of subjects based on information from other cameras as well as their own analysis. Lin et al. [10] developed a distributed system for gesture recognition that fuses data after some image processing using a peer-to-peer protocol. That system will be described in more detail in Section 4.3. The distributed tracker of Bramberger et al. [3] passed off a tracking task from camera to camera as the target moved through the scene. Each camera ran its own tracker. Handoffs were controlled by a peer-to-peer protocol.
4 Approaches to Distributed Computer Vision This section studies distributed smart camera algorithms and architectures in detail. We start with a survey of challenges unique to distributed vision systems. We then briefly introduce the platform architectures. We will then consider example systems for several problems: gesture recognition, tracking, calibration, and sparse traffic analysis.
Distributed Smart Cameras and Distributed Computer Vision
271
4.1 Challenges A distributed smart camera is a data fusion system—samples from cameras are captured, features are extracted and combined, and results are classified. There is more than one way to perform these steps, providing a rich design space. We can identify several axes on which the design space of distributed smart cameras can be analyzed: • How abstract is the data being fused: pixel, small-scale feature, shape, etc.? What methods are used to fuse data? • What groups/cliques of sensors combine their data? For example, groups may be formed by network connectivity, location, or signal characteristics. How does the group structure evolve as the scene changes? • How sparse in time and space is the data? We can also identify some axes based on the underlying technology: • • • • • •
Number of cameras. Heterogeneity of cameras. Fixed and/or moving cameras. Fields-of-view (overlapping, non-overlapping). Server-assisted (directories, etc.) vs. peer-to-peer. Data fusion level: pixels (server-based) vs. mid-level vs. high-level
The simplest form of data fusion is to directly combine pixels. Image stitching algorithms [12] can be used to merge several images into a larger image. Such algorithms are more often used for photography—that is, directly-viewed images—than for computer vision. Thanks to the large amount of data involved, some amount of preprocessing is often done before merging data. Features can be extracted using a variety of methods. Scale-invariant feature transform (SIFT) is a widely used feature extraction algorithm. A set of feature vectors is extracted from the image as min/max of the difference-of-Gaussians function to smoothed and resampled images. The features are then clustered to identify corresponding features in the images to be compared. These features can be combined in a variety of ways using distributed algorithms, either tree-based or graph-based. Group structure is an issue that cuts across image processing and distributed algorithms. The natural structure of the problem suggests certain groupings of data that are reflected in the formulas that describe the image processing being performed. Distributed algorithms and networks also encourage grouping. In many cases, the groupings suggested by those two disciplines are complementary but in some cases their recommendations conflict. Group structure in networks is often based on network distance as illustrated in Figure 1. Network distance is determined in part in a wireless network by the transmission radius of a node, which determines what other nodes can be reached directly. It is also determined by the path length between nodes in a network. In contrast, group structure for image processing algorithms is determined by field-ofview as illustrated in Figure 2. When cameras have overlapping fields of view, a tar-
272
Marilyn Wolf and Jason Schlessman
group
Fig. 1 Groups in wireless networks.
Fig. 2 Groups in overlapping field-of-view image processing.
get is in general visible to several cameras, These cameras are not always physically adjacent—in fact, to battle occlusion, cameras at widely different angles provide the best coverage. The cameras also may not be close in the network sense. However, these cameras do need to talk to each other. In the case of non-overlapping fields-of-view, as shown in Figure 3, communication between nodes is determined by the possible or likely paths of targets, which may or may not correspond to network topology. Features from multiple sensors need to be combined at some point. Where they are combined and how they are combined are both factors. We will use
Distributed Smart Cameras and Distributed Computer Vision
273
Fig. 3 Groups in non-overlapping cameras.
two example systems to illustrate different approaches to feature analysis, fusion, and classification. Chapter 6 describes sensor networks in more detail.
4.2 Platform Architectures The details of the computing platform can vary widely, but the fundamental assumptions made by most distributed computer vision algorithms include the nodes and network in a distributed system. Processing nodes are assumed to have reasonably powerful processors that can perform significant amounts of work in real time. RAM is generally not unlimited but enough is available for the required algorithms. Mass storage (disk or flash) may or may not be available. Most systems assume that power is available to nodes for continuous operations, although it is possible to build a battery-operated system that will operate for a short interval. Networking may be provided using either wired or wireless networks. A useful amount of bandwidth is available, as well. In reality, the fundamental assumptions made by many distributed computer vision algorithms are not entirely pragmatic. Consider a primary application area of computer vision: security and surveillance. As expected, these fields call for discreet and in many cases remote deployment. Clearly, attaching a PC-equivalent processing node to each camera reduces the potential for covertness, while the necessary deployment remoteness necessitate the consideration of power reduction techniques. As a result, a reconsidering of the aforementioned assumptions of significantly powerful processors with large amounts of RAM and power is necessitated. In order to provide the level of computational ability needed by computer vision algorithms, while also incorporating the additional considerations mentioned above, a number of non-PC computing platforms can be considered. The simplest possibility in terms of architectural design is the use of an embedded general pur-
274
Marilyn Wolf and Jason Schlessman
pose processor such as those from ARM. In many cases, moving a vision algorithm onto an embedded processor involves code tweaking for hardware interaction, and otherwise tends to be a relatively painless process. Unfortunately, these processors suffer from performance issues, as they ultimately are still generic CPUs with lower processing ability. For all but the least complex algorithms, this platform is insufficient. Another approach is to use a DSP, such as those developed by Texas Instruments. These allow for a middle ground between an embedded processor and a PC node in that they still consume significantly less power than PC nodes while providing optimized instructions which can be exploited by computer vision algorithms. A simple example found on the majority of DSPs is the inclusion of an optimized multiply-accumulate instruction. This instruction is applicable to a number of preprocessing and filtering steps in many vision algorithms, and is typically available in commodity CPUs, the difference being the DSP unit consumes much less power while providing this boost in performance. DSP toolchains typically are similar to those used in embedded processors, with the additional requirement of the system designer incorporating appropriate code annotations to take advantage of the DSP instructions. For some algorithms and system requirements, still more computational potential is necessary while maintaining a level of low power consumption and small footprint. In these instances, a custom or reconfigurable platform is considered. Typically an FPGA will be utilized either as a design and test platform for a final custom build or as a target reconfigurable fabric for these cases, respectively. Much like DSPs, these provide optimized processing ability, in this case with specialized processing units within the FPGA fabric. For example, high-speed integer multipliers and adders are found in many FPGA cells currently on the market. Furthermore, many FPGAs incorporate a simple embedded processor for algorithm flow control. The complexity in using this type of platform is in design overhead. An FPGA requires significantly more design consideration than the previously mentioned platforms. This is due to the inherent need for hardware and software task analysis. Also, the determined hardware tasks typically cannot be directly mapped to FPGA logic, and instead require algorithmic redesign and hardware timing scheme design and analysis. It should be mentioned that all of these platforms raise the need for bandwidth analysis during architectural design and mapping, in opposition to the previously mentioned assumptions. This is due to the platforms having a smaller footprint, and resultant lower amount of available storage on the board, not to mention on the chip itself. As vision algorithm complexity increases through use of larger pixel representation schemes, greater temporal frame storage, and also accuracy requirements which abound in real-time critical vision systems, the amount of data which needs to be both transferred on and off chip as well as stored on-chip intermediately increases dramatically. Schlessman [22] addresses each of these issues with the MEAD methodology for embedded architectural design, ultimately providing vision designers with a clearer concept of what platform architecture is best suited for the algorithm under consideration.
Distributed Smart Cameras and Distributed Computer Vision
275
4.3 Mid-Level Fusion for Gesture Recognition Lin et al. [10] developed a peer-to-peer algorithm for gesture recognition. The distributed system is based on a single-camera gesture recognition system [20]. That algorithm extracts shapes, builds a graph model of the relationships between the shapes, and then matches that graph against a library of graphs representing poses.
Fig. 4 Combining features for gesture recognition.
The basic challenge in extending the single-camera system to multiple cameras is that a subject may stand such that no single camera has a complete view. In general, data must be combined from multiple cameras before a graph can be built. A bruteforce approach would be to share frames, but this entails a substantial bandwidth penalty. Lin’s system combines the features created by pixel-level region identification. The regions from different cameras are easily fused but they can be transmitted at lower cost. As shown in Figure 4, when the target appears in the view of multiple cameras, each camera extracts the foreground object and extracts pixel boundaries for the foreground regions. A protocol determines which camera will fuse these boundaries and complete the recognition process. A group is formed by the cameras that can view the target. One member of the group, the lead node, has a token that denotes that it is responsible for fusion. The token moves to a different node when the targetÕs position, as determined by the centerline of its bounding box, moves out of the field-of-view of the lead node. The token is passed to a camera whose field-of-view contains the target centerline.
4.4 Target-Level Fusion for Tracking Velipasalar et al. [21] developed a peer-to-peer tracking system that fuses data at a higher level. At system calibration time, the cameras determine all other cameras that share overlapping fields-of-view. This information is used to build groups for sharing information about targets. Each camera runs its own tracker. As illustrated in Figure 5, a camera maintains for each target in its field-of-view the position and appearance model of the target. The cameras then trade information about targets. The cameras that have a given target in their fields-of-view form a group. Cameras communicate every n frames, where n is chosen at design time. A more adaptive
276
Marilyn Wolf and Jason Schlessman
target model node 1
target model node 2
Fig. 5 Combining tracks during distributed tracking.
approach would be to adapt the communication rate based upon the rate at which targets move. The group is built on top of a logical ring structure, which is commonly used to communicate in multiprocessors. The cameras in the group exchange information on the position of the target. Exchange allows cameras to estimate position more accurately. It also allows cameras whose view of the target is occluded to know where the target is during the occlusion interval. The protocol also negotiates an identity for each target. When a target enters the system, the cameras must also agree upon a common label for the target. They compare appearance models and positions to determine that they, in fact share a target. They then select a common label for that target.
4.5 Calibration Camera networks need to be calibrated both spatially and temporally. Without proper calibration, we cannot properly compare data or analysis results between cameras. Spatial calibration is usually divided into extrinsic—the camera relative to world coordinates—and intrinsic—the focal length, etc. of the camera. Temporal calibration provides consistent time tags for frames between cameras. Radke et al. [14] developed a distributed algorithm for the metric calibration of camera networks. External calibration of a camera determines the position of the camera in world coordinates using a rotation matrix and translation vector. Intrinsic parameters include focal length, location of the principal point, and skew. Calibrating cameras through image processing is usually more feasible than by directly measuring camera position. Devarajan et al. model the camera network using a communication graph and a vision graph. The communication graph is based on network
Distributed Smart Cameras and Distributed Computer Vision
277
connectivity—it has an edge between nodes that directly communicate; this graph can be constructed using standard ad-hoc network techniques. The vision graph is based on signal characteristics—it has an edge between two nodes that have overlapping fields-of-view; this graph needs to be constructed during the calibration process. Each camera is described by a 3 × 4 matrix Pi that gives the rotation matrix and optical center of the camera. The intrinsic parameter matrix for camera i is known as Ki . A set of points X = {X1 , . . . , Xn } are used as reference points for calibration; these points can be selected using a variety of methods, such as SIFT. Two cameras have overlapping fields-of-view if they can both see a certain number of the Xs. The reconstruction is ambiguous and we need to estimate a matrix H that turns the reconstruction into a metric form (preserving angles, etc.). A bundle adjustment step uses nonlinear minimization to improve the calibration. Synchronization, also known as temporal calibration, is necessary to provide an initial time base for video analysis. Cameras may also need to be resynchronized during operation. Even professional cameras may not provide enough temporal stability for long-running experiments and low-cost cameras often display wide variations in frame rate. Lamport and Melliar-Smith [9] developed an algorithm for synchoronizing clocks in distributed systems; such approaches are independent of the distributed camera application. Velipasalar and Wolf [17] developed a video-based algorithm for synchronization. The algorithm works by matching target positions between cameras, so it assumes that the cameras have previously been spatially calibrated. A set of frames is selected for comparison by each camera. For each frame, an anchor point is selected. The targets are assumed to be on a plane, so the anchor point is determined by dropping a line from the top of the bounding box to the ground plane. Given these reference points, a search algorithm compares the positions of the targets to minimize the distance D1,2 i, j between frame sequences 1 and 2 at offsets i and j.
4.6 Sparse Camera Networks In many environments, we cannot set up cameras whose fields-of-view cover the entire area of interest. In sparse camera networks, we must find alternative methods to correlate observations between cameras. The most common approach is to use Bayesian metrics to group observations into sets that correspond to paths. Search algorithms find the most likely assignments of observations to paths. A graph models the set of observation positions as nodes and the possible paths between those positions as edges. Oh et al. [13] developed a Markov chain Monte Carlo algorithm for multitarget tracking (MCMC-MTT). We are given a set of target positions xi and a set of observations yi . We partition the available observations into a set of tracks = {y1 , . . . , yt }. The set of all tracks for a given scene is . Given a new set of observations yt+1 , we want to assign them to tracks and possibly reassign existing observations to different tracks, generating an updated set of tracks such that we maximize the posterior of . Oh et al. showed that the posterior of is:
278
Marilyn Wolf and Jason Schlessman
P( | Y ) =
1 pzzt (1 − pz)ct pddt (1 − pd )ut bat fft × Z 1≤t≤T
∈ /{0 } 1≤i≤| |−1
N( (ti+1 | x¯ti+1 ( ), bti+1 ( ))
(7)
In this formula, Z is a normalizing constant and N() is the Gaussian density function with mean and covariance matrix . In the approach of Oh et al., MCMC multi-target tracking makes use of several types of moves to generate new candidate tracks: • Birth/death: A birth move creates a new track while death destroys an existing one. • Split/merge: A split breaks one track into two pieces while a merge combines two tracks into one. • Extension/reduction: Extension lengthens an existing track by adding new observations at the end of the track while reduction shortens a track. • Track update: An update adds an observation to a track. • Track switch: A track switch exchanges observations between two tracks. A move is accepted or rejected with probability: ( )q( , ) A( , ) = min 1, ( )q( , )
(8)
where ( ) is the stationary distribution of the states and q( , ) is the proposal distribution. Kim et al. [31] used a modified graph model to encapsulate information about priors that guides the search process. Entry and exit nodes are often connected by cycles of edges that model entry and exit. Such cycles can induce long paths to be generated that consist of the target shuttling between two nodes. A supernode models a set of nodes and associated edges. The supernode represents the prior knowledge that such cycles are unlikely.
5 Summary Distributed algorithms offer a new approach to computer vision. Nodes in a distributed system communicate explicitly through mechanisms that have non-zero cost in delay and energy. This constraint encourages us to consider algorithms that refine the representation of a scene in stages with predictable ways of combining data within a stage. Distributed algorithms are, in general, more difficult to develop and prove correct, but they also provide advantages such as robustness. Distributed signal processing may require new statistical representations of signals that can be efficiently shared over the network while preserving the useful properties of the underling signal. We also need a better understanding of the accuracy of the approx-
Distributed Smart Cameras and Distributed Computer Vision
279
imations underlying distributed signal processing algorithms—how do termination conditions affect the accuracy of the estimate, for example.
References 1. J. Liu andÊ M. Chu andÊ Jie Liu andÊ J. Reich andÊÊ Feng Zhao. Distributed state representation for tracking problems in sensor networks. In Information Processing in Sensor Networks, 2004. IPSN 2004. Third International Symposium on, pages 234–242. IEEE, 2004. 2. V. Boykov and D. Huttenlocher. Adaptive bayesian recognition in tracking rigid objects. In Proceedings, IEEE Conference on Computer Vision and Pattern Recognition, pages 697–704. IEEE, 2000. 3. Michael Bramberger, Markus Quaritsch, Thomas Winkler, Bernhard Rinner, and Helmut Schwabach. Integrating multi-camera tracking into a dynamic task allocation system for smart cameras. In Proceedings of the IEEE Conference on Advanced Video and Signal Based Surveillance (AVSS 2005), pages 474Ж479. IEEE, 2005. 4. James V. Candy. Boostrap particle filtering. IEEE Signal Processing Magazine, 73:73–85, July 2007. 5. M. Coates. Distributed particle filters for sensor networks. In Proceedings of the International Symposium on Information Processing in Sensor Networks, pages 99–107, 2004. 6. R. T. Collins, A. J. Lipton, H. Fujiyoshi, and T. Kanade. Algorithms for cooperative multisensory surveillance. Proceedings of the IEEE, 89(10):1456–1477, October 2001. 7. T. Horprasert, D. Harwood, and L. S. Davis. A robust background subtraction and shadow detection. In Proceedings, 2000 Asian Conference on Computer Vision. 8. Honggab Kim, Justin Romberg, and Wayne Wolf. Multi-camera tracking on a graph using markov chain monte carlo. In Proceedings, 2009 ACM/IEEE International Conference on Distributed Smart Cameras. ACM, 2009. 9. Leslie Lamport and Michael Melliar-Smith. Synchronizing clocks in the presence of faults. Journal of the ACM, 32(1):52–78, January 1985. 10. Chang Hong Lin, Tiehan Lv, Wayne Wolf, and I. Burak Ozer. A peer-to-peer architecture for distributed real-time gesture recognition. In Proceedings, International Conference on Multimedia and Exhibition, pages 27–30. IEEE, 2004. 11. J. Mallett and V. M. Bove Jr. Eye society. In Proceedings IEEE ICME 2003. IEEE, 2003. 12. L. McMillan and G. Bishop. Plenoptic modeling: an image-based rendering system. Proceedings, ACM SIGGRAPH, pages 39–46, 1995. 13. S. Oh, S. Russell, and S. Sastry. Markov chain monte carlo data association for general multiple-target tracking problems,. In Proc. 43rd IEEE Conf. Decision and Control, December 2004. 14. R., D. Devarajan, and Z. Cheng. Calibrating distributed camera networks. Proceedings of the IEEE, 96(10):1625–1639, October 2008. 15. Bernhard Rinner and Wayne Wolf. An introduction to distributed smart cameras. Proceedings of the IEEE, 96(10):1565–1575, October 2008. 16. X. Sheng, Y.-H. Hu, and Parameswaran Ramanathan. Distributed particle filter with gmm approximation for multiple targets localization and tracking in wireless sensor network. In Information Processing in Sensor Networks, 2005. IPSN 2005. Fourth International Symposium on, pages 181–188. IEEE, 2005. 17. Senem Velipasalar and Wayne H. Wolf. Frame-level temporal calibration of video sequences from unsynchronized cameras. Machine Vision and Applications Journal, (DOI 10.1007/s00138-008-0122-6), January 2008. 18. Wayne Wolf. High Performance Embedded Computing. Morgan Kaufman, 2006. 19. Wayne Wolf. Modern VLSI Design: IP-Based Design. PTR Prentice Hall, fourth edition, 2009.
280
Marilyn Wolf and Jason Schlessman
20. Wayne Wolf, Burak Ozer, and Tiehan Lv. Smart cameras as embedded systems. IEEE Computer, 35(9):48–53, September 2002. 21. Senem Velipasalar, J. Schlessman, Cheng-Yao Chen, Wayne Wolf and Jaswinder Pal Singh. A Scalable Clustered Camera System for Multiple Object Tracking. EURASIP Journal on Image and Video Processing, Vol. 2008, Article ID 542808, 22 pages, 2008. 22. J. Schlessman. Methodology and Architectures for Embedded Computer Vision. PhD Thesis, Department of Electrical Engineering, Princeton University, 2010.
Part II
Architectures Managing Editor: Jarmo Takala
Arithmetic Oscar Gustafsson and Lars Wanhammar
Abstract In this chapter fundamentals of arithmetic operations and number representations used in DSP systems are discussed. Different relevant number systems are outlined with a focus on fixed-point representations. Structures for accelerating the carry-propagation of addition are discussed, as well as multi-operand addition. For multiplication, different schemes for generating and accumulating partial products are presented. In addition to that, optimization for constant coefficient multiplication is discussed. Division and square-rooting is also briefly outlined. Furthermore, floating-point arithmetic and the IEEE 754 floating-point arithmetic standard are presented. Finally, some methods for computing elementary functions, e.g., trigonometric functions, are presented.
1 Number representation The way we select to represent our numbers have a profound impact on the corresponding computational units. Here we consider number representations based on a positional weight (radix) of two. It is worth noting that as the main application here is DSP, we will where required choose to consider numbers that are fractional rather than integer. This will mainly effect the numbering of the indices. In this chapter we primarily deal with traditional number representations. In Chapter 25 the selection of number representation for specific problems are discussed in detail.
Oscar Gustafsson Department of Electrical Engineering, Linköping University, SE-581 83 Linköping, Sweden e-mail: [email protected] Lars Wanhammar Department of Electrical Engineering, Linköping University, SE-581 83 Linköping, Sweden e-mail: [email protected]
S.S. Bhattacharyya et al. (eds.), Handbook of Signal Processing Systems, DOI 10.1007/978-1-4419-6345-1_12, © Springer Science+Business Media, LLC 2010
283
284
Oscar Gustafsson and Lars Wanhammar
1.1 Binary representation An unsigned binary number, X, with W f fractional bits can be written Wf
X = xi 2−i
(1)
i=1
where xi ∈ {0, 1}. Denoting the weight of the least significant position Q, in this case Q = 2−W f , one can see that the range of X is 0 ≤ X ≤ 1 − Q. Q is sometimes referred to as unit of least significant position, ulp. As an example, the number 0.2510 is written using W f = 3 as .0102 and Q = 2−3 = 0.12510.
1.2 Two’s complement representation To represent negative numbers, there are several different number representations proposed. The most common one is the two’s complement (2C) representation. Here, a binary number, X with W f fractional bits is written Wf
X = −x0 + xi 2−i .
(2)
i=1
This gives a numerical range as −1 ≤ X ≤ 1 − Q. It is worth noting that the range is not symmetric. This will cause problems when implementing certain arithmetic operations as discussed later. x0 , the sign bit, is one if X < 0 and zero otherwise. A number −0.2510 is represented as 1.112C with W f = 2, while 0.2510 is 0.012C . A beneficial property of two’s complement arithmetic is the fact that an arbitrary long sequence of numbers can be added in arbitrary order as long as the result is known to be in the range of the representation. Any overflows/underflows in the intermediate computations will cancel. This is related to the fact that computations in two’s complement number representation are performed modulo 2WS , where WS is the weight of the sign bit. For the representation in (2) WS = 1 so all computations are performed modulo 2.
1.3 Redundant representations A redundant representation is a representation where a number may have more than one representation. As we will see later the fact that we can select the representation will provide a number of advantages, most importantly the ability to perform addition in constant time.
Arithmetic
285 ai 0 0 0 0 1 1 1 1
bi 0 0 1 1 0 0 1 1
di 0 1 0 1 0 1 0 1
si 0 1 1 0 1 0 0 1
ci 0 0 0 1 0 1 1 1
(a)
ai bi di FA ci si (b)
Fig. 1 Full adder: (a) truth table and (b) symbol.
1.3.1 Signed-digit representation In a signed-digit (SD) number representation, the digits may have either positive or negative sign. For a radix-2 representation we have xi ∈ {−1, 0, 1}. Using W f fractional bits as in (1) the range of X is −1 + Q ≤ X ≤ 1 − Q. A number may now have more than one representation. Consider the number 0.2510 which can be written as .01SD or .11¯SD , where 1¯ is used to denote −1. In some applications as, for instance, digital filters [30], FFTs [4] and DCTs [28] as well as general DSP algorithms [43], it is of interest to find a signed-digit representation with a minimum number of nonzero positions to simplify the corresponding multiplication (see Section 3.5). This is referred to as a minimum signed-digit (MSD) representation. However, in general it is non-trivial to determine if a representation is minimum. A specific minimum signed-digit representation is obtained if the constraint xi xi+1 = 0, ∀i is imposed. The resulting representation is called canonic signed-digit (CSD) representation and is apart from being minimum also unique (as the name indicates). 1.3.2 Carry-save representation The carry-save representation stems from the ripple-carry adder, which will be further discussed in Section 2.1. Instead of propagating the carries in the addition, these bits are stored and the data is represented using two vectors. This also leaves an additional input of the full adder cells unused, so it is possible to add three vectors. A full adder cell is illustrated in Fig. 1(b) and the truth table is given in Fig. 1(a). From this we can see that si = parity(ai ⊕ bi ⊕ di ) = ai ⊕ bi ⊕ di
(3)
where ⊕ is the exclusive-OR operation, and ci = majority(ai , bi , di ) = ai bi + ai di + bi di .
(4)
286
Oscar Gustafsson and Lars Wanhammar ai 0 0 0 0 1 1 1 1
bi 0 0 1 1 0 0 1 1
di 0 1 0 1 0 1 0 1
si 0 1 1 0 1 0 0 1
ci Result 0 0 0 −1 1 1 0 0 1 1 0 0 1 2 1 1
(a)
ai bi di + +
RBA
+
ci si (b)
Fig. 2 Redundant binary adder with two positive and one negative inputs: (a) truth table and (b) symbol.
Assuming two’s complement representation of the vectors, a number is represented as Wf
Wf
i=1
i=1
X = S + C = −s0 + si 2−i − c0 + ci 2−i .
(5)
It is possible to represent numbers in the range −2 ≤ X ≤ 2 − 2Q. However, it is common to let X span the same numerical range as the two’s complement representation since it usually will be converted into a non-redundant representation. The carry-save representation can also be seen as representation with radix-2 and digits xi ∈ {0, 1, 2, 3} where xi = 2ci + si . It is possible to generalize the concept of carry-save representation to a binary redundant representations by replacing the relation xi = 2ci + si with either xi = 2ci − si where xi ∈ {−1, 0, 1, 2} or xi = −2ci + si where xi ∈ {−2, −1, 0, 1}. It is then possible to use cells corresponding to full adders, but with designated signs of the inputs, such that the bits with negative weights should be connected to − inputs. This is illustrated in Fig. 2(b) while the computational rules for an adder with two positive and one negative input are shown in Fig. 2(a). Note that both types of representations needs to be used as otherwise there would be an imbalance as each adder produces one positive and one negative vector but inputs either one positive and two negative or two positive and one negative.
1.4 Shifting and increasing the wordlength When adding signed numbers it is for most number representations important that the vectors have the same lengths. For two’s complement representation we must extend the sign by copying the sign-bit at the MSB side of the vector. On the LSB side it is enough to extend with zeros1 .
1
It is worth noticing that for one’s complement the sign-bits are inserted at the LSB side.
Arithmetic
287
The operation of shifting, i.e., multiplying or dividing by a power of two, is in fact the same as increasing the wordlength. Hence, to shift a two’s complement number two positions to the right (dividing by 4) require copying of the sign-bit to the two introduced positions. For carry-save representation the extension of the wordlength on the MSB side require that the sign-bits of the carry and sum vectors are corrected to avoid what is referred to as carry overflow. The origin of this effect is that once the carry and sum vectors are added there is an overflow in the addition. As the resulting carry bit from the MSB position is neglected the result still remains valid, in fact this is a required property for two’s complement representation to work. However, if the vectors are shifted once before the final addition, the effect of the overflow will occur in an earlier position.
1.5 Negation Negation is a useful operation in itself, but also in a subtraction, Z = X − Y , which can be seen as an addition of the negated value, Z = X + (−Y ). To negate a two’s complement number one inverts all the bits and add a one to the LSB position. As will be seen later on, this does not necessarily have to be performed explicitly, but can rather be integrated with an addition.
1.6 Finite-wordlength effects In recursive algorithms it is necessary to use nonlinear operations to maintain a finite wordlength. This introduces small re-quantization errors, so called granularity errors. In addition, very large overflow errors will occur if the finite number range is exceeded. These errors will not only cause distortion, but may also be the cause of parasitic oscillations in recursive algorithms [5, 13, 45]. The affect of these errors depends on many factors, for example, type of quantization, algorithm, type of arithmetic, representation of negative numbers, and properties of the input signal[31]. The analysis of the influence of roundoff errors in floating-point arithmetic is very complicated [32, 44, 56], because quantization of products as well as addition in floating-point arithmetic and subtraction causes errors that depend on the signal values. Quantization errors in fixed-point arithmetic are with few exceptions independent of the signal values and can therefore be analyzed independently. Lastly, the high dynamic range provided by floating-point arithmetic is not really needed in good filter algorithms, since filter structures with low coefficient sensitivity also utilize the available number range efficiently. Fixed-point arithmetic is predominately used in application-specific ICs since the required hardware is much simpler, faster, and consumes less power compared with floating-point arithmetic. We will therefore focus on fixed-point arithmetic.
288
Oscar Gustafsson and Lars Wanhammar
1–Q
XQ X
–2
–1
1
–1
2
Fig. 3 Saturation arithmetic.
1.6.1 Overflow characteristics A two’s complement representation of negative numbers is usually employed in digital hardware. The overflow characteristic of the two’s complement representation is shown in Fig. 4. As discussed earlier, the largest and smallest numbers in two’s complement representation are 1 − Q and −1, respectively. A two’s complement number larger than 1 − Q will be interpreted as x − 2, while a number slightly smaller than −1 will be interpreted as x − 2. Hence, very large overflows errors are incurred. A common scheme to reduce the size of overflow errors and mitigate their harmful effect is to limit numbers outside the normal range to either the largest or smallest representable number. This scheme is referred to as saturation arithmetic and is shown in Figure 3. Many standard signal processors provide addition and subtraction instructions with inherent saturation. Another saturation scheme, which may be simpler to implement in hardware, is to invert all bits when overflow occurs. Using fixed-point arithmetic the signal levels need to be adjusted at the input of multiplications with non-integer coefficients. Note that, as discussed earlier, a sum of several numbers, using two’s complement representation, may have intermediate overflows as long as the final value (sum) is within the valid range.
1–Q
XQ X
–2
–1
–1
1
2
Fig. 4 Overflow characteristic of two’s complement arithmetic.
Arithmetic
1 Q
289
p(e
1 Q
–Q Rounding Q 2 2
e
–Q
p(e
e Truncation
1 2Q X>0 –Q
p(e X<0
e
Magnitude Truncation Q
Fig. 5 Error distributions for fixed-point arithmetic.
1.6.2 Truncation Quantizing a binary number, X , with infinite wordlength to a number, XQ , with finite wordlength yields an error e = XQ − X .
(6)
Truncation of the binary number is performed by removing the bits with index i > W f . The resulting error density distribution is shown in the center of Fig. 5. The variance is 2 = last bit position.
Q2 12
and the mean value is −Q/2 where Q refer to the weight of the
1.6.3 Rounding Rounding is, in practice, performed by adding 2−(W f +1) to the non-quantized number before truncation. Hence, the quantized number is the nearest approximation to the original number. However, if the wordlength of X is W f + 1, the quantized number should, in principle, be rounded upwards if the last bit is 1 and downwards if it is 0, in order to make the mean error zero. This special case is often neglected in practice. The resulting error density distribution is shown to the left in Fig. 5. The 2 variance is 2 = Q12 and the mean value is zero. 1.6.4 Magnitude truncation Magnitude truncation quantizes the number so that |XQ | ≤ |X | .
(7)
Hence, e is ≤ 0 if X is positive and ≥ 0 if X is negative. This operation can be performed by adding 2−(W f +1) before truncation if X is negative and 0 otherwise. That is, in two’s complement representation adding the sign bit to the last position. The resulting error density distribution is shown to the right in Fig. 5. The error analysis of magnitude truncation becomes very complicated since the error and sign of the signal are correlated [31].
290
Oscar Gustafsson and Lars Wanhammar
Magnitude truncation is needed to suppress parasitic oscillation in wave digital filters [13]. 1.6.5 Quantization of products The affect of a quantization operation, except for magnitude truncation, can be modeled with a white additive noise source that is independent of the signal and with the error density functions as shown in Fig. 5. This model can be used if the signal varies from sample to sample over several quantization levels in an irregular way. However, the error density function is a discrete function if both the signal and the coefficient have finite wordlengths. The difference is significant only if a few bits are discarded by the quantization. The mean value and variance for the errors are Qc Q, rounding m = Q2c −1 (8) Q, truncation 2 and
e 2 = ke (1 − Q2c ) where
Q2 12
1, rounding or truncation ke = 6 4 − , magnitude truncation
(9)
(10)
where Q and Qc refer to the signal and coefficient, respectively. For long coefficient wordlengths the average value is close to zero for rounding and −Q/2 for truncation. Correction of the average value and variance is only necessary for short coefficient wordlengths, for example, for the scaling coefficients.
2 Addition The operation of adding two or more numbers is in many ways the most fundamental arithmetic operation since most other operations in one or another way are based on addition. The methods discussed here concerns either two’s complement or unsigned representation. Then, the major problem is to efficiently speed up the carry-propagation in the adder. There are other schemes than those presented here, for more details we refer to e.g., [57]. It is also possible to perform addition in constant time using redundant number systems such as the previously discussed signed-digit or carry-save representations. An alternative is to use residue number systems (RNS), that split the carry-chain into several shorter ones [38].
Arithmetic
291
x0 y0
xWf2 yWf2 xWf1 yWf1 xWf yWf
FA
FA
FA
sWf2
sWf1
cin=0
FA
cout s0
sW f
Fig. 6 Ripple-carry adder.
2.1 Ripple-carry addition The probably most straightforward way to perform addition of two numbers is to perform bit-by-bit addition using a full adder (FA) cell and propagate the carry bit to the next stage. This is called ripple-carry addition and is illustrated in Fig. 6. This type of adder can add both unsigned and two’s complement numbers. However, for two’s complement numbers the result must have the same number of bits as the inputs, while for unsigned numbers the carry bits acts as a possible additional bit to increase the wordlength. The operation of adding two two’s complement numbers is outlined in Fig. 7 for the example of numbers −53/256 and 94/256. The major drawback with the ripple-carry adder is that the worst case delay is proportional to the wordlength. Also, typically the ripple-carry adder will produce many glitches due to the full adder cells having to wait for the correct carry. This situation is improved if the delay for the carry bit is smaller than that of the sum bit [23]. However, due to the simple design the energy per computation is still reasonable small [57].
2.2 Carry-lookahead addition To speed up the addition several different methods have been proposed, see, for instance, [57]. Methods typically referred to as carry-lookahead methods are based on the following observation. The carry output of a full adder sometimes depends on the carry input and sometimes it is determined without the need of the carry input. This is illustrated in Table 1. Based on this we can define the propagate signal, pi , and the generate signal, gi , as Value −53/256 94/256
0 1 0 11 41/256 0
1 1 1 0 0
2 0 0 1 1
3 0 1 1 0
4 1 1 1 1
5 0 1 1 0
6 1 1 0 0
7 Signal 1 xi 0 yi ci 1 si
Fig. 7 Example of addition in two’s complement arithmetic by using a ripple-carry adder.
292
Oscar Gustafsson and Lars Wanhammar
Table 1 Cases for carry-propagation in a full adder cell. ai 0 0 1 1
bi ci−1 Case 0 0 No carry-propagation (kill) 1 ci Carry-propagation (propagate) 0 ci Carry-propagation (propagate) 1 1 Carry-generation (generate)
pi = ai ⊕ bi and gi = ai bi .
(11)
Now, the carry output can be expressed as ci−1 = gi + pi ci .
(12)
For the next stage the expression becomes ci−2 = gi−1 + pi−1 ci−1 = gi−1 + pi−1 (gi + pi ci ) = gi−1 + pi−1 gi + pi−1 pi ci . (13) For N + 1:th stage we have ci−(N+1) = gi−N + pi−N gi−(N−1) + pi−N pi−N−1 gi−(N−2) + · · · + pi−(N−1) . . . pi−1 pi ci . (14) The terms containing gk and possibly pk terms are called group generate, as they together acts as a merged generate signal for all the bits i to i − N. The subterm pi−(N−1) . . . pi−1 pi is similarly called group propagate. Both the group generate and group propagate signals are independent of any carry signal. Hence, (14) shows that it is possible to have the carry propagate N stages with a maximum delay of one AND-gate and one OR-gate as illustrated in Fig. 8. However, the complexity and delay of the precomputation network grows with N, and, hence, a careful design is required to not make the precomputation the new critical path. The carry-lookahead approach can be generalized using dot-operators. Adders using dot-operators are often referred to as parallel prefix adders. The dot-operator operates on a pair of generate and propagate signals and is defined as gk g g g + pi g j = i • j i . (15) pk pi pj pi p j The group generate from position k to position l, k < l, can be denoted by Gk:l and similarly the group propagate as Pk:l . These are then defined as
ci(N+1)
group generate group propagate ci
Fig. 8 Illustration of N-stage carry-lookahead carry propagation.
Arithmetic
293
Bit position k
m Gk:l, Pk:l
l
n
Gm:n, Pm:n Gk:n, Pk:n Fig. 9 Illustration of the idempotency property for group generate and propagate signals.
Gk:l g g g k • k+1 • · · · • l . Pk:l pk pk+1 pl
(16)
The dot operator is associative but not commutative. Furthermore, the dotoperation is idempotent. This means that gk g g = k • k . (17) pk pk pk For the group generate and propagate signals this leads to that Gk:n Gk:l Gm:n = • , k ≤ l, m ≤ n, m ≤ l − 1. Pk:n Pk:l Pm:n
(18)
This is illustrated in Fig. 9. Hence, we can form the group generate and group propagate signals by combining smaller, possibly overlapping, subgroup generate and propagate signals. The carry signal in position k can be written as ck = G(k+1):l + P(k+1):l cl
(19)
Similarly, the sum signal in position k for an adder using W f fractional bits is then expressed according to (3) as sk = ak ⊕ bk ⊕ dk = pk ⊕ (G(k+1):W f + P(k+1):W f cin ) = pk ⊕ ck
(20)
From this, one can see that it is of interest to compute all group generate and group propagate originating from the LSB position, i.e., Gk:W f and Pk:W f for 1 ≤ k ≤ Wf . A straightforward way of obtaining this is to do a sequential operation as shown in Fig. 10. However, again the delay will be linear in the wordlength, as for the ripple-carry adder. Indeed, the adder in Fig. 10 is a ripple-carry adder where the full adder cells explicitly compute pi and gi . Based on the properties of the dot operator, we can possibly find ways to interconnect the adders such that the depth is reduced by computing different group generate and propagate signals in parallel. This is illustrated for an eight-bits adder in Fig. 11. This particular structure of interconnecting the dot-operators are referred to as a Ladner-Fischer parallel prefix adder [27]. Often one uses a simplified struc-
294
Oscar Gustafsson and Lars Wanhammar
x0 y0 x1 y1 x2 y2 x3 y3 x4 y4 x5 y5 x6 y6 x7 y7 GP GP GP GP GP GP GP GP 0:0 1:1 2:2 3:3 4:4 5:5 6:6 7:7
cin
6:7 5:7 4:7 3:7 2:7 1:7 0:7 cout
c0
c1
c2
c3
c4
c5
c6
Fig. 10 Sequential computation of group generate and propagate signals.
x0 y0 x1 y1 x2 y2 x3 y3 x4 y4 x5 y5 x6 y6 x7 y7 GP GP GP GP GP GP GP GP 0:0 1:1 2:2 3:3 4:4 5:5 6:6 7:7 0:1
2:3
0:3
1:3
0:7
1:7
cout
c0
4:5
4:7
2:7 c1
cin
6:7
5:7
3:7 c2
c3
c4
c5
c6
Fig. 11 Parallel computation of the group generate and propagate signals.
ture to represent the parallel prefix computation, as shown in Fig. 12a. Comparing Figs. 11 and 12a it is clear that the dots in correspond to dot-operators. In fact, the parallel prefix graph works for any associative operation. Over the years there has been a multitude of different schemes for parallel prefix addition trading the depth, number of dot-product operations, and fan-out of
Arithmetic
295
the dot-product cells. In Figs. 12b–d, three of schemes for 16-bits parallel prefix computations are illustrated, namely Ladner-Fischer [27], Kogge-Stone [25], and Brent-Kung [3], respectively. Unified views of all possible parallel prefix schemes has been proposed in [20, 24].
2.3 Carry-select and conditional sum addition The fundamental idea of carry-select addition is to split the adder into two or more stages. For all stages except the stage with the least significant bits one uses two adders. It is assumed that the incoming carry bit is zero one of the adders and one for the other. Then, a multiplexer is used to select the correct result and carry to the next stage once the incoming carry is known. A two-stage carry-select adder is shown in Fig. 13. The length of the stages should be designed such that the delay of the stage is equivalent to the delay of the first stage plus the number of multiplexers that the carry signal passes through. Hence, the actual values are determined by the relative adder and multiplexer delays, as well as the fan-out of the multiplexer control signals. For each smaller adder in the carry-select adder, it is possible to apply the same idea of splitting each smaller adder into even smaller adders. If this is applied until
(a)
(b)
(c)
(d)
Fig. 12 Different parallel prefix schemes for an 8-bit Ladner-Fischer adder [27] as shown in Fig. 11 and for 16-bits adders: (b) Ladner-Fischer [27], (c) Kogge-Stone [25], and (d) Brent-Kung [3].
296
Oscar Gustafsson and Lars Wanhammar
X
Y
k1 bits
cout
k2 bits
k1 bits
k1 bits adder
cin=0
k1 bits adder
cin=1
k2 bits adder
k2 bits
cin=0
k1+1 bits multiplexer k1 bits
k2 bits S
Fig. 13 Two-stage carry-select adder.
X1 X2 X3
XN
Redundant adders CPA
Fig. 14 Principle of a multi-operand adder.
only one-bit adders remain, we obtain a conditional sum adder. There are naturally, a wide range of intermediate adder structures based on the ideas of carry-select and conditional sum adders.
2.4 Multi-operand addition When several operands are to be added, it is beneficial to avoid several carrypropagations. Especially, when there are delay constraints it is inefficient to use several high-speed adders. Instead it is common to use a redundant intermediate representation and a fast final carry-propagation adder (CPA). The basic concept is illustrated in Fig. 14. For performing multi-operand addition, either counters or compressors or a combination of counters and compressors can be used. A counter is a logic gate that takes a number of inputs, add them together and produce a binary representation of the
Arithmetic
297
output. The simplest counter is the full adder cell shown in Fig. 1. In terms of counters, it is a 3:2 counter, e.g., it has three inputs and produce a two-bits output word. This can be generalized to n : k counters, having n inputs of the same weight and producing a k bit output corresponding to the number of ones in the input. Clearly, n and k must satisfy n ≤ 2k − 1 or equivalently k ≥ log2 (n + 1) . A compressor on the other hand does not produce a valid binary count of the number of input bits. However, it does reduce the number of partial products, but at the same time has several incoming and outgoing carries. The output carries should be generated without any dependence on the input carries. The most frequently used compressor is the 4:2 compressor shown in Fig. 15, which is realized using full adders. Clearly, there are no major advantage using 4:2 compressors that are implemented as in Fig. 15 compared to using 3:2 counters (full adders). However, other possible realizations are available. These should satisfy x1 + x2 + x3 + x4 + cin = s + 2c + 2cout
(21)
and cout should be independent of cin . There exist realizations with lower logic depth compared to full adders, and, hence, the total delay of the multi-operand addition may be reduced using 4:2 compressors. It is important to note that an n : k counter or compressor reduces the number of bits in the computation with exactly n − k. Hence, it is easy to estimate the required number of counters and compressors to perform any addition if the original number of bits to be added and the number of outbits are known. It should also be noted, that depending on the actual structure it is typically impossible to use only one type of compressors and adders. Specifically, half adders (or 2:2 counters) may sometimes be needed, despite not reducing the number of bits, to move bits to the correct weights for further additions.
x1 x2 x3 x4 FA cout
cin FA c
s
Fig. 15 4:2 compressor composed of full adders (3:2 counters).
298
Oscar Gustafsson and Lars Wanhammar
x1 y1
+ z1
xWf3 yWf3
xWf1y1
xWfy1 xWf1y2
x1y1
x2y1 x1y2
x2y2 x1y3
x2y3 x1y4
z2
z3
z4
z5
xWf2 yWf2
xWf1 yWf1
xWfyWf3 xWfyWf2 xWfyWf1 xWf1yWf2 xWf1yWf1 xWf1yWf
z2Wf3
z2Wf2
z2Wf1
xWf yWf xWfyWf
z2Wf
Fig. 16 Partial product array for unsigned multiplication.
3 Multiplication The process of multiplication can be divided into three different steps: partial product generation that determines a number of bits to be added, summation of the generated partial products, and, for some of the summation structures, carry-propagation addition, usually called vector merging addition (VMA), as many summation structures produce redundant results.
3.1 Partial product generation For unsigned binary representation, the partial product generation can be readily realized using AND-gates computing bit-wise multiplications as Z = XY =
W fX
W fY
W fX W fY
i=1
j=1
i=1 j=1
xi 2−i y j 2− j =
xi y j 2−i− j
(22)
This leads to a partial product array as shown in Fig. 16. For two’s complement data the result is very similar, except that the sign-bit causes some of the bits to have the negative sign. This can be seen from Z = XY
W fX
W fY
= (−x0 + xi 2−i )(−y0 + y j 2− j ) i=1
j=1
W fY
W fX
W fX W fY
j=1
i=1
i=1 j=1
= x0 y0 − x0 y j 2− j − y0
xi 2−i + xi y j 2−i− j
The corresponding partial product matrix is shown in Fig. 17.
(23)
Arithmetic
299 x1 y1
xWf1y0
xWfy0 xWf1y1
xWfy1 xWf1y2
x0 y0
+ z1
x0y0
x1y0 x0y1
x1y1 x0y2
x1y2 x0y3
x1y3 x0y4
z0
z1
z2
z3
z4
xWf3 yWf3
xWf2 yWf2
xWf1 yWf1
xWfyWf3 xWfyWf2 xWfyWf1 xWf1yWf2 xWf1yWf1 xWf1yWf
z2Wf3
xWf yWf xWfyWf
z2Wf2
z2Wf1
z2Wf
xWf2 yWf2
xWf1 yWf1
xWf yWf
Fig. 17 Partial product array for two’s complement multiplication. x1 y1
1 xWf1y0
xWfy0 xWf1y1
xWfy1 xWf1y2
x0 y0
+
1
x0y0
x1y0 x0y1
x1y1 x0y2
x1y2 x0y3
x1y3 x0y4
z1
z0
z1
z2
z3
z4
xWf3 yWf3
xWfyWf3 xWfyWf2 xWfyWf1 xWf1yWf2 xWf1yWf1 xWf1yWf
z2Wf3
z2Wf2
z2Wf1
xWfyWf
z2Wf
Fig. 18 Partial product array without sign-extension.
3.1.1 Avoiding sign-extension As previously stated, the wordlengths of two two’s complement numbers should be equal when performing the addition or subtraction. Hence, the straightforward way of dealing with the varying wordlengths in two’s complement multiplication is to sign-extend the partial results to obtain the same wordlength for all rows. To avoid this excessive sign-extension it is possible to either perform the summation from top to bottom and perform sign-extension of the partial results to match the next row to be added. This is further elaborated in Sections 3.2.1 and 3.2.2. However, if we want to be able to add the partial products in an arbitrary order using a multi-operand adder as discussed in 2.4, the following technique, proposed initially by Baugh and Wooley [1], can be used. Note that for a negative partial product we have −p = p¯ − 1. Hence, we can replace all negative partial products with an inverted version. Then, we need to subtract a constant value from the result, but as there will be several constants, one from each negative partial product, we can sum these up and form a single compensation vector to be added. When this is applied we get the partial product array as shown in Fig. 18. 3.1.2 Reducing the number of rows As discussed in Section 1.3.1, it is possible to reduce the number of nonzero positions by using a signed-digit representation. It would be possible to use e.g. a CSD representation to obtain a minimum number of nonzeros. However, the drawback is that the conversion from two’s complement to CSD requires the carry-propagation.
300
Oscar Gustafsson and Lars Wanhammar
Y <<1 1
0
multiply by 2
1
0
negative
1
0
zero
0 P
s
Fig. 19 Generation of partial products for radix-4 modified Booth encoding.
Furthermore, the worst case is that half of the positions are nonzero, and, hence, one would still need to design the multiplier to deal with this case. Instead, it is possible to derive a signed-digit representation that is not necessarily minimum but has at most half of the positions being nonzero. This is referred to modified Booth encoding [33] and is often described as being a radix4 signed-digit representation where the recoded digits ri ∈ {−2, −1, 0, 1, 2}. An alternative interpretation is a radix-2 signed-digit representation where di di−1 , i ∈ {W f ,W f −2 ,W f −4 , . . . }. The logic rules for performing the modified Booth encoding are based on the idea of finding strings of ones and replace them as 011 . . .11 = 100 . . .01¯ and are illustrated in Table 2. From this, one can see that there is at most one nonzero digit in each pair of digits (d2k d2k+1 ). Now, to perform the multiplication, we must be able to possibly negate and multiply the operand with 0, 1, or 2. This can conceptually be performed as in Fig. 19. As discussed earlier, the negation is typically performed by inverting the bits and add a one in the column corresponding to the LSB position. The partial product array for a multiplier using the modified Booth encoding is shown in Fig. 20. It is possible to use the modified Booth encoding with higher radices than two. However, that requires the computations of non-trivial multiples such as 3 for radix-
Table 2 Rules for the radix-4 modified Booth encoding. x2k x2k+1 x2k+2 0 0 0 0 0 1 0 1 0 0 1 1 1 0 0 1 0 1 1 1 0 1 1 1
rk d2k d2k+1 0 00 1 01 1 01 2 10 ¯ −2 10 −1 01¯ −1 01¯ 0 00
Description String of zeros End of ones Single one End of ones Start of ones ¯ + 01) Start and end of ones (10 Start of ones String of ones
Arithmetic
301
8 and 3, 5, and 7 for radix-16. The number of rows is reduced roughly by a factor of k for the radix-2k modified Booth encoding. 3.1.3 Reducing the number of columns It is common that the results after the multiplication are quantized to be represented with fewer bits than the original result. To reduce the complexity of the multiplication in these cases it has been proposed to perform the quantization at the partial product stage [29]. This is commonly referred to as fixed-width multiplication referring to the fact that (most of) the partial products rows have the same width. Simply truncating the partial products will result in a rather large error. Several methods have therefore been proposed to compensate for the introduced error [41].
3.2 Summation structures The problem of summing up the partial products can be solved in three general ways; sequential accumulation where a subset of the partial products are accumulated in each cycle, array accumulation which gives a regular structure, and tree accumulation which gives the smallest logic depth but in general an irregular structure. 3.2.1 Sequential accumulation
In so-called add-and-shift multipliers, the partial bit-products are generated sequentially and successively accumulated as generated. This type of multipliers are therefore slow, but the required chip area is small. The accumulation can be done using any of the bit-parallel adders discussed above or using digit-serial or bit-serial accumulators. A major advantage of bit-serial over bit- parallel arithmetic is that it significantly reduces chip area. This is done in two ways. First, it eliminates wide buses and simplifies the wire routing. Second, by using small processing elements,
pWf2,1
1 pWf2,0
pWf,1 1 pWf2,1
1
+
1 z1
p1,1
p1,0
p1,1
p1,2
p1,3
z0
z1
z2
z3
z4
x0 y0
x1 y1
xWf3 yWf3
xWf2 yWf2
xWf1 yWf1
xW f yW f
pWf,0
pWf,1
pWf,Wf3
pWf,Wf3
pWf,Wf3
pWf2,2
pWf2,3
pWf,Wf3 sWf
z2Wf1
z2Wf
pWf2,Wf3 pWf2,Wf3 sWf2
z2Wf3
Fig. 20 Partial product array for radix-4 modified Booth encoded multiplier.
z2Wf2
302
Oscar Gustafsson and Lars Wanhammar
----x0.x1x2x3x4x5
a0
a1
a2
a3
a4
Sign-Ext. &
&
&
&
&
Wc–1 x0x0x0x0x0.x1x2x3x4x5
y FA
D
FA
D
FA
D
FA
D
FA
Wf+Wc–1 D
D
D
D
D
Fig. 21 Serial/parallel multiplier based on carry−save adders.
the chip itself will be smaller and will require shorter wiring. A small chip can support higher clock frequencies and is therefore faster. Two’s complement and binary offset representations are suitable for DSP algorithms implemented with bit-serial arithmetic, since the bit-serial operations can be done without knowing the sign of the numbers involved. Figure 21 shows a five-bit serial/parallel multiplier where the bit-products are generated row-wise. In a serial/parallel multiplier, the multiplicand X arrive bit-serially while the multiplier a is applied in a bit-parallel format. Many different schemes for bit-serial multipliers have been proposed. They differ mainly in which order bit-products are generated and accumulated and in the way subtraction is handled. Addition of the first set of partial bit-products starts with the products corresponding to the LSB of X. Thus, in the first time slot, at bit xW f , we simply add, a · xW f to the initially cleared accumulator. Next, the D flip-flops are clocked and the sum-bits from the FAs are shifted one bit to the right, each carry-bit is saved and added to the full-adder in the same stage, the sign-bit is copied, and one bit of the product is produced at the output of the accumulator. These operations correspond to multiplying the accumulator contents by 2−1 . In the following clock cycle, the next bit of X is used to form the next set of bit-products, which are added to the value in the accumulator, and the value in the accumulator is again divided by 2. This process continues for W f clock cycles, until the sign bit x0 is reached, whereupon a subtraction must be done instead of an addition. During the first W f clock cycles, the least significant part of the product is computed and the most significant is stored in the D flip-flops. In the next W f clock cycles, zeros are therefore applied to the input so that the most significant part of the product is shifted out of the multiplier. Note that the accumulation of the bit-products is performed using a redundant representation, which is converted to a non-redundant representation in the last stage of the multiplier. A digit-serial multiplier, which accumulate several bits in each stage, can be obtained either via unfolding of a bit-serial multiplier or via folding of a bit-parallel multiplier.
Arithmetic
303 0
x0*y3 FA
x1*y3 0 x1*y2
FA
x2*y3 x2*y2
FA
0
x3*y3 x3*y2
x0*y2 FA
x1*y1
FA
x2*y1
FA
x3*y1
x0*y1 FA 1 FA don’t care
FA
x2*y0
FA
x3*y0
x0*y0 FA
p–1
x1*y0
FA
p0
FA p1
p2
1 p3
p4
p5
p6
Fig. 22 Baugh-Wooley’s multiplier.
3.2.2 Array accumulation Array multipliers use an array of almost identical cells for generation and accumulation of the bit- products. Figure 22 shows a realization of the Baugh-Wooley’s multiplier [1] with the multiplication time proportional to 2W f . 3.2.3 Tree accumulation The array structure provides a regular structure, but at the same time the delay grows linear with the wordlength. Considering Figs. 18 and 20, they both provide a number of partial products that should be accumulated. As mentioned earlier, it is common to accumulate the partial products such that there are at most two partial products of each weight and then use a fast carry-propagation adder to perform the final step. In Section 2.4 the problem of adding a number of bits was considered. Here, we will focus on structures using full and half adders (or 3:2 and 2:2 counters), although there are other structures proposed using different types of counters and compressors. The first approach is to add as many full adders as possible to reduce as many partial products as possible. Then, we add as many half adders as possible to minimize the number of levels and try to shorten the wordlength for the vector merging adder. This approach is roughly the Wallace tree proposed in [52]. The main drawback of this approach is an excessive use of half adders. Dadda [7] instead proposed that full and half adders should only be used if required to obtain a number of par-
304
Oscar Gustafsson and Lars Wanhammar
Inputs Result (a)
(b)
Fig. 23 Operation on bits in a dot diagram with (a) full adder and (b) half adder.
tial products equal to a value in the Dadda series. The value of position n in the Dadda series is the maximum number of partial products that can be reduced using n levels of full adders. The Dadda series starts {3, 4, 6, 9, 13, 19, . . .}. The benefit of this is that the number of half adders is significantly reduced while still obtaining a minimum number of levels. However, the length of the vector merging adder increases. A compromise between these two approaches is the Reduced Area heuristic [2], where similarly to the Wallace tree, as many full adders as possible are introduced in each level. Half adders are on the other hand only introduced if required to reach a number of partial products corresponding to a value in the Dadda series or if the least significant weight with more than one partial products is represented with exactly two partial products. In this way a minimum number of stages is obtained, while at the same time both the length of the vector merging adder and the number of half adders are kept small. To illustrate the operation of the reduction tree approaches we use dot diagrams, where each dot corresponds to a bit (partial product) to be added. Bits with the same weight are in the same column and bits in adjacent columns have one position higher or lower weight, with higher weights to the left. The bits are manipulated by either full or half adders. The operation of these are illustrated in Fig. 23. The reduction schemes are exemplified based on an unsigned 6 × 6-bits multiplication in Fig. 24. The complexity results are summarized in Table 3. It should be noted that the positioning of the results in the next level is done based on ease of illustration. From a functional point of view this step is arbitrary, but it is possible to optimize the timing by carefully utilizing different delays of the sum and carry outputs of the adder cells [37]. Furthermore, it is possible to reduce the power consumption by optimizing the interconnect ordering [39]. The reduction trees in Fig. 24 does not provide any regularity. This means that the routing is complicated and may become the limiting factor in an implementation. Reduction structures that provide a more regular routing, but still a small number of stages, include the Overturned stairs reduction tree [35] and the HPM tree [12].
Table 3 Complexity of the three reduction trees in Fig. 24. Tree structure Full adders Half adders VMA length Wallace [52] 16 13 8 Dadda [7] 15 5 10 Reduced area [2] 18 5 7
Arithmetic
305
Stage 1
Stage 1
Stage 1
Stage 2
Stage 2
Stage 2
Stage 3
3
Stage 3
Stage 4 (a)
4 (b)
Stage 4 (c)
Fig. 24 Reduction trees for a 6× 6-bits multiplier: (a) Wallace [52], (b) Dadda [7], and (c) Reduced area [2].
3.3 Vector merging adder The role of the vector merging adder is to add the outputs of the reduction tree. In general, any carry-propagation adder can be used, e.g., those presented in Section 2. However, the different input signals to the adders will typically be available at different delays from the multiplier input values. Therefore is it possible to derive carry-propagation adders that utilizes this different signal arrival times to optimize the adder delay [47].
3.4 Multiply-accumulate In many DSP algorithms computations of the form Z = XY + A are common. These can be effectively implemented by simply adding another row corresponding to A in the partial product array. In many cases this will not increase the number of levels required. For sequential operation the modification of the first stage of the serial/parallel multiplier as shown in Fig. 25, makes it possible to add an input Z to be added to the product at the same level of significance as X.
3.5 Multiplication by a constant When the multiplier coefficient is known, it is possible to reduce the complexity of the corresponding circuit. First of all, no circuitry is required to generate the partial products. Second, there will in general be fewer partial products to add. This can easily be realized considering the partial product array in Fig. 16. For the coefficient bits that are zero, all the corresponding partial product bits will also be zero, and,
306
Oscar Gustafsson and Lars Wanhammar
x0.x1x2 ... xWd–1
a4 D
& z
a3
a2 D
&
a1 D
&
a0 D
&
& y
FA
FA
FA
FA
FA
D
D
D
D
D
Set Fig. 25 Serial/parallel multiplier-acumulator.
hence, these are not required to be added. To obtain more zero positions, the use of a minimum signed digit representation such as CSD is useful. It is also possible to utilize potential redundancy in the computations to further reduce the complexity. How this is done in detail depends on which type of addition is assumed to the basic operation. As both addition and subtraction have the same complexity, we will refer to both as the addition. In the following we will assume carry-propagation addition, i.e., two input and one output, realized in any way discussed in Section 2. Furthermore, for ease of exposition, we will assume that the standard sign-extension is used. For carry-save addition we refer to [17, 18]. Consider a signed-digit representation of a multiplier coefficient X such as shown in (1) with xi ∈ {−1, 0, 1}. Each nonzero position will produce a partial result row and these partial result rows can be added in an arbitrary order. Now, if the same pattern of nonzero positions, called subexpression, occurs in more than one position of the representation, we only need to compute the corresponding partial result once and use it for all the instances where it is required. Figures 26a and 26b show ex¯ amples of multiplication with the constant 231/256 = 1.0010100 1¯ with and without ¯ utilizing reduncancy, respectively. In this case we extract the subexpression 1001, but we might just as well have chosen 10001 and subtracted one of the subexpressions as shown in Fig. 26c. This can be performed is a systematic way as described below. However, first we note that if we multiply the same data with several constant coefficients, as in a transposed direct form FIR filter [55], the different coefficients can share subexpressions. Hence, the systematic way is as follows [21, 42]: 1. Represent the coefficients in a given representation. 2. For each coefficient find and count possible subexpressions. A subexpression is characterized by the difference in nonzero position and if the nonzeros have the same or opposite signs. 3. If there are common subexpressions, select one to replace and replace instances of it by introducing a new symbol in place of the subexpression. The most com-
Arithmetic
307
>>3
>>5
(a)
>>3
>>5
>>5
>>3
>>8
(b)
(c)
¯ Fig. 26 Constant multiplication with 231/256 = 1.0010100 1¯ using (a) no sharing, (b) sharing of ¯ and (c) sharing of the subexpression 10001. the subexpression 1001,
mon approach is to select the most frequent subexpression, thus, applying a greedy optimization approach, and replace all instances of it. However, it should be noted that from a global optimality point of view this is not always the best. 4. If there were subexpressions replaced, go to Step 2 otherwise the algorithm is done. There are a number of sources of possible suboptimality in the procedure described above. The first is that the results are representation dependent, and, hence, it will in general reduce the complexity trying several different representations. It may seem to make sense to use an MSD representation as it originally have few nonzero positions. However, the more nonzeros the more likely is it that common subexpressions will exist. For the single constant multiplication case this has been utilized for a systematic algorithm that searches all representations with the minimum number of nonzero digits plus k additional nonzero digits [19]. The second source is the selection of subexpression to replace. It is common to select the most frequent one, applying a greedy strategy, and replace all instances. However, it has been shown that from a global optimization point of view, it is not always beneficial to replace all subexpressions. Another issue is which subexpression to choose if there are more than one that are as frequent. For single constant coefficients there is an optimal approach based on searching all possible interconnections of a given number of adders presented in [19]. It is shown that multiplication with all coefficients up to nineteen bits can be realized using at most five additions, compared to up to nine additions using a straightforward CSD realization without sharing. The optimal approach avoids the issue of representation dependence by only considering the decimal value at each addition, independent of underlying representation. For the multiple constant multiplication case several effective algorithms have been proposed over the years that avoids the problem of representation dependence [14, 51]. Theoretical lower bounds for related problems have been presented in [15].
308
Oscar Gustafsson and Lars Wanhammar
Table 4 Look-up table. x1 0 0 0 0 1 1 1 1
x2 0 0 1 1 0 0 1 1
x3 Fk 0 0 1 a3 0 a2 1 a2 + a3 0 a1 1 a1 + a3 0 a1 + a2 1 a1 + a2 + a3
Fk (0.0000000)2C (1.1110101)2C (0.1010101)2C (0.1001010)2C (0.0100001)2C (0.0010110)2C (0.1110110)2C (0.1101011)2C
3.6 Distributed arithmetic Distributed arithmetic is an efficient scheme for computing inner products of a fixed and a variable data vector N
Y = aT X = ai Xi .
(24)
i=1
The basic principle is owed to Croisier et al. [6]. The inner product can be rewritten using two’s complement representation N
Wf
i=1
k=1
Y = ai −xi0 + xik 2−k
(25)
where xik is the kth bit in xi . By interchanging the order of the two summations we get N
Wf
Y = − ai xi0 + i=1
N
ai xik
2−k
(26)
k=1 i=1
Wf
= −F0 (x10 , x20 , ......, xN0 ) + Fk (x1k , x2k , ..., xNk )2−k
(27)
k=1
where
N
Fk (x1k , x2k , ..., xNk ) = ai xik .
(28)
i=1
Fk is a function of N binary variables, ith variable being the kth bit in xi . Since Fk can take on only 2N values, it can be precomputed and stored in a look-up table. For example, consider the inner product Y = a1 X1 + a2 X2 + a3 X3 where a1 = (0.0100001)2C, a2 = (0.1010101)2C, and a3 = (1.1110101)2C. Table 4 shows the function Fk and the corresponding addresses. Figure 27 shows a realization of (3) by the Horner’s method2 2
William Horner, 1819. In fact, this method was first derived by Isaac Newton in 1711.
Arithmetic
309
y = ((...((0 + FW f )2−1 + ... + F2)2−1 + F1)2−1 − F0 )
(29)
Inputs, X1 , X2 , ......XN are shifted bit-serially out from the shift registers with the least-significant bit first. Bits xik are used to address the look-up table. Since, the output is divided by 2, by the inherent shift, the circuit is called a shift-accumulator [55]. Computation of the inner product require W f + 1 clock cycles. In the last cycle, F0 is subtracted from the accumulator register. Notice the resemblance with a shiftand-add implementation of a real multiplication. A more parallel form of distributed arithmetic can also be realized by allocating several tables. The tables, which are identical, may be addressed in parallel and their appropriately shifted values. 3.6.1 Reducing the memory size The memory requirement becomes very large for long inner products. There are mainly two ways to reduce the memory requirements. One of several possible ways to reduce the overall memory requirement is to partition the memory into smaller pieces that are added before the shift-accumulator, as shown in Fig. 28. The amount of memory is in this case reduced from 2N words to 2 · 2N/2 words. For example, for N = 10 we get 210 = 1024 words, which is reduced to only 2 · 25 = 64 words at the expense of an additional adders. Memory size can be halved by using the ingenious scheme[6] based on the identity X = 12 [X − (−X )], which can be rewritten Wf
X = −(x0 − x¯0 )2−1 + (xk − x¯k )2−k−1 − 2−(W f +1) .
(30)
k=1
Notice that (xk − x¯k ) can only take on the values −1 or +1. Inserting this expression into (1) yields
X1
SR ROM
LSB
N
2 words
XN WROM
Reg.
WROM Add/Sub Y
Fig. 27 Block diagram for distributed arithmetic.
WROM
310
Oscar Gustafsson and Lars Wanhammar
X1 X2 XN/2
ROM
XN/2+1 XN/2+2
ROM
N/2
N/2
2 words
2 words
XN
LSB Reg.
Add
Add/Sub Y Fig. 28 Reducing the memory by partitioning. Table 5 Look-up table contents. x1 0 0 0 0 1 1 1 1
x2 0 0 1 1 0 0 1 1
x3 0 1 0 1 0 1 0 1
Fk −a1 − a2 − a3 −a1 − a2 + a3 −a1 + a2 − a3 −a1 + a2 + a3 +a1 − a2 − a3 +a1 − a2 + a3 +a1 + a2 − a3 +a1 + a2 + a3
u1 0 0 1 1 1 1 0 0
u2 A/S 0 A 1 A 0 A 1 A 1 S 0 S 1 S 0 S
Wf
Y=
Fk(x1k , . . . , xNk )2−k−1 − F0(x10 , . . . , xN )2−1 + F(0, . . . , 0)2−(W f +1)
(31)
k=1
where
N
Fk (x1k , x2k ......., xNk ) = ai (xik − x¯ik ).
(32)
i=1
The function Fk is shown in Table 5 for N = 3. Notice that only half the values are needed, since the other half can be obtained by changing the signs. To explore this redundancy we make the following address modification u1 = x1 ⊕ x2 , u2 = x1 ⊕ x3 , and A/S = x1 ⊕ ssign−bit where X1 has been selected as control variable [55]. The control signal xsign−bit is zero at all times except when the sign bit arrives. Figure 29 shows the resulting realization with halved look-up table. The XOR gates used for halving the memory can be merged with the XOR-gates that are needed for inverting Fk .
Arithmetic
311
3.6.2 Complex Multipliers A complex multiplication requires three real multiplications and some additions but only two distributed arithmetic units, which from area, speed, and power consumption points of view are comparable to the real multiplier. Let X = A + jB and K = c + jd where K is the fixed coefficient and X is a variable. Now, the product of the two complex numbers can be written K.X = (cA − dB) + j(dA + cB) Wf
= {−c(a0 − a¯0 )2−1 + c(ak − a¯k )2−k−1 − c2−(W f +1) } k=1
Wf
−{−d(b0 − b¯0 )2−1 + d(bk − b¯k )2−k−1 − d2−(W f +1) } k=1 Wf
+ j{−d(a0 − a¯0 )2−1 d(ak − a¯k )2−k−1 − d2−(W f +1) } k=1 Wf
+ j{−c(bk − b¯k )2−1 + c(bk − b¯k )2−k−1 − c2−(W f +1) } k=1
Wf
= F1 (a0 , b0 )2−1 + F1 (ak , bk )2−k−1 + F1 (0, 0)2−(W f +1) k=1
Wf
+ j{F2 (a0 , b0 )2−1 + F2 (ak , bk )2−k−1 + F2 (0, 0)2−(W f +1) }. k=1
Hence, the real and imaginary parts of the product can be computed using just two distributed arithmetic units. The content of the look-up table that stores F1 and F2 is shown in Table 6.
X1 X2
=1
ROM
LSB
N–1
XN xSign-bit
2 words
=1 WROM =1
Add/Sub
Reg.
WROM Add/Sub Y
Fig. 29 Distributed arithmetic with reduced memory.
WROM
312
Oscar Gustafsson and Lars Wanhammar
Table 6 ROM contents for the complex multiplier. ai 0 0 1 1
bi 0 1 0 1
(C + D)
F1 −(c − d) −(c + d) (c + d) (c − d)
F2 −(c + d) (c − d) −(c − d) (c + d)
(C – D) ai + bi = 1 ai + bi = 0
ai + bi Add/Sub ai
MUX F1
F2
Shift-Accumulator
Shift-Accumulator
Real part
Imaginary part
AC – BD
Add/Sub bi
AD + BC
Fig. 30 Block diagram for a complex multiplier.
Obviously only two coefficients are needed, (c+d) and (c−d). If a j ⊕b j = 1, the F coefficients values are applied directly to the accumulators, and if a j ⊕ b j = 0,the F coefficients values are interchanged and added or subtracted depending on the data bits ak and bk . The realization is shown in Fig. 30.
4 Division Of the four basic arithmetic operations, the division is the most complex to compute. Furthermore, the result of a division consists of two components, the quotient, Z, and the remainder, R, such that X = ZD + R (33) where X is the dividend, D = 0 is the divisor, and |R| < D. By definition the sign of the remainder should be the same as that of the dividend. For the result to be a fractional number we must have XD ≤ 1. This can always be obtained by shifting the dividend and/or divisor. For ease of exposition we will initially start with unsigned data, but eventually introduce signed data. For further information on the methods presented here and others, we refer to [10].
Arithmetic
313
4.1 Restoring and nonrestoring division The simplest way to perform a division is to sequentially shift the dividend one position (multiply by two) and then check if the divisor has larger magnitude than the dividend. If so, the corresponding magnitude bit of the quotient is one and we subtract the divisor from the dividend. Conceptually, the comparison can be made by first subtracting the divisor from the dividend and then check if the result is positive (quotient bit is one) or negative (quotient bit is zero). If the result is negative we need to add the divisor again, which gives the name restoring division. The computation in step i can be written as ri = 2ri−1 − zi D
(34)
where r0 = X . Therefore, if 2ri−1 − D is positive, we set zi = 1, otherwise zi = 0 and ri = 2ri−1 . ri is the remainder after iteration i and considering (33) we have R = ri 2−1 . To compute a quotient with W f fractional bits obviously W f iterations of (34) are required. Instead of restoring the remainder by adding the divisor, we can assign a negative quotient digit. This gives the nonrestoring division selection rule of the quotient digits, zi , in (34) as 1 ri−1 D ≥ 0 i.e. same sign zi = . (35) −1 ri−1 D < 0 i.e. different signs Note that with this definition of the selection rules the remainder will sometimes be positive, sometimes negative. Hence, division with a signed dividend and/or divisor is well covered within this approach. This also gives that the final remainder does not always have the same sign as the dividend. Hence, in that case we must compensate by adding or subtracting D to R and consequently subtracting or adding one LSB to Z. The result from the nonrestoring division will be represented using a representation with qi ∈ {−1, 1}. This representation is sometimes called nega-binary and is in fact not a redundant representation. The result should in most cases be converted into a two’s complement representation. Naturally, one can convert this by forming a word with positive bits and one with negative bits and subtract the negative bits from the positive bits. However, for this all bits must be computed before the conversion can be done. Instead, it is possible to use the on-the-fly conversion technique in [9] to convert the digits into bits once they are computed. Another consequence of the nega-binary representation is that if a zero remainder is obtained, this will not remain zero in the succeeding stages. Hence, a zero remainder should be detected and either the iterations stopped or corrected for at the end.
314
Oscar Gustafsson and Lars Wanhammar
4.2 SRT division The SRT division scheme extends the nonrestoring scheme by allowing 0 as a quotient digit. Furthermore, by restricting the dividend to be in the range 1/2 ≤ |D| < 1, which can be obtained by shifting, the selection rule for the quotient digit in (34) can be defined as ⎧ ⎨ 1 |2ri−1 | ≥ 1/2, ri−1 D ≥ 0 zi = 0 |2ri−1 | < 1/2 (36) ⎩ −1 |2ri−1 | ≥ −1/2, ri−1 D < 0 This has two main advantages: firstly, when zi = 0 there is no need to add or subtract; secondly, comparing with 1/2 or -1/2 only requires two bits of 2ri−1 . There exists slightly improved selection rules that further reduce the number of additions/subtractions. However, the number of iterations are still W f for a quotient with W f fractional bits.
4.3 Speeding up division While the number of additions/subtractions are reduced in the SRT scheme it would require an asynchronous circuit to improve the speed. In many situations this is not wanted. Instead, to reduce the number of cycles one can use division with a higher W radix. Using radix b = 2m reduces the number of iterations to mf . The iteration is now ri = bri−1 − zi D (37) where zi ∈ {0, 1, . . ., b − 1} for restoring division. For SRT division zi ∈ {−a, −a + 1, . . . , −1, 0, 1, . . . , a}, where (b − 1)/2 ≤ a ≤ (b − 1). The selection rules can be defined in several different ways similar to radix-2 discussed earlier. We can guarantee convergence by selecting the quotient digit such that |ri | < D, which typically implies maximizing the magnitude of the quotient digit. For SRT division it is possible to select the redundancy of the representation based on a. Higher redundancy leads to a larger overlap in the regions where one can select any of two different quotient digits. Having an overlap means that one can select the breakpoint such that few bits of ri and D need to be compared. However, a higher redundancy means that there are more multiples of D that needs to be computed for the comparison. Hence, there is a trade-off between the number of bits that needs to be compared and the precomputations for the comparisons. Even though higher-radix division reduces the number of iteration, each iteration still needs to be performed sequentially. In each step the current remainder must be known and the quotient digit selected before it is possible to start a new step. There are two different ways to overcome this limitation. First, it is possible to overlap the complete computation of the partial remainder in step i and the selection of the quotient digit in step i + 1. This is possible since not all bits of the remainder must
Arithmetic
315
be known to select the next quotient digit. Second, the remainder can be computed in a redundant number system.
4.4 Square root extraction Computing the square root is in some ways a similar operation to division as one can sequentially iterate the remainder, initialized to the radicand, r0 = X, with the partially computed square root Zi = ik=1 zk 2−k in a similar way as for division. More precisely, the iteration for square root extraction is ri = 2r i − 1 − zi (2zi + zi 2−i )
(38)
where Z0 = 0. Schemes similar restoring, nonrestoring, and SRT division can be defined. For the quotient digit selection scheme similar to SRT division the square root is restricted to 1/2 ≤ Z < 1, which corresponds to 1/4 ≤ X < 1. The selection rule is then ⎧ ⎨ 1 1/2 ≤ ri−1 < 2 (39) zi = 0 −1/2 < ri−1 < 1/2 . ⎩ −1 −2 ≤ ri−1 ≤ −2
5 Floating-point representation Floating-point numbers consists of two parts, the mantissa (or significand), M, and the exponent (or characteristic), E, with a number, X, represented as X = MbE
(40)
where b is the base of the exponent. For ease of exposition we assume b = 2. With floating-point numbers we obtain a larger dynamic range, but at the same time a lower precision compared to a fixed-point number system using the same number of bits. Both the exponent and the mantissa are typically signed integer or fractional numbers. However, their representation are often not two’s complement. For the mantissa it is common to use a sign-magnitude representation, i.e., use a separate sign-bit, S, and represent an unsigned mantissa magnitude with the remaining bits. For the exponent it is common to use excess-k, i.e., add k to the exponent to obtain an unsigned number.
316
Oscar Gustafsson and Lars Wanhammar
5.1 Normalized representations A general floating-point representation is redundant since M2E =
M E+1 2 2
(41)
However, to use as much as possible of the dynamic range provided by the mantissa, we would like to use the representation without any leading zeros. This representation is called the normalized form. Another benefit of normalized representations are that comparisons are simpler. It is possible to just compare the exponents, and only if the exponents are the same one will have to compare the mantissas. Also, as we know that there are no leading zero we make the first one in the representation explicit, and, hence, effectively add a bit to the representation.
5.2 IEEE 754 Before the emergence of the IEEE 754 floating-point standard, typically different computer systems had different floating-point standards making the transportation of data between different systems difficult. Still, some computer systems have their own floating-point representation. However, most have converged to the IEEE 754 standard. The most recent installment was released in August 2008 [22]. The IEEE 754-2008 standard defines three binary and two decimal formats, where we will focus on the 32-bit binary format, called binary32, as the other two binary formats follow along the same lines. The binary32 format has a sign bit, eight exponent bits using excess-127 representation, and 23 bits for the mantissa plus a hidden leading one. The representation can be visualized as s e−7 e−6 e−5 e−4 e−3 e−2 e−1 e0 f1 f2 f3 f4 f5 f6 . . . f22 f23
Sign
E 8 bits biased exponent
F 23 bits unsigned fraction
The value of the floating-point number is given by X = (−1)s 1.F 2E−127 .
(42)
Note the hidden one due to the normalized number system, so M = 1.F. This means that the actual mantissa value will be in the range 1 ≤ M < 2 − 2−23. Out of the 256 possible values for the exponent, two have special meanings to deal with zero value, ±, and undefined results (NaN). This is outlined in Table 7. The denormalized numbers are used to extend the dynamic range as the hidden one otherwise limits the smallest positive number to 21−127 = 2−126 . A denormalized number has a value of
Arithmetic
317
Table 7 Special cases for the exponent in binary32. F =0 F = 0 E =0 0 Denormalized E = 255 ± NaN Table 8 The three binary formats in IEEE754-2008. Property binary32 binary64 binary128 Total bits 32 64 128 Mantissa bits 23 + 1 52 + 1 112 + 1 Exponent bits 8 10 14 Bias 127 1023 16383 Ext. mantissa bits ≥ 32 ≥ 64 ≥ 128 Ext. exponent bits 11 15 20
X = (−1)s 0.F 2−126.
(43)
Using denormalized numbers it is possible to represent 2−23 2−126 = 2−149. However, the implementation cost of denormalized numbers are high, and, hence, are not always included. There is also an extended format defined that is used for intermediate results in certain complex functions. The extended binary32 format uses 11 bits for the exponent and at least 32 bits for the mantissa (now without a hidden bit). The two other formats are defined similarly to binary32 and are outlined in Table 8.
5.3 Addition and subtraction Adding and subtracting floating-point values require that both operands have the same exponent. Hence, we have to shift the mantissa of the smaller operand as in (41) such that both exponents are the same. Then, assuming binary32 and EX ≥ EY , it is possible to factor out the exponent term as Z = (−1)sZ MZ 2EZ −127 = X ± Y = ((−1)sX MX ± (−1)sY MY 2−(EX −EY ) ) 2EX −127 (44) where we can identify
and
(−1)sZ Mˆ Z = (−1)sX MX ± (−1)sY MY 2−(EX −EY )
(45)
Eˆ Z = EX .
(46)
Depending on the operation required and the sign of the two operand, either a subtraction or an addition of the mantissas are performed. If the effective operation is an addition, we have 1 ≤ Mˆ Z < 4, which means that we may need to right-shift once
318
Oscar Gustafsson and Lars Wanhammar
to obtain the normalized mantissa, MZ , and at the same time increase EˆZ by one to obtain EZ . If the effective operation is a subtraction, the result is 0 ≤ |Mˆ Z | < 2. For this case we might have to right-shift to obtain the normalized number, MZ , and correspondingly decrease the exponent to obtain EZ . It should be noted that adding or subtracting sign-magnitude numbers is more complex compared to adding or subtracting two’s complement numbers as one will have to make decisions based on the sign and the magnitude of the operators to determine which the effective operation to be performed is. Also, in the case of subtraction one needs to either determine which the largest magnitude is and subtract the smaller from the larger or negate the result in the case it is negative.
5.4 Multiplication The multiplication of two floating-point numbers (assumed to be in IEEE 754 binary32 format) is computed as Z = (−1)sZ MZ 2EZ −127 = XY = (−1)sX MX 2EX −127 (−1)sY MY 2EY −127
(47)
where we see that sZ = sX ⊕ sY Mˆ Z = MX MY Eˆ Z = EX + EY − 127.
(48) (49) (50)
As we have 1 ≤ MX , MY < 2 for normalized numbers, we get 1 ≤ Mˆ Z < 4. Hence, it may be required to shift Mˆ Z one position to the right to obtain the normalized value MZ , which can be seen by comparing with (41). If this happens one will also need to add 1 to Eˆ Z to obtain EZ . This gives that the multiplication of two floating-point numbers corresponds to one fixed-point multiplication, one fixed-point addition, and a simple normalizing step after the operations. For multiply-accumulate it is possible to use a fused architecture with the benefit that the alignment of the operand to be added can be done concurrently with the multiplication. In this way it is possible to reduce the delay for the total MAC operation compared to using separate multiplication and addition. Furthermore, rounding is only performed for the final output.
5.5 Quantization error The quantization error in the mantissa of a floating-point number is
Arithmetic
319
1 2Q –Q
p(
Rounding
1 4Q Q
p(
–2Q
Truncation
2Q
p( 1 2Q –2Q Magnitude Truncation
Fig. 31 Error distributions for floating-point arithmetic.
XQ = (1 + )X
(51)
Hence, the error is signal dependent and the analysis becomes very complicated [32, 44, 56]. Figure 31 shows the error distributions of floating-point arithmetic. Also, the quantization procedure needed to suppress parasitic oscillation in wave digital filters is more complicated for floating-point arithmetic.
6 Computation of elementary function The need of computing non-linear functions arises in many different algorithms. The straightforward method of approximating an elementary function is of course to just store the function values in a look-up table. However, this will typically lead to large tables, even though the resulting area from standard cell synthesis grows slower than the number of memory bits [16]. Instead it is of interest to find ways to approximate elementary functions using a trade-off between arithmetic operations and look-up tables. In this section we briefly look at three different classes of algorithms. For a more thorough explanation of these and other methods we refer to [36].
6.1 CORDIC The coordinate rotation digital computer (CORDIC) algorithm is a recursive algorithm to calculate elementary functions such as the trigonometric and hyperbolic (and their inverses) functions as well as magnitude and phase of complex vectors and was introduced by Volder [49] and generalized by Walther [53]. A summary of the development of CORDIC can be found in [34, 50, 54]. It revolves around the idea of rotating the phase of a complex number by multiplying it by a succession of constant values. However, these multiplications can all be made as powers of 2 and hence, in binary arithmetic they can be done using just shifts and adds. Hence, CORDIC is in general a very attractive approach when a hardware multiplier is not available. A rotation of a complex number X + jY by an angle can be written as cos ( ) − sin ( ) X Xr = . (52) sin ( ) cos ( ) Yr Y
320
Oscar Gustafsson and Lars Wanhammar
Y (Xk+1, Yk+1)
atan(2k) (Xk, Yk) X Fig. 32 Similarity (rotation) in the CORDIC algorithm.
The idea of the CORDIC is to decompose the rotation by in several steps such that each rotation is a simple operation. In the straightforward CORDIC algorithm we have
=
dk wk , dk = ±1, wk = arctan(2−k )
(53)
k=0
Considering rotation k we get Xk+1 cos (dk wk ) − sin (dk wk ) Xk 1 −dk 2−k Xk = = cos (wk ) Yk+1 sin (dk wk ) cos (dk wk ) Yk Yk dk 2−k 1 (54) Now neglecting the cos(wk ) term we get a basic iteration which is a multiplication with 2−k and an addition or subtraction. The sign of the rotation (dn ) is determined by comparing the required rotation angle with the currently rotated angle. This is typically done by using a third variable, Zk , where Z0 = and Zk+1 = Zk + dk wk . Then 1 Zk ≥ 0 dk = (55) −1 Zk < 0 The effect of neglecting the cos(wk ) in (54) is that the rotation is in fact not a proper rotation but instead a similarity [36]. Furthermore, as illustrated in Fig. 32, the magnitude of the vector is increased. The gain of the rotations depends on the number of iterations and can be written as k
G(k) = i=0
1 + 2−2i
(56)
For k → , G ≈ 1.6468. Several schemes to compensate for the gain has been proposed and a survey can be found in [48]. The above application of the CORDIC algorithm is usually referred to as rotation mode and can be used to compute sin ( ) and cos ( ) or perform rotations of
Arithmetic
321
Table 9 Generalized CORDIC. Type Parameters
Rotation mode, dk = sign(zk )
Vectoring mode, dk = −sign(y k)
Xn → Gk X02 +Y02 Circular Yn → 0 Zn → Z0 + arctan( XY00 ) Xn → X0 Linear Yn → 0 Zn → Z0 + YX00 m = −1 Xk → Gˆ k (X1 cosh (Z1 ) −Y1 sin (Z1 )) Xn → Gˆ k X12 −Y12 Hyperbolic wk = tanh−1 (2−k ) Yk → Gˆ k (Y1 cosh (Z1 ) + X1 sinh (Z1 )) Yn → 0 (k) = k − hk Zk → 0 Zn → Z1 + tanh−1 ( XY11 ) m=1 wk = arctan(2−k ) (k) = k m=0 wk = 2−k (k) = k
Xk → Gk (X0 cos (Z0 ) −Y0 sin (Z0 )) Yk → Gk (Y0 cos (Z0 ) + X0 sin (Z0 )) Zk → 0 Xk → X0 Yk → Y0 + X0 Z0 Zk → 0
complex vectors. There are also a vectoring mode, where the rotation is performed such that the imaginary part, Yk , becomes zero. The generalized CORDIC iterations can be written as Xk+1 = Xk − mdkYk 2− (k) Yk+1 = Yk + dk Xk 2− (k) Zk+1 = Zk − dk w (k) .
(57)
With an appropriate choice of m, dk , wk , and (k) the CORDIC algorithm can perform a wide number of functions. These are summarized in Table 9, where three different types of CORDIC algorithms are introduced; Circular for computing trigonometric expressions, Linear for linear relationships, and Hyperbolic for hyperbolic computation. The scaling factor Gˆ k for hyperbolic computations is k
Gˆ k = i=1
1 − 2−2(i−hi)
(58)
and the factor hk is defined as the largest integer such that 3hk + 1 + 2hk − 1 ≤ 2n. In practice this leads to that certain iteration angles, such that k = (3i+1 − 1)/2, are used twice to obtain convergence in the hyperbolic case. The CORDIC computations can be performed in an iterative manner as in (57), but naturally also be unfolded. There has also been proposed Radix-4 CORDIC algorithms, performing two iterations in each step, as well as different approaches using redundant arithmetic to speed up each iteration.
6.2 Polynomial and piecewise polynomial approximations It is possible to derive a polynomial p(X ) that approximates a function f (X ) by performing a Taylor expansion for a given point a such as
322
Oscar Gustafsson and Lars Wanhammar
bn
bn1
b1
b0
X
Fig. 33 Block diagram for Horner’s method used for polynomial evaluation.
p(X ) =
i=0
f (i) (a) (X − a)i . i!
(59)
When the polynomial is restricted to a certain number of terms it is often better to optimize the polynomial coefficients as there are some accuracy to be gained. To determine the best coefficients is an approximation problems where typically there are more constraints (number of points for the approximation) than variables (polynomial order). This problem can be solved for a minimax solution using e.g. Remez’ exchange algorithm or linear programming. For a least square solution the standard method to solve overdetermined systems can be applied. If fixed-point coefficients are required, the problem becomes much harder. The polynomial approximations can be efficiently and accurately evaluated using Horner’s method. This says that a polynomial p(X ) = b0 + b1 X + b2 X 2 + · · · + bn−1X n−1 + bnX n
(60)
is to be evaluated as p(X ) = ((. . . ((bn X + bn−1 )X + bn−1)X + · · · + b2)X + b1 )X + b0
(61)
Hence, there no need to compute any powers of X explicitly and a minimum number of arithmetic operations is used. Polynomial evaluation using Horner’s method maps nicely to MAC-operations. The resulting scheme is illustrated in Fig. 33. The drawback of Horner’s method is that the computation is inherently sequential. An alternative is to use Estrin’s method [36], which by explicitly computing i the terms X 2 rearranges the computation into a tree structure, increasing the parallelism and reducing the longest computational path. Estrin’s method for polynomial evaluation can be written as p(X ) = (b3 X + b2)X 2 + (b1 X + b0 )
(62)
for a third-order polynomial. For a seventh-order polynomial it becomes p(X ) = ((b7 X + b6 )X 2 + (b5 X + b4 ))X 4 + ((b3 X + b2)X 2 + (b1 X + b0)).
(63)
As can be seen, Estrin’s method also maps well to MAC-operations. The required polynomial order depends very much on the actual function that is approximated [36]. An approach to obtain a higher resolution despite using a
Arithmetic
323
X k bits
m bits
Look-up table
Look-up table
Look-up table
Look-up table
Fig. 34 Piecewise polynomial approximation using uniform segmentation based on the most significant bits.
lower polynomial order is to use different polynomials for different ranges. This is referred to as piecewise polynomials. A j segment n:th-order piecewise polynomial with segment breakpoints xk , k = 1, 2, . . . , j + 1 can be written as n
p(X ) = bi, j (X − x j )i , x j ≤ X < x j+1
(64)
i=0
From an implementation point of view it is often practical to have 2k uniform segments and let the k most significant bits determine the segmentation. However, it can be shown that in general the total complexity is reduced for non-uniform segments. An illustration of a piecewise polynomial approximation is shown in Fig. 34.
6.3 Table-based methods The bipartite table method is based on splitting the input word, X , in three different subwords, X1 , X2 , and X3 . For ease of exposition we will assume that the length of these are identical, Ws and W f = 3Ws , but in general it is possible to find a lower complexity realization by selecting non-uniform wordlengths. Hence, we have X = X1 + 2−Ws X1 + 2−2Ws X2
(65)
Now taking the first-order Taylor expansion of f at X0 + 2−Ws X1 we get f (X ) ≈ f (X0 + 2−Ws X1 ) + 2−2Ws X2 f (X0 + 2−Ws X1 )
(66)
Again, we take the Taylor expansion, this time a zeroth-order expansion of f (X0 + 2−Ws X1 ) at X0 as
324
Oscar Gustafsson and Lars Wanhammar
X k bits X0
k bits X1
T0(X0, X1)
k bits X2
T1(X0, X2)
Fig. 35 Bipartite table approximation structure.
f (X0 + 2−Ws X1 ) ≈ f (X0 )
(67)
This gives the bipartite approximation as f (x) ≈ T1 (X0 , X1 ) + T2 (X0 , X2 )
(68)
where T1 (X0 , X1 ) = f (X0 + 2−Ws X1 ) T2 (X0 , X2 ) = 2−2Ws X2 f (X0 )
(69)
The functions T1 and T2 are tabulated and the results are added. The resulting structure is shown in Fig. 35. The bipartite approximation can be seen as a piecewise linear approximation where the same slope tables are used in several intervals. Here, T1 contains the offset values, and T2 contains tabulated lines with slope f (X0 ). The accuracy of the bipartite approximation can be improved by instead performing the first Taylor expansion at X0 + 2−Ws X1 + 2−2Ws−1 and the second at X0 + 2−Ws −1 [46]. It is also possible to split the input word into more subwords yielding a multipartite table approximation [8].
7 Further reading Several books have been published on related subjects. For general digital arithmetic we refer to [11, 26, 40]. For the specific case of approximation of elementary function [36] provides a broad overview of various techniques.
Arithmetic
325
References 1. C. R. Baugh and B. A. Wooley, “A two’s complement parallel array multiplication algorithm,” IEEE Trans. Comput., vol. 22, pp. 1045–1047, Dec. 1973. 2. K. Bickerstaff, M. Schulte, and E.E. Swartzlander, Jr., “Parallel reduced area multipliers,” J. VLSI Signal Processing, vol. 9, no. 3, pp. 181–191, Nov. 1995. 3. R. P. Brent, H. T. Kung, “A regular layout for parallel adders,” IEEE Trans. Comput., vol. 31, pp. 260–264, Mar. 1982. Aug. 1973. 4. S. C. Chan and P. M. Yiu, “An efficient multiplierless approximation of the fast Fourier transform using sum-of-powers-of-two (SOPOT) coefficients,” IEEE Signal Processing Letters, vol. 9, no. 10, pp. 322–325, Oct. 2002. 5. T. A. C. M. Claasen, W. F. G. Mecklenbräuker, and J. B. H. Peek, “Effects of quantization and overflow in recursive digital filters,” IEEE Trans. Acoust., Speech, Signal Processing, vol. 24, no. 6, Dec. 1976. 6. A. Croisier, D.J. Esteban, M.E. Levilion, and V. Rizo, Digital Filter for PCM Encoded Signals, U.S. Patent 3777130, Dec. 4, 1973. 7. L. Dadda, “Some schemes for parallel multipliers,” Alta Frequenza, vol. 34, pp. 349–356, May 1965. 8. F. de Dinechin and A. Tisserand, “Multipartite table methods,” IEEE Trans. Comput., vol. 54, no. 3, pp. 319–330, March 2005. 9. M. D. Ercegovac and T. Lang, “On-the-fly conversion of redundant into conventional representation,” IEEE Trans. Comput., vol. 36, pp. 895–897, July 1987. 10. M. D. Ercegovac and T. Lang, Division and Square Root: Digit-Recurrence Algorithms and Implementations, Kluwer Academic Publishers, 1994. 11. M.D. Ercegovac and T. Lang, Digital Arithmetic, Morgan Kaufmann Publishers, 2004. 12. H. Eriksson, P. Larsson-Edefors, M. Sheeran, M. Själander, D. Johansson, and M. Schölin, “Multiplier reduction tree with logarithmic logic depth and regular connectivity,” in Proc. IEEE Int. Symp. Circuits Syst., 2006. 13. A. Fettweis, and K. Meerkötter, “On parasitic oscillations in digital filters under looped conditions,” IEEE Trans. Circuits Syst., vol. 24, no. 9, pp. 475–481, Sept. 1977. 14. O. Gustafsson, “A difference based adder graph heuristic for multiple constant multiplication problems,” in IEEE Int. Symp. Circuits Syst., New Orleans, LA, May 27-30, 2007. 15. O. Gustafsson, “Lower bounds for constant multiplication problems,” IEEE Trans. Circuits Syst. II, vol. 54, no. 11, pp. 974–978, Nov. 2007. 16. O. Gustafsson and K. Johansson, “An empirical study on standard cell synthesis of elementary function look-up tables,” in Proc. Asilomar Conf. Signals Syst. Comput., Pacific Grove, CA, Oct. 26–29, 2008. 17. O. Gustafsson and L. Wanhammar, “Low-complexity constant multiplication using carrysave arithmetic for high-speed digital filters,” in Proc. Int. Symp. Image, Signal Processing, Analysis, Istanbul, Turkey, Sept. 27–29, 2007. 18. O. Gustafsson, A. G. Dempster, and L. Wanhammar, “Multiplier blocks using carry-save adders,” IEEE Int. Symp. Circuits Syst., Vancouver, Canada, May 23–26, 2004. 19. O. Gustafsson, A. G. Dempster, K. Johansson, M. D. Macleod, and L. Wanhammar, “Simplified design of constant coefficient multipliers,” Circuits, Systems and Signal Processing, vol. 25, no. 2, pp. 225–251, Apr. 2006. 20. D. Harris, “A taxonomy of parallel prefix networks”, in Asilomar Conf. Signals Syst. Comput., Pacific Grove, CA, 2003. 21. R. I. Hartley, “Subexpression sharing in filters using canonic signed digit multipliers,” IEEE Trans. Circuits Syst. II, vol. 43, no. 10, pp. 677-688, Oct. 1996. 22. IEEE 754-2008 Standard for Floating-Point Arithmetic 23. K. Johansson, O. Gustafsson, and L. Wanhammar, “Power estimation for ripple-carry adders with correlated input data,” in Proc. Int. Workshop Power Timing Modeling, Optimization, Simulation, Santorini, Greece, Sept. 15–17, 2004. 24. S. Knowles, “A family of adders,” in Proc. Symp. Comput. Arithmetic, 1990, pp. 30–34.
326
Oscar Gustafsson and Lars Wanhammar
25. P. M. Kogge and H. S. Stone, “A parallel algorithm for the efficient solution of a general class of recurrence equations,” IEEE Trans. Comput., vol. 22, pp. 786–793, 1973. 26. I. Koren, Computer Arithmetic Algorithms, 2nd edition, A. K. Peters, Natick, MA, 2002. 27. R. E. Ladner and M. J. Fischer, “Parallel prefix computation,” J. Ass. Comput. Mach., vol 27, pp. 831–838. Oct. 1980. 28. J. Liang and T. D. Tran, “Fast multiplierless approximations of the DCT with the lifting scheme,” IEEE Trans. Signal Processing, vol. 49, no. 12, pp. 3032–3044, Dec. 2001. 29. Y.-C. Lim, “Single-precision multiplier with reduced circuit complexity for signal processing applications,” IEEE Trans. Comput., vol. 41, no. 10, pp. 1333–1336, Oct. 1992. 30. Y.-C. Lim, R. Yang, D. Li, and J. Song, “Signed power-of-two term allocation scheme for the design of digital filters,” IEEE Trans. Circuits Syst.–Part II, vol. 46, no. 5, pp. 577–584, May 1999. 31. B. Liu, “Effect of finite word length on the accuracy of digital filters – a review,” IEEE Trans. on Circuit Theory, vol. 18, pp. 670–677, Nov. 1971. 32. B. Liu, and T. Kaneko, “Error analysis of digital filters realized with floating-point arithmetic,” Proc. IEEE, vol. 57, pp. 1735–1747, Oct. 1969. 33. O. L. MacSorley, “High-speed arithmetic in binary computers,” Proc. IRE, vol. 49, no. 1, pp. 67–91, Jan. 1961. 34. P. K. Meher, J. Valls, T.-B. Juang, K. Sridharan, K. Maharatna, “50 years of CORDIC: Algorithms, architectures and applications,” IEEE Trans. Circuits Syst.–Part I, vol. 56, no. 9, pp. 1893–1907, Sept. 2009. 35. Z.-J. Mou and F. Jutand, “’Overturned-Stairs’ adder trees and multiplier design,” IEEE Trans. Comput., vol. 41, no. 8, pp. 940–948, Aug. 1992. 36. J.-M. Muller, Elementary Functions: Algorithms and Implementation, 2nd Edition, Birkhäuser, Boston, 2006. 37. V. G. Oklobdzija, D. Villeger, and S. S. Liu, “A method for speed optimized partial product reduction and generation of fast parallel multipliers using an algorithmic approach,” IEEE Trans. Comput., vol. 45, no. 3, March 1996. 38. A. Omondi and B. Premkumar, Residue Number Systems: Theory and Implementation, Imperial College Press, 2007. 39. S. T. Oskuii, P. G. Kjeldsberg, and O. Gustafsson, “A method for power optimized partial product reduction in parallel multipliers,” in Proc. IEEE Norchip Conf., 2007. 40. B. Parhami, Computer Arithmetic: Algorithms and Hardware Designs, 2nd edition, Oxford University Press, New York, 2010. 41. N. Petra, D. De Caro, V. Garofalo, E. Napoli, and A. G. M. Strollo, “Truncated binary multipliers with variable correction and minimum mean square error,” IEEE Trans. Circuits Syst.– Part I, 2010. 42. M. Potkonjak, M. B. Srivastava, and A. P. Chandrakasan, “Multiple constant multiplications: efficient and versatile framework and algorithms for exploring common subexpression elimination,” IEEE Trans. Computer-Aided Design, vol. 15, no. 2, pp. 151-165, Feb. 1996. 43. M. Püschel et al, “SPIRAL: Code generation for DSP transforms,” Proc. IEEE, vol. 93, no. 2, pp. 232–275, Feb. 2005. 44. B. D. Rao, “Floating point arithmetic and digital filters,” IEEE Trans. Signal Processing, vol. 40, no. 1, pp. 85–95, Jan. 1992. 45. H. Samueli, and A.N. Willson Jr., “Nonperiodic forced overflow oscillations in digital filters,” IEEE Trans. Circuits Syst., vol. 30, no. 10, pp. 709–722, Oct. 1983. 46. M. J. Schulte and James E. Stine, “Approximating elementary functions with symmetric bipartite tables,” IEEE Trans. Comput., no. 8, vol. 48, pp. 842–847, Aug. 1999. 47. P. F. Stelling and V. G. Oklobdzija, “Design strategies for optimal hybrid final adders in a parallel multiplier,” J. VLSI Signal Processing, vol. 14, no. 3, pp. 321–331, Dec. 1996. 48. D. Timmerman, H. Hahn, B. J. Hosticka, and B. Rix, “A new addition scheme and fast scaling factor compensation methods for CORDIC algorithms,” Integration, the VLSI Journal, vol. 11, pp. 85–100, Nov. 1991. 49. J. E. Volder, "The CORDIC trigonometric computing technique," IRE Trans. Elec. Comput., vol. 8, pp. 330–334, 1959.
Arithmetic
327
50. J. E. Volder, "The birth of CORDIC", J. VLSI Signal Processing, vol. 25, 2000. 51. Y. Voronenko and M. Püschel, “Multiplierless multiple constant multiplication,” ACM Trans. Algorithms vol. 3, no. 2, May. 2007. 52. C. Wallace, “A suggestion for a fast multiplier,” IEEE Trans. Electron. Comput., vol. 13, no. 1, pp. 14–17, Feb. 1964. 53. J. S. Walther, "A unified algorithm for elementary functions," Spring Joint Computer Conference Proceedings, vol. 38, pp. 379–385, 1971. 54. J. S. Walther, "The story of unified CORDIC", J. VLSI Signal Processing, vol. 25, 2000. 55. L. Wanhammar, DSP Integrated Circuits, Academic Press, 1999. 56. B. Zeng, and Y. Neuvo, “Analysis of floating point roundoff errors using dummy multiplier coefficient sensitivities,” IEEE Trans. Circuits Syst., vol. 38, no. 6, pp. 590–601, June 1991. 57. R. Zimmermann, Binary adder architectures for cell-based VLSI and their synthesis, Ph.D. Thesis, Swiss Federal Institute of Technology (ETH), Zurich, Hartung-Gorre Verlag, 1998.
Application-Specific Accelerators for Communications Yang Sun, Kiarash Amiri, Michael Brogioli, and Joseph R. Cavallaro
Abstract For computation-intensive digital signal processing algorithms, complexity is exceeding the processing capabilities of general-purpose digital signal processors (DSPs). In some of these applications, DSP hardware accelerators have been widely used to off-load a variety of algorithms from the main DSP host, including FFT, FIR/IIR filters, multiple-input multiple-output (MIMO) detectors, and error correction codes (Viterbi, Turbo, LDPC) decoders. Given power and cost considerations, simply implementing these computationally complex parallel algorithms with high-speed general-purpose DSP processor is not very efficient. However, not all DSP algorithms are appropriate for off-loading to a hardware accelerator. First, these algorithms should have data-parallel computations and repeated operations that are amenable to hardware implementation. Second, these algorithms should have a deterministic dataflow graph that maps to parallel datapaths. The accelerators that we consider are mostly coarse grain to better deal with streaming data transfer for achieving both high performance and low power. In this chapter, we focus on some of the basic and advanced digital signal processing algorithms for communications and cover major examples of DSP accelerators for communications.
Yang Sun Rice University, 6100 Main St., Houston, TX 77005, e-mail: [email protected] Kiarash Amiri Rice University, 6100 Main St., Houston, TX 77005, e-mail: [email protected] Michael Brogioli Freescale Semiconductor Inc., 7700 W. Parmer Lane, MD: PL63, Austin TX 78729, e-mail: [email protected] Joseph R. Cavallaro Rice University, 6100 Main St., Houston, TX 77005, e-mail: [email protected]
S.S. Bhattacharyya et al. (eds.), Handbook of Signal Processing Systems, DOI 10.1007/978-1-4419-6345-1_13, © Springer Science+Business Media, LLC 2010
329
330
Yang Sun, Kiarash Amiri, Michael Brogioli, and Joseph R. Cavallaro
1 Introduction In third-generation (3G) wireless systems, the signal processing algorithm complexity has begun to exceed the processing capabilities of general purpose digital signal processors (DSPs). With the inclusion of multiple-input multiple-output (MIMO) technology in the fourth-generation (4G) wireless system, the DSP algorithm complexity has far exceeded the processing capabilities of DSPs. Given area and power constraints for the mobile handsets one can not simply implement computation intensive DSP algorithms with gigahertz DSPs. Besides, it is also critical to reduce base station power consumption by utilizing optimized hardware accelerator design. Fortunately, only a few DSP algorithms dominate the main computational complexity in a wireless receiver. These algorithms, including Viterbi decoding, Turbo decoding, LDPC decoding, MIMO detection, and channel equalization/FFT, need to be off-loaded to hardware coprocessors or accelerators, yielding high performance. These hardware accelerators are often integrated in the same die with DSP processors. In addition, it is also possible to leverage the field-programmable gate array (FPGA) to provide reconfigurable massive computation capabilities. DSP workloads are typically numerically intensive with large amounts of both instruction and data level parallelism. In order to exploit this parallelism with a programmable processor, most DSP architectures utilize Very Long Instruction Word, or VLIW architectures. VLIW architectures typically include one or more register files on the processor die, versus a single monolithic register file as is often the case in general purpose computing. Examples of such architectures are the Freescale StarCore processor, the Texas Instruments TMS320C6x series DSPs as well as SHARC DSPs from Analog Devices, to name a few [14, 28, 39]. In some cases due to the idiosyncratic nature of many DSPs, and the implementation of some of the more powerful instructions in the DSP core, an optimizing compiler can not always target core functionality in an optimal manner. Examples of this include high performance fractional arithmetic instructions, for example, which may perform highly SIMD functionality which the compiler can not always deem safe at compile time. While the aforementioned VLIW based DSP architectures provide increased parallelism and higher numerical throughput performance, this comes at a cost of ease in programmability. Typically such machines are dependent on advanced optimizing compilers that are capable of aggressively analyzing the instruction and data level parallelism in the target workloads, and mapping it onto the parallel hardware. Due to the large number of parallel functional units and deep pipeline depths, modern DSP are often difficult to hand program at the assembly level while achieving optimal results. As such, one technique used by the optimizing compiler is to vectorize much of the data level parallelism often found in DSP workloads. In doing this, the compiler can often fully exploit the single instruction multiple data, or SIMD functionality found in modern DSP instruction sets. Despite such highly parallel programmable processor cores and advanced compiler technology, however, it is quite often the case that the amount of available instruction and data level parallelism in modern signal processing workloads far
Application-Specific Accelerators for Communications
331
exceeds the limited resources available in a VLIW based programmable processor core. For example, the implementation complexity for a 40 Kbps DS-CDMA system would be 41.8 Gflops/s for 60 users [50], not to mention 100 Mbps+ 3GPP LTE system. This complexity largely exceeds the capability of nowadays DSP processors which typically can provide 1-5 Gflops performance, such as 1.5 Gflops TI C6711 DSP processor and 1.8 Gflops ADI TigerSHARC Processor. In other cases, the functionality required by the workload is not efficiently supported by more general purpose instruction sets typically found in embedded systems. As such the need for acceleration at both the fine grain and coarse grain levels is often required, the former for instruction set architecture (ISA) like optimization and the latter for task like optimization. Additionally, wireless system designers often desire the programmability offered by software running on a DSP core versus a hardware based accelerator, to allow flexibility in various proprietary algorithms. Examples of this can be functionality such as channel estimation in baseband processing, for which a given vendor may want to use their own algorithm to handle various users in varying system conditions versus a pre-packaged solution. Typically these demands result in a heterogeneous system which may include one or more of the following: software programmable DSP cores for data processing, hardware based accelerator engines for data processing, and in some instances general purpose processors or micro-controller type solutions for control processing. The motivations for heterogeneous DSP system solutions including hardware acceleration stem from the tradeoffs between software programmability versus the performance gains of custom hardware acceleration in its various forms. There are a number of heterogenous accelerator based architectures currently available today, as well as various offerings and design solutions being offered by the research community. There are a number of DSP architectures which include true hardware based accelerators which are not programmable by the end user. Examples of this include the Texas Instruments’ C55x and C64x series of DSPs which include hardware based Viterbi or Turbo decoder accelerators for acceleration of wireless channel decoding [26, 27].
1.1 Coarse Grain Versus Fine Grain Accelerator Architectures Coarse–grain accelerator based DSP systems entail a co–processor type design whereby larger amounts of work are run on the sometimes configurable co–processor device. Current technologies being offered in this area support offloading of functionality such as FFT and various matrix-like computations to the accelerator versus executing in software on the programmable DSP core. As shown in Figure 1, coarse grained heterogeneous architectures typically include a loosely coupled computational grid attached to the host processor. These types of architectures are sometimes built using an FPGA, ASIC, or vendor pro-
332
Yang Sun, Kiarash Amiri, Michael Brogioli, and Joseph R. Cavallaro Memory Interface
Reconfigurable Plane
Main Memory
Host Processor
Fig. 1 Traditional coarse grained accelerator architecture. [8]
grammable acceleration engine for portions of the system. Tightly coupled loop nests or kernels are then offloaded from executing in software on the host processor to executing in hardware on the loosely coupled grid. Fine–grain accelerator based architectures are the flip-side to the coarse grained accelerator mindset. Typically, ISAs provide primitives that allow low cost, low complexity implementations while still maintaining high performance for a broad range of input applications. In certain cases, however, it is often advantageous to offer instructions specialized to the computational needs of the application. Adding new instructions to the ISA, however, is a difficult decision to make. On the one hand they may provide significant performance increases for certain subsets of applications, but they must still be general enough such that they are useful across a much wider range of applications. Additionally, such instructions may become obsolete as software evolves and may complicate future hardware implementations of the ISA [53]. Vendors such as Tensilica, however, offer toolsets to produce configurable extensible processor architectures typically targeted at the embedded community [25]. These types of products typically allow the user to configure a predefined subset of processor components to fit the specific demands of the input application. Figure 2 shows the layout of a typical fine grained reconfigurable architecture whereby a custom ALU is coupled with the host processors pipeline. In summary, both fine grained acceleration and coarse grained acceleration can be beneficial to the computational demands of DSP applications. Depending on the overall design constraints of the system, designers may chose a heterogeneous coarse grained acceleration system or a strictly software programmable DSP core system.
Application-Specific Accelerators for Communications
Shadow Register File
Register File
Execution Control Unit
Host Processor Pipeline
333
Reconfigurable Array (RA)
Configuration Control and Caching Unit
Fig. 2 Example fine grained reconfigurable architecture with customizable ALU for ISA extensions. [8]
1.2 Hardware/Software Workload Partition Criteria In partitioning any workload across a heterogeneous system comprised of reconfigurable computational accelerators, programmable DSPs or programmable host processors, and varied memory hierarchy, a number of criteria must be evaluated in addition to application profile information to determine whether a given task should execute in software on the host processor or in hardware on FPGA or ASIC, as well where in the overall system topology each task should be mapped. It is these sets of criteria that typically mandate the software partitioning, and ultimately determine the topology and partitioning of the given system. Spatial locality of data is one concern in partitioning a given task. In a typical software implementation running on a host processor, the ability to access data in a particular order efficiently is of great importance to performance. Issues such as latency to memory, data bus contention, data transfer times to local compute element such as accelerator local memory, as well as type and location of memory all need to be taken into consideration. In cases where data is misaligned, or not contiguous or uniformly strided in memory, additional overhead may be needed to arrange data before block DMA transfers can take place or data can efficiently be computed on. In cases where data is not aligned properly in memory, significant performance degradations can be seen due to decreased memory bandwidth when performing unaligned memory accesses on some architectures. When data is not uniformly strided, it may be difficult to burst transfer even single dimensional strips of memory via DMA engines. Consequently, with non-uniformly strided data it may be necessary to perform data transfers into local accelerator memory for computation via programmed I/O on the part of the host DSP. Inefficiencies in such methods of data transfer can easily overshadow any computational benefits achieved by compute acceleration of the FPGA. The finer the granularity of computation offloaded
334
Yang Sun, Kiarash Amiri, Michael Brogioli, and Joseph R. Cavallaro
for acceleration in terms of compute time, quite often the more pronounced the side effects of data memory transfer to local accelerator memory. Data level parallelism is another important criteria in determining the partitioning for a given application. Many applications targeted at VLIW-like architectures, especially signal processing applications, exhibit a large amount of both instruction and data level parallelism [24]. Many signal processing applications often contain enough data level parallelism to exceed the available functional units of a given architecture. FPGA fabrics and highly parallel ASIC implementations can exploit these computational bottlenecks in the input application by providing not only large numbers of functional units but also large amounts of local block data RAM to support very high levels of instruction and data parallelism, far beyond that of what a typical VLIW signal processing architecture can afford in terms of register file real estate. Furthermore, depending on the instruction set architecture of the host processor or DSP, performing sub-word or multiword operations may not be feasible given the host machine architecture. Most modern DSP architecures have fairly robust instruction sets that support fine grained multiword SIMD acceleration to a certain extent. It is often challenging, however, to efficiently load data from memory into the register files of a programmable SIMD style processor to be able to efficiently or optimally utilize the SIMD ISA in some cases. Computational complexity of the application often bounds the programmable DSP core, creating a compute bottleneck in the system. Algorithms that are implemented in FPGA are often computationally intensive, exploiting greater amounts of instruction and data level parallelism than the host processor can afford, given the functional unit limitations and pipeline depth. By mapping computationally intense bottlenecks in the application from software implementation executing on host processor to hardware implementation in FPGA, one can effectively alleviate bottlenecks on the host processor and permit extra cycles for additional computation or algorithms to execute in parallel. Task level parallelism in a portion of the application can play a role in the ideal partitioning as well. Quite often, embedded applications contain multiple tasks that can execute concurrently, but have a limited amount of instruction or data level parallelism within each unique task [51]. Applications in the networking space, and baseband processing at layers above the data plane typically need to deal with processing packets and traversing packet headers, data descriptors and multiple task queues. If the given task contains enough instruction and data level parallelism to exhaust the available host processor compute resources, it can be considered for partitioning to an accelerator. In many cases, it is possible to concurrently execute multiple of these tasks in parallel, either across multiple host processors or across both host processor and FPGA compute engine depending on data access patterns and cross task data dependencies. There are a number of architectures which have accelerated tasks in the control plane, versus data plane, in hardware. One example of this is the Freescale Semiconductor QorIQ platform which provides hardware acceleration for frame managers, queue managers and buffer managers. In doing this, the architecture effectively frees the programmable processor cores from dealing with control plane management.
Application-Specific Accelerators for Communications
.. .
MIMO Encoder
.. .
.. .
335
MIMO Equalizer & Estimator
MIMO Detector
Channel Decoder
Fig. 3 Basic structure of an MIMO receiver.
2 Hardware Accelerators for Communications Processors in 3G and 4G cellular systems typically require high speed, throughput, and flexibility. In addition to this, computationally intensive algorithms are used to remove often high levels of multiuser interference especially in the presence of multiple transmit and receive antenna MIMO systems. Time varying wireless channel environments can also dramatically deteriorate the performance of the transmission, further requiring powerful channel equalization, detection, and decoding algorithms for different fading conditions at the mobile handset. In these types of environments, it is often the case that the amount of available parallel computation in a given application or kernel far exceeds the available functional units in the target processor. Even with modern VLIW style DSPs, the number of available functional units in a given clock cycle is limited and prevents full parallelization of the application for maximum performance. Further, the area and power constraints of mobile handsets make a software-only solution difficult to realize. Figure 3 depicts a typical MIMO receiver model. Three major blocks, MIMO channel estimator & equalizer, MIMO detector, and channel decoder, determine the computation requirements of a MIMO receiver. Thus, it is natural to offload these very computational intensive tasks to hardware accelerators to support high data rate applications. Example of these applications include 3GPP LTE with 326 Mbps downlink peak data rate and IEEE 802.16e WiMax with 144 Mbps downlink peak data rate. Further, future standards such as LTE-Advance and WiMax-m are targeting over 1 Gbps speeds. Data throughput is an important metric to consider when implementing a wireless receive. Table 1 summaries the data throughput performance for different MIMO wireless technologies as of 2009. Given the current DSP processing capabilities, it is very necessary to develop application-specific hardware accelerators for the complex MIMO algorithms.
336
Yang Sun, Kiarash Amiri, Michael Brogioli, and Joseph R. Cavallaro
Table 1 Throughput performance of different MIMO systems HSDPA+ LTE LTE WiMax Rel 1.5 WiMax Rel 1.5 (2 × 2 MIMO) (2 × 2 MIMO) (4 × 4 MIMO) (2 × 2 MIMO) (4 × 4 MIMO) Downlink 42 Mbps 173 Mbps 326 Mbps 144 Mbps 289 Mbps Uplink 11.5 Mbps 58 Mbps 86 Mbps 69 Mbps 69 Mbps
2.1 MIMO Channel Equalization Accelerator The total workload for a given channel equalizer performed as a baseband processing part on the mobile receiver can be decomposed into multiple tasks as depicted in Figure 4. This block diagram shows the various software processing blocks, or kernels, that make up the channel equalizer firmware executing on the digital signal processor of the mobile receiver. The tasks are: channel estimation based on known pilot sequence, covariance computation (first row or column) and circularization, FFT/IFFT (Fast Fourier transform and Inverse Fast Fourier transform) post-processing for updating equalization coefficients, finite–impulse response (FIR) filtering applied on the received samples (received frame), and user detection (despreading–descrambling) for recovering the user information bits. The computed data is shared between the various tasks in a pipeline fashion, in that the output of covariance computation is used as the input to the matrix circularization algorithm. The computational complexity of various components of the workload vary with the number of users in the system, as well as users entering and leaving the cell as well as channel conditions. Regardless of this variance in the system conditions at runtime, the dominant portions of the workload are the channel estimation, fast
FIR Filter
Inverse Fast Fourier Transform
Fast Fourier Transform Post Process
Fast Fourier Transform
Covariance Matrix Circularization
Covariance Matrix Computation
Channel Estimation
`
Despreading / Descrambling
Decoded Bit Stream
Received Bit Stream
Time
Fig. 4 Workload partition for a channel equalizer.
Application-Specific Accelerators for Communications
337
Covariance Matrix Computation
FIR Filtering
IFFT
Post FFT Processing
Despread Descramble
Decoded Sequence
`
FFT
FPGA
Channel Estimation
DSP
Channel Channel Estimation Estimation
Received Sequence
Fig. 5 Channel equalizer DSP/hardware accelerator partitioning.
Fourier transform, inverse fast Fourier transform and FIR filtering as well as despreading and descrambling. As an example, using the workload partition criteria for partitioning functionality between a programmable DSP core and system containing multiple hardware for a 3.5G HSDPA system, it has been shown that impressive performance results can be obtained. In studying the bottlenecks of such systems when implemented on a programmable DSP core in software, it has been found the key bottlenecks in the system to be the channel estimation, fast fourier transform (FFT), inverse fast fourier transform (IFFT), FIR filter, and to a lesser extent despreading and descrambling as illustrated in Figure 4 [9]. By migrating the 3.5G implementation from a solely software based implementation executing on a TMS320C64x based programmable DSP core to a heterogeneous system containing not only programmable DSP cores but also distinct hardware acceleration for the various bottlenecks, the authors achieve almost an 11.2x speedup in the system [9]. Figure 5 illustrates the system partitioning between programmable DSP core and hardware (e.g. FPGA or ASIC) accelerator that resulted in load balancing the aforementioned bottlenecks. The arrows in the diagram illustrate the data flow between local programmable DSP core on-chip data caches and the the local RAM arrays. In the case of channel estimation, the work is performed in parallel between the programmable DSP core and hardware acceleration. Various other portions of the workload are offloaded to hardware based accelerators while the programmable DSP core performs the lighter weight signal processing code and book keeping. Despite the ability to achieve over 11x speedup in performance, it is important to note that the experimental setup used in these studies was purposely pessimistic. The various FFT, IFFT, etc compute blocks in these studies were offloaded to discrete FPGA / ASIC accelerators. As such, data had to be transferred, for example,
338
Yang Sun, Kiarash Amiri, Michael Brogioli, and Joseph R. Cavallaro
Tx
H
Rx
Fig. 6 MIMO transmitter and receiver
from local IFFT RAM cells to FIR filter RAM cells. This is pessimistic in terms of data communication time. In most cases the number of gates required for a given accelerator implemented in FPGA/ASIC was low enough that multiple accelerators could be implemented within a single FPGA/ASIC drastically reducing chip-to-chip communication time.
2.2 MIMO Detection Accelerators MIMO systems, Figure 6, have been shown to be able to greatly increase the reliability and data rate for point-to-point wireless communication [47]. Multiple-antenna systems can be used to improve the reliability and diversity in the receiver by providing the receiver with multiple copies of the transmitted information. This diversity gain is obtained by employing different kinds of space-time block code (STBC) [1, 45, 46]. In such cases, for a system with M transmit antennas and N receive antennas and over a time span of T time symbols, the system can be modeled as Y = HX + N,
(1)
where H is the N × M channel matrix. Moreover, X is the M × T space-time code matrix where its xi j element is chosen from a complex-valued constellation of the order w = | | and corresponds to the complex symbol transmitted from the i-th antenna at the j-th time. The Y matrix is the received N × T matrix where yi j is the perturbed received element at the i-th receive antenna at the j-th time. Finally, N is the additive white Gaussian noise matrix on the receive antennas at different time slots. MIMO systems could also be used to further expand the transmit data rate using other space-time coding techniques, particularly layered space-time (LST) codes [17]. One of the most prominent examples of such space-time codes is Vertical Bell Laboratories Layered Space-Time (V-BLAST) [20], otherwise known as spatial multiplexing (SM). In the spatial multiplexing scheme, independent symbols are transmitted from different antennas at different time slots; hence, supporting even higher data rates compared to space-time block codes of lower data rate [1, 45]. The spatial multiplexing MIMO system can be modeled similar to Eq. (1) with T = 1 since there is no coding across the time domain:
Application-Specific Accelerators for Communications
339
y = Hx + n,
(2)
where H is the N × M channel matrix, x is the M-element column vector where its xi -th element corresponds to the complex symbol transmitted from the i-th antenna, and y is the received N-th element column vector where yi is the perturbed received element at the i-th receive antenna. The additive white Gaussian noise vector on the receive antennas is denoted by n. While spatial multiplexing can support very high data rates, the complexity of the maximum-likelihood detector in the receiver increases exponentially with the number of transmit antennas. Thus, unlike the case in Eq. (1), the maximum-likelihood detector for Eq. (2) requires a complex architecture and can be very costly. In order to address this challenge, a range of detectors and solutions have been studied and implemented. In this section, we discuss some of the main algorithmic and architectural features of such detectors for spatial multiplexing MIMO systems. 2.2.1 Maximum-Likelihood (ML) Detection The Maximum Likelihood (ML) or optimal detection of MIMO signals is known to be an NP-complete problem. The maximum-likelihood (ML) detector for Eq. (2) is found by minimizing the y − Hx2
(3)
2
norm over all the possible choices of x ∈ M . This brute-force search can be a very complicated task, and as already discussed, incurs an exponential complexity in the number of antennas, in fact for M transmit antennas and modulation order of w = | |, the number of possible x vectors is wM . Thus, unless for small dimension problems, it would be infeasible to implement it within a reasonable area-time constraint [10, 19]. 2.2.2 Sphere Detection Sphere detection can be used to achieve ML (or close-to-ML) with reduced complexity [15, 23] compared to ML. In fact, while the norm minimization of Eq. (3) is exponential complexity, it has been shown that using the sphere detection method, the ML solution can be obtained with much lower complexity [15]. In order to avoid the significant overhead of the ML detection, the distance norm can be simplified [13] as follows: D(s) = y − Hs 2 = QH y − Rs 2 =
1
i=M
M
|yi − Ri, j s j |2 , j=i
(4)
340
Yang Sun, Kiarash Amiri, Michael Brogioli, and Joseph R. Cavallaro
1
w
2
...
i=M 1
w
2
w
2
...
i=M-1
.
. . . 1 2
i=1
1
w
...
.
1
. . 2
w
... .
.
1
...
2
w
...
Fig. 7 Calculating the distances using a tree. Partial norms, PNs, of dark nodes are less than the threshold. White nodes are pruned out.
where H = QR represents the channel matrix QR decomposition, R is an upper triangular matrix, QQH = I and y = QH y. The norm in Eq. (4) can be computed in M iterations starting with i = M. When i = M, i.e. the first iteration, the initial partial norm is set to zero, TM+1 (s(M+1) ) = 0. Using the notation of [11], at each iteration the Partial Euclidean Distances (PEDs) at the next levels are given by Ti (s(i) ) = Ti+1 (s(i+1) ) + |ei (s(i) )|2
(5)
with s(i) = [si , si+1 , ..., sM ]T , and i = M, M − 1, ..., 1, where |ei (s(i) )|2 = |yi − Ri,i si −
M
Ri, j s j |2 .
(6)
j=i+1
One can envision this iterative algorithm as a tree traversal with each level of the tree corresponding to one i value or transmit antenna, and each node having w children based on the modulation chosen. The norm in Eq. (6) can be computed in M iterations starting with i = M, where M is the number of transmit antennas. At each iteration, partial (Euclidian) 2 distances, PDi = |yi − M j=i Ri, j s j | corresponding to the i-th level, are calculated and added to the partial norm of the respective parent node in the (i − 1)-th level, PNi = PNi−1 + PDi . When i = M, i.e. the first iteration, the initial partial norm is set to zero, PNM+1 = 0. Finishing the iterations gives the final value of the norm. As shown in Figure 7, one can envision this iterative algorithm as a tree traversal problem where each level of the tree represents one i value, each node has its own PN, and w children, where w is the QAM modulation size. In order to reduce the search complexity, a threshold, C, can be set to discard the nodes with PN > C. Therefore, whenever a node k with a PNk > C is reached, any of its children will
Application-Specific Accelerators for Communications
341
have PN ≥ PNk > C. Hence, not only the k-th node, but also its children, and all nodes lying beneath the children in the tree, can be pruned out. There are different approaches to search the entire tree, mainly classified as depth-first search (DFS) approach and K-best approach, where the latter is based on breadth-first search (BFS) strategy. In DFS, the tree is traversed vertically [2, 11]; while in BFS [22, 52], the nodes are visited horizontally, i.e. level by level. In the DFS approach, starting from the top level, one node is selected, the PNs of its children are calculated, and among those new computed PNs, one of them, e.g. the one with the least PN, is chosen, and that becomes the parent node for the next iteration. The PNs of its children are calculated, and the same procedure continues until a leaf is reached. At this point, the value of the global threshold is updated with the PN of the recently visited leaf. Then, the search continues with another node at a higher level, and the search controller traverses the tree down to another leaf. If a node is reached with a PN larger than the radius, i.e. the global threshold, then that node, along with all nodes lying beneath that, are pruned out, and the search continues with another node. The tree traversal can be performed in a breadth-first manner. At each level, only the best K nodes, i.e. the K nodes with the smallest Ti , are chosen for expansion. This type of detector is generally known as the K-best detector. Note that such a detector requires sorting a list of size K × w to find the best K candidates. For instance, for a 16-QAM system with K = 10, this requires sorting a list of size K ×w = 10×4 = 40 at most of the tree levels. 2.2.3 Computational Complexity of Sphere Detection In this section, we derive and compare the complexity of the proposed techniques. The complexity in terms of number of arithmetic operations of a sphere detection operation is given by JSD (M, w) =
1
Ji E{Di },
(7)
i=M
where Ji is the number of operations per node in the i-th level. In order to compute Ji , we refer to the VLSI implementation of [11], and note that, for each node, one needs to compute the Ri, j s j , multiplications, where, except for the diagonal element, Ri,i , the rest of the multiplications are complex valued. The expansion procedure, Eq. (4), requires computing Ri, j s j for j = i + 1, ..., M, which would require (M − i) complex multiplications, and also computing Ri,i si for all the possible choices of √ w s j ∈ . Even though, there are w different s j s, there are only ( 2 − 1) different multiplications required for QAM modulations. For instance, for a 16-QAM with {±3 ±3 j, ±1 ± 1 j, ±3 ± 1 j, ±1 ±3 j}, computing only (Ri, j × 3) would be sufficient for all the choices of modulation points. Finally, computing the . 2 requires a squarer or a multiplier, depending on the architecture and hardware availabilities.
342
Yang Sun, Kiarash Amiri, Michael Brogioli, and Joseph R. Cavallaro
Fig. 8 Number of addition and multiplications operations for 16-QAM with different number of antennas, M.
In order to compute the number of adders for each norm expansion in (4), we note that there are (M − i) complex valued adders required for yi − M j=i+1 Ri, j s j , and w more complex adders to add the newly computed Ri,i si values. Once the w dif2 ferent norms, |yi − M j=i Ri, j s j , are computed, they need to be added to the partial distance coming from the higher level, which requires w more addition procedures. Finally, unless the search is happening at the end of the tree, the norms need to be sorted, which assuming a simple sorter, requires w(w + 1)/2 compare-select operations. Therefore, keeping in mind that each complex multiplier corresponds to four real-valued multipliers and two real-valued adders, and that every complex adder corresponds to two real-valued adders, Ji is calculated by Ji (M, w) = Jmult + Jadd (M, w) √ w Jmult (M, w) = (( − 1) + 4(M − i) + 1) 2 Jadd (M, w) = (2(M − i) + 2w + w) + (w(w + 1)/2) · sign(i − 1), where sign(i − 1) is used to ensure sorting is counted only when the search has not reached the end of the tree, and is equal to: 1 t≥1 sign(t) = . (8) 0 otherwise Moreover, we use , and to represent the hardware-oriented costs for one adder, one compare-select unit and one multiplication operation, respectively. Figure 8 shows the number of addition and multiplication operations needed for a 16-QAM system with different number of antennas.
Application-Specific Accelerators for Communications Channel Matrix
343
Received Vector
Pre-Processing Unit (PPU)
Sphere Detector
Previous PD
Node Ordering Unit (NOU) Tree Traversal Unit (TTU) Computation Unit (CMPU)
PD Unit #1
PD_1
PD Unit #2
PD_2
. . .
. . .
PD Unit #w
PD_w
Computation Unit (CMPU )
Detected Vector
Fig. 9 Sphere Detector architecture with multiple PED function units.
2.2.4 Depth-First Sphere Detector Architecture The depth-first sphere detection algorithm [11, 15, 19, 23] traverses the tree in a depth-first manner: the detector visits the children of each node before visiting its siblings. A constraint, referred to as radius, is often set on the PED for each level of the tree. A generic depth-first sphere detector architecture is shown in Figure 9. The Pre-Processing Unit (PPU) is used to compute the QR decomposition of the channel matrix as well as calculate QH y. The Tree Traversal Unit (TTU) is the controlling unit which decides in which direction and with which node to continue. Computation Unit (CMPU) computes the partial distances, based on (4), for w different s j . 2 Each PD unit computes |yi − M j=i Ri, j s j | for each of the w children of a node. Finally, the Node Ordering Unit (NOU) is for finding the minimum and saving other legitimate candidates, i.e. those inside Ri , in the memory. As an example to show the algorithm complexity, an FPGA implementation synthesis result for a 50 Mbps 4 × 4 16-QAM depth-first sphere detector is summarized in Table 2 [2]. Table 2 FPGA Resource Utilization for Sphere Detector Device Xilinx Virtex-4 xc4vfx100-12ff1517 Number of Slices 4065/42176 (9%) Number of FFs 3344/84352 (3%) Number of Look-Up Tables 6457/84352 (7%) Number of RAMB16 3/376 (1%) Number of DSP48s 32/160 (20%) Max. Freq. 125.7 MHz
344
Yang Sun, Kiarash Amiri, Michael Brogioli, and Joseph R. Cavallaro
Fig. 10 The K-best MIMO detector architecture. The intermediate register banks contain the sorting information as well as the other values, i.e. R matrix.
2.2.5 K-Best Detector Architecture K-best is another popular algorithm for implementing close-to-ML MIMO detection [22, 52]. The performance of this scheme is suboptimal compared to ML and sphere detection. However, it has a fixed complexity and relatively straightforward architecture. In this section, we briefly introduce the architecture [22] to implement the K-best MIMO detector. As illustrated in Figure 10, the PE elements at each stage compute the euclidean norms of (6), and find the best K candidates, i.e. the K candidates with the smallest norms, and pass them as the surviving candidates to the next level. It should be pointed out that the equation (2) can be decomposed into separate real and imaginary parts [22], which would double the size of the matrices. While such decomposition reduces the complex-valued operations of nodes into real-valued operations, it doubles the number of levels of the tree. Therefore, as shown in Figure 10, there are 8 K-best detection levels for the 4-antenna system. By selecting the proper K value, the real-value decomposition MIMO detection will not cause performance degradation compared to the complex-value MIMO detection [32]. In summary, both depth-first and K-best detectors have a regular and parallel data flow that can be efficiently mapped to hardware. The large amount of required multiplications makes the algorithm very difficult to be realized in a DSP processor. As the main task of the MIMO detector is to search for the best candidate in a very short time period, it would be more efficient to be mapped on a parallel hardware searcher with multiple processing elements. Thus, to sustain the high throughput MIMO detection, an MIMO hardware accelerator is necessary.
2.3 Channel Decoding Accelerators Error correcting codes are widely used in digital transmission, especially in wireless communications, to combat the harsh wireless transmission medium. To achieve high throughput, researchers are investigating more and more advanced error correction codes. The most commonly used error correcting codes in modern systems are convolutional codes, Turbo codes, and low-density parity-check (LDPC) codes. As a core technology in wireless communications, FEC (forward error correction) coding has migrated from the basic 2G convolutional/block codes to more powerful 3G Turbo codes, and LDPC codes forecast for 4G systems.
Application-Specific Accelerators for Communications
345
As codes become more complicated, the implementation complexity, especially the decoder complexity, increases dramatically which largely exceeds the capability of the general purpose DSP processor. Even the most capable DSPs today would need some types of acceleration coprocessor to offload the computation-intensive error correcting tasks. Moreover, it would be much more efficient to implement these decoding algorithms on dedicated hardware because typical error correction algorithms use special arithmetic and therefore are more suitable for ASICs or FPGAs. Bitwise operations, linear feedback shift registers, and complex look-up tables can be very efficiently realized with ASICs/FPGAs. In this section, we will present some important error correction algorithms and their efficient hardware architectures. We will cover major error correction codes used in the current and next generation communication standards, such as 3GPP LTE, IEEE 802.11n Wireless LAN, IEEE 802.16e WiMax, and etc. 2.3.1 Viterbi Decoder Accelerator Architecture In telecommunications, convolutional codes are among the most popular error correction codes that are used to improve the performance of wireless links. For example, convolutional codes are used in the data channel of the second generation (2G) mobile phone system (eg. GSM) and IEEE 802.11a/n wireless local area network (WLAN). Due to their good performance and efficient hardware architectures, convolutional codes continue to be used by the 3G/4G wireless systems for their control channels, such as 3GPP LTE and IEEE 802.16e WiMax. A convolutional code is a type of error-correcting code in which each m-bit information symbol is transformed into an n-bit symbol, where m/n is called the code rate. The encoder is basically a finite state machine, where the state is defined as the contents of the memory of the encoder. Figure 11 (a) and (b) show two examples of convolutional codes with constraint length K = 3, code rate R = 1/3 and constraint length K = 7, code rate R = 1/2, respectively. The Viterbi algorithm is an optimal decoding algorithm for the decoding of convolutional codes [16, 49]. The Viterbi algorithm enumerates all the possible codewords and selects the most likely sequence. The most likely sequence is found by traversing a trellis. The trellis diagram for a K = 3 convolutional code (cf. Figure 11(a)) is shown in Figure 12. In general, a Viterbi decoder contains four blocks: branch metric calculation (BMC) unit, add-compare-select (ACS) unit, survivor memory unit (SMU), and trace back (TB) unit as shown in Figure 13. The decoder works as follows. BMC calculates all the possible branch metrics from the channel inputs. ACS unit recursively calculates the state metrics and the survivors are stored into a survivor memory. The survivor paths contain state transitions to reconstruct a sequence of states by tracing back. This reconstructed sequence is then the most likely sequence sent by the transmitter. In order to reduce memory requirements and latency, Viterbi decoding can be sliced into blocks, which are often referred to as sliding windows. The sliding window principle is shown in Figure 14. The ACS recursion is carried out for the
346
Yang Sun, Kiarash Amiri, Michael Brogioli, and Joseph R. Cavallaro
+ S0
Information u
+
Parity c0
+
Parity c1
S1
+
Parity c2
(a)
+
Information u
S0
S1
S2
Parity c0
S3
S4
+
S5
Parity c1
(b)
Fig. 11 Convolutional encoder. (a) K=3, R=1/3. (b) K=7, R=1/2 encoder used for WLAN.
entire code block. The trace back operation is performed on every sliding window. To improve the reliability of the trace back, the decisions for the last T (also referred to warmup window) steps will be discarded. Thus, after a fixed delay of L + 2T , the decoder begins to produce decoded bits on every clock cycle. Like an FFT processor [4, 12, 33], a Viterbi decoder has a very regular data structure. Thus, it is very natural to implement a Viterbi decoder in hardware to support high speed applications, such as IEEE 802.11n wireless LAN with 300 Mbps peak data rate. t
t+1
t+2
t+3
t+4
S0
S0
S0
S0
S0
S1
S1
S1
S1
S1
S2
S2
S2
S2
S2
S3
S3
S3
S3
S3
Fig. 12 A 4-state trellis diagram for the encoder in Figure 11(a). The solid lines indicate transitions for a “0" input and the dashed lines for a “1" input. Each stage of the trellis consists of 2K−2 Viterbi butterflies. One such butterfly is highlighted at step t + 2.
Application-Specific Accelerators for Communications
347
SM Regs Channel input Branch Metric Calc.
ACS Arrays
Survivor Memory
Trace Back
Decoded bits
Fig. 13 Viterbi decoder architecture with parallel ACS function units.
The most complex operations in the Viterbi algorithm is the ACS recursion. Figure 15(a) shows one ACS butterfly of the 4-state trellis described above (cf. Figure 12). Each butterfly contains two ACS units. For each ACS unit, there are two branches leading from two states on the left, and going to a state on the right. A branch metric is computed for each branch of an ACS unit. Note that it is not necessary to calculate every branch metric for all 4 branches in an ACS butterfly, because some of them are identical depending on the trellis structure. Based on the old state metrics (SMs) and branch metrics (BMs), the new state metrics are updated as: SM(0) = min SM(0) + BM(0, 0), SM(1) + BM(1, 0) SM(2) = min SM(0) + BM(0, 2), SM(1) + BM(1, 2) .
(9) (10)
Given a constraint length of K convolutional code, 2K−2 ACS butterflies would be required for each step of the decoding. These butterflies can be implemented in serial or parallel. To maximize the decoding throughput, a parallel implementation is often used. The basic parallel ACS architecture can process one bit of the message at each clock cycle. However, the processing speed can be possibly increased by N times by merging every N stages of the trellis into one high-radix stage with 2N branches for every state. Figure 16 shows the radix-2, radix-4, and radix-8 trellis structures. Figure 17 shows a radix-8 ACS architecture which can process three message bits over three trellis path bits. Generally for a radix-N ACS architecture, it can process log2 N message bits over log2 N trellis path bits. For high speed applications, high radix ACS architectures are very common in a Viterbi decoder. The top level of a generic Viterbi decoder accelerator is shown in Figure 18. Although a pure software
0 Received sequence
L
N+K-2
T
W1 W2 W3 W4
Fig. 14 Sliding window decoding with warmup. Decoding length = L. Warmup length = T .
348
Yang Sun, Kiarash Amiri, Michael Brogioli, and Joseph R. Cavallaro
ACS
BM(0,0) SM (0)
SM (1)
BM(0,0) BM (0 ,2 ) ) ,0 1 ( BM BM(1,2)
SM’(0) SM(0)
+
0
BM(1,0) SM(1)
SM’(2)
-
+
SM’(0) 1
(a) 0
0
0
1
2
1
BM(1,0) BM(0,2)
... 2
1
2
3
3
3
Stage 1
Stage 2
ACS
+
0
+
SM’(2) 1
BM(1,2)
(c)
(b)
Fig. 15 ACS butterfly architecture. (a) Basic butterfly structure. (b) ACS butterfly hardware implementation. (c) In-place butterfly structure for a 4-state trellis diagram.
approach is feasible for a modern DSP processor, it is much more cost effective to implement the Viterbi decoder with a hardware accelerator. The decoder can be memory mapped to the DSP external memory space so that the DMA transfer can be utilized without the intervention of the host DSP. Data is passed in and out is in a pipelined manner so that the decoding can be simultaneously performed with I/O operations.
(a)
(b)
(c)
Fig. 16 Different radix trellis structure. (a) Radix-2 trellis. (b) Radix-4 trellis. (c) Radix-8 trellis.
Application-Specific Accelerators for Communications
ACS
Switch
ACS
349
Switch
ACS
Fig. 17 Radix-8 ACS architecture.
2.3.2 Turbo Decoder Accelerator Architecture Turbo codes are a class of high-performance capacity-approaching error-correcting codes [5]. As a break-through in coding theory, Turbo codes are widely used in many 3G/4G wireless standards such as CDMA2000, WCDMA/UMTS, 3GPP LTE, and IEEE 802.16e WiMax. However, the inherently large decoding latency and complex iterative decoding algorithm have made it rarely being implemented in a general purpose DSP. For example, Texas Instruments’ latest multi-core DSP processor TI C6474 employs a Turbo decoder accelerator to support 2 Mbps CDMA Turbo codes for the base station [27]. The decoding throughput requirement for 3GPP LTE Turbo codes is to be more than 80 Mbps in the uplink and 320 Mbps in the downlink. Because the Turbo codes used in many standards are very similar, e.g. the encoding polynomials are same for WCDMA/UMTS/LTE, the Turbo decoder is often accelerated by reconfigurable hardware. A classic Turbo encoder structure is depicted in Figure 19. The basic encoder consists of two systematic convolutional encoders and an interleaver. The information sequence u is encoded into three streams: systematic, parity 1, and parity 2. Here the interleaver is used to permute the information sequence into a second different sequence for encoder 2. The performance of a Turbo code depends critically on the interleaver structure [36]. The BCJR algorithm [3], also called forward-backward algorithm or Maximum a posteriori (MAP) algorithm, is the main component in the Turbo decoding process. The basic structure of Turbo decoding is functionally illustrated in Figure 20.
Register Arrays
DSP Processor
E M I F
Branch Metric Calc.
Radix-N ACS Arrays
DSP IF
DSP Core
D M A
Survivors & Trace Back
Viterbi Decoder Accelerator
Fig. 18 A generic Viterbi decoder accelerator architecture. Data movement between DSP processor and accelerator is via DMA. Fully-parallel ACS function units are used to support high speed decoding.
350
Yang Sun, Kiarash Amiri, Michael Brogioli, and Joseph R. Cavallaro X Y1
u
c0 Encoder 1
Π
Encoder 2
u
D
D
D
c1 QPP Interleaver
c2
Y2 D
D
(a)
D
(b)
Fig. 19 Turbo encoder structure. (a) Basic structure. (b) Structure of Turbo encoder in 3GPP LTE.
The decoding is based on the MAP algorithm. During the decoding process, each MAP decoder receives the channel data and a priori information from the other constituent MAP decoder through interleaving ( ) or deinterleaving ( −1), and produces extrinsic information at its output. The MAP algorithm is an optimal symbol decoding algorithm that minimizes the probability of a symbol error. It computes the a posteriori probabilities (APPs) of the information bits as follows: ∗ (uˆk ) = max k−1 (sk−1 ) + k (sk−1 , sk ) + k (sk )) (11) u:uk =1 ∗ − max k−1 (sk−1 ) + k (sk−1 , sk ) + k (sk )) , (12) u:uk =0
Forward recursion α 0 1
yc1 yu ∏
yc2
2
MAP 1
3
L1e L1a
4
∏
∏
L2a L2e
MAP 2
−1
5
7
(a)
uk = 0 uk = 1
6
sk
uk , ck
sk +1
sk + 2
sk + 3
sk + 4
Backward recursion β (b)
Fig. 20 Basic structure of an iterative Turbo decoder. (a) Iterative decoding based on two MAP decoders. (b) Forward/backward recursion on trellis diagram.
Application-Specific Accelerators for Communications
351
γ0
α0
LUT
γ0 State m
α0
α0
+
0
-
α1
α1
+
+
α0
1
γ1 γ1 (a)
(b)
Fig. 21 ACSA structure. (a) Flow of state metric update. (b) Circuit implementation of an ACSA unit.
where k and k denote the forward and backward state metrics, and are calculated as follows: ∗
k (sk ) = max{k−1 (sk−1 ) + k (sk−1 , sk )}, sk−1 ∗
k (sk ) = max{k+1 (sk+1 ) + k (sk , sk+1 )}. sk+1
(13) (14)
The k term above is the branch transition probability that depends on the trellis diagram, and is usually referred to as a branch metric. The max star operator employed in the above descriptions is the core arithmetic computation that is required by the MAP decoding. It is defined as: ∗
max(a, b) = log(ea + eb ) = max(a, b) + log(1 + e−|a−b|).
(15)
A basic add-compare-select-add (ACSA) unit is shown in Figure 21. This circuit can process one step of the trellis per cycle and is often referred to as Radix-2 ACSA unit. To increase the processing speed, the trellis can be transformed by merging every two stages into one radix-4 stage as shown in Figure 22. Thus, the throughput can be doubled by applying this transform. For an N state Turbo codes, N such ACSA unit would be required in each step of the trellis processing. To maximize the decoding throughput, a parallel implementation is usually employed to compute all the N state metrics simultaneously. In the original MAP algorithm, the entire set of forward metrics needs to be computed before the first soft log-likelihood ratio (LLR) output can be generated. This results in a large storage of K metrics for all N states, where K is the block length and N is the number of states in the trellis diagram. Similar to the Viterbi algorithm, a sliding window algorithm is often applied to the MAP algorithm to reduce the decoding latency and memory storage requirement. By selecting a proper length of the sliding window, e.g. 32 for a rate 1/3 code, there is nearly no bit error rate (BER) performance degradation. Figure 23(a) shows an example of the sliding window algorithm, where a dummy reverse metric calculation (RMC) is used to get the initial values for metrics. The sliding window hardware architecture is shown
352
Yang Sun, Kiarash Amiri, Michael Brogioli, and Joseph R. Cavallaro sk −2
sk −1
sk
sk
sk − 2 0 1
α k − 2 γ k −1
γk
11 01 10 00
αk −2
βk
(a)
γ k (sk −2 , sk )
βk
γ0 LUT
α0 γ1
0
-
γ0 α1
γ2 γ3 State m
α3
+ +
+ 1
LUT 0
α0
γ1 γ2
LUT
α2 α1
α0
α2
+
α0
1
0
α3
+
+
+ 1
γ3 (b)
Fig. 22 (a) An example of Radix-4 trellis. (b) Radix-4 ACSA circuit implementation.
in Figure 23(b). The decoding operation is based on three recursion units, two used for the reverse (or backward) recursions (dummy RMC 1 and effective RMC 2), and one for forward recursion (FMC). Each recursion unit contains parallel ACSA units. After a fixed latency, the decoder produces the soft LLR outputs on every clock cycle. To further increase the throughput, a parallel sliding window scheme [6, 30, 34, 44, 48] is often applied as shown in Figure 24. Another key component of Turbo decoders is the interleaver. Generally, the interleaver is a device that takes its input bit sequence and produces an output sequence that is as uncorrelated as possible. Theoretically a random interleaver would have the best performance. But it is difficult to implement a random interleaver in hardware. Thus, researchers are investigating pseudo-random interleavers such as the row-column permutation interleaver for 3G Rel-99 Turbo coding as well as the new QPP interleaver [41] for 3G LTE Turbo coding. The main differences between these two types of pseudo-random interleavers is the capability to support parallel Turbo decoding. The drawback of the row-column permutation interleaver is that memory conflicts will occur when employing multiple MAP decoders for parallel decoding. Extra buffers are necessary to solve the memory conflicts caused by the row-column permutation interleaver [37]. To solve this problem, the new 3G LTE standard has
Application-Specific Accelerators for Communications
353 RMC 2
RMC 2 RMC 1
W RMC 2 Received sequence
RMC 2 RMC 1
RMC 1
...
... FMC
FMC
FMC
FMC
(a)
La (u) u Lc (y ) c1,2 Lc (y )
M U X
FBMC
FMC
RAM
ACSA x 8
Buffer
M U X
RBMC 1
RMC 1
RBMC 2
RMC 2
uˆ LLRC Le (uˆ )
(b)
Fig. 23 Sliding window MAP decoder. (a) An example of sliding window MAP algorithm, where a dummy RMC is performed to achieve the initial metrics. (b) MAP decoder hardware architecture.
adopted a new interleaver structure called QPP interleaver [41]. Given an information block length N, the x-th QPP interleaved output position is given by
(x) = ( f2 x2 + f1 x) mod N, 0 ≤ x, f1 , f2 < N.
(16)
It has been shown in [41] that the QPP interleaver will not cause memory conflicts as long as the parallelism level is a factor of N. The simplest approach to implement an interleaver is to store all the interleaving patterns in non-violating memory such as ROM. However, this approach can become very expensive because it is necessary to store a large number of interleaving patterns to support decoding of multiple block size Turbo codes such as 3GPP LTE Turbo codes. Fortunately, there usually exists an efficient hardware implementation for the interleaver. For example, Figure 25 shows a circuit implementation for the QPP interleaver in 3GPP LTE standard [44]. A basic Turbo accelerator architecture is shown in Figure 26. The main difference between the Viterbi decoder and the Turbo decoder is that the Turbo decoder is based on the iterative message passing algorithms. Thus, a Turbo accelerator may need more communication and control coordination with the DSP host processor. For example, the interleaving addresses can be generated by the DSP processor and passed to the Turbo accelerator. The DSP can monitor the decoding process to decide when to terminate the decoding if there are no more decoding gains. Alternately, the Turbo accelerator can be configured to operate without DSP intervention. To support this feature, some special hardware such as interleavers have to be configurable via DSP control registers. To decrease the required bus bandwidth, interme-
354
Yang Sun, Kiarash Amiri, Michael Brogioli, and Joseph R. Cavallaro RMC (a) FMC N/4
1
N/4+W RMC (b)
FMC N/4-W
N/4
N/2
N/2+W RMC (c)
FMC N/2-W
N/2
3N/4
3N/4+W RMC (d)
FMC 3N/4-W
N
3N/4
Fig. 24 An example of parallel sliding window decoding, where a decode block is sliced into 4 sections. The sub-blocks are overlapped by one sliding window length W in order to get the initial value for the boundary states.
diate results should not be passed back to the DSP processor. Only the successfully decoded bits need to be passed back to the DSP processor, e.g. via the DSP DMA controller. Further, to support multiple Turbo codes in different communication systems, a flexible MAP decoder is necessary. In fact, many standards employ similar Turbo code structures. For instance, CDMA, WCDMA, UMTS, and 3GPP LTE all use an eight-state binary Turbo code with polynomial (13, 15, 17). Although IEEE 802.16e WiMax and DVB-RCS standards use a different eight-state double binary Turbo code, the trellis structures of these Turbo codes are very similar as illustrated in Figure 27. Thus, it is possible design multi-standard Turbo decoders based on flexible MAP decoder datapaths [31, 40, 44]. It has been shown in [44] that the area overhead to support multi-codes is only about 7%. In addition, when the throughput requirement is high, e.g. more than 20 Mbps, multiple MAP decoders can be activated to increase the throughput performance.
П(x+1)=(П(x)+Г(x)) mod K A
П(x0)
2f 2 K
0 1
Init
A B
+
Г(x0)
0
-
D 1
Y
Г(x)
B
0
+
Г(x+1)=(Г(x)+2f2) mod K
-
1
D
Y
Г(x0) Init
Fig. 25 An circuit implementation for the QPP interleaver (x) = ( f2 x2 + f1 x) mod K [44].
П(x)
Application-Specific Accelerators for Communications
MAP
Interconnect
MEM
DMA Controller
DSP Core
355
MEM
...
MAP
... MAP
MEM Interleaver
Fig. 26 Turbo decoder accelerator architecture. Multiple MAP decoders are used to support high throughput decoding of Turbo codes. Special function units such as interleavers are also implemented in hardware.
In summary, due to the iterative structures, a Turbo decoder needs more Gflops than what is available in a general purpose DSP processor. For this reason, Texas Instruments’ latest C64x DSP processor integrates a 2 Mbps 3G Turbo decoder accelerator in the same die [27]. Because of the parallel and recursive algorithms and special logarithmic arithmetics, it is more cost effective to realize a Turbo decoder in hardware. 2.3.3 LDPC Decoder Accelerator Architecture A low-density parity-check (LDPC) code [18] is another important error correcting code that is the among one of the most efficient coding schemes discovered as of
0
0
0
0
1
1
1
1
2
2
2
2
3
3
3
3
4
4
4
4
5
5
5
5
6
6
6
6
7
7
7
7
Input 00
(a)
Input 01
Input 10
Input 11
(b)
Fig. 27 Radix-4 trellis structures of (a) CDMA/WCDMA/UMTS/LTE Turbo codes and (b) WiMax/DVB-RCS Turbo codes.
356
Yang Sun, Kiarash Amiri, Michael Brogioli, and Joseph R. Cavallaro
PEC 1
PEC 2
...
PEC M
PEC 1
PEC 2
...
PEC z
Check memory + Interconnects PEV 1
PEV 2
PEV 3
Soft In/Out
Soft In/Out
Soft In/Out
...
PEV N-1
PEV N
Soft In/Out
Soft In/Out
PEV 1
PEV 2
...
PEV
z
Variable memory + Interconnects
(a)
(b)
Fig. 28 Implementation of LDPC decoders, where PEC denotes processing element for check node and PEV denotes processing element for variable node. (a) Fully-parallel. (b) Semi-parallel.
2009. The remarkable error correction capabilities of LDPC codes have led to their recent adoption in many standards, such as IEEE 802.11n, IEEE 802.16e, and IEEE 802 10GBase-T. The huge computation and high throughput requirements make it very difficult to implement a high throughput LDPC decoder on a general purpose DSP. For example, a 5.4 Mbps LDPC decoder was implemented on TMS320C64xx DSP running at 600 MHz [29]. This throughput performance is not enough to support high data rates defined in new wireless standards. Thus, it is important to develop area and power efficient hardware LDPC decoding accelerators. A binary LDPC code is a linear block code specified by a very sparse binary M × N parity check matrix: H · xT = 0, where x is a codeword and H can be viewed as a bipartite graph where each column and row in H represent a variable node and a check node, respectively. The decoding algorithm is based on the iterative message passing algorithm (also called belief propagation algorithm), which exchanges the messages between the variable nodes and check nodes on graph. The hardware implementation of LDPC decoders can be serial, semi-parallel, and fully-parallel as shown in Figure 28. Fullyparallel implementation has the maximum processing elements to achieve very high throughput. Semi-parallel implementation, on the other hand, has a lesser number of processing elements that can be re-used, e.g. z number of processing elements are employed in Figure 28(b). In a semi-parallel implementation, memories are usually required to store the temporary results. In many practical systems, semi-parallel implementations are often used to achieve 100 Mbps to 1 Gbps throughput with reasonable complexity [7, 21, 35, 42, 43, 54]. In LDPC decoding, the main complexity comes from the check node processing. Each check node receives a set of variable node messages denoted as Nm . Based on these data, check node messages are computed as mn = m j = m j mn , j∈Nm \n
j∈Nm
Application-Specific Accelerators for Communications
λmn… λm2 λm1
f (·)
D
357
g (·)
Λmn …Λm2 Λm1
λmn
FIFO
a b
|x|
|a|
|x|
|b|
1 1 Sign bit
log(1 + e − (| x|) )
+
LUT
log(1 + e − (|x|) )
Min
|x|
LUT
Min(|a|, |b|)
-+
0 Neg
1
Sign(a) ^ Sign(b)
Fig. 29 Recursive architecture to compute check node messages [42].
where mn and mn denote the check node message and the variable node message, respectively. The special arithmetic operators and are defined as follows: 1 + ea eb ea + eb = sign(a) sign(b) min(|a|, |b|) + log(1 + e−(|a|+|b|)) − log(1 + e− |a|−|b| ) ,
a b f (a, b) = log
1 − ea eb ea − eb −|a|−|b| −(|a|+|b|) = sign(a) sign(b) min(|a|, |b|) + log(1 − e ) − log(1 − e ) .
a b g(a, b) = log
Figure 29 shows a hardware implementation from [42] to compute check node message mn for one check row m. Because multiple check rows can be processed simultaneously in the LDPC decoding algorithm, multiple such check node units can be used to increase decoding speed. As the number of ALU units in a general purpose DSP processor is limited, it is difficult to achieve more than 10 Mbps throughput in a DSP implementation. Given a random LDPC code, the main complexity comes not only from the complex check node processing, but also from the interconnection network between check nodes and variable nodes. To simplify the routing of the interconnection network, many practical standards usually employ structured LDPC codes, or quasicyclic LDPC (QC-LDPC) codes. The parity check matrix of a QC-LDPC code is shown in Figure 30. Table 3 summaries the design parameters of the QC-LDPC codes for IEEE 802.11n WLAN and IEEE 802.16e WiMax wireless standards. As can be seen, many design parameters are in the same range for these two applications, thus it is possible to design a reconfigurable hardware to support multiple standards [42].
358
Yang Sun, Kiarash Amiri, Michael Brogioli, and Joseph R. Cavallaro
Table 3 Design parameters for H in standardized LDPC codes z j k Check node degree Variable node degree Max. throughput WLAN 802.11n 27-81 4-12 24 7-22 2-12 600 Mbps WiMax 802.16e 24-96 4-12 24 6-20 2-6 144 Mbps
As an example, a multi-standard semi-parallel LDPC decoder accelerator architecture is shown in Figure 31 [42]. In order to support several hundreds Mbps data rate, multiple PEs are used to process multiple check rows simultaneously. As with Turbo decoding, LDPC decoding is also based on an iterative decoding algorithm. The iterative decoding flow is as follows: at each iteration, 1 × z APP messages, denoted as Ln are fetched from the L-memory and passed through a permuter (eg. barrel shifter) to be routed to z PEs (z is the parallelism level). The soft input information mn is formed by subtracting the old extrinsic message mn from the APP message Ln . Then the PEs generate new extrinsic messages mn and APP messages Ln , and store them back to memory. The operation mode of the LDPC accelerator needs to be configured in the beginning of the decoding. After that, it should work without DSP intervention. Once it has finished decoding, the decoded bits are passed back to the DSP processor. Figure 32 shows the ASIC implementation result of this decoder (VLSI layout view) and its power consumption for different block sizes. As the block size increases, the number of active PEs increases, thus more power is consumed.
3 Summary Digital signal processing complexity in high-speed wireless communications is driving a need for high performance heterogenous DSP systems with real-time processing. Many wireless algorithms, such as channel decoding and MIMO detection, demonstrate significant data parallelism. For this class of data-parallel algorithms, application specific DSP accelerators are necessary to meet real-time requirements x
z 2z
... jz
z
2z
...
kz
Fig. 30 Structured LDPC parity check matrix with j block rows and k block columns. Each submatrix is a z × z identity shifted matrix.
Application-Specific Accelerators for Communications
DSP IF
L-Mem Interconnection network
Ln -+
-+ Λ-Mem
Λ-Mem
λmn
PE 1
Λmn
+
-+ Λ-Mem
DMA
359
...
PE 2
PE z
+
+
L’n
Interconnection network (Inverse)
Fig. 31 Semi-parallel LDPC decoder accelerator architecture. Multiple PEs (number of z) are used to increase decoding speed. Variable messages are stored in L-memory and check messages are stored in -memory. An interconnection network along with an inverse interconnection network are used to route data.
while minimizing power consumption. Spatial locality of data, data level parallelism, computational complexity, and task level parallelism are four major criteria to identify which DSP algorithm should be off-loaded to an accelerator. Additional cost incurred from the data movement between DSP and hardware accelerator must be also considered. There are a number of DSP architectures which include true hardware based accelerators. Examples of these include the Texas Instruments’ C64x series of DSPs 450
425
400
Permuter DSP I/F
Power consumption (mW)
L-Mem
Misc Logic
ROM
Ctrl
PE + Λ-Memory x96
375
350
325
300
275
250 500
(a)
5mm
1000
1500
2000
2500
Block size (bit) (b)
Fig. 32 An example of a LDPC decoder hardware accelerator [42]. (a) VLSI layout view (3.5 mm2 area, 90nm technology). (b) power consumptions for different block sizes.
360
Yang Sun, Kiarash Amiri, Michael Brogioli, and Joseph R. Cavallaro
which include a 2 Mbps Turbo decoding accelerator [27], and Freescale Semiconductor’s six core broadband wireless access DSP MSC8156 which includes a programmable 200 Mbps Turbo decoding accelerator (6 iterations), a 115 Mbps Viterbi decoding accelerator (K = 9), an FFT/IFFT accelerator for sizes 128, 256, 512, 1024 or 2048 points at up to 350 million samples/s, and a DFT/IDFT for sizes up to 1536 points at up to 175 million samples/s [38]. Relying on a single DSP processor for all signal processing tasks would be a clean solution. As a practical matter, however, multiple DSP processors are necessary for implementing a next generation wireless handset or base station. This means greater system cost, more board space, and more power consumption. Integrating hardware communication accelerators, such as MIMO detectors and channel decoders, into the DSP processor silicon can create an efficient System on Chip. This offers many advantage: the dedicated accelerators relieve the DSP processor of the parallel computation-intensive signal processing burden, freeing DSP processing capacity for other system control functions that more greatly benefit from programmability.
References 1. Alamouti, S.M.: A simple transmit diversity technique for wireless communications. IEEE Journal on Selected Areas in Communications 16(8), 1451–1458 (1998) 2. Amiri, K., Cavallaro, J.R.: FPGA implementation of dynamic threshold sphere detection for MIMO systems. IEEE Asilomar Conference on Signals, Systems and Computers pp. 94–98 (2006) 3. Bahl, L., Cocke, J., Jelinek, F., Raviv, J.: Optimal decoding of linear codes for minimizing symbol error rate. IEEE Transactions on Information Theory IT-20, 284–287 (1974) 4. Bass, B.: A low-power, high-performance, 1024-point FFT processor. In: IEEE International Solid-State Circuit Conference (ISSCC) (1999) 5. Berrou, C., Glavieux, A., Thitimajshima, P.: Near Shannon limit error-correcting coding and decoding: Turbo-codes. In: IEEE International Conference on Communications, pp. 1064– 1070 (1993) 6. Bougard, B., Giulietti, A., Derudder, V., Weijers, J.W., Dupont, S., Hollevoet, L., Catthoor, F., Van der Perre, L., De Man, H., Lauwereins, R.: A scalable 8.7-nJ/bit 75.6-Mb/s parallel concatenated convolutional (turbo-) codec. In: IEEE International Solid-State Circuit Conference (ISSCC) (2003) 7. Brack, T., Alles, M., Lehnigk-Emden, T., Kienle, F., Wehn, N., Lapos, Insalata, N., Rossi, F., Rovini, M., Fanucci, L.: Low complexity LDPC code decoders for next generation standards. In: Design, Automation, and Test in Europe (DATE), pp. 1–6 (2007) 8. Brogioli, M.: Reconfigurable heterogeneous DSP/FPGA based embedded architectures for numerically intensive embedded computingworkloads. Ph.D. thesis, Rice University, Houston, Texas, USA (2007) 9. Brogioli, M., Radosavljevic, P., Cavallaro, J.: A general hardware/software codesign methodology for embedded signal processing and multimedia workloads. In: IEEE 40th Asilomar Conference on Signals, Systems, and Computers, pp. 1486–1490 (2006) 10. Burg, A.: VLSI circuits for MIMO communication systems. Ph.D. thesis, Swiss Federal Institute of Technology, Zurich, Switzerland (2006) 11. Burg, A., Borgmann, M., Wenk, M., Zellweger, M., Fichtner, W., Bolcskei, H.: VLSI implementation of MIMO detection using the sphere decoding algorithm. IEEE Journal of Solid-State Circuits 40(7), 1566–1577 (2005)
Application-Specific Accelerators for Communications
361
12. Cooley, J.W., Tukey, J.W.: An algorithm for the machine calculation of complex Fourier series. Mathematics of Computation 19, 297–301 (1965) 13. Damen, M.O., Gamal, H.E., Caire, G.: On maximum likelihood detection and the search for the closest lattice point. IEEE Transactions on Information Theory 49(10), 2389–2402 (2003) 14. Devices, A.: The SHARC Processor Family. http://www.analog.com/en/embeddedprocessing-dsp/sharc/processors/index.html (2009) 15. Fincke, U., Pohst, M.: Improved methods for calculating vectors of short length in a lattice, including a complexity analysis. Mathematics of Computation 44(170), 463–471 (1985) 16. Forney, G.D.: The Viterbi algorithm. Proceedings of the IEEE 61(3), 268–278 (1973) 17. Foschini, G.: Layered space-time architecture for wireless communication in a fading environment when using multiple antennas. Bell Labs. Tech. Journal 2, 41–59 (1996) 18. Gallager, R.: Low-density parity-check codes. IEEE Transactions on Information Theory 8, 21–28 (1962) 19. Garrett, D., Davis, L., ten Brink, S., Hochwald, B., Knagge, G.: Silicon complexity for maximum likelihood MIMO detection using spherical decoding. IEEE Journal of Solid-State Circuits 39(9), 1544–1552 (2004) 20. Golden, G., Foschini, G.J., Valenzuela, R.A., Wolniansky, P.W.: Detection algorithms and initial laboratory results using V-BLAST space-time communication architecture. Electronics Letters 35, 14–15 (1999) 21. Gunnam, K., Choi, G.S., Yeary, M.B., Atiquzzaman, M.: VLSI architectures for layered decoding for irregular LDPC codes of WiMax. In: IEEE International Conference on Communications, pp. 4542–4547 (2007) 22. Guo, Z., Nilsson, P.: Algorithm and implementation of the K-best sphere decoding for MIMO detection. IEEE Journal on Selected Areas in Communications 24(3), 491–503 (2006) 23. Hassibi, B., Vikalo, H.: On the sphere-decoding algorithm I. Expected complexity. IEEE Transactions on Signal Processing 53(8), 2806–2818 (2005) 24. Hunter, H.C., Moreno, J.H.: A new look at exploiting data parallelism in embedded systems. In: Proceedings of the International Conference on Compilers, Architectures and Synthesis for Embedded Systems, pp. 159–169 (2003) 25. Tensilica Inc.: http://www.tensilica.com (2009) 26. Texas Instruments: TMS320C55x DSP CPU Programmer’s Reference Supplement. http://focus.ti.com/lit/ug/spru652g/spru652g.pdf (2005) 27. Texas Instruments: TMS320C6474 high performance multicore processor datasheet. http://focus.ti.com/docs/prod/folders/print/tms320c6474.html (2008) 28. Instruments, T.: TMS320C6000 CPU and Instruction Set Reference Guide. http://dspvillage.ti.com (2001) 29. Lechner, G., Sayir, J., Rupp, M.: Efficient DSP implementation of an LDPC decoder. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 4, pp. 665–668 (2004) 30. Lee, S.J., Shanbhag, N.R., Singer, A.C.: Area-efficient high-throughput MAP decoder architectures. IEEE Transactions on VLSI Systems 13, 921–933 (2005) 31. Martina, M., Nicola, M., Masera, G.: A flexible UMTS-WiMax turbo decoder architecture. IEEE Transactions on Circuits and Systems II 55, 369–273 (2008) 32. Myllylä, M., Silvola, P., Juntti, M., Cavallaro, J.R.: Comparison of two novel list sphere detector algorithms for mimo-ofdm systems. IEEE International Symposium on Personal Indoor and Mobile Radio Communications (2006) 33. Parhi, K.K.: VLSI Digital Signal Processing Systems Design and Implementation. Wiley (1999) 34. Prescher, G., Gemmeke, T., Noll, T.G.: A parametrizable low-power high-throughput turbodecoder. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 5, pp. 25–28 (2005) 35. Rovini, M., Gentile, G., Rossi, F., Fanucci, L.: A scalable decoder architecture for IEEE 802.11n LDPC codes. In: IEEE Global Telecommunications Conference, pp. 3270–3274 (2007)
362
Yang Sun, Kiarash Amiri, Michael Brogioli, and Joseph R. Cavallaro
36. Sadjadpour, H., Sloane, N., Salehi, M., Nebe, G.: Interleaver design for turbo codes. IEEE Journal on Seleteced Areas in Communications 19, 831–837 (2001) 37. Salmela, P., Gu, R., Bhattacharyya, S., Takala, J.: Efficient parallel memory organization for turbo decoders. In: Proc. European Signal Processing Conf., pp. 831–835 (2007) 38. Freescale Semiconductor: MSC8156 six core broadband wireless access DSP. www.freescale.com/starcore (2009) 39. Semiconductor, F.: Freescale Starcore Architecture. www.freescale.com/starcore (2009) 40. Shin, M.C., Park, I.C.: A programmable turbo decoder for multiple 3G wireless standards. In: IEEE Solid-State Circuits Conference, vol. 1, pp. 154–484 (2003) 41. Sun, J., Takeshita, O.: Interleavers for turbo codes using permutation polynomials over integer rings. IEEE Transactions on Information Theory 51(1) (2005) 42. Sun, Y., Cavallaro, J.R.: A low-power 1-Gbps reconfigurable LDPC decoder design for multiple 4G wireless standards. In: IEEE International SOC Conference (SoCC), pp. 367–370 (2008) 43. Sun, Y., Karkooti, M., Cavallaro, J.R.: VLSI decoder architecture for high throughput, variable block-size and multi-rate LDPC codes. In: IEEE International Symposium on Circuits and Systems (ISCAS), pp. 2104–2107 (2007) 44. Sun, Y., Zhu, Y., Goel, M., Cavallaro, J.R.: Configurable and scalable high throughput turbo decoder architecture for multiple 4G wireless standards. In: IEEE International Conference on Application-Specific Systems, Architectures and Processors (ASAP), pp. 209–214 (2008) 45. Tarokh, V., Jafarkhani, H., Calderbank, A.R.: Space-time block codes from orthogonal designs. IEEE Transactions on Information Theory 45(5), 1456–1467 (1999) 46. Tarokh, V., Jafarkhani, H., Calderbank, A.R.: Space time block coding for wireless communications: Performance results. IEEE Journal on Selected Areas in Communications 17(3), 451–460 (1999) 47. Telatar, I.E.: Capacity of multiantenna Gaussian channels. European Transactions on Telecommunications 10, 585–595 (1999) 48. Thul, M.J., Gilbert, F., Vogt, T., Kreiselmaier, G., Wehn, N.: A scalable system architecture for high-throughput turbo-decoders. Journal of VLSI Signal Processing pp. 63–77 (2005) 49. Viterbi, A.: Error bounds for convolutional coding and an asymptotically optimum decoding algorithm. IEEE Transactions on Information Theory IT-13, 260–269 (1967) 50. Wijting, C., Ojanperä, T., Juntti, M., Kansanen, K., Prasad, R.: Groupwise serial multiuser detectors for multirate DS-CDMA. In: IEEE Vehicular Technology Conference, vol. 1, pp. 836–840 (1999) 51. Willmann, P., Kim, H., Rixner, S., Pai, V.S.: An efficient programmable 10 Gigabit Ethernet network interface card. In: ACM International Symposium on High-Performance Computer Architecture, pp. 85–86 (2006) 52. Wong, K., Tsui, C., Cheng, R.S., Mow, W.: A VLSI architecture of a K-best lattice decoding algorithm for MIMO channels. IEEE International Symposium on Circuits and Systems 3, 273–276 (2002) 53. Ye, Z.A., Moshovos, A., Hauck, S., Banerjee, P.: CHIMAERA: A High–Performance Architecture with a Tightly–Coupled Reconfigurable Functional Unit. In: Proceedings of the 27th Annual International Symposium on Computer Architecture, pp. 225–235 (2000) 54. Zhong, H., Zhang, T.: Block-LDPC: a practical LDPC coding system design approach. IEEE Transactions on Circuits and Systems I: Fundamental Theory and Applications 52(4), 766– 775 (2005)
FPGA-based DSP John McAllister
Abstract Field Programmable Gate Array (FPGA) offers an excellent host embedded device for DSP systems when real-time performance beyond that which multiprocessor platforms can achieve is required, but volumes are too small to justify the costs of developing a custom chip. This niche role is due to the ability of FPGA to host custom computing architectures, tailored to the application. Modern FPGAs play host to a range of complex processing resources which can only be effectively exploited by heterogeneous processing architectures composed of microprocessors with custom co-processors, parallel software processors and dedicated hardware units. The complexity of these architectures, coupled with the need for frequent regeneration of the implementation with each new application makes FPGA system design a highly complex and unique design problem. The key to success in this process is the ability of the designer to best exploit the FPGA resources in a custom architecture, and the ability of design tools to quickly and efficiently generate these architectures. This chapter describes the state-of-the-art in FPGA device resources, computing architectures and design tools which support the DSP system design process.
1 Introduction In the mid-1980’s, the pioneers of FPGA technology realised that whilst at that time transistors were scarce and as a result precious to circuit designers, Moore’s Law would eventually make transistors so cheap that they could be used as a low cost collective in macro-blocks [34]. If these macro blocks permitted use as ’programmable’ logic, i.e. were functionally flexible to perform any simple logic funcJohn McAllister Institute of Electronics, Communications and Information Technology (ECIT), Queen’s University Belfast, UK e-mail: [email protected]
S.S. Bhattacharyya et al. (eds.), Handbook of Signal Processing Systems, DOI 10.1007/978-1-4419-6345-1_14, © Springer Science+Business Media, LLC 2010
363
364
John McAllister
tion, and could be arbitrarily interconnected, they could be used in networks to implement any digital logic functionality. This vision became incarnate in the Xilinx XC2064, the world’s first FPGA device in 1985. In the intervening period, the FPGA has changed markedly and dramatically from the simple ’glue-logic’ style devices of the original incarnations, to domain-specific heterogeneous processing platforms comprising microprocessors, multi-gigabit serial transceivers, networked communications endpoints and vast levels of on-chip computation resource in the latest generations. The key motivating factors for choosing FPGA as a target platform for DSP applications are flexibility, real-time performance and cost. Whilst software processors allow functional flexibility, the application’s real-time performance requirements, or physical constraints placed on the embedded realisation, for example in terms of size or power consumption, may be beyond that which these can achieve. In such a situation, unless volumes are sufficiently high, the Non Recurring Engineering (NRE) costs associated with creating a customised Application Specific Integrated Circuit (ASIC) are such that this may not be commercially viable. For such high performance, low volume DSP systems, the ability of FPGA to host custom computing architectures tailored to the real-time requirements of the application and physical requirements of the implementation, at relatively low cost, is a key advantage. The historical perspective of FPGA is as a blank canvas ’sea’ of programmable logic which can host ’dedicated hardware’ architectures, i.e. networks of interconnected functional units, each of which can implement only the specific DSP function for which they were created. Indeed design tools which promote this kind of approach are prominent [1, 5, 11, 15, 16]. However, in isolation this viewpoint is a naive assumption about the capabilities of state-of-the-art FPGA, which boast a huge number and variety of resources with which to create custom DSP computing architectures. At the top end these boast around 50 times the programmable Multiply Accumulate (MAC) processing resource of DSP processors1 ; they can play host to custom memory structures ranging from large capacity, low bandwidth off-chip storage to small, high bandwidth on-chip Static Random Access Memory (SRAM); they have a myriad of other physical resources, as outlined earlier, which when combined with the ability to easily exploit task and data parallelism in the application, offer the potential to outperform the other fixed architecture, software programmable embedded devices. However, this potential is only realised if the resources are properly harnessed in a manner which is easily accessible to application designers. Proper use of resources, in the context of modern FPGA, means creation of heterogeneous networks of microprocessors with optional accelerator co-processors [4, 12, 13, 40] and parallel software processing architectures (i.e. multiple datapaths operating under the control of a single control unit) [2, 46] along with the classic dedicated hardware components. The complexity of the architectures to be created, coupled with the requirement to create a new architecture for every new application makes the FPGA system design problem unique. The keys to success in this process are understanding the 1
R Based on comparison of Virtex 6 XC6VSX475T with assumed clock rate of 500MHz and Texas Instruments TMS3206474 DSP.
FPGA-based DSP
365
LUT LUT LUT LUT Array Array Array Array
LUT LUT LUT LUT Array Array Array Array
LUT LUT LUT LUT Array Array Array Array
LUT LUT LUT LUT Array Array Array Array
LUT LUT LUT LUT Array Array Array Array
LUT LUT LUT LUT Array Array Array Array
LUT LUT LUT LUT Array Array Array Array
LUT LUT LUT LUT Array Array Array Array
LUT LUT LUT LUT Array Array Array Array
Programmable Interconnect
LUT LUT LUT LUT Array Array Array Array
Programmable Interconnect
LUT LUT LUT LUT Array Array Array Array
Programmable Pin Array
LUT LUT LUT LUT Array Array Array Array
Fig. 1 Typical FPGA Device Architecture
requirements of the algorithm to be implemented, how to efficiently use the various resources of the target device, and having the ability to quickly generate and tune an efficient custom computing architecture. This chapter explores the capabilities of current FPGA for implementing DSP applications, the types of computing architectures which harness these capabilities best, and some examples of how these systems may be built quickly and efficiently. In Sections 2 and 3, the evolution of FPGA technology, and the major processing resources available on state-of-the-art FPGA pertinent to DSP system implementation are described. Sections 4 and 5 describe how the various FPGA resources may be exploited efficiently using different design styles to implement DSP applications, whilst Section 6 describes a number of design tools and how they enable high productivity design of DSP applications on FPGA.
2 FPGA Technology 2.1 FPGA Fundamentals The programmability of FPGA is based on two key principles: the use of programmable functional blocks, and programmable interconnect which allows multiple blocks to be connected to form more complex logical functions. The general structure of a classical FPGA device, highlighting the key architectural components, is given in Fig. 1. The fundamental building block of the FPGA is the Look-Up-Table (LUT). A LUT is a Read-Only-Memory (ROM) which may be programmed to emulate logical functions by storing the relevant output in the memory location corresponding to the inputs which produce those outputs. Loading the data into each LUT on the
366
John McAllister
chip is known as configuring the FPGA. From the earliest incarnations until very recently, these LUTs had four inputs and a single output, and as such could emulate the behaviour of any logic function of the same dimensions or smaller. By connecting these together using the programmable interconnect, networks of LUTs could implement higher dimensionality logic functions. The resulting networks are then connected to the outside world via the programmable pins. These two facilities underpin the programmable nature of FPGA and successive generations of FPGA devices have contained larger and larger numbers of LUTs to implement larger and larger functions. In the latest generations there are up to 70,000 such LUTs. This increased resource has enabled unconstrained exploitation of parallelism in the applications by enabling greater numbers of processing units to be hosted on the device (within it’s physical resource limits), without encountering the common communications bottlenecks apparent in software processing architectures. The designer has access to the functionality of each individual LUT, and as such any architecture implemented on FPGA can be manipulated at the bitlevel, as opposed to the word level manipulation predominant in software processing technologies. This, coupled with the ability to add as much parallel processing resource as the physical limits of the target FPGA device allows, means that the programmable fabric of FPGA can play host to high performance, parallel computing architectures for DSP. However, this resource was augmented in recent generations R of FPGA device, such as the Xilinx Virtex series (and later) with advanced resources to further boost the performance of DSP applications. Section 2.2 outlines the key architectural evolutions in FPGAs which have enabled their use as host devices for high performance DSP systems, and describes the nature of the devices available from the leading FPGA vendors, Xilinx Inc. and Altera Corp.
2.2 The Evolution of FPGA Since their invention in the mid-1980’s, FPGA’s have charted an unconventional evolutionary path, which may be roughly divided into three distinct phases, as described in 2.2.1 - 2.2.3. 2.2.1 Age 1: Custom Glue Logic In their original forms, the numbers of LUT resources on which an FPGA designer could call was severely limited. During this era, FPGAs found a niche as host devices for glue logic by which multiple separate chips could be connected if their signalling interfaces were inconsistent. Given the relatively simple circuitry that each device could host, gate-level schematic capture of the functionality the device was sufficently productive to be viable.
FPGA-based DSP
367
2.2.2 Age 2: Mid-density Logic With the advance of Moore’s law, more and more LUTs could be squeezed onto a single FPGA device. This resulted in a new types of FPGA, such as the Xilinx R R Virtex and Altera Stratix series, which could host bit-parallel fundamental DSP operators such as filters and transforms, on a single-chip. Coupled with the easily accessible parallelism available by occupying different parts of the chip, these components could achieve very high performance compared with conventional software processors and even specialised DSP processors of the same era. Added to the increased density of logic that these devices could host, a number of architectural features were introduced specifically to boost the performance of mathematical operations. In particular, the addition of dedicated carry-logic along with the LUTs in the FPGA programmable logic meant that very high speed, bitparallel adders could be created; further, the introduction of dedicated multiplier components meant that very high speed Multiply-Accumulate (MAC) functionality (a fundamental operation in, for example FIR/IIR filters, FFTs and DCTs) could be achieved, in turn boosting the performance of these types of operations when implemented on FPGA. The increased complexity of the circuitry implemented on these devices, and the increase in design abstraction level from the gate to the arithmetic component and Register Transfer Level (RTL) meant that Hardware Design Languages (HDLs) such as VHDL and Verilog, specifically developed to support design at these levels, became the standard development languages for FPGA. These were supported by development routes similar to ASIC technologies which convert these RTL descriptions to gate-level netlists, including synthesis tools such as Synplify Pro from R Synplicity Inc., and device specific implementation tools such as Xilinx ISE and R Altera Quartus which convert the gate-level netlists to programming files for the devices. 2.2.3 Age 3: Heterogeneous System-on-Programmable-Chip With the emergence of further generations of FPGA, a marked divergence in system architecture and design style became obvious. On these devices, it became possible to implement multiprocessor architectures composed of units such as the XilTM inx Microblaze [12] softcore processor (i.e. a processor implemented using the FPGA’s programmable fabric) or hardcore processors embedded within the device silicon, distinct from the programmable fabric, such as the PowerPC processor on R -II Pro [13]. Coupled with the introduction and expansion of other Xilinx Virtex silicon embedded units, such as Block RAMs (BRAMs), gigabit serial transceivers, MAC datapaths and network endpoints for rapid integration into computer networks via standards such as Ethernet or PCI Express, FPGA architectures have diversified to include a wide range of heterogeneous components which may be used in different ways to implement DSP functions.
368
John McAllister
Table 1 Current Xilinx FPGA offerings Family
Member LX LXT R Virtex -5 SXT FXT TXT 3A R Spartan -3 3AN 3A DSP
Application High performance logic High performance logic, serial connectivity DSP applications, serial conectivity Embedded processing, serial connectivity Ultra-high bandwidth communications Mainstream applications Non-volatile applications DSP applications
Furthermore, today’s devices are not based on a single structure - each generation of devices is subdivided into families with each family exhibiting different features depending on the intended application domain. Consider the current mass range of Xilinx FPGA offerings, which is outlined in Table 1. As this shows, there are generally two grades of device: the higher performance R R Virtex device for more substantial applications, and the lower-cost Spartan devices for smaller applications. These two classes are then subdivided into families, R depending on the desired target application - hence Virtex has specialised offerings in the embedded, DSP, and high bandwidth processing areas. Each of these families also offers various sizes of device that are characterised by the relative mix R of processing features they exhibit. For instance, the FXT Virtex device contains up to two PowerPC processors, twenty-four multi-Gbit serial transcievers, four PCI Express endpoints and eight Ethernet MAC controllers, to enable highly efficient, rapid construction of networked embedded custom computing architectures. In the R R DSP application domain, the Virtex and Spartan series offer the SXT and 3ADSP devices respectively. Instead of offering multiple families of the same grade of FPGA, Altera divide their FPGA range into three grades and offer various sizes of each of the devices. R The range of capabilities changes from the Stratix device at the high end, through R R the Arria mid-range device, and finally to the Cyclone low-end devices. Despite their differing origins, however, the features exhibited by these devices to support DSP applications are, conceptually, remarkably similar. These are described in the context of the high-end DSP device offerings from Xilinx and Altera, the R R Virtex -5 and Stratix -IV respectively, in Section 3.
3 State-of-the-art FPGA Technology for DSP Applications There are three types of resource on current FPGA most relevant to DSP system implementation: programmable logic, DSP-datapath, and memory. Sections 3.1 3.3 describe the characteristics of these features, and how they may be efficiently exploited for DSP system implementation.
FPGA-based DSP
369 cout LFG O
Cout
2
SE
Carry Logic
6
D
Q
CLK
Cout
LFG O
Switch Matrix
2
SE
Carry Logic
6
Slice 1
D
Slice 0
Cin
Q
CLK cin
Cin
(a) Virtex 5 CLB Structure
clk
(b) Simplified Virtex 5 Slice (upper half)
R Fig. 2 Virtex -5 FPGA Programmable Fabric Architecture [17]
3.1 FPGA Programmable Logic Technology R 3.1.1 Virtex -5 Configurable Logic Block R Virtex series FPGA are primarily constructed of Configurable Logic Blocks (CLBs). The structure of a CLB is outlined in Fig. 2(a). As this shows, CLBs are composed of two slices, each of which contains four Logic Function Generator (LFG) units, as shown in Fig. 2(b) (only the upper two slices are shown, the given structure is R R R replicated to create the full slice architecture). In Virtex , Virtex -II and Virtex II Pro technologies the LFG units had four inputs and one output, but this increased R to five inputs and one output in Virtex -4, and further to six inputs and two outputs R in Virtex -5 FPGA [17]. R -5 contains four sets of: a single 6-input The basic slice architecture of Virtex LFG, a storage element (SE) for registering the data eminating from the LFG, and a fast carry and switching logic which, coupled with the cout and cin ports on adjacent sets and slices can be used to cascade sets to implement high performance bit-parallel adders. Whilst the architecture of these slices seems rather simple, in fact there is considerable underlying complexity. The LFGs themselves may operate in one of three modes: as a LUT, which can implement any logic function of up to six inputs and two outputs; as a 64 bit distributed RAM (DisRAM), or as a 32 element shift register (SRL). In DisRAM mode a single LFG may implement a 64 bit DisRAM (Fig. 3(a)), and numerous LFGs may be combined to implement larger or higher bandwidth RAMs if so desired, as outlined in Fig. 3(b) which shows a 1 bit wide, 64 bit dual-port DisRAM. Larger bandwidth DisRAMs may be created by using numerous single-bit RAMs in parallel. In the SRL mode, each LFG may implement an addressable shift register of up to thirty-two elements in length, and multiple LFGs are cascadable for implementation of longer shift registers, as outlined in Fig. 4.
370
John McAllister RAM64x1d D A[5:0]
LFG DI1 A WA
6
WCLK WE
O
Output D Q
clk we
RAM64x1s D A[5:0] 6 WCLK WE
LFG
LFG DI1 A WA
Output
O
DPRA[5:0]
6
D Q
clk we
DI1 A WA
O
Output D Q
clk we
R (a) Virtex -5 64 element DisRAM
R (b) Virtex -5 64 element Dual-Port DisRAM
R -5 FPGA LFG RAM Configuration Examples Fig. 3 Virtex
R 3.1.2 Stratix -IV Adaptive Logic Module R Whilst the fundamental programmable component of the Xilinx Virtex range of R FPGA is the CLB, in the case of Altera Stratix FPGA it is the Logic Array Block (LAB) [6]. Each of these is composed of ten Adaptive Logic Modules (ALMs). These modules, a simplified version of which is shown in Fig. 5, are based around 6-input LUT components, with a total of eight inputs to two LUTs, as shown in Fig. 5. Each ALM can implement up to two combinational functions. An ALM also
SRL LFG DI1
D DA
6
O
D
Addr MC31 clk we
CLK W LFG DI1
C CA
6
O
Addr MC31 clk we
LFG
B DI1 BA
6
O
AA
Fig. 4 Virtex-5 FPGA Slice SRL
DI1 6
B
Addr MC31 clk we
LFG A
C
O
Addr MC31 clk we
A
FPGA-based DSP
371 ALUT0 6-input LUT 8
O
add0 Switching Logic
D r0
6-input LUT O
Q
add1 Switching Logic
D
Q r1
ALUT1
Fig. 5 Stratix IV Adaptive Logic Module (ALM)
includes two programmable registers, two dedicated full adders, a carry chain, a shared arithmetic chain, and a register chain. Each ALM can operate in four modes, each of which uses the variety of ALM resources differently. 1. Normal Mode: suitable for general logic applications and combinational functions; eight data are input to the combinational logic, which implements two functions or a single function of up to six inputs. 2. Extended LUT: suited to implementing HDL ’if-else’ conditional functions; a specific set of seven input functions may be implemented, where the seven inputs are composed of a 2-to-1 multiplexer fed by two arbitrary five input functions. 3. Arithmetic: the ALM is used to implement adders, counters, accumulators, wide parity functions and comparators; two sets of two 4-input LUTs are used, along with two dedicated full adders, which allow the LUTs to perform pre-adder logic. 4. Shared Arithmetic: each ALM implements a 3-input addition, where each of two LUTs implement either the sum or the carry of three inputs; this mode is designed for multi-operand addition since it reduces the number of summation stages by increasing the width of each stage. 5. LUT-Register: third register capability in the ALM.
3.2 FPGA DSP-datapath for DSP The FPGA programmable fabric offers easy access to simple, high performance arithmetic, storage and logic components. However implementing more complex arithmetic components (for example multipliers) in the programmable logic is much more resource expensive and consumes many LUTs and other resources. To address R -II, this, with the advent of more recent FPGA devices, such as Xilinx’s Virtex hardcore versions of these were integrated on the FPGA silicon, and could be embedded almost seamlessly in circuit architectures along with programmable-logic
John McAllister
B 25 x 18 A
Switching Logic
372
+
P
C OPMODE
ALUMODE
R
Fig. 6 Simplified Virtex -5 DSP48 Slice
based components. This marked a notable step forward in the use of FPGA for DSP since these previously relatively resource expensive operators were now available with higher performance and lower cost than before. In the latest generations of FPGA, this has gone a step farther with the integration of specialised DSP-datapath units. The nature of this facility is outlined in 3.2.1 and 3.2.2. R 3.2.1 Virtex -5 DSP48 Slices R Virtex -5 FPGA host multiple hardcore DSP-datapath units known as DSP48E slices. The architecture of one such slice is shown in Fig. 6 (note that configurable pipeline registers and carry logic for cascading multiple DSP48E slices have been omitted for simplicity), whilst Table 2 describes the numbers of DSP48E slices in R each member of the Virtex -5 family [18]. The basic architecture and operation of this component is obviously as a MAC. This fundamental DSP operation is popularly employed in the datapath of DSP processor architectures, and their performance is frequently benchmarked upon it [20]. However, the operation of a DSP48E is slightly more capable that just simply MAC operations, including:
• • • • • • • • •
25 x 18 bit pipelineable multiplication 48 bit addition/subtraction triple and limited quad addition/subtraction dedicated bitwise logic barrel shifting wide bus multiplexing dedicated overflow/underflow support pattern detection wide counters
R Table 2 Virtex -5 FPGA DSP48E Constitution Device 5VSX35T 5VSX50T 5VSX95T DSP48E 192 288 640
FPGA-based DSP
373 a0
144
Output Data
Input 288 Data
b2 Adder
Half-DSP Block 0
72
a3 72 Half-DSP Block 1 144
Output Data
Output Register Bank
a2 144
Rounding and Saturation Logic
34
Register Bank
b1 Control
Input Register Bank
a1
Adder/Accumulator
Adder
b0
b3
Half DSP Block
(a) DSP Module Structure
(b) Half-Module Structure
R Fig. 7 Stratix -IV DSP Module Architecture
The DSP48E essentially operates as a programmable Arithmetic Logic Unit (ALU), with the OPMODE input controlling the overall operational mode of the slice, and the ALUMODE input controlling the operation of the add/bitwise logic component. The 48 bit addition is divisible into sub-operations (two 24 bit additions, or four 12 bit additions) which operate in parallel. R 3.2.2 Stratix -IV DSP Module R The equivalent of a DSP48E in Stratix -IV is the DSP module. These are composed to two identical independent half-modules, each of which has four 18 × 18 multipliers, as shown in Fig. 7. Table 3 outlines the number of DSP blocks, and the equivalent number of independent 9, 12, 18 and 36 bit multiplications which may R be realised using modules, for each member of the Stratix -IV device family. Each DSP block can operate in one of five basic modes, based around various configurations of the MAC function fundamental to DSP operations, the five classes being [6]:
R Table 3 Stratix -IV FPGA DSP Block Constitution 9×9 12 × 12 18 × 18 18 × 18 36 × 36 Device DSP Blocks Multipliers Multipliers Multipliers Complex Multipliers EP4SE110 64 512 384 256 128 128 EP4SE230 161 1288 966 644 322 322 EP4SE290 104 832 624 416 208 208 EP4SE360 130 1040 780 520 260 260 EP4SE530 128 1024 768 512 256 256 EP4SE680 170 1360 1020 680 340 340
374
John McAllister
1. Independent Multiplier: parallel multiplication configurations, varying from eight 9 bit multiplications, to two 36 bit multiplications in a single DSP module. 2. 2-Multiplier Adder: four sets of two 18 bit MAC operations per block; useful for processing complex-valued DSP operations such as FFT or complex FIR/IIR. 3. 4-Multiplier Adder: two sets of four 18 bit MAC operations per block; useful when implementing one and two-dimensional filtering applications. 4. Multiply Accumulate: second stage adder used as a 44 bit accumulator or subtractor. 5. Shift: two shifts of up to 36 bits per block, including arithmetic and logical shifts in both left and right directions, or 32 bit rotator or barrel shifter. 6. High Precision Multiply-add: two 18 × 36 bit multiply-adds per block, for use when high precision filtering is required. R In addition, the Stratix -IV DSP block has dedicated rounding and saturation logic after the second stage adder, and the pipelining level of the device is configurable.
3.3 FPGA Memory Hierarchy in DSP System Implementation Real-time implementation of DSP applications requires careful architecture design such that the datapath components, which perform the mathematical calculations, offer high enough computational capacity, and that these have sufficent input and output data bandwidth to keep them fed with data and continuously operating. This situation becomes particularly significant when implementing image processing applications on FPGA, where storage of entire frames of image data on-chip is sometimes not feasible [7], whilst storage in off-chip memory incurrs significant data bandwidth and synchronisation problems. To help alleviate this issue, recent generations of FPGA device have exhibited a distinct hierarchy of memory storage, ranging from small, high bandwidth DisRAM-like units for data storage close to the datapath, to the low bandwidth, bulk off-chip storage mechanisms. The general R R and Stratix FPGA are outlined in 3.3.1 structures of these hierarchies in Virtex and 3.3.2. R 3.3.1 Virtex Memory Hierarchy
In addition to the DisRAM configuration option of the Virtex LFG outined in Section 3.1.1, the latest generation of Xilinx FPGA have a three-tier memory hierarchy similar to that found in conventional microprocessor, except that in FPGA the absolute storage size and bandwidth at each level of the hierarchy is controlled by the designer, within physical capacity limits. On the chip, the DisRAM resource offered by the programmable logic is augmented by a series of dedicated, hardwired
FPGA-based DSP
375 raddr Read Pointer
waddr Write Pointer
wrcount
BRAM
mem_wen
ce
mem_ren
din/dinp
wrcount
do/dop
wrclk
rdclk
wren
Status Flag Logic
rden
rst
rderr
wrerr
almostfull
almostempty
full
empty
R Fig. 8 Virtex BRAM-based FIFO Architecture [19]
R Table 4 Stratix IV Memory Hierarchy
Level bits/block no. blocks suggested use MLAB 640 6750 Shift Registers; FIFO buffers, filter delay lines M9K 9,216 1144 General purpose M144K 147,456 48 Processor main memory, image buffers
R BRAM components on the chip [19]. In Virtex FPGA these take the form of a number of 36Kb RAMs, each of can be configured to implement a dual-port 36 Kb BRAM, or two dual-port 18 Kb BRAMs. Configuration of the memory may take either form without resorting to use of the programmable routing resource of the FPGA device; further, two BRAMs may be cascaded to implement a 64 Kb BRAM without accessing the FPGA routing resource. In addition, since the streaming nature of signal processing applications means that First-In-First-Out (FIFO) queues offer a simple, highly effective data buffering R scheme, Virtex BRAMs may be configured for use in this mode. They have extra logic integrated, including counters, comparators and status flag generation logic for this purpose. Each BRAM can implement one FIFO. The architecture of such a FIFO, which does not require programmable fabric resource, is shown in Fig. 8.
R 3.3.2 Stratix Memory Hierarchy R The Stratix FPGA family also has a well defined memory hierarchy, known as TrMatrix memory [6]. The hierarchy moves from small, high bandwidth MLAB blocks, to be placed close to the datapath, through larger M9K blocks, to M144K blocks, all of which are on-chip. The structure of each level, and their intended uses are outlined in Table 4.
376
John McAllister
3.4 Discussion In this section, we have surveyed the three main architectural aspects of FPGA devices at a designers disposal for implementation of DSP systems. It is clear that strong similarities are evident across these devices in the form of distinct programmable logic, DSP-datapath and memory hierarchy resources in both Xilnx and Altera FPGAs. On first glance it appears the FPGA offer a ’blank canvas’ for implementation of DSP applications. However, to make best use of the resource this is not necessarily the case. The various parts of the device are exceptionally efficient at implementing specific architectures, but their efficiency decreases once the natural bounds of the R resources are breached. For instance, consider the use of DSP48E slices in Virtex 5 FPGA. These offer a highly efficient way to implement MAC-based DSP operations of less than 18 × 25 bit multiplication and 48 bit addition in a single slice, but once these wordsize bounds are exceeded, numerous DSP48E slices are required to account for the deficit [18]. This severely restricts the efficiency with which these reR sources may be exploited for the same functionality. Similarly, whilst the Virtex -5 FPGA programmable fabric is highly efficient at implementing single port DisRAM components (64 bits/LUT), this efficiency is much reduced when multi-port RAMs are required. A similar scenario exists regarding the dual-port BRAM components, where unless the required memory bandwidth is less than dual-port, the efficiency of implementation is much reduced. To best exploit the fabric of the particular FPGA being targetted, these constraints must be explicitly taken into account during system architecture design [31]. Despite these constraints, the differing resources offer the potential to implement the same function in differing ways on the same device. For instance, consider the implementation options for a hypothetical filter component. In the first, the programmable fabric could be used. It is suited to implementing this type of architecture due to the ease of implementation of adders and constant coefficient multipliers [44], and the ability to utilise it’s rich register resource for datapath pipelining [29] to increase the throughput of the architecture. Alternatively, due to the ease of implementation of MAC operations in either of the devices variants’ approach to DSP datapaths, it is easy to see how these may be exploited for implementation of such architectures. The choice of implementation strategy is subject to the real-time requirements of the application, and the relative constraints of the target FPGA device. For instance, the choice of implementation strategy for the hypothetical filter using programmable logic or DSP datapath in any particular system is based on whichever other components are present in the system and how they use the resources. If a 32,000 point FFT is present and is using up all the DSP48E resources then the choice of programmable fabric-based implementation is made, otherwise it may be more suitable to implement the FIR filter using DSP48E; there is no general ’best-fit’ solution to implementing specific operations on an FPGA. However, the differing resources hint at differing possible processing styles for parts of the system targetted towards different device resources. In Sections 4 and 5 we outline different techniques used
FPGA-based DSP
377
to implement DSP systems using the various device resources, whilst in Section 6 we review design tool technology which automates these design approaches.
4 Core-Based Implementation of FPGA DSP Systems 4.1 Strategy Overview The programmable logic in FPGA offers the most mature facility for logic implementation, having gradually evolved from the LUT-based implementation philosophy on which FPGA was founded. As a result, strategies for creating dedicated processing architectures (known here as cores) to implement DSP functions is the most mature of any FPGA-based DSP implementation strategy. Given the abundance of register resource and the ability to easily implement simple arithmetic components in the programmable logic, it follows naturally that fine-grained pipelined core architectures are particularly good match with this host resource [44]. The availability of hardcore arithmetic components, such as multipliers, in later generations of FPGA only serves to promote this style by offering high performance, low cost implementations of previously resource expensive mathematical operators. The resulting abundance of bit-parallel pipelined MAC functionality enables high performance cores to be created for fundamental DSP operators such as FIR/IIR filters, FFTs or DCTs. This style of core design has become further ingrained with R R the advent of DSP48E slices in Virtex and DSP Modules in Stratix FPGA, which offer another, even higher performance mechanism for building the same types of core architectures. Fig. 9 shows how an array of DSP48E slices may be used to R -5. implement a multichannel FIR filter in Virtex The longstanding suitability of FPGA to host such high performance architectures promoted much interest regarding it’s use as a host for systolic array type core architectures for FPGA, since these depend on fine-grained functional units with deep pipelines, a natural fit with FPGA’s capability [44]. The variety of operations implemented in this fashion is huge, but mostly focuses on standard DSP operations such as filters and transforms, and extends to matrix computations, for example [42, 47, 48]. Furthermore, this style of architecture design enables powerful automatic core synthesis tools to be created. These convert Signal Flow Graph (SFG) algorithms automatically to systolic cores using graph transformation techniques such as folding and unfolding to balance throughput with resource [29, 45]. Given FPGA’s suitability as host to high performance core architectures, the device vendors themselves provide tools to automatically generate cores for varying real-time and resource requirements; tools such as Xilinx’s Core Generator and Altera’s MegaCore IP Library provide easy access to these cores, and promote a design reuse strategy for FPGA-based DSP system design. Further, this graph-based design style for rapid generation of custom core architectures is actively adopted in a number of commercial design tools, as outlined in Section 4.2.
378
John McAllister FIFO
A
25 43
H3
B
18
Y1 Y2 Ym
OPMODE ALUMODE 0010101 0000 FIFO
A
25 43
H2
B
18
OPMODE ALUMODE 0010101 0000 FIFO
A
25 43
H1
B
18
OPMODE ALUMODE 0010101 0000 FIFO
A
25 43
X1 X2
H0
B
18
Xm srl
OPMODE ALUMODE 0000101 0000
Fig. 9 DSP48E-based Multichannel FIR [18]
4.2 Core-based Implementation of DSP Applications from Simulink Since simple cores are easily available in libraries such as Core Generator, the MegaCore IP Library or others, there is considerable potential for graph-based design tools for automatic composition of these to create more complex custom cores. For instance, Synplicity Inc.’s Synplify DSP allows a user to automatically convert their DSP function, described as a MATLAB Simulink model, to a custom core. This tool can automatically pipeline and retime the graph to produce pipelined and multi-channel versions of a corresponding core. Xilinx and Altera use their System Generator and DSP Builder design tools respectively to enable a similar process, easing and promoting the uptake of their Core Generator and MegaCore IP library cores respectively. In both cases there is a tight coupling between the core behaviour, specified using a Simulink graph, and the resulting core architecture; a node on the Simulink graph translates to a component in the resulting circuit, implemented using a core exported from the vendor repository and cores are connected in a topology identical to that of the Simulink graph. A generalised version of the process used by both these tools is shown in Fig. 10 [5, 16]. Behavioural models of the library cores are used to construct a Simulink .mdl model of the behaviour. After verificaton of the functionallity, implementationspecific models of the cores addressing physical characteristics such as wordlengths and core timing are integrated into the Simulink model, which must then be manually manipulated to produce the same functionality taking account of the these physical factors. This process allows verification that the functionality of the core matches that of the original algorithm. Once this is verified, this graph is exported
FPGA-based DSP
379 Simulink Algorithm Design Using Matlab/Sinulink
Ideal Algorithm (.mdl)
IP Library
Non-ideal Algorithm Modellling
Implementation Specific Algorithm (.mdl)
HIL Model Generation
RTL HDL Generation
RTL Source
Device Specific Synthesis, Place and Route
Adapted Algorithm (.mdl)
Programming File
HIL Simulation
Target FPGA Platform Programming
Fig. 10 System Generator/DSP Builder Development Flow
to a RTL HDL description and passed to the vendor-specific programming tools (Xilinx ISE and Altera Quartus-II). If a development platform exists, this may be integrated into the Simulink environment. A new Simulink graph which only creates the source stimulus for the system implemented on the FPGA is then automatically generated. When the FPGA on the platform is programmed using the automatically generated programming file, Hardware-in-the-loop (HIL) testing ensues, where the host graph generates the stimulus data, communicates it to the FPGA, which processes it and passes the result back to the host, allowing the functional equivalence of the FPGA implementation with the original algorithm to be verified. This kind of tightly coupled approach allows rapid development of custom core architectures from Simulink graphs, and HIL testing of the core, but neither System Generator or DSP Builder exhibit advanced synthesis features such as auto-pipelining, retiming, multi-channelisation or graph folding apparent in other tools [45].
4.3 Other Core-based Development Approaches In addition to these device vendor-specific approaches a number of other approaches exist for rapid implementation of DSP cores and core networks. Labview FPGA enables an application designer targetting National Instruments FPGA hardware to rapidly implement their Labview algorithms as cores in this fashion. Approaches such as Handel-C, Accelchip and Synfora, on the other hand start with a more conventional software programming language with the aim of generating custom cores. In Handel-C [11], the parallel semantics of Communicating Sequential Processes [10] are combined with a subset of ANSI-C to allow application designers to de-
380
John McAllister
scribe the functionality of DSP operations with explicit parallelism, and translate this program to a gate-level netlist for implementation on FPGA. In Accelchip [1], a MATLAB description of the behaviour is automatically converted to a core, with the opportunity for the designer to optimise the implementation by applying loop transformations to the MATLAB source. Similar techniques are employed by Synfora Inc’s PICO FPGA toolset [15], which allows the user to specify the functionality in C and generate a selection of cores from which they may choose the most suitable option. These types of approach enable rapid generation of custom DSP cores, and by connecting together numerous cores, entire systems may be implemented. Alternatively, the cores may be used in a different fashion - as accelerators for bottleneck regions of a program executing on a software processor. The ability of FPGA to host such a programmable processor-centric system design style is examined in Section 5.
5 Programmable Processor-Based FPGA DSP System Design R Devices such as Virtex -II Pro, which includes embedded PowerPC processor comTM ponents, and the ability to use softcore processor architectures such as Microblaze presents the possibility of creating Multiprocessor Programmable SoC (MPSoC) architectures on any FPGA [25]. Further, the unique ability of FPGA to allow such general purpose processors to be extended with custom datapath co-processors offers an excellent platform for developing application-specific MPSoCs, composed of multiple processors which maintain their software programmability, yet offer higher performance for domain specific applications by co-processor based extension [35, 36]. Indeed, FPGA microprocessors have inbuilt features to allow such easy extension. Furthermore, since FPGA can easily host massively parallel processing architectures and have abundant on-chip programmable MAC datapath resources, they can play host to parallel programmable architectures, such as vector or SIMD processing styles [8] with the added ability to tailor the size and nature of the processor in an application-specific manner [2, 21, 32, 46]. All these options offer software programmability as a design approach for implementing DSP systems, rather than the dedicated hardware design styles described in Section 4. This section explores the various options for creating or extending software processing architectures for FPGA-based DSP; beginning by highlighting how the main commercial FPGA processors can be easily extended with custom datapath co-processors, a number of significant examples of this approach and massively parallel processor design are described.
FPGA-based DSP
381 Virtex-5 FXT Platform Embedded Processor Block
DMA0
DMA1 PLB Slave
FCM Interface
Control Interface
Crossbar
PowerPC 440 Processor APU Control
Memory Controller
PLB Master
PLB Slave
Control Module
DMA2 Device Control Register
DMA3
Device Control Register Interface R Fig. 11 Virtex -5 FXT PowerPC 440 Device [14]
5.1 Application Specific Microprocessor Technology for FPGA DSP 5.1.1 Hardcore FPGA-based Processors for DSP Systems R The Xilinx Virtex -II Pro marked a watershed in FPGA technology by hosting PowerPC 403 processors directly on the device silicon, augmenting the programmable fabric with on-chip, software-based system control. These devices still R -5 FXT devices, which host PowerPC persist on the latest generations of Virtex 440 devices [14]. Fig. 11 describes a generalised architecture of this processor. As Fig. 11 shows, the core PPC440 module is connected to the rest of ther FPGA by a variety of communications interfaces. A configurable crossbar switch provides DMA-based access to a memory controller for the main processor memory and a number of Peripheral Local Bus (PLB) interfaces are included for access to this bus. Of most interest however, is the presence of the Auxiliary Processor Unit (APU) controller, accessed through the FPGA Fabric Coprocessor Module (FCM) interface. This controller and interface is specifically included to allow tighter integration of a co-processor module, implemented in the FPGA programmable fabric or DSP-datapath, to the processor pipeline than is possible using the standard PLB interface.
382
John McAllister Memory Management Unit
CacheS I/F
CacheM I/F D-Cache
I-Cache
ALU
CacheS I/F Program Counter
Shift
Special Purpose Registers
Bus Interface
CacheM I/F
PLB I/F Barrel Shift Bus Interface
Multiplier
PLB I/F
OPB I/F
Divider LMB I/F
OPB I/F
Instruction Buffer
FPU Instruction Decode
LMB I/F
MFSL I/F 32 x 32 Register File
SFSL I/F
R (a) Microblaze Softcore Processor Architecture [12]
Microblaze Instruction Decoder
Instruction Decoder
FSL
IBus
Distributor
I-Mem
DBus
D-Mem
R-file
Distributor
ALU
SRfile
VRfile
VRfile
VRfile
ALU
ALU
ALU
SIMD Coprocessor
(b) FSL-based Microblaze
R
Coprocessing [2]
Fig. 12 Microblaze Reconfigurable Processing Architecture
R 5.1.2 Microblaze -Centric Reconfigurable Processor Design R The Microblaze softcore processor is a configurable Reduced Instruction Set Computer (RISC) for Xilinx FPGAs. The architecture of the this processor is outR lined in Fig. 12(a). As this shows, Microblaze is signficantly more flexible than the PowerPC 440 described in Section 5.1.1. It is configurable in a number of ways with the configurable parts of the architecture shown in grey in Fig. 12(a): the Memory Management Unit (MMU), Instruction and Data Cache Units, and a number of application specific datapath units. Of particular significance is the configurable use of Fast Simplex Links (FSLs). R can host up to 16 FSLs, which are dedicated, point-to-point uniA Microblaze directional data streaming interfaces, offering a low-latency interface directly into the processor pipeline for external co-processors. A simple schematic of how such a co-processor interfaces to the Microblaze using FSLs is given in Fig. 12(b).
FPGA-based DSP
383
S Nios II Processor Hardware Accelerator M
M
M
M
Avalon Switch
S
S
S
S
Instruction Memory
Peripherals
Data Memory
Data Memory
Fig. 13 Nios Processor and Subsystem Architecture [4] Table 5 Nios C2H Synthesis Rules Command Type Arithmetic/Logical Iteration Selection Local Variables Global Variables Memory Accesses
Examples Mapping result += a + b; Hardware block do(), for(), State machine + control if-else Multiplexer + control Accelerator internal memory Avalon MM Master port my *ptr = 8; Avalon MM Master port
Note Pipelined Register/MLAB Pointers, arrays, structs, unions
5.1.3 Nios-Centric Reconfigurable Processor Design The Altera Nios II softcore processor exists as part of an extended subsystem architecture, which is given in Fig. 13. As this shows, custom hardware accelerators may be integrated into the Nios memory subsystem using the Altera Avalon Switch Fabric [3]. Altera also offer a simple mechanism for quickly developing new co-procesors which fit into this subsystem architecture. Altera’s Nios II C2H Compiler [4] allows a user to describe the functionality of the system in Nios II compatible C code, and offload critical parts of the code to a co-processor, which is then automatically created and inserted into the subsystem of Fig. 13. A simple set of rules, outlined in Table 5, are used to convert C commands to a hardware architecture. This one-toone mapping style makes the method of constructing the co-processor architecture explicit to the designer, allowing them to manipulate the C program to optimise it. The C2H Compiler also provides rudimentary architectural optimisations; in particular automatically pipelining and scheduling loops and automatically removing
384
John McAllister
Avalon master ports when several connect to the same memory block, merging the references and scheduling the accesses internally.
5.2 Reconfigurable Processor Design for FPGA 5.2.1 Molen The Molen ’polymorphic’ processor [40] also focuses on accelerating critical tasks via the use of custom cores, but rather than attempt to couple the accelerator core into the pipeline of the General Purpose Processor (GPP), it considers the accelerator core to be a peer component in a larger computing machine composed of it and the GPP. An overview of the Molen machine architecture is given in Fig. 14. Operating under a single thread of control, the core to be integrated (HW in Fig. 14) is wrapped in a Control and Communications Unit (CCU) wrapper, and a microcoded program is generated which fits a restricted instruction subset template defined for the Molen machine [40]. The Arbiter and Memory Mux Units arbitrate instruction and data fetches to/from the main memory between the GPP and the HW. All code is executed on the GPP, except operations tagged for implementation using the core, with the resulting communication between GPP and HW occurring via the exchange registers in Fig. 14. In this arrangement, the design problem extends beyond using an extensible instruction set architecture for the GPP to include generation of custom instructions and wrapper logic for any arbitrary core such that it fits into the Molen architecture [26]. The Molen architecture reports impressive speed-up for DSP and image processing applications, reporting MPEG-2 application speedups by a factor of up to R -II Pro FPGA using a PowerPC 403 GPP [26, 40]. three for a Virtex
Main Memory
Instruction Fetch
Data Fetch
Arbiter
Memory Mux
Exchange Registers
Register File
Core Processor
ρμ-code unit
CCU
HW Memory
Reconfigurable Processor
Fig. 14 Molen Machine Architecture
FPGA-based DSP Instruction Decode
Program Counter
385 Main Memory
Memory Crossbars
Scalar Core
Instruction Memory Vector Lane
Vector Lane
Vector Lane
Instruction Decode
Vector Lane
Memory Unit
(a) VIPERS Softcore Vector Processor Scalar In
Vector Register Elements
To MAC chain (optional)
ALU
From MAC chain (optional)
Load-Store Buffers Lane Local Memory
Vector Flags
From shift chain (optional)
To/from memory
Scalar Out
(b) VIPERS Vector Lane Fig. 15 VIPERS Vector Processor Architecture [46] Table 6 VIPERS Processor Configuration Options Parameter NLane MVL VPUW MemWidth MemMinWidth
Description Number of vector lanes Maximum vector length Processor data width Memory interface width Minimum accessible data width in memory
Range 4-128 16-512 8/16/32 32-128 8/16/32
5.2.2 VIPERS The application-specific datapath route for accelerating DSP applications considered thus far is only one feasible option. The vast levels of simple on-chip programmable DSP-datapath (see Section 3.2) resource has driven attempts to achieve acceleration of critical functions by exploiting highly parallel software processing architectures using these units. VIPERS is one such approach in this vein [46]. VIPERS is a softcore vector processor, the architecture of which is outlined in Fig. 15(a), with the architecture of each vector lane of the processor shown in Fig. 15(b). Each vector lane contains an ALU (supporting arithmetic and logical operations, maximum/minimum, merge, absolute value, absolute difference and comparison operations) as well as a barrel shifter and a DSP Module-based multiplier. The architecture of VIPERS is configurable in a number of aspects, which are outlined in Table 6.
386
John McAllister
The ability to vary the number of independent vector lanes and the size of each lane (via the MVL parameter) provides scalable data level parallelism. Furthermore, the presence of optional MAC and shift chains in the architecture (the former of which exploits the DSP Module resource) allows acceleration of critical functions in a manner which is the ideal use of the physical chip resource.
5.3 Discussion The huge array of implementation options available to the DSP designer targetting FPGA provides the ability to implement a system by exploiting a variety of processing architectures and styles unparalleled elsewhere in the embedded computing domain. With options spanning between the two extremes of dedicated hardware implementation and MPSoC implementation perhaps the ideal scenario is to find the right balance between the two such that the high performance of the dedicated hardware and the flexibility of the software may be combined. To make best use of the FPGA resource, a balance must be struck between programmable logic, hardcore or softcore processors and DSP-datapath units based on the specific requirements of the application, and the constraints of the target device. Modern FPGA are a diverse mix of processing resources, and as Section 4 and this one have described, harnessing these resources requires complex architectures consisting of multiple heterogeneous software processing architectures and cores. Accordingly, the process of DSP system design for FPGA is a complex hardware/software cosynthesis, integration and design space exploration problem. This alone makes FPGA system design complex enough, but the situation is exacerbated since it is encountered every time a new DSP application is targetted to FPGA, since the entire motivation for choosing FPGA is for their ability to host custom computing architectures tailored to the application. As such, architecture synthesis for DSP applications on FPGA is a complex design problem encountered very frequently, a combination which places unique demands on Electronic System Level (ESL) design tools for embedded systems. To maintain a productive design process in this environment it is vital that a designer can quickly evaluate a range of options for implementing their architecture, and choose the one which best matches their needs, applying some fine-tuning afterwards if required. This should all occur, preferably, without the designer having to resort to low abstraction, i.e. RTL level, manipulation of the architecture. To enable this kind of process a new generation of system design tools which allow rapid generation of FPGA architectures is beginning to emerge which help explore the trade-offs of resource, power, and real-time performance. A number of these are described in Section 6.
FPGA-based DSP
387
6 System Level Design for FPGA DSP As outlined in Section 2.2, FPGA architecture development has historically been sustained by the use of RTL-level, HDL-based coding, and implementation techniques similar to those employed for ASIC development. However, as the complexity of ASICs increases along with the cost of fabricating a chip, the design and use of ASIC is increasingly confined to mass-market technologies such as mobile phones and PCs, all of which employ popular communications standards such R , WiFi, and codecs such as MPEG. This has driven development of as Bluetooth standards-based programmable SoC platforms which are highly efficient at implementing a set of standards in a particular area, for example 802.11 wireless communications protocols, or a variety of MPEG codecs, and programmable to realise specific standards from the target set [24, 43]. Consequently, the majority of SoC design tools are intended for a design process which may take a substantial period of time to build a highly efficient architecture once. This is distinct from FPGA-based design where architecture re-engineering is commonplace. A divergence in the nature of design tools supporting ASIC and FPGA design is a natural consequence. Multiprocessor architectures, on the other hand, do not require any architectural manipulation, merely reprogramming. As such the requirement to frequently and rapidly generate custom computing architectures encountered in FPGA is unique. To maintain design productivity in this process, tools which enable this design process must quickly and automatically generate an architecture from an abstract representation of the algorithm functionality, and should be able to automatically explore the design space trading off performance with power and resource before automatically generating the implementation. A number of tools which enable this process for FPGA-based systems are beginning to emerge, and are described in Sections 6.1 - 6.3.
6.1 Deadalus Daedalus is a system level design and rapid implementation toolset for DSP applications which has been used to generate FPGA-based MPSoC implementations of various image processing algorithms [27, 39]. A conglomerate of two previously existing tools, Sesame [30] and ESPAM [28], Daedalus enables a full synthesis flow from sequential application specification to implementation in a variety of MPSoC topologies, including point-to-point, bus-based or crossbar interconnect. The flow of data and operations in Daedalus is illustrated in Fig. 16. The top processes are mostly aimed at generating an allocation of processing resources, with the lower processes aimed at synthesis of the allocation. The functional specification of the system to be implemented, written as a Static Affine Nested Loop Program (SANLP) in a standard sequential language such as C++ or Matlab, is converted to a parallel Kahn Process Network form [22] by the KPNGen tool [41]. Daedalus accepts this specification and performs an exhaustive search of
388
John McAllister Application (C/C++)
High Level Models IP Library
System Level Architecture Exploration (Sesame)
Parallelisation (KPNGen)
Platform Spec. (XML)
Kahn Process Network (XML)
Mapping Spec. (XML)
RTL Models System Level Synthesis (ESPAM)
Platform Netlist
IP Cores (VHDL)
Processor C/C++
Auxiliary Files
RTL Synthesis (Commerical Tools – Xilinx P latform S tudio …)
Device Netlist
Fig. 16 Daedalus MPSoC Design Flow
the design space to produce an MPSoC allocation which can support the real-time requirements of the system. The platform specification and mapping of the KPN tasks to distinct processors in the platform are then generated and passed to ESPAM for synthesis of the solution.
6.2 Compaan/LAURA The Compaan toolchain is composed of the Compaan parallelising compiler and the LAURA architectural synthesis tool. It has predominately targetted core network implementations of SANLP programs written in Matlab on FPGA [38]. Compaan acts as a parallelising compiler for the sequential specification, once again generating a KPN application specification. The LAURA tool then assumes either a fixed allocation of a one-to-one correspondence between the KPN structure and the target architecture [38]. Design space exploration of the implementation is affected by manipulating the KPN topology using transformations of the loops (e.g. skewing, unrolling) in the source SANLP [9, 37].
6.3 Koski Koski is a Unified Modelling Language (UML)-based design environment for automatic programming, synthesis and exploration of MPSoC implementations of wire-
FPGA-based DSP
Application Implementation/ Verification Automatic Code Generation
389 Application Specification
Design Constraints
Architecture Specification
Application Model
Mapping Model
Architecture Model
Functional Simulation UML Parsers
UML Design
UML backannotate
UML Interface
Dynamic Architecture Exploration
Architecture Exploration Flow
UML Profiler
Static Architecture Exploration
Specification and Requirements
Physical Implementation
Fig. 17 Koski MPSoC Design Toolset
less sensor network applications [23]. The structure of the Koski design environment is outlined in Fig. 17. The final architecture is composed of multiple processors and cores connected to a HIBI network [33]. Again, application specification uses the KPN modelling domain. This specification, along with a specification of the architecture and of the implementation constraints, is processed by an architecture exploration toolset to generate the platform allocation. During static mapping, the allocation of processors, and mapping of tasks to processors is generated, with this allocation fine-tuned during a dynamic mapping phase. Upon finalisation of the mapping, the RTL and software synthesis technology of Koski is used to generate a prototype of the final implementation.
6.4 Requirements for FPGA DSP System Design The key to productive DSP system design on FPGA is the ability to quickly generate a wide range of implementations from which the designer can choose a suitable candidate. Given the range of techniques which may be used to implement DSP systems, as outlined in this chapter, enabling this kind of capability is currently a possibility, rather than a reality. Of the three techniques outlined in this section, the first signs of automated design space exploration are beginning to appear, although in both cases (Daedalus and Koski) exhaustive exploration techniques, which do not scale well with application size, are used. Similar capabilities are evident in Synfora’s PICO FPGA toolset for rapid core creation. Whilst the first steps have been taken, much remains to be done to establish mature programming and rapid implementation and design space exploration tools for FPGA. As the use of FPGA becomes more widespread to fill the widening gap
390
John McAllister
created by the commercial justification required for the NRE costs of fabricating custom chips, and the power consumption and performance issues of multicore technology, FPGA must, and will, establish a breed of system design approaches all of it’s own to meet it’s own particular demands.
7 Summary FPGAs offer enormous potential for implementation of custom DSP computing architectures for ad-hoc, low volume DSP systems. The suitability of their programmable logic to host bit-parallel core architectures, and the advent of DSPdatapath units, along with hardcore and softcore processors has given designers to ability to harness a vast array of processing technologies with which to implement their systems. The options range from dedicated hardware architectures, to massively parallel software programmable architectures through GPP architectures with application specific co-processors, and everything in between. Whilst this presents an enormous opportunity, it places significant stress on the FPGA system design process. Since the entire motivation for using FPGA is it’s ability to host custom computer architectures tailored to the application, the complex heterogeneous system synthesis process required to generate an implementation of a DSP application is encountered frequently. As such, FPGA DSP system design presents an architecture synthesis challenge unencountered elsewhere in the embedded computing world. A new generation of design tools is starting to tackle and automate solutions to this system design problem, but many advances are yet to be made before it is solved.
References 1. Banerjee, P., Haldar, M., Nayak, A., Kim, V., Saxena, V., Parkes, S., Bagchi, D., Pal, S., Tripathi, N., Zaretsky, D., Anderson, R., Uribe, J.: Overview of a compiler for synthesizing MATLAB programs onto FPGAs. IEEE Trans. VLSI Syst. 12(3), 312–324 (2004) 2. Cho, J., Chang, H., Sung, W.: An FPGA-based SIMD processor with a vector memory unit. In: Proc. IEEE International Symposium on Circuits and Systems, pp. 525–528 (2006) 3. Altera Corp.: Avalon Interface Specifications (2008). URL www.altera.com 4. Altera Corp.: Nios II C2H Compiler Uuser Guide (2008). URL www.altera.com 5. Altera Corp.: DSP Design Flow User Guide (2009). URL www.altera.com 6. Altera Corp.: Stratix IV Device Handbook (2009). URL www.altera.com 7. Fischaber, S., Woods, R., McAllister, J.: SoC memory hierarchy derivation from dataflow graphs. Journal of Signal Processing Systems (2009) 8. Gajski, D., Vahid, F., Narayan S. Gong, J.: Specification and Design of Embedded Systems. Prentice Hall (1994) 9. Harriss, T., Walke, R., Kienhuis, B., Deprettere, E.F.: Compilation from Matlab to process networks realised in FPGA. Design Automation for Embedded Systems 7(4), 385–403 (2002) 10. Hoare, C.: Communicating Sequential Processes. Prentice Hall (1985) 11. Agility Design Solutions Inc.: Handel-C Language Reference Manual (2007). URL www. agilityds.com
FPGA-based DSP
391
12. Xilinx Inc.: MicroBlaze Processor Reference Guide (2008). URL www.xilinx.com 13. Xilinx Inc.: Embedded Processor Block in Virtex-5 FPGAs (2009). URL www.xilinx. com 14. Xilinx Inc.: Embedded Processor Block in Virtex-5 FPGAs Reference Guide (2009). URL www.xilinx.com 15. Synfora Inc.: PICO FPGA Datasheet (2009). URL www.synfora.com 16. Xilinx Inc.: System Generator for DSP User Guide (2009). URL www.xilinx.com 17. Xilinx Inc.: Virtex-5 FPGA User Guide (2009). URL www.xilinx.com 18. Xilinx Inc.: Virtex-5 FPGA XtremeDSP Design Considerations (2009). URL www. xilinx.com 19. Xilinx Inc.: Virtex-6 FPGA Memory Resources (2009). URL www.xilinx.com 20. Texas Instruments: TMS320C64x/C64x+ DSP CPU and Instruction Set Reference Guide (2008). URL http://www.ti.com 21. Jones, A., Hoare, R., Kusic, D., Fazekas, J., Foster, J.: An FPGA-based VLIW processor with custom hardware execution. In: Proceedings 13th International Symposium on FieldProgrammable Gate Arrays, pp. 107–117 (2005) 22. Kahn, G.: The semantics of a simple language for parallel programming. In: Proc. IFIP Congress, pp. 471–475 (1974) 23. Kangas, T., Kukkala, P., Orsila, H., Salminen, E., Hännikäinen, M., Hämäläinen, T., Riihimäki, J., Kuusilinna, K.: UML-based multiprocessor SoC design framework. ACM Trans˝ actions on Embedded Computing Systems (TECS) 5, 281U–320 (2006) 24. Keutzer, K., Malik, S., Newton, R., Rabaey, J., Sangiovanni-Vincentelli, A.: System level design: Orthogonolization of concerns and platform-based design. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 19(12), 1523–1543 (2000). URL http://www.gigascale.org/pubs/98.html 25. Kulmala, A., Salminen, E., Hämäläinen, T.: Instruction memory architecture evaluation on multiprocessor FPGA MPEG-4 encoder. In: Proc. Design and Diagnostics of Electronic Circuits and Systems, pp. 1–6 (2007) 26. Moscu Panainte, E., Bertels, K., Vassiliadis, S.: The MOLEN compiler for reconfigurable processors. ACM Transactions on Embedded Computing Systems 6(1) (2007) 27. Nikolov, H.: System level design methodology for streaming multi-processor embedded systems. Ph.D. thesis, Leiden University, the Netherlands (2009) 28. Nikolov, H., Stefanov, S., Deprettere, E.: Multi-processor system design with ESPAM. In: Proc. 4th Int’l Conf/ on Hardware/Software Codesign and System Synthesis, pp. 211–216 (2006) 29. Parhi, K.: VLSI Digital Signal Processing Systems : Design and Implementation. Wiley (1999) 30. Pimentel, A., Erbas, C., Polstra, S.: A systematic approach to exploring embedded system architectures at multiple abstraction levels. IEEE Transaction on Computers 55(2), 1–14 (2006) 31. Qiang, L., Constantinides, G., Masselos, K., Cheung, P.: Combining data reuse with data-level parallelization for FPGA-targeted hardware compilation: A geometric programming framework. IEEE Trans. CAD 28, 305–315 (2009) 32. Ravindran, K., Satish, N.R., Jin, Y., Keutzer, K.: An FPGA-based soft multiprocessor system for IPv4 packet forwarding. In: International Conference on Field Programmable Logic and Applications, pp. 487–492 (2005) 33. Salminen, E., Kangas, T., Hämäläinen, T., Riihimäki, J., Lahtinen, V., Kuusilinna, K.: HIBI communication network for system-on-chip. Journal of VLSI Signal Processing Systems 43(2-3), 185–205 (2006) 34. Santo, B.: 25 microchips that shook the world. IEEE Spectrum (2009) 35. Sheldon, D., Kumar, R., Lysecky, R., Vahid, F., Tullsen, D.: Application-specific customization of parameterized FPGA soft-core processors. In: Proc. IEEE/ACM International Conference on Computer-Aided Design, pp. 261–268 (2006)
392
John McAllister
36. Sheldon, D., Kumar, R., Vahid, F., Tullsen, D., Lysecky, R.: Conjoining soft-core FPGA processors. In: Proc. IEEE/ACM International Conference on Computer-Aided Design, pp. 694– 701 (2006) 37. Stefanov, T., Kienhuis, B., Deprettere, E.: Algorithmic transformation techniques for efficient exploration of alternative application instances. In: Proc. 10th Int. Symp. on Hardware/Software Codesign, pp. 7–12 (2002) 38. Stefanov, T., Zissulescu, C., Turjan, A., Kienhuis, B., Deprettere, E.: System design using Kahn process networks: The Compaan/Laura approach. In: Proc. Design Automation and Test in Europe (DATE) Conference, p. 10340 (2004) 39. Thompson, M., Stefanov, T., Nikolov, H., Pimentel, A., Erbas, C., Polstra, E., Deprettere, E.: A framework for rapid system-level exploration, synthesis and programming of multimedia MP-SoCs. In: Proc. ACM/IEEE/IFIP Int. Conference on Hardware-Software Codesign and System Synthesis, pp. 9–14 (2007) 40. Vassiliadis, S., Wong, S., Gaydadjiev, G., Bertels, K., Kuzmanov, G., Moscu Panainte, E.: The MOLEN polymorphic processor. IEEE Transactions on Computers 53(11), 1363–1375 (2004) 41. Verdoolaege, S., Nikolov, H., Stefanov, T.: PN: a tool for improved derivation of process networks. Eurasip Journal on Embedded Systems 2007 (2007) 42. Walke, R., Smith, R., Lightbody, G.: 20-GFLOPS QR processor on a Xilinx Virtex-E Fpga. In: Proc. SPIE Advanced Signal Processing Algorithms, Architectures and Implementations X, vol. 4116, pp. 300–310 (2000) 43. Wolf, W.: High Performance Embedded Computing - Architectures, Applications, and Methodologies. Morgan Kaufmann (2007) 44. Woods, R., McAllister, J., Yi, Y., Lightbody, G.: FPGA-based Implementation of Signal Processing Systems. Wiley (2008) 45. Yi, Y., Woods, R.: Hierarchical synthesis of complex DSP functions using IRIS. IEEE Trans. Computer Aided Design of Integrated Circuits and Systems 25(5), 806–820 (2006) 46. Yu, J., Eagleston, C., Chou, C., Perreault, M., Lemieux, G.: Vector processing as a soft processor accelerator. ACM Trans. Reconfigurable Technology and Systems 2(2), 12:1–12:34 (2000) 47. Zhuo, L., Prasanna, V.: Scalable and modular algorithms for floating-point matrix multiplication on reconfigurable computing systems. IEEE Trans. Parallel and Distributed Systems 18(4), 433–448 (2007) 48. Zhuo, L., Prasanna, V.: High-performance designs for linear algebra operations on reconfigurable hardware. IEEE Trans. Computers 57(8), 1057–1071 (2008)
General-Purpose DSP Processors Jarmo Takala
Abstract Recently the border between DSP processors and general-purpose processors has been diminishing as general-purpose processors have obtained DSP features to support various multimedia applications. This chapter provides a view to general-purpose DSP processors by considering the characteristics of DSP algorithms and identifying important features in a processor architecture for efficient DSP algorithm implementations. Fixed-point and floating-point data paths are discussed. Memory architectures are considered from parallel access point of view and address computations are shortly discussed.
1 Introduction Recently the border between DSP processors and general-purpose processors has been diminishing as general-purpose processors have obtained DSP features to support various multimedia applications. On the other hand, DSP processors, which used to be programmed with manual assembly, have nowadays incorporated features from general-purpose computers to support software development on highlevel languages. A DSP processor can be defined in a various ways and the simplest interpretation states that a DSP processor is any microprocessor that processes signals represented in digital form. As all programmable processors could be classified as DSP processors according to this definition, we would need to refine the definition. A more focused view on DSP processors can be obtained by considering the characteristics of digital signal processing. As DSP is application of mathematical algorithms to signals represented digitally and, on the other hand, DSP is applied in real-time systems where real-time constraint is determined by the repetition period of the alJarmo Takala Tampere University of Technology, P.O.Box 553, FIN-33101 Tampere, Finland e-mail: [email protected]
S.S. Bhattacharyya et al. (eds.), Handbook of Signal Processing Systems, DOI 10.1007/978-1-4419-6345-1_15, © Springer Science+Business Media, LLC 2010
393
394
Jarmo Takala
gorithm, we can assume that DSP systems and, in particular, real-time DSP systems contain mainly repetitious application of data-driven behaviors defined by mathematical algorithms under strict timing constraints [20]. This implies that DSP processors are designed for repetitive, numerically intensive tasks. On the other hand, often DSP applications define two main requirements: timing and error. The timing requirement dictates that a sequence of operations defined by the algorithm in hand must be performed in a given time. In addition, error of the results must be less than specified, i.e., accuracy of computations must fulfill the requirements. Therefore, we can expect that DSP processors contain features to improve the accuracy and performance of computations according to DSP algorithms. In addition, as the processed signals often represent real world physical signals, there is need to interface the processor to various peripheral devices, e.g., A/D and D/A, to receive and send digital signals. In particular, in real-time systems, there is need to transfer data in and out with constant data rate requirements. Modern DSP processors often contain various peripheral devices for easy interfacing. As many of the traditional DSP algorithms contain computation of sum of products, e.g., FIR filtering being the traditional example; filtered signal yn is obtained with aid of an N-tap FIR filter as yn =
N−1
ci xn−1
(1)
i=0
where xn is the input sample at time instant n and ci is the filter coefficient. As multiplication is often used, a fast multiplier has been an integral part of DSP processor architecture. Although the first commercially available DSP processor, Intel 2920 [14, 30], did not have a hardwired multiplier unit at all. The sum of products computation indicates accumulation, thus multiply-accumulate (MAC) is advantageous for DSP applications. In particular, single cycle multiply-accumulate instructions have been seen in DSP processors. Another important property is high memory bandwidth for feeding the arithmetic units with operands from memory. For this purpose, multiple-access memory architectures are used, which allow parallel instruction fetch and operand accesses. In addition, specialized addressing modes can improve the memory related performance as well as dedicated address generation units. DSP processors often contain specialized execution control mechanisms; in particular, efficient looping capabilities reduce the overhead due to repetitive execution. Specialized features to improve numerical accuracy are also present as accuracy of results is often one of the main criteria set by DSP applications. Finally, as DSP systems are often processing data representing real world signals, input/output interfaces and different peripherals are needed.
General-Purpose DSP Processors
395
2 Arithmetic Type DSP processors are divided as fixed-point and floating-point processors based on the type of arithmetic units in the processor. In the early days of DSP processors, fixed-point processors with 16-bit word width were used, e.g., Texas Instruments TMS32010 [23] and NEC PD7720[25]. While the 16-bit data was sufficient for speech applications, fairly soon it was realized that some applications require higher accuracy and 24-bit processors were introduced, e.g., Motorola DSP56000 [17]. In addition, floating-point DSP processors were introduced, e.g., Hitachi HD61810 [11], AT&T DSP32 [13], and NEC PD77230[24]. The floating-point processors contain more complex logic and, therefore, consume more power and are more expensive. However, the floating-point processors are easier to program as the dynamic range in floating-point representation is larger and there is no need to scale and optimize the signal levels during intermediate computations. Furthermore, high-level languages have floating-point data types while integers are the only supported fixed-point data types although signal processing calls for fractional data types for fixed-point arithmetic.
3 Data Path The actual signal processing operations in a processor are carried out in various functional units, such as arithmetic logic units (ALU) or multipliers, and the collection these units is called as data path or data ALU. In order to store intermediate results, data path contains also accumulators, registers, or register files. The data path in DSP processors can be expected to be tailored for computations inherent in typical DSP algorithms. DSP processors may also be enhanced with special function units to improve performance for a group of applications. In the following sections, principal features of fixed-point data paths are studied and differences in floatingpoint data paths are discussed.
3.1 Fixed-Point Data Paths A typical fixed-point data path includes multiplier, ALU, shifters, and registers and accumulators. Data path is used mainly for signal processing operations, while computations related to memory access, which call for integer arithmetic, are often performed in separate ALUs called as address generation units or address arithmetic units.
396
Jarmo Takala
3.1.1 Multiplier and ALU As traditional DSP processors are used for computing numerically intensive tasks a fast multiplier is an essential unit in a DSP processor. Although multipliers are are included in general-purpose microprocessors, there is a major difference in multipliers in fixed-point DSP processors. In general-purpose fixed-point computations, integer data type is used and multipliers operate such that integer operands result in an integer product (b-bit operands produce b-bit product, i.e., the LSB of product is saved). In DSP processors, however, fractional data type is exploited, which implies that the LSB of the product is not sufficient, thus the product is obtained at full precision; multiplication of b-bit operands results in 2b-bit product (law of conservation of bits). This indicates that the integer multipliers present in standard microprocessors and microcontrollers is not well suited to signal processing. In some DSP processors, multipliers may produce narrower results for obtaining speed-up or smaller silicon area. Multiplier may also be pipelined implying latency, i.e., the product may not be available for the next instruction. Multiplication is also involved in one of the most characteristic operations in DSP, multiply-accumulate and often DSP performance even characterized as MAC/s. This measure is often practical as DSP algorithms are, in general, dataindependent and, therefore, deterministic in behavior. DSP processor may contain an additional adder to be used in MAC operation and these resources form a MAC unit. Processor may also contain parallel units to further boost the performance on DSP applications. In similar fashion as in general-purpose processors, DSP processors contain arithmetic-logical unit, which performs the basic operations: addition, subtraction, increment, negate, and, or, not, etc. Often the addition in MAC operation is carried out in ALU. The ALU in DSP processor operates in a similar fashion as in generalpurpose processors but the arithmetic operations are carried out with extended word width operands. This is due to the fact the consecutive MAC operations tend to increase the word width of the result. In order to avoid overflow in accumulation, additional guard bits are used, i.e., additional bits are used when performing the accumulation and storing the accumulation results. In general, log2 (N) additional bits are required for carrying out N additions without overflow. The more guard bits, the more headroom against the overflow. In early DSP processors, multiplier was a separate unit, which stored its result in a specific "product" register as seen in Fig. 1. Such an arrangement implies that an additional instruction is needed to perform the accumulation and there is latency of one instruction in MAC operation. A similar behavior can be seen in Freescale DSP56300 illustrated in Fig. 2. In order to increase the performance for DSP applications, processor can contain several multipliers. E.g., TI TMS320C55x family of processors contain two MAC units [28]. The unit can perform multiplication with 17-bit operands and a 40-bit addition. TI TMS320C64x processors contain parallel functional units as shown in Fig. 3. Two of the eight function units can perform several type of multiplications: 32×32 multiplication, 16×16 multiplication, 16×32 multiplication, quad 8×8 mul-
General-Purpose DSP Processors
397
DATA BUS (16) T-REGISTER (16) MULTIPLIER (16*16->32) P-REGISTER (32)
SHIFTER PROGRAM BUS (16)
SHIFTER
ALU (32)
ACCUMULATOR (32)
SHIFTER
DATA MEMORY (16-bit)
Fig. 1 Principal data path of TI TS320C25
tiplications, dual 16×16 multiplications, and Galois field multiplication. These two units support also dual 16×16 and quad 8×8 MAC operations. Freescale MSC8156 contains six SC3850 DSP cores and the data path of each core contains four ALUs, which can perform dual 16×16 MAC operations. The ALUs support also complexvalued multiplication of operands with 16-bit real and imaginary parts [10]. 3.1.2 Registers Early DSP processors were load-store architectures where operands in memory have to be moved to specific operand registers before the operand can be transferred to a functional unit. In a similar fashion, results from functional units were stored to specific registers or accumulators. In early DSP processor, the number of generalpurpose registers was small indicating need to store intermediate results to memory. This represents clearly a performance bottleneck. Higher performance can be expected when operands are fetched directly from the memory and with the aid of efficient addressing mechanisms. In addition, modern processors contain generalpurpose registers and the number of registers is higher reducing the need for storing local data to memory. As discussed in previous section, the width of accumulators and registers is often greater than the full precision result of multiplication as the additional bits (guard bits) in the most significant part provide headroom against an overflow when accumulating several products. Finally, several registers can be used as a combined extended precision register. E.g., in Freescale DSP56300 in Fig. 2, 24-bit registers X1 and X2 can be used together as a 48-bit register. Arithmetic operations with
398
Jarmo Takala
X0 (24) X1 (24) Y0 (24) Y1 (24)
24
24
24
24 Y DATA BUS (24)
X DATA BUS (24)
P DATA BUS (24)
MULTIPLIER (24*24->48) 48 pipeline register 48
BIT FIELD UNIT & SHIFTER 56
56 SHIFTER
ALU/LIMITER 56
56 A (56) B (56)
56 24
56 SHIFTER/LIMITER
24
Fig. 2 Principal data path of Freescale DSP56300
extended precision can be performed by exploiting signed and unsigned versions of native add and multiply instructions and the final 96-bit result can be stored to accumulators A and B. In general-purpose DSP processors, the trend has been towards larger number of general-purpose registers; e.g., TMS320C64x processors contain two register files, which each contain 32 registers as illustrated in Fig. 3. These 32-bit registers can be used to store data ranging from packed 8-bit to 64-bit fixed-point data. Word widths greater than 32 bits are stored by using to registers as a pair. In Freescale SC3850, 16 registers form a register file, which supports word widths up to 40 bits. Special packed data formats are supported such that two 20-bit operands can be packed into a single 40-bit register. This format is well suited for representing complex numbers. 3.1.3 Shifters Shifters can be found in all general-purpose processors and the actual shifter unit is part of ALU. In general, two type of shift operations can be found: arithmetic and logical shift. The difference is in right shift: in logical shift, the vacant bit positions are filled with zeros while the sign-extension is used in arithmetic shift. In both cases, the vacant bits are filled with zeros in left shift. Yet another form of shift is circular shift (or rotation) where bits shifted out are moved in the vacant bit po-
General-Purpose DSP Processors
L1 Data Cache Controller DA1 addr DA2 addr
LD/ST data .L1
399
.S1
.M1
.D1
.D2
.M2
LD/ST data .S2
.L2
Odd Register File A1, A3, A5, …, A31
Odd Register File B1, B3, B5, …, B31
Even Register File A0, A2, A4, …, A30
Even Register File B0, B2, B4, …, B30
Fig. 3 Principal data path of TI TMS320C64x
sitions. Circular shifts are used in communications applications and some address computations. The arithmetic shift is a useful operation in fixed-point processors as arithmetic shift b bits to the left corresponds to multiplication by a factor of 2b and shift to the right is division by 2b . In fixed-point DSP processors, shifters can be found in various locations as shifting is needed for different purpose. Shifting can be used to provide headroom against overflow in multiply-accumulation, i.e., if the product is shifted b bits to the right, 2b accumulations can be performed without overflow. Such an organization was used in early DSP processors, e.g., in TI TMS320C25 depicted in Fig. 1, but the disadvantage is loss of precision as the least significant bits are lost in shifting. The shifter after the multiplier can still be useful when using fractional mode multiplication, i.e., product is shifted one bit to the left in order to remove the extra sign bit due to multiplication of fractional operands, e.g., in TI TMS320C54x [27]. Another use for shifters is scaling of operands from memory. When full precision products from fractional operands are used, the position of binary point in the extended precision result in accumulator is not the same as in the operands in the memory. E.g., the b-bit operands in memory use representation with b − 1 fractional bits, while full precision product in accumulators use representation with 2b − 2 fractional bits. Therefore, if we would like to add an operand from memory to the full precision result in the accumulator, the memory operand has to be multiplied by 2(b−1) , i.e., shifted (b − 1) bits to the left. Such an shift can be carried out with the top leftmost shifter connected to the data bus in Fig. 1. In a similar fashion, constants from instructions need to be shifted to the left and for this purpose, immediate field from program bus in DSP56300 in Fig. 2 has connection to the bit field unit & barrel shifter. Finally, shifters are also exploited when storing extended precision results from wide accumulators or registers to native word width memory. This implies that a part of the bits need to be selected to be transferred to memory. For this purpose, the
400
Jarmo Takala
7
0
6
2 5
4
Rotate ±1 bits
0
1
2
3
4
5
6
7
Rotate ±2 bits
0
1
2
3
4
5
6
7
Rotate ±4 bits
0
1
2
3
4
5
6
7
1
3
Fig. 4 Barrel shifter
contents of the accumulator or register can be shifted such that the bits to be stored are in the LSB or MSB part and then stored to memory. E.g., 16-bit operands with 15 fractional bits will produce a 32-bit full precision result with 30 fractional bits in the accumulator, thus the result should be shifted 15 bits to the right such that the 16 least significant bits can be stored to memory in the same representation as the original operands. Alternatively, if the 32-bit accumulator can be used as two 16-bit registers (upper and lower half), the product can be shifted one bit to the left and the upper half is stored to memory. The shifters in DSP processors need to support shifts of variable number of bits in a single cycle unlike in low-cost microprocessors where shifts and rotations of only one bit to the left or right are typically supported. Therefore, DSP processors use barrel shifters (also known as plus-minus-2i (PM2I) networks), which contains multiple layers for interconnections between register elements [16]. Each bit register in 2n -bit barrel shifter is connected to registers, which are at distances ±20 , ±21 , . . . , ±2(n−1) indicating that a shift of an arbitrary number of bits will take at most (n −1) steps. This is illustrated in Fig. 4 where connections in an 8-bit barrel shift can be seen, i.e., each bit register has connections to five neighbors. The three different connection layers are shown in the right and it can be seen that any arbitrary shift will take at most two steps. Barrel shifters are implemented as cascade of parallel multiplexers. When the structure of barrel shifter is drawn by placing the register elements on a ring topology as in Fig. 4, the structure reminds barrel, hence the name. 3.1.4 Overflow Management In fixed-point DSP processors, special care needs to be taken against the overflow. The simplest method to detect is overflow is the overflow bit in status register, which is set when ALU detects an overflow as a result of an arithmetic operation. Such a mechanism is available in most of the processors but it is not an efficient method for DSP applications as it requires software procedure to handle the error. One method for overflow management, specific to DSP processors, is guard bits, i.e., oversized accumulators and registers with extended precision arithmetic units.
General-Purpose DSP Processors
401
xmax = 0.875 0.750 0.625 0.500 0.375 0.250 0.125 -0.125 -0.125 -0.375 -0.500 -0.625 -0.750 -0.875 xmin = -1.000
Fig. 5 Saturation arithmetic with 4-bit fractional representation (sign bit and three fraction bits), thin dashed line: full precision signal, thick dashed line: quantized overflown signal, solid line: quantized signal with saturation, xmin (xmax ): minimum (maximum) representable number
The main purpose is to allow accumulation of products without overflow and the final extended precision results are quantized when stored to memory. Another specific feature for overflow management are shifters. E.g., to avoid extra overhead when quantizing extended precision results for memory, shifters between the accumulators or registers and buses are used. Another mechanism is to scale intermediate results to arrange headroom against the overflow. An example of such scaling is the shifter after the multiplier in Fig. 1. This approach reduces precision as some of the least significant bits of the products are lost during the right shift. Yet another mechanism is dynamic scaling mode that is useful in block processing applications, e.g., discrete trigonometric transforms where block floating-point representation can be exploited. The principal idea is to emulate floating-point representation but to use a common exponent for a block of samples. The common exponent is defined by the largest magnitude in the block. In block processing, the word width increase of intermediate results is detected at each iteration and headroom against overflow is arranged with the aid of suitable down-scaling of operands during the next iteration. E.g., in DSP56300 [9], such scaling is supported with a scaling bit in the status register. This bit is set if a result moved from an accumulator to memory requires scaling. This is a sticky bit, thus it can be used to detect need for scaling in a block of data. In the next iteration, the operands from memory can then be scaled towards zero. The signal scaling will require careful planning from the programmer to maximize the intermediate signal levels in a proper manner. An automated method for avoiding overflow is to use saturation arithmetic or limiter as discussed in Chapter 12. In such a case, the arithmetic unit limits the result to the largest positive or negative representable number in case of overflow as illustrated in Fig. 5 where behavior of limiter is shown when using 3-bit fractional representation. Overflow is also possible when storing results from oversized accumulators to memory, thus a
402
Jarmo Takala
limiter should be available not only in ALUs but between accumulators and memory buses as seen in Fig. 2. Saturation arithmetic is specified in various telecommunication standards and limiters would be useful is such applications.
3.2 Floating-Point Data Paths The first floating-point DSP processors used proprietary number representations. In 1985, ANSI/IEEE standard 754-1985 [5] was introduced and soon floating-point arithmetic compatible with IEEE standard was introduced especially in generalpurpose microprocessors. In modern floating-point DSP processors, the arithmetic units exploit IEEE standard. In a floating-point arithmetic unit, the results are scaled automatically to maximize the precision of results as the number representation requires normalization of mantissa. This normalization results in the main advantage of floating-point representation: large dynamic range. E.g., IEEE single-precision format defines 24-bit mantissa and 8-bit exponent where the minimum and maximum values for exponent are −126 and 127, respectively. Therefore, the dynamic range of the IEEE singleprecision number system is 2 − 2−23 × 2127 Dsp = = 2.89 × 1076 ; 20 log10 (2.89 × 1076)dB ≈ 1529dB. 1 × 2−126 Similar dynamic range with signed fixed-point number system requires word width of 255 bits: 1 D f xp = −254 = 2.89 × 1076. 2 The huge dynamic range implies that overflow is not a major concern like in fixed-point data paths. However, overflow is still possible and arithmetic units indicate such an exceptional result by setting the overflow bit in the status register and an error recovery routine provided by user or operating system is then executed. The normalization of mantissa in a floating-point arithmetic units is carried out with the aid of a shifter. Such a shifter it is an integral part of floating-point arithmetic unit and, therefore, it is not visible for programmer unlike shifters in fixedpoint DSP processors. Therefore, less shifters are seen in floating-point DSP processors. In addition, arithmetic units in floating-point DSP processors often support also fixed-point arithmetic. The data paths of the first floating-point DSPs reminded the fixed-point data paths but soon floating-point processors were designed to be more flexible, e.g., generalpurpose register files were used instead of operand registers and accumulators. An example of such a data path is Analog Devices ADSP-2106x [4] illustrated in Fig. 6. However, the data paths in modern VLIW DSP processors can be the same independent on the arithmetic type, e.g., TI TMS320C67x floating-point processors contain data path similar to TMS320C64x fixed-point processors illustrated in Fig. 3.
General-Purpose DSP Processors
403 PROGRAM BUS (48)
DATA BUS (40/32)
MULTIPLIER (40*40 ->40)
Data Register File R0-R15 (40)
BARREL SHIFTER
ALU (40)
Fig. 6 Principal block diagram of ADSP2106X processors
In floating-point processors, the multiplier does not produce product at full precision (as seen in Fig. 6) unlike in fixed-point DSP processors. Still quite often multipliers in floating-point DSP processors provide arithmetic results at extended precision unlike in general-purpose microprocessors. While 32-bit single-precision IEEE format is popular there are processors supporting even higher precisions; e.g., TI TMS320C67x processors have hardware support for single-precision and doubleprecision instructions.
3.3 Special Function Units In general, significant savings can be obtained if a system is tailored for the application at hand. This implies that efficiency can be improved by adding applicationspecific functions to the data path of a general-purpose DSP processor. An early example of such a customization is AT&T DSP1610, which included an additional bit manipulation unit. Another example is the compare-select-and-store unit integrated in the data path of TI TMS320C54x depicted in Fig. 7. This unit speeds up Viterbi decoder implementations. The data path has an additional unit for determining exponent value of operand in accumulator, which speeds up floating-point emulation. The previous processors were targeted to telecommunications applications and they may be called as domain-specific processors as they contain customized units to support a specific application domain, i.e., not only a single application but a group of applications having similar features. In a similar fashion, DSP processors may be targeted to video applications; Analog Devices ADSP-BF522 processor contains four 8-bit video ALUs, which support, e.g., average operation and subtract/absolute value/accumulate operation. In modern DSP processors, the additional special units can also be rather complex, e.g., in ADSP-TS201S, 128-bit communications logic unit is used for trellis
404
Jarmo Takala PROGRAM BUS (16) C DATA BUS (16) D DATA BUS (16)
EXP encoder
T reg (16)
MULTIPLIER ( 17*17)
ALU (40)
BARREL SHIFTER
SHIFTER ADDER(40) COMP
MSW/LSW select
limiter/round TRN reg TC reg accumulator A (40)
accumulator B (40)
E DATA BUS (16)
Fig. 7 Principal block diagram of TI TMS320C54x processors
decoding and it executes complex correlations for CDMA communications applications [3]. TI TMS320C6455 [29] contains also special hardware for speeding up Viterbi decoding and turbo decoding but the units are not integrated in the data path; there are two co-processors for speeding up the tasks. The main difference of supporting such special functionalities is that the special function units integrated in the data path will be reflected by the instruction set, i.e., there are instructions to control the usage of special function units, while co-processors are mainly visible as memory or I/O mapped peripheral devices.
4 Memory Architecture In order to obtain high performance in DSP, it is not enough to have high-performance arithmetic units in the data path. It is essential to be able to feed operands from memory to the data path; the memory bandwidth should match the performance of the data path. Quite often algorithms are applied to a large number of data structures such that there is not enough registers for all the operands, thus operands have to be obtained to memory.
General-Purpose DSP Processors
405 Address bus
CPU
Data bus
I0 O0
Data Bus
Memory Instructions Operands
I1 O1 O1
One-Operand Instruction Two-Operand Instruction
Fig. 8 von Neumann architecture. In : instruction access, On : Operand access
In general, low-cost microprocessors use von Neumann architecture where instructions and data are both placed to the same storage structure implying that instruction fetches and data accesses are interleaved over common resources as illustrated in Fig. 8. This implies that several cycles are needed to complete, e.g., MAC operation needed for implementing one FIR tap in (1), which requires three memory accesses: fetching of instruction, coefficient, and sample. Higher performance can be expected if memory accesses could be performed simultaneously. In first DSP processors, memory bandwidth was improved by exploiting Harvard architecture where instructions and data are stored in different independent memories as shown in Fig. 9(a). This implies that while operands for current instruction are accessed, the next instruction can already be fetched. This approach doubles the memory bandwidth when one operand instructions are used as clearly seen by comparing the timing diagrams in Fig. 8 and Fig. 9(a).
Data Memory Operands
Address bus
Address bus
CPU
Data bus
Program Data Bus
I0
I1
I2
Data bus
Program Memory Instructions
I3
O0 O1 O20 O21
Data Data Bus
(a)
Data Memory Operands
Address bus
CPU
Address bus
Data bus
repeat buffer
Data bus
Program Data Bus Data Data Bus
I0
Program Memory Instructions & operands
I1 O10 O20 O30 O0 O11 O21 O31 (b)
Fig. 9 Memory organizations with timing diagrams: (a) pure Harvard architecture and (b) Harvard architecture with repeat buffer. In : instruction fetch, On : Operand access
406
Jarmo Takala
EXTERNAL ADDRESS BUS SWITCH ADDRESS GENERATION UNIT
PROGRAM RAM (24)
X DATA RAM (24)
Y DATA RAM (24)
DMA GLOBAL INTERNAL DATA BUS SWITCH
EXTERNAL DATA BUS SWITCH
PDATA XDATA YDATA
I/O PERIPHERALS
PROGRAM CONTROLLER
DATA PATH
Fig. 10 Principal block diagram of DSP56300
The speedup of two operand instructions is not doubled, thus one of the first modifications was to use repeat buffer where an instruction can be stored to avoid fetch from the memory thus the program bus is free for the data access. Such an approach is applicable mainly only in loops. In the first iteration, the instruction (I1 in Fig. 9(b)) is fetched from program memory, thus the program bus is reserved and operand accesses are performed in the next cycle (O10 and O11 ). In the next iterations, the instruction is fetched from the repeat buffer, thus the program bus is free for operand access and two operand accesses over two buses can be performed in parallel (O20 and O21 etc.). The memory bandwidth can be increased even further by adding memory modules and buses. In DSP56300, three memory modules are used as illustrated in Fig. 10. If the data is distributed between the two data memory modules, two operand instructions can be executed in one cycle. Even higher bandwidth can be obtained when more memory modules and buses are added. An example of such DSP processors is Hitachi DSPi, which contained six parallel memory modules citeKaneko:87:sscc. However, in order to exploit the parallelism, data has to be carefully arranged between the memory modules to avoid access conflicts. The complexity of such data distribution task is illustrated with an example in the following. Let us assume that an algorithm operating over a 4× 4 matrix stored to four memory modules and it accesses elements in rows and columns. Here a question arises: how to distribute the data over the four memory modules such that four elements from memory can be accesses in parallel. The data could be stored to memories column by column, i.e., each column is stored to a single memory module as illustrated in Fig. 11(a) where the elements of matrix are numbered row by row, thus the first
General-Purpose DSP Processors
407
M0
M1
M2
M3
M0
M1
M2
M3
M0
M1
M2
M3
0 4 8 12
1 5 9 13
2 6 10 14
3 7 11 15
0 7 10 13
1 4 11 14
2 5 8 15
3 6 9 12
0 5 10 15
1 4 11 14
2 7 8 13
3 6 9 12
r r
an-1 ... a3 a2 a1 a0
an-1 ... a3 a2 a1 a0
r
+
an-1 ... a3 a2 a1 a0
m
m (a)
(b)
m (c)
Fig. 11 Parallel access schemes: (a) low-order interleaving, (b) row rotation, and (c) linear transform.
row of matrix contains elements (0, 1, 2, 3). This data distribution method is called as low-order interleaving where the the memory module address m and row address to memory r are obtained with the aid of address a as follows m = a mod Q r = a/Q
(2) (3)
where Q denotes the number of memory modules and mod is modulus operation. When the number of memory modules is a power-of-two, the implementation is simple as illustrated in Fig. 11(a). We can see that the elements in a row can be accessed in parallel as the elements are stored to separate memory modules M0 , . . . , M3 . However, accessing elements in a column introduces access conflict as all the elements are located in the same memory module. The column access needs to be serialized slowing down the operation. Row rotation scheme [7] can be used to allow conflict-free parallel access to rows and columns. The module and row address are now defined as m = a mod Q r = (a + a/Q) mod Q
(4) (5)
where denotes floor operation. The implementation in case the number of memory modules is a power-of-two requires an adder as illustrated in Fig. 11(b). The storage of matrix elements is shown and it can be seen that the elements in each row and column are stored in different memory modules.
408
Jarmo Takala
r an-1 ... ... ... a1 a0 m
0x00 0x01 0x02 0x03 0x04 ... 0x0b 0x0c 0x0d 0x0e 0x0f
(a)
(b)
M0
M1 M2 M3
0x00 0x01 0x02 0x03 0x04 ... 0x0b 0x0c 0x0d 0x0e 0x0f
M0 M1 M2 M3 M0 M3 M0 M1 M2 M3
(c)
Fig. 12 Parallel access schemes: (a) high-order interleaving and memory maps with (b) high-order interleaving, and (c) low-order interleaving.
Another scheme is to apply linear transformations [12]. Here the computations of module and row address are performed at bit-level thus the module and row addresses are used as bit arrays m and r obtained from index address array a as follows m = Ta
(6)
r = Ka
(7)
where T and K are binary matrices called row and module transformation matrices, respectively. These matrices can be designed in various way to obtain support for different access patterns. We can use the following matrices: ⎛ ⎞ ⎛ ⎞ a3 a3 ⎟ ⎜ ⎟ 1000 ⎜ ⎜ a2 ⎟ ; m = Ta = 1 0 1 0 ⎜ a2 ⎟ r = Ka = (8) 0 1 0 0 ⎝ a1 ⎠ 0 1 0 1 ⎝ a1 ⎠ a0 a0 and the corresponding matrix distribution and the implementation of module and row address computation can be seen in Fig. 11(c). The computations in linear transformation schemes are based on modulo-2 arithmetic, thus the arithmetic is realized as bit-wise exclusive-OR operations and modulo operations and carry delay of adders are avoided. Linear transformation schemes are hence called as XOR schemes. While the previous schemes have been applied in some DSP processors, a more popular scheme for parallel memory modules is high-order interleaving where the module address is defined by some most significant bits of the address as illustrated in Fig. 12(a). The memory map in such a scheme will look like in Fig. 12(b), i.e., one memory module is allocated to a block of addresses and the next module will reserve the next block of addresses. In such a case, a single data array is not distributed over the memory modules but several arrays are to be accessed in
General-Purpose DSP Processors
409
parallel and placed in different memory modules. E.g., in FIR computations, the coefficients ci in (1) and delay line containing the latest samples xi are stored in different memory blocks, thus one coefficient and one sample from delay line can be accessed in parallel. The memory map according to low-order interleaving is shown in Fig. 12(c). In such a case, the idea is to access various elements from the same array in parallel rather than accessing elements from different arrays in parallel as it is often the case in modified Harvard architecture. This is also supported by the fact that in various processors based on Harvard architecture, the parallel memory modules reside in different memory spaces, e.g., in in DSP65300 in Fig. 10, the parallel memory modules are in X and Y memory spaces, thus the module is determined during programming. Harvard architecture is not the only method to increase the memory bandwidth. Similar performance can be obtained by exploiting multi-port memories. E.g., dualport memory with two buses provides the same memory bandwidth as two single port memories over to buses. The dual-port memory is, however, more expensive in terms of area, speed, and power. On the other hand, multi-port memory is more flexible as the same memory space is seen through the ports, thus there is no need to distribute the data between different modules. In LSI DSP16410, three-port memories are used such that one port is dedicated for instruction/coefficient memory space, the second port for data memory space, and the third port is left for DMA transfers [22]. Multiple access memories can also be used to increase the memory bandwidth, i.e., memory can be accessed multiple times in an instruction cycle. E.g., TMS320C54x processors contain dual-access memory [27], which corresponds to a Harvard architecture with two data memories. Multiple access memory is also flexible as there is only one memory space, thus careful distribution of data is not needed like in Harvard architecture. However, multiple access memory may restrict the maximum clock frequency of the processor.
5 Address Generation Intensive access to memory in DSP applications implies that address computations are performed frequently. As the data path is utilized by signal processing arithmetic, DSP processors often contain ALUs dedicated to memory address computations. Many of DSP applications operate over data in arrays, e.g., filters exploit delay lines, as illustrated by FIR filter example in (1), and discrete transforms imply block processing. In early DSP processors, delay lines were implemented as a linear FIFO buffer in memory, which indicates that the samples need to be copied from a memory location to the next memory location at each iteration. This implies need for extra memory bandwidth. Similar behavior without the need for moving data is to exploit circular buffer or ring buffer, i.e., a memory array is managed with the aid of a pointer. Each time a new sample enters to buffer, the sample is stored to a memory location pointed by the pointer, and then the pointer is incremented. Once
410
Jarmo Takala Y ADDR BUS X ADDR BUS PROGRAM ADDR BUS
N0 N1 N2 N3
M0 M1 M2 M3
Address ALU
EP R0 R1 R2 R3
R4 R5 R6 R7
Address ALU
N4 N5 N6 N7
M4 M5 M6 M7
GLOBAL DATA BUS PROGRAM ADDRESS BUS
Fig. 13 Principal block diagram of address generation unit of DSP56300
the buffer is filled and the pointer exceeds the reserved memory area, the pointer is initialized to point to the beginning of the buffer. Pointers can be maintained in specific address registers and updated with the aid of address ALUs. The pointer increment behaves as a modulo arithmetic, thus circular buffers are maintained with the aid of modulo addressing. The previous features are used with the aid of addressing modes and the most powerful addressing mode in DSP processors is typically register indirect addressing with pre- or post-updates. The powerful addressing modes imply that the a single DSP instruction specifies several operations, thus often loop kernels in DSP processors are short compared to the corresponding kernels in general-purpose processors. The address generation unit of DSP56300 illustrated in Fig. 13 is an example of an versatile and flexible address unit. Registers Rn, Nn, and Mn operate as triplets such that Rn contains the address that is used to compute the effective operand address. This register can be pre- or post-updated depending on the used selected addressing mode. Register Nn contains offset or index, which is added to the address in Rn. Register Mn is modifier, which defines the type of arithmetic used in the address computation: linear, modulo, or reverse-carry. Linear addressing is used in general-purpose addressing, modulo addressing is used with circular buffers, and reverse-carry addressing is useful on realizing the bit-reversed permutation found in radix-2 FFT. As there are eight register triplets, programmer can maintain eight different type of buffers simultaneously. The EP register is stack extension pointer and it is not used for operand addressing. Dedicated address generation units are used in conventional DSP processors, while multi-issue processors, e.g., VLIW processors, contain more parallel resources and arithmetic units can be used for computing both the addresses and signals.
General-Purpose DSP Processors
411
6 Program Control DSP algorithms are quite often data-independent and, therefore, computations are highly data-oriented in contrast to control-oriented algorithms containing datadependent computations. In addition, algorithms often define tight loops, thus DSP processor should have efficient looping capabilities. If the loop kernel contains only few instructions, it is clear that the loop control overhead, i.e., instructions needed to decrement the loop counter, test against end condition, and branch, can be significant. DSP processors often have zero-overhead looping capability (also known as hardware looping), which means that processor has hardware support for the decrement-test-branch sequence, thus the control overhead of software looping can be avoided. Hardware looping, however, may introduce side effects or constraints, e.g., interrupts are disabled during the loop execution, branches or certain addressing modes are not allowed in loop kernels, or the number of instructions in loop kernel is limited. In pipelined processors, branching is an expensive operation as the destination address of branching often is obtained in late pipeline stages implying that several instructions have already been fetched to the pipeline but due to the branch they should not be executed. Therefore, the pipeline needs to be flushed. This indicates that a standard branch consumes several cycles and is called as multi-cycle branch. One alternative method is to execute the instructions already fetched to the pipeline. This means that instructions following immediately the branch instruction will be executed regardless of the branch. Such a method is called as delayed branch and the number of instructions executed after branch instructions is called as delay slot. In decision-intensive code, branching is needed often and it would be extremely expensive in architectures with deep pipelines. One approach to avoid multi-cycle branching in such cases is conditional execution. This means that a condition code is integrated to the instruction and the instruction is executed only when the condition is true. The instruction passes through the pipeline normally and only the execution stage is conditional.
7 Conclusions The DSP processor architectures have evolved over the time. Early DSP processors were programmed manually with assembly language and high performance was obtained with specialized function units operating in parallel. This implied nonorthogonal complex instruction set computer (CISC) type of instruction sets that are poor targets for compilers. Although reduced instruction set computer (RISC) and superscalar architectures have gained popularity in general-purpose computing, they have not been the mainstream in DSP processors. This is partly due to the fact that often DSP applications set hard real-time requirements indicating that the execution time of the software implementation should be deterministic. Typically the features
412
Jarmo Takala
used in superscalar processors introduce dynamic run-time behavior resulting in non-deterministic execution times. Various methods to exploit parallelism has been used to improve the performance. Conventional DSP processor architectures have been enhanced with multiple MAC units, e.g., TMS320C55x, ADSP-BF5xx, or additional co-processors. Multiple function units are exploited with SIMD extensions, e.g., in SIMD mode in ADSP-21xxx, second data path executes the same operations as the instruction defines for the first data path but for a different register file. Multi-issue processors in form of VLIW machines have gained popularity in DSP applications, e.g., TI C6000 and Freescale StarCore families. Multiple DSP processors are also integrated in a single chip, e.g., Freescale MSC8156 contains six SC3580 VLIW cores and additional co-processors for accelerating turbo and Viterbi decoding and FFT and CRC computations.
8 Further Reading There are several chapter in this handbook considering issues related to generalpurpose digital signal processors. Chapter 12 provides an introduction to arithmetic, customized DSP processors are discussed in Chapter 16, and 3D integrated VLIW digital signal processor is investigated in Chapter 19. Compilers for traditional DSP processors are considered in Chapter 21 and for VLIW DSPs in Chapter 22. DSP extensions to C language, which allow more efficient exploitation of specific features of digital signal processors, are introduced in Chapter 27. There are many textbooks discussing DSP processors and interested readers are advised to consider the following: [1, 15, 18–20]. In addition, Berkeley Design Technologies [6] carries out independent benchmarking of DSP processors and their web page contains useful information on various issues related to DSP implementations. Finally, DSP processor vendors provide application notes, which are a rich source for practical information, e.g., Texas Instruments [26], Freescale [8], Analog Devices [2], and LSI [21].
References 1. Ackenhusen, J.G.: Real-Time Signal Processing: Design and Implementation of Signal Processing Systems. Prentice-Hall, Inc., Upper Saddle River, NJ (1999) 2. Analog Devices, Inc.: URL http://www.analog.com/ 3. Analog Devices, Inc.: TigerSHARC Embedded Processor: ADSP-TS201S (2006) 4. Analog Devices, Inc.: SHARC Processor: ADSP-21060/ADSP-21060L/ADSP21062/ADSP-21062L/ADSP-21060C/ADSP-21060LC (2008) 5. ANSI/IEEE Std 754-1985: IEEE standard for binary floating-point arithmetic. Standard, The Institute of Electrical and Electronics Engineers, Inc., New York, NY, U.S.A. (1985) 6. Berkeley Design Technologies, Inc.: URL www.bdti.com
General-Purpose DSP Processors
413
7. Budnik, P., Kuck, D.: The organization and use of parallel memories. IEEE Trans. Comput. 20(12), 1566–1569 (1971) 8. Freescale Semiconductor, Inc.: URL http://www.freescale.com/ 9. Freescale Semiconductor, Inc.: DSP56300 Family Manual: 24-Bit Digital Signal Processors (2005) 10. Freescale Semiconductor, Inc.: Differences Between the MSC8144 and the MSC815x DSPs (2009) 11. Hagiwara, Y., Kita, Y., Miyamoto, T., Toba, Y., Hara, H., Akazawa, T.: A single chip digital signal processor and its application to real-time speech analysis. IEEE Trans. Acoustics Speech Signal Process. 31(1), 339 – 346 (1983) 12. Harper III, D.: Increased memory performance during vector accesses through the use of linear address transformations. IEEE Trans. Comput. 41(2), 227–230 (1992) 13. Hayes, W., Kershaw, R., Bays, L., Boddie, J., Fields, E., Freyman, R., Garen, C., Hartung, J., Klinikowski, J., Miller, C., Mondal, K., Moscovitz, H., Rotblum, Y., Stocker, W., Tow, J., Tran, L.: A 32-bit VLSI digital signal processor. IEEE J. Solid-State Circuits 20(5), 998 – 1004 (1985) 14. Hoff, M.E., Townsend, M.: An analog input/output microprocessor for signal processing. In: IEEE Int. Solid-State Circuits Conf., Digest Tech. Papers, p. 220 (1979) 15. Hu, Y.H.: Programmable Digital Signal Processors: Architecture, Programming, and Applications. Marcel Dekker, New York, NY (2002) 16. Hwang, K., Briggs, F.A.: Computer Architecture and Parallel Processing. McGraw-Hill Book Co., Singapore (1985) 17. Kloker, K.: Motorola DSP56000 digital signal processor. IEEE Micro 6(6), 29–48 (1986) 18. Kuo, S.M., Gan, W.S.: Digital Signal Processors: Architectures, Implementations, and Applications. Pearson Education, Inc., Upper Saddle River, NJ (2005) 19. Kuo, S.M., Lee, B.H.: Real-Time Digital Signal Processing: Implementations, Applications, and Experiments with the TMS320C55x. Wiley, New York, NY (2001) 20. Lapsley, P.D., Bier, J., Shoham, A., Lee, E.A.: DSP Processor Fundamentals: Architectures and Features. Berkeley Design Technology, Inc., Fremont, CA (1996) 21. LSI Corp.: URL http://www.lsi.com/ 22. LSI Corp.: DSP16410 Digital Signal Processor (2007) 23. Magar, S., Caudel, E., Leigh, A.: A microcomputer with digital signal processing capability. In: IEEE Int. Solid-State Circuits Conf., Digest Tech. Papers, vol. XXV, pp. 32–33 (1982) 24. Nishitani, T., Kuroda, I., Kawakami, Y., Tanaka, H., Nukiyama, T.: Advanced single-chip signal processor. In: Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Process., vol. 11, pp. 409–412. Tokyo, Japan (1986) 25. Nishitani, T., Maruta, R., Kawakami, Y., Goto, H.: A single-chip digital signal processor for telecommunication applications. IEEE J. Solid-State Circuits 16(4), 372 – 376 (1981) 26. Texas Instruments, Inc.: URL http://www.ti.com/ 27. Texas Instruments, Inc.: TMS320C54x DSP Functional Overview (2006) 28. Texas Instruments, Inc.: C55x v3.x CPU: Reference Guide (2009) 29. Texas Instruments, Inc.: TMS320C6455 Fixed-Point Digital Signal Processor (2009) 30. Townsend, M., Hoff Jr., M.E., Holm, R.E.: An NMOS microprocessor for analog signal processing. IEEE Trans. Comput. 29(2), 97–102 (1980)
Application Specific Instruction Set DSP Processors Dake Liu
Abstract In this chapter, Application Specific Instruction Set Processors (ASIP) will be introduced and discussed for readers who want general information about ASIP technology. The introduction includes ASIP design flow, source code profiling, architecture exploration, assembly instruction set design, design of assembly language programming toolchain, firmware design, benchmarking, and microarchitecture design. Two examples, design for instruction set level acceleration of radio baseband, and design for instruction set level acceleration of image and video signal processing, are introduced.
1 Introduction 1.1 ASIP Definition An ASIP is an Application Specific Instruction-Set Processor designed for an application domain. ASIP instruction set is specifically designed to accelerate computationally heavy and most used functions. ASIP architecture is designed to implement the assembly instruction set with minimum hardware cost. An ASIP DSP is an application specific digital signal processor for iterative data manipulation, transformation, and matrix computing extensive applications. It is important to understand when the use of an ASIP is appropriate. If the design parameters of an ASIP are well chosen, the ASIP will contribute benefits to a group of applications. An ASIP is suitable when: (1) the assembly instruction set will not be used as an instruction set for general purpose computing and control; (2) the system performance or power consumption requirements cannot be achieved Dake Liu Linköping University, Sweden e-mail: [email protected]
S.S. Bhattacharyya et al. (eds.), Handbook of Signal Processing Systems, DOI 10.1007/978-1-4419-6345-1_16, © Springer Science+Business Media, LLC 2010
415
416
Dake Liu
by general purpose processors, (3) the volume is high or the design cost is not a sensitive parameter, and (4) it is possible to use the ASIP for multiple products.
1.2 Difference Between ASIP and General CPU Designers of general-purpose processors think of both the maximum performance and maximum flexibility. The instruction set must be general enough to support general applications. The compiler should offer compilation for all programs and to adapt all programmers’ coding behaviors. ASIP designers have to think about applications and cost first. Usually the primary challenges for ASIP designers are the silicon cost and power consumption. Based on the carefully specified function coverage, the goal of an ASIP design is to reach the highest performance over silicon cost, the highest performance over power consumption, and the highest performance over the design cost. The requirement on flexibility should be sufficient instead of ultimate. The performance is application specific instead of the highest one. To minimize the silicon cost, a design of an ASIP aims usually to a custom performance requirement instead of an ultimate possible high performance. Programs running in an ASIP might be relatively short, simple, with ultra high coding efficiency, requirements on tool qualities such as the quality of code compiler could be application specific. For example, for radio baseband, the requirement on compiler may not really be mandatory. The main difference between a general-purpose processor and an ASIP DSP is the application domain. A general-purpose processor is not designed for a specific application class so that it should be optimized based on the performance of the "Application Mix". The application mix has been formalized and specified using general benchmarks such as SPEC (Standard Performance Evaluation Corporation). The application domain of an ASIP is usually limited to a class of specific applications, for example video/audio decoding, or digital radio baseband. The design focus of an ASIP is on specific performance and specific flexibility with low cost for solving problems in a specific domain. A general-purpose microprocessor aims for the maximum average performance instead of specific performance. Except for memories, there are two major families of integrated digital semiconductor components, microprocessors and ASIC. Microprocessors take the role of flexible computing of all programs running in desktops or laptops. An ASIC supplies application specific functionality, and its functional coverage and performance are fixed during the hardware design time, for higher performance, lower silicon area, and lower power consumption. In 1980’s to 90’s, ASIC played dominant roles in communications and consumer electronics products, for example, the radio baseband in mobile phones was based on ASIC. After year 2000, more standards in radio communications and audio/video processing have emerged and need to be integrated in a single product, thus necessitating the integration of more ASIC circuit modules in one product. For example, GSM
Application Specific Instruction Set DSP Processors
417
ASIP architecture ASIP FW design tools
ASIP FW design
Fig. 1 Knowledge required in ASIP design
in dual frequency bands, WCDMA, HSPA, WLAN including IEEE802.11a/g/b/n, LTE, and Bluetooth might needed in one handset. At the same time, the mobile phone should also support different voice codec standards, audio players, image codecs, and different video codecs. The costs of a design with many ASIC modules will surge, and ASIP will therefore gradually replace ASIC in many cost sensitive multi-mode products. Here multi-mode means the function coverage of multiple applications, or an application supporting several standards.
2 DSP ASIP Design Flow ASIP DSP design can be divided into three parts, as illustrated in Fig. 1. The first part is the "ASIP architecture design" including architecture exploration, design of the assembly instruction set and the instruction set architecture. The second part is the "design of ASIP FW (firmware) programming tools", including compiler, assembler, linker, and instruction set simulator. The third part is the "ASIP firmware design" including the design of benchmarking codes and application FW codes. The divided three parts can be identified in the ASIP design flow depicted in Fig. 2. This ASIP design flow gives a guidance of ASIP design from requirement specification down to the microarchitecture specification and design. Following the ASIP hardware design flow in Fig. 2, the starting point of an ASIP design is to specify the requirements including application coverage, performance, and costs. The inputs of the design step are product specification and the collection of underlying standards. By analyzing markets, competitors, technologies, and standards, the output of the step is of course the product requirements including the function coverage, performance, and costs. After determining the functional coverage requirement of the ASIP, source code of applications will be collected and source code profiling will be conducted in order to further understand applications. The inputs of the source code profiling are the collected source code and application scenarios. The outputs of the source code profiling are the results of the code analysis including the estimated complexity, computing cost, and memory cost. During source code profiling, the early partitioning of hardware and software can be proposed.
Dake Liu
Functional and requirement specification
418
Application source code profiling
90%-10% locality, identify critical path, check function coverage
ASIP architecture proposal Iterative design process: Design of assembly instruction set and the instruction set simulator.
HW requirement spec
Find a reference instruction sets
Benchmarking, cost trade off, and architecture design
Microarchitecture specification and RTL coding Fig. 2 DSP ASIP Hardware Design Flow
Hardware/software partitioning for an ASIP is the trade-off of function allocation to hardware or to software tasks. Implementing a task in software means to write a subroutine of the task in assembly language or in C. Implementing a task in hardware means to implement functions on an accelerated instruction or several accelerated instructions. ASIP instruction set design can be guided by the "10%− 90% code locality", which has been used as the rule of thumb in ASIP design in companies such as Coresonic [2]. The locality rule means that 10% of the instructions run at 90% time and 90% of the instructions appear only 10% of the execution time. In other words, the essential of ASIP design is to find the best instruction set architecture optimized for accelerating the 10% most frequently used instructions and to specify the rest of 90% instructions to fulfill functional coverage. Based on the exposed 90 − 10% code locality, hardware requirements can be specified and a specific architecture of the ASIP can be proposed accordingly. ASIP architecture proposal could be based on a selected available architecture with modifications. If flexibility is an essential requirement, selecting an available architecture, such as VLIW architecture might be preferred. A typical case is to select VLIW or master-slave SIMD architecture for a multi-mode image and video signal processing. An ASIP architecture could also be a dataflow architecture generated from the dataflow analysis during source code profiling. If the requirements on performance and computing latency are exceptionally tough, normal available architectures may not offer sufficient performance. A dataflow architecture might thus be a good choice. For example a dataflow architecture is suitable for a radio baseband processor in an advanced radio base station. Following the proposed architecture, an assembly instruction set is proposed and its instruction set simulator is implemented for benchmarking, and a cost trade-off
Application Specific Instruction Set DSP Processors
419
analysis of the instruction set and the architecture is performed. The inputs of the assembly instruction set design include the architecture proposal, profiling results of the source code, and the 90 − 10% code locality analysis. If a good reference instruction set can be found from a previous product or from a third party, the design of ASIP will be faster. The output of the assembly instruction set design is the assembly instruction set manual. Instruction set design consists of the functional design of instructions, allocation of functions to hardware, and coding of the instruction set. The coding of the instruction set includes the design of the assembly syntax and the design of the binary machine codes. Finally binary codes are executed by the instruction set simulator for benchmarking. The performance of the instruction set and the usage of each instruction is exposed for further optimization. Before releasing an assembly instruction set, it should be evaluated via instruction set benchmarking. In processor design, benchmark is a type of program designed to measure the performance of a processor in a particular application. Benchmarking is the execution of such a program to allow processor users to evaluate the processor performance and the cost. Benchmarking methods and benchmark scores of DSP processors can be found at BDTI [1]. More general benchmarking knowledge is available at EEMBC [3]. To support benchmarking, programming tools including compilers, assemblers, and instruction set simulators should be developed. Toolchain design and programming technique for benchmarking will be discussed later in this chapter. After benchmarking and the instruction set release, the architecture can be designed. The inputs for architecture design are the assembly instruction set manual, the early architecture proposal, and the requirements on performance and costs. The output of the architecture level design is the architecture specification documents. The architecture design of a processor is a specification of the high level hardware modules (the datapath including RF, ALU, and MAC, the control path, bus system and memory subsystem), and interconnections between the top level hardware modules. During architecture design, function allocation to modules and scheduling of functions between modules is performed. The bandwidth and latency of busses is analyzed during the architecture design. The inputs of microarchitecture design are the architecture specification, the assembly instruction set manual, and knowledge of the selected silicon technology. The output of the processor microarchitecture design is the description of the bit accurate and cycle accurate implementation of hardware modules (such as ALU, RF, MAC, and control path) and inter-module connections for further RTL coding of the processor. As a document used in ASIP design flow, the output of microarchitecture design is the input for RTL coding of the processor. During architecture design, functional modules are specified and the implementation of each functional module was not mentioned. It means that in the architecture level design, functional modules are usually treated as black boxes, in microarchitecture design, missing details in each module will be specified. Microarchitecture design in general is to specify the implementation of functional modules in detail including inter-module functions. Microarchitecture design
420
Dake Liu
includes function partition, allocation, connection, scheduling, and integration. Optimization of timing critical paths and minimization of power consumption will be specified during microarchitecture design. Processor microarchitecture design, in particular, is to implement each assembly instruction functions into involved hardware modules. It includes: • • • • •
the partitioning of each instruction into micro operations, the allocation of each micro operation to hardware modules, scheduling of each micro operation into different pipeline stages, performance optimization, and cost minimization.
The ASIP hardware design flow starts from the requirement specification and finishes after the microarchitecture design. The design of an ASIP is mostly based on experience. The essential goal of ASIP design flow is to minimize the cost of design iteration. Design of an ASIP is a risky and expensive business with much rewarding when the design is successfully implemented. After specifying the microarchitecture of an ASIP, the VLSI implementation will be the same as all other silicon backend designs, and can be found from any backend VLSI book. RTL coding and silicon implementation are out of the scope of the chapter.
3 Architecture and Instruction Set Design In this chapter, the design methodology of an ASIP will be refined. Architecture and assembly instruction set design methods will be introduced.
3.1 Code Profiling The development of a large application may involve the experience of hundreds of people over many years. Application engineers take years to learn and accumulate system design experience. Unlike application engineers, ASIP designers are hardware engineers who specialize on circuit specification and implementation. It is impossible that ASIP engineers can really understand all the design details of an application in a short time. Source code profiling fills the gap between system design and hardware design. The purpose of profiling is not to understand the system design; it is rather to understand the cost of the design including dynamic behavior of code execution and the static behavior of the code structure, the hardware cost, and the runtime cost. Source code profiling is a technique to isolate the ASIP design from the design of applications.
Application Specific Instruction Set DSP Processors
421
Source code Lexical analysis, parsing CFG extraction
Compiler front-end
CFG
Annotation Dynamic profiling
Static profiling Stimuli
Critical path
ASIP design document Fig. 3 Static and dynamic profiling
Code profiling can be dynamic or static. Static profiling analyzes the code instead of running it, so it is data (or stimuli) independent. Static code profiling exposes the code structure and the critical (maximum, or worst case) execution time. A static code profiler is based on a compiler frontend, the code analyzer. Code structure can be exposed in a control flow graph (CFG), the output of the code analyzer. All run time cost including computing and data access can be assigned to each branch of the graph and the total costs can be collected and estimated by accumulating the cost of each path. The critical path with the maximum execution time should be identified via source code analyzers for ASIP design. However, a critical path may seldom or never be executed and dynamic profiling is thus needed. Dynamic profiling is to run the code by feeding selected data stimuli, so it is data dependent. Dynamic profiling can expose the statistics of the execution behavior if the stimuli are correctly selected. It is usually used to identify the 90 − 10% code locality. By placing probes in interesting paths, statistics on critical path appearances can be achieved. Functions that are most timing critical are thus identified as inputs for designing accelerated architectures and instructions. However, dynamic profiling takes much time, so carefully selecting stimuli is an essential ASIP design skill. More detailed information on profiling techniques can be found in Chapter 6 of [6]. An illustrative view of code profiling for ASIP design is given in Fig. 3.
422
Dake Liu Computing performance
Single MAC
DMAC
SIMD
Complexity handling performance Power efficiency
VLIW
Superscalar
Multi-core
Silicon efficiency
Fig. 4 Available architecture selection
3.2 Using an Available Architecture Requirements on performance and function coverage were specified during code profiling. If the required performance can be handled by an available architecture, selection of an available architecture will be a good idea to reduce the design complexity by using available competences. Principally, there are so many different architectures to select that it is not an easy task to decide a suitable architecture for a class of applications. That is why architecture selection is mostly based on experience, and drawbacks of the selected architecture usually appear right after selecting the architecture. Architecture selection can be based on architectural templates given in Fig. 4. Four dimensions: performance, complexity handling, power efficiency and silicon efficiency are taken into account in Fig. 4. Here the performance measurements include arithmetic computing, addressing computing, load store, and control. The control includes data quality control in software running on fixed point datapath hardware, hardware resource control, and program flow control. Control complexity may not be exposed during the source code profiling and it should be predicted according to design experiences. Complexities can be further divided into data complexity, and control complexity. Data complexity stands for the irregularity of data, data storage, and data access. For general computing, the impact of the data irregularity can be hidden by using highprecision data types. For embedded computing a variety of data types can be used to minimize the silicon costs and power consumption. The data complexity includes also the complexity of memory addressing and data access. Design of data storage and data access is based on the knowledge of memory subsystem (HW) and data structure (SW). Complexities of data storage and data access of DSP applications are different from the complexity of memory subsystem and the data structure of general computing. For general computing, the complexity
Application Specific Instruction Set DSP Processors
423
comes from the requirements of flexible addressing, and large addressing space. The complexity is hidden by hardware cache. For embedded DSP, the complexity of data storage and data access is induced by parallel data access for advanced algorithms. Successful handling of data complexity can give a competitive edge. Data access complexities cannot be exposed during source code profiling because the complexity of physical addressing is hidden behind the language of the source code. Control complexity is exposed when handling program flow control and hardware resource management. On subroutine level, the control complexity is mixed up with data complexity, such as managing data dependencies. On high level in an application program, the complexity can be algorithm selection, data precision control, working mode (data rate, levels, profiles, etc.) selection, and asynchronous event handling (such as interrupt and exception handling). On top or system level, the complexity can be the hardware resource management through threading control and control for interprocessor communications. The management of control can be efficient if instructions for conditional execution, conditional branches, I/O control, and bit manipulations are carefully designed. Power efficiency of a system is the average performance over the average power consumption. Low power ASIP design is the process of minimizing redundant operations (to minimize the dynamic power) and minimizing the circuit size (to minimize the static power). When the size of transistors shrinks, power efficiency and silicon efficiency are more related because the static power is no longer negligible and even becomes dominant. In a modern high-end ASIP design, since the memory consumes most of the power, the design for power efficiency is almost the same as the design of efficient memory architecture. It is about optimizing the memory partition, minimizing the memory size, and minimizing memory transactions. Silicon efficiency of a system is defined as the average performance over the ASIP silicon area. In general, as discussed previously, low power design is equivalent to designing for silicon efficiency. The on chip memory size of a silicon efficient solution is relatively small. If the on chip memory size is too small, however, the design may induce extra on-chip and off-chip data swapping and introduce extra power consumption. Fig. 4 gives architecture selection among six popular DSP processor architectures. The single MAC architecture is a typical low cost DSP processor architecture with single multiplication-and-accumulation unit in datapath. Only one iterative computing can be executed at a time in the processor. The performance may not be enough for advanced applications. The complexity handling capability can be relatively high if the instruction set is advanced. This architecture consumes the least silicon cost and power consumption is limited. The dual MAC architecture is a kind of high performance and efficient architecture. There are two multiplication-and-accumulation units in datapath. Two iterative computations or two steps of an iterative computation can be executed at a time by Dual-MAC in the processor. By keeping the same complexity handling capability, the iterative computing performance of a Dual MAC machine can be double. Com-
424
Dake Liu
pared to the single MAC machine, the silicon cost and power consumption may be only 20% higher, so the silicon efficiency and the power efficiency of a dual MAC machine is much better. A SIMD (Single Instruction Multiple Data) machine can perform multiple operations while executing one instruction. The computing performance is high. However, a SIMD machine can handle data level parallel computing only when computation is regular. While handling irregular operations for complex control functions, a SIMD machine exposes its weaknesses. Power and silicon efficiency can be very high if a SIMD machine handles only regular vector data signal processing. In most systems, SIMD is a slave processor to accelerate signal processing with regular data structure. A master is used to handle complex control functions. Because control and parallel data processing are allocated to two processors, extra computing latencies induced by inter-processor communications should be taken into account. When requirements on both the computing performance and the complexity handling are high at the same time, a VLIW (Very Long Instruction Word) machine may be a solution. A VLIW machine can execute multiple instructions in one clock cycle. However, it does not have a data dependency checker in hardware and, therefore, advanced compiler technology must be available together with VLIW technology. A VLIW machine is very useful with an application requiring both performance and complexity handling. A VLIW machine is therefore popular for general purpose DSP with high throughput. However, when designing an ASIP, the application is known and a design based on master and slave SIMD as well as accelerators might be more efficient. When applications are unknown and there is no sufficient compiler competence, VLIW machine architecture cannot be used. The superscalar architecture might thus be a choice. A superscalar processor can check data dependencies dynamically in hardware. At the same time, the extra hardware cost is needed for data dependency check. The hardware cost of a superscalar is therefore relatively higher. The power efficiency and silicon efficiency is relatively lower than that of VLIW. However superscalar processors offer higher performance while handling complex control. Superscalar ASIP can be found for search and recognition based applications. When control complexity cannot be separated from vector computing, a VLIW or a superscalar processor should be used. If control complexity can be separated from vector computing, and if there is a master processor available for handling the top-level DSP application program, a SIMD architecture is preferred to enhance the performance of vector/iterative processing. SIMD, VLIW, and superscalar all have only one control path issuing instructions, so these machines are ILP (Instruction Level Parallel) machines. Finally, a multi-core solution is needed if the computation cannot be handled by one processor. A multi-core solution introduces extra control complexities of intertask dependence, control, and communications between processors. A multi-core solution also introduces the partition induced overheads. Multi-core ASIP is beyond the scope of the chapter.
Complexity
Application Specific Instruction Set DSP Processors
Programmable slave machine
425
Distributed multi processors
FSM driven datapath Function solver Autonomy level
Fig. 5 Possible custom architectures
The selected architecture can be a template of an ASIP. Architectural level modification and instruction level adaptation will be followed. The final architecture of the ASIP might be none of the architecture listed in Fig. 4.
3.3 Generating a Custom Architecture A normal processor architecture is a control flow architecture (von Neumann architecture). A control flow architecture machine can handle only limited number of operations in one clock cycle. A data flow architecture might be introduced as an ASIP architecture to handle many operations, such as hundreds of operations, per clock cycle. In this case, optimized algorithm codes can be directly mapped or partly mapped to hardware to form a dataflow machine. Task processing in a classical dataflow machine is controlled by a new event; it could be the system input "probe" or the data output "tag" of previous execution. The execution control of classical dataflow machine is not by program counter based FSM. However, dataflow architectures have been successfully merged into normal control flow architectures as slave machines of the main processor. In Fig. 5, a slave machine could be a function solver, a FSM driven datapath, or a programmable slave processor. Except for data flow machines, there are other kinds of custom architectures. A function solver can be a part of datapath in an ASIP. It is a simple hardware module directly controlled by the control path of the processor. A typical function solver carries a single function such as 1/x, square root, functional module for complex data processing etc. It is usually driven by an I/O instruction or an accelerated instruction. The autonomy level of a function solver is low. A FSM driven datapath usually offers autonomous execution of a single algorithm. The execution might be triggered by arriving data or by the master machine. This architecture is typically used to process functions requiring ultra high performance, or functions relatively independent of the master machine. A programmable slave machine is controlled by a local programmable control path, and the autonomy level is rather high. However, task execution in a pro-
426
Dake Liu
Band pass filter Function 1 Transform
Filter
Function 2 Function 3
(a) Behavior model
Configuration registers, FSM, and memories
FSM steps 1
Input buffer
2
Band pass filter
3
Function 1
4 5
Transform Filter
6
Function 2
7
Function 3
8
Output buffer
(b) HW implementation
Fig. 6 Implementation of task flow architecture
grammable slave machine is controlled by the master processor. A programmable slave machine is usually an ASIP. SIMD is a typical programmable slave machine. Finally, a distributed system could be based on distributed ASIP processors. In this system, no one is assigned as master and no one is principally a slave. Each machine runs its own task and communications between tasks are based on message passing. A FSM driven datapath in Fig. 5 could be a task flow machine. The task flow architecture is a typical custom architecture. It is also called function mapping hardware or algorithm mapping architecture. The architecture is the direct implementation of a DSP control flow graph. In Fig. 6, the method of hardware implementation of a control flow graph is depicted. The left part in Fig. 6 is the behavioral task control flow graph of the DSP application. If the control flow graph is relatively simple and will not be changed through the life time of the product, the control flow graph on the left part of Fig. 6 can be implemented using HW on the right part of Fig. 6 (b). In Fig. 6 (b), the datapath is the mapping of the control flow graph in Fig. 6 (a). The FSM controls the execution of the DSP task in each hardware block step by step. Memories are used as the computing data buffer passing data between the hardware function blocks. The control flow graph architecture (a kind of dataflow architecture) is preferred when the complexity is not high and the data rate is very high. In a control flow graph architecture, hardware multiplexing may not be necessary, configuration of the hardware is easy, and programmability is not mandatory. The data flow architecture is mostly used for function accelerations at the algorithmic level. Here configurability refers to the ability that the system functionality can be changed by an external device. Programmability stands for the ability of the hard-
Application Specific Instruction Set DSP Processors The highest functional coverage Low data access latency Low memory cost
427 Handling control complexity The highest performance
ASIP instruction set
Low silicon cost
Low design cost
Low power consumption
High I/O bandwidth Compiler friendly instruction set
Fig. 7 Design space of an ASIP assembly instruction set
ware to run programs. The difference between configurability and programmability is the way the hardware is controlled; configuration control is relatively stable and it is not changed in each clock cycle and the running program changes the hardware function in each clock cycle.
4 Instruction Set Design Different from a general purpose processor, an ASIP instruction set consists of two parts, the first part for function coverage and the second part for function acceleration. To simplify the ASIP design, the first part of the instruction set should be relatively simple. The second part of the instruction set offers both accelerations for computing extensive functions and for most frequently executed functions. It is always expected that an assembly instruction set should be optimized for all possible applications. However, no instruction set can reach that goal. An optimization promotes some features and restrains other features. Fig. 7 exposes different requirements to an ASIP instruction set. Obviously, not all can be reachable at the same time. When designing a processor, we specify the main goal and the cost constraints. Under limited power consumption and silicon cost, one can only focus on and optimize some features or parameters. At the same time, other features will be restrained. The trade-off among the four most important dimensions is illustrated in Fig. 8. Here "Compiler friendly" means that the distance between IR (Intermediate Representation) and assembly language is relatively small. It also means that the compiler (usually a C compiler) can translate high level source code to assembly code and use the assembly instruction set in an efficient way. A "compiler friendly" instruction set is usually characterized by generality and orthogonality (for example, any instruction can use any register in the same way). Unfortunately in ASIP instruction sets may not be general enough if high performance and low power are required at the same time.
428
Dake Liu Performance over silicon An ASIP Acceleration
Compiler friendly A general processor Flexibility
Fig. 8 Trade-off between performance and flexibility
4.1 Instruction Set Specification A DSP instruction set is usually divided into two groups, the RISC subset and the CISC subset, as illustrated in Fig. 9. All DSP processors, both ASIP DSP processors and general-purpose DSP processors, need RISC subset instructions for handling general arithmetic and control functions. The function coverage is offered by the RISC instruction subset. CISC subset instructions are used for function accelerations. If a processor is required to supply a rather wide set of complex control functions, the RISC instruction subset could be similar to an instruction set of a general RISC processor. The RISC instruction subset is usually divided into three groups: instructions for basic load and store, instructions for basic arithmetic and logic oper-
RISC Fig. 9 Classification of an instruction set
Function solvers
Repeat instruction
CISC instructions MAC and convolution
Other program flow control instructions
Bransches and calls
Control
Multiplications
Long arithmetic operations
Bit and bits manipulations
Arithmetic instructions General arithmetic, logic, shift / rotate instructions
I/O access instructions
Move between registers and load immediate data
Load or store memory
Data access instructions
CISC
Reserved for acceleration extension
A typical ASIP DSP processor assembly instruction set
Application Specific Instruction Set DSP Processors
429
ation and instructions for program flow control. Design of RISC instruction subset can be found from most computer architecture books.
4.2 Instructions for Functional Acceleration The decision to introduce CISC instructions is based on 10−90% code locality analysis during source code profiling. CISC subset instructions can be divided into two categories: normal CISC instructions and instructions for accelerated extensions. The normal CISC instructions are specified in the assembly instruction set manual during the design of the ASIP core. The normal CISC instructions are completely decoded by the control path of the ASIP core. ASIP might be designed as an SIP (Silicon Intellectual Property) for multiproducts. Different products may have different requirements on function acceleration. Further instruction level function extension should be possible even after the design release of the ASIP RTL code and the programming toolchain. The method to add CISC instructions without touching the released ASIP design was discussed in detail in [6]. One group of CISC instructions are specified by merging a chain of available RISC arithmetic and load store instructions into one CISC instruction. In this way, acceleration is achieved without adding extra arithmetic hardware. The execution time is reduced to the time to run one instruction. The pipeline depth of the accelerated instruction is increased. If the hardware used by RISC instructions cannot be utilized, extra hardware should be added to support accelerated instructions. Another group of CISC instructions thus need extra hardware. By adding the first group of CISC instructions, we only increase the control complexity of the ASIP control path. By adding the second group of CISC instructions, we add complexity both to the control path and to the datapath. A typical CISC instruction example is the MAC (multiplication and accumulation) instruction. It merges the multiplication instruction and accumulation instruction to one instruction with double pipeline depth. The execution of the instruction usually needs two clock cycles. The behavior description of the instruction is: Buffer <= (Operand A) x (Operand B) //in the first clock cycle ACR <= ACR + Buffer // in the second clock cycle
This is the most used arithmetic computation in DSP applications. There are opportunities to dramatically improve DSP performance by fusing several instructions into a "convolution instruction", an iterative loop, including memory address computing, data and coefficient loading, multiplication, accumulation, and loop control. By adding the convolution instruction, both data access overheads and loop control overhead are eliminated. A convolution instruction CONV N M1(AM) M2(A) is therefore introduced. The behavior description of the convolution instruction is: 01 02 03 04 05
// CONV instruction: iteration from N to 1 OPB <= TM (A); // load filter coefficient OPA <= DM (AM); // load sample if AM == BUFFER_TOP then AM <= BOTTOM // Check FIFO bound else INC (AM); // update sample pointer
430
Dake Liu 06 07 08 09 10 11 12
INC (A); // update coefficient pointer BFR <= OPA * OPB; // Save compute product in buffer ACR <= ACR + BFR; // accumulate product DEC (LOOP_COUNTER); // DEC loop counter if LOOP_COUNTER != 0 then jump to 01 // one more iteration else Y <= Truncate (round (ACR)); // final result end.
The loop counter LOOP_COUNTER, the data memory address pointer AM, and the coefficient memory address pointer A must be configured by running prolog instructions. In each operation step, two memory accesses, the multiplication, and the accumulation consumes one clock cycle, N taps of convolution will consume only N + 2 clock cycles. Without the special CISC instruction CONV an iterative computing of a N-tap FIR could consume up to 12N clock cycles, 6N cycles for arithmetic computing, addressing, and data access, 3N clock cycles for checking the FIFO bound, and 3N clock cycles for checking the loop counter (consuming extra two clock cycles when a jump is taken).
4.3 Instruction Coding Coding includes assembly coding and binary machine coding. Assembly coding is to name assembly instructions and make sure that the assembly language is humanreadable. Hardware machines can only read binary machine code and the assembly code must be translated to binary machine code for execution. The purpose of assembly coding is to let the assembly code be easy to use, easy to remember, and without ambiguity. Assembly coding is usually based on suggestions from the IEEE std. 694-1985. As the lowest level hardware dependent machine language, assembly code gives function descriptions based on micro-operations. Assembly code also specifies the usage of hardware such as the arithmetic computing units, physical registers, the physical data memory addresses, and the physical target address for jump instructions. The assembly language exposes micro-operations in the datapath, addressing path and memory subsystem, some micro-operations in the control path, such as test of jump conditions and target addressing. It is not necessary to expose all micro-operations. Some micro-operations are not exposed in the assembly manual because they are not directly used by assembly programmers, such as bus transactions and the details of instruction decoding. In Fig. 10, explicit and implicit micro operations are classified. All micro-operations that the assembly users (programmers) need to know will be exposed in the assembly language manual. However, to efficiently use the binary machine code, some micro-operations that are specified by the assembly language manual, will not be coded into assembly and binary machine code. Examples of such operations are flag operations for ALU computing and PC=PC+1 to fetch the next instruction. When writing instruction set simulator and the RTL code, all hidden micro-operations must be exposed and implemented.
Application Specific Instruction Set DSP Processors
431
Explicit micro-operations specified in assembly manual:
Target addressing
Results
Instruction configuration
Operands
Operations
Explicit micro-operations specified in assembly code and binary machine code: Data memory addressing
Implicit micro-operations not specified in assembly instruction set manual: bus usage, instruction decoding
All micro-operations related to an assembly instruction
Implicit microoperations, not necessary to specify them in assembly code: For example flag ops and PC<=PC+1
Fig. 10 Assembly and binary coding convention
4.4 Instruction and Architecture Optimization An ASIP instruction set shall be optimized for a group of applications with cost in mind. The optimization process is so called software-hardware co-design of the instruction set architecture. Traditional embedded system design flow is given in Fig. 11. To simplify a design, system design inputs are partitioned and assigned to the HW design team and the SW design team during the functional design. The SW design team and HW design team design their SW and HW independently without much interwork. In the SW design team, functions mapped to SW will be implemented in program codes to be executed in hardware. The implemented functions will be verified during simulations and finally the simulated programs are translated to machine language during the SW implementation.
System design inputs Early HW/SW partition HW specification
The wall
SW specification Programming SW simulation SW implementation
HW modeling HW simulation HW implementation
Integration and tests
Fig. 11 Traditional embedded system design flow
432
Dake Liu System design inputs Early HW/SW partition SW specification
HW specification
Programming
HW modeling
SW simulation
HW simulation
SW implementation
HW implementation
Integration and tests
Fig. 12 Hardware/software co-design flow
From HW design perspective, functions assigned to HW team are allocated to HW modules. Modules could be either processors or functional circuits. The behaviors of programmable HW modules can be described by assembly language simulator. The behaviors of non-programmable HW modules can be described by HW description languages. A function implemented in programmable hardware will be an assembly language instruction. Functions implemented in non-programmable hardware will be hardware modules or parts of a hardware module. Finally, the implemented SW and HW is integrated. The implemented binary code of the assembly programs will be executed on the implemented HW. The design flow in Fig. 11 is principally correct because it follows the golden rule of design: divide-and-conquer. However, when designing a high quality embedded system, we actually do not know how to make a correct or optimized early partition. In other words, the early partition may not be good enough without iterative optimizations through the design. To optimize a design, we need to interactively move functions between SW and HW design teams during the embedded system design. Under such a challenge, "HW/SW co-design" appeared in the early 90’s. The HW/SW co-design flow is depicted in Fig. 12. Following the figure, the idea of the new design flow is to optimize the partition of HW and SW functions cooperatively at each design step during the embedded system design. HW/SW co-design is to trade-off function partition and implementation between SW and HW through embedded system design. Eventually, the results will be optimized following certain goals. Re-partition is not difficult. The difficulty is the fast and quantitative modeling for functional verification as well as performance and cost estimation of the new design after each re-partition. Fast processor prototyping is therefore a challenging research topic. The implementation of an algorithm level function in ASIP HW is to design accelerated instructions for the specific function. Implementing a function in SW
Application Specific Instruction Set DSP Processors
433
ASIP requirement specification Early manual partition according to source code profiling and product requirements Implement the function as an instruction Instruction set specification
Implement the function as a subroutine
Processor architecture specification Microarchitecture design
Assembly instruction set simulator Move a subroutine to an instruction Benchmarking of instruction set
Change an instruction to a subroutine
Application SW implementation
Design for HW acceleration
Processor HW implementation
ASIP Integration, final function verification and performance validation
Fig. 13 Hardware/software co-design for an ASIP
means to design software subroutine of the function. After source code profiling, an instruction set architecture can be proposed. Further HW/SW co-design of ASIP in Fig. 13 is to modify the proposal by trading off HW/SW partition through all design steps of an ASIP. An ASIP assembly instruction set should be efficient enough to supply performance for the most frequently appearing functions and be general enough to cover all functions in the specified application domains. When the performance is not enough, the most used (kernel) subroutines or algorithms should be hardwareaccelerated. Those subroutines or algorithms that are seldom used should be implemented as SW subroutines. Performance and coverage can be fulfilled by carefully trading off the HW/SW partitioning. However, functional coverage evaluation is not easy. So far, there is no better way than running real applications. The cost is high and it takes a long time. HW/SW co-design flow in Fig. 13 is part of the design flow in Fig. 2.
4.5 Design Automation of an Instruction Set Architecture Researchers are working towards multi-objective-oriented evaluation tools to compare and select among different solutions. Many researchers also work on identification of functions to be accelerated and propose instructions for functional acceleration. Also many researchers are working on ASIP synthesis. There are two main approaches in research of ASIP synthesis. One is to synthesis an ASIP based on fixed template, and the other one is to synthesis an ASIP without template, only based on architecture description language (ADL). There might be a bright future when ASIP can be synthesized based on selected architecture. In such a case, architecture must be selected based on the quantitative requirements including
434
Dake Liu Digital frontend filters
RF
Symbol processor
ABB
Forward error correction
DBB
Micro controller
MCU
Application Processor
Peripheral, Bus and memory subsystem Fig. 14 A digital baseband DBB in a radio system
performance, flexibility, silicon costs, power consumption, and reliability. However, there are engineering, marketing, and project dependent problems that are not yet modeled by EDA tools, such as: • • • • • • • • •
How to model market requirements How to model project requirements How to model human competence and availability How to balance SW design costs and HW design costs How to differentiate from competitors How to convince customers How to distribute weighting factors among requirements How to define the product life time How to trade-off between costs and flexibility.
At present, design of an ASIP DSP is based on experience. There are other nontechnical decision factors. For example, the life time of ASIP may not be "the longer the better", as a longer product life time may block sales of new products.
5 ASIP Instruction Set for Radio Baseband Processors Baseband signal in a radio system is the signal which carries the information to be transmitted and the signal is not modulated onto the carrier wave of radio frequency. A baseband signal can be analog or digital. The sub system processing on a digital baseband signal is called digital baseband or DBB. The sub system between the DBB and the RF (Radio Frequency) sub system is called mixed analog baseband or ABB. In this section we only discuss DBB. A DBB in radio system is presented in Fig. 14. A DBB module includes both the transmitter and the receiver. In a transmitter, the DBB codes binary streams from application payload to symbols on the radio channel. The coding includes efficient coding (the digital modulation, to code as many bits as possible in unit bandwidth) and redundant coding (to carry informa-
Application Specific Instruction Set DSP Processors
435
tion for error correction on receiver side). The digital modulation is a way to carry multiple bits to a symbol. The receiver recovers a radio signal from the analog to digital converter (ADC) in ABB to payload bits. The received waveforms carry both transmitted signals with distortions and interferences. Interferences are from external interference sources, such as glitches and white noise, and from the received signal itself. The received signal is a combination of the wave directly transmitted and waves reflected by different objects and received by the receiver at different times, giving interference with each other. Signals arriving to the antenna at different times induce time domain distortion. Signals carried by different frequencies induce frequency domain distortion. Distortions and interferences must be cancelled before signal detection, which requires a pair of channel estimator and equalizer to remove the noise and distortions. Relative movement of a radio receiver and its transmitter changes the relative positions between the transmitter and the receiver. An estimated channel will not be valid as soon as the relative position is changed. If the mobility is high (e.g. a moving vehicle), the life time of a channel model will be on millisecond level, requiring recalculation of the channel in a millisecond. To guard against glitches and white noise, forward error correction (FEC) algorithms must be executed. Computing features of advanced forward error correction algorithms include high computing load, computing based on limited data precision, advanced addressing algorithms for parallel data access, and feasibilities for data and task level parallelization. All these computing tasks, filtering signals, signal synchronizer, channel estimator, channel equalizer, signal detector, forward error correction, signal coding, and symbol shaping, need heavy computing. Moreover, radio baseband signal processing must conducted in real-time. A mobile channel should be estimated and updated within milliseconds, and all other signal processing for a receiver and transmitter must be done in each symbol period. For example, the symbol period of IEEE 802.11a/g is 4 microseconds, so several billions of arithmetic operations per second are required [7]. Some tasks have heavy requirements on computing performance and low requirements on flexibility, such as simple filtering, packet detection, and sampling rate adaptation. These tasks can be allocated to the DFE (digital frontend) module. Functions of these filters do not change in each clock cycle, so configurability is sufficient for complexity handling of DFE. In the same way, error correction computing under acceptable low data precision can be allocated to FEC (forward error correction) modules without requiring programmability. Except for DFE, FEC, and hardware control functions (to be allocated to the micro controller), the rest of the tasks are allocated to the symbol processor. Sufficient flexibility is required for signal processing on complex data. The main tasks allocated in the symbol processor include synchronization, channel estimation, time and frequency domain channel equalization, and transformation. The cost of 802.11a/g symbol signal processing in symbol processor is listed in Table 1. The required computing performances are calculated based on single precision integer data. Associated performance requirements for data access are not included in the table.
436
Dake Liu
Table 1 Profiling and function allocation for IEEE 802.11a/g receiver Tasks requiring high computing performance
Algorithms
Number of Appear Million arithmetic -ance arithmetic operations operations per task per second Packet synchronization Cross correlation ∼ 3000 ∼ 100 s 30 Sampling rate offset estimation FFT / phase decision 2500 100 s 25 F domain channel estimation FFT, interpolation, 1/x 6640 24 s 277 Payload Rotation Vector product ∼ 384 4 s 96 IFFT or FFT 64 points FFT / IFFT 1984 4 s 496 F-domain channel equalization Vector product ∼ 384 4 s 96 Total cost 1010
By offloading the majority of computing to DFE and FEC, the symbol processor only handles the small part of computing requiring programmability. However, the small part of the programmable computing requires more than 1Giga arithmetic operations in a second, not yet including the data access cost. By analyzing tasks listed in Table 1, the most used ASIP task level instructions for signal processing of radio symbols are listed in Table 2. In Table 2, V1[i], V2[i], and V3[i] are the complex data vectors. If a datapath of complex data is used, one step of signal processing on complex data equals about 46 times of signal processing on integer data, and the performance will be enhanced 4-6 times. By further accelerating ASIP instructions on vector level, loop overheads can be eliminated by hardware loop controller. All functions listed in Table 1 can be processed on 200MHz in one datapath with complex data type [7]. The digital baseband of IEEE802.11a/g is relatively easy. The computing cost for LTE and WiMAX will be at least 5-10 times higher. At least 5 more complex data datapath modules are needed in an ASIP for LTE and WiMAX baseband signal processing.
Table 2 Five most used task level (CISC) instructions accelerated on vector processing level Instructions Conjugate complex convolution (auto correlation) Complex convolution
Functional Specification For I = 1 to N do {Complex REG = Complex REG + V1[i] * Conjugate of V1(or V2)[i]} For I = 1 to N do { Complex REG = Complex REG + V[i] * V2[i]} Product of conjugate complex vectors For I = 1 to N do {V3 [i] = V[i] * Conjugate of V2[i]} Product of complex vectors For I = 1 to N do {V3 [i] = V[i] * V2[i]} Product of complex data vector For I = 1 to N do {V2 [i] = Scalar * V2[i]} and complex data scalar
Intra-frame compression
Lossless compression
Inter-frame compression
Frequency-domain residue compression
U-frame
V-frame
437
Frequency-domain frame compression
Y-frame
RGB to YUV color conversion
R-frame
G-frame
B-frame
Application Specific Instruction Set DSP Processors
Fig. 15 Image and video compression
6 ASIP Instruction Set for Video and Image Compression An image or video CODEC carries-out signal processing, eliminates redundancies of image and video data, and is a kind of applications of data compression [8]. Compressed images and video can be stored and transferred with low cost. The most frequently used image and video compression methods are lossy compression to get a high compression ratio. Generally, lossy compression techniques for image and video are based on 2D (two dimensional) signal processing, as illustrated in Fig. 15. Fig. 15 shows a simplified video compression flow, and the shaded part in Fig. 15 is a simplified image compression flow. The first step in image and video compression is color transform. The three original color planes R, G, B (R=red, G=green, B=blue) are translated to the Y, U, V (Y=luminance, U and V=chrominance) color planes. The human eye has high sensitivity to luminance ("brightness") and low sensitivity to chrominance ("color"). Normal people will not notice the difference of the resolution if the color planes U and V are down-sampled to 14 of their original size. Therefore, a frame with a size factor of 3 (RGB color planes) is down-sampled to a frame of YUV planes with the size factor of 1 + 14 + 14 = 1.5, yielding a compression ratio of 2. Frequency-domain compression is executed after the RGB to YUV transform. The Discrete Cosine Transform (DCT) is used to translate the image from timedomain to frequency-domain. Because the human sensitivity to spatially highfrequency details is low, the information located in the higher frequency parts can be quantized with lower resolution. People will not notice that the resolution of the higher frequency part is low. After quantization, the resolution of the higher frequency part will only be 1 or 2 bits instead of 8 bits. The image or video signal in frequency domain is further compressed. After the frequency-domain compression, continuous strings of 1’s and 0’s appear because of quantization. A lossless compression such as Huffman coding is finally used to compress the most frequent patterns, such as a chain of zeroes. Huffman coding assigns shorter bit strings for more frequent symbols and longer bit
438
Dake Liu
string for less probable symbols, and ensures that each code can be uniquely decoded. Video compression is an extension of image compression, and it further utilizes two types of redundancies: spatial and temporal. Compression of the first video frame, the so called reference frame, is similar to the compression of a still image frame. The reference frame is used for the further compressing of later frames, the inter-frames. The difference between a reference frame and a neighbor frame is usually small. If only the identified inter-frame difference is compressed and transferred, the neighboring frame can be recovered by applying the differences to the reference frame. The size of inter-frame differences is very small compared to the size of the original frame. This coding method is temporal coding. If one part of a video frame is the same or similar to another part in the same frame, only one part of the compressed code will be stored or transferred. It can be used to replace another part of the frame. Transferring or storing another part of the video frame will therefore not be necessary. Only the difference of the two parts will be transferred. This coding method is spatial coding. In a video decoder, to comfort visual feeling of human, interpolation and deblocking algorithms are introduced. To further improve video resolution, the corresponding sample is obtained using interpolation to generate non-integer positions. Non-integer position interpolation gives the best matching region comparing to integer motion estimation. Interpolation algorithms are needed on both the encoder and decoder sides. A de-blocking filter is applied to improve the decoded visual quality by smoothing the sharp edges between coding units. The classical JPEG (Joint Picture Expert Group) compression for still images can reach 20:1 compression. Including the memory access cost, JPEG encoding consumes roughly 200 operations and decoding roughly 150 operations, for one RGB pixel (including three color components). If multiple pictures with high resolution should be compressed in a high resolution (e.g. 3648 × 2736) camera, accelerated instructions will be mandatory. Because the processing of Huffman coding and decoding can vary a lot between different images, the JPEG computing cost of a complete image cannot be accurately estimated. The classical MPEG2 (Moving Picture Expert Group) video compression standard can reach a 50:1 compression ratio on average. The advanced video codec H.264/AVC standard can increase this ratio to more than 100:1. The computing cost of a video encoder is very dependent on the complexity of the video stream and the motion estimation algorithm. Including the cost of memory accesses, a H.264 encoder may consume as much as 4000 operations for a pixel and the decoder consumes about 500 operations for a pixel on average. As an example, for encoding a video stream with QCIF size (176 × 144) and 30 frames per second, the encoder requires about 3Giga operations per second. The corresponding decoder requires about 400Mega operations per second. A general DSP processor may not offer both the performance and low power consumption for such applications. An ASIP will be needed especially for handheld video encoding. Obviously, function acceleration is needed both for encoding and decoding of images and video frames. The heaviest computing to accelerate is motion estima-
Application Specific Instruction Set DSP Processors
439
Table 3 Accelerated algorithms for image and video signal processing Algorithms SAD Interpolation De-blocking filter 8 × 8 Discrete Cosine Transform Color transform
Kernel operations Result= |Ai − Bi| Result=round ((a1 x1 ± a2 x2 ± a3 x3 ± a4 x4 ± a5 x5 ± a6 x6 )/32) Result=round((a1 x1 + a2 x2 + a3 x3 + a4 x4 + a5 x5 + a6 x6 )/8) Result=Integer butterfly computing and data access Result=a1 x1 ± a2 x2 ± a3 x3
tion. Many motion estimation algorithms have been proposed, most of them based on SAD (Sum of absolute difference). Some accelerated algorithms are listed in Table 3. If data access and computing of the listed algorithms in Table 3 can be implemented with accelerated instructions, more than 80% of image / video signal processing can be accelerated.
7 Programming Toolchain and Benchmarking As soon as an assembly instruction set is proposed, its toolchain should be provided for benchmarking the designed instruction set. An assembly programming toolchain includes the C-compiler, the assembler, the linker, the ISS (Instruction Set Simulator), and the debugger. A simplified firmware design flow is given in the following Fig. 16 (a). The toolchain to support programming is given in Fig. 16 (b). Seven main design steps for translating the behavior C code to the qualified executable binary code are shown in Fig. 16 (a), and the six tools in the programmer’s toolchain in Fig. 16 (b) are used for the code translation and simulation. Each tool in the toolchain in Fig. 16 (b) is marked with numbers. The ways that tools are used in each design step in the flow are annotated by numbers on the design steps in Fig. 16 (a).
Behavior code adapt to HW
(1) (1)
Compile C code to assembly (2)
Build assembly library
Assembly coding
(3)
(1)
C-HW-adapter
(2)
C-compiler
(3)
Assembler
(4)
Objective linker
(5)
ASM simulator
(6)
Debugger
(4) Link objective codes and allocate codes to program memory Generate finally binary code and run it on simulator
(5)(6)
Verify codes, benchmarking, and release the design
(5) (6)
(a) The method translating source code to qualified assembly code
Fig. 16 Firmware design flow using programmer’s toolchain
(b) Toolchain
440
Dake Liu
A C-code adapter might be a tool or a step of programming methodology. To get quality assembly code from C-compiler, the source C-code should adapt to hardware including: • Change or modify algorithms to adapt to hardware. • Adapt data types to hardware. • Add annotations to guide instruction selections during compiling. After adapting the source C-code to hardware, the C-code is compiled to assembly code by the compiler. A compiler consists of its source code analyzer, or the frontend, and its target code synthesizer, or the backend. Most ASIP compilers are based on gcc(GNU Compiler Collection). Compiler design for ASIP DSP can be found in other chapters in this handbook. The assembler further translates assembly code into binary code. The translation process has three major steps. The first step is to parse the code and set up symbol tables. The second step is to set up the relationship between symbolic names and addresses. The third step is to translate each assembly statement; opcode, registers, specifiers, and labels into binary code. The input of an assembler is the source assembly code. The output of an assembler is the relocatable binary code called object file. The output file from an assembler is the input file to a linker, which contains machine (binary) instructions, data, and book-keeping information (symbol table for the linker). The assembly instruction set simulator executes assembly instructions (in binary machine code format) and behaves the same way as a processor. It is therefore called the behavioral model of the processor. A simulator can be a processor core simulator or a simulator of the whole ASIP DSP chip. The simulator of the core covers the functions of the basic assembly instructions, which should be "bit and cycle accurate" to the hardware of the core. The simulator of the ASIP chip covers the functions of both the core and the peripherals, which should be "bit, cycle, and pin accurate" meaning the result from the simulator should match the result from the whole processor hardware. Once the firmware is developed, the programmer must debug it and make sure it works as expected. Firmware debugging can be based on a software simulator, or a hardware development board. In any cases, the programmer will need a way to load the data, run the software, and observe the outputs.
8 ASIP Firmware Design and Benchmarking Firmware by definition is the fixed software in embedded products, for example an audio decoder (MP3 player) in a mobile phone, where it is visible to end users. On the other hand, firmware can also implicitly exist in a system not exposed to the user, e.g. the firmware for radio baseband signal processing. Firmware can be application firmware or system firmware. The latter is used to manage tasks and system hardware. A typical system firmware is the real-time operating system, RTOS. Qualified
Application Specific Instruction Set DSP Processors
441
DSP Firmware design is based on deep understanding of a DSP processor and its assembly instruction set.
8.1 Firmware Design The firmware design flow was briefly introduced on the left part in Fig. 16. It is further refined in Fig. 17. A complete firmware design process consists of behavior modeling, bit accurate modeling, memory accurate modeling, timing budget and real-time modeling, and finally assembly coding. Firmware design has three starting points according to the freedom of the design. If designers have all freedoms including selection of algorithms, the design will start from design entry 1, the step of algorithm design and selection. For example, standards of radio baseband modems specify only the transmitter and do not regulate the implementation of receivers. During the implementation of radio baseband receivers, we have the freedom to select either a frequency-domain channel estimator or a time-domain channel estimator according to the cost-performance trade off, the computing latency, and the hardware architecture. Here the cost means mostly the computing cost, data access cost, and the memory cost. For example, some architecture will be suitable for transform and frequency domain algorithms. High level behavior of a DSP system is modeled based on high precision data types and a behavior modeling language, such as C. However, if there is no freedom to select algorithms, a design starts from design entry 2. A typical example is video decoding for H.264 or audio decoding for MP3. Algorithms and C code are available from the standard. The C code is based on floating-point with excessive high data precision. For embedded applications, hardware with fixed point and limited precision will be used. The design following entry 2 starts from bit-accurate modeling. In this case, the freedom to decide the data precision is available. Data or intermediate results can be scaled and truncated during processing. During the bit accurate modeling phase, data quality control algorithms, such as data masking to emulate the finite hardware precision, signal level measurements, and gain control algorithms, will be added to the original source code. Freedom of data precision control may not be available in some designs. When the freedoms of algorithm selection and data precision are not available, the firmware design entry is 3 in Fig. 17. In this case, the bit accurate model is available when the firmware design starts, for example the bit accurate source code could be from a standard committee. A typical case is the implementation of some voice decoder, which starts from available bit-accurate C code. Voice is usually compressed before transfer or storage to minimize the transfer bandwidth or storage cost. A voice decoder will decompresses the compressed voice and to recover the voice waveform at the voice user side. The memory cost is roughly exposed during source code profiling. The early estimate of memory cost might not be correct. Memory cost can be further exposed when the data types are decided after bit accurate modeling. A memory accurate
442
Dake Liu Simplified firmware design
Design entry 1
Design entry 2
Generate and verify the machine code
Assembly coding
Re-allocatable assembly coding
Coding cycle accurate FW
Timing budget
Run time budget
Coding FW with memory costs
Analyze and expose memory cost
Memory accurate modeling
Coding finite length firmware
Bit accurate modeling
Design for finite data precision
High level language modeling
Algorithm selections
Understand application & product
HW independent behavior modeling
Design entry 3
Fig. 17 Detailed firmware design flow
model can therefore be achieved. Cache is not much used for embedded DSP computing because cache is expensive and data access in embedded computing is relatively explicit and predictable. A memory accurate model is an essential step to expose the data access cycle cost and finally to minimize it. The first step of memory accurate modeling is to re-model the data access in real hardware based on the constraint of the on chip memory sizes. After the first step, the extra memory transaction cost will be exposed. In the next step, moving data between memories and other storage devices will be modeled using program code or DMA transactions. The firmware has runtime constraints when processing real-time data, especially streaming data. From profiling of the source code, runtime cost can be estimated. Extra run time cost is used to execute added subroutines for handling finite data precision, extra memory transactions, and task management. If the total runtime is less than the new data arrival time, the system implementation is feasible. Otherwise, enhancing hardware performance and selecting low cost algorithms are required. Finally, the assembly code is generated and the functionality is verified. The assembly code as the firmware can be released when the final run time of the assembly code follows the specification and the precisions of the results are acceptable.
8.2 Benchmarking and Instruction Usage Profiling Benchmarking by definition is some kind of measure on performance. In this context, benchmarking is used to measure the performance of the instruction set of an ASIP DSP. In general, benchmarking is to measure the run time (the number of clock cycles) for a defined task. Other measurement can also be important, for
Application Specific Instruction Set DSP Processors
443
example memory usage and power consumption. Memory usage can be further divided into the cost of program memory and data memory required by a benchmark. DSP benchmarking is usually conducted by running the assembly benchmark code on the instruction set simulator of the ASIP. DSP Benchmarking can be further divided into the benchmarking of DSP algorithm kernels and the benchmarking of DSP applications. By benchmarking DSP kernels, the 10% codes taking 90% runtime, essential performance and cost can be estimated. When benchmarking the kernel algorithms, kernel assembly code such as filters FIR, IIR, LMS (Least Mean Square) adaptive filter, and transforms (FFT and DCT) are executed on the instruction set simulator of the processor. Benchmarking of an application runs the application codes on assembly instruction set simulator of the processor. The best way to evaluate a processor is to run applications because all overheads can be taken into account. However, the coding cost for an entire application can be very high. The benchmarking of an application should only be conducted on part of the application, the cost extensive part. To benchmark a DSP instruction set, two kinds of cycle cost measurement are frequently used. One is the cycle cost per algorithm per sample data. For example "30 clock cycles are used to process a data sample by a 16-tap FIR filter". Another kind of measurement is the data throughput of an application firmware per mega Hertz. For example, "In one million clock cycles (1MHz), up to 500 voice samples can be processed in a 2048-tap acoustic echo canceller". If the voice sampling rate is 8 kHz (it usually is), the computing cost of 16 MHz will be required in this case. The benchmarks of basic DSP algorithms are usually written in assembly language. However, if the firmware design time (time to market) is very short, benchmarks written in high-level language will be necessary. In this case, the benchmarking checks the mixed qualities of the instruction set and compiler. BDTI (Berkeley Design Technology Incorporation [1]) always supplies benchmarks based on handwritten assembly code. EEMBC (EDN Embedded Microprocessor Benchmark Consortium [3]) allows two scoring methods: Out-of-the-box benchmarking and Full-Fury benchmarking. Out-of-the-box (do not requiring any extra effort) benchmarking is based on the assembly code directly generated by the compiler. Full-Fury (also called optimized) benchmarking is based on assembly code generated and fine tuned by experienced programmers. It is not easy to make an ideally fair comparison by benchmarking low level algorithm kernels on target processors. Each processor has dedicated features and is optimized for some algorithms, while not optimized for some other algorithms. A processor holding a poor benchmarking record of an application might have very good benchmarking record of another application. A typical case is that a radio baseband processor will never be used as a video decoder processor. For fair comparison, processors from different categories should not be compared. DSP kernel algorithms consist of 10% of an application code that takes 90% of the runtime. Benchmarking on kernels is relevant because DSP kernel algorithms will take the majority of the execution time in most DSP applications. Well accepted DSP kernels are listed in Table 4. In the table, the typical cycle cost is measured on a DSP processor with single MAC unit and two separated memory blocks, a simple
444
Dake Liu
Table 4 Examples of kernel DSP algorithms running on a single MAC DSP Algorithm Kernel Descriptions or specifications Typical cycle cost Block transfer Move N data words from one memory to another ∼ 3N + 3 256p FFT 256 point FFT including computing and data access ∼ 11000 Single FIR A N-tap FIR filter running one sample ∼ N + 12 Frame FIR A N-tap FIR filter running K samples ∼ K(N + 6) + 8 Complex data FIR A N-tap complex data FIR filter running one sample ∼ 8N + 15 LMS Adaptive FIR A N-tap least significant square adaptive filter ∼ 3N + 10 16/16 bits division A positive 16bits divided by a 16bits positive data ∼ 50 Vector add C[i] ≤ A[i] + B[i] Here i is from 0 to N-1. ∼ 3 + 3N Vector window C[i] ≤ A[i] * B[i] Here i is from 0 to N-1. ∼ 3 + 3N Vector Max R ≤ MAX A[i] Here i is from 0 to N-1. ∼ 2 + 2N
and typical DSP processor. It exposes the average performance among single MAC commercial DSP processors. If the benchmarking result of an ASIP designed by you is much behind scores in the table, you may need to understand why and try to improve your design. Cycle cost and code cost consist of three parts while coding and running a kernel benchmarking subroutine, the prolog, the kernel, and the epilog. The prolog is the code to prepare and start running a kernel algorithm, it includes loading and configuring parameters of the algorithm. The kernel is the code of the algorithm, including loading data, algorithm computing, and storing intermediate results. The epilog is the code to terminate running of an algorithm, including storing the final result and restoring the changed parameters. Cycle cost for running the prolog, the kernel, and the epilog should be taken into account during benchmarking. Instruction usage profiling is another measure on using an instruction set. Instead of measuring the performance, it measures the appearance of each instruction in application codes and how often it is executed in each million cycles. The usage profiling tells which instruction is used the most in an application. It also tells which instruction is the least used by an application. If several instructions are not used by a class of applications and the processor is only designed for this class of application, these instructions can be removed to simplify the hardware design and reduce the silicon costs. However, sometimes the saving of silicon cost and design cost is almost negligible while removing an instruction. The cost reduction could be significant only when many instructions are removed. On the other hand, since the most used instructions can be identified, optimizing them can significantly improve the performance or reduce the power consumption.
9 Microarchitecture Design for ASIP The microarchitecture design of an ASIP specifies the design of hardware modules in an ASIP. In particular, it is the hardware implementation of the ASIP assembly instruction set to each function module of the core and peripheral modules of the
Application Specific Instruction Set DSP Processors
445
ASIP. The input of the microarchitecture design is the ASIP architecture specification, the assembly instruction set manual, and the requirements on performance, power, and silicon cost. The output of the microarchitecture design is the microarchitecture specification for RTL coding, including module specifications of the core and interconnections between modules. The microarchitecture design of an ASIP can be divided into three steps. (1). To partition each assembly instruction into micro-operations and to allocate each micro-operation to corresponding hardware modules. A micro-operation is the lowest level hardware operation, for example, operand sign bit extension, operand inversion, addition, etc. If there is no such a module for some micro-operations, adding a module or adding functions to a module is needed. (2). For each hardware module, collect all micro operations allocated in the module, allocate them to hardware components, and specify ways of hardware multiplexing for RTL coding. (3). Fine tune and implement inter-module specifications of the ASIP architecture and finalize the top-level connections and pipelines of the core. Similar to architecture design, there are also two ways to design microarchitecture, one way is to use reference microarchitecture, which is an available design of a hardware module from publications, other design teams, and available IP. The reference microarchitecture is close to the requirements of the module to be designed and can execute most micro-operations allocated to the module. Another method is to design a custom microarchitecture, which is to generate a custom architecture dedicated for a task flow of an application. This method is used when there is no reference microarchitecture or the reference microarchitecture is not good enough. The method is to map one or several CFG (Control Flow Graph, the behavior flow chart of an application) to hardware. A typical case is to design accelerator modules for special algorithms, such as lossless compression algorithms, forward error correction algorithms, or packet processing protocols. A typical example in Fig. 18 shows an example of a simplified datapath cluster for complex-data computing in a radio baseband signal processor. N in this figure is the word length of real or imaginary data. The cluster can execute the following functions one step in a clock cycle. These functions are computing a FFT butterfly, iterative computing for complex data multiplication and accumulation, complex data multiplication, and complex data addition / subtraction. A Radix-2 decimation in time (DIT) butterfly can be executed in each clock cycle for FFT. The input data of the butterfly is data OPA = AR+jAI and OPB = BR + jBI, the input coefficient is OPC = CR + jCI. The outputs of a butterfly are Re(B+A*C) + jIm(B+A*C) and Re(B-A*C) + jIm(B-A*C). When executing the complex-data MAC (Multiplication and accumulation) or convolution, the input will be OPA and OPC. The output is thus Re(MAC) + jIm(MAC). When executing the complex multiplication, the input is also the OPA and OPC. The output is Re(MUL) + jIm(MUL). Finally, complex data addition and subtraction can be executed when the inputs are OPA, and OPC. Results are OPA + OPC = REA(MAC) + j Im(MAC) or OPA - OPC = REA(MAC) - j Im(MAC).
446
Dake Liu OPA=AR+jAI
OPC=CR+jCI AR AI CR CI
NxN
NxN
NxN
NxN
MUL
MUL
MUL
MUL
IMI
RMR
RMI
IMR
ADD/SUB
ADD/SUB
2Nb
2Nb
OPB= BR+ jBI SUB/ADD
2Nb
Re(MUL)
SUB/ADD
SUB/ADD
2Nb
2Nb
SUB/ADD
Im(MUL)
2Nb
ACRR
ACIR
Re(MAC) Re(B-A×C)
Re(B+A×C)
Im(MAC) Im(B-A×C)
Im(B+A×C)
Fig. 18 A radix-2 data path for complex data computing
10 Further Reading The ASIP DSP design flow was briefly introduced in this chapter. Architecture exploration and design, assembly instruction design, design for toolchain, firmware design, benchmarking, and microarchitecture design were introduced in the chapter. Two examples, radio baseband ASIP DSP and video ASIP DSP were briefly introduced. Readers are encouraged to get more ASIP design knowledge from [6]. ASIP design is so far experience based. If all requirements on performance, power consumption, and silicon cost cannot be reached by COTS, ASIP design is essential. ASIP enables performance computing and dramatically minimizes the cost of volume products. This chapter is provided for readers who want to get fundamental knowledge in ASIP. For professional ASIP designers, more detailed information is provided in [6], [4], and [5].
References 1. Berkeley Design Technology, Inc. URL http://www.bdti.com 2. Coresonic, Inc. URL http://www.coresonic.com 3. Embedded Microprocessor Benchmarking Consortium. URL http://www.eembc.org
Application Specific Instruction Set DSP Processors
447
4. Hohenauer, M., Scharwaechter, H., Karuri, K., Wahlen, O., Kogel, T., Leupers, R., Ascheid, G., Meyr, H., Braun, G.: Compiler-in-loop architecture exploration for efficient application specific embedded processor design. In: Design & Elektronik. WEKA Verlag (2004) 5. Karlström, P., Liu, D.: NoGAP: A micro architecture construction framework. In: K. Bertels, N. Dimopoulos, C. Silvano, S. Wong (eds.) Embedded Computer Systems: Architectures, Modeling, and Simulationodeling, and Simulation, LNCS, vol. 5657, pp. 171–180. Springer (2009) 6. Liu, D.: Embedded DSP Processor Design: Application Specific Instruction Set Processors. Elsevier (2008) 7. Liu, D., Tell, E.: Low-power baseband processors for communications. In: C. Piguet (ed.) Low-Power Electronics Design, chap. 23. CRC Press (2004) 8. Salomon, D.: Data Compression: The Complete Reference, 3 edn. Springer (2006)
Coarse-Grained Reconfigurable Array Architectures Bjorn De Sutter, Praveen Raghavan, Andy Lambrechts
Abstract Coarse-Grained Reconfigurable Array (CGRA) architectures accelerate the same inner loops that benefit from the high ILP support in VLIW architectures. By executing non-loop code on other cores, however, CGRAs can focus on such loops to execute them more efficiently. This chapter discusses the basic principles of CGRAs, and the wide range of design options available to a CGRA designer, covering a large number of existing CGRA designs. The impact of different options on flexibility, performance, and power-efficiency is discussed, as well as the need for compiler support. The ADRES CGRA design template is studied in more detail as a use case to illustrate the need for design space exploration, for compiler support and for the manual fine-tuning of source code.
1 Application Domain of Coarse-Grained Reconfigurable Arrays Many embedded applications require high throughput, meaning that a large number of computations needs to be performed every second. At the same time, the power consumption of battery-operated devices needs to be minimized to increase their autonomy. In general, the performance obtained on a programmable processor for a certain application can be defined as the reciprocal of the application execution time. Bjorn De Sutter Ghent University, Sint-Pietersnieuwstraat 41, 9000 Gent, Belgium and Vrije Universiteit Brussel, Pleinlaan 2, 1050 Brussel, Belgium e-mail: [email protected] Praveen Raghavan IMEC, Kapeldreef 75, 3001 Heverlee, Belgium e-mail: [email protected] Andy Lambrechts IMEC, Kapeldreef 75, 3001 Heverlee, Belgium e-mail: [email protected]
S.S. Bhattacharyya et al. (eds.), Handbook of Signal Processing Systems, DOI 10.1007/978-1-4419-6345-1_17, © Springer Science+Business Media, LLC 2010
449
450
Bjorn De Sutter, Praveen Raghavan, Andy Lambrechts
Considering that most programs consist of a number of consecutive phases P = [1, p] with different characteristics, performance can be defined in terms of the operating frequencies f p , the instructions executed per cycle IPC p and the instruction counts ICp of each phase, and in terms of the time overhead involved in switching between the phases t p→p+1 as follows: ICp 1 = execution time = + t p→p+1. performance p∈P IPC p ∗ f p
(1)
The operating frequencies f p cannot be increased infinitely because of powerefficiency reasons. Alternatively, a designer can increase the performance by designing or selecting a system that can execute code at higher IPCs. In a powerefficient architecture, a high IPC is reached for the most important phases l ∈ L ⊂ P, while limiting their instruction count ICl and reaching a sufficiently high, but still power-efficient frequency fl . Furthermore, the time overhead t p→p+1 as well as the corresponding energy overhead of switching between the execution modes of consecutive phases should be minimized if such switching happens frequently. Note that such switching only happens on hardware that supports multiple execution modes in support of phases with different characteristics. Course-Grained Reconfigurable Array (CGRA) accelerators aim for these goals for the inner loops found in many digital signal processing (DSP) domains, including multimedia and Software-Defined Radio (SDR) applications. Such applications have traditionally employed Very Long Instruction Word (VLIW) architectures such as the TriMedia 3270 [74] and the TI C64 [59], Application-Specific Integrated Circuits (ASICs), and Application-Specific Instruction Processors (ASIPs). To a large degree, the reasons for running these applications on VLIW processors also apply for CGRAs. First of all, a large fraction of the computation time is spent in manifest nested loops that perform computations on arrays of data and that can, possibly through compiler transformations, provide a lot of Instruction-Level Parallelism (ILP). Secondly, most of those inner loops are relatively simple. When the loops include conditional statements, this can be implement by means of predication [37] instead of with complex control flow. Furthermore, none or very few loops contain multiple exits or continuation points in the form of, e.g., break or continue statements as in the C-language. Moreover, after inlining the loops are free of function calls. Finally, the loops are not regular or homogeneous enough to benefit from vector computing, like on the EVP [3] or on Ardbeg [62]. When there is enough regularity and Data-Level Parallelism (DLP) in the loops of an application, vector computing can typically exploit it more efficiently than what can be achieved by converting the DLP into ILP and exploiting that on a CGRA. So in short, CGRAs (with limited DLP support) are ideally suited for applications of which time-consuming parts have manifest behavior, large amounts of ILP and limited amounts of DLP. In the remainder of this chapter, Section 2 presents the fundamental properties of CGRAs. Section 3 gives an overview of the design options for CGRAs. This overview help designers in evaluating whether or not CGRAs are suited for their applications and their design requirements, and if so, which CGRA designs are most
Coarse-Grained Reconfigurable Array Architectures
451
RF 0
IS 0
IS 1
RF 1
IS 2
IS 3
IS 4
IS 5
IS 6
IS 7
Fig. 1 An example clustered VLIW architecture with two RFs and eight ISs. Solid directed edges denote physical connections. Black and white small boxes denote input and output ports, respectively. There is a one-to-one mapping between input and output ports and physical connections.
suited. After the overview, Section 4 presents a case study on the ADRES CGRA architecture. This study serves two purposes. First, it illustrates the extent to which source code needs to be tuned to map well onto CGRA architectures. As we will show, this is an important aspect of using CGRAs, even when good compiler support is available and when a very flexible CGRA is targeted, i.e., one that puts very few restrictions on the loop bodies that it can accelerate. Secondly, our use case illustrates how Design Space Exploration (DSE) is necessary to instantiate optimized designs from parameterizable and customizable architecture templates such as the ADRES architecture template. Some conclusions are drawn in Section 5.
2 CGRA Basics CGRAs focus on the efficient execution of the type of loops discussed in the previous section. By neglecting non-loop code or outer-loop code that is assumed to be executed on other cores, CGRAs can take the VLIW principles for exploiting ILP in loops a step further to consume less energy and deliver higher performance, without compromising on available compiler support. Figures 1 and 2 illustrate this. Higher performance for high-ILP loops is obtained through two main features that separate CGRA architectures from VLIW architectures. First, CGRA architectures typically provide more Issue Slots (ISs) than typical VLIWs do. In the CGRA literature some other commonly used terms to denote CGRA ISs are ArithmeticLogic Units (ALUs), Functional Units (FUs), or Processing Elements (PEs). Conceptually, these terms all denote the same: logic on which an instruction can be executed, typically one per cycle. For example, a typical ADRES [6–8, 16, 38, 40–42] CGRA consists of 16 ISs, whereas the TI C64 features 8 slots, and the NXP TriMedia features only 5 slots. The higher number of ISs directly allows to reach higher IPCs, and hence higher performance, as indicated by Equation (1). To support these higher IPCs, the bandwidth to memory is increased by having more load/store ISs than on a typical VLIW, and special memory hierarchies as found on ASIPs, ASICs, and other DSPs. These include FIFOs, stream buffers, scratch-pad memories, etc. Secondly, CGRA architectures typically provide a number of direct connections between the ISs that allow data to “flow” from one IS to another without needing to pass data through a Register File (RF). As a result, less register copy operations
452
Bjorn De Sutter, Praveen Raghavan, Andy Lambrechts IS 0
IS 1
IS 2
IS 3 RF 1
RF 0
IS 4
IS 5
IS 6
IS 7
IS 8
IS 9
IS A
IS B
RF 2
RF 3
IS C
IS D
IS E
IS F
(a) CGRA organization connections to issue slot inputs
connections from register files outputs and issue slot outputs
mux
mux
RF A IS 0
mux
connections from issue slot outputs
connections to register file inputs and issue slot inputs
(b) Connectivity of register files and issue slots Fig. 2 Part (a) shows an example CGRA with 16 ISs and 4 RFs, in which dotted edges denote conceptual connections that are implemented by physical connections and muxes as in part (b).
need to be executed in the ISs, which reduces the IC term in Equation (1) and frees ISs for more useful computations. Higher energy efficiency is obtained through several features. Because of the direct connections between ISs, less data needs to be transferred into and out of RFs. This saves considerable energy. Also, because the ISs are arranged into a 2D matrix, small RFs with few ports can be distributed in between the ISs as depicted in Figure 2. This contrasts with the many-ported RFs in (clustered) VLIW architectures, which basically feature a one-dimensional design as depicted in Figure 1. The distributed CGRA RFs consume considerably less energy. Finally, by not supporting control flow, the instruction memory organization can be simplified. In statically reconfigurable CGRAs, this memory is nothing more than a set of configuration bits that remain fixed for the whole execution of a loop. Clearly this is very energyefficient. Other CGRAs, called dynamically reconfigurable CGRAs, feature a form of distributed level-0 loop buffers [35] or other small controllers that fetch new configurations every cycle from simple configuration buffers. To support loops that include control flow and conditional operations, the compiler then replaces that control flow by data flow by means of predication [37] or other mechanisms. In this way
Coarse-Grained Reconfigurable Array Architectures
453
CGRAs differ from VLIW processors that typically feature a power-hungry combination of an instruction cache, instruction decompression and decoding pipeline stages and a non-trivial update mechanism of the program counter. There are two main drawbacks to CGRA architectures. Firstly, because they can only execute loops, they need to be coupled to other cores on which all other parts of the program are executed. In some designs, this coupling introduces run-time and design-time overhead. Secondly, as clearly visible in the example CGRA of Figure 2, the interconnect structure of a CGRA is vastly more complex than that of a VLIW. On a VLIW, scheduling an instruction in some IS automatically implies the reservation of connections between the RF and the IS and of the corresponding ports. On CGRAs, this is not the case. Because there is no one-to-one mapping between connections and input/output ports of ISs and RFs, connections need to be reserved explicitly by the compiler or programmer together with ISs, and the data flow needs to be routed explicitly over the available connections. This can be done, for example, by programming switches and multiplexors (a.k.a. muxes) explicitly, like the ones depicted in Figure 2(b). Consequently more complex compiler technology than that of VLIW compilers is needed to automate the mapping of code onto a CGRA. Moreover, writing assembly code for CGRAs ranges from being very difficult to virtually impossible, depending on the type of reconfigurability and on the form of processor control. Having explained these fundamental concepts that differentiate CGRAs from VLIWs, we can now also differentiate them from Field-Programmable Gate Arrays (FPGAs), where the name CGRA actually comes from. Whereas FPGAs feature bitwise logic in the form of Look-Up Tables (LUTs) and switches, CGRAs feature more energy-efficient and area-conscious word-wide ISs, RFs and interconnections. Hence the name coarse-grained array architecture. As there are much fewer ISs on a CGRA than there are LUTs on an FPGA, the number of bits required to configure the CGRA ISs, muxes, and RF ports is typically orders of magnitude smaller than on FPGAs. If this number becomes small enough, dynamic reconfiguration can be possible every cycle. So in short, CGRAs can be seen as statically or dynamically reconfigurable coarse-grained FPGAs, or as 2D, highly-clustered loop-only VLIWs with direct interconnections between ISs that need to be programmed explicitly.
3 CGRA Design Space The large design space of CGRA architectures features many design options. These include the way in which the CGRA is coupled to a main processor, the type of interconnections and computation resources used, the reconfigurability of the array, the way in which the execution of the array is controlled, support for different forms of parallelism, etc. This section discusses the most important design options and the influence of the different options on important aspects such as performance, power efficiency, compiler friendliness and flexibility. In this context, higher flexibility equals placing fewer restrictions on loop bodies that can be mapped onto a CGRA.
454
Bjorn De Sutter, Praveen Raghavan, Andy Lambrechts Memory Controller
TinyRISC Core Processor 8x8 Morphosys CGRA
Caches
Double Frame Buffer
DMA Controller
Configuration Memory
Fig. 3 A TinyRISC main processor loosely coupled to a MorphoSys CGRA array. Note that the main data memory (cache) is not shared and that no IS hardware or registers is are shared between the main processor and the CGRA. Thus, both can run concurrent threads.
Our overview of design options is not exhaustive. Its scope is limited to the most important features of CGRA architectures that feature a 2D array of ISs. However, the distinction between 1D VLIWs and 2D CGRAs is anything but well-defined. The reason is that this distinction is not simply a layout issue, but one that also concerns the topology of the interconnects. Interestingly, this topology is precisely one of the CGRA design options with a large design freedom.
3.1 Tight versus Loose Coupling Some CGRA designs are coupled loosely to main processors. For example, Figure 3 depicts how the MorphoSys CGRA [29] is connected as an external accelerator to a TinyRISC Central Processing Unit (CPU). The CPU is responsible for executing non-loop code, for initiating DMA data transfers to and from the CGRA and the buffers, and for initiating the operation of the CGRA itself by means of special instructions added to the TinyRISC ISA. This type of design offers the advantage that the CGRA and the main CPU can be designed independently, and that both can execute code concurrently, thus delivering higher parallelism and higher performance. For example, using the double frame buffers [29] depicted in Figure 3, the MorphoSys CGRA can be operating on data in one buffer while the main CPU initiates the necessary DMA transfers to the other buffer for the next loop or for the next set of loop iterations. One drawback is that any data that needs to be transferred from non-loop code to loop code needs to be transferred by means of DMA transfers. This can result in a large overhead, e.g., when frequent switching between non-loop code and loops with few iterations occurs and when the loops consume scalar values computed by non-loop code. By contrast, an ADRES CGRA is coupled tightly to its main CPU. A simplified ADRES is depicted in Figure 4. Its main CPU is a VLIW consisting of the shared RF and the top row of CGRA ISs. In the main CPU mode, this VLIW executes
Coarse-Grained Reconfigurable Array Architectures
455 shared RF
IS 0
Data Memory Controller
Data Memories (caches or scratch-pads)
IS 1
IS 2
IS 3 RF 1
RF 0
IS 4
IS 5
IS 6
IS 7
IS 8
IS 9
IS A
IS B
RF 2
IS C
RF 3
IS D
IS E
IS F
Fig. 4 A simplified picture of an ADRES architecture. In the main processor mode, the top row of ISs operates like a VLIW on the data in the shared RF and in the data memories, fetching instructions from an instruction cache. When the CGRA mode is initiated with a special instruction in the main VLIW ISA, the whole array starts operating on data in the distributed RFs, in the shared RF and in the data memories. The memory port in IS 0 is also shared between the two operating modes. Because of the resource sharing, only one mode can be active at any point in time.
instructions that are fetched from a VLIW instruction cache and that operate on data in the shared RF. The idle parts of the CGRA are then disabled by clock-gating to save energy. By executing a start_CGRA instruction, the processor switches to CGRA mode in which the whole array, including the shared RF and the top row of ISs, executes a loop for which it gets its configuration bits from a configuration memory. This memory is omitted from the figure for the sake of simplicity. The drawback of this tight coupling is that because the CGRA and the main processor mode share resources, they cannot execute code concurrently. However, this tight coupling also has advantages. Scalar values that have been computed in nonloop code, can be passed from the main CPU to the CGRA without any overhead because those values are already present in the shared RFs or in the shared memory banks. Furthermore, using shared memories and an execution model of exclusive execution in either main CPU or CGRA mode significantly eases the automated co-generation of main CPU code and of CGRA code in a compiler, and it avoids the run-time overhead of transferring data. Finally, on the ADRES CGRA, switching between the two modes takes only two cycles. Thus, the run-time overhead is minimal. Silicon Hive CGRAs [9, 10] do not feature a clear separation between the CGRA accelerator and the main processor. Instead there is just a single processor that can be programmed at different levels of ILP, i.e., at different instruction word widths. This allows for a very simple programming model, with all the programming and performance advantages of the tight coupling of ADRES. Compared to ADRES, however, the lack of two distinctive modes makes it more difficult to implement coarse-grained clock-gating or power-gating, i.e., gating of whole sets of ISs combined instead of separate gating of individual ISs. Somewhere in between loose and tight coupling is the PACT XPP design [45], in which the array consist of simpler ISs that can operate like a true CGRA, as well
456
Bjorn De Sutter, Praveen Raghavan, Andy Lambrechts
as of more complex ISs that are in fact full-featured small RISC processors that can run independent threads in parallel with the CGRA. As a general rule, looser coupling potentially enables more Thread-Level Parallelism (TLP) and it allows for a larger design freedom. Tighter coupling can minimize the per-thread run-time overhead as well as the compile-time overhead. This is in fact no different from other multi-core or accelerator-based platforms.
3.2 CGRA Control There exist many different mechanisms to control how code gets executed on CGRAs, i.e., to control which operation is issued on which IS at which time and how data values are transferred from producing operations to consuming ones. Two important aspects of CGRAs that drive different methods for control are reconfigurability and scheduling. Both can be static or dynamic. 3.2.1 Reconfigurability Some CGRAs, like ADRES, Silicon Hive, and MorphoSys are fully dynamically reconfigurable: exactly one full reconfiguration takes place for every execution cycle. Of course no reconfiguration takes places in cycles in which the whole array is stalled. Such stalls can happen, e.g., because memory accesses take longer than expected in the schedule as a result of a cache miss or a memory bank access conflict. This cycle-by-cycle reconfiguration is similar to the fetching of one VLIW instruction per cycle, but on these CGRAs the fetching is simpler as it only iterates through a loop body existing of straight-line CGRA configurations without control flow. Other CGRAs like the KressArray [25–27] are fully statically reconfigurable, meaning that the CGRA is configured before a loop is entered, and no reconfiguration takes place during the loop at all. Still other architectures feature a hybrid reconfigurability. The RaPiD [15, 19] architecture features partial dynamic reconfigurability, in which part of the bits are statically reconfigurable and another part is dynamically reconfigurable and controlled by a small sequencer. Yet another example is the PACT architecture, in which the CGRA itself can initiate events that invoke (partial) reconfiguration. This reconfiguration consumes a significant amount of time, however, so it is advised to avoid it if possible, and to use the CGRA as a statically reconfigurable CGRA. In statically reconfigured CGRAs, each resource performs a single task for the whole duration of the loop. In that case, the mapping of software onto hardware becomes purely spatial, as illustrated in Figure 5(a). In other words, the mapping problem becomes one of placement and routing, in which instructions and data dependencies between instructions have to mapped on a 2D array of resources. For these CGRAs, compiler techniques similar to hardware synthesis techniques can be used, as those used in FPGA placement and routing [4].
Coarse-Grained Reconfigurable Array Architectures
457
ins 1
ins 3
ins 2
e
ins 0
tim
y-axis
x-axis
ax
is
ins 0 ins 1 ins 2 ins 3
Fig. 5 Part (a) shows a spatial mapping of a sequence of four instructions on a statically reconfigurable 2x2 CGRA. Edges denote dependencies, with the edge from instruction 3 to instruction 0 denoting that instruction 0 from iteration i depends on instruction 3 from iteration i − 1. So only one out of four ISs is utilized per cycle. Part (b) shows a temporal mapping of the same code on a dynamically reconfigurable CGRA with only one IS. The utilization is higher here, at 100%.
By contrast, dynamic reconfigurability enables the programmer to use hardware resources for multiple different tasks during the execution of a loop or even during the execution of a single loop iteration. In that case, the software mapping problem becomes a spatial and temporal mapping problem, in which the operations and data transfers not only need to be placed and routed on and over the hardware resources, but in which they also need to be scheduled. A contrived example of a temporal mapping is depicted in Figure 5(b). Most compiler techniques [16, 18, 21, 42, 44, 46, 47] for these architectures also originate from the FPGA placement and routing world. For CGRAs, the array of resources is not treated as a 2D spatial array, but as a 3D spatial-temporal array, in which the third dimension models time in the form of execution cycles. Scheduling in this dimension is often based on techniques that combine VLIW scheduling techniques such as modulo scheduling [50, 56], with FPGA synthesis-based techniques [4]. Still other compiler techniques exist that are based on constraint solving [56], or on integer-linear programming [1, 64]. The most important advantage of static reconfigurability is the lack of reconfiguration overhead, in particular in terms of power consumption. For that reason, large arrays can be used that are still power-efficient. The disadvantage is that even in the large arrays the amount of resources constrains which loops can be mapped. Dynamically reconfigurable CGRAs can overcome this problem by spreading the computations of a loop iteration over multiple configurations. Thus a small dynamically reconfigurable array can execute larger loops. The loop size is then not limited by the array size, but by the array size times the depth of the reconfiguration memories. For reasons of power efficiency, this depth is also limited, typically to tens or hundreds of configurations, which suffices for most if not all inner loops. A potential disadvantage of dynamically reconfigurable CGRAs is the power consumption of the configuration memories, even for small arrays, and of the configuration fetching mechanism. The disadvantage can be tackled in different ways. ADRES and MorphoSys tackle it by not allowing control flow in the loop bodies, thus enabling the use of very simple, power-efficient configuration fetching techniques similar to level-0 loop buffering [35]. Whenever control flow is found in loop bodies, such as for conditional statements, this control flow then first needs to be converted into data flow, for example by means of predication and hyperblock
458
Bjorn De Sutter, Praveen Raghavan, Andy Lambrechts
formation [37]. While these techniques can introduce some initial overhead in the code, this overhead typically will be more than compensated by the fact that a more efficient CGRA design can be used. The MorphoSys design takes this reduction of the reconfiguration fetching logic even further by limiting the supported code to Single Instruction Multiple Data (SIMD) code. In the two supported SIMD modes, all ISs in a row or all ISs in a column perform identical operations. As such only one IS configuration needs to be fetched per row or column. As already mentioned, the RaPiD architecture limits the number of configuration bits to be fetched by making only a small part of the configuration dynamically reconfigurable. Kim et al. provide yet another solution in which the configuration bits of one column in one cycle are reused for the next column in the next cycle [31]. Furthermore, they also propose to reduce the power consumption in the configuration memories by compressing the configurations [30]. Still, dynamically reconfigurable designs exist that put no restrictions on the code to be executed, and that even allow control flow in the inner loops. The Silicon Hive design is one such design. Unfortunately, no numbers on the power consumption overhead of this design choice are publicly available. A general rule is that a limited reconfigurability puts more constraints on the types and sizes of loops that can be mapped. Which design provides the highest performance or the highest energy efficiency depends, amongst others, on the variation in loop complexity and loop size present in the applications to be mapped onto the CGRA. With large statically reconfigurable CGRAs, it is only possible to achieve high utilization for all loops in an application if all those loops have similar complexity and size, or if they can be made so with loop transformations, and if the iterations are not dependent on each other through long-latency dependency cycles (as was the case in Figure 5). Dynamically reconfigurable CGRAs, by contrast, can also achieve high average utilization over loops of varying sizes and complexities, and with inter-iteration dependencies. That way dynamically reconfigurable CGRAs can achieve higher energy efficiency in the data path, at the expense of higher energy consumption in the control path. Which design option is best thus also depends on the process technology used, and in particular on the ability to perform clock or power gating and on the ratio between active and passive power (a.k.a. leakage). 3.2.2 Scheduling and Issuing Both with dynamic and with static reconfigurability, the execution of operations and of data transfers needs to be controlled. This can be done statically in a compiler, similar to the way in which operations from static code schedules are scheduled and issued on VLIW processors [20], or dynamically, similar to the way in which outof-order processors issue instructions when their operands become available [55]. Many possible combinations of static and dynamic reconfiguration and of static and dynamic scheduling exist. A first class consists of dynamically scheduled, dynamically reconfigurable CGRAs like the TRIPS architecture [24, 52]. For this architecture, the compiler de-
Coarse-Grained Reconfigurable Array Architectures
459
termines on which IS each operation is to be executed and over which connections data is to be transferred from one IS to another. So the compiler performs placement and routing. All scheduling (including the reconfiguration) is dynamic, however, as in regular out-of-order superscalar processors [55]. TRIPS mainly targets generalpurpose applications, in which unpredictable control flow makes the generation of high-quality static schedules difficult if not impossible. Such applications most often provide relatively limited ILP, for which large arrays of computational resources are not efficient. So instead a small, dynamically reconfigurable array is used, for which the run-time cost of dynamic reconfiguration and scheduling is acceptable. A second class of dynamically reconfigurable architectures avoids the overhead of dynamic scheduling by supporting VLIW-like static scheduling [20]. Instead of doing the scheduling in hardware where the scheduling logic then burns power, the scheduling for ADRES, MorphoSys and Silicon Hive architectures is done by a compiler. Compilers can do this efficiently for loops with regular, predictable behavior and high ILP, as found in many DSP applications. As for VLIW architectures, software pipelining [50, 56] is a very important to expose the ILP in software kernels, so most compiler techniques [16, 18, 21, 42, 44, 46, 47] for statically scheduled CGRAs implement some form of software pipelining. A final class of CGRAs are the statically reconfigurable, dynamically scheduled architectures, such as KressArray or PACT (neglecting the time-consuming partial reconfigurability of the PACT). The compiler performs placement and routing, and the code execution progress is guided by tokens or event signals that are passed along with data. Thus the control is dynamic, and it is distributed over the token or event path, similar to the way in which transport-triggered architectures [14] operate. These statically reconfigurable CGRAs do not require software pipelining techniques because there is no temporal mapping. Instead the spatial mapping and the control implemented in the tokens or event signals implement a hardware pipeline. We can conclude by noting that, as in other architecture paradigms such as VLIW processing or superscalar out-of-order execution, dynamically scheduled CGRAs can deliver higher performance than statically scheduled ones for control-intensive code with unpredictable behavior. On dynamically scheduled CGRAs the code path that gets executed in an iteration determines the execution time of that iteration, whereas on statically scheduled CGRAs, the combination of all possible execution paths (including the slowest path which might be executed infrequently) determines the execution time. Thus, dynamically scheduled CGRAs can provide higher performance for some applications. However, the power efficiency will then typically also be poor because more power will be consumed in the control path. Again, the application domain determines which design option is most appropriate. 3.2.3 Thread-level and Data-level Parallelism Another important aspect of control is the possibility to support different forms of parallelism. Obviously, loosely-coupled CGRAs can operate in parallel with the
460
Bjorn De Sutter, Praveen Raghavan, Andy Lambrechts
main CPU, but one can also try to use the CGRA resources to implement SIMD or to run multiple threads concurrently within the CGRA. When dynamic scheduling is implemented via distributed event-based control, as in KressArray or PACT, implementing TLP is relatively simple and cheap. For small enough loops of which the combined resource use fits on the CGRA, it suffices to map independent thread controllers on different parts of the distributed control. For architectures with centralized control, the only option to run threads in parallel is to provide additional controllers or to extend the central controller, for example to support parallel execution modes. While such extensions will increase the power consumption of the controller, the newly supported modes might suit certain code fragments better, thus saving in data path energy and configuration fetch energy. The TRIPS controller supports four operation modes [52]. In the first mode, all ISs cooperate for executing one thread. In the second mode, the four rows execute four independent threads. In the third mode, fine-grained multi-threading [55] is supported by time-multiplexing all ISs over multiple threads. Finally, in the fourth mode each row executes the same operation on each of its ISs, thus implementing SIMD in a similar, fetch-power-efficient manner as is done in the two modes of the MorphoSys design. Thus, for each loop or combination of loops in an application, the TRIPS compiler can exploit the most suited form of parallelism. The Raw architecture [58] is a hybrid between a many-core architecture and a CGRA architecture in the sense that it does not feature a 2D array of ISs, but rather a 2D array of tiles that each consist of a simple RISC processor. The tiles are connected to each other via a mesh interconnect, and transporting data over this interconnect to neighboring tiles does not consume more time than retrieving data from the RF in the tile. Moreover, the control of the tiles is such that they can operate independently or synchronized in a lock-step mode. Thus, multiple tiles can cooperate to form a dynamically reconfigurable CGRA. A programmer can hence partition the 2D array of tiles into several, potentially differently sized, CGRAs that each run an independent thread. This provides very high flexibility to balance the available ILP inside threads with the TLP of the combined threads. Other architectures do not support (hardware) multi-threading within one CGRA core at all, like the current ADRES and Silicon Hive. The first solution to run multiple threads with these designs is to incorporate multiple CGRA accelerator cores in a System-on-Chip (SoC). The advantage is then that each accelerator can be customized for a certain class of loop kernels. Also, ADRES and Silicon Hive are architecture templates, which enables CGRA designers to customize their CGRA cores for the appropriate amount of DLP for each class of loop kernels, in the form of SIMD or subwordparallelism. Alternatively, TLP can be converted into ILP and DLP by combining, at compiletime, kernels of multiple threads and by scheduling them together as one kernel, and by selecting the appropriate combination of scheduled kernels at run time [53].
Coarse-Grained Reconfigurable Array Architectures
461
3.3 Interconnects and Register Files 3.3.1 Connections A wide range of connections can connect the ISs of a CGRA with each other, and with the RFs, other memories and IO ports. Buses, point-to-point connections, and crossbars are all used in various combinations and in different topologies. For example, some designs like MorphoSys and the most common ADRES and Silicon Hive designs feature a densely connected mesh-network of point-topoint interconnects in combination with sparser buses that connect ISs further apart. Thus the number of long power-hungry connections is limited. Multiple studies of point-to-point mesh-like interconnects as in Figure 10 have been published in the past [8, 29, 33, 39]. Other designs like RaPiD feature a dense network of segmented buses. Typically the use of crossbars is limited to very small instances because large ones are too power-hungry. Fortunately, large crossbars are most often not needed, because many application kernels can be implemented as systolic algorithms, which map well onto mesh-like interconnects as found in systolic arrays [48]. Unlike crossbars and even busses, mesh-like networks of point-to-point connections scale better to large arrays without introducing too much delay or power consumption. For statically reconfigurable CGRAs, this is beneficial. Buses and other long interconnects connect whole rows or columns to complement short-distance mesh-like interconnects. The negative effects that such long interconnects can have on power consumption or on obtainable clock frequency can be avoided by segmentation or by pipelining. In the latter case, pipelining latches are added along the connections or in between muxes and ISs. Our experience, as presented in Section 4.2.2 is that this pipelining will not necessarily lead to lower IPCs in CGRAs. This is different from out-of-order or VLIW architectures, where deeper pipelining increases the branch misprediction latency [55]. Instead at least some CGRA compilers succeed in exploiting the pipelining latches as temporary storage, rather than being hampered by them. This is the case in compiler techniques like [16, 42] that are based on FPGA synthesis methods in which RFs and pipelining latches are treated as interconnection resources that span multiple cycles instead of as explicit storage resources. This treatment naturally fits the 3D array modeling of resources along two spatial dimensions and one temporal dimension. Consequently, those compiler techniques can exploit pipelining latches naturally and similarly to using storage space in distributed RFs. 3.3.2 Register Files Compilers for CGRA architectures place operations in ISs, thus also scheduling them, and route the data flow over the connections between the ISs. Those connections may be direct connections, or latched connections, or even connections that go through RFs. Therefore most CGRA compilers treat RFs not as temporary storage, but as interconnects that can span multiple cycles. Thus the RFs can be treated uni-
462
Bjorn De Sutter, Praveen Raghavan, Andy Lambrechts
formly with the connections during routing. A direct consequence of this compiler approach is that the design space freedom of interconnects extends to the placement of RFs in between ISs. During the DSE for a specific CGRA instance in a CGRA design template such as the ADRES or Silicon Hive templates, both the real connections and the RFs have to be explored, and that has to be done together. Just like the number of real interconnect wires and their topology, the size of RFs, their location and their number of ports then contribute to the interconnectivity of the ISs. We refer to [8, 39] for DSEs that study both RFs and interconnects. Besides their size and ports, another important aspect is that RFs can be rotating [51]. The power and delay overhead of rotation is very small in distributed RFs, simply because these RFs are small themselves. Still they can provide an important functionality. Consider a dynamically reconfigurable CGRA on which a loop is executed that iterates over x configurations, i.e., each iteration takes x cycles. That means that for a write port of an RF, every x cycles the same address bits get fetched from the configuration memory to configure the address set at that port. In other words, every x cycles a new value is being written into the register specified by that same address. This implies that values can stay in the same register for at most x cycles; then they are overwritten by a new value from the next iteration. In many loops, however, some values have a life time that spans more than x cycles, because it spans multiple loop iterations. To avoid having to insert additional data transfers in the loop schedules, rotating registers can be used. At the end of every iteration of the loop, all values in rotating registers rotate into another register to make sure that old values are copied to where they are not overwritten by newer values. 3.3.3 Predicates, Events and Tokens To complete this overview on CGRA interconnects, we want to point out that it can be very useful to have interconnects of different widths. The data path width can be as small as 8 bits or as wide as 64 or 128 bits. The latter widths are typically used to pass SIMD data. However, as not all data is SIMD data, not all paths need to have the full width. Moreover, most CGRA designs and the code mapped onto them feature signals that are only one or a few bits wide, such as predicates or events or tokens. Using the full-width datapath for these narrow signals wastes resources. Hence it is often useful to add a second, narrow datapath for control signals like tokens or events and for predicates. How dense that narrow datapath has to be, depends on the type of loops one wants to run on the CGRA. For example, multimedia coding and decoding typically includes more conditional code than SDR baseband processing. Hence the design of, e.g., different ADRES architectures for multimedia and for SDR resulted in different predicate data paths being used, as illustrated in Section 4.2.1. At this point, it should be noted that the use of predicates is fundamentally not that different from the use of events or tokens. In KressArray or PACT, events and tokens are used, amongst others, to determine at run time which data is selected to be used later in the loop. For example, for a C expression like x + (a>b) ? y + z : y - z one IS will first compute the addition y+z, one IS will compute
Coarse-Grained Reconfigurable Array Architectures
463
the subtraction y-z, and one IS will compute the greater-than condition a>b. The result of the latter computation generates an event that will be fed to a multiplexor to select which of the two other computer values y+z and y-z is transferred to yet another IS on which the addition to x will be performed. Unlike the muxes in Figure 2(b) that are controlled by bits fetched from the configuration memory, those event-controlled multiplexors are controlled by the data path. In the ADRES architecture, the predicates guard the operations in ISs, and they serve as enable signals for RF write ports. Furthermore, they are also used to control special select operations that pass one of two input operands to the output port of an IS. Fundamentally, an event-controlled multiplexor performs exactly the same function as the select operation. So the difference between events or tokens and predicates is really only that the former term and implementation are used in dynamically scheduled designs, while the latter term is used in static schedules.
3.4 Computational Resources Issue slots are the computational resources of CGRAs. Over the last decade, numerous designs of such issue slots have been proposed, under different names, that include PEs, FUs, ALUs, and flexible computation components. Figure 6 depicts some of them. For all of the possible designs, it is important to know the context in which these ISs have to operate, such as the interconnects connecting them, the control type of the CGRA, etc. Figure 6(a) depicts the IS of a MorphoSys CGRA. All 64 ISs in this homogeneous CGRA are identical and include their own local RF. This is no surprise, as the two MorphoSys SIMD modes (see Section 3.2.1) require that all ISs of a row or of a column execute the same instruction, which clearly implies homogeneous ISs. In contrast, almost all features of an ADRES IS, as depicted in Figure 6(b), can be chosen at design time, and can be different for each IS in a CGRA that then becomes heterogeneous: the number of ports, whether or not there are latches between the multiplexors and the combinatorial logic that implements the operations, the set of operations supported by each IS, how the local registers file are connected to ISs and possibly shared between ISs, etc. As long as the design instantiates the ADRES template, the ADRES tool flow will be able to synthesize the architecture and to generate code for it. A similar design philosophy is followed by the Silicon Hive tools. Of course this requires more generic compiler techniques than those that generate code for the predetermined homogeneous ISs of, e.g., the MorphoSys CGRA. Given the state of the art in compiler technology for this type of architecture, the advantages of this freedom are (1) the possibility to design different instances optimized for certain application domains, (2) the knowledge that the features of those designs will be exploited, and (3) the ability to compile loops that feature other forms of ILP than DLP. DLP can still be supported, of course, by simply incorporating SIMD-capable (a.k.a. subwordparallel) ISs of, e.g., 4x16 bits wide. The drawback is this design freedom is that, at least with the current compiler techniques, these
464
Bjorn De Sutter, Praveen Raghavan, Andy Lambrechts
src 0 mux
src 1 mux
pred mux LOCAL RF 4 registers
ALU + MULT
optional pipe latch
src 0 mux
src 1 mux
opt src 3 mux
optional pipe latch
optional pipe latch
optional pipe latch
flag feedback latch
any combination of operations shift optional pipe latch
latch
(b) ADRES issue slot
(a) MorphoSys issue slot
ALU + MULT
ALU + MULT
LOCAL RF IS 0 ALU
src 0 mux
src 1 mux
src 0 mux
IS 1 ALU
sw ALU + MULT
IS 2 ALU
IS 3 ALU
src 1 mux sw
sw
sw
shared MUL
ALU + MULT
(c) a single issue slot as proposed by Galanis et al.
(d) issue slots that share a multiplier
Fig. 6 Four different structures of ISs proposed in the literature. Part (a) displays a fixed MorphoSys IS, including its local RF. Part (b) displays the fully customizable ADRES IS, that can connect to shared or non-shared local RFs. Part (c) depicts the IS structure proposed by Galanis et al. [23], and (d) depicts a row of four ISs that share a multiplier [28].
techniques are so generic that they miss optimization opportunities because they do not exploit regularity in the designed architectures. They do not exploit it for speeding up the compilation, nor do they for producing better schedules. Figure 6(c) depicts the IS proposed by Galanis et al. [23]. Again, all ISs are identical. In contrast to the MorphoSys design, however, these ISs consist of several ALUs and multipliers with direct connections between them and their local RFs. These direct connections within each IS can take care of a lot of data transfers, thus freeing time on the shared bus-based interconnect that connects all ISs. Thus, the local interconnect within each IS compensates for the lack of a scaling global interconnect. One advantage of this approach is that the compiler can be tuned specifically for this combination of local and global connections and for the fact that it does not need to support heterogeneous ISs. Whether or not this type of design is more power-efficient than that of CGRAs with more design freedom and potentially more heterogeneity is unclear at this point in time. At least, we know of no studies from which, e.g., utilization numbers can be derived that allow us to compare the two approaches. With respect to utilization, it is clear that the designs of Figure 6(a) and 6(b) will only be utilized well if a lot of multiplications need to be performed. Otherwise,
Coarse-Grained Reconfigurable Array Architectures
465
the area-consuming multipliers remain unused. To work around this problem, the sharing of large resources such as multipliers between ISs has been proposed in the RSPA CGRA design [28]. Figure 6(d) depicts one row of ISs that do not contain multipliers internally, but that are connected to a shared multiplier through switches and a shared bus. The advantage of this design, compared to an ADRES design in which each row features 3 pure ALU ISs and 1 ALU+MULT IS, is that this design allows the compiler to schedule multiplications in all ISs (albeit only one per cycle), whereas this scheduling freedom would be limited to one IS slot in the ADRES design. To allow this schedule freedom, however, a significant amount of resources in the form of switches and a special-purpose bus need to be added to the row. While we lack experimental data to back up this claim, we firmly believe that a similar increase in schedule freedom can be obtained in the aforementioned 3+1 ADRES design by simply extending an existing ADRES interconnect with a similar amount of additional resources. In the ADRES design, that extension would then also be beneficial to operations other than multiplications. The optimal number of ISs for a CGRA depends on the application domain, on the reconfigurability, as well as on the IS functionality and on the DLP available in the form of subwordparallelism. As illustrated in section 4.2.2, a typical ADRES would consist of 4x4 ISs [7, 38]. TRIPS also features 4x4 ISs. MorphoSys provides 8x8 ISs, but that is because the DLP is implemented as SIMD over multiple ISs, rather than as subwordparallelism within ISs. In our experience, scaling dynamically reconfigurable CGRA architectures such as ADRES to very large arrays (8x8 or larger) does not make sense even with scalable interconnects like mesh or meshplus interconnects. Even in loops with high ILP, utilization drops significantly on such large arrays [43]. It is not yet clear what is causing this lower utilization, and there might be several reasons. These include a lack of memory bandwidth, the possibility that the compiler techniques [16, 42] simply do not scale to such large arrays, or the fact that the relative connectivity in such large arrays is lower. Simply stated, when a mesh interconnects all ISs to their neighbors, each IS not on the side of the array is connected to 4 other ISs out of 16 in a 4x4 array, i.e., to 25% of all ISs, while it is connected to 4 out of 64 ISs on an 8x8 array, i.e., to 6.25% of all ISs. To finalize this section, we want to mention that, just like in any other type of processor, it makes sense to pipeline complex combinatorial logic, e.g., as found in multipliers. There are no fundamental problems to do this, and it can lead to significant increases in utilization and clock frequency.
3.5 Memory Hierarchies CGRAs have a large number of ISs that need to be fed with data from the memory. Therefore the data memory sub-system is a crucial part of the CGRA design. Many reconfigurable architectures feature multiple independent memory banks or blocks to achieve high data bandwidth. Exploiting those automatically in a compiler has not yet been fully solved.
466
Bjorn De Sutter, Praveen Raghavan, Andy Lambrechts
The RAW architecture features an independent memory block in each tile for which Barua developed a method called modulo unrolling to disambiguate and assign data to different banks [2]. However, this technique can only handle array references through affine index expression on loop induction variables. MorphoSys has a 256-bit wide frame buffer between the main memory and a reconfigurable array to feed data to the ISs operating in SIMD mode [29]. The efficient use of such a wide memory depends by and large on manual data placement and operation scheduling. Similar techniques for wide loads and stores have also been proposed in regular VLIW architectures for reducing power [49]. However, this requires the programmer or compiler to perform the data layout in memory in order to exploit the large bandwidth between the level-1 memory and the datapath. Both Silicon Hive and PACT feature distributed memory blocks without a crossbar. A Silicon Hive programmer has to specify the allocation of data to the memory for the compiler to bind the appropriate load/store operations to the corresponding memories. Silicon Hive also supports the possibility of interfacing the memory or system bus using FIFO interfaces. This is efficient for streaming processing but is difficult to interface when the data needs to be buffered on in case of data reuse. The ADRES architecture template provides a parameterizable Data Memory Queue (DMQ) interface to each of the different single-ported, interleaved level1 scratch-pad memory banks [7]. The DMQ interface is responsible for resolving bank access conflicts, i.e., when multiple load/store ISs would want to access the same bank at the same time. Connecting all load/store ISs to all banks through a conflict resolution mechanism allows maximal freedom for data access patterns and also maximal freedom on the data layout in memory. The potential disadvantage of such conflict resolution is that it increases the latency of load operations. In software pipelined code, however, increasing the individual latency of instructions most often does not have a negative effect on the schedule quality, because the compiler can hide those latencies in the software pipeline. In the main processor VLIW mode of an ADRES, that accesses the same memories in code not software-pipelined, the conflict resolution is disabled to obtain shorter access latencies.
3.6 Compiler Support Apart from the specific algorithms used to compile code, the major distinctions between the different existing compiler techniques relate to whether or not they support static scheduling, whether or not they support dynamic reconfiguration, whether or not they rely on special programming languages, and whether or not they are limited to specific hardware properties. Because most compiler research has been done to generate static schedules for CGRAs, we focus on those in this section. As already indicated in Sections 3.2.1 and 3.2.2, many algorithms are based on FPGA placement and routing techniques [4] in combination with VLIW code generation techniques like modulo scheduling [50, 56] and hyperblock formation [37].
Coarse-Grained Reconfigurable Array Architectures
467
Whether or not compiler techniques rely on specific hardware properties is not always obvious in the literature, as not enough details are available in the descriptions of the techniques, and few techniques have been tried on a wide range of CGRA architectures. For that reason, it is very difficult to compare the efficiency (compilation time) and effectiveness (quality of generated code) of the different techniques. The most widely applicable static scheduling techniques use different forms of Modulo Resource Routing Graphs (MRRGs). RRGs are time-space graphs, in which all resources (space dimension) are modeled with vertices. There is one such vertex per resource per cycle (time dimension) in the schedule being generated. Directed edges model the connections over which data values can flow from resource to resource. The schedule, placement and routing problem than becomes a problem of mapping the Data Dependence Graph (DDG) of some loop body on the RRG. Scheduling refers to finding the right cycle to perform an operation in the schedule, placement refers to finding the right IS in that cycle, and routing refers to finding connections to transfer data from producing operations to consuming operations. In the case of a modulo scheduler, the modulo constraint is enforced by modeling all resource usage in the modulo time domain. This is done by modeling the appropriate modulo reservation tables [50] on top of the RRG, hence the name MRRG. The granularity of its vertices depends on the precise compiler algorithm. One modulo graph embedding algorithm [46] models whole ISs or whole RFs with single vertices, whereas the simulated-annealing technique in the DRESC [16, 42] compiler that targets ADRES instances models individual ports to ISs and RFs as separate vertices. Typically, fewer nodes that model larger components lead to faster compilation because the graph mapping problem operates on a smaller graph, but also to lower code quality because some combinations of resource usage cannot be modeled precisely. Some techniques, such as DRESC, are built on the central idea of finding the best routes to steer the placement and scheduling, thus exploring many possible routings, while others [21, 44, 46, 47] use heuristics to place and schedule the code, using routability as a constraint during the scheduling. The latter are typically much more efficient, but less effective. MRRG-based compiler techniques are easily retargetable to a wide range of architectures, such as those of the ADRES template, and they can support many programming languages. Different architectures can simply be modeled with different MRRGs. To support different programming languages like C and Fortran, the techniques only require a compiler front-end that is able to generate DDGs for the loop bodies. Obviously, the appropriate loop transformations need to be applied before generating the DDG in order to generate one that maps well onto the MRRG of the architecture. Such loop transformations are discussed in detail in Section 4.1. Many other CGRA compiler techniques have been proposed, most of which are restricted to specific architectures. Static reconfigurable architectures like RaPiD and PACT have been targeted by compiler algorithms [11, 18, 63] based on placement and routing techniques that also map DDGs on RRGs. These techniques support subsets of the C programming language (no pointers, no structs, ...) and require the use of special C functions to program the IO in the loop bodies to be mapped
468
Bjorn De Sutter, Praveen Raghavan, Andy Lambrechts
onto the CGRA. The latter requirement follows from the specific IO support in the architectures and the modeling thereof in the RRGs. For the MorphoSys architecture, with its emphasis on SIMD across ISs, compiler techniques have been developed for the SA-C language [60]. In this language the supported types of available parallelism are specified by means of loop language constructs. These constructs are translated into control code for the CGRA, which are mapped onto the ISs together with the DDGs of the loop bodies. CGRA code generation techniques based on integer-linear programming have been proposed for the RSPA architecture, both for spatial [1] and for temporal mapping [64]. Basically, the ILP formulation consists of all the requirements or constraints that must be met by a valid schedule. This formulation is built from a DDG and a hardware description, and can hence be used to compile many source languages. It is unclear, however, to what extent the ILP formulation and its solution rely on specific architecture features, and hence to which extent it would be possible to retarget the ILP-formulation to different CGRA designs. A similar situation occurs for the constraint-based compilation method developed for the Silicon Hive architecture template [56], of which no detailed information is public. Code generation techniques for CGRAs based on instruction-selection pattern matching and list-scheduling techniques have also been proposed [22, 23]. It is unclear to what extent these techniques rely a on specific architecture because we know of no trial to use them for different CGRAs, but these techniques seem to rely heavily on the existence of a single shared-bus that connects ISs as depicted in Figure 6(c). Similarly, the static reconfiguration code generation technique by Lee et al. relies on CGRA rows consisting of identical ISs [34]. Because of this assumption, a two-step code generation approach can be used in which individual placements within rows are neglected in the first step, and only taken care of in the second step. The first step then instead focuses on optimizing the memory traffic. Finally, compilation techniques have been developed that are really specialized for the TRIPS array layout and for its out-of-order execution [13]. One rule-of-thumb covers all the mentioned techniques: more generic techniques, i.e., techniques that are more flexible in targeting different architectures or different instances of an architecture template, are less efficient and often less effective in exploiting special architecture features. In other words techniques that rely on specific hardware features, such as interconnect regularities or specific forms of ISs clustering, while being less flexible, will generally be able to target those hardware features more efficiently, and often also more effectively. Vice versa, architectures with such features usually need specialized compiler techniques. This is similar to the situation of more traditional DSP or VLIW architectures.
4 Case Study: ADRES This section presents a case study on one specific CGRA design template. The purpose of this study is to illustrate that it is non-trivial to compile and optimize code
Coarse-Grained Reconfigurable Array Architectures
469
for CGRA targets, and to illustrate that within a design template, there is a need for hardware design exploration. This illustrates how both hardware and software designers targeting CGRAs need a deep understanding of the interaction between the architecture features and the used compiler techniques. ADRES [6–8, 16, 38, 40–42] is an architecture design template from which dynamically reconfigurable, statically scheduled CGRAs can be instantiated. In each instance, an ADRES CGRA is coupled tightly to a VLIW processor. This processor shares data and predicate RFs with the CGRA, as well as memory ports to a multibanked scratch-pad memory as described in Section 3.1. The compiler-supported ISA of the design template provides instructions that are typically found in a load/store VLIW or RISC architecture, including arithmetic operations, logic operations, load/store operations, and predicate computing instructions. Additional domainspecific instructions, such as SIMD operations, are supported in the programming tools by means of intrinsics [57]. Local rotating and non-rotating, shared and private local RFs can be added to the CGRA as described in the previous sections, and connected through an interconnect consisting of muxes, buses and point-to-point connections that are specified completely by the designer. Thus, the ADRES architecture template is very flexible: it offers a high degree of design freedom, and it can be used to accelerate a wide range of loops.
4.1 Mapping Loops onto ADRES CGRAs The first part of this case study concerns the mapping of loops onto ADRES CGRAs, which are one of the most flexible CGRAs supporting a wide range of loops. This study illustrates that many loop transformations need to be applied carefully before mapping code onto ADRES CGRAs. We discuss the most important compiler transformations and, lacking a full-fledged loop-optimizing compiler, manual loop transformations that need to be applied to source code in order to obtain high performance and high efficiency. For other, less flexible CGRAs, the need for such transformations will even be higher because there will be more constraints on the loops to be mapped in the first place. Hence many of the discussed issues not only apply to ADRES CGRAs, but also to other CGRA architectures. We will conclude from this study that programming CGRAs with the existing compiler technology is not compatible with high programmer productivity. 4.1.1 Modulo Scheduling Algorithms for CGRAs To exploit ILP in inner loops on VLIW architectures, compilers typically apply software pipelining by means of modulo scheduling [50, 56]. This is no different for ADRES CGRAs. In this section, we will not discuss the inner working of modulo scheduling algorithms. What we do discuss, are the consequences of using that technique for programming ADRES CGRAs.
470
Bjorn De Sutter, Praveen Raghavan, Andy Lambrechts
After a loop has been modulo-scheduled, it consists of three phases: the prologue, the kernel and the epilogue. During the prologue, stages of the software-pipelined loop gradually become active. Then the loop executes the kernel in a steady-state mode in which all software pipeline stages are active, and afterwards the stages are gradually disabled during the epilogue. In the steady-state mode, a new iteration is started after every II cycles, which stands for Initiation Interval. Fundamentally, every software pipeline stage is II cycles long. The total cycle count of a loop with iter iterations that is scheduled over ps software pipeline stages is then given by cycles prologue + II · (iter − (ps − 1)) + cyclesepilogue.
(2)
In this formula, we neglect processor stalls because of, e.g., memory access conflicts or cache misses. For loops with a high number of iterations, the term II · iter dominates this cycle count, and that is why modulo scheduling algorithms try to minimize II, thus increasing the IPC terms in Equation (1). The minimal II that modulo scheduling algorithms can reach is bound by minII = max(RecMII, ResMII). The first term, called resource-minimal II (ResMII) is determined by the resources required by a loop and by the resources provided by the architecture. For example, if a loop body contains 9 multiplications, and there are only two ISs that can execute multiplications, then at least 9/2 = 5 cycles will be needed per iteration. The second term, called recurrence-minimal II (RecMII) depends on recurrent data dependencies in a loop and on instruction latencies. Fundamentally, if an iteration of a loop depends on the previous iteration through a dependency chain with accumulated latency RecMII, it is impossible to start that iteration before at least RecMII cycles of the previous iteration have been executed. The next section uses this knowledge to apply transformations that optimize performance according to Equation (1). To do so successfully, it is important to know that ADRES CGRAs support only one thread, for which the processor has to switch from a non-CGRA operating mode to CGRA mode and back for each inner loop. So besides minimizing the cycle count of Equation (2) to obtain higher IPCs in Equation (1), it is also important to consider the terms t p→p+1 in Equation (1). 4.1.2 Loop Transformations Loop Unrolling Loop unrolling and the induction variable optimizations that it enables can be used to minimize the number of iterations of a loop. When a loop body is unrolled x times, iter decreases with a factor x, and ResMII typically grows with a factor slightly less than x because of the induction variable optimizations and because of the ceiling operation in the computation of ResMII. By contrast, RecMII typically remains unchanged or increases only a little bit as a result of the induction variable optimizations that are enabled after loop unrolling.
Coarse-Grained Reconfigurable Array Architectures
471
In resource-bound loops, ResMII > RecMII. Unrolling will then typically have little impact on the dominating term II · iter in Equation (2). However, the prologue and the epilogue will typically become longer because of loop unrolling. Moreover, an unrolled loop will consume more space in the instruction memory, which might also have a negative impact on the total execution time of the whole application. So in general, unrolling resource-bound loops is unlikely to be very effective. In recurrence-bound loops, RecMII · iter > ResMII · iter. The right hand side of this inequality will not increase by unrolling, while the left hand side will be divided by the unrolling factor x. As this improvement typically compensates for the longer prologue and epilogue, we can conclude that unrolling can be an effective optimization technique for recurrence-bound loops if the recurrences can be optimized with induction variable optimizations. This is no different for CGRAs than it is for VLIWs. However, for CGRAs with their larger number of ISs, it is more important because more loops are recurrence-bound. Loop Fusion, Loop Interchange, Loop Combination and Data Context Switching Fusing adjacent loops with the same number of iterations into one loop can also be useful, because fusing multiple recurrence-bound loops can result in one resourcebound loop, which will result in a lower overall execution time. Furthermore, less switching between operating modes takes place with fused loops, and hence the terms t p→p+1 are minimized. Furthermore, less prologues and epilogues need to be executed, which might also improve performance. This improvement will usually be limited, however, because the fused prologues and epilogues will rarely be much shorter than the sum of the original ones. Moreover, loop fusion does result in a loop that is bigger than any of the original loops, so it can only be applied if the configuration memory is big enough the fit the fused loop. If this is the case, less loop configurations need to be stored and possibly reloaded into the memory. Interchanging an inner and outer loop serves largely the same purpose as loop fusion. As loop interchange does not necessarily result in larger prologues and epilogues, it can be even more useful, as can be the combining of nested loops into a single loop. Data-context switching [5] is a very similar technique that serves the same purpose. That technique has been used by Lee et al. for statically reconfigurable CGRAs as well [34], and in fact most of the loop transformations mentioned in this section can be used to target such CGRAs, as well as any other type of CGRA. Live-in Variables In our experience, there is only one caveat with the above transformations. The reason to be careful when applying them is that they can increase the number of live-in variables. A live-in variable is a variable that gets assigned a value before the loop, which is consequently used in the loop. Live-in variables can be manifest in the original source code, but they can also result from compiler optimizations that are
472
Bjorn De Sutter, Praveen Raghavan, Andy Lambrechts
enabled by the above loop transformations, such as induction variable optimizations and loop-invariant code motion. When the number of live-in variables increases, more data needs to be passed from the non-loop code to the loop code, which might have a negative effect on t p→p+1. The existence and the scale of this effect will usually depend on the hardware mechanism that couples the CGRA accelerator to the main core. Possible such mechanisms are discussed in Section 3.1. In tightlycoupled designs like that of ADRES or Silicon Hive, passing a limited amount of values from the main CPU mode to the CGRA mode does not involve any overhead: the values are already present in the shared RF. However, if their number grows too big, there will not be enough room in the shared RF, which will result in much less efficient passing of data through memory. We have experienced this several times with loops in multimedia and SDR applications that were mapped onto our ADRES designs. So, even for tightly-coupled CGRA designs, the above loop transformations and the enabled optimizations need to be applied with great care. Predication Modulo scheduling techniques for CGRAs [16, 18, 21, 42, 46, 47] only schedule loops that are free of control flow transfers. Hence any loop body that contains conditional statements first needs to be if-converted into hyperblocks by means of predication [37]. For this reason, many CGRAs, including ADRES CGRAs, support predication. Hyperblock formation can result in very inefficient code if a loop body contains code paths that are executed rarely. All those paths contribute to ResMII and potentially to RecMII. Hence even paths that get executed very infrequently can slow down a whole modulo-scheduled loop. Such loops can be detected with profiling, and if the data dependencies allow this, it can be useful to split these loops into multiple loops. For example, a first loop can contain the code of the frequently executed paths only, with a lower II than the original loop. If it turns out during the execution of this loop that in some iteration the infrequently executed code needs to be executed, the first loop is exited, and for the remaining iterations a second loop is entered that includes both the frequently and the infrequently executed code paths. Alternatively, for some loops it is beneficial to have a so-called inspector loop with very small II to perform only the checks for all iterations. If none of the checks are positive, a second so-called executor loop is executed that includes all the computations except the checks and the infrequently executed paths. If some checks were positive, the original loop is executed. One caveat with this loop splitting is that it causes code size expansion in the CGRA instruction memories. For power consumption reasons, these memories are kept as small as possible. This means that the local improvements obtained with the loop splitting need to be balanced with the total code size of all loops that need to share these memories.
Coarse-Grained Reconfigurable Array Architectures
473
original loop iterations
original loop iterations
1 prologue on main core
4 3 2 1
2 1
4 3 2 1
3 2 1
4 3 2 1
4 3 2 1
4 3 2 1
4 3 2 1 kernel on CGRA
4 3 2 1
4 3 2 1
kernel-only on CGRA
4 3 2 1 4 3 2 1
epilogue on main core
4 3 2 4 3 4
4 3 2 1 4 3 2 1 4 3 2 1 4 3 2 1 4 3 2 1 4 3 2 1
time
Fig. 7 On the left a traditional modulo-scheduled loop, on the right a kernel-only one. Each numbered box denotes one of four software pipeline stages, and each row denotes the concurrent execution of different stages of different iterations. Grayed boxes denote stages that actually get executed. On the left, the dark grayed boxes get executed on the CGRA accelerator, in which exactly the same code is executed every II cycles. The light grayed boxes are pipeline stages that get executed outside of the loop, in separate code that runs on the main processor. On the right, kernelonly code is shown. Again, the dark grey boxes are executed on the CGRA accelerator. So are the white boxes, but these get deactivated during the prologue and epilogue by means of predication.
Kernel-Only Loops Predication can also be used to generate so-called kernel-only loop code. This is loop code that does not have separate prologue and epilogue code fragments. Instead the prologues and epilogues are included in the kernel itself, where predication is now used to guard whole software pipeline stages and to ensure that only the appropriate software pipeline stages are activated at each point in time. A traditional loop with a separate prologue and epilogue is compared to a kernel-only loop in Figure 7. Three observations need to be made here. The first observation is that kernel-only code is usually faster because the pipeline stages of the prologue and epilogue now get executed on the CGRA accelerator, which typically can do so at much higher IPCs than the main core. This is a major difference between (ADRES) CGRAs and VLIWs. On the latter, kernel-only loops are much less useful because all code runs on the same number of ISs anyway. Secondly, while kernel-only code will be faster on CGRAs, more time is spent in the CGRA mode, as can be seen in Figure 7. During the epilogue and prologue, the whole CGRA is active and thus consuming energy, but many ISs are not performing useful computations because they execute operations from inactive pipeline stages. Thus, kernel-only is not necessarily optimal in terms of energy consumption. The third observation is that for loops where predication is used heavily to create hyperblocks, the use of predicates to support kernel-only code might over-stress the predication support of the CGRA. In domains such as SDR, where the loops typically have no or very little conditional statements, this poses no problems. For applications that feature more complex loops, such as in many multimedia applica-
474
Bjorn De Sutter, Praveen Raghavan, Andy Lambrechts
(a) original 15-tap FIR filter const short c[15] = {-32, ..., 1216}; for (i = 0; i < nr; i++) { for(value = 0, j = 0; j < 15; j++) value += x[i+j]*c[j]; r[i] = value; } (b) filter after loop unrolling, with hard-coded constants const short c00 = -32, ..., c14 = 1216; for (i = 0; i < nr; i++) r[i] = x[i+0]*c00 + x[i+1]*c01 + ... + x[i+14]*c14; (c) after redundant memory accesses are eliminated int i, value, d0, ..., d14; const short c00 = -32, ..., for (i = 0; i < nr+15; i++) d0 = d1; d1 = d2; ... ; value = c00 * d0 + c01 * if (i >= 14) r[i - 14 ] = }
c14 = 1216; { d13 = d14; d14 = x[i]; d1 + ... + c14 * d14; value;
Fig. 8 Three C versions of a FIR filter.
tions, this might create a bottleneck even when predicate speculation [54] is used. This is where the ADRES template proves to be very useful, as it allowed us to instantiate specialized CGRAs with varying predicate data paths, as can be seen in Table 2. 4.1.3 Data Flow Manipulations The need for fine-tuning source code is well known in the embedded world. In practice, each compiler can handle some loop forms better than other forms. So when one is using a specific compiler for some specific VLIW architecture, it can be very beneficial to bring loops in the appropriate shape or form. This is no different when one is programming for CGRAs, including ADRES CGRAs. Apart from the above transformations that relate to the modulo scheduling of loops, there are important transformations that can increase the “data flow” character of a loop, and thus contribute to the efficiency of a loop. The three C implementations of a Finite Impulse Response (FIR) filter in Figure 8 provide an excellent example. Figure 8(a) depicts a FIR implementation that is efficient for architectures with few registers. For architectures with more registers, the implementation depicted in Figure 8(b) will usually be more efficient, as many memory accesses have been eliminated. Finally, the equivalent code in Figure 8(c) contains only one load per outer loop iteration. To remove the redundant memory accesses, a lot of temporary
Coarse-Grained Reconfigurable Array Architectures
475
Table 1 Number of execution cycles and memory accesses (obtained through simulation) for the FIR-filter versions compiled for the multimedia CGRA, and for the TI C64+ DSP. program
cycle count CGRA TI C64+ FIR (a) 11828 1054 FIR (b) 1247 1638 FIR (c) 664 10062
memory accesses CGRA TI C64+ 6221 1618 3203 2799 422 416
variables had to be inserted, together with a lot of copy operations that implement a delay line. On regular VLIW architectures, this version would result in high register pressure and many copy operations to implement the data flow of those copy operations. Table 1 presents the compilation results for a 16-issue CGRA and for an 8-issue clustered TI C64+ VLIW. From the results, it is clear that the TI compiler could not handle the latter code version: its software-pipelining fails completely due to the high register pressure. When comparing the minimal cycle times obtained for the TI C64+ with those obtained for the CGRA, please note that the TI compiler applied SIMDization as much as it could, which is fairly orthogonal to scheduling and register allocation, but which the experimental CGRA compiler used for this experiment did not yet perform. By contrast, the CGRA compiler could optimize the code of Figure 8(c) by routing the data of the copy operations over direct connections between the CGRA ISs. As a result, the CGRA implementation becomes both fast and power-efficient at the same time. This is a clear illustration of the fact that, lacking fully automated compiler optimizations, heavy performance-tuning of the source code can be necessary. The fact that writing efficient source code requires a deep understanding of the compiler internals and of the underlying architecture, and the fact that it frequently includes experimentation with various loop shapes, severely limits the programming productivity. This has to be considered a severe drawback of CGRAs architectures. Moreover, as the FIR filter shows, the optimal source code for a CGRA target can be radically different than that for, e.g., a VLIW target. Consequently, the cost of porting code from other targets to CGRAs or vice versa, or of maintaining code versions for different targets (such as the main processor and the CGRA accelerator), can be high. This puts an additional limitation on programmer productivity.
4.2 ADRES Design Space Exploration In this part of our case study, we discuss the importance and the opportunities for DSE within the ADRES template. First, we discuss some concrete ADRES instances that have been used for extensive experimentation, including the fabrication of working silicon samples. These examples demonstrate that very power-efficient CGRAs can be designed for specific application domains.
476
Bjorn De Sutter, Praveen Raghavan, Andy Lambrechts
Table 2 Main differences between two studied ADRES CGRAs. Power, clock and area include the CGRA and its configuration memory, the VLIW processor for non-loop code, including its 32K L1 I-cache, and the 32K 4-bank L1 data memory. These numbers are gate-level estimates. # issue slots (FUs) # load/store units ld/st/mul latency # local data RFs data width config. word width ISA extensions interconnect
multimedia CGRA SDR CGRA 4x4 4x4 4 4 6/6/2 cycles 7/7/3 cycles 12 (8 single-ported) of size 8 12 (8 single-ported) of size 4 32 64 896 bits 736 bits 2-way SIMD, clipping, min/max 4-way SIMD, saturating arithm. Nearest Neighbor (NN) NN + next-hop + 8 predicate buses + 8 data buses
pipelining power, clock, and area 91 mW at 300 MHz for 4mm2 310mW at 400 MHz for 5.79mm2
Afterwards, we will show some examples of DSE results with respect to some of the specific design options that were discussed in Section 3. 4.2.1 Example ADRES Instances During the development of the ADRES tool chain and design, two main ADRES instances have been worked out. One was designed for multimedia applications [38] and one for SDR baseband processing [6, 7]. Their main differences are presented in Table 2. Both architectures have a 64-entry data RF (half rotating, half non-rotating) that is shared with a unified three-issue VLIW processor that executes non-loop code. Thus this shared RF has six read ports and three write ports. Both CGRAs feature 16 FUs, of which four can access the memory (that consists of four singleported banks) through a queue mechanism that can resolve bank conflicts. Most operations have latency one, with the exception of loads, stores, and multiplications. One important difference between the two CGRAs relates to their pipeline schemes, as depicted for a single IS (local RF and FU) in Table 2. As the local RFs are only buffered at their input, pipelining registers need to be inserted in the paths to and from the FUs in order to obtain the desired frequency targets as indicated in the table. The pipeline latches shown in Table 2 hence directly contribute in the maximization of the factor f p in Equation (1). Because the instruction sets and the target frequencies are different in both application domains, the SDR CGRA has one more pipeline register than the multimedia CGRA, and they are located at different places in the design.
Coarse-Grained Reconfigurable Array Architectures
477
Table 3 Results for the benchmark loops. First, the target-version-independent number of operations (#ops) and the ResMII. Then for each target version the RecMII, the actually achieved II and IPC (counting SIMD operations as only one operation), and the compile time. Benchmark
CGRA
AVC decoder
multimedia
Wavelet
multimedia
MPEG-4 encoder
multimedia
Tiff2BW SHA-2
multimedia multimedia
MIMO
SDR
WLAN
SDR
Loop MBFilter1 MBFilter2 MBFilter3 MBFilter4 MotionComp FindFrameEnd IDCT1 MBFilter5 Memset IDCT2 Average Forward1 Forward2 Reverse1 Reverse2 Average MotionEst1 MotionEst2 TextureCod1 CalcMBSAD TextureCod2 TextureCod3 TextureCod4 TextureCod5 TextureCod6 MotionEst3 Average main loop main loop Channel2 Channel1 SNR Average DemapQAM64 64-point FFT Radix8 FFT Compensate DataShuffle Average
#ops 70 89 40 105 109 27 60 87 10 38
ResMII 5 6 3 7 7 4 4 6 2 3
67 77 73 37
5 5 5 3
75 72 73 60 9 91 91 82 91 52
5 5 5 4 1 6 6 6 6 4
35 111 166 83 75
3 7 11 6 5
55 123 122 54 153
4 8 8 4 14
pipelined non-pipelined RecMII II IPC RecMII II IPC 2 6 11.7 1 6 11.7 7 9 9.9 6 8 11.1 3 4 10.0 2 3 13.3 2 9 11.7 1 9 11.7 3 10 10.9 2 10 10.9 7 7 3.9 6 6 4.5 2 5 12.0 1 5 12.0 3 7 12.4 2 7 12.4 2 2 5.0 1 2 5.0 2 3 12.7 1 3 12.7 10.0 10.5 5 6 11.2 5 5 13.4 5 6 12.8 5 6 12.8 2 6 12.2 1 6 12.2 2 3 12.3 1 3 12.3 12.1 12.7 2 6 12.5 1 6 12.5 3 6 12.0 2 6 12.0 7 7 10.4 6 6 12.2 2 5 12.0 1 5 12.0 2 2 4.5 1 2 4.5 2 7 13.0 1 7 13.0 2 7 13.0 1 7 13.0 2 6 13.7 1 6 13.7 2 7 13.0 1 7 13.0 3 4 13.0 2 5 10.4 11.7 11.6 2 3 11.7 1 3 11.7 8 9 12.3 8 9 12.3 3 14 11.9 1 14 10.4 3 8 10.4 1 8 10.7 4 6 12.5 2 6 12.5 11.6 11.2 3 6 9.2 1 6 9.2 4 10 12.3 2 12 10.3 3 10 12.2 1 12 10.2 4 5 10.8 2 5 10.8 3 14 10.9 1 16 9.6 11.1 10.0
Traditionally, in VLIWs or in out-of-order superscalar processors, deeper pipelining results in higher frequencies but also in lower IPCs because of larger branch misprediction penalties. Following Equation (1), this can result in lower performance. In CGRAs, however, this is not necessarily the case, as explained in Section 3.3.1. To illustrate this, Table 3 includes IPCs obtained when generating code for both CGRAs with and without the pipelining latches. The benchmarks mapped onto the multimedia ADRES CGRA are a H.264AVC video decoder, a wavelet-based video decoder, an MPEG4 video coder, a black-andwhite TIFF image filter, and a SHA-2 encryption algorithm. For each application at most the 10 hottest inner loops are included in the table. For the SDR ADRES CGRA, we selected two baseband modem benchmarks: one WLAN MIMO Channel Estimation and one that implements the remainder of a WLAN SISO receiver. All
478
Bjorn De Sutter, Praveen Raghavan, Andy Lambrechts
Fig. 9 Average power consumption distribution of the ADRES SDR CGRA in CGRA mode.
applications are implemented in standard ANSI C using all language features such as pointers, structures, different loop constructs (while, for, do-while), but not using dynamic memory management functions like malloc or free. The general conclusions to be taken from the mapping results in Table 3 are as follows. (1) Very high IPCs are obtained at low power consumption levels of 91 and 310 mW and at relatively high frequencies of 300 and 400 MHz, given the standard cell 90nm design. (2) Pipelining seems to be bad for performance only where the initiation interval is bound by RecMII, which changes with pipelining. (3) In some cases pipelining even improves the IPC. Synthesizable VHDL is generated for both processors by a VHDL generator that generates VHDL code starting from the same XML architecture specification used to retarget the ANSI C compiler to different CGRA instances. A TSMC 90 nm standard cell GP CMOS (i.e. the General-Purpose technology version that is optimized for performance and active power, not for leakage power) technology was used to obtain the gate-level post-layout estimates for frequency, power and area in Table 2. More detailed results of these experiments are available in the literature [6, 7] for this SDR ADRES instance, as well as for the multimedia instance [38]. The SDR ADRES instance has also been produced in silicon in samples of a full SoC SDR chip [17]. The two ADRES cores on this SoC proved to be fully functional at 400 MHz, and the power consumption estimates have been validated. One of the most interesting results is depicted in Figure 9, which displays the average power consumption distribution over the ADRES SDR CGRA when the CGRA mode is active in the above SDR applications. Compared to VLIW processor designs, a much larger fraction of the power is consumed in the interconnects and in the FUs, while the configuration memory (which corresponds to an L1 VLIW instruction cache), the RFs and the data memory consume relatively little energy. This is particularly the case for the local RFs. This clearly illustrates that by focusing on regular loops and their specific properties, CGRAs can achieve higher performance and a higher power-efficiency than VLIWs. On the CGRA, most of the power is spent in the FUs and in the interconnects, i.e., on the actual computations and on
Coarse-Grained Reconfigurable Array Architectures
479
Fig. 10 Basic interconnects that can be combined. All bidirectional edges between two ISs denote that all outputs of one IS are connected to the inputs of the other IS and vice versa. Buses that connect all connected IS outputs to all connected IS inputs are shown as edges without arrows.
the transfers of values from computation to computation. The latter two aspects are really the fundamental parts of the computation to be performed, unlike the fetching of data or the fetching of code, which are merely side-effects of the fact that processors consist of control paths, data paths and memories. 4.2.2 Design Space Exploration Example Many DSEs have been performed within the ADRES template [8, 12, 33, 38, 43]. We present one experimental result [33] here, not to present absolute numbers but to demonstrate the large impact on performance and on energy consumption that some design choices can have. In this experiment, a number of different interconnects have been explored for four microbenchmarks (each consisting of several inner loops): a MIMO SDR channel estimation, a Viterbi decoder, an Advanced Video Codec (AVC) motion estimation, and an AVC half-pixel interpolation filter. All of them have been compiled with the DRESC compiler for different architectures of which the interconnects are combinations of the four basic interconnects of Figure 10, in which distributed RFs have been omitted for the sake of clarity. Figure 11 depicts the relative performance and (estimated) energy consumption for different combinations of these basic interconnects. The names of the different architectures indicate which basic interconnects are included in its interconnect. For example, the architecture b_nn_ex includes the buses, nearest neighbor intercon-
480
Bjorn De Sutter, Praveen Raghavan, Andy Lambrechts
(a) MIMO
(b) AVC interpolation
(c) Viterbi
(d) AVC motion estimation
Fig. 11 DSE results for four microbenchmarks on 4x4 CGRAs with fixed ISs and fixed RFs, but with varying interconnects.
nects and extra connections to the shared RF. The lines connecting architectures in the charts of Figure 11 connect the architectures on the Pareto fronts: these are the architectures that have an optimal combination of cycle count and energy consumption. Depending on the trade-off made by a designer between performance and energy consumption, he will select one architecture on that Pareto front. The lesson to learn from these Pareto fronts is that relatively small architectural changes, in this case involving only the interconnect but not the ISs or the distributed RFs, can span a wide range of architectures in terms of performance and energyefficiency. When designing a new CGRA or choosing for an existing one, it is hence absolutely necessary to perform a good DSE that covers ISA, ISs, interconnect and RFs. Because of the large design space, this is far from trivial.
5 Conclusions This chapter on CGRA architectures presented a discussion of the CGRA processor design space as an accelerator for inner loops of DSP-like applications such as software-defined radios and multimedia processing. A range of options for many design features and design parameters has been related to power consumption, per-
Coarse-Grained Reconfigurable Array Architectures
481
formance, and flexibility. In a use case, the need for design space exploration and for advanced compiler support and manual high-level code tuning have been demonstrated. The above discussions and demonstration support the following main conclusions. Firstly, CGRAs can provide an excellent alternative for VLIWs, providing better performance and better energy efficiency. Secondly, design space exploration is needed to achieve those goals. Finally, existing compiler support needs to be improved, and until that happens, programmers need to have a deep understanding of the targeted CGRA architectures and their compilers in order to manually tune their source code. This can significantly limit programmer productivity.
References 1. Ahn, M., Yoon, J.W., Paek, Y., Kim, Y., Kiemb, M., Choi, K.: A spatial mapping algorithm for heterogeneous coarse-grained reconfigurable architectures. In: DATE ’06: Proceedings of the conference on Design, automation and test in Europe, pp. 363–368. European Design and Automation Association, 3001 Leuven, Belgium, Belgium (2006) 2. Barua, R.: Maps: a compiler-managed memory system for software-exposed architectures. Ph.D. thesis, Massachusetss Institute of Technology (2000) 3. van Berkel, K., Heinle, F., Meuwissen, P., Moerman, K., Weiss, M.: Vector processing as an enabler for software-defined radio in handheld devices. EURASIP Journal on Applied Signal Processing 2005(16), 2613–2625 (2005). DOI 10.1155/ASP.2005.2613 4. Betz, V., Rose, J., Marguardt, A.: Architecture and CAD for Deep-Submicron FPGAs. Kluwer Academic Publishers (1999) 5. Bondalapati, K.: Parallelizing DSP nested loops on reconfigurable architectures using data context switching. In: DAC ’01: Proceedings of the 38th annual Design Automation Conference, pp. 273–276. ACM, New York, NY, USA (2001). DOI http://doi.acm.org/10.1145/ 378239.378483 6. Bougard, B., De Sutter, B., Rabou, S., Novo, D., Allam, O., Dupont, S., Van der Perre, L.: A coarse-grained array based baseband processor for 100Mbps+ software defined radio. In: DATE ’08: Proceedings of the conference on Design, automation and test in Europe, pp. 716– 721. ACM, New York, NY, USA (2008). DOI http://doi.acm.org/10.1145/1403375.1403549 7. Bougard, B., De Sutter, B., Verkest, D., Van der Perre, L., Lauwereins, R.: A coarse-grained array accelerator for software-defined radio baseband processing. IEEE Micro 28(4), 41–50 (2008). DOI http://doi.ieeecomputersociety.org/10.1109/MM.2008.49 8. Bouwens, F., Berekovic, M., Gaydadjiev, G., De Sutter, B.: Architecture enhancements for the ADRES coarse-grained reconfigurable array. In: Proc. of HiPEAC Conf. (2008) 9. Burns, G., Gruijters, P.: Flexibility tradeoffs in SoC design for low-cost SDR. Proceedings of SDR Forum Technical Conference (2003) 10. Burns, G., Gruijters, P., Huiskens, J., van Wel, A.: Reconfigurable accelerators enabling efficient SDR for low cost consumer devices. Proceedings of SDR Forum Technical Conference (2003) 11. Cardoso, J.M.P., Weinhardt, M.: XPP-VC: A C compiler with temporal partitioning for the PACT-XPP architecture. In: FPL ’02: Proceedings of the Reconfigurable Computing Is Going Mainstream, 12th International Conference on Field-Programmable Logic and Applications, pp. 864–874. Springer-Verlag, London, UK (2002) 12. Cervero, T., Kanstein, A., López, S., De Sutter, B., Sarmiento, R., Mignolet, J.Y.: Architectural exploration of the H.264/AVC decoder onto a coarse-grain reconfigurable architecture. In: Proc. of the International Conference on Design of Circuits and Integrated Systems (2008)
482
Bjorn De Sutter, Praveen Raghavan, Andy Lambrechts
13. Coons, K.E., Chen, X., Burger, D., McKinley, K.S., Kushwaha, S.K.: A spatial path scheduling algorithm for EDGE architectures. SIGPLAN Not. 41(11), 129–140 (2006). DOI http://doi.acm.org/10.1145/1168918.1168875 14. Corporaal, H.: Microprocessor Architectures from VLIW to TTA. John Wiley (1998) 15. Cronquist, D., Franklin, P., Fisher, C., Figueroa, M., Ebeling, C.: Architecture design of reconfigurable pipelined datapaths. In: Proceedings of the Twentieth Anniversary Conference on Advanced Research in VLSI (1999) 16. De Sutter, B., Coene, P., Vander Aa, T., Mei, B.: Placement-and-routing-based register allocation for coarse-grained reconfigurable arrays. In: LCTES ’08: Proceedings of the 2008 ACM SIGPLAN-SIGBED conference on Languages, compilers, and tools for embedded systems, pp. 151–160. ACM, New York, NY, USA (2008). DOI http://doi.acm.org/10.1145/1375657. 1375678 17. Derudder, V., Bougard, B., Couvreur, A., Dewilde, A., Dupont, S., Folens, L., Hollevoet, L., Naessens, F., Novo, D., Raghavan, P., Schuster, T., Stinkens, K., Weijers, J.W., Van der Perre, L.: A 200Mbps+ 2.14nJ/b digital baseband multi processor system-on-chip for SDRs. In: Proc of VLSI Symposum (2009) 18. Ebeling, C.: Compiling for coarse-grained adaptable architectures. Tech. Rep. UW-CSE-0206-01, University of Washington (2002) 19. Ebeling, C.: The general RaPiD architecture description. Tech. Rep. UW-CSE-02-06-02, University of Washington (2002) 20. Fisher, J., Faraboschi, P., Young, C.: Embedded Computing, A VLIW Approach to Architecture, Compilers and Tools. Morgan Kaufmann (2005) 21. Friedman, S., Carroll, A., Van Essen, B., Ylvisaker, B., Ebeling, C., Hauck, S.: SPR: An architecture-adaptive CGRA mapping tool. In: FPGA ’09: Proceeding of the ACM/SIGDA international symposium on Field programmable gate arrays, pp. 191–200. ACM, New York, NY, USA (2009). DOI http://doi.acm.org/10.1145/1508128.1508158 22. Galanis, M.D., Milidonis, A., Theodoridis, G., Soudris, D., Goutis, C.E.: A method for partitioning applications in hybrid reconfigurable architectures. Design Automation for Embedded Systems 10(1), 27–47 (2006) 23. Galanis, M.D., Theodoridis, G., Tragoudas, S., Goutis, C.E.: A reconfigurable coarse-grain data-path for accelerating computational intensive kernels. Journal of Circuits, Systems and Computers (JCSC) pp. 877–893 (2005) 24. Gebhart, M., Maher, B.A., Coons, K.E., Diamond, J., Gratz, P., Marino, M., Ranganathan, N., Robatmili, B., Smith, A., Burrill, J., Keckler, S.W., Burger, D., McKinley, K.S.: An evaluation of the TRIPS computer system. In: ASPLOS ’09: Proceeding of the 14th international conference on Architectural support for programming languages and operating systems, pp. 1–12. ACM, New York, NY, USA (2009). DOI http://doi.acm.org/10.1145/1508244.1508246 25. Hartenstein, R., Herz, M., Hoffmann, T., Nageldinger, U.: Mapping applications onto reconfigurable KressArrays. In: Proceedings of the 9th International Workshop on Field Programmable Logic and Applications (1999) 26. Hartenstein, R., Herz, M., Hoffmann, T., Nageldinger, U.: Generation of design suggestions for coarse-grain reconfigurable architectures. In: Proceedings of the 10th International Workshop on Field Programmable Logic and Applications (2000) 27. Hartenstein, R., Hoffmann, T., Nageldinger, U.: Design-space exploration of low power coarse grained reconfigurable datapath array architectures. In: Proceedings of the International Workshop - Power and Timing Modeling, Optimization and Simulation (2000) 28. Kim, Y., Kiemb, M., Park, C., Jung, J., Choi, K.: Resource sharing and pipelining in coarsegrained reconfigurable architecture for domain-specific optimization. In: DATE ’05: Proceedings of the conference on Design, Automation and Test in Europe, pp. 12–17. IEEE Computer Society, Washington, DC, USA (2005). DOI http://dx.doi.org/10.1109/DATE.2005.260 29. Kim, Y., Mahapatra, R.: A new array fabric for coarse-grained reconfigurable architecture. In: Proceedings of the IEEE EuroMicro Conference on Digital System Design, pp. 584–591 (2008)
Coarse-Grained Reconfigurable Array Architectures
483
30. Kim, Y., Mahapatra, R.: Dynamic context compression for low-power coarse-grained reconfigurable architecture. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 18(1), 15–28 (2010) 31. Kim, Y., Mahapatra, R., Park, I., Choi, K.: Low power reconfiguration technique for coarsegrained reconfigurable architecture. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 17(5), 593–603 (2009) 32. Lam, M.S.: Software pipelining: an effective scheduling technique for VLIW machines. In: Proc. PLDI, pp. 318–327 (1988) 33. Lambrechts, A., Raghavan, P., Jayapala, M., Catthoor, F., Verkest, D.: Energy-aware interconnect optimization for a coarse grained reconfigurable processor. VLSI Design, International Conference on pp. 201–207 (2008) 34. Lee, J.e., Choi, K., Dutt, N.D.: An algorithm for mapping loops onto coarse-grained reconfigurable architectures. In: LCTES ’03: Proceedings of the 2003 ACM SIGPLAN conference on Language, compiler, and tool for embedded systems, pp. 183–188. ACM, New York, NY, USA (2003). DOI http://doi.acm.org/10.1145/780732.780758 35. Lee, L.H., Moyer, B., Arends, J.: Instruction fetch energy reduction using loop caches for embedded applications with small tight loops. In: ISLPED ’99: Proceedings of the 1999 international symposium on Low power electronics and design, pp. 267–269. ACM, New York, NY, USA (1999). DOI http://doi.acm.org/10.1145/313817.313944 36. Lee, M.H., Singh, H., Lu, G., Bagherzadeh, N., Kurdahi, F.J., Filho, E.M.C., Alves, V.C.: Design and implementation of the MorphoSys reconfigurable computing processor. J. VLSI Signal Process. Syst. 24(2/3), 147–164 (2000). DOI http://dx.doi.org/10.1023/A:1008189221436 37. Mahlke, S.A., Lin, D.C., Chen, W.Y., Hank, R.E., Bringmann, R.A.: Effective compiler support for predicated execution using the hyperblock. In: MICRO 25: Proceedings of the 25th annual international symposium on Microarchitecture, pp. 45–54. IEEE Computer Society Press, Los Alamitos, CA, USA (1992). DOI http://doi.acm.org/10.1145/144953.144998 38. Mei, B., De Sutter, B., Vander Aa, T., Wouters, M., Kanstein, A., Dupont, S.: Implementation of a coarse-grained reconfigurable media processor for AVC decoder. Journal of Signal Processing Systems 51(3), 225–243 (2008) 39. Mei, B., Lambrechts, A., Verkest, D., Mignolet, J.Y., Lauwereins, R.: Architecture exploration for a reconfigurable architecture template. IEEE Design and Test of Computers 22(2), 90–101 (2005) 40. Mei, B., Vernalde, S., Verkest, D., De Man, H., Lauwereins, R.: ADRES: An architecture with tightly coupled VLIW processor and coarse-grained reconfigurable matrix. In: Proc. of Field-Programmable Logic and Applications, pp. 61–70 (2003) 41. Mei, B., Vernalde, S., Verkest, D., De Man, H., Lauwereins, R.: Exploiting loop-level parallelism for coarse-grained reconfigurable architecture using modulo scheduling. IEE Proceedings: Computer and Digital Techniques 150(5) (2003) 42. Mei, B., Vernalde, S., Verkest, D., Lauwereins, R.: Design methodology for a tightly coupled VLIW/reconfigurable matrix architecture: A case study. In: Proc. of Design, Automation and Test in Europe (DATE), pp. 1224–1229 (2004) 43. Novo, D., Schuster, T., Bougard, B., Lambrechts, A., Van der Perre, L., Catthoor, F.: Energyperformance exploration of a CGA-based SDR processor. Journal of Signal Processing Systems (2008) 44. Oh, T., Egger, B., Park, H., Mahlke, S.: Recurrence cycle aware modulo scheduling for coarse-grained reconfigurable architectures. In: LCTES ’09: Proceedings of the 2009 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems, pp. 21–30. ACM, New York, NY, USA (2009). DOI http://doi.acm.org/10.1145/1542452. 1542456 45. PACT XPP Technologies: XPP-III Processor Overview White Paper (2006) 46. Park, H., Fan, K., Kudlur, M., Mahlke, S.: Modulo graph embedding: Mapping applications onto coarse-grained reconfigurable architectures. In: CASES ’06: Proceedings of the 2006 international conference on Compilers, architecture and synthesis for embedded systems, pp. 136–146. ACM, New York, NY, USA (2006). DOI http://doi.acm.org/10.1145/1176760. 1176778
484
Bjorn De Sutter, Praveen Raghavan, Andy Lambrechts
47. Park, H., Fan, K., Mahlke, S.A., Oh, T., Kim, H., Kim, H.S.: Edge-centric modulo scheduling for coarse-grained reconfigurable architectures. In: PACT ’08: Proceedings of the 17th international conference on Parallel architectures and compilation techniques, pp. 166–176. ACM, New York, NY, USA (2008). DOI http://doi.acm.org/10.1145/1454115.1454140 48. Petkov, N.: Systolic Parallel Processing. North Holland Publishing (1992) 49. Raghavan, P., Lambrechts, A., Jayapala, M., Catthoor, F., Verkest, D., Corporaal, H.: Very wide register: An asymmetric register file organization for low power embedded processors. In: DATE ’07: Proceedings of the conference on Design, Automation and Test in Europe (2007) 50. Rau, B.R.: Iterative modulo scheduling. Tech. rep., Hewlett-Packard Lab: HPL-94-115 (1995) 51. Rau, B.R., Lee, M., Tirumalai, P.P., Schlansker, M.S.: Register allocation for software pipelined loops. In: PLDI ’92: Proceedings of the ACM SIGPLAN 1992 conference on Programming language design and implementation, pp. 283–299 (1992) 52. Sankaralingam, K., Nagarajan, R., Liu, H., Kim, C., Huh, J., Burger, D., Keckler, S.W., Moore, C.R.: Exploiting ILP, TLP, and DLP with the polymorphous TRIPS architecture. SIGARCH Comput. Archit. News 31(2), 422–433 (2003). DOI http://doi.acm.org/10.1145/ 871656.859667 53. Scarpazza, D.P., Raghavan, P., Novo, D., Catthoor, F., Verkest, D.: Software simultaneous multi-threading, a technique to exploit task-level parallelism to improve instruction- and datalevel parallelism. In: Proceedings of the 16th International Workshop on Integrated Circuit and System Design. Power and Timing Modeling, Optimization and Simulation (PATMOS), pp. 107–116 (2006) 54. Schlansker, M., Mahlke, S., Johnson, R.: Control CPR: a branch height reduction optimization for EPIC architectures. SIGPLAN Not. 34(5), 155–168 (1999). DOI http://doi.acm.org/10. 1145/301631.301659 55. Shen, J., Lipasti, M.: Modern Processor Design: Fundamentals of Superscalar Processors. McGraw-Hill (2005) 56. Silicon Hive: HiveCC Databrief (2006) 57. Sudarsanam, A.: Code optimization libraries for retargetable compilation for embedded digital signal processors. Ph.D. thesis, Princeton University (1998) 58. Taylor, M., Kim, J., Miller, J., Wentzla, D., Ghodrat, F., Greenwald, B., Ho, H., Lee, M., Johnson, P., Lee, W., Ma, A., Saraf, A., Seneski, M., Shnidman, N., Frank, V., Amarasinghe, S., Agarwal, A.: The Raw microprocessor: A computational fabric for software circuits and general purpose programs. IEEE Micro 22(2), 25–35 (2002) 59. Texas Instruments: TMS320C64x Technical Overview (2001) 60. Venkataramani, G., Najjar, W., Kurdahi, F., Bagherzadeh, N., Bohm, W., Hammes, J.: Automatic compilation to a coarse-grained reconfigurable system-on-chip. ACM Trans. Embed. Comput. Syst. 2(4), 560–589 (2003). DOI http://doi.acm.org/10.1145/950162.950167 61. van de Waerdt, J.W., Vassiliadis, S., Das, S., Mirolo, S., Yen, C., Zhong, B., Basto, C., van Itegem, J.P., Amirtharaj, D., Kalra, K., Rodriguez, P., van Antwerpen, H.: The TM3270 media-processor. In: MICRO 38: Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture, pp. 331–342. IEEE Computer Society, Washington, DC, USA (2005). DOI http://dx.doi.org/10.1109/MICRO.2005.35 62. Woh, M., Lin, Y., Seo, S., Mahlke, S., Mudge, T., Chakrabarti, C., Bruce, R., Kershaw, D., Reid, A., Wilder, M., Flautner, K.: From SODA to scotch: The evolution of a wireless baseband processor. In: MICRO ’08: Proceedings of the 2008 41st IEEE/ACM International Symposium on Microarchitecture, pp. 152–163. IEEE Computer Society, Washington, DC, USA (2008). DOI http://dx.doi.org/10.1109/MICRO.2008.4771787 63. Programming XPP-III Processors White Paper (2006) 64. Yoon, J., Ahn, M., Paek, Y., Kim, Y., Choi, K.: Temporal mapping for loop pipelining on a MIMD-style coarse-grained reconfigurable architecture. In: Proc. International SoC Design Conference (2006)
Multi-core Systems on Chip Luigi Carro and Mateus Beck Rutzig
Abstract This chapter discusses multicore architectures for DSP applications. We briefly explain the main challenges involved in future processor designs, justifying the need for thread level parallelism exploration, since instruction-level parallelism is becoming increasingly difficult and unfeasible to explore given a limited power budget. We discuss, based on an analytical model, the tradeoffs on using multi-processor architectures over high-end single-processor design regarding performance and energy. Hence, the analytical model is applied to a traditional DSP application, illustrating the need of both instruction and thread exploration on such application domain. Some successful MPSoC designs are presented and discussed, indicating the different trend of embedded and general purpose processor market designs. Finally, we produce a thorough analysis on hardware and software open problems like interconnection mechanism and programming models.
1 Introduction Industry competition in the current wide electronics market makes the design of a device increasingly complex. New marketing strategies have been focusing on increasing the product functionalities to attract consumer´ns interest. However, the convergence of different functions in a single device produces new design challenges, making the device implementation even more difficult. To worsen this scenario, designers should handle well known design constraints as energy consumption and process costs, all mixed in the difficult task to increase the processing capability, Luigi Carro Universidade Federal do Rio Grande do Sul, Brazil e-mail: [email protected] Mateus Beck Rutzig Universidade Federal do Rio Grande do Sul, Brazil e-mail: [email protected]
S.S. Bhattacharyya et al. (eds.), Handbook of Signal Processing Systems, DOI 10.1007/978-1-4419-6345-1_18, © Springer Science+Business Media, LLC 2010
485
486
Luigi Carro and Mateus Beck Rutzig
since users desire nowadays the equivalent of a portable supercomputer. Therefore, the fine tuning of these requirements is fundamental to produce the best possible product, which consequently will obtain a wider market. The convergence of functionalities to a single product enlarges the application software layer over the hardware platform, increasing the range of heterogeneous code that the processing element should handle. The clear example today of such convergence with mixed behavior is the iPhone. Aiming to balance manufacturing costs and to avoid overloading of the original hardware platform with extra processing capabilities, way beyond the original hardware design scope, there is a natural lifecycle that guides the functionality execution scheme in the platform. Here, the functionality lifecycle is divided into three phases: introduction, growth and maturity. Commonly, during the introduction phase, which reflects the time when a novel functionality is launched in the market, due to doubts about the consumer acceptance, the logical behavior of the product is described using well known high level software languages like C++, Java and .NET. This execution strategy avoids costs, since the target processing element, usually a general purpose processor (GPP), is the very same of (or very close to) the previous product. After market consolidation the growth and maturity phase start. At this time, the functionality is used in a wide range of products, and its execution tends to be closer to the hardware to achieve better energy efficiency, speedup, and to avoid overloading the GPP processing capabilities. Generally, two methods are used to approach the required functionality of a product to the underlying hardware, aiming to explore the benefits of a joint development. The first technique evaluates small applications parts, which possibly have huge impact in the whole application. For example, this used to be the scenario of embedded systems some time ago, or of some very specialized applications in general purpose processors. After the profiling and evaluation phase, chosen code parts are moved to specialized hardwired instructions that will extend the processor instruction set architecture (ISA) and assist a delimited range of applications. MMX, SSE and 3DNow! are successful ISA extensions that have been created aiming to support certain application domains, in those cases, multimedia processing. A second technique uses a more radical approach to close the gap between the hardware and the required functionality. Its entire logic behavior is implemented in hardware, aiming to build an application specific integrated circuit (ASIC). ASIC development can be considered a better design solution than ISA extensions, since it provides better energy efficiency and performance. Due to several reasons the current scenario of the electronics market is changing, as over 4 billion of embedded devices were shipped in 2006 [21] taking over the desktop computer market. Besides the already explained drawbacks regarding the traditional product manufacturing, such designs need to worry about battery life, size and, for critical applications, reliability. Aiming to achieve their hard requirements, system-on-a-chip (SoC) is largely used in the embedded domain. The main proposal of a SoC design is to integrate in a single die the processing element (in most cases a GPP is employed), memory, communication, peripheral and external interfaces. This eases the deployment of a product, by freeing designers from the
ISA A
ISA A
ISA A
ISA A
ISA A
ISA A ISA A
MPSoC #3
ISA A
487
MPSoC #2
MPSoC #1
Multi-core Systems on Chip
ISA A ISA B
ISA A ISA A ISA C
Fig. 1 Different MPSoC approaches.
hard system validation task. The mobileGT SoC platform [6] exemplifies a successful employment of a SoC for the embedded domain. BMW, Ford, General Motors, Hyundai and Mercedes Benz use the cited SoC in their cars to manage the information and entertainment applications. However, SoC designs do not sustain the required performance for the growing market of heterogeneous software behaviors, which has been running on top of the available GPP processor. Moreover, the instruction level parallelism (ILP) exploration is no longer an efficient technique to improve performance of those processors, due the limited ILP available in the application code [23] . Despite the great advantages shown on ISA extensions employment, like the MMX strategy, this approach relies on a high design and validation time, which goes against the need of a fast time-to-market. ASIC employment usually depends on a complex and costly design process that could also affect the time-to-market required by the company. In addition, this architecture attacks only a specific application domain, and can fail to deliver the required performance to software behaviors that are out of this domain. However, both ASIC and ISA extension employment are supplied by their high-performance under a limited application domain. This critical scenario dictates the need of a change on the hardware platform development paradigm. A new organization that appears as a possible solution for the current design requirements is the Multiprocessor System-on-chip (MPSoC). Many advantages can be obtained by combining different processing elements into a single die. The computation time can clearly benefit since, at the same time, several different programs could be executed in the available processing elements. In addition, the flexibility to combine different processing elements appears as a solution to the heterogeneous software execution problem, since designers can select the set of processing elements that best fit in their design requirements. Figure 1 illustrates three examples of MPSoC with different set of processing elements. MPSoC #1 is composed of only general purpose processors (GPP) with the same processing capability. In contrast, the MPSoC #2 aggregates GPP with distinct computation capability. As another example, MPSoC #3 is assembled to another computation purpose, since it contains three different processing paradigms. More details about each MPSoC approach are explained in the next sections. MPSoC provides several advantages, and three highlight among all: performance, energy consumption and validation time. MPSoC design introduces a new parallel execution paradigm aiming to overcome the performance barrier created for the instruction level parallelism limitation. In contrast to the ILP exploration
488
Luigi Carro and Mateus Beck Rutzig
Active Process Set Process 1
Process 2
Scheduling 2
Scheduling 3
Process
Thread
Process
2
3
2
Process
Process
Thread
1
3
8
Thread
Thread
Process
1
4
3
Thread
Thread
6
7
Thread 10
Thread 9
Thread 8
Thread 7
Thread 5
Thread 4
Thread 3
Thread 2
Thread 1
Possible Scheduling Scheduling 1
Process 3
Fig. 2 Scheduling methodologies for a given active process set.
paradigm that can be supported both for hardware or software, current coarser parallelism detection strategies are made only for the software team. The hardware team is only responsible for encapsulating and making the communication infrastructure to build a single die with a suitable number of processing elements. The software team splits and distributes the paralyzed code among the MPSoC processing elements. Although both teams might work together, the reality is that there is no strong methodology nowadays that can support software development for multiprocessing chips for any kind of application. We define parallel code as composed of threads or processes. A thread is contained inside a process, and differs from this by its lightweight context information. Moreover, a process has a private address space, while a set of threads shares the same address space used to make communication with each other in an easier fashion. Figure 2 illustrates an active process set environment composed of three threaded processes. Scheduling 1 reflects the traditional Thread-Level Parallelism (TLP), exploiting the parallel execution of active processes parts. Scheduling 2 methodology supports only single-threaded parallelism exploitation, in ProcessLevel Parallelism (PLP) only processes are scheduled, regardless of their internal thread number. Finally, there is mixed parallelism exploitation, illustrated by Scheduling 3, working both in thread and process scheduling levels. The last approach is widely used in the commercial MPSoC due to its greater flexibility on scheduling methodology. The scheduling policies rely on the MPSoC processing elements set (Figure 1). Therefore, the above scheduling approaches can be associated with the previous MPSoC processing element sets. Supposing that all GPP of MPSoC #1 and MPSoC #2 have the same ISA, Scheduling 1 methodology only fits in this hardware modeling due the binary compatibility of threads code. In contrast, the three examples of MPSoC illustrated in Figure 1 support the Scheduling 2 and 3.
Multi-core Systems on Chip
489
Commonly, parallel execution in multiprocessing platforms is not only used to achieve better speedup, but energy consumption is a big reason for its usage as well. Several factors, both in hardware and software level, contribute to the lower energy consumption of MPSoC architectures. Regarding the architecture design, there is no dedicated hardware to split and distribute the application code over the processing elements, since the software team does this work during application development time. A responsible for software distribution is needed, and this is the main problem. Traditional approaches to save energy on single processing platforms are also widely use in multiprocessing designs. Dynamic voltage and frequency scaling (DVFS) is applied at the entire chip, changing, at runtime, the global MPSoC voltage and frequency [27]. The management policy is based on some system tradeoff like processing elements load, battery charge or chip temperature. Commonly, DVFS is controlled by software, which provides unsuitable time overhead to change the processing element status. Nevertheless, DVFS can affect the performance and even cause drawbacks on real-time based systems. DVFS can also be applied independently on each processing element. Software team during profiling can insert frequency scaling instructions into code, to avoid hardware monitoring for DVFS. Hence, the computational load of each processor is individually monitored, making chip power management more accurate [2, 19] . More recently, aiming to decrease the delay of power management based on software, Core i7, a multiprocessing platform from Intel, employs a dedicated built-in microcontroller to handle DVFS [9]. The built-in chip power manager frequently looks at the temperature and power consumption aiming to turn off independent cores when they are not being used. The DVFS microcontroller can be seen as an ASIC that supplies only the power management routine, which makes clear the convergence of dedicated processing elements to a single die. Consumers are always waiting for products equipped with exciting features. Time-to-market appears as an important consumer electronics constraint that should be carefully handled. Researches explain that 70% of the design time is spent in the platform validation [1] being an attractive point for time-to-market optimization. Regarding this subject, MPSoC usage softens the hard task to shrink time-to-market. Commonly, an MPSoC is built by the combination of validated processing elements that are aggregated into a single die as a puzzle game. Since each puzzle block reflects a validated processing element, the remaining design challenge is to assemble the different blocks. Actually, the designer should only select a communication mechanism to connect the entire system, easing the design process by the use of standard communication mechanisms. Up to now, only explanations about hardware characteristics of MPSoC were mentioned. However, software partitioning is a key feature in an MPSoC system. Regarding software behavior, a computational powerful MPSoC becomes useless if the application partitioning is poorly performed, or if application does not provide a minimum thread or process level parallelism to be explored by the computational resources. Amdahl´ns law argues that an application speedup is limited by its sequential part. Therefore, if an application needs 1 hour to execute, being 5 minutes sequential (almost 9% of entire application code), the maximum speedup provided
Luigi Carro and Mateus Beck Rutzig
Billions of instructions per second
490 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 Video Conferencing
Graphics Processing
Speech Processing
Virtual Reality
Video Recognition
Fig. 3 Throughput requirements of traditional DSP applications, adapted from [15]
for a multiprocessing system is 12 times, no matter how many processing elements are available. Therefore, the designer ability to perform software partitioning and the limited parallel code are the main challenges on an MPSoC design development. Today, we are usually in touch with functionalities like video conferencing, graphics and speech processing. These functionalities are widespread in digital cameras, MP3, DVD, cell phones and generic communication devices. To support their executions several DSP algorithms are running inside of these devices. Fast Fourier Transformation (FFT), Finite Impulse Response (FIR), Discrete Cosine Transformation (DCT) and IEEE 802.11 communication standard are examples of these algorithms. The growing convergence of different functionalities to a single device has been increasing, and it is difficult for a single hardware platform to reach the performance requirements of these applications. Many DSP functionalities and their throughput requirements are shown in Figure 3. Let us suppose a mobile device composed by all the functionalities shown in Figure 3, and also let us suppose that they are running on a 1 GHz 8-issue superscalar processor. The processor throughput needed to support the execution is almost 7 billion instructions per second. The application execution scenario becomes hard for a general purpose processor, since the maximum performance of the baseline processor reaches 8 billions of instructions per second (with perfect ILP exploitation and perfect branch prediction). In this way, DSP processors are always present in real life products, aiming to achieve, with their specialized DSP instructions, the necessary performance for the DSP applications. However, even DSP processors are suffering to reach the high performance required when many applications are deployed. Hence, the multiprocessing system appears, also in the DSP domain, as a solution to supply the high performance requirements. The major advantage of multiprocessing elements on the DSP domain is the inherent Thread-Level Parallelism, not present in most traditional applications. Figure 4 illustrates the speedup provided by a multiprocessing system for DSP applications. The speedup for the Maximum Value application increases almost lin-
Multi-core Systems on Chip
491
Relative Speedup
8-tap FIR
8x8 DCT
64-pt FFT
Maximum Value
20 18 16 14 12 10 8 6 4 2 0 1
2
5
8
10
13
15
20
23
25
Number of DSP processors Fig. 4 Speedup of traditional DSP application on MPSoC system [26].
early with the number of processor. The 8-tap FIR and 64-pt FFT also show optimistic speedups when the number of processing elements increases. On the other hand, the 8x8 Discrete Cosine Transformation provides the smallest speedup among the application workload. Clearly, this application is a good example for Amdahl´ns law, demonstrating that multiprocessing systems can fail to accelerate applications that have a meaningful sequential part. The rest of this chapter is divided into five major sections. First, a more detailed analytical modeling is provided to show the actual design space of multiprocessing devices in the DSP field. Next, an MPSoC hardware taxonomy is presented, aiming to explain the different architectures and applicability domain. MPSoC systems are classified taking into account two points of view: architecture and employed organization. A dedicated section demonstrates some successful MPSoC designs. A section discusses open problems in MPSoC design, covering from software development, communication infrastructure. Finally, future challenges like how can one overcome Amdahl´ns law limitations are covered.
2 Analytical Model In this section we try to define the design space for MPSoC architectures. First, we model a multiprocessing architecture made of several 5-stage pipelined and homogeneous cores, and compare it to the modeling of a superscalar architecture in terms of performance and energy. A homogeneous MPSoC chip will have to be built by replication of simple pipelined processors because of power dissipation limitations. As out-of-order superscalar processors take too much power and area, the amount of such processors that can be found in a single chip is going to be small.
492
Luigi Carro and Mateus Beck Rutzig
2.1 Performance Comparison Let us start with the basic equation relating time with instructions, ExecutionTime = Instructions ·CPI ·CycleTime
(1)
where CPI is the mean amount of cycles taken to execute a instruction in a certain benchmark. The execution time on a superscalar processor Tss can be written as CPIss Tss = Instructions + · CPIss CycleTimess (2) issue where CPIss is the mean amount of cycles taken to execute instructions, for a certain benchmark, in the superscalar processor. Variable issue is the amount of parallelism that can be obtained in the superscalar in the best case situation, without data or control dependencies in the code. Theoretically, this part of the code could be accelerated by the amount of parallelism available in the superscalar processor. CPIss can be obtained from benchmarks, and is usually smaller than 1, because the superscalar can accelerate code with lots of control dependencies thanks to branch prediction, and false data dependencies thanks to register renaming. A typical value for a superscalar CPIss would be 0.62 [8] , showing that more than one instruction can be executed at any cycle. Coefficients and refer to the percentage of instructions that has ILP or not ( + = 1). Typical values are = 0.62 and = 0.38 for the MiBench benchmark [8] . This division of instructions on code that can be parallelized or code that can only be accelerated by speculation is important, because it states one of the major differences between superscalar and traditional pipelined processors. It is clear that in this modeling no information about cache misses is used, nor the performance of the disk or I/O is taken into account. Nevertheless, although simple, this modeling will give some interesting clues on the potential of MPSoC architectures for DSP applications. Having stated the equation for superscalar performance, one must look now for the potential use of a multiprocessing architecture. Moreover, the advantage of using MPSoC is based not only on the available ILP, but mostly on the potential thread level parallelism (TLP). Some works [20] can even translate code with enough ILP into TLP, so that more than one core can execute the code, exploiting ILP also at the single issue core level, when embedded in a multiprocessing device. For a single pipelined core, the performance equation can be written as (Tsc denotes execution time in single pipelined core) Tsc = Instructions ( CPIsc +CPIsc )CycleTimesc .
(3)
Since in an MPSoC environment processors must be simple, so that a large number of them can be integrated, the CPIsc of the single core is bigger than 1, and reach 3 or more for a processor without branch prediction. Since there is no available parallelism at the instruction level that can be used in the simple pipelined core with
Multi-core Systems on Chip
493
only a single execution flow, the separation in and instructions types does not makes much sense (but we will keep the notation for comparison purposes). On the other hand, using the techniques proposed in [20], there are a certain number of instructions that can be split into several processors, and hence one could write Tmc = Instructions + ( ·CPIsc + ·CPIsc )CycleTimemc (4) P where is the amount of code that has TLP and can be transformed into multithread code, while is the part of the code that must be executed sequentially (no TLP can be obtained). P is the number of homogeneous processors that are available in the chip. Hence, Eq. (4) reflects the fact that in a MPSoC environment one could benefit from thread level parallelism, but increasing the number of processors can only accelerate part of the code which can be parallelized at thread level. One important aspect also is that simpler processors could also run at much higher frequencies than superscalar processors, since their organizations reflect smaller area and power consumption. However, the total power budget will probably be a limiting factor. For the sake of the modeling, we will assume that CycleTimemc = K ·CycleTimess .
(5)
Based on the above reasoning, one can compare the performance of the superscalar with the one of the MPSoC device: ss Instructions CPI issue + · CPIss CycleTimess Tss = (6) Tmc Instructions + ( CPIsc + ·CPIsc )CycleTimemc P
and by simplifying Eq. (6) we obtain ! ! ss CPI Tss 1 1 issue + · CPIss = . Tmc CPIsc + ·CPIsc K P +
(7)
It is clear from Eq. (7) that although simple pipelined cores have faster cycle (by a factor of K), that factor is not enough to define performance. Regarding the second term on Eq. (7), the fact that the superscalar can execute many instructions in parallel could give a better performance. In the extreme case, let us imagine that issue = P = , meaning that we have infinite resources, either in the form of arithmetic operators or in the form of processors. This would reduce Eq. (7) to Tss 1 ·CPIss 1 = . (8) Tmc CPIsc + ·CPIsc K This relation clearly shows that, as long as one has code which carries control or data dependencies, and cannot be parallelized (at the instruction or thread level), a
494
Luigi Carro and Mateus Beck Rutzig 4-issue Superscalar
8-Cores MPSoC
18-Cores MPSoC
48-Cores MPSoC
128-Cores MPSoC
Execution Time (log)
1
0.1
0.01
0.001 0.01
0.1
0.25
0.5
0.75
0.99
Parallelism Pergentage (α or δ) Fig. 5 MPSoC and superscalar performance regarding a power budged using different ILP and TLP; = is assumed.
superscalar machine will always be faster than a multiprocessing machine, regardless of the amount of available resources. An interesting question is when the performance of the superscalar is equal to the performance of the MPSoC. One can write Tss = Tmc , hence CPIss 1 + · CPIss = + ( ·CPIsc + · CPIsc ) . (9) issue K P This equation shows that one must have enough processors and enough code that can the parallelized at thread level (either in automatic or in manual fashion) to overcome the superscalar advantage. As it will be seen shortly, that is exactly the case of many stream and multimedia applications. Given the theoretical model, one can test it with some numbers based on real data. Let us consider a 4-issue MIPS R10000 superscalar processor, and an MPSoC design composed of replications of the pipelined MIPS R3000 processor. A comparison between both architectures is done using the equations of the proposed analytical model. Figure 5 shows, in a logarithmic scale, the performance of the superscalar processor when parameters and change, assuming the 4-issue core, with CPI 0.6. In addition, also in Figure 5 we show the performance of many cores within the MPSoC, varying the and parameters and the number of pipelined processors from 8 to 128. The goal of this comparison is to demonstrate which technique better explores parallelism at different levels, considering six values for both ILP and TLP. For instance, = 0.01 means that an application only shows 1% thread level parallelism (TLP) within its code. When = 0.01 it is assumed that 1% of instruction level parallelism (ILP) is provided in the application code,
Multi-core Systems on Chip
495
that is, only 1% of the code can be executed in parallel. The MIPS R3000 core is a simple pipelined core that has CPI=1.8. Its hardware takes only 75,000 transistors [16], almost 29 times less than the 2.2 millions of transistors used on the MIPS R10000 [25]. Following the same strategy found in current processor designs, for a fair performance comparison we considered the same power budged for the superscalar and the MPSoC approaches. In order to normalize the power budged of both approaches we have tuned the MPSoC design clock frequency to achieve the same power of the superscalar processor. Regarding applications that reflect minimum ILP ( = 0.01) or TLP ( = 0.01) factors, the superscalar processor and an 8-core MPSoC present almost the same performance. However, when applications present greater parallelism percentage ( > 0.25 and > 0.25), the 8-core MPSoC handles the TLP parallelism, surpassing the ILP exploration in the 4-issue Superscalar processor. Due to the power budged assumed, when more cores are inserted in an MPSoC design the overall clock frequency tends to decrease. The performance of applications with low thread level parallelism (small ) worsens with the increase in the number of cores. Regarding the applications with = 0.01 in Figure 5, lots of performance is wasted as the number of cores increases. Nevertheless, as the application thread level parallelism increases, ( > 0.01) the negative impact on performance is softened, since the additional cores have better use. Aiming to make a fair performance comparison among superscalar and MPSoC approaches we have devised an 18-Core MPSoC that presents the same area as 4-issue superscalar processor. Since MIPS R10000 is 29 times bigger than MIPS R3000, for a fair comparison we considered that the intercommunication mechanism takes an area of almost 37% of chip, since this is also the number presented in [5]. This area is equivalent to 11 MIPS R3000 cores. The performance of both approaches shows the powerful capabilities of the superscalar processor, since it can accelerate code with different levels of available parallelism, thanks to speculation and register renaming. The MPSoC approach only surpasses the superscalar performance when the TLP level is greater than 75%, when the same area is used in both designs. Summarizing the comparison with the same power budged, the superscalar machine shows better performance when applications provide poor thread level parallelism. Regarding MPSoC designs there is an additional tradeoff, since when more cores are included in the chip, the MPSoC performance tends to worsen, since more core means more power, and hence the cores must be run at lower frequency to obey the power budged. When almost the whole application supports TLP exploration ( > 0.99) the 128-core MPSoC takes a bigger execution time than the other MPSoC designs.
496
Luigi Carro and Mateus Beck Rutzig
2.2 Energy Comparison If in terms of performance the superscalar is hard to beat, in terms of energy there is a whole new combination of factors. As power in CMOS circuits is proportional to the switching capacitance, the frequency and to the power supply squared, the potential of using MPSoC is higher. In this simple modeling, we will assume that the power is dissipated only in the data path. This is clearly overly optimistic for what regards the power dissipated by a superscalar, but this can also give an idea of the lower bound of energy dissipation in superscalar processors. Power consumption of a superscalar processor, Pss , can be written as 1 2 Pss ≈ issue ·C VSS (10) CPIss where C is the capacitance switching of a single issue processor, and Vss is the voltage the processor is operating on. The term (1/CPIss ) is included because, to sustain the performance with a CPI smaller than 1, a lot of power is spent in speculation. Hence, this term tries to take this into account, by increasing the power as more speculation is given to lower the CPI. The energy of the superscalar is just Ess = Pss · Tss .
(11)
For the multicore case, Pmc ≈ P ·C
1 2 Vmc CPIsc
(12)
where P is the nymber of homogeneous processor cores available in the chip and again the correction factor of (1/CPIsc ) has been included. In this case, the energy consumption is Emc = Pmc · Tmc . (13) By comparing both we can write Ess CPIss 1 1 = issue + ·CPIss Emc issue CPIss K 1 + ( ·CPIsc + · CPIsc ) , =P P CPImc
(14)
and by simplifying (14) we obtain Vss2 ( + issue · ) K1 Ess = . 2 ( + P · ) Emc Vmc
(15)
Eq. (15) tells us that the superscalar dissipates power when extra time is spent in code that cannot benefit from ILP, while the MPSoC dissipates useless power whenever there is no code that can be dully parallelized. Figure 6 shows the energy model
Multi-core Systems on Chip
497
Energy
4-issue Superscalar
8-Cores MPSoC
18-Cores MPSoC
20 18 16 14 12 10 8 6 4 2 0
0.01
0.1
0.25
0.5
0.75
0.99
Parallelism Percentage (α or δ) Fig. 6 MPSoC and superscalar energy consumption; = is assumed.
results regarding the same power budged used in the performance model. For this experiments only 8- and 18-core MPSoCs are considered, since the 48- and 128core MPSoCs follow the same trend. As can be noticed in Figure 6, the superscalar organization and the 8-core MPSoC have the same energy lines for low parallelism levels ( = = 0.01 and = = 0.1). However, when greater parallelism is available in the application ( > 0.25 and > 0.25), the 8-core MPSoC has a better performance than the superscalar machine, which means less energy consumption thanks to smaller power of pipelined cores. On the other hand, to obey the design power budged the 18-core MPSoC must run at a lower chip frequency than the 8-core MPSoC. Hence, MPSoCs based on high core density presents worst performance in applications with low/medium TLP (Figure 5). As a consequence, energy consumption of those MPSoCs tends to be greater than those with fewer cores, since no power management is taken into account in this modeling both for superscalar and MPSoC designs. As can be observed in Figure 6, 8-core MPSoC achieves better energy consumption than 18-core for low/medium TLP values ( < 0.75). However, when applications present greater thread level parallelism, the 18-core MPSoC energy consumption converges to the same values of the 8-core design, thanks to the higher processors usage. Since 18-core MPSoC has the same area of 4-issue superscalar, the MPSoC technique employment is feasible only for applications with high thread level parallelism if the energy consumption is a hard design constraint.
2.3 DSP Applications on a MPSoC Organization We evaluate the superscalar and MPSoC performance regarding an actual DSP application execution. An 18-tap FIR filter is used to measure the performance of both
498
Luigi Carro and Mateus Beck Rutzig
Fig. 7 C-like FIR filter.
approaches handling a traditional DSP application. The C-like description of the FIR filter employed in this experiment is illustrated in Figure 7. Superscalar machines explore the FIR filter instruction level parallelism in a transparent way, working on the original binary code. Unlike the superscalar approach, to explore the potential of the MPSoC architecture there is a need to make manual source code annotations in order to split the application code among many processing elements. In this way, some code highlights are shown in Figure 7 to simulate annotations, indicating the necessary number of cores to explore the ideal thread level parallelism of each part of the FIR filter code. For instance, the first annotation considers a loop controlled for IMP_SIZE value, which depends on the number of FIR taps. In this case, 54 loop iterations are done since the experiment regards an 18-tap FIR filter. The OpenMP [4] programming language provides specific code directives to easily split loop iterations among processing elements. Using OpenMP directives, the ideal exploration of this loop is done through 54-core MPSoC, each one being responsible to execute single loop iteration. However, when the amount of processing
Multi-core Systems on Chip
499
80
74.9
70
Speedup
60 50
44.8 37.1
40 30 17.3
20 10
13.3 2.2
6.0
0 SS
6-MPIOC
18-MPIOC 54-MPIOC
6-MPSS
18-MPSS
54-MPSS
Architectures Fig. 8 Speedup provided in 18-tap FIR filter execution for superscalar, MPSoC and a mix of both approaches
elements is lower than the number of loop iterations, OpenMP combines them in groups to distribute tasks among the available resources. Hence, regarding the execution of 54 loop iterations into 18-core MPSoC, OpenMP creates 18 groups, each one composed of 3 iterations. Figure 7 demonstrates that almost the entire FIR code can be parallelized, since its code is made up of several loops. In general, traditional DSP applications (ex: FFT and DCT) have a loop-based code behavior suitable for OpenMP loop parallelization. However, some loops cannot be parallelized due data dependency among iterations. The last loop of the FIR filter description shown in Figure 7 demonstrates this behavior, since shifting array values presents dependencies among all loop iterations. Aiming to illustrate the impact on performance of TLP and ILP exploration on DSP applications, we evaluated the 18-tap FIR execution over three different architectures: a 4-issue Superscalar (SS); 6- 18- and 54-core MPSoCs based on pipelined cores, with no ILP exploration capabilities (MPIOC). Finally, in order to have a glimpse on the future, we imagined a 6- 18- and 54- Cores MPSoCs based on a 4-issue superscalar processor, able to explore both ILP and TLP (MPSS). We have extracted the speedup with a tool [17] that makes all data dependence graphs of the application. After, regarding the characteristics of the evaluated architectures, the execution time of each graph is measured in order to obtain their speedup over the baseline processor. It is important to point out that instruction and thread communication overhead has not been taken into account in this experiment. The results shown in Figure 8 reflect the speedup provided over a single pipelined core performance running the C-like description of the 18-tap FIR filter presented in Figure 7. The leftmost bar shows the speedup provided for the ILP exploration of a 4-issue superscalar processor. In this case, the execution time of the Superscalar processor is 2.2 times lower than that of an pipelined core, showing that the FIR filter has neither high nor low ILP, since a 4-issue superscalar processor could potentially achieve up to 4 times the performance of an pipelined core.
500
Luigi Carro and Mateus Beck Rutzig
Regarding the MPSoC composed of pipelined cores, the 6-core machine provides almost a linear speedup, decreasing by 5.96 times the single pipelined core execution time. This behavior is maintained when more pipelined cores are inserted. However, when 18-tap FIR filter is explored for the maximum TLP (54-MPIOC), a speed up of only 44.8 times is achieved, showing that even applications which are potentially suitable for TLP exploration could present non-linear speedups. This can be explained by the sequential code present inside of loop iteration. Amdahl´ns Law shows that it is not sufficient to build architectures with a large number of processors, since most parallel applications contain a certain amount of sequential code [24]. Hence, there is a need to balance the number of processors with a suitable ILP exploration approach to achieve greater performance. The MPSS approach combines TLP with ILP exploration of 4-issue superscalar aiming to show that simple TLP extraction is not enough to achieve linear speedups even for DSP applications with high TLP. Figure 8 illustrates the speedup of the MPSS approach. As it can be noticed, 6-MPSS accelerates the 18-tap FIR filter more than twice 6MPIOC, since higher ILP is explored for the superscalar processor. In this case, when the first loop of Figure 7 is split among the MPSoC, each core executes 9 loop iterations providing a large room for ILP exploration. However, when the number of cores increases, the sequential code decreases, making less room for the ILP optimization. Nevertheless, the 18-tap FIR filter execution in 54-MPSS is 66% faster than 54-MPIOC execution showing the need of a mixed parallelism exploration. Summarizing, most DSP applications benefit of thread level parallelism exploration thanks to their loop-based behavior. However, even applications with high TLP could still obtain some performance improvement by also exploiting ILP. Hence, in an MPSoC design ILP techniques also should be investigated to conclude what is the best fit concerning the design requirements. Finally, this subsection also shows that replications of simple processing elements leaves a significant optimization possibility unexplored, indicating that heterogeneous MPSoC could be a possible solution to balance the performance of the system.
3 MPSoC Classification Due the large amount of application domains that the MPSoC design principle can be applied to, some method to define the advantages/disadvantages of each one should be defined. Commonly, MPSoC designs are classified only regarding their architecture, as homogeneous or heterogeneous. In this section the classification is done from two points of view: architecture and organization modeling.
Multi-core Systems on Chip
501
3.1 Architecture View Point When processors are classified under the architecture view point, the instruction set architecture (ISA) is the parameter used to distinguish them. The first step of a processor design is the definition of its basic operation capabilities, aiming to create an ISA to interface those operations to the high level programmer. Generally, an ISA is typically classified as reduced instruction set computer (RISC) or complex instruction set computer (CISC). The strategy of RISC architectures is making the processor data path control as simple as possible. Their instructions set are non-specialized, covering only the basic operations needed to make simple computations. Besides its reduced instruction set, the genuine RISC architectures provide additional features like instructions finishing at every clock cycle execution, special instructions to perform memory accesses, register-based operations and a small clock cycle, thanks to their pipelined micro-operations. On the other hand, the CISC architecture strategy aims at supporting all the necessary operations on hardware, building specialized instructions to better support high level software abstractions. In fact, a single CISC instruction reflects the behavior of a several RISC instructions, with different amount of execution cycles and memory accesses. Hence, the data path control becomes more complex and harder to manage than RISC architectures. Both strategies are widely employed in processor architectures, always guided by the requirements and constraints of the design. DSP processors better fit the CISC paradigm, since there are many specialized instructions to provide efficient signal processing execution. Multiply-accumulate, round arithmetic, specialized instructions for modulo addressing in circular buffers and bit-reversed addressing mode for FFT are some examples of specialized CISC instructions focused on accelerating specific and critical parts of the DSP application domain. In the architecture point of view, there are several ways to combine processing elements inside a die. Figure 1 presents three examples of MPSoC with different processing element arrangement. As can be noticed in this Figure, MPSoC #1 is composed of four processing elements with the very same ISA, and this MPSoC arrangement is classified as a homogeneous MPSoC architecture. The same classification is given for the MPSoC #2 that only differs from the first for the diversity on computing time of its processing elements. Details about this organization fashion are given in the next section. Homogeneous MPSoC architectures have been widespread in general purpose processing domain. As already explained, the current difficulties found on increasing the applications speed using ILP techniques, aggregated to the large density provided by the high transistor level integration, encouraged the coupling of several GPP into a single die. Since, in most cases, the baseline processor is already validated, the inter-processor communication design is the unique task to be performed on the development of a homogeneous MPSoC architecture (although this turned out to be not such an easy task). In the software side, programming models are used to fork the code among the homogeneous cores, making the hard task of coding parallel software easier. OpenMP [4] and MPI [14] are the most used programming models providing a large number of directives that facilitate the hard task
502
Luigi Carro and Mateus Beck Rutzig
of distributing code among processing elements. More details about programming languages are provided in the last section. For almost five years now, Intel and AMD have been using this approach to speed up their high-end processors. In 2006, Intel has shipped its MPSoC based on homogeneous architecture strategy. Intel Core Duo is composed of two processing elements that make communication among themselves through an on-chip cache memory. In this project, Intel has thought beyond the benefits of MPSoC employment and created an approach to increase the process yield. A new processor market line, called Intel Core Solo, was created aiming to increase the process yield by selling even Core Duo dies with manufacturing defects. In this way, Intel Core Solo has the very same two-core die as the Core Duo, but only one core is defect free. Recently, embedded general-purpose RISC processors are following the trend of high-end processors coupling many processing elements, with the same architecture, on a single die. Early, due to the hard constraints of these designs and the few parallel applications that would benefit from several GPP, homogeneous MPSoC were not suitable for this domain. However, the embedded software scenario is getting similar to a personal computer one due the large amount of simultaneous applications running over its GPP. ARM Cortex-A9 MPSoC processor is the pioneer to employ homogeneous multiprocessing approach into embedded domain, coupling four Cortex-A9 cores into a single die. Each processing element uses powerful techniques for ILP exploration, as superscalar execution and SIMD instruction set extensions, which closes the gap between the embedded processor design and high-end processors. Heterogeneous MPSoC architectures support different market requirements than homogeneous ones. Generally, the main goal of the homogeneous approach is to provide efficiency on a large range of applications behaviors, being widespread used in GPP processor. On the other hand, heterogeneous approach aims to supply efficient execution on a defined range of application behavior, including some particular ISA extensions to handle specific software domains. MPSoC #3 (Figure 1) better exemplifies a heterogeneous architecture composed of three different ISAs. This architecture style is widely used in the mobile embedded domain, due to the necessity for high performance with energy efficiency. The heterogeneous MPSoC assembly depends a lot on the performance and power requirements of the application. Supposing a current cell phone scenario where there are several applications running over the MPSoC. Commonly, designers explore the design space by trying to find critical parts of code that could be efficiently executed in hardware. After, these parts are moved to hardware, achieving better performance and energy efficiency. In an MPSoC used by a current cell phone there are many heterogeneous processing elements providing efficient execution for specific tasks, like video decompression and audio processing. Samsung S3C6410 better illustrates the embedded domain trend to use MPSoCs. This heterogeneous architecture handles the most widely used applications on embedded devices like multimedia and digital signal processing. SamsungÕs MPSoC is based on an ARM 1176 processor, which includes several ISA extensions, as DSP and SIMD processing, aiming to increase the performance of those applications.
503
ISA A
ISA A
ISA A
ISA A
MPSoC #2
MPSoC #1
Multi-core Systems on Chip
ISA A
ISA A
ISA A ISA A
Fig. 9 Examples of homogeneous and heterogeneous organizations.
Its architecture is composed of many other dedicated processing elements, called hard-wired multimedia accelerators. Video Codec (MPEG4, H.264 and VC1), Image Codec (JPEG), 2D and 3D accelerators have become hard-wired because of the performance and energy constraints of the embedded systems. This drastic design change illustrates the growth and maturity phase of the functionality lifecycle discussed in the beginning of this chapter. In this phase, the electronic consumer market already has absorbed these functionalities, and their hard-wired execution is mandatory for energy and performance efficiency.
3.2 Organization View Point In contrast with the previous section, this part of the chapter explores the heterogeneous versus homogeneous arrangement regarding the MPSoC organization as a classification parameter. This means that all MPSoC designs discussed here have the same ISA, and they are classified considering the use of similar/dissimilar processor organization. The homogeneous MPSoC organization, as already explained, has been widespread in general purpose processing. Intel and AMD replicate their cores into a single die to explore the process/thread parallelism, passing the responsibility of process/thread scheduling to the operating system or software team. Also, the ARM Cortex design has introduced MPSoC in the embedded domain with a homogeneous MPSoC organization employment. When a simple replication of the very same processing elements occurs, the design is classified as homogeneous, both in architecture and organization. Their processing elements have the same processing capability, area and energy consumption, which delimits the design space exploration, since no special task scheduling could be employed. The process/thread scheduling, in an MPSoC design, is an important feature to be explored, in order to create a tradeoff between the requirements and design constraints, aiming to find the right balance between performance and power consumption. Supposing the two approaches illustrated in Figure 9, MPSoC #1 is composed by replications of the same processing element, being classified as homogeneous both in architecture and organization. On the other hand, MPSoC #2, from the organization point of view, demonstrates different processing elements despite having the same ISA. The last MPSoC fashion is classified as a heterogeneous MPSoC
504
Luigi Carro and Mateus Beck Rutzig
in the organization point of view, and a homogeneous MPSoC in the architecture point of view. MPSoC #1 supports only naive task scheduling approaches, since all processing elements have the same characteristics. In this organization fashion, the task is always allocated in the same processing element organization, meaning that there is no choosing for a task scheduling aiming to achieve certain design requirements. In contrast, MPSoC #2 provides flexibility on task scheduling. An efficient algorithm could be developed to support design tradeoffs regarding requirements and constraints of the device. Supposing a scenario with three threads (thread#1, thread#2, and thread #3) that require the following performance level: low, very low and very high. The MPSoC #1 scheduling approach simply ignores the computation requirements, executing the three threads into their powerful processing elements, causing a waste of energy resources. An efficient scheduling algorithm can be employed in MPSoC #2 to avoid the previous drawback, exploiting its heterogeneous organization. Suppose a dynamic task scheduling algorithm that saves information about threads execution. This approach can fit the thread computation need regarding the different processing capabilities available in MPSoC #2, allowing an energy-efficient scheduling. For the given example, thread#1 is executed in the middle-size processing element, thread#2 in the smallest one and thread#3 in the largest processing element.
4 Successful MPSoC Designs Many successful MPSoC solutions are in the market, for both high-end performance personal computers and highly-constrained mobile embedded devices. Commonly, as already explained, high-end performance personal computers provide robust GPP use with architecture and organization shaped on a homogeneous way due to the large variation of software behavior that runs over these platforms. On the other hand, MPSoCs for mobile embedded devices supply many ISAs, through application specific instruction set processor to provide an efficient execution on a restricted application domain. This subsection delimits MPSoC exploration to embedded domain due their widely usage of DSP processors. Some successful embedded MPSoC platforms are discussed, demonstrating the trends for mobile multiprocessors.
4.1 OMAP Platform In 2002, Texas Instruments has launched in the market an Innovator Development kit (IDK) targeting high performance and low power consumption for multimedia applications. IDK provides an easy design development, with open software, based on a customized hardware platform called open multimedia applications processor (OMAP). Since its launch, OMAP is a successful platform being used by the embedded market leaders like Nokia with its N90 cell phones series, Samsung OM-
Multi-core Systems on Chip
505
HD Television
Video
GPS
Bluetooth
ARM® Cortex™-A9 MPCore™
ARM® Cortex™-A9 MPCore™
POWERVR™ SGX540 graphics accelerator
Audio
TLW 6040
3G /4G
Image Signal Processor (ISP)
mWLAN
Shared memory controller/DMA Power Management TLW 6030
Fig. 10 OMAP 4440 block diagram.
NIA HD and Sony Ericsson IDOU. Currently, due to the large diversity found on the embedded consumer market, Texas Instruments has divided the OMAP family in two different lines, covering different aspects. The high-end OMAP line supports the current sophisticated smart phones and powerful cell phone models, providing pre-integrated connectivity solutions for the latest technologies (3G, 4G, WLAN, Bluetooth and GPS), audio and video applications (WUXGA), including also high definition television. The low-end OMAP platforms cover down-market products providing older connectivity technologies (GSM/GPRS/EDGE) and low definition display (QVGA). Recently, Texas Instrument released its latest high-end product. The OMAP4440 covers the connectivity besides high-quality video, image and audio support. This mobile platform came to supply the need of the increasingly multimedia applications convergence in a single embedded device. As can be seen in Figure 10, this platform incorporates the dual-core ARM Cortex A9 MPCore providing higher mobile general purpose computing performance. A novel power management hardware technique available in ARM Cortex A9 MPCore balances the power consumption with the performance requirements, activating only the cores that are needed for a particular execution. Also, due to the high performance requirement of today smart phones, up to eight threads can be concurrently fired in the MPCore, since each core is composed of four single-core Cortex A9. The single-core ARM Cortex A9 implements speculative issue out-of-order superscalar execution, SIMD instruction set and DSP extensions, showing almost the same processing power as a personal computer into an embedded mobile platform. Excluding the ARM Cortex MPCore, the remainder processing elements are dedicated to multimedia execution. The POWERVRł SGX540 multi-threaded graphics accelerator employs state-ofart support for 2D and 3D graphics, including OpenGL and EGL APIs. Several
506
Luigi Carro and Mateus Beck Rutzig
image enhancements like digital anti-aliasing, digital zoom and auto-focus are provided by the Image Signal Processor (ISP), which focus on increasing the quality of the digital camera. The processing capabilities of ISP enables picture resolutions of up to 20-megapixel, with an interval of one second between shots. The high memory bandwidth needed for ISP is supplied by a DMA controller that achieves a memoryto-memory rate of 200-megapixels/second at 200 MHz. To conclude the robust multimedia accelerators scenario, OMAP4440 allows a true 1080p multi-standard high definition record and playback at 30 frames per second rate, by its IVA3 hardware accelerator. IVA DSP processor supports a broad of multimedia codec standards, including MPEG4 ASP, MPEG-2 MP, H.264 HP, ON2 VP7 and VC-1 AP. The large amount of OMAP4440 processing elements makes its power dissipation rates unfeasible for mobile embedded devices. Texas Instruments has handled this drawback by integrating on the OMAP 440 design a power management technology. Smartreflex2 supports, both at the hardware and software level, the following power management techniques: Dynamic Voltage & Frequency Scaling (DVFS), Adaptive Voltage Scaling (AVS), Dynamic Power Switching (DPS), Static Leak Management (SLM) and Adaptive Body Bias (ABB). DVFS supports multiple voltage values, aiming to balance the power supply and frequency regarding the performance needed in a given execution time. AVS is coupled to DVFS to provide a greater balance on power consumption, considering hardware features like temperature. The coupling of both techniques delivers the maximum performance of OMAP4440 processing elements with the minimum possible voltage at every computation. Regarding leakage power, the 45nm OMAP4440 MPSoC die already provides a significant reduction in the consumption. DPS and SLM techniques allow lowest stand-by power mode avoiding leakage power. Finally, ABB works at transistor level enabling dynamic changes on transistor threshold voltages aiming to reduce leakage power. All techniques are supported by the integrated TLW6030 power management chip illustrated in Figure 10. In order to support energy efficient audio processing, Texas has included the TWL 6040 integrated chip that acts as an audio back-end chip, providing up to 140 hours of music playback in airplane mode.
4.2 Samsung S3C6410/S5PC100 As OMAP, Samsung MPSoC designs are focused on multimedia-based development. Their projects are very similar due to the increasing market demand for powerful multimedia platforms, which stimulates the designer to take the same decision to achieve efficient multimedia execution. Commonly, the integration of specific accelerators is used, since this reduces the design time avoiding validation and testing time. In 2008, Samsung has launched the most powerful of the Mobile MPSoC family. At first, S3C6410 was a multimedia MPSoC like OMAP4440. However, after its deployment in the Apple iPhone 3G employment, it has become one of the most popular MPSoCs, shipping 3 million units during the first month after launch. Recently,
Multi-core Systems on Chip
507
Samsung S3C6410
ARM® 1176JZF-S I/D cache 16KB/16KB 533/667MHz
Multimedia Accelerators
Samsung S5PC100
ARM® Cortex™-A8 32 KB/32KB I/D Cache 667/883MHz 256KB L2 Cache NEON
Multimedia Accelerators
Camera : 4MP
H.264/MPEG4 /VC1
Camera IF w/ CSI-2
720p Video Engine
2D/3D Graphics
NTSC/PAL
2D/3D Graphics
NTSC/PAL/HDMI
JPEG CODEC
JPEG CODEC
Fig. 11 Samsung S3C6410 and S5PC100 MPSoC block diagrams.
Apple has developed iPhone 3GS, which assures better performance with lower power consumption. These benefits are supplied by the replacement of the S3C6410 architectures with the high-end S5PC100 version. In this subsection, we will explore both architecture highlighting the platform improvements between S3C6410 and S5PC100. Following the multimedia-based MPSoC trend, Samsung platforms are composed of several application specific accelerators building heterogeneous MPSoC architectures. S3C6410 and S5PC100 have a central general purpose processing element, in both cases ARM-based, surrounded by several multimedia accelerators tightly targeted to DSP processing. As can be noticed in Figure 11, both platforms skeleton follow the same execution strategy changing only the processing capability of their IP cores. This is an interesting design modeling approach, since while traditional architecture designs suffer for long design/validate/test phases to build a new architecture, MPSoC ones shorter these design times only by replacing their lowend cores for high-end ones, creating a new MPSoC architecture. This can be verified in the Figure 11, small platform changes are done from S3C6410 to S5PC100 aiming to increase the MPSoC performance. More specifically, a 9-stage pipelined ARM 1176JZF-S core with SIMD extensions is replaced for a 13-stage superscalar pipelined ARM Cortex A8 providing greater computation capability for general purpose applications. Besides its double-sized L1 cache compared with ARM1176JZFS, ARM Cortex A8 also includes a 256KB L2 cache avoiding external memory accesses due L1 cache misses. NEON ARM technology is included in ARM Cortex A8 to provide flexible and powerful acceleration for intensive multimedia applications. Its SIMD based-execution accelerates multimedia and signal processing algorithms such as video encode/decode, 2D/3D graphics, speech processing, image processing at least 2x of previous SIMD technology. Regarding multimedia accelerators, both MPSoC are able to provide suitable performance for any high-end mobile devices. However, S5PC100 includes the latest codec multimedia support using powerful accelerators. A 720p multi format codec (MFC) video is included aiming to provide high quality capture and play-
508
Luigi Carro and Mateus Beck Rutzig
back at 30 frames per second rate. Also, previous video codec as MPEG4, VCI and H.264 are available in this platform. Besides the S3C6410 TV out interfacing PAL and NTSC, the new MPSoC version includes high definition multimedia interface (HDMI) standard allowing high definition video formats. Summarizing, all technologies included in S5PC100 platform accelerate multimedia applications employing DSP approaches.
4.3 Other MPSoCs Other MPSoCs have already been released in the market, with different goal from the architectures discussed before. Sony, IBM and Toshiba have worked together to design the Cell Broadband Engine Architecture [3] . The Cell architecture combines a powerful central processor with eight SIMD-based processing elements. Aiming to accelerate a large range of application behaviors, the IBM PowerPC architecture is used as GPP processor. Also, this processor has the responsibility to manage the processing elements surrounding it. These processing elements, called synergistic processing elements (SPE), are built to support streaming applications with SIMD execution. Each SPE has a local memory, and without hardware management it cannot directly access the system memory. These facts make the software development for the Cell processor even more difficult, since the software team should be aware of this local memory, and manage it at the software level to better explore the SPE execution. Despite its high processing capability, the Cell processor does not yet have a large market acceptance because of the intrinsic difficulty to code software in order to use the SPEs. Sony has lost parts of the gaming entertainment market after the Cell processor was deployed in the Playstation console family, since game developers had not enough knowledge to efficiently explore the complex Cell architecture. Homogeneous MPSoC organization is also explored in the market, mainly for personal computers with general purpose processors, because of the huge amount of different applications these processors have to face, and hence it is more difficult to define specialized hardware accelerators. In 2005, Sun Microsystems announced its first homogeneous MPSoC, composed of up to 8 processing elements executing the full RISC SPARC V9 instruction set. UltraSparc T1, also called Niagara [11] , is the first multithreaded homogeneous MPSoC, and each processing element is able to execute four threads concurrently. In this way, Niagara can handle, at the same time, up to 32 threads. Recently, with the deployment of UltraSparc T2 [10] , this number has grown to 64 concurrent threads. Niagara MPSoC family targets massive data computation with distributed tasks, like the market for web servers, database servers and network file systems. Intel has announced its first MPSoC homogeneous prototype with 80 cores, which is capable of executing 1 trillion floating-point operations per second, while consuming 62 Watts [22]. The company expects to launch this chip within the next 5 years in the market. Hence, the x86 instruction set architecture era could be broken, since their processing elements is based on the very
Multi-core Systems on Chip
509
long instruction word (VLIW) approach, letting to the compiler the responsibility for the parallelism exploration. The interconnection mechanism used on the 80-core MPSoC uses a mesh network to communicate among its processing elements. However, even employing the mesh communication turns out to be difficult, due to the great amount of processing elements. In this way, this ambitious project uses a 20 Mbytes stacked on-chip SRAM memory to improve the processing elements communication bandwidth. Graphic processing unit (GPU) is another MPSoC approach aiming at graphicbased software acceleration. However, this approach has been arising as promise architecture also to improve general purpose software. Intel Larrabee [18] attacks both applications domain thanks to its CPU- and GPU- like architecture. In this project Intel has employed the assumption of energy efficiency by simple cores replication. Larrabee uses several P54C-based cores to explore general purpose applications. In 1994, P54C was shipped in CMOS 0.6um technology reaching up to 100 MHz and does not include out-of-order superscalar execution. However, some modifications have been done in the P54C architecture, like supporting of SIMD execution aiming to provide more powerful graphic-based software execution. The SIMD Larrabee execution is similar to but powerful than the SSE technology available in the modern x86 processors. Each P54C is coupled to a 512-bit vector pipeline unit (VPU), capable to execute, in one processor cycle, 16 single precision floating point operations. Also, Larrabee employs a fixed-function graphics hardware that performs texture sampling tasks like anisotropic filtering and texture decompression. NVIDIA Tesla [12] is another example of MPSoC based on the concept of a general purpose graphic processor unit. Its massive-parallel computing architecture provides support to Compute Unified Device Architecture (CUDA) technology. CUDA, the NVIDIA´ns computing engine, supports the software developer by easing the application workload distribution, by providing software extensions. Also, CUDA provides permission to access the native instruction set and memory of the processing elements, turning the NVIDIA Tesla to a CPU-like architecture. Tesla architecture incorporates up to four multithreaded cores communicating through a GDDR3 bus which provides a huge data bandwidth. Summarizing all MPSoC discussed before, Table 1 compares their main characteristics showing their differences depending on the target market domain. Heterogeneous architectures, like the OMAP, Samsung and Cell, incorporate several specialized processing elements to attack specific applications for highly-constrained mobile or portable devices. These architectures have multimedia-based processing elements, following the trend of embedded systems. Homogeneous architectures still use only homogeneous organizations, coupling several processing elements with the same ISA and the processing capabilities. Heterogeneous organizations have not been used on homogeneous architectures, since the variable processing capability is supported dynamically by power management like DVFS. Unlike heterogeneous architectures, homogeneous ones aims at the general purpose processing market, handling a wide range of applications behavior by replicating the GPP.
510
Luigi Carro and Mateus Beck Rutzig
Table 1 Summarized MPSoCs. Architecture Organization TI OMAP4440 Samsung S3C6410 S5PC100 Cell Sun Niagara Intel 80-Cores Intel Larrabee NVIDIA Tesla (GeForce8800)
Cores
Multithreaded Interconnection Cores
2 ARM Cortex A9 Heterogeneous Heterogeneous 1 PowerVR Graphics Accelerator 1 Image Signal Processor 1 ARM1176JZF-S Heterogeneous Heterogeneous 5 Multimedia Accelerators
No
Integrated Bus
No
Integrated Bus
Heterogeneous Heterogeneous
No
Integrated Bus
Yes (4 threads) No
Crossbar
No
Integrated Bus
Yes (up to 768 threads)
Network
Homogeneous Homogeneous
1 PowerPC 8 SPE 8 SPARC V9 ISA
Homogeneous Homogeneous
80 VLIW
Homogeneous Homogeneous
n P54C x86 cores SIMD execution 128 Stream Processors
Homogeneous Homogeneous
Mesh
5 Open Problems Due the short history of MPSoCs, several design points are still open. In this section, we discuss two open problems of MPSoC design: interconnection mechanism and MPSoC programming models.
5.1 Interconnection Mechanism The interconnection mechanism has a very important role in an MPSoC design, since it is responsible to support the exchange of information between all MPSoC components, typically between processing elements or processing elements and some storage component. The development of the interconnection mechanism in an MPSoC should take into account the following aspects: parallelism, scalability, testability, fault tolerance, reusability, energy consumption and communication bandwidth [13]. However, there are several communication interconnection approaches that provide different qualitative levels regarding the cited aspects. As can be noticed in Table 1, there is no agreement on the interconnection mechanism, since each design has specifics constraints and requirements that guide the choices of the communication infrastructure, always taking into account its particular aspects. Commonly, buses are the most used mechanism on current MPSoC designs. Current buses can achieve high speeds, and buses have additional advantages like low cost, easy testability and high communication bandwidth, encouraging the use of buses in MPSoC designs. The weak points of this approach are poor scalability, no fault tolerance and no parallelism exploitation. However, modifications on its original concept can soften these disadvantages, but could also affect some good characteristics. Segmented bus is an original bus derivative aiming to increase the
Multi-core Systems on Chip
511
performance, communication parallelism exploration and energy savings [7]. This technique divides the original bus in several parts, enabling concurrent communication inside of each part. However, bus segmentation impacts on the scalability, and makes the communication management between isolated parts harder. Besides their disadvantages, for obvious historical reasons buses are still widely used in MPSoC designs. Intel and AMD are still using integrated buses to make the communication infrastructure on their high-end MPSoC, due to the easier implementation and high bandwidth provided. Crossbars are widely used on network hardware like switches and hubs. Some MPSoC designers have been employing this mechanism to connect processing elements [10]. This interconnection approach provides huge performance, enabling communication between any processing elements and the smallest possible time. However, high area cost, energy consumption and poor scalability discourage its employment. AMD Opteron family and Sun Niagara use crossbars to support high communication bandwidth within their general purpose processors. Network-on-chip has been emerging as a solution to couple several processing elements [22]. This approach provides high communication parallelism, since several connecting paths are available for each node. In addition, as the technology scales, wire delays increase (because of the increased resistance derived form the smaller cross-section of the wire), and hence shorter wires, as used in NOCs could soften this scaling drawback. Also, its explicit modular shape positively affects the processing elements scalability, and can be explored by a power management technique to support the simple turning off of idle components of the network on chip. NOC disadvantages include the excessive area overhead and high latency of the routers. Intel 80-core prototype employs a mesh style network-on-chip interconnection to supply the communication of its 80 processing elements.
5.2 MPSoC Programming Models For decades, many ILP exploration approaches were proposed aiming to improve the processor performance. Most of those works employed dynamic ILP exploration at hardware level, becoming an efficient and adaptive process used in superscalar architectures, for example. Also, traditional ILP exploration free software developers from the hard task to explicit, in the source code, those parts that can be executed in parallel. However, MPSoC employment relies on manual source code changes to fork the parallel parts among the processing elements. Currently, software developers must be aware of the MPSoC characteristic like the number of processing elements to allocate the parallel code. Thus, the sequential programming knowledge should migrate to the parallel programming paradigm. Due to these facts, parallel programming approaches have been gaining importance on computing area, since easy and efficient code production is fundamental to explore the MPSoC processing capability.
512
Luigi Carro and Mateus Beck Rutzig
Communication between processing elements is needed whenever information exchange among the threads is performed. Commonly, this communication is based either on message passing or on shared memory techniques. Message passing leaves the complex task of execution management to the software description level. The software code should contain detailed description of the parallelization process since the application developer has the complete control to manage the process. The meticulous message passing management by the software becomes slower than shared memory. However, this approach enables robust communication employment providing the complete control to the software developer to make this job. Message Passing Interface (MPI) [14] is a widely used standard protocol that employs message passing communication mechanism. MPI provides an application programming interface (API) that specifies a set of routines to manage inter processes communication. The advantage of MPI usage over other mechanisms is that both data and task parallelism can be explored, with the cost of more code changes to achieve parallel software. In addition, MPI relies on network communication hardware which guides the performance of the entire system. Shared memory communication uses a storage mechanism aiming to hold the necessaries information for threads communication. This approach provides simpler software development, since thanks to a global addressing system most of the communication drawbacks are transparent to the software team. The main drawback of shared memory employment is the bottleneck between processing element and memory, since several threads could try to access the same storage element at a certain time. Memory coherency also can be a bottleneck for shared memory employment, since some mechanism should be applied to guarantee data coherency. OpenMP [4] employs shared memory communication mechanism to manage the parallelization process. This approach is based on a master and slave mechanism, where the master thread forks a specified number of slave threads to execute in parallel. Thus, each thread executes a parallelized section of the application into a processing element independently. OpenMP provides easier programming than MPI, with greater scalability, since a smaller number of code changes should be done to increase the number of spawned threads. However, in most cases OpenMP code coverage is limited to highly parallel parts and loops. This approach is strongly based on a compiler, which must translate the OpenMP directives to recognize what section should be parallelized.
6 Future Research Challenges As discussed throughout this chapter, although MPSoCs can be considered a consolidated strategy for the development of high performance and low energy products, there are still many open problems that required extra research effort. Software partitioning is one of the most important open issues. That is, taking a program developed in the way programs have been developed for years now, and making it work in an MPSoC environment by the use of automatic partitioning is very difficult al-
Multi-core Systems on Chip
513
ready for the homogeneous case, not to mention the heterogeneous one. Matching a program to an ISA and to a certain MPSoC organization is an open and challenging problem. Another research direction should contemplate hardware development. From the results shown in section 2 it is clear that heterogeneous SoCs, capable to explore parallelism at the thread and also at the instruction level, are the most adequate to obtain real performance and energy gains. Unfortunately, in an era where fabrication costs demand huge volumes, the big question to be answered regards the right amount of heterogeneity to be embedded in the MPSoC fabric.
References 1. Anantaraman, A., Seth, K., Patil, K., Rotenberg, E., Mueller, F.: Virtual simple architecture (VISA): Exceeding the complexity limit in safe real-time systems. In: Proc. 30th Annual Int. Symposium Computer Architecture, pp. 350–361. San Diego, CA (2003) 2. Bergamaschi, R., Han, G., Buyuktosunoglu, A., Patel, H., Nair, I., Dittmann, G., Janssen, G., Dhanwada, N., Hu, Z., Bose, P., Darringer, J.: Exploring power management in multi-core systems. In: Proc. Asia and South Pacific Design Automation Conf. Seoul, Korea (2008) 3. Chen, T., Raghavan, R., Dale, J.N., Iwata, E.: Cell broadband engine architecture and its first implementation: A performance view. IBM J. Res. Dev. 51(5), 559–572 (2007) 4. Dagum, L., Menon, R.: OpenMP: An industry-standard API for shared-memory programming. IEEE Comput. Sci. Eng. 5(1), 46–55 (1998) 5. Du, J.: Inside an 80-core chip: The on-chip communication and memory bandwidth solutions (2007). URL http://blogs.intel.com/research/2007/07/inside\_the\ _terascale\_many\_core.php 6. Freescale, Inc. URL http://www.freescale.com/webapp/sps/site/ homepage.jsp 7. Guo, J., Papanikolaou, A., Marchal, P., Catthoor, F.: Physical design implementation of segmented buses to reduce communication energy. In: Proc. Asia and South Pacific Design Automation Conference, pp. 42–47. Yokohama, Japan (2006) 8. Guthaus, M.R., Ringenberg, J.S., Ernst, D., Austin, T.M., Mudge, T., Brown, R.B.: MiBench: A free, commercially representative embedded benchmark suite. In: Proc. IEEE Int. Workshop Workload Characterization, pp. 3–14. Austin, TX (2001) R coreTM i7 processor SDK webinar: Q and A transcript from the 8:00 a.m. 9. Intel PST. URL http://software.intel.com/sites/webinar/corei7-sdk/ intel-core-i7-8am.doc 10. Johnson, T., Nawathe, U.: An 8-core, 64-thread, 64-bit power efficient Sparc SoC (Niagara2). In: Proc. Int. Symposium Physical Design, p. 2. Austin, TX (2007) 11. Kongetira, P., Aingaran, K., Olukotun, K.: Niagara: A 32-way multithreaded Sparc processor. IEEE Micro 25(2), 21–29 (2005) 12. Lindholm, E., Nickolls, J., Oberman, S., Montrym, J.: NVIDIA Tesla: A unified graphics and computing architecture. IEEE Micro 28(2), 39–55 (2008) 13. Marcon, C., Borin, A., Susin, A., Carro, L., Wagner, F.: Time and energy efficient mapping of embedded applications onto NoCs. In: Proc. Asia and South Pacific-Design Automation Conference (2005) 14. Pacheco, P.S.: Parallel Programming with MPI. Morgan Kaufmann Publishers, Inc. (1996) 15. Reifel, M., Chen, D.: Parallel digital signal processing: An emerging market. Application Note (1994). URL http://focus.ti.com/lit/an/spra104/spra104.pdf 16. Rowen, C., Johnson, M., Ries, P.: The MIPS R3010 floating-point coprocessor. IEEE Micro 8(3), 53–62 (1988)
514
Luigi Carro and Mateus Beck Rutzig
17. Rutzig, M.B., Beck, A.C., Carro, L.: Dynamically adapted low power ASIPs. In: J. Becker, R. Woods, P. Athanas, F. Morgan (eds.) Reconfigurable Computing: Architectures, Tools and Applications, Lecture Notes In Computer Science, vol. 5453, pp. 110–122. Springer-Verlag, Berlin, Germany (2009) 18. Seiler, L., Carmean, D., Sprangle, E., Forsyth, T., Abrash, M., Dubey, P., Junkins, S., Lake, A., Sugerman, J., Cavin, R., Espasa, R., Grochowski, E., Juan, T., Hanrahan, P.: Larrabee: A many-core x86 architecture for visual computing. In: ACM SIGGRAPH 2008 Papers, pp. 1–15. Los Angeles, CA (2008) 19. Shi, K., Howard, D.: Challenges in sleep transistor design and implementation in low-power designs. In: Proc. 43rd Annual Conf. Design Automation, pp. 113–116 (2006) 20. Song, Y., Kalogeropulos, S., Tirumalai, P.: Design and implementation of a compiler framework for helper threading on multi-core processors. In: Proceedings of the 14th international Conference on Parallel Architectures and Compilation Techniques, pp. 99–109 (2005) 21. Tsarchopoulos, P.: European research in embedded systems. In: S. Vassiliadis, S. Wong, T.D. Hämäläinen (eds.) Embedded Computer Systems: Architectures, Modeling, and Simulation, Lecture Notes in Computer Science, vol. 4017, pp. 2–4. Springer-Verlag, Berlin, Germany (2006) 22. Vangal, S., Howard, J., Ruhl, G., Dighe, S., Wilson, H., Tschanz, J., Finan, D., Iyer, P., Singh, A., Jacob, T., Jain, S., Venkataraman, S., Hoskote, Y., Borkar, N.: An 80-tile 1.28TFLOPS network-on-chip in 65nm CMOS. In: Digest of Technical Papers IEEE Int. Solid-State Circuits Conf., pp. 98–589 (2007) 23. Wall, D.W.: Limits of instruction-level parallelism. SIGPLAN Not. 26(4), 176–188 (1991) 24. Woo, D.H., Lee, H.H.: Extending Amdahl’s law for energy-efficient computing in the manycore era. Computer 41(12), 24–31 (2008) 25. Yeager, K.: The MIPS R10000 superscalar microprocessor. IEEE Micro pp. 28–40 (1996) 26. Yu, Z.: High performance and energy efficient multi-core systems for DSP applications. Dissertation, Univ. California, Davis (2007). URL http://www.ece.ucdavis.edu/ vcl/pubs/theses/2007-5/ 27. Zhaoa, G., Kwan, H.K., Lei, C.U., Wong, N.: Processor frequency assignment in threedimensional MPSoCs under thermal constraints by polynomial programming. In: Proc. IEEE Asia Pacific Conf. Circuits Syst., pp. 1668–1671 (2008)
DSP Systems using Three-Dimensional (3D) Integration Technology Tong Zhang, Yangyang Pan, and Yiran Li
Abstract As three-dimensional (3D) integration technology is quickly maturing and steadily entering mainstream markets, it has begun to attract exploding interest from integrated circuit and system designers. This chapter discusses and demonstrates the exciting opportunities and potentials for digital signal processing (DSP) circuit and system designers to exploit 3D integration technology. In particular, this chapter advocates a 3D logic-DRAM integration paradigm and addresses the use of 3D logic-memory integration in both programmable digital signal processors and application-specific digital signal processing circuits. To further demonstrate the potential, this chapter presents case studies on applying 3D logic-DRAM integration to clustered VLIW (very long instruction word) digital signal processors and application-specific video encoders. Since DSP systems using 3D integration technology is still in its research infancy, by presenting some first discussions and results, the aim of this chapter is to motivate greater future efforts from DSP system research community to explore this new and rewarding research area.
1 Introduction Three-dimensional (3D) integration has recently attracted tremendous interest from global semiconductor industries and is expected to lead to an industry paradigm Tong Zhang Rensselaer Polytechnic Institute, Troy, NY e-mail: [email protected] Yangyang Pan Rensselaer Polytechnic Institute, Troy, NY e-mail: [email protected] Yiran Li Rensselaer Polytechnic Institute, Troy, NY e-mail: [email protected]
S.S. Bhattacharyya et al. (eds.), Handbook of Signal Processing Systems, DOI 10.1007/978-1-4419-6345-1_19, © Springer Science+Business Media, LLC 2010
515
516
Tong Zhang, Yangyang Pan, and Yiran Li
shift [44]. The family of 3D integration technology includes packaging-based 3D integration such as system-in-package (SiP) and package-on-package (PoP), die-todie and die-to-wafer 3D integration, and wafer-level back-end-of-the-line (BEOL)compatible 3D integration. In general, all 3D integration technologies would offer many potential benefits, including multi-functionality, increased performance, reduced power, small form factor, reduced packaging, flexible heterogeneous integration and reduced overall costs. As a result, 3D integration appears to be a viable enabling technology for future integrated circuits (ICs) and low-cost multi-functional heterogeneous integrated systems. As 3D integration technology is quickly maturing and entering mainstream markets, it clearly provides a new spectrum of opportunities and challenges for circuit and system designers, and warrants significant rethinking and innovations from circuit and system design perspectives. Existing work mainly came from two research communities, i.e., electronic design automation (EDA) and computer architecture. In the context of EDA, a significant amount of research has been done in the areas of physical design (e.g., see [1, 16, 25]) and thermal analysis and management (e.g., see [13, 23, 54]). As pointed out in [5], architectural innovations will play an at least equally important role as EDA innovations in order to fully exploit the potential of 3D integration. At the architectural front, existing work predominantly focused on high-performance general-purpose multi-core microprocessors (e.g., see [7, 19, 35, 41–43]), as 3D integration clearly provides a promising solution to address the looming memory wall problem [64] in computing systems. Due to the demands of an ever-increasing number of signal processing applications that require high throughput at low power, digital signal processing (DSP) integrated circuits, including both programmable digital signal processors and application-specific DSP circuits, have become, and will continue to be, one of the main drivers for the semiconductor industry [56]. Nevertheless, little attention has been given to exploiting the potential of 3D integration to improve various programmable and/or application-specific DSP systems. This chapter represents our first attempt to fill this missing link and, more importantly, motivate greater future efforts from DSP system research community to explore this new and rewarding research area. In this chapter, after briefly reviewing the basics of the 3D integration technology, we first discuss the rationale and opportunities for DSP systems, both programmable digital signal processors and application-specific signal processing circuits, to exploit the 3D integration technology. In particular, we advocate a 3D logic-DRAM integration paradigm, i.e., one or multiple DRAM dies are stacked with one logic die, for 3D DSP systems, and present an approach for 3D DRAM architecture design. Then we present two case studies on applying 3D logic-DRAM integration to clustered VLIW (very long instruction word) digital signal processors and application-specific video encoders in order to quantitatively demonstrate the promise of 3D DSP system design.
DSP Systems using Three-Dimensional (3D) Integration Technology
517
2 Overview of 3D Integration Technology The objective of 3D integration is to enable electrical connectivity between multiple active device planes within the same package. Various 3D integration technologies are currently pursued and can be divided into the following three categories: 1. 3D packaging technology: It is enabled by wire bonding, flip-chip bonding, and thinned die-to-die bonding [14]. As the most mature 3D integration technology, it is already being used in many commercial products, noticeably in cell phones [24, 53]. Its major limitation is very low inter-die interconnect density (e.g., only few hundreds of inter-die bonding wires) compared to the other emerging 3D integration technologies. 2. Transistor build-up 3D technology: It forms transistors inside on-chip interconnect layer [3], on poly-silicon films [18], or on single-crystal silicon films [32], [33]. Although a drastically high vertical interconnect density can be realized, it is not readily compatible to existing fabrication processes and is subject to severe process temperature constraints that tend to dramatically degrade the circuit electrical performance. 3. Monolithic, wafer-level, back-end-of-the-line (BEOL) compatible 3D technology: It is enabled by wafer alignment, bonding, thinning and inter-wafer interconnections [45]. Realized by through silicon vias (TSVs), inter-die interconnects can have very high density. The TSVs can be formed before/during bonding (via-first) or after bonding (via-last) [44]. Among the above three different categories of 3D integration technologies, waferlevel BEOL-compatible 3D integration appears to be the most promising 3D integration for high-volume production and is quickly becoming mature because of tremendous recent research efforts from both academia and industries [8, 12, 29, 45, 47, 48, 51, 52]. For example, SEMATECH [55], the biggest semiconductor technology research and development consortium, launched its 3D program in early 2005. It has already generated a comprehensive cost-model and a draft of a 3D roadmap for the International Technology Roadmap for Semiconductors (ITRS) [31], and begun tool and process benchmarking. From IC design perspective, vertical interconnects enabled by 3D integration technology, particularly wafer-level BEOL-compatible 3D integration, provide several distinct advantages in a straightforward manner, including • Integration of disparate technologies: Fabrication processing technologies specific to functions such as DRAM and RF circuits are incompatible with that of high performance logic CMOS devices. Converging these into a single 2D chip inevitably results in performance compromises (e.g., the density loss of embedded DRAM [46]). Obviously, 3D integration technology provides a natural vehicle to enable independent optimal fabrication and integration of these functions. • Massive bandwidth: 3D integration can provide a massive vertical inter-die interconnect bandwidth. This can enable a very high degree of operational parallelism for improving the system performance and/or reducing power consump-
518
Tong Zhang, Yangyang Pan, and Yiran Li
tion. Meanwhile, this makes it possible to carry out a significant amount of crossdie co-design and co-optimization. • Wire-length reduction: Expanding from 2D to 3D domain, we may possibly replace a lateral wire of tens or hundreds of microns with a 2∼3 m tall vertical via. Since interconnects play an important role in determining overall system speed and power consumption [15, 26], such wire-length reduction can significantly improve the IC system performance.
3 3D Logic-Memory Integration Paradigm for DSP Systems: Rationale and Opportunities This section advocates a 3D logic-memory integrated paradigm, in particular the integration of a single logic die and multiple memory dies, for 3D integrated DSP systems. The motivation is two-fold: 1. Its great practical feasibility: In spite of those appealing advantages of 3D integration as described above, 3D integration tends to have two apparent potential drawbacks: thermal effect and yield loss. Dies stacked in the middle are subject to bigger heat resistance and hence higher temperature, which could greatly degrade the circuit performance and reliability, particularly when such dies have very high activities and dissipate a large amount of heat. Although prior work from the EDA research community has developed techniques such as thermal via placement [17, 23] to mitigate this thermal issue to a certain degree, 3D stacked dies are fundamentally subject to the thermal problem, particularly for dies with high activity and hence high power consumption. Since 3D integration, especially wafer-level BEOL-compatible 3D integration, cannot use known good die (KGD), it is inherently subject to possible yield loss, which may result in overall cost increase. Moreover, the 3D integration process itself may introduce some defects, leading to more yield loss. Compared with 3D logic-logic integration, 3D logic-memory integration could be much less vulnerable to the above two issues and hence has a much greater practical feasibility, at least in the foreseeable future. Compared with logic dies, digital memories tend to consume much less energy and hence dissipate much less heat. Meanwhile, because of homogeneous functionality and regular architecture of digital memories, simple yet effective testing and fault tolerance techniques can be used to greatly improve yield. 2. Its great potential to improve performance: With the memory intensive nature, signal processing can greatly benefit from 3D memory stacking. Most signal processing functions can well leverage the large memory storage capacity and massive logic-memory interconnect bandwidth to straightforwardly improve the overall signal processing system performance. Moreover, due to the large algorithm and architecture design flexibility inherent in most signal processing functions, such 3D logic-memory integration paradigm directly enables a large space
DSP Systems using Three-Dimensional (3D) Integration Technology
519
to pursue optimal signal processing algorithm/architecture design solutions and explore a wide spectrum of design trade-offs. This will be further demonstrated by two case studies presented later. Under a 3D logic-memory integration framework, we further discuss the opportunities and research directions that could potentially lead to effective explorations of 3D logic-memory integration in both programmable digital signal processors and application-specific signal processing ICs. Programmable digital signal processors are typically characterized with specialized hardware and instructions geared to efficiently executing common mathematical computations used in signal processing [38]. Being capable of well exploiting the instruction-level parallelism (ILP) abundant in many signal processing algorithms, VLIW (very long instruction word) architecture is being widely used to build modern programmable digital signal processors, although VLIW architecture tends to require more program memory. Meanwhile, enabled by continuous technology scaling, programmable digital signal processors are quickly adopting multi-core architecture in order to better handle various high-end communication and multimedia applications. As a result, similar to the memory wall problem in general purpose processors, programmable digital signal processors also experience an increasingly significant gap between on-chip processing speed and off-chip memory access speed/bandwidth. Hence, onchip cache memory is also being used in programmable digital signal processors. However, because programmable digital signal processors are typically used in embedded systems with real-time constraints but the direct use of cache may introduce uncertainty of program execution time, design and use of on-chip cache memory in programmable digital signal processors can be very sophisticated and hence have been well studied (e.g., see [2, 20–22, 49, 63]). It is very intuitive that 3D logic-memory integration provides many opportunities to greatly improve programmable digital signal processor performance, which include but certainly are not limited to: • The 3D stacked memory can serve for the entire cache memory hierarchy with much reduced memory access latency. This can directly reduce the memory system design complexity and energy consumption induced by data movement among cache memory hierarchy. • With the high-capacity memory storage, we could possibly design a cache memory system with much improved program execution time predictability. One possible solution is to use a coarse-grained caching of the program code, e.g., cache an entire sub-routine instead of fine-grained cache line as in current design practice, which can largely reduce the run-time cache misses and hence improve the execution time predictability by leveraging the large cache capacity enabled by 3D memory stacking. • With sufficient 3D stacked memory storage capacity, signal processing data-flow diagram synthesis and scheduling may be potentially carried out in a much more efficient way. Modern data-flow diagram synthesis and scheduling techniques are typically memory constrained and inevitably involve certain design trade-off. 3D
520
Tong Zhang, Yangyang Pan, and Yiran Li Massive logic-memory interconnect bandwidth
Algorithm Design
Logic Architecture
Application-Aware 3D
Space Exploration
Design
Memory Architecture Design
Fig. 1 Joint design methodology for 3D logic-memory integrated application-specific DSP ICs.
logic-memory integration may provide a unique opportunity to develop much better signal processing data-flow diagram synthesis and scheduling solutions. In sharp contrast to programmable processors, application-specific signal processing IC design typically involves very close interaction between specific signal processing algorithm design and VLSI architecture design/optimization, which enables a much greater design flexibility and trade-off spectrum. Under the 3D logicmemory integration framework, the high memory storage capacity with massive logic-memory interconnect bandwidth naturally leads to much more space to explore signal processing algorithm and logic architecture design. Meanwhile, leveraging the massive logic-memory interconnect bandwidth, we can customize the memory architecture and storage organization geared to the specific signal processing algorithm, leading to further potential to improve the overall system performance. Fig. 1 illustrates the close coupling among signal processing algorithm, logic architecture, and 3D memory architecture design inherent in 3D logic-memory integrated application-specific DSP IC design. Although specific materializations of the above joint design will certainly vary for different DSP algorithms, we believe that the following intuitive general guidelines can help designers to explore system design optimization opportunities: • We should always fully analyze the memory access characteristics of the specific DSP algorithm, and try to accordingly customize the 3D memory architecture and logic-memory interconnect. This may help to greatly reduce the memory access energy consumption and/or maximize the overall signal processing system speed and parallelism. • It is highly desirable to explore the potential of trading memory storage capacity for improving signal processing performance and/or computational efficiency. Some signal processing algorithms, which may be considered as infeasible due to their demand for large memory, could possibly become the most efficient options under the 3D logic-memory integration framework. • Besides straightforward data storage, the 3D stacked memory could also be used to realize functions like table look-up or content addressable storage. For many signal processing algorithms, particularly those involving a significant amount of search operations, embedding table look-up or content address storage functions in the 3D memory could potentially achieve noticeable overall system performance.
DSP Systems using Three-Dimensional (3D) Integration Technology
521
To further demonstrate the design of 3D logic-memory integrated DSP systems, the remainder of this chapter presents two case studies in the context of both programmable digital signal processor and application-specific DSP IC. As the mainstream high capacity memory technology, DRAM appears to be the favorable option for the 3D logic-memory integration. Hence, in the following, we first present a practical 3D DRAM design strategy that could leverage 3D integration to improve DRAM design itself, then we will present the two case studies, including a 3D integrated VLIW digital signal processor and a 3D integrated H.264 video encoder.
4 3D DRAM Design In current design practice, high capacity DRAMs have a hierarchical architecture that typically consists of banks, sub-banks, and sub-arrays. With its own address and data bus, each DRAM bank can be accessed independently from the other banks. Each bank is divided into sub-banks, and the data are read/written from/to one subbank during each memory access to one bank. Each sub-bank is further divided into sub-arrays, and each sub-array contains an indivisible array of DRAM cells surrounded by supporting peripheral circuits such as word-line decoders/drivers, sense amplifiers (SAs), and output drivers etc. Reference [30] gives detailed discussions on modern DRAM circuit and architecture design. DRAM cells, peripheral circuits, and interconnect, including H-tree outside/inside banks and sub-banks and the associated buffers, altogether determine the overall DRAM performance, including silicon area, access latency, and energy consumption. The pitches of word-lines and bit-lines within each DRAM sub-array are in the range ∼100 nm and below, while the pitch of TSVs for 3D integration is at least a few m or tens of m in order to allow low-cost fabrications of TSVs with high manufacturability. To well accommodate such significant pitch size mismatch, this section presents a practical 3D DRAM architecture design using a coarse-grained inter-sub-array 3D partitioning. The key is that only the address and/or data I/O wires of each memory sub-array associate with TSVs and the entire memory subarray including cell array and peripheral circuits remain exactly the same as in 2D design. The top view of a 3D memory with inter-sub-array 3D partitioning looks exactly the same as its 2D counterpart, except that each leaf is a 3D sub-array set. Let n denote the number of memory dies being integrated. Each 3D sub-array set contains n individual 2D sub-arrays, and each 2D sub-array contains the memory cells and its own complete peripheral circuits such as decoders, drivers, and SAs. Within each 3D sub-array set, all the 2D sub-arrays only share address and data input/output through TSVs. The global address/data routing outside and inside each bank is simply distributed across all the n layers through TSVs, as illustrated in Fig. 2. Let Ndata and Nadd denote the data access and address input bandwidth of each 3D subarray set, and recall that n denotes the number of memory layers. Without loss of generality, we assume that Ndata and Nadd are divisible by n. As illustrated in Fig. 2, the Ndata -bit address bus and Nadd -bit data bus are uniformly distributed across all
522
Tong Zhang, Yangyang Pan, and Yiran Li
TSVs bundle Nadd/4
Memory cell array Peripheral circuits Ndata/4
Distributed Nadd-bit address bus
Nadd/4 Ndata/4 Nadd/4
Distributed Ndata-bit data bus
Ndata/4 Nadd/4 Ndata/4 Fig. 2 Illustration of a 3D sub-array set with 4 layers.
the n layers and are shared by all the n 2D sub-arrays within the same 3D sub-array set through a TSVs bundle. Compared with direct commodity DRAM die stacking, in the above presented design strategy, the DRAM global address/data routing is shared among DRAM dies to reduce the routing area overhead in each DRAM die. Since global routing tends to play an important role in determining the overall DRAM performance, this approach can achieve considerable performance gain. For the purpose of evaluation, we modified CACTI 5, a recent version of the widely used memory modeling tool CACTI [9], to support the above inter-sub-array 3D partitioning strategy. Using a 1 Gb DRAM as an example, we estimate the area, latency and energy of such 3D DRAM at the 65-nm technology node. For the purpose of comparison, we further consider conventional 2D DRAM and 3D die packaging of separate DRAM dies without the use of TSVs (referred to as 3D die packaging). Table 1 shows the estimated results that clearly demonstrate the noticeable performance gain of such inter-sub-array 3D partitioning for 3D DRAM architecture design. Table 1 Estimated results for an example 1Gb DRAM using different design approaches at 65nm technology node based on ITRS projection. 2D DRAM Footprint (mm2 ) Access Latency (ns) Access Energy (nJ)
92 26 2.20
3D die packaging Proposed 3D DRAM 2-layer 4-layer 8-layer 2-layer 4-layer 8-layer 69 44 33 41 19 10 23 19 18 20 16 14 1.75 1.50 1.40 1.50 1.40 1.35
DSP Systems using Three-Dimensional (3D) Integration Technology
523
5 Case Study I: 3D Integrated VLIW Digital Signal Processor The VLIW architecture is conventionally characterized by the fact that each instruction specifies several independent operations. This is compared to RISC (reduced instruction set computer) instructions that typically specify one operation and CISC (complex instruction set computer) instructions that typically specify several dependent operations. VLIW processors typically contain several identical processing clusters, where each cluster consists of a collection of execution units, such as integer ALUs, floating-point ALUs, load/store units, and branch units. All the clusters can execute several operations in one instruction at the same time. Because VLIW processors can readily exploit the ILP and most signal processing applications have abundant ILP, most existing programmable digital signal processors inherently use VLIW architectures. VLIW processors rely on a compiler to explicitly exploit parallelism, which is different from RISC and CISC that rely on hardware to discover parallelism on-the-fly. The compiler schedules the operations based on expected architectural latencies, including the instruction execution time and the memory access time. Because of the data intensive nature of most signal processing applications, VLIW digital signal processors demand careful design and optimization of cache and memory hierarchy, which has been extensively studied (e.g., see [2, 20–22, 63]). It is very intuitive that VLIW digital signal processor and high-capacity DRAM integration enabled by 3D integration technologies can further improve the memory hierarchy performance and hence improve the overall system performance.
5.1 3D DRAM Stacking in VLIW Digital Signal Processors In current design practice, a VLIW digital signal processor typically has a shared on-chip L2 cache and connects with an off-chip DRAM, where each L2 cache miss may result in significant penalty due to the long latency of off-chip data access. We have two obvious options to explore the design of VLIW digital signal processors with 3D DRAM stacking: 1. The most straightforward design option is to directly stack a VLIW digital signal processor die with 3D DRAM, as illustrated in Fig. 3(a). This can be considered as moving the off-chip DRAM into the processor chip package. Clearly, such a straightforward design option can directly reduce the on-chip L2 cache miss penalty due to much less processor-DRAM access latency enabled by the 3D integration. Moreover, since 3D integration can enable a massive inter-die interconnect bandwidth through TSVs, the cache-memory communication bandwidth can be largely increased, which can further improve the overall system performance. 2. Beyond the above straightforward design option, we further study the potential of migrating the shared on-chip L2 cache into the 3D DRAM domain. As illustrated
524
Tong Zhang, Yangyang Pan, and Yiran Li
(a)
(b)
Fig. 3 Illustration of (a) straightforward 3D DRAM stacking, and (b) migrating on-chip L2 cache into 3D DRAM domain.
in Fig. 3(b), this makes it possible to put more clusters on the VLIW digital signal processor die, which can directly increase the system parallelism and potentially improve the overall computing system performance without increasing the chip footprint. In this context, the 3D DRAM has a heterogeneous structure and covers two levels of the entire memory hierarchy. Because L2 cache demands very short access latency, the 3D DRAM L2 cache should be particularly customized in order to achieve a comparable access latency as its on-chip SRAM counterpart.
5.2 3D DRAM L2 Cache The 3D VLIW architecture configuration as shown in Fig. 3(b) migrates the on-chip L2 cache into the 3D DRAM domain. Since L2 cache access latency plays a critical role in determining the overall computing system performance, one may intuitively argue that, compared with on-chip SRAM L2 cache, 3D DRAM L2 cache may suffer from much longer access latency and hence result in significant performance degradation. In this section, we show that this intuitive argument may not necessarily hold true. In particular, as we increase the L2 cache capacity and the number of DRAM dies, the 3D DRAM L2 cache may have a latency comparable or even shorter than an SRAM L2 cache. Commercial DRAM is typically much slower than SRAM mainly because, being commodity, DRAM has been always optimized for density and cost rather than speed. The speed of DRAM can be greatly improved by two approaches at the cost of density and cost, including: 1. We can reduce the size of each individual DRAM sub-array to reduce the memory access latency at the penalty of storage density. With shorter lengths of wordlines and bit-lines, a smaller DRAM sub-array can directly lead to reduced access latency because of the reduced load of the peripheral circuits. 2. We can adopt the multiple threshold voltage (multi-Vth ) technique that has been widely used in logic circuit design [27], i.e., we still use high-Vth transistors in
DSP Systems using Three-Dimensional (3D) Integration Technology
Access time (ns)
2D SRAM
Single-Vth 2D DRAM
525
Multi-Vth 2D DRAM
5 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 512KB
1MB
2MB
4MB
Fig. 4 Comparison of access latency of 2D SRAM, 2D single-Vth DRAM, and 2D multi-Vth DRAM at 65nm node.
DRAM cells to maintain a sufficiently low DRAM cell leakage current, while using low-Vth transistors in peripheral circuits and H-tree buffers to reduce the latency. Such multi-Vth design is not typically used in commodity DRAM since it will increase leakage power consumption of peripheral circuits and, more importantly, will complicate the DRAM fabrication process and hence incur a higher cost. Moreover, as we increase the L2 cache capacity, the global routing will play a bigger role in determining the overall L2 cache access latency. By using the above presented 3D DRAM design strategy, we can directly reduce the latency incurred by global routing, which will further contribute to reducing 3D DRAM L2 cache access latency compared with 2D SRAM L2 cache. To evaluate the above arguments, Fig. 4 shows the comparison of access latency of 2D SRAM, single-Vth 2D DRAM, and multi-Vth 2D DRAM, under different L2 cache capacities, at 65nm node. The results show that, as we increase the capacity of L2 cache, multi-Vth DRAM L2 cache already excels its SRAM counterpart in terms of access latency, even without using 3D to further reduce global routing delay. This essentially agrees with the conclusions drawn from a recent work on embedded SOI DRAM design [4], which shows that embedded SOI DRAM can achieve shorter access latency than its SRAM counterpart at a capacity as small as 256kB. Fig. 5 further shows that, when we use the 3D DRAM design strategy presented in Section 4 to implement a 2MB L2 cache, the access latency advantage of multiVth 3D DRAM over 2D SRAM will further improve as we increase the number of DRAM dies. This is mainly because, as we stack more DRAM dies, the footprint and hence latency incurred by global routing will accordingly reduce.
526
Tong Zhang, Yangyang Pan, and Yiran Li
Access time (ns)
Single-Vth 3D DRAM
4.5 4 3.5 3 2.5 2 1.5 1 0.5 0
Multi-Vth 3D DRAM
2D SRAM (1.99ns)
1-layer
2-layer
4-layer
8-layer
Fig. 5 Comparison of 2MB L2 cache access latency as we increase the number of DRAM dies.
5.3 Performance Evaluation To evaluate the above presented two 3D VLIW digital signal processor architecture design options as illustrated in Fig. 3(a) and Fig. 3(b), we use the Trimaran simulator [10] to carry out simulations over a wide range of signal processing benchmarks. Trimaran is an integrated compilation and performance monitoring infrastructure and covers HPL PlayDoh architecture [34]. We further integrate the memory subsystems of M5 [6] to strengthen the memory system simulation capability of Trimaran. For the 3D VLIW processor and DRAM integration, we assume that the VLIW processor and 3D DRAM are designed using 90-nm and 65-nm technologies, respectively. For the VLIW processor, each cluster contains 4 ALUs, a 2kB multi-ported register file, a private 4kB instruction cache and a 4kB data cache. All the clusters share one 256kB L2 cache. We estimate the cluster silicon area based on [36] that reports a silicon implementation of a VLIW digital signal processor at 0.13- m technology node similar to the architecture being considered in this work. In [36], one cluster, consisting of 6 ALUs and a 16kB single-ported register file, occupies about 3.2 mm2 , and the entire VLIW digital signal processor contains 16 clusters and occupies 155 mm2 (note that besides the 16 clusters, it contains two MIPS CPUs and a few other peripheral circuits). Accordingly, we estimate that one cluster with 4 ALUs and a 2kB multi-ported register file is about 2.77 mm2 at 90-nm technology node, the 256kB L2 cache is 5.69 mm2 , and the entire VLIW processor die size is 58.5 mm2 . Therefore, when we move the 256kB L2 cache into 3D DRAM domain, we can re-allocate the silicon area to two additional clusters. We have that 58.5 mm2 die size with 6 stacked DRAM dies can achieve a total storage capacity of 1GB with 128-bits output width at 65-nm node. Table 2 lists all the basic configuration parameters used in the simulations. We use the conventional system architecture with off-chip DRAM as a baseline configuration, where the number of clusters is 2 and the L2 cache size is 256KB.
DSP Systems using Three-Dimensional (3D) Integration Technology
527
Table 2 Configuration parameters used in Trimaran [10] Frequency Number of DRAM dies Die size Parameter of one cluster
200 MHz 6 58.5 mm2 ALU number: 4; register file: 2kB multi-ported; area size: 2.77 mm2 ; technology: 90-nm 2D SRAM instruction cache (per cluster) 4kB, 2 way, 16 byte blocks; access latency:1.26 ns 2D SRAM data cache (per cluster) 4kB, 2 way, 16 byte blocks; access latency: 1.26 ns Shared 2D SRAM L2 cache 256kB, 2 way, 32 byte blocks; access latency: 2.43 ns; area: 5.69 mm2 ; technology: 90-nm Shared 3D DRAM L2 cache 512kB, 2 way, 32 byte blocks; access latency: 4.12 ns Shared Main memory 1GB, 4kB page size; access latency: 22.5 ns; technology: 65-nm
The latency of off-chip main memory access is set to 500 cycles. For the first 3D VLIW digital signal processor design option as shown in Fig. 3(a), the 1GB main memory is directly stacked with the processor die. For the second 3D VLIW digital signal processor design option as shown in Fig. 3(b), the 256KB L2 cache is migrated into 3D DRAM and we put two more clusters into the processor die without increasing the die area. Using the Trimaran simulator and a variety of media benchmarks [39], we evaluate the effectiveness of these two 3D design options in terms of IPC (instructions per cycle). As shown in Fig. 6, the results clearly demonstrate the potential performance advantages of the 3D VLIW digital signal processor design options. The first design option with direct 3D DRAM stacking can lead to 10%∼80% IPC improvement over the baseline scenario. Compared with direct 3D DRAM stacking, the second design option can further improve up to 10% IPC by migrating the L2 cache into the 3D domain and increasing computational parallelism.
6 Case Study II: 3D Integrated H.264 Video Encoder This section presents a case study on exploiting 3D memory stacking to improve application-specific video encoder design efficiency. The key and most resource demanding operation in video encoding is motion estimation, which has been widely studied over the past two decades and many motion estimation algorithms have been developed, covering a wide spectrum of trade-offs between motion estimation accuracy and computational complexity. Most existing algorithms perform block-based motion estimation, where the basic operations are calculating and comparing the matching functions between the current image macroblock (MB) and all the candidate MBs inside a confined area in the reference image frame(s). Hence, video encoding demands the simultaneous storage of several or many image frames in digital memory.
528
Tong Zhang, Yangyang Pan, and Yiran Li
Normalized IPC Improvement
BaselineofVLIWDSPDesign:2DͲDRAMͲMain Memoryand2DSRAML2Cache 3DVLIWDSPDesignOption1:3DͲDRAMͲMain Memoryand2DSRAML2Cache 3DVLIWDSPDesignOption2:3DͲDRAMͲMain Memoryand3DDRAML2Cache 120.00% 100.00% 80.00% 60.00% 40.00% 20.00% 0.00% cjpeg
djpeg
epic
unepic
gsmencode mpeg2dec rawcaudio rawdaudio
Fig. 6 Normalized IPC improvement of the two 3D VLIW digital signal processor design options over the baseline.
In current design practice, due to limited on-chip SRAM storage resources, all the current and reference image frames are stored in an off-chip high-capacity memory, most likely a DRAM, and on-chip SRAMs are used as buffers to streamline video encoding. Nevertheless, on-chip SRAM buffers still tend to occupy more than half of the silicon area of the entire encoder [28], most of which are used for motion estimation. As the video resolution and frame rate continue to increase and multiframe-based motion estimation is being widely used, memory storage and access will inevitably continue to remain as the most critical issue in video encoder IC design. It is very intuitive that 3D memory stacking can be the most viable option to address this issue and hold a great promise to largely improve overall video encoder performance in terms of silicon area and energy efficiency. In this section, we quantitatively evaluate the potential using H.264/AVC encoder design as a test vehicle.
6.1 H.264 Video Encoding Basics As the latest video coding standard of ITU and ISO Moving Picture Experts Group (MPEG), H.264/AVC [62] has been widely adopted in real-life applications due to its excellent video compression efficiency. Fig. 7 illustrates the typical H.264/AVC encoder structure. Each frame is processed in the unit of MB with a size of 16 × 16, each of which is encoded in either intra or inter mode. In the intra mode, the predicted MB is formed from pixels in the current frame that have been encoded, decoded and reconstructed. While in the inter mode, the predicted MB is formed by motion-compensated prediction from one or more reference frames. The prediction
DSP Systems using Three-Dimensional (3D) Integration Technology +
Current Frame
VLC
Video Buffer
Bit stream
Motion Compensation
Choose Intra Predication
Reconstructed Frame
Q
-
Motion Estimation Reference Frame
DCT
529
Intra Predication
+
De-blocking Filter
+ IDCT
IQ
Fig. 7 H.264/AVC encoder structure.
is subtracted from the current MB to produce a residual or difference MB, which is further transformed using DCT based on every 4 × 4 block and quantized to generate a sequence of coefficients. These coefficients are then reordered and entropy encoded, together with information about prediction mode, quantization step size, motion vector, etc, required to decode the MB, to form the compressed bitstream. Meanwhile, the quantized MB coefficients are decoded and reconstructed, which will be used in the encoding of subsequent MBs. Due to the nature of block transformation, blocking artifacts are inevitably introduced, and de-blocking filtering is used to reduce such block artifacts. To achieve a higher compression efficiency than previous video coding standards, H.264/AVC adopted several more aggressive techniques, including intra prediction, 1/4-pixel motion estimation with multiple reference frames (MRF), variable block sizes (VBS), context-based adaptive variable length coding, rate-distortion optimization mode decision, etc. At the cost of higher video compression efficiency, these techniques inevitably incur higher implementation complexity. In an H.264 encoder, almost every component has to access memory, and motion estimation has the most stringent memory resource demand in terms of memory storage and memory access bandwidth [61]. As the key operation in motion estimation, SAD (sum of absolute differences) computation is given by N N SAD = C(i,j) − R(m,n) (i,j) ,
(1)
i=1 j=1
where C and R(m,n) denote the luminance intensity of current and reference MBs with the size of N × N pixels, and (m,n) presents the motion vector relevant to the current MB. Clearly, SAD computation involves two tasks, including image data retrieval and SAD arithmetic computation. Image data retrieval involves the fetch of current MBs and candidate MBs. Given a motion vector and the location of current MB, a candidate MB is formed within the corresponding search region in the reference frame. Different motion estimation algorithms have different rules on screening
530
Tong Zhang, Yangyang Pan, and Yiran Li
the spectrum of motion vectors to find an optimal motion vector. Hence, implementations of different motion estimation algorithms mainly differ in the sequence and number of candidate MBs that must be fetched from the reference frame(s), and accordingly they differ in the number of SAD computations.
6.2 3D DRAM Stacking in H.264 Video Encoder Due to its memory intensive nature, a video encoder can very naturally benefit from 3D DRAM stacking. To exploit 3D memory stacking in video encoder design, the most straightforward option is to directly stack the existing video encoder and commodity DRAM die(s) without any modification or optimization. In this context, system performance gain only comes from using TSVs to replace chip-to-chip links, which can directly result in lower DRAM access energy consumption and latency. Very intuitively, much higher performance gains should be expected if we could further optimize the video encoder and DRAM design to fully take advantage of the 3D integration implementation platform. This case study investigates the potential of customizing 3D stacked DRAM architecture geared to the most resource demanding motion estimation operation. The objective is to eliminate the on-chip SRAM buffer used in conventional design and maximize the 3D DRAM access efficiency. 6.2.1 Architecting DRAM for Image Storage It is feasible and certainly preferable to store both current and reference frames in 3D stacked DRAM. This section discusses how to architect the 3D stacked DRAM so that we could very efficiently fetch the current MB and all the candidate MBs at small energy and latency cost. It is well known that, during DRAM data access, switching from one word-line to another word-line incurs additional access energy consumption and latency overhead. Therefore, it is desirable to make as many consecutive reads or writes as possible within the same word-line before switching to another word-line. However, if an image is linearly row-by-row (or column-bycolumn) mapped to the physical memory address space as illustrated in Fig. 8, a significant amount of word-line switching will occur and lead to higher energy and latency overhead. Clearly, in order to reduce the overhead induced by DRAM word-line switching, we should use a MB-by-MB mapping instead of linear row-by-row or column-bycolumn mapping, i.e., the luminance intensity data of one or more MBs locate on one word-line. Since a search region is spatially aligned with the current MB in the center, it spans several MBs in the reference frames. Let each MB be N × N, and let SW and SH represent width and height of the search region. Then each search region will span SBW × SBH MBs in the reference frame, where
DSP Systems using Three-Dimensional (3D) Integration Technology
531
Word-line
Macro-block
Fig. 8 Illustration of a linear row-by-row translation from image space to physical memory address space.
" SBW = 2 ·
# " # SW − N SH − N + 1, SBH = 2 · + 1. 2·N 2·N
(2)
Notice that · denote the ceiling operation that returns the smallest integer that is greater than or equal to the operand. Recall that the objective here is to eliminate the use of the on-chip SRAM buffer by directly forming and delivering the candidate MB from the 3D stacked DRAM. In this work, we intend to deliver the candidate MB row-by-row to the SAD computation engine on the logic die, i.e., each clock cycle, the 3D stacked DRAM delivers one row (i.e., luminance intensity data of N pixels within the candidate MB) to the logic die. Assuming that the luminance intensity data of each pixel use D bits, we need a total N · D TSVs to deliver the data from the 3D stacked DRAM to the logic die for motion estimation. To support this row-by-row candidate MB fetching, we could leverage the multibank architecture of modern DRAM, i.e., DRAM typically consists of multiple banks that can be accessed independently. Since each candidate MB at most can only span 2 × 2 MBs within the search region, we could store each image frame in 2 banks that alternatively store all the MBs row-by-row, For example, assuming that each MB is 16 × 16 and the search region is 32 × 32, then we have SBW = SBH = 3, as illustrated in Fig. 9. Under the above multi-bank MB-oriented storage organization, the key design issue is, given the motion vector, how to most efficiently extract the candidate MB from the search region. As pointed out in the above, we intend to fetch the candidate MB row-by-row and each row in one candidate MB at most spans two MBs in the reference frame. Therefore, assuming that each bank could output one row located in one MB, we could use barrel shifters to combine these data to form a row in each candidate MB, as illustrated in Fig. 10. Recall that each MB is N × N and the luminance intensity of each pixel uses D bits. Hence, each bank has a data I/O bandwidth of N · D bits, and internally each bank chooses N · D bits from one word-line. We note that, since each word-line
532
Tong Zhang, Yangyang Pan, and Yiran Li Reference Frame DRAM
Current Frame DRAM
Bank 1 Bank 0
Bank i (i ࠵ {0,1})
Search region Current MB
Total 9 MBs from the reference frame
Fig. 9 Illustration of frame memory organization when SBW =3, where the numbers within the 9 MBs indicate the index of the corresponding 2 DRAM banks.
could contain several MBs and each MB could contain a few hundreds bits (e.g., let N = 16 and D = 8, each MB contains 2048 bits), each word-line within one bank should be further partitioned into several segments in order to maintain a reasonable access speed. Therefore, like current DRAM design practice, each bank should contains a certain number of sub-arrays, where each sub-array contains an indivisible array of DRAM cells surrounded by supporting peripheral circuits such as word-line decoders/drivers, sense amplifiers (SAs), and output drivers etc. To further improve the DRAM architecture design flexibility, we should also maintain the bank→subbank→sub-array hierarchy like in current DRAM design practice, i.e., each bank contains several sub-banks, and at one time only one sub-bank carries out the write or read operation; each sub-bank contains a certain number of sub-arrays that collectively carry out the write or read operation. In this work, we propose a simple
Bank 0
Candidate MB
Bank 1
N·D
N·D
Barrel shifter
Barrel shifter
One row
0
A row in candidate MB
0
N·D
To logic die through TSVs
Fig. 10 Illustration of combining the data from two banks to form a row in candidate MB, where each row has N pixels and the luminance intensity of each pixel uses D bits.
DSP Systems using Three-Dimensional (3D) Integration Technology
533
bit-plane sub-bank structure to arrange the storage of MB data in the sub-arrays: Recall that the luminance intensity of each pixel uses D bits, we partition each subbank into D sub-arrays, A1 , A1 , ..., AD , each sub-array Ai stores the i-th bit of the luminance intensity of one pixel. Hence, the data I/O bandwidth of each sub-array (i.e., the column width of each sub-array) is N-bit, and all the D sub-arrays in each sub-bank handles the write or read of one row in one MB at a time. This bit-plane sub-bank structure has the following two major advantages: 1. It can simplify the write and read of one row in reference MBs. Within each sub-array, we can group the N bits on the same row in one MB as one column. Therefore, all the sub-arrays can readily share the same row and column address to collectively write or read one row in each MB without demanding any extra multiplexers or barrel shifters inside each sub-bank. Moreover, we could vertically align the sub-arrays within the same sub-bank in the 3D DRAM stacking, which could further reduce the address and data routing overhead. 2. It can readily support run-time graceful performance vs. energy trade-off by adjusting the precision of the luminance intensity data participating motion estimation. If we want to reduce energy consumption at minimal motion estimation performance degradation, we could directly disable the read from one or more sub-arrays that store the lower bits of the luminance intensity data. 6.2.2 Motion Estimation Memory Access With the above DRAM architecture design strategy, the motion estimation engine on the logic die can access the 3D DRAM to directly fetch the current MB and candidate MB through a simple interface. Assume that the video encoder should support a multi-frame motion estimation with up-to m reference frames. In order to seamlessly support multi-frame motion estimation while maintaining the same video encoding throughput, we store all these m reference frames separately, each reference frame is stored in 2 banks. The motion estimation engine can access all the m reference frames simultaneously, i.e., the motion estimation engine contains m parallel SAD computation units, each unit carries out motion estimation based upon one reference frame. We denote the MB at the top-left corner of each frame with a 2D position index of (0, 0). Assuming that each frame contains FW × FH MBs, the MB at the bottom-right corner of each frame has a 2D position index of (FW − 1, FH − 1). Assuming that each word-line in one bank stores s MBs, we store all the MBs row-by-row. Hence, given the MB index (x, y), we first can identify its bank index as x%2, where % is the modulo operator that finds the remainder of division. Then we can derive the corresponding DRAM row address as $ % y · FW + x Arow = , (3) s where · denotes the floor operation that returns the biggest integer that is less than or equal to the operand. Since each address location in one bank corresponds to one
534
Tong Zhang, Yangyang Pan, and Yiran Li
row in one MB and each MB has N rows, each MB spans N consecutive column addresses. The first column address is obtained as Acol = ((y · FW + x)%s) · N,
(4)
and the N consecutive column addresses are {Acol , Acol + 1, · · · , Acol + N − 1}. For each computation unit to access one reference frame, it only needs to provides two sets of parameters, 2D position index of current MB (xMB , yMB ) and motion vector (xMV , yMV ), which determine the location of the candidate MB in the reference frame. Given these inputs, DRAM will deliver the corresponding candidate MB row-by-row within N cycles. Due to the specific DRAM architecture described above, each 2-bank DRAM frame storage can easily derive its own internal row and column address and the corresponding configuration parameters for the barrel shifters, which is described as follows. As pointed out earlier, each candidate MB may span at most 4 adjacent MBs in the reference frame as illustrated in Fig. 10. Given the 2D position index of current MB (xMB , yMB ) and motion vector (xMV , yMV ), we can directly derive the 2D position indexes of these up-to 4 adjacent MBs as " # " # |xMV | |yMV | xMB + sign(xMV ) · , yMB + sign(yMV ) · , (5) N N " # $ % |xMV | |yMV | xMB + sign(xMV ) · , yMB + sign(yMV ) · , (6) N N $ % " # |xMV | |yMV | xMB + sign(xMV ) · , yMB + sign(yMV ) · , (7) N N % % $ $ |xMV | |yMV | xMB + sign(xMV ) · , yMB + sign(yMV ) · , (8) N N where sign(·) represents the sign the of operand. Clearly, dependent upon the specific values of xMV and yMV , we may have 1, 2, or 4 distinctive indexes, i.e., the candidate MB may span 1, 2, or 4 adjacent MBs in the reference frame. For the general case, let us consider the scenario when one candidate MB spans 4 adjacent MBs in the reference frame as illustrated in Fig. 11. After we obtain the indexes of the 4 reference MBs, we can directly calculate the corresponding DRAM row addresses. In order to determine the range of the column addresses, we calculate |yMV |%N, yMV ≥ 0 yo f f −set = (9) N − |yMV |%N, yMV < 0 The candidate MB spans the last N − yo f f −set rows in the two top MBs, and the first yo f f −set rows in the two bottom MBs, as illustrated in Fig. 11. Hence, we can correspondingly determine the column addresses. Finally, we should determine the configuration data for the two barrel shifters (as shown in Fig. 10) that forms one row in the candidate MB at a time. In this context, we can calculate
DSP Systems using Three-Dimensional (3D) Integration Technology off-set
MB in Bank 0
535
MB in Bank 1
off-set
Candidate MB
MB in Bank 0
MB in Bank 1
yoff-set
N-yoff-set
Top MB in bank 0 xoff-set N-xoff-set Top MB in bank 1 Bottom MB in bank 0 xoff-set N-xoff-set Bottom MB in bank 1
Fig. 11 An example to illustrate the calculation of memory access address to fetch one candidate MB.
xo f f −set =
xMV ≥ 0 |xMV |%N, N − |xMV |%N, xMV < 0
(10)
For each row read from the reference MBs on the left as shown in Fig. 11, the barrel shifter will shift the data towards left by xo f f −set pixels; for each row read from the reference MBs on the right as shown in Fig. 11, the barrel shifter will shift the data towards right by N − xo f f −set pixels. By implementing the above simple calculations in the DRAM domain, the 3D stacked DRAM can readily deliver the candidate MB to the motion estimation engine on the logic die, while the motion estimation engine only needs to supply the current motion vector and the position of the current MB.
6.3 Performance Evaluation In the H.264/AVC coding standard, the size of each MB is 16 × 16 and the search region is set 32 × 32 for HDTV1080p (1920 × 1080). Assuming that the luminance density of each pixel uses 8 bits, each 2-bank frame DRAM has a data I/O bandwidth of 128 bits. We assume that the encoder must be able to support multi-frame motion estimation with up-to five reference frames. Hence, we need six 2-bank frame 3D DRAM blocks to store the current frame and five reference frames in the stacked 3D DRAM. This leads to an aggregate data I/O bandwidth of 128 × 6 = 768 bits, corresponding to 768 TSVs for logic-DRAM data interconnect. We use the intersub-array 3D partitioning strategy presented in Section 4 to estimate 3D DRAM performance. For the target HDTV1080p resolution, each image frame needs about
536
Tong Zhang, Yangyang Pan, and Yiran Li
2MByte storage. Table 3 shows the estimated 3D DRAM results for each 2MByte 2-bank frame storage DRAM block at 65-nm node. Since each sub-bank always has eight sub-arrays, we explore the 3D DRAM design space by varying the size of each sub-array and the number of sub-banks. In this study, the number of bit-lines in each sub-array is fixed as 512. Table 3 clearly shows a trade-off: as we increase the numTable 3 Estimated results for each 2MByte 2-bank frame storage DRAM block. 1-layer stacked Number of sub-banks
1
2
4
8
4-layer stacked 16
1
2
4
8
16
Access time (non-burst) (ns)
9.49 7.37 6.65 6.57 7.36 9.36 7.25 6.53 6.44 6.67
Burst access time (ns)
7.17 4.83 4.11 3.91 4.39 7.09 4.84 4.03 3.83 3.93
Energy per access(non-burst) (nJ) 0.71 1.00 1.29 1.61 1.99 0.93 1.22 1.28 1.59 1.00 Energy per burst access (nJ) Footprint
(mm2 )
0.15 0.45 0.76 1.08 1.46 0.13 0.43 0.74 1.06 1.42 1.10 2.08 3.12 4.79 7.49 0.22 0.34 0.47 0.69 1.13
ber of sub-banks by reducing the size of each sub-array, we could directly reduce the access latency, while the access energy consumption and DRAM footprint will increase. We considered both 1-layer and 4-layer 3D DRAM stacking, and the results clearly show the advantages of 4-layer 3D DRAM stacking. As pointed out in the above, this proposed 3D DRAM attempts to access the data on the same wordline as much as possible. Conventionally, access to the data on the same word-line is denoted as burst access. Table 3 shows the difference between burst access and non-burst access, which clearly shows that burst access is much more preferable. The above presented image storage architecture can seamlessly support any arbitrary motion vector search pattern, hence can naturally support various motion estimation algorithms. In this case study we considered the following popular algorithms: exhaustive full search (FS), three step search (TSS) [11], new three step search (NTSS) [40], four step search (FSS) [50, 60], and diamond search (DS)[58, 59]. We apply these algorithms to two widely used HDTV 1080p video sequences Tractor and Rush hour [57], where 15 frames are extracted and analyzed in each video sequence. Fig. 12 shows the peak signal-to-noise ratio (PSNR) vs. average memory energy consumption for processing each image frame. Each curve contains five points, corresponding to the scenarios using 1, 2, 3, 4, and 5 reference frames, respectively. We note that, due to the very regular memory access pattern in full search, explicit memory access can be greatly reduced by data reuse in the motion estimation engine. In this study, we assume the use of the full search motion estimation design approach with excellent data reuse presented in [37]. As clearly shown in Fig. 12, full search tends to consume the minimal memory access energy when using the above proposed 3D DRAM image storage architecture. We further designed a 1-D motion estimation computation engine that can support variable block size motion estimation. Every cycle, it receives one row in the candidate MB and accordingly updates the SAD computation results. Using Syn-
DSP Systems using Three-Dimensional (3D) Integration Technology
537
Rush hour (HDTV 1080p)
38 37.8 37.6
PSNR (dB)
37.4 37.2 37 36.8
Full search Three step search New three step search Four step search Diamond step search
36.6 36.4 36.2 36
0
0.5
1
1.5
2
2.5
3
Memory energy consumption per frame (mJ)
(a) Tractor (HDTV 1080p)
31.5 31
PSNR (dB)
30.5 30 29.5
Full search Three step search New three step search Four step search Diamond step search
29 28.5 28 0
0.5
1
1.5
2
2.5
3
3.5
Memory energy consumption per frame (mJ)
(b) Fig. 12 PSNR vs. average memory energy consumption per frame of different motion estimation algorithms, where the five points on each curve correspond to the use of 1, 2, 3, 4, and 5 reference frames.
opsys tool sets and 65-nm standard CMOS technology, we synthesized this computation engine and estimated its power consumption. Combining the DRAM and computation energy consumption results, we estimated the overall average motion estimation power consumption as shown in Fig. 13.
538
Tong Zhang, Yangyang Pan, and Yiran Li
FS
TSS
NTSS
FS-TSS
DS
250
Power consumption (mW)
200
150
100
50
0 1
2
3
4
5
Number of reference frames
Fig. 13 Average power consumption of the entire motion estimation operation.
7 Conclusion Motivated by recent significant developments of 3D integration technology towards high manufacturability, this chapter studied the exciting opportunities enabled by 3D integration for digital signal processing circuits and systems design. Because of the memory intensive nature of most signal processing systems, 3D integration of a single signal processing logic die and multiple high density memory dies appears to be a very natural vehicle to exploit 3D integration technology in DSP systems. The key of such 3D integrated DSP system design is how to most effectively leverage the massive logic-memory interconnect bandwidth and very short logic-memory communication latency to push the overall system performance and efficiency beyond what is achievable today. This chapter discussed the use of 3D logic-memory integration in both programmable digital signal processors and application-specific signal processing ICs. Focusing on 3D logic-DRAM integration, this chapter discussed 3D DRAM architecture design and presented two case studies on applying 3D logic-DRAM integration to programmable VLIW processors and applicationspecific video encoders.
References 1. Ababei, C., Feng, Y., Goplen, B., Mogal, H., Zhang, T., Bazargan, K., Sapatnekar, S.: Placement and routing in 3D integrated circuits. IEEE Design & Test of Computers 22, 520–531
DSP Systems using Three-Dimensional (3D) Integration Technology
539
(2005) 2. Abraham, S.G., Mahlke, S.A.: Automatic and efficient evaluation of memory hierarchies for embedded systems. In: 32nd Annual International Symposium on Microarchitecture(MICRO32), pp. 114–125 (1999) 3. Banerjee, K., Souri, S., Kapur, P., Saraswat, K.: 3-D ICs: A novel chip design for improving deep-submicrometer interconnect performance and systems-on-chip integration. Proceedings of the IEEE 89, 602–633 (2001) 4. Barth, J., Reohr, W., Parries, P., Fredeman, G., Golz, J., Schuster, S., Matick, R., Hunter, H., Tanner, C., Harig, J., Hoki, K., Khan, B., Griesemer, J., Havreluk, R., Yanagisawa, K., Kirihata, T., Iyer, S.: A 500 MHz random cycle, 1.5 ns latency, SOI embedded DRAM macro featuring a three-transistor micro sense amplifier. IEEE Journal of Solid-State Circuits 43, 86–95 (2008) 5. Bernstein, K., Andry, P., Cann, J., Emma, P., Greenberg, D., Haensch, W., Ignatowski, M., Koester, S., Magerlein, J., Puri, R., Young, A.: Interconnects in the third dimension: Design challenges for 3D ICs. In: Proc. of ACM/IEEE Design Automation Conference (DAC), pp. 562–567 (2007) 6. Binkert, N.L., Dreslinski, R.G., Hsu, L.R., Lim, K.T., Saidi, A.G., Reinhardt, S.K.: The M5 simulator: Modeling networked systems. IEEE Micro 26(4), 52–60 (2006) 7. Black, B., Annavaram, M., Brekelbaum, N., DeVale, J., Jiang, L., Loh, G.H.: Die stacking (3D) microarchitecture. In: Proc. of IEEE/ACM International Symposium on Microarchitecture (Micro), pp. 469–479 (2006) 8. Burns, J., Aull, B., Chen, C., Chen, C., Keast, C., Knecht, J., Suntharalingam, V., Warner, K., Wyatt, P., Yost, D.: A wafer-scale 3-D circuit integration technology. IEEE Trans. Electron Devices 53, 2507–2516 (2006) 9. CACTI: An integrated cache and memory access time, cycle time, area, leakage, and dynamic power model. http://www.hpl.hp.com/research/cacti/ 10. Chakrapani, L.N., Gyllenhaal, J., Hwu, W.W., Mahlke, S.A., Palem, K.V., Rabbah, R.M.: Trimaran: An infrastructure for research instruction-level parallelism. In: Languages and Compilers for High Performance Computing. Lecture Notes in Computer Science, pp. 32-41 (2005) 11. Chau, L.P., Jing, X.: Efficient three-step search algorithm for block motion estimation video coding. In: Proc. of IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 421–424 (2003) 12. Chen, K.N., Lee, S., Andry, P., Tsang, C., Topol, A., Lin, Y., Lu, J., Young, A., Ieong, M., Haensch, W.: Structure design and process control for Cu Bonded Interconnects in 3D integrated circuits. In: Technical Digest of IEEE International Electron Devices Meeting (IEDM), pp. 367–370 (2006) 13. Chen, M., Chen, E., Lai, J.Y., Wang, Y.P.: Thermal investigation for multiple chips 3D packages. In: Proc. of Electronics Packaging Technology Conference, pp. 559–564 (2008) 14. Claasen, T.: An industry perspective on current and future state of the art in system-on-chip (SoC) Technology. Proceedings of the IEEE 94, 1121–1137 (2006) 15. Cong, J.: An interconnect-centric design flow for nanometer technologies. Proceedings of the IEEE 89, 505–528 (2001) 16. Cong, J., Luo, G.: A multilevel analytical placement for 3D ICs. In: Proc. of Asia and South Pacific Design Automation Conference, pp. 361–366 (2009) 17. Cong, J., Zhang, Y.: Thermal via planning for 3-D ICs. In: Proc. of IEEE/ACM International Conference on Computer-Aided Design (ICCAD), pp. 745–752 (2005) 18. Crowley, M., Al-Shamma, A., Bosch, D., Farmwald, M., Fasoli, L.: 512 Mb PROM with 8 layers of antifuse/diode cells. In: IEEE Intl. Solid-State Circuit Conf. (ISSCC), p. 284 (2003) 19. Emma, P.G., Kursun, E.: Is 3D chip technology the next growth engine for performance improvement? IBM Journal of Research and Development 52(6), 541–552 (2008) 20. Fisher, J.A., Faraboschi, P., Young, C.: Embedded Computing: A VLIW Approach to Architecture, Compilers and Tools. Morgan Kaufmann (2004) 21. Gibert, E., Sanchez, J., Gonzales, A.: Effective instruction scheduling techniques for an interleaved cache clustered VLIW processor. In: Proceedings of the 35th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 123–133 (2002)
540
Tong Zhang, Yangyang Pan, and Yiran Li
22. Gibert, E., Sanchez, J., Gonzales, A.: Local scheduling techniques for memory coherence in a clustered VLIW processor with a distributed data cache. In: Proceedings of the International Symposium on Code Generation and Optimization, pp. 193–203 (2003) 23. Goplen, B., Sapatnekar, S.: Placement of thermal vias in 3-D ICs using various thermal objectives. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 25, 692–709 (2006) 24. Harrer, H., Katopis, G., Becker, W.: From chips to systems via packaging - A comparison of IBM’s mainframe servers. IEEE Circuits and Systems Magazine 6, 32–41 (2006) 25. Healy, M., Vittes, M., Ekpanyapong, M., Ballapuram, C.S., Lim, S.K., Lee, H.H.S., Loh, G.H.: Multiobjective microarchitectural floorplanning for 2-D and 3-D ICs. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 26, 38–52 (2007) 26. Ho, R., Mai, K.W., Horowitz, M.A.: The future of wires. Proceedings of the IEEE 89, 490–504 (2001) 27. Horowitz, M., Stark, D., Alon, E.: Digital circuit design trends. IEEE Journal of Solid-State Circuits 43, 757–761 (2008) 28. Huang, Y., Chen, T.C., Tsai, C.H., Chen, C.Y., Chen, T.W., Chen, C.S., Shen, C.F., Ma, S.Y., Wang, T.C., Hsieh, B.Y., Fang, H.C., Chen, L.G.: A 1.3TOPS H.264/AVC single-chip encoder for HDTV applications. In: International Solid-State Circuit Conference, pp. 128–129. San Francisco (2005) 29. Hwang, C.G.: New paradigms in the silicon industry. In: Proc. of Technical Digest of IEEE International Electron Devices Meeting (IEDM), pp. 19–26 (2006) 30. Itoh, K.: VLSI Memory Chip Design. Springer (2001) 31. The International Technology Roadmap for Semiconductors (ITRS). http://www.itrs.net/reports.html 32. Jung, S.M., Jang, J., Cho, W., Moon, J., Kwak, K.: The revolutionary and truly 3-dimensional 25F2 SRAM technology with the smallest S3 (stacked single-crystal Si) cell, 0.16 m2 , and SSTFT (stacked single-crystal thin film transistor) for ultra high density SRAM. In: Proc. of Symposium on VLSI Technology, pp. 228–229 (2004) 33. Jung, S.M., Jang, J., Kim, K.: Three dimensionally stacked NAND flash memory technology using stacked single crystal Si Layers in ILD and TANOS structure for beyond 30nm node. In: Technical Digest of IEEE International Electron Devices Meeting (IEDM), pp. 37–40 (2006) 34. Kathail, V., Schlansker, M.S., Rau, B.R.: HPL-PD architecture specification: Version 1.1. Tech. rep., Hewlett-Packard Company (2000) 35. Kgil, T., D’Souza, S., Saidi, A., Binkert, N., Dreslinski, R., Reinhardt, S., Flautner, K., Mudge, T.: PicoServer: Using 3D stacking technology to enable a compact energy efficient chip multiprocessor. In: Proc. of 12th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) (2006) 36. Khailany, B.K., Williams, T., Lin, J., Long, E.P., Rygh, M., Tovey, D.W., Dally, J.W.: A programmable 512 GOPS stream processor for signal, image, and video processing. IEEE Journal of Solid-State Circuits 43(1), 202–213 (2008) 37. Kim, M., Hwang, I., Chae, S.I.: A fast VLSI architecture for full-search variable block size motion estimation in MPEG-4 AVC/H.264. In: Proc. Asia and South Pacific Design Automation Conference, pp. 631–634 (2005) 38. Lapsley, P., Bier, J., Shoham, A., Lee, E.A.: DSP Processor Fundamentals: Architectures and Features. IEEE Press (1997) 39. Lee, C., Potkonjak, M., Mangione-Smith, W.: MediaBench: A tool for evaluating and synthesizing multimedia and communications systems. In: Proc. of IEEE/ACM International Symposium on Microarchitecture, pp. 330–335 (1997) 40. Li, R., Zeng, B., Liou, M.L.: A new three-step search algorithm for block motion estimation. IEEE Trans. on Circuits and Systems for Video Technology 4(4), 438–442 (1994) 41. Liu, C.C., Ganusov, I., Burtscher, M., Tiwari, S.: Bridging the processor-memory performance gap with 3D IC technology. IEEE Design and Test of Computers 22, 556–564 (2005) 42. Loh, G.: 3D-stacked memory architecture for multi-core processors. In: Proceedings of the 35th ACM/IEEE Intl. Conf. on Computer Architecture (2008)
DSP Systems using Three-Dimensional (3D) Integration Technology
541
43. Loh, G., Xie, Y., Black, B.: Processor design in 3D die-stacking technologies. IEEE Micro 27, 31–48 (2007) 44. Lu, J.Q.: 3-D hyperintegration and packaging technologies for micro-nano systems. Proceedings of the IEEE 97, 18–30 (2009) 45. Lu, J.Q., Cale, T., Gutmann, R.: Wafer-level three-dimensional hyper-integration technology using dielectric adhesive wafer bonding. Materials for Information Technology: Devices, Interconnects and Packaging (Eds. E. Zschech, C. Whelan, T. Mikolajick) pp. 386–397 (Springer-Verlag (London) Ltd, August 2005) 46. Matick, R., Schuster, S.: Logic-based eDRAM: Origins and rationale for use. IBM J. Res. & Dev. 49, 145–165 (2005) 47. Moor, P.D., Ruythooren, W., Soussan, P., Swinnen, B., Baert, K., Hoof, C.V., Beyne, E.: Recent advances in 3D integration at IMEC. Enabling Technologies for 3-D Integration (edited by C.A. Bower, P.E. Garrou, P. Ramm, and K. Takahashi) (2006) 48. Morrow, P., Black, B., Kobrinsky, M., Muthukumar, S., Nelson, D., Park, C.M., Webb, C.: Design and fabrication of 3D microprocessors. Enabling Technologies for 3-D Integration (edited by C.A. Bower, P.E. Garrou, P. Ramm, and K. Takahashi) (2006) 49. Panda, P.R., Catthoor, F., Dutt, N.D., Danckaert, K., Brockmeyer, E., Kulkarni, C., Kjeldsberg, P.G.: Data and memory optimization techniques for embedded systems. ACM Transactions on Design Automation of Electronic Systems 6, 149–206 (2001) 50. Po, L.M., Ma, W.C.: A novel four-step search algorithm for fast block motion estimation. IEEE Trans. on Circuits and Systems for Video Technology 6, 313–317 (1996) 51. Pozder, S., Chatterjee, R., Jain, A., Huang, Z., Jones, R., Acosta, E.: Progress of 3D integration technologies and 3D interconnects. In: Proc. of IEEE International Interconnect Technology Conference, pp. 213–215 (2007) 52. Pozder, S., Jones, R., Adams, V., Li, H.F., Canonico, M., Zollner, S., Lee, S., Gutmann, R., Lu, J.Q.: Exploration of the scaling limits of 3D integration. Enabling Technologies for 3-D Integration (edited by C.A. Bower, P.E. Garrou, P. Ramm, and K. Takahashi) (2006) 53. Rickert, P., Krenik, W.: Cell phone integration: SiP, SoC, and PoP. IEEE Design & Test of Computers 23(3), 188–195 (2006) 54. Sapatnekar, S.: Addressing thermal and power delivery bottlenecks in 3D circuits. In: Proc. of Asia and South Pacific Design Automation Conference, pp. 423–428 (2009) 55. SEMATECH Consortium. http://www.sematech.org 56. Strauss, W.: The real DSP chip market. IEEE Signal Processing Magazine 20, 83 (2003) 57. Sveriges Television (SVT): http://www.svt.se 58. Tham, J.Y., Ranganath, S., Ranganath, M., Kassim, A.A.: A novel unrestricted center-biased diamond search algorithms for block motion estimation. IEEE Trans. on Circuits and Systems for Video Technology 8, 369–377 (1998) 59. Tham, J.Y., Ranganath, S., Ranganath, M., Kassim, A.A.: A new diamond search algorithm for fast block-matching motion estimation. IEEE Trans. on Image processing 9, 287–290 (2000) 60. Wang, K.T., Chen, O.C.: Motion estimation using an efficient four-step search method. In: Proc. of IEEE International Symposium on Circuits and Systems, pp. 217–220 (1998) 61. Wang, R., Li, J., Huang, C.: Motion compensation memory access optimization strategies for H.264/AVC decoder. In: Proc. of IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 97–100 (2005) 62. Wiegand, T., Sullivan, G.J., Bjontegaard, G., Luthra, A.: Overview of the H.264/AVC video coding standard. IEEE Trans. on Circuits and Systems on Video Technology 13(7), 560–576 (2003) 63. Wu, Z., Wolf, W.: Design study of shared memory in VLIW video signal processors. In: Proceedings of the 1998 International Conference on Parallel Architectures and Compilation Techniques, pp. 52–59 (1998) 64. Wulf, W.A., McKee, S.A.: Hitting the memory wall: Implications of the obvious. Computer Architecture News 23, 20–24 (1995)
Mixed Signal Techniques Olli Vainio
Abstract Mixed signal circuits include both analog and digital functions. Mixed signal techniques that are commonly encountered in signal processing systems are discussed in this chapter. First, the principles and general properties of sampling and analog to digital conversion are presented. The structure and operating principle of several widely used analog to digital converter architectures are then described, including both converters operating at the Nyquist rate and oversampled converters based on sigma-delta modulators. Next, different types of digital to analog converters are discussed. The basic features and building blocks of switched-capacitor circuits are then shown. Finally, mixed-signal techniques for frequency synthesis and clock synchronization are explained.
1 Introduction Most physical quantities in the world are analog by nature, i.e., continuous in time and amplitude. Digital signal processing, on the other hand, is done with signal representations that are discrete in both time and amplitude. A signal processing system therefore often needs converters between the two domains, called A/D and D/A converters. For instance, a compact disc (CD) player includes a D/A converter for converting digitally encoded information to an audio signal (see Chapter 2). There are also intermediate forms, especially sampled analog signals in switchedcapacitor filters, which are discrete in time and continuous in amplitude. Presently a large portion of integrated circuits are mixed signal circuits, meaning that both analog and digital functions are included in the circuit. This chapter discusses a few selected topics of mixed signal techniques that are common in signal processing systems. The chapter is organized as follows. Section 2 Olli Vainio Tampere University of Technology, P. O. Box 553, FIN-33101 Tampere, Finland e-mail: [email protected]
S.S. Bhattacharyya et al. (eds.), Handbook of Signal Processing Systems, DOI 10.1007/978-1-4419-6345-1_20, © Springer Science+Business Media, LLC 2010
543
544
Olli Vainio
presents the basic principles of sampling, sample and hold circuits and performance parameters used to characterize data converters. In the next section, various analog to digital converter architectures are shown, including both Nyquist-rate and oversampled converters. Also the principles of sigma-delta modulators are explained. Section 4 describes digital to analog converters. Principles and basic building blocks of switched-capacitor circuits are discussed in Section 5. Finally, Section 6 presents techniques for frequency synthesis and clock synchronization.
2 General Properties of Data Converters The block diagram of a mixed signal system, as commonly encountered in signal processing systems, is shown in Fig. 1. The continuous-time analog input signal x(t), where t is the continuous time, originates from a transducer that converts some form of observable energy into voltage or current for electronic processing. The discrete-time representation x[n] of the analog signal x(t) is typically obtained by periodic sampling according to the relation x[n] = x(nT )
(1)
where n takes integer values and T is the sampling period. Its reciprocal, fs = 1/T , is the sampling frequency. Sampling can be regarded as the multiplication of the signal x(t) by a sampling function, s(t), which is a periodic impulse train s(t) =
n=−
(t − nT )
(2)
where (t) is the unit impulse function or Dirac delta function. In the sampling process, the unit impulses are weighted by x(t) such that the area of each impulse corresponds to the instantaneous amplitude value x(nT ). [15] The frequency spectrum of a periodic impulse train is a periodic impulse train. Since multiplication in the time domain corresponds to convolution in the frequency domain, the spectrum of the sampled signal includes the original signal band and its replicas that repeat on the frequency axis. The replicas are called spectral images and they are centered at multiples of fs . It can be concluded that if fs is too small in relation to the bandwidth of x(t), the images will overlap, and reconstruction of the original signal is no longer possible. Such a situation is called aliasing.
x(t)
Antialias filter
Sample and hold
A/D converter
x[n]
Digital system
y[n]
D/A converter
Fig. 1 A mixed signal system for digital processing of an analog input signal.
Reconstruction filter
y(t)
Mixed Signal Techniques
545
The Nyquist sampling theorem states that the signal x(t) is uniquely determined by its samples x[n] if fs ≥ 2 fmax (3) where fmax is the bandwidth of the signal. The frequency fmax is called the Nyquist frequency and the frequency 2 fmax is the Nyquist rate. Usually anti-alias filtering is done before sampling in order to limit the signal bandwidth according to Eq. (3). In the discrete-time domain, the frequency is given as the angular frequency, = 2 f / fs , where f is the analog signal frequency in Hertz. Analog to digital conversion is in principle the process of converting the weighted impulse train to a sequence of digitally coded samples in the discrete-time domain. In practice, sampling using ideal impulses is not possible, and the sampling window has a finite width. The signal is therefore measured over a finite time interval and not instantaneously, giving rise to the aperture effect. The signal may be changing while it is being sampled, and for a specified amount of uncertainty, the slope or maximum frequency of the input signal is limited. Assuming a sinusoidal signal and a maximum allowed uncertainty of 1/2 LSB (least significant bit), the maximum frequency of a signal that can be converted to 1/2 LSB accuracy using a b-bit analog to digital converter (ADC) is 1 fmax = (4) 2b+1 where is the aperture time. [16]
2.1 Sample and Hold Circuits Commonly a sample and hold circuit (S/H, also known as zero-order hold) is used in connection with an ADC. The S/H output is kept constant during the conversion, and the aperture effect can be avoided. fmax then depends on fs according to the Nyquist criterion in Eq. (3). The ADC has the time T to complete the conversion. The basic sample and hold circuit is shown in Fig. 1(a). When the switch is closed, the capacitor voltage follows the input voltage Vi . When the switch is open, CH holds the voltage to which it was charged. Amplifiers may be used to buffer the input and output, as shown in Fig. 1(a). The input buffer should have a large input impedance, small output impedance, and it should be stable for capacitive loads. The output buffer should have a large input impedance and a short settling time. An example of the behavior of Vi and Vo is depicted in Fig. 1(b), where the switch is controlled by the signal S/H. [16] The Fourier transform of the sampling pulse is given by H( f ) = e− j f T T sinc( f T ).
(5)
The S/H can be considered as an ideal impulse sampler followed by a filter with a frequency response of sinc( f / fs ), where sinc(x) = sin(x)/x [1]. The signal spectrum is modified because of the droop with frequency caused by the sinc response.
546
Olli Vainio
Vi
CH
Vo
S/H
(a)
Vo Vi
S H
Fig. 2 (a) A sample and hold circuit. (b) Illustration of S/H operation.
(b)
This is illustrated in Fig. 3. At the Nyquist frequency fs /2, the input signal is attenuated by the factor 0.64 or -3.9 dB. At the sampling frequency fs the output of the ideal S/H is zero. The arrows in Fig. 3 represent the fundamental signal and the images of a sine wave of frequency f s /6 Hz corresponding to = /3 in the discrete-time domain. The circuit is also called a track and hold circuit when the control signal has such a duty cycle that the switch is closed for more time than it is open. Many integrated circuit ADCs include a S/H circuit internally. Such ADCs are also called sampling A/D converters. Amplitude / dB 0 dB -3.9 dB
-13.5 dB -17.9 dB -20.8 dB
fs /2
fs
2f s
3fs
Frequency / Hz
Fig. 3 Frequency response of the S/H circuit in relation to the signal band (dashed line).
Mixed Signal Techniques
547
Various performance parameters and nonideality measures are used to characterize S/H circuits. [16] [8] 1. Tracking speed in the sample mode. There are both small-signal and large-signal limitations due to a finite bandwidth and limited slew rate. 2. Sample to hold transition error occurs when the S/H goes from sample mode to hold mode. Opening of the switch is not instantaneous, and the delay may be different for each switch operation. Charge injection during the sample to hold transition results in a voltage error called the pedestal or step error. Because charge injection depends on the input voltage, nonlinear distortion may be caused. 3. Aperture time uncertainty or aperture jitter is caused by the effective sampling time changing from one sampling instance to the next. 4. Input signal feedthrough during the hold mode, typically caused by parasitic capacitive coupling. 5. In the hold mode, the voltage is slowly changing at a rate called the droop rate. The change is caused by charge leakage from the capacitor due to input bias current of the output amplifier, switch leakage and internal capacitor leakage. 6. General characteristics of analog circuits, such as dynamic range, linearity, gain and offset errors and power supply rejection ratio (PSRR).
2.2 Quantization Assuming that a b-bit digital representation is used, the samples are assigned one of 2b values by the ADC. Quantization causes quantization error to be superimposed to the signal. Quantization noise is commonly considered as zero-mean white noise that is uniformly distributed in the interval ±q/2, where q is the quantization step size. If the input amplitude is normalized in the range ±1, the quantization step is q = 21−b . Denoting the quantization error as e, the quantization noise power can be calculated as &
2 =
q/2
−q/2
e2 de = q2 /12.
(6)
For a sine wave input, the signal to quantization noise power in decibels for an ideal ADC is given by SNR = 10 log(
1/2 1/2 ) = 10 log( 1−b 2 ) 2 q /12 (2 ) /12
(7)
yielding SNR = 6.02b + 1.76 dB.
(8)
The ADC output may be encoded, e.g., as unsigned binary numbers, two’s complement representation or thermometer-code. The most common binary forms used for
548
Olli Vainio
sample encoding in digital signal processing are two’s complement fixed point and floating point formats (see Chapter 12) [7]. The thermometer-code (also known as unary-weighted code) representation of a b-bit binary word has 2b − 1 bits. For instance, a 3-bit binary word 000, 001, ..., 111 is thermometer-coded as 0000000, 0000001, ..., 1111111. Thus all the bits in this code are weighted equally. Converters between binary numbers and thermometercode can be designed as combinational logic in a straightforward way [20].
2.3 Performance Specifications Data converters are characterized by both static and dynamic performance parameters. The application determines which parameters are considered the most important. 2.3.1 Static Performance The ADC produces one of the 2b output codes for any analog input voltage. Thus the dynamic range (DR) of the converter is 2b or expressed in decibels, DR = 20 log(2b) = 6.02b dB. Ideally, a line connecting the output codes as a function of the input voltage is a straight line. Nominally, the output makes a transition from zero to one LSB for the input voltage corresponding to 1/2b , and respectively reaches the maximum at 1/(2b − 1) of the full scale voltage reference VREF . There is offset error, if the ADC transfer function is shifted along the input voltage axis. Some ADCs have an intentional built-in offset of 1/2 LSB in order to shift the quantization error range from 0 . . . 1 LSB to −1/2 . . .1/2 LSB. The full-scale error is a measure of a change of the slope of the ADC transfer function, including both offset and gain errors. Each code step of the ADC should ideally have the same size in terms of the input voltage. Deviation in step size between the codes is called differential nonlinearity (DNL), defined as Vn+1 −Vn DNL = −1 (9) VLSB where VLSB = VREF /2b . Vn+1 and Vn are voltages corresponding to two successive code transitions. Integral nonlinearity (INL) is the deviation of the actual ADC transfer function from a straight line. There are two common methods to specify the position of the straight line: either using a best line fit for the transfer function or passing a straight line between the end points. The INL corresponding to an output C can be calculated as VC − V0 INL = −C (10) VLSB
Mixed Signal Techniques
549
where 0 < C < 2b − 1. VC and V0 are input voltages corresponding to codes C and zero, respectively. 2.3.2 Dynamic Performance The dynamic performance is characterized by investigating the frequency-domain characteristics of the ADC, typically calculating a fast Fourier transform (FFT) of the output for a sinusoidal input. The spectrum then shows a peak for the fundamental frequency as well as components of harmonic distortion and noise caused by various sources. The signal-to-noise ratio (SNR) is the ratio of the root mean square (RMS) power of the signal to the noise power, where harmonic distortion is not included. SNR is expressed in decibels, given by rms rms SNR = 20 log(VSIGNAL /VNOISE ).
(11)
Harmonic distortion is caused by nonlinearities in the converter. The combined effect of the harmonic components is measured as the total harmonic distortion (THD), given by T HD = 20 log(
V22 + V32 + · · · + Vm2 VSIGNAL
)
(12)
where V2 , V3 , · · · ,Vm are the harmonic components within the bandwidth of interest. A parameter that combines the effects of distortion and noise is the signal-to-noise and distortion (SiNAD), calculated as VSIGNAL SiNAD = 20 log( ). 2 2 V2 + V3 + · · · + Vm2 +VNOISE
(13)
The spurious-free dynamic range (SFDR) is the difference in magnitude between the fundamental signal and the highest spur, i.e., any unwanted signal in the spectrum. The spur may be a harmonic distortion component or at some point elevated noise floor. SFDR is usually measured in dBc (i.e., with respect to the carrier frequency amplitude). It is calculated using RMS values as SFDR(dBc) = input signal(dB) − unwanted tone(dB).
(14)
SFDR is measured up to the Nyquist frequency or within a specified band. [1]
3 Analog to Digital Converters The selection of an ADC type depends on the application requirements, such as sampling rate, wordlength, power dissipation and any limitations on the possible non-
550
Olli Vainio
idealities mentioned in the previous section. One way to classify the architectures is to make a distinction between Nyquist-rate converters and oversampled converters. [8]
3.1 Nyquist-Rate A/D Converters In Nyquist-rate ADCs the input signal bandwidth is close to that specified by the sampling theorem in Eq. (3). In practice, a somewhat higher sampling rate is often preferred to make it easier to construct anti-aliasing and reconstruction filters. In the following we mention some of the common Nyquist-rate ADC types. 3.1.1 Integrating Converters Integrating ADCs are suitable for applications requiring high accuracy and low speed. The single-slope integrating ADC is based on using an analog integrator, a comparator and a digital counter. The integrator and comparator can be implemented using operational amplifiers (see Section 5.1). The integrator charges a capacitor with a constant slope while the counter is counting up. When the integrator voltage reaches the input signal voltage, the comparator output changes state, counting is stopped and the counter output gives the conversion result. The capacitor is then discharged and the counter is cleared for the next conversion cycle. The singleslope ADC suffers from drift with aging. A preferable approach is the dual-slope ADC, shown in Fig. 7. The conversion cycle operates in two phases. 1. In phase 1 the counter is running for 2b clock cycles. The integrator output is charged with a slope that is proportional to Vin . 2. In phase 2 the counter is reset and the integrator input is switched to reference voltage Vre f . The integrator output voltage is reduced with a constant slope. When it reaches zero, counting is stopped and the conversion result is ready.
Integrator Vin Vref
b1 b2 b3
-
Control Logic
+
Counter
bb Clock
Fig. 4 A dual-slope ADC.
Mixed Signal Techniques
551
The dual-slope converter avoids the drift problem of the single-slope ADC because both the integrator gain and the counting rate do not vary between the operation phases. The converter has high noise immunity, as the input voltage is practically averaged during phase 1. The signal is therefore effectively filtered using a sinc type filter response. By appropriately choosing the integration time, certain frequency components can be attenuated by aligning those with the transmission zeros of the sinc filter. This can be useful, for instance, to eliminate power line noise. 3.1.2 Successive-Approximation Converters Successive-approximation converters use binary search to find the conversion result one bit at a time, starting from the most significant bit (MSB). It therefore takes b clock cycles to find the b-bit result. The block diagram of a successiveapproximation ADC is shown in Fig. 8. It uses a successive-approximation register (SAR), including some control logic, to construct the result. During each iteration cycle, the SAR output word is given to a digital to analog converter (DAC), and its output is compared in an analog comparator with the input voltage from a S/H circuit. The S/H is needed to keep the comparator input stable during conversion. Successive-approximation converters are suitable for medium speed, moderate accuracy applications. Converters can also be designed with nonlinear transfer functions for, e.g., data compressing applications [10]. Then the quantization steps are dense for small amplitudes and progressively larger for larger amplitudes, and the SNR is better for typical signal statistics than that achievable by linear quantization. For example, digital communication systems use standardized A-law and -law algorithms for speech companding [4].
Vin
S/H
+
SAR
b1 b2
b3
bb Vref
D/A converter
Fig. 5 A successive-approximation ADC.
3.1.3 Flash Converters Flash converters, also known as parallel converters, are used in very high speed applications. Flash ADCs are available for sampling rates exceeding 1 GHz. A disadvantage of the flash ADC architecture is that the size grows proportionally to 2b . The resolution is typically limited to 8 bits and the power dissipation is quite high. A
552
Olli Vainio
flash ADC architecture is shown in Fig. 9 for the example case of 3-bit output. The input voltage Vin is connected to a parallel array of comparators. Each comparator is also connected to a node of a resistor string which divides the reference voltage Vre f to a stack of fixed voltage levels for comparison with Vin . The comparator array generates a thermometer code, such as 1111000. One of the XOR gates gives a high output corresponding to the detected voltage level, and the 2b to b encoder produces the output word. By changing the resistor sizing it is possible to modify the transfer function of the flash ADC. It can be made nonlinear, e.g., logarithmic, or intentional 0.5 LSB offset may be included. 3.1.4 Sub-Ranging Converters Sub-ranging or multi-step converters are popular in high-speed, medium-accuracy applications. The block diagram of an 8-bit subranging ADC is shown in Fig. 2. The conversion is done in two steps. First, the coarse ADC converts the 4 MSBs. This result is converted back to analog with a DAC and subtracted from the input voltage. The difference is converted digital with a fine ADC that produces the 4 LSBs. The subconverters are typically flash ADCs, needing together much less comparators
Vin Vref + R
+ -
R
+ -
R
+
2b to b encoder
b1 b2 b3
R
+ -
R
+ -
R
+ -
R
Fig. 6 A flash ADC.
bb
Mixed Signal Techniques
553
than a full flash ADC of the same overall resolution. The fine ADC may be preceded by a gain amplifier with a gain of a power of two.
+ Vin
Coarse 4-bit ADC
Gain
Fine 4-bit ADC
- +
DAC
4 MSBs
4 LSBs
Fig. 7 A sub-ranging ADC with two 4-bit subconverters.
3.2 Oversampled A/D Converters Oversampled converters use a sampling rate that is much higher than the Nyquist rate. The oversampling ratio (OSR) is defined as the ratio of the actual sampling rate and the Nyquist rate. The OSR is typically between 8 and 512. Oversampled converters can utilize noise shaping, which means shaping the spectrum of the quantization noise so that most of the noise power is moved outside of the signal band. The sampling rate is subsequently lowered close to the Nyquist rate by decimation. This operation involves a so-called decimation filter, which is a lowpass filter that limits the signal bandwidth to avoid aliasing. The decimation filter also attenuates the out-of-band quantization noise. 3.2.1 Benefits of Oversampling The quantization noise power, given by Eq. 6, is independent of the sampling frequency. Assuming that no noise shaping is done, the noise is evenly distributed in the frequency domain from zero to the Nyquist frequency. When oversampling is used, the portion of the quantization noise that is outside of the signal band can be filtered out, and the quantization noise power on the signal band is reduced to 2 OSR =
q2 1 ( ). 12 OSR
(15)
In the oversampled case the signal-to-noise ratio is therefore given by SNROSR = 6.02b + 1.76 + 10 log(OSR) dB
(16)
554
Olli Vainio
where the last term is the benefit achieved by oversampling. Thus, SNR is increased by 3 dB when the sampling rate is doubled, which corresponds to 0.5 bits increase in resolution. 3.2.2 First-Order Sigma-Delta Modulator Oversampled converters usually take advantage of noise shaping, that can be accomplished using sigma-delta (also called delta-sigma) modulators. Various modulator topologies exist. The block diagram of a first-order sigma-delta modulator for an ADC is shown in Fig. 3. The analog input voltage is added to the feedback signal and taken as input to the integrating filter. The integrator output is quantized by the quantizer Q. One-bit quantizers are most common but also multi-bit quantizers are used. The quantized output is the oversampled digital output from the modulator. It is also converted back to digital using a DAC and subtracted from the input in the addition node. In practice, a modulator is often implemented using switchedcapacitor circuits (see Section 5). The one-bit quantizer is implemented with an analog comparator, and the one-bit DAC is simply a switch choosing one of two reference voltages. The sigma-delta modulator can be analyzed using a linearized model, where the quantizer is modeled as an additive noise source. Denoting the input by X (z), the noise by N(z) and the integrator transfer function by H(z), the output Y (z) can be expressed as Y (z) = ST F(z)X (z) + NT F(z)N(z) (17) where the signal transfer function ST F(z) and the noise transfer function NT F(z) are respectively given by H(z) ST F(z) = (18) 1 + H(z) NT F(z) =
1 . 1 + H(z)
(19)
When the integrator has a transfer function H(z) =
z−1 1 − z−1
(20)
Eq. 17 becomes
X(z)
+
+
Integrator
Fig. 8 Block diagram of a first-order sigma-delta modulator.
DAC
Y(z)
Q
Mixed Signal Techniques
555
Y (z) = z−1 X(z) + (1 − z−1)N(z).
(21)
We can see that the input passes through the modulator delayed by a unit delay, and the noise is filtered by a differentiator 1 − z−1 . The differentiator has a zero at the zero frequency (DC, direct current) and the response rises towards high frequencies. The noise spectrum is therefore shaped by this transfer function. SNR again depends on the oversampling ratio and is given by SNRSD1 = 6.02b − 3.41 + 30 log(OSR) dB.
(22)
Doubling the OSR improves SNR by 9 dB, corresponding to 1.5 bits. 3.2.3 Higher-Order Sigma-Delta Modulators The block diagram of a second-order sigma-delta modulator is shown in Fig. 4. It includes two integrators, one of which is nondelaying (1/(1 − z−1)) and the other one delaying (z−1 /(1 − z−1)). The output can be expressed as Y (z) = z−1 X(z) + (1 − z−1)2 N(z).
(23)
The noise is therefore filtered by a double differentiator. The SNR is given by SNRSD2 = 6.02b − 11.14 + 50 log(OSR) dB.
(24)
Doubling the OSR improves SNR by 15 dB, corresponding to 2.5 bits. The topology shown in Fig. 4 can be generalized to create an Mth order sigmadelta modulator, for which the output can be expressed as Y (z) = z−1 X(z) + (1 − z−1)M N(z).
(25)
Thus M times differentiation of the noise is achieved. An Mth order sigma-delta modulator improves the SNR by 6M + 3 dB for each doubling of the OSR, corresponding to M + 0.5 bits. However, such higher-order modulators are problematic from the stability point of view and alternative topologies have been developed, such as multiple feedback, feedforward and cascaded structures. A common approach is to construct a higher-order modulator as a cascade of first- or second-order modulators. Such topologies are called multistage noise shaping or MASH modulators. An advantage of cascaded modulators is their stability.
X(z)
+
-
+
1/(1-z-1)
+
z-1/(1-z-1)
-
Fig. 9 Block diagram of a second-order sigma-delta modulator.
Y(z)
Q
556
Olli Vainio
The common principle is to feed the quantization error of the first modulator to the input of another modulator. An example of cascaded modulators is shown in Fig. 10. This is a third-order modulator which consists of a second-order modulator followed by a first-order modulator. The output of the 2-1 modulator is expressed as Y (z) = z−2 X (z) + (1 − z−1)3 N2 (z)
(26)
where N2 (z) is the quantization noise of the second modulator.
X(z)
+
+
1/(1-z-1)
z-1/(1-z-1)
+
-
z-1
Q
+
-
Y(z)
+ -
+
-1
-1
z /(1-z )
Q
(1-z-1)2
Fig. 10 Block diagram of a third-order (2-1) cascaded modulator.
3.2.4 Sigma-Delta A/D Converters A sigma-delta ADC consists of a sigma-delta modulator and a digital decimation filter, followed by sampling rate reduction. The achievable resolution depends on the order of the sigma-delta modulator and the oversampling ratio. The decimation filter typically consists of at least two stages, where the first stage is made of one or more averaging (sinc) filters. A multistage decimation filter example is shown in Fig. 11 for decimation by the factor 32. H1 (z) and H2 (z) are optimized decimation filters and H3 (z) is a droop correction filter, serving the purpose of equalizing the passband response. [9] [1] The best number of the sinc filters for an Mth order sigma-delta modulator is M + 1. Then only little aliased noise is passed, as the slope of attenuation of this lowpass filter at least equals the slope of the rising quantization noise [8]. An efficient way to implement the sinc filter is shown in Fig. 15. The filter has the transfer function 1 − z−K M+1 Hsinc (z) = 2−P ( ) (27) 1 − z−1
Mixed Signal Techniques
557
In 32f s
Fig. 11 Multistage decimation filter for decimation by 32. The downward array indicates decimation, followed by the decimation factor.
H sinc(z)
H2(z)
2
H 1(z)
8
H3(z)
2
Out fs
where K is the decimation ratio of the sinc filter. The scaling factor 2−P has to satisfy the condition 2−P ≤ (1/K)M+1 . (28) It should be noticed that although the structure of Fig. 15 may have occasional overflows in the internal nodes, the result is still correct if proper scaling and two’s complement arithmetic is used. In K fs
2
-P
+
+ Z
-1
+
+ Z
-1
+ -
Z -1
K Z
-1
+ -
Z -1
-
Out fs
Z -1
Fig. 12 Implementation of the averaging filters for decimation by factor K.
4 Digital to Analog Converters In principle, DAC operation could be achieved by generating a sequence of weighted impulses from the digital samples and lowpass filtering to remove the image frequencies. Such a filter is called a reconstruction filter. In practice, the DAC output is usually taken from a S/H circuit. The sinc type response of the S/H circuit causes a droop of about -3.9 dB at the Nyquist frequency, as illustrated in Fig. 3. The reconstruction filter may be designed to compensate the droop, or compensation may be built into the digital signal processing algorithms. DAC architectures are based on either Nyquist-rate operation or oversampling. Important performance parameters for DACs (see Section 2) are: DNL, INL, SFDR, settling time, power dissipation, maximum clock frequency and output voltage swing. [8]
558
Olli Vainio
4.1 Nyquist-rate D/A Converters Probably the simplest DAC principle is pulse width modulation (PWM) where the duty cycle of a binary voltage or current signal is altered according to the digital word. The output may be lowpass filtered with an analog filter. PWM is used widely, e.g., in electric motor control (see Chapter 1). 4.1.1 R-2R D/A Converters The R-2R converters operate in either current mode or voltage mode. A currentmode R-2R DAC is shown in Fig. 16. Each switch is controlled by a bit of the word to be converted b1 . . . bb . The resistor network creates a voltage division so that the branch currents have a binary-weighted relationship. The current through the 2R resistor at the kth level is Ik =
1 2k−1 (Vre f + − Vre f − ), 2R 2b
(29)
k = 1, . . . ,b. The branch currents are added by the operational amplifier. The output voltage is b
Vout = Vre f − ± R (bk Ik )
(30)
k=1
with a plus or minus depending on whether Vre f + < Vre f − or Vre f + > Vre f − , respectively. One drawback of this traditional current-mode R-2R DAC is that the output voltage swing is limited to one half of the supply voltage. Other R-2R ladder based structures are also used, each having its pros and cons. [1] 4.1.2 Current-Steering D/A Converters Current-steering DACs are suitable for very high speed applications, such as direct radio frequency (RF) synthesis. For example, 12-bit DACs are available for sample rates exceeding 4 GHz. High performance is due to the principle of current switching and there is no need to charge and discharge large internal capacitances. The basic principle of a current-steering DAC is illustrated in Fig. 17. There is an array of binary-weighted current sources and each of the currents is directed to either the output or ground depending on the corresponding bit of the binary word. The currents are summed and converted to voltage by the resistor and the amplifier. [8] 4.1.3 Thermometer-Code D/A Converters Thermometer-coding is often used in DACs. A current-steering DAC for 3-bit words could be constructed as shown in Fig. 18, where all the current sources have equal
Mixed Signal Techniques
559 R Vref+
-
R
1
bb 0
1
bb-1 0
1
bb-2 0
1
b1 0
Vout
+
2R
R
2R
R
2R
R
2R
2R
Fig. 13 A current-mode R-2R DAC.
Vref-
strength. One of the advantages of thermometer-code is that there is less glitching than in binary-weighted schemes. 4.1.4 Segmented D/A Converters A popular approach to designing current-steering DACs is to use segmented converters. Segmentation means that unary weighting is used for some of the MSB current sources and binary weighting for the rest of the bits. Segmentation can im-
R
b1
b2
I
b3
I/2 VSS
Fig. 14 A current-steering DAC.
b4
I/4
+
I/8
Vout
560
Olli Vainio R
t4
t1
I
t5
I
t6
I
t7
I
Vout
+
I
VSS
Fig. 15 A DAC using thermometer-code for current steering.
prove converter throughput and reduce spurs compared to a fully binary-weighted scheme but adds to the power consumption and complexity of the DAC. The architecture of a 10-bit segmented current-steering DAC is shown in Fig. 5 [3]. The four MSBs are thermometer-coded and the six LSBs are binary-weighted. The switching principle in a segmented DAC is illustrated in Fig. 6 using a 6-bit converter as an example. Here thermometer-coding is used for the two MSBs and this 3-bit code controls the three leftmost switches to steer the current to either the output or to the ground. The four LSBs are binary-coded and the corresponding switches are attached to a R-2R ladder that implements the binary weighting. [8] b10
b9
b8
b7
b6
b5
Thermometer decoder
b4
b3
b2
b1
Binary decoder
I+ Switching core
Fig. 16 Architecture of a 10bit segmented current-steering DAC.
I-
Current source array
4.2 Oversampled D/A Converters Oversampled D/A converters can be constructed using digital sigma-delta modulators. The block diagram of a sigma-delta modulator-based DAC is shown in Fig. 12. In many ways the system is a dual of a sigma-delta ADC. Modulator topologies such as those described in Section 3 can be used. The input is a digital word that is sampled at the Nyquist rate (or a slightly higher rate). The signal is oversampled by
Mixed Signal Techniques
561
I out
2R
2R
R I
I
I
2R
R
2R
R
2R
I
VSS
Fig. 17 A 6-bit segmented current-steering DAC.
the OSR factor, denoted by the arrow upwards. Oversampling is in principle carried out by inserting an appropriate number of zero-valued samples between each two original samples, and filtering the resulting sparse signal with a digital interpolation filter. The interpolated signal at the higher sampling rate occupies a bandwidth that is narrower by factor OSR than the original at the lower sampling rate. The signal is processed by a sigma-delta modulator which includes a quantizer and performs noise shaping, i.e., the spectrum of the quantization noise is shaped according to the order of the modulator. The quantized output of the modulator is converted analog with a 1-bit DAC. Such a DAC generating only two distinct output levels is inherently linear. Finally the output is filtered by an analog lowpass filter. Because of the oversampling, the specifications of the lowpass filter can be quite relaxed. However, the filter order should be at least one higher than the order of the sigma-delta modulator in order to match the rising spectrum of quantization noise. Sigma-delta DACs are commonly used in applications requiring very high resolution, such as digital audio, because of the high linearity and low cost of such converters. Din b
OSR
Digital Interpolation Filter
Sigma-Delta Modulator
1-bit D/A
Analog Lowpass Filter
Vout
Fig. 18 Block diagram of a sigma-delta modulator-based DAC.
5 Switched-Capacitor Circuits Switched-capacitor (SC) circuits are widely used to implement analog signal processing in integrated circuits. Several manufacturers offer SC filter components that
562
Olli Vainio
can be configured by the user to implement a desired frequency response. The accuracy of SC circuits depends of capacitance ratios which can be well controlled in the manufacturing process, unlike absolute resistance values. That is the main reason to prefer SC circuits over active-RC implementations. However, active-RC filters are often used as prototypes when designing SC filters. SC circuits use analog voltage samples on a continuous amplitude scale. The time domain operation is discrete, and Z-domain synthesis techniques can be used to synthesize SC filters. [5] [8] The building blocks of SC circuits are operational amplifiers, switches and capacitors. The switches in CMOS integrated circuits are constructed as a parallel connection of an NMOS and a PMOS transistor in an arrangement called the transmission gate, in order to avoid the threshold voltage loss of a single transistor switch. A clock signal is needed to control the switches. Normally a 2-phase clock is used with nonoverlapping clock phases, denoted by 1 and 2 in the following.
5.1 Amplifiers The choice of the amplifier type is affected by several parameters, such as openloop gain, bandwidth, common-mode rejection ratio (CMRR), PSRR, noise, linearity and output dynamic range. Common amplifier types used in SC circuits are the internally compensated two-stage Miller operational amplifier and the operational transconductance amplifier (OTA), which is compensated by the load capacitance. In SC circuits, the amplifier only needs to drive capacitive loads and the output impedance can be high, which is characteristic for the OTA. This allows designing the amplifier for higher speed and larger signal swing. If resistive loads need to be driven, an output buffer is included in the amplifier to achieve a low output impedance. [8] [11] In mixed-mode integrated circuits, the analog circuits need to operate together with digital circuits, and low operation voltage is preferred because the supply voltage affects power dissipation of the digital parts through a square relationship. The design of analog circuits for low-voltage operation is challenging, however, amplifiers can be implemented for sub-1 V operation with a gain that is large enough, e.g., for headphone applications [13].
5.2 Switched Capacitors The basic idea of SC circuits is to simulate the functionality of a resistor with a switched capacitor [16]. Consider the circuit in Fig. 13(a). When the switch is in position 1, the capacitor is charged to voltage V1 , storing a charge Q1 = CV1 . If the switch is changed to position 2, the capacitor voltage changes to V2 and the amount of charge transferred from left to right is Q = C(V1 −V2 ). Assuming that the switch operates with a period T , the average current will be
Mixed Signal Techniques
563
I=
C(V1 − V2 ) . T
(31)
We may calculate the resistance R of an equivalent resistor as in Fig. 13(b) that passes the same current by using the relation I= yielding R=
V1 − V2 R
(32)
T 1 = C f sC
(33)
where fs is the switching frequency. 1
2
C
V1
Fig. 19 (a) A switched capacitor. (b) Equivalent resistor.
V2
(a)
V1
R
V2
(b)
5.3 Integrators Integrators are used as the basic building blocks in SC filters. Fig. 19(a) shows an inverting and Fig. 19(b) shows a noninverting SC integrator circuit that form the basis of many SC networks. Notice that although the switches are drawn as single NMOS transistors, a PMOS transistor is usually connected in parallel. The transfer function of the inverting integrator is given by Hi (z) =
Vo (z) C1 1 = −( ) Vi (z) C2 1 − z−1
(34)
and the transfer function of the noninverting integrator is Hn (z) =
Vo (z) C1 z−1 =( ) . Vi (z) C2 1 − z−1
(35)
The unit delay z−1 in the numerator is due to the fact that it takes the time of both clock phases for Vi to have an effect on Vo . Both of these integrator circuits are insensitive to stray capacitances that may be present at the capacitor plates.
564
Olli Vainio φ1
C2
φ1 C1
Vi
Vo
φ2
+
φ2
(a) C2
φ2
φ1 C1 Vi
+
φ1
φ2
Fig. 20 (a) Inverting SC integrator. (b) Noninverting SC integrator.
Vo
-
(b)
5.4 First-Order Filter A first-order SC filter based on an active-RC prototype can be constructed as shown in Fig. 20. The transfer function is H1 (z) =
C1 C2 −1 Vo (z) C (1 − z ) + C4 =− 4 . Vi (z) 1 − z−1 + C3
(36)
C4
This transfer function indicates the benefit of SC filters that the poles and zeros are determined by capacitance ratios and the clock frequency rather than absolute component parameter values. φ1
φ1
φ1
φ1
C2
C3
φ2
φ2
φ2
φ2
C4
Vi
C1
+
Fig. 21 First-order SC filter.
Vo
Mixed Signal Techniques
565
5.5 Second-Order Filter A second-order filter is also known as the biquad. The structure of an SC biquad is shown in Fig. 21. The transfer function can be expressed as H2 (z) =
Vo (z) a2 + a1 z−1 + a0 z−2 =− Vi (z) b2 + b1 z−1 + z−2
(37)
where the relationships between the transfer function coefficients and the capacitances are K3 = a0 (38) K2 = a2 − a0
(39)
K1 K5 = a0 + a1 + a2
(40)
K6 = b2 − 1
(41)
K4 K5 = b1 + b2 + 1.
(42)
This structure is used for filters with a relatively low Q-value. Q in this case is the ratio of the pole’s angular frequency and the 3 dB bandwidth [12]. Thus a large value of Q is related to a narrow bandwidth. In high-Q filters the structure of Fig. 21 tends to have a large spread of capacitance values, and alternative realizations have been developed that avoid this problem. φ1
K4C 1
φ1
φ2
φ1
φ2
Vi
V1
φ1
K 1 C1
φ1
K5C 2
φ2 C2
φ2
φ2
-
+
φ1
φ2
φ1
φ1 K 2 C2
φ2
φ2 K3C 2
Fig. 22 Biquad SC filter.
φ1
φ2
φ2
C1 φ1
K 6C2
+
Vo
566
Olli Vainio
Higher-order SC filters can be built by cascading biquads. However, such cascade realizations are usually quite sensitive to component variations, especially in the filter passband. Less sensitive realizations can be obtained based on feedback structures developed for active-RC filters, such as the leap-frog or the follow-theleader feedback structure. Alternatively, an SC circuit may simulate the operation of an LC ladder filter prototype. [19]
5.6 Differential SC Circuits In many SC circuit applications, fully differential circuits are preferred over the single-ended counterparts. In a differential circuit the signal is carried by two wires and their voltage difference represents the signal value. Differential circuits are tolerant against power supply noise, clock feedthrough, switch charge injection errors and common-mode noise in general, because noise coupled to both wires does not affect the difference. A differential circuit also has a larger dynamic range of the signal. A drawback is larger circuit size, since the circuitry is doubled, except the amplifiers. As an example, Fig. 14 shows a second-order sigma-delta modulator implemented as a differential SC circuit. In this circuit the amplifiers are differential class AB OTA amplifiers. [18] Q
Q
Vref
Vref
-V ref
-Vref
φ1
φ2
+Vi
φ2
φ1
φ2
φ1
φ2
Q
- + φ2
-Vi
φ1
φ1
φ1
φ2
φ2
φ1
Vref
+ -
φ1
φ1
φ2
-V ref
- + φ1
φ2 φ1
φ2
φ2
φ1
-Vref
+ -
φ2
Vref Q
Q
Fig. 23 Implementation of a second-order SC sigma-delta modulator.
Q
Mixed Signal Techniques
567
5.7 Performance Limitations SC filters can achieve a dynamic range of 80-90 dB. The useful dynamic range is limited by noise of the operational amplifier, consisting of 1/f and thermal noise, as well as the kT/C noise in the MOS transistor switches. The accuracy of the frequency response depends on several factors. Variation of the capacitance ratios is of the order of 0.1% or higher, affecting the coefficients of the transfer function. If large capacitance ratios are needed, one of the capacitors is required to have a small size in order to keep the overall chip area reasonable, and its variations limit the accuracy. There are also parasitic capacitances between the source and drain diffusions of the transistors and the substrate in an integrated circuit. The settling time of the operational amplifier is the main factor limiting the usable clock rate in a SC filter. The settling time is affected by the finite bandwidth and gain of the amplifier. The amplifier should be fast enough for the charge transfer to be completed during a clock phase. A highly selective filter, i.e., one with high-Q poles and a high order, suffers most response degradation due to residual settling errors. The error of the transient response is approximately of the form e−t/2op , where t is time and op is the time constant of the amplifier in the integrator configuration, being approximately the inverse of the amplifier unity-gain angular frequency. If 10 time constants are allowed for settling, the filter passband edge frequency should be at least 25 times smaller than the unity-gain frequency of the amplifier. Because of anti-aliasing considerations the clock frequency usually needs to be at least eight times the desired filter edge frequency. [6]
6 Frequency Synthesis The clock signal of an electronic system usually originates from a crystal oscillator. Crystal oscillators can generate the nominal frequency very accurately over a wide range of frequencies, from a few kilohertz up to several hundred megahertz. This is not always sufficient, as contemporary systems operate with clock rates in the gigahertz range. Mixed signal circuits can be used to multiply the clock frequency to a desired value.
6.1 Phase-Locked Loops The block diagram of a phase-locked loop (PLL) is shown in Fig. 24. The input signal is a reference clock of frequency fr , originating from an outside source such as a crystal oscillator. A part of the PLL is a voltage controlled oscillator (VCO) that oscillates at the desired frequency when the PLL is in locked mode. A VCO can be constructed, for instance, as a ring connection of an odd number of inverters, where at least one of the inverters is current-starved so that its propagation delay can
568
Olli Vainio
be controlled by a control voltage [17]. The VCO output frequency is divided by an integer factor N and compared with the reference input in a phase-frequency detector (PFD) that generates a signal proportional to the phase and frequency difference between its inputs. This signal is lowpass filtered and used to control the VCO. In the locked mode, the output frequency is fo = N fr .
(43)
PLL frequency synthesizers are commonly used as local oscillators in radio receivers and transmitters. A limitation of using only integer values of N is that only frequencies separated by integer multiples of N can be generated. This can be undesirable, e.g., if a radio standard requires narrow channel spacing. fr PFD
Loop filter
VCO
fo
Divide by N
Fig. 24 A phase-locked loop.
N
6.2 Fractional-N Frequency Synthesis A fractional division ratio in the PLL of Fig. 24 can be achieved by periodically switching the division ratio between two or more integer values. Suppose that the switching period has length q, and the division ratio is N + 1 for fraction p/q of the time and N for (q − p)/q part of the time. The average output frequency is then given by p fo = fr (N + ). (44) q However, this approach has the problem of producing unwanted spurious tones at the synthesizer output. A more advantageous method is shown in Fig. 25. A digital sigma-delta modulator is used to generate the fractional part of the division ratio and the integer part I is given separately. The sigma-delta modulator is clocked by the reference frequency. When the input of the sigma-delta modulator is x(n) and the input wordlength b, the synthesizer output frequency is fo = N fr = (I +
x(n) ) fr . 2b
(45)
Mixed Signal Techniques
569
Therefore, the desired frequency can be approximated with an accuracy that depends on the wordlength b. Techniques have been developed that even remove this limitation, so that frequencies can be synthesized with perfect accuracy. [2] fr fo
Loop filter
PFD
VCO
Divide by N
Fig. 25 A fractional-N frequency synthesizer with a sigma-delta modulator.
x(n)
Sigma-Delta Modulator
+
I
b-bit
6.3 Delay-Locked Loops A delay-locked loop (DLL) is a circuit that can be used for clock synchronization in a multiple clock domain system. The block diagram of a DLL is shown in Fig. 26. The input clock fr goes into a voltage-controlled delay line (VCDL) and the local clock fo is taken from the output of the VCDL. Its phase is compared with the phase of the input in a phase detector. Depending on the direction of the phase difference, either an up or down signal is given to a charge pump that generates a control voltage for the VCDL. The phases of fr and fo will therefore be matched. A VCDL can be constructed, e.g., using current-starved inverters. A charge pump is built from two current sources, one for charging and the other for discharging a capacitor according to the state of the ’Up’ and ’Down’ signals. [17]
fr
Up
Phase detector
Fig. 26 A delay-locked loop.
Down
Charge VCTRL pump
VCDL
fo
570
Olli Vainio
6.4 Direct Digital Synthesis Direct digital synthesis (DDS) is widely used for frequency synthesis in communication systems. The DDS principle is illustrated in Fig. 27. On each clock cycle the phase accumulator adds to its contents the frequency control word from the input. The phase accumulator register is used to address a phase to amplitude converter, which is a look-up table holding samples of the waveform to be generated, typically a sinusoid. The control word determines the step of the accumulator increments and therefore also the step size with which the waveform is sampled. Typically the symmetry properties of the sine wave are exploited when constructing the table and only a quarter of the full wave is stored. Then the two MSBs are used to decode the quadrant of the sine wave and the remaining bits address a one-quadrant sine look-up table. A more efficient implementation can be achieved if, for instance, a second-degree polynomial approximation of the sine wave is used [14]. The DDS technique supports several modulation schemes. Frequency modulation is achieved by altering the input word. The frequency change is instantaneous and phase-continuous. Phase modulation can be accomplished by adding a programmable offset to the output of the phase accumulator. Depending on the wordlength, a very high resolution of the output frequency and phase tuning is possible using DDS. With a b-bit phase accumulator the frequency resolution is
f = fs /2b
(46)
where fs is the clock frequency. A limit on the output frequency is set by the sampling theorem, i.e., fo ≤ fs /2.
Frequency Control
+
Phase Accumulator
Clock
Fig. 27 A direct digital synthesis system.
Phase to Amplitude Converter
DAC
fo
Mixed Signal Techniques
571
References 1. Baker, R.J.: CMOS Mixed-Signal Circuit Design. Wiley, Hoboken, NJ (2009) 2. Borkowski, M.: Digital Delta-Sigma Modulation: Variable Modulus and Tonal Behaviour in a Fixed-Point Digital Environment. Doctoral dissertation, University of Oulu, Acta Univ. Oul. C 306 (2008) 3. Deveugele, J., Steyaert, M.S.J.: A 10-bit 250-MS/s binary-weighted current-steering DAC. IEEE J. Solid-State Circuits 41, 320-329 (2006) 4. Freeman, R.L.: Telecommunication System Engineering 3rd Ed. Wiley, New York (1996) 5. Geiger, R.L., Allen, P.E., Strader, N.R.: VLSI Design Techniques for Analog and Digital Circuits. McGraw-Hill, Singapore (1990) 6. Gray, P.R., Castello, R.: Performance limitations in switched-capacitor filters. In: Tsividis, Y., Antognetti, P.: Design of MOS VLSI Circuits for Telecommunications, pp. 314-333. PrenticeHall, Englewood Cliffs, NJ (1985) 7. Ifeachor, E.C., Jervis, B.W.: Digital Signal Processing A Practical Approach. AddisonWesley, Wokingham, England (1993) 8. Johns, D.A., Martin, K.: Analog Integrated Circuit Design. Wiley, New York, NY (1997) 9. Lampinen H.: Studies and Implementations of Low-Voltage High-Speed Mixed AnalogDigital CMOS Integrated Circuits. Doctoral dissertation, Tampere University of Technology, Publications 362 (2002) 10. Lampinen, H., Vainio, O.: A new dual-mode data compressing A/D converter. In: Proc. IEEE Int. Symp. on Circuits and Systems, pp. 429-432. Hong Kong (1997) 11. Lampinen, H., Vainio, O.: An optimization approach to designing OTAs for low-voltage sigma-delta modulators. IEEE Trans. Instrum. Meas. 50, 1665-1671 (2001) 12. Lee, T.H.: The Design of CMOS Radio-Frequency Integrated Circuits. Cambridge University Press, Cambridge, UK (1998) 13. Lee, K., Meng, Q., Sugimoto, T., Hamashita, K., Takasuka, K., Takeuchi, S., Moon, U.-K., Temes, G.C.: A 0.8 V, 2.6 mW, 88 dB dual-channel audio delta-sigma D/A converter with headphone driver. IEEE J. Solid-State Circuits 44, 916-927 (2009) 14. Lindeberg, J., Vankka, J., Sommarek, J., Halonen, K.: A 1.5 V direct digital synthesizer with tunable delta-sigma modulator in 0.13- m CMOS. IEEE J. Solid-State Circuits 40, 19781982 (2005) 15. Oppenheim, A.V., Schafer, R.W.: Discrete-Time Signal Processing. Prentice Hall, Englewood Cliffs, NJ (1989) 16. Pallas-Areny, R., Webster, J.G.: Analog Signal Processing. Wiley, New York, NY (1999) 17. Rabaey, J.M., Chandrakasan, A., Nikolic, B.: Digital Integrated Circuits: A Design Perspective. 2nd Ed. Prentice Hall, Upper Saddle River, NJ (2003) 18. Ritoniemi, T., Karema, T., Tenhunen, H., Lindell, M.: Fully differential CMOS sigma-delta modulator for high performance analog-to-digital conversion with 5 V operating voltage. In: Proc. IEEE Int. Symp. on Circuits and Systems, pp. 2321-2326. Espoo, Finland (1988) 19. Sedra, A.S.: Switched-capacitor filter synthesis. In: Tsividis, Y., Antognetti, P.: Design of MOS VLSI Circuits for Telecommunications, pp. 272-313. Prentice-Hall, Englewood Cliffs, NJ (1985) 20. Vainio, O., Akopian, D., Astola, J.T.: Systematic design of encoder and decoder networks with applications to high-speed signal processing. In: Proc. IEEE Int. Symp. on Industrial Electronics, pp. 172-175. Athens, Greece (1995)
Part III
Programming and Simulation Tools Managing Editor: Rainer Leupers
C Compilers and Code Optimization for DSPs Björn Franke
Abstract Compilers take a central role in the software development tool chain for any processor and enable high-level programming. Hence, they increase programmer productivity and code portability while reducing time-to-market. The responsibilities of a C compiler go far beyond the translation of the source code into an executable binary and comprise additional code optimization for high performance and low memory footprint. However, traditional optimizations are typically oriented towards RISC architectures that differ significantly from most digital signal processors. In this chapter we provide an overview of the challenges faced by compilers for DSPs and outline some of the code optimization techniques specifically developed to address the architectural idiosyncrasies of the most prevalent digital signal processors on the market.
1 Introduction Compilers translate the source code written by the programmer into an executable program for the particular target machine. While compilers generally aid the programmer to work more productively through higher levels of abstraction compilers have to meet many additional – and sometimes conflicting – requirements. For performance-critical DSP applications highest efficiency of the generated code is expected and only small overhead compared to manually written assembly code is tolerated. At the same time, high code density is paramount for memory constrained and cost-sensitive DSP applications. Optimized code should be predictable, correct and amendable to debugging, and generally “well-behaved” irrespective of coding style and idiosyncrasies of the target architecture. In this chapter we explore some of
Björn Franke University of Edinburgh, School of Informatics, Informatics Forum, 10 Crichton Street, Edinburgh EH8 9AB, Scotland, United Kingdom, e-mail: [email protected]
S.S. Bhattacharyya et al. (eds.), Handbook of Signal Processing Systems, DOI 10.1007/978-1-4419-6345-1_21, © Springer Science+Business Media, LLC 2010
575
576
Björn Franke
the challenges faced by C compilers for DSPs and take a look into the optimizations that are applied to meet the strict requirements on performance and code size.
2 C Compilers for DSPs C compilers for digital signal processors are faced with a number of challenges arising from idiosyncratic architectural features and the users’ demand for near-optimal code quality. At the same time, C compilers for DSPs can be tuned for a their particular digital signal processing application domain. Advantage can be taken of recurrent algorithmic features and their mapping onto the underlying hardware can be improved over time. Before we look into these code optimizations in the next section we present a brief overview of the industrial context in which DSP compiler development takes place and of some DSP-specific compiler and C language extensions. Finally, we introduce a number of DSP benchmarks suitable for compiler and platform performance evaluation.
2.1 Industrial Context Compilers are rarely developed from scratch, but most frequently an existing compiler is ported to a new processor or a compiler development toolkit is used to generate the processor-specific parts of a compiler from high-level machine descriptions. While this approach ensures rapid success in deploying a working compiler it may also have some shortcomings: • Depending on the chosen compiler framework it may be difficult or costly to fit in a new, particular optimization critical to achieve a certain performance level for the key application of customer XYZ. • A set of optimizations might improve the average performance on some DSP benchmark suite, however, individual benchmarks might show a performance degradation unacceptable to some customers. A compiler might also correctly implement the ISO-C standard, but behave differently to the “unofficial standard” set by the GCC compiler and customers complain about this apparent “bug”. In addition, some users might want extended control over the compiler optimizations, either to fine-tune the optimizations to their application or to prevent the compiler from performing optimizations that might interfere with some manually applied code optimizations. In either case, compilers rarely meet the requirements of all the users and often this results in a pragmatic solution where a compiler is extended with additional switches controlled by the user, e.g. platform-specific #pragma’s or compilerknown functions, or DSP-specific extensions to the C programming language are used.
C Compilers and Code Optimization for DSPs
577
2.2 Compiler and Language Extensions In the performance-critical DSP environment both the compiler and the C programming language itself are frequently extended to narrow the gap between the highlevel language and the target processor and, thus, enabling the compiler to generate “better” code. Programmers can use these extensions to convey extra information to the compiler. Thereby the compiler is directed to make explicit use of specific processor features such as specialized instructions or dual memory banks. Compilerknown functions extend the compiler with special functions and are available to the programmer as convenient shortcuts to custom DSP operations provided by the specific target processor. On the other hand, DSP-C and Embedded C are extensions to the ISO-C language standard [20] and provide the programmer with new, DSP-specific language keywords. This later approach maintains a certain degree of portability and readability of the code while making use of DSP-specific hardware extensions to increase code efficiency. 2.2.1 Compiler-Known Functions Compiler-known functions (CKFs), sometimes also called intrinsic functions, are function calls that are not mapped to regular function calls by the compiler, but are directly replaced by one or more assembly instructions. CKFs enhance the readability of the C code while at the same time they provide a convenient way of directing the compiler to explicitly emit a particular sequence of instructions. CKFs are useful in cases where the compiler is known to not generate the desired instruction, e.g. because its pattern is not tree shaped and does not fit into the usual tree covering based instruction selection algorithm employed by the compiler, or the programmer wants to exercise specific control over the code generation process for performance reasons. For instance, the compiler for the NXP TriMedia TM5250 processor defines a CKF ifir16 for computing the signed sum of products of signed 16-bit halfwords. This operation is useful in implementing complex arithmetic, but results in several assembly instructions when implemented in “plain” C rather than a CKF. The NXP compiler, however, has knowledge about this special function and maps what looks like an ordinary function call to ifir16 directly to a single assembly instruction of the same name. CKFs require a detailed understanding of the target processor by the programmer and code written using CKFs is machine-dependent. Portability can be reestablished, however, by providing a suitable set of simulation functions (written in “plain” ISO-C) for the CKFs that are less efficient, but functionally equivalent to their machine-specific counterparts.
578
Björn Franke
2.2.2 DSP-C and Embedded C DSP-C [3] and Embedded C [12] are DSP-specific extensions to the ISO C language enable the programmer to gain control over hardware resources typically not available on general-purpose computing systems, but commonly found in DSPs, while at the same time writing portable code. In particular, support for fixed point data types, multiple memories and circular buffers are provided. These extensions serve a dual purpose: First, they enhance the expressiveness of the C programming language and provide the programmer with constructs that target idioms (fixed point data types, circular buffers) frequently encountered in DSP algorithms. Second, they shift the responsibility for the efficient use of non-standard hardware resources (dual memory banks) from the compiler to the programmer and, thus, help alleviate the compiler of a“hard” data partitioning and memory assignment problems. A more comprehensive discussion of the DSP-C standard can be found in [3].
2.3 Compiler Benchmarks Measuring the success of a compiler to generate dense and efficient code and the ability to compare the results across platforms or to hand-coded assembly implementations largely depends on standardized benchmark applications. Early, but now largely outdated DSP benchmarks were developed in academia and aimed at supporting research in DSP code generation. For example, DSPstone [47] and UTDSP [23] benchmark suites have been used in a large number of compiler research projects because they comprise compact, but compute-intensive DSP kernels written in C and operate on small data sets. The workload presented by these benchmarks, however, is not representative for modern DSP applications and the lack of a clearly defined benchmarking methodology make them less suitable for meaningful performance evaluations. More recently, the Embedded Microprocessor Benchmark Consortium (EEMBC) has developed a set of performance benchmarks for the hardware and software used in embedded systems. The EEMBC benchmarks include, but are not restricted to, DSP applications and cover different embedded application domains including automotive, consumer, digital entertainment, networking, office automation and telecom. The primary goal of EEMBC is to provide industry with standardized benchmarks and, thus, enabling processor vendors and their customers to compare processor performance for their chosen application domain. In addition, the EEMBC benchmarks provide an additional benefit in that performance figures for compiler generated code (“out-of-the-box”) and manually tuned algorithms (“optimized”) are quoted. These figures give an interesting insight to the compilers’ code generation abilities as for some platforms the difference between the two performance figures can be quite significant. Unlike DSPstone and UTDSP the EEMBC benchmarks come closer to representing real-world applications, however, their greater “real-
C Compilers and Code Optimization for DSPs
579
ism” comes at the cost of increased code size and complexity that sometimes make it more difficult for the compiler engineer to review the generated code.
3 Code Optimizations With efficiency being one of the most important design goals of digital signal processors it is critical for a compiler to generate the most efficient code for any DSP application. Hence, most DSP compilers perform routinely optimizations such as: • • • • • • • • •
Rearranging code to combine blocks and minimize branching, eliminating unnecessary recalculations of values, combining equivalent constants, eliminating unused code and storage of unreferenced values, global register allocation, inlining or replacing function and library calls with program code, substituting operations to conserve resources, eliminating ambiguous pointer references when possible, and many more.
Some of the more idiosyncratic architectural features such as complex instruction patterns targeting frequently occuring DSP operations may still not be fully exploited by conventional compilers. Much of the early work on code optimization for embedded DSPs has therefore investigated improved code generation approaches for selecting machine-specific instructions [29], scheduling with register constraints [10], and register allocation [28]. For DSPs with heterogeneous data-paths traditional code generation approaches fail to produce efficient code and compiler writers have developed more advanced, graph-based code generation techniques [27]. However, most modern DSP architectures have become more “compiler-friendly” and more conventional code generation techniques can be applied. In the following paragraphs we discuss a number of code optimization techniques specific to digital signal processors. We cover address code and control flow optimizations, loop and memory optimizations, and, eventually, optimizations for reduced code size. Conversion from floating-point to fixed-point implementation of an algorithm is usually not performed by a compiler. While the conversion can be automated [30] additional application knowledge about the expected dynamic range of the processed signals is required. A more comprehensive discussion of code optimization techniques for a wider range of embedded processors, including DSPs and multimedia processors, is presented in [24]. A general survey of compiler transformations for high-performance computing can be found in [4]. An excellent textbook introduction to the design and implementation of optimizing compilers is [32].
580
Björn Franke
3.1 Address Code Optimizations Typical code contains a large number of instruction for the computation of addresses of data stored in memory. These auxiliary computations are necessary, but compete for processing resources with the computations on actual user data. Hence, address code optimizations have been developed that aim at making address computations more efficient and, thus, increasing overall performance. In the following two paragraphs we discuss two of these optimizations that support the compiler in making better use of the address generation units common to many DSPs. 3.1.1 Pointer-to-Array Conversion Many DSP architectures have specialized Address Generation Units (AGUs) (see Chapter 15), but early compilers were unable to generate efficient code for them, especially in programs containing explicit array references. Programmers, therefore, used pointer-based accesses and pointer arithmetic within their programs in order to give “hints” to the early compiler on how and when to use post-increment/decrement addressing modes available in AGUs. For instance, consider example in figure 1(a), a kernel loop of the DSPstone benchmark matrix2. Here the pointer accesses “encourage” the compiler to utilize the post-increment address modes of the AGU of a DSP. If, however, further analysis and optimization is needed before code generation, then such a formulation is problematic as such techniques often rely on explicit array index representations and cannot cope with pointer references. In order to maintain semantic correctness compilers use conservative optimization strategies, i.e. many possible array access optimizations are not applied in the presence of pointers. Obviously, this limits the maximal performance of the produced code. It is highly desirable to overcome this drawback, without adversely affecting AGU utilization. Although general array access and pointer analysis are without further restrictions equally powerful and intractable, it is easier to find suitable restrictions of the array data dependence problem while keeping the resulting algorithm applicable to real-world programs. Furthermore, as array-based analysis is more mature than pointer-based analysis within available commercial compilers, programs containing arrays rather than pointers are more likely to be efficiently implemented. Pointer-to-Array Conversions [15] collects information from pointer-based code in order to regenerate the original accesses with explicit indexes that are suitable for further analyses. Example 1(b) shows the loop with explicit array indexes that is semantically equivalent to the previous loop in example 1(a). Not only it is easier to read and understand for a human reader, but it is amendable to existing array dataflow analyses. The data-flow information collected by these analyses can be used for e.g. redundant load/store eliminations, software-pipelining and loop parallelization. A further step towards regaining a high-level representation that can be analyzed by existing formal methods is the application of de-linearisation methods.
C Compilers and Code Optimization for DSPs
581
i n t ∗ p_a = &A[ 0 ] ; i n t ∗ p_b = &B [ 0 ] ; i n t ∗ p_c = &C [ 0 ] ; f o r ( k = 0 ; k < Z ; k ++) { f o r ( k = 0 ; k < Z ; k ++) { p_a = &A[ 0 ] ; f o r ( i = 0 ; i < X; i ++) { f o r ( i = 0 ; i < X; i ++) { p_b = &B[ k ∗Y] ; C[X∗ k+ i ] = A[Y∗ i ] ∗ B[Y∗ k ] ; ∗ p_c = ∗ p_a ++ ∗ ∗ p_b ++ ; f o r ( f = 0 ; f < Y−2; f ++) { f o r ( f = 0 ; f < Y−2; f ++) { ∗ p_c += ∗ p_a ++ ∗ ∗ p_b ++ ; C[X∗ k+ i ] += A[Y∗ i + f + 1] ∗ B[Y∗ k+ f + 1 ] ; } } C[X∗ k+ i ] += ∗ p_c ++ += ∗ p_a ++ ∗ ∗ p_b ++ ; A[Y∗ i +Y−1] ∗ B[Y∗ k+Y− 1]; } } } } (b) Loop after pointer-to-array conversion. (a) Original loop. Fig. 1 Example: Pointer-to-Array Conversion.
De-linearisation is the transformation of one-dimensional arrays and their accesses into other shapes, in particular, into multi-dimensional arrays. The code in figure 1(b) shows the example loop after application of clean-up conversion and delinearisation. The arrays A, B and C are no longer linear arrays, but have been transformed into matrices. Such a representation enables more aggressive compiler optimizations such as data layout optimizations. Later phases in the compiler can easily linearize the arrays for the automatic generation of efficient memory accesses.
f o r ( k = 0 ; k < Z ; k ++) { f o r ( i = 0 ; i < X; i ++) { C[ k ] [ i ] = A[ i ] [ 0 ] ∗ B[ k ] [ 0 ] ; f o r ( f = 0 ; f < Y−2; f ++) { C[ k ] [ i ] += A[ i ] [ f + 1] ∗ B[ k ] [ f + 1 ] ; } C[ k ] [ i ] += A[ i ] [ Y−1] ∗ B[ k ] [ Y− 1]; } } Fig. 2 Example: Pointer-to-Array Conversion and Delinearization.
582
Björn Franke
3.1.2 Offset Assignment Address generation units (AGU) perform address calculations in parallel to other on-going computations of the DSP. Efficient utilization of these dedicated functional units depends on their effective use by the compiler. For example, an AGU might compute the address of the next data item to be accessed by adding a constant (small) modifier (e.g. +1 or −1) to the current data address and keeping the new address ready for the next memory operation to come. This, however, is only possible if the next data item can be placed within this range. Placing variables in memory in such a way that for a given sequence of memory accesses the best use of parallel address computations can be made is one of the possible compiler optimizations targeting AGUs. The goal of offset assignment is then “to generate a data memory layout such that the number of address offsets of two consecutive memory accesses that are outside the auto-modify range is minimized” [45]. For example, consider the following sequence of accesses to the variables a, b, c, d, e, and f: S = {a, b, c, d, e, f , a, d, a, d, a, c, d, f , a, d}. Assuming an auto-modify range of ±1, i.e. address calculations refering the data item stored at the address immediately before or after the current item are free, and a single index register we compute the addressing costs for two different memory layout schemes in figure 3. Each time the index register needs to be written because of an access to an element not neighboring the current position a cost of 1 is incurred (= dotted arrows). For the naïve scheme shown in the upper half of the diagram the total addressing costs are 9 whereas in the improved memory layout shown in the bottom half of the diagram the total costs have been reduced to 4. In general, the offset assignment problem has a large solution space and can be optimally solved for small instances only. Several heuristics that have been developed are based on finding a maximum-weighted Hamiltonian path in the access graph by using a greedy edge selection strategy [45].
3.2 Control Flow Optimizations Control flow optimizations such as the elimination of branches to unconditional branches, branches to the next basic block, branches to the same target and many more are part of every optimizing compiler. However, some particular architectural features of digital signal processors such as zero overhead loops and predicated instructions require specialized control flow optimizations. 3.2.1 Zero Overhead Loops Zero overhead loops are an architectural feature found in many DSPs (see Chapter 15) that help implementing efficient code for loops with iteration counts known
C Compilers and Code Optimization for DSPs
583
a b c d e f
Addressing costs: 9
b c d a f e
Addressing costs: 4
Fig. 3 Example: Addressing costs of different memory layout schemes.
at compiler-time. They do not incur the overhead for increment and branch instructions for each iteration, but instead a zero overhead loop is initialized once and subsequently a fixed number of instructions is executed a specified number of times. Constraints regarding the number of instructions in the loop body, whether or not control flow instructions are permissible inside a zero overhead loop, the nesting of zero overhead loops, early exits or function calls complicate code generation for zero overhead loops because a number of checks need to be performed [44][39]. A conventional loop implementation with an explicit termination test is shown in figure 4(a). For each of the 10 iterations a sequence of increment, comparison and branch instructions needs to be executed. Compare this to the code fragment in figure 4(b) where a zero overhead loop is used and no additional instructions are required after the initial loop setup. While zero overhead loops are conceptually simple to exploit within a compiler, it is the number and type of constraints that sometimes make it difficult to generate code for zero overhead loops for all applicable loops. For example, many compiler frameworks generate the loop control code before code for the loop body is generated. This is problematic if the architecture places constraints on the maximum size of the loop body of a zero overhead loop. Possible solutions include to defer the decision on zero overhead loop code generation if the compiler framework is flexible enough to allow this, or to rely on a conservative heuristic concerning the ratio of intermediate operations to machine instructions. The use of zero overhead loops not only increases performance, but code size is reduced at the same time (see section 3.5). In particular, zero overhead loops
584
Björn Franke do # 1 0 ,LEND / ∗ l o o p body ∗ / ... ...
mov.d # 1 , d10 LSTART : / ∗ l o o p body ∗ / ... ... LTEST : mov.d # 1 , d11 a d d . d d10 , d11 , d10 mov.d # 1 0 , d12 s l t . d d10 , d12 , d13 b n z . d d13 , LSTART (a) Loop with explicit test.
LEND : ...
(b) Zero-overhead loop.
Fig. 4 Example: Code Generation for Zero-Overhead Loops.
compare favorably in terms of code size to loop unrolling (see section 3.3.1) while often providing a similar or better performance gain. 3.2.2 If-Conversion Predicated execution is a standard feature in many modern DSPs, and is supported by compilers through If-Conversion [13], i.e. a transformation which converts control dependencies into data dependencies. A program containing conditional branches is transformed into predicated code in such a way that instructions previously located in either of the two paths of a conditional branch are guarded by predicates that evaluate to either true (= instruction enabled) or false (= instruction disabled).
L1 :
i f ( cond ) Branch L1 r 2 = MEM[A] r1 = r2 + 1 r 0 = MEM[ r 1 ] r9 = r3 + r4 (a) Conditional branch.
˘ Z´ = cond p1 , p2â A r 2 = MEM[A] r 1 = r 2 + 1 r 0 = MEM[ r 1 ] L1 : r9 = r3 + r4 (b) After If-Conversion. (p1 denotes the predicate for the true condition and p2 is its inverse, i.e. p1 = ¬p2.)
Fig. 5 Example: If-Conversion.
A simple piece of code containing a conditional branch is shown in figure 5(a). The branch condition is evaluated and only one of the control-dependent paths is executed. In the code after If-Conversion shown in figure 5(b) the conditional branch has been eliminated and a predicate p2 is computed. Subsequently, instructions from both control flow paths are executed and the predicate controls which results are used. Note that the conditional instructions now depend on the predicate p2. The
C Compilers and Code Optimization for DSPs
585
code after If-Conversion does not contain any control dependences, but these have been converted into data dependences. If-Conversion is a useful optimization as it avoids pipeline stalls due to mispredicted branches. However, If-Conversion does not guarantee performance improvements in all cases. For large conditionally executed code blocks, for example, conventional branches work better. In such cases many instructions need to be fetched and decoded by the processor that turn out to be disabled and, thus, waste valuable CPU time. In addition, deeply nested predicates quickly become impractical. For small, balanced branches, however, If-Conversion effectively increases processor pipeline utilization through reduced number of stalls and higher flexibility in instruction scheduling.
3.3 Loop Optimizations Most execution time of DSP applications is spent on loops. Hence, effective code optimization of loop structures promises significant overall performance improvements. Unfortunately, the optimization of loops is a much “harder” problem than optimizing straight-line code. This is because the individual iterations of a loop may not be independent of each other, but data dependences might exist, i.e. data generated in one iteration of the loop gets carried forward to another iteration where it is consumed. Clearly, any loop restructuring transformation must obey these dependences in order to preserve correctness. In the following paragraphs we present a number of loop transformations that are of particular importance to DSP kernels and applications. A more comprehensive list of loop optimizations can be found in e.g. [46] or [4]. Within the compiler a suitable framework is used for representing and systematically manipulating loop structures. The two most popular frameworks for this purpose are unimodular transformations and the polyhedral model. The approach based on unimodular transformations represents statements of n-deep loop nests as integer points in an n-dimensional space which can be captured in matrix notation upon which transformations described by unimodular matrices can be applied. These matrices can describe the loop reversal, loop interchange, and loop skewing transformations and combinations thereof. The polyhedral model goes beyond this and describes the statements contained in a loop nest as unions of sets of polytopes. The polytope model can represent imperfectly nested loops and affine transformations. Data dependences and boundaries of the polytopes are expressed as additional constraints. Both unimodular transformations and the polytope model are excellent formalisms to capture loop transformations. Still, the key problem remains to identify a suitable sequence of loop transformations to eventually improve the overall performance of the loop under consideration [38].
586
Björn Franke
3.3.1 Loop Unrolling Loop unrolling is probably the single-most important code transformation for any DSP application [41]. As the name suggests this transformation unrolls a loop, i.e. the loop body is replicated one or more times and the step of the loop iteration variable is adjusted accordingly. A simple example where the loop body is duplicated and the loop step doubled is shown in figure 6.
f o r ( i = 0 ; i < 1 6 ; i ++) { /∗ i = 0 , 1 , 2 , . . . ∗/ c += a [ i ] ∗ b [ i ] ; } (a) Original loop.
f o r ( i = 0 ; i < 1 6 ; i += 2 ) { /∗ i = 0 , 2 , 4 , . . . ∗/ c += a [ i ] ∗ b [ i ] ; /∗ i = 1 , 3 , 5 , . . . ∗/ c += a [ i + 1] ∗ b [ i + 1 ] ; } (b) Loop unrolled 2-fold.
Fig. 6 Example: Loop Unrolling.
DSP applications spent most of their time in small, but compute-intensive loops. To improve overall performance, efficient hardware utilization is paramount and one way of achieving this is to merge consecutive loop iterations. The main benefits from loop unrolling arise from a reduced number of control flow instructions (conditional branches) due to fewer iterations of the unrolled loop and better scheduling opportunities due to the larger loop body. For example, the number of iterations of the unrolled loop in figure 6(b) is half of that of the original loop in figure 6(a). As a consequence the overhead associated with the comparison with the upper loop bound and the increment of the loop iteration variable is reduced by a factor of 21 . At the same time the size of the loop body is doubled. This provides the instruction scheduler with more instructions to hide the latencies of e.g. memory accesses and to better exploit the available instruction level parallelism. For small loops complete unrolling can be beneficial, i.e. the entire surrounding loop is flattened. For larger loops – with either a large number of iterations or a large number of instructions contained in the loop body – this is not practical because of the resulting code bloat. Determining the optimal loop unroll factor is difficult [43][31], in particular, if the target machine comprises an instruction cache. Code expansion resulting from loop unrolling can degrade the performance of the instruction cache. The added scheduling freedom can result in additional register pressure due to increased live ranges of variables. The resulting memory spills and reloads may negate the benefits of unrolling. In addition, control flow instructions in the loop body complicate unrolling decisions for example, if the compiler cannot determine that a loop may take an early exit. 1
This reduction may be less if the original loop is implemented using Zero-Overhead Loops, see Chapter 15.
C Compilers and Code Optimization for DSPs
587
In cases where the loop iteration count is not divisible without remainder by the unroll factor epilogue code containing the remaining few iterations needs to be generated. This further increases code size in addition to the already duplicated loop body. 3.3.2 Loop Tiling Loop tiling [36] is a compiler optimization that helps improving data cache performance by partitioning the loop iteration space into smaller blocks (also called tiles). By adjusting the size of the working set to the size of the data cache data reuse of array elements is maximized within each tile. The example in figure 7 illustrates the application of loop tiling for a simple matrix multiplication kernel. The original code in figure 7(a) contains a nested loop with the innermost loop operating on the data arrays a, b and c. The arrays a and b are accessed in row and column order, respectively. In particular, different iterations of the j-loop operate on the same elements of the array a, and the accesses to array b do not depend on the outer i loop. This means, data elements from both a and b are reused for the computations of different elements c[i][j]. If the data arrays are large and the caches of the target machine small the data cache has not enough capacity to hold previously accessed data until its reused, thus a larger number of cache misses than necessary will occur. Loop tiling reduces the size of the “working set”, thus increasing data locality and making sure that data will not get evicted from the cache until its reuse in a later loop iteration. The code fragment in figure 7(b) shows the matrix multiplication loop after loop tiling where three new loop levels (ii, jj, and kk) have been introduced. Each of the new, inner loops operates on a small tile of size t of its “parent” loop (e.g. between i and i + t), whereas the outer loops now iterate over these tiles taking larger steps of size t. To achieve the best effect of loop tiling this code transformation is frequently combined with array padding and loop interchange. Array padding adds a small numbers of “dummy” elements to an array. This can lead to better aligned data placement in memory and helps further reducing the number of cache misses. Loop interchange exchanges the order of two or more nested loops if it is safe to do so. This may further increase cache efficiency due to better data locality. Determining the optimal loop tiling factor to make the best use of the available caches is difficult and both analytical [1] and empirical [11] approaches have been developed. 3.3.3 Loop Reversal Loop reversal is a simple loop transformation that “runs a loop backwards”. In most cases loop reversal does not itself improve the generated code, but it acts as an enabler to better code generation, e.g. for zero-overhead loops, or to other optimizations such as loop fusion.
588
Björn Franke f o r ( i = 0 ; i < N; i ++) { f o r ( j = 0 ; j < N; j ++) { c[ i ][ j ] = 0.0; f o r ( k = 0 ; k < N; k ++) { c [ i ] [ j ] += a [ i ] [ k ] ∗ b [ k ] [ j ] ; } } } (a) Original loop. f o r ( i = 0 ; i < N; i += t ) { f o r ( j = 0 ; j < N; j += t ) { f o r ( k = 0 ; k < N ; k += t ) { f o r ( i n t i i = i ; i i < i + t ; i i ++) f o r ( i n t j j = j ; j j < j + t ; j j ++) { c[ i i ][ j j ] = 0.0; f o r ( i n t kk = k ; kk < k + t ; kk ++) c [ i i ] [ j j ] += a [ i i ] [ kk ] ∗ a [ kk ] [ j j ] ; } } } } (b) After loop tiling (tile size = t).
Fig. 7 Example: Loop Tiling.
Consider the example in figure 8. The original loop in 8(a) initializes the elements of array a with a constant value, starting with the last element and working towards the beginning. If the target platform supports zero-overhead loops (see Chapter 15) and the compiler is able to generate appropriate code (see section 3.2.1) it still might be that the particular shape of the presented loop defeats the pattern recognition for zero-overhead loop code generation. In this case the reversal of the loop such as shown in figure 8(b) can help creating a canonical form more readily amendable to zero-overhead loop code generation. This is an example of a loop transformation that does not immediately improve code quality, but enables other parts of the compiler to produce better code. Loop reversal is only legal for loops that have no loop carried dependences.
f o r ( i = N−1; i > 0 ; i −−) { a [ i ] = 0; } (a) Original loop. Fig. 8 Example: Loop Reversal.
f o r ( i = 0 ; i < N; i ++) { a [ i ] = 0; } (b) Loop after loop reversal.
C Compilers and Code Optimization for DSPs
589
3.3.4 Loop Vectorization Most media processors and some DSPs such as the Analog Devices SHARC and TigerSHARC processors provide short vector (or SIMD) instructions, capable of executing one operation on multiple data items in parallel (see Chapter 15). This parallelism, however, defeats the standard code generation methodology based on tree pattern matching with dynamic programming where intermediate representation data flow trees are covered with instruction patterns. Taking full advantage of SIMD instructions requires a different approach where the compiler operates on full data flow graphs. Consider the example in figure 9. The original loop in figure 9(a) operates on 16-bit data stored in the arrays B and C. The original loop has already been unrolled 2-fold (see paragraph 3.3.1) in order to prepare the loop for vectorization. It simply adds the corresponding elements of the two arrays and stores the computed sum in the elements of array A. The iterations of the loop are completely independent of each other and pairs of successive iterations can be summarized and executed simultaneously by a single vector add instruction as shown in figure 9(b).
v o i d f ( s h o r t ∗ A, s h o r t ∗ B , s h o r t ∗ C) { int i ; f o r ( i = 0 ; i < N; i += 2 ) { A[ i ] = B[ i ] + C[ i ] ; A[ i + 1] = B [ i + 1] + C[ i + 1 ] ; } } (a) Original loop.
v o i d f ( s h o r t ∗ A, s h o r t ∗ B , s h o r t ∗ C) { int i ; f o r ( i = 0 ; i < N; i += 2 ) { VADD_2 (A, B , C ) ; } } (b) Loop after loop vectorization.
Fig. 9 Example: Loop Vectorization.
A compiler targeting SIMD operations must ensure that the way data is stored in memory matches the data layout constraints for the operands of the vector operations. Furthermore, the compiler must be capable of identifying suitable operations in a larger data flow graph that can replaced with a single SIMD instruction. Due to their parallel nature these operations might be partially or fully independent of each other. Code selection for short vector instructions can be accomplished in a standalone preprocessor [37] or fully integrated in the code generation stage of the compiler [35][19][25]. Another approach is taken in [14] where high-level algorithmic specifications of DSP algorithms are translated to C code and vectorization is considered already in this automated C code generation stage.
590
Björn Franke
3.3.5 Loop Invariant Code Motion Loop Invariant Code Motion [39] moves constant expressions out of the loop body and, hence, reduces the number of instructions executed per loop iteration. An example of this transformation is shown in figures 10 and 11. After the original loop in figure 10(a) has been lowered address computations for the array access in the loop body become explicit (see figure 10(b)). The first part of the address expression, however, is only dependent on the outer loop iterator i, but is constant for any value of the inner loop iterator j. “Hoisting” this constant subexpression out of the j-loop results in the code shown in figure 11(a). Not only does this code avoid the repeated computation of a constant expression, it is also amendable to further optimization. Strength Reduction (see also section 3.5.1) can be applied to replace the costly multiplication operation with a cheaper addition, and Induction Variable Creation can eliminate the use of the loop iterator j in favor of an induction variable incremented in each iteration that can be mapped efficiently the available post-increment addressing modes (see Chapter 15) offered by many DSPs.
f o r ( i = 0 ; i < M; i ++) {
f o r ( i = 0 ; i < M; i ++) {
f o r ( j = 0 ; j < M; j ++) { a[ i ][ j ] = 5 ∗ j ; } }
f o r ( j = 0 ; j < M; j ++) { ∗ ( a +M∗ i + j ) = 5 ∗ j ; } }
(a) Original code.
(b) Code after Lowering.
Fig. 10 Example: Loop Invariant Code Motion, Part 1.
f o r ( i = 0 ; i < M; i ++) { pa = a+M∗ i ; f o r ( j = 0 ; j < M; j ++) { ∗ ( pa + j ) = 5 ∗ j ; } } (a) After Loop Invariant Code Motion.
ivar_pa = a ; f o r ( i = 0 ; i < M; i ++) { ivar1_j = ivar_pa ; iv ar2_j = 0; f o r ( j = 0 ; j < M; j ++) { ∗( i v a r 1 _ j ) = i v a r 2 _ j ; i v a r 1 _ j ++; i v a r 2 _ j += 5 ; } i v a r _ p a += M; } (b) After Induction Variable Creation.
Fig. 11 Example: Loop Invariant Code Motion, Part 2.
Before any loop-invariant computation can be moved out of a loop the compiler needs to verify this property. This is easy for genuine constants. For general expres-
C Compilers and Code Optimization for DSPs
591
sion it must be shown that all reaching definitions of the arguments of an expression are outside the loop. After that a new block of code containing the hoisted, loopinvariant computation is inserted as a pre-header to the loop. 3.3.6 Software Pipelining
Sequential Execution 1. Iteration
2. Iteration
3. Iteration
4. Iteration
5. Iteration
Software Pipelining 1. Iteration 2. Iteration 3. Iteration 4. Iteration 5. Iteration Prologue
Body
Epilogue
Time Fig. 12 Software Pipelining
Software Pipelining [2] is a low-level loop optimization and VLIW scheduling technique. It aims at exploiting parallelism at the instruction level by overlapping consecutive iterations of a loop so that a faster execution rate is realized (see figure 12). Instead of executing loop iterations in sequential order software pipelining partially overlaps loop iterations. After a number of initial iterations that are summarized in a loop prologue a stable and repetitive scheduling pattern for the loop body emerges. In the example instructions from three loop iterations are executed simultaneously. Eventually, the epilogue code executes the remaining instructions that do not belong to this repeated pattern towards the end of the loop. Software pipelining is highly important for VLIW processors. More details can be found in Chapter 22.
592
Björn Franke
3.4 Memory Optimizations Memory optimizations play an important role when memory latency or bandwidth form a bottleneck. In this section we discuss one memory optimization of particular importance to DSPs as it deals with dual memory banks and the partitioning and assignment of variables to these memories. 3.4.1 Dual Memory Bank Assignment Due to their streaming nature memory bandwidth is critical for most digital signal processing applications. To accommodate for these bandwidth requirements digital signal processors are typically equipped with dual memory banks (see Chapter 15) that enable simultaneous access to two operands if the data is partitioned appropriately.
void l m s f i r ( f l o a t i n p u t [ ] , f l o a t o u t p u t [ ] , fl oa t expected [] , floa t c o e f f i c i e n t [] , fl oa t gain ) { /∗ Variable d eclara tions omitted . ∗/
}
sum = 0 . 0 ; f o r ( i = 0 ; i < NTAPS ; ++ i ) { sum += i n p u t [ i ] ∗ c o e f f i c i e n t [ i ] ; } o u t p u t [ 0 ] = sum ; e r r o r = ( e x p e c t e d [ 0 ] − sum ) ∗ g a i n ; f o r ( i = 0 ; i < NTAPS−1; ++ i ) { c o e f f i c i e n t [ i ] += i n p u t [ i ] ∗ e r r o r ; } c o e f f i c i e n t [ NTAPS−1] = c o e f f i c i e n t [ NTAPS−2] + ( i n p u t [ NTAPS−1] ∗ e r r o r ) ;
Fig. 13 Example: lmsfir function.
Efficient assignments of variables to memory banks can have a significant beneficial performance impact, but are difficult to determine. For instance, consider the example shown in figure 13. This shows the lmsfir function from the UTDSP lmsfir_8_1 benchmark. The function has five parameters that can be allocated to two different banks. Local variables are stack allocated and outside the scope of explicit memory bank assignment as the stack sits on a fixed memory bank on most target architectures. In figure 14 four of the possible legal assignments are shown. In the first case in figure 14 (a) all data is placed in the X memory bank. This is the default case for many compilers where no explicit memory bank assignment is specified. Clearly,
C Compilers and Code Optimization for DSPs
593
no advantage of dual memory banks can be realized and this assignment results in an execution time of 100 cycles for e.g. an Analog Devices TigerSHARC TS-101 platform. The best possible assignment is shown in figure 14 (b), where input and gain are placed in X memory and output, expected, and coefficient in Y memory. Simultaneous accesses to the input and coefficient arrays have been enabled and, consequently, this assignment reduces the execution time to 96 cycles. Interestingly, an “equivalent” assignment scheme as shown in figure 14(c) that simply swaps the assignment between the two memory banks does not perform as well. In fact, the “inverted” scheme derived from the best assignment results in an execution time of 104 cycles, a 3.8% slowdown over the baseline due to additional register-to-register moves introduced by the compiler. The worst possible assignment scheme is shown in figure 14(d). Still, input and coefficient are placed in different banks enabling parallel loads, but this scheme takes 110 cycles to execute, a 9.1% slowdown over the baseline.
Fig. 14 Four memory bank assignments for the lmsfir function resulting in different execution times..
An early attempt to solve the dual memory bank assignment problem is described in [40]. They produced a low-level solution that performs a greedy minimum-cost partitioning of the variables using the loop-nest depth of each interference as a priority heuristic. The problem was formulated as an interference graph where two nodes interfere if they represent a potentially parallel access in a basic block. This is a very intuitive representation of the problem and has been used in many other solutions since. Another approach [21] uses synchronous data flow specifications and the simple conflict graphs that accompany such programs. Three techniques were proposed, a
594
Björn Franke
traditional coloring algorithm, an integer linear programming based algorithm and a low complexity greedy heuristic using variable size as a priority metric. Another integer linear programming solution prior to [21] is described in [26]. This technique runs after the compiler back-end has generated object code and lets it consider spill code and ignore accesses which do not reach memory. The assignment problem is modeled as an interference graph where two variables interfere if they are in the same basic block and there is no dependence between them. The interference is weighted according to the number of potentially parallel memory accesses. More recently a more accurate integer linear programming model for DSP memory assignment has been presented [17]. The model described here is considerably more complicated than the one previously presented in [26] but provides larger improvements. Finally, a technique that operates at a higher level than the other methods is described in [42]. It performs memory assignment on the high-level IR, thus allowing the coloring method to be used with each of the back-ends within the compiler. The problem is modeled as an independence graph and the weights between variables take account of both execution frequency and how close the two accesses are in the code. Source-level transformations targeting DSP-C as an output language [34] and the use of machine-learning techniques [33] for dual memory bank assignment have been proposed in recent years.
3.5 Optimizations for Code Size Most digital signal processors comprise fast, but small on-chip memories that store both data and the program code. While it is possible to provide additional, external memory this may not be desirable for reasons of cost and higher PCB integration. In such a situation the size of available on-chip memories place hard constraints on the code size which as an optimization goal becomes at least as important as performance. Incidentally, smaller code may also be faster code, especially if an instruction cache is present. Lower instruction count and a reduced number of instruction cache misses contribute to higher performance. However, there is no strict correlation between code size and performance. For example, some optimizations for performance such as loop unrolling (see paragraph 3.3.1) increase code size whereas more code-size aware code optimization such as redundant code elimination may lead to lower performance. Eventually, it will be the responsibility of the application design team to trade-off the (real-time) performance requirements of their application and the memory constraints set by their chosen processor. In the following paragraph we present a number of compiler-driven optimizations for code size that are applicable to most DSPs and do not require any additional hardware modules e.g. for code decompression. A comprehensive survey of code-size reductions methods can be found in [5].
C Compilers and Code Optimization for DSPs
595
3.5.1 Generic Optimizations for Code Compaction In this paragraph we discuss a number of code optimizations that are routinely applied by most compilers as part of their effort to eliminate code redundancies. Eliminating redundant or unreachable code certainly reduces code size without much effort and should be considered as a first measure if code size is a concern. Redundant Code Elimination aims at eliminating those computations in a program that at a given point are guaranteed to be available as a results of an earlier computation. If such a computation can be identified then the redundant code fragment can be eliminated and the results of the previous computation used instead. One common variant of redundant code elimination is Common Subexpression Elimination where the identification and substitution of redundant code is restricted to subexpressions of a generally larger expression. While the behavior of the code does not change it is important to note that the expected performance impact depends on a number of factors. On the one hand, fewer instructions are issued due to the elimination of redundant computations. This has a positive impact on performance. On the other hand, register pressure is increased as a previously computed results needs to be stored for its later use and the increased register use may possibly lead to register spilling to memory. If this is the case, any performance and even code size benefit may get negated. In general, great care needs to be taken when applying redundant code elimination to make sure the desired code size reduction effect is achieved. Unreachable Code Elimination detects and eliminates those computations that cannot be reached on any control flow path from the rest of the program. Unreachable code fragments can exist for a number of reasons. First, the user may have defined functions that are not used anywhere in the program. Second, the user may use conditional statements to enable/disable certain code fragments, e.g. for debugging purposes, rather than using the more appropriate C pre-processor macros for conditional compilation. Finally, unreachable code can result from the earlier application of other optimizations. Consider the example in figure 15(a). The code in the else block is not reachable because the conditional expression of the if-statement will always be true. However, at this stage the compiler cannot detect this. After constant propagation and constant folding (see figure 15(b)) the code the conditional expression only makes use of constant values rather than variables and 4 > 2 can be replaced with 1 (= true). Subsequently, the now provably unreachable code in the else block of the if-statement can be eliminated as shown in figure 15(c). Care needs to be taken when applying unreachable code elimination because it is not always possible to accurately work out the control flow relations in a program. For example, if a program makes use of both the C and assembly languages certain C routines that appear to be unreachable may get called from the assembly code. The compiler needs to be conservative to ensure correctness and cannot make any assumptions about e.g. functions with external linkage. Dead Code Elimination targets computations whose results are never used. Note that a code fragment may reachable, but still dead if the results computed by this fragment are not used (or output) at any later point in the program. “Results” of
596 x = 3; y = 2; i f ( x+1 > y ) { r = x; } else { / ∗ Unreachable ∗/ r = y; } (a) Original code.
Björn Franke x = 3; y = 2; i f (4 > 2) { r = x; } else { /∗ Unreachable ∗/ r = y; } (b) Code after constant propagation and folding.
x = 3; y = 2; r = x; / ∗ U n r e a c h a b l e code eliminated . ∗/ (c) Code after unreachable code elimination.
Fig. 15 Example: Unreachable Code Elimination.
a computation do include any side effects such as exceptions generated or signals raised by this piece of code and that affect other parts of the program under optimization. Only if it can be guaranteed that a fragment of code does not produce any direct or indirect results in this strict sense (e.g. it cannot produce any Division-byZero Exceptions or similar) that are consumed elsewhere it can be eliminated. Strength Reduction [7] replaces a costly sequence of instructions with an equivalent, but cheaper (= faster or shorter) instruction sequence. For example, an optimizing compiler may replace the operation 2 × x with either x + x or x 1. Strength reduction is most widely known for use inside loops where repeated multiplications are replaced with cheaper additions. In addition to the obvious performance improvements from replacing a costly operation with a faster one, strength reduction often decreases the total number of instructions in a loop and, thus, contributes to code size reductions. If applied in combinations with other optimizations such as algebraic reassociation, strength reduction further reduces the number of operations. On the other hand, strength reduction may increase the number of additions as the number of multiplies is decreased. Certainly, optimizations for code size interfere with each other as much as optimizations for performance. Hence, it is a difficult problem to select the most appropriate set of code transformations (and their order of application) to achieve a maximum code size reduction effect. An iterative approach to compiler tuning where many different versions of the same code are tested and statistical analysis is employed to identify the compiler options that have the largest effect on code size is presented in [18]. The use of genetic algorithms to find optimization sequences that generate small object codes is subject of [8]. Code compaction in a binary rewriting tool is described in [9]. For architectures that provide an additional narrow instruction set the generation of mixed-mode instruction width code is discussed in e.g. [22].
C Compilers and Code Optimization for DSPs
597
3.5.2 Coalescing of Repeated Instruction Sequences In addition to the generic optimizations discussed in the previous paragraph that target the obvious sources of code size reduction such as dead or unreachable code, there exist further optimizations that aim at specifically generating more compact code, possibly at the cost of decreased performance. One of these techniques [6] tries to identify repeated code fragments that are subsequently replaced by a single instance and control transfer instructions (jumps or function calls) to this single block of code. This optimization effectively “outlines” repeated fragments of code and replaces the original, redundant occurrences with function calls (jumps) to the newly created function (block). This code transformation is also known as procedural abstraction if a single, new function is created, or cross-jumping or tail merging if a region of code is replaced with a direct jump to a single instance of an identical code block. An example of procedural abstraction is illustrated in figure 16.
/ ∗ Region 1 ∗/ ... x = a + b; y = c − d; z = x ∗ y; ... ... / ∗ Region 2 ∗/ r = n + m; s = o − p; t = r ∗ s; ... (a) Original code.
/ ∗ Region 1 ∗/ ... f (&x ,&y ,& z , a ,b ,c ,d ); ... ... /∗ Region 2 ∗/ f (& r ,& s ,& t , n , m, o , p ) ; ... (b) Code after procedural abstraction.
/ ∗ New f u n c t i o n ∗ / v o i d f (&x ,&y ,& z , a ,b ,c ,d) { x = a + b; y = c − d; z = x ∗ y; }
(c) Newly created function.
Fig. 16 Example: Procedural Abstraction.
Coalescing of repeated instructions sequences, either through procedural abstraction or cross-jumping, involves a number steps. Initially, repeated instructions sequences need to be identified using a suffix tree [16] from which a repeat table is constructed. Repeated instruction sequences identified during these stages might still contain unsafe operations that interfere with code size reducing code transformations. These control hazards, e.g. jumps into or out of the code fragment, must be avoided, for example, by splitting these repeats appropriately. Each of the smaller fragments can then be dealt with separately. Once the compiler has identified a set of repeats suitable for compaction it will determine if coalescing of repeated instruction sequences is profitable and, if so, either apply the procedural abstraction or cross-jumping transformation. The former requires that the block of code to be “outlined” is single-entry, single-exit, i.e. all internal jumps must be within the body of the region. For cross-jumping all of the control-flow successors must match for each repeated region.
598
Björn Franke
If repeated code patterns occur frequently and are large enough the overall benefit from procedural abstraction or cross-jumping outweighs their code size and performance overheads. On the other hand, for each code region replaced with a function call there is a function call overhead resulting in additional instructions possibly outweighing the benefit, especially for very short code fragments. Also, the additional control flow instructions may introduce a significant performance overhead when compared with the original straight-line code. Within an optimizing compiler the coalescing of repeated instruction sequences is typically performed on a low-level intermediate representation2. When applied at a lower level, possibly after register allocation, the identification of repeated instruction sequences should not be restricted to exact, instruction-by-instruction matches, but allow for some abstraction of branches and registers. Two regions may only differ in the use of labels or register names and by treating these similar regions as matches the overall effectiveness of code compression can be increased.
References 1. Jaume Abella and Antonio Gonzalez and Josep Llosa and Xavier Vera: Near-Optimal Loop Tiling by Means of Cache Miss Equations and Genetic Algorithms. In: Proceedings of the 2002 International Conference on Parallel Processing Workshops (ICPPW ’02), Washington, DC, USA, 2002. 2. Allan, Vicki H. and Jones, Reese B. and Lee, Randall M. and Allan, Stephen J.: Software pipelining. ACM Computing Surveys, Vol. 27, No. 3, pp. 367–432, 1995. 3. ACE Associated Compiler Experts bv: DSP-C – An extension to ISO/IEC IE 9899:1990. Netherlands, 1998. 4. David F. Bacon, Susan L. Graham, and Oliver J. Sharp: Compiler Transformations for HighPerformance Computing. ACM Computing Surveys, Vol. 26, No. 4, December 1994. 5. Árpád Beszédes, Rudolf Ferenc, Tibor Gyimóthy, André Dolenc, and Karsisto, Konsta: Survey of code-size reduction methods. ACM Computing Surveys, Vol. 35, No. 3, pp. 223–267, 2003. 6. Keith D. Cooper and Nathaniel McIntosh: Enhanced Code Compression for Embedded RISC Processors. In: Proceedings of the SIGPLAN Conference on Programming Language Design and Implementation, pp. 139–149, 1999. 7. Keith D. Cooper, L. Taylor Simpson, and Christopher A. Vick: Operator Strength Reduction. ACM Transactions on Programming Languages and Systems, Vol. 23, Issue 5, pp. 603–625, Sept. 2001. 8. Keith D. Cooper, Philip J. Schielke, and Devika Subramanian; Optimizing for Reduced Code Space using Genetic Algorithms. In: Proceedings of the ACM SIGPLAN 1999 workshop on Languages, Compilers, and Tools for Embedded Systems, pp. 1–9, Atlanta, Georgia, United States, 1999. 9. Sauma K. Debray, William Evans, Robert Muth, and Bjorn De Sutter: Compiler Techniques for Code Compaction. ACM Transactions on Programming Languages and Systems, Vol. 22, No. 2, March 2000, pp. 378–415. 10. Depuydt, Francis and Goossens, Gert and De Man, Hugo: Scheduling with register constraints for DSP architectures. Integration, the VLSI Journal, pp. 95–120, Vol. 18, No. 1, 1994. 11. Dubach, Christophe and Cavazos, John and Franke, Björn and Fursin, Grigori and O’Boyle, Michael F.P. and Temam, Olivier: Fast compiler optimisation evaluation using code-feature 2
A high-level representation is used in figure 16 to preserve clarity and brevity of the example.
C Compilers and Code Optimization for DSPs
12. 13.
14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29.
599
based performance prediction. In: Proceedings of the 4th International Conference on Computing Frontiers (CF ’07), Ischia, Italy, 2007. ISO/IEC JTC1 SC22 WG14 N1169: Programming languages - C - Extensions to support embedded processors. Technical Report ISO/IEC TR 18037, 2006. Fang, Jesse Zhixi: Compiler Algorithms on If-Conversion, Speculative Predicates Assignment and Predicated Code Optimizations. In: Proceedings of the 9th International Workshop on Languages and Compilers for Parallel Computing (LCPC ’96), pp. 135–153, London, UK, 1997. Franz Franchetti and Markus Püschel: Short Vector Code Generation and Adaptation for DSP Algorithms. In: Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 537–540, 2003. Franke, Björn and O’Boyle, Michael: Array recovery and high-level transformations for DSP applications. ACM Transactions on Embedded Computing Systems, Vol.2 , No. 2, pp. 132– 162, 2003. C.W. Fraser, E.W. Myers, and A.L. Wendt: Analyzing and compressing assembly code. SIGPLAN Notices, Vol. 19, No. 6, pp. 117–121, June 1984. G. Gréwal and T. Wilson and A. Morton: An EGA Approach to the Compile-Time Assignment of Data to Multiple Memories in Digital-Signal Processors, SIGARCH Computer Architecture News, Vol. 31, No. 1, pp. 49–59, 2003. Masayo Haneda, Peter M.W. Knijnenburg, and Harry A.G. Wijshoff: Code Size Reduction by Compiler Tuning. In: S. Vassiliadis et al. (Eds.): SAMOS 2006, LNCS 4017, pp. 186–195, Springer-Verlag, 2006. Hohenauer, Manuel and Engel, Felix and Leupers, Rainer and Ascheid, Gerd and Meyr, Heinrich: A SIMD optimization framework for retargetable compilers. ACM Transactions on Architecture and Code Optimization, Vol. 6, No. 1, pp. 1–27, 2009. ISO/IEC JTC1/SC22/WG14 working group. ISO/IEC 9899:1999 – Programming languages – C, 1999. M-Y. Ko and S. S. Bhattacharyya: Data Partioning for DSP Software Synthesis. In: Proceedings of the International Workshop on Software and Compilers for Embedded Systems (SCOPES ’03), pp. 344–358, Vienna, Austria, 2003 Arvind Krishnaswamy and Rajiv Gupta: Mixed-width instruction sets. Communications of the ACM, Vol. 46, No. 8, pp. 47–52, 2003. Lee, C.G.: UTDSP Benchmark Suite. http://www.eecg.toronto.edu/~corinna/DSP/infrastructure/ UTDSP.html.Cited1July2009 Leupers, R.: Code Optimization Techniques for Embedded Processors: Methods, Algorithms, and Tools. Kluwer Academic Publishers, Norwell, MA, USA, 2000. Leupers, Rainer: Code selection for media processors with SIMD instructions. In: Proceedings of the Conference on Design, Automation and Test in Europe (DATE ’00), Paris, France, pp. 4–8, 2000. R. Leupers and D. Kotte: Variable Partioning for Dual Memory Bank DSPs. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP ’01), pp. 1121–1124, Salt Lake City, USA, 2001. Leupers, Rainer and Bashford, Steven: Graph-based code selection techniques for embedded processors. ACM Transactions on Design Automation of Electronic Systems, pp. 794–814, Vol. 5, No. 4, 2000. Leupers, Rainer: Register allocation for common subexpressions in DSP data paths, In: Proceedings of the 2000 Asia and South Pacific Design Automation Conference (ASP-DAC ’00), pp. 235–240, Yokohama, Japan, 2000. Liao, Stan and Devadas, Srinivas and Keutzer, Kurt and Tjiang, Steve and Wang, Albert: Code optimization techniques for embedded DSP microprocessors. In: Proceedings of the 32nd Annual ACM/IEEE Design Automation Conference (DAC ’95), pp. 599–604, San Francisco, California, United States, 1995.
600
Björn Franke
30. Menard, Daniel and Chillet, Daniel and Charot, François and Sentieys, Olivier: Automatic floating-point to fixed-point conversion for DSP code generation. In: Proceedings of the 2002 International Conference on Compilers, Architecture, and Synthesis for Embedded Systems (CASES ’02), pp. 270–276, Grenoble, France, 2002. 31. Monsifrot, Antoine and Bodin, François and Quiniou, Rene: A Machine Learning Approach to Automatic Production of Compiler Heuristics. In: Proceedings of the 10th International Conference on Artificial Intelligence: Methodology, Systems, and Applications (AIMSA ’02), pp. 41–50, London, UK, 2002. 32. Stephen S. Muchnick: Advanced compiler design and implementation. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1997. 33. Alastair Murray and Björn Franke: Using Genetic Programming for Source-Level Data Assignment to Dual Memory Banks. In: Proceedings of the 3rd Workshop on Statistical and ˘ Z09), ´ Machine Learning Approaches to Architecture and Compilation (SMARTâA Paphos, Cyprus, 2009. 34. Alastair Murray and Björn Franke: Fast Source-Level Data Assignment to Dual Memory Banks. In: Proceedings of the Workshop on Software & Compilers for Embedded Systems (SCOPES 2008), March 2008, Munich, Germany. 35. Nuzman, Dorit and Rosen, Ira and Zaks, Ayal: Auto-vectorization of interleaved data for SIMD. ACM SIGPLAN Notices, Vol. 41, No. 6, pp. 132–143, 2006. 36. Preeti Ranjan Panda, Hiroshi Nakamura, Nikil D. Dutt, Alexandru Nicolau: Augmenting Loop Tiling with Data Alignment for Improved Cache Performance. IEEE Transactions on Computers, Vol. 48, No. 2, pp. 142-149, February, 1999. 37. Pokam, Gilles and Bihan, Stéphane and Simonnet, Julien and Bodin, François: SWARP: A retargetable preprocessor for multimedia instructions. Concurrency and Computation: Practice & Experience, Volume 16, Issue 2-3, pp. 303–318, January 2004. 38. Pouchet, Louis-Noël and Bastoul, Cédric and Cohen, Albert and Cavazos, John: Iterative Optimization in the Polyhedral Model: Part II, Multidimensional Time. In: Proceedings of the 2008 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’08), pp. 90–100, New York, NY, USA, 2008. 39. Pujare, S., Lee, C.G., and Chow, P.: Machine-independent compiler optimizations for the UofT DSP architecture. In: Proceedings of the 6th International Conference on Signal Processing Applications and Technology (ICSPAT), pp. 860–865, 1995. 40. M.A.R. Saghir, P.Chow and C.G. Lee: Exploiting Dual Data-Memory Banks in Digital Signal Processors. In: Proceedings of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-VII), pp. 234–243, Cambridge, Massachusetts, USA, 1996. 41. Sair, S., Kaeli, D., Meleis, W.: A study of loop unrolling for VLIW-based DSP processors. In: Proceedings of the IEEE Workshop on Signal Processing Systems, 1998. 42. V. Sipkovà: Efficient Variable Allocation to Dual Memory Banks of DSPs. In: Proceedings of the 7th International Workshop on Software and Compilers for Embedded Systems (SCOPES ’03), pp. 359–372, Vienna, Austria, 2003. 43. Mark Stephenson and Saman Amarasinghe: Predicting Unroll Factors Using Supervised Classification. In: Proceedings of the International Symposium on Code Generation and Optimization (CGO ’05), pp. 123–134, Washington, DC, USA, 2005. 44. Uh, Gang-Ryung; Wang, Yuhong; Whalley, David B.; Jinturkar, Sanjay; Burns, Chris and Cao, Vincent: Techniques for Effectively Exploiting a Zero Overhead Loop Buffer. In: Proceedings of the 9th International Conference on Compiler Construction (CC ’00), pp. 157– 172, London, UK, 2000. 45. Vienna University of Technology, Institute of Communications and Radio-Frequency Engineering: Address Code Optimization. www.address-code-optimization.org, retrieved August 2009. 46. Michael Joseph Wolfe: Optimizing Supercompilers for Supercomputers. MIT Press, Cambridge, MA, USA, 1990.
C Compilers and Code Optimization for DSPs
601
47. Zivojnovic, V., Martinez, J., Schläger, C. & Meyr, H.: DSPstone: A DSP-Oriented Benchmarking Methodology. In: Proceedings of the International Conference on Signal Processing Applications and Technology (ICSPAT’94), Dallas, Oct. 1994.
Compiling for VLIW DSPs Christoph W. Kessler
Abstract This chapter describes fundamental compiler techniques for VLIW DSP processors. We begin with a review of VLIW DSP architecture concepts, as far as relevant for the compiler writer. As a case study, we consider the TI TMS320C62xTM clustered VLIW DSP processor family. We survey the main tasks of VLIW DSP code generation, discuss instruction selection, cluster assignment, instruction scheduling and register allocation in some greater detail, and present selected techniques for these, both heuristic and optimal ones. Some emphasis is put on phase ordering problems and on phase coupled and integrated code generation techniques.
1 VLIW DSP architecture concepts and resource modeling In order to satisfy high performance demands, modern processor architectures exploit various kinds of parallelism in programs: thread-level parallelism (i.e., running multiple program threads in parallel on multi-core and/or hardware-multithreaded processors), data-level parallelism (i.e., executing the same instruction or operation on several parts of a long data word or on a vector of multiple data words together), memory-level parallelism (i.e., overlapping memory access latency with other, independent computation on the processor), and instruction-level parallelism (i.e., overlapping the execution of several instructions in time, using different resources of the processor in parallel at a time). By pipelined execution of subsequent instructions, a certain amount of instruction level parallelism (ILP) can already be exploited in ordinary sequential RISC processors that issue a single instruction at a time. More ILP can often be leveraged Christoph W. Kessler Department of Computer Science (IDA), Linköping University, S-58183 Linköping, Sweden e-mail: [email protected]
S.S. Bhattacharyya et al. (eds.), Handbook of Signal Processing Systems, DOI 10.1007/978-1-4419-6345-1_22, © Springer Science+Business Media, LLC 2010
603
604
Christoph W. Kessler Register file ADDER
MEM
SHIFT
add
Functional units Program memory
...
PC
MULT.
NOP
load
NOP
Very Long Instruction Word
...
Fig. 1 A traditional VLIW processor with very long instruction words consisting of four issue slots, each one controlling one functional unit.
by multiple-issue architectures, where execution of several independent instructions can be started in parallel, resulting in a higher throughput (instructions per clock cycle, IPC). The maximum number of instructions that can be issued simultaneously is called the issue width, denoted by . In this chapter, we focus on multiple-issue instruction-level parallel DSP architectures, i.e., > 1. ILP in programs can either be given explicitly or implicitly. With implicit ILP, dependences between instructions are implicitly given in the form of register and memory addresses read and written by instructions in a sequential instruction stream, and it is the task of a run-time (usually, hardware) scheduler to identify instructions that are independent and do not compete for the same resource, and thus can be issued in parallel to different available functional units of the processor. Superscalar processors use such a hardware scheduler to analyze for data dependences and resource conflicts on-the-fly within a given fixed-size window over the next instructions in the instruction stream. While convenient for the programmer, superscalar processors require high energy and silicon overhead for analyzing dependences and dispatching instructions. With explicit ILP, the assembler-level programmer or compiler is responsible to identify independent instructions that should execute in parallel, and group them together into issue packets (also known as instruction groups in the literature). The elementary instructions in an issue packet will be dispatched simultaneously to different functional units for parallel execution. In the following, we will consider explicit ILP architectures. The issue packets do not necessarily correspond one-to-one to the units of instruction fetch. The processor’s instruction fetch unit usually reads properly aligned, fixed-sized blocks of bytes from program memory, which contain a fixed number of elementary instructions, and decodes them together. We refer to these blocks as fetch packets (also known as instruction bundles in the literature). For instance, a fetch packet for the Itanium IA-64 processor family contains three instructions, and fetch packets for the TI ’C62x contain eight instructions. In the traditional VLIW architectures (see Fig. 1), the issue packets coincide with the fetch packets; they have a fixed length of L bytes that are L-byte aligned in instruction (cache) memory, and are called Very Long Instruction Words (VLIWs). A VLIW contains > 1 predefined slots for elementary instructions. Each instruction
Compiling for VLIW DSPs
add
Program memory
Fetch packet
...
PC
605
load
Issue packet Issue packet
load
mul Issue packet
...
Fig. 2 Several issue packets may be accommodated within a single fetch packet. Here, the framed fetch packet contains three issue packets: the first two contain just one elementary instruction each, while the third one contains two parallel instructions.
slot may be dedicated to a certain kind of instructions or to controlling a specific functional unit of the processor. Not all instruction slots have to be used; unused slots are marked by NOP (no operation) instructions. While decoding is straightforward, code density can be low if there is not enough ILP to fill most of the slots; this wastes program memory space and instruction fetch bandwidth. Instead, most explicit ILP architectures nowadays allow to pack and encode instructions more flexibly in program memory. An instruction of specific kind may be placed in several or all possible instruction slots of a fetch packet. Also, a fetch packet may accommodate several issue packets, as illustrated in Fig. 2; the boundaries between these may, for instance, be marked by special delimiter bits. The different issue packets in a fetch packet will be issued subsequently for execution. The processor hardware is responsible for extracting the issue packets from a stream of fetch packets.1 In the DSP domain, the Texas Instruments TI TMS320C6x processor family uses such a flexible encoding schema, which we will present in Section 2. The existence of multiple individual RISC-like elementary instructions as separate slots within an issue packet to express parallel issue is a key feature of VLIW and EPIC architectures. In contrast, consider dual-MAC (multiply-accumulate) instructions that are provided in some DSP processors but encoded as a single instruction (albeit a very powerful one) in a linear instruction stream. Such instructions are, by themselves, not based on VLIW but should rather be considered as a special case of SIMD (single instruction multiple data) instructions. Indeed, SIMD instructions can occur as elementary instructions in VLIW instruction sets. Generally, a SIMD instruction applies the same arithmetic or logical operation to multiple operand data items in parallel. These operand items usually need to reside in adjacent registers or memory locations to be treated and addressed as single long data words. Hence, SIMD instructions have only one opcode, while issue packets in VLIW/EPIC architectures have one opcode per elementary instruction. The appropriate issue width and the number of parallel functional units for a VLIW processor design depends, beyond architectural constraints, on the characteristics of the intended application domain. While the average ILP degree achievable in general-purpose programs is usually low, it can be significantly higher in the computational kernels of typical DSP applications. For instance, Gangwar et al. [35] report for DSPstone and Mediabench benchmark kernels an achievable ILP degree of 20 on average for a (clustered) VLIW architecture with 16 ALUs and 1
Processors that decouple issue packets from fetch packets are commonly also referred to as Explicitly Parallel Instruction set Computing (EPIC) architectures.
606 add: Time
Christoph W. Kessler ALU MULTIPLIER issue read read stage stage stage stage stage stage write unit src1 src2 0 1 0 1 2 3 result bus opnd opnd
mul: Time
0
0
1
1
2
2
3
ALU MULTIPLIER issue read read stage stage stage stage stage stage write 1 0 1 2 3 unit src1 src2 0 result bus opnd opnd
3 4
... t: mul ... t+1: ... t+2: add ... ...
5
structural hazard at t+5
Fig. 3 Top: Example reservation tables for addition and multiplication on a pipelined processor with an ALU and multiplier unit. Resources such as register file access ports and pipeline stages on the functional units span the horizontal axis of the reservation tables while time flows downwards. Time slot 0 represents the issue time. — Bottom: Scheduling an add instruction two clock cycles after a mul instruction would lead to conflicting subscriptions of the result resource (write back to register file). Here, the issue of add would have to be delayed to, say, time t + 3. If exposed to the programmer/compiler, a nop instruction could be added before the add to fill the issue slot at time t + 2. Otherwise, the processor will handle the delay automatically by stalling the pipeline for one cycle.
8 load-store units. Moreover, program transformations can be applied to increase exploitable ILP; we will discuss some of these later in this chapter. Resource modeling We model instruction issue and resource usage explicitly. An instruction i issued at time t occupies an issue slot (e.g., a slot in a VLIW) at time t and possibly2 several resources (such as functional units or buses) at time t or later. For each instruction type, its required resource reservations relative to the issue time t are specified in a reservation table [22], a boolean matrix where the entry in row j and column u indicates if the instruction uses resource u in clock cycle t + j. If an instruction is issued at a time t, its reservations of resources are committed to a global resource usage map or table. Two instructions are in conflict with each other if their resource reservations overlap in the global resource usage map; this is also known as a structural hazard. See Fig. 3 for an example. In such a case, one of the two instructions has to be issued at a later time to avoid duplicate reservations of the same resource. Non-pipelined resources that have to be reserved for more than one clock cycle in sequence can thus lead to delayed issuing of subsequent instructions that should use the same resource. The occupation time o(i, j) denotes the minimum distance in issue time between two (data-independent) instructions i and j that are to be 2
NOP (no operation) instructions only occupy an issue slot but no further resources.
Compiling for VLIW DSPs
607
issued subsequently on the same issue unit and that subscribe to a common resource. Hence, the occupation time only depends on the instruction types. For instance, in Fig. 3, o(add, add) = 1. In fact, for most processors, occupation times are generally 1. Sets of time slots on one or several physical resources (such as pipeline stages in functional units or buses) can often be modeled together as a single virtual resource if it is clear from analysis of the instruction set that, once an instruction is assigned the earliest one of the resource slots in this subset, no other instruction could possibly interfere with it in later slots or other resources in the same subset. A processor is called fully pipelined if it can be modeled with virtual resources such that all occupation times are 1, there are no exposed structural hazards, and the reservation table for an instruction thus degenerates to a vector over the virtual resources. On regular VLIW architectures, these virtual resources often correspond one-to-one to functional units. Latency and register write models Consider two instructions i1 and i2 where i1 issued at time t1 produces (writes) a value so that it is available at the beginning of time slot t1 + w1 in some register r, which is to be consumed (read) by i2 at time t2 + r2 . The time of writing the result relative to the issue time, w1 is called the write latency3 of i1 , and r2 the read latency of i2 . For the earliest possible issue time t2 of i2 we have to preserve the constraint t2 ≥ t1 + w1 − r2 to make sure that the operand value for i2 is available in the register. We refer to the minimum difference in issue times induced by data dependence, i.e., (i1 , i2 ) = w1 − r2 as latency between i1 and i2 . See Fig. 4 for an illustration. For memory data dependences between store and load instructions, latency is defined accordingly. Latencies are normally positive, because operations usually read operands early in their execution and write results just before terminating. For the same reason, the occupation time usually does not exceed the latency. Only for uncommon combinations of an early-writing i1 with a late-reading i2 , or in the case of write-after-read dependences, negative latencies could occur, which means that a successor instruction actually could be issued before its predecessor instruction in the data depen3
For simplicity of presentation, we assume here that write latency and read latency are constants for each instruction. In general, they may in some cases depend on run-time conditions exposed by the hardware and vary in an interval between earliest and latest write resp. read latency. See also our remarks on the LE model further below. For a more detailed latency model, we refer to Rau et al. [79].
608 ... mul ...,R1 ; write R1 data dependence add R1,... ; read R1 ...
Christoph W. Kessler
Time
Issue Issue unit 1 unit 2
read read read read ALU M U L T I P L I E R write write src1 src2 src1 src2 stage stage stage stage stage stage result result 1 0 1 2 3 opnd opnd opnd opnd 0 bus bus
0 1 2 3 4 5 6 7
Fig. 4 A read-after-write (flow) data dependence forces the instruction scheduler to await the latency of 6 clock cycles between the producing and consuming instruction to make sure that the value written to register R1 is read.
dence graph and still preserve the data dependence. However, this only applies to the EQ model, which we now explain. There exist two different latency models with respect to the result register write time: EQ (for ”equal”) and LE (for ”less than or equal”). Both models are being used in VLIW DSP processors. The EQ model specifies that the result register of an instruction i1 issued at time t1 will be written exactly at the end of time slot t1 + w1 − 1, not earlier and not later. Hence, the destination register r only needs to be reserved from time t1 + w1 on. In the LE model, t1 + w1 is an upper bound of the write time, but the write could happen at any time between issue time t1 and t1 + w1 , depending on hardwarerelated issues. In the LE model, the destination register r must hence be reserved already from the issue time on. The EQ model allows to better utilize the registers, but the possibility of having several in-flight result values to be written to the same destination register makes it more difficult to handle interrupts properly. In some architectures, latency only depends on the instruction type of the source instruction. If the latency (i, j) is the same for all possible instructions j that may directly depend on i (e.g., that use the result value written by i) we set (i) = (i, j). Otherwise, on LE architectures, we could instead set (i) = max j (i, j), i.e., the maximum latency to any possible direct successor instruction consuming the output value of i. The assumption that latency only depends on the source instruction is then a conservative simplification and may lead in some cases to somewhat longer register live ranges than necessary. The difference (i1 , i2 )−o(i1 , i2 ) is usually referred to as the delay4 of instruction i1 .
4
Note that in some papers and texts, the meanings of the terms delay and latency are reversed.
Compiling for VLIW DSPs Cluster 1
609
Cluster 2
Register File 1 ... FU
...
Cluster N
Register File 2 ...
FU
FU
...
Register File N ...
... FU
FU
...
FU
Interconnection Network
Fig. 5 Clustered VLIW processor with partitioned register set.
Clustered VLIW: Partitioned Register Sets In VLIW architectures, possibly many instructions may execute in parallel, each accessing several operand registers and/or producing a result value to be written to some register. If each instruction should be able to access each register in a homogenous register set, the demands on the number of parallel read and write ports to the register set, i.e., on the access bandwidth to the register set, become extraordinarily high. Register files with many ports have very high silicon area and energy costs, and even access latency grows. A solution is to constrain general accessibility and partition the set of functional units and likewise the register set to form clusters. A cluster consists of a set of functional units and a local register set, see Figure 5. Within a cluster, each functional unit can access each local register. However, the number of accesses to a remote register is strictly limited, usually to one per clock cycle and cluster. A task for the programmer (or compiler) is thus to plan in which cluster data should reside at runtime and on which cluster each operation is to be performed to minimize the loss of ILP due to the clustering constraints. Control Hazards The pipelined execution of instructions on modern processors, including VLIW processors, achieves maximum throughput only in the absence of data hazards, structural hazards, and control hazards. In VLIW processors, these hazards are exposed to the assembler-level programmer or compiler. Data hazards and structural hazards have been discussed above. Control hazards denote the fact that branch instructions may disrupt the linear fetch-decode-execute pipeline flow. Branch instructions are detected only in the decoding phase and the branch target may, in the case of conditional branches, be known even later during execution. If subsequent instructions have been fetched, decoded and partly executed on the ”wrong” control flow branch when the branch is detected or the branch target is known, the effect of these instructions must be rolled back and the pipeline must restart from the branch target. This implies a nonzero delay in execution that may differ depending on the type of branch instruction (nonconditional branch, conditional branch taken as expected, or conditional branch
610
Christoph W. Kessler
not taken as expected). There are basically two possibilities how processors manage branch delays: (1) Delayed branch: The branch instruction semantics is re-defined to take its effect on the program counter only after a certain number of delay time slots. It is a task for global instruction scheduling (see Section 7) to try filling these branch delay slots with useful instructions that need to be executed anyway but do not influence the branch condition. If no other instructions can be moved to a branch delay slot, it has to be filled with a NOP instruction as placeholder. (2) Pipeline stall: The entire processor pipeline is frozen until the first instruction word has been loaded from the branch target. The delay is not explicit in the program code and may vary depending on the branch instruction type. In particular, conditional branches have a detrimental effect on processor throughput. For this reason, hardware features and code generation techniques that allow to reduce the need for (conditional) branches are important. The most prominent one is predicated execution: Each instruction takes an additional operand, a boolean predicate, which may be a constant or a variable in a predicate register. If the predicate evaluates to true, the instruction executes as usual. If it evaluates to false, the effect of that instruction is rolled back such that it behaves like a NOP instruction. Hardware Loops Many innermost loops in digital signal processing applications have a fixed number of iterations and a fixed-length loop body consisting of straight-line code. Some DSP processors therefore support a hardware loop construct. A special hardware loop setup instruction at the loop entry initializes an iteration count register and also specifies the number of subsequent instructions that are supposed to form the loop body. The iteration count register is advanced automatically after every execution of the loop body; no separate add instruction is necessary for that purpose. A backward branch instruction from the end to the beginning of the loop body is now no longer necessary either, as the processor automatically resets its program counter to the first loop instruction, unless the iteration count has reached its final value. Hardware loops have thus no overhead for loop control per loop iteration and only a marginal constant loop setup cost. Also, they do not suffer from control hazards, as the processor hardware knows well ahead of time where and whether to execute the next backward branch. Examples of VLIW DSP Processors In the next section, we will consider the TI’C62x DSP processor as a case study. Other VLIW/EPIC DSP processors include, e.g., the HP Lx / STMicroelectronics ST200, Analog Devices TigerSHARC ADSP-TS20xS [4] and the NXP (formerly Philips Semiconductors) TriMedia [74].
Compiling for VLIW DSPs
611
Program cache / Program memory
Register file A (A0−A15)
.L1
.S1
.M1
2X
Register file B (B0−B15)
1X
.D1
.D2
.M2
.S2
.L2
Data cache / Data memory Fig. 6 The TI ’C6201 Clustered VLIW DSP Processor.
2 Case study: TI ’C62x DSP processor family As a case study, we consider a representative of Texas Instrument’s C62xTM / C64xTM/ C67xTM family of clustered VLIW fixed-point digital signal processors (DSPs) with the VelociTITM instruction set. The Texas Instruments TI TMS320C6201TM[82] (shorthand: ’C6201) is a highperformance fixed-point digital signal processor (DSP) clocked at 200 MHz. It is a clustered VLIW architecture with issue width = 8. A block diagram is given in Figure 6. The ’C6201 has 8 functional units, including two 16-bit multipliers for 32-bit results (the .M-units) and six 32/40-bit arithmetic-logical units (ALUs), of which two (the .D-units) are connected to on-chip data cache memory. The ’C62x CPUs are load-store architectures, i.e., all operands of arithmetic and logical operations must be constants or reside in registers, but not in memory. The data addressing (.D) units are used to load data from (data) memory to registers and store register contents to (data) memory. The load and store instructions exist in variants for 32bit, 16-bit and 8-bit data. The two .L units (logical units) mainly provide 32-bit and 40-bit arithmetic and compare operations and 32-bit logical operations like and, or, xor. The two .S units (shift units) mainly provide 32-bit arithmetic and logical operations, 32-bit bit-level operations, 32-bit and 40-bit shifts, and branching. The ’C62x architecture is fully pipelined. The reservation table of each instruction5 is a 10×1 matrix, consisting of eight slots for the eight functional units and the two cross paths 1X and 2X (which will be described later) at issue time. In particular, each instruction execution occupies exactly one of the functional units at issue time. Separate slots for modeling instruction issue are thus not required, they 5
Exception: For load and store instructions, two more resources are used to model load destination register resp. store source register access to the two register files, as only one loaded or stored register can be accessed per register file and clock cycle. Furthermore, load instructions can cause additional implicit delays (pipeline stalls) by unbalanced access to the internal memory banks (see later). This effect could likewise be modeled with additional resources representing the different memory banks. However, this will only be useful for predicting stalls where the alignment of the accessed memory addresses is statically known.
612
Christoph W. Kessler A
1
B
1
C
0
D
1
E
1
F
0
G
1
H
0
Fig. 7 A fetch packet for the ’C62x can contain up to 8 issue packets, as marked by the chaining bits. In this example, there are 3 issue packets: instructions A, B, C issued together, followed by D, E, F issued together and finally G and H issued together.
coincide with the slots for the corresponding functional units. The occupation time is 1 for all instructions. The ’C62x architecture offers the EQ latency model (there it is called ”multiple assignment”) for non-interruptable code, while the LE model (called ”single assignment”) should be used for interruptable code. Global enabling and disabling of interrupts is done by changing a flag in the processor’s control status register. The latency for load instructions is 5 clock cycles, for most arithmetic instructions it is 1, and for multiply 2 clock cycles. Load and store instructions may optionally have an address autoincrement or -decrement side effect, which has latency one. Some instructions are available on several units. For instance, additions can be done on the .L units, .S units and .D units. Each of the two clusters A and B has sixteen 32-bit general purpose registers, which are connected to four units (including one multiplier and one load-store unit). The units of Cluster A are called .D1, .M1, .S1 and .L1, those of Cluster B are called .D2, .M2, .S2 and .L2. All units are fully pipelined (with occupation time 1), i.e., in principle, a new instruction could be issued to each unit in each clock cycle. The ’C6201 has 128 KB on-chip static RAM, 64 KB for data and 64 KB for instructions. An instruction fetch packet for the ’C62x family is 256 bit wide and is partitioned into 8 instruction slots of 32 bit each. The least significant bit position of a 32-bit instruction slot is used as a chaining bit to indicate the limits of issue packets: If the chaining bit of slot i is 0, the instruction in the following slot (i + 1) belongs to the next issue packet (see Fig. 7). Technically, issue packets cannot span several fetch packets 6 . Hence, the maximum issue packet size of = 8 occurs when all chaining bits (except perhaps the last one) are set in a fetch packet. As the other extreme, up to eight issue packets could occur in a fetch packet (if all chaining bits are cleared). The next fetch packet is not fetched before all issue packets of the previous one have been dispatched. As each functional unit can do simple integer operations like addition, the ’C6201 can thus run up to eight integer operations per cycle, which amounts to 1600 MIPS (million instructions per second). The ADD2 instruction, which executes on .S units, allows to perform a pair of 16-bit additions in a single clock cycle on the same functional unit, if the 16-bit operands (and results) each are packed into a common 32-bit register, see Fig. 8. One 6
Even though ’C62x assembly language allows an issue packet to start in a fetch packet and continue into the next one, the assembler will automatically create and insert a fresh fetch packet after the first one, move the pending issue packet there, and fill up the remainder of the first issue packet with NOP instructions.
Compiling for VLIW DSPs 31
16 15
613 0
31
+16
ADD2: 31
16 15
0
+16 16 15
0
Fig. 8 The SIMD instruction ADD2 performs two 16-bit integer additions on the same functional unit in one clock cycle. The 32-bit registers shared by the operands and results are shown as rectangles.
of these two 16-bit additions accesses the lower 16 bit (bits 0..15) of the registers, the other the higher 16 bit (bits 16..31). No carry is propagated from the lower to the higher 16-bit addition, which differs from the behavior of the 32-bit ADD instruction and therefore requires the separate opcode ADD2. The SUB2 instruction, also available on the .S units, works similarly for two 16-bit subtractions. Other instructions like bitwise AND, bitwise OR etc. work for 16-bit operand pairs in the same way as for 32-bit operands and thus do not need a separate opcode. Within each cluster, each functional unit can access any register. At most one instruction per cluster and clock cycle can take one operand from the other cluster’s register file, for which it needs to reserve the corresponding cross path (1X for accessing B registers from cluster A, and 2X for the other way), which is also modeled as a resource for this purpose. Assembler mnemonics encode the used resources as a suffix to the instruction’s opcode: For instance, ADD.L1 is an ordinary addition on the .L1 unit using operands from cluster A only, while ADD.S2X denotes an addition on the .S2 unit that accesses one A register via the cross path 2X. In total, there are twelve different instructions for addition (not counting the ADD2 option for 16-bit additions). Table 1 (a) schedule generated by an early version of the TI-C compiler (12 cycles) [58]; (b) optimal schedule generated by OPTIMIST with dynamic programming (9 cycles) [51] (a) LDW.D1 *A4,B4 LDW.D1 *A1,A8 LDW.D1 *A3,A9 LDW.D1 *A0,B0 LDW.D1 *A2,B2 LDW.D1 *A5,B5 LDW.D1 *A7,A4 LDW.D1 *A6,B6 NOP 1 MV.L2X A8,B1 MV.L2X A9,B3 MV.L2X A4,B7
(b) LDW.D1 *A0,A8 || MV.L2X A1,B8 LDW.D2 *B8,B1 || LDW.D1 *A2,A9 || MV.L2X A3,B10 LDW.D2 *B10,B3 || LDW.D1 *A4,A10 || MV.L2X A5,B12 LDW.D2 *B12,B5 || LDW.D1 *A6,A11 || MV.L2X A7,B14 LDW.D2 *B14,B7 || MV.L2X A8,B0 MV.L2X A9,B2 MV.L2X A10,B4 MV.L2X A11,B6 NOP 1 ; (last delay slot of LDW to B7)
614
Christoph W. Kessler
It becomes apparent that the problems of resource allocation (including cluster assignment) and instruction scheduling are not independent but should preferably be handled together to improve code quality. An example (adapted from Leupers [58]) is shown in Tab. 1: A basic block consisting of eight independent load (LDW) instructions is to be scheduled. The address operands are initially available (i.e., live on entry) in registers A0,...,A7 in register file A, the results are expected to be written to registers B0,...,B7 (i.e., live on exit) in register file B. Load instructions execute on the cluster containing the address operand register. The result can be written to either register file. However, only one load or store instruction can access a register file per clock cycle to write its destination register resp. read its source register; otherwise, the processor stalls for one clock cycle to serialize the competing accesses. Copying registers between clusters (which occupies the corresponding cross path) can be done by Move (MV), which is a shorthand for ADD with one zero operand, and has latency 1. As the processor has 2 load/store units and load has latency 5, a lower bound for the makespan (the time until all results are available) is 8 clock cycles; it can be sharpened to 9 clock cycles if we consider that at least one of the addresses has to be moved early to register file B to enable parallel computing, which takes one more clock cycle. A naive solution (a) sequentializes the computation by executing all load instructions on cluster A only. A more advanced schedule (b) utilizes both load/store units in parallel by transporting four of the addresses to cluster B as soon as possible, so the loads can run in parallel. Note also that no implicit pipeline stalls occur as the parallel load instructions always target different destination register files in their write-back phase, 5 clock cycles after issue time. Indeed, (b) is an optimal schedule; it was computed by the dynamic programming algorithm in OPTIMIST [51]. Generally, there can exist several optimal schedules. For instance, another one for this example is reported by Leupers [58], which was computed by a simulated annealing based heuristic. Branch instructions on the ’C62x, which execute on the .S units, are delayed branches with a latency of 6 clock cycles, thus 5 delay slots are exposed. If two branches execute in the same issue packet (on .S1 and .S2 in parallel), control branches to the target for which the branch condition evaluates to true. This can be used to realize three-way branches. If both branch conditions evaluate to true, the behavior is undefined. All ’C62x instructions can be predicated. The four most significant bits in the opcode form a condition field, where the first three bits specify the condition register tested, and the fourth bit specifies whether to test for equality or non-equality of that register with zero. Registers A1, A2, B0, B1 and B2 can serve as condition registers. The condition field code 0000 denotes unconditional execution. Usually, branch targets will be at the beginning of an issue packet. However, branch targets can be any word address in instruction memory and thereby any instruction, which may also be in the middle of an issue packet. In that case, the instructions in that issue packet that appear in the program text before the branch target address will not take effect (are treated as NOPs). Most ’C62x processor types use interleaved memory banks for the internal (onchip) data memory. In most cases, data memory is organized in four 16-bit wide
Compiling for VLIW DSPs Bank 0
Bank 1
Bank 2
Bank 3
3 11
4 12
5 13
6 14
7 15
...
...
...
...
...
...
...
2 10
...
byte 0 byte 1 8 9
615
Fig. 9 Interleaved internal data memory with four memory banks, each 16 bit (2 bytes) wide.
memory banks, and byte addresses are mapped cyclically across these (see Fig. 9). Each bank is single-ported memory, thus only one access is possible per clock cycle. If two load or store instructions try to access addresses in the same bank in the same clock cycle, the processor stalls for one cycle to serialize the accesses. For avoiding such delays, it is useful to know statically the alignment of addresses to be accessed in parallel, and make sure that these end up in different memory banks. Note also that load-word (LDW) and store-word (STW) instructions, which access 32-bit data, access two neighbored banks simultaneously. Word addresses must be aligned on word boundaries, i.e., the two least significant address bits are zero. Halfword addresses must be aligned on halfword boundaries. TI ’C62x and ’C64x processors are fixed point DSP processors, where the ’C64x processors have architecture extensions that include, for instance, further support for SIMD processing (such as four-way 8 bit SIMD addition etc., four-way 16 × 16 bit multiply and eight-way 8 × 8 bit multiply), further instructions such as 32 × 32 multiply and complex multiply, compact (16-bit) instructions that can be mixed with 32-bit instructions [43], hardware support for software pipelining of loops, and more (2 × 32) registers. The TI ’C67x family also supports floatingpoint computations. Beyond the ’C6x assembly language, TI provides three further programming models for ’C6x processors: (1) ANSI C, (2) C with calls to intrinsic functions that map one-to-one to ’C6x-specific instructions, such as _add2(), and (3) linear assembly code, which is RISC-like serial unscheduled code that uses ’C6x instructions but assumes no resource conflicts and only unit latencies. In general, the more processor-specific programming models allow to generate more efficient code. For instance, for an IIR filter example, TI reports that the software pipeline (see Sect. 7.2) generated from plain C code has a kernel length of 5 clock cycles, from C with intrinsics only 4, while the linear assembly optimizer achieves 3 clock cycles and thus the best throughput [80].
3 VLIW DSP code generation overview In this section, we give a short overview of the main tasks in code generation that produce target-specific assembler code from a (mostly) target-independent intermediate representation of the program. We will consider these tasks and the main techniques used for them in some more detail in the following sections.
616
Christoph W. Kessler
From intermediate representation to target code Most modern compilers provide not just one but several intermediate representations (IR) of the program module being translated. These representations differ in their level of abstraction and degree of language independence and target independence. High-level representations such as abstract syntax trees follow the syntactic structure of the programs and represent e.g. loops and array accesses explicitly, while these constructs are, in low-level representations, lowered to branches and pointer arithmetics, respectively; such low-level IRs include control flow graphs, three-address code or quadruple sequences. A compiler supporting several different representations allows the different program analyses, optimizations and transformations to be implemented each on the level that is most appropriate for it. For instance, common subexpression elimination is best performed on a lower-level representation because more common subexpressions can be found after array accesses and other constructs have been lowered. Code generation usually starts from a low-level intermediate representation (LIR) of the program. This LIR may be, to some degree, target dependent. For instance, IR operations for which no equivalent instruction exists on the target (e.g., there is no division instruction on the ’C62x) are lowered to equivalent sequences of LIR operations or to calls to corresponding routines in the compiler’s run-time system. For simple target architectures, the main tasks in code generation include instruction selection, instruction scheduling and register allocation: • Instruction selection maps the abstract operations of the IR to equivalent instructions of the target processor. If we associate fixed resources (functional units, buses etc.) to be used with each instruction, this also includes the resource allocation problem. Details will be given in Section 4. • Instruction scheduling arranges instructions in time, usually in order to minimize the overall execution time, subject to data dependence, control dependence and resource constraints. In particular, this includes the subproblems of instruction sequencing, i.e., determining a linear (usually, topological) order of instructions for scheduling, and code compaction, i.e. determining which independent instructions to execute in parallel and mapping these to slots in instruction issue packets and fetch packets. Local, loop-level and global instruction scheduling methods will be discussed in Section 7. • Register allocation selects which values should, at what time during execution, reside in some target processor register. If there may not be enough registers available at a time, some values must be temporarily spilled to memory, which requires the generation and scheduling of additional spill code in the form of load and store instructions. The simpler problem of register assignment maps the runtime values that were allocated a register to a concrete register number. Details will be given in Section 6. Advanced architectures such as clustered architectures may require additional tasks such as cluster assignment and data transfer generation (see Section 5), which may be considered separately or be combined with some of the above tasks. For
Compiling for VLIW DSPs
MADD
617 SUB
ADD32
MUL MUL16
MUL16
(a)
ADD SUB16
(b)
ADD32
ADD2 ADD16
LDH INDIR16
ADD16
(c)
Fig. 10 Examples for covering IR nodes (solid circles) and edges (arrows) with patterns (dashed ovals) corresponding to instructions.
instance, cluster assignment for instructions could be considered part of instruction selection, and cluster assignment for data may be modeled as an extension of register allocation. Another task that is typical for DSP processors is that of address code generation for address generation units (AGUs). AGUs provide auto-increment and autodecrement functionality as a parallel side effect to ordinary instructions that use special address registers for accessing data in memory. The AGUs may provide fixed offset values or offset registers to be used for in-/decrementing. A compiler could thus assign address registers and select offsets in an attempt to minimize the amount of residual addressing code that would still be computed with ordinary processor instructions on the main functional unit. Further code generation problems frequently occuring with VLIW DSPs include exploiting available SIMD instructions, which can be regarded a subproblem of instruction selection, and optimizing memory data layout to avoid stalls caused by memory bank access conflicts.
4 Instruction selection and resource allocation The instruction selection problem is often modeled as a pattern matching problem, where the processor-independent operations of the intermediate representation are to be covered by patterns that correspond to the semantics of target processor instructions. The description of the target processor instructions in terms of IR operations has to be provided by the compiler writer. Each IR operation has to be covered by exactly one pattern node. Some examples are shown in Fig. 10. As intermediate results corresponding to inner edges of a multi-node pattern are no longer accessible in registers if that instruction is selected, such a pattern is only applicable as a cover if no other operations (such as SUB in Fig. 10(b)) access such an intermediate result. In other words, all outgoing edges from IR nodes covered by non-root pattern nodes must be covered by pattern edges. Each pattern is associated with a cost, which is typically its occupation time or its latency as an a-priori estimation of the instruction’s impact on overal execution time in the final code (the exact impact will only be known after the remaining tasks of code generation, in particular instruction scheduling, have been done). But also other cost metrics such as register space requirements are possible. The optimization
618
Christoph W. Kessler
goal is then to minimize the accumulated total cost of all covering patterns, subject to the condition that each IR operation is covered by exactly one pattern node. The optimizing pattern matching problem can be solved in various ways. A common technique, tree parsing, is applicable if the patterns are tree-shaped and the data flow graph of the current basic block (for instruction selection, we usually consider one basic block of the input program at a time) is a tree. The patterns are modeled as tree rewrite rules in a tree grammar describing admissible derivations of coverings of the input tree. A heuristic solution could be determined by a LR-parser that selects, in each step of constructing bottom-up a derivation the input tree, in a greedy way whenever there are several applicable rules (patterns) to choose from [37]. An optimal solution (i.e., a minimum-cost covering of the tree with respect to the given costs) can be computed in polynomial time by a dynamic programming algorithm that keeps track of all possibly optimal coverings in a subtree [1][33]. Computing a minimum cost covering for a directed acyclic graph (DAG) is assumed to be NP-complete, but by splitting the DAG into trees processed separately and forcing the shared nodes’ results to be stored in registers, dynamicprogramming based bottom-up tree pattern matching techniques can still be used as heuristic methods. Ertl [30] gives an algorithm to check if a given processor instruction set (i.e., tree grammar) belongs to a special class of architectures (containing e.g. MIPS and SPARC processors) where the constrained tree pattern matching always produces optimal coverings for DAGs. Another way to compute a minimum cost covering, albeit an expensive one, is to apply advanced optimization methods such as integer linear programming [86][60][10][28] or partitioned boolean quadratic programming [25]. This may be an applicable choice if integer linear programming is also used for solving other subtasks such as register allocation or instruction scheduling, the basic block and the number of patterns are not too large, and a close integration between these tasks is desired to produce high-quality code. Furthermore, solving the problem by such general optimization techniques is by no means restricted to tree-shaped patterns or tree-shaped IR graphs. In particular, they work well with complex patterns such as forests (non-connected trees, as in Fig. 10(c)) and directed acyclic graph (DAG) patterns, as with autoincrement load and store instructions or memory read-modifywrite instructions and even on cyclic IR structures, such as static single assignment (SSA) representation. Because covering several nodes with a complex pattern corresponds to merging these nodes, special care has to be taken with forest and DAG patterns to avoid the creation of artificial dependence cycles by the matching, which could lead to non-schedulable code [24]. Instruction selection can be combined with resource allocation, i.e., the binding of instructions to executing resources. For some instructions, there may be no choice: For instance, on ’C62x, a multiply instruction can only be executed on a .M unit. In case that the same instruction can execute on different functional units, each with its own cost, latency and resource reservations, one could model these variants as different instructions that just share the pattern defining the semantics. Like instruction selection, resource allocation is often done before instruction scheduling,
Compiling for VLIW DSPs
619
but a tighter integration with scheduling would be helpful because resource allocation clearly constrains the scheduler. A natural extension of this approach is to also model cluster assignment for instructions as a resource allocation problem and thus as extended instruction selection problem. However, cluster allocation has traditionally been treated as a separate problem; we will discuss it in Section 5. Further target-level optimizations could be modeled as part of instruction selection. For instance, for special cases of IR patterns there could exist alternative code sequences that may be faster, shorter, or more appropriate for later phases in code generation. As an example, an ordinary integer multiplication by 2 can be implemented in at least three different ways (mutations): by a MUL instruction, maybe running on a separate multiplier unit, by an ADD instruction, and by a left-shift (SHL) by one, each having different latency and resource requirements. The ability to consider such mutations during instruction scheduling increases flexibility and can thus improve code quality [73]. Another extension of instruction selection concerns the identification of several independent IR operations with same operator and short operand data types that could be merged to utilize a SIMD instruction instead. Also, the selection of short instruction formats to reduce code size can be considered a subproblem of instruction selection. While beneficial for instruction fetch bandwidth, short instruction formats constrain the number of operands, the size of immediate fields or the set of accessible registers, which can have negative effects on register allocation and instruction scheduling. For the 16-bit compact instructions of ’C64x, Hahn et al. [43] explore the trade-off between performance and code size.
5 Cluster assignment for clustered VLIW architectures Cluster assignment for clustered VLIW architectures can be done at IR level or at target level. It maps each IR operator or instruction, respectively, to one of the clusters. Also, variables and temporaries to be held in registers must be mapped to a register file. Indeed, a value could reside in several register files if appropriate data transfer instructions are added; this is also an issue of register allocation and instruction scheduling and typically solved later than instruction cluster assignment in most compilers, although there exist obvious interdependences. There exist various techniques for cluster assignment for basic blocks. The goal is to minimize the number of transfer instructions. Usually, heuristic solutions are applied. Desoli [23] identifies sub-DAGs of the target-level dataflow graph that are mapped to a cluster as a whole. Gangwar et al. [35] first decompose the targetlevel dataflow graph into disjoint chains of nodes connected by dataflow edges. The nodes of a chain will always be mapped to the same cluster. Chains are grouped together by a greedy heuristic until there are as many chain groups left as clusters. Finally, chain groups are heuristically assigned to clusters so that the residual
620
Christoph W. Kessler
cross-chaingroup dataflow edges coincide with direct inter-cluster communication links wherever possible. For many-cluster architectures where no fully connected inter-cluster communication network is available, the algorithm tries to minimize the communication distance accordingly, such that communicating chain groups are preferably mapped to clusters that are close to each other. Usually, cluster assignment precedes instruction scheduling in phase-decoupled compilers for clustered VLIW DSPs, because the resource allocation for instructions must be known for scheduling. On the other hand, cluster allocation could benefit from information about free communication slots in the schedule. The quality of the resulting code suffers from the separate handling of cluster assignment, instruction scheduling and register allocation. Leupers [58] proposed a simulatedannealing based approach where cluster allocation and instruction scheduling are applied alternatingly in an iterative optimization loop.
6 Register allocation and generalized spilling In the low-level IR, all program variables that could be held in a register and all temporary variables are modeled as symbolic registers, which the register allocator then maps to the hardware registers, of which only a limited number is available. A symbolic register s is live at a program point p if s is defined (written) on a control path from the procedure’s entry point to p, and there exists a program point q where s is used (read) and s may not be (re-)written on the control path p...q. Hence, s is live at p if it is used in a control flow successor q of p. The live range of s is the set of program points where s is live. The number of all symbolic registers live at a program point p is called the register pressure at p. Live ranges could be defined on the low-level IR (if register allocation is to be done before instruction selection), but usually, they are defined at target instruction level, because instruction selection may introduce additional temporary variables to be kept in registers. If the schedule is given, the live ranges are fixed, which constrains the register allocator, and generated spill code has to be inserted into the schedule. If register allocation comes first, some pre-scheduling (sequencing) at LIR or target level is required to bring the operations/instructions of a basic block in a linear order that defines the live range boundaries. Early register allocation constrains the subsequent scheduler, but generated spill code will then be scheduled and compacted together with the remaining instructions. Two live ranges interfere if they may overlap. Interfering live ranges cannot share the same register. The live range interference graph is an undirected graph whose nodes represent the live ranges of a procedure and edges represent interferences. Register assignment now means coloring the live range interference graph by assigning a color (i.e., a specific physical register) to each node such that interfering nodes have different colors. Moreover, the coloring must not use more colors than the number K of machine registers available. Determining if a general graph is colorable with K
Compiling for VLIW DSPs
621
colors is NP-complete for K ≥ 3. If a coloring cannot be found, the register allocator must restructure the program to make the interference graph colorable. Chaitin [17] proposed a heuristic for coloring the interference graph with K colors, where K denotes the number of physical registers available. The algorithm works iteratively. In each step, it tries to find a node with degree < K, because then there must be some color available for that node, and removes the node from the interference graph. If the algorithm cannot find such a node, the program must be transformed to change the interference graph into a form that allows the algorithm to continue. Such transformations include coalescing, live range splitting, and spilling with rematerialization of live ranges. After the algorithm has removed all nodes from the interference graph, the nodes are colored in reverse order of removal. The optimistic coloring algorithm by Briggs [14] improves Chaitin’s algorithm by delaying spilling transformations. Coalescing is a transformation applicable to copy instructions s2 ← s1 where the two live ranges s1 , s2 do not overlap except at that copy instruction, which marks the end (last use) of s1 and beginning (write) of s2 . Coalescing merges s1 and s2 together to a single live range by renaming all occurrences of s1 to s2 , which forces the register allocator to store them in the same physical register, and the copy instruction can now be removed. Long live ranges tend to interfere with many others and thus may make the interference graph harder to color. As coalescing yields longer live ranges, it should be applied with care. Conservative coalescing [15] merges live ranges only if the degree of the merged live range in the interference graph would still be smaller than K. The reverse of coalescing is live range splitting i.e., insertion of register-toregister copy operations and renaming of accesses to split a long live range into several shorter ones. Splitting can make an interference graph colorable without having to apply spilling; this is often more favorable, as register-to-register copying is faster and less energy consuming than memory accesses. Live range splitting can be done considerately with a small number of sub-live-ranges, or aggressively, with one sub-live-range per access. Spilling removes a live range as symbolic register, by storing its value in main memory (or other non-register location). For each writing access, a store instruction is inserted that saves the value to a memory location (e.g., in the variable’s ”home” location or in a temporary location on the stack), and for each reading access, a load instruction to a temporary register is inserted. This spill code leads to increased execution time, energy consumption and code size. In some cases, it could be more efficient to realize the rematerialization [15] of a spilled value not by an expensive load from memory but by recomputing it instead. The choice between several spill candidates could be made greedily by considering the spill cost for a live range s, which contains the number of required store and load (or other rematerialization) instructions (to model the code size penalty), often also weighted by predicted execution frequencies (to model the performance and energy penalty). A live range may not have to be spilled everywhere in the program. For instance, even if a symbolic register has a long live range, it may not be accessed during major
622
Christoph W. Kessler
periods in its live range where register pressure is high, for instance during an inner loop. Such periods are good candidates for partial spilling. Register allocation can be implemented as a two-step approach [5][24], where a global pre-spilling phase is run first to limit the remaining register pressure at any program point to the available number of physical registers, which makes the subsequent coloring phase easier. Coloring-based heuristics are used in many standard compilers. While just-intime compilers and dynamic optimizations require fast register allocators such as linear scan allocators [83][76], the VLIW DSP domain rather calls for static compilation with high code quality, which justifies the use of more advanced register allocation algorithms. The first register allocator based on integer linear programming was presented by Goodwin and Wilken [39]. Optimal spilling selects just those live ranges for spilling whose accumulated spill cost is minimal, while making the remaining interference graph colorable. Optimal selection of spill candidates (pre-spilling) and optimal a-posteriori insertion of spill code for a given fixed instruction sequence and a given number of available registers are NP-complete even for basic blocks and have been solved by dynamic programming or integer linear programming for various special cases of processor architecture and dependency structure [5][46][47][68]. In most compilers, heuristics are used that try to estimate the performance penalty of inserted load and store instructions [11]. Recently, several practical method based on integer linear programming for general optimal pre-spilling and for optimal coalescing have been developed, e.g., by Appel and George [5]. A recent trend is towards register allocation on the SSA form: For SSA programs, the interference graph belongs to the class of chordal graphs, which can be K-colored in quadratic time [12][42][16]. The generation of optimal spill code and minimization of copy instructions by coalescing remain NP-complete problems also for SSA programs. For the problem of optimal coalescing in spill-free SSA programs, a good heuristic method was proposed by Brisk et al. [16], and an optimal method based on integer linear programming was given by Grund and Hack [41]. For optimal pre-spilling in SSA programs, Ebner [24] models the problem as a constrained min-cut problem and applies a transformation that yields a polynomial-time near-optimal algorithm that does not rely on an integer linear programming solver.
7 Instruction scheduling In this section, we review fundamental instruction scheduling methods for VLIW processors at basic block level, loop level, and global level. We also discuss the automatic generation of the most time consuming part of instruction schedulers from a formal description of the processor.
Compiling for VLIW DSPs
623
7.1 Local instruction scheduling The control flow graph at the IR level or target level representation of a program is a graph whose nodes are the IR operations or target instructions, respectively, and its edges denote possible control flow transitions between nodes. A basic block is any (maximum-length) sequence of textually consecutive operations (at IR level) or instructions (at target level) in the program that can be entered by control flow only via the first and left only via the last operation or instruction, respectively. Hence, branch targets are always at the entry of a basic block, and a basic block contains no branches except maybe its last operation or instruction. Control flow executes all operations of a basic block from its entry to its exit. Hence, the data dependences in a basic block form a directed acyclic graph, the data flow graph of the basic block. This data flow graph defines the partial execution order that constrains instruction scheduling: The instruction/operation at the target of a data dependence must not be issued before the latency of the instruction/operation at the source has elapsed. Leaf nodes in the data flow graph do not depend on any other node and have therefore no predecessor (within the basic block), root nodes have no successor node (within the basic block). A path from a leaf node with maximum accumulated latency over its edges towards a root node is called a critical path of the basic block; its length is a lower bound for the makespan of any schedule for the basic block. Methods for instruction scheduling for basic blocks (i.e., local scheduling) are simpler than global scheduling methods because control flow in basic blocks is straightforward and can be ignored. Only data dependences and resource conflicts need to be taken into account. Interestingly, most basic blocks in real-world programs are quite small and consist of only a few instructions. However, program transformations such as function inlining, loop unrolling or predication can yield considerably larger basic blocks. Traditionally, heuristic methods have been considered for local instruction scheduling, mostly because of fast optimization times. A simple and well-known heuristic technique is list scheduling. List scheduling is based on topological sorting of the operations or instructions in the basic block’s data flow graph, taking the precedence constraints by data dependences into account. The algorithm schedules nodes iteratively and maintains a list of data-ready nodes, the ready list. Initially, it consists of the leaves of the data flow graph, i.e., those nodes that do not depend on any other and could be scheduled immediately. The nodes in the ready list are assigned priorities that could, for instance, be the estimated maximum accumulated latency on any path from that node to a root of the data flow graph. In each step, list scheduling picks, in a greedy way, as many highest-priority nodes as possible from the ready list that fit into the next issue packet and for which resource requirements can be satisfied. The resource reservations of these issued nodes are then committed to the global resource usage map, and the issued nodes are removed from the data flow graph. Some further nodes may now become data ready in the next steps after the latency after all their predecessors has elapsed. The ready list is accordingly updated, and the process repeats
624
Christoph W. Kessler
until all nodes have been scheduled. The above description is for forward scheduling. Backward scheduling starts with the roots of the data flow graph and works in an analogous way in reversed topological order. Another technique is critical path scheduling. First, a critical path in the basic block is detected; the nodes of that path are removed from the data flow graph and scheduled in topological order, each in its own issue packet. For the residual data flow graph, a critical path is determined etc., and this process is repeated until all nodes in the data flow graph have been scheduled. If there is no appropriate free slot in an issue packet to accommodate a node to be scheduled, a new issue packet is inserted. Time-optimal instruction scheduling for basic blocks is NP-complete for almost any nontrivial target architecture, including most VLIW architectures. For special combinations of simple target architectures and restricted data flow graph topologies such as trees, polynomial-time optimal scheduling algorithms are known. In the last decade, more expensive optimal methods for local instruction scheduling have become more and more popular, driven by (1) the need to generate highquality code for embedded applications, (2) the fact that modern computers offer the compiler many more CPU cycles that can be spent on advanced optimizations, and (3) advances in general optimization problem solver technology, especially for integer linear programming. For local instruction scheduling on general acyclic data flow graphs, optimal algorithms based on integer linear programming [61][85][50][10][29], branch-and-bound [20][44], constraint logic programming [9] and dynamic programming [54][52] have been developed. Also, more expensive heuristic optimization techniques such as genetic programming [91][65][29] have been used successfully. Even if the scope limitation to a single basic block is too restrictive in practice, local instruction scheduling techniques are nevertheless significant because they are also used in several global scheduling algorithms for larger acyclic code regions and even in certain cyclic scheduling algorithms, which we will discuss in Section 7.2.
7.2 Modulo scheduling for loops Most DSP programs spend most of their execution time in some (inner) loops. Efficient loop scheduling techniques are therefore key to high code quality. Loop unrolling is a simple transformation that can increase the scope of a local scheduler (and also other code optimizations) beyond the iteration boundaries, such that independent instructions from different iterations could be scheduled in parallel. However, loop unrolling increases code size considerably, which is often undesirable in embedded applications. Software pipelining is a technique to overlap the execution of subsequent loop iterations such that independent instructions from different iterations can be scheduled in parallel on an instruction-level parallel architecture, without having to repli-
Compiling for VLIW DSPs
625
cate the loop body code as in unrolling. As most scheduling problems with resource and dependence constraints, (rate-)optimal software pipelining is NP-complete. Software pipelining has been researched intensively, both as a high-level loop transformation (performed in the middle end of a compiler or even as source-tosource program transformation) and as low-level optimization late in the code generation process (performed in the back end of a compiler), after instruction selection with resource allocation has been performed. The former approaches are independent of particular instructions and functional units to be selected for all operations in the loop, and thus have to rely on inaccurate cost estimations for execution time, energy, or register pressure when comparing various alternatives, while the actual cost will also depend on decisions made late in the code generation process. The latter approaches are bound to fixed instructions and functional units, and hence the flexibility of implementing the same abstract operation by a variety of different target machine instructions, with different resource requirements and latency behavior, is lost. In either case, optimization opportunities are missed because interdependent problems are solved separately in different compiler phases. Approaches to integrate software pipelining with other code generation tasks will be discussed in Section 8.
FOR i FROM 1 TO N DO A(i); (a) B(i); C(i); END DO
A
(b)
B C
+1
A(1); FOR i FROM 1 TO N-1 DO B(i); (c) C(i) || A(i+1); END DO B(N); C(N);
Fig. 11 Simple example: (a) Original loop, where A(i), B(i), C(i) denote operations in the loop body that may compete for common resources, in this example B(i) and C(i), and that may involve both loop-independent data dependences, here A(i) → B(i) and A(i) → C(i), and loop-carried data dependences, here B(i) → A(i+ 1), see the dependence graph in (b). (c): After software pipelining.
Software pipelining, also called cyclic scheduling, transforms a loop such as that in Fig. 11(a) into an equivalent loop whose body contains operations from different iterations of the original loop, such as that in Fig. 11(c), which may result in faster code on an instruction-level parallel architecture as e.g. C(i) and A(i + 1) could be executed in parallel (||) because they are statically known to be independent of each other and not to subscribe to the same hardware resources. This parallel execution was not possible in the original version of the loop because the code generator usually treats the loop body (a basic block) as a unit for scheduling and resource allocation, and furthermore separates the code for C(i) and A(i + 1) by a backward branch to the loop entry. The body of the transformed loop is called the kernel, the operations before the kernel that “fill” the pipeline (here A(1)) are called the prologue, and the operations after the kernel that “drain” the pipeline (here B(N) and C(N)), are called the epilogue of the software-pipelined loop. Software pipelining thus overlaps in the new kernel the execution of operations originating from different iterations of the original loop, as far as permitted by given dependence and
626
Christoph W. Kessler
resource constraints, in order to solicit more opportunities for parallel execution on instruction-level parallel architectures, such as superscalar, VLIW or EPIC processors. Software pipelining can be combined with loop unrolling. In their survey of software pipelining methods, Allan et al. [3] divide existing approaches into two general classes. Based on a lower bound determined by analyzing dependence distances, latencies, and resource requirements, the modulo scheduling methods, as introduced by Rau and Glaeser [78] and refined in several approaches [56][64], first guess the kernel size (in terms of clock cycles), called the initiation interval (II), and then fill the instructions of the original loop body into a modulo reservation table of size II, which produces the new kernel. If no such modulo schedule could be found for the assumed II, the kernel is enlarged by incrementing II, and the procedure is repeated. The kernel-detection methods, such as those by Aiken and Nicolau [2] (no resource constraints) and Vegdahl [84], continuously peel off iterations from the loop and schedule their operations until a pattern for a steady state emerges, from which the kernel is constructed. Modulo scheduling starts with an initial initiation interval given by the lower bound MinII (minimum initiation interval), which is the maximum of the recurrencebased minimum initiation interval (RecMinII) and the resource-based minimum initiation interval (ResMinII). RecMinII is the maximum accumulated sum of latencies along any dependence cycle in the dependence graph, divided by the number of iterations spanned by the dependence cycle. If there is no such cycle, RecMinII is 0. ResMinII is the maximum accumulated number of reserved slots on any resource in the loop body. Modulo scheduling attempts to find a valid modulo schedule by filling all instructions in the modulo reservation table for the current II value. Priority is usually given to dependence cycles in decreasing order of accumulated latency per accumulated distance. If the first attempt fails, most heuristic methods allow backtracking for a limited number of further attempts. An exhaustive search is usually not feasible because of the high problem complexity. Instead, if no attempt was successful, the II is incremented and the procedure is repeated with a one larger modulo reservation table. As there exists a trivial upper bound for the II (namely, the accumulated sum of all latencies in the loop body), this iterative method will eventually find a modulo schedule. The main goal of software pipelining is to maximize the throughput by minimizing II, i.e., rate-optimal software pipelining. Moreover, minimizing the makespan (the elapsed time between the first instruction issue and last instruction termination) of a single loop iteration in the modulo scheduled loop is often a secondary optimization goal, because it directly implies the length of prologue and epilogue and thereby has an impact on code size (unless special hardware support for rotating predicate registers allows to represent prologue and epilogue code implicitly with the predicated kernel code). Register allocation for software pipelined loops is another challenge. Software pipelining tends to increase register pressure. If a live range is longer than II cy-
Compiling for VLIW DSPs
627
cles, it will interfere with itself (e.g., with its instance starting in the next iteration of the kernel) and thus a single register will not be sufficient; special care has to be taken to access the ”right” one at any time. There are two kinds of techniques for such self-overlapping live ranges: hardware based techniques, such as rotating register sets and register queues, and software techniques such as modulo variable expansion [56] and live range splitting [81]. With modulo variable expansion, the kernel is unrolled and symbolic registers renamed until no live range self-overlaps any more: If denotes the maximum length of a self-overlapping live range, the required unroll factor is = /II, and the new initiation interval of the expanded kernel is II = II. The drawback of modulo variable expansion is increased code size and increased register need. An alternative approach is to avoid self-overlapping live ranges a priori by splitting long live ranges on dependence cycles into shorter ones, by inserting copy instructions. Optimal methods for software pipelining based on integer linear programming have been proposed e.g. by Badia et al. [7], Govindarajan et al. [40] and Yang et al. [89]. Combinations of modulo scheduling with other code generation tasks will be discussed in Section 8. Software pipelining is often combined with loop unrolling. Especially if the lower bound MinII is a non-integer value, loop unrolling before software pipelining can improve throughput. Moreover, loop unrolling reduces loop overhead (at least on processors that do not have hardware support for zero-overhead loops). The downside is larger code size.
7.3 Global instruction scheduling Basic blocks are the units of (procedure-)global control flow analysis. The basic block graph of a program is a directed graph where the nodes correspond to the basic blocks and edges show control flow transitions between basic blocks, such as branches or fall-through transitions to branch targets. Global instruction scheduling methods consider several basic blocks at a time and allow to move instructions between basic blocks. The (current) scope of a global scheduling method is referred to as a region. Regions used for global scheduling include traces, superblocks, hyperblocks and treegions. Local scheduling methods are extended to address entire regions. Because the scope is larger, global scheduling has more flexibility and may generate better code than local scheduling. Program transformations such as function inlining, loop unrolling or predication can be applied to additionally increase the size of basic blocks and regions. The idea of trace scheduling [31] is to make the most frequently executed control flow paths fast while accepting possible performance degradations along less frequently used paths. Execution frequencies are assigned to the outgoing edges at branch instructions based on static predictions or on profile data. A trace is a linear path (i.e., free of backwards edges and thereby of loops) through the basic block graph, where, at each basic block Bi in the trace except for the last one, its successor
628
Christoph W. Kessler T5 ENTRY B1
T3 B2
B3
T1
B4
B11
T4 B5
B12
T2
B6
B8
B14
B13
B9
B7
B10
T6
B15
EXIT
Fig. 12 Traces (shaded) in a basic block graph, constructed and numbered T1, T2, ... in order of decreasing predicted execution frequency. A trace ends at a backwards branch or at a join point with another trace of higher execution frequency (which thus was constructed earlier). Trace T1 represents the more frequent control path in the inner loop starting at basic block B4.
B j in the trace is the target of the more frequently executed control flow edge leaving Bi . Traces may have side entrances and side exits of control flow from outside the trace. Forward edges are possible, and likewise backwards edges to the first block in the trace. See Fig. 12 for an example. Trace scheduling repeatedly identifies a maximum-length trace in the basic block graph, removes its basic blocks from the graph and schedules the instructions of the trace with a local scheduling method as if it were a single basic block. As instructions are moved across a control flow transition, either upwards or downwards, correctness of the program must be re-established by inserting compensation code into the other predecessor or successor block of the original basic block of that instruction, respectively. Two of the possible cases are shown in Fig. 13. The insertion of compensation code may lead to considerable code size expansion on the less frequently executed branches. After the trace has been scheduled, its basic blocks are removed from the basic block graph, the next trace is determined, and this process is repeated until all basic blocks of the program have been scheduled. Superblocks [48] are a restricted form of traces that do not contain any branches into it (except possibly for backwards edges to the first block). This restriction simplifies the generation of compensation code in trace scheduling. A trace can be con-
Compiling for VLIW DSPs T:
T: B:
629 T:
B:
i2 i1
i1
i2
i3
i1 i2’
i2 i3
B:
i2 i1 i3 i1’
i3
(a)
T: B:
(b)
Fig. 13 Two of the main cases in trace scheduling where compensation code must be inserted. The trace T is being compacted. Case (a): Hoisting instruction i2 into the predecessor basic block B requires inserting a copy i2 of i2 into the other predecessor block(s). Case (b): Moving instruction i1 forward across the branch instruction i2 requires inserting a copy i1 of i1 into the other branch target basic block.
verted into a superblock by replicating its tail parts that are targets of branches from outside the trace. Tail duplication is a form of generating compensation code ahead of scheduling, and can likewise increase code size considerably. While traces and superblocks are linear chains of basic blocks, hyperblocks [66] are regions in the basic block graph with a common entry block and possibly several exit blocks, with acyclic internal control flow. Using predication, the different control flow paths in a hyperblock could be merged to a single superblock. A treegion [45], also known as extended basic block [69], is an out-tree region in the basic block graph. There are no side entrances of control flow into a treegion, except to its root basic block. Recently, optimal methods for global instruction scheduling on instruction-level parallel processors have become popular. Winkel used integer linear programming for optimal global scheduling for Intel IA-64 (Itanium) processors [87] and showed that it can be used in a production compiler [88]. Malik et al. [67] proposed a constraint programming approach for optimal scheduling of superblocks.
7.4 Generated instruction schedulers Whenever a forward scheduling algorithm such as list scheduling inserts another data-ready instruction at the end of an already computed partial schedule, it needs to fit the required resource reservations of the new instruction against the already committed resource reservations, and likewise obey pending latencies of predecessor instructions where necessary, in order to derive the earliest possible issue time for the new instruction relative to the issue time of the last instruction in the partial schedule. While the impact of dependence predecessors can be checked quickly, determining that issue time offset is more involved with respect to the resource reservations. The latter calculation could be done, for instance, by searching through the partial schedule’s resource usage map, resource by resource. For advanced schedul-
630
Christoph W. Kessler
ing methods that try lots of alternatives, faster methods for detecting of resource conflicts resp. computing the issue time offset are desirable. Note that the new instruction’s issue time relative to the currently last instruction of the partial schedule only depends on the most recent resource reservations, not the entire partial schedule. Each possible contents of this still relevant, recent part of the pipeline can be interpreted as a pipeline state, and appending an instruction will result in a new state and an issue time offset for the new instruction, such that scheduling can be described as a finite state automaton. The initial state is an empty pipeline. The set of possible states and the set of possible transitions depend only on the processor, not on the input program. Hence, once all possible states have been determined and encoded and all possible transitions with their effects on successor state and issue time have been precomputed once and for all, scheduling can be done very fast by looking up the issue time offset and the new state in the precomputed transition table for the current state and inserted instruction. This automaton-based approach was introduced by Müller [70] and was improved and extended in several works [77][8][26]. The automaton can be generated automatically from a formal description of the processor’s set of instructions with their reservation tables. A drawback is that the number of possible states and transitions can be tremendous, but there exist techniques such as standard finite state machine minimization, automata factoring, and replacement of several physical resources with equivalent contention behavior by a single virtual resource, to reduce the size of the scheduling automaton.
8 Integrated code generation for VLIW and clustered VLIW In most compilers, the subproblems of code generation are treated separately in subsequent phases of the compiler back-end. This is easier from a software engineering point of view, but often leads to suboptimal results because decisions made in earlier phases constrain the later phases. For instance, early instruction scheduling determines the live ranges for a subsequent register allocator; when the number of physical registers is not sufficient, spill code must be inserted a-posteriori into the existing schedule, which may compromise the schedule’s quality. Conversely, early register allocation introduces additional (“false”) data dependences, which are an artifact caused by the reuse of registers but constrain the subsequent instruction scheduling phase. Interdependences exist also between instruction scheduling and instruction selection: In order to formulate instruction selection as a separate minimum-cost covering problem, phase-decoupled code generation assigns a fixed, context-independent cost to each instruction, such as its expected or worst-case execution time, while the actual cost also depends on interference with resource occupation and latency constraints of other instructions, which depends on the schedule. For instance, a potentially concurrent execution of two independent IR operations may be prohibited if instructions are selected that require the same resource.
Compiling for VLIW DSPs
631
IR−level on instruction scheduling ati l on c e o v el ati ll −le r a lev lloc − IR iste IR er a g t IR−level re gis instruction scheduling re
IR
gene ra
tion
target−level ninstruction scheduling el catio v el on le lev ati t− allo et− alloc target−level ge ter r g a r t gis ta ter cluster assignment target−level re gis instruction scheduling re target−level cluster assignment
instruction selection
code
ti ca
instruction selection
ted
IR−level oninstruction scheduling ion l v llo ve cat −le r a −le allo IR iste R I g ter IR−level re gis instruction scheduling re
el
instruction selection and resource allocation
instruction selection and resource allocation
integ ra
IR−level cluster assignment
target−level ninstruction scheduling el catio v el on le lev cati t− allo t− ge er ge r allo r tar gist ta te target−level re gis instruction scheduling re
Fig. 14 Phase-decoupled vs. fully integrated code generation for clustered VLIW processors. Only the four main tasks of code generation (cluster assignment, instruction selection, instruction scheduling, and register allocation) are shown. While often performed in just this order, many phase orderings are possible in phase-decoupled code generation, visualized by the paths along the edges of the four-dimensional hypercube from the processor-independent low-level IR to final target code. Fully integrated code generation solves all tasks simultaneously as a monolithic combined optimization problem, thus following the main diagonal.
Even the subdivision of instruction scheduling into separate phases for sequencing and compaction can have negative effects on schedule quality if instructions with non-block reservation tables occur [53]. Furthermore, on clustered VLIW processors, concurrent execution may be possible only if the operands reside in the right register sets at the right time, as discussed earlier. While instruction scheduling and register allocation need to know about the cluster assignment of instructions and data, cluster assignment could profit from knowing about free slots where transfer instructions could be placed, or free registers where transfered copies could be held. Any phase decoupled approach may result in bad code quality because the later phases are constrained by decisions made in the early ones. Hence, the integration of these subproblems to solve them as a single optimization problem, as visualized in Fig. 14, is highly desirable, but unfortunately this increases the overall complexity of code generation considerably. Despite the recent improvements in general optimization problem solver technology, this ambitious approach is limited in scope to basic blocks and loops. Other methods take a more conservative approach based on a phase-decoupled code generator and make, heuristically, an early phase aware of possibly different goals of later phases. For instance, register pressure aware scheduling methods, which trade less instruction level parallelism for shorter live ranges where register pressure is predicted to be high, can lead to better register allocations with less spill code later.
Target code
632
Christoph W. Kessler
Integrated code generation at basic block level There exist several heuristic approaches that aim at a better integration of instruction scheduling and register allocation [13][34][38][55]. For the case of clustered VLIW processors, the heuristic algorithm proposed by Kailas et al. [49] integrates cluster assignment, register allocation, and instruction scheduling; heuristic methods that integrate instruction scheduling and cluster assignment were proposed by Özer et al. [75], Leupers [59] and by Nagpal and Srikant [72]. For the computationally intensive kernels of DSP application programs to be used in an embedded product throughout its lifetime, the manufacturer is often willing to afford spending a significant amount of time in optimizing the code during the final compilation. However, there are only a few approaches that have the potential— given sufficient time and space resources—to compute an optimal solution to an integrated problem formulation, mostly combining local scheduling and register allocation [9][50][57]. Some of these approaches are also able to partially integrate instruction selection problems, even though for rather restricted machine models. For instance, Wilson et al. [86] consider architectures with a single, non-pipelined ALU, two non-pipelined parallel load/store/move units, and a homogeneous set of general-purpose registers. Araujo and Malik [6] consider integrated code generation for expression trees with a machine model where the capacity of each sort of memory resource (register classes or memory blocks) is either one or infinity, a class that includes, for instance, the TI C25. The integrated method adopted in the retargetable framework AVIV [44] for clustered VLIW architectures builds an extended data flow graph representation of the basic block that explicitly represents all alternatives for implementation; then, a branch-and-bound heuristic selects an alternative among all representations that is optimized for code size. Chang et al. [18] use integer linear programming for combined instruction scheduling and register allocation with spill code generation for non-pipelined, nonclustered multi-issue architectures. Kessler and Bednarski [52] propose a dynamic programming algorithm for fully integrated code generation for clustered and non-clustered VLIW architectures at the basic block level, which was implemented in the retargetable integrated code generator OPTIMIST. Bednarski and Kessler [10] and Eriksson et al. [29] solve the problem with integer linear programming; the latter work also gives a heuristic approach based on a genetic algorithm. Loop-level integrated code generation There exist several heuristic algorithms for modulo scheduling that attempt to reduce register pressure, such as Hypernode Resource Modulo Scheduling [64] and Swing Modulo Scheduling [63]. Codina et al. [21] give a heuristic method for modulo scheduling integrated with register allocation and spill code generation for
Compiling for VLIW DSPs
633
clustered VLIW processors. Zalamea et al. [90] consider the integration of register pressure aware modulo scheduling with register allocation, cluster assignment and spilling for clustered VLIW processors and present an iterative heuristic algorithm with backtracking. Stotzer and Leiss [81] propose a preprocessing transformation for modulo scheduling for the ’C6x clustered VLIW DSP architecture that attempts to reduce self-overlapping cyclic live ranges in a preprocessing phase and thereby eliminate the need for modulo variable expansion or rotating register files. Eisenbeis and Sawaya [27] propose an integer linear programming method for modulo scheduling integrated with register allocation, which gives optimal results if the number of schedule slots is fixed. Nagarakatte and Govindarajan [71] provide an optimal method for integrating register allocation and spill code generation with modulo scheduling for non-clustered architectures. Eriksson and Kessler [28] give an integer linear programming method for optimal, fully integrated code generation for loops, combining modulo scheduling with instruction selection, cluster assignment, register allocation and spill code generation, for clustered VLIW architectures.
9 Concluding remarks Compilers for VLIW DSP processors need to apply a considerable amount of advanced optimizations to achieve code quality comparable to hand-written code. Current advances in general optimization problem solver technology are encouraging, and heuristic techniques developed for standard compilers are being complemented by more aggressive optimizations. For small and medium sized program parts, even optimal solutions are within reach. Also, most problems in code generation are strongly interdependent and should be considered together in an integrated or at least phase-coupled way to avoid poor code quality due to phase ordering effects. We expect further improvements in optimized and integrated code generation techniques for VLIW DSPs in the near future.
Trademarks and Acknowledgements Trademarks C62x, C64x, C67x, VelociTI, TMS320C62x are trademarks of Texas Instruments. Itanium is a trademark of Intel. ST200 is a trademark of STMicroelectronics. TigerSHARC is a trademark of Analog Devices. TriMedia is a trademark of NXP. Acknowledgements The author thanks Mattias Eriksson and Dake Liu for discussions and commenting on a draft of this chapter. The author also thanks Eric Stotzer from Texas Instruments for interesting discussions about code generation for the TI ’C6x DSP processor family. This work was funded by Vetenskapsrådet (project Integrated Software Pipelining) and SSF (project DSP platform for emerging telecommunication and multimedia).
634
Christoph W. Kessler
References 1. Alfred V. Aho, Mahadevan Ganapathi, and Steven W.K. Tjiang. Code Generation Using Tree Matching and Dynamic Programming. ACM Transactions on Programming Languages and Systems, 11(4):491–516, October 1989. 2. Alexander Aiken and Alexandru Nicolau. Optimal loop parallelization. SIGPLAN Notices, 23(7):308–317, July 1988. 3. Vicki H. Allan, Reese B. Jones, Randall M. Lee, and Stephen J. Allan. Software pipelining. ACM Computing Surveys, 27(3), September 1995. 4. Analog Devices. TigerSHARC embedded processor ADSP-TS201S. Data sheet, www.analog.com/en/embedded-processing-dsp/tigersharc, 2006. 5. Andrew W. Appel and Lal George. Optimal Spilling for CISC Machines with Few Registers. In Proc. ACM conf. on Programming language design and implementation, pages 243–253. ACM Press, 2001. 6. Guido Araujo and Sharad Malik. Optimal code generation for embedded memory nonhomogeneous register architectures. In Proc. 7th Int. Symposium on System Synthesis, pages 36–41, September 1995. 7. Rosa M. Badia, Fermin Sanchez, and Jordi Cortadella. OSP: Optimal Software Pipelining with Minimum Register Pressure. Technical Report UPC-DAC-1996-25, DAC Dept. d’arquitectura de Computadors, Univ. Polytecnica de Catalunya, Barcelona, Campus Nord. Modul D6, E08071 Barcelona, Spain, June 1996. 8. Vasanth Bala and Norman Rubin. Efficient instruction scheduling using finite state automata. In Proc. 28th int. symp. on miocroarchitecture (MICRO-28), pages 46–56. IEEE, 1995. 9. Steven Bashford and Rainer Leupers. Phase-coupled mapping of data flow graphs to irregular data paths. Design Automation for Embedded Systems (DAES), 4(2/3):1–50, 1999. 10. Andrzej Bednarski and Christoph Kessler. Optimal integrated VLIW code generation with integer linear programming. In Proc. Int. Euro-Par 2006 Conference. Springer LNCS, August 2006. 11. D. Bernstein, M.C. Golumbic, Y. Mansour, R.Y. Pinter, D.Q. Goldin, H. Krawczyk, and I. Nahshon. Spill code minimization techniques for optimizing compilers. In Proc. Int. Conf. on Progr. Lang. Design and Implem., pages 258–263, 1989. 12. F. Bouchez, A. Darte, C. Guillon, and F. Rastello. Register allocation: what does the NPcompleteness proof of Chaitin et al. really prove? [...]. In Proc. 19th int. workshop on languages and compilers for parallel computing, New Orleans, November 2006. 13. Thomas S. Brasier, Philip H. Sweany, Steven J. Beaty, and Steve Carr. Craig: A practical framework for combining instruction scheduling and register assignment. In Proc. Int. Conf. on Parallel Architectures and Compilation Techniques (PACT’95), 1995. 14. Preston Briggs, Keith Cooper, Ken Kennedy, and Linda Torczon. Coloring heuristics for register allocation. In Proc. Int. Conf. on Progr. Lang. Design and Implem., pages 275–284, 1989. 15. Preston Briggs, Keith Cooper, and Linda Torczon. Rematerialization. In Proc. Int. Conf. on Progr. Lang. Design and Implem., pages 311–321, 1992. 16. Philip Brisk, Ajay K. Verma, and Paolo Ienne. Optimistic chordal coloring: a coalescing heuristic for SSA form programs. Des. Autom. Embed. Syst., 13:115–137, 2009. 17. G.J. Chaitin, M.A. Auslander, A.K. Chandra, J. Cocke, M.E. Hopkins, and P.W. Markstein. Register allocation via coloring. Computer Languages, 6:47–57, 1981. 18. Chia-Ming Chang, Chien-Ming Chen, and Chung-Ta King. Using integer linear programming for instruction scheduling and register allocation in multi-issue processors. Computers Mathematics and Applications, 34(9):1–14, 1997. 19. Chung-Kai Chen, Ling-Hua Tseng, Shih-Chang Chen, Young-Jia Lin, Yi-Ping You, Chia-Han Lu, and Jenq-Kuen Lee. Enabling compiler flow for embedded VLIW DSP processors with distributed register files. In Proc. LCTES’07, pages 146–148. ACM, 2007. 20. Hong-Chich Chou and Chung-Ping Chung. An Optimal Instruction Scheduler for Superscalar Processors. IEEE Trans. on Parallel and Distr. Syst., 6(3):303–313, 1995.
Compiling for VLIW DSPs
635
21. Josep M. Codina, Jesus Sánchez, and Antonio González. A unified modulo scheduling and register allocation technique for clus tered processors. In Proc. PACT-2001, September 2001. 22. Edward S. Davidson, Leonard E. Shar, A. Thampy Thomas, and Janak H. Patel. Effective control for pipelined computers. In Proc. Spring COMPCON75 Digest of Papers, pages 181– 184. IEEE Computer Society, February 1975. 23. Giuseppe Desoli. Instruction assignment for clustered VLIW DSP compilers: a new approach. Technical Report HPL-98-13, HP Laboratories Cambridge, February 1998. 24. Dietmar Ebner. SSA-based code generation techniques for embedded architectures. PhD thesis, Technische Universität Wien, Vienna, Austria, June 2009. 25. Erik Eckstein, Oliver König, and Bernhard Scholz. Code instruction selection based on SSAgraphs. In A. Krall, editor, Proc. SCOPES-2003, Springer LNCS 2826, pages 49–65, 2003. 26. Alexandre E. Eichenberger and Edward S. Davidson. A reduced multipipeline machine description that preserves scheduling constraints. In Proc. PLDI’96, pages 12–22, New York, NY, USA, 1996. ACM Press. 27. Christine Eisenbeis and Antoine Sawaya. Optimal loop parallelization under register constraints. In Proc. 6th Workshop on Compilers for Parallel Computers (CPC’96), pages 245– 259, December 1996. 28. Mattias Eriksson and Christoph Kessler. Integrated modulo scheduling for clustered VLIW architectures. In Proc. HiPEAC-2009 High-Performance and Embedded Architecture and Compilers, Paphos, Cyprus, pages 65–79. Springer LNCS 5409, January 2009. 29. Mattias Eriksson, Oskar Skoog, and Christoph Kessler. Optimal vs. heuristic integrated code generation for clustered VLIW architectures. In Proc. 11th int. workshop on software and compilers for embedded systems (SCOPES’08). ACM, 2008. 30. M. Anton Ertl. Optimal Code Selection in DAGs. In Proc. Int. Symposium on Principles of Programming Languages (POPL’99). ACM, 1999. 31. Joseph A. Fisher. Trace scheduling: A technique for global microcode compaction. IEEE Trans. Comput., C–30(7):478–490, July 1981. 32. Joseph A. Fisher, Paolo Faraboschi, and Cliff Young. Embedded computing: a VLIW approach to architecture, compilers and tools. Elsevier / Morgan Kaufmann, 2005. 33. Christopher W. Fraser, David R. Hanson, and Todd A. Proebsting. Engineering a Simple, Efficient Code Generator Generator. Letters of Programming Languages and Systems, 1(3):213– 226, September 1992. 34. Stefan M. Freudenberger and John C. Ruttenberg. Phase ordering of register allocation and instruction scheduling. In Code Generation: Concepts, Tools, Techniques [36], pages 146– 170, 1992. 35. Anup Gangwar, M. Balakrishnan, and Anshul Kumar. Impact of intercluster communication mechanisms on ILP in clustered VLIW architectures. ACM Trans. Des. Autom. Electron. Syst., 12(1):1, 2007. 36. Robert Giegerich and Susan L. Graham, editors. Code Generation - Concepts, Tools, Techniques. Springer Workshops in Computing, 1992. 37. R.S. Glanville and S.L. Graham. A New Method for Compiler Code Generation. In Proc. Int. Symposium on Principles of Programming Languages, pages 231–240, January 1978. 38. James R. Goodman and Wei-Chung Hsu. Code scheduling and register allocation in large basic blocks. In Proc. ACM Int. Conf. on Supercomputing, pages 442–452. ACM press, July 1988. 39. David W. Goodwin and Kent D. Wilken. Optimal and near-optimal global register allocations using 0–1 integer programming. Softw. Pract. Exper., 26(8):929–965, 1996. 40. R. Govindarajan, Erik Altman, and Guang Gao. A framework for resource-constrained rateoptimal software pipelining. IEEE Trans. Parallel and Distr. Syst., 7(11):1133–1149, November 1996. 41. Daniel Grund and Sebastian Hack. A fast cutting-plane algorithm for optimal coalescing. In Proc. 16th int. conf. on compiler construction, pages 111–125, March 2007. 42. Sebastian Hack and Gerhard Goos. Optimal register allocation for SSA-form programs in polynomial time. Information Processing Letters, 98:150–155, 2006.
636
Christoph W. Kessler
43. Todd Hahn, Eric Stotzer, Dineel Sule, and Mike Asal. Compilation strategies for reducing code size on a VLIW processor with variable length instructions. In Proc. HiPEAC’08 conference, pages 147–160. Springer LNCS 4917, 2008. 44. Silvina Hanono and Srinivas Devadas. Instruction scheduling, resource allocation, and scheduling in the AVIV retargetable code generator. In Proc. Design Automation Conf. ACM, 1998. 45. W. A. Havanki. Treegion scheduling for VLIW processors. M.S. thesis, Dept. Electrical and Computer Engineering, North Carolina State Univ., Raleigh, NC, USA, 1997. 46. L.P. Horwitz, R. M. Karp, R. E. Miller, and S. Winograd. Index register allocation. Journal of the ACM, 13(1):43–61, January 1966. 47. Wei-Chung Hsu, Charles N. Fischer, and James R. Goodman. On the minimization of loads/stores in local register allocation. IEEE Trans. Softw. Eng., 15(10):1252–1262, October 1989. 48. Wen-Mei Hwu, Scott A. Mahlke, William Y. Chen, Pohua P. Chang, Nancy J. Warter, Roger A. Bringmann, Roland G. Ouellette, Richard E. Hank, Tokuzo Kiyohara, Grant E. Haab, John G. Holm, and Daniel M. Lavery. The superblock: an effective technique for VLIW and superscalar compilation. J. Supercomput., 7(1-2):229–248, 1993. 49. Krishnan Kailas, Kemal Ebcioglu, and Ashok Agrawala. Cars: A new code generation framework for clustered ILP processors. In Proc. 7th Int. Symp. on High-Performance Computer Architecture (HPCA’01), pages 133–143. IEEE Computer Society, June 2001. 50. Daniel Kästner. Retargetable Postpass Optimisations by Integer Linear Programming. PhD thesis, Universität des Saarlandes, Saarbrücken, Germany, 2000. 51. Christoph Kessler and Andrzej Bednarski. Optimal integrated code generation for clustered VLIW architectures. In Proc. ACM SIGPLAN Conf. on Languages, Compilers and Tools for Embedded Systems / Software and Compilers for Embedded Systems, LCTES-SCOPES’2002. ACM, June 2002. 52. Christoph Kessler and Andrzej Bednarski. Optimal integrated code generation for VLIW architectures. Concurrency and Computation: Practice and Experience, 18:1353–1390, 2006. 53. Christoph Kessler, Andrzej Bednarski, and Mattias Eriksson. Classification and generation of schedules for VLIW processors. Concurrency and Computation: Practice and Experience, 19:2369–2389, 2007. 54. Christoph W. Keßler. Scheduling Expression DAGs for Minimal Register Need. Computer Languages, 24(1):33–53, September 1998. 55. Tokuzo Kiyohara and John C. Gyllenhaal. Code scheduling for VLIW/superscalar processors with limited register files. In Proc. MICRO-25. IEEE CS Press, 1992. 56. Monica Lam. Software pipelining: An effective scheduling technique for VLIW machines. In Proc. CC’88, pages 318–328, July 1988. 57. Rainer Leupers. Retargetable Code Generation for Digital Signal Processors. Kluwer, 1997. 58. Rainer Leupers. Code Optimization Techniques for Embedded Processors. Kluwer, 2000. 59. Rainer Leupers. Instruction scheduling for clustered VLIW DSPs. In Proc. PACT’00 int. conference on parallel architectures and compilation. IEEE Computer Society, 2000. 60. Rainer Leupers and Steven Bashford. Graph-based code selection techniques for embedded processors. ACM TODAES, 5(4):794–814, October 2000. 61. Rainer Leupers and Peter Marwedel. Time-constrained code compaction for DSPs. IEEE Transactions on VLSI Systems, 5(1):112–122, 1997. 62. Dake Liu. Embedded DSP processor design. Morgan Kaufmann, 2008. 63. Josep Llosa, Antonio Gonzalez, Mateo Valero, and Eduard Ayguade. Swing Modulo Scheduling: A Lifetime-Sensitive Approach. In Proc. PACT’96 conference, pages 80–86. IEEE, 1996. 64. Josep Llosa, Mateo Valero, Eduard Ayguade, and Antonio Gonzalez. Hypernode reduction modulo scheduling. In Proc. MICRO-28, 1995. 65. M. Lorenz and P. Marwedel. Phase coupled code generation for DSPs using a genetic algorithm. In Proc. conf. on design automation and test in Europe (DATE’04), pages 1270–1275, 2004. 66. Scott A. Mahlke, David C. Lin, William Y. Chen, Richard E. Hank, and Roger A. Bringmann. Effective compiler support for predicated execution using the hyperblo ck. In Proc. MICRO25, pages 45–54, December 1992.
Compiling for VLIW DSPs
637
67. Abid M. Malik, Michael Chase, Tyrel Russell, and Peter van Beek. An application of constraint programming to superblock instruction scheduling. In Proc. 14th Int. Conf. on Principles and Practice of Constraint Programming, pages 97–111, September 2008. 68. Waleed M. Meleis and Edward D. Davidson. Dual-issue scheduling with spills for binary trees. In Proc. 10th ACM-SIAM Symposium on Discrete Algorithms, pages 678 – 686. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 1999. 69. Steven S. Muchnick. Advanced Compiler Design and Implementation. Morgan Kaufmann, 1997. 70. Thomas Müller. Employing finite automata for resource scheduling. In Proc. 26th int. symp. on microarchitecture (MICRO-26), pages 12–20. IEEE, December 1993. 71. S. G. Nagarakatte and R. Govindarajan. Register allocation and optimal spill code scheduling in software pipelined loops using 0-1 integer linear programming formulation. In Proc. int. conf. on compiler construction (CC-2007), pages 126–140. Springer LNCS 4420, 2007. 72. Rahul Nagpal and Y. N. Srikant. Integrated temporal and spatial scheduling for extended operand clustered VLIW processors. In Proc. 1st conf. on Computing Frontiers, pages 457– 470. ACM Press, 2004. 73. Steven Novack and Alexandru Nicolau. Mutation scheduling: A unified approach to compiling for fine-grained parallelism. In Proc. Workshop on compilers and languages for parallel computers (LCPC’94), pages 16–30. Springer LNCS 892, 1994. 74. NXP. Trimedia TM-1000. Data sheet, www.nxp.com, 1998. 75. Emre Ozer, Sanjeev Banerjia, and Thomas M. Conte. Unified assign and schedule: a new approach to scheduling for clustered register file microarchitectures. In Proc. 31st annual ACM/IEEE Int. Symposium on Microarchitecture, pages 308–315. IEEE CS Press, 1998. 76. Massimiliano Poletto and Vivek Sarkar. Linear scan register allocation. ACM Transactions on Programming Languages and Systems, 21(5), September 1999. 77. Todd A. Proebsting and Christopher W. Fraser. Detecting pipeline structural hazards quickly. In Proc. 21st symp. on principles of programming languages (POPL’94), pages 280–286. ACM Press, 1994. 78. B. Rau and C. Glaeser. Some scheduling techniques and an easily schedulable horizontal architecture for high performance scientific computing. In Proc. 14th Annual Workshop on Microprogramming, pages 183–198, 1981. 79. B. Ramakrishna Rau, Vinod Kathail, and Shail Aditya. Machine-description driven compilers for EPIC and VLIW processors. Design Automation for Embedded Systems, 4:71–118, 1999. Appeared also as technical report HPL-98-40 of HP labs, Sep. 1998. 80. Richard Scales. Software development techniques for the TMS320C6201 DSP. Texas Instruments Application Report SPRA481, www.ti.com, December 1998. 81. Eric J. Stotzer and Ernst L. Leiss. Modulo scheduling without overlapped lifetimes. In Proc. LCTES-2009, pages 1–10. ACM, June 2009. 82. Texas Instruments, Inc. TMS320C62x DSP CPU and instruction set reference guide. Document SPRU731, focus.ti.com/lit/ug/spru731, 2006. 83. Omri Traub, Glenn Holloway, and Michael D. Smith. Quality and Speed in Linear-scan Register Allocation. In Proc. ACM SIGPLAN Conf. on Progr. Lang. Design and Implem., pages 142–151, 1998. 84. Steven R. Vegdahl. A Dynamic-Programming Technique for Compacting Loops. In Proc. MICRO-25, pages 180–188. IEEE CS Press, 1992. 85. Kent Wilken, Jack Liu, and Mark Heffernan. Optimal instruction scheduling using integer programming. In Proc. Int. Conf. on Progr. Lang. Design and Implem., pages 121–133, 2000. 86. Tom Wilson, Gary Grewal, Ben Halley, and Dilip Banerji. An integrated approach to retargetable code generation. In Proc. Int. Symposium on High-Level Synthesis, pages 70–75, May 1994. 87. Sebastian Winkel. Optimal global instruction scheduling for the Itanium processor architecture. PhD thesis, Universität des Saarlandes, Saarbrücken, Germany, September 2004. 88. Sebastian Winkel. Optimal versus heuristic global code scheduling. In Proc. MICRO-40, pages 43–55, 2007.
638
Christoph W. Kessler
89. Hongbo Yang, Ramaswamy Govindarajan, Guang R. Gao, George Cai, and Ziang Hu. Exploiting schedule slacks for rate-optimal power-minimum software pipelining. In Proc. Workshop on Compilers and Operating Systems for Low Power (COLP-2002), September 2002. 90. Javier Zalamea, Josep Llosa, Eduard Ayguade, and Mateo Valero. Modulo scheduling with integrated register spilling for clustered VLIW architectures. In Proc. ACM/IEEE MICRO-34, pages 160–169, 2001. 91. Thomas Zeitlhofer and Bernhard Wess. Operation scheduling for parallel functional units using genetic algorithms. In Proc. Int. Conf. on ICASSP ’99: Proceedings of the Acoustics, Speech, and Signal Processing (ICASSP’99), pages 1997–2000. IEEE Computer Society, 1999.
Software Compilation Techniques for MPSoCs Rainer Leupers, Weihua Sheng, Jeronimo Castrillon
Abstract The increasing demands such as high-performance and energy-efficiency for future embedded systems result in the emerging of heterogeneous Multiprocessor System-on-Chip (MPSoC) architectures. To fully enable the power of those architectures, new tools are needed to take care of the increasing complexity of the software to achieve high productivity. An MPSoC compiler is the tool-chain to tackle the problems of expressing parallelism in applications’ modeling/programming, mapping/scheduling and generating the software to distribute on an MPSoC platform for efficient usage, for a given (pre-)verified MPSoC platform. This chapter talks about the various aspects of MPSoC compilers for heterogeneous MPSoC architectures, using a comparison to the well-established uni-processor C compiler technology. After a brief introduction to MPSoC and MPSoC compilers, the important ingredients of the compilation process, such as programming models, granularity and partitioning, platform description, mapping/scheduling and code-generation, are explained in detail. As the topic is relatively young, a number of case studies from academia and industry are selected to illustrate the concepts at the end of this chapter.
Rainer Leupers Institute for Software for Systems on Silicon, RWTH Aachen University, SSS-611910, Templergraben 55, 52056 Aachen, Germany e-mail: [email protected] Weihua Sheng Institute for Software for Systems on Silicon, RWTH Aachen University, SSS-611910, Templergraben 55, 52056 Aachen, Germany e-mail: [email protected] Jeronimo Castrillon Institute for Software for Systems on Silicon, RWTH Aachen University, SSS-611910, Templergraben 55, 52056 Aachen, Germany e-mail: [email protected]
S.S. Bhattacharyya et al. (eds.), Handbook of Signal Processing Systems, DOI 10.1007/978-1-4419-6345-1_23, © Springer Science+Business Media, LLC 2010
639
640
Rainer Leupers, Weihua Sheng, Jeronimo Castrillon
1 Introduction 1.1 MPSoCs and MPSoC Compilers Lately it has become clear that Multiprocessor System-on-Chip (MPSoC) is the most promising way to keep on exploiting the high level of integration provided by the semiconductor technology and, at the same time, matching the constraints imposed by the embedded system market in terms of performance and power consumption. Take a look at today’s standard cell phone — it is integrated with a great number of functions such as camera, personal digital assistant applications, voice/data communications and multi-band wireless standards. Moreover, like many other consumer electronic products, many non-functional parameters are evenly critical for their successes in the market, e.g. energy consumption and form factor. All those requirements necessitate the emerging of the heterogeneous MPSoC architectures, which usually consist of programmable processors, special hardware accelerators and efficient Networks on Chips (NoCs) and run a large amount of the software, in order to catch up with the next wave of integration. Compared with high-performance computing systems in supercomputers and computer clusters, embedded computing systems require a different set of the constraints that need to be taken into consideration in the design process: • Real-time constraints: Real-time performance is key to the embedded devices especially in the signal processing domain such as wireless and multimedia. Meeting real-time constraints requires not only the hardware being capable of meeting the demands of high-performance computations but also the predictable behavior of the running applications. • Energy-efficiency: Most handheld equipments are driven by batteries and energyefficiency is one of the most important factors during the system design. • Area-efficiency: How to efficiently use the limited chip area becomes critical, especially for the consumer electronics where portability is a must-to-have. • Application/Domain-specific: Unlike general-purpose computing products, the embedded products usually target at specific market segments, which in turn asks for the specialization of the system design to tailor for specific applications. With those design criteria mentioned, heterogeneous MPSoC architectures are believed to outperform the previous uni-processor or homogeneous solutions. For a detailed discussion on the architectures, the readers are referred to Chapter 18. MPSoC design methodologies, also referred as ESL (Electronic System-Level) tools, are growing in importance to tackle the challenge of exploring the exploding design space brought by the heterogeneity [25]. Many different tools are required for completing a successful MPSoC design (or a series of MPSoC product generations, which is more often seen now in the industry e.g. Texas Instruments (TI) OMAP [71]). The MPSoC compiler is one important tool among those, which will be discussed in detail in this chapter.
Software Compilation Techniques for MPSoCs
641
Firstly, what is an MPSoC Compiler? Today’s compilers are targeted to uniprocessors and the design and implementation of special compilers optimized for the various processor types (RISC, DSP, VLIW, etc.) has been well understood and practiced. Now, the trend moving to MPSoCs raises the level of complexity of the compilers for utilizing MPSoCs. The problems of expressing parallelism in applications’ modeling/programming, and then mapping, scheduling and generating the software to distribute on an MPSoC platform for efficient usage, remain largely unaddressed [21]. We define MPSoC Compiler as the tool-chain to tackle those problems for a given (pre-)verified MPSoC platform. It is worth mentioning that our definition of MPSoC compiler is subtly different from the term software synthesis as it appears in the hardware-software co-design community [26]. In this context, software synthesis emphasizes that starting from a single high-level system specification, the tools perform hardware/software partitioning and automatically synthesize the software part so as to meet the system performance requirements of the specifications. The flow is also called an applicationdriven “top-down” design flow. The MPSoC compiler is, however, used mostly in the platform-based design, where the system and semiconductor suppliers evolve the MPSoC designs in generations targeting a specific application domain for the programmers to develop software. The function of an MPSoC compiler is very close to that of a uni-processor compiler, where the compiler translates the high-level programming language (e.g. C) into the machine binary code. The difference is that an MPSoC compiler needs to perform additional (and much more complex) jobs over the uni-processor one such as partitioning/mapping, in that the underlying MPSoC machine is magnitudes more complex. Though software synthesis and MPSoC compilers share some similarities (both generate code), the major difference is that they exist in different methodologies’ contexts, thus focusing on different objectives. The rest of the chapter is organized as follows. Section 1.2 briefly introduces the challenges of building MPSoC compilers, using a comparison of an MPSoC compiler to a uni-processor compiler, followed by Section 2 where detailed discussions are carried out. Section 3 looks into how the challenges are tackled and realized by demonstrating a number of case studies of MPSoC compiler construction in academia and industry.
1.2 Challenges of Building MPSoC Compilers Before the multi-processor or multi-core era, the uni-processor-based hardware system has been very successful in creating a comfortable and convenient programming environment for software developers. The success is largely due to the fact that the sequential programming model is very close to the natural way humans think and that it has been taught for decades in the basic engineering courses. Also, the compilers of high-level programming languages like C for uni-processors are well studied, which hide nearly all the hardware details from the programmers as a holistic tool [33]. The user-friendly graphical integrated development environments (IDEs)
642
Rainer Leupers, Weihua Sheng, Jeronimo Castrillon
like Eclipse [1] and mature debugging tools like gdb [2] also contribute to the good ecosystem of hardware and software in the uni-processor age. The complexity of programming and compiling for MPSoC architectures has increased a great deal, compared to the uni-processor. The reasons are many-fold and the most important two are as follows. On one hand, MPSoCs inherently ask for applications being written in parallel programming models so as to efficiently utilize the hardware resources. Parallel programming (or thinking) has been proven to be difficult for programmers, despite many years’ efforts in high-performance computing. On the other hand, the heterogeneity of MPSoC architectures requires the compilation process to be ad-hoc. The programming models for different Processing Elements (PEs) can be different. The granularity of the parallelism might vary. The compiler tool-chains can come from different vendors for PEs and Networks on Chips (NoCs). All those make the MPSoC compilation an extremely sophisticated process, which is most likely not “the holistic compiler” anymore for the end users. The software tool-chains are not yet fully prepared to well handle MPSoC systems, plus the lack of productive multi-core debugging solutions. An MPSoC compiler, as the key tool to enable the power of MPSoCs, is known to be difficult to build. Then what are the fundamental challenges to construct an MPSoC compiler with respect to the uni-processor’s compiler construction? A brief list is provided below, with an in-depth discussion in the following Section 2. 1. Programming Models: Evidently the transition to the parallel programming models impacts the MPSoC compiler fundamentally. 2. Granularity of Parallelism: While Instruction-Level Parallelism (ILP) is exploited by the uni-processor compiler, the MPSoC compiler focuses on more coarse-grained levels of parallelism. 3. Platform Description: The traditional uni-processor compiler requires the architecture information such as the instruction-set and latency table in the backend to generate code. The MPSoC compiler needs another set of the platform description to work with, and the utilization phases will go beyond just the backend. 4. Mapping/Scheduling: An MPSoC compiler needs to map and schedule the tasks (or called code blocks) while the uni-processor compiler performs those at instruction-level. 5. Code generation: It is yet another leaping complexity for the MPSoC compiler — to be able to generate the final binaries for heterogeneous PEs plus the NoC vs. to generate the binary for the one-ISA architecture.
2 Foundation Elements of MPSoC Compilers In this section we delve into the details of the problems mentioned in the introduction of this chapter. We base our discussion on the general structure of a compiler, shown in Figure 1. We highlight the issues that make the tasks of an MPSoC compiler particularly challenging, taking the case of a uni-processor compiler as a reference.
Software Compilation Techniques for MPSoCs
Programming Model
Front end: (Lexical, syntactic, Semantic analysis)
Abstract representation
Optimized Abstract representation
Middle end: (IR, target independ ent optimization)
643
Architecture Model
Back end: (Code selection, reg. allocation, scheduling)
101010 010111 0101… Binary
Fig. 1 Coarse View of Uni-processor Compiler Phases
A compiler is typically divided into three phases: The front end, the middle end and the back end. The front end checks for the lexical, syntactic and semantic correctness of the application. Its output is an abstract Intermediate Representation (IR) of the application which is suitable for optimizations and for code generation in the following phases of the compiler. The middle end, sometimes conceptually included within the front end, performs different analyses on the IR. These analyses enable several target-independent optimizations that mainly aim at improving the performance of the posterior generated code. The back end is in charge of the actual code generation and is divided into phases as well. Typical back end phases include code selection, register allocation and instruction scheduling. These phases are machine dependent and therefore require a model of the target architecture. As in the uni-processor case, MPSoC compilers are also divided into phases in order to manage complexity. Throughout this section more details about these phases will be given, as they help to understand the differences between uni-processor and multi-processor compilers. The overall structure of the compiler (in Figure 1) will suffer some changes though. On the one hand, if the programming model makes some architecture features visible to the programmer, the front end will need information of the target platform. On the other hand, the back end should be aware of formal properties of the programming model that may ease the code generation.
2.1 Programming Models The main entry for any compiler is a representation of an application using a given programming model (see Figure 1). A programming model is a bridge that provides humans access to the resources of the underlying hardware platform. Designing such a model is a delicate art, in which hardware details are hidden for the sake of productivity and usually at the cost of performance. In general, the more details remain hidden, the harder the job of the compiler is to close the performance gap. In this sense, a given programming model may reduce the work of the compiler but will never circumvent using one. Figure 2 shows an implementation of an FIR filter with
644
Rainer Leupers, Weihua Sheng, Jeronimo Castrillon
Fig. 2 FIR implementation on different programming languages. (a) Matlab, (b) C, (c) DSP-C (adapted from [8]).
different programming languages representing different programming models. The figure shows an example of the productivity-performance trade-off. On one extreme, the Matlab implementation (Figure 2a) features high simplicity and no information of the underlying platform. The C implementation (Figure 2b) provides more information, having types and the memory model visible to the programmer. On the other extreme, the DSP-C implementation (Figure 2c) has explicit memory bank allocation (through the memory qualifiers X and Y) and dedicated data types (accum, fract). Programming at this level requires more knowledge and careful thinking, but will probably lead to better performance. Without this explicit information, a traditional C compiler would need to perform complex memory disambiguation analysis in order to place the arrays in separate memory banks. In [11], the authors classify programming models as being either hardwarecentric, application-centric or formalism-centric. Hardware-centric models strive for efficiency and usually require a very experienced programmer (e.g. Intel IXP-C [51]). Application-centric models strive for productivity allowing fast application development cycles (e.g. Matlab [64], LabView [55]), and formalism-centric models strive for safeness due to the fact of being verifiable (e.g. Actors [29]). Practical programming models for embedded MPSoCs cannot pay the performance overhead brought by a pure application-centric approach and will seldom restrict programmability for the sake of verifiability. As a consequence of this, programming models used in industry are typically hardware-centric and provide some means to ease programmability as will be discussed later in this section. Orthogonal to the previous classification, programming models can be broadly classified into sequential and parallel ones. The latter being of particular interest for MPSoC programming and this chapter’s readers, though having its users outnumbered by the sequential programming community. As a matter of fact, nearly 90% of the software developers nowadays use C and C++ [18], which are languages with underlying sequential semantics. Programmers have been educated for decades to program sequentially. They find it difficult to describe an application in a parallel manner, and when doing so, they introduce a myriad of (from their point of view) unexpected errors. Apart from that, there are millions of lines of sequential legacy code that will not be easily rewritten within a short period of time to make use of the new parallel architectures.
Software Compilation Techniques for MPSoCs
645
Compiling a sequential application, for example written in C, for a simple processor is a very mature field. Few people would program an application in assembly language for a single issue embedded RISC processor such as the ARM7 or the MIPS32. In general, compiler technology has advanced a lot in the uni-processor arena. Several optimizations have been proposed for superscalar processors [38], for DSPs [49], for VLIW processors [20] and for exploiting Single Instruction Multiple Data (SIMD) engines [50]. Nonetheless, high performance routines for complex processor architectures with complex memory hierarchies are still hand-optimized and are usually provided by processor vendors as library functions. In the MPSoC era, the optimization space is too big to allow hand-crafted solutions across different cores. The MPSoC compiler has to help the programmer to optimize the application, probably taking into account the presence of optimized routines for some of the processing elements. In spite of the efforts invested in classical compiler technology, plain C programming is not likely to be able to leverage the processing power of future MPSoCs. When coding a parallel application in C, the parallelism is hidden due to the inherent sequential semantics of the language and its centralized flow of control. Retrieving this parallelism requires complex dataflow and dependence analyses which are usually NP-complete and sometimes even undecidable (see Section 2.2). For this reason MPSoC compilers need also to cope with parallel programming models, some of which will be introduced in the following. 2.1.1 Parallel Programming Models There are manifold parallel programming models. In this section only a couple of them will be introduced in order to give the reader an idea of the challenges that these models pose to the compiler. Part IV of this book, Design Methods, provides in depth discussion of several models of computation (MoCs) relevant to programming models for signal processing MPSoCs. Most current parallel programming models are built on top of traditional sequential languages like C or Fortran by providing libraries or by adding language extensions. These models are usually classified by the underlying memory architecture that they support; either shared or distributed. Prominent examples used in the high performance computing (HPC) community are: • POSIX Threads (pThreads) [68]: A library-based shared memory parallel programming model. • Message Passing Interface (MPI) [65]: A library-based distributed memory parallel programming model. • Open Multi-Processing (OpenMP) [59]: A shared memory parallel programming model based on language extensions1. These programming models are widely used in the HPC community, but their acceptance in the embedded community is quite limited. Apart from the well-known 1
OpenMP has also distributed memory extensions
646
Rainer Leupers, Weihua Sheng, Jeronimo Castrillon (a) p1
p2
a1
p3
e4
2 3 1 e1
a2
6
2
3
1
(b)
a3
e3 2
e2
i1 a1
a2 i2
a3
Firing ... Ra2,1 = Ra2,2 = Ra2,3 = Ra3,1 = ...
Rules:
(c)
([*,*],[*]) (^,[0]) ([0], ^) ([*,*,*])
Fig. 3 Example of Concurrent MoCs. (a) KPN, (b) SDF, (c) DDF.
problems of programming with threads [48], the overhead of embedded implementations of these standards and the complex and hybrid (shared-distributed) nature of embedded architectures have prevented the adoption of such programming models. Embedded applications do not usually feature the regularity present in scientific applications, and even if they do, their granularity is too fine to afford the overhead introduced by complex software stacks. Even in the HPC community and as a reaction to the problems of threaded programming, some new programming models are being proposed. The two most prominent ones, inspired from transactions in data bases, are: Transactional Memory (TM) [9] and Thread Level Speculation (TLS) [40]. How successfully these models can be deployed in embedded MPSoCs is still unknown. Both models feature a high performance penalty if not supported by hardware, and the power overhead of such hardware may be a showstopper when adopting these techniques. 2.1.2 Embedded Parallel Programming Models In the embedded domain, parallel programming models based on concurrent MoCs are gaining acceptance and appear to be one promising choice for describing signal processing applications. The theory of MoC originated from theoretical computer science in order to formally describe a computing system and was initially used to compute bounds on complexity (the “big O” notation). MoCs were thereafter used in the early 1980s to model VLSI circuits and only in the 1990s started to be utilized for modeling parallel applications. Dataflow programming models based on concurrent MoCs like Synchronous Dataflow (SDF) [46] and some extensions (like Boolean Dataflow (BDF) [47]) have been deeply studied in [67]. More general dataflow programming models based on Dynamic Dataflow (DDF) and Kahn Process Networks (KPN) [3] MoC have also been proposed [58][44] (see also Chapters 34 and 35). • KPN Programming Model: In a KPN programming model, an application is represented as a graph G = (V, E) like the one in Figure 3a. In such a graph, a node n ∈ V is called process and represents certain data processing. The edges
Software Compilation Techniques for MPSoCs
647
represent unbounded FIFO channels through which processes communicate by sending data items or tokens. Processes can only be in one of two states: ready or blocked. The blocked state can only be reached by reading from only one empty input channel — blocking reads semantics. A KPN is said to be determinate: the history of tokens produced on the communication channels does not depend on the scheduling [3]. • DDF Programming Model: In this programming model, an application is also represented as a graph G = (V, E, R) with R a family of sets, one set for every node in V . Edges have the same semantics as in the KPN model. Nodes are called actors and do not feature the blocking read semantics of KPN. Instead, every actor a ∈ V has a set of firing rules Ra ∈ R, Ra = {Ra,1 , Ra,2 , . . . }. A firing rule for an actor a ∈ V with p inputs is a p-tuple Ra,i = (c1 , . . . , c p ) of conditions. A condition describes a sequence of tokens that has to be available at the given input FIFO. Parks introduced a notation for such conditions in [62]. The condition [X1 , X2 , . . . , Xn ] requires n tokens with values X1 , X2 , . . . , Xn to be available at the top of the input FIFO. The conditions [∗], [∗, ∗], [∗(1) , . . . , ∗(m) ] require at least 1, 2 and m tokens respectively with arbitrary values to be available at the input. The symbol ⊥ represents any input sequence, including an empty FIFO. For an actor a to be in the ready state at least one of its firing rules need to be satisfied. An example of a DDF graph is shown in Figure 3c. In this example, the actor a2 has 3 different firing rules. This actor is ready if there are at least two tokens in input i1 and at least 1 token in input i2, or if the next token on input i2 or i1 has value 0. Notice that more than one firing rule can be activated, in this case the dataflow graph is said to be non-determinate. • SDF Programming Model: An SDF can be seen as a simplification of DDF model2 , in which an actor with p inputs has only one firing rule of the form Ra,1 = (n1, . . . , n p ) with n ∈ N. Additionally, the amount of tokens produced by one execution of an actor on every output is also fixed. An SDF can be defined as a graph G = (V, E,W ) where W = {w1 , . . . , w|E| } ⊂ N3 associates 3 integer constants we = (pe , ce , de ) to every channel e = (a1 , a2 ) ∈ E. pe represents the number of tokens produced by every execution of actor a1 , ce represents the number of tokens consumed in every execution of actor a2 and de represents the number of tokens (called delays) initially present on edge e. An example of an SDF is shown in Figure 3b with delays represented as dots on the edges. For the SDF in the example, W = {(3, 1, 0), (6, 2, 0), (2, 3, 0), (1, 2, 2)}. Different dataflow models differ in their expressiveness, some being more general, some being more restrictive. By restricting the expressiveness, models possess stronger formal properties (e.g. determinism) which make them more amenable to analysis. For example, the questions of termination and boundedness, i.e. whether an application runs forever with bounded memory consumption, are decidable for SDFs but undecidable for BDFs, DDFs and KPNs [62]. Also, since the token consumption and production of an SDF actor are known beforehand, it is possible for a compiler to compute a plausible static schedule for an SDF. For a KPN instead, due 2
Being more closely related to the so-called Computation Graphs [36]
648
Rainer Leupers, Weihua Sheng, Jeronimo Castrillon (a) 1: int b[N]; 2: int A[4][N]; 3: int c[4]; 4: 5: void foo() 6: { 7: int i,j,s; 8: 9: b[0] = f1(); 10: for (i = 0; i < N; i++) 11: f2(&b[0], &b[i], &A[i%4][0]); 12: 13: s = b[0]; b[0] = 0; 14: for (i = 0; i < 4*16; i++) { 15: j = i % 4; 16: f3(s, &A[j][0]); 17: c[j] = f4(&A[j][0]); 18: if (j == 3) 19: s += sum(&c[0]); 20: } 21:}
(b) b
f1
b[0]
s
f2
A
f3
f4
c
sum
s
b[0]
(c) b
f1
b[0] f2+f3
s A
f4
c
sum
s
Fig. 4 Sample program. (a) C implementation, (b) a “good” KPN representation. (c) a “bad” KPN representation
to control dependent access to channels, it is impossible to compute a pure static schedule. Apart from explicitly exposing parallelism, dataflow programming models became attractive mainly for two reasons. On the one hand, they are well suited for graphical programming, similar to the block diagrams used to describe signal processing algorithms. On the other hand, some of the underlying MoC’s properties enable tools to perform analysis and compile/synthesize the specification into both software and hardware. Channels explicitly expose data dependencies among computing nodes (processes or actors), and these nodes display a distributed control flow which is easily mapped to different PEs. To understand how dataflow models can potentially reduce the compilation effort, an example of an application written in a sequential and in two parallel forms is shown in Figure 4. Let us assume that the KPN parallel specification in Figure 4b represents the desired output of a parallelizing compiler. In order to derive this KPN from the sequential specification in Figure 4a, complex analyses have to be performed. For example, the compiler needs to identify that there is no dependency on array A among lines 11 and 16 (i.e. between processes f2 and f3), which is a typical example of dataflow analysis (see Section 2.2.3). Only for a restricted subset of C programs, namely Static Affine Nested Loop Programs (SANLP), similar transformations to that shown in Figure 4 have been successfully implemented in [74]. For more complex programs, starting from a parallel specification would greatly simplify the work of the compiler. However, we argue that even with a parallel specification at hand, an MPSoC compiler has to be able to look inside the nodes in order to attain higher performance. With the applications becoming more and more complex, a compiler cannot completely rely on the programmer’s knowledge when decomposing the application into blocks. A block diagram can hide lots of parallelism in the interior of the blocks and thus, computing nodes cannot always be considered as black boxes but rather as gray/white boxes [52]. As an example of this, consider the KPN shown in Figure 4c. Assume that this parallel specification was written by a programmer to represent the
Software Compilation Techniques for MPSoCs
649
same application logic in Figure 4a. This KPN might seem appropriate to a programmer, because the communication is reduced (5 instead of 6 edges). However, notice that if functions f2 and f3 are time consuming, running them in parallel could be greatly advantageous. In this representation, the parallelism remains hidden inside block f2+f3. Summary Currently, MPSoC compilers need to support sequential programming models, both because of the great amount of sequential legacy code and because of a couple of generations of programmers that were taught to program sequentially. At the same time, dataflow models need to be also taken into account and the compiler needs to be aware of their formal properties. An MPSoC compiler should profit from the techniques for Instruction Level Parallelism (ILP) present in each of the uniprocessor compilers. Therefore, the granularity of the analysis need to be re-thought in order to extract coarser parallelism as will be discussed in the next section.
2.2 Granularity, Parallelism and Intermediate Representation In a classical compiler, the front end translates application code into an abstract representation — the IR. Complex constructs of the original high level programming languages are lowered into the IR while keeping machine independence. The IR serves as basis for performing analysis (e.g. control and data flow), upon which many compiler optimizations can be performed. Although there is no de facto standard for IRs, most compiler IRs use graph data structures to represent the application. The fundamental analysis units used in traditional compilers are the so-called Basic Blocks (BB), where a BB is defined as a maximal sequence of consecutive statements in which flow of control enters at the beginning and leaves at the end without halt or possibility of branching except at the end [10]. A procedure or function is represented as a Control Flow Graph (CFG) whose nodes are BBs and edges represent the control transitions in the program. Data flow is analyzed inside a BB and as a result a Data Flow Graph (DFG) is produced, where nodes represent statements (or instructions) and edges represent data dependencies (or precedence constraints). With intra-procedural analysis, data dependencies that span across BB borders can be identified. As a result both control and data flow information can be summarized in a Control Data Flow Graph (CDFG). A sample CDFG for the code in Figure 5a is shown in Figure 5c. BBs are identified with the literals v1,v2,...,v6. For this code it is easy to identify the data dependencies by simple inspection. Notice however, that the self-cycles because of variable c in v3 and v4 will never be executed, i.e. the definition of c in line 7 will never reach line 6. Moreover, notice that the code in Figure 5b is equivalent to that in Figure 5a even if not evident at the first sight. Even for such a small program, a compiler needs to be equipped with powerful analysis to derive such an optimization.
650
Rainer Leupers, Weihua Sheng, Jeronimo Castrillon 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14:
(a) ... c = x + y; b = 2; for (i = 0; i < 10; i++) { if (c > 0) { a = b + c; c = -1; } else { a = b – c; c = 1; } A[i] = a; } ...
1: 2: 3: 4: 5: 6: 7:
(b) ... c = x + y; c = c > 0 ? c: -c; A[0] = 2 + c; for (i = 1; i < 10; i++) A[i] = 3; ...
Control
x
Data
v1
c = x + y; b = 2; i = 0; b,c
(c)
y
i b,c
c if (c > 0) goto v3 v2
a = b + c; c = -1; v3
c
a = b - c; c = 1; v4
c a
v5 A[i] = a; i = i + 1; if (i < 10) goto v2
c
a
i v6
... A
Fig. 5 Example of a CDFG. (a) sample C code, (b) optimized code, (c) CDFG for (a).
For simple processors, the analysis at the BB granularity has worked perfectly for the last 20-30 years. The instructions inside a BB will always be executed one after another in an in-order processor, and for that reason BBs are very well suited for exploiting ILP. Already for more complex processors, like VLIW, BBs fall short to leverage the available ILP. Predicated execution, software pipelining and trace scheduling [20] are just some examples of optimizations that cross the BB borders seeking for more parallelism. This quest for parallelism is even more challenging in the case of MPSoC compilers. An MPSoC compiler must go beyond ILP, which will be exploited by the individual compilers on each one of the processors. The question of granularity and its implication on parallelism for MPSoCs is still open. The ideal granularity for the analysis depends in part on the characteristics of the target platform. The interconnect will define the communication characteristics, and thus will dictate which granularity is reasonable and which not. The following sections will give more insights on the granularity issue, will present different kinds of parallelism and will briefly discuss the problem of dataflow analysis. 2.2.1 Granularity and Partitioning Granularity is a major issue for MPSoC compilers and has a direct impact on the kind and degree of parallelism that can be achieved. We define partitioning as the process of analyzing an application and fragmenting it into blocks with a given granularity suitable for parallelism extraction. In this sense, the process of constructing CFGs out of an application as discussed before can be seen as a partitioning. The most intuitive granularities and their drawbacks for MPSoC compilers are: • Statement: A statement is the smallest standalone entity of a programming language. An application can be broken to the level of statements and the relations
Software Compilation Techniques for MPSoCs (a) 1: ... 9: b[0] = f1(); 10: for (i = 0;...) 11: f2(&b[0],...); 12: 13: s = b[0]; b[0] = 0; 14: for (i = 0;...) { 15: j = i % 4; 16: f3(s,&A[j][0]); 17: c[j] = f4(...); 18: if (j == 3) 19: s += sum(&c[0]); 20: }
(b) 1: ... 9: b[0] = f1(); 10: for (i = 0;...) 11: f2(&b[0],...); 12: 13: s = b[0]; b[0] = 0; 14: for (i = 0;...) { 15: j = i % 4; 16: f3(s,&A[j][0]); 17: c[j] = f4(...); 18: if (j == 3) 19: s += sum(&c[0]); 20: }
651 (c) 1: ... 9: b[0] = f1(); 10: for (i = 0;...) 11: f2(&b[0],...); 12: 13: s = b[0]; b[0] = 0; 14: for (i = 0;...) { 15: j = i % 4; 16: f3(s,&A[j][0]); 17: c[j] = f4(...); 18: if (j == 3) 19: s += sum(&c[0]); 20: }
Fig. 6 Granularity example. (a) Basic Block level, (b) Function level, (c) Task level
among each them. This granularity provides the highest degree of freedom to the analysis but could prevent ILP from being exploited at the uni-processor level. Besides, the complexity could be prohibitively big for real life applications. • Basic Block: As already discussed, traditional compilers work on the level of BBs, for they are well suited for ILP. However, in practice it is easy to find examples where BBs are either too big or too small for MPSoC parallelism extraction. A BB composed of a sequence of function calls inside a loop would be seen as a single node by the MPSoC compiler, and potential parallelism will be therefore hidden. On the other extreme, a couple of small basic blocks divided by simple control constructs could be better handled by a uni-processor compiler with support for predicated execution. • Function: A function is simply defined as a subroutine with it own stack. At this level, only function calls are analyzed and the rest of the code is considered as irrelevant for the analysis. As with BBs, this granularity can be too coarse or too fine-grained depending on the application and on the programmer. It is possible to force a coding style where parallelism is explicitly coded in the way the behavior is factorized into functions. However, an MPSoC compiler should not make any assumption on the coding style. • Task/process: For parallel programming models, the granularity is determined by the tasks (processes or actors) defined in the specification itself. As mentioned in Section 2.1, a powerful MPSoC compiler might inspect the inside of the tasks in order to look for parallelism hidden in the specification. As an example, different partitions at different granularity levels for the program introduced in Figure 4a are shown in Figure 6. The BBs of function foo are shown in Figure 6a. This partition is a good example of what discussed above. The BB on line 9 is too light weight in comparison to the other BBs, whereas the BB in lines 16-17 may be too coarse. The partition at function level is shown in Figure 6b. This partition happens to match the KPN derivation introduced in Figure 4b. Whether this granularity is appropriate or not, depends on the amount of data flowing between the functions and the timing behavior of each one of the functions as will be discussed later in this chapter. Finally, Figure 6c shows a coarser task level partitioning, where the first task processes mostly array b and the second task processes array A. Again, this task level granularity might turn out to be inefficient if the time spent processing
Rainer Leupers, Weihua Sheng, Jeronimo Castrillon
DS1
...
DSn
DS1
...
T1 DSn
(a)
T1
T3
...
DS1
... T1
T2
DS2 Tn
(b)
...
652
(c)
Fig. 7 Kinds of Parallelism: a) DLP, b) TLP, c) PLP
array A is too large. If this is the case, exploiting the parallelism inside the second task would be required. As illustrated with the example, it is not clear what will be the ideal granularity for an MPSoC compiler to work on. Most likely, there will be no exact definition as in the case of BBs, and the final granularity will depend on the MPSoC, the programming model and the application domain. Some research groups have performed steps towards the identification of a suitable granularity for particular platforms. The approach is usually based on partitioning the application into code blocks by means of heuristics or clustering algorithms [15]. Irrespective of the granularity chosen, in the remaining of this chapter we refer to the outcome of the partitioning as code blocks. A code block may be a BB, a function, a task, a process, an actor or the result of intricate clustering algorithms. 2.2.2 Parallelism While a traditional compiler tries to exploit ILP, the goal of an MPSoC compiler is to extract coarser parallelism. The most prominent types of coarse or macroscopic parallelism are illustrated in Figure 7 and are described in the following. • Data Level Parallelism (DLP): In DLP the same computational task is carried out on several disjoint Data Sets as illustrated in Figure 7a. This kind of parallelism can be seen as a generalization of SIMD. DLP is typically present in multimedia applications, where a decoding task performs the same operations on different portions of an image or video (e.g. Macroblocks in H.264). Several programming models provide support for DLP. For instance, an OpenMP for directive tells the compiler or the runtime system that the iterations of the loop are independent from each other and can therefore be parallelized. • Task (or functional) Level Parallelism (TLP): In TLP different tasks can compute in parallel on different data sets as shown in Figure 7b. This kind of parallelism is inherent to programming models based on concurrent MoCs (see Section 2.1.1). Tasks may have dependencies to each other, but once a task has its data ready, it can execute in parallel with the already running tasks in the system. • Pipeline Level Parallelism (PLP): In PLP a computation is broken into a sequence of tasks and these tasks are repeated for different data sets. In this sense,
Software Compilation Techniques for MPSoCs
653
PLP can be seen as a mixture of DLP and TLP. Following the principle of pipelining, a task with DLP can be segmented in order to achieve a higher throughput. At a given time, different tasks of the original functionality are executed concurrently on different data sets. This kind of parallelism is e.g. exploited in the programming models targeting network processors. Exploiting these kinds of parallelism is a must for an MPSoC compiler, which has to be therefore equipped with powerful flow and dependence analysis capabilities. 2.2.3 Flow and Dependence Analysis Flow analysis includes both control and data flow. The result of these analyses can be summarized in a CDFG, as discussed at the beginning of this section. Data flow analysis serves to gather information at different program points, e.g. about available defined variables (reaching definitions) or about variables that will be used later in the control flow (liveness analysis). As an example, consider the CDFG in Figure 5c in which a reaching definitions analysis is carried out. The analysis tells, for example, that the value of variable c in line 5 can come from three different definitions in lines 2, 7 and 10. Data flow analysis deals mostly with scalar variables, like in the previous example, but falls short when analyzing the flow of data when explicit memory accesses are included in the program. In practice, memory accesses are very common through the use of of pointers, structures or arrays. Additionally, in the case of loops, data flow analysis only says if a definition reaches a point but does not specify exactly in which iteration the definition is made. The analyses that answer these questions are known as array analysis, loop dependence analysis or simply dependence analysis. Given two statements S1 and S2, dependence analysis determines if S2 depends on S1, i.e. if S2 cannot execute before S1. If there is no dependency, S1 and S2 can execute in any order or in parallel. Dependencies are classified into control and data: • Control Dependence: A statement S2 is control dependent on S1 (S1 c S2) if whether or not S2 is executed depends on S1’s execution. In the following example, S1 c S2: S1: if (a > 0) goto L1; S3: L1: ...
S2: a = b + c;
• Data Dependencies: – Read After Write (RAW, also true/flow dependence): There is RAW dependence between statements S1 and S2 (S1 f S2) if S1 modifies a resource that S2 reads thereafter. In the following example, S1 f S2: S1: a = b + c;
S2: d = a + 1;
– Write After Write (WAW, also output dependence): There is WAW dependence between statements S1 and S2 (S1 o S2) if S2 modifies a resource that was previously modified by S1. In the following example, S1 o S2:
654
Rainer Leupers, Weihua Sheng, Jeronimo Castrillon S1: for (i = N; i > S2: for (j = M; j S3: A[3*i + 2*j ... S4: ... = A[4 * S5: }}
X; i--) { > Y ; j--) { + 2][i + j + 1] = ...;
(a)
i + j + 1][2*i + j + 1] ...;
(b) S1: while (i) { S2: f1(&A[i]); S3: ... = A[f2(i)]; ... S4: }
S1: S2: S3: S4:
scanf(“%d”,&i); scanf(“%d”,&j); A[i] = ...; ... = A[j]; ...
(c)
Fig. 8 Examples of dependence analysis. (a) NP Complete, (b) Requires inter-procedural analysis, (c) Undecidable
S1: a = b + c;
S2: a = d + 1;
– Write After Read (WAR, also anti-dependence): There is WAR dependence between statements S1 and S2 (S1 a S2) if S2 modifies a resource that was previously read by S1. In the following example, S1 a S2: S1: d = a + 1;
S2: a = b + c;
Obviously, two statements can exhibit different kinds of dependencies simultaneously. Computing these dependencies is one of the most complex tasks inside a compiler, either for uni-processor or for multi-processors systems. For a language like C, the problem of finding all dependencies statically is NP complete and in some cases undecidable. The main reason for this being the use of pointers [30] and indexes to data structures that can only be resolved at runtime. Figure 8 shows three sample programs to illustrate the complexity of dependence analysis. In Figure 8a, in order to determine if there is a RAW dependence between S3 and S4 (S3 f S4) across iterations, one has to solve a constrained integer linear system of equations, which is NP complete. For the example, the system of equations is: 3x1 + 2x2 + 2 = 4y1 + y2 + 1 x1 + x2 + 1 = 2y1 + y2 + 1 subject to X < x1 , y1 < N and Y < x2 , y2 < M. Notice for example that there is a RAW dependence between iterations (1, 1) and (2, −2) on A[7][3]. In order to analyze the sample code in Figure 8b, a compiler has to perform inter-procedural analysis to identify if f1 potentially modifies the contents of A[i] and to sort out the potential return values of f2(i). Depending on the characteristics of f1 and f2 this problem can be undecidable. Finally, the code in Figure 8c is an extreme case of the previous one, in which it is impossible to know the values of the indexes at compile time. The complexity of dependence analysis motivated the introduction of means for memory disambiguation at the programming language level, such as the restrict keyword in C99 standard [76]. For an MPSoC compiler, the situation is not different. The same kind of analysis has to be performed at the granularity produced by the partitioning step. Array
Software Compilation Techniques for MPSoCs (a)
1(b[0])
1
f2(&b[0], &b[i]);
3: f2
…
4: f3
1: f4
…
4: f4
T3
T3 1: sum
…
…
15(s):4
64(c)
s += sum(&c[0]); 64
1: f3
64N(A)
c[j] = f4(A[j][0]); 16
1: f2 2: f2
f3(s, &A[j][0]);
1(s):4
(b)
Precedence Constraints
N
1
64
f1
1(b[0])
b[0] = f1(); N(b):1
655
N: f2
T3 61: f3
…
61: f4
…
T3 64: f3 64: f4
Control Data
T1
T2
16: sum
Fig. 9 Dependence analysis on example in Figure 4a. (a) Summarized CDFG, (b) unrolled dependencies
analysis could still be handled by a vectorizing compiler for one of the processors in the platform. The MPSoC compiler has to perform the analysis at a coarser granularity level in which function calls will not be an exception. This is for example the case for the code in Figure 4a. In order to derive KPN representations like those presented in Figure 4b-c, the compiler needs to be aware of the side effects of all functions. For example, it has to make sure that function f2 does not modify the array A, otherwise there would be a dependency (an additional channel in the KPN) between processes f2 and f3 in Figure 4b. The dependence analysis should also provide additional information, for example, that the sum function is only executed every four iterations of the loop. This means that every four instances of f3 followed by f4 can be executed in parallel. This is illustrated in Figure 9. A summarized version of the CDFG with the dependence information is shown in Figure 9a. In this graph, data edges are annotated with the variable that generates the dependency and, in the case of loop-carried dependencies, with the distance of the dependence [54]. The distance of a dependence tells after how many iterations a defined value will be actually used. With the dependence information, it is possible to represent the precedence constraints along the execution of the whole program as shown in Figure 9b. In the figure, n: f represents the n-th execution of function f. With this partitioning, it is possible to identify two different kinds of parallelism: T1 and T2 are a good example of TLP, whereas T3 displays DLP. This is a good example where flow and dependence analysis help determining a partitioning that exposes coarse grained parallelism. Due to the complexity of static analyses, research groups have recently started to take decisions based on Dynamic Data Flow Analysis DDFA [15] [23]. Unlike static analyses, where dependencies are determined at compile time, DDFA uses traces obtained from profiling runs. This analysis is of course not sound and cannot be directly used to generate code. Instead, it is used to obtain a coarse measure of the data flowing among different portions of the application in order to derive plausible partitions and in this way identify DLP, TLP and/or PLP. Being a profile-
656
Rainer Leupers, Weihua Sheng, Jeronimo Castrillon
based technique, the quality of DDFA depends on a careful selection of the input stimuli. In interactive programming environments, DDFA can provide hints to the programmer about where to perform code modifications to expose more parallelism. Summary Traditional compilers work at the basic block granularity which is well suited for ILP. MPSoC compilers in turn need to be equipped with powerful flow analysis techniques, that allow to partition the application into a suitable granularity. This granularity may not match any mainstream granularity and may depend on the target platform. The partitioning step must break the application into code blocks from which coarse level parallelism such as DLP, TLP and PLP can be extracted.
2.3 Platform Description for MPSoC compilers After performing optimizations in the middle end, a compiler back end generates code for the target platform based on a model of it. Such a platform model is also required by an MPSoC compiler, but, in contrast to a uni-processor compiler flow, the architecture model or a subset of it may also be used by other phases of the compiler. For example, if the programming model exposes some hardware details to the user, the front end needs to be able to cope with that and probably perform consistency checks. Besides, some MPSoC optimizations in the middle end may need some information about the target platform as discussed in Section 2.2. Traditionally an architecture model describes: i Available operations: In form of an abstract description of the Instruction Set Architecture (ISA). This information is mainly used by the code selection phase. ii Available resources: A list of hardware resources such as registers and functional units (in case of a superscalar or a VLIW). This information is used, for example, by the register allocator and the scheduler. iii Communication links: Describe how data can be moved among functional units and register files (e.g. cross paths in a cluster VLIW processor). iv Timing behavior: In form of latency and reservation tables. For each available operation, an entry in the latency table tells the compiler how long it takes to generate a result whereas an entry in the reservation table tells the compiler which resources are blocked and for how long. This information is mainly used to compute the schedule. In the case of an MPSoC, a platform description has to provide similar information but at a different level. Instead of a representation of an ISA, the available operations describe which kinds of processors and hardware accelerators are in the platform. Instead of a list of functional units, the model provides a list of PEs and a description of the memory subsystem. The communication links represent no longer interconnections among functional units and register files, but possibly a complex
Software Compilation Techniques for MPSoCs
657
Network-On-Chip (NoC) that interconnects the PEs among them an with the memory elements. Finally, the timing behavior has to be provided not for individual operations (instructions) but for code blocks. Usually, the physical description (items i-iii) is a graph representation provided in a given format (usually an XML file, see Section 3 for practical examples).The timing behavior cannot be represented as application independent static tables due to the presence of programmable processing elements that can, in principle, execute arbitrary pieces of code. Getting the timing behavior of the code blocks obtained after partitioning is a major research topic and a requisite for an MPSoC compiler. Several techniques, under the name of Performance Estimation, are applied in order to get such execution times: Worst/Best/Average Case Execution Time (W/B/ACET) [75]. These techniques can be roughly categorized as: • Analytical: Analytical or static performance estimation tries to find theoretical bounds to the WCET, without actually executing the code. Using compiler techniques, all possible control paths are analyzed and bounds are computed by using an abstract model of the architecture. This task is particularly difficult in the presence of caches and other non-deterministic architectural features. For such architectures, the WCET might be too pessimistic and thus induce bad decisions (e.g. wrong schedules). There are already some commercial tools available, aiT [7] is a good example. • Simulation-based: In this case the execution times are measured on a simulator. Usually cycle accurate virtual platforms are used for this purpose [19] [69]. Virtual platforms allow full system simulation, including complex chip interconnects and memory subsystems. Simulation-based models suffer from the contextsubset problem, i.e. the measurements depend on the selection of the inputs. • Emulation-based: The simulation time of cycle accurate models can be prohibitively high. Typical simulation speeds range from 1 to 100 KIPS (Kilo Instructions per Second). Therefore, some techniques emulate the timing behavior of the target platform in the host machine without modeling every detail of the processor by means of instrumentation. Source level timing estimation has proven to be useful for simple architectures [37] [32], the accuracy for VLIW or DSP processors is however not satisfactory. The authors in [22] use so-called virtual back ends to perform timing estimation by emulating the effects of the compiler back end and thus improving the accuracy of source level timing estimation considerably. With these techniques, simulation speeds of up to 1 GIPS are achievable. Performance estimation plays an important role when compiling for MPSoCs. Granularity, discussed in Section 2.2, is directly related to the timing of the code blocks. Is the execution time of a code block too short compared to the other blocks, then it can be merged with into other blocks or simply ignored during the timing analysis. Additionally, the timing information is key for driving the scheduling and mapping decisions. As an example consider a platform with two types of PEs, RISCs and DSPs, for the application in Figure 4a. Assume that the performance estimation determines an average time tˆfp for the execution of function f on processing type p.
658
Rainer Leupers, Weihua Sheng, Jeronimo Castrillon
Figure 10 shows possible scheduling traces for the sample application with different partitioning and mapping. The following timing variables are used in the figure to simplify the visualization: risc risc risc t1 = tˆrisc f 3 + tˆf 4 + 3 · max(tˆf 3 , tˆf 4 )
t2 = tˆrisc f2 risc t3 = tˆrisc f 3 + tˆf 4 risc t4 = 2 · tˆrisc f 3 + 2 · tˆf 4 dsp dsp ˆ t5 = 4 · tˆdsp + t + tˆsum f3 f4
The trace in Figure 10a corresponds to the KPN representation introduced in Figures 4b, 6b. This partition exhibits PLP, but the fact that some instances of f3 and f4 can be executed in parallel, as suggested by the dependence graph in Figure 9a, is not exploited. Besides, depending on the architecture, this partition may require the array A to be sent through the interconnect, which could be too costly. The schedule in Figure 10b uses the information gathered by the dependence analysis in Figure 9 and exploits fully the TLP and the DLP available in the application. This behavior can be obtained by unrolling the loop at the granularity shown in Figure 6b. In this example, task T1 dominates the execution of the complete application and therefore processors RISC2-RISC5 feature a low utilization. To improve the utilization, two tasks T3 are merged into T3’ as shown in Figure 10c. Finally, if the DSP is used to execute task T2, a similar execution time of the application can be obtained by using only two processing elements as illustrated in Figure 10d. Note that the last partition corresponds to the task-level granularity in Figure 6c. The previous example is of course too simplistic and serves only as an illustration of the importance of timing information. In this example no interconnect nor access to share resources such as peripherals or memory were modeled. For real applications it is much more difficult to characterize the timing behavior and the resource usage. It might be impractical to characterize the timing behavior of a code block with a simple average value. For some applications, a discrete set of values for different scenarios might be used [24][52]. For other applications a more detailed statistical description in form of a probability density function could be required [70]. Access to shared resources can be modeled in a similar way to reservation tables. Several code blocks might access the same resources in the MPSoC (e.g. analog-todigital converters, serial communication links, graphical display) and these structural hazards need to be taken into account by the MPSoC compiler. Summary Platform models for MPSoC compilers describe similar features to those of traditional compilers but at a higher level. Processing elements and NoCs take the place of functional units, register files and their interconnections. On an MPSoC compiler, the platform model is no longer restricted to be used on the back end but a subset of it may be used by the front end and the middle end. Out of the information needed
Software Compilation Techniques for MPSoCs
659 (a)
D t1 RISC3 RISC1
2: 4xf4
1: 4xf4
RISC2
1: 4xf3 f1
1: f2
…
2: 4xf3
sum
…
2: f2
N: f2
D t3 4: T3 3: T3
7: T3
RISC3
2: T3
6: T3
62: T3
5: T3
61: T3
f1
8: T3
1: T3
sum
1: f2
2: f2
t (b)
T3: f3,f4
RISC5 RISC4 RISC2
T1
t0
D t2
RISC1
16: 4xf4 16: 4xf3
64: T3
…
63: T3
spare time T2
…
N: f2
T1
t0 D t4 RISC3
2: T3’
RISC2
1: T3’
RISC1
f1
1: f2
… …
2: f2
32: T3’ 31: T3’ N: f2
T2 T1
t0 D t5
DSP1 RISC1
(c)
T3’: 2xf3,2xf4
sum
1: T4 f1
1: f2
2: T4 2: f2
T4
...
…
t (d)
T4: 4xf3,4xf4,sum ...
t
T2 N: f2
T1
t0
t
Fig. 10 Example of impact of timing in the mapping and scheduling decisions on application code in Figure 4a. (a) Partition with PLP, direct implementation of Figure 4b (b) full parallelism exposed in Figure 9b, (c) improved processor utilization (d) improved mapping
to describe the platform, the timing behavior is the most challenging. This timing information is needed for performing successfully mapping and scheduling, as will be described in the next section.
2.4 Mapping and Scheduling Mapping and scheduling in a traditional compiler is done in the back end provided a description of the architecture. Mapping refers to the process of assigning operations to instructions and functional units (code selection) and variables to registers (register allocation). Scheduling refers to the process of organizing the instructions in a timed sequence. The schedule can be computed statically (for RISC, DSPs and VLIWs) or dynamically at runtime (for Superscalars) whereas the mapping of operations to instructions is always computed statically. The main purpose of mapping and scheduling in uni-processor compilers had been always to improve performance. Code size is also an important objective for embedded (specially VLIW) processors. Only recently, power consumption became an issue. However, the re-
660
Rainer Leupers, Weihua Sheng, Jeronimo Castrillon
duction in power consumption with back end techniques does not have a big impact on the overall system power consumption. In an MPSoC compiler similar operations have to be performed. Mapping, in this context, refers to the process of assigning code blocks to PEs and logical communication links to physical ones. In contrast to the uni-processor case, mapping can be also dynamic. A code block could be mapped at runtime to different PEs, depending on availability of resources. Scheduling for MPSoCs has a similar meaning as for uni-processors, but instead of scheduling instructions, the compiler has to schedule code blocks. The presence of different application classes, e.g. real time, add complexity to the optimizations in the compiler. Particularly, there is much more room for improving power consumption in an MPSoC; after all, power consumption is one of the MPSoC drivers in the first place. The result of scheduling and mapping is typically represented in form of a Gantt Chart, similar to the ones presented in Figure 10. The PEs are represented in the vertical axis and the time in the horizontal axis. Code blocks are located in the plane, according to the mapping and the scheduling information. In Figure 10a functions f1 and f2 are mapped to processor RISC1, the functions f3 and sum are mapped to processor RISC2 and function f4 to processor RISC3. Given that code blocks have a higher time variability than instructions, scheduling can be rarely performed statically. Pure static scheduling requires full knowledge of the timing behavior and is only possible for very predictable architectures and regular computations, like in the case of systolic arrays [42]. If it is not possible to obtain a pure static schedule, some kind of synchronization is needed. Different scheduling approaches require different synchronization schemes with different associated performance overhead. In the example of the previous section, the timing information of task T3 is not known precisely as shown in Figure 10b. Therefore the exact starting time of function sum cannot be determined and a synchronization primitive has to be inserted to ensure correctness of the result. In this example, a simple barrier is enough in order to ensure that the execution of T3 in processors RISC3, RISC4 and RISC5 has finished before executing function sum. 2.4.1 Scheduling Approaches Which scheduling approach to utilize depends on the characteristics of the application and the properties of the underlying MoC used to describe it. Apart from pure static schedules, one can distinguish among the following scheduling approaches: • Self-timed Scheduling: Typical for applications modeled with dataflow MoCs. A self-timed schedule is close to a static one. Once a static schedule is computed, the code blocks are ordered on the corresponding PEs and synchronization primitives are inserted that ensure the presence of data for the computation. This kind of scheduling is used for SDF applications. For a more detailed discussion the reader is referred to [67]. • Quasi-static Scheduling: Used in the case where control paths introduce a predictable time variation. In this approach, unbalanced control paths are balanced
Software Compilation Techniques for MPSoCs
661
and a self-timed schedule is computed. Quasi-static scheduling for dynamically parameterized SDF graphs is explored in [20] (see also Chapter 32). • Dynamic Scheduling: Used when the timing behavior of the application is difficult to predict and/or when the number of applications is not known in advance (like in the case of general purpose computing). The scheduling overhead is usually higher, but so is also the average utilization of the processors in the platform. There are many dynamic scheduling policies. Fair queue scheduling is common in general purpose operating systems (OSs), whereas different flavors of priority based scheduling are typically used in embedded systems with real time constraints, e.g. Rate Monotonic (RM) and Earliest Deadline First (EDF). • Hybrid Scheduling: Term used to refer to scheduling approaches in which several static or self-timed schedules are computed for a given application at compile time, and are switched dynamically at run-time depending on the scenario [24]. This approach is applied to streaming multimedia applications, and allows to adapt at runtime making it possible to save energy [52]. Virtually every MPSoC platform provides support for implementing mapping and scheduling. The support can be provided in software or in hardware and might restrict the available policies that can be implemented. This has to be taken into account by the compiler, which needs to generate/synthesize appropriate code (see Section 2.5). Software support can be provided by a full fledged OS or by light micro kernels. Several commercial OS vendors provide solutions for MPSoC platforms, but full efficient support for highly heterogeneous MPSoCs remains a major challenge. In order to reduce the overhead introduced by software stacks, hardware support for mapping and scheduling has been proposed both in the HPC [27][41] and in the embedded communities [61] [14]. 2.4.2 Computing a Schedule Independent of which scheduling approach and how this is supported, the MPSoC compiler has to compute a schedule (or several of them). This problem is known to be NP-complete even for simple Directed Acyclic Graphs (DAGs). Uni-processor compilers therefore employ heuristics, most of them being derived from the classical List Scheduling algorithm [31]. Computing a schedule for multiprocessor platforms is by no means simpler. The requirements and characteristics of the schedule depend on the underlying MoC with which the application was modeled. In this chapter we distinguish between application modeled with centralized and distributed control flow. Centralized control flow: Uni-processor compilers deal with centralized control flow, i.e. instructions are placed in memory and a central entity dictates which instructions to execute next, e.g. the program counter generator. The scheduler in a traditional uni-processor
662
Rainer Leupers, Weihua Sheng, Jeronimo Castrillon
compiler leaves the control decisions out of the analysis and focus on scheduling instructions inside a BB. Since the control flow inside a BB is linear, there are no circular data dependencies and the data dependence graph is therefore acyclic. The resulting DAG is typically scheduled with a variant of the list scheduling algorithm. Given a DAG G = (V, E), the list scheduling algorithm determines an starting time tsn and a resource rn to every node n ∈ V . This is done by iterating the following steps until a valid schedule is obtained: i Compute the ready set: Set of nodes for which all the predecessors in G are already scheduled. ii Select node: Pciks the node from the ready set with the highest priority. There are several heuristics to compute the priority of a node. This can be done, for example, by associating a higher priority to nodes in the critical path. iii Allocate node: The selected node n is mapped to a resource rn at time tsn . There are also several heuristics for choosing the resource. This can be done, for example, by selecting a resource that minimizes the finishing time of the node. In order to achieve a higher level of parallelism, uni-processor compilers apply different techniques that goes beyond BBs. Typical examples of this techniques include loop unrolling and software pipelining [45]. An extreme example of loop unrolling was introduced in the previous section, where the dependence graph in Figure 9a was completely unrolled in Figure 9b. Note that the graph in Figure 9b is acyclic and could be scheduled with the list scheduling algorithm. The results of list scheduling with 5 and 3 resources would look similar to the scheduling traces in Figure 10b,c respectively. In principle, the same scheduling approach can be used for MPSoCs. However, since every processor in the MPSoC has its own control flow, a mechanism has to be implemented to transfer control. In the example in Figure 10b, some processor has the control code for the loop in line 14 of Figure 4 and activates the four parallel tasks T3. There are several ways of handling this distribution of control. Parallel programming models like pthreads offer source level primitives to implement forks and joins. Some academic research platforms offer dedicated instructions to send so-called control tokens among processors [77]. Distributed control flow: Parallel programming models based on concurrent MoCs like the ones discussed in Section 2.1.2 feature distributed control flow. For applications represented in this way, the issue of synchronization is greatly simplified and can be added to the logic of the channel implementation. Simple applications represented as acyclic task precedence graphs with predictable timing behavior can be scheduled with a list scheduling algorithm or with one of many other available algorithms for DAGs. For a survey on DAG scheduling algorithms the reader is referred to [43]. Applications, where precedence constraints are not explicit in the programming model and where communication can be control dependent, e.g. KPNs, are usually scheduled
Software Compilation Techniques for MPSoCs
663 (a)
a11
a21
a22
a23
a31
a32
(b)
Proc 2 Proc 1
a11
a22
a31
a21
a23
a32
a11
a22
a31
a21
a23
a32
t
Fig. 11 Example of SDF scheduling, for SDF in Figure 3b, (a) derived DAG with r = [1 3 2]T (b) possible schedule on two PEs
dynamically. Finally, for applications represented as SDF, a self-timed schedule can be easily computed. • KPN scheduling: KPNs are usually scheduled dynamically. There are two major ways of scheduling a KPN: data and demand driven. In data driven scheduling, every process in the KPN with available data at its input is in the ready state. A dynamic scheduler then decides which process gets executed on which processor at runtime. A demand driven scheduler first schedules processes with no output channels. This processes execute until a read blocks in one of the input channels. The scheduler triggers then only the processes from which data has been requested (demanded). This process continues recursively. For further details de reader is referred to [62]. • SDF scheduling: As mentioned before, SDFs are usually scheduled using a selftimed schedule, which requires a static schedule to be computed in the first place. There are two major kinds of schedules: blocked and non-blocked schedules. In the former, a schedule for one cycle is computed and is repeated without overlapping, whereas in the latter, the execution of different iterations of the graph are allowed to overlap. For computing a blocked schedule, a complete cycle in the SDF has to be determined. A complete cycle is a sequence of actor firings that brings the SDF to its initial state. Finding a complete cycles requires that (i) enough initial tokens are provided in the edges and (ii) there is a non trivial solution for the system of equations · r = 0, where [i j ] = pi j − ci j , and pi j ci j are the number of tokens that actor i produces to and consumes from channel j respectively. In the literature, r is called repetition vector and topology matrix. As an example, consider the SDF in Figure 3b. This SDF has a topology matrix: ⎛ ⎞ 3 −1 0 ⎜ 6 −2 0 ⎟ ⎟ =⎜ ⎝ 0 2 −3 ⎠ −2 0 1
664
Rainer Leupers, Weihua Sheng, Jeronimo Castrillon
and a repetition vector is r = [1 3 2]T . By unfolding the SDF according to its repetition vector and removing the feedback edges (those with delay tokens) one obtains the DAG shown in Figure 11a with a possible schedule on two processors sketched in Figure 11b. Using this procedure, the problem of scheduling an SDF is turned into DAG scheduling, and once again, one of the many heuristics for DAGs can be used. See Chapter 30 for further details. For general application models and for obtaining better results than with humandesigned heuristics, several optimization methods are used. Integer Linear Programming is used in [4] and a combination of Integer Linear Programming and Constraint Programming (CP) is employed in [12]. Genetic Algorithms have also been used for this purpose, see Chapter 35. Apart from scheduling and mapping code blocks and communication, a compiler also needs to map data. Data locality is already an issue for single processor systems with complex memory architectures: caches and Scratch Pad Memories (SPM). In MPSoCs, maximizing data locality and minimizing false sharing is an even bigger challenge [35]. Summary Mapping and scheduling is one of the major challenges of MPSoC compilers. Different application constraints lead to new optimization objectives. Besides, different programming models with their underlying MoC allow different scheduling approaches. Most of these techniques work under the premise of accurate performance estimation (Section 2.3) which is by itself a hard problem. Due to the high heterogeneity of signal processing MPSoCs, mapping of data represents a bigger challenge than in uni-processor systems.
2.5 Code Generation The code generation phase of an MPSoC compiler is ad-hoc due to the heterogeneity of MPSoCs. To name a few examples: the PEs are heterogeneous where the programming models may differ, the communications networks (thus the APIs) are heterogeneous, and the OS-service libraries implementations can vary from one to another. After the partitioning, mapping and scheduling, the code generation of an MPSoC compiler acts like a meta-compiler, distributing the code blocks to the PE compilers, generating the code for communications and schedulers, linking with the low-level libraries and so on. The distinct feature of heterogeneous MPSoCs is that different types of processors, which are usually programmable, are deployed. They can be developed inhouse or acquired through third-party IP vendors. They are usually shipped with software tool-chains such as C compilers, linkers and assemblers. After partitioning and mapping phases, the MPSoC compiler needs to generate the mapped-toprocessor code, in the form of either a high-level programming language like C or PE-specific assembly, to be further compiled by the PE tool-chain. The PE com-
Software Compilation Techniques for MPSoCs
665
piler can turn on its own optimization features to further improve the code quality. In this sense, an MPSoC compiler is more like a meta-compiler on top of many off-the-shelf compilers to coordinate the compilation process. One example of the many initial efforts made in this area is the Real Time Software Components (RTSC) project initially driven by TI to help standardizing the embedded components [6]. A RTSC package is a named collection of files that form a unit of versioning, update, and delivery from a producer to a consumer. Important information such as codec library, building/linker parameters and instruction-set architecture for this particular codec are specified and packaged. Therefore, different codecs from different vendors are possible to be packaged seamlessly together, giving further benefits such as helping the automation of building/linking. The PEs will communicate with each other using the NoC, which requires communication/synchronization primitives (e.g. semaphores, message-passing) correctly set in place of the code blocks that the MPSoC compiler distributes to the PEs. Again, due to the heterogeneous nature of the underlying architecture, the same communication link may look very different in the implementation, e.g. when the sending/receiving points are in the different PEs. The embedded applications often need to be implemented in a portable fashion for the sake of software re-use. Abstraction of the communication functions to a higher level into the programming model is widely practiced, though it is still very ad-hoc and platform-specific. Recently the Multicore Association has published the first draft of Multicore Communications API (MCAPI), which is a message-passing API to capture the basic elements of communication and synchronization that are required for closely distributed embedded systems [3]. This might have been a good first step in this area. As discussed in Section 2.4, the scheduling decision is a key factor in the MPSoC compiler, especially in embedded computing where real-time constraints have to be met. No matter which scheduling policy is determined for the final design, the functionality has to be implemented, in hardware, or software, or in a hybrid manner. A common approach is to use a ready OS, often a RTOS, as a scheduler. There are many commercial solutions available such as QNX and WindRiver. The scheduler implementation in hardware is not uncommon for embedded devices, as software solutions may lead to larger overhead, which is not acceptable for RT-constrained embedded systems. Industrial companies as well as academia have delivered some results in this area which are promising though more successful stories are still needed to justify. A hybrid solution is a mixture, where some acceleration for the scheduler is implemented in hardware while flexibility is provided by software programmability, therefore customizing a trade-off between efficiency and flexibility. If the scheduling is not helped by e.g. an OS or a hardware scheduler, the code generation phase needs to generate or synthesize the scheduler e.g. [17] and [44]. Summary Code generation is a complicated process, where many efforts are made to hide the compilation complexity via layered SW stacks and APIs. Heterogeneity will cause ad-hoc tool-chains to exist for a long time, further aggravating the situation.
666
Rainer Leupers, Weihua Sheng, Jeronimo Castrillon
3 Case Studies As discussed in Section 2, MPSoC compilers’ complexity grows rapidly compared to the compilers of uni-processors. Considering MPSoC compilers in the infancy, nowadays the MPSoC compiler constructions of different industrial platforms and academic prototypes are very much ad-hoc. This section surveys some typical case studies for the readers to see how real-world implementations address the various MPSoC compiler challenges. • Academic Research: In academia, many research groups have looked into the the topic of MPSoC compilers recently. Because the topic in nature is very heterogeneous, it has called for interests from different research communities, such as real-time computing, compiler optimization, parallelization and fast simulation. A considerable amount of efforts have been put in the areas such as MoCs, automatic parallelization, retargetable compilers and virtual simulation platforms. Compared to their counterparts in industry, the academic researchers focus mostly on the upstream of the MPSoC compilation flow, e.g., using MoCs to model applications, automatic task-to-processor mapping, early system performance estimation and holistic construction of the MPSoC compilers. • Industrial Cases: Several large semiconductor houses have already a few mature product lines aiming at different segments of the market due to the applicationspecific nature of the embedded devices. The stringent time-to-market window calls for the necessity to adopt platform-based MPSoC design methodology. That is, a new generation of an MPSoC architecture is based on a previous successful model with some evolutionary improvements. Compared to their counterparts in academia, the MPSoC SW architects in industry focus more on the SW tools re-use (considering the huge amount of certified code), providing abstractions and conveniences to the programmers for SW development and efficient codegeneration. Two case studies are selected here to be explained in detail, followed by a number of other research projects described in short. From academia, a detailed look on the MPSoC software compilation flow of the SHAPES project [60] is given. From industry, the compilation tool-chain of TI OMAP [71] is demonstrated. SHAPES SHAPES [60] is an European Union FP6 Integrated Project whose objective is to develop a prototype of a tiled scalable HW and SW architecture for embedded applications featuring inherent parallelism. The major SHAPES building block, the RISC-DSP tile (RDT), is composed of an Atmel Magic VLIW floating-point DSP, an ARM9 RISC, on-chip memory, and a network interface for on- and off-chip communication. On the basis of RDTs and interconnect components, the architecture can be easily scaled to meet the computational requirements of the application at minimum cost.
Software Compilation Techniques for MPSoCs
667
The SHAPES software development environment is illustrated in Figure 12. The starting point is the Model-driven compiler/Functional simulator, which takes an application specification in process networks as input. High-level mapping exploration involves the trace information from the Virtual Shapes Platform (VSP) and the analytical performance results from the Analytical Estimator, based on multi-objective optimization considering the various conflicting criteria such as throughput, delay, predictability and efficiency. With the mapping information, the Hardware dependent Software (HdS) phase then generates the necessary dedicated communication and synchronization primitives, together with OS services.
Fig. 12 The SHAPES Software Development Environment
The central part of the SHAPES software environment is the Distributed Operation Layer (DOL) framework [66]. The DOL structure and interactions with other tools and elements are shown in the Figure 13. DOL mainly provides the MPSoC software developers two main services: system level performance analysis and process-to-processor mapping exploration. • DOL Programming Model: DOL uses process networks as the programming model — the structure of the application is specified in XML consisting of processes, software channels and connections, while the application functionality is specified in C/C++ language and process communications are performed by the DOL APIs, e.g. DOL_read() and DOL_write(). DOL uses a special iterator element to allow the user to instantiate several processes of the same type. For the process functionality in C/C++, a set of coding rules needs to be followed. In each process there must be a init and a fire procedure. The init procedure al-
668
Rainer Leupers, Weihua Sheng, Jeronimo Castrillon
Fig. 13 The DOL framework
locates and initializes data, which is called once during the initialization of the application. The fire procedure is called repeatedly afterwards. • Architecture Description: DOL aims at mapping, therefore its architecture description abstracts away several details of the underlying architecture. The architecture description in XML format contains three kinds of information: structural elements such as processors/memories, performance data such as bus throughputs, and parameters such as memory sizes. • Mapping Exploration: DOL mapping includes 2 phases: performance evaluation and optimization. Performance evaluation collects the data from both analytical performance evaluation and the detailed simulation. The designer a-priori fixes the the optimization objectives and the DOL uses evolutionary algorithms and the PISA interface [5] to generate the mapping in XML format. With the mapping descriptor the HdS layer generates hardware dependent implementation codes and makefiles. Thereafter the application can be compiled and linked against communication libraries and OS services. The final binary can be run on the VSP or on the SHAPES hardware prototype. TI OMAP TI OMAP (Open Multimedia Application Platform) family consists of multiple SoC products targeting at portable and mobile multimedia applications. The recent OMAP35x series [71] offers a good combination of general purpose, multimedia
Software Compilation Techniques for MPSoCs
669
and graphics processing in a single-chip combination and delivers good performance for advanced user interfaces, web browsing, enhanced graphics and multimedia.
Fig. 14 (a) A detailed block diagram of the OMAP3530 processor (b) TI OMAP Software Stack
Figure 14 (a) shows the detailed block diagram of the OMAP3530 processor. OMAP3530 has an ARM Cortex A8 CPU, a TI C64x DSP and a 2D/3D graphics accelerator. Generally speaking, the application programming strategy [39] for OMAP is to allocate computation-intensive tasks (e.g. many video/audio signal-processing algorithms) to the DSP processor and to keep the ARM loaded with high-level OS tasks and I/O. Like many other embedded SoC solutions, TI OMAP solutions carry more and more hardware accelerators for special processing along the product line. For instance, the SGX graphics accelerator is available on OMAP3530, which enables the programmers to migrate OpenGL code. TI provides a rich set of tools for the programmer to develop the applications using different PEs and provide many abstractions to ease the programmer’s job by not going deep into many platform details. Nevertheless, how to partition and allocate different tasks/applications onto the platform so as to efficiently leverage the underlying processor power remains an art, which highly relies on the knowledge and experience of the programmer. Figure 14 (b) illustrates the software development tools and layers that TI provides. TI promotes the philosophy of abstractions among the software layers to hide just enough details for the developers at different roles/layers. • OS Level (low-level developer): At OS level, the choices of OSes on the ARM are diverse including Linux and WinCE. The so-called DSP/BIOS developed by TI is a real-time multi-tasking kernel which is especially optimized for real-time DSP task scheduling. DSP/BIOS has a very small foot-print in size, fast speed optimized for DSP tasks, and a low latency which is also deterministic. DSP/BIOS Link is a special layer developed by TI to handle inter-processor communication. • Algorithm/Codec Level (algorithm developer): Algorithms/codecs are usually allocated onto the DSP due to its computation power. TI has standardized the interface of the algorithms/codecs to the upper levels like XDAIS (eXpressDSP Algorithm Interoperability Standard [72]) as the abstractions. Those standards
670
Rainer Leupers, Weihua Sheng, Jeronimo Castrillon
provide the coding rules and guidelines for the algorithm developers. There are also helper tools which can check the compliance of the given algorithms/codes to the standards and provide the standard packaging for delivery. • Codec Engine Server Level (system integrator): Codec Engine Server needs to be developed by the system integrator by combining the core codecs along with the other infrastructure pieces (DSP/BIOS, DSP/BIOS Link, and so on) and ultimately produces an executable that is callable from another core (usually ARM) where the Codec Engine is running. It is critical that the system resource requirements (MIPS, memory, DMA and so on) are evaluated to ensure that the codec combination can co-exist and run as required. • Custom Application Development Level (application developer): The application developer uses Codec Engine APIs to leverage the codec functionality, on top of the layers introduced earlier. Third-party tools/frameworks that provide valuable addons like GUI or streaming frameworks can be ported here. TI also provides the DMAI, the DaVinci (and OMAP) Multimedia Applications Interface solution, to ease the programmer’s job by providing a collection of useful modules such as abstraction of common peripherals. The abstractions among the layers are realized by the standardized interfaces. Therefore, different teams can work in different domains at the same time thus boosting the productivity. Moreover, this also enables the possibility of third-parties participating in the TI software stack to provide valuable/commercial solutions, e.g. special codec packages, application-level GUI frameworks. HOPES HOPES is a parallel programming framework based on a novel programming model called Common Intermediate Code (CIC) [44]. The proposed work-flow of MPSoC software development is as follows. The software development process starts with specifying the application tasks in CIC, by either manually writing or deriving from an initial model-based specification. The CIC has separate sections to describe the application’s functionality (in terms of both functional parallelism and data parallelism) as well as the architecture information. The next step is mapping the tasks to PEs. Afterwards the CIC translator translates the CIC program into the executable C codes for each processor core, which includes the phases such as generic API translation and scheduler code generation. In the end, the target dependent executables can be generated and further run in the simulator or the hardware. The core of the HOPES is the CIC programming model which is designed as the central IR enabling the retargetable parallel compilation. Similar to most of the existing uni-processor retargetable tool-sets, the CIC separates the target-independent part and target-dependent part as follows. The Task Code is a target-independent unit which will be mapped to a PE. The default inter-task communication model is a streaming-based channel-like structure, which uses ports to send/receive data between tasks. OpenMP pragmas can be used to specify DLP or loop parallelism inside task code. Generic APIs, e.g. channel access and file I/O, are used in the task
Software Compilation Techniques for MPSoCs
671
code to keep code-reusability. A special pragma to indicate the possibility of using a hardware accelerator for certain functions is also available. All those constructs will later be translated or implemented by target dependent translators, depending on the mapping. The Architecture Information has three subsections: the hardware defines e.g processor lists, memory map and hardware accelerators information, the constraints defines global constraints like power consumption as well as task-level constraints like period and deadline, and the structure section describes the application structure in terms of interconnecting all the tasks and the task-to-PE mapping. The CIC translator marries the 2 sections in the CIC by taking care of the implementing the target-independent constructs in the Task Code section by the target dependent libraries/code in the Architecture Information section. All the interface code and PE-executable C code will be generated and further compiled by the PE C compilers. To support the retargetable compilation flow for a new target architecture, a new CIC translator needs to be written to realize the translation of the target-independent constructs in the CIC to the new target, which could be nontrivial. Daedalus Daedalus framework [58] is a tool-flow developed at Leiden University for automated design, programming and implementation of MPSoCs starting at a high level of abstraction. Daedalus design-flow consists of three key tools, KPNgen tool [74], Sesame (Simulation of Embedded System Architectures for Multilevel Exploration) [63] and ESPAM (Embedded System-level Platform synthesis and Application Modeling) [57], which work together to offer the designers a single environment for rapid system-level architectural exploration and automated programming and prototyping of multimedia MPSoC architectures. The ESPAM tool-set is the core synthesis tool in the Daedalus framework, which accepts the following inputs: • Application Specification In Daedalus, the application is specified as a KPN. For applications specified as parameterized static affine loop programs in C, KPNgen tool can automatically derive KPN descriptions from the C programs. Otherwise, i.e. for the applications not in this class, application specifications need to be created by hand. • Mapping Specification The relation between the application and the platform, i.e. how all the processes in the KPN description of the application are mapped to the components in the MPSoC platform, are specified here. • Platform Description The platform description represents the topology of the MPSoC platform — the components are taken from a library of IP components, which are pre-defined and pre-verified including a variety of programmable and dedicated processors, memories and interconnects. The Sesame tool is used here for design space exploration to automatically generate the mapping and the platform description. It
672
Rainer Leupers, Weihua Sheng, Jeronimo Castrillon
allows fast performance evaluation for different application-to-architecture mappings, hardware/software partitioning and target platforms. The good candidates, as a result, are given as input to the ESPAM tool. The ESPAM tool uses the inputs mentioned above together with the low-level component models (in RTL) from the IP library to automatically generate synthesizable VHDL code that implements the hardware architecture. It also generates, from the XML specification of the application, the C code for those processes that are mapped on to programmable cores, together with the code for synchronization of the communication between the processors. Furthermore, the well developed commercial synthesis tools and the component compilers can be used to process the outputs for fast hardware/software prototyping. TCT The TCT model [77] provides a simple programming model based on C language that allows designers to partition the application directly in the C code. The TCT compiler analyzes the control and data dependencies, automatically inserts the interthread communications and generates the concurrent execution model. Therefore, the productivity in the iterative process of application parallelization is greatly improved. The TCT programming model is very simple. The construct THREAD(name) {...} is used to group statements to form a TCT thread, which is to be executed in a (separate) concurrent process. With few exceptions such as goto statements and dynamic memory allocations, any compound statement in C is allowed inside the thread scope. With minimum code modifications, the sequential C code is quickly turned into a parallel specification of the application. The TCT execution model is a program-driven MIMD (Multiple Instruction Multiple Data) model where each thread is statically allocated to one processor. The execution model has multiple flows of control simultaneously, which enables modeling different parallelism schemes such as functional pipelining and task parallelism. Three thread synchronization instructions implement the decentralized thread interaction in the generated parallel code. The TCT compiler takes the input of annotated C source code, performs the dependency analysis among the threads and generates the parallel code with inserted thread communications to guarantee the correct concurrent execution. The TCT MPSoC, which is based on an existing AMBA SoC platform with a homogeneous distributed-memory multiprocessor-array, is supported as backend for code generation: a hardware prototype and a software tool called TCT trace scheduler are available for the designer to simulate and get the performance results given the partitioning.
Software Compilation Techniques for MPSoCs
673
MPA The MPSoC Parallelization Assist (MPA) tool [27] is developed at IMEC to help designers map the applications onto an embedded multicore platform efficiently, and explore the solution space fast. The exploration flow’s starting point is the sequential C source code of an application. The initial parallelization specification is assisted by profiling the application using the instruction set simulator (ISS) or source code instrumentation and execution on a target platform. The key input to the tool is the Parallelization Specification (ParSpec), which is a separate file from the application source code and orchestrates the parallelization. The ParSpec is composed of one or more parallel sections to specify the computational partitioning of the application: outside a parallel section the code is sequentially executed, whereas inside a parallel section all code must be assigned to at least one thread. Both functional and datalevel splits can be expressed in the ParSpec. Keeping the ParSpec separate from the application code allows the designer to try out different mapping strategies easily. The parallel C code is then generated by the MPA tool, where scalar dataflow analysis is done to insert communication primitives into the generated code. The shared variables have to be explicitly specified by the designer and the necessary synchronization has to be added using e.g. LoopSync. The MPA tool targets two execution environments for the generated parallel code — hardware or virtual platform in case the platform is available, and a High-Level Simulator in case the platform is not ready. The execution trace is generated, which the designer analyzes to decide whether changes are needed for the ParSpec to improve the mapping result. CoMPSoC/C-HEAP The CoMPSoC (Composable and Predictable Multi-Processor System on Chip) platform template is proposed in [28] which aims at providing a scalable hardware and software MPSoC template that removes all the interferences between applications through resource reservations. The hardware design features admission control and budget enforcement employed in the building blocks in the shared architecture in order to achieve composability and predictability. We will focus on the software platform of the CoMPSoC here in this chapter. The interested reader is referred to [28] for the details on the hardware CoMPSoC template. The software stack of the CoMPSoC is mainly composed of two parts: the application middleware and the host software. Like other works introduced in this chapter, in order to accomplish the synchronization and communication between the tasks running simultaneously on the processors, an implementation of the CHEAP [56] is bundled as a part of the CoMPSoC template offering (C-HEAP itself is a complete API which performs not only the communication protocol, such as synchronization and data transportation, but also the reconfiguration protocol such as task management and configuration of the channels.) The communication API is implemented on the shared memory, with synchronization done using pointers in a memory-mapped FIFO-administration structure [28]. The host software, i.e. all the
674
Rainer Leupers, Weihua Sheng, Jeronimo Castrillon
administrative tasks for the processor tiles and the control registers of the communication infrastructure, is carried out on a central host tile, implemented in portable C libraries. MAPS The MAPS (MPSoC Application Programming Studio) project is part of RWTH Aachen University’s Ultra high speed Mobile Information and Communication (UMIC) research cluster [73]. It is a research prototype tool-set for semi-automatic code parallelization and task-to-processor mapping, inspired by a typical problem setting of SW development for wireless multimedia terminals, where multiple applications and radio standards can be activated simultaneously and partially compete for the same resources. Applications can be specified either as sequential C code or in the form of pre-parallelized processes, where real-time properties such as latency and period as well as preferred PE types can be optionally annotated. MAPS uses advanced dataflow analysis to extract the available parallelism from the sequential codes (in [15] a more detailed description of code partitioning is presented) and to form a set of fine-grained task graphs based on a coarse model of the target architecture. The resulting partitioning/mapping can be exercised and refined with a fast, high-level SystemC based simulation environment MVP (MAPS Virtual Platform [16]), which has been put in practice to evaluate different software settings in a multi-application scenario. After further refinement, a code generation phase translates the tasks into C codes for compilation onto the respective PEs with their native compilers and OS primitives. The SW can then be executed, depending on availability, either on the real HW or on a cycle-approximate virtual platform incorporating instruction-set simulators. MAPS also develops a dedicated task dispatching ASIP (OSIP, operating system ASIP [14]) in order to enable higher PE utilization via more fine-grained tasks and lower context switching overhead. Early evaluation case studies exhibited great potential of the OSIP approach in lowering the taskswitching overhead, compared to an additional RISC performing scheduling, in a typical MPSoC environment.
Summary In this chapter we presented an overview of the challenges for building MPSoC compilers and described some of the techniques, both established and emerging, that are being used to leverage the computing power of current and yet to come MPSoC platforms. The chapter concluded with selected academic and industrial examples that show how the concepts are applied to real systems. We have seen how new programming models are being proposed that change the requirements of the MPSoC compiler. We discussed that, independent of the programming model, an MPSoC compiler has to find a suitable granularity to ex-
Software Compilation Techniques for MPSoCs
675
pose parallelism beyond the instruction level (ILP), demanding advanced analysis of the data and control flows. Computing a schedule and a mapping of an application is one of the most complex tasks of the MPSoC compiler and can only be achieved successfully with accurate performance estimation or simulation. Most of these analyses are target specific, hence the MPSoC itself needs to be abstracted and fed to the compiler. With this information, the compiler can tune the different optimizations to the target MPSoC and finally generate executable code. The whole flow shares similarities with that of a traditional uni-processor compiler, but is much more complex in the case of an MPSoC. We have presented some foundations and described approaches to deal with these problems. However, there is still a lot of research to be done to make the step from a high level specification to executable code as transparent as it is in the uni-processor case. Acknowledgements The authors would like to thank Jianjiang Ceng, Anastasia Stulova, Stefan Schürmans, and all ISS research members for their valuable comments.
References 1. Eclipse. http://www.eclipse.org/. Visited on Jan. 2010 2. GDB: The GNU Project Debugger. http://www.gnu.org/software/gdb/. Visited on Jan. 2010 3. MCAPI Multicore Communications API. http://www.multicoreassociation.org/workgroup/comapi.php. Visited on Nov. 2009 4. Multiprocessor Systems-on-Chips, chap. Chapter 12. ILP-based Resource-aware Compilation, pp. 337–354 5. PISA - A Platform and Programming Language Independent Interface for Search Algorithms. http://www.tik.ee.ethz.ch/sop/pisa/. Visited on Nov. 2009 6. Real Time Software Components. http://www.eclipse.org/dsdp/rtsc. Visited on Jan. 2010 7. AbsInt: aiT Worst-Case Execution Time Analyzers. http://www.absint.com/ait/. Visited on Nov. 2009 8. ACE: Embedded C for High Performance DSP Programming with the CoSy Compiler Development System. a-qual.com/topics/2005/EmbeddedCv2.pdf. Visited on Jan. 2010 9. Adl-Tabatabai, A.R., Kozyrakis, C., Saha, B.: Unlocking Concurrency. Queue 4(10), 24–33 (2007) 10. Aho, A.V., Sethi, R., Ullman, J.D.: Compilers: Principles, Techniques, and Tools. AddisonWesley Longman Publishing Co., Inc., Boston, MA, USA (1986) 11. Asanovic, K., Bodik, R., Catanzaro, B.C., Gebis, J.J., Husbands, P., Keutzer, K., Patterson, D.A., Plishker, W.L., Shalf, J., Williams, S.W., Yelick, K.A.: The Landscape of Parallel Computing Research: A View from Berkeley. Tech. rep., EECS Department, University of California, Berkeley (2006) 12. Benini, L., Bertozzi, D., Guerri, A., Milano, M.: Allocation and Scheduling for MPSoCs via Decomposition and No-good Generation. Principles and Practices of Constrained Programming - CP 2005 (DEIS-LIA-05-001), 107–121 (2005) 13. Bhattacharya, B., Bhattacharyya, S.S.: Parameterized dataflow modeling for DSP systems. IEEE Transactions on Signal Processing 49(10), 2408–2421 (2001) 14. Castrillon, J., Zhang, D., Kempf, T., Vanthournout, B., Leupers, R., Ascheid, G.: Task Management in MPSoCs: An ASIP Approach. In: ICCAD 2009 (2009) 15. Ceng, J., Castrillon, J., Sheng, W., Scharwächter, H., Leupers, R., Ascheid, G., Meyr, H., Isshiki, T., Kunieda, H.: MAPS: an integrated framework for MPSoC application parallelization.
676
16.
17. 18. 19. 20. 21. 22.
23.
24. 25. 26. 27. 28. 29. 30. 31. 32. 33.
34. 35.
Rainer Leupers, Weihua Sheng, Jeronimo Castrillon In: DAC ’08: Proceedings of the 45th annual conference on Design automation, pp. 754–759. ACM, New York, NY, USA (2008) Ceng, J., Sheng, W., Castrillon, J., Stulova, A., Leupers, R., Ascheid, G., Meyr, H.: A highlevel virtual platform for early MPSoC software development. In: CODES+ISSS ’09: Proceedings of the 7th IEEE/ACM international conference on Hardware/software codesign and system synthesis, pp. 11–20. ACM, New York, NY, USA (2009) Cesario, W., Jerraya., A.: Multiprocessor Systems-on-Chips, chap. Chapter 9. ComponentBased Design for Multiprocessor Systems-on-Chip, pp. 357–394. Morgan Kaufmann (2005) Collette, T.: Key Technologies for Many Core Architectures. In: 8th International Forum on Application-Specific Multi-Processor SoC (2008) CoWare: CoWare Virtual Platforms. http://www.coware.com/products/virtualplatform.php. Visited on Apr. 2009 Fisher, J., P., F., Young, C.: Embedded Computing: A VLIW Approach to Architecture, Compilers and Tools. Morgan-Kaufmann (Elsevier) (2005) Flake, P., Davidmann, S., Schirrmeister, F.: System-level exploration tools for MPSoC designs. In: Design Automation Conference, 2006 43rd ACM/IEEE, pp. 286–287 (2006) Gao, L., Huang, J., Ceng, J., Leupers, R., Ascheid, G., Meyr, H.: TotalProf: a fast and accurate retargetable source code profiler. In: CODES+ISSS ’09: Proceedings of the 7th IEEE/ACM international conference on Hardware/software codesign and system synthesis, pp. 305–314. ACM, New York, NY, USA (2009) Georgios Tournavitis Zheng Wang, B.F., O’Boyle, M.: Towards a Holistic Approach to Auto-Parallelization – Integrating Profile-Driven Parallelism Detection and Machine-Learning Based Mapping. In: PLDI 0-9: Proceedings of the Programming Language Design and Implementation Conference. Dublin, Ireland, June 15 - 20 (2009) Gheorghita, S., T. Basten, H.C.: An Overview of Application Scenario Usage in StreamingOriented Embedded System Design Grant Martin: ESL Requirements for Configurable Processor-based Embedded System Design. Design and Reuse Gupta, R., Micheli, G.D.: Hardware-software Co-synthesis for Digital Systems. In: IEEE Design & Test of Computers, pp. 29–41 (1993) Hankins, R.A., et al.: Multiple Instruction Stream Processor. SIGARCH Comp. Arch. News 34(2) (2006) Hansson, A., Goossens, K., Bekooij, M., Huisken, J.: CoMPSoC: A template for composable and predictable multi-processor system on chips. ACM Trans. Des. Autom. Electron. Syst. 14(1), 1–24 (2009) Hewitt, C., Bishop, P., Greif, I., Smith, B., Matson, T., Steiger, R.: Actor induction and metaevaluation. In: POPL ’73: Proceedings of the 1st annual ACM SIGACT-SIGPLAN symposium on Principles of programming languages, pp. 153–168. ACM, New York, NY, USA (1973) Hind, M.: Pointer Analysis: Haven’t we Solved this Problem Yet? In: PASTE ’01, pp. 54–61. ACM Press (2001) Hu, T.C.: Parallel Sequencing and Assembly Line Problems. Operations Research 9(6), 841– 848 (1961). URL http://www.jstor.org/stable/167050 Hwang, Y., Abdi, S., Gajski, D.: Cycle-approximate Retargetable Performance Estimation at the Transaction Level. In: DATE ’08: Proceedings of the conference on Design, automation and test in Europe, pp. 3–8. ACM, New York, NY, USA (2008) Hwu, W.M., Ryoo, S., Ueng, S.Z., Kelm, J.H., Gelado, I., Stone, S.S., Kidd, R.E., Baghsorkhi, S.S., Mahesri, A.A., Tsao, S.C., Navarro, N., Lumetta, S.S., Frank, M.I., Patel, S.J.: Implicitly Parallel Programming Models for Thousand-core Microprocessors. In: DAC ’07: Proceedings of the 44th annual conference on Design automation, pp. 754–759. ACM, New York, NY, USA (2007) Kahn, G.: The Semantics of a Simple Language for Parallel Programming. In: J.L. Rosenfeld (ed.) Information Processing ’74: Proceedings of the IFIP Congress, pp. 471–475. NorthHolland, New York, NY (1974) Kandemir, M., Dutt, N.: Multiprocessor Systems-on-Chips, chap. Chapter 9. Memory Systems and Compiler Support for MPSoC Architectures, pp. 251–281. Morgan Kaufmann (2005)
Software Compilation Techniques for MPSoCs
677
36. Karp, R.M., Miller, R.E.: Properties of a model for parallel computations: Determinacy, termination, queuing. SIAM Journal of Applied Math 14(6) (1966) 37. Karuri, K., Al Faruque, M.A., Kraemer, S., Leupers, R., Ascheid, G., Meyr, H.: Fine-grained Application Source Code Profiling for ASIP Design. In: DAC ’05: Proceedings of the 42nd annual conference on Design automation, pp. 329–334. ACM, New York, NY, USA (2005) 38. Kennedy, K., Allen, J.R.: Optimizing compilers for modern architectures: a dependence-based approach. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (2002) 39. Kloss, N.: Application Programming Strategies for TI’s OMAP Solutions. Embedded Edge (2003) 40. Krishnan, V., Torrellas, J.: A Chip-Multiprocessor Architecture with Speculative Multithreading. IEEE Trans. Comput. 48(9), 866–880 (1999) 41. Kumar, S., Hughes, C.J., Nguyen, A.: Carbon: Architectural Support for Fine-Grained Parallelism on Chip Multiprocessors. SIGARCH Comp. Arch. News 35(2) (2007) 42. Kung, H.T.: Why Systolic Architectures? Computer 15(1), 37–46 (1982) 43. Kwok, Y.K., Ahmad, I.: Static Scheduling Algorithms for Allocating Directed Task Graphs to Multiprocessors. ACM Comput. Surv. 31(4), 406–471 (1999) 44. Kwon, S., Kim, Y., Jeun, W.C., Ha, S., Paek, Y.: A retargetable parallel-programming framework for MPSoC. ACM Trans. Des. Autom. Electron. Syst. 13(3), 1–18 (2008) 45. Lam, M.: Software pipelining: an effective scheduling technique for VLIW machines. SIGPLAN Not. 23(7), 318–328 (1988) 46. Lee, E., Messerschmitt, D.: Synchronous data flow. Proceedings of the IEEE 75(9), 1235– 1245 (1987) 47. Lee, E.A.: Consistency in Dataflow Graphs. IEEE Trans. Parallel Distrib. Syst. 2(2), 223–235 (1991) 48. Lee, E.A.: The Problem with Threads. Computer 39(5), 33–42 (2006). URL http:// portal.acm.org/citation.cfm?id=1137232.1137289 49. Leupers, R.: Retargetable Code Generation for Digital Signal Processors. Kluwer Academic Publishers, Norwell, MA, USA (1997) 50. Leupers, R.: Code Selection for Media Processors with SIMD Instructions. In: DATE ’00, pp. 4–8. ACM (2000) 51. Li, L., Huang, B., Dai, J., Harrison, L.: Automatic multithreading and multiprocessing of C programs for IXP. In: PPoPP ’05: Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming, pp. 132–141. ACM, New York, NY, USA (2005) 52. Ma, Z., Marchal, P., Scarpazza, D.P., Yang, P., Wong, C., Gmez, J.I., Himpe, S., YkmanCouvreur, C., Catthoor, F.: Systematic Methodology for Real-Time Cost-Effective Mapping of Dynamic Concurrent Task-Based Systems on Heterogenous Platforms. Springer Publishing Company, Incorporated (2007) 53. Mignolet, J.Y., Baert, R., Ashby, T.J., Avasare, P., Jang, H.O., Son, J.C.: MPA: Parallelizing an Application onto a Multicore Platform Made Easy. IEEE Micro 29(3), 31–39 (2009) 54. Muchnick, S.S.: Advanced Compiler Design and Implementation. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (1997) 55. National Instruments: LabView. http://www.ni.com/labview/. Visited on Mar. 2009 56. Nieuwland, A., Kang, J., Gangwal, O.P., Sethuraman, R., Busa, R.S.C.N., Goossens, K., Llopis, R.P.: C-HEAP: A Heterogeneous Multi-processor Architecture Template and Scalable and Flexible Protocol for the Design of Embedded Signal Processing Systems. Design Automation for Embedded Systems (7), 233–270 (2002) 57. Nikolov, H., Stefanov, T., Deprettere, E.: Systematic and Automated Multiprocessor System Design, Programming, and Implementation. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on 27(3), 542–555 (2008) 58. Nikolov, H., Thompson, M., Stefanov, T., Pimentel, A., Polstra, S., Bose, R., Zissulescu, C., Deprettere, E.: Daedalus: Toward Composable Multimedia MP-SoC Design. In: DAC ’08: Proceedings of the 45th annual conference on Design automation, pp. 574–579. ACM, New York, NY, USA (2008)
678
Rainer Leupers, Weihua Sheng, Jeronimo Castrillon
59. The OpenMP specification for parallel programming: www.openmp.org. Visited on Nov. 2009 60. Paolucci, P.S., Jerraya, A.A., Leupers, R., Thiele, L., Vicini, P.: SHAPES:: a tiled scalable software hardware architecture platform for embedded systems. In: CODES+ISSS ’06: Proceedings of the 4th international conference on Hardware/software codesign and system synthesis, pp. 167–172. ACM, New York, NY, USA (2006) 61. Park, S., sun Hong, D., Chae, S.I.: A Hardware Operating System Kernel for Multi-Processor Systems. IEICE 5(9) (2008) 62. Parks, T.M.: Bounded scheduling of process networks. Ph.D. thesis, Berkeley, CA, USA (1995) 63. Pimentel, A.D., Erbas, C., Polstra, S.: A Systematic Approach to Exploring Embedded System Architectures at Multiple Abstraction Levels. IEEE Transactions on Computers 55(2), 99–112 (2006) 64. Sharma, G., Martin, J.: MATLAB˝o: A Language for Parallel Computing. International Journal of Parallel Programming 37(1) (2009) 65. Snir, M., Otto, S.: MPI-The Complete Reference: The MPI Core. MIT Press (1998) 66. Sporer, T., Franck, A., Bacivarov, I., Beckinger, M., Haid, W., Huang, K., Thiele, L., Paolucci, P., Bazzana, P., Vicini, P., Ceng, J., Kraemer, S., Leupers, R.: SHAPES - a Scalable Parallel HW/SW Architecture Applied to Wave Field Synthesis. In: Proc. 32nd Intl Audio Engineering Society (AES) Conference, pp. 175–187. Audio Engineering Society, Hillerod, Denmark (2007) 67. Sriram, S., Bhattacharyya, S.S.: Embedded Multiprocessors: Scheduling and Synchronization. Marcel Dekker, Inc., New York, NY, USA (2000) 68. Standard for information technology - portable operating system interface (POSIX). Shell and utilities. IEEE Std 1003.1-2004, The Open Group Base Specifications Issue 6, section 2.9: IEEE and The Open Group 69. Synopsys: Synopsys Virtual Platforms. http://www.synopsys.com/Tools/SLD/VirtualPlatforms /Pages/default.aspx. Visited on May 2009 70. T. Kempf, S. Wallentowitz, G. Ascheid, R. Leupers, and H. Meyr. RWTH Aachen University.: A Workbench for Analytical and Simulation based Design Space Exploration of Software Defined Radios. In: Accepted for VLSI Design Conference 2009. New Delhi, India (2009) 71. TI: OMAP35x Product Bulletin. http://www.ti.com/lit/sprt457. Visited on Mar. 2009 72. TI: TI eXpressDSP Software and Development Tools. http://focus.ti.com/general/docs/gencontent.tsp?contentId=46891. Visited on Jan. 2010 73. UMIC: Ultra high speed Mobile Information and Communication. http://www.umic.rwthaachen.de. Visited on Nov. 2009 74. Verdoolaege, S., Nikolov, H., Stefanov, T.: PN: A Tool for Improved Derivation of Process Networks. EURASIP J. Embedded Syst. 2007(1), 19–19 (2007) 75. Wilhelm, R., Engblom, J., Ermedahl, A., Holsti, N., Thesing, S., Whalley, D., Bernat, G., Ferdinand, C., Heckmann, R., Mitra, T., Mueller, F., Puaut, I., Puschner, P., Staschulat, J., Stenström, P.: The Worst-case Execution-time Problem - Overview of Methods and Survey of Tools. ACM Trans. Embed. Comput. Syst. 7(3), 1–53 (2008) 76. Working Group ISO/IEC JTC1/SC22/WG14: C99, Programming Language C ISO/IEC 9899:1999 77. Zalfany Urfianto, M., Isshiki, T., Ullah Khan, A., Li, D., Kunieda, H.: Decomposition of TaskLevel Concurrency on C Programs Applied to the Design of Multiprocessor SoC. IEICE Trans. Fundam. Electron. Commun. Comput. Sci. E91-A(7), 1748–1756 (2008)
DSP Instruction Set Simulation Florian Brandner and Nigel Horspool and Andreas Krall
Abstract An instruction set simulator is an important tool for system architects and for software developers. However, when implementing a simulator, there are many choices which can be made and that have an effect on the speed and the accuracy of the simulation. They are especially relevant to DSP simulation. This chapter explains the different strategies for implementing a simulator.
1 Introduction/Overview The Instruction Set Architecture (ISA) of a computer comprises a complete definition of the instructions, the way those instruction are represented by bit patterns in the computer memory, and the semantics of each instruction when executed. An ISA is normally implemented by the physical hardware of the computer, using hardware resources such as the register file, the arithmetic and logic unit, the bus, and so on. However it is equally possible for the ISA to be implemented entirely in software. In this situation, we are using the computer which runs the software (the host computer) to emulate the ISA of a different computer (the guest computer). The software application which implements the ISA is commonly known as an instruction set simulator [61, chapter 2]. It should be noted that some implementations of instruction set simulators make use of extra hardware resources to boost their
Florian Brandner Institut für Computersprachen, Technische Universität Wien, e-mail: brandner@complang. tuwien.ac.at Nigel Horspool Department of Computer Science, University of Victoria, e-mail: [email protected] Andreas Krall Institut für Computersprachen, Technische Universität Wien, e-mail: andi@complang. tuwien.ac.at
S.S. Bhattacharyya et al. (eds.), Handbook of Signal Processing Systems, DOI 10.1007/978-1-4419-6345-1_24, © Springer Science+Business Media, LLC 2010
679
680
Florian Brandner and Nigel Horspool and Andreas Krall
performance, and some are implemented entirely in hardware (in which case, the correct name for the technology is instruction set emulation). Instruction set simulators have several important uses. The three main ones, and ones which will be briefly explained below, are: • to implement a virtual machine, • to emulate a computer architecture using a different architecture, and • to perform low-level monitoring of software execution. A virtual machine is used for the purpose of providing a managed code environment. For example, Java source code is compiled to Java bytecode, which are instructions for the Java Virtual Machine (JVM). The JVM has an ISA which is perhaps too complex to be implemented fully in hardware but is easily implementable in software. (The Java code is managed in the sense that the bytecode instructions are verified by the JVM to ensure that unsafe usage cannot occur.) Note that this usage of the term ‘virtual machine’ is distinct from the notion of virtual machine when used in the context of virtualization. The purpose of virtualization is to allow multiple operating systems to share a single hardware platform through the use of virtual hardware. The instruction sets of the virtual machines would, in the context of virtualization, be identical to the instruction set of the hardware platform and there would thus be no place for instruction set simulation. The second use is where one computer emulates another computer with a different ISA. It is particularly useful when software applications have been developed for one architecture and need to be executed on another architecture. This, for example, was the technique by which Apple eased the transition for its users when the architecture of the Macintosh computers was changed from the Motorola 68000-series to the PowerPC and then, again, to the Intel x86. Having one computer emulate another computer achitecture is also useful when a new computer is under development and not yet available to the software developers. The third use arises because a software implementation of an ISA can easily be augmented with additional software which makes various performance measurements. It can measure quantities such as the number of instructions executed, the numbers of conditional branches taken and not taken, the number of stalls in a simulated pipeline, the hit rate of a simulated cache, the number of simulated clock cycles per instruction, and so on. Such measurements can be very useful to a system designer who has to choose a system configuration, such as choosing between a unified L1 cache or a split instruction cache and data cache, and choosing the sizes of the caches and the number of cache levels, etc. These measurements can also be useful to a software developer who is fine-tuning code to make it execute as efficiently as possible. In such a scenario, it makes perfect sense to execute an x86 simulator using an x86 platform, and similarly for other platforms. It is the second and third uses which are the most relevant to DSP instruction set simulation. Using a simulator which provides detailed performance measurements, a systems engineer can configure the system to optimize the trade-offs between factors such as cost, power consumption, and speed. Meanwhile, a software developer
DSP Instruction Set Simulation
681
can use the simulator to begin testing and debugging code written for the DSP system before that system has been deployed. There are two levels of implementation for an instruction set simulator: fullsystem level and CPU-level [66]. • Full-system simulation is the most complete and most detailed level: the whole computer system including the I/O bus and the peripherals are simulated in software. Thus, for example, if the code being simulated generates a simulated interrupt then the simulator proceeds by executing the instructions of the interrupt service routine, and so on. Similarly, instructions which generate system calls would cause the simulation to continue by a change in execution mode and subsequent execution of code that belongs to an operating system. This would be the level of implementation needed by a system designer or by a software developer producing kernel-level code for an operating system. • CPU-level emulation, in contrast, requires simulating only the CPU as it executes a single application program. All system calls, e.g. for performing an I/O operation, would be mapped by the simulator into equivalent calls to services provided by the host operating system. There would be no attempt to continue the execution of instructions into operating system code. This would be the level of simulation appropriate for implementing a virtual machine, or for allowing a software developer to produce application programs for the emulated architecture. There are two main approaches for implementing an ISA simulator in software. One approach is to use an interpreter, and the other is to use code translation. The first choice has several variations, as does the second. These choices and their variations, as well as the rôle of hardware to assist in the simulation, are explained in more detail in the remainder of this chapter. A final possibility of using specialpurpose hardware to emulate the ISA will not be covered here.
2 Interpretive Simulation The oldest and simplest software implementation of an ISA simulator is an instruction interpreter modeled after the fetch-decode-execute cycle of a von Neumann style computer. The straightforward implementation of an ISA interpreter [36] is sometimes called a ‘classic interpreter’ and is explained below. However, there are some obvious inefficiencies which affect the speed of a classic interpreter. Improvements to the interpreter structure which address those inefficiencies give rise to some variations which are also described below.
682
Florian Brandner and Nigel Horspool and Andreas Krall
2.1 The Classic Interpreter The interpreter has an array MEM which models the RAM memory of the architecture being simulated, and it has a variable PC which models the program counter register of that architecture. If the computer has a set of numbered registers, an array REG can be used to model those registers. A small portion of possible C code for an interpreter of the MIPS instruction set is shown in Figure 1. It uses an array REG to mimic a register file, it has a variable which holds the current state of the condition code register, and its MEM array holds an exact replica of what would be in the main memory of the simulated computer. // load instructions and data into MEM array PC = start_address ; while (isRunning) { instr = MEM[PC/4]; // fetch next instruction PC += 4; // advance PC by 4 bytes opcode = decode(instr); // extract opcode switch( opcode ) { ... case ADDI: // add-immediate instruction C = instr & 0xFFFF; Rs = (instr >> 21) & 0x1F; Rt = (instr >> 16) & 0x1F; sum = IntegerAdd(REG[Rs], C); if (Rt != 0) REG[Rt] = sum; break; ... case SW: // store word instruction C = instr & 0xFFFF; Rx = (instr >> 21) & 0x1F; Ry = (instr >> 16) & 0x1F; MEM[(REG[Ry]+C)/4] = REG[Rx]; break; ... } }
Fig. 1 Structure of an ISA Interpreter for the MIPS Architecture.
Two of the simpler instructions have been picked for the example code; even so, there are some hidden complexities. For example, overflow by the add-immediate instruction must be detected and must be handled in the same way as the simulated architecture. This tricky detail is assumed to be implemented by the IntegerAdd function. The interpreter code can exactly mimic the hardware resources of the computer being simulated. If caches, or other components of the system, need to be simulated, the interpreter code can be augmented in a straightforward manner. An important
DSP Instruction Set Simulation
683
point is that additional code to monitor execution performance can easily be added too. However, there is a significant performance penalty when a simple interpreter is used. If we compare the speed of the actual computer against the speed of an interpreter simulating that computer when executed on that same computer, there is typically a slowdown of a few orders of magnitude. The speed which can be achieved depends very much on the complexity of the ISA. Much effort has been put into making interpreters work faster, but the modifications made to the interpreter scheme can interfere with the correctness or the completeness of the emulation. These problems will be discussed below. It should also be observed that a significant mismatch between the characteristics of the guest system and the host system can incur a big performance penalty. For example, attempting to simulate a system with 40 bit words (which are quite common for fixed-point arithmetic DSPs) on a host computer which has 32 bit words can cause severe complications. Of course, performance penalties due to such mismatches will not be limited to just interpretive techniques; they are likely to be observed no matter what simulation strategy is employed. Instead of directly interpreting the binary representation, interpretive simulators often store the program in an intermediate representation which is more efficient to handle. Such an example is the well known SPIM simulator for the MIPS processor [37]. Decoding of the instructions is only done once when the binary representation is converted into the intermediate representation. When a new processor is under development, the binary representation is usually specified as the last step. Therefore, instruction set simulators for design space exploration often use an intermediate representation for the simulated program. When the size and the encoding of the instructions is not yet determined, the instruction cache can only be approximated.
2.2 Threaded Code The outermost control structure of a classical interpreter is a switch statement inside a loop. Although it is the code in each case of the switch statement which performs the useful work, there is significant overhead associated with transferring control from the end of one case to the start of the case needed to simulate the next instruction. This overhead can be greatly reduced if we pre-translate the code to be interpreted into a different format. One such format is threaded code [5]. For example, if the program being interpreted is an assembled version of the MIPS code shown in Figure 2(a), then the interpreter could begin by filling a data array with the information shown in Figure 2(b). Now, this TCODE array can be used to implement direct transfers of control from the code for one MIPS instruction to the next MIPS instruction, and so on. The interpreter code of Figure 1 simply needs to be modified to have the structure shown in Figure 3.
684
Florian Brandner and Nigel Horspool and Andreas Krall
@ MIPS Assembler
void *TCODE[] = {
... addi 1,2, 29 addi 1,1, 1 sw 1, 24(17) addi 17,17, 4 sw 1, 24(17) ... (a) MIPS Assembler Code
... &ADDI, &ADDI, &SW, &ADDI, &SW, ...
}; (b) Threaded Code Array
Fig. 2 Threaded Code Translation Example.
// load instructions and data into Mem array PC = start_address ; TPC = &TCODE[PC/4]; goto *TPC++; // jump to first instruction ... ADDI:
// add-immediate instruction instr = MEM[PC/4]; PC = PC + 4; C = instr & 0xFFFF; Rs = (instr >> 21) & 0x1F; Rt = (instr >> 16) & 0x1F; sum = IntegerAdd(REG[Rs], C); if (Rt != 0) REG[Rt] = sum; goto *TPC++; // transfer to next instruction
SW:
// store word instruction instr = MEM[PC/4]; PC = PC + 4; C = instr & 0xFFFF; Rx = (instr >> 21) & 0x1F; Ry = (instr >> 16) & 0x1F; MEM[(REG[Ry]+C)/4] = REG[Rx]; goto *TPC++; // transfer to next instruction ...
Fig. 3 Threaded Code Interpreter for the MIPS Architecture.
DSP Instruction Set Simulation
685
Threaded code is an excellent implementation strategy for a virtual machine, such as the JVM. In this situation, instructions can be identified in advance and translated to entries in the TCODE array. The instruction operands can also be interspersed between the label addresses, so that all information needed by the code which implements an instruction can be extracted from the TCODE array. The PC variable is then redundant and can be eliminated – it is replaced by the TPC variable. In effect, the virtual machine code has been translated into a less compact format which is much more amenable to execution by an interpreter. For a real machine, one usually cannot perform such a simple translation. In particular, it may be impossible to identify which locations in the guest computer’s memory hold instructions and which hold data. Even if the guest computer’s address space has been divided up into a text region (which holds instructions) and a data region, it is still quite possible that the text region holds some data (especially read-only data). If the guest computer code is self-modifying code, there is a bigger problem because the TCODE array may lose its correspondence with the instructions in the guest computer’s memory.
2.3 Interpreter Optimizations Particular combinations of instructions tend to recur in assembler code. An interpreter can take advantage of this by creating new instructions which combine frequently occurring n-tuples of instructions into single superinstructions which are also known as superoperators [47]. If the n-tuples have a high static frequency of occurrence, the memory footprint of the program is reduced. If they have a high dynamic frequency, the interpretation speed is increased. Ideally, we can find n-tuples which have both high static and dynamic frequencies. The improved interpretation speed is obtained because (a) there are fewer fetchdecode-execute cycles being performed, and (b) the combined actions of the opcodes involved in the n-tuple can be simplified. For example, if the two MIPS instructions: addi add
$3,$2,4 $3,$3,$3
are combined into a single instruction NewOp1 $3,$2,4 then the intermediate update of register 3 can be eliminated as unnecessary. If a threaded code interpreter is in use, there is effectively no limit to the number of superoperator combinations which can be supported. There is a trade-off between the memory footprint of the interpreter and interpretation speed.
686
Florian Brandner and Nigel Horspool and Andreas Krall
3 Compiled Simulation Interpretive simulation is quite slow. Unnecessary operations like instruction decoding and all details of instruction fetch are repeated again and again. These details can be optimized away with compiled simulation. The idea of compiled simulation is to take an application program (usually represented in assembly language source form or as a binary file) and generate a specialized simulator which executes just this one application. This technique is due to Mills et al. [38].
3.1 Basic Principles All compiled simulators are based on the following principle; see Figure 4 for an overview. The binary or the assembly language source of the application to be simulated is read by the simulator generator. The generator then translates the application to a high-level language, such as C or C++. The generated source code is translated to machine code by an optimizing high-level language compiler of the host computer and linked against a generic simulator library. The resulting program is a simulator, which emulates the behavior of the original application when executed. In addition to the optimizations available to the host’s high-level language compiler, optimizations might be applicable during the source code generation by the simulator generator, e.g., the generator may monitor execution modes of the processor and thus omit certain simulation overhead.
Fig. 4 Simulator generation for compiled simulation.
Compiled simulators are distinguished by the way in which sequences of instructions are translated to high-level language snippets. Mills [38] translates the whole application available in assembly language source to a single function of the C programming language. Each assembly language instruction is represented by a C macro which emulates the behavior of the particular instruction. The macro uses
DSP Instruction Set Simulation
687
the original address of the instruction as a label of a C switch statement and is followed by C statements which emulate the instruction. The macros #define SUB(addr) case (addr): AC -= *SP++; #define BNE(addr, to) \ case (addr): if (AC) {PC = (to); break;} define a subtraction and a branch instruction of an accumulator architecture. The variable PC corresponds to the value of the program counter of the emulated architecture. The macros are used in a giant C switch statement which represents the whole instruction memory of the emulated architecture (see Figure 5). The program counter has to be set and evaluated in the switch statement only when a branch instruction is emulated. The macros read like the original program. PC = 0x1000; while (isRunning) { switch(PC) { SUB(0x1000) BNE(0x1002, 0x1000) ... } }
Fig. 5 Mills represents instructions as C macros within a switch statement, where the labels correspond to the instruction addresses.
Mills’ method works very well for medium sized programs. The whole program is translated to a single large C function. The advantage is that this function can be compiled to very efficient machine code. Local variables of the function will be mapped to registers of the host architecture and local optimizations are applied. This advantage is lost, however, when the program becomes too large. The size of the function may exceed limits imposed by the compiler, or non-linear algorithms employed by the compiler may lead to drastically increased translation times. In either case, the approach of generating a single C function becomes impractical. Large programs are thus partitioned into aligned blocks of instructions with a size that is a power of two [4], where each block is assigned a dedicated simulation function. Due to the use of multiple simulation functions, most of the processor’s state has to be stored in global variables, and this interferes with optimizations that would otherwise be performed by the C compiler. Figure 6 shows the generated C code for a partitioned compiled simulator. For large programs it is beneficial to translate basic blocks. These are linear sequences of instructions with at most one branch at the end. Each basic block is translated to a C function. During simulation, each C function returns the address of the next function to be executed. The sequence of function calls is managed by a generic simulation loop. This approach works particularly well for functional sim-
688
Florian Brandner and Nigel Horspool and Andreas Krall PC = 0x1000; while (isRunning) { switch(PC>>10) { case 0: iblock0(); break; case 1: iblock1(); break; case 2: iblock2(); break; ... } } void iblock0() { while (isRunning) { switch(PC) { SUB(0x1000) BNE(0x1002, 0x1000) ... default: return; } } }
Fig. 6 Large simulation functions are avoided by partitioning of the address space of the simulated program.
ulators and architectures where the simulation of instructions does not cross the boundaries of the simulation functions. Compiled simulation is generally very similar to reverse compilation. However, reverse compilation becomes more difficult when the architecture supports delay slots for branch and memory access instructions. An example is the C6x series of DSP processors from Texas Instruments. Bermudo et al. [8] show that control flow graph reconstruction can be simplified by duplication of basic blocks.
3.2 Simulation of Pipelined Processors The problem with compiled simulation for pipelined processors is that the execution of instructions is partitioned into subtasks which are executed during different cycles. Each of these subtasks must be scheduled statically to give a correct behavior and high optimization potential. This is simple when the processor is an in-order architecture and the instructions do not cross basic block boundaries. When instructions cross basic block boundaries duplication of basic blocks is necessary. Take, for example, the basic blocks of the control flow in Figure 7(a). Basic block 3 has only basic block 1 as predecessor. Therefore, instructions started in basic block 1 can be extended to basic block 3. Basic block 4 has to be duplicated, as depicted by Figure 7(b), because instructions started in basic block 1 and 2 can both be continued in basic block 4.
DSP Instruction Set Simulation
689
(a) Original control flow.
(b) Control flow with duplicated blocks.
Fig. 7 Simulation of processors with long pipelines across basic block boundaries using block duplication.
Basic block determination can be impossible when the program contains computed indirect branches. To handle this situation, it is necessary to combine the compiled simulator with an interpretive simulator. Whenever a branch target is computed which was not known to be the start of a basic block, the simulator switches into interpretive mode. Interpretation continues until the end of the basic block is reached. The interpretive mode can also be used to support debugging features, such as single stepping.
3.3 Static Binary Translation Static binary translation is similar to compiled simulation and it was used earlier than compiled simulation [40][57]. The only difference is that instead of generating a simulator in a high-level language, machine code for the host architecture is generated. The advantage is that no code size restrictions or constraints on the organization of the machine code exist. The removal of these constraints usually leads to faster simulation. Additionally the translation time is reduced, since the generation and parsing of the high-level language version of the simulator is eliminated. A disadvantage is that all optimizations, which are usually done by the high-level language compiler, have to be carried out by the binary translator. The use of a compiler framework greatly reduces the development time of a simulator based on static binary translation.
690
Florian Brandner and Nigel Horspool and Andreas Krall
4 Dynamically Compiled Simulation In the case of instruction set simulation based on dynamic binary translation, code generation is performed on-the-fly at simulation time. This approach was first used for execution profiling in the Shade system [12]. The advantage is that more information, such as the targets of indirect branches, is available and that it is possible to handle self-modifying code. A disadvantage is that the translation is repeated for every simulation run, and the overall simulation time has to include that translation time.
4.1 Basic Simulation Principles To keep the translation overhead small, simulation usually starts in interpretive mode. During interpretation, information on basic block execution frequencies is collected. When a predetermined execution threshold for a basic block is exceeded, this basic block is translated to machine code. Some systems use traces as the basic unit of compilation. A trace is a cycle-free linear sequence of instructions possibly containing multiple branches into and out of the trace. The translated code blocks or translated traces are typically kept in a translation cache. If, during the simulation, a matching code fragment is found in the translation cache, the simulator jumps to the cached machine code and executes it, otherwise simulation continues using interpretation. When the end of the translated code is reached, control is returned to the simulator to resume the interpretive mode of execution. To avoid unnecessary switches back into interpretive mode, branches in already translated code are updated whenever a new block is translated. In the case of self-modifying code or dynamic code loading/generation within the simulated program, the translation cache has to be invalidated accordingly. Detecting memory updates that overlap with translated code regions may cause significant overhead.
4.2 Optimizations Since compilation time is included in the simulation time, only a few important optimizations are performed during dynamic code generation. The most important is register allocation. Registers and other hardware resources of the simulated architecture are usually stored in memory. At the beginning of a basic block or a trace the state of the simulated processor is partially loaded into registers. Similarly, modified values are saved upon leaving the compiled code. The register allocation strategy typically follows some simple heuristics. For example, certain registers of the simulated processor may be mapped to registers of the host computer without performing any analysis of the code, in order to reduce compilation overhead.
DSP Instruction Set Simulation
691
Processor simulation often involves the update of status registers or other information which are later overwritten by subsequent simulated instructions without any intervening uses. Liveness analysis can detect whether such computed values are actually used by a subsequent instruction. For basic blocks and traces, this analysis is very fast. The gathered information may be used to avoid useless computations and should lead to significant performance improvements. Better code can be generated when regions of code which contain loops are optimized and translated to machine code. However, due to the more complex control flow, optimizing these regions requires more time. Therefore, only the most frequently executed code regions are considered for this translation method. For example, a mixed-mode simulator can be used with three different levels of optimization. Seldomly executed code is interpreted, moderately executed code is optimized at a basic block level, and very frequently executed code is optimized at a region level [6].
5 Hardware Supported Simulation Although today’s workstation computers provide tremendous performance, the speed of an instruction set simulator may be slow and often unsatisfactory. It is also likely that the simulation overhead for complex Systems-On-Chip (SoC) and multi-core devices will further increase in the future, and thus widen the gap between simulation time and the actual execution time on a real system. Even though instruction set simulation inherently offers a high degree of parallelism, it can not be exploited due to communication and synchronization overhead. Dedicated hardware provides a promising approach to overcome this difficulty by leveraging finegrained parallelism. This section is primarily focussed on simulation platforms that use Field Programmable Gate Arrays (FPGAs) in order to connect the simulator to external hardware, or to perform simulation tasks within the FPGA itself. Other approaches that rely on custom ASIC designs [19][28][18] also show impressive results, but appear to be less applicable as general solutions.
5.1 Interfacing Simulators with Hardware In the simplest case, instruction set simulation is performed entirely in software by a workstation computer, while other devices are realized using FPGAs or real hardware as depicted in Figure 8. This approach is particularly suited to system integration and verification, but usually offers little or no improvement in terms of simulation speed. Keeping the hardware and software models synchronized is a major problem in such systems. There are problems both with simulation performance and with correctness and accuracy of the performance measurements. A typical solution is
692
Florian Brandner and Nigel Horspool and Andreas Krall
Fig. 8 External hardware can interface with an instruction set simulator in software by driving the hardware clock from within the simulator. The transmission of I/O data is effectively synchronized and can be controlled entirely in software.
to control the hardware clock from within the software simulation [58][45], e.g., by toggling the value of an I/O pin. This allows the models to be kept in perfect synchronization on a cycle-per-cycle basis. However, depending on the interconnect, accessing the I/O pin might be costly. For example, initiating a bus transfer on every simulated cycle is likely to dominate the overall execution time on the simulator’s host machine, as bus interfaces usually operate only at a fraction of the clock frequency of modern microprocessors. Transaction-based synchronization [35][56] can be used to reduce the number of synchronization operations, but also impacts the accuracy of the simulation results.
5.2 Simulation in Hardware Integrating the simulation tools with external devices that are implemented in hardware is a valuable approach, but does not improve the performance of the instruction set simulation itself. However, since a tight coupling between the software model and the external hardware has already been established it is now possible to implement parts of the instruction set simulator within the hardware [60], for example by using an FPGA. This approach is especially interesting for architectures that are highly parallel and contain features that are hard to handle using software. Digital signal processors typically fall into this category, and may thus draw significant benefit from hardware supported simulation. In contrast to prototyping [10], the primary goal is not to implement the architecture design using an FPGA, but to implement a hardware structure that models
DSP Instruction Set Simulation
693
Fig. 9 Simulation can be performed in part using software techniques and hardware components. An efficient partitioning of such a simulation system is to perform the complex timing emulation within an FPGA, while a purely functional simulation is executed on the host computer.
the behavior at a certain level of granularity, e.g., on the cycle- or transaction-level. For example, executing a single cycle of the modeled architecture might be realized using several cycles in the FPGA [25]. The number of cycles in the FPGA might also vary depending on the complexity of the currently active instructions. This also implies that signals, registers, functional units, and other hardware components of the original architecture are not necessarily represented in the simulation model as such. The designer can thus choose to implement the simulation model using software techniques, a basic hardware implementation, or using highly optimized hardware structures in order to find a suitable trade-off between simulation speed, design complexity, and flexibility. The observation that functional simulation usually can be performed efficiently on modern workstation computers, while the modeling of dynamic behavior and timing is usually overly complex and slow, led to the approach adopted in several simulation systems [15][25][48]. The simulation is split into two separate parts that operate in parallel, the functional model and the timing model. In the case of trace-driven simulation, sometimes also referred to as functional-first simulation, the simulation process is controlled by the functional model. On the other hand, in timing-directed simulation, also called timing-first simulation, the timing model controls and drives progress. In the former case, the functional model executes the program without considering dynamic effects, such as variable latency instructions, interlocking, branch prediction, or cache and memory delays. If, at some point, the timing model detects a mismatch between the instruction stream executed by the functional model and the instructions executed on a real system, e.g., due to a mispredicted branch, the simulation has to be reverted and restarted in order to correct the mismatch. In trace-driven systems, the timing model has control and triggers
694
Florian Brandner and Nigel Horspool and Andreas Krall
the execution of the functional model at the appropriate times. Both approaches share the advantage that the functional model can be implemented using software techniques, and are thus more flexible and easier to extend than hardware models. However, there is no reason why the functional part could not execute within the FPGA [48]. In addition, reuse is enabled by combining different timing models with the same functional model [48]. Another approach is to divide the simulation tasks between hardware and software depending on execution frequency and implementation complexity. Certain simulation tasks are selectively promoted to hardware if they dominate the performance of the system and if their implementation in hardware is cheap in terms of implementation effort and hardware resources.
6 Generation of Instruction Set Simulators Designing and developing an instruction set simulator for a new architecture or processor design is a cumbersome and error-prone task. Especially during the early stages of the design process, the instruction set and its implementation may change frequently and thus complicate the task of simulator development. Processor description languages [42][32] provide a promising approach in which the specification of the instruction set and implementation details of a new design are specified using a formal language. These languages provide helpful abstractions, leading to processor models that are easy to maintain and extend, and thus reduce development time and cost. Furthermore, these languages allow for design space exploration, a technique for exploring different processor designs in order to find the best solution for a particular task. The resulting Application-Specific Instruction Processors (ASIP) combine superior performance and reduced power consumption while still providing a certain degree of flexibility in comparison to pure hardware solutions.
6.1 Processor Description Languages The development of a specialized DSP architecture is typically constrained by development cost, time to market, and the production cost per unit. Efficient and accurate simulation tools play a central role in exploring and evaluating design alternatives. In addition, these tools allow software development to start during the early stages of the design process. This not only leads to a shorter development cycle, but also lets developers collect data and perform various measurements that are crucial for tuning the design to the requirements of the application domain in question. However, developing a new instruction set simulator, or even customizing existing tools, is an error-prone and time-consuming task. Extensible and maintainable models of the processor design are thus desperately needed.
DSP Instruction Set Simulation
695
Processor description languages and related tools provide a promising solution to this problem. In recent years, these languages have evolved from being sole research projects into products that are actively used and successfully adopted by leading companies in order to develop and design highly specialized and tuned architectures [30]. A processor model typically consists of several layers that include information on the hardware organization, the instruction set, instruction semantics and timing. Meta information such as assembly syntax, Application Binary Interface (ABI) conventions, etc., can be specified in most languages. Structuring the models, for example by instruction classes, enables code reuse across different instructions and leads to concise and compact models. In this way, adding new instructions or adapting existing instructions is often only a matter of a few lines of code. Processor description languages, sometimes also referred to as Architecture Description Languages, can roughly be categorized into three distinct groups: 1. Structural languages offer primitives that directly match abstractions typically found in hardware design. The processor model thus closely resembles the structure of the original hardware design. 2. Behavioral languages on the other hand primarily focus on the instruction set architecture and typically provide some means to structure and order instruction variants and meta information such as assembly syntax, binary encoding, and abstract instruction semantics. 3. Mixed approaches combine the structural and behavioral view of the processor and provide mechanisms to model both the hardware structure and the instruction set architecture. Usually these languages also provide a mechanism to specify a mapping between the behavioral and the structural description that allows one to relate the two models to each other. The information that is available in these processor specifications can be used to (semi-)automatically generate software tools that are customized for the particular processor. This includes standard development tools such as a compiler, an assembler, and a linker. However it also includes instruction set simulators and hardware models in a general purpose Hardware Description Language (HDL), such as VHDL or Verilog. The various flavors of languages are usually geared towards a particular application. Behavioral approaches, for example, typically support the generation of compiler back-ends and related development tools, such as assemblers and linkers, very well. However they lack information needed for accurate simulation and automatic generation of hardware models. Structural languages are usually well suited for these low-level tasks, but in turn lack abstract semantic models of instructions. Generating a compiler is thus more complicated in these systems. To overcome such limitations, many systems have progressively adopted features and ideas from the other style. Most contemporary languages thus follow the mixed approach that supports both application fields equally well. Figure 10 depicts an example workflow for the design of a new architecture using a processor description language. The designers iteratively extend and adapt the architecture model in order to reach the best possible solution. During each iteration step, a series of simulation runs is performed to evaluate the modification and expose
696
Florian Brandner and Nigel Horspool and Andreas Krall
bottlenecks that have to be dealt with during the next iteration. This process is often referred to as Design Space Exploration (DSE). Although DSE can be applied in a more general setting even without the use of processor description languages, these languages effectively reduce the time spent on exploring and evaluating design alternatives.
Fig. 10 During design space exploration the processor model, specified using a processor description language, is iteratively adopted in order to achieve the best possible performance, power, and area trade-off for a given DSP application.
6.2 Retargetable Simulation A key component that comes with virtually every processor description language is a simulation framework that can be customized and adapted quickly and easily to a new architecture model. This is usually achieved by generating the processor specific portions of the simulator, e.g., a software instruction decoder and appropriate simulation functions that model the behavior of individual instruction. The simulation techniques employed by these retargetable simulation engines do not differ from those found in hand-coded simulators, and range from simple functional interpreters and highly sophisticated binary translators to highly precise event-driven simulators. However, these engines potentially have to cope with a large range of different architectures, making optimizations found in hand-written simulators impossible in these engines for the sake of generality. It is thus not uncommon for retargetable simulation engines to restrict the range of supported architectures to certain architecture styles. In-order pipelined architectures for example are typically well supported, while less attention is typically paid to dynamic features of modern superscalar processors. It is important to note that the execution model of the processor description language and the techniques used to implement the execution model within the retargetable simulation framework may impact the accuracy of the simulation results. Designers thus need to evaluate whether the processor description language and its accompanying tools are suited for the particular task at hand.
DSP Instruction Set Simulation
697
As with hand-coded frameworks, the simplest form of retargetable simulation is interpretation. The semantic information that is available in the processor models is grouped into simple simulation functions that are invoked from within the interpreter loop. Each function performs an atomic simulation step that is indivisible for external observers during simulation. In the case of functional simulation, a single function is generated that captures the behavior of the instruction completely. For transaction-based or even cycle-accurate simulation, multiple simulation functions are emitted. The simple interpreter loop is usually also extended using state machines that control which functions need to be invoked. This is particularly important for handling various hazards, interlocks, and stalls that may emerge during the simulation. These techniques allow a very detailed simulation of the dynamic behavior of a processor; the generated simulators thus achieve very accurate results. However, simulation speed is usually not satisfactory. Retargetable simulators based on compiled simulation and dynamic binary translation can be used to circumvent the problem of unsatisfactory speed. In a similar manner to the interpreter approach, simulation functions are generated. However, these functions do not emulate the behavior of the instruction itself but instead generate code that is later executed. In the case of static compiled simulation, code fragments of a high-level language, such as C, are emitted by these functions that precisely emulate the instruction’s behavior. In the case of dynamic compilation and binary translation, machine code fragments or fragments of an intermediate representation that can be compiled to machine code are emitted. Integrating retargetable simulation engines with other simulation tools and external components is often a major problem. This is in part caused by the very nature of these engines. The generic architecture independent parts are incomplete without the generated portions of the simulator. Extending them is complex and often impossible without considering the processor model. The generated portions on the other hand can not be modified directly without losing the advantage of retargetability. Leveraging existing general-purpose modeling frameworks such as SystemC [62] is a viable option that allows interfacing with a large number of existing models that come with these frameworks. Some processor description languages take this idea even a step further and do not supply a specialized language to specify the instruction semantics. Instead, a general purpose modeling language can be used, allowing the processor model to directly interact with external components and models.
7 Related Work Interpretive Simulation Klint describes the classic implementation of an ISA interpreter [36]. 1990 James Larus made SPIM available [37]. SPIM is an interpretive simulator for the MIPS architecture which is commonly used for teaching. SPIM takes assembly language programs as input which are stored in an intermediate representation. Magnusson
698
Florian Brandner and Nigel Horspool and Andreas Krall
shows how instruction cache simulation and execution profiling can be done efficiently with a threaded-code interpreter [39]. Processors with out-of-order execution usually are simulated using interpretive simulation since a static modeling of the dynamic behavior of the processor is nearly impossible. A famous interpretive simulator for such kind of processors is SimpleScalar [2]. Compiled Simulation Mills et al. were the first to use compiled simulation [38]. They presented a simple and efficient method to translate medium sized applications into a fast compiled simulator. Their technique inspired all the later work. Reshadi et al. compile only the instruction decoding part to achieve the flexibility of an interpretive simulator while reaching a simulation speed which is close to compiled simulation [54]. SyntSim is a generator for functional simulators [7]. Using profile information, parts of the executable are compiled. The performance is between a factor of 2 and 16 slower than native code. Errico et al. generate interpretive and both static and dynamic compiled simulators from a simulator specification [23]. The dynamic simulator generates C code at run time which is compiled by GCC and dynamically loaded. Translation of aligned code pages is done identically both for the static and the dynamic simulator. Dynamically Compiled Simulation SimOS [53] is a full-system simulator that offers various simulators including Embra [65], a fast dynamic translator. Embra focuses on the efficient simulation of the memory hierarchy, in particular the efficient simulation of the memory address translation and memory protection mechanisms, as well as caches. Embra follows a compile-only approach, i.e., simulated instructions are always translated to native code and then executed. The translated code fragments for basic blocks are stored in a translation cache to speed up the look-up of native code blocks during the simulation. The basic simulator can be extended and adapted using customized translations. For example, different cache configurations and coherency protocols are realized using these customized translations. Shade [12] is another dynamically compiling simulator that aims primarily at fast execution tracing. It offers a rich interface to trace and process events during simulation. Similar to the customized translations in Embra, Shade allows user-defined as well as pre-defined code to collect trace data. Trace collection is controlled by analyzers that specify whether information should be considered during tracing on a per-opcode or per-instruction basis. The tracing level, i.e., the amount of data collected during simulation can be varied at runtime. In this way, only critical portions of the program execution need be executed with full tracing.
DSP Instruction Set Simulation
699
Ebcio˘glu et al. present sophisticated code generation techniques to efficiently generate code for parallel processors at runtime [21][22]. Several code generator optimizations are presented and combined with statistics collected at runtime. For example, instructions of the target processor are initially interpreted, and compilation of tree regions is only triggered for hot paths of the simulated program. Instruction-level parallelism is further improved by aggressive instruction scheduling techniques based on these tree regions. The simulation of synchronous exceptions, i.e., irregular control flow due to page faults, division by zero, etc., is optimized by code motion techniques [27]. However, only the behavior of the target processor is modeled. The simulation neither includes a timing model nor a model of the target processor’s memory organization. Jones et al. use Large Translation Units (LTUs) [33] to speed up the simulation of the EmCore processor, a processor which is compatible with the ARC architecture. Several basic blocks are translated at once within an LTU, leading to reduced compilation overhead and improved simulation speed. The simulator supports two operation modes: (1) a fast cycle-approximate functional simulation, and (2) a cycleaccurate timing model. Similar to typical static compiled simulation, code generation is performed using the C programming language in combination with a standard C compiler. Shared libraries are generated from the C source code and dynamically loaded at runtime. The generated code is stored permanently and can thus be reused across several simulation runs to improve the simulation speed further. Retargetable Simulation Several retargetable simulation frameworks are publicly available from open source projects and universities, others are offered by commercial software vendors. SimpleScalar is a retargetable simulator based on interpretation [2]. New processors are described using DEF files, which contain a set of C macro definitions and plain C code. A similar environment is offered by the commercial SimiCS simulator [41], but instead of plain C code a specialized device modeling language (DLM) is offered. The Liberty Simulation Environment (LSE) [64] supports very detailed event-based simulation. Hardware structures are described using the Liberty Structural Specification Language (LSS) [63]. Another approach is taken by the Unisim project,1 it relies on the SystemC language and its tools [62] to provide an execution environment for the simulation of processors [1]. The project offers several components, such as memories, caches, buses, and even complete processors, with well defined interfaces that can quickly be integrated with other components. The listed simulation environments offer very powerful support for implementing efficient and accurate simulation tools. However, the focus of these systems is targeted towards simulation only. For design space exploration of application-specific processors, an efficient simulation engine alone is not enough. Processor description languages usually support the automatic generation of all the tools needed for DSE. 1
http://unisim.org/site/
700
Florian Brandner and Nigel Horspool and Andreas Krall
A recent book by Mishra and Dutt [42] provides an excellent introduction to this topic and covers most of the contemporary languages. LISA by CoWare Inc. is one of the most successful processor description languages. The system provides powerful tools to automatically derive an efficient simulator based on compiled simulation [46] or interpretation from a processor specification. Nohl et al. propose the JIT-CC approach to speed up the simulation of LISA models [43] by caching information on previously decoded instructions. The LISA simulators can also be enhanced with hybrid simulation so that irrelevant parts of the program can be fast-forwarded using native execution [29]. Time-consuming accurate simulation is performed only where required. A similarly mature processor modeling environment is offered by Target Compiler Technologies. Based on the nML processor description language [26], a complete tool chain for the modeled processor can be derived [30], including the retargetable simulator CHECKERS. Both languages, LISA and nML, have successfully been used during the design of commercial processors in the digital signal processing domain, e.g., the CoolFlux DSP [50]. The EXPRESSION language [31] has been developed at the University of California, Irvine.2 Efficient simulators based on compiled simulation [54][52] and interpretive simulation can be derived from EXPRESSION models. The interpreter is based on Reduced Colored Petri Nets (RCPNs) and also supports the efficient simulation of out-of-order execution [51]. The system also supports a hybrid form of static and dynamic compiled simulation [55] in order to reduce the compilation overhead at runtime. Like the Unisim framework, the ArchC processor description language3 is based on SystemC [3]. It follows the SystemC TLM standard (described in more detail in another chapter of this book). It is thus possible to integrate the derived simulators with external SystemC modules. Farfeleder et al. present a very efficient compiled simulator based on a processor description language. Individual basic blocks of the simulated program are compiled to C functions [34][24]. A technique called basic block duplication is used for efficient simulation of pipelined architectures. A similar approach is taken in [6]. Instead of static compiled simulation via C code, machine code is generated directly using the LLVM compiler infrastructure.4 In addition to simple basic blocks, hot regions of code are recognized and recompiled using more aggressive optimizations. These regions may contain arbitrary control flow, including loops. The system is based on a structural processor description language and supports the efficient simulation of external interrupts using a rollback mechanism [9].
2 3 4
http://www.ics.uci.edu/~express/ http://archc.sourceforge.net/ http://www.llvm.org/
DSP Instruction Set Simulation
701
Hardware Supported Simulation IBM has pioneered the use of dedicated hardware resources to speed up binary translation. The DAISY processor [19] and, later, BOA [59] were optimized to support the efficient execution of PowerPC programs. The VLIW processors offered dedicated registers to hold the processor state of the simulated CPU, additional shadow registers to support rollbacks if speculative execution fails, and extensions to support the efficient processing of external interrupts. Transmeta [18] has adopted this approach to simulate X86 processors. Both approaches are strictly functional in that timing is not considered. Schnerr et al. [56] use an off-the-shelf TI C6x VLIW processor to improve the simulation speed of embedded SoC systems. Cycle-accurate instruction set simulation is performed very efficiently using the VLIW processor. The simulation interfaces to the hardware components of the SoC using a synchronization protocol either on a per-cycle basis or based on transactions [58]. A similar synchronization scheme was proposed by Nakamura. A processor simulation in software is synchronized with hardware components in an FPGA using shared registers [45]. In addition, a processor emulation platform based on FPGAs is available [44]. The ReSim simulation engine [25] is based on traces. Traces are collected using a simple simulator, e.g., a functional SimpleScalar model, and later processed by a dedicated timing simulator implemented in an FPGA. This splitting lowers the complexity of the FPGA implementation, improves speed, and can easily be replayed using different hardware configurations. Splitting the simulation task into a functional and timing partition was first proposed by Chiou et al. for the FAST simulator [17][15][16]. In contrast to ReSim, the simulation is not based on traces, instead the functional simulator runs in parallel either on a host computer or a general purpose processor that is integrated with the FPGA. In both cases the functional and the timing partition are tightly coupled. For example, the timing partition informs the functional simulator in the case of mispredictions. A different splitting of the simulation tasks is used by the ProtoFlex environment [11][13][14], which mainly targets the efficient simulation of multi-processor systems. Individual processor components and instructions are promoted to the more efficient FPGA emulation, depending on execution frequency and complexity tradeoffs. The FPGA resources are shared among the individual emulated processors. The ASim simulation framework [20] was also extended to support partial execution in FPGAs. HASim [49][48] supports implementation of individual modules of a processor using either software or via an FPGA.
702
Florian Brandner and Nigel Horspool and Andreas Krall
References 1. August, D., Chang, J., Girbal, S., Gracia-Perez, D., Mouchard, G., Penry, D.A., Temam, O., Vachharajani, N.: An open simulation environment and library for complex architecture design and collaborative development, IEEE Computer Architecture Letters, 6(2):45–48 (2007) 2. Austin, T., Larson, E., Ernst, D.: SimpleScalar: An infrastructure for computer system modeling. Computer, 35(2):59–67, (2002) 3. Azevedo, R., Rigo, S., Bartholomeu, M., Araujo, G., Araujo, C., Barros.,E.: The ArchC architecture description language and tools, Int. J. Parallel Program., 33(5):453–484, (2005) 4. Bartholomeu, M., Azevedo, R., Rigo, S., Araujo, G.: Optimizations for compiled simulation using instruction type information, Symposium on Computer Architecture and High Performance Computing (SBAC-PAD 2004), pages 74–81, (2004) 5. Bell, J.R.: Threaded code, Commun. ACM, 16(6):370–372 (1973) 6. Brandner, F., Fellnhofer, A., Krall, A., Riegler, D.: Fast and accurate simulation using the LLVM compiler framework, RAPIDO ’09: 1st Workshop on Rapid Simulation and Performance Evaluation: Methods and Tools, (2009) 7. Burtscher, M., Ganusov, I.: Automatic synthesis of high-speed processor simulators, MICRO 37: Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture, pages 55–66, (2004) 8. Bermudo, N., Horspool, R.N., Krall, A.: Control flow graph reconstruction for reverse compilation of assembly language programs with delayed instructions, SCAM’05: Proceedings of the Fifth International Workshop on Source Code Analysis and Manipulation, pages 107–116, (2005) 9. Brandner, F.: Precise simulation of interrupts using a rollback mechanism, SCOPES ’09: Proceedings of the 12th International Workshop on Software and Compilers for Embedded Systems, pages 71–80, (2009) 10. Cofer, R.C., Harding, B.: Rapid System Prototyping with FPGAs: Accelerating the Design Process, Newnes, (2005) 11. Chung, E.S., Hoe, J.C., Falsafi, B.: ProtoFlex: Co-simulation for component-wise FPGA emulator development, WARFP ’06: In Proceedings of the 2nd Workshop on Architecture Research using FPGA Platforms, (2006) 12. Cmelik, B., Keppel, D.: Shade: A fast instruction-set simulator for execution profiling, SIGMETRICS ’94: Proceedings of the 1994 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, pages 128–137, (1994) 13. Chung, E.S., Nurvitadhi, E., Hoe, J.C., Falsafi, B., Mai, K.: A complexity-effective architecture for accelerating full-system multiprocessor simulations using FPGAs. FPGA ’08: Proceedings of the 16th International ACM/SIGDA Symposium on Field Programmable Gate Arrays, pages 77–86, (2008) 14. Chung, E.S., Papamichael, M.K., Nurvitadhi, E., Hoe, J.C., Mai, K., Falsafi, B.: ProtoFlex: Towards scalable, full-system multiprocessor simulations using FPGAs. ACM Transactions on Reconfigurable Technology and Systems, 2(2):1–32, (2009) 15. Chiou, D., Sunwoo, D., Kim, J., Patil, N., Reinhart, W.H., Johnson, D.E., Xu, Z.: The FAST methodology for high-speed SoC/computer simulation, ICCAD ’07: Proceedings of the 2007 IEEE/ACM International Conference on Computer-Aided Design, pages 295–302, (2007) 16. Chiou, D., Sunwoo, D., Kim, J., Patil, N., Reinhart, W.H., Johnson, D.E., Keefe, J., Angepat, H.: FPGA-accelerated simulation technologies (FAST): Fast, full-system, cycle-accurate simulators, MICRO ’07: Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture, pages 249–261, (2007) 17. Chiou, D., Sanjeliwala, H., Sunwoo, D., Xu, J.Z., Patil, N.: FPGA-based fast, cycle-accurate, full-system simulators, WARFP’06: Proceedings of the second Workshop on Architecture Research using FPGA Platforms, (2006) 18. Dehnert, J.C., Grant, B.K, Banning, J.P., Johnson, R., Kistler, T., Klaiber, A., Mattson, J.: The Transmeta Code MorphingTM software: Using speculation, recovery, and adaptive retranslation to address real-life challenges, CGO ’03: Proceedings of the International Symposium on Code Generation and Optimization, pages 15–24, (2003)
DSP Instruction Set Simulation
703
19. Ebcio˘glu, K., Altman, E.R.: DAISY: Dynamic compilation for 100% architectural compatibility, ISCA ’97: Proceedings of the 24th International Symposium on Computer Architecture, pages 26–37, (1997) 20. Emer, J., Ahuja, P., Borch, E., Klauser, A., Luk, C.-K., Manne, S., Mukherjee, S.S., Patil, H., Wallace, S., Binkert, N., Espasa, R., Juan, T.: Asim: A performance model framework, Computer, 35(2):68–76, (2002) 21. Ebcio˘glu, K., Altman, E.R., Gschwind, M., Sathaye, S.: Optimizations and oracle parallelism with dynamic translation, MICRO 32: Proceedings of the 32nd annual ACM/IEEE International Symposium on Microarchitecture, pages 284–295, (1999) 22. Ebcio˘glu, K., Altman, E., Gschwind, M., Sathaye, S.: Dynamic binary translation and optimization, IEEE Transactions on Computers, 50(6):529–548, (2001) 23. Errico, J.D., Qin, W.: Constructing portable compiled instruction-set simulators – an ADLdriven approach, DATE ’06: Proceedings of the Conference on Design, Automation and Test in Europe, pages 112–117, (2006) 24. Farfeleder, S., Krall, A., Horspool, R.N.: Ultra fast cycle-accurate compiled emulation of inorder pipelined architectures, EUROMICRO Journal of Systems Architecture, 53(8):501– 510, (2007) 25. Fytraki, S., Pnevmatikatos, D.: ReSim, a trace-driven, reconfigurable ILP processor simulator, DATE ’09: Proceedings of Design, Automation and Test in Europe 2009, (2009) 26. Fauth, A., Van Praet, J., Freericks, M.: Describing instruction set processors using nML, EDTC ’95: Proceedings of the 1995 European Conference on Design and Test, pages 503– 507, (1995) 27. Gschwind, M., Altman, E.: Optimization and precise exceptions in dynamic compilation, ACM SIGARCH Computer Architecture News, 29(1):66–74, (2001) 28. Gschwind, M., Altman, E.R., Sathaye, S., Ledak, P., Appenzeller, D.: Dynamic and transparent binary translation, Computer, 33(3):54–59, (2000) 29. Gao, L., Kraemer, S., Leupers, R., Ascheid, G., Meyr, H.. A fast and generic hybrid simulation approach using C virtual machine, CASES ’07: Proceedings of the 2007 International Conference on Compilers, Architecture, and Synthesis for Embedded Systems, pages 3–12, (2007) 30. Goossens, G., Lanneer, D., Geurts, W., Van Praet, J.: Design of ASIPs in multi-processor SoCs using the Chess/Checkers retargetable tool suite, International Symposium on Systemon-Chip, pages 1–4, (2006) 31. Halambi, A., Grun, P., Ganesh, V., Khare, A., Dutt, N., Nicolau, A.: EXPRESSION: A language for architecture exploration through compiler/simulator retargetability, DATE ’99: Proceedings of the Conference on Design, Automation and Test in Europe, pages 485–490, (1999) 32. Ienne, P., Leupers, R.: Customizable Embedded Processors: Design Technologies and Applications (Systems on Silicon), Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, (2006) 33. Jones, D., Topham, N.P.: High speed CPU simulation using LTU dynamic binary translation. In HiPEAC’09: Proceedings of the 4th International Conference on High Performance Embedded Architectures and Compilers, pages 50–64, (2009) 34. Krall, A., Farfeleder, S., Horspool, R.N.: Ultra fast cycle-accurate compiled emulation of inorder pipelined architectures, SAMOS ’05: Proceedings of the International Workshop on Systems, Architectures, Modeling, and Simulation, LNCS 3553, pages 222–231, (2005) 35. Kudlugi, M., Hassoun, S., Selvidge, C., Pryor, D.: A transaction-based unified simulation/emulation architecture for functional verification. DAC ’01: Proceedings of the 38th Conference on Design Automation, pages 623–628, (2001) 36. Klint, P.: Interpretation techniques, Software: Practice and Experience, 11(9):963 – 973, (1981) 37. Larus, J. Assemblers, linkers and the SPIM simulator, in Patterson, D.A., Hennessy, J.L., editors, Computer Organization and Design: The Hardware/software Interface, Morgan Kaufmann, (2005)
704
Florian Brandner and Nigel Horspool and Andreas Krall
38. Mills, C., Ahalt, S.C., Fowler, J.: Compiled instruction set simulation, Software: Practice and Experience, 21(8):877–889, (1991) 39. Magnusson, P.S.. Efficient instruction cache simulation and execution profiling with a threaded-code interpreter, WSC ’97: Proceedings of the 29th Conference on Winter Simulation, pages 1093–1100, (1997) 40. May, C.: Mimic: a fast System/370 simulator, Symposium on Interpreters and Interpretive Techniques, pages 1–13, (1987) 41. Magnusson, P.S., Christensson, M., Eskilson, J., Forsgren, D., Hållberg, G., Högberg, J., Larsson, F., Moestedt, A., Werner, B.: Simics: A full system simulation platform, Computer, 35(2):50–58, (2002) 42. Mishra, P. and Dutt, N.: Processor Description Languages, Volume 1, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, (2008) 43. Nohl, A., Braun, G., Schliebusch, O., Leupers, R., Meyr, H., Hoffmann, A.: A universal technique for fast and flexible instruction-set architecture simulation, DAC ’02: Proceedings of the 39th Conference on Design Automation, pages 22–27, (2002) 44. Nakamura, Y., Hosokawa, K.. Fast FPGA-emulation-based simulation environment for custom processors, IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, E89-A(12):3464–3470, (2006) 45. Nakamura, Y., Hosokawa, K., Kuroda, K., Yoshikawa, K., Yoshimura, T.: A fast hardware/software co-verification method for system-on-a-chip by using a C/C++ simulator and FPGA emulator with shared register communication, DAC ’04: Proceedings of the 41st annual Conference on Design Automation, pages 299–304, (2004) 46. Pees, S., Hoffmann, A., Meyr, H.: Retargetable compiled simulation of embedded processors using a machine description language, ACM Transactions on Design Automation of Electronic Systems, 5(4):815–834, (2000) 47. Proebsting, T.A.: Optimizing an ANSI C interpreter with superoperators, POPL ’95: Proceedings of the 22nd ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, pages 322–332, (1995) 48. Pellauer, M., Vijayaraghavan, M., Adler, M., Arvind, Emer, J.: Quick performance models quickly: Closely-coupled partitioned simulation on FPGAs, ISPASS ’08: IEEE International Symposium on Performance Analysis of Systems and Software, pages 1–10, (2008) 49. Pellauer, M., Vijayaraghavan, M., Adler, M., Arvind, Emer, J.: A-Ports: An efficient abstraction for cycle-accurate performance models on FPGAs, FPGA ’08: Proceedings of the 16th International ACM/SIGDA Symposium on Field Programmable Gate Arrays, pages 87–96, (2008) 50. Roeven, H., Coninx, J., Ade, M.: CoolFlux DSP: The embedded ultra low power Cprogrammable DSP core, GSPx’04: International Signal Processing Conference, pages 1–7, (2004) 51. Reshadi, M., Dutt, N.: Generic pipelined processor modeling and high performance cycleaccurate simulator generation, DATE ’05: Proceedings of the Conference on Design, Automation and Test in Europe, pages 786–791, (2005) 52. Reshadi, M., Dutt, N., Mishra, P.: A retargetable framework for instruction-set architecture simulation, ACM Transactions on Embedded Computing Systems, 5(2):431–452, (2006) 53. Rosenblum, M., Herrod, S.A., Witchel, E., Gupta, A.: Complete computer system simulation: The SimOS approach, IEEE Parallel & Distributed Technology, 3(4):34–43, (1995) 54. Reshadi, M., Mishra, P., Dutt, N.: Instruction set compiled simulation: A technique for fast and flexible instruction set simulation, Proceedings of the 40th Conference on Design Automation, pages 758–763, (2003) 55. Reshadi, M., Mishra, P., Dutt, N.: Hybrid-compiled simulation: An efficient technique for instruction-set architecture simulation, ACM Transactions on Embedded Computing Systems, 8(3):1–27, (2009) 56. Schnerr, J., Bringmann, O., Rosenstiel, W.: Cycle accurate binary translation for simulation acceleration in rapid prototyping of SoCs, DATE ’05: Proceedings of the Conference on Design, Automation and Test in Europe, pages 792–797, (2005)
DSP Instruction Set Simulation
705
57. Sites, R.L., Chernoff, A., Kirk, M.B., Marks, M.P., Robinson, S.G.: Binary translation, Communications of the ACM, 36(2):69–81, (1993) 58. Schnerr, J., Haug, G., Rosenstiel, W.: Instruction set emulation for rapid prototyping of SoCs, DATE ’03: Proceedings of the Conference on Design, Automation and Test in Europe, pages 562–567, (2003) 59. Sathaye, S., Ledak, P., Leblanc, J., Kosonocky, S., Gschwind, M., Fritts, J., Bright, A., Altman, E., Agricola, C.: BOA: Targeting multi-gigahertz with binary translation, Proceedings of the 1999 Workshop on Binary Translation, pages 2–11, (1999) 60. Suh, T., Lee, H.-H.S., Lu, S.-L., Shen, J.: Initial observations of hardware/software cosimulation using FPGA in architectural research, WARFP’06: In Proceedings of the 2nd Workshop on Architecture Research using FPGA Platforms, (2006) 61. Smith, .E., Nair, R.: Virtual Machines, Morgan Kaufman, (2005) 62. Open SystemC Initiative. http://www.systemc.org/home. 63. Vachharajani, M., Vachharajani, N., August, D.I.: The Liberty Structural Specification Language: A high-level modeling language for component reuse, PLDI ’04: Proceedings of the ACM SIGPLAN 2004 Conference on Programming Language Design and Implementation, pages 195–206, (2004) 64. Vachharajani, M., Vachharajani, N., Penry, D.A., Blome, J.A., Malik, S., and August, D.I.: The Liberty Simulation Environment: A deliberate approach to high-level system modeling, ACM Transactions on Computer Systems, 24(3):211–249, (2006) 65. Witchel, E., Rosenblum, M.: Embra: Fast and flexible machine simulation, SIGMETRICS ’96: Proceedings of the 1996 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, pages 68–79, (1996) 66. Yi, J.J., Lilja, D.J.: Simulation of computer architectures: Simulators, benchmarks, methodologies, and recommendations, IEEE Transactions on Computers, 55(3):268–280, (2006)
Optimization of Number Representations Wonyong Sung
Abstract In this section, automatic scaling and word-length optimization procedures for efficient implementation of signal processing systems are explained. For this purpose, a fixed-point data format that contains both integer and fractional parts is introduced, and used for systematic and incremental conversion of floating-point algorithms into fixed-point or integer versions. A simulation based range estimation method is explained, and applied for automatic scaling of C language based digital signal processing programs. A fixed-point optimization method is also discussed, and optimization examples including a recursive filter and an adaptive filter are shown.
1 Introduction Many embedded processors do not equip floating-point units, thus it is needed to develop integer versions of code for real-time execution of signal processing applications. Integer programs run much faster than the floating-point versions, but integer versions can suffer from overflows and quantization effects. Converting a floatingpoint program to an integer version requires scaling of data, which is known to be difficult and time-consuming. VLSI implementation of digital signal processing algorithms demands fixed-point arithmetic for reducing the chip area, circuit delay, and power consumption. With fixed-point arithmetic, it is possible to use the fewest number of bits possible for each signal and save the chip area. However, if the number of bits is too small, quantization noise will degrade the system performance to an unacceptable level. Thus, fixed-point optimization that minimizes the hardware cost while meeting the fixed-point performance is very needed. Wonyong Sung Department of EE, Seoul National University, 599 Gwanangno, Gwanak-gu, Seoul, Republic of Korea e-mail: [email protected]
S.S. Bhattacharyya et al. (eds.), Handbook of Signal Processing Systems, DOI 10.1007/978-1-4419-6345-1_25, © Springer Science+Business Media, LLC 2010
707
708
Wonyong Sung
In Section 2, the data format for representing a fixed-point data is presented. This format contains both integer and fractional parts for representing a data. Thus, this format is very convenient for data conversion between floating-point and fixed-point data types. Section 3 contains the range estimation methods that are necessary for integer word-length determination and scaling. A simulation based range estimation method is explained, which can be applied to not only linear but also non-linear and time-varying systems. In Section 4, a floating-point to integer C program conversion process is shown. This code conversion process is especially useful for C program language based implementation of signal processing systems. Section 5 presents the word-length optimization flow for signal processing programs, which should be important for VLSI or 16-bit programmable digital signal processor (DSP) based implementations. In Section 6, summary and related works are described.
2 Fixed-point Data Type and Arithmetic Rules A fixed-point data type does not contain an exponent term, which makes hardware for fixed-point arithmetic much simpler than that for floating-point arithmetic. However, fixed-point data representation only allows a limited dynamic range, hence scaling is needed when converting a floating-point algorithm into a fixed-point version. In this section, fixed-point data formats, fixed-point arithmetic rules, and a simple floating-point to fixed-point conversion example will be shown. The two’s complement format is used when representing negative numbers.
2.1 Fixed-Point Data Type A widely used fixed-point data format is the integer format. In this format, the least significant bit (LSB) has the weight of 1, thus the maximum quantization error can be as large as 0.5 even if the rounding scheme is used. As a result, small numbers cannot be faithfully represented with this format. Of course, there can be overflows even with the integer format because an N-bit signed integer has a value that is between −2N−1 and 2N−1 − 1. Another widely used format is the fractional format, in which the magnitude of a data cannot exceed 1. This format seems convenient for representing a signal whose magnitude is bounded by 1, but it suffers from overflow or saturation problems when the magnitude of the value exceeds the bound. Fig. 1 shows two different interpretations for a binary data ‘1001000.’ With either the integer or the fractional formats, an expert can design an optimized digital signal processing system by incorporating proper scaling operations, which is, however, very complex and difficult to manage. This conversion flow is not easy because all the intermediate variables or constants should be represented with only integers or fractions whose represented values are usually much different from those of the corresponding floating-point data. The difference of the represen-
Optimization of Number Representations
709
tation format also hinders incremental conversion from a floating-point design to a fixed-point one. For seamless floating-point to integer or fixed-point conversion, the semantic gap between the floating-point and fixed-point data formats needs to be eliminated. weight for integer
sign bit
-27
26
25
24
23
22
21
20
0
1
0
0
1
0
0
0
-2
0
2
-1
2
-2
2
-3
weight for fractional
2
-4
2
-5
2
-6
2-7
integer number 26 + 23 = 72 fractional number 2-1 + 2-4 = 0.5625
Fig. 1 Integer and fractional data formats.
To solve these problems, a generalized fixed-point data-type that contains both integer and fractional parts can be used [17]. This fixed-point format contains the attributes specified as follows. < wordlength, integer_wordlength, sign_over f low_quantization_mode >
(1)
The word-length (WL) is the total number of bits for representing a fixed-point data. The integer word-length (IWL) is the number of bits to the left of the (hypothetical) binary-point. The fractional word-length (FWL) is the number of bits to the right of the (hypothetical) binary point. The sign can be either unsigned(‘u’) or two’s complement(‘t’). Thus, the word-length (WL) corresponds to ‘IWL+FWL+1’ for signed data, and is ‘IWL+FWL’ for unsigned data. If the fractional word-length is 0, the data with this format can be represented with integers. At the same way, it becomes the fractional format when the IWL is 0. Note that the IWL or FWL can be even larger than the WL; in this case, the other part has a minus word-length. Fig. 2 shows an interpretation of an 8-bit binary data employing the fixed-point format with the IWL of 3. The overflow and quantization modes are needed for arithmetic or quantization operations. The overflow mode specifies whether no treatment (‘o’) or saturation (‘s’) scheme is used when overflow occurs, and the quantization mode denotes whether rounding (‘r’) or truncation (‘t’) is employed when least significant bits are quantized. Most of all, this fixed-point data representation is very convenient for translating a floating-point algorithm into a fixed-point version because the data values is not limited to integers or fractions. The range (R) and the quantization step (Q) are dependent on the IWL and FWL, respectively: −2IW L ≤ R < 2IW L and Q = 2−FW L = 2−(W L−1−IW L) for the signed format. Assigning a large IWL to a variable can prevent overflows, but it increases the quantization noise. Thus, the optimum IWL for a variable should be determined according to its range or the pos-
710
Wonyong Sung FWL
IWL sign bit
0
weight -23
1
0
0
1
0
0
0
22
21
20
2-1
2-2
2-3
2-4
fixed-point value 22 + 2-1 = 4.5
hypothetical binary point
Fig. 2 Generalized fixed-point data format.
sible maximum absolute value. The minimum IWL for a variable x, IW Lmin (x), can be determined according to its range, R(x), as follows. IW Lmin (x) = log2 R(x),
(2)
where x denotes the smallest integer which is equal to or greater than x. Note that preventing overflow or saturation is very critical in fixed-point arithmetic because the magnitude of error caused by them is usually much larger than that produced by quantization.
2.2 Fixed-point Arithmetic Rules Since the generalized fixed-point data format allows a different integer word-length, two variables or constants that do not have the same integer word-length cannot be added or subtracted directly. Let us assume that x1 is ‘01001000’ with the IWL of 3 and x2 is ‘00010000’ with the IWL of 2. Since the interpreted value of x1 is 4.5 and that of x2 is 0.5, the result should be a number that corresponds to 5.0 or a close one. However, direct addition of 01001000 (x1) and 00010000 (x2) does not yield the expected result. This is because the two data have different integer word-lengths. The two fixed-point data should be added after aligning their hypothetical binary points. The binary point can be moved, or the integer word-length can be changed, by using arithmetic shift operations. Arithmetic right shift by one bit increases the integer word-length by one, while arithmetic left shift decreases the integer word-length. The number of shifts required for addition or subtraction can easily be obtained by comparing the integer word-lengths of the two input data format. In the above example, x2, with the IWL of 2, should be shifted right by 1 bit before performing the integer addition to align the binary-point locations. As illustrated in Fig. 3, this results in a correct value of 5.0 when the result is interpreted with the IWL of 3. Note that the result of addition or subtraction sometimes needs an increased IWL. If the IWL of the added result is greater than those of two input operands, the inputs should be scaled down to prevent overflows. Subtraction can be treated the same way with addition. The scaling rules for addition and subtraction
Optimization of Number Representations
711
are shown in Table 1, where Ix and Iy are the IWL’s of two input operands and x and y, respectively, and Iz is the IWL of the result, z. X1 IWL 3 X2 IWL 2
0
0
1
0
Step 1: right shift 1 bit
0
0
0
1
1
0
0
0
0
0
0
Step 2: fixed-point add
0
+ 0
0
0
1
0
0
0
1
0
1
0
0
0
0
=
0
0
Fig. 3 Fixed-point addition with different IWL’s.
Table 1 Fixed-point arithmetic rules. floating fixed-point result IWL -point Ix > Iy , Iz Iy > Ix , Iz Iz > Ix , Iy Assignment x = y x = y >> (Ix − Iy ) x = y << (Iy − Ix ) Ix Addition/ x + y x + (y >> (Ix − Iy )) (x >> (Iy − Ix )) + y (x >> (Iz − Ix ))+ max(Ix , Iy , Iz ) Subtraction y >> (Iz − Iy ) Multiplication x ∗ y mulh(x, y) Ix + Iy + 1 or Ix + Iy z: variable storing the result
In fixed-point multiplication, the word-length of the product is equal to the sum of two input word-lengths. In two’s complement multiplication, two identical sign bits are generated except for the case that both input data correspond to the negative minimum, ‘100 · · ·0’. Ignoring this case, the IWL of the two’s complement multiplied result becomes Ix + Iy + 1. Fig. 4 shows the multiplied result of two 8-bit fixed-point numbers. By assuming the IWL of 5, we can obtain the interpreted value of 2.25.
2.3 Fixed-point Conversion Examples To illustrate the fixed-point conversion process, a floating-point version of the recursive filter shown in Fig. 5-(a) is converted to a fixed-point hardware system.
712
Wonyong Sung X1 IWL 3
0
1
0
0
1
0
0
0
(4.5)
0
0
0
0
(0.5)
Ý X2 IWL 2
0
0
0
1
=
(2.25)
0
0
0
0
0
1
0
0
1
0
IWL of 5
0
0
0
0
0
0
FWL of 9
two identical sign bits
Fig. 4 Fixed-point multiplication.
Assume that the input signal has the range of 1; the input signal is between -1 and 1. The output signal is also known to be between -5.3 and 5.3. The output signal range is obtained from floating-point simulation results. The coefficient is 0.9, and is unsigned. Hence, the range of the multiplied signal, z[n] will be 4.77 (=0.9*5.3). From the given range information, we can assign the IWL’s of 0 for x[n], 3 for y[n], 3 for z[n] and 0 for the coefficient. The coefficient a is coded as ‘58982,’ which corresponds to 0.9 × 216 unsigned number. Since the multiplication of y[n] and a is conducted between signed and unsigned numbers, the IWL of z[n] is 3, which is Iy + Ia . If the coefficient a is coded with the two’s complement format, the IWL of z[n] would be 4 due to the extra sign generated in the multiplication process. Since the precision of the hardware multiplier is 16-bit, only the upper 16 bits, including the sign bit, of y[n] is used for the multiplication. The quantizer (Q) in this figure takes the upper 16 bits among 32 bits of y[n]. Since the difference between Ix and Iz is 3, x[n] is scaled down or arithmetic shift-righted by 3 bits, as the hardware in Fig. 5-(b) shows. R(x) = 1.0 xfl[n]
R(y) = 5.3 yfl[n] z-1
zfl[n]
Ix = 0 x[n]
16bits
3
19bits
Iy = 3 y[n]
32bits Q
Iz = 3
z-1
32bits 58982 (= 0.9
0.9
(a) Floating-point filter
16bits
(b) Fixed-point filter
Fig. 5 Floating-point to fixed-point conversion of a recursive filter.
216)
Optimization of Number Representations
713
There are a few different fixed-point implementations. One example is a fixedpoint implementation without needing shift operations. Note that no shift operation is needed when adding or subtracting two fixed-point data with the same integer word-length. In this case, the IWL of 3 is assigned to the input x[n], even though the range of x[n] is 1.0. This means that the input x[n] is un-normalized, and the upper 3 bits next to the sign bit are unused. Since the IWL’s of x[n] and z[n] are the same, there is no need of inserting a shifter. Fig. 6-(a) shows the resulting hardware. The SQNR (Signal to Quantization Noise Ratio) of the input is obviously lowered by employing the un-normalized scheme. Another fixed-point implementation in Fig. 6-(b) shows the hardware using a 16-bit adder. In this case, the quantizer (Q) is moved to the output of the multiplier. Note that the SQNR of this scheme is even lower than that of Fig. 6-(a).
Ix = 3 x[n] 16bits Iz = 3 32bits
Iy = 3 y[n]
32bits
Ix = 3 x[n] 16bits
Q
16bits
-1
z
16bits 58982 (= 0.9
216)
(a) Fixed-point filter without shift
Iy = 3 y[n]
16bits
z-1
Q 32bits
58982 (= 0.9
216)
(b) Fixed-point filter with a 16 bit adder
Fig. 6 Fixed-point filters with reduced complexity.
In the above example, the range of 1 is assumed to the input x[n], which is from the floating-point design. However, assuming the range of 2 as for the input x[n] does not change the resultant hardware because the output range should be doubled in this case.
3 Range Estimation for Integer Word-length Determination The floating-point to fixed-point conversion examples in the previous section shows that estimating the ranges of all of variables is most crucial for this conversion process. There are two different approaches for range estimation. One is to calculate the L1-norm of the system and the other is using the simulation results of floating-point systems [17][12].
714
Wonyong Sung
3.1 L1-norm based Range Estimation The L1-norm of a linear shift-invariant system is the maximum value of the output when the absolute value of the input is bounded to 1. If the unit-pulse response of a system is h[n], where n = 0, 1, 2, 3, · · ·, the L1-norm of this system is defined as: L1¯norm(h[n]) =
|h[n]|
(3)
n=0
Obviously, the L1-norm can easily be estimated for an FIR system. There are also several analytical methods that compute the L1-norm of an IIR (infinite impulse response) system [12]. Since the unit-pulse response of an IIR system usually converges to zero when thousands of time-steps elapse, it is practically possible to estimate the L1-norm of an IIR system with a simple C code or a Matlab program that sums up the absolute value of the unit-pulse response, instead of conducting a contour integration [12] [11]. Since L1-norm cannot easily be defined to time-varying or non-linear systems, the L1-norm based range estimation method is hardly applicable to systems containing non-linear and time-varying blocks. Another characteristic of the L1-norm is that it is a very conservative estimate, which means that the range obtained with the L1-norm is the largest one for any set of the given input, and hence the result can be an over-estimate. For example, the L1-norm of the first order recursive system shown in Fig. 5-(a) is 10, which corresponds to the case that the input is a DC signal with the maximum value of 1. For example, if we design a speech processing system, the input with this characteristic is not likely to exist. With an over-estimated range, the data should be shift-down by more bits, which will increase the quantization noise level. For a large scale system, the L1-norm based scaling can be impractical because accumulation of the extra-bits at each stage may seriously lower the accuracy of the output. However, if a very reliable system that should not experience any overflow is needed, the L1-norm based scaling can be considered. The L1-norm based scaling is limited in use for real applications because most practical systems contain time-varying or non-linear blocks.
3.2 Simulation based Range Estimation The simulation based method estimates the ranges by simulation of floating-point design while supplying realistic input signal [17]. This method is especially useful when there is a floating-point simulation model, which can be a C program or a CAD system based design. This method can be applied to general, including non-linear and time-varying, systems. Thus, provided that there is a floating-point version of a designed system and various input files for simulation, a CAD tool can convert a floating-point design to a fixed-point version automatically. One drawback of this method is that it needs extensive simulations with different environmental parameters and various input signal files. The scaling with this approach is not conservative,
Optimization of Number Representations
715
thus there can be overflows if the statistics of the real input signal differ much from the ones that are used for range estimation. Therefore, it is needed to employ various input files for simulation or give some additional integer bits, called guard-bits, to secure overflow-free design. This simulation based method can also be applied to word-length optimization. For unimodal and symmetric distributions, the range can be effectively estimated by using the mean and the standard deviation, which are obtained from simulation results, as follows. R = | | + n × , n k (4) Specifically, we can use n as k + 4, where k is the kurtosis [17]. However, the above rule is not satisfactory for other distributions, which may be multimodal, non symmetric, or non zero mean. As an alternate rule, we can consider R = Rˆ 99.9% + g
(5)
where g is a guard for the range. A partial maximum, Rˆ P% , indicates a sub-maximum value, which covers P% of the entire samples. Note that various sub-maxima are collected during the simulation. The more different Rˆ 100% and Rˆ 99.9% are, the larger guard value is needed.
3.3 C++ Class based Range Estimation Utility A range estimation utility for C language based digital signal processing programs is explained, which is freely available [3]. This range estimation utility is not only essential for automatic integer C code generation, but also useful for determining the number of shifts in assembly programming of fixed-point DSPs [16]. With this utility, users develop a C application program with floating-point arithmetic. The range estimator then finds the statistics of internal signals throughout floating-point simulation using real inputs, and determines the integer word-lengths of variables. Although we may develop a separate program that traces the range information during simulation, this approach may demand too much program modification. The developed range estimation class uses the operator overloading characteristics of C++ language, thus a programmer does not need to change the floating-point code significantly for range estimation. To record the statistics during the simulation, a new data class for tracing the possible maximum value of a signal, i.e., the range, has been developed and named as fSig. In order to prepare a range estimation model of a C or C++ digital signal processing program, it is only necessary to change the type of variables, from float to fSig. The fSig class not only computes the current value, but also keeps the records of a variable using private members for it. When the simulation is completed, the ranges of the variables declared as fSig class are readily available from the records stored in the class.
716
Wonyong Sung
The fSig class has several private members including Data, Sum, Sum2, Sum3, Sum4, AMax, and SumC. Data keeps the current value, while Sum and Sum2 record the summation and the square summation of past values, respectively. Sum3 and Sum4 record the third and fourth moments, respectively. They are needed to calculate the statistics of a variable, such as mean, standard deviation, skewness, and kurtosis. AMax stores the absolute maximum value of a variable during the simulation. The class also keeps the number of modifications during the simulation in SumC field. The fSig class overloads arithmetic and relational operators. Hence, basic arithmetic operations, such as addition, subtraction, multiplication, and division, are conducted automatically for fSig variables. This property is also applicable for relational operators, such as ‘==,’ ‘!=,’ ‘>,’ ‘<,’ ‘>=,’ and ‘<=.’ Therefore, any fSig instance can be compared with floating-point variables and constants. The contents, or private members, of a variable declared by the fSig class is updated when the variable is assigned by one of the assignment operators, such as ‘=,’ ‘+=,’ ‘−=,’ ‘∗=,’ and ‘/=.’ The range estimator is executed after preparing the simulation model that only modifies the variable declaration. After the simulation is completed, the mean ( ), standard deviation ( ), skewness (s), and kurtosis (k) can be calculated using Sum, Sum2, Sum3, Sum4 and SumC information. Then, the statistical range of fSig variable x can be estimated. The integer word-lengths of all signals are then obtained from their ranges. As an example, let us consider a first order digital IIR filter. The original C++ program for the filter and a translated version for range estimation are given in Fig. 7. void iir1(short argc, char *argv[]) { float Xin; float Yout; // fSig() float Ydly; // fSig() float Coeff;
}
Coeff = 0.9; Ydly = 0.; for( i = 0; i < 1000; i++ ) { infile >> Xin ; Yout = Coeff * Ydly + Xin ; Ydly = Yout ; outfile << Yout << ’\n’; } (a) The original C++ program.
void iir1(short argc, char *argv[]) { float Xin; static fSig Yout("iir1/Yout"); static fSig Ydly("iir1/Ydly"); float Coeff;
}
Coeff = 0.9; Ydly = 0.; for( i = 0; i < 1000; i++ ) { infile >> Xin ; Yout = Coeff * Ydly + Xin ; Ydly = Yout ; outfile << Yout << ’\n’; } (b) A version for the range estimator.
Fig. 7 C++ programs for a first order IIR filter.
As shown in Fig. 7, it is only necessary for developing a range estimation program to modify the declaration part of the original floating-point version. In this example, a white noise sequence uniformly distributed between −1 and +1 is used for the input data, and four times of standard deviation is used for estimating the
Optimization of Number Representations
717
ranges. Note that the integer word-length for Xin is known, 0. The range estimation result is shown in Fig. 8. Statistics: VarName Mean StdDev Skewness Kurtosis R99.9% R100% Update iir1/Ydly -0.1133 +1.3076 +0.0220 -0.0258 +4.2638 +4.4214 3001 iir1/Yout -0.1134 +1.3078 +0.0220 -0.0268 +4.2638 +4.4214 3000 Integer word-lengths: VarName Range iir1/Ydly +5.309891 iir1/Yout +5.309515
IWL +3 +3
Fig. 8 The result of the range estimator for the IIR filter.
The elements of an array variable are assumed to have the same IWL for simple code generation. If it is not, the scaled integer codes need to check the array index, which can slow down program execution significantly. For a pointer variable, the IWL is defined as that of the pointed variables. For example, when the pointer variables p and q have the IWL of 2 and 3, respectively, the expression ∗q = ∗p can be converted to ∗q = ∗p >> 1. Since the IWL of a pointer variable is not changed at runtime, a pointer cannot support two variables having different IWL’s. In this case, the IWL’s of these pointers are equalized automatically at the integer C code generation step that will be described in the next section.
4 Floating-point to Integer C Code Conversion C language is most frequently used for developing digital signal processing programs. Although C language is very flexible for describing algorithms with complex control flows, it does not support fixed-point data formats. In this section, a floatingpoint to integer C program conversion procedure is explained [24][13]. As shown in Fig. 9, the conversion flow utilizes the simulation based range estimation results for determining the number of shift operations for scaling. In addition, the number of shift operations is minimized by equalizing the IWL’s of corresponding variables or constants for the purpose of reducing the execution time.
4.1 Fixed-Point Arithmetic Rules in C Programs As summarized in Table 1, the addition or subtraction of two input data with different IWL’s needs arithmetic shift before conducting the operation. Fixed-point multiplication in C language needs careful treatment because integer multiplication in ANSI C only stores the lower-half of the multiplied result, while fixed-point multiplication needs the upper-half part. Integer multiplication is intended to prevent
718
Wonyong Sung
Floating point C code
Code conversion
Shift reduction
IWL annotator
Range estimation
Profiling
IWL check
IWL information
Shift optimization
Syntax analysis
Integer C code generation Integer C code
Fig. 9 Fixed-point addition with different IWL’s.
any loss of accuracy in multiplication of small numbers, and hence it can generate an overflow when large input data are applied. However, for signal processing purpose, the upper part of the result is needed to prevent overflows and keep accuracy. Integer and fixed-point multiplication operations are compared in Fig. 10-(a) and (b) [14][15]. N bit
Not used (a)
N bit
N bit
N bit
Not used (b)
16 bit
Not used
32 bit (c)
Fig. 10 Integer and fixed-point multiplications. (a) ANSI C integer multiplication. (b) Fixed-point multiplication. (c) MPYH instruction of TMS320C60.
In traditional C compilers, a double precision multiplication operation followed by a double to single conversion is needed to obtain the upper part, which is obviously very inefficient [28]. However, in recent C compilers for some DSPs such as Texas Instruments’ TMS320C55 (’C55), the upper part of the multiplied result can be obtained by combining multiply and shift operations [6]. In the case of TMS320C60 (’C60), which has 32-bit registers and ALU’s, but only 16 by 16-bit multipliers, the multiplication of the upper 16-bit of two 32-bit operands is efficiently supported by C intrinsics as depicted in Fig. 10-(c) [7]. If there is no support for obtaining the upper part of the multiplied result in the C compiler level, an assembly level implementation of fixed-point multiplication is useful. For the
Optimization of Number Representations
719
Motorola 56000 processor, fixed-point multiplication is implemented with a single instruction using inline assembly coding [5]. Note that, in Motorola 56000, the IWL of the multiplication result is Ix + Iy , because the output of the multiplier is one bit left shifted in hardware. The implementation of the macro or inline function for fixed-point multiplication, mulh(), is dependent on the compiler of a target processor as illustrated in Table 2. Table 2 Implementation of fixed-point multiplication. Target processor Implementation TMS320C50 #define mulh(x,y) ((x)*(y)>>16) TMS320C60 #define mulh(x,y) _mpyh(x,y) Motorola 56000 __inline int mulh(int x, int y) { int z; __asm("mpy %1,%2,%0":"=D"(z):"R"(x),"R"(y)); return z; }
4.2 Expression Conversion Using Shift Operations The most frequently used expression in digital signal processing is the accumulation of product terms, which can be generally modeled as follows. xi = x j × xk + xl j,k
(6)
l
Complex expressions in C programs are converted to several expressions having this form. Figure 11 shows one example.
y=(a+b)*c;
tmp = a+b; y=tmp*c;
Fig. 11 An example of expression conversion.
Since the IWL of the added result is the maximum value of two input operands and the result, as shown in Table 1, the IWL of the right hand side expression, Irhs , is determined by the maximum IWL of the terms as shown in Eq. 7. Irhs = max(Ix j + Ixk + 1, Ixl , Ixi ), j,k,l
(7)
720
Wonyong Sung
where Ix + Iy + 1 is used for the IWL of the multiplied results. The number of scaling shifts for the product, addition, or assignment, which is represented as, s j,k , sl or si , respectively, is determined as follows. s j,k = Irhs − (Ix j + Ixk + 1) sl = Irhs − Ixl si = Irhs − Ixi
(8) (9) (10)
Equation 6 is now converted to the scaled expression as follows. xi = {((x j × xk ) >> s j,k ) + (xl ) >> sl )} << si j,k
(11)
l
4.3 Integer Code Generation The IWL information file generated in the range estimation step includes the scope of a variable that indicates whether it is global or local, the variable name and its IWL. In the automatic scaling integer program converter for C, this information is attached to the symbol table of the floating-point program. The conversion of types and expression trees is conducted in the integer C code generation stage. The symbol tables are modified to replace floating-point types with integer types. Not only the float type but also the float-based types such as pointers to float, float arrays, and float functions are converted to corresponding integer-based types. The expression tree conversion that inserts scaling shifts uses the fixed-point arithmetic rules shown in Table 1. It is performed from the bottom to the top of a parse tree, and the IWL information of each tree node is also propagated in the same way. In this step, the pointer operations that involve different IWL’s are also checked. 4.3.1 Shift Optimization In many programmable DSP’s, an implementation that needs no or less scaling shift operations is not only faster but also requires a smaller code size. Since no shift operation is needed for addition or assignment of operands having the same IWL, the number of scaling shifts can be reduced by equalizing the IWL’s of relevant variables. An example implementation with shift reduction is illustrated in Fig. 6(a). Note that it is only allowed to increase the initial IWL’s that are determined according to Eq. 2, thus the equalization can increase the quantization noise level. Scaling shift reduction requires global optimization because IWL modification of a variable in an expression can incur extra scaling shifts in other expressions. Shift optimization also depends on the architecture of the target DSP. For example, if a DSP has a barrel shifter, the number of bits for one scaling shift, unless it is
Optimization of Number Representations
721
zero, does not affect the number of execution cycles. However, if it has no barrel shifter and should conduct scaling employing one-bit shift operations, the shift cost is also affected by the number of bits for one scaling operation. It is also needed for minimizing the execution time to reduce the number of scaling operations that are inside a long loop. Thus, this optimization requires program-profiling results. The IWL modification that minimizes the overhead for scaling is conducted as follows. First, the number of shifts for each expression is formulated with the IWL’s of the relevant variables and constants. Second, the cost function that corresponds to the total overhead of scaling shifts is made based on the results of the first step, the target DSP architecture and the program-profiling information. Finally, the cost function is minimized by modifying the IWL’s using the integer linear programming or the simulated annealing algorithms. Note that shift reduction using a DSP architecture without a barrel shifter can be modeled as an integer linear programming problem. The simulated annealing algorithm is a general optimization method, but the optimization can take much time. Detailed methods for shift reduction can be found in [13].
4.4 Implementation Examples A fourth order IIR filter is implemented for TI’s ’C50, ’C60 and Motorola 56000 using the developed scaling method. Floating-point C code for a fourth order IIR filter shown in Fig. 12 is given in Fig. 13-(a). x
x1
t1
y1 z-1
t2
a1[0]
z-1
d1[0] b1[0]
b1[1]
z-1 d1[1]
y a2[0]
d2[0] a1[1]
b2[0]
a1[2]
b2[1]
z-1 d2[1]
a2[1]
a2[2]
Fig. 12 A fourth order IIR filter.
After the range estimation, the original IWL’s are determined as shown in Table 3. With these initially determined IWL’s, the integer C code shown in Fig. 13(b) is generated using the IWL model of Ix + Iy + 1 for the multiplied results. At first, the coefficients a1, a2, b1, and b2 are all converted with the two’s complement format having the IWL of 1. Although some coefficients, such as 0.35 or -0.75504, can be encoded with the IWL of 0, the same IWL is given to all the coefficients to reduce the number of shifts for scaling. As a result, the coefficient value of 1 is converted to 230 (= 1073741824), the value of 0.355407 is translated
722
Wonyong Sung
to 0.355407 × 230 (= 381615360), and so on. The expression x1 = 0.01 * *x is converted as follows. The constant ‘0.01’ is changed to a two’s complement integer ‘1374389534’ (= 0.01 × 237) because the decimal number 0.01 is represented in binary as ‘0.0 0000 0101 0001 1110 1011 · · · ,’ where there are six zeroes below the binary point. Thus, the IWL of -6 is given to this constant, and translates 0.01 to 0.01 × 2(31+6). The input x is read from the file and has the IWL of 17 with the two’s complement format. The multiplied result of a and x has the IWL of 12 (= -6 + 17 + 1), but x1 has the IWL of 10. Thus, there needs two bit left shift before assigning the multiplied result to x1 and the expression becomes x1=sll(mulh(1374389534, *x), 2). Next, since the IWL of t1, or d1 is 12, the IWL of b1[0]*d1[0] and b1[1]*d1[1] is 14 (= 1+12+1). At the same way, since the IWL of x1 is 10, and that of mulh(*b1, *d1) or mulh(b1[1], d1[1]) is 14, x1 needs to right shifted by 4 bits to be added. The added result is again assigned to t1, which has the IWL of 12, thus it needs two bit left shift before assigning to t1, and forms the equation of t1 = sll( (x1»4)+ mulh(*b1, *d1) + mulh(b1[1], d1[1]), 2). Because the IWL of t1 and d1 is 12, the multiplication with the coefficients a1 produces the result having the IWL of 14 (= 12 + 1 +1), and the multiplied result can be assigned to y1 without any shift. The rest of the code can be interpreted at the same way. Shift reduction is performed with a DSP architecture that employs a barrel shifter, and the integer C code shown in Fig. 13-(c) is generated by controlling the IWL’s. When considering the expression of x1 = sll(mulh(1374389545, *x), 2), we can eliminate the shift left operation by increasing the IWL of x1 by 2 bits. Note that the IWL reduction process requires simultaneous IWL modification of several variables or constants. Table 3 shows the optimized IWL and the optimized integer C code is shown in Fig. 13-(c). Although there can be a loss of bit resolution, we can find that several redundant shift operations are successfully eliminated. In the optimized code, left shift operations are eliminated when calculating x1 and y. Instead, we can find that 4bit right shift operations are also removed when using x1 and y1. Table 3 IWL determined by the range estimation for the fourth order filter. Variable Original IWL Optimized IWL IWL increment x 17 20 3 y 15 18 3 x1 10 15 5 y1 14 18 4 t1, d1 12 13 1 t2, d2 16 16 0 b1, b2, a2 1 1 0 a1 1 4 3
In the fourth order IIR filter, the speed-up, which is the ratio in the execution time of the integer to the floating-point versions, was 29.8, 406, and 28.5 for ’C50, ’C60,
Optimization of Number Representations
723
float a1[3] = { 1, 0.355407, 1.0 }; float a2[3] = { 1, -1.091855, 1.0 }; float b1[2] = { 1.66664, -0.75504 }; float b2[2] = { 1.58007, -0.92288 }; float d1[2], d2[2]; void iir4(float *x, float *y) { float x1, y1, t1, t2; x1 = 0.01* *x; t1 = x1 + b1[0]*d1[0] + b1[1]*d1[1]; y1 = a1[0]*t1 + a1[1]*d1[0] + a1[2]*d1[1]; d1[1] = d1[0]; d1[0] = t1; t2 = y1 + b2[0]*d2[0] + b2[1]*d2[1]; *y = a2[0]*t2 + a2[1]*d2[0] + a2[2]*d2[1]; d2[1] = d2[0]; d2[0] = t2; } (a) The floating-point C code. #define sll(x,y) ((x)<<(y)) int a1[3] = { 1073741824, 381615360, 1073741824 }; int a2[3] = { 1073741824, -1172370379, 1073741824 }; int b1[2] = { 1789541073, -810718027 }; int b2[2] = { 1696587243, -990934855 }; int d1[2]; int d2[2]; extern void iir4(int *x, int *y) { int x1; int y1; int t1; int t2; x1 = sll(mulh(1374389534, *x), 2); t1 = sll((x1 >> 4) + mulh(*b1, *d1) + mulh(b1[1], d1[1]), 2); y1 = mulh(*a1, t1) + mulh(a1[1], *d1) + mulh(a1[2], d1[1]); d1[1] = *d1; *d1 = t1; t2 = sll((y1 >> 4) + mulh(*b2, *d2) + mulh(b2[1], d2[1]), 2); *y = sll(mulh(*a2, t2) + mulh(a2[1], *d2) + mulh(a2[2], d2[1]), 3); d2[1] = *d2; *d2 = t2; } (b) The integer C code before shift reduction.
724
Wonyong Sung
#define sll(x,y) ((x)<<(y)) int a1[3] = { 134217728, 47701920, 134217728 }; int a2[3] = { 1073741824, -1172370379, 1073741824 }; int b1[2] = { 1789541073, -810718027 }; int b2[2] = { 1696587243, -990934855 }; int d1[2]; int d2[2]; extern void iir4(int *x, int *y) { int x1; int y1; int t1; int t2; x1 = mulh(1374389534, *x); t1 = sll(x1 + mulh(*b1, *d1) + mulh(b1[1], d1[1]), 2); y1 = mulh(*a1, t1) + mulh(a1[1], *d1) + mulh(a1[2], d1[1]); d1[1] = *d1; *d1 = t1; t2 = sll(y1 + mulh(*b2, *d2) + mulh(b2[1], d2[1]), 2); *y = mulh(*a2, t2) + mulh(a2[1], *d2) + mulh(a2[2], d2[1]); d2[1] = *d2; *d2 = t2; } (c) The integer C code after shift reduction. Fig. 13 The C codes for the fourth order IIR filter.
and Motorola 56000, respectively, as shown in Table 4. The remarkable speed-up of ’C60 is mainly due to the deeply pipelined VLIW architecture having a large register file and the efficient C compiler. This machine can execute up to 8 integer operations in one cycle and store all the variables of a small loop kernel in the registers, but needs a large number of no-operation cycles for floating-point function calls to flush pipeline registers. The compiler for ’C60 is very efficient because it has several compiler friendly components, such as a large general purpose register file, an orthogonal instruction set and a VLIW scheduler [7]. The developed shift reduction technique is applied to this example. The number of shift operations in the converted C code is reduced from 7 to 2 without imposing an IWL upper bound for TI’s ’C50 and ’C60. The number of shifts in the C codes for Motorola 56000 is different from that of TI’s DSP’s, because the IWL of multiplication results is different as described in the previous section. The cycle counts of the shift reduced codes are shown in Table 5. As shown in this Table, ’C60 achieves 33% of speedup using the shift optimization. The low speed up of ’C50 is because the shifts can be performed by load-store instructions with no additional cycle in ’C50. For the Motorola 56000, high speed up can be achieved because its shift cost is much higher than that of the other DSP’s employing barrel shifters. The SQNR (Signal to Quantization Noise Ratio) of the fixed-point implementation was measured as 49.3 dB, 57.9 dB and 78.5 dB for ’C50, ’C60 and Motorola 56000, respectively. Note that
Optimization of Number Representations
725
’C50 uses a 16-bit word-length for internal memory, while ’C60 supports a native 32-bit word- length, although both machines have only 16-bit multipliers. In ’C60, the upper 16 bits of the 32-bit data are multiplied to produce a 32-bit result. Thus, the ’C50 based implementation generates more quantization noise due to the truncation of the lower 16 bits of the 32-bit multiplied result. Motorola 56000 uses a 24-bit data type for both addition and multiplication. The upper 24 bits of the multiplier output are used for the multiplication results. When the IWL’s are increased for the shift reduction, the fixed-point performance is slightly degraded in the ’C60 examples because the FWL’s are decreased, but the ’C50 example results do not agree with our expectation. It is because the quantization noises due to the scale down shifts are eliminated. In this example, the input is scaled instead of the internal signals to reduce the scaling shifts, and it results in less quantization noise at the output signal. Since the results of the optimization as a function of the IWL increase are not simple, we need to try a few different upper bounds to find an optimum one. Table 4 Performance comparison for the fourth order IIR filter. # of cycles SQNR floating-p. integer speed-up floating-p. integer ’C50 2,980 100 29.8 49.3dB ’C60 3,659 9 406.6 57.9dB 56000 26,282 921 28.5 78.5dB
Table 5 Shift reduction results of the fourth order IIR filter. IWL increment upper bound 0 (no shift reduction) # of shifts in C codes 7 ’C50 # of cycles 100 speedup SQNR 49.3dB ’C60 # of cycles 9 speedup SQNR 57.9dB # of shifts in C codes 5 56000 # of cycles 921 speedup SQNR 78.5dB
3
infinite
4 96 4% 51.2dB 6 33% 57.1dB 3 675 27% 78.5dB
2 94 6% 54.1dB 8 11% 54.2dB 2 577 37% 78.5dB
726
Wonyong Sung
5 Word-length Optimization VLSI implementation of digital signal processing algorithms requires fixed-point arithmetic for the sake of circuit area minimization, speed, and low-power consumption. Word-length optimization is also used for 16-bit programmable DSP and SIMD processor based implementations because some SIMD arithmetic instructions for embedded processors employ shortened word-length, 8-bit or 16-bit, data for increasing the data-level parallelism. In the word-length optimization, it is necessary to reduce the quantization effects to an acceptable level without increasing the hardware cost too much. If the number of bits assigned is too small, quantization noise will degrade the system performance too much; on the other hand, if the number of bits is too large, the hardware cost will become too high. Word-length optimization for linear time-invariant systems may be conducted by analytical methods [9]. However, the simulation based approach is preferred because these methods allow not only linear but also non-linear and time-varying blocks [23]. In this subsection, the finite word-length effects will be described, a C++ class library based fixed-point simulation tool will be presented, and the simulation based word-length optimization method will be explained.
5.1 Finite Word-length Effects Fixed-point arithmetic introduces quantization noise, and by which the final system performance is inevitably degraded in most cases. Thus, the needed performance of a system with fixed-point arithmetic should be defined first for word-length optimization. Finite word-length effects for implementing a digital filter can be classified into coefficient and signal quantization [8]. The coefficient quantization changes the system transfer function, thus its effect can be observed by displaying the frequency response of the transfer function. Frequency responses of an FIR filter with floating-point, 12-bit, 8-bit, and 4-bit coefficients are shown in Figure 14. The filter structure also affects the quantization effects. When implementing a high order recursive filters, the second order cascaded forms usually show better results when compared to the direct forms. The signal quantization can be considered as adding noise, instead of distorting the system transfer function. One of the most widely used fixed-point performances is the SQNR (Signal to Quantization Noise Ratio), which is defined as Eq. 12, where y f l [n] is the floating-point result, and y[n] is the fixed-point result. SQNR =
Psignal E[y f l 2 ] = Pnoise E[(y f l − y)2 ]
(12)
The SQNR is a convenient measure, but it can usually be applied to linear timeinvariant systems where the output quantization noise can be modeled as additive terms.
Optimization of Number Representations
727
Filter Response for each wordlength 5
Floating pt. Fixed 4pt. Fixed 8pt. Fixed 12pt.
0 -5
Magnitude(dB)
-10 -15 -20 -25 -30 -35 -40 -45 0
0.1
0.2
0.3 0.4 0.5 0.6 0.7 Normalized Frequency(xPi rad/samples)
0.8
0.9
1
Fig. 14 Coefficient quantization effects of an FIR filter.
Fig. 15 shows a simple additive noise model for the first order filter shown in Fig. 5-(a). xfl[n]
yfl[n] q[n] z-1
zfl[n]
0.9
Fig. 15 Additive noise model.
Note that the quantization noise can be roughly modeled as uniformly distributed between − /2 and /2 when the rounding scheme is employed, where corresponds to 2−FW L . Thus the maximum noise amplitude is halved when increasing the FWL by 1 bit. The quantization scheme that simply eliminates the low significant bits in word-length reduction creates DC biased quantization noise, which can result in a serious problem when an accumulation circuit that has a very high DC gain is followed. Thus, as the FWL is decreased by 1 bit, the output noise power
728
Wonyong Sung
becomes quadrupled, and the SQNR is decreased by 6 dB. At the same way, the increase of the FWL by 1 bit increases the SQNR by 6 dB. For more complex systems having multiple of quantization noise sources, the additive quantization noise model can also be built in a similar way, and the output noise is the sum of all the quantization noise sources [12]. Therefore, the increase of all the fractional word-lengths by 1 bit also increases the output SQNR by 6 dB in this case, too. The SQNR as the fixed-point performance measure can also be used for waveform coders, such as the adaptive delta modulators(ADMs) or CELP vocoders, however the increase of the fractional word-length by 1 bit does not reduce the output noise power by 6 dB because these systems are not linear with respect to the quantization noise sources. For the case of an adaptive filter, word-length reduction causes slow convergence and higher steady state noise power. Thus, the fixed-point performance for optimizing an adaptive filter can be modeled as the noise power after some time-off period [22].
5.2 Fixed-point Simulation using C++ gFix Library Although several analytical methods for evaluating the fixed-point performance of a digital signal processing algorithm have been developed by using the statistical model of quantization noise, they are not easily applicable to practical systems containing non-linear and time-varying blocks[26][27]. The analysis is more complicated when a specific kind of input signal, such as speech, is required for the evaluation. In order to relieve these problems, simulation tools can be used for evaluating the fixed-point characteristics of a digital signal processing algorithm. There are a few commercially available fixed-point simulation tools for signal processing. SPD (Signal Processing Designer) of CoWare and Simulink of MathWorks provide fixedpoint simulation libraries [1][4]. Mixed simulation of floating-point and fixed-point blocks is allowed with these libraries. The fixed-point block can be a simple adder or a quite complex one, such as FFT or digital filtering. In order to assign a fixed-point format for each block, it is just needed to open a block by mouse clicking and edit the fixed-point attributes for the block, such as the word-length, integer or fractional word-length, overflow or saturation mode, rounding or quantization mode, and so on. However, the widely used C programming language does not support fixed-point arithmetic. Fixed-point simulation of a C program for signal processing can be conducted with a fixed-point class that also uses the operator overloading characteristic of the C++ language. A new fixed-point data class, gFix, and its operators are developed to prepare a fixed-point version of a floating-point program, and to know its finite word-length and scaling effects by simulation [17]. The gFix class follows the generalized fixed-point format. For example, gFix (10, 2, "tsr" ) represents a format with the WL of 10, IWL of 2, and the two’s complement, saturation for overflow handling, and rounding for quantization. The gFix class supports all of the assignment and arithmetic operations supported in C or C++ languages. There is no
Optimization of Number Representations
729
loss of accuracy during the fixed-point add or multiply operations. However, arithmetic right shift or arithmetic left shift may cause loss of accuracy or overflows. The assignment operator, ‘=,’ converts the input data according to the fixed-point format of the left side variable, and assigns the format converted data to this variable. If the given format of the left side variable does not have an enough precision for representing the input data, the data is modified according to the attributes of the left side variable, such as saturation, rounding, or truncation. Since the format conversion is occurred only at the assignment operator, two programs shown in Fig. 16-(a) and Fig. 16-(b) can have different fixed-point results. In Fig. 16-(b), the result of a + b is format converted to 10 bit data, and then added to the operand c, and then format converted again. The fixed-point performance of different implementations with the gFix gFix gFix gFix
a(12,0,"tsr"); b(12,0,"tsr"); c(12,0,"tsr"); d(10,1,"tsr");
d = a + b + c;
gFix gFix gFix gFix gFix
a(12,0,"tsr"); b(12,0,"tsr"); tmp(10,1,"tsr"); c(12,0,"tsr"); d(10,1,"tsr");
tmp = a + b; d = tmp + c;
(a)
(b)
Fig. 16 Three operand addition using different architectures.
same algorithm can be compared by utilizing the above characteristics. A fixed-point simulation model of a simple IIR filter converted from the floatingpoint C program is shown in Fig. 17. Note that only the type of variables is converted to gFix, but other parts of the program are not changed.
5.3 Word-length Optimization Method Word-length optimization usually tries to find the word-length vector that minimizes the hardware cost while meeting the system performance with fixed-point arithmetic. Since the optimum word-length is dependent on the desired fixed-point performance of a system, a performance measure block has to be included as a part of a system set-up. A performance measurement block must generate a positive result (or pass) when the quantization effects are acceptable. Hardware cost library is also needed to estimate the total hardware cost for a given word-length vector. Note that not only the algorithm but also the system architecture affects the hardware cost. In the simplest case, only one word-length is used for all arithmetic operations, which is called the uniform word-length optimization. In the uniform word-length optimization, fixed-point simulation with a shorter (or longer) word-length than the optimum one should yield a fixed-point performance which is lower (or higher) than
730
Wonyong Sung void iir1(short argc, char *argv[]) { gFix Xin(12,0); gFix Yout(16,3); gFix Ydly(16,3); gFix Coeff(10,0);
}
Coeff = 0.9; Ydly = 0.; for( i = 0; i < 1000; i++ ) { infile >> Xin ; Yout = Coeff * Ydly + Xin ; Ydly = Yout ; outfile << Yout << ’\n’; }
Fig. 17 A fixed-point C++ program for a first order IIR filter.
the needed performance. Thus, it is possible to arrive at the optimum word-length by increasing (or decreasing) the word-length when the obtained performance is lower (or higher) than the needed fixed-point performance. In case of linear time-invariant systems, it is possible to reduce the number of simulations by considering that the SQNR becomes higher by 6 dB with the word-length increase of one bit. Usually, there are multiple word-lengths to optimize in implementing a fixedpoint a system. As the number of word-lengths to optimize increases, optimization of them should take a longer time. In other words, minimizing the number of variables is very important for reducing the optimization time. In this optimization method, the number of different word-lengths is reduced by signal grouping that assigns the same word-length to signals, for example, connected with a delay or a multiplexer block. The word-length sensitivity of a signal needs to be considered for optimization. Some signals are very sensitive to quantization, thus they need a long word-length. The minimum bound of the word-length for each signal group is in inverse relation with the sensitivity, and can be used to reduce the search space. Finally, the optimization of different word-lengths requires a hardware cost model. The word-length optimization method in this section consists of four steps: signal grouping, sign and integer word-length determination, minimum word-length determination, and cost optimum word-length search. As an example, Fig. 18 shows the setup for fixed-point optimization of a first order filter. The SQNR is used as the measure of the fixed-point performance. 5.3.1 Signal Grouping The optimization method preprocesses the netlist of a signal flow block diagram to group the signals that can have the same fixed-point attributes and, as a result, to minimize the number of variables for optimization. Both automatic and manual grouping functions are employed. The automatic grouping rules are as follows.
Optimization of Number Representations
731
<64,31,t> Signal source
ADC
<64,31,t> <64,31,t>
DAC
Signal sink
z-1 <64,31,t> <64,31,t>
0.95 Ref. output source
SQNR desired dB: 50
Fig. 18 Fixed-point setup for a first order filter.
1. Signals connected by a delay, a multiplexer, or a switch are grouped. 2. Input and output signals of an adder or a subtracter are grouped. 3. Signals connected by a multiplier, a quantizer, or a format converter can have different fixed-point data formats unless these signals are grouped together via the other path in a signal flow block diagram. In [18], more grouping rules were implemented to reduce the number of groups as much as possible. But, for the Fixed Point Optimizer of SPW, a manual grouping mechanism is developed instead of eliminating complex grouping rules. By manually adding a prefix to each fixed-point attribute parameter, we can combine signals into one group that would otherwise be assigned to different groups. For example, the ADC and the DAC in Fig. 7 can have the same fixed-point data format if the fixed-point format for them are specified as ‘ada64, 31, t.’ In high-level synthesis, it is necessary to bind operations in order to reduce the number of hardware components [20]. In this case, we can use the binding results for grouping. 5.3.2 Determination of Sign and Integer Word-Length If the minimum value for a signal is not negative, the sign can be ‘u’ (unsigned). Otherwise, it should be ‘t’ (two’s complement). The integer word-length for a signal can be determined from the range of a signal, R(x), using Eq. 2. Although, each signal can have separate integer word-length, one common integer word-length is assigned to all the signals in the same group in order to lower the use of shifters for implementation. The sign and the integer word-length is determined in just one simulation.
732
Wonyong Sung
5.3.3 Determination of the Minimum Word-Length for Each Group Assume a word-length vector w, whose component is the word-length in each group. w = (w1 , w2 , · · · , wN ),
(13)
where N is the number of groups. The performance of a fixed-point system, such as SQNR or negative of mean squared error, is represented by p(w), and the hardware cost, usually the number of gates, is c(w). Then, the optimum word-length vector, wopt , should have the minimum value of c(w) while p(w) is larger than pdesired . We assume the following relation between the word-length vector and the fixed-point performance. p((w1 , w2 , · · · , wi , · · · , wN )) ≥ p((w1 , w2 , · · · , wi − 1, · · · , wN )),
(14)
where 1 ≤ i ≤ N. The above equation means that reduction of a word-length of a group decreases, or at least does not increase, the fixed-point performance of a system. Then, the number of simulations required for the search can be reduced greatly by decomposing the procedure into two steps: the minimum word-length and the cost optimum word-length determination. The minimum word-length for a group, wi,min , is the smallest word-length that satisfies the fixed-point performance of a system when the word-lengths of all other groups are very large, typically a 64 bit fixed-point or the floating-point type. By the assumption shown in Eq. 14, this minimum word-length can be considered to be smaller than or equal to the optimum word-length for the group. The minimum word-length determination procedure for the first order digital filter is illustrated in Table 6. The word-length vector for this filter consists of three components, which are ‘ADC,’ ‘DAC’ and filter word-lengths. The common, or uniform, word-length that satisfies the fixed-point performance is first determined. The uniform wordlength is not only useful for some implementation architecture, such as bit-serial implementations, but also the upper limit of the minimum word-length according to the assumption in Eq. 14. Thus, the search for finding the minimum word-length for each group goes downward from the determined uniform word-length. The number of simulations required for the minimum word-length determination procedure is typically two to four times the number of groups. The minimum word-length determination does not require the hardware cost model. 5.3.4 Determination of the Minimum Hardware Cost Word-Length Vector The word-length vector that satisfies the system performance while requiring the minimum hardware cost is determined at the next step. If the simulation using the minimum word-length vector, wmin , satisfies the desired performance, the minimum word-length vector wmin becomes the optimum word-length vector, wopt . If not, the word-length is increased, and the fixed-point system performance is measured by simulation. Three search strategies, an exhaustive and two heuristic methods, are
Optimization of Number Representations
733
Table 6 Search sequence for the first order recursive filter word-length simulation hardware comment ADC filter DAC result cost 16 16 16 pass uniform word-length 14 14 14 fail determination 15 15 15 fail 14 64 64 pass group ADC minimum 12 64 64 pass word-length determination 10 64 64 fail 11 64 64 pass 64 14 64 fail group filter minimum 64 15 64 fail word-length determination 64 64 14 pass group DAC minimum 64 64 12 pass word-length determination 64 64 10 fail 64 64 11 pass 11 16 11 fail 8054 minimum word-length 12 16 11 fail 8254 exhaustive search 11 16 12 fail 8256 11 17 11 fail 8281 13 16 11 fail 8454 12 16 12 fail 8456 11 16 13 fail 8458 12 17 11 pass 8481 determined word-length
developed for this step. The search sequences for the first order recursive filter will be illustrated as an example. The hardware cost of the components for implementing a DSP system, such as ADC, adder, multiplier, and delay, has to be known for the minimum cost wordlength optimization of a system. For example, when the price of an ADC is high, it may be justified to minimize the word-length of the ADC by increasing the wordlengths for multipliers and adders. Adders, constant coefficient multipliers, and delays are modeled to require the number of gates in proportion to their word-lengths. The number of gates for a multiplier is dependent on the product of two input wordlengths. Note that the hardware cost model is dependent not only on the process technology but also on the implementation architecture. When a signal processing algorithm is implemented in a time-domain multiplexed mode, the per-bit cost of arithmetic blocks that are shared can be lowered. Since the hardware cost model is stored at an external file, a designer can assign an appropriate value to each component according to the implementation architecture and the ASIC library. In this example, the hardware cost model using VLSI Technologies’ cell library is used, and the hardware cost of 200 and 202 is assumed for each bit of ADC and DAC, respectively. According to these models, the hardware cost increase for each bit is 200 for group ‘ADC,’ 227 for group ‘filter,’ and 202 for group ‘DAC.’ In the exhaustive search algorithm, the word-length vector that requires the least cost is selected in a priority for the test starting from the minimum word-length vec-
734
Wonyong Sung
tor. For example, the word-length for group ‘ADC’ will be increased first as shown in Table 6. If the test is failed, the word-length for group ‘DAC’ will be increased instead of group ‘ADC.’ The search sequence using the exhaustive algorithm is shown in Table 6. The minimum cost word-length vector determined is (12, 17, 11), and the hardware cost required is 8481. Although the result is the minimum cost solution, this method is a combinatorial algorithm as a function of the number of groups and is practical only when the number of groups is small, usually less than 6. In the first heuristic search algorithm, all the word-lengths are increased by one, e.g. the word-length vector is increased to (12, 17, 12) from the minimum wordlength vector (11, 16, 11), and the fixed-point performance is measured. If the result is not satisfactory, the above procedure is repeated. This step requires typically one to two simulations. After then, the word-length for a group whose cost saving is the greatest is decreased. If it does not satisfy the performance, the word-length is restored. This procedure is conducted for all the group, and, as a result, requires N simulations. The maximum number of simulations required for the heuristic search based word-length optimization, including the minimum word-length determination is only linearly proportional to the number of groups, N. The word-length vector determined for the first order digital filter shown in Fig. 7 is (12, 17, 11), which is the same to the minimum cost word-length. According to the experiments using 8 examples, the additional hardware cost is usually less than 5% of that required for the exhaustive search algorithm. In the second heuristic search algorithm, the word-length vector that has the best ratio of performance increment to cost increase is selected. The word-length of each group is increased by one bit, and the fixed-point performance is measured using simulations for each case, and the best result is selected for the next iteration until the desired fixed-point performance is satisfied. Since the fixed-point performance increment is used in this method, this heuristic can be applied when the fixed-point performance can be quantified such as SQNR. When the fixed-point performance is tested using binary decision, e.g. pass or fail, this method cannot be used. The search sequences of this heuristic method for the first order recursive filter is shown in Table 7. Table 7 Search sequences of performance/cost directed heuristic method word-length performance hardware comment ADC filter DAC SQNR(dB) cost 11 16 11 34.27 8054 minimum word-length 12 16 11 38.56 8254 selected 11 17 11 34.83 8281 11 16 12 34.59 8256 13 16 11 39.65 8454 12 17 11 40.36 8481 determined word-length 12 16 12 37.43 8456
Optimization of Number Representations
735
5.4 Optimization Example The impulse response of a channel can be identified by using an adaptive digital filter which adjusts the coefficients to minimize the channel modeling error. When the word-length of the filter coefficients or the input data is not sufficient, the system not only converges slowly but also the magnitude of error becomes large. Thus, a power estimation block that measures the average power of the error signal after some time-off period is used for the fixed-point performance measure [22].
CHANNEL
ADC PWR_EST
SIGNAL SOURCE
LMS_8tap
ADC
SIGNAL SINK
Fig. 19 An LMS channel identification system
When we apply the automatic grouping rules, the coefficient in each tap becomes a separate group. In order to reduce the number of groups, the coefficients are manually grouped by assigning a fixed-point attribute of ‘coef.’ The grouping result for each tap is shown in Fig. 20. tap_in
z-1
error
tap_out u005<64,31,t>
coef<64,31,t> coef<64,31,t>
coef<64,31,t>
z-1
sum_in
sum_out u001<64,31,t>
Fig. 20 Grouping result for one tap of LMS adaptive filter.
The ‘tap_in’ and ‘tap_out’ signal are automatically grouped because they are connected by a delay, and assigned to ‘u005.’ The ‘sum_in’ and ‘sum_out’ signal
736
Wonyong Sung
are also grouped, and assigned to the ‘u001’ group. The integer word-length, minimum word-length, and optimum word-length for each group of signals are shown in Table 8. The optimization results are compared for three search algorithms: uniform word-length, exhaustive search, heuristic search based on uniformly increasing the minimum word-length vector. The uniform word-length optimization requires 13 bits for all the signals and, as a result, the hardware cost required is 96,265 gates. The exhaustive and heuristic search algorithms, both are based on the minimum cost criterion, need the cost of 46,278 and 49,520 gates, respectively. The hardware cost required using the minimum cost optimization is just 48% of that needed by the uniform word-length determination in this example. The number of simulations for searching the optimum word-length vector from the minimum word-length vector is 27 and 5 for the exhaustive and heuristic search algorithms, respectively. Table 8 Determined fixed-point attributes for the LMS adaptive filter group u001 u005 u002 coef u004 u006
signal
integer minimum optimum word-length word-length word-length sum 3 12 12 tap 2 7 7 error -4 5 5 coefficient 1 13 13 ADC (channel) 3 8 10 ADC (source) 2 7 13
6 Summary and Related Works Fixed-point hardware or integer arithmetic based implementation of digital signal processing algorithms is important for not only hardware cost minimization but also power consumption reduction. The conversion of a floating-point algorithm into a fixed-point or integer version has been considered very time-consuming; it often takes more than 50% of the algorithm to hardware or software implementation procedure [25]. The fixed-point format discussed in Section 2 bridges the gap between the floating-point and fixed-point data types, and allows seamless and incremental conversion. Note that incremental conversion can help a designer very much because the effects of each small conversion can readily be verified. There are other fixed-point formats used for similar purposes, and one of them is the Q format [2]. In the Q format, the integer and fractional bits are given, while the sign is usually implied as the two’s complement. For example, Q1,30 describes a fixed-point data with 1 integer bit and 30 fractional bits stored as a 32-bit two’s complement number. The Q format has been used widely for assembly programming of Texas Instruments digital signal processors.
Optimization of Number Representations
737
The simulation based range estimation and automatic conversion process becomes more and more popular as the computing power for simulation becomes cheaper. The FRIDGE, a fixed-point design and simulation environment, supports interpolative approach to reduce the range estimation overhead [25]. This tool also supports integer and VHDL code generation path from a floating-point C program. In these days, application oriented language, such as Matlab from MathWorks, is frequently used for signal processing application development. The Simulink package for Matlab and SPD (Signal Processing Designer) of CoWare support easy to use, GUI (Graphic User Interface) supported, fixed-point arithmetic [1][4]. With these tools, it is possible to convert a floating-point design into a fixed-point version in an incremental and convenient way. For the word-length optimization for VLSI and FPGA based design, several different search methods have been studied recently [21][10]. For linear systems, the word-length optimization is converted to an integer programming problem by exploiting the additive quantization noise model [9]. A fixed-point optimization flow that conducts both high-level synthesis and fixed-point optimization simultaneously has been developed [19]. As one of the future works, integer code generation for SIMD (Single Instruction Multiple Data) architecture is needed; this work requires combined processing of scaling, word-length optimization, and automatic vectorization. Also, fast word-length optimization of very large systems is an interesting problem. The word-length optimization or scaling software needs to be integrated into widely used CAD tools. The capability of fixed-point optimization tools supported by commercial CAD software can hardly meet users’ expectation.
References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12.
URL http://www.coware.com/products/signalprocessing.php URL http://en.wikipedia.org/wiki/Q\_(number\_format) Fixed-Point C++ class. URL http://msl.snu.ac.kr/fixim/ Simulink. URL http://www.mathworks.com/products/simulink/ DSP56KCC User’s Mannual. Motorola Inc. (1992) TMS320C2x/C2xx/C5x Optimizing C Compiler (Version 6.60). Texas Instruments Inc., TX (1995) TMS320C6x Optimizing C Compiler. Texas Instruments Inc., TX (1997) Catthoor, F., Vandewalle, J., Man, H.D.: Simulated Annealing based Optimization of Coefficient and Data Word-Lengths in Digital Filters. Int. J. Circuit Theory and Applications 16, 371–390 (1988) Constantinides, G., Cheung, P., Luk, W.: Wordlength optimization for linear digital signal processing. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on 22(10), 1432–1442 (2003). Han, K., Evans, B.L.: Optimum wordlength search using sensitivity information. EURASIP J. Appl. Signal Process. 2006, 76–76 (uary). Han, K., Olson, A., Evans, L.: Automatic floating-point to fixed-point transformations. In: Signals, Systems and Computers, 2006. ACSSC ’06. Fortieth Asilomar Conference on, pp. 79–83 (2006). Jackson, L.B.: On the Interaction of Roundoff Noise and Dynamic Range in Digital Filters. The Bell System Technical Journal pp. 159–183 (1970)
738
Wonyong Sung
13. K. Kum, J.K., Sung, W.: AUTOSCALER for C: an optimizing floating-point to integer C program converter for fixed-point digital signal processors. IEEE Trans. Circuits and SystemsII: Analog and Digital Signal Processing 47(9), 840–848 (2000) 14. Kang, J., Sung, W.: Fixed-point C language for digital signal processing. In: Proc. of the 29th Annual Asilomar Conference on Signals, Systems and Computers, vol. 2, pp. 816–820 (1995) 15. Kang, J., Sung, W.: Fixed-point C compiler for TMS320C50 digital signal processors. In: Proc. of 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 707–710 (1997) 16. Kim, S., Sung, W.: A Floating-point to Fixed-point Assembly Program Translator for the TMS 320C25. IEEE Trans. on Circuits and Systems 41(11), 730–739 (1994) 17. Kim, S., Sung, W.: Fixed-point optimization utility for C and C++ based digital signal processing programs. IEEE Trans. on Circuits and Systems (will be published) 18. Kum, K.I., Sung, W.: VHDL based Fixed-point Digital Signal Processing Algorithm Development Software. In: Proceeding of International Conference on VLSI and CAD ’93, pp. 257–260. Korea (1993) 19. Kum, K.I., Sung, W.: Combined word-length optimization and high-level synthesis of digital signal processing systems. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on 20(8), 921–930 (2001). 20. Micheli, G.D.: Synthesis and Optimization of Digital Circuits. McGraw-Hill, Inc., NJ (1994) 21. Shi, C., Brodersen, R.: Automated fixed-point data-type optimization tool for signal processing and communication systems. In: Design Automation Conference, 2004. Proceedings. 41st, pp. 478–483 (2004) 22. Sung, W., Kum, K.I.: Word-Length Determination and Scaling Software for a Signal Flow Block Diagram. In: Proceeding of the International Conference on Acoustics, Speech, and Signal Processing ’94, vol. 2, pp. 457–460. Adelaide, Australia (1994) 23. Sung, W., Kum, K.I.: Simulation-Based Word-Length Optimization Method for Fixed-Point Digital Signal Processing Systems. IEEE Trans. on Signal Processing 43(12), 3087–3090 (1995) 24. Willems, M., Bürsgens, V., Grötker, T., Meyr, H.: FRIDGE: An interactive code generation environment for HW/SW codesign. In: Proc. of 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 287–290 (1997) 25. Willems, M., Bürsgens, V., Meyr, H.: FRIDGE: Floating-point programming of fixed-point digtal signal processors. In: Proc. of the International Conference on Signal Processing Applications and Technology (1997) 26. Wong, P.W.: Quantization and roundoff noises in fixed-point FIR digital filters. IEEE Trans. Signal Processing 39, 1552–1563 (1991) 27. Yun, I.D., Lee, S.U.: On the fixed-point error analysis of several fast DCT algorithms. IEEE Trans. Circuits and Systems for Video Technology 3(1), 27–41 (1993) 28. Ziv˘ojnovic, V.: Compilers for Digital Signal Processors. DSP & Multimedia Technology 4(5), 27–45 (1995)
Intermediate Representations for Simulation and Implementation Jerker Bengtsson
Abstract Simulation and implementation of DSP systems is a challenge due to their complex dynamic behaviour and requirements on non functional properties. This chapter presents a number of high-level intermediate representations for implementation of design tools for parallel DSP platforms, considering modeling of non constant behaviour of programs; specialized models of computation; scheduling strategies; heterogenous and hierachical specifications of systems; and performing accurate performance analysis for design space exploration and optimization of assignments during a synthesis process. Four different intermediate representaions that are representative for the state-of-the-art on different techniques for simulation and implementation are presented. The basic structure and the usage of these representations are demonstrated with small examples.
1 Background An intermediate representation (IR) is an abstraction of a program mapped on a machine. When designing an IR one of the goals is that it should be independent of details given in a particular source language and should commit to as little as possible of details in the target machine. In a typical compile flow, the front-end translates the source language and maps the program to the IR. The back-end then performs target related optimizations and generates binary code for the target. This chapter introduces intermediate representations suitable for coarse-grained models of computation, for example synchronous dataflow (SDF) and cyclo-static dataflow (CSDF), see Chapter 30. These kinds of models of computation are very suitable for mapping on parallel hardware and enable analysis of static as well as dynamic execution properties. Jerker Bengtsson Halmstad University, Box 823, SE-301 18 Halmstad, Sweden, e-mail: jerker.bengtsson@ hh.se
S.S. Bhattacharyya et al. (eds.), Handbook of Signal Processing Systems, DOI 10.1007/978-1-4419-6345-1_26, © Springer Science+Business Media, LLC 2010
739
740
Jerker Bengtsson
Compared to mapping on single processors, mapping digital signal processing (DSP) applications on parallel hardware, e.g. reconfigurable processors (Chapter 17) and multicores (Chapter18), involves several complex steps including scheduling, synchronization and communication. Furthermore, DSP applications are typically associated with constraints of non-functional character. Some examples of such constraints are requirements on timing and power-efficiency. One classical problem is execution time analysis, which requires analysis techniques that consider variable computation and communication times. One complication is that execution times for the different components of a parallel program often are dependent on the current state of the system (i.e. the state of the hardware, the program and the I/O). The cost, e.g. in execution time, for some particular operation typically includes both a static cost and a dynamic cost. A static cost of an operation is a cost that is independent of at which time the particular operation is executed. For example, the cost for a processor to execute n arithmetical or logical instructions. A dynamic cost of an operation is a cost that is dependent on at which point in time the operation is executed. For example, such a cost can be a core blocking time when for example multiple cores are concurrently using shared resources (e.g. transactions over an interconnection network or writing to or reading from a global memory). This chapter introduces examples of both untimed and timed intermediate representations. Unlike untimed representations, timed representations include a notion of system time, which enables representation, simulation and analysis of dynamic processing costs. For multi-objective implementation and optimization of DSP software, the IR constitutes an important source for analysis and optimization. The process of mapping an application to a parallel target is often divided into several steps. The number of steps and the task performed at each step might be dependent on the target machine architecture and the choice of strategy used for implementation. Considering multiprocessor and multicore targets, the mapping (scheduling) process is typically performed in three steps: 1. assigning tasks to processors (assignment), 2. determining the execution order of tasks assigned to each processor (ordering), 3. determining when each task should be executed (timing). These steps can independently be performed at either compile-time or at runtime, which results in different scheduling strategies. The fundamental goal of scheduling is to find a schedule of the program that optimizes execution with respect to some performance measure. Often it is the architecture of the target hardware and the specific domain of application that determines how efficiently a certain strategy can be implemented. Table 1 shows four different classes of classical scheduling strategies using the taxonomy given by Lee and Ha [8]. This taxonomy does not provide any exact bounds between all possible scheduling strategies, but it demonstrates different typical combinations of compile time or run time decision making in the scheduling process. The scheduling strategy is in particular important to consider when analyzing non functional-properties of a program’s execution, e.g. execution times and power consumption. Naturally, the accuracy of a simulated program execution at the level
Intermediate Representations for Simulation and Implementation
741
Table 1 Scheduling Taxonomy strategy
assignment
ordering
timing
fully dynamic static-assignment self-timed fully static
run compile compile compile
run run compile compile
run run run compile
Four examples on different classes of scheduling strategies. Scheduling decisions that are made at compile time are denoted compile and scheduling decisions that are made during run time are denoted run.
of the IR is dependent on how well the IR models the execution behavior (e.g. tasks execution order, communication and timing of tasks) on the targeted hardware platform. One of the more popular and efficient scheduling strategies for the DSP domain is the self-timed scheduling strategy, in which processor assignment and task ordering is determined at compile-time, see Table 1. One of the advantages with self-timed scheduling is that execution times of the program does not have to be precisely known. Self-timed scheduling is therefore also considered to be robust since small variations in execution times will not affect the functional correctness of the execution [7]. A schedule of a program executed in parallel is typically modeled using graphtheoretic models. One example of such graph model is the inter processor communication graph (IPC graph) Gipc [12]. IPC graphs are used to co-model both computation and communication and does not assume execution using global program synchronization. Furthermore, as will be presented further on in this chapter, timing models can be added to IPC graphs to enable analysis of dynamic execution properties. Many implementations of intermediate representations for multiprocessor and multicore systems target the self-timed execution scheduling strategy. The rest of this chapter will present examples, consisting of four different but fairly closely related types of intermediate representations, which demonstrates of state-of-the-art techniques for representation of parallel systems. First, untimed representations are introduced in Section 2. Two concrete examples of such well known representations - the system property intervals (SPI) and the FunState representation are presented. The SPI representation enables representation of changing properties of programs, such as for example variable execution times. The FunState representation is an enhancement of the SPI, which unlike SPI, include more powerful means for representing different scheduling strategies. Finally, in Section 3, timed representations for dynamic simulation and analysis of non-functional properties are presented. Two examples of such closely related intermediate representations are introduced - job configuration networks and timed configuration graphs (TCFG).
742
Jerker Bengtsson
2 Untimed Representations There are a number of representations developed for synthesis and analysis of parallel systems. This section introduces two examples of such closely related representations and how they are used. The first example, System Property Intervals (SPI), is a representation that uses the concept of constraints intervals to model non constant behavior of a system, for example variable execution times of tasks. The second example, FunState, is based on the concept of functions and state machines, which enables representation of different scheduling strategies. Thus, SPI and FunState are closely related and are both examples of representations which can be used for static (untimed) analysis of systems.
2.1 System Property Intervals The System Property Intervals (SPI) is developed to represent the internal design of heterogeneous systems [4]. SPI is an untimed representation which can be used for static timing analysis of a system. The abstract behavior of a system is described in SPI by means of processes communicating via channels, see Figure 1. Each process is associated with a set of parameter intervals that abstractly represent the dynamic behavior of the system. The parameters represent aspects that are of importance for analysis of a particular behavioral aspect of the application. Some typical examples are communication behavior, rules for activation of tasks and latencies for different resources. The values of the parameters are constrained by intervals in the form of minimum and maximum bounds (hence the name system property intervals). As an example, execution time is one parameter that is often not constant. It can be specified by means of a best case (lower bound) and an worst case (upper bound). An SPI model is formally a graph G = (P,C, E) where • P is the set of process nodes, • C = Q ∪ R is the set of channel nodes, and • E = (P × C) ∪ (C × P) is the set of edges. A channel (C) is represented either as a FIFO (first in first out) queue q ∈ Q or as a register r ∈ R. Processes are data driven, which means that a process is activated when all its input required for execution is available. To be able to derive rules for when processes are activated it is necessary to know the amount of data communicated. The communication in an SPI graph is specified by assigning data rates for each input and output channel for all processes. The data rate specifies how many tokens that are produced or consumed through a channel. A token is an abstraction for the type of data communicated on a particular channel. Each process is associated with a set of input and output data rate constraining intervals; one constraining interval for each of the input and output channels respectively. An input data rate constraining interval is specified as Rc = [rc,min , rc,max ],
Intermediate Representations for Simulation and Implementation
743
where c denotes the specific input channel, rc,min the minimum data rate and rc,max the maximum data rate. Similar to the input data rate constraint intervals, each output channel is associated with an output data rate constraining interval Sc = [sc,min , sc,max ]. The rate constraining intervals only specifies the maximum and minimum possible rates for a channel which enables representation of non-constant data rates. The specific rate produced or consumed on a channel for the kth execution of a process is denoted rc (k). Thus, each input and output channel is associated with a rate sequence [rc (1), rc (2), . . . , rc (n)] and [sc (1), sc (2), . . . , sc (n)] respectively, where n is the number of times a process is executed during one repetition of a schedule. C2 P1
P3 C1
P2
C3
Fig. 1 Simple SPI model graph. Process nodes are denoted P and channel nodes are denoted C.
Similarly to the data rate constraining intervals, each process is associated with a latency constraint interval. The latency constraint interval for a process p is specified Lat p = [lat p,min , lat p,max ]. The activation time tact is the time when all data is available and the process is activated. The starting time tstart is the time when input data has been read and the process starts its execution. Finally, the completion time tcomp is the time when output data has been written and the process stops its execution. The execution time for the kth execution of process p is tcomp,p (k) − tstart,p (k) and must be within the interval given by Lat p . For more complex applications it is desirable to be able describe different execution modes of processes. For example, a process might configure its next execution dependently on its current data input. This kind of process mode change can be managed using a combination of a process mode, a mode tag and an activation function specification. A process mode m p is abstractly described by means of a latency interval Lat p , an input data rate Rc and an output rate data Sc interval, for each of its input and output channels respectively. Each process is associated with a non-empty, finite set of process modes M p = m p,1 , . . . , m p,n p , where n p is the number of modes for the process. Considering the SPI graph in Figure 2, where process P1 has i modes, process P2 has j modes and process P3 has only one mode. The process modes for this graph is then described by
744
Jerker Bengtsson P2
C1 P1
C2 P3
C3
Fig. 2 Simple SPI graph with three processes. Process P1 has i modes, process P2 has j modes and process P3 has only one mode.
mP1 ,i = ([latP1 ,min (i), latP1 ,max (i)], [rC3 ,min (i), rC3 ,max (i)], [sC1 ,min (i), sC1 ,max (i)]) mP2 , j = ([latP2 ,min ( j), latP2 ,max ( j)], [rC1 ,min ( j), rC1 ,max ( j)], [sC2 ,min ( j), sC2 ,max ( j)]) mP3 ,1 = ([latP3 ,min (1), latP3 ,max (1)], [rC2 ,min (1), rC2 ,max (1)], [sC3 ,min (1), sC3 ,max (1)]) Note that process P1 would have i process mode descriptions and process P2 would have j descriptions. For simplicity, here only the mode descriptions for the ith and jth mode, of process P1 and P2 respectively, are shown. Process P3 has only one mode, thus only one mode description is required. In many cases the execution mode of a process to be activated is determined by the content of the input data. That is, the input data for next execution has to be inspected. This dynamic mode change behavior is represented by mode tags and so called activation functions. Each process is associated with a set of mode tag production rules T Pp . Each of the mode tag production rules maps a specific input mode tag to a specific output mode tag t p p : I p → O p , see Figure 3. I p is a finite set of possible input mode tag patterns and O p is the finite set of possible output mode tag patterns. An input mode tag pattern is a tuple having one entry for each input channel of a process. Similarly, an output mode tag pattern is a tuple having one entry for each output channel.
C2
C1
P2
C3
C2 .tag = 1 → C 2.tag = 2 C2 .tag = 2 → C 2.tag = 1
Fig. 3 An SPI model graph to the left in the figure and its associated mode tag production rules to the right. If the mode tag on channel 2 has the value 1, then a mode tag with the value of 2 is produced on channel 2. Similarly, if the mode tag has the value 2, then a mode tag with value 1 is produced on channel 2.
The role of the activation functions is to select next execution modes of a process based on the current input mode tag patterns. An activation function is a finite set
Intermediate Representations for Simulation and Implementation
745
of rules which each maps to a finite set of modes M ∈ Mp . A rule is evaluated by means of a function of the number of available input tokens (c.num) and the first available input tag of some input channels of the process (c.tag). A process is only activated if and only if there is any rules evaluated to a ’true’ value. To specify the cyclo-static mode change for process P2 in Figure 3, the set of rules are = {1 , 2 }, where 1 : (c2 .tag = 1) → mP2,1 and 2 : (c2 .tag = 2) → mP2,2 . Here the modes are activated on the basis of mode tags while the number of input tokens on channel C1 is omitted. 2.1.1 Specification of Latency Constraints Latency constraints can be specified for paths of channels. The latency constraint is the allowed time interval for which tokens are allowed to be communicated within on the path. A path latency constraint is a labeled path in the SPI model graph. A path latency constraint involving n processes is described in the form (p1 → c1 → p2 . . . → cn−1 → pn ). Process p1 is the sending process and process pn is the receiving process. A latency path including n processes is connected through n − 1 channels. Such a path can be described in short form as path = (c1 , . . . , cn ). Similarly to process latencies and channel data rates, path latency constraints are specified by means of intervals. A latency constraint interval is given by LCpath = [tlat,min ,tlat,max ]. Figure 4 shows an example of an SPI graph with a cyclic latency path starting and ending at process P1 . The latency constraint for this path is given by LCpath = [Tmin , Tmax ]. C1
P2
P1
C2 P3
C3
Fig. 4 Example of a specification of a cyclic latency path that starts and ends at process P1 . The dashed edges constitute the path path = (C1 ,C2 ,C3 ).
SPI models are very powerful in the sense that it is possible to describe heterogeneous compositions of models of computation. However, representations in the form of arbitrarily specified process networks graphs is most often not suitable for analysis. Instead, the typical usage is to start from reduced specifications of one or several more decidable subclasses of process networks such as for example parameterized synchronous dataflow (PSDF), cyclo static dataflow (CSDF) or synchronous dataflow (SDF), which are discussed in Chapter 30. For these more restricted models there exists well-known and proven analysis techniques. It has been demonstrated how CSDF graphs can be transformed (unfolded) to homogeneous
746
Jerker Bengtsson
synchronous dataflow (HSDF) graphs, which then can be mapped using the SPI representation. Further, path latency constraints for such models can be analyzed using integer linear programming techniques. Necessary and sufficient conditions for restricting SPI graphs to CSDF behavior, as well as algorithms for unfolding CSDF graphs and latency path constraints can be found in [13]. There are also other representations which similarily to SPI uses mixed communication of control information and data. Two examples are Huss’ codesign model CDM [3] and Eles’ conditional process graph [5].
2.2 Functions Driven by State Machines One of the disadvantages with SPI is that it is not possible to represent scheduling strategies. The FunState representation is a later enhancement of SPI, which also enables verification and representation of scheduling strategies, as was introduced in Section 1. In difference to SPI, FunState models dataflow in separate from control flow. In SPI processes are autonomously controlled in dataflow style. A FunState model is a model of functions driven by a network global state machine. Furthermore, hierarchical FunState networks and state machines can be represented. This section will present the basics of FunState and some basic examples on how it can be used to represent different scheduling strategies. A FunState graph is a bipartite graph G = (F, S, E) where F is the set of functions (corresponding to processes in SPI), S is a set of storage units (corresponding to channels in SPI), and E is a set of directed edges. No edges are connecting two storage units or two functions.
q3
q4 f3
f2
f1 q1 q2
q2 # ≥ 4 ∧ q4 # ≥ 3 / f1
q1 # ≥ 1/ f 2
q3 # ≥ 2 / f3
Fig. 5 The figure shows an example of a simple FunState graph. The upper part shows the network of functions and its storage units. The lower part shows the finite state machine, here having one state and three possible transitions . The number of initial tokens in the storage units are represented by filled dots. No filled dots in a storage unit means zero initial tokens.
Intermediate Representations for Simulation and Implementation
747
Figure 5 show an example of a FunState model. The upper part of the figure shows the network of functions ( f1 , f2 , f3 ) and its storage units (q1 , q2 , q3 , q4 ), each initialized with 1, 2, 0, 3 tokens respectively. The lower part shows the finite state machine, which in this case has only one state and three possible transitions. Note that the number of states is not identical to the number of functions in the network. There are two types of storage units: queues and registers. A queue is FIFO ordered and its length is unbounded. Registers are arrays of pairs (address, value). A register is denoted r and the value of a register is denoted r$n where n is the id of the register. The number of tokens available in a queue is denoted q# and q$i denotes the value of the i’th token in a queue. The number of tokens and their values is a part of the system state. Similarly, registers are denoted r and their values are denoted r$1, r$2 . . . , r$n. However, unlike queues, register values are constants. Functions f ∈ F operates on tokens and values. Each input and output of a function is associated with a variable, which denotes the number of tokens consumed (ci ) or produced (pi ) on input and output channels respectively. Productions and consumption rates are specified by non negative integer values. A variable evaluates to either a constant value or a random process value. Figure 6 shows a function ( f3 ) that consumes c tokens from the queue q1 and 3 tokens from the register r1 . It produces some number in the interval [1, 4] of tokens to queue q2 and p tokens to register r2 . q1 r1
c
[1,4]
q2
p
r2
f3 3
Fig. 6 Example of a function f communicating via queues q and registers r. Consumption and production rates can be specified as constants (input rate 3 from r1 ), variables (input rate c from q1 and output rate p to r2 ) or by means of a random interval (output rate [1, 4] to q2 ).
State machines are used to specify activation of functions and states of a system. Transitions between states are specified by means of conditions and actions, see Figure 7. It is possible to specify conditions only concerning the number of tokens in some queue, for example q# > u where u some variable. Conditions can also concern the value of some queued token, like for example q$2 < 1.5. An action specifies the function that should be activated if the transition is to be taken, see Figure 7. When the execution of a FunState model is started the current state of the system (xc ) is always reset to its initial state (x0 ). Further, all queues and register values are preloaded with initial tokens (queues) and values (registers). During execution of the model the following steps are repeatedly followed in order: 1. Evaluation of conditions: All conditions specified for transitions possible from the current state are evaluated. 2. Check possible progress of system: If no transitions is evaluated to be enabled the execution is stopped.
748
Jerker Bengtsson
(q # ≥ 3) ∧ (q$2 < 1.5 )/ f1 , f 2 condition
action
Fig. 7 Example of a specification of conditions and actions for state transitions. The activation of function f 1 and function f 2 is made only if q contains 3 or more tokens and if the value of the second token in q is less than 1.5.
3. State machine reactions: One non-deterministically chosen state transition (from the set of enabled transitions where the current state is the source) is chosen. All functions related to the actions associated with the transition is activated. 4. Fire functions: All functions that have been activated are executed in nondeterministic order. When a functions is being executed, tokens and and values are removed from input queues and registers and tokens and values are added to output queues and registers respectively. The basic model of FunState further allow construction of hierarchical models including both components and state machines. The semantics of hierarchical FunState models is out of the scope of this chapter and is not presented here. Furthermore, the underlying computational model of FunState can be described and analyzed using both static graph representation and dynamic state transition diagrams [13]. There is also extensions of FunState, including timed functions, representation of timing constraints and timing properties [14]. 2.2.1 Examples of Representation of Different Models of Computation In difference to its successor SPI, FunState enables representation of different input specifications (heterogeneous combinations of different models of computation). Communicating finite state machines are asynchronously operating finite state machines (FSMs) that communicates via FIFO ordered queues. Figure 8 shows an example of a simple communicating finite state machine model. The FSM M1 of component C1 can send a value to the queue q whenever it makes a transition. The transitions of the FSM M2 can then be guarded using predicates on the value of the first element in q. Figure 9 shows another example in which a SDF program is represented. The actors in the SDF program is in FunState represented by functions. The channels (the edges in the SDF graph) are represented by a queue and two edges (one input, a queue and one output). The number of tokens in the FunState queues are initialized according to the number of initial tokens on the corresponding SDF channels. Each component of C contains a function and an FSM, which has one state and one transition. The condition for the transition corresponds to the consumption rates for all input edges of the dataflow actor and the action is to activate firing of the actor.
Intermediate Representations for Simulation and Implementation
749
C 0 q
C1
/ C1
C1
C2
f1
/ C2
C2
1
1
f1
M1
M2
Fig. 8 Hierarchical FunState model representing a small communicating finite state machine. The highest level, shown to the left in the figure, represents FSM M1 and FSM M2 by means of two abstract components C1 and C2 . To the right, the specification for the components C1 and C2 .
q3 2 4
f2
2
3
C1
f1
f1
q1 q2
3 1
C
q4 f3
1
2 / C1
/ C2
q2 # ≥ 4 ∧ q4 # ≥ 3 / f1
/ C3
Fig. 9 A synchronous dataflow graph to the left in the figure and a hierarchical FunState representation using local control to the right. Sub model C1 of the complete model C is shown rightmost in the figure. The state machine will make a transition to f 1 when there is 3 tokens available on q4 and 4 tokens available on q2 .
In difference to SDF, CSDF enables periodical changes of production and consumption rates. The graph to the left in Figure 10 shows a CSDF actor. In the FunState representation, to the right in the figure, the CSDF actor is represented using two functions ( f1 , f2 ). The state machine cycles through the different actor states following the at compile time specified sequence of input token rates for the CSDF graph. The last example demonstrates representation of boolean (BDF) and dynamic dataflow (DDF) programs. BDF and DDF are two extensions of SDF, which both enables specification of data dependent dataflow. Figure 11 show switch and select actors in BDF. A select actor is used to select which input of i1 and i2 that should be transferred to the output when the actor is fired. The actor is activated when there is a true or false value (c$1 ∈ true, f alse) available on the control channel c. Similarly, the switch actor transfer the output from input i to either of output o1 or o2 dependently on the value of the control channel c (c$ ∈ true, f alse). Figure 12 shows a FunState representation of the switch and select actors from Figure 11. The conditions for the state transitions of the select actor is defined as:
750
Jerker Bengtsson C1 1
2
f1
o1
i1 (1,3) (2,4)
f2
3
4
i1 # ≥ 1/ f1
i1 # ≥ 3 / f 2
Fig. 10 A CSDF model represented in FunState. The state machine has two states, one for each actor mode, which controls the cyclic mode transitions for the CSDF actor. i1
i2
c
i c
o
o1
o2
Fig. 11 A select actor to the left in the figure and a switch actor to the right. The switch and select actors chooses its input (output) dependently on the boolean value on the control channel c.
c1 : i1 # ≥ 1 ∧ c# ≥ 1 ∧ c$1 = true c2 : i2 # ≥ 1 ∧ c# ≥ 1 ∧ c$1 = f alse Similarly, the conditions for the state transitions of the switch actor is defined as: c3 : i# ≥ 1 ∧ c# ≥ 1 ∧ c$1 = true c4 : i# ≥ 1 ∧ c# ≥ 1 ∧ c$1 = f alse Figure 13 shows a non-deterministic merge DDF actor and its FunState representation. The merge actor is activated for firing when there is at least one token present on either of the two input channels (i1 , i2 ).
Intermediate Representations for Simulation and Implementation
i1
f1
751
f1
i
o1
o
c
c
i2
f2
o2
f2
c1 / f1
c3 / f1
c2 / f 2
c4 / f 2
select
switch
Fig. 12 Switch and select actors represented in FunState.
i1
i2
i1
f1 o
c o
i2
f2
i1 # ≥ 1/ f1
i2 # ≥ 1/ f 2
select
Fig. 13 A non-deterministic merge actor in DDF and its representation in FunState.
2.2.2 Representation of Schedules FunState allow representation of scheduling strategies. As an example of this, consider the SDF program shown in Figure 5. Only the state machine need to be replaced to specify an exact static schedule for this graph. A periodic single processor schedule for the actors f1 , f2 , f3 in the SDF graph is ( f2 , f3 , f1 , f2 , f3 , f3 ). Figure 14 shows two different possible representations when implementing this schedule using a static scheduling strategy. The representation to the left is a straightforward implementation, using a state machine with a state and a transition following the order of this specific schedule. However, the subsequence ( f2 , f3 ) occurs twice in the schedule. This redundancy is taken into account in the corresponding representation to the right, where AND composition between parallel state machines are used. Similarly to FunState, many other representations that separate data and control flow has been proposed. Some examples are SDL [11], codesign finite state ma-
752
Jerker Bengtsson C / f2
/ f2
/ f1
/ f3 / f3
C
/ f3
/ f2
u/s
u/t
/ f3 , u
/s
/ f1 /t
/ f3
Fig. 14 Two different representations of the same static schedule. The representation to the left shows a straightforward representation of the static schedule. The representation to the right shows another representation of the same schedule but using AND compositions of FunState sub models.
chines (CFSMs) [1], combining synchronous dataflow with finite state machines (FSMs) [9] [6]. However, in comparison to FunState, many of these other approaches have limited composability since control and data flow cannot be mixed arbitrarily in the hierarchical levels.
3 Timed Representations Most embedded DSP applications have real-time constraints. To provide qualitative mappings with respect to performance and timing requirements, the intermediate representation must be amenable to timing analysis. This requires models representing both scheduling strategies as well as the timing properties of the target hardware. This section will present examples of timed representations and of how a machines timing properties can be modeled. The first example introduces timing models for analysis of IPC graphs. The second example includes a timing model and a machine model for a certain class of multicore processors (arrays of processing tiles).
3.1 Job Configuration Networks In real-time analysis applications are typically modeled as a set of real-time jobs. A job in this context is considered as a task graph for a complete application. The SDF model of computation is very suitable for specification and real-time analysis of DSP applications. A job configuration network is one proposed intermediate representation that can be implemented using process networks [10]. Processes represent processors (cores) and channels represent communication between processors.
Intermediate Representations for Simulation and Implementation
753
3.1.1 Implementation of a Job Configuration Network Similar to the SPI and FunState representations, job configuration networks are suitably implemented using process networks, see Chapter 34. Process networks provides a good base for functional representation of parallel program execution on multiprocessors and multicores. Construction of a job configuration network starts from a homogeneous synchronous dataflow (HSDF) graph that describes the application. Then the HSDF graph is partitioned and mapped onto the job configuration network. Each actor in the HSDF graph is assumed to be annotated with execution times in the form of real numbers. The execution time can be a fixed value, the worst case execution time, or a variable number. A job configuration network can be described by a directed graph. The nodes of the graph are processes (denoted P), which each is assigned a subset of the mapped HSDF application graph. Thus, each process represents a processing tile and the set of computation actors mapped on it. Channels connect actors mapped on different processes. A channel in the job configuration network models unidirectional communication by means of an input buffer, a data connection, a flow-control mechanism and an output buffer. Data from the input buffer (located at the sending side) is pumped through the data connection and is stored in the output buffer (at the receiving side). The flow-control mechanism watches the state of the output buffer in order to prevent overflow. A channel input buffer is associated with a counter counting the number of free places (credits) in the buffer at the output side. Whenever a token is placed on the input of a channel, the credit counter is decremented. The channel is blocked for writing whenever this counter reaches zero. Every time the receiving process removes a token from the channel, a credit is generated and sent back to the sender, and the credit counter is incremented. The time elapsed from departure of a token at the sending side, to the arrival of a token at the receiving side constitutes the data propagation delay. The data propagation delay is denoted D . Correspondingly, the time for transferring a credit from the receiving side to the sending side is denoted C . In the job configuration network, channels are implemented using specific transfer processes (denoted T). A transfer process implements the delayed transfer of tokens from the input buffer to the output buffer according to D and C .
3.2 IPC Graphs In order to analyze timing properties such as job throughput, execution time and buffer sizing, the job configuration network is transformed into an IPC graph. An IPC graph models the execution of a job on a parallel processor and can be extended with timing models to be amenable for timing analysis. In difference to the application graph (HSDF), and the job configuration network (Process network), the IPC graph contains two kinds of edges: data edges and sequence edges. Data edges represents data dependencies between actors. The sequence edges represents constraints on execution order. No data is transferred on the sequence edges. The
754
Jerker Bengtsson
sequence edges are used to enforce some specific sequencing of actors. For each individual processing resource and the network, the sequence edges are used to create cyclic execution paths when analyzing the timing properties of the graph. In addition to the computation actors specified by the application graph, the IPC graph contains data copy actors (read and write actors). The data copy actors represent data transaction between local memories and global memory.
P1 5
P1 5 C 16 P1 5
(a)
W1 1
R1 1
P2 5
R2 1
W2 1
C 16
(b)
Fig. 15 A simple send receive application in the form of an HSDF graph is shown to the left. An IPC graph of the same send receive application is shown to the right. Actors P1 and P2 are sending data to the receiving actor C. Two write actors (W1 ,W2 ) models the write operations of the sending actors P1 , P2 ). Furthermore, two read actors R1 , R2 models the required read operations of the receiving actor C.
IPC graphs can be implemented by means of an HSDF graph. Every process and every channel of the job configuration graph can separately be translated into an HSDF model (IPC sub graph), which then are assembled together to form the complete IPC graph for the application. Processes are modeled by means of processor cycles as illustrated by Figure 15. It includes compute actors (P1 , P2 ,C), write actors (W1 and W2 ) and read actors (R1 and R2 ) which performs the read and write operations on the channels. Figure 16 shows two examples of a buffer model. The first example, (a), is buffer of size 2 without transfer delay. The dashed actors are mapped on processes that reads and writes to the channel. The edges drawn from the left to the right are data edges which models data dependencies. The edges drawn from right to left are sequence edges modeling the flow-control mechanism. The position of the sequence edges and the number of initial tokens on each edge are dependent on the size of the buffer. Only two tokens can be on the data edges at the same time. The second
Intermediate Representations for Simulation and Implementation
755 δC
i=0
j=0
i=0
i=1
j=1
i=1
δD
j=0
δC j=1 δD i=2
j=2
i=2
δC
j=2
δD (a)
(b)
Fig. 16 Two examples of buffer models mapped on an IPC graph. The graph to the left (a) implements a buffer of size 2 without transfer delay. Since there is only one initial credit on the outgoing edges from j = 1 and j = 2 respectively, actor i = 0 will block until there are new credits on the sequence edges. Similarly, the graph to the right (b) implements a buffer of size 9 with transfer delays (C , D ). Each one of the sequence edges, from right to left, have 3 initial credits. Thus only 3 iterations (9 data tokens produced) are possible until new credits have been placed on the sequence edges.
example, (b), shows a buffer with transfer delay and buffer size of 9. The transfer delay is implemented by splitting the data and sequence edges and inserting a delay actor in between. Delay actors on data edges are annotated with a delay of D and delay actors on sequence edges are annotated with a delay of C . 3.2.1 Timing Analysis of IPC Graphs Figure 17 shows the final IPC graph, implemented by means of an HSDF model, for the small producer consumer job shown in Figure 15. The timing models for the IPC graph assume self-timed execution and that multiple iterations of the graph are allowed to overlap during execution. For IPC graphs with actors annotated with constant (worst-case) execution times it is possible to obtain measures on the worst-case throughput and lateness of the represented job. However, there is a set of restrictions the IPC graph must comply to: 1. The graph must be strongly connected. For any two actors in the graph, there must exist a directed cycle that contains both actors. 2. The graph must enforce first-in-first-out token production order. The output of actors must be produced in order with the actors starting order. The second restriction is necessary when actors have variable execution times. Consider the case that the execution time of iteration i + 1 of an actor is less than the
756
Jerker Bengtsson
execution time for iteration i. Then the token output of instance i + 1 can overtake the output of instance i. This restriction can be enforced by the usage of initial tokens on actors cycles. For each actor in the graph, it must be part of at least one cycle that has only one initial token. This restriction is enforced in the example in Figure 15, where each actor belongs to a processor cycle and every cycle has only one initial token. A simple cycle in the graph is a cyclic path that cannot contain any actor more than once. The edges of cycle can be either data edges or sequence edges. The cycle weight is the sum of the execution times for the actors in the cycle, which is a constant value since the actors execution times are constant. The cycle mean is the cycle weight divided by the sum of all initial tokens on the edges of the cycle. The cycle in the graph that has the largest cycle mean is said to be the critical cycle. The critical cycle constitutes the slowest part of the graph and it is the cycle that constrains the speed of computation for the whole graph. The timing analysis demonstrated in this section makes use of three important properties of such HSDF graphs: monotonicity, periodicity and boundedness: Periodicity For self-timed execution of an IPC graph (HSDF) G(V, E) with fixed execution times, the periodicity is: s(v, k + N) = s(v, k) + p × N; v ∈ V, k ≥ K
(1)
where the iteration interval p is the MCM, k is the iteration index, s(v, k) is the starting time of actor v, K is the number of iterations required before the self-timed execution enters the "periodic regime", and N is the number of iterations in one period. When a periodic regime has been entered, every actor is guaranteed to start exactly N firings within any half-closed interval of length p × N. Thus, by finding the actors starting times s(v, k) during one period, it is possible to characterize the periodic execution of the IPC graph. If this information is available, it is also possible to obtain a measure of the graphs lateness. Lateness The lateness of a graph is
= maxv∈V (maxk=n..n+N−1 (s(v, k) − p × k + t(v)))
(2)
where t(v) is the fixed execution time of actor v and n is an arbitrary integer n ≥ K. The lateness (maximum latency) is a measure on how late iteration k of the IPC graph, starting from time p × k, can finish. Boundedness The upper bound on the completion time of I iterations of the IPC graph (HSDF) is given by HSDF-BOUND(I) = p × (I − 1) +
(3)
As mentioned, the value of the iteration interval p can be found by computing the MCM (the critical cycle) of the IPC graph. To compute the lateness , it is necessary
Intermediate Representations for Simulation and Implementation
757
to have a sample of the starting times s(v, k). This means that also that K and N have to be known. Furthermore, in case I < K, the HSDF-BOUND(I) can be found through simulating I iterations of the graph [10]. The example in Figure 17 shows the result after translating the configuration network for the producer - consumer example in Figure 15 to an IPC graph. There are rules for how to position the backward edges and how to set the number of initial tokens on these edges, but they will not be discussed further here [10]. The cycle (P1 ,W1 , P2 ,W2 ) represents the computation of the actors mapped on the first processing tile. Similarly, the cycle (R1 , R2 ,C) represents the actor C mapped on the second processing tile. The transfer actors (T1 and T2 ) and the delay actors (D1 , D2 , C1 , C2 ) represent the channel connection between the two processing tiles.
P1 5
W1 1
T1 4
δD1 0,2 δC1 0,3
P2 5
W2 1
R1 1
δC1 0,3 T2 4
δD2 0,2
R2 1
C 16
Fig. 17 Complete IPC graph for the send receive example in Figure 15.
The IPC graph in Figure 17 is amenable for timing analysis. The average iteration interval p (MCM) can be determined by finding the critical cycle in the graph. In this case, the cycle R1 , C1 , T2 , D2 , R2 ,C is critical, resulting in an iteration interval of 22.5 time units. One way to improve the performance is to modify the size of the channels output buffer. The size of the output buffer can be increased by adding extra initial tokens. If the buffer size is increased from 2 to 3, there will be one extra token on edge (R1 , T2 ). This will reduce the cycle mean. The new critical cycle will be (R1 , R2 ,C). Note that the structure of the graph has not been changed.
758
Jerker Bengtsson
3.3 Timed Configuration Graphs If the goal were just to abstract away processor specific details, the intermediate representation could very well be a fairly simple graph data structure. However, in order to be able to analyze non-functional properties for a certain mapping of a graph, there is a need also to represent both time and the dynamic behaviour of the hardware. Timed configurations graphs (TCFGs) are closely related to job configurations networks. This section presents how TCFGs are constructed to represent programs mapped on tiled parallel processors. TCFGs can easily be implemented using a hierarchical heterogeneous model consisting of process networks and SDF. Programs are specified using a multi-rate SDF graph. The processing resources are implemented abstractly using process networks. Furthermore, a machine model is used to describe computational resources and performance of the processor target [2].
3.4 Set of Models A program is specified using a multi-rate SDF graph. Each actor in the SDF graph is assumed to be associated with a tuple < r p , rm , Rs , Rr > where • r p is the worst case execution time, in number of operations. • rm is the requirement on memory allocation, in words. • Rs = [rs1 , rs2 , ..., rsn ] is a sequence where rsi is the number of words produced on output channel i each firing. • Rr = [rr1 , rr2 , ..., rrm ] is a sequence where rr j is the number of words consumed on input channel j each firing. Scheduling algorithms require cost estimates in order to find a qualitative mapping with respect to some optimization objective. For communication resources on parallel processors, the cost typically comprise a static cost for sending and receiving data and a dynamic cost determined by the resource location, the amount of data to be communicated and the current state of the processor. Thus, the accuracy of the cost estimates depends on how well the dynamic overhead associated with communication, synchronization and when accessing off-chip resources can be captured. The machine model presented here can be used to represent a certain class of array structured multi processors and multi cores. The term processing tile is here used to refer to one core or processor in such an architecture. The machine model comprises a set of parameters describing the computational resources and a set of abstract performance functions, which describe the performance of computations, communication and memory transactions. The processing tiles are assumed to be
Intermediate Representations for Simulation and Implementation
759
tightly coupled via a mesh network. Each tile is assumed to have individual instruction sequencing capability. Memory transactions between tile private and shared memory is managed in software. The resources of such an abstract tile architecture can be described using two tuples, M and F. M consists of a set of parameters describing the resources: M =< (x, y), p, bg , gw , gr , o, so , sl , c, hl , rl , ro > where • (x, y) is the number of rows and columns of cores. • p is the processing power (instruction throughput) of each core, in operations per clock cycle. • bg is global memory bandwidth, in words per clock cycle • gw is the penalty for global memory write, in words per clock cycle • gr is the penalty for global memory read, in words per clock cycle • o is software overhead for initiation of a network transfer, in clock cycles • so is core send occupancy, in clock cycles, when sending a message. • sl is the latency for a sent message to reach the network, in clock cycles • c is the bandwidth of each interconnection link, in words per clock cycle. • hl is network hop latency, in clock cycles. • rl is the latency from network to receiving core, in clock cycles. • ro is core receive occupancy, in clock cycles, when receiving a message F is a set of abstract common functions describing the performance of computations, global memory transactions and local communication as functions of the resources M: F(M) =< t p ,ts ,tr ,tc ,tgw ,tgr > where • • • • • •
t p is a function evaluating the time to compute a sequence of instructions ts is a function evaluating the core occupancy when sending a data stream tr is a function evaluating the core occupancy when receiving a data stream tc is a function evaluating network propagation delay for a data stream tgw is a function evaluating the time for writing a stream to global memory tgr is a function evaluating the time for reading a stream from global memory
The first step when modeling a specific processor target is to set the values of the parameters of M. The second step is to define the performance functions F(M). 3.4.1 Modeling a Tiled 16 Cores Processor In this section it is demonstrated how the machine model can be configured in order to model an array structured parallel processor. It is assumed that the processor has 16 (4 × 4) programmable tiles. Each tile includes a single-issue core and private
760
Jerker Bengtsson
data and instruction memory. The tiles are tightly interconnected via a dimensionordered (x-y), wormhole-routed network. Shared memory consist of four off-chip, individually addressed memory banks located off-chip. Transactions between tilelocal and shared memory are programmed in the software, using message-passing via the on-chip interconnection network. The parameters for a processor with this configuration are can be specified with following parameter setup: M =< (4, 4), 1, 1, 1, 6, 2, 5, 1, 1, 1, 1, 3 > The core instruction throughput is p operations per clock cycle. Thus, for a single-issue core p = 1. The memory bank consisting of four shared off-chip DRAMs are connected to four separate I/O ports. The DRAMs can be accessed concurrently, each having a bandwidth of bg = 1 words per clock cycle. The latency penalty for a DRAM write is gw = 1 cycle and for a read the latency is gr = 6 cycles. The overhead for a initiating network transfer includes sending a header and possibly an address (when addressing any of the off-chip memories). The software overhead for this initiation is here set to o = 2. The on-chip networks are assumed to be register mapped, meaning that after a message header has been sent, the network can be treated as destination or source operand of an instruction. Ideally, this means that the receive and send occupancy is zero. In practice, when multiple input and output channels are merged and physically mapped on a single network link, data needs to be buffered locally. Here a send and receive occupancy of so = 2 and ro = 2 will be assumed. Note that this occupancy then also include the overhead for the data to be read from and written to local memory. The network hop-latency is set to hl = 1 cycles per router hop and the link bandwidth is assumed to be c = 1. Furthermore, the send and receive latency is one clock cycle when injecting and extracting data to and from the network: sl = 1 and rl = 1. After configuring the parameters of the machine model, the next step is to specify the performance functions F(M): Compute The time required to process the fire code of an SDF actor on a processing tile can be defined as
rp t p (r p , p) = p which is a function of the requested number of operations r p (worst-case execution time associated with each actor) and the core instruction throughput p ∈ M. To r p all instructions except those related to network send and receive operations are counted. Send The time required for a processing tile to issue a network send operation can be defined as
Rs ts (Rs , o, so ) = × o + Rs × s o framesize
Intermediate Representations for Simulation and Implementation
761
Here send is defined to be a function of the requested amount of words to be sent, Rs , the software overhead o ∈ M when initiating a network transfer, and a possible send occupancy so ∈ M. The framesize is a processor specific parameter that specifies the maximum length, in words, of a message. Thus, the first term of ts captures the software overhead for the number of messages required to send the complete stream of data. For connected nodes in the SDF graph that are mapped on the same core, it is also possible to choose to represent channels mapped in local memory. In that case ts is set to zero. Receive Similar to send, the time required for a processing tile to issue a network receive operation can be defined as
Rr tr (Rr , o, ro ) = × o + Rr × ro framesize Network Propagation Modeling communication accurately is difficult: at the one hand high accuracy requires the use of a low machine abstraction level. At the other, the machine abstraction implemented by the model should be abstract enough to be able to model different processor targets. For simplicity, here it will be assumed that network communication is collision free. The network propagation time here consists of a a possible network injection and extraction latency at the source and destination and the propagation time for one router hop on the network. The network propagation time with these assumptions can be defined as
tc (Rs , xs , ys , xd , yd , sl , hl , rl ) = sl + d(xs , ys , xd , yd ) × hl + nturns (xs , ys , xd , yd ) + rl Network injection and extraction latency are captured by sl and rl respectively. Further, the propagation time depends on the network hop latency hl and the number of network hops d(xs , ys , xd , yd ), which is a distance function of the source and destination coordinates. Routing turns add an extra cost of one clock cycle. This is captured by the value of nturns (xs , ys , xd , yd ) which, similar to d, is a function of the source and destination coordinates. Shared Memory Read Each memory bank in the shared memory is assumed to be individually controlled by a memory controller. Reading from the shared memory bank requires first one send operation (the overhead which is captured by ts ), in order to configure the memory controller and set the address of memory to be read. The second step is to issue a receive operation to receive the memory contents on that address. The propagation time when receiving a stream of data from shared memory at a processing tile is here defined as
762
Jerker Bengtsson
tgr (rl , xs , ys , xd , yd , hl ) = rl + d(xs , ys , xd , yd ) × hl + nturns (xs , ys , xd , yd ) Memory read latency is not included in this expression. This needs to be accounted for in a memory model that has to be included in the intermediate representation. Shared Memory Write Like the memory read operation, writing to global memory requires two send operations: one for configuring the memory controller (set write mode and address) and one for sending the data to be stored. The time required for streaming data from the sending core to global memory is evaluated by
tgw (sl , xs , ys , xd , yd , hl ) = sl + d(xs , ys , xd , yd ) × hl + nturns (xs , ys , xd , yd ) Like in shared memory read function, the memory write latency is included in the memory model that has to be included in the intermediate representation.
3.5 Construction of Timed Configuration Graphs A timed configuration graph GAM (V, E) represents a synchronous dataflow program A (with annotations < r p , rm , Rs , Rr > as described in Section 3.4), mapped on an abstract machine < M, M(F ) >. V is the set of nodes and E is the set of edges. There are two types of nodes in the graph: processor nodes, v p , and memory nodes, vb . A processor node represents a set of SDF sub-graphs of A mapped on a processing tile. Memory nodes represent buffers mapped in shared memory. The set of edges, E, represents the configuration for the on-chip network. When constructing a timed configuration graph, GAM (V, E), the SDF graph A is first clustered into sub-graphs. Each SDF sub-graph is then assigned to a processing tile in M during the scheduling step. The edges of the SDF graph that end up inside a node of type v p will be implemented using local memory, so they do not appear as top level (visible) edges in GAM . The edges of the SDF that reach between pairs of nodes not belonging to the same SDF sub-graphs can be mapped in two different ways: 1. as a network connection between the two processing tiles; such a connection is represented by an edge; 2. as a buffer in global memory. In this case, a memory node is introduced.
Intermediate Representations for Simulation and Implementation
763
When GAM has been completely constructed, each v p , vb ∈ V and e ∈ E has been assigned costs for computation, communication and memory read and writes, respectively. The costs are calculated using the parameters of M and the performance functions F(M) (Section 3.4). These costs comprise the static part of the costs, relative to the current time and the current state of the system, when computing the total cost for executing an application. The graph to the left in Figure 18 shows a simple multi-rate SDF graph (A). The graph to the right in the figure, shows one possible TCFG for this same SDF graph (GAM ). One static actor firing schedule for A in this example is 6a3b3c3de3. Thus, actor a fires 6 times, actors b, c and d fire 3 times, and actor e 1 time. The firing of this schedule is repeated indefinitely. Thus, no runtime scheduling supervision is required since static code can be generated. The feedback channel from actor c to actor b is buffered in core local memory. The edge from actor a to actor d is a buffer in shared (off-chip) memory and the others are mapped as point-to-point connections on the network. The integer values represent the send and receive rates of the channels (rs and rr ), before and after A has been clustered and transformed to GAM , respectively. Note that these values in GAM are the values in A multiplied by the number of times an actor fires, as given by the firing schedule.
1
1
12
12
6A 2
4
B
C
20 20
5
15
A
3B
3C
120 120
15
E 20
9 40
D
3
15
120 120 3D
9
9
E
Fig. 18 The graph to the right is one possible graph GAM for the application graph A to the left.
3.5.1 Abstract Interpretation of TCFGs Dynamic analysis of a timed configuration graph can be performed using abstract interpretation techniques. An interpreter can be implemented by very simple means using process networks. This section will briefly demonstrate how such an interpreter have been implemented using process networks. The implementation of timed configuration graphs demonstrated here comprise a heterogeneous and hierarchical dataflow model. The top level is a process network. Nodes representing processing tiles are implemented using specific processor actors (processes), and memory nodes are implemented using specific memory actors (processes). The SDF program, provided as input to the tool, is transformed to a dis-
764
Jerker Bengtsson
tributed (clustered) SDF model, where each cluster is embedded in a processor actor. The edges connecting the clusters in the distributed SDF graph are cut and replaced by process networks channels, which represent the inter processor communication. However, apart from this temporary cut, the SDF model is still intact. There are different possible approaches for analysis of the dynamic behavior of a dataflow model. One is to let the model calculate resource and timing properties on itself during execution of the model. Another is to generate an abstract representation of the model. However, using abstract interpretation does not require modification of the underlying modeling architecture used. The timed configuration graph and the abstract interpreter can be added on top of an existing modeling infrastructure. Furthermore, the approach of using abstract representation of the mapped SDF graph is likely more beneficial in terms of modeling performance. Program Abstraction The firing for each SDF actor is abstracted by means of a sequence, S, consisting of receive, compute and send operations: S = tr1 ,tr2 . . .trn ,t p ,ts1 ,ts2 , . . . ,tsm The abstract operations have been bound to static costs computed using the machine model M and the defined performance functions F(M), as was exemplified in Section 3.4. Time There is no notion of global time in a process network. Each of the top level vertices of GAM (processing tiles and memories) is implemented by an individual process. Each of the processor processes has a local clock, t. The clock, t, is stepped by means of (not equal) time segments. The length of a time segment corresponds to the static cost bound to a certain operation in S and possibly a dynamic cost (blocking time) when issuing send or receive operations addressed to other cores or shared memories. Processing tiles and memory A processor process (processing tile) implements a state machine. For each processor process that is firing, the current state is evaluated and then stored in the history. The history is a chronologically ordered list describing the state evolution from time t = 0. Memory processes, vm , models competing memory transactions by the first come first served policy. A read or write latency is added to the current time of a read or write operation. States For each processor process it is recorded during which segments of time computations and communication operations were issued. For each process, the current state maps to a state type ∈ StateSet, a start time, tstart , and a stop time, tstop . The state of a vertex is a tuple state =< type,tstart ,tstop > The StateSet defines the set of possible state types:
Intermediate Representations for Simulation and Implementation
765
StateSet = {receive, compute, send, receiveMem, sendMem} The value of tstart is the start of the time segment corresponding to the currently processed operation, and tstop is the end of the time segment. For all states, time tstop corresponds to tstart + , where includes the static cost bound to the particular operation. For send, receive, receiveMem and sendMem also possibly includes a dynamic cost (blocking time) while issuing the operations. Clock synchronization The timed configuration graph is synchronized by means of discrete events. Send and receive are blocking operations. A read operation blocks until data are available on the edge, and a write operation blocks until the edge is free for writing. During a time segment, only one message can be sent over an edge. Synchronization of time between communicating processes requires two way communication. Thus, each edge in the mapped SDF graph is represented by a pair of oppositely directed edges in the implementation of GAM . Network propagation time Channels perform no computation. Therefore delay actors are used to account for propagation times over edges. The delay actor adds the edge weight (corresponding to tc ∈ F(M)), that has been assigned to each top level edge during the construction of GAM . Program interpretation The core interpreter makes state transitions depending on the current operation, the associated static cost of the operation and whether send and receive operations block or not (the dynamic cost). A state generating function takes timing parameters as input and returns next state ∈ StateTypes. In this example implementation of the TCFG, the network is modeled on a relatively high level, at which communication is represented using point-to-point channels. Thus, the model does not capture message concurrency and buffer capacity of the network. This is one example of design trade-off made to keep the network abstraction at a high level for reasons of modeling performance and a higher generality of hardware representation. These are however problems that can be solved, but at the price of a lower level of implementation of the intermediate representation.
4 Summary Implementation of DSP applications on parallel hardware is a complex task typically involving many design constraints. Most DSP systems are embedded real-time systems, which includes non-functional constraints such as for example timing. The problem of finding the best fitted parallel implementation of a program - with respect to the available processor resources and the design constraints - is a non-trivial implementation task. This chapter has presented four examples of different types of intermediate representations useful for implementation of design tools for DSP de-
766
Jerker Bengtsson
velopment. The SPI representation allows representation of variable execution properties of programs, but does not provide means powerful enough for representation of scheduling strategies; the FunState representation is an enhancement of SPI that is more complex but enables specification of different scheduling strategies; job configuration networks, which unlike SPI and FunState, is a timed representation that enables dynamic analysis and simulation; and finally timed configuration graphs (TCFGs), which are similar to job configuration networks and have been designed to enable dynamic analysis and simulate the execution of multi-rate SDF graphs on parallel processors.
References 1. Balarin, F., Chiodo, M., Giusto, P., Hsieh, H., Jurecska, A., Lavagno, L., Passerone, C., Sangiovanni-Vincentelli, A., Sentovich, E., Suzuki, K., Tabbara, B.: Hardware-software codesign of embedded systems: the POLIS approach. Kluwer Academic Publishers, Norwell, MA, USA (1997) 2. Bengtsson, J., Svensson, B.: Manycore performance analysis using timed configuration graphs. In: IEEE International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation, pp. 108–117. Samos, Greece (2009) 3. Bossung, W., Huss, S.A., Klaus, S.: High-level embedded system specifications based on process activation conditions. VLSI Signal Processing Systems 21(3), 277–291 (1999). 4. Cieslok, F., Teich, J.: Timing analysis of process models with uncertain behaviour. Tech. rep., Computer Engineering Laboratory (DATE), University of Paderborn (2000) 5. Eles, P., Kuchcinski, K., Peng, Z., Doboli, A., Pop, P.: Scheduling of conditional process graphs for the synthesis of embedded systems. Design, Automation and Test in Europe Conference and Exhibition pp. 132–139 (1998). 6. Grotker, T., Schoenen, R., Meyr, H.: PCC: a modeling technique for mixed control/data flow systems. In: Proceedings of the 1997 European conference on Design and Test, pp. 482–486. IEEE Computer Society, Washington, DC, USA (1997) 7. Lee, E.A., Goe, E.E., Heine, H., Ho, W.H., Bhattacharyya, S., Bier, J.C., Guntvedt, E.: GABRIEL: a design environment for programmable DSPs. In: Proceedings of the 26th ACM/IEEE Design Automation Conference, pp. 141–146. ACM, New York, NY, USA (1989). 8. Lee, E.A., Ha, S.: Scheduling strategies for multiprocessor real-time DSP. In: Proc of the IEEE Global Telecommunications Conference, vol. 2, pp. 1279–1283 (1989). 9. Pankert, M., Mauss, O., Ritz, S., Meyr, H.: Dynamic data flow and control flow in high level dsp code synthesis. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 2, pp. 449–452. Adelaide, SA, Australia (1994) 10. Poplavko, P., Basten, T., Bekooij, M.J.G., Meerbergen, J.V.V., Mesman, B.: Task-level timing models for guaranteed performance in multiprocessor networks-on-chip. In: Proceedings of the International Conference on Compilers, Architecture, and Synthesis of Embedded Systems, pp. 63–72 (2003) 11. Smith, J.R.W., Reed, R.: Telecommunications Systems Engineering Using SDL. Elsevier Science Inc., New York, NY, USA (1989) 12. Sriram, S., Lee, E.A.: Determining the order of processor transactions in statically scheduled multiprocessors. VLSI Signal Processing Systems. 15(3), 207–220 (1997). 13. Thiele, L., Strehl, K., Ziegenbein, D., Ernst, R., Teich, J.: FunState – an internal design representation for codesign. In: ICCAD ’99: Proceedings of the 1999 IEEE/ACM international conference on Computer-aided design, pp. 558–565. IEEE Press, Piscataway, NJ, USA (1999)
Intermediate Representations for Simulation and Implementation
767
14. Thiele, L., Teich, J., Naedele, M., Strehl, K., Ziegenbein, D.: SCF - state machine controlled flow diagrams. Tech. Rep. TIK-33, Computer Engineering and Networks Lab (TIK), Swiss Federal Institute of Technology (ETH) Zurich, Gloriastrasse 35, CH-8092 Zurich (1998)
Embedded C for Digital Signal Processing Bryan E. Olivier
Abstract The majority of micro processors in the world do not sit inside a desktop personal computer or laptop as general purpose processor, but have a dedicated purpose inside some kind of apparatus, like a mobile telephone, modem, washing machine, cruise missile, hard disk, DVD player, etc. Such processors are called embedded processors. They are designed with their application in mind and therefore carry special features. With the high volume and strict real time requirements of mobile communication the digital signal processor (DSP) emerged. These embedded processors featured special hardware and instructions to support efficient processing of the communication signal. Traditionally these special features were programmed through some assembly language, but with the growing volume of devices and software a desire arose to access these features from a standardized programming language. A work group of the International Organization for Standardization (ISO) has recognized this desire and came up with an extension of their C standard to support those features. This chapter intends to explain this extension and illustrate how to use them to efficiently use a DSP.
1 Introduction Traditionally the core algorithms in digital signal processing are programmed in assembly or by using intrinsics because this gives access to the special features offered by dedicated digital signal processors (DSP). These algorithms are becoming increasingly complex, because of error correction and encryption requirements. This increasing complexity makes the tradition of programming in assembly or by using intrinsics more and more of a burden on maintenance and development. Therefore the desire to support such DSP features in the C-language, as defined in [7], is growing. Bryan E. Olivier ACE Associated Compiler Experts bv., e-mail: [email protected]
S.S. Bhattacharyya et al. (eds.), Handbook of Signal Processing Systems, DOI 10.1007/978-1-4419-6345-1_27, © Springer Science+Business Media, LLC 2010
769
770
Bryan E. Olivier
With the Embedded C extension as described in the ISO technical report "Extensions for the programming language C to support embedded processors" [8] it has become possible to write such algorithms in portable and standard C, without compromising the required performance. This chapter exhibits the support offered by this Embedded C extension and shows some examples of how to use it. The Embedded C extensions add four features to the C language [7]. • • • •
Fixed point types Memory spaces Named Registers Hardware I/O addressing
2 Typical DSP Architecture
AD converter DMA
- Generic Memory DMA
?
?
X memory X _Fract * px;
A A
Register file _Accum accu;
6
? A A
A A -
-
? Y memory Y _Fract * py;
A
?
A A
MUL
ADD accu += (*px)*(_Accum)(*py);
Fig. 1 Typical DSP multiply/accumulate architecture.
Figure 1 shows the essence of architectures found in digital signal processors. The arithmetical unit can perform a multiply accumulate in one cycle. The operands can be obtained from separate memories, which means both can be loaded from
Embedded C for Digital Signal Processing
771
memory in parallel. This is called a Harvard architecture. Conventional processors have one memory and both operands would have to be loaded from memory one after another, referred to as a von Neumann architecture. The operands are of fixed point type, resulting in less chip surface for the arithmetical operations in comparison with conventional floating point. This architecture is motivated by the often used finite impulse response (FIR) filter. The core of this algorithm consists of the inner product of two vectors. For performance reasons the elements of these vectors are of fixed point type and both vectors are stored in two separate memories so they can be accessed simultaneously on a Harvard architecture. One of these memories can be filled with samples taken by an analog-digital (AD) converter. Such samples are also easily represented as fixed point values. Figure 1 uses the Embedded C syntax to show the elements in the architecture. Two pointers px and py are defined to point to two separate memory spaces X and Y. These two memories X and Y contain, for this example, the two vectors and the multiply/accumulate (MAC) unit multiplies two values and adds the product to an accumulator register. Often these X and Y memories can only be filled with data through some Direct Memory Access (DMA) hardware. This means that these memories are not part of the global generic memory of the processor. If the architecture also supports zero overhead loops, then the inner product can be computed with one cycle for each pair of elements in the vectors. Zero overhead loops are supported by special hardware on the processor. With one instruction this hardware is set up to iterate a predetermined number of times. This way costly branches at the end of a loop body are avoided. The key elements in this architecture are the following: • • • •
Zero overhead loops Multiply/accumulate unit Separate memories Fixed point types
Zero overhead loops can be recognized relatively easy by a compiler, as can the operations for the MAC unit. For these features of the architecture no language extensions are required. It is a lot harder to distribute data appropriately over separate memories and virtually impossible to recognize fixed point operations written in standard ISO-C. Embedded C is an extension of ISO-C allowing the programmer to specify these last two elements in the application program in a portable way.
3 Fixed Point Types As described in Sect. 2, most DSP architectures support fixed point types as an efficient alternative for floating point. Also the samples taken by an analog-digital converter are easily represented as fixed point values. In the Embedded C report therefore two new and related arithmetic types are introduced, _Fract and _Accum.
772
Bryan E. Olivier
The names of the types start with an underscore and capital to prevent pollution of the name-space of existing programs. Such names are reserved by the ISO C standard and should not be used in an application. Embedded C requires however that every compiler environment should define the more natural spelling fract, accum and sat (see below) in the include file stdfix.h. Different sizes of these types can be specified through short and long. These types can also be specified as signed and unsigned, where the unspecified default type is signed. These types can also be qualified as _Sat for saturated arithmetic. Saturation means that if the result of an arithmetical operation overflows (underflows), it will stick to the highest (lowest) possible value instead of wrapping around. This is often used in digital signal processing, because the induced error from saturation is smaller than that following from wrap around. Type signed _Fract unsigned _Fract signed _Accum unsigned _Accum
Value range [−1.0, +1, 0) [0.0, +1, 0) [−n, +n) [0.0, +m)
Fig. 2 Value ranges of fixed point types. n and m are some powers of 2.
Figure 2 describes the ranges of values that different fixed point types may have. These ranges include the lower bound, but exclude the upper bound. The _Accum types are introduced to safely accumulate a set of _Fract without the risk of overflow. Only when this accumulator type is converted back to a regular _Fract type overflow has to be accounted for. It should be noted that even though the _Accum types also come in a saturating flavor, this is hardly ever supported by the hardware. Application programmers are therefore advised to carefully use the saturating accumulator types, because arithmetic will often be diverted to expensive emulation routines. Saturation on _Fract types however is often very well supported by the hardware.
3.1 Fixed Point Sizes The _Fract types only have a number of fractional bits and the _Accum types additionally have a number of integral bits. Figure 3 gives the constraints set by Embedded C on these number of bits for various types. Roughly, these constraints state that _Fract is at least as big as the size of short _Fract and long _Fract is at least as big as the size of _Fract. Similar constraints are stated for the different flavors of _Accum. It also states that a _Fract has the same number of fractional bits as a _Accum, again for all flavors. Finally, it sets some minimum requirements on the number of fractional and integral
Embedded C for Digital Signal Processing
773
bits. Such minimum sizes can be used for highly portable applications only relying on such minimums. f (short _Fract) ≤ f (_Fract) ≤ f (long _Fract) i(short _Fract) = i(_Fract) = i(long _Fract) = 0 f (short _Accum) = f (short _Fract) (recommended) f (_Accum) = f (_Fract) (recommended) f (long _Accum) = f (long _Fract) (recommended) i(short _Accum) ≤ i(_Accum) ≤ i(long _Accum) f (short _Fract) ≥ 7 f (_Fract) ≥ 15 f (long _Fract) ≥ 23 (31 recommended) i(short _Accum) ≥ 4 (8 recommended) i(_Accum) ≥ 4 (8 recommended) i(long _Accum) ≥ 4 (8 recommended) f (type) is the number of fractional bits, i(type) the number integral bits of a type. Fig. 3 Constraints on the various sizes of the fixed point types.
The signed and unsigned versions are required to have the same size. It is however allowed to use the sign bit as an additional fractional or integral bit in the unsigned version. The recommended values and constraints do not have to be supported by a compiler, however if supported it would improve portability of applications across different architectures. The actual sizes of the implementation will be highly dependent on the architecture targeted by the compiler. It can be expected that different architectures for the same application domain will have converging sizes of their fixed point hardware, because the application domain will dictate a certain precision.
3.2 Fixed Point Constants Fixed point constants look just like floating point constants, except for a suffix that specifies the exact type of the constant. The r is used to specify a _Fract and the k for the _Accum. The h is used for short and the l is used for long. The u is used for unsigned. So 0.5uhr specifies an unsigned short _Fract constant, whereas 0.5lk specifies a signed long _Accum. All constants must be within the range of their types, with the exception of 1.0 that can be used to specify the largest possible value of a _Fract type, that is 1.0 − 1/2 f (_Fract) , where f (_Fract) is the number of fractional bits as described in Sect. 3.1. Figure 4 lists all the suffices for fixed point constants and their implied type.
774
Bryan E. Olivier Meaning of a suffix letter r _Fract k _Accum h short l long u unsigned Suffix hr uhr r ur lr ulr
Fixed-point type short _Fract unsigned short _Fract _Fract unsigned _Fract long _Fract unsigned long _Fract
Suffix hk uhk k uk lk ulk
Fixed-point accumulator type short _Accum unsigned short _Accum _Accum unsigned _Accum long _Accum unsigned long _Accum
Fig. 4 The constant suffices with their implied types.
3.3 Fixed Point Conversions And Arithmetical Operations Conversions involving fixed point follow similar rules in C as apply to integer. This means that the types of the operands determine the type of the operation through a ranking order and the rounding and overflow are dictated by this destination type. Conversion from a fixed point type to an integral type uses rounding towards zero.
[_Sat] short [unsigned] (_Fixed/_Accum)
?
? r [_Sat] [unsigned] (_Fixed/_Accum) 6
[_Sat] [unsigned] (_Fixed/_Accum)
?
r?[_Sat] long [unsigned] (_Fixed/_Accum) 6
r? 6
?
[_Sat] long [unsigned] (_Fixed/_Accum)
Fig. 5 Conversion in expressions with mixed sized fixed point types. Convert the smaller to the bigger type.
All sizes of _Fract are ranked as smaller than _Accum. This means that the resulting type of adding a (long _Fixed) with a (short _Accum) will result in a short _Accum, thereby loosing precision of the long _Fixed. See Fig. 5 for a visualization.
Embedded C for Digital Signal Processing
775
Figure 5 through Fig. 9 use brackets around elements in the types to denote that these are optional. However per figure such an optional element should be left out in all occurrences or none at all. Boxes represent objects and types without a box represent the type in which an operation is performed. Arrows going out of a box denote the types of the two operands. The arrow pointing to a box denotes the type of the result. Conversions from and to fixed point types in mixed expressions follow similar rules as for other C types, with a few exceptions. If a fixed point type is combined with a floating point type in an operation, then the fixed point type is first converted to a floating point. See Fig. 6 for a visualization.
[_Sat][short/long][unsigned](_Fixed/_Accum)
?
? r float 6 float
?
? r double 6 double
Fig. 6 Conversion in expressions with mixed floating point and fixed point types. Convert the fixed point type to the floating point type.
If two _Fract types are combined, the one with the lower precision is converted to the type with the higher precision before the operation is performed. The exception is made for combining integral types or _Accum types with _Fract types. In this case the result of the operation is calculated using the values of the two operands, with their full precision. The result is rounded back to the required fixed point type dictated by the ranking of the fixed point types. The same holds for relational operators (<, <= etc.). See Fig. 7 for a visualization. The reason for this exception is that the normal conversion rules in C would result in pointless expressions. E.g. 3*0.1r would either result in 0.1r, if 3 was converted to a _Fract, or it would result in 0, if 0.1r was first converted to an integral type. With this exception the result is 0.3r, which was deemed to make more sense. If an unsigned fixed point type is combined with signed fixed point type, then the unsigned value is first converted to its corresponding signed value and the result will also have this type. See Fig. 8 for a visualization. If any of the operands was of a saturated type, then the result of the operation is also saturated. If one of the operands in an expression is a saturating fixed point type, then the result is also a saturating fixed point type. See Fig. 9 for a visualization. The efficiency of the machine code generated for such combinations highly depends on the support in the architecture. Most processors with fixed point hardware
776
Bryan E. Olivier [_Sat][short/long][unsigned]_Fixed
6
? r full precision ?
6
? r
full precision
r
?
[short/long] int
6
[_Sat][short/long][unsigned]_Accum
Fig. 7 Conversion in expressions with mixed integral and fixed point types. Compute the result in full precision and convert to the resulting type.
[_Sat][short/long](_Fixed/_Accum)
6
? r[_Sat][short/long](_Fixed/_Accum) 6 [_Sat][short/long] unsigned (_Fixed/_Accum)
Fig. 8 Conversion in expressions with mixed signedness of fixed point types. Convert the unsigned fixed point type to its corresponding signed type.
[short/long][unsigned](_Fixed/_Accum)
?
? r[short/long][unsigned](_Fixed/_Accum) 6 _Sat [short/long][unsigned](_Fixed/_Accum)
Fig. 9 Conversion in expressions with mixed saturation of fixed point types. The operation and the result are saturating.
support do not support such mixed operations, and the generated code will emulate the calculation using regular integral arithmetic. With the exception of bitwise and &, or | and exclusive or ˆ all operations on fixed point types are supported.
Embedded C for Digital Signal Processing
777
3.4 Fixed Point Support Functions As described in the previous section, all operations on combinations of integral and fixed point types will yield a fixed point type. Other combinations are supported through special support functions. The value returned by these functions are defined similarly as for the regular operators: the result value is first calculated using the exact value of the operands and then rounded according the result type. If the result type is an integral type, then overflow results in undefined behavior. Embedded C also specifies some useful support functions for which there is no operator. Figure 10 describes the various support functions, where the replacement Multiply 2 fixed point types resulting in an integral type. Example: int mulik(_Accum, _Accum) divifx Divide an integral by a fixed point type resulting in an integral type. Example: int divir(int, _Fract) fxdivi Divide 2 integral types resulting in a fixed point type. Example: long _Accum lkdivi(long int, long int) idivfx Divide 2 fixed point types resulting in an integral type. Example: int idivr(_Fract, _Fract) for fx in r, lr, k, lk, ur, ulr, uk, ulk absfx Compute the absolute value of a fixed point type. Example: long _Fract abslr(long _Fract) for fx in hr, r, lr, hk, k, lk roundfx Round a fixed point value to a number of fractional bits given by an int parameter. Example: long _Fract roundlr(long _Fract, int) countlsfx Count the number of bits a fixed point value can be shifted without overflow. Always returns an int. Example: int countlsr(_Fract) bitsfx Compute an integral value with the same bit representation as a fixed point value. Example: int bitsr(_Fract), where int is big enough fxbits Compute a fixed point value with the same bit representation as an integral value. Example: _Fract rbits(int), the inverse of bitsr strtofxfx Convert a string to a fixed point value. Example: short _Fract strtofxhr(const char * restrict nptr, char ** restrict endptr) for fx in hr, r, lr, hk, k, lk, uhr, ur, ulr, uhk, uk, ulk mulifx
Fig. 10 Various support functions as defined by Embedded C.
for fx refers to a qualified fixed point type using the suffices as also used for fixed point constants. The integral types involved in these functions use the same qualifiers as the involved fixed point type, unless stated otherwise. Figure 11 shows the 4 new conversion specifiers used for the new fixed point types and the standard length modifiers that also apply to the conversions specifiers for fixed point types. The specification of precision and length is similar to that of floating point formatter f.
778
Bryan E. Olivier Conversion specifiers Length modifiers %r (signed) _Fract h short %R unsigned _Fract l long %k (signed) _Accum %K unsigned _Accum
Fig. 11 Additional formatting letters for fixed point types.
4 Memory Spaces The C language only knows two memory spaces, one for data and one for code. Embedded-C facilitates multiple memory spaces for data. As mentioned in Sect. 2 most DSPs can access two memories simultaneously. Embedded C supports such architectures by allowing the programmer to locate objects in special memory spaces, called named memory spaces. Although Embedded C has provisions to declare new named memory spaces in the program code, few compilers will actually support that and therefore a detailed description is omitted here. Memory spaces must comply with one important rule: for any two memory spaces, either both memory spaces are disjoint or one is a subset of the other. The unqualified (generic) memory space should also always be supported. If this were not the case regular portable C-code suddenly would be illegal C in the context of a compiler that only supports named memory spaces. The named memory spaces however do not need to be a subset or superset of this generic memory space. Let A and B be two memory spaces. If A is a subset of B, then every pointer to A is compatible with a pointer to B and can be automatically converted. Conversion of a pointer to B to a pointer to A will also be automatic but the result is undefined if the actual memory pointed to is not part of memory space A. If A and B are disjoint, then such conversions are not legal in Embedded C. Figure 12 shows some legal and one illegal memory space configuration. Arrows are included in the picture representing pointer conversions from one space to another. Regular arrows represent legal conversions. Arrows with a cross represent conversions that a compiler will flag as an error. Dashed arrows represent conversions of which the result is undefined, but will pass through a compiler. Heap memory management routines, such as malloc, for named memory spaces can be supplied through dedicated routines but Embedded C does not require such support to exist. Of course heap management in the generic memory space must be supported. Embedded C also stipulates that the standard C-library shall only use generic pointers. It should be noted that applications using such named memory spaces are not portable to other architectures. The spaces only effect the data layout but not their values. So it is possible to emulate such an application on some general purpose platform by using an appropriate library and set of include files. Such include files basically will use a #define to remove the memory qualifiers and supply emulation routines for the special support functions required to handle such spaces. This
Embedded C for Digital Signal Processing a)
779
@ @
Generic
b)
A
@ @
Generic
6
6 ?
?
A
?
c)
A
6 ?
? Generic
d)
legal
B
Generic
6 ?
?
B
@ @
legal
?
B
6
legal
@ @
A illegal B
Fig. 12 Some examples of legal and illegal memory space configurations.
is very useful to test an application on a native workstation. Often the hardware is not available early in the application development or is hard to operate in a testing environment. Following are a couple of examples of pointer declaration using named memory spaces X and Y. The memory space qualifier are put in the same place as other type qualifiers, for example const and volatile. As a rule of thumb: the closer the qualifier to the variable identifier, the more it relates to the object itself. There is no difference between placing a qualifier directly before or after the base type, so const int c means the same as int const c. Following are a couple of examples of a pointer p pointing to an int. int * X p; p in memory X pointing to an int in generic memory int X * p; p in generic memory pointing to an int in memory X int X * Y p; p in memory Y pointing to an int in memory X
780
Bryan E. Olivier
5 Applications Any floating point algorithm can be converted to use fixed point for the sake of performance. However this comes at a price in precision and range. It also requires careful thought about the range of the values involved in the computation, because if too much overflow/saturation takes place the quality of the results will deteriorate. This section exhibits two examples of such algorithms and explains the analytic aspects involved.
5.1 FIR Filter The most common DSP algorithms are the finite impulse response (FIR) filter and the fast Fourier transform (FFT). The FIR filter uses the inner product at its core (see below) and the FFT algorithm also has a variation of the inner product (now on complex numbers) at its core. This algorithm is the main motivation for the typical DSP architecture as described in Sect. 2. It uses two memories X and Y to stream data through a multiply accumulate operation. #define N 16 X _Fract coeff[N] = { 0.7r, ... }; _Fract inner_product(Y _Fract *inp) { int i; _Accum sum = 0.0k; for (i = 0; i < N; i++) { sum += coeff[i] * (_Accum)*(inp++); } return (_Sat _Fract)sum; } The above example sums the products in an _Accum typed variable sum. It is assumed that this summation will not overflow, because N is smaller or equal than the integral part of the _Accum. Only the final result is saturated back to a _Fract. The cast to _Accum inside the loop makes sure that the multiplication is done in the _Accum type. Without this cast the multiplication would have been done in _Fract type and if both the coefficient and the input would have the exact value −1.0, then the result would be 1.0 which is out of range for a regular _Fract, resulting in an overflow. In a normal FIR filter there will not however be a −1.0 in the coefficients. So in that case the _Accum cast could be safely removed and the multiplication would be performed within the _Fract type.
Embedded C for Digital Signal Processing
781
The example also uses the memory spaces X and Y from where respectively the inp and the coeff constants are retrieved. Assuming the target architecture looks like the one described in Sect. 2, every iteration of this loop could thus be performed in one cycle.
5.2 Sine Function in Fixed Point Another example of converting a floating point algorithm into fixed point is the computation of the sine function. #define F 5 static _Accum sinfract(_Accum x) { int f; _Accum result = 1.0k; for (f = 2*F+1; f > 1; f -= 2) { result = 1.0k - result*x*x/(f*(f-1)); } result *= x; return result; } The above algorithm is based on the Taylor expansion of the sine function evaluated using a Horner scheme. sin(x) = x −
x3 x5 x7 x9 x11 + − + − ... 3! 5! 7! 9! 11!
It can be shown that r will always stay in range of the _Accum if x is in the range of [− , ]. The precision of this approximation however deteriorates dramatically as x approaches − and . With a 31 bits precision fixed point and 11 iterations, the error at is around 3.5 · 10−4 . To improve the results one can use the symmetry of the sine function and only use the approximation for [− /2, /2]. Still the approximation becomes worse as x approaches [− /2] and [ /2]. For a 31 bits precision fixed point and 11 iterations, the error at /2 is around 4.5 · 10−8. The results can be further improved by using the formula sin(2x) = 2 sin(x) cos(x) The approximation for sin(x) and cos(x) will now only be used for x ∈ [− /4, /4]. For this range only a few iterations are required to obtain high precision. With the
782
Bryan E. Olivier
31 bits precision fixed point and 11 iterations, the error is erratic over the whole range and is maximally 9 · 10−10, which is around the precision of the fixed point. Note that if x ∈ [− /4, /4] all values involved in the computation will stay smaller than 1.0, except for the constant 1.0k itself. This suggests that using a _Fixed type instead of _Accum would suffice. Experimental results show that replacing the 1.0k constant by the slightly smaller 1.0r hurts the precision only within the precision of the _Fixed itself. An optimizing compiler would take x*x out of the loop and compute it only once. It could also avoid the integer multiplication and complicated mixed type division (integer and fixed point) by putting the values for 1.0k/(f*(f-1)) in some table. Alternatively the compiler could simply fully unroll the whole loop. In that case f*(f-1) will be computed at compile time. Standard compiler optimizations next can turn the division by a constant into a multiplication by a constant. The resulting loop then only contains one multiplication and a multiply subtract operation. If one takes a careful look at the accuracy of the above algorithm, then one will see that there is a trend in the error that is the result of the used Taylor expansion. Variations of the above Horner scheme exist, where the constants 1/( f ∗ ( f − 1)) are slightly changed to obtain a uniform distribution of the error over the whole range of x. As a final note it must be said that the Cordic algorithm for computing the trigonometric functions is very often used as it requires only addition and shift operations.
6 Named Registers The C syntax is extended by Embedded C to include a storage-class specifier for named registers. It can be used to identify a variable with a machine register. A variable declared this way has external linkage and static storage. The names of the registers must be reserved identifiers, so they are outside the name space of existing programs. A reserved identifier either starts with two underscores or with one underscore and a capital. For each supported register only one variable can be associated with this register in every scope. The size of the type of the variable should be big enough to contain the full machine register. Otherwise the part contained in the variable is implementation defined. The following is an example of a named register declaration, where _SP is a intrinsic name of the compiler referring to the stack pointer. register _SP volatile void * stack_pointer; stack_pointer is an object stored in register _SP and is a pointer to volatile void.
Embedded C for Digital Signal Processing
783
7 Hardware I/O Addressing The Embedded-C report also contains a section on supporting I/O hardware by a compiler. Apart from standardizing an interface to such I/O hardware, it is an elaborate description on how such support could be engineered. The goal of this part of the Embedded-C report is to facilitate the separate development of support in a compiler and the device driver itself. void iogroup_acquire(iogroup); void iogroup_release(iogroup); void iogroup_map(iogroup, iogroup); unsigned int iord(ioreg); unsigned long iordl(ioreg); void iowr(ioreg, unsigned int a); void iowrl(ioreg, unsigned long a); unsigned int iordbuf(ioreg, ioindex_t ix); unsigned long iordbufl(ioreg, ioindex_t ix); void iowrbuf(ioreg, ioindex_t ix, unsigned int a); void iowrbufl(ioreg, ioindex_t ix, unsigned long a); void ioand(ioreg, unsigned int a); void ioor(ioreg, unsigned int a); void ioxor(ioreg, unsigned int a); void ioandl(ioreg, unsigned long a); void ioorl(ioreg, unsigned long a); void ioxorl(ioreg, unsigned long a); void ioandbuf(ioreg, ioindex_t ix, unsigned int a); void ioorbuf(ioreg, ioindex_t ix, unsigned int a); void ioxorbuf(ioreg, ioindex_t ix, unsigned int a); void ioandbufl(ioreg, ioindex_t ix, unsigned long a); void ioorbufl(ioreg, ioindex_t ix, unsigned long a); void ioxorbufl(ioreg, ioindex_t ix, unsigned long a); Fig. 13 Functions to access I/O hardware.
Specialized I/O hardware is often connected to a processor through memory mapped I/O, that is they are accessed through a reserved part of the memory space. Other processors however have dedicated instructions to communicate with specialized I/O hardware. The programmer of a device driver would prefer to write his driver so that it is portable across all platforms, independent of the way this hardware is accessed by the processor. This part of the Embedded-C report aims to facilitate this through a standardized set of functions to access such devices. It also describes the way such an interface can be supported by a compiler. Figure 13 lists all functions that should be supported by a compiler. Central in this interface are the ioreg-designators. Basically these are identifiers, one for each register in the I/O device. These identifiers can correspond to an object in the device driver, but they are more often used for token concatenation inside macros. Below some examples are given.
784
Bryan E. Olivier
The engineering of I/O hardware support consists of intrinsic compiler support to access I/O registers from the processor through for example memory mapped I/O or dedicated instructions. Next to that the compiler offers an include file iohw.h mapping the standardized functions to this intrinsic compiler support. For each combination of I/O device and processor architecture an include file is written specifying how each ioreg maps onto the intrinsic compiler support. As a result a device driver can be written in a portable way, since it will only use the functions from the Embedded-C report. In the following two simple examples are given. One for regular memory mapped I/O that does actually not require any additional intrinsic compiler support and another example for the PowerPC that uses dedicated instructions to access I/O registers. For the latter the CoSy style inline assembly is used to implement the intrinsic compiler support. /* iohw.h */ #define MM_ACCESS(TYPE,ADR) (*((TYPE volatile *)(ADR)))
\
#define MM_DIR_8_DEV8_RD(ADR) \ (MM_ACCESS(uint8_t,ADR)) #define MM_DIR_8_DEV8_WR(ADR,VAL) \ (MM_ACCESS(uint8_t,ADR) = (uint8_t)VAL) #define iord(NAME) \ NAME##_TYPE_RD(NAME##_ADR) #define iowr(NAME,VAL) \ NAME##_TYPE_WR(NAME##_ADR,(VAL)) /* iohw_device.h */ #define MYPORT1_TYPE_RD #define MYPORT1_TYPE_WR #define MYPORT1_ADR
MM_DIR_8_DEV8_RD MM_DIR_8_DEV8_WR 0xBEAF
The compiler support is written in the iohw.h part. The macro MM_ACCESS dereferences a memory mapped I/O address. The MM_DIR.. macros map a read and write operation onto this dereference. The prototypes as specified in Embedded C are mapped onto names in the device driver part by using token concatenation. The device driver part maps them back onto the MM_DIR.. macros. The actual memory mapped address is also specified through token concatenation. /* iohw.h */ /* Write to a SPR (Special Purpose Register) */ asm void __mtspr(int spr, unsigned int v) { @[ .spr-serialize; .constant spr]
Embedded C for Digital Signal Processing
mtspr
785
@{spr},@{v}
} /* Read from a SPR (Special Purpose Register) */ asm unsigned int __mfspr(int spr) { @[ .spr-serialize; .constant spr] mfspr @{}, @{spr} } #define IO_DIR_32_DEV32_RD(ADR) (__mfspr(ADR)) #define IO_DIR_32_DEV32_WR(ADR,VAL) (__mtspr(ADR,(unsigned int)(VAL)))
\
#define iord(NAME) NAME##_TYPE_RD(NAME##_ADR) #define iowr(NAME,VAL) NAME##_TYPE_WR(NAME##_ADR,VAL)
\
\
\
/* iohw_device.h */ #define MYPORT2_TYPE_RD IO_DIR_32_DEV32_RD #define MYPORT2_TYPE_WR IO_DIR_32_DEV32_WR #define MYPORT2_ADR 311 Above is an example of I/O registers accessed through special instructions on the PowerPC architecture. The main difference with the memory mapped example is that an address in this case is the number of a special purpose register (SPR). The inline assembly routines above contain a clause to instruct the compiler to serialize the accesses to the SPRs and they specify that the number of the SPR should be a constant, as the processor does not support a dynamically computed SPR number. Embedded C also provides methods of dynamically (at runtime) initialization of the I/O device and grouping of a collection of I/O registers to represent a complete device.
8 History and Future of Embedded C Embedded C [8] is rooted in DSP-C [4]. DSP-C was and is still used as industry standard to support features of DSP processors. DSP-C supports fixed point types and memory spaces. The most important difference between DSP-C and Embedded C is that the latter supports the special combinations of integral and fixed point type operations. The DSP-C extension also supports circular buffers, which was not included in the Embedded C report, because there were too many different implementations of such buffers. Hardware supporting circular buffers automatically
786
Bryan E. Olivier
sets a pointer to this buffer to the start of the buffer when it runs of the end of the buffer. This is also very useful in the implementation of FIR filters and fast Fourier transforms. ACE Associated Compiler Experts bv developed DSP-C in collaboration with Philips Semiconductors and submitted it for standardization by the ISO working group on C. This group adopted the rationale of DSP-C and this resulted in Embedded C in February 2004 [8]. The standardized interface to I/O hardware doesn’t seem to be widely accepted by the industry. Embedded C is supported by the CoSy compiler development system [2], the Byte Craft Limited compilers [9], the EDG front-end [6], the Dinkumware libraries [10] the Nullstone compiler performance analysis tools [5] and test suites are supplied in Perennial [11] and SuperTest [3]. With these tools a wide range of compiler support is covered. For more information on Embedded C see [1]. The report on the Embedded C extension mentions two features that may be supported in future versions. One is that if an approach for implementing circular buffers in hardware emerges in the industry, then this may be abstracted into an extension of Embedded C. Another is complex fixed point data types as a future extension, because these complex types are used very often in the industry and several DSP algorithms could be more efficiently described using complex fixed point data types. As fixed point types are used in the industry to trade between precision and speed every application domain sets its own standard for the number of integral and fractional bits. From a hardware point of view there is very little difference in the complexity of implementing division for integer and fixed point arithmetic. From a programming point of view however, there is a big difference. Conversions have to be made explicit using shifts and the programmer must be aware what division between bits is actually used at every point in the program. This means that the industry would benefit if Embedded C would support every conceivable partitioning between integral and fractional bits.
9 Conclusions Embedded C removes the need to use intrinsics or assembly to target fixed point arithmetic found on typical DSP architectures. It thereby facilitates more readable and portable programs. It facilitates a natural specification of memory spaces, where one otherwise would have to resort to non-portable pragmas. There will still be features of the architecture that may have to be targeted through intrinsics, assembly, pragmas or other compiler-supported extensions. Good examples are complex fixed point types or special Viterbi instructions. A Viterbi instruction takes two arguments and produces two values and a flag. Such operations are not easily captured in an expression.
Embedded C for Digital Signal Processing
787
References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11.
www.embedded-c.org. ACE Associated Compiler Experts bv. Cosy compiler development system. ACE Associated Compiler Experts bv. Supertest. ACE Associated Compiler Experts bv. DSP-C, An extension to ISO/IEC IS 9899:1990, www.dsp-c.org. 2005. Nullstone Corporation. www.nullstone.com. Edison Design Group, Inc. www.edg.com. ISO/IEC. International Standard ISO/IEC 9899:1999, Programming languages – C. 1999. ISO/IEC. ISO/IEC TR 18037, Programming languages – C – Extensions to support embedded processors. 2008. Byte Craft Limited. www.bytecraft.com. Dinkumware Ltd. www.dinkumware.com. Perennial. www.peren.com.
Part IV
Design Methods Managing Editor: Ed Deprettere
Signal Flow Graphs and Data Flow Graphs Keshab K. Parhi and Yanni Chen
1 Introduction Signal processing programs differ from the traditional computing programs in the sense that these programs are referred to as non-terminating programs. In other words, input samples are processed periodically (typically with a certain iteration period or sampling period) and the tasks are repeated infinite number of times. A traditional dependence graph representation of such a program would require infinite number of nodes. Signal flow graphs and data flow graphs are powerful representations of signal processing algorithms and signal processing systems because these can represent the operations using a finite number of nodes. Signal flow graphs have been used for a long time to analyze transfer functions of linear systems. Data flow graphs are more general and are used to represent both linear and non-linear systems. These can also be used to represent multi-rate systems that contain rate-altering components such as interpolators and decimators. The z−1 elements in signal flow graphs and delay elements in data flow graphs describe inter-iteration precedence constraints, i.e., constraints between two tasks of different iterations. A simple edge without any z−1 or delay element represents a precedence constraint within the same iteration. These are referred as intra-iteration precedence constraints. Both types of precedence constraints describe the causality constraints among different operations. These causality constraints are important both in hardware designs and software implementations. In a hardware implementation, the critical path time is the time needed for satisfying the causality constraints among all tasks within the same iteration, and is a lower bound on the clock period. Keshab K. Parhi University of Minnesota, Department of Electrical and Computer Engineering, 200 Union St. S.E., Minneapolis, MN 55455 USA, e-mail: [email protected] Yanni Chen Marvell Semiconductor Inc., 5488 Marvell Lane, Santa Clara, CA 95054 USA, e-mail: yannic@ marvell.com
S.S. Bhattacharyya et al. (eds.), Handbook of Signal Processing Systems, DOI 10.1007/978-1-4419-6345-1_28, © Springer Science+Business Media, LLC 2010
791
792
Keshab K. Parhi and Yanni Chen
In a software implementation, the scheduling algorithm must satisfy these causality constraints by satisfying both intra-iteration and inter-iteration constraints. Another important property of these flow graphs is that these can be transformed into equivalent forms by high-level transformations such as pipelining, retiming, unfolding and folding. These equivalent forms have same input-output characteristics but have different sets of constraints. Pipelining and retiming can be used to reduce clock period in a circuit. These transformations can also be used as a preprocessing step for folding where the objective is to design time-multiplexed or folded architectures. Unfolding transformation can lead to lower iteration periods in software implementations as it can unravel some of the concurrency hidden in the original data flow graph. In hardware implementations, unfolding leads to parallel implementations that can increase the effective sample speed while the clock speed remains unaltered. Both pipelined and parallel implementations can be used to reduce power consumption if achieving higher speed is less important. Alternatively, in deep submicron technologies with low supply voltage, pipelining and parallel processing can be used to meet the critical path requirements. This chapter provides a brief overview of signal flow graphs and data flow graphs, and is organized as follows. Section 2 introduces signal flow graphs, and illustrates how signal flow graphs are used for deriving transfer functions, either by using Mason’s gain formula or by using a set of equations. This section also illustrates retiming and pipelining of signal flow graphs. Section 3 addresses single-rate and multirate data flow graphs, and illustrates how an equivalent single-rate data flow graph can be obtained from a multi-rate data flow graph. Section 4 provides an overview of unfolding and folding transformations, and their applications to hardware design. Section 5 illustrates how the causality constraints imposed by the intra-iteration and inter-iteration constraints are exploited for scheduling in software implementations. Sections 3-5 are adapted from the text book [1].
2 Signal Flow Graphs In this section, the notation of signal flow graph (SFG) is first overviewed. Then two useful approaches, i.e., the Mason’s gain formula and the equation-solving, are explained in detail to derive the corresponding transfer function for a given SFG.
2.1 Notation Signal flow graphs have been used for the analysis, representation, and evaluation of linear digital networks, especially digital filter structures. An SFG is a collection of nodes and directed edges [2], where the nodes represent computations or tasks and a directed edge ( j, k) denotes a branch originating from node j and terminating at node k. With input signal at node j and output signal at node k, the edge ( j, k)
Signal Flow Graphs and Data Flow Graphs
793
denotes a linear transformation from the signal at node j to the signal at node k. An example of SFG is shown in Fig. 1, where both nodes j and k represent summing operations and the edge ( j, k) denotes a unit gain transformation. G4
j X
k Y1
G1
Y2
G2 Y3
G3 Y4
Y5
Y
−H2
−H1 −H3
Fig. 1 An example of signal flow graph.
Two types of nodes exist in SFG, source nodes and sink nodes. A source node is a node with no entering edges, and is used to represent the injection of external inputs into a graph. A sink node is a node with only entering edges, and is used to extract outputs from a graph.
2.2 Transfer Function Derivation of SFG For a given SFG, there are mainly two approaches to derive its corresponding transfer function. One is the Mason’s gain formula, which provides a step-by-step method to obtain the transfer function. The other is the equation-solving approach by labeling each intermediate signal, writing down the equation for that signal with dependency on other signals, and then solving the multiple equations to represent the output signal only in terms of the input signal. Note that the variables used in the signal flow graphs and the equations correspond to frequency-domain representations of signals. 2.2.1 Mason’s Gain Formula First of all, a few useful terminologies in Mason’s gain formula have to be defined related to an SFG. • Forward path: a path that connects a source node to a sink node in which no node is traversed more than once. • Loop: a closed path without crossing the same point more than once. • Loop gain: the product of all the transfer functions in the loop. • Non-touching or non-interacting loops: two loops are nontouching or noninteracting if they have no nodes in common. In general, Mason’s gain formula [3] is presented as below:
794
Keshab K. Parhi and Yanni Chen N
M=
Y = X
Mj j
j=1
(1)
where • • • • •
M = transfer function or gain of the system Y = output node X = input node N = total number of forward paths between X and Y = determinant of the graph = 1- loop gains + non-touching loop gains taken two at a time - non-touching loop gains taken three at a time + . . . • M j = gain of the jth forward path between X and Y • j = 1-loops remaining after eliminating the jth forward path, i.e., eliminate the loops touching the jth forward path from the graph. If none of the loops remains, j = 1. To illustrate the actual usage of Mason’s gain formula, the transfer function for the example SFG shown in Fig. 1 is derived by following the steps below: 1. Find the forward paths and their corresponding gains Two forward paths exist in this SFG: M1 = G1 G2 G3 and M2 = G4 2. Find the loops and their corresponding gains There are four loops in this example: Loop1 = −G1 H1 , Loop2 = −G3 H2 , Loop3 = −G1 G2 G3 H3 , Loop4 = −G4 H3 3. Find the j If we eliminate the path M1 = G1 G2 G3 from the SFG, no complete loops remain, so 1 = 1. Similarly, if the path M2 = G4 is eliminated from the SFG, no complete loops remain neither, so 2 = 1 as well. 4. Find the determinant Only one pair of non-touching loops is in this SFG, i.e., Loop1 and Loop2 , thus non-touching loop gains taken two at a time = (−G1 H1 )(−G3 H2 ). Therefore, = 1- loop gains + non-touching loop gains taken two at a time = 1 − (−G1H1 − G3 H2 − G1 G2 G3 H3 − G4 H3 ) + (−G1 H1 )(−G3 H2 ) = 1 + G1 H1 + G3 H2 + G1 G2 G3 H3 + G4H3 + G1 G3 H1 H2 5. The final step is to apply the Mason’s gain formula using the terms found above N
M=
Y = X
Mj j
j=1
=
G1 G2 G3 + G4 (2) 1 + G1H1 + G3 H2 + G1G2 G3 H3 + G4 H3 + G1 G3 H1 H2
Signal Flow Graphs and Data Flow Graphs
795
2.2.2 Equations-solving Based Transfer Function Derivation As an alternative approach, the equations-solving based method follows different set of steps. Note that the intermediate signals have already been appropriately labeled in Fig. 1. 1. Write down the equations for each labeled signal with dependency on other signals: Y1 = X − Y H3 Y2 = Y1 −Y3 H1 Y3 = G1Y2 Y4 = G2Y3 −Y H2 Y5 = G3Y4 + G4Y2 Y = Y5 2. Solve all the equations above and derive the relationship between output node Y and input node X: Y = Y5 = G3Y4 + G4Y2 = G3 (G2Y3 −Y H2 ) + G4Y2 = G3 (G2 G1Y2 − Y H2 ) + G4Y2 = −G3 H2Y + (G1 G2 G3 + G4 )Y2 Therefore, Y=
G1 G2 G3 + G4 Y2 1 + G3 H2
(3)
Note that Y2 = Y1 − Y3 H1
(4)
By substituting both Y1 and Y3 into ( 4), we obtain Y2 = X −Y H3 − Y3 H1 = X −Y H3 − G1 H1Y2
(5)
Consequently, Y2 =
X −Y H3 1 + G1H1
Then by substituting ( 6) into ( 3),
Y=
As a result,
G1 G2 G3 + G4 X − Y H3 (G1 G2 G3 + G4 )(X − Y H3 ) · = 1 + G3 H2 1 + G1 H1 (1 + G3 H2 )(1 + G1H1 ) G1G2G3 + G4 (G1 G2 G3 + G4 )H3 = X− Y (1 + G3H2 )(1 + G1 H1 ) (1 + G3 H2 )(1 + G1H1 )
(6)
796
Keshab K. Parhi and Yanni Chen
1+
(G1 G2 G3 + G4 )H3 G1 G2 G3 + G4 Y= X (1 + G3 H2 )(1 + G1H1 ) (1 + G3H2 )(1 + G1H1 ) Y G1 G2 G3 + G4 = X (1 + G3 H2 )(1 + G1 H1 ) + (G1 G2 G3 + G4 )H3
Finally, the transfer function between the output node Y and input node X is M=
G1 G2 G3 + G4 1 + G1H1 + G3 H2 + G1 G3 H1 H2 + G4 H3 + G1 G2 G3 H3
(7)
which is exactly the same as the derived transfer function using Mason’s gain formula in ( 2).
3 Data Flow Graphs In this section, the notation of data flow graph (DFG) is introduced and is followed by an overview of the single-rate DFG and the multi-rate DFG. How to construct an equivalent single-rate DFG from the multi-rate DFG is then explained in detail. After that, the concepts of retiming and pipelining are briefly introduced to derive equivalent DFGs.
3.1 Notation In data flow graph representations, the nodes represent computations or functions or subtasks and the directed edges represent data paths (communications between nodes) and each edge has a nonnegative number of delays associated with it. For example, Fig. 2(b) is the DFG corresponding to the block diagram in Fig. 2(a). Compared to the SFG in Sec. 2.1, DFG can be seen as a more generalized form of SFG in that it can effectively describe both linear single-rate and nonlinear multirate DSP systems. x(n)
y(n)
(2)
(2) 1 A 1
A D
D
D 1 1 B (4)
B (4)
a
(a)
(b)
(c)
Fig. 2 (a) Block diagram description of the computation y(n) = ay(n − 1) + x(n). (b) Conventional DFG representation. (c) Synchronous DFG representation.
Signal Flow Graphs and Data Flow Graphs
797
DFG describes the data flow among subtasks or elementary computations modeled as nodes in a signal processing algorithm. Similar to SFG, various DFGs derived for one algorithm can be obtained from each other through high-level transformations. DFGs are generally used for high-level synthesis to derive concurrent implementations of DSP applications onto parallel hardware, where subtask scheduling and resource allocation are of major concerns.
3.2 Synchronous Data Flow Graph A synchronous data flow graph (SDFG) is a special case of data-flow graph where the number of data samples produced or consumed by each node in each execution is specified a priori [4]. For example, Fig. 2(c) is an SDFG for the computation y(n) = ay(n − 1) + x(n), which explicitly specifies that one execution of both nodes A and B consumes one data sample and produces one output sample, which is a single-rate system. In addition, the SDFG can describe multi-rate systems in a simple way. For example, Fig. 3(a) shows an SDFG representation of a multi-rate system, where nodes A, B and C are operated at different frequencies fA , fB and fC , respectively. Note that A processes fA input samples and produces 3 fA output samples per time unit. Node B consumes 5 input samples during each execution, hence consumes 5 fB input samples per time unit. Using the equality 3 f A = 5 fB , we have fB = 3 fA /5. Similarly, the operating rate of node C can be computed as fC = 2 fB /3 = 2 fA /5. For a specified input sampling rate, the operating frequencies for nodes A, B and C can be computed. An equivalent single-rate DFG for the multi-rate DFG in Fig. 3(a) is shown in Fig. 3(b). In contrast, this single-rate DFG contains 10 nodes and 30 edges, as compared to 3 nodes and 4 edges in the SDFG representation. For more details on the synchronous data flow graphs, the reader is referred to Chapter 30.
3.3 Construct an Equivalent Single-rate DFG from the Multi-rate DFG By definition, each node in a single-rate DFG (SRDFG) is executed exactly once per iteration. In contrast, multi-rate DFG (MRDFG) allows each node to be executed more than once per iteration, and 2 nodes are not required to execute the same number of times in an iteration. However, SRDFG can still be used to represent multi-rate systems by first unfolding the multi-rate systems to single-rate. An edge from the node U to the node V in an MRDFG is shown in Fig. 4, where the value OUV is the number of samples produced on the edge by an invocation of the node U, the value IUV is the number of samples consumed from the edge by an invocation of the node V and the value iUV is the number of delays on the edge.
798
Keshab K. Parhi and Yanni Chen 1
A
3
5
B
2
3
C
2
(a)
A0 B0 A1
C0
A2
B1 C1
A3 B2 A4
(b)
Fig. 3 Multi-rate DFG in (a) can be converted into single-rate DFG in (b), which can then be represented using linear SFG.
U
i UV D
O UV
I UV
V
Fig. 4 An edge U → V in a multi-rate DFG.
If the nodes U and V are invoked kU times and kV times in one iteration, respectively, then the number of samples produced on the edge from the node U to the node V in this iteration is OUV kU , and the number of samples consumed from the edge by the node V in this same iteration is IUV kV . Intuitively, to avoid a buildup or deficiency of samples on the edge, the number of samples produced in one iteration must equal the number of samples consumed in one iteration. This relationship can be described mathematically as OUV kU = IUV kV
(8)
An algorithm for constructing an equivalent SRDFG from an MRDFG is described as follows: 1. For each node U in the MRDFG 2. For k = 0 to kU − 1 3. Draw a node U k in the SRDFG with the same computation time as U in the MRDFG iUV 4. For each edge U → V in the MRDFG 5. For j = 0 to OUV kU -1
Signal Flow Graphs and Data Flow Graphs
6.
799
Draw an edge U j/OUV → V (( j+iUV )/IUV )%kV in the SRDFG with ( j + iUV )/(IUV kV ) delays
To determine how many times each node must be executed in an iteration, the set of equations found by writing ( 8) for each edge in the MRDFG must be solved so the number of invocations for the nodes are coprime. For example, the set of equations for the MRDFG in Fig. 5 is 4ka = 3kb kb = 2kc kc = kc 3kc = 2ka which has a solution of ka = 3, kb = 4, kc = 2. Once the number of invocations of the nodes has been determined, an equivalent SRDFG can be constructed for the MRDFG. For the MRDFG in Fig. 5, the equivalent SRDFG is shown in Fig. 6.
2
a
4
3D
3
b
1
2D
2
3
c 1
1 D
Fig. 5 A multi-rate DFG.
3.4 Equivalent Data Flow Graphs Data flow graphs can be transformed into different yet equivalent forms. Two common techniques to derive the equivalent DFGs are introduced here, one is the retiming and the other is pipelining. The transfer functions in these equivalent forms are either unaltered or differ only by a factor of z−i , i.e., they may contain i additional delay elements. Note that both retiming and pipelining transformations can be applied to DFGs as well as SFGs in an identical manner. 3.4.1 Retiming Retiming [5] is a transformation technique that changes the locations of delay elements in a circuit without affecting its input/output characteristics. For example, although the DFGs in Fig. 7(a) and Fig. 7(c) have different number of delays at
800
Keshab K. Parhi and Yanni Chen
b0 a0 D D D
b1
c0
a1
D D
b2 a2
c1 D
b3 Fig. 6 An equivalent SRDFG for the MRDFG in Fig. 5.
different locations, they share the same input/output characteristics. Furthermore, these two DFGs can be derived from one another using retiming. Retiming has many applications in synchronous circuit design including reducing the clock period of the circuit by reducing the computation time of the critical path, decreasing the number of registers in the circuit, reducing the dynamic power consumption of the circuit by placing the registers at the inputs of nodes with large capacitances to reduce the switching activity, and logic synthesis. Cutset retiming is a special case of retiming and it only affects the weights of the edges in the cutset, which is a set of edges that can be removed from the graph to create two disconnected subgraphs. If these two disconnected subgraphs are labeled G1 and G2 as depicted in Fig. 7(b), then cutset retiming consists of adding k delays to each edge from G1 and G2 , and removing k delays from each edge from G2 to G1 . In the case of k = 1, the DFG in Fig. 7(a) can then be transformed to the DFG in Fig. 7(c) by using cutset retiming technique. 3.4.2 Pipelining Pipelining is a special case of cutset retiming where there are no edges in the cutset from the subgraph G2 to the subgraph G1 as shown in Fig. 8(b), i.e., pipelining applies to graphs without loops. These cutsets are referred to as feed-forward cutsets, where the data move in the forward direction on all the edges of the cutset.
Signal Flow Graphs and Data Flow Graphs
801
G1 1
1
D
D
2
1
2D
D
3
2
4
D
3
D
2
3
4
G2
(a)
3D
4
(b)
(c)
Fig. 7 (a) The unretimed DFG with a cutset shown as a dashed line. (b) The 2 graphs G1 and G2 formed by removing the edges in the cutset. (c) The retimed graph found using cutset retiming with k = 1.
Consequently, registers can be arbitrarily placed on a feed-forward cutset without affecting the functionality of the algorithm. If 2 registers are inserted to each edge, the DFG in Fig. 8(a) can then be transformed into the DFG in Fig. 8(c) by applying pipelining technique. Therefore, complex retiming operations can be described by multiple simple cutest retiming or pipelining operations applied in a step-by-step manner. G1 IN
D
D
a
b
c
IN
a
D
D
b
c
G2
OUT
(a)
OUT
(b) IN
a 2D
D
D
b
c 2D
2D OUT
(c) Fig. 8 (a) The unretimed DFG with a cutset shown as a dashed line. (b) The 2 graphs G1 and G2 formed by removing the edges in the cutset. (c) The graph obtained by cutset retiming with k = 2.
802
Keshab K. Parhi and Yanni Chen
Pipelining transformation leads to a reduction in the effective critical path by introducing pipelining registers along the datapath, which can be exploited to either increase the clock speed or sample speed or to reduce power consumption at same speed. Consider the simple structure in Fig. 9(a), where the computation time of the critical path is 2TA . Fig. 9(b) shows the 2-level pipelined structure, where 1 register is placed between two adders and hence the critical path is reduced by half. a(n)
b(n)
x(n)
y(n) (a) a(n)
x(n)
b(n−1) D
y(n−1)
(b) Fig. 9 (a) A datapath. (b) The 2-level pipelined structure of (a).
Obviously, in an M-level pipelined system the number of delay elements in any path from input to output is (M − 1) greater than that in the same path in the original sequential circuit. While pipelining offers the benefit of critical path reduction, its two drawbacks lie in the increase in the number of registers and the system latency, which is the time difference in the availability of the first output data in the pipelined system and the sequential system.
4 Applications to Hardware Design In this section, two DFG-based high-level transformations applicable to practical hardware design such as field programmable gate array (FPGA) or application specific integrated circuit (ASIC) implementations are introduced, one is the unfolding transformation, and the other is the folding transformation. Examples will be given to demonstrate their usage in hardware design. For more details on how to map the decidable signal processing graphs to FPGA implementation, the reader is referred to Chapter 31.
Signal Flow Graphs and Data Flow Graphs
803
4.1 Unfolding Unfolding is a transformation technique that can be applied to a DSP program to create a new program describing more than one iteration of the original program. More specifically, unfolding a DSP program by the unfolding factor J creates a new program that describes J consecutive iterations of the original program. As a result, in unfolded system each delay is J-slow. Unfolding has applications in designing high-speed and low-power VLSI architectures. One application is to unfold the program to reveal hidden concurrencies so that the program can be scheduled to a smaller iteration period, thus increasing the throughput of the implementation. Another application is to design parallel architectures at the word level and bit level from serial counterpart to increase the throughput or decrease the power consumption of the implementation. 4.1.1 The DFG Based Unfolding Two approaches can be used to derive the J-unfolded DFG. One is to write equations for the original and the J-unfolded programs and then draw the corresponding unfolded DFG. This method could be tedious for large value of J. The other approach is to use a graph-based technique which directly unfolds the original DFG to create the DFG of the J-unfolded program without explicitly writing the equations describing the original unfolded system. For each node U in the original DFG, there are J nodes with the same function as U in the J-unfolded DFG. Additionally, for each edge in the original DFG, there are J edges in the J-unfolded DFG. Consequently, the DFG of the J-unfolded program contains J times as many nodes and edges as the DFG of the original DFG. The following two-step algorithm could be used to construct a J-unfolded DFG: 1. For each node U in the original DFG, draw the J nodes U0 ,U1 , . . . ,UJ − 1. 2. For each edge U → V with w delays in the original DFG, draw the J edges Ui → V(i+w)%J with i+w J delays for i = 0, 1, . . . , J − 1. Apparently, if an edge has w < J delays in the original DFG, unfolding produces J − w edges with no delays and w edges with 1 delay in the J-unfolded DFG. To demonstrate the unfolding algorithm, the DFG in Fig. 10(b) that corresponds to the DSP algorithm in Fig. 10(a) will serve as an example, where the nodes A and B represent input and output, respectively, and the nodes C and D represent addition and multiplication by a, respectively. To unfold this DFG in Fig. 10(b) by unfolding factor of 2 to obtain the 2-unfolded DFG as shown in Fig. 10(c), the two steps of the unfolding algorithm are performed: 1. The 8 nodes Ai , Bi , Ci and Di for i = 0, 1 are first drawn according to the 1st step of the unfolding algorithm. 2. After these nodes have been drawn, for an edge U → V such as D → C with no delays, this step reduces to drawing the J edges Ui → Vi with no delays.
804
Keshab K. Parhi and Yanni Chen
y(n)
B
a
9D
x(n)
A
(a)
9D
C
(b) y(2k)
B0
a
x(2k)
A0
C0
D
5D
D0 5D 4D
A1
C1
D1
4D x(2k+1)
B1
a
y(2k+1)
(c)
(d)
Fig. 10 (a) The original DSP program describing y(n) = ay(n − 9) + x(n) for n = 0 to . (b) the DFG corresponding to DSP program in (a). (c) The 2-unfolded DFG. (d) The 2-unfolded DSP program describing y(2k) = ay(2k − 9) + x(2k) and y(2k + 1) = ay(2k − 8) + x(2k + 1) for n = 0 to .
Additionally, for the edges C → D with w = 9 delay, there are the edges C0 → D(0+9)%2 = D1 with ( 0+9 2 ) = 4 delays and C1 → D(1+9)%2 = D0 with 1+9 ( 2 ) = 5 delays. Referring to Fig. 10(c), the nodes C0 and C1 in the 2-unfolded DFG represent addition as the node C in the original DFG. Similarly, the nodes D0 and D1 in the 2unfolded DFG represent multiplications as the node D in the original DFG. The node A in the original DFG represents the input x(n). The k-th iteration of the node Ai in the unfolded DFG executes the Jk + i -th iteration of the node A in the original DFG for i = 0, 1, . . . , J − 1 and k = 0 to . Similarly, the node B0 corresponds to the output samples y(2k + 0) and the node B1 corresponds to the output sample y(2k + 1). Therefore, the 2-unfolded DFG in Fig. 10(c) corresponds to the 2-unfolded DSP program in Fig. 10(d).
Signal Flow Graphs and Data Flow Graphs
805
4.1.2 Applications to Parallel Processing A direct application of the general unfolding transformation is to design parallel processing architectures from serial processing architectures. At the word level, this means that word-parallel architectures can be designed from word-serial architectures. At the bit level, it means that bit-parallel and digit-serial architecture can be designed from bit-serial architectures. Word-level Parallel Processing In general, unfolding a word-serial architecture by J creates a word-parallel architecture that processes J words per clock cycle. As an example, consider the DSP program y(n) = ax(n) + bx(n − 4) + cx(n − 6) shown in Fig. 11(a). To create an architecture that can process more than 1 word per clock cycle, the first step is to draw a corresponding DFG as in Fig. 11(b). The next step is to unfold the DFG to the 3-unfolded DFG as in Fig. 11(c) by following the steps described in 4.1.1. The final step is to draw the corresponding 3-unfolded DSP program as in Fig. 11(d). The exact details are omitted here and left to the reader as an exercise. Bit-level Parallel Processing Assume the wordlength of the data is W bits, the hardware implementation could have following possible architectures: • Bit-serial processing: One bit is processed per clock cycle and hence a complete word is processed in W clock cycles. • Bit-parallel processing: one word of W bits is processed every clock cycle. • Digital serial processing: N bits are processed per clock cycle and a word is processed in W /N clock cycles, which is referred to as the digit size. Most bit-serial architecture contains an edge with a switch, which corresponds to a multiplexer in hardware. Consider the edge U → V in Fig. 12. To unfold this edge with unfolding factor J, two basic assumptions are made: • The wordlength W is a multiple of the unfolding factor J, i.e., W = W J • All edges into and out of the switch have no delays. With these two assumptions in mind, the edge in Fig. 12 can be unfolded using the following two steps: 1. Write the switching instance as W l + u = J(W l + uJ ) + (u%J) 2. Draw an edge with no delays in the unfolded graph from the node Uu%J to the node Vu%J , which is switched at time instance (W l + uJ ).
806
Keshab K. Parhi and Yanni Chen
x(n)
X
c
b
a
2D
C y(n)
4D
(a)
B 2D
D
A 4D
E
(b)
x(3k)
X0 c
C0
B0 D
D0
b
a
D
2D
A0 2D
y(3k)
E0 x(3k+1)
X1 C1
B1 D
D1
A1 D
c
E1
b
a
D
D y(3k+1)
X2
x(3k+2)
C2
B2 D2
A2 D
c
b
a
E2
D
(c)
y(3k+2)
(d) Fig. 11 (a) The original DSP program. (b) The DFG for (a). (c) The 3-unfolded DFG. (d) The 3-parallel DSP program.
Wl+u
U Fig. 12 A switch.
V
Signal Flow Graphs and Data Flow Graphs
807
If the switch has multiple instances, then each switching instance is treated separately. In addition, if an edge contains a switch and a positive number of delays, a dummy node can be used to reduce this problem to the case where the edge contains no delay elements. Using these techniques for unfolding switches, bit-parallel and digit-serial architectures can be systematically designed from bit-serial architectures. This is demonstrated using the bit-serial adder in Fig. 13(a). A DFG corresponding to this adder is shown in Fig. 13(b). The 2-unfolded version of this DFG is shown in Fig. 13(c) and the architectures corresponding to this unfolded DFG is in Fig. 13(d), where the 2-unfolded architecture is a digit-serial adder with digit size equal to 2.
a3 a2 a1 a0 b3 b2 b1 b0
A
s3 s2 s1 s0
X
D
4 l +0
S D
B
D
4 l +0
4 l +1,2,3
4 l +1,2,3
0
Z (b)
(a) A0 B0 Z0
2 l +0
X0
S0
A1
D0
B1
X1
D1
Z1
2 l +1
a 2 b2 a 0 b0
S1 0
a 3 b3 a 1 b1 2 l +0
2 l +0
D
2 l +1
01
carry out
D
s2 s0 (c)
s3 s1 (d)
Fig. 13 (a) Bit-serial addition s = a + b for wordlength W = 4. (b) The DFG corresponding to the bit-serial adder. (c) The DFG resulting from unfolding the DFG using unfolding factor of J = 2. (d) The digit-serial adder designed by unfolding the bit-serial adder using J = 2.
For more details on how to unfold the DFG when the unfolding factor J is not the divisor of the wordlength W , the reader is referred to Sec. 5.5.2.2 in [1]. 4.1.3 Infinite Unfolding of DFG Any DFG can be unfolded by a factor of . This infinitely unfolded DFG explicitly represents all intra-iteration and inter-iteration constraints. These DFGs correspond to dependence graphs (DGs) of traditional terminating programs. The DG or the
808
Keshab K. Parhi and Yanni Chen
infinitely unfolded DFG cannot contain any delay elements. Fig. 14(a) shows a DFG and Fig. 14(b) shows the corresponding infinitely unfolded DFG.
A
B
D
C
2D (a) A0
B0
C1
A3
B3
C4
A6
B6
C7
A1
B1
C2
A4
B4
C5
A7
B7
...
A2
B2
C3
A5
B5
C6
...
...
(b) Fig. 14 (a) The original DFG. (b) The Infinitely unfolded DFG.
4.2 Folding The folding transformation is used to systematically determine the control circuits in DSP architectures where multiple algorithm operations such as additions are timemultiplexed to a single functional unit such as pipelined adder. By executing multiple algorithm operations on a single functional unit, the number of functional units in the implementation is reduced, resulting in an integrated circuit with low silicon area. In general, folding can be used to reduce the number of hardware functional units by a factor of N at the expense of increasing the computation time by a factor of N. Consider the edge e connecting the nodes U and V with w(e) delays, as shown in Fig. 15(a). Let the executions of the l-th iteration of the nodes U and V be scheduled at the time units Nl + u and Nl + v, respectively, where u and v are folding orders of the nodes U and V that satisfy 0 ≤ u, v ≤ N − 1 and N is the folding factor. The folding order of a node is the time partition to which the node is scheduled to execute in hardware. The functional units that execute the nodes U and V are denoted as HU and HV , respectively. If HU is pipelined by PU stages, then the result of the l-th iteration of the node U is available at the time unit Nl + u + PU . Since the e edge U → V has w(e) delays, the result of the l-th iteration of the node U is used
Signal Flow Graphs and Data Flow Graphs
809
by the (l + w(e))-th iteration of the node V , which is executed at N(l + w(e)) + v. Therefore, the result must be stored for Nl+v
U
w(e)D
V
(a)
HU
PU D
DF(U V)
HV
(b) e
Fig. 15 (a) An edge U → V with w(e) delays. (b) The corresponding folded datapath. The data e begin at the functional unit HU which has PU pipelining stage, pass through DF (U → V ) delays, and are switched into the functional unit HV at the time instances Nl + v.
e
DF (U → V ) = [N(l + w(e)) + v] − [Nl + PU + u] = Nw(e) − PU + v − u e
(9)
time units, which is independent of the iteration number l. The edge U → V is e implemented as a path from HU to HV in the architecture with DF (U → V ) delays, and data on this path are input to HV at Nl + v, as in Fig. 15(b). A folding set is an ordered set of operations executed by the same functional unit. Each folding set contains N entries, some of which may be null operations. The operation in the j-th position within the folding set, where j goes from 0 to N − 1, is executed by the functional unit during the time partition j. The biquad filter example shown in Fig. 16 is folded with folding factor of N = 4 and the folding sets shown in the figure can be written as S1 = {4, 2, 3, 1} and S2 = {5, 8, 6, 7}, where the folding set S1 and S2 contains only addition operations using the same hardware adder and multiplication operation using the same hardware multiplier, respectively. To obtain the folded architecture as shown in Fig. 17 corresponding to the DFG in Fig. 16, the folding equations for each of the 11 edges are written as below: DF (1 → 2) = 4(1) − 1 + 1 − 3 = 1 DF (1 → 5) = 4(1) − 1 + 0 − 3 = 0 DF (1 → 6) = 4(1) − 1 + 2 − 3 = 2 DF (1 → 7) = 4(1) − 1 + 3 − 3 = 3 DF (1 → 8) = 4(2) − 1 + 1 − 3 = 5 DF (3 → 1) = 4(0) − 1 + 3 − 2 = 0 DF (4 → 2) = 4(0) − 1 + 1 − 0 = 0 DF (5 → 3) = 4(0) − 2 + 2 − 0 = 0 DF (6 → 4) = 4(1) − 2 + 0 − 2 = 0 DF (7 → 3) = 4(1) − 2 + 2 − 3 = 1 DF (8 → 4) = 4(1) − 2 + 0 − 1 = 1
810
Keshab K. Parhi and Yanni Chen
(S1|3)
IN
(S1|1)
D
1
D
a 3
D
6
(S2|0)
2
b
5
(S1|2)
(S1|0) 4
D
(S2|2)
c
d 7
OUT
D 8
D
(S2|3)
(S2|1)
Fig. 16 The retimed biquad filter with valid folding sets assigned.
IN {2}
{0,2} {3} {1}
D
OUT {0,2} {1,3}
D
{0} {2} {3} {1}
a b c d
D
D
D
2D
{p,q} denotes 4l +p and 4l +q
{0} {2} {3}
2D
{1}
Fig. 17 The folded biquad filter using the folding sets given in Fig. 16.
e
For a folded system to be realizable, DF (U → V ) ≥ 0 must hold for all of the edges in the DFG. Once valid folding sets have been assigned, retiming can be used to either satisfy this property or determine that the folding sets are infeasible. In general, the original DFG and the N-unfolded version of the folded DFG are retimed and/or pipelined versions of each other. Furthermore, an arbitrary DFG can be unfolded by factor N and the unfolded DFG can be folded with many possible folding sets to generate a family of architectures. In order to obtain the original DFG
Signal Flow Graphs and Data Flow Graphs
811
from the unfolded DFG via folding transformation, an appropriate folding set has to be chosen.
5 Applications to Software Design In this section, the precedence constraints in DFG will first be introduced followed by the definition of critical path and the iteration bound. A DFG based scheduling algorithm is then explained in detail.
5.1 Intra-iteration and Inter-iteration Precedence Constraints The DFG captures the data-driven property of DSP algorithms where any node can fire (perform its computation) whenever all the input data are available. This implies that a node with no input edges can fire at any time. Thus many nodes can be fired simultaneously, leading to concurrency. Conversely, a node with multiple input edges can only fire after all its precedent nodes have fired. The latter case imposes the precedence constraints between two nodes described by each edge. This precedence constraint is an intra-iteration precedence constraint if the edge has no delay elements or an inter-iteration precedence constraint if the edge has one or more delays. Together, the intra-iteration and inter-iteration precedence constraints specify the order in which the nodes in the DFG can be executed. For example, the edge from node A to node B in Fig. 2(b) enforces the interiteration precedence constraint, which states that the execution of the k-th iteration of A must be completed before the (k + 1)-th iteration of B. On the other hand, the edge from B to A enforces the intra-iteration precedence constraint, which states that the k-th iteration of B must be executed before the k-th iteration of A.
5.2 Definition of Critical Path and Iteration Bound The critical path of a DFG is defined to be the path with the longest computation time among all paths that contain no delay elements. The critical path in the DFG in Fig. 18(a) is the path A → B, which requires 6 u.t.. Since the critical path is the longest path for combinational rippling in the DFG, the computation time of the critical path is the minimum computation time for one iteration of the DFG, which is the execution of each node in the DFG exactly once. A loop is a directed path that begins and ends at the same node and the amount of time required to execute a loop can be determined from the precedence relation described by the edges of the DFG. According to these precedence constraints, iteration k of the loop consists of the sequential execution of Ak and Bk . Given that the
812
Keshab K. Parhi and Yanni Chen
execution times of nodes A and B are 2 and 4 u.t., respectively, one iteration of the loop requires 6 u.t.. This is the loop bound, which represents the lower bound on the loop computation time. Formally, the loop bound of the l-th loop is defined as wtl , l where tl is the loop computation time and wl is the number of the delays in the loop. As a result, the loop bound for the loop in Fig. 18(a) is 6/2 = 3 u.t.. As another example, the DFG in Fig. 18(b) contains two loops, namely, the loops l1 = A → B → A and l2 = A → B → C → A. Therefore, the loop bounds for l1 and l2 are 6/2 = 3 u.t. and 11/1 = 11 u.t., respectively. The loop with the maximum loop bound is called the critical loop and its corresponding loop bound is the iteration bound of the DSP program, which is the lower bound on the iteration or sample period of the DSP program regardless of the amount of the computing resources available. Formally, the iteration bound is defined as D (2)
(4)
(2)
(4)
(5)
A
B
A
B
C
2D
2D
(a)
(b)
Fig. 18 (a) A DFG with one loop that has a loop bound of 6/2 = 3 u.t.. The iteration bound for this DFG is 3 u.t.. (b) A DFG with iteration bound T = max{6/2, 11/1} = 11 u.t.
T = maxl∈L {
tl } wl
(10)
where L is the set of the loops in the DFG, tl and wl are the computation time and the number of delays of the loop l, respectively. To compute the iteration bound of a DFG by locating all the loops and directly compute T by using ( 10) is rather straightforward. However, the number of loops in a DFG grows exponentially with the number of nodes, and therefore polynomial-time algorithms are desired for computing the iteration bound. Examples of polynomial-time algorithms include longest path matrix algorithm [6] and the minimum cycle mean algorithm [7].
5.3 Scheduling DFGs are generally used for high-level synthesis to derive concurrent implementations of DSP applications onto parallel hardware, where a major concern is subtask scheduling which determines when and in which hardware units nodes can be executed.
Signal Flow Graphs and Data Flow Graphs
813
Scheduling algorithm consists of assigning a scheduling time to every operator in an architecture, where the time represents when the operation will normally take place. Each operator must be scheduled at a time after all of its inputs become available. Consequently, the scheduling problem can be formulated as a linear programming problem. Furthermore, because all scheduled time of the operators must be integers, the scheduling algorithm must find the optimal integer solution to this problem. 5.3.1 The Scheduling Algorithm Consider a pair of operators joined by an edge shown in Fig. 19. One of the outputs produced by the operator OPi is the operand OPNDx , which in turn is the input to the operator OPj . The scheduled times of operators OPi and OPj are denoted by Ti and Tj , respectively. In addition, the timing specifications of the relevant output and input ports of OPi and OPj are denoted by ti and t j , respectively. Since OPi is scheduled at time Ti , the output OPNDx will become available at time Ti + ti . Further, the same operand will be required as an input to the operator OPj at time Ti + t j . By the requirement that the operand can not be used by operator OPj before it is produced by operator OPi , the following inequality can be derived:
OP i Ti
(t i )
OPND x T i+(t i)
(t j )
OP j
Tj
Fig. 19 A pair of operators showing the timing information.
T j + t j ≥ Ti + ti
(11)
Such an inequality holds for each pair of operands joined by an edge in the DFG. In these inequalities, Ti and T j are the unknowns, where the the value ti − t j is a known constant. A solution to the set of inequalities can be determined by using common techniques for solving linear programming problems. Once a solution is found to this set of inequalities, the circuit may be correctly synchronized by inserting a delay equal to Tj − Ti − (ti − t j ) clock cycles between the operators OPi and OPj . In general, many solutions exist to satisfy the set of inequalities for a given architecture and hence there are many ways to synchronize the circuit. The goal of optimal scheduling is to generate a solution that provides the minimal cost, where cost is defined to be the total number of shift-register delays required to properly
814
Keshab K. Parhi and Yanni Chen
synchronize the circuit. If a linear cost function can be defined, the minimum cost problem can be easily formulated as a linear programming problem. 5.3.2 Minimum Cost Solution Minimizing the total number of synchronization delays required for reach edge between functional units of a circuit is not sufficient. Note that there exists the possibility of multiple fanout from any functional unit. Therefore, the delays from a multiple fanout output should be allocated sequentially instead of in parallel. Consider a simple case where an output of some operator OPo is used as an input to 3 other operators, OPA , OPB and OPC as shown in Fig. 20(a) with delays of 10, 12 and 25 clock cycles. respectively. The total number of delays is equal to 10 + 12 + 25 = 47. An alternative sequential arrangement of the delays is shown in Fig. 20(b), where the total number of delays is 25, which is the length of the longest delay. Therefore, the total number of delays that need to be allocated to any node in the circuit is equal to the maximum delay that must be allocated.
O
OPND x
10D
A
12D
B
25D
C
(a)
B
A O
OPND x
10D
2D
13D
C
(b)
Fig. 20 (a) Operator O with a fanout of three and no delay sharing. (b) Operator O with a fanout of three and delay sharing.
Let Dx represents the maximum delay that must be allocated to an operand OPNDx of width wx . A total cost function to be minimized can now be defined: Cost = Dx wx
(12)
x
where the sum is over each operand node in the circuit. For each node as shown in Fig. 20(b), there exists a constraint equivalent to ( 11),
Signal Flow Graphs and Data Flow Graphs
815
T j − Ti ≥ ti − t j
(13)
In addition, the maximum delay on operand OPNDx from OPi to OPj is less than the maximum delay Dx , which is described by the constraint: T j − Ti − (ti − t j ) ≤ Dx
(14)
These two constraints along with the cost function that will be minimized as in ( 12), describe a linear programming problem capable of providing the minimum cost scheduling. 5.3.3 Scheduling of Edges With Delays The scheduling algorithm will generate optimal solutions for DFGs consisting of delay-free edges. To handle edges that contain delays, a preferable method is to incorporate the word delays right into the linear programming solutions, which is achieved by slightly modifying equations to take into account the presence of z−1 operators, Specifically, Fig. 21 shows a general situation in which the output of some operator OPi undergoes a z−n transformation before being used as an input to the operator OPj . The modified equations describing these scheduling constraints thus become:
OP i Ti
(t i )
OPND x
−n
Z
T i+(t i)
(t j )
OP j
Tj
Fig. 21 General code to be scheduled including z−1 operators.
Tj − Ti ≥ ti − t j − nW
(15)
T j − Ti − (ti − t j ) + nW ≤ Dx
(16)
and where W is the number of clock cycles in a word or wordlength. In this case, T j − Ti − (ti − t j ) + nW is the delay applied to the connection shown in the diagram, and Dx is the maximum delay applied to this variable.
816
Keshab K. Parhi and Yanni Chen
6 Conclusions This chapter has introduced the signal flow graphs and data flow graphs. Several transformations such as pipelining, retiming, unfolding and folding have been reviewed, and applications of these transformations on signal flow graphs and data flow graphs have been demonstrated for both hardware and software implementations. It is important to note that any DFG that can be pipelined can be operated in a parallel manner, and vice versa. Any transformation that can improve performance in a hardware system can also improve the performance of a software system. Thus, preprocessing of the DFGs and SFGs by these high-level transformations can play a major role in the system performance. Acknowledgements Many parts of the text and figures in this chapter are taken from the text book in [1]. These have been reprinted with permission of John Wiley & Sons, Inc. The authors are grateful to John Wiley & Sons, Inc., for permitting the authors to use these figures and parts of the text from [1]. They are also grateful to George Telecki, associate publisher at Wiley for his help in this regard.
References 1. Parhi K, (1999) VLSI digital signal processing systems, design and implementation, John Wiley & Sons, New York 2. Crochiere R, Oppenheim A, Analysis of linear digital networks, Proc. IEEE, no. 4, pp 581595, Apr. 1975. 3. Bolton W, (1998) Newnes control engineering pocketbook, Oxford. 4. Lee E, Messerschmitt D, Synchronous data flow, Proc. IEEE, special issue on hardware and software for digital signal processing, pp.1235-1245, Sept. 1987. 5. Leiserson C, Rose F, Saxe J, Optimizing synchronous circuitry by retiming, Third Caltech Conference on VLSI, pp. 87-116, 1983. 6. Gerez S, Heemstra de Groot S, and Herrmann O, A polynomial-time algorithm for the computation of the iteration-period bound in recursive data flow graphs, IEEE Trans. On Circuits and Systems-I: Fundamental Theory and Applications, vol. 39, no. 1, pp 49-52, Jan. 1992. 7. Ito K and Parhi K, Determining the minimum iteration period of an algorithm, Journal of VLSI Signal Processing, vol. 11, pp 229-244, 1995.
Systolic Arrays Yu Hen Hu and Sun-Yuan Kung
Abstract This chapter reviews the basic ideas of systolic array, its design methodologies, and historical development of various hardware implementations. Two modern applications, namely, motion estimation of video coding and wireless communication baseband processing are also discussed.
1 Introduction Systolic array [2, 13, 15] is an on-chip multi-processor architecture proposed by H.T. Kung in late 1970s. It is proposed as an architectural solution to the anticipated on-chip communication bottleneck of modern very large scale integration (VLSI) technology. A systolic array features a mesh-connected array of identical, simple processing elements (PE). According to H.T. Kung [13], "In a systolic system, data flows from the computer memory in a rhythmic fashion, passing through many processing elements before it returns to memory, much as blood circulates to and from the heart." As depicted in Figure 1, a systolic array is often configured into a linear array, a two-dimensional rectangular mesh array, or sometimes, a two dimensional hexagonal mesh array. In a systolic array, every PE is connected only to its nearest neighboring PEs through dedicated, buffered local bus. This localized interconnects, and regular array configuration allow a systolic array to grow in size without incurring excessive on-chip global interconnect delays due to long wires.
Yu Hen Hu University of Wisconsin, Department of Electrical and Computer Engineering, Madison, WI 53706-1691 e-mail: [email protected] Sun-Yuan Kung Princeton University, Department of Electrical Engineering, Princeton, NJ 08544 e-mail: [email protected]
S.S. Bhattacharyya et al. (eds.), Handbook of Signal Processing Systems, DOI 10.1007/978-1-4419-6345-1_29, © Springer Science+Business Media, LLC 2010
817
818
Yu Hen Hu and Sun-Yuan Kung
(a)
(b)
(c)
Fig. 1 Common configurations of systolic architecture: (a) linear array, (b) rectangular array, (c) hexagonal array
Several key architectural concerns impacted on the development of systolic architecture [13]: 1. Simple and regular design – In order to reduce design complexity, design cost, and to improve testability, fault-tolerance, it is argued that VLSI architecture should consist of simple modules (cores, PEs, etc) organized in regular arrays. 2. Concurrency and communication – Concurrent computing is essential to achieve high performance while conserving power. On-chip communication must be constrained to be local and regular to minimize excessive overhead due to long wire, long delay and high power consumption. 3. Balanced on-chip computation rate and on/off chip data input/output rate – moving data on/off chip remains to be a communication bottleneck of modern VLSI chips. A sensible architecture must balance the demand of on/off chip data I/O to maximize the utilization of the available computing resources. Systolic array is proposed to implement application specific computing systems. Toward this goal, one must map the computing algorithm to a systolic array. This requirement stimulated two complementary research directions that have seen numerous significant and fruitful research results. The first research direction is to reformulate existing computing algorithms, or develop novel computing algorithms that can be mapped onto a systolic architecture to enjoy the benefit of systolic computing. The second research direction is to develop a systematic design methodology that would automate the process of algorithm mapping. In section 2 of this chapter, we will provide a brief overview of these systolic algorithms that have been proposed. In section 3, the formal design methodologies developed for automated systolic array mappings will be reviewed. Systolic array computing was developed based on a globally synchronized, finegrained, pipelined timing model. It requires a global clock distribution network free of clock skew to distribute the clock signal over the entire systolic array. Recognizing the technical challenge of developing large scale clock distribution network, S. Kung et al. [14–16] proposed a self-timed, data flow based wavefront array processor architecture that promises to alleviate the stringent timing constraint imposed
Systolic Arrays
819
by the global clock synchronization requirement. In section 4, the wavefront array architecture and its related design methodology will be discussed. These architectural features of systolic array have motivated numerous developments of research and commercial computing architectures. Notable examples include the WARP and iWARP project at CMU [1, 3, 7, 10]; TransputerTMof INMOS [8, 19, 22, 26]; and TMS 32040 DSP processor of Texas Instruments [23]. In section 5 of this chapter, brief reviews of these systolic-array motivated computing architectures will be surveyed. While the notion of systolic array was first proposed three decades ago, its impacts can be felt vividly today. Modern applications of the concept of systolic array can be found in field programmable gate array (FPGA) chip architectures, networkon-chip (NoC) mesh array multi-core architecture. Computation intensive special purpose architecture such as discrete cosine transform and block motion estimation algorithms in video coding standards, as well as the QR factorization for least square filtering in wireless communication standards have been incorporated in embedded chip designs. These latest real world applications of systolic array architecture will be discussed in section 6.
2 Systolic Array Computing Algorithms A systolic array exhibits characteristics of parallelism (pipelining), regularity, and local communication. A large number of signal processing algorithms, and numerical linear algebra algorithms can be implemented using systolic arrays.
2.1 Convolution Systolic Array For example, consider a convolution of two sequences {x[n]} and {h[n]}: y[n] =
K−1
h[k]x[n − k],
0 ≤ n ≤ N − 1.
(1)
k=0
A systolic array realization of this algorithm can be shown in Figure 2 (K = 4). In Figure 2(a), the block diagram of the systolic array and the pattern of data movement are depicted. The block diagram of an individual processing element (PE) is illustrated in Figure 2(b) where a shaded rectangle represents a buffer (delay element) that can be implemented with a register. The output y[n] begins its evaluation at the upper left input with initial value 0. When it enters into each PE, the multiplyand-accumulate (MAC) operation yout = yin + h[k]xin
(2)
820
Yu Hen Hu and Sun-Yuan Kung
y4 - y3 - y2 - y1
0
x7
h[2]
h[1]
h[0]
x5
x7
h[3] x3
x1
(a)
yin
yout = yin + h[k]xin
h[k] xin
xout = xin
A buffer (delay element) (b)
Fig. 2 (a) Fully pipelined systolic array for convolution, (b) internal architecture of a single processing element (PE)
will be performed. The systolic array is of the same length as the sequence {h[k]; 0 ≤ k ≤ K − 1} with each h[k] resides in a register in each PE. The final result {y[n]} appears at the upper right output port. Every other clock cycle, one output will be evaluated. The input {x[n]} will be provided from the lower left input port. It will propagate toward the lower right output port without being modified (xout = xin ). Along the way, it will be properly buffered to keep pace of the evaluation of the convolution.
2.2 Linear System Solver Systolic Array Similar to above example, a systolic algorithm is often presented in the form of a high level block diagram of the systolic array configuration (e.g. Figure 1) complemented with labels indicating data movement within the processor array, and a detailed block diagram explaining the operations performed within an individual PE. Another example given in [13] is shown in Figure 3. It consists of two systolic arrays for solving linear systems of equations. One triangular-configured systolic array is responsible for orthogonal triangulation of a matrix using QR factorization, and the other linear systolic array is responsible for solving a triangular linear system using back-substitution. A linear system of equations is represented as Xb = y. Using a Jacobi’s rotation method, the first column of the X matrix will enter the upper-left circular PE where an angle k is evaluated such that
Systolic Arrays
821 .
y3
.
x34
y2
.
x33
x24
y1
.
x32
x23
x14
.
x31
x22
x13
.
.
x21
x12
.
.
.
x11
.
.
.
.
Systolic array for solving triangular linear systems Systolic array for orthogonal triangularization
b1 b2 , ..., bp-1 bp
Fig. 3 Systolic array for solving linear systems [13]
cos k − sin k sin k cos k
(k−1)
x11 xk,1
(k) = x11 , k = 2, 3, . . . 0
Clearly, in this circular PE, the operation to be performed will be (k−1) k = − tan−1 xk,1 /x11 .
(3)
(4)
This k then will be propagated to the square PEs to the right of the upper left circular PE to perform rotation operations (k−1) (k−1) (k−1) (k) (k) (k) x12 . . . x1N y1 x12 . . . x1N y1 cos k − sin k (5) (k−1) (k−1) (k−1) = (k) (k) (k) sin k cos k xk2 . . . xkN yk xk2 . . . xkN yk The second row of the results of above equation will propagated downward to the next row in the triangular systolic array repeating what has been performed on the first row of that array. After N − 1 iterations, the results will be ready within the triangular array. Note that during this rotation process, the right hand size of the linear systems of equations y is also subject to the same rotation operation. Equivalently, these operations taken places at the triangular systolic array amount to pre-multiply the X matrix with a unitary matrix Q such that QX = U is an upper triangular matrix, and z = Qy. This yields an upper triangular system Ub = z.
822
Yu Hen Hu and Sun-Yuan Kung
x[2] -
x[1]
x[4] -
x[3] -
−∞ m[4] - m[3] - m[2] - m[1] (a)
a
c = min{a,b}
d = max{a,b}
b (b)
Fig. 4 (a) A bubble sort systolic array, (b) operation performed within a PE
To solve this triangular system of equations, a back-propagation solver algorithm is used. Specifically, given ⎡ ⎤⎡ ⎤ ⎡ ⎤ b1 z1 u11 u11 . . . u1N ⎢ 0 u22 ⎥ ⎢ b2 ⎥ ⎢ z 2 ⎥ u 2N ⎢ ⎥⎢ ⎥ ⎢ ⎥ (6) Ub = ⎢ . . . . ⎥ ⎢ . ⎥ = ⎢ . ⎥ = Z. ⎣ .. . . . . .. ⎦ ⎣ .. ⎦ ⎣ .. ⎦ 0 . . . 0 uNN
bN
zN
The algorithm begins by solving uNN bN = zN for bN = zN /uNN . In the systolic array, uNN are fed from the last (lower right corner) circular PE to the circular PE of the linear array to the right. The computed bN then will be forwarded to the next square processor in the upper right direction to be substituted back into the next equation of uN−1,N−1 bN−1 + uN−1,N bN = zN−1 . (7) In the rectangular PE, the operation performed will be zN−1 − uN−1,N bN . The result then will be fed back to the circular PE to compute bN−1 = (zN−1 − uN−1,N bN )/ uN−1,N−1 .
2.3 Sorting Systolic Arrays Given a sequence {x[n]; 0 ≤ n ≤ N −1}, the sorting algorithm will output a sequence {m[n]; 0 ≤ n ≤ N − 1} that is a permutation of the ordering of {x[n]} such that m[n] ≥ m[n + 1]. There are many sorting algorithms available. A systolic array that implements the bubble sort algorithm is presented in Figure 4.
Systolic Arrays
823
Each PE in this systolic array will receive data a, b from both left and right sides. These two inputs will be compared and the maximum of the two will be output to the right side buffer while the minimum of the two output to the left side buffer. The input will be loaded into the upper buffer according to specific schedule. The left most input will be fixed at −. It has been shown that systolic arrays of insertion sort and selection sort can also be derived using similar approach [15].
3 Formal Systolic Array Design Methodology Due to the regular structure, localized interconnect, and pipelined operations, a systematic systolic array design methodology has been proposed that greatly simplified the systolic array design complexity and opened new avenue to seek optimized systolic architecture. In order to introduce the systematic systolic array design methodology in this section, a few important representations will be briefly surveyed.
3.1 Loop Representation, Regular Iterative Algorithm (RIA), and Index Space Algorithms that are suitable for systolic array implementation must exhibit high degree of regularity, and require intensive computation. Such an algorithm often can be represented by a set of nested Do loops of the following general format: L1 : DO i1 = p1 , q1 L2 : DO i2 = p2 , q2 .. .. . . Lm :
DO im = pm, qm H(i1 , i2 , . . . , im ) End do .. . End do Enddo
where {Lm } specify the level of the loop nest, {im } are loop indices, and [i1 , i2 , , im ]T is a m × 1 index vector, representing an index point in a m-dimensional lattice. {pm , qm } are loop bounds of the mth loop nest. H(i1 , i2 , . . . , im ) is the loop body and may have different granularity. That is, the loop body could represent bit-level operations, word-level program statements, or sub-routine level procedures. Whatever the granularity is, it is assumed that the loop body is to be executed in a single PE. For convenience, the execution time of a loop body in a PE will be assumed to be one clock cycle in this chapter. In other words, it is assumed that the execution of a loop body within a PE cannot be interrupted. All data needed to execute the loop
824
Yu Hen Hu and Sun-Yuan Kung
body must be available before the execution of loop body can start; and none of the output will be available until the execution of the entire loop body is completed. If the loop bounds are all constant, the set of indices corresponding to all iterations form a rectangular parallelepiped. In general, the loop bounds are linear (affine) function with integer coefficients of outer loop indices and can be represented with two inequalities: p0 ≤ Pi and Pi ≤ q0
(8)
where p0 and q0 are constant integer-valued vectors, and P, Q respectively, are integer-valued upper triangular coefficient matrices. If P = Q, then the corresponding loop nest can be transformed in the index space such that the transformed algorithm has constant iteration bounds. Such a nested loop is called a regular nested loop. If an algorithm is formulated to contain only regular nested loops, it is called a regular iterative algorithm (RIA). Consider the convolution algorithm described in equation (1) of this chapter. The mathematical formula can be conveniently expressed with a 2-level loop nest as shown in Figure 5. For n = 0 to N − 1, y[n] = 0; For k = 0 to K − 1, y[n] = y[n] + h[k] ∗ x[n − k]; end end
Fig. 5 Convolution
In this formulation, n and k are loop indices having loop bounds (0, N − 1) and (0, K − 1) respectively. The loop body H(i) consists of a single statement y[n] = y[n] + h[k]x[n − k]. Note that i=
n N−1 10 n 0 ; p0 = = q0 = Pi = Qi ≤ ≤ k K−1 01 k 0
(9)
Hence, this is a RIA.
3.2 Localized and Single Assignment Algorithm Formulation As demonstrated in section 2, a systolic array is a parallel, distributed computing platform where each PE will execute identical operations. By connecting PEs with
Systolic Arrays
825
specific configurations, and providing input data at right timing, a systolic array will be able to perform data intensive computations in a rhythmic, synchronous fashion. Therefore, to implement a given algorithm on a systolic array, its formulation may need to be adjusted. Specifically, since computation takes place at physically separated PEs, data movement in a systolic algorithm must be explicitly specified. Moreover, unnecessary algorithm formulation restrictions that may impede the exploitation of inherent parallelism must be removed. A closer examination of Figure 5 reveals two potential problems according to above arguments: (i) The variables y[n], h[k], x[n − k] are one-dimensional arrays while each index vector i = (n, k) resides in a two-dimensional space. (ii) The memory address locations y[n] will be repeatedly assigned with new values during each k loop K times before the final result is evaluated. Having one-dimensional variable arrays in a two dimensional index space implies that the same input data will be needed when executing the loop body H(i) at different iterations (index points). In a systolic array, it is likely that H(i) and H(j), i = j may be executed at different PEs. As such, how these variables may be distributed to different index points where they are needed should be explicitly specified in the algorithm. Furthermore, a design philosophy that dominates the development of systolic array is to discourage on-chip global interconnect due to many potential drawbacks. Hence, the data movement would be restricted to local communication. Namely passing the data from one PE to one or more of its nearest neighboring PEs in a systolic array. This restriction may be imposed by limiting the propagation of such a global variable from one index point to its nearest neighboring index points. For this purpose, we make the following modification of algorithm in Figure 5: h[k] → h1[n, k] such that h1[0, k] = h[k], h1[n, k] = h1[n − 1, k] x[n] → x1[n, k] such that x1[n, 0] = x[n], x1[n, k] = x1[n − 1, k − 1]. Note that the equations for h1 and x1 are chosen based on the fact that h[k] will be made available for the entire ranges of index n, and x[n − k] will be made available to all (n , k ) such that n − k = n − k. An algorithm with all its variables passing from one iteration (index point) to its neighboring index point is called a (variable) localized algorithm. The repeated assignment of different intermediate results of y[n] into the same memory location will cause an unwanted output dependence relation in the algorithm formulation. Output dependence is a type of false data dependence that would impede potential parallel execution of a given algorithm. The output dependence can be removed if the algorithm is formulated to obey a single assignment rule. That is, every memory address (variable name) will be assigned to a new value only once during the execution of an algorithm. To remedy, one would create new memory locations to be assigned to these intermediate results by expanding the one dimensional array {y[n]} into a two-dimensional array {y1[n, k]}: y[n] → y1[n, k] such that y1[n, −1] = 0, y1[n, k] = y1[n, k − 1] + h1[n, k]x1[n, k]
826
Yu Hen Hu and Sun-Yuan Kung
where the previously localized variables h1 and x1 are used. With above modifications, algorithm in Figure 5 is reformulated as shown in Figure 6. h1(0, k) = h(k), k = 0, . . ., K − 1 x1(n, 0) = x(n), n = 0, . . ., N − 1 y1(n, −1) = 0, n = 0, 1, . . ., N − 1 n = 0, 1, 2, . . ., N − 1 and k = 0, . . ., K − 1 y1(n, k) = y1(n, k − 1) + h1(n, k) ∗ x1(n, k) h1(n, k) = h1(n − 1, k) x1(n, k) = x1(n − 1, k − 1) y(n) = y1(n, K), n = 0, 1, 2, . . ., N − 1
Fig. 6 Convolution (localized, single assignment version)
3.3 Data Dependence and Dependence Graph An iteration H(j) is dependent on iteration H(i) if H(j) will read from a memory location whose value is last written during execution of iteration H(i). The corresponding dependence vector d is defined as: d = j − i. A matrix D consisting of all dependence vectors of an algorithm is called a dependence matrix. This inter-iteration dependence relation imposes a partial ordering on the execution of the iterative loop nest. From algorithm in Figure 6, three dependence vectors can be derived: n n 0 − d1 = = ; k k−1 1 n n−1 1 d2 = − = ; (10) k k 0 n n−1 1 − = . d3 = k k−1 1 In the index space, a lattice point whose coordinates fall within the range of the loop bounds represents the execution of the loop body of the particular loop index values. The dependence vectors may be represented by directed arcs starting from the iteration that produces the data to the iteration where the data is needed. Together with these index points and directed arcs, one has a dependence graph (DG) representing the computation tasks required of a localized RIA. The corresponding DG of the convolution algorithm is depicted in Figure 7 for K = 5 and N = 7.
Systolic Arrays
827
Fig. 7 Localized dependence graph of convolution algorithm
The dependence graph in Figure 7 is shift-invariant in that the dependence vector structure is identical of each (circled) lattice point of the iteration space. This regularity and modularity is the key feature of a systolic computing algorithm that lends itself for efficient systolic array implementation. Due to the shift invariant nature, a DG of a localized RIA algorithm can be conveniently represented by the set of indices {i; p0 ≤ Pi ≤ q0 } and the dependence vectors D at each index point.
3.4 Mapping an Algorithm to a Systolic Array A schedule S : i → t(i) ∈ Z+is a mapping from each index point i in the index space R to a positive integer t(i) which dictates when this iteration is to be executed. An assignment A : i → p(i) is a mapping from each index point i onto a PE index p(i) where the corresponding iteration will be executed. Given the dependence graph of a given algorithm, the development of a systolic array implementation amounts to find a mapping of each index point i in the DG onto (p(i), t(i)). Toward this goal, two fundamental constraints will be discussed. First, it is assumed that each PE can only execute one task (loop body) at a time. As such, a resource constraint must be observed: Resource Constraints If t(i) = t(j), i = j, then p(i) = p(j); and if p(i) = p(j), i = j, then t(i) = t(j). (11) In addition, the data dependence also imposes a partial ordering of schedule. This data dependence constraint can be summarized as follows: Data Dependence Constraint
828
Yu Hen Hu and Sun-Yuan Kung
If index j can be reached from index i by following the path consisting of one or more dependence vectors, then H(j) should be scheduled after H(i). That is, if there exists a vector m consisting of non-negative integers such that if j = i + Dm, then s(j) > s(i)
(12)
where D is the dependence matrix. Since a systolic array often assumes a one or two dimensional regular configuration (cf. Figure 1), the PE index p(i) can be associated with the lattice point in a PE index space just as each loop body in a loop nest is associated with an index point in the DG. To ensure the resulting systolic array features local inter-processor communication, the localized dependence vectors in the DG should not require global communication after the PE assignment i → p(i). A somewhat restrictive constraint to enforce this requirement would be Local Mapping Constraint: If j − i = dk (a dependence vector), then p(j) − p(i)1 ≤ dk 1
(13)
A number of performance metrics may be defined to compare the merits of different systolic array implementations. These include Total Computing Time: TC = max (t(i) − t(j)) (14) i, j∈DG
PE Utilization: UPE = NDG / (TC NPE )
(15)
where ND G is the number of index points in the DG, and NPE is the number of PEs in the systolic array. Now we are ready to formally state the systolic array mapping and scheduling problem: Systolic Array Mapping and Scheduling Problem Given a localized, shift invariant DG, and a systolic array configuration, find a PE assignment mapping p(i), and a schedule t(i) such that the performance is optimized, namely, the total computing time TC is minimized, and the PE utilization UPE is maximized; subject to (i) the resource constraint, (ii) the data dependence constraint, and (iii) the local mapping constraints. Thus the systolic array implementation is formulated as a discrete constrained optimization problem. By fully exploiting of the regular (shift invariant) structure of both the DG and the systolic array, this problem can be further simplified.
Systolic Arrays
829
3.5 Linear Schedule and Assignment A linear schedule is an integer-valued scheduling vector s in the index space such that t(i) = sT i + t0 ∈ Z+ (16) where t0 is a constant integer. The data dependence constraint stipulates that sT d > 0 for any dependence vector d.
(17)
Clearly, all iterations that reside on a hyper-plane perpendicular to s, called equi-temporal hyperplane must be executed in parallel at differentPEs. The equitemporal hyperplane is defined as Q = i | sT i = t(i) − t0 , i ∈ DG . According to the resource constraint, the maximum number of index points in Q determines the minimum size (number of PEs) of the systolic array. Assume that the PE index space is a m − 1 dimensional subspace in the iteration index space. Then the assignment of individual iterations i to a PE index p(i) can be realized by projecting i onto the PE subspace along an integer-valued assignment vector a. Define a m × (m − 1) integer-valued PE basis matrix P such that PT a = 0, then a linear PE assignment can be obtained via an affine transformation p(i) = PT i + p0 Combining eqs. (16) and (18), one has a node mapping procedure: T t(i) s i = Node mapping PT p(i)
(18)
(19)
The node mapping procedure can also be extended to a subset of nodes where external data input and output take places. The same node mapping procedure will indicate where and when these external data I/O will take place in the systolic array. This special mapping procedure is also known as I/O mapping. Different PEs in the systolic array are interconnected by local buses. These buses are implemented based on the need of passing data from an index point (iteration) to another as specified by the dependence vectors. Hence, the orientation of these buses as well as buffers on them can be determined also using P and s: T s Arc mapping (20) D= PT e where is the number of first-in-first-out buffers required on each local bus, and e is the orientation of the local bus within the PE index space. Consider two iterations i, j ∈ DG, i = j. If p(i) = p(j), it implies that 0 = p(i) − p(j) = PT (i − j) ⇒ i − j = ka.
(21)
830
Yu Hen Hu and Sun-Yuan Kung
The resource constraint (cf. eq. (11)) stipulates that if p(i) = p(j) i = j, then t(i) = t(j). Hence, t(i) − t(j) = sT (i − j) = ksT a = 0 (22) Example 1. Let us now use the convolution algorithm in Figure 6 and its corresponding DG in Figure 7 as an example and set aT = [1 0], and sT = [1 1]. It is easy to derive the PE basis matrix PT = [0 1]. Hence, the node mapping becomes T s n 11 n n+k t(i) = = = , 0 ≤ n ≤ 6, 0 ≤ k ≤ min(4, n). PT k 01 k k p(i) (23) This implies every (n, k) iterations will be executed at PE #k of the systolic array and the scheduled execution time slot is n + k. Next, the arc mapping can be found as: T 11 101 s 112 D= = (24) 01 011 PT 011 The second row of the right-hand-side (RHS) of eq. (24) indicates that there are three local buses. The first one has an entry "0" implies that this is a bus that starts and ends at the same PE. The other two have an entry "1", indicating that they are local buses in the increasing k direction. The first row of the RHS gives the number of registers required on each local bus to ensure the proper execution ordering is obeyed. Thus, the first two buses have a single buffer, while the third bus has two buffers. Note that the external data input {x[n]} are fed into the DG at {(n, 0); 0 ≤ n ≤ 6}, and the final output {y[n]} will be available at {(n, K); 0 ≤ n ≤ 6} where K = 4. Thus, through I/O mapping, one has T n n 11 n n n n+K s = = . (25) PT 0K 01 0K 0 K This implies that the input x[n] will be fed into the #0 PE of the systolic array at the nth clock cycle; and the output y[n] will be available at the #K PE at the (n + K)th clock cycle. The node mapping, arc mapping and I/O mapping are summarized in Figure 8. At the left of Figure 8, the original DG is overlaid with the equi-temporal hyperplane which is depicted by parallel, dotted lines. To the right of Figure 8 is the systolic array, its local buses, and the number of buffers (Delays) on each bus. This array is a more abstract version of what is presented in Figure 2.
Systolic Arrays
831
Fig. 8 Linear assignment and schedule of convolution algorithm
4 Wavefront Array Processors 4.1 Synchronous versus Asynchronous Global On-chip Communication The original systolic array architecture adopted a globally synchronous communication model. It is assumed that a global clock signal is available to synchronize the state transition of every storage elements on chip. However, as predicted by the MooreÕs law, in modern integrated circuits, transistor sizes continue to shrink, and the number of transistors on a chip continues to increase. These trends make it more and more difficult to implement globally synchronized clocking scheme on chip. On the one hand, the wiring propagation delay does not scale down as transistor feature sizes reduce. As such, the signal propagation delay becomes very prominent compared to logic gate propagation delay. As on-chip clock frequency exceeds giga-hertz threshold, adverse impacts of clock skew become more difficult to compensate. On the other hand, as the number of on-chip transistors increases, so does the complexity and size of on-chip clock distribution network. The power consumption required to distribute giga-hertz clock signal synchronously over entire chip becomes too large to be practical.
832
Yu Hen Hu and Sun-Yuan Kung
In view of the potential difficulties in realizing a globally synchronous clocking scheme as required by the original systolic array design, a asynchronous array processor, known as wavefront array processor has been proposed.
4.2 Wavefront Array Processor Architecture According to [14, 16], a wavefront array is a computing network with the following features: • Self-timed, data-driven computation: No global clock is needed, as the computation is self-timed. • Regularity, modularity and local interconnection: The array should consist of modular processing units with regular and (spatially) local interconnections. • Programmability in wavefront language or data flow graph (DFG): Computing algorithms implemented on a wavefront array processor may be represented with a data flow graph. Computation activities will propagate through the processor array as if a series of wavefronts propagating through the surface of water. • Pipelinability with linear-rate speed-up: A wavefront array should exhibit a linear-rate speed-up. With M PEs, a wavefront array promises to achieve an O(M) speed-up in terms of processing rates. The major distinction between the wavefront array the systolic array is that there is no global timing reference in the wavefront array. In the wavefront architecture, the information transfer is by mutual agreements between a PE and its immediate neighbors using, say, an asynchronous hand-shaking protocol [14, 16].
4.3 Mapping Algorithms to Wavefront Arrays In general, there are three formal methodologies for the derivation of wavefront arrays [15]: 1. Map a localized dependence graph directly to a data flow graph (DFG). Here a DFG is adopted as a formal abstract model for wavefront arrays. A systematical procedure can be used to map a dependence graph (DG) to a DFG. 2. Convert an signal flow graph into a DFG (and hence a wavefront array), by properly imposing several key data flow hardware elements. 3. Trace the computational wavefronts and pipeline the fronts through the processor array. This will be elaborated below. The notion of computational wavefronts offers a very simple way to design wavefront computing, which consists of three steps: 1. Decompose an algorithm into an orderly sequence of recursions; 2. Map the recursions onto corresponding computational wavefronts in the array;
Systolic Arrays
833 Memory Modules Program Code t= t= 5 4
t= t= 4 3
t= t= 3 2
t= t= 6 5
on #2 t
First Wave Second Wave
F #2 ron N t -1
t= t= 8 7
t= t= 7 6
Fr
Memory Modules
Fr
on #1 t
t= t= 2 1
Memory
Fig. 9 Wavefront processing for matrix multiplication [15]
3. Pipeline the wavefronts successively through the processor array.
4.4 Example: Wavefront Processing for Matrix Multiplication The notion of computational wavefronts may be better illustrated by an example of the matrix multiplication algorithm where A, B, and C, are assumed to be N × N matrices: C = A × B. (26) The topology of the matrix multiplication algorithm can be mapped naturally onto the square, orthogonal N × N matrix array as depicted in Figure 9. The computing network serves as a (data) wave-propagating medium. To be precise, let us examine the computational wavefront for the first recursion in matrix multiplication. Suppose that the registers of all the PEs are initially set to zero, that is, Ci j (0) = 0. The elements of A are stored in the memory modules to the left (in columns) and those of B in the memory modules on the top (in rows). The process starts with PE (1, 1) which computes: C11 (1) = C11 (0) + a11b11 .
834
Yu Hen Hu and Sun-Yuan Kung
The computational activity then propagates to the neighboring PEs (1, 2) and (2, I), which execute: C12 (1) = C12 (0) + a11b12 and C21 (1) = C21 (0) + a21b11 . The next front of activity will be at PEs (3,1), (2,2), and (1,3), thus creating a computation wavefront traveling down the processor array. This computational wavefront is similar to optical wavefronts (they both obey Huygens’ principle), since each processor acts as a secondary source and is responsible for the propagation of the wavefront. It may be noted that wave propagation implies localized data flow. Once the wavefront sweeps through all the cells, the first recursion is over. As the first wave propagates, we can execute an identical second recursion in parallel by pipelining a second wavefront immediately after the first one. For example, the (1, 1) processor executes C11 (2) = C11 (1) + a12b21 = a11 b11 + a12b21 . Likewise each processor (i, j) will execute (from k = 1 to N) Ci j (k) = Ci j (k + 1) + aik bk j = ai1 b1 j + ai2 b2 j + . . . + aik bk j and so on. In the wavefront processing, the pipelining technique is feasible because the wavefronts of two successive recursions would never intersect. The processors executing the recursions at any given instant are different, thus any contention problems are avoided. Note that the successive pipelining of the wavefronts furnishes additional dimension of concurrency. The separated roles of pipeline and parallel processing also become evident when we carefully inspect how parallel processing computational wavefronts are pipelined successively through the processor arrays. Generally speaking, parallel processing activities always occur at the PEs on the same front, whereas pipelining activities are perpendicular to the fronts. With reference to the wavefront processing example in Figure 9, PEs on the anti-diagonals of the wavefront array execute in parallel, since each of the PEs process information independently. On the other hand, pipeline processing takes place along the diagonal direction, in which the computational wavefronts are piped. In this example, the wavefront array consists of N × N processing elements with regular and local interconnections. Figure 9 shows the first 4 × 4 processing elements of the array. The computing network serves as a (data) wave propagating medium. Hence the hardware has to support pipelining the computational wavefronts as fast as resource and data availability allow. The (average) time interval T between two separate wavefronts is determined by the availability of the operands and operators.
Systolic Arrays
835
4.5 Comparison of Wavefront Arrays against Systolic Arrays The main difference between a wavefront array processor and a systolic array lies in hardware design, e.g., on clock and buffer arrangements, architectural expandability, pipeline efficiency, programmability in a high-level language, and capability to cope with time uncertainties in fault-tolerant designs. As to the synchronization aspect, the clocking scheme is a critical factor for largescale array systems, and global synchronization often incurs severe hardware design burdens in terms of clock skew. The synchronization time delay in systolic arrays is primarily due to the clock skew which can vary drastically depending on the size of the array. On the other hand, in the data-driven wavefront array, a global timing reference is not required, and thus local synchronization suffices. The asynchronous data-driven model, however, incurs fixed time delay and hardware overhead due to hand-shaking. From the perspective of pipelining rate, the data-driven computing in the wavefront array may improve the pipelinability. This becomes especially helpful in the case where variable processing times are used in individual PEs. A simulation study on a recursive least squares minimization computation also reports a speedup by a factor of almost two, in favor of the wavefront array over a globally clocked systolic array [4]. In general, a systolic array is useful when the PEs are simple primitive modules, since the handshaking hardware in a wavefront array would represent a nonnegligible overhead for such applications. On the other hand, a wavefront array is more applicable when the modules of the PEs are more complex (such as floatingpoint multiply-and-add), when synchronization of a large array becomes impractical or when a reliable computing environment (such as fault tolerance) is essential.
5 Hardware Implementations of Systolic Array 5.1 Warp and iWARP Warp [1] is a prototype linear systolic array processor developed at CMU in mid1980s. As illustrated in Figure 10, the Warp array contains 10 identical Warp cells interconnected as a linear array. It is designed as an attach processor to a host processor through an interface unit. Each Warp cell has three inter-cell communication links: one address link and two data links (X and Y). They are connected to nearest neighboring cells or the interface unit. Each cell contains two floating point units (one for multiply and one for addition) with corresponding register files, two local memory banks (2K words each with 32bits/word) for resident and temporary data, each communication link also has a 512 words buffer queue. All these function units are interconnected via a cross-bar switch for intra-cell communication.
836
Yu Hen Hu and Sun-Yuan Kung Host
Address
X Y
Cell 1
Interface Unit
Cell 2
... ... ...
Cell n
X Y
WARP Processor Array
Fig. 10 Warp system overview [1]
The Warp cell is micro-programmed with horizontal micro-code. Although all cells will execute the same cell program, broadcasting micro-code to all cells is not practical and would violate the basic principle of localized communication. A noticeable feature of the WARP processor array is that its inter-cell communication is asynchronous. It is argued [1] that the synchronous, fine-grained inter-PE communication schemes of the original systolic array is too restrictive and is not suitable for practical implementations. Instead, a hard-ware assisted run-time flow control scheme together with a relatively large queue size would allow more efficient inter-cell communication without incurring excessive overheads. The Warp array uses a specially designed programming language called "W2". It explicitly supports communication primitives such as "receive" and "send" to transfer data between adjacent cells. The program execution at a cell will stall if either the send or receive statement cannot be realized due to an empty receiving queue (nothing to receive from) or a full sent queue (nowhere to send to). Thus, the programmer bears the responsibility of writing a deadlock free parallel program to run on the Warp processor array. The performance of the Warp processor array is reported as hundreds of times faster than running the same type of algorithm in a VAX 11/780, a popular minicomputer at the time of Warp development. The development of the Warp processor array is significant in that it is the first hardware systolic array implementation. Lessons learned from this project also motivated the development of iWarp. The iWarp project [3, 7] was a follow-up project of WARP and started in 1988. The purpose of this project is to investigate issues involved in building and using high performance computer systems with powerful communication support. The project led to the construction of the iWarp machines, jointly developed by Carnegie Mellon University and Intel Corporation. As shown in Figure 11, the basic building block of the iWarp system is a full custom VLSI component integrating a LIW (long instruction word) microprocessor, a network interface and a switching node into one single chip of 1.2cm × 1.2cm silicon. The iWarp cell consists of a computation agent, a communication agent, and a local memory. The computation agent includes a 32-bit micro-processor with 96-bit wide instruction words, an integer/logic unit, a floating point multiplier, and
Systolic Arrays
837
Fig. 11 Photograph of a iWARP chip [3]
a floating point adder. It runs at a clock speed of 20 MHz. The communication agent has 4 separate full duplex physical data links capable of transferring data at 40 MB/s. These data links can be configured into 20 virtual channels. The clock speed of the communication agent is 40 MHz. Each cell is attached to a local memory sub-system including up to 4MB static RAM (random access memory) or/and 16 MB DRAM. The iWarp system is designed to be configured as a n × m torus array. A typical system would have 64 cells configured as a 8 × 8 torus array and yields 1.2GFlop/s peak performance. The communication agent supports word-level flow control between connecting cells and transfers messages word by word to implement wormhole routing [18]. Exposing this mechanism to the computation agents allows programs to communicate systolically. Moreover, a communication agent can automatically route messages to the appropriate destination without the intervention of the computation agent.
5.2 SAXPY Matrix-1 Claimed to be "the first commercial, general-purpose, systolic computer", Matrix1 [6] is a vector array processor developed by the SAXPY computer co. in 1987 for scientific, signal processing applications. It promises 1 GFLOP throughput by means of 32-fold parallelism, fast (64 ns) pipelined floating-point units, and fast and flexible local memories.
838
Yu Hen Hu and Sun-Yuan Kung System Controller
Saxpy Interconnect (S-Bus)
Matrix Processor
System Memory
Mass Storage System
Fig. 12 Block diagram of a Matrix-1 system
At system level, a Matrix-1 system (cf. Figure 12) consists of a system controller, system memory, and mass storage in addition to the matrix processor. These system components are interconnected with a high-speed (320 MB/sec) bus (S-bus). The system memory has a maximum capacity of 128 Mbytes. It uses only physical addresses and hence allows faster access. The Matrix Processor (Figure 13) is a ring-connected linear array of 8, 16, 24, or 32 vector processors. Each processor is called a computational zone. All zones receive the same control and address instructions at each clock cycle. The Matrix Processor can function in a systolic mode (in which data are transferred from one zone to the next in a pipelined fashion) or in a block mode (in which all zones operate simultaneous to execute vector operations). Each zone has a pipelined, 32-bit floating-point multiplier; a pipelined, 32-bit floating-point adder with logic capabilities, and a 4K-word local memory implemented as a two-way interleaved zone buffer. These components operate at a clock frequency of 16 MHz. With 32 zones, the maximum computing power would approach 960 MFLOP. The Matrix-1 employs an application programming interface (API) approach to interface with the host processor. The user program will be written in C or Fortran and makes calls to the matrix processor subroutines. Experienced programmers may
global data
...
I/O and buffer
...
to / from system memory
...
Zone 0 memory
Zone 1 memory
...
Zone 31 memory
Fig. 13 The Matrix Processor Zone architecture of SAXPY Matrix-1 computer
Systolic Arrays
839 floating-point unit 32 32
32-bit processor
32
link services
32
32
link interface
4 kbytes on-chip 32 RAM
32
link interface
32
link interface
32
link interface
system 32 services
timers
external memory 32 interface
event
Fig. 14 INMOS T805 floating-point processor (http://www.classiccmp.org/transputer/)
also write their own matrix processor subroutines or directly engage assembly level programming of the matrix processors.
5.3 Transputer The Transputer (transistor computer) [8, 19, 22, 26] is a microprocessors developed by Inmos Ltd. in mid-1980s to support parallel processing. The name was selected to indicate the role the individual Transputers would play: numbers of them would be used as basic building blocks, just as transistors in integrated circuits. A most distinct feature of a Transputer chip is that there are four serial links to communicate with up to four other Transputers simultaneously each at 5, 10, or 20 Mbit/s. The circuitry to drive the links is all on the Transputer chip and only two wires are needed to connect two Transputers together. The communication links between processors operate concurrently with the processing unit and can transfer data simultaneously on all links without the intervention of the CPU. Supporting the links was additional circuitry that handled scheduling of the traffic over them. Processes waiting on communications would automatically pause while the networking circuitry finished its reads or writes. Other processes running on the transputer would then be given that processing time. These unique properties allow multiple Transputer chips to be configured easily into various topologies such as linear or mesh array, or trees to support parallel processing. Depicted in Figure 14 is a chip layout picture and a floor plan of Transputer T805. It has a 32-bit architecture running at 25 MHz clock frequency. It has an IEEE 754 64-bit on-chip floating point unit, 4 Kbytes on-chip static RAM, and may
840
Yu Hen Hu and Sun-Yuan Kung
connect to 4 Gbytes directly addressable external memory (no virtual memory) at 33 Mbytes/sec sustained data rate. It uses a 5MHz clock input and runs on a single 5V power supply. Transputers were intended to be programmed using the OCCAM programming language, based on the CSP process calculus. Occam supported concurrency and channel-based inter-process or inter-processor communication as a fundamental part of the language. With the parallelism and communications built into the chip and the language interacting with it directly, writing code for things like device controllers became a triviality. Implementations of more mainstream programming languages, such as C, FORTRAN, Ada and Pascal were also later released by both INMOS and third-party vendors.
5.4 TMS 32040 TMS 32040 [23] is Texas Instruments’ floating point digital signal processor developed in early 1990. The ’320C40 has six on-chip communication ports for processor-to-processor communication with no external-glue logic. The communication ports remove input/output bottlenecks, and the independent smart DMA coprocessor is able to relieve the CPU input/output burden. Each of the six serial communication ports is equipped with a 20M-bytes/s bidirectional interface, and separate input and output 8-word-deep FIFO buffers. Direct processor-to-processor connection is supported by automatic arbitration and handshaking. The DMA coprocessor allows concurrent I/O and CPU processing for sustained CPU performance. The processor features single-cycle 40-bit floating-point and 32-bit Integer multipliers, 512-byte instruction cache, and 8K Bytes of single-cycle dual-access program or data RAM. It also contains separate internal program, data, and DMA coprocessor buses for support of massive concurrent input/output (I/O) program and data throughput. The TMS 32040 is designed to support general purpose parallel computation with different configurations. With 6 bidirectional serial link ports, it would directly support a hypercube configuration containing up to 26 = 64 processing elements. It, of course, also can be easily configured to form a linear or two-dimensional meshconnected processor array to support systolic computing.
Systolic Arrays
841 motion vector Current block search area
Current frame
Reference frame
Fig. 15 Block motion estimation
6 Recent Developments and Real World Applications 6.1 Block Motion Estimation Block motion estimation is a critical computation step in every international video coding standard, including MPEG-I, MPEG-II, MPEG-IV, H.261, H.263, and H.264. This algorithm consists of a very simple loop body (sum of absolute difference) embedded in a six-level nested loop. For real time, high definition video encoding applications, the motion estimation operation must rely on special purpose on-chip processor array structures that are heavily influenced by the systolic array concept. The notion of block motion estimation is demonstrated in Figure 15. To the left of this figure is the current frame which is to be encoded and transmitted from the encoding end. To the right is the reference frame which has already been transmitted and reconstructed at the receiver end. The encoder will compute a copy of this reconstructed reference frame for the purpose of motion estimation. Both the current frame and the reference frame are divided into macro-blocks as shown with dotted lines. Now focus on the current block which is the shaded macro-block at the second row and the fourth column of the current frame. The goal of motion estimation is to find a matching macro-block in the reference frame, in the vicinity of the location of the current block such that it resembles the current block in the current frame. Usually, the current frame and the reference frame are separated by a couple of frames temporally, and are likely to contain very similar scene. Hence, there exists high degree of temporal correlation among them. As such, there is high probability that the current block can find a very similar matching block in the reference frame. The displacement between the location of the current block and that of the matching macro-block is the motion vector of the current block. This is shown to the right hand side of Figure 15. By transmitting the motion vector alone to the receiver, a predicted copy of the current block can be obtained by copying the matching macroblock from the reference frame. That process is known as motion compensation. The similarity between the current block in the current frame and corresponding matching block in the reference frame is measured using a mean of absolute difference (MAD) criterion:
842
Yu Hen Hu and Sun-Yuan Kung Do h = 0 to Nh − 1 Do v = 0 to Nv − 1 MV(h, v) = (0, 0) Dmin(h, v) = Do m = −p to p Do n = −p to p MAD(m, n) = 0 Do i = h ∗ N to (h + 1) ∗ N − 1 Do j = v ∗ N to (v + 1) ∗ N − 1 MAD(m, n) = MAD(m, n) + |x(i, j) − y(i + m, j + n)| End do j End doi If Dmin(h, v) > MAD(m, n) Dmin(h, v) = MAD(m, n) MV(h, v) = (m, n) End if End do n End do m End do v End do h
Fig. 16 Full search block matching motion estimation
MAD(m, n) =
1 N−1 N−1 |x(i, j) − y(i + m, j + n)| N 2 i=0 j=0
(27)
where the size of the macro-block is N pixels by N pixels. x(i, j) is the value of the (i, j)th pixel of the current frame and y(i + m, j + n) is the value of the (i + m, j + n)th pixel of the reference frame. MAD(m, n) is the mean absolute difference value between the current block and the candidate matching block with a displacement of (m, n), −p ≤ m, n ≤ p, where p is a bounded, pre-set search range which is usually twice or thrice the size of a macro-block. The motion vector (MV) of the current block is found as MV = arg min MAD(m, n) (28) −p≤m,n≤p
We assume each video frame is partitioned into Nh × Nv macro-blocks. With eqs. (27) and (28), one may express the whole frame full-search block matching motion estimation algorithm as a six-level nested loop as shown in Figure 16. The performance requirement for such a motion estimation operation is rather stringent. Take MPEG-II for example, a typical video frame of 1080p format contains 1920 × 1080 pixels. With a macro-block size N = 16, one has Nh = 1920/16 = 120, Nv = 1080/16 = 68. Usually, N = 16, and p = N/2. Since there are 30 frames per second, the number of the sum of absolute difference operations that need to be performed would be around 30 × Nh × Nv × (2p + 1)2 × N 2 ≈ 1.8 × 1010 operations/second. Since motion estimation is only part of video encoding operations, an application specific hardware module would be a desirable implementation option. In view of
Systolic Arrays
843 PE
D
PE
D
PE
D
PE
PE
D
PE
D
PE
D
PE
PE
D
PE
D
PE
D
PE
PE
D
PE
D
PE
D
PE
MV
D
D
D
MUX 1 0
ctrl1
MUX 0 1
x(b,l,k) y(b,l,k)
ctrl2
y(b,l+N-1,k-(N-1))
Fig. 17 2-D array with spiral interconnection (N = 4 and p = 2) [27]
the regularity of the loop-nest formulation, and the simplicity of the loop-body operations (addition/subtraction), a systolic array solution is a natural choice. Toward this direction, numerous motion estimation processor array structures have been proposed, including 2D mesh array, 1D linear array, tree-structured array, and hybrid structures. Some of these realizations focused on the inner 4-level nested loop formulation of algorithm in Figure 16 [12, 20], and some took the entire 6-level loop nest into accounts [5, 11, 27]. An example is shown in Figure 17. In this configuration, the search area pixel y is broadcast to each processing elements in the same column; and current frame pixel x is propagated along the spiral interconnection links. The constraint of N = 2p is imposed to achieve low input/output pin count. A simple PE is composed of only two 8-bit adders and a comparator as shown in Figure 18. A number of video encoders micro-chips including motion estimation have been reported over the years. Earlier motion estimation architectures often use some variants of a pixel-based systolic array to evaluate the MAD operations. Often a fast search algorithm is used in lieu of the full search algorithm due to speed and power
y
MADa x
Min(MADa,MADb)
Com Reg
AD
MADb A
x
DFF
y
Fig. 18 Block diagram of an individual processing element [27]
844
Yu Hen Hu and Sun-Yuan Kung
Table 1 Table 1. MPEG-IV motion estimation chip features [17] Technology Supply Voltage Core Area Logic Gates SRAMs Encoding Feature Search Range Operating Frequency Power Consumption
TSMC 0.18 m, 1P6M CMOS 1.8 V (Core) / 3.3 V (I/O) 1.78×1.77 mm2 201 K (2-input NAND gate) 4.56 kB MPEG-4 SP H[-16, +15.5] V[-16, +15.5] 9.5 MHz CIF, 28.5 MHz VGA 5 mW (CIF, 9.5 MHz, 1.3 V) 18 mW (VGA, 28.5 MHz, 1.4 V)
consumption concerns. One example is a MPEG-IV standard profile encoder chip reported in [17]. Some chip characteristics are given in Table 1. As shown in Figure 19, the motion estimation is carried out with 16 adder tree (processing units, PU) for sum of absolute difference calculation and the motion vectors are selected based on these results. A chip micro-graph is depicted in Figure 20.
6.2 Wireless Communication Systolic array has also found interesting applications in wireless communication baseband signal processing applications. A typical block diagram of wireless transceiver HORIZONTAL 16x DATA SHARING 31 pels from reference
Current MB
Reference pels
pels 0-15 pels 1-16 to Tree#0 to Tree#1
pels 15-30 to Tree#16
VERTICAL 4x DATA SHARING
PU #0 16-PE & adder tree
PU #0 PU #0 16-PE & . . . 16-PE & adder tree adder tree
CLK_A CLK_B CLK_C CLK_D Current vertical pos. Reference vertical pos.
CLK_A
16 SAD registers
CLK_B
16 SAD registers
CLK_C
16 SAD registers
CLK_D
16 SAD registers
16 pels from cur.
0
1
2
3
0
0 16 pels from ref.
PE#0 PE#1 ... PE#15
Partial SAD of MB
Fig. 19 Motion estimation architecture [17]
1 1 time
Systolic Arrays
845
Fig. 20 MPEG-IV encoder chip die micro-graph [17]
baseband processing algorithms is depicted in Figure 21. It includes fast Fourier transform (FFT) and inverse fast Fourier transform (IFFT), channel estimator/equalizer, data interleaver, and channel encoder/decoder, etc. In [21], a reconfigurable systolic array of CORDIC processing nodes (PN) is proposed to realize the computation intensive portion of the wireless baseband operations. CORDIC (Coordinate Rotation Digital Computer) [24, 25] is an arithmetic computing algorithm that has found many interesting signal processing applications [9]. Specifically, it is an efficient architecture to realize unitary rotation operations such as Jacobi rotation described in eqs. (3), (4) and (5) in this chapter. With CORDIC, the rotation angle is represented with a weighted sum of a sequence of elementary angels {a(i); 0 ≤ i ≤ n − 1} where a(i) = tan−1 2−i . That is,
RF/IF Receiver
Remove Cyclic Extension
FFT
Parallel to Serial
De-interleaver channel equalizer and error decoder
RF/IF transmit
Add Cyclic Extension
IFFT
Serial to Parallel
Interleaver and Error encoder
Fig. 21 A typical block diagram of wireless transceiver baseband processing
846
Yu Hen Hu and Sun-Yuan Kung
Fig. 22 CORDIC systolic array [21]
=
n−1
n−1
i=0
i=0
i a(i) = tan−1 2i ,
i ∈ {−1, +1}.
(29)
As such, the rotation operation through each elementary angle may be easily realized with simple shift-and-add operations x(i + 1) cos a(i) −i sin a(i) x(i) 1 −m i 2−i x(i) = = k(i) y(i + 1) i sin a(i) cos a(i) y(i) i 2−i 1 y(i) √ where k(i) = 1/ 1 + 2−2i. A block diagram of a 4 × 4 CORDIC reconfigurable systolic array is shown in Figure 22. The control unit is a general purpose RISC (reduced instruction set computer) micro-processor. The PN array employs a data driven (data flow) paradigm so that globally synchronous clocking is not required. During execution phase, the address generator provides an address stream to the data memory bank. Accessed data is fed from the data memory bank to the PN array and back, via the memory interface, which adds a context pointer to the data. With the context pointer, dynamic reconfiguration of the PN array within a single clock cycle becomes possible. The PN architecture is depicted in Figure 23 where two CORDIC processing elements, two delay processing elements are interconnected via the communication agent, which also handles external communications with other PNs. Using this CORDIC reconfiguration systolic array, a minimum mean square error detector is implemented for an OFDM (orthogonal frequency division modulation) MIMO (multiple input, multiple output) wireless transceiver. A QR decomposition recursive least square (QRD-RLS) triangular systolic array is implemented on a FPGA prototype system and is shown in Figure 24.
Systolic Arrays
847
Fig. 23 Processing node architecture [21]
7 Conclusions In this chapter, the historically important systolic array architecture is discussed. The basic systolic design methodology is reviewed, and the wavefront array processor architecture has been surveyed. Several existing implementations of systolic array like parallel computing platforms, including WARP, SAXPY Matrix-1, Transputer, and TMS320C40 have been briefly reviewed. Real world applications of systolic arrays to video coding motion estimation and wireless baseband processing have also been discussed.
Fig. 24 QRD-RSL triangular systolic array [21]
848
Yu Hen Hu and Sun-Yuan Kung
References 1. Annaratone, M., et al.: The WARP computer: Architecture, implementation, and performance. IEEE Trans. Computers 36, 1523–1538 (1987) 2. Arnould, E., Kung, H., et al.: A systolic array computer. In: Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 10, pp. 232 – 235 (1985) 3. Borkar, S., Cohn, R., Cox, G., Gross, T., Kung, H.T., Lam, M., Levine, M., Moore, B., Moore, W., Peterson, C., Susman, J., Sutton, J., Urbanski, J., Webb, J.: Supporting systolic and memory communication in iwarp. In: Proc. 17th Intl. Symposium on Computer Architecture, pp. 71–80 (1990) 4. Broomhead, D., Harp, J., McWhirter, J., Palmer, K., Roberts, J.: A practical comparison of the systolic and wavefront array processing architectures. In: Proc. Int’l Conf. Acoustics, Speech, and Signal Processing, vol. 10, pp. 296–299 (1985) 5. Chen, Y.K., Kung, S.Y.: A systolic methodology with applications to full-search block matching architectures. J. of VLSI Signal Processing 19(1), 51–77 (1998) 6. Foulser, D.E.: The Saxpy Matrix-1: A general-purpose systolic computer. IEEE Computer 20, 35–43 (1987) 7. Gross, T., O’Hallaron, D.R.: iWarp: Anatomy of a Parallel Computing System. MIT Press, Boston, MA (1998) 8. Homewood, M., May, D., Shepherd, D., Shepherd, R.: The IMS T800 Transputer. IEEE Micro 7(5), 10–26 (1987) 9. Hu, Y.H.: CORDIC-based VLSI architectures for digital signal processing. IEEE Signal Processing Magazine 9, 16–35 (1992) 10. iWarp project. URL http://www.cs.cmu.edu/afs/cs/project/iwarp/ archive/WWW-pages/iwarp.html 11. Kittitornkun, S., Hu, Y.: Systolic full-search block matching motion estimation array structure. IEEE Trans. Circuits Syst. Video Technology 11, 248–251 (2001) 12. Komarek, T., Pirsch, P.: Array architectures for block matching algorithms. IEEE Trans. Circuits Syst. 26(10), 1301–1308 (1989) 13. Kung, H.T.: Why systolic array. IEEE Computers 15, 37–46 (1982) 14. Kung, S.Y.: On supercomputing with systolic/wavefront array processors. Proc. IEEE 72, 1054–1066 (1984) 15. Kung, S.Y.: VLSI Array Processors. Prentice Hall, Englewood Cliffs, NJ (1988) 16. Kung, S.Y., Arun, K.S., Gal-Ezer, R.J., Bhaskar Rao, D.V.: Wavefront array processor: Language, architecture, and applications. IEEE Trans. Computer 31(11), 1054–1066 (1982) 17. Lin, C.P., Tseng, P.C., Chiu, Y.T., Lin, S.S., Cheng, C.C., Fang, H.C., Chao, W.M., Chen, L.G.: A 5mW MPEG4 SP encoder with 2D bandwidth-sharing motion estimation for mobile applications. In: Proc. International Solid-State Circuits Conference, pp. 1626–1635. San Francisco, CA (2006) 18. Ni, L.M., McKinley, P.: A survey of wormhole routing techniques in direct networks. IEEE Computer 26, 62–76 (1993) 19. Nicoud, J.D., Tyrrell, A.M.: The transputer T414 instruction set. IEEE Micro 9(3), 60–75 (1989) 20. Pan, S.B., Chae, S., , Park, R.: VLSI architectures for block matching algorithm. IEEE Tran. Circuits Syst. Video Technol. 6(1), 67–73 (1996) 21. Seki, K., Kobori, T., Okello, J., Ikekawa, M.: A cordic-based reconfigrable systolic array processor for MIMO-OFDM wireless communications. In: Proc. IEEE Workshop on Signal Processing Systems, pp. 639–644. Shanghai, China (2007) 22. Taylor, R.: Signal processing with occam and the transputer. IEE Proceedings F: Communications, Radar and Signal Processing 131(6), 610–614 (1984) 23. Texas Instruments: TMS320C40 Digital Signal Processors (1996). URL http://focus. ti.com/docs/prod/folders/print/tms320c40.html 24. Volder, J.E.: The CORDIC trigonometric computing technique. IRE Trans. on Electronic Computers EC-8(3), 330–334 (1959)
Systolic Arrays
849
25. Walther, J.S.: A unified algorithm for elementary functions. In: Spring Joint Computer Conf. (1971) 26. Whitby-Strevens, C.: Transputers-past, present and future. IEEE Micro 10(6), 16–19, 76–82 (1990) 27. Yeo, H., Hu, Y.: A novel modular systolic array architecture for full-search block matching motion estimation. IEEE Tran. Circuits Syst. Video Technol. 5(5), 407–416 (1995)
Decidable Signal Processing Dataflow Graphs: Synchronous and Cyclo-Static Dataflow Graphs Soonhoi Ha and Hyunok Oh
1 Introduction Digital signal processing (DSP) algorithms are often informally, but intuitively, described by block diagrams in which a block represents a function block and an arc or edge represents a dependency between function blocks. While a block diagram is not a programming model, it resembles a formal dataflow graph in appearance (see Chapter 28). Fig. 1 shows a block-diagram representation of a simple DSP algorithm, which can also be regarded as a dataflow graph of the algorithm. A dataflow graph is a graphical representation of a dataflow model of computation in which a node, or an actor, represents a function block that can be executed, or fired, when enough input data are available. An arc is a FIFO queue that delivers data samples, also called tokens, from an output port of a source node to an input port of a destination node. If a node has no input port, the node becomes a source node that is always executable. In DSP algorithms, a source node represents an interface block that receives triggering data from an outside source. The "Read" block in Fig. 1 is a source block that reads audio data samples from an outside source. A
Read
Filter
Play
Store Fig. 1 Dataflow graph of a simple DSP algorithm.
Soonhoi Ha Seoul National University, Korea, e-mail: [email protected] Hyunok Oh Hanyang University, Korea, e-mail: [email protected]
S.S. Bhattacharyya et al. (eds.), Handbook of Signal Processing Systems, DOI 10.1007/978-1-4419-6345-1_30, © Springer Science+Business Media, LLC 2010
851
852
Soonhoi Ha and Hyunok Oh
dataflow graph is usually assumed to execute indefinitely as long as the source block produces samples on the output ports. The dataflow model of computation was first introduced as a parallel programming model for the associated computer architecture called dataflow machines [5]. While the granularity of a node is assumed to be fine as a machine instruction in dataflow machine research, the node granularity can be as large as a well-defined function module such as a filter or an FFT unit in a DSP algorithm representation. The main advantage of the dataflow model as a programming model is that it specifies only the true dependency between nodes so that there are numerous ways of executing the nodes as long as data dependencies between the nodes are preserved. For example, blocks "Filter" and "Store" in Fig. 1 can be executed in parallel after they receive data samples from the "Read" block. Thus the key technique to execute a dataflow graph is scheduling, which is to determine where and when to execute the nodes on a given target platform that may contain multiple processors and hardware IPs. The dynamic scheduling decision can be made at run-time by examining the existence of data on the input arcs of nodes and executing the ready nodes on an appropriate processing element. There are a couple of drawbacks of dynamic scheduling. First, dynamic scheduling requires run-time overhead in order to manage the ready nodes and schedule them in terms of both space and time. Second, with a limited amount of resources, dynamic scheduling of nodes may incur buffer overflow or a deadlock situation if buffers are not carefully managed. It is not decidable in general whether a dataflow program can be executed without buffer overflow or a deadlock problem in case dynamic scheduling is used. On the other hand, some dataflow models have restricted semantics so that the scheduling decision can be made at compile-time. If the execution order of nodes is determined statically at compile-time, we can decide before running the program if the program has the possibility of buffer overflow or deadlock. Such dataflow graphs are called decidable dataflow graphs. The SDF (synchronous dataflow) model proposed by Lee in [8], is a pioneering decidable model that has been widely used for DSP algorithm specification in many design environments including Ptolemy [4] and Grape II [7]. Subsequently, a number of generalizations of the SDF model have been proposed to extend the expression capability of the SDF model. The most popular extension is CSDF (cyclo-static data flow) [3]. In this chapter, we explain the decidable dataflow models and focus on their characteristics of decidability. For decidable dataflow graphs, the most important issue is to determine an optimal static schedule since there are numerous ways to schedule the nodes.
2 SDF (Synchronous Dataflow) In a dataflow graph, the number of tokens produced (or consumed) per node firing is called the output (or the input) sample rate of the output (or the input) port. The
Decidable Signal Processing Dataflow Graphs
A
2
3
1 1 1D
B D (a)
2
1
853
C
Periodic schedules Σ1: ADADBCCADBCC Σ2: 3(AD)2(B2(C)) Σ3: 3A3D2B4C (b)
Fig. 2 (a) An SDF graph and (b) some periodic schedules for the SDF graph.
simplest dataflow model is the single-rate dataflow (SRDF) in which all sample rates are unity. When a port may consume or produce multiple tokens, we call it multirate dataflow (MRDF). Among multi-rate dataflow graphs, synchronous dataflow has a restriction that the sample rates of all ports are fixed and not varying at run time. Fig. 2 (a) shows an SDF graph, which has the same topology as Fig. 1, where each arc is annotated with the number of samples produced and consumed by the incident nodes. There may be sample delays associated with an arc. The sample delay is represented as initial samples that are queued on the arc buffer from the beginning, and denoted as xD where x represents the number of initial samples, as shown on arc AD in Fig. 2 (a). From the sample rate information on each arc, we can determine the relative execution rate of two end nodes of the arc. In order not to accumulate tokens unboundedly on an arc, the number of samples produced from the source node should be equal to the number of samples consumed by the destination node in the long run. In the example of Fig. 2 (a), the execution rate of node C should be twice as fast as the execution rate of node B on average. Based on this pair-wise information on the execution rates, we can determine the ratio of execution rates among all nodes. The resultant ratio of execution rates among nodes A, B, C and D in Fig. 2 (a) becomes 3:2:4:3.
2.1 Static Analysis The key analytical property of the SDF model is that the node execution schedule can be constructed at compile time. The number of executions of node A within a schedule is called the repetition count x(A) of the node. A valid schedule is a finite schedule that does not reach deadlock and produces no net change in the number of samples accumulated on each arc. In a valid schedule, the ratio of repetition counts is equal to the ratio of execution rates among the nodes so that one iteration of a valid schedule does not increase the samples queued on all arcs. If there exists a valid schedule, the SDF graph is said to be consistent . We represent the repetition counts of nodes in a consistent SDF graph G by vector qG . For the graph of Fig. 2 (a), (1) qG = (x(A), x(B), x(C), x(D)) = (3, 2, 4, 3)
854
Soonhoi Ha and Hyunok Oh node: A B C A
2 1
1
B
2
1 2
(a)
C
2 -1 0 Г = 0 2 -1 1 0 -2
arc: AB BC AC
(b)
Fig. 3 (a) An SDF graph that is sample rate inconsistent and (b) the associated topology matrix.
Since an SDF graph imposes only partial ordering constraints between the nodes, we can determine the order of node invocations in various ways. Fig. 2 (b) shows three possible valid schedules of the Fig. 2(a) graph. In Fig. 2(b), each parenthesized term n(X1 X2 ...Xm ) represents n successive executions of the sequence X1 X2 ...Xm , which is called a looped schedule. If every block appears exactly once in the schedule such as 2 and 3 in Fig. 2 (b), the schedule is called a single appearance (SA) schedule. An SA-schedule that has no nested loop is called a flat SA-schedule. 3 of Fig. 2(b) is a flat SA-schedule, while 2 is not. Static analysis of an SDF graph is performed by constructing a valid schedule. No valid schedule can be found for an erroneous SDF graph. To construct a valid schedule, we first compute the repetition counts of all nodes. For a given arc e, we denote src(e) as the source node and snk(e) as the destination node. The output sample rate of src(e) onto the arc is denoted as prod(e) and the input sample rate of snk(e) as cons(e). Then, the following equation, called a balance equation, should be held for a consistent SDF graph. x(src(e))prod(e) = x(snk(e))cons(e) f or each e.
(2)
We can formulate the balance equations for all arcs compactly with the following matrix equation. qTG = 0 (3) where , called the topology matrix of G, is a matrix of which rows are indexed by the arcs in G and whose columns are indexed by the nodes in G, and whose entries are defined by ⎧ if A = src(e) ⎨ prod(e), (e, A) = −cons(e), if A = snk(e) (4) ⎩ 0, otherwise A valid schedule exists only if equation 3 has a non-zero solution of repetition vector qG . Mathematically, this condition is satisfied when the rank of the topology matrix is n − 1 [8], where n is the number of nodes in G. In case no non-zero solution exists, the SDF graph is called sample rate inconsistent. Fig. 3 (a) shows a simple SDF graph that is sample rate inconsistent, and its associated topology matrix. Note that the rank of the topology matrix is 3, not 2. Sample rate consistency does not guarantee that a valid schedule exists. A sample rate consistent SDF graph can be deadlocked as illustrated in Fig. 4(a) if the SDF graph has a cycle with insufficient amount of initial samples. The repetition vector,
Decidable Signal Processing Dataflow Graphs
A
1 1
2
B
2
1 1
C
855
A
1
2
1
(a)
B
2
1
C
1
xD
(b)
Fig. 4 (a) An SDF graph that is deadlocked, and (b) the modified graph with initial samples on the feedback arc.
qG = (x(A), x(B), x(C)), is (2,1,2). However, there is no fireable node since all nodes wait for input samples from each other. So we modify the graph by adding initial samples on arc CA in Fig. 4(b). Suppose that the number of initial samples is one, or x =1. Then node A is fireable initially. After node A is fired, one sample is produced and queued into the FIFO queue of arc AB. But the graph is deadlocked again since no node becomes fireable afterwards. In this example, the minimum number of initial samples is two in order to rescue the graph from the deadlock condition. The simplest method to detect deadlock is to construct a static SDF schedule by simulating the SDF graph as follows: 1. At first, make an empty schedule list that will contain the execution sequence of nodes, and initialize the set of fireable nodes. 2. Select one of the fireable nodes and put it in the schedule list. If the set of fireable nodes is empty, exit the procedure. 3. Simulate the execution of the selected node by consuming the input samples from the input arcs and producing the output samples to the output arcs. 4. Examine each destination node of the output arcs, and add it to the set of fireable nodes only if it becomes fireable and its execution count during the simulation is smaller than its repetition count. 5. Go back to step 2 to repeat this procedure. When we complete this procedure, we can determine if the graph is deadlocked by examining the schedule list. If there is any node that is scheduled fewer times than its repetition count in the schedule list, the graph is deadlocked. Otherwise, the graph is deadlock-free. In summary, an SDF graph is consistent if it is sample rate consistent and it is deadlock-free. Therefore, the consistency of an SDF graph can be statically verified by computing the repetition counts of all nodes (sample rate consistency) and by constructing a static schedule.
2.2 Software Synthesis from SDF graph An SDF graph can be used as a graphical representation of a DSP algorithm, from which target codes are automatically generated. Software synthesis from an SDF graph includes determination of an appropriate schedule and a coding style for each dataflow node, both of which affect the memory requirements of the generated soft-
856
Soonhoi Ha and Hyunok Oh
main() { for () { for (3) { code block of A; code block of D; } for (2) { code block of B; for (2) { code block of C; } } } /* end of for */ } /* end of main */ (a)
main() { for () { switch (i) { case 1-3: code block of A; code block of D; break; case 4-5: code block of B; for (2) {code block of C;} break; default: i = 1; break; } i++; } /* end of for */ } /* end of main */
A() { code block of A; } B() { code block of B; } C() { code block of C; } D() { code block of D; } main() { for() { for (3) { A() ; D(); } for (2) { B(); for (2) { C(); } } } /* end of for */ } /* end of main */
(b)
(c)
Fig. 5 Three programs based on the same schedule 2 of Fig. 2 (b).
ware. One of the main scheduling objectives for software synthesis is to minimize the total (sum of code and data) memory requirements. For software synthesis, the kernel code of each node (function block) is assumed already optimized and provided from the predefined block library. Then the target software is synthesized by putting the function blocks into the scheduled position once a schedule is determined. There are two coding styles, inline and function, depending on how to put a function block into the target code. The former is to generate an inline code for each node at the scheduled position, and the latter is to define a separate function that contains the kernel of each node. Fig. 5 shows three programs based on the same schedule 2 of Fig. 2(b). The first two use the inlining coding style, and the third the function coding style. If we use function calls, we have to pay, at run-time, the function-call overhead which can be very detrimental if there are many function blocks of small granularity. If inlining is used as shown in Fig. 5 (a), however, there is a danger of a large code size if a node is instantiated multiple times. For the example of Fig. 2, schedule 1 is not adequate for inlining while SA-schedules, 2 and 3, bring the smallest code size. Fig. 5 (b) shows an alternative code that uses inlining without a proportional increase of code size to the number of instantiations of nodes. The basic idea is to make a simple run-time system that executes the nodes according to the schedule sequence. It should recognize the run-time overhead of switch-statements and code overhead for schedule sequence management even though they are smaller than the other approaches. For each schedule, the buffer size requirement can also be computed. Assuming that a separate buffer is allocated on each arc, the buffering requirement for each arc is the maximum number of samples accumulated on the arc during the simulation of
Decidable Signal Processing Dataflow Graphs
857
the schedule. For the example in Fig. 2, we can compare the buffering requirements of three schedules as follows: Table 1 Buffer requirements for three schedules of Fig. 2 (b). Schedule arc AB arc AD arc BC total 1: ADADBCCADBCC 4 2 2 8 2: 3(AD)2(B2(C)) 6 2 2 10 3: 3A3D2B4C 6 4 4 14
As Table 1 shows, the SA schedules usually require larger buffers while they guarantee the minimum code size for inline code generation. In multimedia applications, frame-based algorithms are common where the size of a unit sample may be as large as a video frame or an audio frame. Thus, minimizing the buffer size is just as important as minimizing the code size. In general, both code size and buffer size should be considered when we construct a memory-optimal schedule. Buffering requirements can be reduced if we allow buffer sharing. Arc buffers can be shared if their life-times are not overlapped with each other during an iteration of the schedule. The life-time of an arc buffer is defined by a set of durations from when a sample starts accumulating until the buffer becomes empty. If there is no initial sample on arc AD in Fig. 2, we can share the two arc buffers of arc AD and arc BC. A more aggressive buffer sharing technique has been developed by separating global sample buffers and local pointer buffers in case the sample size is large in frame-based applications [9]. The key idea is to allocate a global buffer whose size is large enough to store the maximum amount of live samples during an iteration of the schedule. Each arc is assigned a pointer buffer that stores pointers to the global buffer. Code size can also be reduced by sharing the kernel of a function block when there are multiple instances of the same block [15]. Multiple instances of the same block are regarded as different blocks, and the same kernel, possibly with different local states, may appear several times in the generated code. A technique has been proposed to share the same kernel by defining a shared function. Separate state variables and buffers should be maintained for each instance, which define the context of each instance. The shared function is called with the context of an instance as an argument at the scheduled position of the instance. To decide whether a code block should be shared, the overhead and the gain of sharing are compared. If is an overhead that is incurred by function sharing, R is a code block size, and n is the number of instances of a block, the decision function for code sharing is summarized as the following inequality: (n−1)R <1
858
Soonhoi Ha and Hyunok Oh
2.3 Static Scheduling Techniques Static scheduling of an SDF graph is the key technique of static analysis that determines the possibility of a deadlock and the memory requirement of the generated code. Since an SDF graph imposes only partial ordering constraints between the nodes, there exist many valid schedules, and finding an optimal schedule has been actively researched. 2.3.1 Scheduling Techniques for Single Processor Implementations The main objective for a single processor implementation is to minimize the memory requirement while considering both the code and buffer memory. Since the problem of finding a schedule with a minimum buffer requirement for an acyclic graph is NP-complete in its simplest form, various heuristic approaches have been proposed. Since a single appearance schedule guarantees the minimum code size for inline code generation, a group of researchers have focused on finding a single appearance schedule that minimizes the data memory requirements. Bhattacharyya et. al. developed two heuristics: APGAN and RPMC, to find an SA-schedule that minimizes the data memory requirements [2]. Ritz et. al. used an ILP formulation to minimize the data memory [14]. Their approach is different from Bhattacharyya’s in that they allow data buffer sharing that is based on a flat single appearance schedule. Since a flat SA-schedule usually requires more data buffer than the optimal nested SAschedule, the advantage of sharing the data buffer is not evident in general. Another group of researches tries to minimize only data memory. Ade et. al. present an algorithm to determine the smallest possible data buffer size for arbitrary SDF applications [1]. Though their work is mainly targeted for mapping an SDF application onto a Field Programmable Gate Array (FPGA) in their GRAPE environment, the efforts to compute the lower bound on the buffer requirement is applicable to software synthesis. Govindarajan et. al. [6] developed a rate optimal compile time schedule, which minimizes the buffer requirement by using linear programming formulation. Both researches do not discuss the code size, which is likely to be more important in regards to the memory requirement. No previous work exists that considers all design factors such as coding styles, buffer sharing, and code sharing. Therefore, in spite of extensive prior research efforts, finding an optimal schedule that minimizes the total memory requirement still remains an open problem, even for the single processor implementation. 2.3.2 Scheduling Techniques for Multiprocessor Implementations The key scheduling objective for a multiprocessor implementation is to reduce the execution length or the throughput of a given SDF graph. While there have been numerous techniques developed for multiprocessor scheduling, they usually assume
Decidable Signal Processing Dataflow Graphs
859
C1
time
D1
A1
2 1
D2
C2
A2 1
D3
B1
A3
2 (a)
B2
C3 C4
P1 P2
A1 D1
P3
A2
B1 A3
C1 B2
D2
C2 C3
C4
D3
processor (b)
Fig. 6 (a) An APEG (acyclic precedence expanded graph) of the SDF graph in Fig. 2 (a), and (b) a parallel scheduling result displayed with a Gantt chart.
homogeneous task graphs that execute each task node only once in a single iteration. They also primarily focus on exploiting the functional parallelism of applications. Therefore, an earlier work translates the input SDF graph to a single rate task graph [12], called an APEG (Acyclic Precedence Expanded Graph) or simply EG (Expanded Graph). A node of an SDF graph is expanded to as many nodes in the EG according to the repetition counts of the node. The corresponding EG of the graph of Fig. 2 (a) is shown in Fig. 6(a) where nodes A and D are expanded to 3 invocations respectively and node B to 2 and node C to 4 invocations. The number of samples communicated through each arc is unity unless specified otherwise. If a node has internal states, there is dependency between node invocations: In the figure, we assume that nodes A and D have internal states and the dependencies between invocations is represented by dashed arcs. An initial sample on arc AD breaks the dependency between A3 and D1 so that D1 can be executed before A3. Once an APEG is generated, any multi-processor scheduling algorithm can be applied in order to minimize the schedule length. Fig. 6(b) shows a parallel scheduling result with a Gantt chart where the vertical axis represents the processors of the target system and the horizontal axis represents the elapsed time. In this approach, loop-level data parallelism is translated into functional parallelism in the EG, and the previous approach of exploiting functional parallelism can be applied to the EG. While this approach may be a reasonable way to achieve the loop-level data parallelism, it is very expensive solution: the total number of nodes in the EG is the sum of all repetition counts of the nodes. Moreover, the same node (node D in Fig. 6(b)) can be mapped to different processors, which incurs significant overhead when managing the internal states if any. Some parallel scheduling techniques have been proposed based on ILP (integerlinear programming) techniques or evolutionary algorithms that work with the SDF graph directly without APEG generation. A recent approach considers functional parallelism, loop-level parallelism, and pipelining simultaneously to minimize the throughput of the graph [16]. If the execution length of each function block is fixed and known at compile-time, we can estimate the total execution duration of the SDF graph. Hence the SDF graph
860
Soonhoi Ha and Hyunok Oh
is suitable when specifying an application that has real-time constraints on throughput or latency. When the execution time of a node varies at run-time, the worst case execution time (WCET) should be used in parallel scheduling in order to guarantee satisfaction of real-time constraints. Unfortunately, however, using WCETs is not enough to guarantee the schedulability if an application consists of multiple tasks running concurrently. If the execution time of a node becomes smaller than its WCET, the order of node firings can vary at run-time, which may lengthen the total execution time of the application. If multiple tasks are running concurrently, resource contention may occur and the execution time can be prolonged at run-time. While many DSP applications are multi-tasking applications, parallel scheduling of multi-tasking applications on heterogeneous multiprocessor systems remains an open problem. With a given schedule, it is not yet possible to decide whether the schedule can satisfy the real-time constraints if some nodes vary execution times and/or there are resource contentions during execution.
3 Cyclo-Static Data Flow (CSDF) The strict restriction of the SDF model, that all sample rates are constant, limits the expression capability of the model. Fig. 7 (a) shows a simple SDF graph that models a stereo audio processing application where the samples for the left and right channels are interleaved at the source. The interleaved samples are distributed by node D to the left (node L) and to the right (node R) channel. In this example, the distributor node (node D) waits until two samples are accumulated on its input arc to produce one sample at each output arc. A more natural implementation would be to make the distribution node route an input sample to two output ports alternatively at each arrival. A useful generalization of the SDF model, called the cyclo-static data flow (CSDF), has been proposed [3]. In cyclo-static dataflow, the sample rates of an input or output port may vary in a periodic fashion. Fig. 7 (b) shows how the CSDF model can specify the same application as Fig. 7 (a). To specify the periodic change of the sample rates, a tuple rather than an integer is annotated at the output ports of node D’. For instance, "{1,0}" on arc D’R denotes that the rate change pattern is repeated every two executions, where the rate is 1 during the first execution, and 0 during the second execution. Similarly, the periodic sample rate "{0,1}" means that the rate is 0 at every (2n + 1) − th iteration and 1 at the other iterations. Note that Fig. 7 (a) and (b) represent the same application in functionality. One firing of node D in the SDF graph is broken down into two firings of node D’ in the CSDF graph. Thus we have to split the behavior of node D’ into phases. The number of phases is determined by the periods of the sample rate patterns of all input and output ports. In general, we can convert a CSDF graph to an equivalent SDF graph by merging as many firings of a CSDF node as the number of phases into a single firing of an equivalent SDF node. For instance, node D’ in the CSDF graph repeats its behavior every two firings, and the number of phases becomes 2.
Decidable Signal Processing Dataflow Graphs
S
1
2
1
1
861
S
D 1
1
1
{1, 0}
R
1
1
D’
L
1
{0, 1}
R
L
(b)
(a)
Fig. 7 (a) An SDF graph where node D is a distributor block and (b) a CSDF graph that shows the same behavior with a different version of a distributor block.
{1,0,0} A
1
{0,2}
1 B
{0,1,0} 1
(a)
1
1 C
A’
3
6
B’
2
1
C
(b)
Fig. 8 (a) A cyclo-static dataflow graph and (b) its corresponding SDF graph.
So an equivalent SDF node can be constructed by merging two firings of node D’ into one, which is node D in the SDF model. The sample rates of input and output ports are adjusted accordingly by summing up the number of samples consumed and produced during one cycle of periodic behavior. The CSDF model has an advantage over the SDF model in that it can reduce the buffer requirement on the arcs. In the example shown in Fig. 7, the minimum size of input buffer for node D should be 2 in the SDF model while it is 1 in the CSDF model.
3.1 Static Analysis Since we can construct an equivalent SDF graph, static analysis and scheduling algorithms developed for SDF are also applicable to CSDF. For formal treatment, we use vectors to represent the periodic sample rates in CSDF: For an arc e, the output sample rate vector of src(e) and the input sample rate vector of snk(e) are denoted by prod(e) and cons(e) in CSDF. Fig. 8(a) shows another CSDF graph that has non-trivial periodic patterns of sample rates. The sample rate vectors for the graph become the following: prod(AC) = (1, 0, 0), cons(AB) = (0, 2), prod(BC) = (0, 1, 0), prod(AB) = cons(BC) = cons(AC) = (1). To make an equivalent SDF node for each CSDF node, we have to compute the repetition period for the phased operation of the CSDF node. First we obtain the
862
Soonhoi Ha and Hyunok Oh
period of the sample rate variation for each port, which is the size of the sample rate vector. Let dim(v) be the dimension of the sample rate vector v. Then the repetition period of a CSDF node A, denoted by p(A) becomes the least common multiple (lcm) value of all dim(v) values for the input and output ports of the node. For the example of Fig. 8(a), the repetition periods become the following: p(A) = lcm(dim(prod(AB)), dim(prod(AC))) = lcm(1, 3) = 3. p(B) = lcm(dim(cons(AB)), dim(prod(BC))) = lcm(2, 3) = 6. p(C) = lcm(dim(cons(AC)), dim(cons(BC))) = lcm(1, 1) = 1. If p(A) firings of CSDF node A are merged into a single firing, an equivalent SDF actor A’ is obtained. Hence the equivalent SDF graph is obtained as shown in Fig. 8(b) where node B’ is obtained by merging 6 firings of node B in the CSDF graph. We denote this equivalence relation as B ≈ 6B. For an equivalent SDF node, the scalar sample rate of a port should be determined. Let (v) be the sum of elements in vector v. The total number of samples produced or consumed on arc e of CSDF node A per the corresponding SDF node execution is given by (prod(e)) (cons(e)) p(A) dim(prod(e)) or p(A) dim(cons(e)) . So, we can construct a topology matrix for the equivalent SDF graph as follows: ⎧ (prod(e)) ⎪ if A = src(e) ⎨ p(A) dim(prod(e)) , (cons(e)) (e, A) = −p(A) (5) if A = snk(e) dim(cons(e)) , ⎪ ⎩ 0, otherwise We can check the sample rate consistency with this topology matrix. For the graph in Fig. 8(b), the topology matrix and the repetition vector become: ⎛ ⎞ 3 −6 0 = ⎝1 0 −1⎠ 0 2 −1 qG = (2, 1, 2) Since rank of is 2, the CSDF graph is sample rate consistent. Moreover, a valid scheduling includes two invocations of nodes A’ and C, and one invocation of node B’. This means that a valid CSDF scheduling contains 6A, 6B and 2C since A ≈ 3A and B ≈ 6B. The deadlock detection algorithm for an SDF graph in section 2.1 is applicable for a CSDF graph, which is to construct a static schedule by simulating the graph.
3.2 Static Scheduling and Buffer Size Reduction One strategy of scheduling a CSDF graph is to schedule the equivalent SDF graph and replacing the execution of the equivalent node with the multiple invocations
Decidable Signal Processing Dataflow Graphs
863
of the original CSDF node. So we can obtain the following schedule for the graph in Fig. 8: 1 = 2A B 2C = 6A6B2C. Then the minimum buffer requirement on arc AB becomes 6. We can construct better schedules in terms of buffer requirements by utilizing the phased operation of a CSDF node. For the case of CSDF graph of Fig. 8(a), we can construct a better schedule as follows. 1. Initially nodes A and B are fireable, so schedule nodes A and B: 2 = "AB". 2. Since node A is the only fireable node, we schedule node A again: 2 = "ABA" 3. Now two samples are accumulated on arc AB and the second phased of node B can start. So schedule node B: 2 = "ABAB". 4. Node C becomes fireable. Schedule node C at the first time: 2 = "ABABC". 5. We can schedule nodes A and B twice: 2 = "ABABCABAB" 6. At this moment, only one sample is stored on arc AC and we can fire nodes A and B. We choose to schedule the fifth invocation of node B to produce one sample on arc BC. 2 = "ABABCABABB" 7. Then, node C becomes fireable. Schedule node C: 2 = "ABABCABABBC" 8. Finally we schedule node A twice and node B once to complete one iteration of the schedule: 2 = "ABABCABABBCAAB" 9. Since schedule 2 contains 6A, 6B and 2C, the schedule is finished and no deadlock is detected. Schedule 2 requires 2 buffers on arc AB, which is 3 times better than schedule 1. Generally, as sample rates vary more, the buffer size reduction becomes more significant. This gain is obtained by splitting the CSDF node into multiple phases. But we have to pay the overhead of code size since a single appearance schedule is given up. In general, there are more valid schedules for a CSDF graph than for the equivalent SDF model. Therefore, discussion on the SDF scheduling can be applied to the CSDF model, but with increased complexity of scheduling problems.
3.3 Hierarchical Composition Another advantage of CSDF is that it offers more opportunities of clustering when constructing a hierarchical graph. It also allows a seemingly delay-free cycle of nodes, while no delay-free cycle is allowed in SDF. Fig. 9 (a) shows an SDF graph with four nodes A,B,C and D. All sample rates are unity since no sample rate is annotated on any arc. The graph can be scheduled without deadlock since there is an initial delay sample between nodes A and B. One unique valid schedule of this graph is "BCDA". Suppose we cluster nodes A and B into an hierarchical node W in CSDF and W’ in SDF as illustrated in Fig. 9(a) and (b) respectively. Since CSDF node W fires node B at every (2n + 1) − th cycle and node A at every 2n − th cycle, the input and the output sample rate vectors of node W become "{0,1}" and "{1,0}" respectively. Therefore, the CSDF graph can be scheduled without deadlock and a valid static schedule is "WCDW" where node B is fired at the first invocation of node W and node A is fired at the second invocation. On the other hand, SDF node
864
Soonhoi Ha and Hyunok Oh
A
B
D
C
Cluster nodes A and B in a hierarchical node
W
W’
{0,1}
{1,0}
D
C
(a) CSDF version
D
C
(b) SDF version
Fig. 9 Clustering of nodes A and B into an hierarchical node in (a) CSDF and (b) in SDF.
W’ should execute both nodes A and B when it is fired. Therefore, the SDF graph as shown in Fig. 9(b) is deadlocked since nodes W’, D and C wait for each other. Clustering of nodes may cause deadlock in SDF even though the original SDF graph is consistent. On the other hand, a CSDF graph that includes a cyclic loop without an initial delay can be scheduled without deadlock if the periodic rates are carefully determined. Therefore, the CSDF model allows mode freedom of hierarchical composition of the graph.
4 Other Decidable Dataflow Models 4.1 FRDF (Fractional Rate Data Flow) The SDF model does not make any assumption on the data types of samples as long as the same data types are used between two communicating nodes. To specify multimedia applications or frame-based signal processing applications, it is natural to use composite data types such as video frames or network packets. If a composite data type is assumed, the buffer requirement for a single data sample can be huge, which amounts to 176 × 144 pixels for a QCIF video frame for instance. Then reducing the buffer requirement becomes more important than reducing the code size when we make a schedule. Fig. 10(a) shows an SDF subgraph of an H.263 encoder algorithm for QCIF video frames. A QCIF video frame consists of 11 × 9 macroblocks whose size is 16 × 16 pixels. Node ME that performs motion estimation consumes the current and the
Decidable Signal Processing Dataflow Graphs current frame previous frame
1 1
ME
99 (a)
1
EN
1
865
1/99 1/99 frME
1
1
EN
1
(b)
Fig. 10 A subgraph of an H.263 encoder graph (a) in SDF and (b) in FRDF.
previous frames as inputs. Internally, the ME block divides the current frame into 99 macroblocks and computes the motion vectors and the pixel differences from the previous frame. And it produces 99 output samples at once where each output sample is a macroblock-size data that represents a 16 × 16 array of pixel differences. Node EN performs macroblock encoding by consuming one macroblock at a time and produces one encoded macroblock as its output sample. This SDF representation is not efficient in terms of buffer requirement and performance. Since node ME produces 99 macroblock-size samples at once after consuming a single frame size sample at each invocation, we need a 99-macroblocksize buffer or a frame-size buffer (99 × 16 × 16 = 176 × 144) to store the samples on the arc between nodes ME and EN. Moreover node EN cannot start execution before node ME finishes motion estimation for the whole input frame. As this example demonstrates, the SDF model has inherent difficulty of efficiently expressing the mixture of a composite data type and its constituents: a video frame and macroblocks in this example. A video frame is regarded as a unit of data sample in integer rate dataflow graphs, and should be broken down into multiple macroblocks explicitly by consuming extra memory space. To overcome this difficulty, fractional rate dataflow (FRDF) in which a fractional number of samples can be produced or consumed has been proposed [10]. In FRDF, a fraction number can be used as a sample rate as shown in Fig. 10(b) where the 1 input sample rates of node frME is set to 99 . The fractional number means that the 1 input data type of node frME is a macroblock and it corresponds to 99 of a frame data. Fig. 11 shows an FRDF graph where the data type of arc BC is a composite type as illustrated in the figure and the data type of arc AC is a primitive type such as integer or a float. A fractional sample rate has different meanings for a composite data type and a primitive type. For a composite data type, the fractional sample rate really indicates the partial production or consumption of the sample. In the example graph, one firing of node B fills 13 of a sample on arc BC and node C reads the first half of the sample at every (2n + 1)-th firing and the second half at every 2nth firing. Hence, if we consider the execution order of nodes B and C, a schedule "BBCBC" is valid since 23 of a sample is available after node B is fired twice and node C becomes fireable. For primitive types, partial production or consumption is not feasible. Then statistical interpretation is applied for a fractional rate. In the example graph, the output sample rate of node A is 13 on arc AC. This means that node A produces a single sample every three executions. Similarly node C consumes one sample every two executions. Note that a fractional rate does not imply at which firings samples are
866
Soonhoi Ha and Hyunok Oh
A
1 1/3
1
B
1/3
1/2 1/2
C
Fig. 11 An FRDF graph in which sample types on arc BC and AC are a composite and a primitive type respectively.
produced. So node C becomes fireable only after node A is executed three times. If we concern the execution order of nodes A and C only, schedule "3A2C" is valid while "2ACAC" is not. Consequently, a valid schedule for the FRDF graph of Fig. 11 is "3(AB)2C". Regardless of the data type, a fractional sample rate qp guarantees that p samples are produced or consumed for q firings of the node. Similar to the CSDF graph, we can construct an equivalent SDF graph by merging q firings of an FRDF node into an equivalent SDF node that produces or consumes p samples per firing. Therefore, static analysis techniques for the SDF model can be applied to the FRDF model. For the analysis of sample rate consistency, however, we can use fractional sample rates directly in the topology matrix. For the FRDF graph of Fig. 11, the topology matrix and repetition vector are: ⎛ ⎞ 1 −1 0 = ⎝ 0 13 − 12 ⎠ 1 1 3 0 −2 qG = (3, 3, 2) Since the rank of is 2, the graph is sample rate consistent. Deadlock can be detected by constructing a static schedule that is similar with the SDF case. If there exists a valid static schedule, the graph is free from deadlock. A static schedule is simply constructed by inserting a fireable node into the schedule list and simulating its firing. An FRDF node has different firing conditions depending on the data types of input ports. An FRDF node is fireable, or executable, if all input arcs satisfy the following condition depending on the data type: 1. If the data type is primitive, there must be at least as many stored samples as the numerator value of the fractional sample rate. An integer sample rate is a special fractional rate whose denominator is 1. 2. If the data type is composite, there must be at least as large a fraction of samples stored as the fractional input sample rate. Special care should be taken for a composite type data. If the consumer and the producer have a different interpretation on the fraction, then a composite type should be regarded as a primitive type when the firing condition is examined. Suppose that for a composite data type of a two-dimensional array, the producer regards it as an array of row vectors while the consumer regards it as an array of column vectors as shown in Fig. 12 . In this case, the two-dimensional array may not be regarded as a composite type data. Therefore, schedule "DDUDU" is not valid while "3D 2U" is.
Decidable Signal Processing Dataflow Graphs
867
Fig. 12 If the consumer and the producer have the different access patterns for a composite type data then the type should be treated as atomic.
Read From Device
1
1/99 1/99 1/99
1D
D
1/3
1/2
U
1 1
Macro Block Encoding
1 1 Variable 1/99 1/99 Write Length To Coding Device
1 1 Motion Compensation
Macro Block Decoding
1
Motion Estimation
Fig. 13 H.263 encoder in FRDF.
In general, the FRDF model results in an efficient implementation of a multimedia application in terms of buffer requirement and performance. Consider the example in Fig. 10(b) again. Since node frME uses a macroblock-size sample, the output arc requires only a single macroblock-size buffer. For each arrival of an input video frame, node frME is fired 99 times and consumes a macro-block size portion of the input frame per firing. Since node EN can be fired after each firing of node frME, shorter latency is experienced when compared with the SDF implementation. Fig. 13 shows a whole H.263 encoding algorithm in FRDF where the sample types for "current frame" and "previous frame" are video frames. It is worth noting that the entire previous frame is needed to do motion estimation for each macroblock of the current frame while the sample rate of the bottom input port of node "Motion 1 Estimation" is 99 . Hence, even though the previous frame is a composite data type, it should be regarded as a primitive data type. Then node "Motion Estimation" is fireable only after the entire previous frame is available. Fig. 14 shows an FRDF graph that corresponds with Fig. 8(a). Both have the same equivalent SDF graphs. Similar to the CSDF model, the FRDF model can reduce the buffer size when compared with the corresponding SDF model. Since node A∗ produces a single sample and B∗ consumes two samples every other execution, a valid schedule for the graph is "2A∗ 2B∗ 2A∗ B∗ C B∗ 2A∗ 2B∗ C". And the required buffer sizes on arcs A∗ B∗ and B∗ C∗ are equal to the sizes for the CSDF graph. The buffer size for arc A∗ C is, however, larger than the CSDF graph since the FRDF model does not know when samples are produced and consumed, and the schedule for the FRDF model should consider the worst case behavior. For an output port, the worst case is when output samples are all produced at the last phase while it is when input samples are all consumed at the first phase for an input port. Therefore, in the FRDF model, rate qp for a primitive data type corresponds to "{(q − 1) × 0,p}" for an output sample rate and "{p,(q − 1) × 0}" for an input sample rate in the
868
Soonhoi Ha and Hyunok Oh
1/3 A*
1
1 2/2
B*
1/3
1
C
Fig. 14 An FRDF graph corresponding to Fig. 8.
CSDF model, where "(q − 1) × 0" denotes "0, 0, · · · , 0". Hence, the CSDF model q−1
may generate better schedules than the associated FRDF model since we can split the node execution into finer granularity of phases at compile-time scheduling; This is not possible in the FRDF if the date type is primitive. The expression capability of two models is, however, different. The CSDF model allows only a periodic pattern to express sample rate variations while the FRDF model has no such restriction as long as the average value is constant during a period. So the FRDF model has more expression power than the CSDF model since it allows dynamic behavior of an FRDF node within a period and the periodic pattern can be regarded as a special case.
4.2 SPDF (Synchronous Piggybacked Data Flow) The SDF model does not allow communication through a shared global variable since the access order to the global variable can vary depending on the execution order of nodes. Suppose a source block produces the frame header in a frame-based signal processing application that is to be used by several downstream blocks. A natural way of coding in C is to define a shared data structure that the downstream blocks access by pointers. But in a dataflow model, such sharing is not allowed. As a result, redundant copies of data samples are usually introduced in the automatically generated code from the dataflow model. Such overhead is usually not tolerable for embedded systems with tight resource and/or timing constraints. To overcome this limitation, an extended SDF model, called SPDF(Synchronous Piggybacked Data Flow) is proposed [11], by introducing the notion of "controlled global states" and by coupling a data sample with a pointer to the controlled global state. The Synchronous Piggybacked Data Flow (SPDF) model was first proposed to support frame-based processing, or block-based processing, that frequently occurs in multimedia applications. In frame-based processing, the system receives input streams of frames that consist of a frame header and data samples. The frame header contains information on how to process data samples. So an intuitive implementation is to store the information in a global data structure, called global states, and the data processing blocks refer to the global states before processing the data samples. Fig. 15 shows a simple SPDF graph where node A reads a frame from a file and produces the header information and the data samples through different output ports. Suppose that a frame consists of 100 data samples in this example. The header infor-
Decidable Signal Processing Dataflow Graphs
869
Global table state name
1st value
“header”
2nd value
10
1/100 A
piggyback
B
C
Down Stream
state name: header period = 100 offset = 0 Fig. 15 An example SPDF graph that shows a typical frame-based processing: The Piggyback block writes the header information to the global state and the downstream blocks refer to the global state before processing the data samples.
mation and the data samples are both connected to a special FRDF(Fractional Rate Data Flow) block, called "Piggyback" block, that has three parameters: "statename", "period", and "offset". The Piggyback block updates the global state of "statename" with the received header information periodically with the specified "period" parameter. It piggybacks a pointer of the global state to each input data sample before sending it through the output port. Since it needs to receive the header information in order to update the global state only once per 100 executions, the sample rate of the 1 input port associated with the header information is set to the fractional value 100 , which means that it consumes one sample per 100 executions in the FRDF model. The input port of this fractional rate is called the "state port" of the Piggyback block. The sample rate of the data port, on the other hand, is unity. The "offset" parameter indicates when to update the global state. The Piggyback block receives as many data samples as the "offset" value before updating the global state. In this example, the "offset" value is set to its default value, zero, which makes the Piggyback block consume the header information and update the global state before it piggybacks the data samples with a pointer to the global state. Note that the Piggyback block with a fractional rate input port is the only extension to the SDF model. Since the sample rates of the SPDF graph are all static, the static analyzability of the SDF model is preserved even after addition of the Piggyback block. Also, piggybacking of data samples with pointers can be performed without any run-time overhead due to utilization of the static schedule of the graph. Suppose that the SDF graph in Fig. 15 has the following static schedule: "A 100(Piggyback, B, C, DownStream)". Then the pseudo code of the automatically generated code associated with the schedule is as follows: code block of A for (int i = 0; i < 100, i++) { if (i == offset_Piggyback) update the global state header code block of B code block of C if (i == offset_Piggyback) update the local state of DownSteam block from the global state information code block of DownStream
870
Soonhoi Ha and Hyunok Oh
Ramp
1/100 piggyback
gain: g_gain gain
out
singen generate sin wave: 100 samples per period
state name: g_gain period = 100 offset = 0
Fig. 16 An SPDF graph that produces a sinusoidal waveform with varying amplitude at runtime: the "gain" state of the "Gain" block is updated by the "Ramp" block through global state, "global_gain".
}
Fig. 16 shows another example that produces a sinusoidal waveform with varying amplitude at run-time. The "Singen" block generates a sinusoidal waveform (N samples per period) of unit amplitude and the "Gain" block amplifies the input samples by the "gain" parameter of the block. To control the amplitude, the graph uses a Piggyback block after the "Singen" block. Another source block, "Ramp", is connected to the state port of the Piggyback block. The "statename" of the Piggyback is named "global_gain" and the "gain" parameter of the "Gain" block is also set to "global_gain". Then, the "gain" parameter of the "Gain" block is updated with a global state named by "global_gain" whose value is determined by the "Ramp" block. In this example, the period of the Piggyback block is set to N so that the amplitude of the sinusoidal waveform is incremented by one every period as shown in Fig. 16. If we insert two initial samples on the input arc of the "Gain" block, the "offset" parameter of the Piggyback block should be 2. Thus the SPDF model provides a safe mechanism to deliver state values through shared memory instead of message passing. Communication through shared memory is prohibited in conventional data flow models since the access order to the shared memory may vary depending on the schedule. But the SPDF model gives another solution by observing that the side effect is caused by an implicit assumption that the global state is allocated to a memory location before a schedule is made. The SPDF model changes the order: allocate the memory for the global state after the schedule is made. Since the scheduling decision is made at compile-time, we know the access order to the variable and the life time of each global state variable. Suppose that the schedule of Fig. 16 becomes "2(100(Singen) Ramp Piggyback) 200(Gain Display)". From the static schedule, we know that we need to maintain two memory locations for the global state, "global_gain" since the Piggyback block writes the second value to the global state before the "Gain" reads the first global state. Allowing shared memory communication without side effects gives a couple of significant benefits over conventional dataflow models. First, it can remove the unnecessary overhead of data copying of message passing communication since the
Decidable Signal Processing Dataflow Graphs
void add() { *addOut = *addIn1 + *addIn2; } (a)
871
void add() { for(int i=0; i
Fig. 17 Code of an "Add" actor (a) in SDF and (b) in SSDF where Nb is the blocking factor.
global state can be shared by multiple blocks. Second, it greatly enhances the expression capability of the SDF model without breaking the static analyzability. It provides an explicit mechanism of affecting the internal behavior of a block from the outside through global states.
4.3 SSDF (Scalable SDF) DSP architectures have stringent on-chip memory limits and off-chip memory access is costly. They also allow vector processing of instructions and arithmetic pipelining like MAC in order to attain peak performance when the pipelining is fully utilized. Therefore, when an SDF block operates on primitive-type data and the granularity is small, it behaves inefficiently. For example, an adder block performs a single accumulation operation by reading two samples from memory and writing a sample into memory. It requires large run time overhead of off-chip memory access for two read and one write operations. In order to achieve efficient implementation, the scalable synchronous dataflow (SSDF) model is proposed [13]. The SSDF model has the same semantics as the SDF model except that a node may consume or produce any integer multiple of the fixed rates per invocation. The actual multiple, called blocking factor, is determined by a scheduler or an optimizer. Fig. 17 (a) shows the code of an "Add" block in SDF. In SSDF, the code includes blocking factor Nb that is the number of iterations as shown in Fig. 17(b). Since the function call overhead of "add()" is larger than the accumulation operation, the SSDF model amortizes the function call overhead by performing Nb accumulations per function call. When blocking factor Nb is 1, the SSDF graph degenerates to an SDF graph. Therefore, the SSDF model has the same sample rates as the SDF model and sample rate inconsistency can be checked using the topology matrix for the SDF model. Moreover, deadlock is detected by constructing a schedule by setting block factor Nb = 1. From the static analysis of the degenerated SDF graph, repetition vector qG can be obtained, assuming Nb is 1 for all blocks. When Nb > 1, the repetition vector for SSDF becomes Nb qG . A straightforward scheduling technique for an SSDF graph is to increase the minimal scheduling period by an integer factor Ng where Ng is a global blocking factor. Each node A of the graph will be invoked Ng x(A) times within one scheduling period, where x(A) is the repetition count of node A. Increasing Ng reduces the
872
Soonhoi Ha and Hyunok Oh
A
1
1
B
1
1
C 1
1
1
1 G
D 1
1
1
E
1
1
F
1 1
1D
1
H
Fig. 18 A graph with feedback loop.
function call overhead but requires larger buffer memory for graph execution. For instance, the "Add" node in Fig. 17 consumes Nb samples from each input port and produces Nb output samples, then all three buffers have size Nb while they have the size 1 when the blocking factor is unity. Moreover, the increment of Ng delays response time although it does not decrease throughput. Another major obstacle to increase the blocking factor is related with feedback loops. Vector processing is restricted to the number of initial delays on the feedback loop. If the number is smaller than Ng , the vector processing capability cannot be fully utilized. For example, the scheduling result for a graph shown in Fig. 18 is "A B G C D H E F" when the blocking factor is 1. If blocking factor Ng becomes 5 then the scheduling becomes "5A 5B 5(GCDH) 5E 5F" in which nodes G,C,D and H are repeated 5 times sequentially. Therefore, a scheduling algorithm for SSDF should consider the graph topology to minimize the program code size. In case feedback loops exist, strong components are first clustered. A strong component of graph G is defined as a subgraph F ⊂ G if for all pairs of nodes u, v ∈ F there exist paths puv (from u to v) and pvu (from v to u). This clustering is performed in a hierarchical fashion until the top graph does not have any feedback loop. Then, a valid schedule for an SSDF graph can be constructed using the SDF scheduling algorithms. Each node is scheduled by applying the global blocking factor Ng . For the SSDF graph in Fig. 18, the top graph consists of five nodes "A B (CDGH) E F" where nodes C, D, G and H are merged into a clustered-node. When blocking factor Ng is set to 5, a schedule for the top graph becomes "5A5B5(clustered-node)5E5F". Next, the strong components are scheduled. The blocking factor depends on the number of initial delay samples on a feedback loop. Let Nl (L) denote the maximum bound of the blocking factor on feedback loop L. Since feedback loops can be nested, a feedback loop with the largest maximum bound Nl (L) should be selected first. Sequentially, feedback loops are selected in a descending order of Nl (L). Scheduling of the clustered subgraph starts with a node that has many initial delay samples on its input ports and allows a large blocking factor. When strong component "(CDHG)" is scheduled in the SSDF graph, actor G should be fired since it has an initial delay sample. For a selected strong component, we schedule the internal nodes as follows, depending on the number of delays on the feedback loop. Case 1: Ng is an integer multiple of Nl (L).
The scheduling order is repeated
Decidable Signal Processing Dataflow Graphs
Ng /Nl (L)
873
(6)
times using Nb = Nl (L) for the internal nodes. In the example of Fig. 18, since Nl (L) = 1, Ng = 5, and Ng /Nl (L) is an integer, schedule of "GCDH" is repeated 5 times. Moreover, the blocking factor for each node Nb is 1. Hence, the final schedule is "5A 5B 5(GCDH) 5E 5F". Case 2: Ng ≤ Nl (L). Blocking factor Nb = Ng is applied for all actors in the strong component. For example, if the number of delay samples increases to 5 in Fig. 18, then blocking factor Nl (L) is 5 which is equal to Ng , and the schedule becomes "5A 5B (5G 5C 5D 5H) 5E 5F". Therefore, the blocking factor can be fully utilized. Case 3: If Ng > Nl (L) but not an integer multiple. One of two scheduling strategies can be applied: 1. 2.
The schedule for the strong component is repeated Ng times using Nb = 1 internally, which produces the smallest code at the cost of throughput. The schedule is repeated with blocking factor Nb = Nl (L), and then once more for the remainder to Ng . This improves throughput but also enlarges the code size.
When Nl (L) = 2 by increasing the number of delay samples to 2, a valid schedule is "5(GCDH)" if the first strategy is followed or "2(2G 2C 2D 2H) GCDH" if the second strategy is followed. Consequently, the final schedule is either "5A5B 5(GCDH) 5E 5F" or "5A 5B 2(2G 2C 2D 2H) GCDH 5E 5F". Although the SSDF model is proposed to allow large blocking factors to utilize vector processing of simple operations in an node, the scheduling algorithm for SSDF is also applicable to an SDF graph in which every node has an inline style code specification. Without the modification of the SDF actor, the blocking factor can be applied to the SDF graph and the SDF schedule. For instance, when block factor Ng = 3 is applied to Fig. 2, a valid schedule is "9A 9D 6B 12C". For the given schedule with the blocking factor, programs can be synthesized as shown in Fig. 5 where each loop value in the codes will be multiplied by blocking factor Ng (=3).
References 1. Ade, M., Lauwereins, R., Peperstraete, J.A.: lmplementing dsp applications on heterogeneous targets using minimal size data buffers. In: Proceedings of RSP’96, pp. 166–172 (1996) 2. Bhattachayya, S.S., Murthy, P.K., Lee, E.A.: Apgan and rpmc: Complementary heuristics for translating dsp block diagrams into efficient software implementations. In: Journal of Design Automation for Embedded Systems, vol. 2, pp. 33–60 (1997) 3. Bilsen, G., Engles, M., , Lauwereins, R., Peperstraete, J.A.: Cyclo-static dataflow. In: IEEE Trans. Signal Processing, vol. 44, pp. 397–408 (1996) 4. Buck, J.T., Ha, S., Lee, E.A., Messerschimitt, D.G.: Ptolemy: A framework for simulating and prototyping heterogeneous systems. In: Int. Journal of Computer Simulation, special issue on Simulation Software Development, vol. 4, pp. 155–182 (1994) 5. Dennis, J.B.: Dataflow supercomputers. In: IEEE Computer Magazine, vol. 13 (1980)
874
Soonhoi Ha and Hyunok Oh
6. Govindarajan, R., Gao, G., Desai, P.: Minimizing memory requirements in rate-optimal schedules. In: Proceedings of the International Conference on Application Specific Array Processors, pp. 75–86 (1993) 7. Lauwereins, R., Engels, M., Peperstraete, J.A., Steegmans, E., Ginderdeuren, J.V.: Grape: A case tool for digital signal parallel processing. In: IEEE ASSP Magazine, vol. 7, pp. 32–43 (1990) 8. Lee, E.A., Messerschmitt, D.G.: Static scheduling of synchronous dataflow programs for digital signal processing. In: IEEE Transaction on Computer, vol. C-36, pp. 24–35 (1987) 9. Oh, H., Ha, S.: Memory-optimized software synthesis from dataflow program graphs with large size data samples. In: EURASIP Journal on Applied Signal Processing, vol. 2003, pp. 514–529 (2003) 10. Oh, H., Ha, S.: Fractional rate dataflow model for memory efficient synthesis. In: Journal of VLSI Signal Processing, vol. 37, pp. 41–51 (2004) 11. Park, C., Chung, J., Ha, S.: Extended synchronous dataflow for efficient dsp system prototyping. In: Design Automation for Embedded Systems, vol. 3, pp. 295–322. Kluwer Academic Publishers (2002) 12. Pino, J., Ha, S., Lee, E.A., Buck, J.T.: Software synthesis for dsp using ptolemy. In: Journal of VLSI Signal Processing, vol. 9, pp. 7–21 (1995) 13. Ritz, S., Pankert, M., Meyr, H.: High level software synthesis for signal processing systems. In: Proceedings of the International Conference on Application Specific Array Processors (1992) 14. Ritz, S., Willems, M., Meyr, H.: Scheduling for optimum data memory compaction in block diagram oriented software synthesis. In: Proceedings of the ICASSP 95 (1995) 15. Sung, W., Ha, S.: Memory efficient software synthesis using mixed coding style from dataflow graph. In: IEEE Transaction on VLSI Systems, vol. 8, pp. 522–526 (2000) 16. Yang, H., Ha, S.: Pipelined data parallel task mapping/scheduling technique for mpsoc. In: DATE (Design Automation and Test in Europe) (2009)
Mapping Decidable Signal Processing Graphs into FPGA Implementations Roger Woods
Abstract Field programmable gate arrays (FPGAs) are examples of complex programmable system-on-chip (PSoC) platforms and comprise dedicated DSP hardware resources and distributed memory. They are ideal platforms for implementing computationally complex DSP systems in image processing and radar, sonar and signal processing. The chapter describes how decidable signal processing graphs are mapped into such platforms and shows how parallelism and pipelining can be controlled from a high level representation to achieve the required speed using minimal hardware resource. The process is demonstrated using a number of simple examples namely a finite impulse response (FIR) filter, lattice filter and a more complex adaptive signal processing design, a least means squares (LMS) filter.
1 Introduction Typically, DSP systems are implemented on either application specific integrated circuits (ASICs) or dedicated DSP microprocessor platforms. With the emergence of coarse grained processing elements such as dedicated multipliers, adders and DSP blocks in field programmable gate arrays (FPGAs), they have now emerged as a highly attractive platform for implementing high performance, complex DSP systems. In addition to the processing elements, distributed memory both in the form of small distributed memory e.g. registers and look up tables (LUTs) as well as embedded RAMs, allows high performance to be achieved by distributing the memory requirements and permitting high parallel, pipelined implementations to be realized on the technology. With non-recurrent engineering (NRE) costs and increasing design challenges for sub-65nm CMOS technology, programmability is becoming increasingly important. Roger Woods ECIT Institute, Queen’s University of Belfast, Queen’s Road, Queen’s Island, Belfast, UK e-mail: [email protected]
S.S. Bhattacharyya et al. (eds.), Handbook of Signal Processing Systems, DOI 10.1007/978-1-4419-6345-1_31, © Springer Science+Business Media, LLC 2010
875
876
Roger Woods
Programmable platforms such as microprocessors and DSP processors allow users to develop software descriptions of the system and then use commercially available compilers and assemblers to create the resulting designs. This approach has numerous advantages including software portability, ease of product modification and migration, and reduced design time compared to ASIC or FPGA solutions. FPGAs also offer programmability but unlike processors, involve the creation of the architecture which is typically a complex design process. However, this offers the key advantage of obtaining very high performance but using pre-fabricated technology and thus avoiding non-recurrent engineering (NRE) costs associated with ASICs. The main issue therefore, is to map the DSP functionality efficiently onto the hardware effectively creating a circuit architecture which best matches the system requirements. This is ideally achieved by making modifications to the signal flow graph (SFG) representation of the algorithm such as those given in Chapters 28 and 29, so that an efficient architecture is produced. The basic process for realizing a circuit architecture from a SFG description is given in Fig. 1. The initial stage involves investigating performance using suitable design software such as R Simulink or LabVIEW to determine correct wordlength and performance; the next stage is to determine performance in terms of sampling rate required, latency etc... and then use these parameters to generate the circuit architecture from the SFG description; a key stage is then the iteration around this loop to optimize the performance typically minimizing area requirements for the throughput rate required. As will be demonstrated in the chapter, the generation of the circuit architecture can be complicated when the SFG representation involves feedback loops. Given the make-up of FPGA architectures with either separate multiplier and adder circuits or in the case of more recent FPGA architectures, dedicated DSP blocks with fixed wordlength representations and also distributed memory, the design focus is to optimize the circuit architecture in terms of these building blocks. This is the approach used in this chapter. It should be noted that some synthesis tools now offer some level of pipelining and retiming as an option but these features have limited impact as shown by the synthesis results given. The design process is illustrated with a number of simple examples namely a FIR filter and a lattice filter which represent non-recursive and recursive structures respectively and a least means square (LMS) filter which shows how decisions made in synthesizing the SFG can have an impact on performance.
1.1 Chapter breakdown The chapter is organized as follows. A brief review of FPGA hardware platforms is given in section 2 concentrating on the features most relevant to DSP in the most R and recent and powerful FPGA families from Xilinx and Altera namely the Virtex R Stratix families. The focus of the section is to highlight the key features of the families and not provide a detailed description of the technologies. Section 3 highlights the key steps in create FPGA circuit architectures for various speed and area require-
Mapping Decidable Signal Processing Graphs into FPGA Implementations
877
ments, covering retiming and delay scaling. Optimizing the circuit architecture to meet certain performance criteria is achieved by applying folding and unfolding either to increase speed or if the speed requirement has been met, to achieve better performance by reducing area requirements. This is covered in section 4. Finally, the LMS filter is used to demonstrate the various optimization and covers various FPGA design trade-offs. The chapter concludes by looking at how some of core features are exposed to the system designer and also how some of the techniques can be used to reduce dynamic power consumption.
2 FPGA hardware platforms Historically, FPGAs were viewed as a glue logic component where the design focus was on maximizing LUT utilization and minimizing programmable routing to provide interconnectivity at the lowest possible delay. With regard to structure, FPGAs are categorized by the following programmable components although the granularity of some of blocks may have changed over the years: • Programmable logic units that can be programmed to realize different digital functions. • Programmable interconnect that allow different blocks to be connected together.
Signal Flow Graph
Reiterate to achieve efficient area/time
Architecture synthesis Processor latency, truncation, wordlengths and internal word growth
Architecture retiming and optimisation
Circuit architecture
Fig. 1 SFG to FPGA circuit architecture design flow
Algorithmic study Software simulations to determine wordlength requirements
878
Roger Woods
• Programmable I/O pins. The evolution of programmable adders in the 1990s meant that effort was concentrated on mapping particular DSP algorithms such as fixed FIR filters and transforms using distributed arithmetic techniques [2], [9]. In this era, the focus was to exploit the nature of some fixed DSP operations and transforms to allow fixed coefficient multipliers to be derived. This could be implemented efficiently using a small number of adders and evolutions such as the reduced coefficient multipliers (RCMs) [10], allowed creation of very efficient multipliers which could operate on a number of multiplicands rather than a single one. More recent evolutions such dedicated multipliers and then more recently DSP blocks, have now changed the design focus once again [1]; the challenge is now to map the signal flow graph (SFG) not to LUTs and adders as previously but to multiplier and adder blocks, meaning that pipelining is applied at the processor level rather than the adder and LUT; this is the focus taken here. As well as the complexR ity, overall performance has improved. For example, the Xilinx Virtex -5 family can now be clocked at 550MHz and has a number of on-board IP blocks and a number of DSP48E slices which combine to give a maximum of 352GMACS performance. With regard to connectivity, RocketIO GTP transceivers deliver 100Mbps R - 3.2Gbps. The Stratix III E family from Altera offers 48,000 to 338,000 equivalent LEs, up to 20Mbits of memory and a number of high-speed DSP blocks that provide implementation of multipliers, multiply accumulate blocks and FIR filters. It is worthwhile looking at this aspect of FPGA functionality in a little more detail.
2.1 FPGA logic blocks The programmable logic block in FPGAs has largely remained unchanged over the years and comprises a lookup table (LUT) followed by dedicated circuitry for implementing fast addition, and a programmable storage unit i.e. flip-flop (see Fig. 2). The only major development has been the growth in LUT size and now 7-input or R 6-input LUTs are commonplace. This can be seen in the Altera Stratix III adapR tive logic module (ALM) in Fig. 2(a) and in the Xilinx Virtex -5 configurable logic block (CLB) in Fig. 2(b). The adder configuration in FPGAs is typically based on a ripple carry adder structure as this can be easily scaled in terms of input/output wordlength even though it is conventionally the slowest of the adder structures [3]. In the case of the ALM, the adders can perform two of two-bit additions or a single R ternary addition and with the Xilinx Virtex , it is possible to perform a fast addition using the fast carry logic circuitry. This gives a range of speeds from 1ns for an 8-bit addition to 2.5ns for a 64-bit addition.
Mapping Decidable Signal Processing Graphs into FPGA Implementations
6-input LUT
R (a) Stratix ALM block diagram
879
Carry logic
R (b) Xilinx Virtex -5 CLB
Fig. 2 FPGA programmable logic blocks
2.2 FPGA DSP functionality In addition to the programmable adder structures, both FPGA families have developed dedicated on-board DSP hardware as shown in Fig. 3. A simplified version of the Xilinx DSP48 hardware block (Fig. 3(a)) comprises multiplicative and accumulation hardware which can be configured to implement one tap of an FIR filter. In addition, the programmable connectivity provided by the multiplexers (some of which are not shown), allows a variety of modes of operation including single stage multiplication, single stage addition, multiply accumulation and increased arithmetic computations achieved by chaining DSP blocks together using the connectivity shown at the bottom of the figure. A pattern detector is also provided on the output which gives support for a number of numerical convergent rounding, overflow/underflow, block floating-point, and support for accumulator terminal count (counter auto reset) with pattern detector outputs. R The Altera Stratix equivalent circuit is shown in Fig. 3(b). Altera have opted for a more complex structure with 4 multipliers and adders/accumulators per stage. It has been clearly developed to support a number of specific DSP functions such as a 2-tap FIR filter, an FFT butterfly and a complex multiplication namely (a + ib)x(c + jd). A number of wordlengths are supported as indicated in Table 1. As with the Xilinx DSP hardware, functionality is also provided to support a number of modes of operation including looping back from the output register (useful for recursion), connectivity of DSP block from above (as DSP blocks are connected in columns) and dedicated rounding, underflow and overflow circuitry. R Table 1 DSP block operation modes for Altera Stratix III FPGA technology [6]
Mode
Multiplier width (bits)
No. of multipliers
per block
Independent multiplier Two-multiplier adder Four-multiplier adder Multiply-accumulate Shift
9/12/18/36/double 18 18 18 36
1 2 4 4 1
8/6/4/2 4 2 2 2
880
Roger Woods
B
18 18
X
43
P 30 + 25 A Includesvarious rounding,underͲ andoverflow circuitry ACinBCinPCin (A,BandpartialproductP,fromlowerDSPblock)
ͲDelayblockwhichisprogrammableinsomecases
(a) Xilinx DSP48 simplified block diagram Registerblock 18 18
X
18 18
X
44
+/Ͳ +/Ͳ
18 18
X
18 18
X
+/Ͳ
44
+
result 44
Includesvarious rounding,underͲ andoverflow ͲDelayblockwhichisprogrammableinsomecases circuitry R (b) Altera Stratix Half-DSP block
Fig. 3 DSP dedicated FPGA hardware
2.3 FPGA memory organization A key aspect in FPGA is memory distribution which is important for DSP applications as pipelining is commonly used to speed up computations. The availability of registers in each ALM and CLB as shown in Fig. 2, allows for direct implementation of pipeline registers. However, there is also other forms of distributed memory in terms of distributed RAM blocks which can be useful in both storing local data and also performing hardware sharing where the FPGA may have to store the internal values for numerous independent calculations. R The various forms of memory for the Altera Stratix is shown in Table 2. This shows a number of different size memories that can be used in FPGAs. A higher R level approach for use of this memory is given in Table 3 for the Xilinx Virtex -4 family and shows the speed and area size treatment much more clearly. This represents a range of memories to be able to implement storage of coefficient data,
Mapping Decidable Signal Processing Graphs into FPGA Implementations
881
storage of internal data where block processing is being performed and storage of internal data where cores may be used in hardware share mode in relatively low speed applications. R Table 2 Stratix memory types and usage[6]
Type
Bits/block
No. of blocks Suggested usage
MLAB M9K M144K
640 9,216 147,456
6,750 1,144 48
Shift registers, small FIFO buffers, filter delay lines General-purpose applications, cell buffers Processor code storage, video frames
R Table 3 Virtex -5 memory types and usage
Type
Unit Size (Kb)
Total Size (Kb)
Access time (ns)
Location
DDR/DDR2 Block RAM Distributed RAM Registers
4,096,000 18 0.016 0.001
4,096,000 1728 288 36
10 2 2 2s
Off-chip On-chip On-chip On-chip
In addition to implementation of combinational logic circuits, the LUT can be used to implement a single bit RAM and also an efficient shift register. The principle by which the LUT can be used as a shift register is covered in detail in the Xilinx application note [7] and given in Fig. 4. Fig. 4(a) shows how a 16:1 multiplexer is realized where the 4-bit address input is used to address the specific input stored in the RAM and Fig. 4(b) shows how this is extended to an addressable shift register. This provides an efficient form for implementing shift registers.
2.4 FPGA design strategy The availability of dedicated multipliers, adders, multiply-accumulate blocks and distributed memory either in the form of distributed RAM blocks or registers, represents an ideal platform for implementing DSP systems. A number of points emerge from the process of implementing DSP systems on FPGA. • The availability of dedicated multipliers, adders and memory elements suggest a direct mapping from the processing graphs to FPGAs where each function can be implemented by a separate processing element thereby allowing high levels of parallelism. Historically, a lot of effort had been dedicated to implementing multiplicative functionality using the LUT-based programming element (Fig. 2)
882
Roger Woods
(a) 16:1 multiplexer
(b) Shift register Fig. 4 Configurations for Xilinx CLB LUT[7]
[2], [9], [10] but this has now been superseded by the recent developments in FPGA architectures, namely the DSP blocks. • The plethora of small, distributed memories suggest a highly pipelined approach is applicable in FPGA. Pipelining not only has the benefits of improving the throughput rate of many systems but can also act to reduce dynamic power consumption [1], [11] as it acts to reduce average netlength and switching activity. • Given that typical FPGA implementations can be clocked at between 200500MHz depending on application requirements, a key aspect is to apply hardware sharing through folding to allow better utilization of the FPGA hardware. This acts to reduce the computational hardware but acts to increase memory requirements (to allow storage of current state for the many functions being implemented). Efficient implementation using the available memory resource is therefore a requirement.
3 Circuit architecture derivation The availability of processing elements and distributed memory on FPGA provide a clear focus to investigate mapping of DSP functions into highly pipelined, parallel circuit architectures. This is demonstrated using the simple examples of FIR and lattice filters which highlight non-recursive and recursive computations.
Mapping Decidable Signal Processing Graphs into FPGA Implementations
883
3.1 Basic mapping of DSP functionality to FPGA DSP systems can be represented using a SFG as illustrated for the FIR filter description given in Fig. 5(a) and represented by the Equation (1). In SFG descriptions, the directed edge (j,k) denotes a linear transform usually given as multiplications, additions or delay elements, from the signal at node j to the signal at node k. The data flow graph (DFG) representation is shown in Fig. 5(b) and it shown here, as it is a more useful representation for applying much of the retiming optimizations applied later in the chapter. yn =
z-1
x(n)
N−1
ai xn−i
(1)
i=0
z-1
a0
a 1
a2
aN-1 y(n)
(a) SFG representation
xn
D M0
D
D
M1
M2
MN-1
A1
A2
AN-1
yn
(b) DFG representation Fig. 5 Representations of N-tap FIR filter
A direct mapping of the algorithm into hardware results in the multiplications and additions being mapped into the dedicated DSP blocks and the delay blocks into registers in the CLB/ALM functional blocks. Multiplication of an m-bit multiplicand by an n-bit multiplier results in a m + n output if the output is not truncated. The output wordlength of n successive additions of two x-bit input is given as x + log2 n. Typically, truncation or more probably rounding is applied throughR out the structure and modeled effectively in Simulink or LabVIEW using suitable fixed-point libraries to effectively model the wordlength effects. R R Table 4 and Table 5 give Xilinx Virtex -5 and Altera Stratix III area and speed figures respectively, for the 64-tap FIR filter of the form given in Fig. 5 synthesized R R Synplify Pro 9.4 synthesis tools from Synopsys . In the comusing Synplicity parison, a 12-bit coefficient wordlength has been chosen and data wordlengths of
884
Roger Woods
8-bit, 16-bit and 24-bit input wordlengths respectively to illustrate how the mapping changes for each technology. The input wordlength exceeds the multiplier size in the Altera DSP blocks when the wordlength grows above 18-bits as clearly seen in Fig. 3(b), so the number of blocks goes up as well as the ALMs in the form of registers and LUTs. With the Xilinx device, the number of DSP48E blocks remains constant as they can support one input word size of 25-bits as can be seen in Fig. 3(a). The register and LUT counts then increasing accordingly with word size. This highlights the importance of relating the wordlength chosen during the algorithmic development stages, to the implementation on specific FPGA platform. R Table 4 64-tap FIR filter implementation In Virtex -5 XC5VLX30
Wordlength
DSP48E
No. of registers
No. of LUTs
8 16 24
32 32 32
834 1338 1842
5384 8367 11683
R Table 5 64-tap FIR filter implementation In Stratix III EP3SE50 FPGA
Wordlength
DSP blocks
No. of registers
No. of LUTs
8 12 16
8 8 64
1126 1622 2118
220 284 1061
3.2 Retiming The throughput rates of the FIR filter implementations is not comparable to a single DSP block performance. This is because the critical path of the FIR filter structure is given as one multiplier and 63 additions. Of course, a key advantage is that this speed is not the clock rate but the sampling rate as 64 multiplications and 63 additions are performed on each clock cycle, unlike a sequential processor where the clock rate has to be divided by the number of operations. Pipelining can however be applied to achieve an improved throughput rate. This is achieved by performing retiming [12] which is a transformation technique used to move delays in a circuit without affecting the input/output characteristics. Retiming has been applied in synchronous designs for clock period reduction [12], power consumption reduction [13], and logical synthesis in general. The basic process of retiming is given in Fig. 6 as taken from [5]. For a circuit with two edges U and V and delays between them (as shown in Fig. 6(a)), a
Mapping Decidable Signal Processing Graphs into FPGA Implementations
885
retimed circuit can be derived where ’ delays are now present on the edge shown in Fig. 6(b). The ’ value is computed using equation (2) where r(U) and r(V ) are the retimed values for nodes U and V respectively.
r(U) U
ω
r(V) V
(a) Original DFG representation
U
ω'
V
(b) Retimed DFG representation
Fig. 6 SFG representation of 3-tap FIR filter [5]
r (e) = (e) + r(U) − r(V )
(2)
Retiming has a number of properties which are summarized from [5]. 1. Weight of any retimed path is given by equation (2). 2. Retiming does not change the number of delays in a cycle. 3. Retiming does not alter the iteration bound in a DFG as the number of delays in a cycle does not change. 4. Adding the constant value j to the retiming value of each node does not alter the number of delays in the edges of the retimed graph. Of course, the key trick is to be able to work out the retiming values r(U) and r(V ) which result in a better retimed graph. A number of retiming routines have been developed but the most intuitive one is the cut-set retiming technique defined in [4].
3.3 Cut-set theorem A cut-set in an SFG (or DFG) is a minimal set of edges, which partitions the SFG into two parts. The procedure is based upon two simple rules. Rule 1: Delay Scaling. All delays D presented on the edges of an original SFG may be scaled by a single positive integer known as the pipelining period of the SFG meaning that D −→ D. Correspondingly, the input and output rates also have to be scaled by a factor of (with respect to the new time unit D’). Time scaling does not alter the overall timing of the SFG. Rule 2: Delay Transfer [12]. Given any cut-set of the SFG, which partitions the graph into two components, we can group the edges of the cut-set into inbound and outbound, as shown in Fig. 7, depending upon the direction assigned to the edges. The delay transfer rule states that a number of delay registers, say k, may
886
Roger Woods
be transferred from outbound to inbound edges, or vice versa, without affecting the global system timing.
Processor 1
Outbound
Edge
Outbound
Edge
Inbound
Edge
Processor 2
Cut Fig. 7 Cut-set theorem application
Consider the application of this to the DFG of the FIR filter given earlier. Its application is better understood by redrawing the FIR filter structure to show the connectivity to the input source xn and the individual delay connections to each of the multipliers (see Fig. 8(a)). Application of a horizontal cut separates the multipliers and adders into two separate graphs and pipelining after the multipliers is achieved. The application of several vertical cut-sets partitions the DFG into N − 2 separate sub-graphs which allows the adders to be pipelined. A better approach is to apply the same cut-sets to the transposed version of the FIR filter where the data is fed from right to left resulting in a transfer of delays from the xn input to the adder chain rather than adding additional delays. The results in the various figures R given in Table 6 for a Virtex -5 target technology. The figures included in brackets represent the values when the pipelining and retiming synthesis options are applied R when synthesizing the designs using Synplicity Synplify Pro 9.4 synthesis tools R from Synopsys . Table 6 Various 64-tap FIR filter implementations with 8-bit input, 10-bit coefficient in a Xilinx R Virtex -5 XC5VSX50T FPGA Circuit
DSP blocks
Registers
LUTs
Throughput (MHz)
Fig. 8(a) Fig. 8(b) Fig. 8(d)
64 (64) 64 (64) 64 (64)
1147 (1341) 2119(1580) 1651 (1640)
599 (516) 1105 (829) 68 (68)
93 (107) 122 (151) 271 (271)
Mapping Decidable Signal Processing Graphs into FPGA Implementations
887
xn
D Sub-graph #1
M0
Sub-graph #2
2D
(N-1)D
M1
M2
MN-1
A1
A2
AN-1
yn
(a) Original FIR DFG xn
Sub-graph #3
D M0
Sub-graph #4
2D M1
D
Sub-graph #N+1
(N-1)D M2
D
MN-1
D
A1
D
A2
yn
AN-1
(b) Horizontal cut-set xn
D M0
D
3D M1
D A1
(2N-3)D M2
D
D A2
MN-1
D
D AN-1
D
yn
(c) Vertical cut-sets xn
M0
D D
M1
D A1
M2
D
D A2
MN-1
D
(d) Pipelined transposed FIR Fig. 8 SFG representation of 3-tap FIR filter [5]
D D
AN-1
yn
888
Roger Woods
3.4 Application to recursive structures The previous structure was non-recursive and simply involved application of delay transfer. Many DSP functions involve feedback loops like the lattice filter structure given in Fig. 9 which has a number of feedback loops.
In
Out
+
+
A1
+
X
M1
X
M2
A2
D (D1)
+
A3
A4
X
M3
X
M4
D (D2)
Fig. 9 Lattice filter structure
Examining the DSP units in Fig. 3, it is clear that a good strategy for FPGA is to apply pipelining at the processor level. The first step is to determine the pipelining period of the structure which is done by working out the possible delays between D1 and D1 and then D1 and D2 etc... The possible paths are given below. D1 to D1 ⇒ D1 → M4 → A3 → D1 D1 to D2 ⇒ D1 → A4 → D2 and D1 → M4 → A3 → M3 → A4 → D2 D2 to D1 ⇒ D2 → M2 → A1 → A3 → D1 D2 to D2 ⇒ D2 → M2 → A1 → A3 → M3 → A4 → D2 The next stage is to determine the pipelining period by applying the longest path matrix algorithm [5]. This involves constructing a series of matrices to determine the pipelining period. If d is assumed to be the number of delays in the DFG, then a series of matrices, L(m), m = 1, 2, . . . , d is created such that element the element (m) l(i, j,) represents the longest path from delay element Di to D j which passes through (m)
exactly m − 1 delays (not including Di and D j ); if not path exists, then l(i, j,) is -1. The iteration bound is found by examining the diagonal elements. The underlying assumption is pipelining at the processor level and therefore a delay is required at processing unit. Examining the paths shown above, we generate matrices, L(1) as shown below and the compute higher level matrices. The higher order matrices do not need to derived from the DFG and can be recursively computed using the computation (m+1)
max
(1)
(m)
l(i, j,) =k ∈ K (−1, l(i,k,) + l( j,k,) ). This is given below for the lattice structure of Fig. 3 as:
Mapping Decidable Signal Processing Graphs into FPGA Implementations
24 35 L(1)
24 7 9 ⇒ 35 8 10 L(1) L(2)
889
The pipelining period is then calculated by determining the following equation: m max l1,k T =i, m 1, 2, . . . D (3) m For this example, this gives the pipelining period of 5, as shown below: 2 5 7 10 T = , , , =5 1 1 2 2 Using this value, the circuit is retimed using the cut-theorem as illustrated in Fig. 10. The process is more easily demonstrated by redrawing the lattice filter structure as shown in Fig. 10(a). Firstly, we transfer 4D from the output to the input edges resulting in 4D on edge M3-A4 and 9D on edge A3-A4; the outputs are then reduced to D which can be embedded into the processor to allow pipelining. 3D is then transferred in cut-set#2 giving the revised form in Fig. 10(b). Two additional cut-sets are employed in Fig. 10(b) to ensure that a single D is left available on the output of the processors as shown in Fig. 10(c). In cut-set#7, applying the cut will result in a delay of −D on edge A1-M1, so to preempt this we transfer 3D from the output in cut-set#5 and transfer the delays as shown, giving Fig. 10(d). The presence of a single delay on each output edge means that the circuit can now be pipelined at the processor level. Synthesis results of the original function and the pipelined version using the XilR inx Virtex -5 FPGA technology is given in Table (7). The addition of the pipeline allows a neat mapping to the DSP48E processors, allowing the blocks to be fully used and reducing the amount of LUTs needed. Of course, the throughput is reduced as inputs are only required once very 4 clock cycles although the clock rate has increased considerably. R Table 7 Synthesis results for the lattice filter using a Xilinx Virtex -5 XC5VSX50T FPGA
Circuit
DSP blocks
Original (Fig. 9) 4 Scaled and retimed (Fig. 10(d)) 4 Hardware shared (Fig. 11) 1
LUTs
Clock (MHz)
Throughput (MHz)
109 50
88 266 240
88 53.2 48
890
Roger Woods
5D
M4
5D
CutͲset#4
M4
3D In
A3
A1
In
A3
A1
3D
M2
M1
M3 5D
5D
A2
Out
3D
4D
A4
Out
(a) Redrawn lattice filter 2D 2D
A1
A3
CutͲset#7
D
CutͲset#6
2D
D
A2
Out 3D
M3 D
9D
D
A4
(b) Application of first cut-set 2D
M4 3D
In
D
M2
M1
D
A2
3D
M3 D
CutͲset#1
In
M2
M1
5D CutͲset#2
CutͲset#3
D
M1
7D
D
A4
D
A1 D
A2
3D
D
M2
M3 D
D Out D
A3
D
M4
4D
7D
D
A4
CutͲset#5
(c) Application of second cut-set
(d) Final pipelined structure
Fig. 10 Retiming applied to lattice filter
4 Circuit architecture optimization The techniques so far have involved the use of pipelining to create the required sampling rates, but some applications operate at lower sampling rates than those indicated in Tables 6 and 7; in addition, redundancy may have occurred as a result of pipelining as illustrated for the lattice filter example which runs at a high clock rate but at comparable lower sampling rates when compared to the original design. In these scenarios, the aim would be to share the hardware by folding.
4.1 Folding In folding, the aim is to perform hardware sharing on the DFG graph thereby allowing trade-offs to be made at the high level as it would be relatively straightforward to work out the clock rate for the lattice filter after pipelining (by computing the delay for a pipelined multiplication and accumulation) without the need to perform FPGA
Mapping Decidable Signal Processing Graphs into FPGA Implementations
891
synthesis. Given that a scaling factor of 5 has been applied to the lattice filter, it is clear that we can schedule each multiplication and addition operation to take place at different cycles; so it is possible to share hardware between these operations using one multiplier and adder without losing performance. This is achieved by f olding [5]. Consider edge e connecting nodes U and V with w(e) delays. Let the executions of the lth iteration of nodes U and V be scheduled at times Nl+u and Nl+v respectively where u and v are folding orders that satisfy 0 < u, v < N. If Hu and Hv are functional units for U and V respectively and Hu is pipelined by Pu stages, then the result of the lth iteration on the node U is available at time unit Nl + u + Pu. Since the edge has delays w(e), the data will be used at the (l + w(e))th iteration of V which is executed at time N(l+w(e)) + v. Thus the edge delay for each functional unit in the previous graph, can be calculated by equation (4): DF (U e V ) = Nw(e) − Pu + v − u (4) → − This expression can be systematically applied to the block diagram of Fig. 8 to derive the circuit of Fig. 11. C1C2C3C4
M
4D 2D 4D
3D3D
Xin
A
2D D
Out
Fig. 11 Hardware sharing applied to lattice filter
The synthesis results for the hardware shared version show a considerable reduction in hardware where one 48E DSP block is required rather than four. The clock rate is reduced when compared to the retimed version in Fig. 10(d) due to the existence of the multiplexers but given the resources used, it could bused to give the best performance of all 3 solutions.
892
Roger Woods
4.2 Unfolding In addition, to reducing the hardware, it is also possible to increase the amount of hardware by unfolding the algorithm to allow a greater throughput rate. Given that the speed is limited by the loops in this circuit, this does not make sense for this example unless a lot of independent computations are to be performed. For this reason, it is not appropriate to apply the transform here but the reader is referred to other texts which highlight relevant examples of its application [1], [5].
4.3 Adaptive LMS filter example To date, the examples have been quite simple but a delayed LMS filter is now used to show how algorithmic design and architecture development are closely linked. An 8-tap transposed form of the delayed LMS (TF-DLMS) adaptive algorithm [14] is given in Fig. 12(a). It shows that the DLMS filter is constructed from an adder and eight instances of a tap cell highlighted in dashed lines. One of the reasons that a TF-LMS filter has been chosen is that the delay is increased to allow pipelining to be applied; however this compromises performance and must be judged carefully. Thus, a key aspect of the design process is to determine the minimum value of m needed to create the pipelining. There are in effect, 16 delay loops in the circuit. Eight of the loops are trivial and refer to the adders A11 , A21 , . . . , A81 that has one processing unit and one delay, thus giving a pipelining period of 1. The other 8 loops are given below and the pipelining period is calculated as before. For loop 1, the path is given by yn → mD → A1 → M81 → A81 → M82 → A82 which is equivalent to 5D if pipelining is to be applied at the processor level. Given the programmable delay is m + 1, then this gives a pipelining period m of 4D. By considering all of the 7 other loops (only some of which are shown), it is seen that this represents the worse case delay. Loop 1: yn → mD → A1 → M81 → A81 → M82 → A81
1 =
5 m+1
⇒ m1 = 4
Loop 2: yn → mD → A1 → M71 → A71 → M72 → A71 → A81
2 =
6 m+3
⇒ m2 = 3
Loop 3: yn → mD → A1 → M61 → A61 → M62 → A61 → A71 → A81
3 =
7 m+5
⇒ m3 = 2
.. . Loop 8: yn → mD → A1 → M11 → A11 → M12 → A11 → A21 → A31 → A41 → A51 → A61 → A71 → A81
Mapping Decidable Signal Processing Graphs into FPGA Implementations
8 =
12 m+15
893
⇒ m8 = −3
m = maxallmc ) = 4D
x(n) 0
Tap1 M12
Tap2 M22
A12
D
A22
D
A11
Tap3 M32
D
A32
D
A21
D
Tap4 M42 A42
D
A31
Tap5 M52
D
A52
D
A41
Tap6 M62
D
A62
D
D
A51
Tap7 M72 A72
D
D
A61
Tap8 M82 y(n)
A82
D
D
A71
A81
mD mD
P
P
P 2D
2D
M11
D
M21
D
2D
M31
D
P
P
P 2D
M41
D
2D
M51
D
P
P 2D
M61
D
2D
M81
d(n)
M81
D
mD
+ A1
-
(a) Detailed LMS filter SFG Tap8 M82
Tap8 M82 y(n)
A82
D
D
A81
D 4D
P 2D D
M81
D
M81
D
D
(b) Application of retiming
(c) Retimed LMS filter Fig. 12 LMS filter
d(n)
M81
2D
D
+
-
A1
2D
D
2D
d(n)
D
3D
P
2D
+
-
A81
2D
D
y(n)
A82
D
4D
2D
d(n)
A1
A81 P
4D
+
y(n)
A82
D
4D
Tap8 M82
D
A1
-
894
Roger Woods
The delay of 4D is then applied to the mD delay block and retiming is performed as illustrated in Fig. 12(b) which shows the right most tap of the filter; the process is repeated throughput the structure in the same way to give the final retimed circuit in Fig. 12(c). The first cut is made around processor A1 around which two delays are transferred from inputs to output giving the second figure in Fig. 12(b). This leaves a delay at the output of A1 which symbolizes that it can be pipelined. A second cut is applied around multiplier M81 and a delay transferred from the output. Due to the fork in output from A1 , a delay appears as shown in the third figure in Fig. 12(b) which is in effect, at the input to multiplier M71 (not shown on the diagram). A delay already exists around adder A81 , so it does not require transfer. The final cut is made around A82 , allowing transfer from the output of A82 to introduce delays to the two inputs, one of which is the output of multiplier M82 . The same procedure is then applied to each tap in the circuit to produce a fully pipelined version as shown in Fig. 12(c). The circuits of Fig. 12(a) and Fig. 12(c) have been coded using VHDL and synR thesized using the Xilinx Virtex -5 FPGA family. The results are given in Table 8. The throughput rate clearly shows the speed improvement when applying pipelining with the only effective area increase in the number of registers employed. This can be mitigated by employing the shift registers shown earlier and is covered in detail in [1]. R Table 8 64-tap LMS filter results in Virtex -5 FPGA
Fig. 12(a) Fig. 12(c)
Area DSP blocks
Throughput Registers
LUTs
(MHz)
128 129
4029 20985
2440 2952
123 203
4.4 Towards FPGA-based IP core generation A key concern in modern electronic design is the design time, a feature which is most critical in SoC design as the complexity grows. One way to reduce design time is to explore design reuse, allowing designs or partial designs to be exploited in the creation of future systems. This issue has been recognized by the International Technology Roadmap for Semiconductors 2005 [8] which indicates that levels of reused logic will increase from 36% quoted in 2008, to an expected value of 48% by 2015. One of the ways to increase design reuse is to use of silicon intellectual property (IP) cores. IP cores can range from dedicated circuit layout known as hard IP through to efficient code targeted to programmable DSP or RISC processors ( f irm IP) right through to hardware description language (HDL) descriptions of archi-
Mapping Decidable Signal Processing Graphs into FPGA Implementations
895
tectures such as those derived in this chapter, known as so f t IP. A key aspect of soft IP is that scalability and parameterization can be built into the cores, a feature which makes the cores particularly powerful as with FPGA-based synthesis tools R such as Synplify DSP high-level synthesis from Synopsys allows the detailed performance to be quickly evaluated. This latter form of cores is know as parameterizable soft IP cores and can be designed so that they may be synthesized in hardware for a range of specifications and processes such as filter tap size, transform point size, or wordlength in DSP applications by setting generic parameters in the HDL descriptions [15]. Whilst parameters such as input/output and coefficient wordlength can be easily identified as a parameter, the implementation of the level of pipelining is much more difficult to achieve but is important as typically this is a key parameter in determining the throughput rate. Therefore a sound approach is to allow a level of pipelining or pipelining period, , to be set by the user. From a SFG perspective, this requires the determination of the delay to be computed in terms of for each edge of the SFG as retiming usually results in a delay in every SFG edge following a processor and possibly in edges between delays. In this way, level of pipelining can be introduced as a IP core parameter to the design, thereby giving the designer the control to adjust the speed of the design.
5 Conclusions The chapter has shown how decidable signal processing graphs are mapped into field programmable gate arrays (FPGA) hardware by adjusting the pipelining period and subsequently exploring folding and unfolding. The availability of hierarchical levels of memory but specifically flip-flops and distributed RAM, makes pipelining an attractive optimization to explore in FPGA implementations, particularly with programmable interconnect having such an impact on the critical path. The chapter has shown this for a number of applications including a simple FIR filter and a more complex adaptive LMS filter. In most cases, real synthesis figures from Xilinx R -5 FPGAs have been used to back up the theoretical analysis. Virtex As much of the material in this book illustrates, implementation of pipelining within such structures increasingly only represents a component design aspect in the design of complex DSP systems; a key feature therefore, is the efficient incorporation of cores created for these system components in a higher level design flow which is considered in section 5.1. Whilst the approaches in the chapter have been explored for mostly optimization of speed against area, the implications for power are given in section 5.2.
896
Roger Woods
5.1 Incorporation of FPGA cores into data flow based systems One of the knock-on effects of applying pipelining for system level DSP implementation, is the change in latency in the DSP component. For example, in the FIR filter of Fig. 8(a) there is no delay in generating the output; however for the transposed pipelined version (Fig. 8(d)), the impact of retiming has been to improve the throughput rate but also to increase the latency to N delays. This latency needs to be reflected at a higher system level. In many cases, this will have no impact but the FIR filter most probably will be used in the creation of a more complex system and this latency could have an impact on performance if it is in the critical path. Therefore, there may be a requirement to adjust the latency of the FIR filter at a later stage in the design flow. For the lattice filter example, the obvious choice would be to select the hardware shared version of the filter (Fig. 11) as it uses less hardware but only if it gives the required speed. However, if the system requirements need several lattice filtering operations, the implementation of Fig. 10(d) may be preferable as it allows several operations to be interlaced and has a higher clock rate than Fig. 11. An initial viewpoint is to treat the synthesized circuit as a black box where all the latencies are coded as fixed values or even programmable values if the delay is dependent on a programmable value such as the number of taps in the filter. The aim of the system level designer would then be to set this value from system level requirements. However, if the aim is to use the core to process several streams of data as in the case of the lattice filter highlighted above, then state information would also need to be incorporated into the core. In this case, the aim would be to to develop a strategy based around the "white box reuse" strategy [16], where previously synthesized systems allow some possibility of changing internal architecture for pipelining or retiming reasons. This will increasingly become an issue as high level design tools emerge which increasingly have to interface with cores that have been developed effectively in a ’bottom-up’ fashion.
5.2 Application of techniques for low power FPGA Power consumption is becoming a critical issue in current systems and is one of the reasons why FPGAs are recommended over processors for DSP systems. Generally speaking, power consumption is separated into static power consumption which is the power consumed when the circuit is in static mode, switched on but necessarily processing data and dynamic power consumption which is the power used when the systems is processing data. Static power consumption is important for battery life in standby mode as it represents the power consumed that the device is powered up and dynamic power is important for battery life when operating as it represents the power consumed when processing data. One of the interesting aspects of the techniques outlined in this chapter is that they generally result in solutions that have a reduced power reduction when com-
Mapping Decidable Signal Processing Graphs into FPGA Implementations
897
pared to the original circuits. At first, this is surprising as pipelining would appear to increase hardware and therefore capacitance due to the additional registers added. However, the dynamic power consumption does not just depend on capacitance but on switched capacitance i.e. the product of the switching on the node and the capacitance of the node. The impact of pipelining as a system level technique is to provide a mechanism by which the FPGA place and route tools can reduce the interconnect length and therefore capacitance. This can have a major impact on performance as demonstrated in Table 9 which gives power figures for a 32-tap version of the FIR filter implementations given earlier and have been taken from real hardware [17]. The figures are taken from R implementations on a Virtex -2 FPGA and show substantial power gain due to reduction in the length of interconnect. Obviously, this reduction would be reduced in more modern FPGA structures where the DSP blocks have dedicated routing. Table 9 Internal Signal/Logic Power Consumption of various FIR filters [17] Circuit
Power consumption (mW)
Fig. 8(a) Fig. 8(b) Fig. 8(c)
964.2 391.7 (-59.3%) 16.4 (-98%)
References 1. Woods R, McAllister J, Lightbody G and Yi Y (2008) FPGA-based Implementation of Signal Processing Systems. 1st edn. Wiley, UK. 2. Meyer-Baese U (2001) Digital Signal Processing with Field Programmable Gate Arrays. Springer, Germany. 3. Omondi, AR (1994) Computer Arithmetic Systems. Prentice Hall Int’l., New York. 4. Kung SY (1988) VLSI Array Processors. Prentice Hall Itn’l., Englewood Cliffs, NJ, USA. 5. Parhi KK (1999) VLSI Digital Signal Processing Systems : Design and Implementation. John Wiley and Sons, Inc., New York. 6. Altera Corp. (2007) Stratix III Device Handbook. Available via http://www.altera. com.Cited6March2009. 7. Xilinx Inc. (2005) Using Look-up Tables as Shift Registers (SRL-16) in Spartan-3 Generation FPGAs. Available via http://www.xilinx.com.Cited6March2009. 8. International Technology Roadmap for Semiconductors: Design. Available via http:// www.itrs.net/Links/2005ITRS/Design2005.pdf. 9. White S A (1989) Applications of Distributed Arithmetic to Digital Signal Processing: a tutorial review. IEEE ASSP Mag. 6 (3), 4–19 10. Turner R H and Woods R (2004) Highly Efficient, Limited Range Multipliers for LUT-based FPGA Architectures. IEEE Trans. on VLSI Syst. 12, 1113–1118. 11. Wilton S J E, Luk W and Ang SS (2004) The Impact of Pipelining on Energy per Operation in Field-Programmable Gate Arrays. Proc. of Int’l conf. on Field Programmable Logic and Application, 719–728.
898
Roger Woods
12. Leiserson C E and Saxe J B (1983) Optimizing Synchronous Circuitry by Retiming. Proc. of the Third Caltech Conference on VLSI, 87–116. 13. Monteiro J, Devadas S and Ghosh A (1993) Retiming Sequential Circuits for Low Power. Proc. of IEEE Int’l Conf. on Computer Aided Design, 398–402. 14. Ting L K, Woods R F, Cowan C F N, Cork P and Sprigings C (2000) High-Performance Fine-grained Pipelined LMS Algorithm in Virtex FPGA. Proc. of SPIE Advanced Signal Processing Algorithms, Architectures and Implementations X, 4116, 288–299. 15. McCanny J, Hu Y, Ding T, Trainor D and Ridge D (1996) Rapid design of DSP ASIC cores using Hierarchical VHDL Libraries. Proc. of 30th Asilomar Conf. on Signals, Systems and Computers, 1344–1348. 16. Bringmann O and Rosenstiel W (1997) Resource Sharing in Hierarchical Synthesis. Proc. of Int’l Conf. on Computer-Aided Design, 318–325. 17. McKeown S, Fischaber S, Woods R, McAllister J and Malins E (2006) Low Power Optimisation of DSP Core Networks on FPGA for High End Signal Processing Systems. Proc. of Int’l Conf. on Military and Aerospace Programmable Logic Devices, Washington, D.C.
Dynamic and Multidimensional Dataflow Graphs Shuvra S. Bhattacharyya, Ed F. Deprettere, and Joachim Keinert
Abstract Much of the work to date on dataflow models for signal processing system design has focused decidable dataflow models that are best suited for onedimensional signal processing. In this chapter, we review more general dataflow modeling techniques that are targeted to applications that include multidimensional signal processing and dynamic dataflow behavior. As dataflow techniques are applied to signal processing systems that are more complex, and demand increasing degrees of agility and flexibility, these classes of more general dataflow models are of correspondingly increasing interest. We begin with a discussion of two dataflow modeling techniques — multi-dimensional synchronous dataflow and windowed dataflow — that are targeted towards multidimensional signal processing applications. We then provide a motivation for dynamic dataflow models of computation, and review a number of specific methods that have emerged in this class of models. Our coverage of dynamic dataflow models in this chapter includes Boolean dataflow, the stream-based function model, CAL, parameterized dataflow, and enable-invoke dataflow.
1 Multidimensional synchronous data flowgraphs (MDSDF) In many signal processing applications, the tokens in a stream of tokens have a dimension higher than one. For example, the tokens in a video stream represent images so that a video application is actually three-dimensional. Static multidimensional Shuvra S. Bhattacharyya University of Maryland, USA e-mail: [email protected] Ed F. Deprettere Leiden University, The Netherlands e-mail: [email protected] Joachim Keinert Fraunhofer Institute for Integrated Circuits, Germany e-mail: joachim.keinert@iis. fraunhofer.de
S.S. Bhattacharyya et al. (eds.), Handbook of Signal Processing Systems, DOI 10.1007/978-1-4419-6345-1_32, © Springer Science+Business Media, LLC 2010
899
900
Shuvra S. Bhattacharyya, Ed F. Deprettere, and Joachim Keinert
(MD) streaming applications can be modeled using unidimensional dataflow graphs (see Chapter 30), but these are at best cyclo-static dataflow graphs, often with many phases in the actor’s vector valued token production and consumption patterns. These models incur a high control overhead that can be avoided when the application is modeled in terms of a multidimensional dataflow graph that may turn out to be just a multidimensional version of the unidimensional synchronous dataflow graph. Such graphs are called multidimensional synchronous dataflow graphs (MDSDF).
1.1 Basics MDSDF is — in principle — a straightforward extension of SDF (see Chapter 30). Fig. 1 shows a simple example. (3,1) A
(1,2)
(1,3)
initial B 2 5
Fig. 1 A simple two-actor MDSDF graph.
6 4
3
1
Actor A produces a (3, 1) token which is a shorthand for 3 rows and 1 column out of an array of elements, and actor B consumes (1, 3) token (1 row and 3 columns). Note that only one dimension can be the streaming dimension. In the example of Fig. 1, this is the horizontal (column) dimension. The SDF 1D repetitions vector (see Chapter 30) becomes a repetitions matrix in the MDSDF case. ! rA,1 rB,1 R= rA,2 rB,2 where rA,1 × 3 = rB,1 × 1, and rA,2 × 1 = rB,2 × 3. Thus actor A has to produce a number of times (3,1) tokens, and actor B has to consume a number of times (1,3) tokens. The equations can be solved to yield (rA,1 , rA,2 ) = (1, 3), and (rB,1 , rB,2 ) = (3, 1). In other words, actor A has to produce once three rows, and three times one column, and similarly for actor B. These equations are independent of initial token tuples (x, y) in the arc’s buffer. In Fig. 1, the buffer contains initial tokens in two columns of a single row, a delay of (1,2) that is. The production/consumption cycle goes in the order of the numbers given at the right in the figure. Note that the state returns to its initial value (1,2).
Dynamic and Multidimensional Dataflow Graphs
901
1.2 Arbitrary sampling Two canonical actors that play a fundamental role in MD signal processing are the compressor (or downsampler) and the expander (or upsampler). An MD signal s(n), n ∈ Rm 1 , is defined on a raster, or a lattice. A raster is generated by the columns of a (non-singular) sampling matrix V . For example, in the 2D case, the sampling matrix ! v1,1 0 V= 0 v2,2 yields a rectangular sampling raster. Similarly, the sampling matrix ! v = v1 v1,2 = v2 V = 1,1 v2,1 = v1 v2,2 = −v2 yields a hexagonal raster, as shown in the upper part of Fig. 2, where only a finite k-support defined by the columns of the matrix P = diag(7, 6) is displayed. Thus, n = V k, k ∈ {P ∩ 2 }, where P is the parallelepiped defined by the columns of the matrix P. Notice that for an integral sampling matrix V , the number of integer points in V — denoted N(V ) — is equal to | det(V ) |. V for the rectangular and the hexagonal sampling matrix examples in the upper part of Figure 2 are shown as shaded parallelepipeds. Sampling lattices can be arbitrary which make MDSDF graphs more complicated than their unidimensional SDF counterparts. The black dots in Figure 2 are sample points n = V k, k ∈ {P ∩ Z2 }. The MD downsampler is characterized by an integral matrix M which takes sample points n to sample points m = n for all n = VI Mk, discarding all other points. Here, VI is the sampling matrix at the input of the downsampler. The sampling matrix at the output is VO = VI M. The values of signal samples are carried over from input to output at the retained sample points. Similarly, the MD upsampler is characterized by an integral matrix L which takes sample points n to sample points m = n for all n = VI k, with equal sample values, and creates additional sample points m = VI L−1 k giving the samples there the value 0. The output sampling matrix is VO = VI L−1 . The lower part of Fig. 2 shows how the upper left raster is downsampled (lower left), and the upper right raster is upsampled (lower right). Of course, the matrices M and L can be any non-singular integral matrix. Consider again the upper part of Fig. 2. The raster points n = V k, k ∈ {P ∩ Z2 } have a support given by the integral points in a support tile defined by the columns of the support matrix Q = V −1 P. ! 3.5 0 Q= 0 2 for the sample raster at the left, and 1
In many practical applications n ∈ Zm .
902
Shuvra S. Bhattacharyya, Ed F. Deprettere, and Joachim Keinert k_2
k_2
k_1
V=
2
0
0
3
V=
k_2
Fig. 2 Rectangular and hexagonal sampling (upper), and downsampling and upsampling (lower)
k_1
2 −2
k_2
k_1
M=
Q=
2 2
1
1
1
−1
k_1
L=
1
1
1
−1
! 1.75 1.5 1.75 −1.5
for the sample raster at the right. If WI is the support matrix at the input of an MD downsampler or upsampler, then the support matrix WO at the output is WO = M −1WI for the downsampler, and WO = LWI for the upsampler. In the example shown in Fig. 2, the output support matrix for the downsampling example is M −1V −1 P, and for the upsampling example it is LV −1 P. They are ! 1.75 1.0 WO = 1.75 −1.0 for the downsampler (bottom left) in Fig. 2, and ! 3.5 0 WO = 0 3 for the upsampler (bottom right) in Fig. 2. The characterization of input-output sampling matrices, and input-output support matrices is not enough to set up balance equations because it still remains to be determined how the points in the support tiles are ordered. For example, consider a source actor S with sampling matrix VO = I (the identity matrix), and support matrix WO = diag(3, 3) (producing three rows and three columns, thus having a sampling rate (3, 3)) connected to an upsampler with
Dynamic and Multidimensional Dataflow Graphs
903
input sampling matrix VI = I. The upsampler can scan its input in row scan order. Its output support matrix WO is WO = LWI . Thus the upsampler has to output | det(L) | samples in a parallelepiped L defined by the columns of the matrix L for every input sample. For example, let ! 2 2 L= 3 −2 The first column of L can be interpreted as the upsampler’s horizontal direction, and the second column as its vertical direction in the generalized rectangle L . There are L1 = 5 groups of L2 = 2 samples (that is L1 × L2 =| det(L) | = 10 samples) in the generalized rectangle. The sampling rate at the output of the upsampler can be chosen to be (L2 , L1 ), that is L2 rows and L1 columns in natural order. The relation between the natural ordered samples and the samples in the generalized rectangle L is then (0, 0) (1, 0) (2, 0) (3, 0) (4, 0) (0, 1) (1, 1) (2, 1) (3, 1) (4, 1) (0, 0) (0, 1) (0, 2) (1, 2) (1, 3) (−1, 1) (−1, 2) (−1, 3) (0, 3) (0, 4)
!
This is shown in Fig. 3. k_2
3
l=(4,1) l=(3,1) l=(2,1)
2
l=(1,1)
4
1
l= (4,0) l= (3,0) l= (2,0) l = (1,0)
l=(0,1)
l_2 1,1 2,1
3,1
4,1
k_1 −2
−1
0
1
2
l_1 0,0 1.0
2,0
3,0
4,0
Fig. 3 Ordering of the upsampler’s output tokens
In this figure, L is shown at the left, and the natural ordering of samples in it is shown at the right. The correspondence between samples in L and the natural ordering of these samples is displayed at the left as well.
904
Shuvra S. Bhattacharyya, Ed F. Deprettere, and Joachim Keinert
1.3 A complete example Consider the MDSDF graph shown in Figure 4.
S
Fig. 4 A combined upsampling and downsampling example
U
SU
L=
UD
2
−2
3
2
M=
D
DT
1
1
2
−2
T
It consists of a chain of four actors S, U , D, and T . S and T are source and sink actors, respectively. U is an expander ! 2 −2 L= 3 2 and D is a compressor 1 1 M= 2 −2
!
The arcs are labeled SU , UD, and DT , respectively. Let VSU be the identity matrix, and WSU =diag(3, 3). The expander consumes the samples from the source in row scan order, The output samples of the upsampler are ordered in the way discussed above, one at a time (that is as (1,1)), releasing | det(L) | samples at the output on each invocation. The decimator can consume the upsampler’s output samples in a rectangle of samples defined by a factorization M1 × M2 of | det(M) |. This rectangle is deduced from the ordering given in Figure 3. Thus, for a 2 × 2 factorization, the (0,0) invocation of the downsampler consumes the (original) samples (0,0), (-1,1), (0,1) and (-1,2) or equivalently, the natural ordered samples (0,0), (0,1), (1,0), and (1,1). Of course, other factorizations of | det(M) | are possible, and the question is whether a factorization can be found for which the total number of samples output by the compressor equals the total number of samples consumed divided by | det(M) | in a complete cycle as determined by the repetitions matrix which, in turn, depends on the factorizations of | det(L) | and | det(M) |. Thus, the question N(WU D ) is whether N(WDT ) = |det(M)| . It can be shown [18] that this is not the case for the 2 × 2 factorization of | det(M) |. The condition is satisfied for a 1 × 4 factorization, though. Thus, denoting by rX,1 and rX,2 the repetitions of a node X in the horizontal direction and the vertical direction, respectively, the balance equations in this case become
Dynamic and Multidimensional Dataflow Graphs
⎛
905
⎞
3 × rS,1 = 1 × rU,1 ⎜ 3 × rS,2 = 1 × rU,2 ⎟ ⎜ ⎟ ⎜ 5 × rU,1 = 1 × rD,1 ⎟ ⎜ ⎟ ⎜ 2 × rU,2 = 4 × rD,2 ⎟ ⎜ ⎟ ⎝ ⎠ rD,1 = rT,1 rD,2 = rT,2 where it is assumed that the decimator produces (1,1) tokens at each firing, and the sink actor consumes (1,1) tokens at each firing. The resulting elements in the repetitions matrix are ⎛ ⎞ rS,1 = 1 rS,2 = 2 ⎜ rU,1 = 3 rU,2 = 6 ⎟ ⎜ ⎟ ⎝ rD,1 = 15 rD,2 = 3 ⎠ rT,1 15 rT,2 = 3 Now, for a complete cycle we have, WSU = diag(3, 3) × diag(rS,1, rS,2 ), WUD = LWSU , and WDT = M −1WUD . Because WUD is an integral matrix, N(WUD ) =| det(WUD) |= 180. There is no simple way to determine N(WDT ) because this matrix is not an integral matrix. Nevertheless N(WDT ) can be found to be 45, which is, indeed, a quarter of N(WUD ) = 180. The reader may consult [18] for further details.
1.4 Initial tokens on arcs In the example in the previous subsection, the arcs where void of initial tokens. In unidimensional SDF, initial tokens on an arc are a prefix to the stream of tokens output by the arc’s source actor. Thus, if an arc’s buffer contains two initial tokens, then the first token output by the arc’s source node is actually the third token in the sequence of tokens to be consumed by the arc’s destination actor. In multidimensional SDF this “shift” is a translation along the vectors of a support matrix W (in the generalized rectangle space) or along the vectors of a sampling matrix V . Thus for a source node with output sampling matrix ! 20 V= 11 a multidimensional delay on its outgoing arcs of (1,2) means that the first token released on this arc is actually located at (2,2) in the sampling space. Similarly, for a support matrix ! 1 0 W= −2 4
906
Shuvra S. Bhattacharyya, Ed F. Deprettere, and Joachim Keinert
a delay of (1,2) means that the first token output is located at (1,2) in the generalized rectangle’s natural order. See [18] for further details.
2 Windowed synchronous/cyclo-static dataflow As discussed in Section 1, specifying applications that process multidimensional arrays in terms of one-dimensional models of computation leads to cumbersome representations. Modeling multidimensional streaming applications as CSDF graphs, for instance, almost always leads to CSDF actors having many phases in their vector valued token production and consumption cycle which incur high control overhead (See Chapter 30). As was shown in Section 1, even the simplest of all dataflow graph models is expressive enough to deal with many multidimensional streaming applications if only tokens of dimension higher than one are introduced in the model. Unfortunately, there is a large class of image and video signal processing applications, among which are wavelet transforms, noise filtering, edge detection, and medical imaging, for which sliding window algorithms have been developed that can achieve higher execution throughput as compared with non-windowed versions. In windowed algorithms, a computation of an output token typically requires a set of input tokens that are grouped in a window of tokens, and windows do overlap when successive output tokens are computed. These windows are multidimensional when tokens in a stream of tokens are multidimensional. Moreover, in most of the sliding image and video application algorithms, the borders of the images need special treatment without explicitly introducing nasty cyclostationary behavior. In particular, pixel planes that are produced by a source actor have to be extended with borders to ensure that pixel plane sizes at the input and the output of pixel plane transforming actors are consistent.
y x src
snk
Fig. 5 Example sliding window algorithm showing a source actor (src), a vertical downsampler, and a sink (snk). Gray shaded pixels are extended border pixels.
Dynamic and Multidimensional Dataflow Graphs
907
Fig. 5 shows a simple example. A source actor produces a 4 × 6 pixel plane that is to be downsampled in row dimension using a sliding window. The gray shaded pixels are border pixels that are not part of the source of the pixel plane produced by the source, but are added to ensure a correct size of the output pixel plane. The window spans three rows in a single column. The downsampler produces one pixel per window, and the window slides two pixel down in row dimension. The windows do not overlap when moving from one column to the next one. Because two consecutive windows in row dimension have two tokens in common, reading of tokens in a window is not necessarily destructive. The occurrence of overlapping windows, and virtual border processing prevent that kind of processing to be represented as an MDSDF graph of the type introduced in Section 1. To overcome this omission, the MDSDF model has been extended to include SDF/CSDF representations of windowed multidimensional streaming applications [12]. The extended model is called the Windowed Static Data Flow graph model (WSDF). In the remainder of this section it is assumed that sampling matrices are diagonal matrices, and that pixel planes are scanned in raster scan order.
2.1 Windowed synchronous/cyclo-static data flow graphs A WSDF graph consists of nodes — or actors — and edges, as do all dataflow graphs. However, a WSDF edge is to be annotated with more parameters than is the case for SDF, CSDF or MDSDF, because of its extended expressiveness. Fig. 6 depicts a simple WSDF graph consisting of a source node A, a sink node B, and an edge (A, B) from source node to sink node.
A
p 1,2
v 2,3
b u 1,1
D d1 , d 2
b l 1,2
c 3,5
c 1,1
B
Fig. 6 A simple WSDF graph with two actors and one edge
The source node produces p=(1, 2) tokens (one row, and two columns of a twodimensional array, that is) per invocation, which is similar to what a source node in a MDSDF graph would do. 2 The token consumption rule for the sink node in WSDF, however, is slightly more involved than in its MDSDF counterpart because windowing and border processing have to be taken into account. The sink node does not consume c=(3,5) tokens, but it has access to tokens in a 3 × 5 window of tokens. 2
Recall that only one of the dimensions can be the streaming dimension. This could be the rowdimension for the current example.
908
Shuvra S. Bhattacharyya, Ed F. Deprettere, and Joachim Keinert
The movement of that window is governed by the edge parameter c, which is (1,1) in the example in Fig. 6, and would have been (2, 1) in the example in Fig. 5. The first window is located in the upper left part of the extended array, and further window positions are determined by the window movement parameter c. WSDF edges can carry initial tokens in exactly the same way as MDSDF edges do. In Fig. 6, there are D=(d1 , d2 ) initial tokens, or equivalently, there is a delay D. The parameters bu =(bu1 , bu2 ) = (1, 1) and bl =(bl1 , bl2 ) = (1, 2) above and below the edge’s border processing rectangle, respectively, in Fig. 6 indicate that there is a border extension of bu1 rows at the upper side, and bl1 rows at the lower side of a (rectangular) group of array elements, as well as bu2 columns at the left side, and bl2 columns at the right side. Negative values of bxy indicate that array entries are suppressed. The group of array elements that is to be border-extended defines a virtual token v. The virtual token in Fig. 6 is v=(2, 3). In contrast with the tokens p=(1, 2) that are effectively produced by the source node A, the virtual token v=(2, 3) is not produced as such, but only specifies the array of elements that is to be border-extended. To distinguish between p and v, the token p is called an effective token, and the token v is called a virtual token. Fig. 7 and Fig. 8 illustrate the virtual token and bordering concepts. 1
2
3
4
5
6
7
8
9
10
Fig. 7 Different variants of border extension, and the resulting sliding window positions: single virtual token.
Fig. 8, corresponds to the example in Fig. 6, where the virtual token is v=(2, 3). The source node produces an effective token p=(1,2) out of a 2 × 6 array. The virtual token v=(2, 3) specifies that this array is to be partitioned into two 2 × 3 subarrays (or units), each one being border-extended independently according to bu =(1, 1), and bl =(1, 2). The sliding of the window c=(3,5) as determined by the parameter c applies to each unit independently. Thus, in Fig. 8, within a single unit, the 3 × 5 window moves by one column to the right, and after returning to the left border, one row down, and again one column to the right. And similar for the second unit. Fig. 7 shows the sliding window operation for the case that the virtual token is v=(2,6). In this case, all array elements produced by actor A are combined to a single unit that is border-extended. Consequently, 2 × 5 different sliding windows are necessary in order to cover the complete array.
Dynamic and Multidimensional Dataflow Graphs
909
1
2
3
4
5
6
7
8
Fig. 8 Different variants of border extension, and the resulting sliding window positions: array split into two units, separated by a vertical dash.
2.1.1 Determination of extended border values Extended border array elements are not produced by the source actor. They are added and given values by the border processing part of the underlying algorithm. Typically, border element values are derived from element values in the sourceproduced effective token array and other border elements. Border processing does not effect token production and consumption in the WSDF model. For a producer/consumer pair, the producer produces effective tokens, and the consumer has access to a window of array elements which may contain extended border elements. However, when border element values have to be computed from source-produced elements and other border elements, the elements to be computed as well as the used elements must be contained in the window.
2.2 Balance equations As is the case with SDF, CSDF, and MDSDF, WSDF has a sample rate consistent periodic behavior and executes in bounded memory. Thus, given a WSDF graph, its initial state, i.e., the initial tokens on its arcs, and the initial position of the sliding windows, the graph can transform token sequences indefinitely in a periodic fashion such that, in each period, the actors are invoked a constant number of times, and at the end of each period the graph returns to its initial state, and sliding windows are back in their initial positions. In other words, a WSDF graph obeys balance equations in the same way as SDF, CSDF, and MDSDF graphs do. Because of the application of a sliding window in WSDF, successive invocations of an actor within a virtual token consume different numbers of array elements. As a consequence, WSDF balance equations resemble CSDF balance equations (see Chapter 30), one equation per dimension. The first step in the proof is to establish the minimum number of times actors have to be invoked in each dimension to ensure that complete virtual tokens are consumed. For a single edge e, this amount of invocations we can be obtained by
910
Shuvra S. Bhattacharyya, Ed F. Deprettere, and Joachim Keinert
calculating how many sliding windows fit into the corresponding virtual token surrounded by extended borders: we =
ve + bue + ble − ce +1 ce
(1)
In this equation, ve , bue , ble , ce , and ce are the edge’s virtual token size, upper and lower boundary extensions, the window size, and window step size, respectively, in terms of number of rows for the row dimension, and of columns in the column dimension. Thus there is an equation we,x for every dimension x. For the example in Fig. 6, this would lead to ! we,1 = 2+1+1−3 +1 = 2 1 we,2 = 3+2+1−5 +1 = 2 1 Clearly, the sink actor B in Fig. 6 has to be invoked twice in both row and column dimensions to consume a complete (border-extended) virtual token. See also Fig. 8. In the case that an actor A has several input edges, the minimum number of invocations NA of actor A equals the smallest common multiple of the minimum number of invocations wek among this set of input edges. If there are no input edges, NA is set to 1. Now the balance equations ensure, for each edge e, that tokens produced by the source node have been consumed by the sink node, and that both the source node and the sink node have exchanged complete virtual tokens to return to the initial state. The structure of such a balance equation for an edge e is as follows: Nsrc(e) · pe · qsrc(e) =
Nsnk(e) · ve · qsnk(e) we
(2)
where pe and ve are the effective and virtual token parameters of edge e for a single dimension. qx defines the number of times the minimum period of actor x is to be executed. This number follows from the basic balance equation
· q = 0.
(3)
The topology matrix has one column for each actor, and one row for each edge. There is one basic balance equation for each dimension. The entries of the matrix are NA × pA when actor A is a source actor, − wNAA × vA when actor A is a sink actor, and 0 otherwise. For a connected graph, the matrix has to be of rank | V | -1, where | V | is the cardinality of the set of actors V . Finally, the repetitions vector r satisfies the equation r = N ·q
(4)
where the matrix N is a diagonal matrix with entries NAk , k = 1 . . . | V |. The set of vectors q, one for each dimension, gives a matrix Q which in two dimensions would be
Dynamic and Multidimensional Dataflow Graphs
Q=
911
qA1 ,1 . . . qAn ,1 qA1 ,2 . . . qAn ,2
! .
Similarly, the set of repetitions vectors r, gives a repetitions matrix R which in two dimensions would be ! rA1 ,1 . . . rAn ,1 R= . (5) rA1 ,2 . . . rAn ,2 2.2.1 A complete example Fig. 9 shows a WSDF graph specification of a medical image enhancement multiresolution filter application. Actor S1 produces a 480 × 640 image that is downsampled by actor C1. The downsampled image is sent to actor F2, as well as to an upsampler E1 that returns an image of size 480 × 640. Actor F1 operates on a sliding window c= (3, 3). Actor A1 combines the results of actors F1 and F2. Actor E2 ensures that both inputs to actor A1 have compatible sizes. b1u 1,1
p1 1,1
S1
v1 480,640
p 2 1,1
c1 3,3
c1 1,1
b 1,1 l 1
c 4 3,3
l 2
u 4
A1
l 4
c 7 1,1
v4 480,640
c2 3,3
c2 2,2
b3u 0,0, b3l 1,1 c 3 2,2 v3 240,320 c 3 1,1
p3 1,1
p5 1,1
c 5 3,3
c5 1,1
v5 240,320
b 1,1, b 1,1 u 5
c 8 1,1
c 8 1,1
c 7 1,1
b 1,1, b 1,1
b 1,1, b 0,0
C1
p 8 1,1
v8 480,640
c 4 1,1
v2 480,640 u 2
F1
p 4 2,2
p 7 2,2
E1
F2
v 7 480,640 p6 1,1 v6 240,320
l 5
c 6 2,2
c 6 1,1 b6u 0,0, b6l 1,1
E2
Fig. 9 WSDF graph representation of a multi-resolution filter
Application of equation (1), and calculation of the smallest common multiple for all input edges of actors yields the minimum number of actor invocations needed to consume complete virtual tokens (images in this case). They are listed in the following table. Actor S1 F1 A1 C1 E1 F2 E2 Row 1 480 480 240 240 240 240 Column 1 640 640 320 320 320 320
912
Shuvra S. Bhattacharyya, Ed F. Deprettere, and Joachim Keinert
Application of equation (2) gives the balance equations Row e1 : 1 · 1 · qS1,1 e2 : 1 · 1 · qS1,1 e3 : 240 · 1 · qC1,1 e4 : 240 · 2 · qE1,1 e5 : 240 · 1 · qC1,1 e6 : 240 · 1 · qF2,1 e7 : 240 · 2 · qE2,1 e8 : 480 · 1 · qF1,1
= = = = = = = =
Column 480 480 · 480 · qF1,1 240 240 · 480 · qC1,1 240 240 · 240 · qE1,1 480 480 · 480 · qF1,1 240 240 · 240 · qF2,1 240 240 · 240 · qE2,1 480 480 · 480 · qA1,1 480 480 · 480 · qA1,1
e1 : 1 · 1 · qS1,2 e2 : 1 · 1 · qS1,2 e3 : 320 · 1 · qC1,2 e4 : 320 · 2 · qE1,2 e5 : 320 · 1 · qC1,2 e6 : 320 · 1 · qF2,2 e7 : 320 · 2 · qE2,2 e8 : 640 · 1 · qF1,2
= = = = = = = =
640 640 320 320 320 320 640 640 320 320 320 320 640 640 640 640
· 640 · qF1,2 · 640 · qC1,2 · 320 · qE1,2 · 640 · qF1,2 · 320 · qF2,2 · 320 · qE2,2 · 640 · qA1,2 · 640 · qA1,2
The corresponding matrices have rank | V | −1 = 7 − 1 = 6. Consequently, a minimal integer solution exists as shown in the following table. Actor
S1 F1 A1 C1 E1 F2 E2
Row 480 1 1 1 Column 640 1 1 1
1 1
1 1
1 1
Finally equation (5) leads to the repetitions given in the following table. Actor
S1 F1 A1 C1 E1 F2 E2
Row 480 480 480 240 240 240 240 Column 640 640 640 320 320 320 320 This confirms that the WSDF specification is correct in that it does not cause unbounded accumulation of tokens. In addition, it demonstrates that the WSDF graph works as expected in that actors E1, F2, and E2 execute less often than the other actors since they read small images at their input. Furthermore, also actor C1 is less active, since it performs a downsampling operation. To this end, the sliding window moves by two pixels while each actor invocation only generates one pixel.
3 Motivation for dynamic DSP-oriented dataflow models The decidable dataflow models covered in Chapter 30 are useful for their predictability, strong formal properties, and amenability to powerful optimization techniques. However, for many signal processing applications, it is not possible to represent all of the functionality in terms of purely decidable dataflow representations. For example, functionality that involves conditional execution of dataflow subsystems or actors with dynamically varying production and consumption rates generally cannot be expressed in decidable dataflow models. The need for expressive power beyond that provided by decidable dataflow techniques is becoming increasingly important in design and implementation signal processing systems. This is due to the increasing levels of application dynamics that
Dynamic and Multidimensional Dataflow Graphs
913
must be supported in such systems, such as the need to support multi-standard and other forms of multi-mode signal processing operation; variable data rate processing; and complex forms of adaptive signal processing behaviors. Intuitively, dynamic dataflow models can be viewed as dataflow modeling techniques in which the production and consumption rates of actors can vary in ways that are not entirely predictable at compile time. It is possible to define dynamic dataflow modeling formats that are decidable. For example, by restricting the types of dynamic dataflow actors, and by restricting the usage of such actors to a small set of graph patterns or “schemas”, Gao, Govindarajan, and Panangaden defined the class of well-behaved dataflow graphs, which provides a dynamic dataflow modeling environment that is amenable to compile-time bounded memory verification [25]. However, most existing DSP-oriented dynamic dataflow modeling techniques do not provide decidable dataflow modeling capabilities. In other words, in exchange for the increased modeling flexibility (expressive power) provided by such techniques, one must typically give up guarantees on compile-time buffer underflow (deadlock) and overflow validation. In dynamic dataflow environments, analysis techniques may succeed in guaranteeing avoidance of buffer underflow and overflow for a significant subset of specifications, but, in general, specifications may arise that “break” these analysis techniques — i.e., that result in inconclusive results from compile-time analysis. Dynamic dataflow techniques can be divided into two general classes: 1) those that are formulated explicitly in terms of interacting combinations of state machines and dataflow graphs, where the dataflow dynamics are represented directly in terms of transitions within one or more underlying state machines; and 2) those where the dataflow dynamics are represented using alternative means. The separation in this dichotomy can become somewhat blurry for models that have a well-defined state structure governing the dataflow dynamics, but whose design interface does not expose this structure directly to the programmer. Chapter 36 includes coverage of dynamic dataflow techniques in the first category described above — in particular, those based on explicit interactions between dataflow graphs and finite state machines. In this chapter, we focus on the second category. Specifically, we discuss dynamic dataflow modeling techniques that involve different kinds of modeling abstractions, apart from state transitions, as the key mechanisms for capturing dataflow behaviors and their potential for run-time variation. Numerous dynamic dataflow modeling techniques have evolved over the past couple of decades. A comprehensive coverage of these techniques, even after excluding the “state-centric” ones, is out of the scope this chapter. Instead we provide a representative cross-section of relevant dynamic dataflow techniques, with emphasis on techniques for which useful forms of compile time analysis have been developed. Such techniques can be important for exploiting the specialized properties exposed by these models, and improving predictability and efficiency when deriving simulations or implementations.
914
Shuvra S. Bhattacharyya, Ed F. Deprettere, and Joachim Keinert
4 Boolean dataflow The Boolean dataflow (BDF) model of computation extends synchronous dataflow with a class of dynamic dataflow actors in which production and consumption rates on actor ports can vary as two-valued functions of control tokens, which are consumed from or produced onto designated control ports of dynamic dataflow actors. An actor input port is referred to as a conditional input port if its consumption rate can vary in such a way, and similarly an output port with a dynamically varying production rate under this model is referred to as a conditional output port. Given a conditional input port p of a BDF actor A, there is a corresponding input port C p , called the control input for p, such that the consumption rate on Cp is statically fixed at one token per invocation of A, and the number of tokens consumed from p during a given invocation of A is a two-valued function of the data value that is consumed from Cp during the same invocation. The dynamic dataflow behavior for a conditional output port is characterized in a similar way, except that the number of tokens produced on such a port can be a twovalued function of a token that is consumed from a control input port or of a token that is produced onto a control output port. If a conditional output port q is controlled by a control output port Cq , then the production rate on the control output is statically fixed at one token per actor invocation, and the number of tokens produced on q during a given invocation of the enclosing actor is a two-valued function of the data value that is produced onto Cq during the same invocation of the actor. Two fundamental dynamic dataflow actors in BDF are the switch and select actors, which are illustrated in Figure 10. The switch actor has two input ports, a control input port wc and a data input port wd , and two output ports wx and wy . The port wc accepts Boolean valued tokens, and the consumption rate on wd is statically fixed at one token per actor invocation. On a given invocation of a switch actor, the data value consumed from wd is copied to a token that is produced on either wx or wy depending on the Boolean value consumed from wc . If this Boolean value is true, then the value from the data input is routed to wx , and no token is produced on wy . Conversely if the control token value is false, then the value from wd is routed to wy with no token produced on wx . A BDF select actor has a single control input port sc ; two additional input ports (data input ports) sx and sy ; and a single output port so . Similar to the control port of the switch actor, the sc port accepts Boolean valued tokens, and the production rate on the so port is statically fixed at one token per invocation. On each invocation of the select actor data is copied from a single token from either sx or sy to so depending on whether the corresponding control token value is true or false respectively. Switch and select actors can be integrated along with other actors in various ways to express different kinds of control constructs. For example, Figure 11 illustrates an if-then-else construct, where the actors A and B are applied conditionally based on a stream of control tokens. Here A and B are SDF actors that each consume one token and produce one token on each invocation. Buck has developed scheduling techniques to automatically derive efficient control structures from BDF graphs under certain conditions [24]. Buck has also shown
Dynamic and Multidimensional Dataflow Graphs
915
Fig. 10 Switch and select actors in Boolean dataflow.
Fig. 11 An if-then-else construct expressed in terms of Boolean dataflow.
that BDF is Turing complete, and furthermore, that SDF augmented with just switch and select (and no other dynamic dataflow actors) is also Turing complete. This latter result provides a convenient framework with which one can demonstrate Turing completeness for other kinds of dynamic dataflow models, such as the enable-invoke dataflow model described in Section 8. In particular, if a given model of computa-
916
Shuvra S. Bhattacharyya, Ed F. Deprettere, and Joachim Keinert
tion can express all SDF actors as well as the functionality associated with the BDF switch and select actors, then such a model can be shown to be Turing complete.
5 The Stream-based Function Model One of the dynamic dataflow models is the Stream-based Function model. In fact, it is a model for the actors in dataflow graphs that otherwise obey in principle the dataflow model of computation communication semantics: Actors communicate point-to-point over unbounded FIFO channels, and an actor blocks until sufficient tokens are available in the currently sequentially addressed channel.
5.1 The concern for an actor model It is convenient to call the active entities in dataflow graphs and networks actors, whether they are void of state, encompass a single thread of control, or are processes. Thus static dataflow graphs (Chapter 30), dynamic dataflow graphs, polyhedral process networks (Chapter 33), and Kahn process networks (Chapter 34) can collectively be referred to as Dataflow Actor Networks (DFAN). They obey the Kahn actor coordination semantics, possibly augmented with actor firing rules, annotated with deadlock free minimal FIFO buffer capacities, and having one or more global schedules associated with them. A DFAN is very often an appealing specification model for signal processing applications. In many cases, the signal processing application DFAN specification is to be implemented in a single-chip multiprocessor execution platform that consists of a number of computational elements (processors), and a communication, synchronization, and storage infrastructure. This requires a mapping that relates an untimed application model and a timed architecture model together: Actors are assigned to processors, FIFO channels are logically embedded in — often distributed — memory, and DFAN communication primitives and protocols are transformed to platform specific communication primitives and protocols. The set of platform computational components may not be homogeneous, nor may they support multiple threads of control. Thus, the specification of platform processors and DFAN actors may be quite different. In fact, the actors in a DFAN are poorly modeled, as they are specified in a standard sequential host programming language, by convention. And platform processors may be not modeled at all, because they may be dedicated and depending on the specification of the actors that are to be assigned to them. For example, when an actor is a process in which several threads of control can be recognized, it would be difficult, if at all possible, to relate it transparently and time consciously to a computational component that can only support a single thread of control, that has no operating system embedded in it, that is. To overcome this problem, it makes sense to provide an actor model that can serve as an intermediate between the conventional DFAN actor
Dynamic and Multidimensional Dataflow Graphs
917
specification, and a variety of computational elements in the execution platform. The Stream-based Function Model [16] is exactly aimed at that purpose.
5.2 The stream-based function actor model The generic actor in a DFAN is a process that comprises one or more code segments consisting of a sequence of control statements, read and write statements, and function-call statements, that are not structured in any particular way. The streambased function actor model aims at providing a process structure that makes the internal behavior of actors transparent and time conscious. It is composed of a set of functions, called function repertoire, a transition and selection function, called controller, and a combined function and data state, called private memory. The controller selects a function from the function repertoire that is associated with a current function state, and makes a transition to the next function state. A selected function is enabled when all its input arguments can be read, and all its results can be written; It is blocked otherwise. A selected non-blocked function must evaluate or fire. Arguments and results are called input tokens and output tokens, respectively. An actor operates on sequences of tokens, streams or signals, as a result of a repetitive enabling and firing of functions from the function repertoire. Tokens are read from external ports and/or loaded from private memory, and are written to external ports and/or stored in private memory. The external ports connect to FIFO channels through which actors communicate point-to-point. Figure 12 depicts an illustrative stream-based function (SBF) actor. Its function repertoire is P = { finit , fa , fb }. The two read ports, and the single write port connect to two input and one output FIFO channels, respectively. P contains at least an initialization function finit , and is a finite set of functions. The functions in P must be selected in a mutual exclusive order. That order depends on the controller’s selection and transition function. The private memory consists of a function state part, and a data state part that do not intersect.
Input Buffer0 Read Ports
State
fa fb
Write Port Output Buffer0
Input Buffer1 Controller
Fig. 12 A simple SBF actor.
Enable Signal
918
Shuvra S. Bhattacharyya, Ed F. Deprettere, and Joachim Keinert
5.3 The formal model If C denotes the SBF actor’s function state space, and if D denotes its data state space, then the actor’s state space S is the Cartesian product of C and D, S = C × D, C ∩ D = 0. /
(6)
Let c ∈ C be the current function state. From c, the controller selects a function and makes a function state transition by activating its selection function and transition function as follows,
: C → P, (c) = f ,
(7)
: C × D → C, (c, d) = c .
(8)
The evaluation of the combined functions and is instantaneous. The controller cannot change the content of C; it can only observe it. The transition function allows for a dynamic behavior as it involves the data state space D. When the transition function is only a map from C to C, then the trajectory of selected functions will be static or cyclo-static (see Chapter 30 and Chapter 33). Assuming that the DFAN does not deadlock, and that its input streams are not finite, the controller will repeatedly invoke functions from the function repertoire P in an endless firing of functions, ( (S))
( (S))
( (S))
( (S))
finit −→ fa −→ fb −→ . . . fx −→ . . .
(9)
Such a behavior is called a Fire-and-Exit behavior. It is quite different from a threaded approach in that all synchronization points are explicit. The role of the function finit in Eq. 9 is crucial. This function has to provide the starting current function state cinit ∈ C. It evaluates first and only once, and it may read tokens from one or more external ports to return the starting current function state. cinit = finit (channela . . . channelz)
(10)
The SBF model is reminiscent of the Applicative State Transition (AST) node in the systolic array model of computation (Chapter 29) introduced by Annevelink [26], after the Applicative State Transition programming model proposed by Backus [27]. The AST node, like the SBF actor, comprises a function repertoire, a selection function , and a transition function . However, the AST node does not have a private memory, and the initial function state cinit is read from a unique external channel. Moreover, AST nodes communicate through register channels. The SBF actor model is also reminiscent of the California Actor Language Model (CAL), which is discussed in Chapter 36 and also in section 6 of this chapter. Clearly, decidable and polyhedral dataflow graphs and networks are special cases of SBF actor networks. Analysis is still possible in case the SBF network models an underlying weakly dynamic nested loop program [17]. A sequential nested loop program is said to be weakly dynamic when the loop bounds and the variable in-
Dynamic and Multidimensional Dataflow Graphs
919
dexing functions are affine functions of the loop iterators and static parameters. Conditional statements are without any restriction. Notice that in case conditional statements are also affine in the loop iterators and static parameters, the nested loop program becomes, then, a static affine nested loop program (Chapter 33). Analysis is also possible when, in addition, the parameters are not static [8].
5.4 Communication and scheduling Actors in dataflow actor networks communicate point to point over FIFO buffered channels. Conceptually, buffer capacities are unlimited, and actors synchronize by means of a blocking read protocol: An actor that attempts to read tokens from a specific channel will block whenever that channel is empty. Writing tokens will always proceed. When a function [c, d] = f (a, b) is invoked, it has to bind to input ports px and py when ports px and py are to deliver the arguments a and b, in this order. Thus if the n-th channel to which port pn has access is empty so that the function’s n-th argument cannot be delivered, then the function f (a1 , . . . , an . . . az ) will block. This is the default DFAN communication synchronization protocol. Of course, channel FIFO capacities are not unlimited in practice, so that mapped DFAN do have a blocking write protocol as well: An actor that attempts to write tokens to a specific channel will block whenever that channel is full. Thus a function [c, d] = f (a, b) has to bind to output ports qx and qy when ports qx and qy are to receive results c and d, in this order. The binding of a function to input and output ports can be implicit or explicit. In the case of implicit binding, the function is said to know its corresponding input and output ports. Because a particular function may read arguments and write results from non-unique input ports and to non-unique output ports, respectively, the function repertoire will, then, contain function variants. Two functions in the function repertoire are function variants when they are functionally identical, and their arguments and results are read from and written to different channels. In the case of explicit binding, the controller’s selection function selects both a function from the function repertoire and the corresponding input and output ports. Explicit binding allows to separate function selection and binding selection, so that reading, executing, and writing can proceed in a pipelined fashion. Both the implicit and explicit binding methods assume blocking read and write actor synchronization semantics. However, the SBF model allows for a more general deterministic dynamic approach. In this approach, the actor behavior is divided into a channel checking part and a scheduling part. In the channel checking part, channels are visited without empty or full channel blockings. A visited input channel Cin returns a Cin .1 signal when the channel is not empty, and a Cin .0 signal when the channel is empty. And similarly for output channels. These signals indicate whether or not a particular function from the function repertoire can fire. In the scheduling part, a function can only be invoked when the channel checking signals allow it to
920
Shuvra S. Bhattacharyya, Ed F. Deprettere, and Joachim Keinert
fire. If not, then the function will not be invoked. As a consequence, the actor is blocked. Clearly channel checking and scheduling can proceed in parallel.
5.5 Composition of SBF actors It should be clear that an SBF actor has a single thread of control. However, when mapping a DFAN onto an execution platform, one may assign two or more actors to a single SBF processing unit. That processing unit may have an SBF-like model, so that a many-to-one mapping must conform to the single thread of control model. This implies that one or more SBF actors in a DFAN have to be merged together in a single SBF actor. An example is shown in Fig. 13. Part of a DFAN is shown at the left in the figure. The functionality of the actors A, B, and C is also shown. The functions f1 () and f2 () of actor A are invoked in sequence, and so are the functions g1 () and g2 () of actor B, and the functions h1 () and h2 () of actor C. The three actors are merged into a single actor D as shown at the right of the figure. When modeling actor D as an SBF actor, the function repertoire will contain functions f1 (), f2 (), g1 (), g2 (), h1 (), and h2 (). A possible single thread schedule is also shown at the right in the figure, where {PM} stands for private memory. However, because actors in the original DFAN are concurrent threads, and because in the SBF actor channel checking and function scheduling proceed in parallel, the actual schedule of functions in the function repertoire of the SBF actor D may be different from the one shown. Indeed, if — for example — function h1 () cannot fire, then function g1 () may be fireable and get invoked before function h1 (). Although this is a sort of multi-threading, it is safe and deterministic, because there is no preemption, and no context switch. 4
1 2
A
5
B
6
$3 7
C
8 9 10
A: [4, 6] = f_1(1, 2 ) [4, 5] = f_2(1, 3) B : [8, 9] = g_1(4) [8] = g_2(4, 5) C: [10] = h_1(6) [10] =h_2(6, 7)
Fig. 13 Merging SBF Actors.
1 2 3
8 D
9 10
7 D: [4, 6]_{PM} = f_1(1,2) [10] = h_1(6)_{PM} [8, 9] = g_1(4)_{PM} [4, 5]_{PM} = f_2(1,3) [8] = g_2(4, 5)_{PM} [4, 6]_{PM} = f_1(1, 2) [10] = h_2(6_{PM}, 7)
Dynamic and Multidimensional Dataflow Graphs
921
6 CAL In addition to providing a dynamic dataflow model of computation that is suitable for signal processing system design, CAL provides a complete programming language and is supported by a growing family of development tools for hardware and software implementation. The name “CAL” is derived as a self-referential acronym for the CAL actor language. CAL was developed by Eker and Janneck at U. C. Berkeley [15], and has since evolved into an actively-developed, widelyinvestigated language for design and implementation of embedded software and field programmable gate array applications (e.g., see [10] [3] [19]). One of the most notable developments to date in the evolution of CAL has been its adoption as part of the recent MPEG standard for reconfigurable video coding (RVC) [1]. Specifically, to enable higher levels of flexibility and adaptivity in the evolution of RVC standards, CAL is used as a format for transmitting the appropriate decoding functionality along with the raw image data that comprises a video stream. In other words, CAL programs are embedded along with the desired video content to form an RVC bit stream. At the decoder, the dataflow graph for the decoder is extracted from the bit incoming bit stream, and references to CAL actors are mapped to platform specific library components that execute specific decoding functions. These capabilities for encoding video decoding algorithms as coarse-grain dataflow graphs and encoding specific decoding functions as references to pre-defined library elements allow for highly compact and flexible representation of video decoding systems. A CAL program is specified as a network of CAL actors, where each actor is a dataflow component that is expressed in terms of a general underlying form of dataflow. This general form of dataflow admits both static and dynamic behaviors, and even non-deterministic behaviors. Like typical actors in any dataflow programming environment, a CAL actor in general has a set of input ports and a set of output ports that define interfaces to the enclosing dataflow graph. A CAL actor also encapsulates its own private state, which can be modified by the actor as it executes but cannot be modified directly by other actors. The functional specification of a CAL actor is decomposed into a set of actions, where each action can be viewed as a template for a specific class of firings or invocations of the actor. Each firing of an actor corresponds to a specific action and executes based on the code that is associated with that action. The core functionality of actors therefore is embedded within the code of the actions. Actions can in general consume tokens from actor input ports, produce tokens on output ports, modify the actor state, and perform computation in terms of the actor state and the data obtained from any consumed tokens. The number of tokens produced and consumed by each action with respect to each actor output and input port, respectively, is declared up front as part of the declaration of the action. An action need not consume data from all input ports nor must it produce data on all output ports, but ports with which the action exchanges data, and the associated rates of production and consumption must be constant for
922
Shuvra S. Bhattacharyya, Ed F. Deprettere, and Joachim Keinert
the action. Across different actions, however, there is no restriction of uniformity in production and consumption rates, and this flexibility enables the modeling of dynamic dataflow in CAL. A CAL actor A can be represented as a sequence of four elements
0 (A), (A), (A), pri(A),
(11)
where (A) represents the set of all possible values that the state of A can take on; 0 (A) ∈ (A) represents the initial state of the actor, before any actor in the enclosing dataflow graph has started execution; (A) represents the set of actions of A; and pri(A) is a partial order relation, called the priority relation of A, on (A) that specifies relative priorities between actions. Actions execute based on associated guard conditions as well as the priority relation of the enclosing actor. More specifically, each actor has an associated guard condition, which can be viewed as a Boolean expression in terms of the values of actor input tokens and actor state. An actor A can execute whenever its associated guard condition is satisfied (true-valued), and no higher-priority action (based on the priority relation pri(A)) has a guard condition that is also satisfied. In summary, CAL is a language for describing dataflow actors in terms of ports, actions (firing templates), guards, priorities, and state. This finer, intra-actor granularity of formal modeling within CAL allows for novel forms of automated analysis for extracting restricted forms of dataflow structure. Such restricted forms of structure can be exploited with specialized techniques for verification or synthesis to derive more predictable or efficient implementations. An example of this capability for specialized region detection in CAL programs is the technique of deriving and exploiting so-called statically schedulable regions (SSRs) [3]. Intuitively, an SSR is a collection of CAL actions and ports that can be scheduled and optimized statically using the full power of static dataflow techniques, such as those available for SDF, and integrated into the schedule for the overall CAL program through a top-level dynamic scheduling interface. SSRs can be derived through a series of transformations that are applied on intermediate graph representations. These representations capture detailed relationships among actor ports and actions, and provide a framework for effective quasi-static scheduling of CAL-based dynamic dataflow representations. Here, by quasi-static scheduling, we mean the construction of dataflow graph schedules in which a significant proportion of overall schedule structure is fixed at compile-time. Quasi-static scheduling has the potential to significantly improve predictability, reduce run-time scheduling overhead, and as discussed above, expose subsystems whose internal schedules can be generated using purely static dataflow scheduling techniques. Further discussion about CAL can be found in Chapter 3, which discusses the application of CAL to reconfigurable video coding.
Dynamic and Multidimensional Dataflow Graphs
923
7 Parameterized dataflow Parameterized dataflow is a meta-modeling approach for integrating dynamic parameters and run-time adaptation of parameters in a structured way into a certain class of dataflow models of computations, in particular, those models that have a well-defined concept of a graph iteration [20]. For example, SDF and CSDF, which are discussed in Chapter 30, and MDSDF, which is discussed in Section 1, have well defined concepts of iterations based on solutions to the associated forms of balance equations. Each of these models can be integrated with parameterized dataflow to provide a dynamically parameterizable form of the original model. When parameterized dataflow is applied in this way to generalize a specialized dataflow model such as SDF, CSDF, or MDSDF, the specialized model is referred to as the base model, and the resulting, dynamically parameterizable form of the base model is referred to as parameterized XYZ, where XYZ is the name of the base model. For example, when parameterized dataflow is applied to SDF as the base model, the resulting model of computation, called parameterized synchronous dataflow (PSDF), is significantly more flexible than SDF as it allows arbitrary parameters of SDF graphs to be modified at run-time. Furthermore, PSDF provides a useful framework for quasi-static scheduling, where fixed-iteration looped schedules, such as single appearance schedules [23], for SDF graphs can be replaced by parameterized looped schedules [20] [11] in which loop iteration counts are represented as symbolic expressions in terms of variables whose values can be adapted dynamically through computations that are derived from the enclosing PSDF specification. Intuitively, parameterized dataflow allows arbitrary attributes of a dataflow graph to be parameterized, with each parameter characterized by an associated domain of admissible values that the parameter can take on at any given time. Graph attributes that can be parameterized include scalar or vector attributes of individual actors, such as the coefficients of a finite impulse response filter or the block size associated with an FFT; edge attributes, such as the delay of an edge or the data type associated with tokens that are transferred across the edge; and graph attributes, such as those related to numeric precision, which may be passed down to selected subsets of actors and edges within the given graph. The parameterized dataflow representation of a computation involves three cooperating dataflow graphs, which are referred to as the body graph, the subinit graph, and the init graph. The body graph typically represents the functional “core” of the overall computation, while the subinit and init graphs are dedicated to managing the parameters of the body graph. In particular, each output port of the subinit graph is associated with a body graph parameter such that data values produced at the output port are propagated as new parameter values of the associated parameter. Similarly, output ports of the init graph are associated with parameter values in the subinit and body graphs. Changes to body graph parameters, which occur based on new parameter values computed by the init and subinit graphs, cannot occur at arbitrary points in time. Instead, once the body graph begins execution it continues uninterrupted through
924
Shuvra S. Bhattacharyya, Ed F. Deprettere, and Joachim Keinert
a graph iteration, where the specific notion of an iteration in this context can be specified by the user in an application-specific way. For example, in PSDF, the most natural, general definition for a body graph iteration would be a single SDF iteration of the body graph, as defined by the SDF repetitions vector. Chapter 30 describes in detail this concept of an SDF iteration and the associated concept of the SDF repetitions vector. However, an iteration of the body graph can also be defined as some constant number of iterations, for example, the number of iterations required to process a fixed-size block of input data samples. Furthermore, parameters that define the body graph iteration can be used to parameterize the body graph or the enclosing PSDF specification at higher levels of the model hierarchy, and in this way, the processing that is defined by a graph iteration can itself be dynamically adapted as the application executes. For example, the duration (or block length) for fixed-parameter processing may be based on the size of a related sequence of contiguous network packets, where the sequence size determines the extent of the associated graph iteration. Body graph iterations can even be defined to correspond to individual actor invocations. This can be achieved by defining an individual actor as the body graph of a parameterized dataflow specification, or by simply defining the notion of iteration for an arbitrary body graph to correspond to the next actor firing in the graph execution. Thus, when modeling applications with parameterized dataflow, designers have significant flexibility to control the windows of execution that define the boundaries at which graph parameters can be changed. A combination of cooperating body, init, and subinit graphs is referred to as a PSDF specification. PSDF specifications can be abstracted as PSDF actors in higher level PSDF graphs, and in this way, PSDF specifications can be integrated hierarchically. Figure 14 illustrates a PSDF specification for a speech compression system. This illustration is adapted from [20]. Here setSp (“set speech”) is an actor that reads a header packet from a stream of speech data, and configures L, which is a parameter that represents the length of the next speech instance to process. The s1 and s2 actors are input interfaces that inject successive samples of the current speech instance into the dataflow graph. The actor s2 zero-pads each speech instance to a length R (R ≥ L) so that the resulting length is divisible by N, which is the speech segment size. The An (“analyze”) actor performs linear prediction on speech segments, and produces corresponding auto-regressive (AR) coefficients (in blocks of M samples), and residual error signals (in blocks of N samples) on its output edges. The actors q1 and q2 represent quantizers, and complete the modeling of the transmitter component of the body graph. Receiver side functionality is then modeled in the body graph starting with the actors d1 and d2, which represent dequantizers. The actor Sn (“synthesize”) then reconstructs speech instances using corresponding blocks of AR coefficients and error signals. The actor Pl (“play”) represents an output interface for playing or storing the resulting speech instances.
Dynamic and Multidimensional Dataflow Graphs
925
Subsystem Compress: params = {L, M, N, R} setSp Compress.init
R N
N An
L
selector
Compress.subinit
sets L
s2
L
s1
1 q1
1
1
d1
1
Sets R, M, N
N Sn
M 1 q2
1
1
d2
1
N R
P1
M
Compress.body
Fig. 14 An illustration of a speech compression system that is modeled using PSDF semantics. This illustration is adapted from [20].
The model order (number of AR coefficients) M, speech segment size N, and zero-padded speech segment length R are determined on a per-segment basis by the selector actor in the subinit graph. Existing techniques, such as the Burg segment size selection algorithm and AIC order selection criterion [22] can be used for this purpose. The model of Figure 14 can be optimized to eliminate the zero padding overhead (modeled by the parameter R). This optimization can be performed by converting the design to a parameterized cyclo-static dataflow (PCSDF) representation. In PCSDF, the parameterized dataflow meta model is integrated with CSDF as the base model instead of SDF. For further details on this speech compression application and its representations in PSDF and PCSDF, the semantics of parameterized dataflow and PSDF, and quasistatic scheduling techniques for PSDF, we refer the reader to [20]. Parameterized cyclo-static dataflow (PCSDF), the integration of parameterized dataflow meta-modeling with cyclo-static dataflow, is explored further in [13]. The exploration of different models of computation, including PSDF and PCSDF, for the modeling of software defined radio systems is explored in [6]. In [2], Kee et al. explore the application of PSDF techniques to field programmable gate array implementation of the physical layer for 3GPP-Long Term Evolution (LTE). The integration of concepts related to parameterized dataflow in language extensions for embedded streaming systems is explored in [7]. General techniques for analysis and verification of hierarchically reconfigurable dataflow graphs are explored in [14].
926
Shuvra S. Bhattacharyya, Ed F. Deprettere, and Joachim Keinert
8 Enable-invoke dataflow Enable-invoke dataflow (EIDF) is another DSP-oriented dynamic dataflow modeling technique [9]. The utility of EIDF has been demonstrated in the context of behavioral simulation, FPGA implementation, and prototyping of different scheduling strategies [9] [4] [5]. This latter capability — prototyping of scheduling strategies — is particularly important in analyzing and optimizing embedded software. The importance and complexity of carefully analyzing scheduling strategies are high even for the restricted SDF model, where scheduling decisions have a major impact on most key implementation metrics [21]. The incorporation of dynamic dataflow features makes the scheduling problem more critical since application behaviors are less predictable, and more difficult to understand through analytical methods. EIDF is based on a formalism in which actors execute through dynamic transitions among modes, where each mode has “synchronous” (constant production/consumption rate behavior), but different modes can have differing dataflow rates. Unlike other forms of mode-oriented dataflow specification, such as stream-based functions (see Section 5), SDF-integrated starcharts (see Chapter 36), SysteMoc (see Chapter 36), and CAL (see Section 6), EIDF imposes a strict separation between fireability checking (checking whether or not the next mode has sufficient data to execute), and mode execution (carrying out the execution of a given mode). This allows for lightweight fireability checking, since the checking is completely separate from mode execution. Furthermore, the approach improves the predictability of mode executions since there is no waiting for data (blocking reads) — the time required to access input data is not affected by scheduling decisions or global dataflow graph state. For a given EIDF actor, the specification for each mode of the actor includes the number of tokens that is consumed on each input port throughout the mode, the number of tokens that is produced on each output port, and the computation (the invoke function) that is to be performed when the actor is invoked in the given mode. The specified computation must produce the designated number of tokens on each output port, and it must also produce a value for the next mode of the actor, which determines the number of input tokens required for and the computation to be performed during the next actor invocation. The next mode can in general depend on the current mode as well as the input data that is consumed as the mode executes. At any given time between mode executions (actor invocations), an enclosing scheduler can query the actor using the enable function of the actor. The enable function can only examine the number of tokens on each input port (without consuming any data), and based on these “token populations”, the function returns a Boolean value indicating whether or not the next mode has enough data to execute to completion without waiting for data on any port. The set of possible next modes for a given actor at a given point in time can in general be empty or contain one or multiple elements. If the next mode set is empty (the next mode is null), then the actor cannot be invoked again before it is somehow reset or re-initialized from environment that controls the enclosing dataflow graph. A null next mode is therefore equivalent to a transition to a mode that
Dynamic and Multidimensional Dataflow Graphs
927
requires an infinite number of tokens on an input port. The provision for multielement sets of next modes allows for natural representation of non-determinism in EIDF specifications. When the set of next modes for a given actor mode is restricted to have at most one element, the resulting model of computation, called core functional dataflow (CFDF), is a deterministic, Turing complete model [9]. CFDF semantics underlie the functional DIF simulation environment for behavioral simulation of signal processing processing applications. Functional DIF integrates CFDF-based dataflow graph specification using the dataflow interchange format (DIF), a textual language for representing DSP-oriented dataflow graphs, and Java-based specification of intra-actor functionality, including specification of enable functions, invoke functions, and next mode computations [9]. Figure 15 and Figure 16 illustrate, respectively, the design of a CFDF actor and its implementation in functional DIF. This actor provides functionality that is equivalent to the Boolean dataflow switch actor described in Section 4.
Mode
Control Input
Data Input
True Output
False Output
Control Mode
1
0
0
0
True Mode
0
1
1
0
False Mode
0
1
0
1
Fig. 15 An illustration of the design of a switch actor in CFDF.
928
Shuvra S. Bhattacharyya, Ed F. Deprettere, and Joachim Keinert
public CFSwitch() { _control = addMode("control"); _control_true = addMode("control_true"); _control_false = addMode("control_false"); _data_in _control_in _true_out _false_out
= = = =
addInput(“data_in"); addInput(“control_in"); addOutput(”true_out"); addOutput(“false_out”);.
_control.setConsumption(_control_in, 1); _control_true.setConsumption(_data_in, 1); _control_true.setProduction(_true_out, 1); _control_false.setConsumption(_data_in, 1); _control_false.setProduction(_false_out, 1);
public CoreFunctionMode invoke(CoreFunctionMode mode){ if (_init==mode) { return _control; } if(_control==mode) { if((Boolean)pullToken(_control_in)) { return _control_true; } else { return _control_false; } } if(_control_true==mode) { Object obj = pullToken(_data_in); pushToken(_true_out, obj); return _control; } if(_control_false==mode) { Object obj = pullToken(_data_in);
}
public boolean enable(CoreFunctionMode mode){ if (_control==mode) { if(peek(_control_in)>0) { return true; } return false; }else if (_control_true==mode){ if(peek(_data_in)>0) { return true; } return false; } else if(_control_false==mode) { if(peek(_data_in)>0) { return true; } return false; } return false;
pushToken(_false_out, obj); return _control; }
}
}
Fig. 16 An implementation of the switch actor design of Figure 15 in the functional DIF environment.
9 Summary In this chapter, we have reviewed several DSP-oriented dataflow models of computation that are oriented towards representing multi-dimensional signal processing and dynamic dataflow behavior. As signal processing systems are developed and deployed for more complex applications, exploration of such generalized dataflow modeling techniques is of increasing importance. This chapter has complemented the discussion of Chapter 30, which focuses on the relatively mature class of decidable dataflow modeling techniques, and builds on the dynamic dataflow principles introduced in certain specific forms in Chapter 34 and Chapter 36.
Dynamic and Multidimensional Dataflow Graphs
929
References 1. S. S. Bhattacharyya, J. Eker, J. W. Janneck, C. Lucarz, M. Mattavelli, and M. Raulet. Overview of the MPEG reconfigurable video coding framework. Journal of Signal Processing Systems, 2010. DOI:10.1007/s11265-009-0399-3. 2. H. Kee, I. Wong, Y. Rao, and S. S. Bhattacharyya. FPGA-based design and implementation of the 3GPP-LTE physical layer using parameterized synchronous dataflow techniques. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, pages 1510–1513. Dallas, Texas, March 2010. 3. R. Gu, J. Janneck, M. Raulet, and S. S. Bhattacharyya. Exploiting statically schedulable regions in dataflow programs. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, pages 565–568. Taipei, Taiwan, April 2009. 4. W. Plishker, N. Sane, and S. S. Bhattacharyya. A generalized scheduling approach for dynamic dataflow applications. In Proceedings of the Design, Automation and Test in Europe Conference and Exhibition, pages 111–116. Nice, France, April 2009. 5. W. Plishker, N. Sane, and S. S. Bhattacharyya. Mode grouping for more effective generalized scheduling of dynamic dataflow applications. In Proceedings of the Design Automation Conference, pages 923–926. San Francisco, July 2009. 6. H. Berg, C. Brunelli, and U. Lucking. Analyzing models of computation for software defined radio applications. In Proceedings of the International Symposium on System-on-Chip, 2008. 7. Y. Lin, Y. Choi, S. Mahlke, T. Mudge, and C. Chakrabarti. A parameterized dataflow language extension for embedded streaming systems. In Proceedings of the International Symposium on Systems, Architectures, Modeling and Simulation, pages 10–17, July 2008. 8. H. Nikolov and E. F. Deprettere. Parameterized stream-based functions dataflow model of computation. In Proceedings of the Workshop on Optimizations for DSP and Embedded Systems, 2008. 9. W. Plishker, N. Sane, M. Kiemb, K. Anand, and S. S. Bhattacharyya. Functional DIF for rapid prototyping. In Proceedings of the International Symposium on Rapid System Prototyping, pages 17–23. Monterey, California, June 2008. 10. G. Roquier, M. Wipliez, M. Raulet, J. W. Janneck, I. D. Miller, and D. B. Parlour. Automatic software synthesis of dataflow program: An MPEG-4 simple profile decoder case study. In Proceedings of the IEEE Workshop on Signal Processing Systems, October 2008. 11. M. Ko, C. Zissulescu, S. Puthenpurayil, S. S. Bhattacharyya, B. Kienhuis, and E. Deprettere. Parameterized looped schedules for compact representation of execution sequences in DSP hardware and software implementation. IEEE Transactions on Signal Processing, 55(6):3126–3138, June 2007. 12. J. Keinert, C. Haubelt, and J. Teich. Modeling and analysis of windowed synchronous algorithms. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, May 2006. 13. S. Saha, S. Puthenpurayil, and S. S. Bhattacharyya. Dataflow transformations in high-level DSP system design. In Proceedings of the International Symposium on System-on-Chip, pages 131–136. Tampere, Finland, November 2006. Invited paper. 14. S. Neuendorffer and E. Lee. Hierarchical reconfiguration of dataflow models. In Proceedings of the International Conference on Formal Methods and Models for Codesign, June 2004. 15. J. Eker and J. W. Janneck. CAL language report, language version 1.0 — document edition 1. Technical Report UCB/ERL M03/48, Electronics Research Laboratory, University of California at Berkeley, December 2003. 16. B. Kienhuis and E. F. Deprettere. Modeling stream-based applications using the SBF model of computation. Journal of Signal Processing Systems, 34(3):291–299, 2003. 17. T. Stefanov and E. Deprettere. Deriving process networks from weakly dynamic applications in system-level design. In Proceedings of the International Conference on Hardware/Software Codesign and System Synthesis, pages 90–96, October 2003. 18. P. K. Murthy and E. A. Lee. Multidimensional synchronous dataflow. IEEE Transactions on Signal Processing, 50(8):2064–2079, August 2002.
930
Shuvra S. Bhattacharyya, Ed F. Deprettere, and Joachim Keinert
19. E. D. Willink, J. Eker, and J. W. Janneck. Programming specifications in CAL. In Proceedings of the OOPSLA Workshop on Generative Techniques in the context of Model Driven Architecture, 2002. 20. B. Bhattacharya and S. S. Bhattacharyya. Parameterized dataflow modeling for DSP systems. IEEE Transactions on Signal Processing, 49(10):2408–2421, October 2001. 21. S. S. Bhattacharyya, R. Leupers, and P. Marwedel. Software synthesis and code generation for DSP. IEEE Transactions on Circuits and Systems — II: Analog and Digital Signal Processing, 47(9):849–875, September 2000. 22. S. Haykin. Adaptive Filter Theory. Prentice Hall, 1996. 23. S. S. Bhattacharyya, J. T. Buck, S. Ha, and E. A. Lee. Generating compact code from dataflow specifications of multirate signal processing algorithms. IEEE Transactions on Circuits and Systems — I: Fundamental Theory and Applications, 42(3):138–150, March 1995. 24. J. T. Buck. Scheduling Dynamic Dataflow Graphs with Bounded Memory using the Token Flow Model. Ph.D. thesis, Department of Electrical Engineering and Computer Sciences, University of California at Berkeley, September 1993. 25. G. R. Gao, R. Govindarajan, and P. Panangaden. Well-behaved programs for DSP computation. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, March 1992. 26. J. Annevelink. HIFI: A Design Method for Implementing Signal Processing Algorithms on VLSI Processor Arrays. Ph.D. thesis, Delft University of Technology, Department of Electical Engineering, Delft, The Netherlands, 1988. 27. J. Backus. Can programming be liberated from the von Neumann style? A functional style and its algebra of programs. Communications of the ACM, 21(8):613–641, 1978.
Polyhedral Process Networks Sven Verdoolaege
Abstract Reference implementations of signal processing applications are often written in a sequential language that does not reveal the available parallelism in the application. However, if an application satisfies some constraints then a parallel specification can be derived automatically. In particular, if the application can be represented in the polyhedral model, then a polyhedral process network can be constructed from the application. After introducing the required polyhedral tools, this chapter details the construction of the processes and the communication channels in such a network. Special attention is given to various properties of the communication channels including their buffer sizes.
1 Introduction Signal processing applications are prime candidates for a parallel implementation. As we have seen in previous chapters, there are several parallel models of computations that can be used for specifying such applications. However, many programmers are unfamiliar with these parallel models of computation. Writing parallel specifications can therefore be a difficult, time consuming and error prone process. For this reason, many application developers still prefer to specify an application as a sequential program, even though such a specification may not be suitable for a direct mapping onto a parallel multiprocessor platform. In this chapter, we present a technique for automatically extracting a parallel specification from a sequential program, provided these sequential programs satisfy some conditions. In particular, the control flow of these programs needs to be static, meaning that the control flow should not depend on the signals being processed. Furthermore, all loop bounds, conditions and array index expressions need to be Sven Verdoolaege Katholieke Universiteit Leuven, Department of Computer Science, Celestijnenlaan 200A, B-3001 Leuven, Belgium, e-mail: [email protected]
S.S. Bhattacharyya et al. (eds.), Handbook of Signal Processing Systems, DOI 10.1007/978-1-4419-6345-1_33, © Springer Science+Business Media, LLC 2010
931
932
Sven Verdoolaege
such that they can be expressed using affine constraints. All functions called should be pure. In particular, they should not change the values of loop iterators or arrays other than their output arguments. These requirements ensure that we can represent all relevant information about the program using mathematical objects called polyhedra. The resulting parallel model, a variation on Kahn process networks [8] (see Chapter 34), can then also be described using such polyhedra, whence the name polyhedral process networks. The concept of a polyhedral process network was developed in the context of the Compaan project [11],[16]. The exposition in this chapter closely follows that of [19].
2 Overview This section presents a high-level overview of the process of extracting a process network from a sequential program. The extracted process network represents the task-level parallelism that is available in the program. The input program is assumed to consist of a sequence of nested loops performing various “operations”. These operations may be calls to functions that can be arbitrarily complicated. The operations are performed on data that has been computed in some iteration of another operation or in a previous iteration of the same operation. The output process network consists of a set of processes, each encapsulating all iterations of a given operation, and communication channels connecting the processes and representing the dataflow. The processes in the network can be executed independently of each other, as long as data is available from the channels from which the process reads and as long as buffer space is available in the channels to which the process writes. That is, the communication primitives implement blocking reads and blocking writes.
1 2 3 4 5 6
for (i = 0; i for (j = 0; a[i][j] = for (i = 1; i for (j = 1; Sbl[i][j]
< K; i++) j < N; j++) ReadImage(); < K-1; i++) j < N-1; j++) { = Sobel(a[i-1][j-1], a[i][j-1], a[i+1][j-1], a[i-1][ j], a[i][ j], a[i+1][ j], a[i-1][j+1], a[i][j+1], a[i+1][j+1]); WriteImage(Sbl[i][j]);
7 8 9 10
}
Fig. 1 Source code of a Sobel edge detection program
Example 1. As a simple example, consider the code for performing Sobel edge detection in Figure 1. The first loop of this program reads the input image, while the second loop performs the actual edge detection and writes out the output image. A
Polyhedral Process Networks
933 ReadImage
a: 2*N-1
a: N+1
a: 3
a: 2*N-2
a: N a: 2 a: 2*N-3 a: N-1 a: 1 Sobel Sbl: 1
WriteImage
Fig. 2 A process network corresponding to the sequential program in Figure 1
process network that corresponds to this program is shown in Figure 2. There are three processes in the network, each corresponding to one of the three “operations” performed by the program, i.e., reading, edge detection and writing. Data flows from the reading process to the edge detection process and from the edge detection process to the writing process, resulting in the communication channels shown in the figure. The annotations on the edges will be explained later. The extraction of a polyhedral process network consists of several steps summarized below and explained in more detail in the following sections: • In a first step, a model is extracted from the input program on which all further analysis will be performed. In particular, the program is represented by a polyhedral model. This model allows for an efficient analysis, but imposes some restrictions on the input programs. The polyhedral model is explained in Section 3, while some basic components for the analysis of polyhedral models are explained in Section 4. • Dataflow analysis is performed to determine which processes communicate with which other processes and how, i.e., to determine the communication channels. For example, the results of the call to ReadImage in Line 3 of Figure 1 are stored in the a array, which is read by the call to Sobel in Line 6. Dataflow analysis therefore results in one or more communication channels from the reading process to the edge detection process. Dataflow analysis is explained in Section 5. • In the next step, the type of each communication channel is determined. For example, the channel may be a FIFO, in which case the processes connected to the channel simply need to write to and read from the channel, or it may not be a FIFO, in which case additional processing will be required. The classification of channels is discussed in Section 6. • The communication channels may need to buffer some data to ensure a deadlockfree execution of the network. Especially for a hardware implementation of process networks, it is important to know how large these buffers may need to grow. The computation of buffer sizes is the subject of Section 8. • The number of processes in the network may exceed the number of processing elements available. Some processes may therefore need to be merged. This
934
Sven Verdoolaege
merging requires the construction of a combined schedule, which is the subject of Section 7. Depending on the kind of dataflow analysis that was performed in constructing the network, some of this analysis may need to be updated or redone based on the merging decisions. • Finally, code needs to be written out for each of the processes in the network. This code needs to execute all iterations of the single or multiple (in case of merging) operations and needs to read from and write to the appropriate communication channels at the appropriate times. The main difficulty in this step is writing out code to scan overlapping polyhedral domains, which is discussed in Section 4.5. The buffer size computation itself (Section 8) consists of several substeps. First a global schedule is computed (Section 7), assigning a global time point to each iteration of each process. This global schedule is only used during the buffer size computation and not during the actual execution. For each channel and each time point, the number of tokens in the channel at that time point is then computed (Section 4.3). Finally, for each channel, an upper bound is computed for the maximal number of tokens in the channel throughout the whole execution (Section 4.4).
3 Polyhedral Concepts The key to an efficient transformation from sequential code to a process network is the polyhedral model used to represent the sequential program and the resulting process network. This section defines both models and related concepts.
3.1 Polyhedral Sets and Relations Each statement inside a loop nest is executed many times when the sequential program is run. Each of these executions can be represented by the values of the iterators in the loops enclosing the statement. This sequence of iterator values is called the iteration vector associated to a given execution of the statement. The set of all such iteration vectors is called the iteration domain of the statement. Assuming that each iterator is an integer that is incremented by one in each iteration of the loop, this iteration domain can be represented very succinctly by simply collecting the lower and upper bounds of each of the enclosing loops. Example 2. The iteration domain associated to the ReadImage statement in Line 3 of Figure 1 is (1) D1 = { (i, j) ∈ Z2 | 0 ≤ i < K ∧ 0 ≤ j < N }. The extraction of a process network requires several manipulations of iteration domains and related sets. To ensure that these manipulations can be performed efficiently or even performed at all, we need to impose some restrictions on how these
Polyhedral Process Networks
935
sets are represented. In particular, we require that the sets are described by integer affine inequalities and equalities over integer variables. An affine inequality is an inequality of the form a1 x1 + a2 x2 + . . . + ad xd + c ≥ 0, i.e., it expresses that some degree-1 polynomial over the variables is greater than or equal to zero. When dealing with several such inequalities in general, it will be convenient to represent them using a matrix notation Ax + c ≥ 0. Sets of rational values described by affine inequalities have been the subject of extensive research and are called polyhedra. Definition 1 (Rational Polyhedron). A rational polyhedron P is a subspace of Qd bounded by a finite number of hyperplanes. P = { x ∈ Qd | Ax ≥ c },
(2)
with A ∈ Zm×d and c ∈ Zm . The sequential code from which we want to extract a process network may contain parameters such as the number of rows K and the number of columns N in Figure 1. In such cases, we do not want to extract a distinct process network for each value of the parameters, but instead a single parametric process network that is valid for all values of the parameters. We therefore also need the concept of a parametric polyhedron. Definition 2 (Parametric Rational Polyhedron). A parametric rational polyhedron P(s) is a family of rational polyhedra in Qd parametrized by parameters s ∈ Qn . d
P : Qn → 2Q : s → P(s) = { x ∈ Qd | Ax + Bs ≥ c },
(3)
d
with A ∈ Zm×d , B ∈ Zm×n , c ∈ Zm and 2Q the power set of Qd , i.e., the set of all subsets of Qd . The parameter domain D = pdomP of a parametric polyhedron d P : Qn → 2Q is a (non-parametric) polyhedron D ⊆ Qn containing all parameter values s for which P(s) is non-empty, pdom P = { s ∈ Qn | P(s) = 0/ }. Bounded polyhedra are called polytopes. In case of parametric polytopes, this means that each polyhedron in the family is a polytope, i.e., that P(s) is a polytope for each value of the parameters s. Besides iterators and parameters, we may also need additional variables to accurately describe an iteration domain. These variables are not used to identify a given iteration, but rather to restrict the possible values of the iterators. This means that we do not care about the values of these variables, but rather that some integer value exists for these variables that satisfies the constraints. These variables may therefore be existentially quantified. The need for such variables arises especially when loop bounds or guards contain modulos or integer divisions. Expressions that contain such constructs but that can still be expressed using affine constraints are called quasi-affine.
936
Sven Verdoolaege 1
Fig. 3 Source code of a loop nest with iteration domains requiring existentially quantified variables to represent
2 3 4 5
for (i = 0; i < N; i++) if (i % 3 == 0) a[i] = f1(); else a[i] = f2();
Example 3. Consider the program in Figure 3. The modulo constraint “i % 3 == 0” is not in itself an affine constraint, but it can be represented as an affine constraint i = 3 by introducing an extra integer variable ∈ Z. The iteration domain of the statement in Line 3 can therefore be represented as D1 = { i ∈ Z | ∃ ∈ Z : 0 ≤ i < N ∧ i = 3 }.
(4)
Similarly, the iteration domain of the statement in Line 5 can be represented as D2 = { i ∈ Z | ∃ ∈ Z : 0 ≤ i < N ∧ 1 ≤ i − 3 ≤ 2 }. In general then, the “polyhedral sets” such as D1 and D2 that are used in the polyhedral representation of both the input program and the resulting process network, are defined as follows. Definition 3 (Polyhedral Set). A polyhedral set S is a finite union of basic sets $ S = i Si , each of which can be represented using affine constraints d
Si : Qn → 2Q : s → Si (s) = { x ∈ Zd | ∃z ∈ Ze : Ax + Bs + Dz ≥ c }, with A ∈ Zm×d , B ∈ Zm×n , D ∈ Zm×e and c ∈ Zm . The parameter domain of S is the (non-parametric) polyhedral set pdomS = { s ∈ Zn | S(s) = 0/ }. Note that any polyhedral set can be represented in infinitely many ways. When talking about polyhedral sets, we will usually have a specific representation in mind. In a similar fashion, we can define “polyhedral relations” over pairs of sets. Definition 4 (Polyhedral Relation). A polyhedral relation R is a finite union of $ d +d basic relations R = i Ri of type Qn → 2Q 1 2 , each of which can be represented using affine constraints Ri = s → Ri (s) = { (x1 , x2 ) ∈ Zd1 × Zd2 | ∃z ∈ Ze : A1 x1 + A2 x2 + Bs + Dz ≥ c }, with Ai ∈ Zm×di , B ∈ Zm×n , D ∈ Zm×e and c ∈ Zm . The parameter domain of R is the (non-parametric) polyhedral set pdomR = { s ∈ Zn | R(s) = 0/ } = { s ∈ Zn | ∃x1 ∈ Zd1 , x2 ∈ Zd2 : (x1 , x2 ) ∈ R(s) }. The domain of R is the polyhedral set domR = s → { x1 ∈ Zd1 | ∃x2 ∈ Zd2 : (x1 , x2 ) ∈ R(s) }, while the range of R is the polyhedral set ran R = s → { x2 ∈ Zd2 | ∃x1 ∈ Zd1 : (x1 , x2 ) ∈ R(s) }. The polyhedral relation R can d also be interpreted as being of type Qn → Qd1 → 2Q 2 . Supplying two arguments then yields the image of an element t ∈ domR(s), i.e., R(s, t) = { x2 ∈ Zd2 | (t, x2 ) ∈ R(s) }.
Polyhedral Process Networks
937
Polyhedral sets and polyhedral relations have essentially the same type if we set d = d1 + d2 . The difference is mainly a matter of interpretation. In polyhedral relations, we make a distinction between two sets of variables, whereas in polyhedral sets, there is no such distinction. Any polyhedral set can also be treated as a polyhedral relation with a zero-dimensional domain, i.e., by setting d1 = 0, d2 = d and A2 = A. Any statement about polyhedral relations will therefore also hold for polyhedral sets. We will usually only treat the general case of polyhedral relations.
3.2 Lexicographic Order For a proper analysis of a sequential program, we not only need to know for which iterator values a given statement is executed, but also in which order these instances are executed. A given instance is executed before another instance if they are executed in the same iteration of zero or more outermost loops and if the second instance is executed in a later iteration of the next outermost loop. In other words, the execution order corresponds to the lexicographic order on iteration vectors. Definition 5 (Lexicographic order). A vector a ∈ Zn is said to be lexicographically smaller than b ∈ Zn if for the first position i in which a and b differ, we have ai < bi , or, equivalently, & ( a≺b≡
n %
i=1
ai < bi ∧
i−1 '
aj = bj .
(5)
j=1
Note that the lexicographic order can be represented as a polyhedral relation Ln = { (a, b) ∈ Zn × Zn | a ≺ b } =
n )
{ (a, b) ∈ Zn × Zn | ai < bi ∧
i=1
(6) i−1 '
a j = b j }.
(7)
j=1
3.3 Polyhedral Models The polyhedral model of a sequential program mainly consists of the iteration domains of the statements (as explained in Section 3.1), and access relations. An access relation is a polyhedral relation R ⊆ Zd × Za that maps the iteration vector of the corresponding statement to an array element. Here, d is the dimension of the iteration domain and a is the dimension of the array. A scalar can be considered as a zero-dimensional array. The access relation of an access to a scalar is therefore simply the Cartesian product of the iteration domain and the whole zero-dimensional space. Example 4. The access relation of the first argument to the call to Sobel in Line 6 of Figure 1 is
938
Sven Verdoolaege
{ (i, j) → (a, b) | 1 ≤ i < K − 1 ∧ 1 ≤ j < N − 1 ∧ a = i − 1 ∧ b = j − 1 }.
(8)
Finally, the model should contain information about the relative execution order of any pair of statements in order to determine whether an iteration of one statement precedes or follows an iteration of another statement. One way of representing this relative position is to keep for each pair of statements the number of enclosing loops that they have in common and the relative ordering of the two statements in the program text. Example 5. In Figure 1, ReadImage precedes Sobel in the program text, which in turn precedes WriteImage. ReadImage shares no enclosing loops with either of the other two statements, while Sobel and WriteImage share two enclosing loops. Another way of representing the relative order is to record for each statement the position of the statement itself and each of its enclosing loops in the program text. That is, for a statement with di enclosing loops, we keep a vector of di + 1 “positions”, e.g., line numbers, ordered from outermost to innermost. We call this vector the statement location. Example 6. In Figure 1, ReadImage has statement location (1, 2, 3), Sobel has statement location (4, 5, 6) and WriteImage has statement location (4, 5, 9). Note that the first representation can easily be derived from the second representation. Combining all this information, we have the following definition. Definition 6 (Polyhedral Model). The polyhedral model of a sequential program consists of a list of statements, where each statement is in turn represented by • • • • •
an identifier, a dimension di , an iteration domain (a polyhedral set) Di ⊆ Zdi , a list of accesses and a statement location.
Finally, each array access is represented by an identifier, an access relation (a polyhedral relation) and a type (read or write). The use of a polyhedral model imposes the following restrictions on the input program: • static control flow, • pure functions and • loop bounds, conditions and index expressions are quasi-affine expressions in the parameters and the iterators of enclosing loops. The extraction of a polyhedral model from a sequential program is available in several modern industrial and research compilers, e.g., [13], [15], [1].
Polyhedral Process Networks
939
3.4 Piecewise Quasi-Polynomials As explained in Section 2, each channel in the process network has a buffer of a bounded size. If the network is parametric, then this bound will in general not simply be a number, but rather an expression in the parameters. Some of the intermediate steps in computing these bounds may also result in expressions involving both the parameters and the iterators, or just the iterators if the network is non-parametric. In both cases the expressions will be piecewise quasi-polynomials. Quasi-polynomials are polynomial expressions that may involve integer divisions of affine expressions in the variables. Piecewise quasi-polynomials are quasi-polynomials defined over polyhedral pieces of the domain. More formally, they can be defined as follows. Definition 7 (Quasi-polynomial). A quasi-polynomial q(x) in the integer variables x is a polynomial expression in greatest integer parts of affine expressions in the variables, i.e., q(x) ∈ Q [Q[x]≤1 ]. Definition 8 (Piecewise Quasi-polynomial). A piecewise quasi-polynomial q(x), with x ∈ Zd consists of a finite set of pairwise disjoint polyhedra Ki ⊆ Qd , each with an associated quasi-polynomial qi (x). The value of the piecewise quasi-polynomial at x is the value of qi (x) with Ki the polyhedron containing x, i.e., * qi (x) if x ∈ Ki q(x) = 0 otherwise. + , Note that the usual polynomials are special cases of quasi-polynomials as x j = x j for x j integer. Example 7. Consider the statement in Line 3 of Figure 3. The number of times this statement is executed can be represented by the piecewise quasi-polynomial -+ , N if N ≥ 0. 3 Note that there is only one polyhedral “piece” in this example.
3.5 Polyhedral Process Networks Now we can finally define the structure of a polyhedral process network. For simplicity we assume that no merging of processes has been performed, i.e., that each process corresponds to a statement in the polyhedral model of the input. Definition 9 (Polyhedral Process Network). A polyhedral process network is a directed graph with as vertices a set of processes P and as edges communication channels C . Each process Pi ∈ P has the following characteristics • a statement identifier si ,
940
Sven Verdoolaege
• a dimension di , • an iteration domain Di ⊆ Zdi . Each channel Ci ∈ C has the following characteristics • a source process Si ∈ P, • a target process Ti ∈ P, • a source access identifier corresponding to one of the accesses in the statement sSi , • a target access identifier corresponding to one of the accesses in the statement sTi , • a polyhedral relation Mi ⊆ DSi × DTi mapping iterations from the source domain to the target domain, • a type (e.g., FIFO), • a piecewise quasi-polynomial buffer size. The identifiers in the process network can be used to obtain more information about the statements and the accesses when constructing a hardware or software realization of the network. The mapping Mi identifies which iterations of the source process write to the channel, which iterations of the target process read from the channel and how these iterations are related. The buffer sizes are such that the network can be executed without deadlocks. Example 8. Consider the network in Figure 2, derived from the code in Figure 1. All processes have dimension di = 2. The iteration domain D1 of the ReadImage process was given in Example 2. The iteration domains of the other two processes are D2 = D3 = { (i, j) ∈ Z2 | 1 ≤ i < K − 1 ∧ 1 ≤ j < N − 1 }. (9) There are nine communication channels between the first and the second process and one communication channel between the second and the third. All communication channels in this network are FIFOs. How these communication channels are constructed is explained in Section 5.1 and how their types are determined is explained in Section 6. The arrows in Figure 2 representing the communication channels are annotated with the name of the array that has given rise to the channel and the buffer size. These buffer sizes are computed in Section 8.
4 Polyhedral Analysis Tools The construction of a polyhedral process network relies on a number of fundamental polyhedral operations. This section provides an overview of these operations. In each case, one or more tools are mentioned with which the operation can be performed. This list of tools is not exhaustive.
Polyhedral Process Networks
941
4.1 Parametric Integer Programming By far the most fundamental step in the construction of a process network is figuring out which processes communicate data with each other and in what way. At the level of the input source program, this means figuring out for each value read in the program where the value was written. That is, for each read access to an array element, we need to know what was the last write access to the same array element that occurred before the given read access. In particular, if many iterations of the same statement write to the same array element, we need the (lexicographically) last iteration of that statement. Computing such a lexicographically maximal (or minimal) element of a polyhedral set can be performed using parametric integer programming. Let us first define a lexmax operator on polyhedral relations. Definition 10 (Lexicographic Maximum). Given a polyhedral relation R, the lexicographic maximum of R is a polyhedral relation with the same parameter domain and the same domain as R that maps each element i of the domain to the unique lexicographically maximal element that corresponds to i in R, i.e., lexmaxR = { (i, j) ∈ R ⊆ Zd1 × Zd2 | ¬(∃j ∈ Zd2 : (i, j ) ∈ R ∧ j ≺ j ) }.
(10)
Now we can define an operation for computing lexmax R as a polyhedral relation. Operation 1 (Lexicographic Maximum). Input: • a basic polyhedral relation R : Zn → Zd1 × Zd2 • a basic polyhedral set S : Zn → Zd1 Output: • a polyhedral relation M = lexmax(R ∩ (S × Zd2 )) • a polyhedral set E = S \ domR The polyhedral relation M satisfies the following additional conditions: • every existentially quantified variable in M is explicitly represented as the greatest integer part of an affine expression in the parameters and the domain variables, • every variable in the range of M is explicitly represented as an affine expressions in the parameters, the domain variables and the existentially quantified variables. Operation 1 can be performed using isl [17],1 or, with some additional transformations, using piplib [6].2 The use of the output set E will become clear in Section 5. Example 9. Let us compute the lexicographically maximal element of the polyhedral set D1 (4) from Example 3. Recall that a polyhedral set can be treated as a polyhedral relation with zero-dimensional domain. The input to Operation 1 is then R = N → Z0 × D1 (N). S may be taken as the universal zero-dimensional set with 1 2
http://freshmeat.net/projects/isl/ http://www.piplib.org/
942
Sven Verdoolaege
universal parameter domain, i.e., S = N → Z0 . The output is the lexicographically maximal element of the set D1 : . / N+2 0 M = N → Z × lexmaxD1 (N) = { i ∈ Z | ∃ = : i = 3 − 3 ∧ N ≥ 1 }. 3 The set E describes the (parameter) values for which there is no element in the set D1 , i.e., E = N → { () | N ≤ 0 }, with () the single element of Z0 . Using Operation 1, we can also compute the domain of a polyhedral relation as a polyhedral set. We simply compute the lexicographic maximum relation of each basic relation and then drop the range variables and the equalities that define them. Similarly, the range of a relation can be computed as the domain of the inverse relation. Alternatively, the lexicographic maximum can be computed using Omega [9],3 by expressing the lexicographic order in (10) using linear constraints, as in (7). However, the result will not necessarily satisfy the two conditions of Operation 1. On the other hand, the Omega library provides built-in operations for computing domains and ranges of relations.
4.2 Emptiness Check A very basic and frequently used operation is that of checking whether a given polyhedral set or relation contains any elements for any value of the parameters. Operation 2 (Emptiness Check). Input: a polyhedral relation R : Zn → Zd1 × Zd2 Output: true if ∀s ∈ Zn : R(s) = 0/ and false otherwise Operation 2 can be performed by applying Operation 1 on each of the basic polyhedral sets in a polyhedral set S ∈ Zn+d1 +d2 with the same description as R, but where all parameters and input variables are treated as set variables. The relation R is empty iff S is empty iff in turn none of the basic polyhedral sets has a lexicographically maximal element. Since S is (a union of) Integer Linear Programming (ILP) problem(s), any other algorithm for testing the feasibility of an ILP problem will work as well.
4.3 Parametric Counting An important step in the buffer size computation is the computation of the number of elements in the buffer before a given read. We will be able to reduce this com3
http://www.cs.umd.edu/projects/omega/
Polyhedral Process Networks
943
putation to that of counting of the number of elements in the image of a polyhedral relation R(s), denoted #R(s, t), for which we will use the following operation. Operation 3 (Number of Image Elements). Input: a polyhedral relation R : Zn → Zd1 × Zd2 Output: a piecewise quasi-polynomial q : Zn × Zd1 → Q : (s, t) → q(s, t) = #R(s, t) We already performed Operation 3 on a polyhedral set in Example 7. Operation 3 can be performed using barvinok [20], [18].4 Note that this library takes a basic polyhedral set as input, but it can be made to handle basic polyhedral relations by treating the domain variables as extra parameters. Unions can be handled by first computing a disjoint union representation and then summing the number of elements in each of the individual basic sets in this representation.
4.4 Computing Parametric Upper Bounds As explained in the previous section, Operation 3 can be used to compute the number of elements in a buffer at any given time. When allocating memory for this buffer, we need to know the maximal number of elements that will ever have to reside in the buffer. As usual, we want to perform this computation parametrically. In general, computing the actual maximum may be too difficult, however. We therefore settle for computing an upper bound that is reasonably close to the maximum. Operation 4 (Upper Bound on a Quasi-polynomial). Input: • a piecewise quasi-polynomial q : Zn × Zd → Q • a bounded polyhedral set S : Zn → Zd , the domain over which to compute the upper bound Output: a piecewise quasi-polynomial u : Zn → Q such that u(s) ≥ max q(s, t) t∈S(s)
∀s ∈ pdomS
Operation 4 can be performed using bernstein [3].5 Although the basic technique only applies to polynomials, extensions to quasi-polynomials are also available [5]. Alternatively, the quasi-polynomials, which are usually the result of a counting problem such as Operation 3, can be approximated by a polynomial during the counting process [12]. Combining Operation 3 and Operation 4 results in the following operation, by taking S = domR. Operation 5 (Upper Bound on the Number of Image Elements). Input: a polyhedral relation R : Zn → Zd1 × Zd2 Output: a piecewise quasi-polynomial u : Zn → Q such that 4 5
http://freshmeat.net/projects/barvinok/ http://icps.u-strasbg.fr/pco/bernstein.htm
944
Sven Verdoolaege
u(s) ≥
max #R(s, t)
t∈domR(s)
Fig. 4 A simple program reading the same array elements several times
1 2 3
∀s ∈ pdomR
for (i = 0; i < N; ++i) for (j = 0; j < i; ++j) b[i][j] = f(a[i + j]);
Example 10. Consider the program in Figure 4 and assume we want to know the maximal number of times an unspecified element of array a is read. This number is the maximal number of domain elements in the access relation that map to the same array element. In terms of the operations we have defined above, it is the maximal number of image elements in the inverse of the access relation. The access relation is A = { (i, j) → a | 0 ≤ i < N ∧ 0 ≤ j < i ∧ a = i + j }. Its inverse is A−1 = { a → (i, j) | 0 ≤ i < N ∧ 0 ≤ j < i ∧ a = i + j }. Applying Operation 3 yields the number of times a given array element is read: * + , a − a2 if 0 ≤ a < N −1 +a, #A (N, a) = N − 2 − 1 if N ≤ a ≤ 2N − 3. An upper bound on this number can then be computed using Operation 4, yielding, max #A−1(N, a) ≤ u(N) = N2 if N ≥ 2. a∈dom A−1
4.5 Polyhedral Scanning When writing out code for a process in a process network, we not only need to make sure that an operation is performed for each element of its iteration domain, but we also need to insert the appropriate reading and writing primitives in the appropriate places. In particular, if Ci is a communication channel with the given process as its source, then a write to the communication channel needs to be inserted in each element of the domain of its mapping relation Mi . Similarly, if C j is a communication channel with the given process as its target, then a read from the communication channel needs to be inserted in each element of the range of its mapping relation M j . Each of these domains and ranges is a polyhedral set and we see that we need to generate code for visiting each element of these sets. That is, we need the following operation.
Polyhedral Process Networks
945
Operation 6 (Code Generation). Input: A set of polyhedral sets { Si } Output: Code for visiting each element of each Si in lexicographic order Operation 6 can be performed using CLooG [2]6 or CodeGen [10] (part of the Omega library). Note that when generating code for a process, we cannot simply apply Operation 6 on the iteration domains and the domains and ranges of the communication channel mappings, as the lexicographic order does not make a distinction between several occurrences of the same element in two or more of these sets. Rather, we need to ensure that the reads happen before the actual operation and that the writes happen after the operation. To enforce this extra ordering constraint, we introduce an extra innermost dimension, assigning it the value 0 for reads, 1 for the iteration domain and 2 for writes. Example 11. Consider a process with a one-dimensional iteration domain D = {i | 0 ≤ i < N } that reads a value from some other process in its first iteration M1 = { () → i | i = 0 }, propagates a value from one iteration to the next M2 = { i → i | 0 ≤ i < N − 1 ∧ i = i + 1 } and then sends a value to some other process in its last iteration M3 = { i → () | i = N − 1 }. The process is shown in Figure 5. The code that results from scanning D × { 1 }, ran M1 × { 0 }, domM2 × { 2 }, ranM2 × { 0 } and domM3 × { 2 } is shown schematically in Figure 6.
2
Fig. 5 A process with a onedimensional iteration domain
1
1
6
http://www.cloog.org/
•
3
• 2
• 2
• 2
• 2
• 2
3
946
Sven Verdoolaege 1 2 3 4 5 6 7 8 9
Fig. 6 Code generated for the process in Figure 5
10 11
for (i = 0; i < N; i++) { if (i == 0) Read(1); if (i >= 1) Read(2); f(); if (i < N - 1) Write(2); if (i == N - 1) Write(3); }
5 Dataflow Analysis This section describes how the communication channels in the process network are constructed using exact dataflow analysis [7]. We first discuss the standard dataflow analysis and then explain how some inter-process communication can be avoided by considering reuse.
5.1 Standard Dataflow Analysis Standard exact dataflow analysis [7] is concerned with finding for each read of a value from an array element in the program, the write operation that wrote that value to that array element. By replacing the write to the array by a write to one or more communication channels (as many as there are accesses in the program where the value is read) and the read from the array to a read from the appropriate communication channel, we will have essentially constructed the communication channels. The effect of dataflow analysis can be described as the following operation. Operation 7 (Dataflow Analysis). Input: • a read access relation R ⊆ Zd × Za • a list of [write] access relations Wi ⊆ Zdi × Za Output: • a list of polyhedral relations Mi ⊆ Zdi × Zd • a polyhedral set S ⊆ Zd The output satisfies the following constraints • each element in the domain of R is an element either of S or of the range of exactly one of the mappings Mi , i.e., { ran Mi }i ∪ { S } partitions dom R, • if a particular element in the domain of R is in the range of one of the mappings, i.e., j ∈ dom R ∩ ran Mk , then the corresponding [write] iteration of Wk , i.e., Mk−1 (j), was the last iteration in any of the domains of the input access relations Wi that accessed the same array element as j, [i.e., it wrote the value read by j] and was executed before j,
Polyhedral Process Networks
947
• if a read iteration belongs to S, i.e., j ∈ domR ∩ S, then this iteration accesses an array element that was not accessed by any of the Wi , [i.e., this iteration reads an uninitialized value]. Operation 7 is applied for each read access in the program. The list of access relations required in the input is constructed from all write accesses in the program that access the same array. For each Mi = 0, / a communication channel is created from the process corresponding to the writing statement to the process corresponding to the reading statement, with the given Mi as mapping. The type and buffer size are defined in the following sections. The set S in the output is assumed to be empty. Below, we briefly sketch how Operation 7 can be implemented using Operation 1, but first we define a global order on all iterations of all statements in the input program. Recall from Section 3.2 that within a single iteration domain, the execution order corresponds to the lexicographic order. In Section 3.3 we explained that the relative order of different statements can be expressed using statement locations. A global order can be obtained by combining these two pieces of information. In particular, let d = maxi di . We define a (2d + 1)-dimensional space where the even dimensions (0 to 2d) correspond to the elements of the statement locations (in order) and the odd dimensions (1 to 2d − 1) correspond to the dimensions of the iteration domains (in order). We can map each iteration domain in this way to the (2d + 1)dimensional space and we call the result the extended iteration domain. For iteration domains with di < d, the remaining dimensions can be assigned an arbitrary value, say zero. The lexicographic order on this space corresponds to the execution order of the input program. In particular, if the first dimension in which two extended iteration vectors differ is 2n, then the two corresponding statements share n loops, the iterators of these n loops have the same value in both vectors and the statements have a different location inside the nth loop. If the first dimension in which two extended iteration vectors differ is 2n + 1, then the two corresponding statements share at least n + 1 loops, the iterators of first n loops have the same value in both vectors, but the iterator of the next loop has a different value in the two vectors. The access relations can similarly be extended by extending their domains. Example 12. From Example 2, we know that the iteration domain associated to the ReadImage statement in Line 3 of Figure 1 is { (i, j) ∈ Z2 | 0 ≤ i < K ∧ 0 ≤ j < N } (1), while from Example 6, we know that its statement location is (1, 2, 3). The corresponding extended iteration domain is therefore { (1, i, 2, j, 3) ∈ Z5 | 0 ≤ i < K ∧ 0 ≤ j < N }. Let us now assume that the input to Operation 7 contains a single (extended) write access relation W . The composition of the read access relation with the inverse of the write access relation yields a mapping from read iterations to write iterations that wrote to the array element accessed by the read. The one that actually wrote the value that is read, is the one that was the (lexicographically) last to write to the element before the read. That is, we want to compute 1 0 lexmax (W −1 ◦ R) ∩ L−1 2d+1 ,
948
Sven Verdoolaege
with L2d+1 the lexicographically smaller relation from (7). The result of this computation is the inverse of the relation M in the output. Unfortunately, we cannot directly apply Operation 1 to compute this lexicographic maximum because L2d+1 (7) is a union of basic relations as soon as d ≥ 1. (If W or R are unions then the basic relations in these unions can be treated as separate writes or reads, so we need not worry about unions in this part of the equation.) However, an extended write iteration that shares i1 iterator values with the extended read iteration will always be executed after an extended write iteration that only shares i2 < i1 iterator values with the extended read iteration. We can therefore first apply Operation 1 to the writes that share 2d iterator values and with S set to the domain of the access relation. If the resulting set E is not empty then we can continue with the writes that share 2d − 1 iterator values, with S set to each of the basic sets in E. This process continues until all the resulting sets E are empty or until we have finished the case of 0 common iterator values. Example 13. Consider once more the access relation (8) of the first argument to the call to Sobel in Line 6 of Figure 1 from Example 4, but now in its extended form, R = { (4, i, 5, j, 6) → (a, b) | 1 ≤ i < K − 1 ∧ 1 ≤ j < N − 1 ∧ a = i − 1 ∧ b = j − 1 }. The only write to the a array occurs in Line 3, with extended access relation W = { (1, i, 2, j, 3) → (a, b) | 0 ≤ i < K ∧ 0 ≤ j < N ∧ a = i ∧ b = j }. Composition of these two relations yields W −1 ◦ R = { (4, i, 5, j, 6) → (1, i , 2, j , 3) | 1 ≤ i < K − 1 ∧ 1 ≤ j < N − 1 ∧ i = i − 1 ∧ j = j − 1 }. We see that the range iterators are uniquely defined in terms of the domain iterators, so in this case there is no need to compute the lexicographic maximum as lexmaxW −1 ◦ R would be identical to W −1 ◦ R. However, let us consider what would happen if we were to apply the above algorithm anyway. Since the first iterators in domain and range are distinct constants, the two iteration vectors never share any initial iterator values. The first 2d = 4 applications of Operation 1 therefore operate on an empty basic polyhedral relation R and simply return the input basic polyhedral set S = dom R as E. The final application returns M1−1 = R = W −1 ◦ R and E = 0. / Dropping the extra iterators again, we obtain M1 = { (i , j ) → (i, j) | 1 ≤ i < K − 1 ∧1 ≤ j < N − 1 ∧ i = i− 1 ∧ j = j − 1 }. (11) If there is more than one write access relation in the input of Operation 7, then the computation is a little bit more complicated. After computing the lexicographically maximal element of a write that shares i iterator values with the read, we still need to check that there is no other write in between. That is, we need to check if there is a write from a different write access that also shares i iterator values with the read, also occurs before the read, but occurs after the already found write. Again we have
Polyhedral Process Networks
949
to consider all possible cases of shared iterator values between the two writes, now from i to 2d + 1, as a write that occurs after a given iteration and that has fewer iterator values in common with the given iteration will be executed after an iteration that has more iterator values in common. The above sketch of an algorithm can still be significantly improved by checking the write access relations in the appropriate order and by taking into account the fact that the dimensions that correspond to the statement locations have fixed values. For a more detailed description, we refer to [7], where a variant of the above algorithm is applied directly on “quasts”, the output structure of piplib which we will not explain here.
5.2 Reuse A process network constructed using the standard dataflow analysis of the previous section will always send data from the process that corresponds to the statement that wrote a given value in the sequential program to the process that corresponds to the statement that read the given value, even if the same value is read multiple times from the same process. Now, in practically any form of implementation of process networks, sending data from a process to itself (or simply keeping it) is much cheaper than sending data to another process. It is therefore better to only transmit data from one process to another the first time it is needed. Obtaining a network that only transmits data to a given process once is actually fairly simple. When applying Operation 7 from the previous section, instead of only supplying a list of write accesses to the array read by the read access R, we extend the list with all read accesses to the given array (including R itself) from the same statement. Any value that was already read by the given process will then be obtained from the last of those reads from the same process instead of from the process that actually wrote the value.
ReadImage a: 2*N-1
a: N+1
a: 3
a: 2*N-2
a: N a: 2 a: 2*N-3 a: N-1 a: 1 Sobel Sbl: 1
WriteImage
Fig. 7 A process network corresponding to the sequential program in Figure 1 exploiting reuse
950
Sven Verdoolaege
Example 14. In Example 1, we have shown a process network (Figure 2) derived from the code in Figure 1 without exploiting reuse inside the Sobel process. Figure 7 shows a process network derived from the same code, but in this case with reuse. The first network was created by only considering the write to array a in Line 3 of Figure 1 as a possible source for the reads in Line 6, while the second was created by also considering all the reads in Line 6 as possible sources. It is clear from the figures that in the first network the Sobel process does not communicate with itself, while in the second network it does. What may not be apparent, however, is that in both networks there are 9 channels from the ReadImage process to the Sobel process. The second figure only shows 3 channels because the software used to generate these process networks automatically combines some communication channels if this combination does not affect the type of the channels, as explained at the end of Section 6.1. The number of channels from the first process to the second has not changed because each read in Line 6 reads at least one value that is not read by any of the other reads. However, even though the number of channels may not have changed, the number of values transmitted over these channels has been drastically reduced. In particular, the image read by the ReadImage process is now transmitted only once, instead of nearly 9 times.
6 Channel Types In the previous section, we showed how to determine the communication channels between the processes in the network. The practical implementation of these channels in hardware depends on certain properties that may or may not hold for the channels. In particular, it will be important to know if a communication channel can be implemented as a FIFO. Other properties include multiplicity and special kinds of internal channels. In this section, we describe how to check these properties.
6.1 FIFOs and Reordering Channels The communication channels derived in Section 5 are characterized by a mapping M (see Definition 9) relating write iterations in one process to the corresponding read iterations in the same or another process. The channel needs to ensure that when the second process reads from the channel it is given the correct value and therefore needs to keep track of which iteration in the first process wrote a given value. There is, however, no need to keep track of this extra information if it can be shown that the values will always be read in the same order as that in which they were written. In such cases, the communication channel can simply be implemented as a FIFO. To check if the writes and reads are performed in the same order, we consider what happens when the order is not the same, i.e., when reordering occurs on the channel. In this case, there is a pair of writes (w1 , w2 ) such that the corresponding
Polyhedral Process Networks
951
reads (r1 , r2 ) are executed in the opposite order, i.e., w1 w2 while r1 ≺ r2 . That is, reordering occurs if the relation T = { w1 → w2 | ∃r1 , r2 : (w1 , r1 ), (w2 , r2 ) ∈ M ∧ w1 w2 ∧ r1 ≺ r2 }
(12)
is non-empty, which can be verified by applying Operation 2. Note that T will only be considered empty if there is no reordering for any value of the parameters. Also note that we only need the lexicographic order within iteration domains and not across iteration domains, as we only compare reads to other reads and writes to other writes. If we identify two (or more) communication channels C1 and C2 as FIFOs, we can check if we can combine them into a single FIFO by performing two emptiness check on a relation T as in (12), except that in one relation (w1 , r1 ) is taken from M1 and (w2 , r2 ) is taken from M2 and vice versa in the other relation. 1
Fig. 8 A program resulting in a network with a reordering channel
2 3 4
for (i a[i] for (i b[i]
= = = =
0; i <= N; i++) g(i); 0; i <= N; i++) f(a[N-i]);
Example 15. Consider the program in Figure 8. The corresponding process network has two processes and one communication channel connecting them. The mapping on the channel is M = {w → r | 0 ≤ w ≤ N ∧ r = N − w} and we have T = { w1 → w2 | ∃r1 , r2 : (w1 , r1 ), (w2 , r2 ) ∈ M ∧ w1 > w2 ∧ r1 < r2 } = { w1 → w2 | 0 ≤ w1 , w2 ≤ N ∧ w1 > w2 ∧ N − w1 < N − w2 } = { w1 → w2 | 0 ≤ w2 < w1 ≤ N }. For N > 0, this relation is clearly non-empty. We conclude that we are dealing with a reordering channel. See Example 20 for a variation on this example where we find a FIFO.
6.2 Multiplicity The standard dataflow analysis from Section 5.1 may result in communication channels that are read more often than they are written to when the same value is used in multiple iterations of the reading process. Such channels require special treatment and this multiplicity condition should therefore be detected by checking whether
952
Sven Verdoolaege
there is any write iteration w that is mapped to multiple read iterations through the mapping M. In particular, let T be the relation T = { w1 → w2 | ∃r1 , r2 : (w1 , r1 ), (w2 , r2 ) ∈ M ∧ w1 = w2 ∧ r1 ≺ r2 }. This relation contains all pairs of identical write iterations that correspond to a pair of distinct read iterations. If this relation is non-empty, then multiplicity occurs. As in the previous section, we can check the emptiness using Operation 2. It should be noted that if we use the dataflow analysis from Section 5.2, i.e., taking into account possible reuse, then the resulting communication channels will never exhibit multiplicity. Instead, two channels would be constructed, one for the first time a value is read and one for propagating the value read to later iterations. In general, it is preferable to detect reuse rather than multiplicity, because the analysis results in fewer types of channels and because taking advantage of reuse may split channels that are both reordering and with multiplicity into a pair of FIFOs. 1 2 3 4 5
Fig. 9 Outer product source code.
b
a
c
7
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
b a
6
for (i = 0; i < N; ++i) a[i] = A(i); for (j = 0; j < N; ++j) b[j] = B(j); for (i = 0; i < N; ++i) for (j = 0; j < N; ++j) c[i][j] = a[i] * b[j];
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Fig. 10 Outer product network with multiplicity
b b
a
a
a
c
b
Fig. 11 Outer product network without multiplicity
Example 16. Consider the code for computing the outer product of two vectors shown in Figure 9. Figures 10 and 11 show the results of applying standard dataflow analysis (Section 5.1) and dataflow analysis with reuse (Section 5.2) respectively. Each figure shows the network on the left and the dataflow between the individual iterations of the processes on the right. The iterations are executed top-down, left-to-right. In the first network, the channel between b and c has mapping
Polyhedral Process Networks
953
Mb→c = { w → (r1 , r2 ) | 0 ≤ r1 , r2 < N ∧ w = r2 }. For testing multiplicity, we have the relation T = { r1,2 → r2,2 | ∃r1,1 , r2,1 ∈ Z : 0 ≤ r1,1 , r1,2 , r2,1 , r2,2 < N ∧r1,2 = r2,2 ∧r1,1 < r2,1 }, which is clearly non-empty, while for testing reordering we use the relation T = { r1,2 → r2,2 | ∃r1,1 , r2,1 ∈ Z : 0 ≤ r1,1 , r1,2 , r2,1 , r2,2 < N ∧r1,2 > r2,2 ∧r1,1 < r2,1 }, which is also non-empty. In the second network, the channel between b and c has mapping (13) Mb→c = { w → (r1 , r2 ) | 0 ≤ r2 < N ∧ r1 = 0 ∧ w = r2 }. There is no multiplicity since each write corresponds to exactly one read. There is also no reordering, since the writes and reads occur for increasing values of w = r2 in both processes.
6.3 Internal Channels Internal channels, i.e., channels from a process to itself, can typically be implemented more efficiently than external channels. In particular, since processes are scheduled independently there is no guarantee that when a value is read, it is already available, or that when a value is written, there is still enough room in the channel to hold the data. The external channels therefore need to implement both blocking on read and blocking on write. For internal channels, there is no need for blocking. In fact, blocking would only lead to deadlocks. Besides the blocking issue, there are some special cases that allow for a more efficient implementation than a general FIFO buffer. 6.3.1 Registers The first special case is that where the FIFO buffer holds at most one value. The FIFO buffer can then be replaced by a register. In Section 8 we will see how to compute a bound on a FIFO buffer in general, but the case where this bound is one can be detected in a simpler way. If each write w1 to a channel is followed by the corresponding read r1 , without any other read occurring in between, then we indeed only need room for one value. On the other hand, if there is an intervening read, w1 ≺ r2 ≺ r1 , then we will need more storage space. As usual, we can detect this situation by performing an emptiness check. Consider the set T = { w1 | ∃r1 , r2 : (w1 , r1 ) ∈ M ∧ r2 ∈ ran M ∧ w1 ≺ r2 ≺ r1 }.
954
Sven Verdoolaege
This set contains all writes w1 such that there exists a read r2 that occurs before the read r1 that corresponds to the write w1 . Note that it is legitimate to compare read and write iterations because we are dealing with internal channels and so these iterations belong to the same iteration domain. 6.3.2 Shift Registers Another special case occurs when the number of iterations between a write to an internal channel and the corresponding read from the channel is constant. If so, the FIFO can be replaced by a shift register, shifting the data in the channel by one in every iteration, independently of whether any data was read or written in that iteration. Such shift registers can be implemented more efficiently than FIFOs in hardware. Checking whether we can use a shift register is as easy as writing down the relation R = { (w, i) ∈ Zd × Zd | ∃r ∈ Zd : (w, r) ∈ M ∧ i ∈ D ∧ w ≺ i ≺ r }, where M is the mapping on the channel and D is the iteration domain of the process, and applying Operation 3. The result is a piecewise quasi-polynomial q of type Zn × Zd → Q, where n is the number of parameters. If the expression is independent of the last d variables, i.e., if q is defined purely in terms of the parameters and not in terms of the write iterators, then we can use a shift register. Note that we also count iterations in which no value is read from or written to the channel. This means that the size of the shift register may be larger than the buffer size of the FIFO, as computed in Section 8.
7 Scheduling Although the processes in a process network are scheduled independently during their execution, there are two occasions where we may want to compute a common schedule for two or more processes. The first such occasion is when there are more processes in the network than there are processing elements to run them on. The second is when we want to compute safe buffer sizes as will be explained in Section 8. In principle, we could use the extended iteration domains from Section 5.1 as the common schedule, resulting in the same execution order as that of the input sequential program. This schedule may lead to very high overestimates of the buffer sizes, however, and we therefore prefer constructing a schedule that is more suited for the buffer size computation. There are many ways of obtaining such a schedule. Below we discuss a simple incremental technique.
Polyhedral Process Networks
955
7.1 Two Processes Let us first consider the case where there are only two processes, P1 and P2 , that moreover have the same dimension d = d1 = d2 . We will relax these conditions in Section 7.2 and Section 7.4. Since the two iteration domains have the same dimension, we can consider the two iteration domains as being part of the same iteration space. However, if there are any communication channels between the two processes, then we need to make sure that that the second does not read a value from any channel before it has been written. Let Mi ⊆ DP1 × DP2 be the mapping of one of these channels and consider the delays between a write to the channel and the corresponding read
i = { t ∈ Zd | ∃(w, r) ∈ Mi : t = r − w }. Note that the concept of a delay, in particular the subtraction r − w, is only valid because we now consider both iteration domains as being part of the same iteration space. If any of these delays is lexicographically negative, i.e., if the smallest delay
i = lexmin i
(14)
is lexicographically negative, then some reads do occur before the corresponding writes and the schedule would be invalid. The solution is to delay the whole second process by just enough to make all the delays over the communication channels lexicographically non-negative. In particular, let
1→2 = lexmin{ i }i ,
(15)
with i ranging over all communication channels between processes P1 and P2 . Then, replacing the iteration domain D2 of P2 by D2 − 1→2 (and adapting all mappings Mi accordingly) will ensure that all delays over channels are lexicographically nonnegative. In principle, we could choose any delay on the second iteration domain that is lexicographically larger than − 1→2 . However, delaying the reads more than strictly necessary would only increase the time the values stay inside the communication channel and could therefore only increase the buffer size needed on the channel. Furthermore, if there are also communication channels from P2 to P1 , then we need to make sure the delays on these channels are also non-negative. Fortunately, if 2→1 is the minimal delay over any such channel, then
1→2 + 2→1 0.
(16)
Otherwise there would be a read operation in process P1 that indirectly depends on a write operation from the same process that occurs later or at the same time, which is clearly impossible if the process network was derived from a sequential program. By choosing a delay on process P2 of − 1→2 , the minimal delay on any channel from P2 to P1 becomes exactly 1→2 + 2→1 and is therefore safe.
956
Sven Verdoolaege
There are still some details we glanced over in the discussion above. First, it should be clear that i (14) can be computed using a variation of Operation 1 or even directly as − lexmax(− i ). The minimal delay over all communication channels 1→2 (15) can be computed as lexmin{ i }i =
)
{ i | i ≺ j for all j < i and i j for all j > i },
i
where both i and j range over all communication channels between the two processes. That is, we select i for the values of the parameters where it is strictly smaller than all previous j and smaller than or equal to all subsequent j . The result is a polyhedral set that contains a single vector for each value of the parameters. Finally, we need to deal with the fact that the above procedure can set the delay over a channel to zero, which means that the write and corresponding read would happen simultaneously. The solution is to assign an order of execution to the statements within a given iteration by introducing an extra innermost dimension as we did in Section 4.5. The values for the extra dimension can be set based on a topological sort of a direct graph with as only edges those communication channels with a zero delay. The absence of cycles in this graph is guaranteed by (16). Example 17. Consider a network composed of the first two processes in Figure 2. There are nine channels between the two processes. For one of these channels, we computed the mapping M1 (11) in Example 13. The delay between writes and reads on this channel is constant and we obtain { 1 } = 1 = { (1, 1) }. The other channels also have constant delays and the smallest of these yields 1→2 = (−1, −1). We therefore replace the second iteration domain D2 (9) by D2 + (1, 1), resulting in a new 1→2 of (0, 0). To make this delay lexicographically positive (rather than just non-negative), we introduce an extra dimension and assign it the value 0 in P1 and 1 in P2 . The new iteration domains are therefore D1 = { (i, j, 0) ∈ Z3 | 0 ≤ i < K ∧ 0 ≤ j < N }, D2 = { (i, j, 1) ∈ Z3 | 2 ≤ i < K ∧ 2 ≤ j < N }.
7.2 More than Two Processes If there are more than two processes in the network, then we can still apply essentially the same technique as that of the previous section, but we need to be careful about how we define the minimal delay 1→2 (15) between two processes P1 and P2 . It will not be sufficient to consider only the communication channels between P1 and P2 themselves. Instead, we need to consider all paths in the process network from P1 to P2 , compute the sum of the minimal delays on the communication channels in each path and then take the minimum over all paths. In effect, this is just the shortest path between P1 and P2 in the process network with as weights the minimal delays on the communication channels. As in the case of a two process network
Polyhedral Process Networks
957
(16), the minimal delay over a cycle is always (lexicographically) positive. We can therefore apply the Bellman-Ford algorithm (see, e.g., [14, Section 8.3]) to compute this shortest path. If the network not only contains more than two processes, but we also want to combine more that two processes, then we can apply the above procedure incrementally, in each step combining two processes into one process until we have performed all the required combinations. If many or all processes need to be combined, it may be more efficient to compute the all-pairs shortest paths using the Floyd-Warshall algorithm [14, Section 8.4] and then to update the minimal delays in each combination step. Example 18. Consider once more the network in Figure 2. There are three processes, so we will need two combination steps. Let us for illustration purposes combine the first and the last process first, even though there is no direct edge between these two processes. (Normally, we would only combine processes that are directly connected.) The delay between P2 and P3 is constant and equal to 2→3 = (0, 0), while the minimal delay between P1 and P2 , as computed in Example 17, is 1→2 = (−1, −1). The minimal delay between P1 and P3 is therefore 1→3 = 1→2 + 2→3 = (−1, −1). We therefore replace the third iteration domain D3 by D3 + (1, 1), resulting in a new 1→3 of (0, 0). 1→2 remains unchanged, but 2→3 changes to (1, 1). In the next combination step, D2 is again replaced by D2 + (1, 1) and 1→2 and 2→3 both become (0, 0). There are now two edges with a zero minimal delay, so we assign the processes an increasing value in a new innermost dimension. The final iteration domains of the first two processes are as in Example 17, while that of the third process is D3 = { (i, j, 2) ∈ Z3 | 2 ≤ i < K ∧ 2 ≤ j < N }. Figure 12 shows the code for the completely merged process. For simplicity, we have written out this code in terms of the original statements and their original accesses rather than using read and writes to communication channels. If we were to compute buffer sizes based on the original schedule (Figure 1), then each channel between the first and the second process would get assigned a buffer size of (K − 2)(N − 2) because in this original schedule the ReadImage process runs to completion before the Sobel process starts running. The significantly smaller buffer sizes that we obtain using the procedure of Section 8 based on the schedule computed here are shown in Figure 2.
7.3 Blocking Writes In deriving combined schedules, we have so far only been interested in avoiding deadlocks. When merging several processes into a single process this is indeed all we need to worry about. However, when using a global schedule in the computation
958 1 2 3 4 5 6 7 8 9 10
Sven Verdoolaege for (i = 0; i < K; i++) for (j = 0; j < N; j++) { a[i][j] = ReadImage(); if (i >= 2 && j >= 2) { Sbl[i-1][j-1]=Sobel(a[i-2][j-2],a[i-1][j-2],a[i][j-2], a[i-2][j-1],a[i-1][j-1],a[i][j-1], a[i-2][ j],a[i-1][ j],a[i][ j]); WriteImage(Sbl[i-1][j-1]); } }
Fig. 12 Merged code for the process network of Figure 2
of valid buffer sizes, we may end up with buffer sizes that, while not introducing any deadlocks, may impose blocking writes, impacting negatively on the throughput of the network. In particular, iterations of different processes that are mapped onto the same iteration in the global schedule (ignoring the extra innermost dimension), can usually be executed in a pipelined fashion, except when non-adjacent processes in this pipeline communicate with each other. In this case, the computed buffer sizes may be so small that the first of these non-adjacent processes needs to wait until the second reads from the communication channel before proceeding onto the next iteration. The solution is to enforce a delay between a writing process and the corresponding reading process that is equal to one iteration of the writing process. This delay ensures that the delay between non-adjacent processes in a pipeline will be large enough to avoid blocking writes that hamper the throughput. The required delay is the smallest variation in the iteration domain D of a process. Let P be the relation that maps a given iteration of D to the previous iteration of D, i.e., P(D) = lexmax{ i → i | i, i ∈ D ∧ i ≺ i }.
(17)
The relation P(D) can be computed using Operation 1. The required delay is then the smallest difference between two subsequent iterations, i.e.,
= lexmin{ t | ∃(i, i ) ∈ P(D) ∧ t = i − i }. The delay is enforced by subtracting the corresponding to the writing process from each channel delay i (14). Note, however, that in the presences of cycles, we may not always be able to enforce this extra delay. In particular, subtracting the ’s may result in negative total delays over these cycles. If this happens, we have to resort to the scheduling of Section 7.2, without the additional delays. Example 19. Consider the code in Figure 13. There are three statements, f , g, and h with three FIFO communication channels between them: one from f to g, one from g to h and one from f to h. The delays on these channels are all zero. The scheduling of Section 7.2 therefore leaves everything in place and the resulting code is identical
Polyhedral Process Networks
959 1 2 3
Fig. 13 A program illustrating blocking writes
4 5
for (i a[i] b[i] c[i] }
= = = =
0; i < N; ++i) { f(i); g(a[i]); h(a[i], b[i]);
to the original. It is clear that in this schedule the size of each of the FIFOs can be set to 1. However, if we try to run the resulting process network, then we see that the second iteration of process f needs to wait for the first iteration of h to empty the FIFO before it can write to the FIFO of size 1. Because of this blocking write, the three processes cannot be fully pipelined, as illustrated in Figure 14. Since all domains are 1-dimensional, the internal delay in each process is 1. Subtracting this internal delay from each channel delay, we see that the minimal delay between process f and h is (−1) + (−1) = −2. Process h is therefore scheduled at position 2 relative to process f . This means that the channel between these two process needs to be of size at least 2. This size in turn then allows a fully pipelined execution of the three processes, as illustrated in Figure 15.
f
f
f
g
g h
Fig. 14 Time diagram with blocking
f
f g
h
g h
h
Fig. 15 Time diagram without blocking
7.4 Linear Transformations The schedules that we have seen so far are all of the form (j) = Id+1,d j + b, where Id+1,d is (d + 1) × d matrix with ik,k = 1 and ik,l = 0 for k = l, and b ∈ Zd+1 . That is, we add an extra dimension (Id+1,d j) and apply a shift (b). Note that we have also assumed so far that all iteration domains have the same dimension. In principle, we could apply more general affine schedules (j) = Aj + b, i.e., including an arbitrary linear transformation Aj. Not only would this allow us to compute potentially tighter bounds on the buffer sizes, but it may even change the types of the communication channels. Example 20. Consider once more the program in Figure 8. In Example 15 we have shown that the only channel in the corresponding network has reordering. If we reverse the second loop, however, i.e., if we apply the transformation −j + N, then the mapping on the channel becomes M = {w → r | 0 ≤ w ≤ N ∧ r = w}
960
Sven Verdoolaege
and we have T = { w1 → w2 | 0 ≤ w1 , w2 ≤ N ∧ w1 > w2 ∧ w1 < w2 }. This set is clearly empty and so the communication channel has become a FIFO. The example shows that if we were to apply general linear transformations, then we would need to do this before the detection of channel types of Section 6. In fact, if we want to detect reuse, as we did in Section 5.2, then we would need to do this reuse detection after any linear transformation. The reason is that a linear transformation may change the internal order inside an iteration domain, meaning that what used to be a previous iteration, from which we could reuse data, may have become a later iteration. On the other hand, the dataflow in the sequential program imposes some restrictions on which linear transformations are valid, so we need to perform dataflow analysis (Section 5) before any linear transformation. The solution is to first apply standard dataflow analysis (Section 5.1), then to perform linear transformations and finally to detect reuse inside the communication channels that were constructed in the first step. A full discussion of general linear transformations is beyond the scope of this chapter. There is however one case where we are forced to apply a linear transformation, and that is the case where not all iteration domains have the same dimension. Before we can merge two iteration domains, they have to reside in the same space. In particular, they need to have the same dimension. Let d be the maximal dimension of any iteration domain, then we could could expand a di -dimensional iteration domain with di ≤ d using the transformation Id,di . This transformation effectively pads the iteration vectors with zeros at the end. This may, however, not be the best place to insert the zeros. A simple heuristic to arrive at better positions for the zeros is to look at the communication channels between a lower-dimensional domain and a full-dimensional domain and to insert the zeros such that iterators that are related to each other through the communication channels are aligned. Note that inserting zeros does not change the internal order inside the iteration domain. The same holds for any scaling that we may want to apply based on essentially the same heuristic. Example 21. Consider the communication channel between processes b and c in the network of Figure 11 (see Example 16). The first process has a one-dimensional iteration domain, while the second process has a two-dimensional iteration domain. The mapping on the channel (13) shows that there is a relation w = r2 between the single iterator of the writing process and the second iterator of the reading process. We therefore insert a zero before the iteration vectors of the writing process such that the original iterator ends up in the second position. For the channel between a and c the relation is w = r1 , so in this case we insert a zero after the original iterator. These choices are reflected by the orientations of these two one-dimensional iteration domains in Figure 11. With these orientations, the writes may be moved on top of the corresponding reads by simply shifting the iteration domains, meaning that a buffer size of at most 1 is needed. Had we chosen a different orientation for the two one-dimensional iteration domains, then this would not have been possible.
Polyhedral Process Networks
961
8 Buffer Size Computation This section describes how to compute valid buffer sizes for all communication channels. The buffer sizes are valid in the sense that imposing them does not cause deadlocks. The first step is to compute a global schedule for all processes in the network, e.g., using the techniques of Section 7. This scheduling step effectively makes all communication channels internal (to the single combined process). We then compute buffer sizes for this particular schedule. The schedule itself is not used during the actual execution of the network. However, we know that there is at least one schedule for which the buffer sizes are valid. The blocking reads and writes on the communication channels will therefore not introduce any deadlocks.
8.1 FIFOs We are given an internal communication channel that has been identified as a FIFO using the technique of Section 6.1 and we want to know how much buffer space is needed on the FIFO. Recall that any channel may be considered internal after a global scheduling step that maps all iteration domains to a common iteration space. The buffer should be large enough to hold the number of tokens that are in transit between a write and a read at any point during the execution. We therefore first count this number of tokens in terms of an arbitrary iteration and then compute an upper bound over all iterations. Let us look at these steps in a bit more detail. The communication channel is described by a mapping M from the write iteration domain Dw to the read iteration domain Dr . The maximal number of tokens in the buffer will occur after some write to the buffer and before the first subsequent read from the buffer. It is therefore sufficient to investigate what happens right before a token is read from the buffer. Let W be the relation mapping any read iteration r to all write iterations that occur before this read and let R be the relation mapping the same read iteration to all previous read iterations, i.e., W = { r → w | r ∈ ran M ∧ w ∈ dom M ∧ w ≺ r } R = { r → r | r, r ∈ ran M ∧ r ≺ r }.
(18) (19)
Then the number n(s, r) of elements in the buffer right before the execution of read r is the number of writes to the buffer before the read, i.e., #W (s, r), minus the number of reads from the buffer before the given read, i.e., #R(s, r), where as usual, s are the parameters. Both of these computation can be performed using Operation 3. Finally, we apply Operation 4 to the piecewise quasi-polynomial n(s, r) = #W (s, r) − #R(s, r) and the polyhedral set ran M, resulting in a piecewise quasi-polynomial u(s) that is an upper bound on the number of elements in the FIFO channel during the whole execution.
962
Sven Verdoolaege
Example 22. Consider once more the network in Figure 2. A global schedule for this network was derived in Examples 17 and 18, and code corresponding to this schedule is shown in Figure 12. Writing the mapping (11) on the channel constructed from the first argument to the call to Sobel in Line 6 of Figure 1 in Example 13 in terms of the new iterators, we obtain M = { (i , j , 0) → (i, j, 1) | 2 ≤ i < K ∧ 2 ≤ j < N ∧ i = i − 2 ∧ j = j − 2 }. From this relation we derive, W = { (i, j, 1) → (i , j , 0) | 2 ≤ i < K ∧ 2 ≤ j < N ∧ 0 ≤ i < K − 2 ∧ 0 ≤ j < N − 2 ∧ ((i < i) ∨ (i = i ∧ j ≤ j)) }. The number of image elements of this relation can be computed as ⎧ ⎪ ⎨i(N − 2) + j + 1 if (i, j, 1) ∈ ran M ∧ i < K − 2 ∧ j < N − 2 #W (K, N, i, j) = (i + 1)(N − 2) if (i, j, 1) ∈ ran M ∧ i < K − 2 ∧ j ≥ N − 2 ⎪ ⎩ (K − 2)(N − 2) if (i, j, 1) ∈ ran M ∧ i ≥ K − 2. For the number of reads before a given read (i, j, 1), we similarly find #R(K, N, i, j) = (i − 2)(N − 2) + j − 2 if (i, j, 1) ∈ ran M. Taking the difference yields ⎧ ⎪ if (i, j, 1) ∈ ran M ∧ i < K − 2 ∧ j < N − 2 ⎨2(N − 2) + 3 n(K, N, i, j) = 3(N − 2) − j + 2 if (i, j, 1) ∈ ran M ∧ i < K − 2 ∧ j ≥ N − 2 ⎪ ⎩ (K − i)(N − 2) − j + 2 if (i, j, 1) ∈ ran M ∧ i ≥ K − 2. The maximum over all reads in ran M occurs in the first domain and is equal to 2(N − 2) + 3 = 2N − 1, which is the value shown on the first edge in Figure 2.
8.2 Reordering Channels For channels that have been identified as exhibiting reordering using the technique of Section 6.1, we need to choose where we want to perform the reordering of the tokens. One option is to perform the reordering inside the channel itself. In this case we can apply essentially the same technique as that of the previous section to compute an upper bound on the minimal number of locations needed to store all elements in the buffer at any given time. However, since the tokens now have to be reordered, we need to be able to address them somehow. An efficient mapping from read or write iterators to the internal buffer of the channel may require more
Polyhedral Process Networks
963
space than strictly necessary. We refer to [4] for an overview and a mathematical framework for finding good memory mappings. Another option is to perform the reordering inside the reading process. In this case, a process wanting to read a value from the reordering channel first looks in its internal buffer associated with the channel. If the value is not in the buffer, it reads values from the channel in the order in which they were written, storing all values it does not need yet in the local buffer until it has read the value it is actually looking for. In other words, the original reordering channel is split into a FIFO channel and a local reordering buffer. There are therefore two buffer sizes to compute in this case. Before embarking upon the computation of these buffer sizes, it should be noted that nothing really interesting happens to the buffers during iterations that do not read from the FIFO. In particular, the maximal number of elements in the FIFO buffer will be reached right before a read from the FIFO and the maximal number of elements in the reordering buffer will be reached right before the read of the value from the FIFO that the process is actually interested in, i.e., after it has copied all intermediate data to the reordering buffer. The uninteresting iterations U are those reads r for which there is an earlier read r that reads something that was written after the value read by r. This latter value will have been put in the reordering buffer at or perhaps even before read r . That is, U = { r | ∃w, r , w : (w, r) ∈ M ∧ (w , r ) ∈ M ∧ w ≺ w ∧ r ≺ r } and S = ran M \ U is the set of “interesting” iterations. For the first of these interesting iteration, i.e., the first read r∗ of a value from the FIFO, the number of tokens in the FIFO is equal to the number of writes that have occurred before that read, i.e., #W (s, r∗ ), where W is as defined in (18). For any other read r ∈ S, we will also have read some values from the FIFO. In particular, we will have read all values that were written up to and including the value that we needed in the previous read r . Let W map a given read r to all these previously read values. Then the number of tokens in the FIFO right before r is #W (s, r)− #W (s, r). The relation W can be computed as W = { r → w | ∃r : r ∈ S ∧ (r, r ) ∈ P ∧ (r , w ) ∈ M ∧ w w }, where P = P(S) is the relation that maps a read in S to the previous read in S, as defined in (17). As before, the relation P can be computed using Operation 1. As a side effect, we obtain a polyhedral set E containing the first read r∗ . The counts can be computed using Operation 3 and finally Operation 4 needs to be applied to obtain a bound on the FIFO buffer size that is valid for all reads. As for the internal buffer, the number of tokens in this buffer after a read r from the FIFO is equal to the number of subsequent reads that read something (from the internal buffer) that was written (to the FIFO) before the token that was actually needed by r, i.e.,
964
Sven Verdoolaege
#{ r → r | ∃w, w : (w, r), (w , r ) ∈ M ∧ w ≺ w ∧ r ≺ r }. Again, this number can be computed using Operation 3 and a bound on the buffer size can then be computed using Operation 4.
8.3 Accuracy of the Buffer Sizes Although the buffer sizes computed using the methods described above are usually fairly close to the actual minimal buffer sizes, there are some possible sources of inaccuracies that we want to highlight in this section. These inaccuracies will always lead to over-approximations. That is, the computed bounds will always be safe. For internal channels, the only operation that can lead to over-approximations is Operation 4. There are three causes for inaccuracies in this operation: the underlying technique [3] is defined over the rationals rather than the integers; in its basic form it only handles polynomials and not quasi-polynomials; and the technique itself will only return the actual maximum (rather than just an upper bound) if the maximum occurs at one of the extremal points of the domain. For external channels, we rely on a scheduling step to effectively make them internal. Since we only consider a limited set of possible schedulings, the derived buffer sizes may in principle be much larger than the absolute minimal buffer sizes that still allow for a deadlock-free schedule.
9 Summary In this chapter we have seen how to automatically construct a polyhedral process network from a sequential program that can be represented in the polyhedral model. The basic polyhedral tools used in this construction are parametric integer programming, emptiness check, parametric counting, computing parametric upper bounds and polyhedral scanning. The processes in the network correspond to the statements in the program, while the communication channels are computed using dependence analysis. Several types of channels can be identified by solving a number of emptiness checks and/or counting problems. Safe buffer sizes for the channels can be obtained by first computing a global schedule and then computing an upper bound on the number of elements in each channel at each iteration of this schedule.
References 1. Amarasinghe, S.P., Anderson, J.M., Lam, M.S., Tseng, C.W.: The SUIF compiler for scalable parallel machines. In: Proceedings of the Seventh SIAM Conference on Parallel Processing for Scientific Computing (1995)
Polyhedral Process Networks
965
2. Bastoul, C.: Code generation in the polyhedral model is easier than you think. In: PACT ’04: Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques, pp. 7–16. IEEE Computer Society, Washington, DC, USA (2004). 3. Clauss, P., Fernández, F.J., Gabervetsky, D., Verdoolaege, S.: Symbolic polynomial maximization over convex sets and its application to memory requirement estimation. IEEE Transactions on Very Sarge Scale Integration (VLSI) Systems (2009). Accepted 4. Darte, A., Schreiber, R., Villard, G.: Lattice-based memory allocation. IEEE Trans. Comput. 54(10), 1242–1257 (2005). 5. Devos, H., Verdoolaege, S., Van Campenhout, J., Stroobandt, D.: Bounds on quasipolynomials for static program analysis (2010). Manuscript in preparation 6. Feautrier, P.: Parametric integer programming. Operationnelle/Operations Research 22(3), 243–268 (1988) 7. Feautrier, P.: Dataflow analysis of array and scalar references. International Journal of Parallel Programming 20(1), 23–53 (1991) 8. Kahn, G.: The semantics of a simple language for parallel programming. In: Proc. of the IFIP Congress 74, pp. 471–475. North-Holland Publishing Co. (1974) 9. Kelly, W., Maslov, V., Pugh, W., Rosser, E., Shpeisman, T., Wonnacott, D.: The Omega library. Tech. rep., University of Maryland (1996) 10. Kelly, W., Pugh, W., Rosser, E.: Code generation for multiple mappings. In: Frontiers’95 Symposium on the frontiers of massively parallel computation. McLean (1995) 11. Kienhuis, B., Rijpkema, E., Deprettere, E.: Compaan: deriving process networks from matlab for embedded signal processing architectures. In: CODES ’00: Proceedings of the eighth international workshop on Hardware/software codesign, pp. 13–17. ACM Press, New York, NY, USA (2000). 12. Meister, B., Verdoolaege, S.: Polynomial approximations in the polytope model: Bringing the power of quasi-polynomials to the masses. In: J. Sankaran, T. Vander Aa (eds.) Digest of the 6th Workshop on Optimization for DSP and Embedded Systems, ODES-6 (2008) 13. Pop, S., Cohen, A., Bastoul, C., Girbal, S., Jouvelot, P., Silber, G.A., Vasilache, N.: Graphite: Loop optimizations based on the polyhedral model for gcc. In: 4th GCC Developper’s Summit. Ottawa, Canada (2006) 14. Schrijver, A.: Combinatorial Optimization - Polyhedra and Efficiency. Springer (2003) 15. Schweitz, E., Lethin, R., Leung, A., Meister, B.: R-stream: A parametric high level compiler. In: J. Kepner (ed.) Proceedings of HPEC 2006, 10th annual workshop on High Performance Embedded Computing. Lincoln Labs, Lexington, MA (2006) 16. Turjan, A.: Compaan - A Process Network Parallelizing Compiler. VDM Verlag (2008) 17. Verdoolaege, S.: An integer set library for program analysis (2009). ACES symposium, Edegem, 7-8 September 18. Verdoolaege, S., Beyls, K., Bruynooghe, M., Catthoor, F.: Experiences with enumeration of integer projections of parametric polytopes. In: R. Bodik (ed.) Proceedings of 14th International Conference on Compiler Construction, Edinburgh, Scotland, Lecture Notes in Computer Science, vol. 3443, pp. 91–105. Springer-Verlag, Berlin (2005). 19. Verdoolaege, S., Nikolov, H., Stefanov, T.: pn: A tool for improved derivation of process networks. EURASIP Journal on Embedded Systems, special issue on Embedded Digital Signal Processing Systems 2007 (2007). 20. Verdoolaege, S., Seghir, R., Beyls, K., Loechner, V., Bruynooghe, M.: Counting integer points in parametric polytopes using Barvinok’s rational functions. Algorithmica 48(1), 37–66 (2007).
Kahn Process Networks and a Reactive Extension Marc Geilen and Twan Basten
Abstract Kahn and MacQueen have introduced a generic class of determinate asynchronous data-flow applications, called Kahn Process Networks (KPNs) with an elegant mathematical model and semantics in terms of Scott-continuous functions on data streams together with an implementation model of independent asynchronous sequential programs communicating through FIFO buffers with blocking read and non-blocking write operations. The two are related by the Kahn Principle which states that a realization according to the implementation model behaves as predicted by the mathematical function. Additional steps are required to arrive at an actual implementation of a KPN to take care of scheduling of independent processes on a single processor and to manage communication buffers. Because of the expressiveness of the KPN model, buffer sizes and schedules cannot be determined at design time in general and require dynamic run-time system support. Constraints are discussed that need to be placed on such system support so as to maintain the Kahn Principle. We then discuss a possible extension of the KPN model to include the possibility for sporadic, reactive behavior which is not possible in the standard model. The extended model is called Reactive Process Networks. We introduce its semantics, look at analyzability and at more constrained data-flow models combined with reactive behavior.
Marc Geilen Eindhoven University of Technology, Den Dolech 2, Eindhoven, The Netherlands, e-mail: m.c. [email protected] Twan Basten Embedded Systems Institute and Eindhoven University of Technology, Den Dolech 2, Eindhoven, The Netherlands e-mail: [email protected]
S.S. Bhattacharyya et al. (eds.), Handbook of Signal Processing Systems, DOI 10.1007/978-1-4419-6345-1_34, © Springer Science+Business Media, LLC 2010
967
968
Marc Geilen and Twan Basten
1 Introduction Process networks are a popular model to express behavior of data-flow and streaming nature. This includes audio, video and 3D multimedia applications such as encoding and decoding of MPEG video streams. Using process networks, an application is modeled as a collection of concurrent processes communicating streams of data through FIFO channels. Process networks make task-level parallelism and communication explicit, have a simple semantics, are compositional and allow for efficient implementations without time-consuming synchronizations. There are several variants of process networks. One of the most general forms are Kahn process networks [26], [27], where the nodes are arbitrary sequential programs, that communicate via channels of the process network with blocking read and non-blocking write operations. Although harder to analyze than more restricted models, such as Synchronous Data Flow networks [31], the added flexibility makes KPNs a popular programming model. Where synchronous data-flow models can be statically scheduled at compile time, KPNs must be scheduled dynamically in general, because their expressive power does not allow them to be statically analyzed. A run-time system is required to schedule the execution of processes and to manage memory usage for the channels. On a heterogeneous implementation platform, a KPN may be distributed over several components with individual scheduling and memory management domains. In this case, the execution of the process network has to be coordinated in a distributed fashion. A process network is determinate if its input/output behavior can be expressed as a function. Kahn Process Networks represent the largest class of determinate dataflow process networks if we compare them based on the input/output functions they can express. The model abstracts from timing behavior and focusses on functional input/output behavior of a network of parallel processes. In this chapter we discuss the syntax and operational semantics of the model, a denotational semantics and the Kahn Principle, which relates them. The denotational semantics of KPNs of [26] is attractive from a mathematical point of view, because of its abstractness and compositionality. In a realization of a process network, FIFO sizes and contents play an important role and are influenced by a run-time environment that governs the execution of the network. It is for this reason that we present a simple operational semantics of process networks, similar to [16], [45]. We study and prove properties of the resulting transition system and we give a simple proof of the Kahn Principle in a bit of mathematical detail. Not because these are new results, but we believe that they provide insight in the fundamental properties and limitation of the Kahn model. We look at methods and requirements for directly implementing Kahn Process Networks and we look at possible extensions of Kahn Process Networks, in particular with the ability to express reactive behavior.
Kahn Process Networks and a Reactive Extension
969
Fig. 1 Example: a Kahn Process Network computing the Fibonacci sequence
Example Figure 1 shows an example of a Kahn Process Network. It consists of four processes which compute deterministic functions. Processes have inputs and outputs. A KPN has unbounded FIFO channels connecting an output of a process to an input of a process. Unconnected channels represent inputs or outputs of the KPN. The network in Figure 1 has no inputs and one output (dk ). The process Add reads one integer number from each input (ck and fk , k ≥ 0) and writes the sum of both numbers to its output (ak = ck + fk ). The processes Delay n first write the value n to their output and subsequently start copying their input to their output. The fourth process, Split, copies its input to three separate outputs, implying ck = dk = ek = bk . If the network executes, Delay0 and Delay1 write respectively a value 0 and 1 to their outputs. The value 1 is copied by the Split actor giving d0 = 1. Now the Add process has the values 0 and 1 on its inputs, adds them up and writes the value 1 to its output. This value is copied again by Delay1 and Split and the value 1 (= d1 ) is produced again on the output. It is easy to see that dk = ak−1 for k ≥ 1 and fk = dk−1 = ak−2 for k ≥ 2. Thus, dk = dk−1 + dk−2 with d0 = d1 = 1 and the recurrence equation corresponds to the Fibonacci sequence and because of the proper initialization by the Delay n processes, the KPN starts producing the Fibonacci sequence. A more practical example of a KPN is shown in Figure 2. It shows a part of a pipeline of a JPEG decoder. Input stream to the network is a stream of compressed data, the Variable Length Decoder function turns it into a stream of macro-blocks of 8 by 8 pixels. Those blocks are subsequently passed through three functions transforming those blocks, undoing the quantization, the zig-zag reordering of data and the discrete cosine transformation respectively. The corresponding mathematical functions are in practice specified by code as illustrated for the Inverse Zig Zag function (see Section 6). In this case the function is an infinite loop because the functions operates block by block on a stream of blocks. It uses a read operation to read a macro block from its input in, then does some processing on it and at the end uses a write operation to write the result to its output.
970
Marc Geilen and Twan Basten
Fig. 2 KPN of JPEG decoder
Preliminaries We introduce some mathematical notation and preliminaries. We use N to denote the natural numbers including 0. We assume the reader is familiar with the concept of a complete partial order (see for instance [14]). We use (X, ) to denote a CPO on the set X with partial order relation ⊆ X × X to denote the corresponding partial order relation. We use D to denote the least upper bound of a directed subset D of X. For convenience, we assume a universal, countable, set Chan of channels and for every channel c ∈ Chan a corresponding finite channel alphabet c . We use to denote the union of all channel alphabets, and A∗ (A ) to denote the set of all finite (and infinite) strings over alphabet A and A∗, = A∗ ∪ A . denotes the prefix relation on strings (a complete partial order, see for instance [14]). If and are strings, · denotes the usual concatenation of the strings. If then − denotes the string without its prefix . A history of a channel denotes the sequence of data elements communicated along a channel [43]. Definition 1. (H ISTORY) A history h of a set C of channels is a mapping from channels c ∈ C to strings over c . The set of all histories of C is denoted as H(C). If h ∈ H(C) and D ⊆ C, then h|D denotes the history obtained from h by restricting the domain to D. If h1 is a history of C1 and h2 is a history of C2 , we write h1 h2 iff C1 ⊆ C2 and for every c ∈ C1 , h1 (c) h2 (c). The set of histories together with the relation on histories form a complete partial order with as bottom element the empty history 0. / If h1 h2 , then h2 − h1 denotes the history which maps a channel c ∈ C1 to h2 (c) − h1 (c). Histories h1 and h2 are called consistent iff they share an upper bound, i.e., if there exists some history h3 such that h1 h3 and h2 h3 . If h1 , h2 ∈ H(C), then the concatenation h1 · h2 is the history such that h1 · h2 (c) = h1 (c) · h2 (c) for all c ∈ C. A history is called finite if it maps every channel in its domain to a finite string and only a finite number of channels to a non-empty string. A finite history is a finite element of the CPO of histories. (Recall from [14] that an element k ∈ X of a CPO (X, ) is a finite element iff for every chain D ⊆ X , if k D then there is some d ∈ D such that k d.) This is expressed by the following lemma which is straightforward to prove (see [17]).
Kahn Process Networks and a Reactive Extension
971
Proposition 1. A finite history is a finite element of the CPO of histories.
2 Denotational Semantics We discuss the formal definition of the semantics, i.e., the behavior, of a KPN. Traditionally, the focus is on the functional input/output behavior of the network, abstracting from the timing. More precisely, we are interested in a function relating the output produced by the network to the input provided to the network. We first concentrate on a denotational semantics, focussing on what input/output relation a KPN computes and later, in Section 3 on an operational semantics, which also captures how that output is computed. The denotational semantics of KPNs [26] defines the behavior of a KPN as a Scott-continuous input/output function on the input or output stream histories. Assuming that the (Scott-continuous) input/output functions of the individual processes are known, then the input/output function of the KPN as a whole can be defined as a least fixed-point of a collection of equations specifying the relations between channel histories based on the processes. A KPN process with m inputs and n outputs is a function f : ( ∗, )m → ( ∗, )n with the characteristic (Scott-continuity) that it preserves least upper bounds: f (i k)= f (ik ). This implies that such a processes is monotone: if i j then f (i) f ( j). Because processes may exhibit memory (the current output may depend on input from the past), the function is defined in terms of the sequences representing the entire input history of the channel. The functions of the individual processes of the Fibonacci example can be defined as follows. Delayn : N∗, → N∗, ; i ∈ N∗, ; n ∈ {0, 1}: Delayn(i) = n · i
(1)
Add : (N∗, )2 → N∗, is defined inductively by the following equations (i, j ∈ N∗, ; x, y ∈ N): ⎧ Add(i, ) = ⎨ Add( , i) = (2) ⎩ Add(x · i, y · j) = (x + y) · Add(i, j) (Strictly speaking the inductive definition only defines the function for finite sequences, but we use the convention that they are continuous f (ik ) = f (ik ) to extend the definitions to infinite strings.) The FIFO channels in a KPN connect output streams of processes to input streams of other processes, creating relationships between the functions of the processes. In the example, the streams a, b, c, d, e and f are governed by the joint equations of the processes:
972
Marc Geilen and Twan Basten
⎧ ⎪ ⎪b = ⎨ f = ⎪a = ⎪ ⎩ c=
Delay1(a) Delay0(e) Add(c, f ) b, d = b, e = b
(3)
One can show that this set of equations has only a single, unique, solution, namely the Fibonacci numbers, for its output sequence d. Network Equations We show how in general, the semantics of a KPN are defined in a similar way as illustrated for the Fibonacci example above. In [26], Kahn presented the denotational semantics of process networks as the solution to a set of equations that capture the input/output relations of the individual processes and the way they are connected in the network. These equations are called the network equations. We define a KPN by a set P of processes and a set S of streams, partitioned into inputs I, outputs O and internal channels C (thus S = I ∪ O ∪C, I ∩ O = 0, / I ∩C = 0/ and O ∩C = 0). / The processes p, with inputs Ip and outputs O p , that constitute the network are described by (Scott-continuous) functions f p : H(I p ) → H(O p ) that map input histories to output histories of the processes. The network as a whole can be described as a function, defined by Kahn’s network equations as follows. For h ∈ H(C ∪ I ∪ O) to be a valid history describing a behavior of the whole KPN (by describing the data communicated on each of its channels), it must satisfy the network equations derived from the processes p ∈ P: h|O p = f p (h|I p ) for all p ∈ P It expresses that if we take from h the channels that are the inputs and outputs of p, then these must be related corresponding to the function f p . To characterize the behavior of the KPN as a function f that maps an input history i to a history of all channels (including the output channels, which we are after) in the KPN, f : H(I) → H(I ∪C ∪ O), we can derive the following equations (substituting f (i) for h and adding an equation for the input channels). f (i)|I = i (4) f (i)|O p = f p ( f (i)|I p ) for all p ∈ P $
Note that C ∪ O = p∈P O p ; every internal or output channel of the KPN is an output channel of one of the processes of the KPN. Hence, all channels are defined by these equations. Equation (4) is a recursive equation and in general, there may be many functions f that satisfy this equation, but only one that corresponds to a behavior respecting causality (where, intuitively, symbols are produced before they are consumed). The causal solution is the smallest function f that satisfies the network equations. This can be obtained as the least fixed-point of an appropriate functional.
Kahn Process Networks and a Reactive Extension
973
From Equation (4) we derive the functional
: (H(I) → H(I ∪C ∪ O)) → (H(I) → H(I ∪C ∪ O)) defined as (compare Equation (4)): ( f )(i)|I = i ( f )(i)|O p = f p ( f (i)|I p ) for all p ∈ P
(5)
Clearly, a fixed-point of satisfies Equation (4). Moreover, is a continuous function on a CPO and according to Kleene’s fixed-point theorem [14] it has a least-fixed point, the limit of the ascending Kleene chain (Kleene iteration), 2 { k (h ) | k ≥ 0}, where h denotes the empty history associating an empty string with every channel. Therefore, Kleene iteration provides a way of constructively computing the network behavior, by updating the output of a process whenever its input has changed. If this procedure is repeated, then either it stops and the complete output has been computed or it continues forever and the infinite output equals the least upper bound of the sequence of outputs computed during the procedure. Kleene iteration for the Fibonacci example (up to 12 steps, although the process continues ad infinitum) is illustrated in Table 1 and illustrates how it converges to the infinite Fibonacci sequence. Table 1 Kleene iteration of the Fibonacci example k a(k) b(k) c(k) d(k) e(k) f (k)
0
1 1 0
2 1 1 1 1 0
3 4 5 1 1 1 1 1.1 1.1 1 1 1.1 1 1 1.1 1 1 1.1 0.1 0.1 0.1
6 7 8 1.2 1.2 1.2 1.1 1.1.2 1.1.2 1.1 1.1 1.1.2 1.1 1.1 1.1.2 1.1 1.1 1.1.2 0.1.1 0.1.1 0.1.1
9 1.2.3 1.1.2 1.1.2 1.1.2 1.1.2 0.1.1.2
10 11 12 1.2.3 1.2.3 1.2.3.5 1.1.2.3 1.1.2.3 1.1.2.3 1.1.2 1.1.2.3 1.1.2.3 1.1.2 1.1.2.3 1.1.2.3 1.1.2 1.1.2.3 1.1.2.3 0.1.1.2 0.1.1.2 0.1.1.2.3
3 Operational Semantics The KPN denotational semantics specifies the input/output behavior of a KPN in an elegant, abstract way. This description however is far away from the actual operation or implementation of a KPN. Formal reasoning about implementations or run-time systems for KPNs can be easier based on an operational semantics. For instance, the denotational semantics provides no information to reason about buffer sizes required for the FIFO communication, or reasoning about potential deadlocks. In this section, we give a compositional operational semantics to hierarchical KPNs.
974
Marc Geilen and Twan Basten
3.1 Labeled transition systems We give an operational semantics to KPNs in the form of a labeled transition system (LTS). We use, more specifically, an LTS with an initial state, with designated input and output actions in the form of reads and writes of symbols on channels, as well as internal actions. Definition 2. (LTS) An LTS is a tuple (S, s0 , I, O, Act, ) consisting of a (countable) set S of states, an initial state s0 ∈ S, a set I ⊆ Chan of input channels, a set O ⊆ Chan (disjoint from I) of output channels, a set Act of actions consisting of input actions {c?a | c ∈ I, a ∈ c } ⊆ Act, output actions {c!a | c ∈ O, a ∈ c } ⊆ Act and (possibly) internal actions (all other actions), and a labeled transition relation ⊆ S × Act × S describing possible transitions between states. Thus, c!a is a write action to channel c with symbol a; c?a models passing of a symbol from input channel c to the LTS. We write s1 s2 if (s1 , , s2 ) ∈ and s1 if there is some s2 ∈ S such that s1 s2 . With a write operation, the symbol on the output channel is determined by the LTS. With a read operation, the symbol that appears on the input channel is determined by the environment of the LTS. Therefore, a read operation is modeled with a set of input actions that provides a transition for every possible symbol of the alphabet. If Act is a set of actions and C ⊆ Chan a set of channels, we write Act|C to denote {c!a, c?a ∈ Act | c ∈ C}, i.e., actions of Act on channels in C. An execution of the transition system is a sequence s0 0 s1 1 . . . of states si ∈ S and actions i i ∈ Act, such that si si+1 for all i ≥ 0 (up to the length of the execution). If is such an execution, then we use | | ∈ N ∪ {} to denote the length of the execution. | | = if is infinite and | | = n if = s0 0 s1 1 . . . n−1 sn . k For k ≤ | |, we use to denote the prefix of the execution up to and including state a k. If s0 0 s1 1 . . . n−1 sn , we write s0 sn , where a = 0 · 1 · . . . · n . From a given execution with actions a = 0 · 1 · . . ., we extract the consumed input and the produced output on a set D ⊆ Chan of channels as follows. • For a channel c ∈ D, a?c is a (finite or infinite) string over c that results from projecting a onto read actions on c; • similarly, a!c is a (finite or infinite) string over c that results from projecting a onto write actions on c; • finally, input history a?D = {(c, a?c) | c ∈ D} and output history a!D = {(c, a!c) | c ∈ D}. Furthermore, we use the same notation for executions with actions a: ?c = a?c, !c = a!c, ?D = a?D and !D = a!D. Thus ?I denotes the input consumed by the network in execution and !O denotes the output produced by the network. Note that these history projection operators are continuous w.r.t. the CPOs of strings, histories and executions. The I/O-history h( ) of an execution is ?I ∪ !O (again, if a is the string of actions executed in execution , h(a) = h( )).
Kahn Process Networks and a Reactive Extension
975
To reason about the input offered to the network (consumed or not consumed), we say that is an execution with input i : I → ∗, iff ?I i (the consumed data is consistent with i, but need not be all of i). Note that if is an execution with input i and i j then is also an execution with input j. Executions in general may be only partially completed or they may be unrealistic because certain actions are systematically being ignored. To be able to exclude such executions when necessary, we need the notions of maximality and fairness. Definition 3. Let = s0
0
s1
1
. . . be an execution of the LTS.
• (M AXIMALITY ) Execution with input i is called maximal iff it is infinite or in its last state only read actions on input channels from which all input of i has been consumed are possible, i.e., if | | = n and sn then = c?a for some c ∈ I and a ∈ c and ?c = i(c). • (FAIRNESS ) Execution is fair with input i iff it is finite, or it is infinite and if at some point an action is enabled, it is eventually executed or disabled (we will see that the latter does not occur for KPNs), i.e., – if for some n ∈ N and internal or output action , sn , then there is some k ≥ n such that k = or sk ; – if for some n ∈ N, c ∈ I and a ∈ c , sn c?a , and i(c) = ( n ?c)a for some ∈ c∗, , then there is some k ≥ n such that k = c?a or sk c?a ; Proposition 2. Every finite execution with input i of an LTS can be extended to a fair and maximal execution with input i. Proof. One can keep adding enabled actions in a fair (e.g., round robin) way until there are no more enabled actions, or towards an infinite execution.
We can describe the externally observable behavior of a labeled transition system by relating the output actions the LTS produces with the input actions provided to the network. In general this gives a relation between input histories and output histories. We restrict the attention to ‘proper’ executions in the sense that only maximal and fair executions are taken into account. Definition 4. (I NPUT /O UTPUT R ELATION ) The input/output relation IO of an LTS is the relation {(i, !O) | is a maximal and fair execution with input i}. Determinacy In general, the input/output relation is too abstract to adequately characterize the behavior of an LTS. If it is non-deterministic, the order in which input is consumed or output is produced can be relevant (see Section 7). We will see that Kahn process networks do not exhibit non-determinism. Transitions are deterministic and if multiple actions are available at the same time, then they are independent (i.e., they can be executed in any order with an identical result). This leads to a special type of LTS, which we call determinate. Figure 3 shows the beginning of the LTS corresponding to the Fibonacci example. There is only very limited choice in choosing a
976
Marc Geilen and Twan Basten
path from the initial top-left state and any path we choose executes exactly the same actions in a slightly modified order caused by concurrency in the process network. We give a precise definition of this type of LTS and summarize their properties. Definition 5. (D ETERMINACY ) LTS (S, s0 , I, O, Act, ) is determinate iff for any s, s1 , s2 ∈ S, 1 , 2 ∈ Act, if s 1 s1 and s 2 s2 , the following hold: 1. (D ETERMINISM ) if 1 = 2 is some input or output action, then s1 = s2 , i.e., executing a particular action has a unique deterministic result; 2. (C ONFLUENCE ) if 1 and 2 are not two input actions on the same channel (i.e., instances of the same read operation), then there is some s3 such that s1 2 s3 1 and s2 s3 . 3. (I NPUT C OMPLETENESS ) if 1 = c?a for some c ∈ I, then for every a ∈ c , s c?a , i.e., input symbols are completely defined by the environment, the LTS cannot be selective in the choice of symbols it accepts; 4. (O UTPUT U NIQUENESS ) if 1 = c!a and 2 = c!a for some c ∈ O, then a = a , i.e., output symbols are completely determined by the transition system. Here, the environment cannot be selective in its choice of symbol to receive. A sequential LTS is a determinate LTS with the additional property that 5. (S EQUENTIALITY ) if 1 = ca ( ∈ {!, ?}) (some read or write operation), then 2 = ca for some a ∈ c and c ∈ I ∪ O, i.e., the LTS accepts at most one input/output operation at any point in time and no other (for instance internal) actions. Although a determinate LTS may have multiple actions enabled at the same time. The order in which they are taken has no influence on the consumed input or produced output of the network in a fair and maximal execution. Two different executions are essentially the same, because the actions of one can be reordered without changing the input or output histories, to obtain the other execution. Proposition 3. The input/output relation of a determinate labeled transition system is a continuous function. A detailed proof can be found in [17]. If is a labeled transition system, we use f to denote its I/O relation. In particular, if is determinate, this is the I/O function. An important property in process networks is a deadlock condition. We can define a deadlock as the possibility to reach a state from which no further transitions are possible, a finite maximal execution. It is usually called a deadlock only if this is an unwanted situation. An important corollary from the analysis above is that if a determinate LTS has some execution which leads to a deadlock state, then all of its executions lead to the same deadlock state. In other words, the deadlock cannot be avoided by a scheduling strategy, it is inherent to the determinate LTS specification and as we will see, thus also to KPNs.
Kahn Process Networks and a Reactive Extension
977
3.2 Operational Semantics We can now formalize an operational semantics of a KPN as a determinate LTS. Definition 6. (K AHN PROCESS NETWORK ) A Kahn process network is a tuple (P,C, I, O, Act, { p | p ∈ P}) that consists of the following elements. • A finite set P of processes. • A finite set C ⊆ Chan of internal channels, a finite set I ⊆ Chan of input channels and a finite set O ⊆ Chan of output channels, all distinct. • Every process p ∈ P is defined by a determinate labeled transition system p = (S p , s p,0 , I p , O p , Act p , p ), with I p ⊆ I ∪ C and O p ⊆ O ∪ C. The sets Act p \(Act p |(I p ∪ O p )) of internal actions of the processes are disjoint. • The set Act of actions consisting of the actions of the constituent processes: Act = $ p∈P Act p . • For every channel c ∈ C ∪ I, there is exactly one process p ∈ P that reads from it (c ∈ I p ) and for every channel c ∈ C ∪ O, there is exactly one process p ∈ P that writes to it (c ∈ O p ). To define the operational semantics of a KPN, we need a notion of global state of the network; this state is composed of the individual states of the processes and the contents of the internal channels. Definition 7. (C ONFIGURATION ) A configuration of the process network is a pair ( , ) consisting of a process state and a channel state , where $
• a process state : P → S = p∈P S p is a function that maps every process p ∈ P on a local state (p) ∈ S p of its transition system; • a channel state : C → ∗ is a history function that maps every internal channel c ∈ C on a finite string (c) over c . The set of all configurations is denoted by Confs and there is a designated initial configuration c0 = (0 , 0 ), where 0 maps every process p ∈ P to its initial state s p,0 and 0 maps every channel c ∈ C to the empty string . Definition 8. (O PERATIONAL S EMANTICS OF A KPN) We assign to a KPN = (P,C, I, O, Act, { p | p ∈ P}), an operational semantics in the form of an LTS (Confs, c0 , I, O, Act, ). The labeled transition relation is inductively defined by the following five rules (given in Plotkin style [41]; if the condition above the rule is satisfied, the conclusion below is also valid). For reading from and writing to internal channels by processes we have the following two rules respectively:
(p) ( , )
c?a p c?a
s, (c) = a , c ∈ C ( {s/p}, { /c})
(p) ( , )
c!a p c!a
s, (c) = , c ∈ C ( {s/p}, { · a/c})
Input channels and output channels are open to the environment: c?a p c?a
(p) ( , )
s, c ∈ I ( {s/p}, )
(p) ( , )
c!a p c!a
s, c ∈ O ( {s/p}, )
978
Marc Geilen and Twan Basten
b!1
b?1
c!1
b!1
b?1
c!1
d!1
e!1
d!1
c? 1
c? 1
e!1 d!1
c? 1
f?0
a!1
a?1
b!1
b?1
c!1
d!1
e!1
f?0
a!1
a?1
b!1
b?1
c!1
d!1
e!1
f?0
a!1
a?1
b!1
b?1
c!1
d!1
e!1
c? 1
c? 1
d!1
c? 1
f?1
a!2
f?1
a!2
f?1
a!2
Fig. 3 Labeled transition system of the Fibonacci example
Individual processes may perform internal actions: p
(p)
( , )
s, ∈ / Act p |(I p ∪ O p )
( {s/p}, )
This LTS is denoted as ( ). Note that it is possible to generalize the five rules into a singe rule, albeit a slightly more cryptic one:
(p) ( , )
p
s, ?C
( {s/p}, ( − ?C) · !C)
We need to show that the LTS we have just defined is indeed determinate. Proposition 4. The labeled transition system of a KPN is determinate. Proof. We check the four properties of a determinate LTS. • Determinism follows from determinism of the process that accepts or produces the input or
output action respectively.
• Confluence. If both actions originate from different processes, it can be checked that they can-
not disable each other. If they originate from the same process, it follows from confluence of that process. • Input completeness follows immediately from input completeness of the constituent processes. • Output uniqueness similarly follows directly from output uniqueness of the process delivering the output.
Proposition 4 shows that the LTS of the KPN has the same property as the individual processes. The presented semantics of KPN is compositional, a determinate process network is constructed from individual determinate processes, and can itself be used as a process in a larger KPN. This way, we can hierarchically construct KPNs. At the lowest level we can start with primitive processes, for instance sequential processes which are typically implemented by sequential programs with read and write operations.
Kahn Process Networks and a Reactive Extension
979
If is a KPN, then we use f to denote the I/O function realized by that KPN. In the remainder, we assume that (P,C, I, O, Act, { p | p ∈ P}) is a Kahn Process Network with LTS (Confs, c0 , I, O, Act, ). Using the operational semantics of KPN, we can reason about buffer capacities for the FIFOs connecting the processes. Abstracting from specific buffer sizes, we can also consider the sufficiency of finite buffer capacities. In the Fibonacci example of Figure 3, there exist executions which require a buffer size of 2 on channel e, for instance if we take the path according to the upper envelope of the picture, two write actions occur in the channel e before the first read action. The path along the bottom envelope of the picture requires only a buffer size of 1 for channel e because the first read occurs before the second write. Definition 9. (B OUNDED E XECUTION) An execution is bounded iff there exists a mapping B : C → N such that at any state ( , ) of , | (c)| ≤ B(c) for all c ∈ C. Not every KPN allows a bounded execution. Some KPNs may have both bounded and unbounded executions. This is relevant for realizations of KPNs, as discussed in Section 6.
4 The Kahn Principle The operational semantics given in the previous section is a model closer to a realization of a KPN than the denotational semantics of Section 2. The denotational semantics defines behavior as the least solution to a set of network equations [26]. The correspondence between both semantics, demonstrating that they are consistent, is referred to as the Kahn Principle. It was stated convincingly, but without proof, by Kahn in [26] and was later proved by Faustini [16] for an operational model of process networks, in [45] for an operational model of concurrent transition systems and in [35] for an operational characterization using I/O automata. Based on the operational semantics, a determinate LTS, by Proposition 3, a functional relation is obtained between inputs and outputs (as defined by Definition 4). This function is shown to correspond to the least solution of Kahn’s network equations. This proof presented here is similar to the proof of the Kahn Principle for I/O automata of [35]. We reproduce it here in outline, because it illustrates the connections between denotational and operational semantics and the essential properties of the KPN model. We first show that if the operational behavior of individual processes of the KPN respect their functional specifications, then so does the KPN as a whole. For an execution with input i, we use the notation h( , i) to denote the history identical to h( ) except for the input channels, which are mapped according to i, since some of the input of i offered to the network may not (yet) have been consumed. Thus h( , i) is equal to the mapping i ∪ !(C ∪ O). We can derive from an execution of the overall KPN how individual processes have contributed to that execution. |p denotes execution projected on process p. If
980
= (0 , 0 ) n1
Marc Geilen and Twan Basten 0 n2
(1 , 1 )
1
( 2 , 2 )
2
. . .. Then |p = n0 (p)
n0
n1 (p)
n2 (p) . . . where n0 , n1 etcetera are such that n0 < n1 < . . . and n0 , n1 , . . . are precisely the actions from process p. Lemma 1. If is a fair and maximal execution of a KPN with input i and p is a process of the KPN, then |p is a fair and maximal execution of p with input !I p . Proof. That |p is an execution of p follows from the fact that if the KPN executes an action
not from p, then the configuration does not change w.r.t. the state of p. If does originate from p, then from ( , ) ( , ), it follows that (p) p (p). Fairness follows from the fact that an enabled read or write operation of the process induces an enabled action of the KPN. Fairness of the execution of the KPN prescribes that the action is executed at some point in and hence also in |p. Similarly, maximality is obtained from maximality of the execution of the network.
Lemma 2. For every fair and maximal execution with input i, the history h( , i) satisfies the network equations. Proof. Let be a fair and maximal execution with input i. It follows using Lemma 1 that h( , i)|O p = f p (h( , i)|Ip ). Thus, h( , i) satisfies the network equations.
Lemma 3. A history corresponding to a fair and maximal execution of the KPN with input i corresponds to the smallest solution to the network equations with input i. Proof. The history is unique (i.e., independent of the execution) by Proposition 3. In Lemma 2
we proved that it is a solution to the network equations. We have to prove that every solution to the network equations is an upper bound of the history of the execution. Let be a fair and maximal execution of the KPN with input i and let h be any history satisfying the network equations such that h|I = i. It suffices to prove that h is an upper bound of the history of every finite prefix of the execution, h( , i) h, proved by induction on the length of the execution. This is trivial for the empty execution (0 , 0 ); we proceed with the induction step. Let = (n , n ), • If is internal to one of the processes or an input action of one of the processes, then h( , i) =
h( , i) and the result follows by the induction hypothesis.
• If is an output action of some process p, then by the induction hypothesis, h( , i)|Ip
h|Ip . By monotonicity of f p and the fact that h satisfies the network equations, it follows that f p (h( , i)|Ip ) f p (h|Ip ) = h|O p . By monotonicity of f p and ?Ip h( , i)|Ip we also have f p ( ?Ip ) f p (h( , i)|Ip ). Combining everything, we then have using Lemmas 1 and 2 that h( , i)|O p f p ( ?Ip ) = f p ( ?Ip ) f p (h( , i)|Ip ) h|O p . Hence, it follows that h( , i) h.
Theorem 1. (K AHN P RINCIPLE ) The I/O relation of a KPN, derived from the operational semantics is a continuous function which corresponds to the denotational semantics of the KPN, i.e., to the least fixed point of the functional (defined in Section 2). Proof. It follows from Lemma 3 that for every input i, a fair and maximal execution of the KPN with input i yields the smallest channel history that satisfies the network equations. The least fixed point of is the function that assigns to any input i precisely that smallest history.
Kahn Process Networks and a Reactive Extension
981
5 Analizability Results Kahn Process Networks are a very expressive model of computation, despite the fact that it only allows specification of functional, determinate systems. In particular, compared to more restricted data-flow models such as Synchronous Data Flow, it allows that the rates at which a process communicates on its outputs are data dependent. Combined with unlimited storage capacity in its unbounded FIFO buffers, this makes KPN an expressive model. Expressiveness is usually in direct conflict with analyzability, the ability to (statically, off-line) analyze an application for its properties, such as deadlock-freedom (for buffers of a given, possibly unbounded, capacity), equivalence, minimum throughput (when adding timing information), static scheduling, and so on. It is known for instance [12] that the problem to decide for a given KPN (with unbounded buffer capacities) whether it is deadlock-free is undecidable. The proof of this fact [12] relies on a reduction from the Halting Problem of Turing Machines. It is shown that a Kahn Process Network can simulate a Universal Turing Machine; in other words KPN is Turing Complete. This allows the Halting Problem for Turing Machines to be reduced to deadlock-freedom of KPNs. Because the former is known to be an undecidable problem, so must the latter problem be undecidable. The details of the translation from Turing Machines to KPNs are given in [12]. We can formalize the question for deadlock-free execution in the semantic framework of this chapter as follows. Definition 10. (D EADLOCK) Given a KPN. Is there some input i, such that the KPN permits a finite, maximal execution with input i? Note that in this case any maximal execution with input i is finite. Moreover, note that this does not always indicate a problem. Some KPNs may be designed to have only finite executions. The boundedness question can be formalized as follows. Definition 11. (B OUNDEDNESS) Given a KPN with an input/output function f and channels C and input channels I. Do there exist finite channel capacities B : C → N, such that for every input history i, there is an execution i such that h(i , i)|O = f (i) and at any state ( , ) of i , | (c)| ≤ B(c) for all c ∈ C? A slightly weaker form of boundedness can also be defined in which for every input i there is a bounded execution, but there need not necessarily be a single bound for all input. A corresponding optimization problem, the buffer sizing problem, would be to try to find minimal such channel capacities. These capacities are not computable because of the undecidability of deadlock-freedom. In the same way, most non-trivial questions about KPNs are undecidable. What is the minimum throughput or maximum latency of a given KPN extended with timing information? What are sufficient FIFO buffer capacities to execute with a guaranteed minimal throughput? Is a particular channel or process ever used? Some of these questions have to be answered to arrive at an implementation of a KPN. We discuss this further in Section 6.
982
Marc Geilen and Twan Basten
6 Implementing Kahn Process Networks KPNs are a mathematical model of computation with conceptually unbounded FIFO buffers. KPNs are a very expressive model and we have seen that many of its properties are not statically decidable. Therefore, in some cases subclasses of KPN are used as a starting point for an (automated) synthesis trajectory. However, in many cases the expressiveness of the model is exploited and KPNs are used directly as the basis for synthesis, or for simulation, in which case conceptually the same problems need to be solved. Besides the infinite buffer capacities, another issue to be resolved is that the semantics involves potentially infinite behaviors. The Fibonacci example if Figure 1 represents a network that produces an infinite stream of numbers, the Fibonacci sequence. A real implementation however will never be able to produce an infinite sequence in a finite amount of time. A judgement whether an implementation is correct should be based on its behavior in finite time to be of practical relevance. Section 9 discusses existing implementations of KPN.
6.1 Implementing Atomic Processes The semantics of KPN assumes atomic processes which implement elementary continuous functions. Kahn and MacQueen suggest [27] that the atomic processes can be implemented with sequential program code which does not use any global variables shared with other processes and with explicit read and write operations added for reading symbols from input and writing symbols to output channels. An example is the code of the Inverse Zig Zag process in Figure 2. If multiple of such processes are required to run on a single processor, then typically a multi-threaded (light-weight) OS is used which spawns a single thread for every process. Determinacy of KPN guarantees that synchronization between these threads is only needed for communication on the FIFOs. When a read operation is executed on an empty FIFO, the reading process thread should stall until new symbols are written to the channel. (It is not allowed to do anything else, because that would break determinacy.) Similarly, if FIFO buffers of limited capacity are used in the implementation, a writing process thread may need to stall until sufficient space is available in the channel to complete the write operation. Threads can be, but need not be, scheduled in a preemptive manner.
6.2 Correctness Criteria An implementation of a KPN should respect the formal (denotational and operational) semantics. It is not entirely trivial how to define correctness because the semantics talks about infinite executions and infinite streams as a convenient ab-
Kahn Process Networks and a Reactive Extension
983
straction of streaming computation. However, in the real-world we will never be able to observe any actual infinitely long executions. We break down correctness of a KPN implementation into three aspects: soundness, completeness and boundedness. Soundness states that the KPN implementation should never produce any output that is different from the output predicted by the semantics, nor should it produce more output than predicted. For instance, if the semantics says that the Fibonacci KPN produces the outputs: 1, 1, 2, 3, etcetera, then the implementation should not produce: 1, 2, . . . . Definition 12. (S OUNDNESS) An implementation strategy for KPN is sound iff for every KPN with input/output function f and every behavior the strategy may produce, if the strategy consumes input i and produces output o, then o f (i). Secondly, output should be complete. Intuitively this means that the implementation should produce all output predicted by the semantics. But of course we cannot expect an implementation to produce an infinite amount of output in a finite amount of time. Every individual part (symbol) of the infinite output stream however occurs after a finite amount of output produced before it. Hence, we require from a good implementation that every bit of the predicted output is eventually (meaning after a finite amount of time) produced by the implementation. In other words, if we sample at finite time intervals the input consumed and output produced by the KPN implementation as a sequence of finite executions k , then the limit of that sequence (its least upper bound) should be the entire infinite execution. Definition 13. (C OMPLETENESS) An implementation strategy for KPN is complete iff for every KPN and for every input i offered to this KPN and for every behavior the strategy may produce, if it is sampled at regular time intervals tk having produced 2 the finite executions k , then = {k | k ≥ 0} is a maximal and fair execution with input i that is part of the operational semantics. In particular, this implies that (a) any new amount of progress is made in a finite amount of time, and (b) no channels are excluded from making progress in finite time, there must be no starvation of parts of the network. We illustrate the importance of this constraint when we discuss run-time scheduling and buffer management below. Thirdly, it is important that this is achieved within bounded memory whenever possible. This amounts to keeping the FIFOs bounded, also in (conceptually) infinite computations. Definition 14. (B OUNDEDNESS) An implementation strategy for KPN is bounded iff for every KPN and for every input offered to this KPN, if there exists a bounded execution according to the operational semantics, then this strategy will produce a bounded execution. Note that not every KPN allows for bounded executions. In such a case, the strategy is allowed to produce an unbounded execution, but rather one should perhaps decide not to implement such a KPN, although one cannot , in general, automatically decide whether this will be the case.
984
Marc Geilen and Twan Basten
w
r
s
w
t
w p
w
q
w r
r u
r
v
Fig. 4 An artificial deadlock
6.3 Run-time Scheduling and Buffer Management Conceptually, the FIFO communication channels of a KPN have an unbounded capacity. A realization of a process network has to run within a finite amount of memory. An unbounded capacity can be mimicked using dynamic allocation of memory for a write operation, but rather than that, it is for reasons of efficiency better to allocate fixed amounts of memory to channels and change this amount only sporadically, if necessary. An added advantage of the fixed capacity of channels is that one can use an execution scheme introduced by Parks [40], where a write action on a full FIFO blocks until there is room freed up in the FIFO. Note that this behavior can be modeled within the KPN model using additional channels in opposite directions, similar to the well-known trick with Synchronous Data Flow graphs [40]. This shows that this does not impact determinacy of the model. This gives an efficient intermediate form of data-driven and demand-driven scheduling [40], [3] in which a process can produce output (in a data-driven way) until the channel it is writing to is full. Then it blocks and other processes take over, effectively regulating the relative speeds of different processes. As discussed previously, it is undecidable in general, how much buffer capacity is needed in every channel [12]. If buffers are chosen too large, memory is wasted. If buffers are chosen too small, so-called artificial deadlocks may occur that are not present in the original KPN, when processes are being permanently blocked because of one or more full channels, an event which cannot occur in the original KPN. Therefore, a scheduler is needed that determines the order of execution of processes and that manages buffer sizes at run-time. Figure 4 shows a process network in an artificial deadlock situation. Process w cannot continue because its input channel to process v is empty; it is blocked on a read action on the channel to v, denoted by the ‘r’ in the figure. The required input should be provided by v, but this is in turn waiting for input from u. u is waiting for q. Process q is blocked, because it is trying to output a token on the full channel to r; the block on the write action is denoted by a ‘w’. Similarly processes r, s and t are blocked by a full channel. Only w could start emptying these channels, but w is blocked. The processes are in artificial deadlock (q, r, s and t would not be blocked in the KPN) and can only continue if the capacity of one of the full channels is increased. Note that a blocked process is dependent on a unique channel on which it is blocked either for reading or for writing and hence it depends on a unique other process that is also connected to that channel. This gives rise to a chain of dependencies and if this chain is cyclic, it is a deadlock. A real (non-
Kahn Process Networks and a Reactive Extension
985
artificial) deadlock, in contrast, is entirely due to the KPN specification and cannot be avoided by buffer capacity selection. Because the KPN with bounded FIFOs is also determinate we can conclude that the buffer management is independent from scheduling; an artificial deadlock, like a real deadlock, cannot be avoided by a different scheduling (See [19] for more details and a proof.) Thus, an important aspect of a run-time scheduler for KPNs is dealing with artificial deadlocks. We can discern different kinds of deadlocks. A process network is in global deadlock if none of the processes can make progress. In contrast, a local deadlock arises if some of the processes in the network cannot progress and their further progress cannot be initiated by the rest of the network. Parks’ strategy [40] for dealing with scheduling, artificial deadlocks and buffer management is the following procedure. First, select some arbitrary, fixed initial buffer capacities. Then, execute the KPN, with blocking read and blocking write operations. If and when a global deadlock occurs and it is an artificial deadlock, then increase the size of the smallest buffer and continue. (One can try to be more efficient in the selection of the buffer that needs to be enlarged [3], [19], but this does not fundamentally change the strategy.) Such a run-time scheduler thus needs to detect a global deadlock. In a single processor multi-threaded implementation, this is typically achieved by having a lowest priority thread of execution that becomes active only when all other threads are blocked, indicating a global deadlock. It then tests whether the deadlock is artificial and if so, increases the size of a selected buffer, ultimately enabling one of the blocked processes again. In terms of the formulated correctness criteria, one can show this method to be sound and bounded, but not complete. Proposition 5. The scheduling strategy of [40] for KPNs is sound and bounded, but not complete. Proof. Soundness is straightforward. The appropriate code is executed and nothing else. The strategy of selecting the smallest buffer to increase guarantees that after a finite number of resolved artificial deadlocks, the buffer capacities must have grown beyond the bounds needed when a bounded execution exists. Incompleteness of the scheduling strategy follows from the counter example of Figure 5, discussed below.
The strategy chosen in [3] is in this respect similar to the one of [40] and also leads to a bounded execution if one exists. To guarantee the correct output of a network, a run-time scheduler must detect and respond to artificial deadlocks. Parks proposes to respond to global artificial deadlocks. In implementations of this strategy [1], [22], [28], [46], [51], global deadlock detection is realized by detecting that all processes are blocked, some of which by writing on a full FIFO. Although this guarantees that execution of the process network never terminates because of an artificial deadlock, it does not guarantee the production of all output required by the KPN semantics; output may not be complete. E.g., in the network of Figure 5, if the upper part reaches a local artificial deadlock, then the lower, independent part is not affected. Processes may not all come to a halt and the local deadlock is not detected and not resolved. The upper part may not produce the required output. Such situations exist in realistic networks.
986
Marc Geilen and Twan Basten
A particular example is the case when multiple KPNs are controlled by a single runtime scheduler, one entire process network may get stuck in a deadlock. Hence, to achieve completeness, a deadlock detection scheme has to detect local deadlocks as well. It is shown in Section 4 that the Kahn Principle hinges on fair scheduling of processes [10], [26], [45]. Fairness means that all processes that can make progress should make progress at some point. This is often a tacit but valid assumption if the underlying realization is truly concurrent, or fairly scheduled. In the context however of bounded FIFO channels where processes appear to be inactive while they are blocked for writing, fairness of a schedule is no longer evident. This issue is neglected if one responds to global deadlocks only, leading to a discrepancy with the behavior of the conceptual KPN. It is proved in [19] and illustrated below, that a perfect scheduler for KPN, satisfying all three correctness criteria for any KPN cannot exist. Theorem 2. ([19]) A scheduling strategy for KPNs satisfying soundness, completeness and boundedness does not exist. The incompleteness of the scheduling method of [40] can have rather severe consequences, leading to starvation of parts of the network. The way to resolve this is to not wait until a global deadlock occurs before taking actions, but act (in finite time) when a local artificial deadlock occurs and resolve it. This is the adaptation that the strategy proposed by [19] makes to the original strategy of [40]. A problem for an implementation strategy for KPN is posed by the production of data that is never used. This is illustrated with Figure 6. Process p writes n data elements (tokens) on channel c connecting p to process q; after that, it writes tokens to output channel a forever. q never reads tokens from c and outputs tokens to channel b forever. If the capacity of c is insufficient for n tokens, then output a will never be written to unless the capacity of c is increased. q doesn’t halt and execution according to Parks’ algorithm does not produce output on channel a, violating completeness, because in the KPN, infinite output is produced on both channels. The above suggests that a good scheduler should eventually increase the capacity of channel c so that it can contain all n tokens. However, such a scheduler fails to correctly schedule another KPN. Consider a process network with the same structure as the one of Figure 6, but this time, p continuously writes tokens on output a, mixed with infinitely many tokens to channel c. q writes infinitely many tokens to b and reads infinitely many tokens from c, but at a different rate than p writes tokens to c. Note that a bounded execution exists; a capacity of one token suffices for channel c. If process p writes to c faster than q reads, channel c may fill up and the scheduler, not knowing if tokens on channel c will ever be read, decides to increase channel capacity. A process q exists that always postpones the read actions until after the scheduler decides to increase the capacity; the execution will be unbounded, although a bounded execution exists. The way out taken by [19] is to assume that in a reasonable KPN, every token that is written to a channel, is eventually also read. Such KPNs are called effective.
Kahn Process Networks and a Reactive Extension
Fig. 5 A local deadlock
987
Fig. 6 Non effective network
The scheduling algorithm is based on the use of blocking write operations to full channels as in [40]. In order to define local deadlocks, it builds upon the notion of causal chains as introduced in [3]. Any blocked process depends for its further progress on a unique other process that must fill or empty the appropriate channel. These dependencies give rise to chains of dependencies. If such a chain of dependencies is cyclic, it indicates a local deadlock; no further progress can be made without external help. In Figure 4, such a causal chain is indicated by the dashed ellipse indicating the cyclic mutual dependencies of the processes q, r, s, t, u, v and w, which form a local deadlock. The scheduling strategy of [19] is identical to the scheduling strategy of [40] except that the search for an artificial deadlock does not occur when the network is in global deadlock, but the scheduler needs to monitor the process network for the occurrence of local artificial deadlocks and resolve those in finite time. The behavior of this scheduling strategy has the following characteristic property. Theorem 3. The scheduling strategy of [19] for KPNs is sound, complete and bounded for effective KPNs. This is proved in [19].
7 Extensions of KPN KPN is an expressive model that allows the description of a wide class of data-flow applications. Its expressiveness leads to undecidability of various elementary questions about KPNs (Section 5). Yet still, there is sometimes the desire to extend the KPN model with additional features. The most prominent ones are time and events or reactive behavior. KPN describes the functional behavior of a process network, but makes no statements about the timing with which these outputs are produced, or the latency or throughput that can or should be attained. Another important element on the wish list is the ability to deal with sporadic messages or events. Often both are added at the same time to save determinacy of the model.
988
Marc Geilen and Twan Basten
Events The desire to deal with sporadic events is illustrated by the addition of the select statement in the KPN programming library YAPI [28]. This statement allows a process to see which of a number of input channels has data available and to read from that channel first. Similarly, [36] describes an extension of the data-flow model with the ability to probe a channel for the presence or absence of data. While both extensions clearly enhance the expressiveness of the model, they also destroy the property of determinacy, of independence of any concrete schedule, the denotational semantics of KPN and the Kahn Principle. From a theoretical perspective, the essential ingredient to add to KPNs to deal with sporadic or reactive behavior is a merge process. A merge process has two inputs and one output and it copies data arriving on any of its inputs to the single output, in an order which is based on the arrival order of data, for instance in the order in which they arrive on the inputs. In this case however, the input/output relation of the process is no longer a function; the same inputs can lead to different outputs, depending on the order in which they arrive. KPN denotational semantics captures behavior as continuous input/output functions and thus no longer works. It has been attempted to capture KPN with merge processes by input/output relations. It turns out however that this is not possible. This is commonly known as the BrockAckerman anomaly [9]. For non-determinate processes, their input/output relation does not sufficiently characterize the system’s behavior to use it as a basis for a semantic framework. The counter example is reproduced in Figure 7. It has a ‘merge’ process as the only indeterminate process in the network, a process ‘first’ which passes on only the first symbol, a process ‘split’ which splits the stream in two exact copies and a process ‘inc’ which passes on all symbols after incrementing them by one. The network is provided with the input consisting of the single symbol 5. The merge process passes it on to its output and the processes ‘first’ and ‘split’ pass it on as well. ‘inc’ turns it into 6 and passes it on to ‘merge’. Operationally speaking, merge now passes the 6 to its output and because the token 6 is causally dependent on the token 5 passing to the output of the merge first, the only possible output of the merge process is 5 · 6. On the other hand, the denotational semantics of the merge process specifies that for the inputs {(5, 6)}, the possible outputs are {6 · 5, 5 · 6}, but the 6 · 5 is not possible in reality. The causality information which is necessary for providing a correct semantics to indeterminate processes is lost in the input/output relation.
Fig. 7 Brock-Ackerman counter example (from [9])
Kahn Process Networks and a Reactive Extension
989
Brock and Ackerman showed [9] that I/O relations do not provide a good semantics for the so-called fair merge [39] that merges streams in such a ways that all input tokens on either input are eventually produced at the output. Subtly different and weaker versions of merge processes (but with strictly different expressiveness) can be defined, fair merge, infinity fair merge, angelic merge, but it was later shown (see [43] for an overview), that they all suffer the same limitation, that input/output relations are insufficient. Only merge processes which interleave symbols from the inputs in a fixed, a priori determined order, for instance alternatingly, are determinate. Alternatively, a trace-based semantics has been found to be able to serve as a fully abstract semantics [25], i.e., a semantic model which contains enough information, but no redundant information to represent behavior. The use of a totally ordered trace indicates the loss of independence of scheduling of concurrent processes associated with the introduction of a merge construct, which is one of the strongest points of KPN. Time A timed process network model is a model which not only describes the data transformations of the network, but also the timing of such a process [11], [52]. Such a model is typically made by labeling symbols with time-stamps or tags from a particular time domain of choice (for instance non-negative integers or reals) or, more general, according to the tagged-signal model of [32], allowing for instance also partially ordered or super-dense time domains common in hardware description languages [34]. Alternatively, a stream can then be described as a mapping from the time-domain to the channel alphabet. If this function is total, it may also specify the absence of data at certain points in time, which can then be deterministically exploited by processes. Time-stamps may be interpreted as the exact timing of the production of tokens or as a specification of a deadline for the production. In practice, timing constraints however are typically end-to-end, for instance latency, or indirect, such as throughput constraints. From a semantics perspective, time-stamps can be exploited to ‘rescue’ the merge process from indeterminacy and in general to make reactive behavior deterministic. Using time information, decisions can be taken in a deterministic way, based on the time-stamps of data, for instance to describe a merge process which merges symbols in the order in which they arrive (with some provision, for instance fixed priority, when tokens arrive at the same time) [11], [52], [34]. Alternative network equations can be formulated and different fixed-point theorems (such as Kleene’s or Banach’s) can be used to show them to have a unique solution. The (big) price one has to pay for salvaging determinacy of the model is that it requires a global notion of synchronized (logical) time, which puts constraints on implementation and requires additional synchronization. By adopting such a notion of synchronized global logical time, we end up close to another end of the spectrum of data-flow languages, the domain of synchronous
990
Marc Geilen and Twan Basten
languages such as Esterel [6], Signal [4] or Lustre [23], where the execution of a network needs to be globally synchronized. This can be a strong disadvantage to a distributed implementation. Recent work [13], [5] in the synchronous language domain searches for conditions to relax the global synchrony constraints towards socalled GALS (globally asynchronous, locally synchronous) implementations, where particularly the most costly global synchronization may be eliminated.
8 Reactive Process Networks The Reactive Process Networks (RPN) model of computation intends to provide a semantic framework and implementation model for data-flow networks with sporadic events, in the same way KPN is a reference for determinate data-flow models of computation. RPN integrates control and event processing with stream processing in a unifying model of computation with a compositional operational semantics. The model tries to find a balance in a trade-off between expressiveness, determinism and predictability, and implementability.
8.1 Introduction We illustrate the Reactive Process Networks model by looking at the domain of multimedia applications, working with information streams such as audio, video or graphics. With modern applications, these streams and their encodings can be very dynamic. Smart compression, encoding and scalability features make these streams less regular than they used to be.
Fig. 8 Embedding of types of streams
Streams are typically parts of larger applications. Other parts of these applications tend to be control-oriented and event-driven and interact with the streaming components. Modern (embedded) multimedia applications can often be seen as in-
Kahn Process Networks and a Reactive Extension
991
stances of the structure depicted in Figure 8. At the heart of the application, computationally intensive data operations have to be performed in streams of for instance pixels, audio samples or video frames. Input and output of these processes are highly regular patterns of data. These data processing activities can often be statically analyzed and scheduled on efficient processing units. At a higher level, modern multimedia streams show a lot of dynamism. Object-based video (de)coders for instance work with dynamic numbers of objects that enter or leave a scene. Decoding of the individual objects themselves uses the static data processing functions, but they may need to be added, removed or adapted dynamically, for instance encoding modes or frame types in MPEG audio or video streams. These dynamic streams still compute functions and processing is determinate, i.e., the functional result is independent of the order in which operations are executed. In turn, the processing of these dynamic data streams is governed by control oriented components. This may for instance be used to convey user interactions to the streaming application or to respond to changing network conditions. The three levels of an application require typical modeling and implementation techniques. A good candidate for instance to describe static computation kernels could be Synchronous Data Flow (SDF) [31], discussed in Chapter 30. Dynamic stream processing can perhaps be best described using KPNs. They are capable of showing data dependent behavior and dynamic changes in their processing, but they are still determinate and can be executed fully asynchronously. To specify the control dominated parts of an application, there are many techniques, such as state machines and event-driven software models. RPN defines a model and its formal operational semantics that allows for an integrated description and analysis of an application consisting of these three levels of computation. It is a unified model for streaming and control, that is also hierarchical and compositional. The goal is to be also able to incorporate more analyzable, less expressive models in the same way SDF fits within the KPN model, but still combining data-flow and control oriented behavior. For instance, a combination of SDF with finite state machines such as HDF or SADF [21], [48], [47], which yields analyzable models, would fit within the framework. This is illustrated in Figure 9. Vertically, it shows the trade-off between expressiveness and analyzability, with a border of decidability. Horizontally, it shows streaming vs. control oriented models. Figure 10 shows the compositional integration of state machines with process networks. A process network is a component with stream input(s) (i in Figure 10(a)), stream output(s) (o) and event input(s) (e). At any point in time, the network operates in a mode that implements a particular streaming function, for instance mode h in Figure 10(a), implemented as the network drawn below it. At some later time, because of the occurrence of some event a, the function of the network needs to change to a mode n having a different streaming function. One could think for instance of a video system where a user changes settings or turns on or off special image processing features. One could view this as a (not necessarily finite) state machine where in every state, the network implements a particular function and external events force the state machine to move from one state to another. In every state, a particular process network performs operations on data streams. Similarly, such state machines
Marc Geilen and Twan Basten
analyzable
expressive
992
Fig. 9 Classification of models of computation
(a)
(b)
Fig. 10 Mixing state machines with process networks
can be embedded in components of a reactive process network as shown in Figure 10(b). Events that are communicated to such components can be generated from an output port of another component. Process networks can be hierarchical entities, we would like such a construct to be compositional and applicable at different levels of a network hierarchy. Note that this is the conceptual behavior of an RPN, but that does not mean that this is exactly how it is implemented. In many cases the events cause relatively small changes to the current streaming function, changing of parameters or activation or deactivation of individual functions. A Reactive Process Network Example An illustrative example of the type of application we are considering is shown in Figure 11. It depicts an imaginary game, which includes modes of 3-dimensional game play with streaming video based modes. The rendering pipeline, used in the
Kahn Process Networks and a Reactive Extension
993
Fig. 11 An interactive 3D game
3D mode, is a dynamic streaming application. Characters or objects may enter or leave the scene because of player interaction, rendering parameters may be adapted to achieve the required frame rates based on a performance monitoring feedback loop. Overlayed graphics (for instance text or scores) may change. This happens under control of the event-driven game control logic. At the processing core of the application, the streaming kernels, a lot of intensive pixel based operations are required to perform the various texture mapping or video filtering operations. Special hardware or processors may be available to execute these operations, which can be scheduled off-line, very efficiently. The model is organized around the two main modes (3D graphics, video). In these modes, the game dynamics and mode changes are influenced or initiated by user interaction, game play and performance feedback. This is the control oriented part of the game depicted as the automaton at the top of Figure 11. The self-loops on these states denote changes where the streaming network essentially stays the same, but its parameters may be changed. In the 3D graphics mode (enlarged at the bottom of the figure), a scene graph, describing all entities and their positions in 3D space, is rendered to a 2-dimensional view on the scene on some output device. The first process communicates the scene graphs with objects to the rendering component. The rendering component
994
Marc Geilen and Twan Basten
transforms the scene graphs into 2-dimensional frames. The last process adds 2dimensional video processing such as filtering, overlays, and so forth. The output is shown by the display device. Notice that the game as a whole has different modes of streaming (3D graphics, video), but similarly, components in the stream processing part have different modes or states of streaming execution. The object renderer for instance can be reconfigured to different modes depending on the number of objects that need to be rendered (using the event channel nrObj controlled by the scene graph process). This illustrates the need for a hierarchical, compositional approach to combining state/event based models with streaming and data-flow based models, as realized by the RPN model.
8.2 Design Considerations of RPN We discuss the main concepts that have had an impact on the design of the model of Reactive Process Networks. Streams, Events and Time Streaming applications represent functions or data transformations. They absorb input and they produce the corresponding output. There is often no inherent notion of time, except for the ordering of tokens in the individual data streams. (We are aiming for an untimed model similar to KPN.) Tokens in different streams have no relation in time, except for causal relationships implicitly defined by the way processes operate on the tokens. These process networks are ideally determinate, i.e., the order and time in which processes execute is irrelevant for the functional result. The output of a process network is completely determined as soon as the input is known. The actual computation of this output introduces a certain latency in the reaction. This latency is not part of the functional specification, but merely a consequence of the computation process. It may be subject to constraints, such as a maximum latency. Time is sometimes implicitly present in the intention of steams. A stream may carry for instance, a sequence of samples of an audio signal that are 1/44100th of a second apart, or video frames of which there are 25 or 30 in every second. Such streams are called periodic. For final realizations, time-related notions such as throughput, latency and jitter are of course important. Events, have a somewhat different relationship to time. An event is unpredictable and the moment when it arrives, in relation to the streams, is significant, but unknown in advance. In many event-based models, the synchrony hypothesis applies, which states that the response to an event can be completed before the following event arrives or is taken into account. This simplifies specifying how a system responds to events. A classical model for event-based systems are state machines, where events make it change from one state into another. Prominent characteristics
Kahn Process Networks and a Reactive Extension
995
are non-determinism and a total ordering of events. If events come from outside and are not predictable a priori, then the system evolves in a non-deterministic way under the influence of the events, even if the response to a particular event is deterministic. Discrete changes due to events are alternated with stable periods of streaming. Conceptually, a discrete change occurs at a well defined point within the stream. Because of pipelining implementation of the stream processing however, there is not necessarily a point in time where the change can be applied instantaneously to the whole network. A video decoder for instance may be processing video frame by video frame in a pipelined fashion. A discrete change occurs between two frames such that the new frames are decoded according to new parameter settings. However, when the first new frame enters into the pipeline, there are still old frames in the pipeline ahead of it. The StreamIt language [49], which employs a fairly synchronized model of streaming, allows a mechanism of delivering asynchronous messages (events) to processes in accordance with the ‘information wavefront’, i.e., in a pipelined fashion. In general, an important aspect of dealing with streaming computation and events is to coordinate their execution to implement a smooth transition. There is a trade-off between predictability and synchronization overhead. Predictability of processing (non-deterministic) events is improved by added control over the moment when and the way how the event is processed relative to the streaming activities. Increased predictability requires more synchronization between processes and hence additional overhead. Such overhead is undesirable, especially if events occur only sporadically. Semantic Model Denotational semantics is often preferred to capture the intended functionality of a process network or to define the functional semantics of a system or programming language implementing process networks, without specifying unnecessary implementation details. The operational semantics on the other hand allows reasoning about implementation details, such as artificial deadlocks [19] or required buffer capacities [8] , [3]. For RPN, a denotational semantics in terms of input/output relations is not possible (Section 7) and one based on sequential execution traces is possible, but hides the parallelism of the model. The detailed operational semantics of RPN can be found in [20]. In the next section, we discuss the main concepts. Communicating Events Processes or actors in data-flow graphs communicate via FIFO channels. We want to add communication of events and we have to decide what communication mechanism is used for events. It is often the case that what is perceived by a lower level process as an event, is considered to be part of streaming by higher level processes. For instance, a video decoder is decoding a stream of video frames and header in-
996
Marc Geilen and Twan Basten
formation for every frame is an integral part of the data stream. For the lower level frame decoder processes, the frame header information is seen as an event that initializes the component to deal with the specific parameters of the following frame. For this reason, we use the same infrastructure for communicating streams also for events. For the sending process, there is no difference at all; at the receiving side, we distinguish stream input ports and event input ports. The former are used in ordinary streaming activity; tokens arriving on the latter will trigger discrete events.
8.3 Operational Semantics of RPN The operational semantics of reactive process networks associates RPNs with a corresponding labeled transition system. An RPN, like a KPN, consists of processes, which may in turn be other RPNs, or primitive processes defined through other means, such as sequential code segments, and the LTS is constructed compositionally, similar to the operational semantics of KPN in Section 3. A detailed description of the operational semantics of RPN can be found in [20]; here we concentrate on the important concepts. In general, two different types of things may happen to the network: data can be streaming through it, or it can encounter events that need to be processed. The events introduce non-determinism. The result should still be as predictable as possible. In particular, we want to guarantee that input consumed before the arrival of a new event will lead to the required output, also if the output has not been completed when the event arrives. In implementations, this is achieved at the expense of additional synchronization or coordination. In specific subclasses of the RPN model, this synchronization may be realized without much overhead. There are very regular subclasses of RPN, such as the SDF-based models of Scenario Aware Data Flow (SADF) [48] or Heterochronous Data Flow (HDF) [21]. SDF graphs execute periodic behavior, often called iterations. In-between such iterations is typically a good moment for reconfiguration. In [37] such moments are referred to as quiescent states, but as explained earlier, in a pipelined execution, such states may not naturally occur and may need to be enforced by stalling the pipeline. The operational semantics of RPN models streaming as a sequence of individual read and write actions of the processes involved in the computation of the output, comparable to the semantics of KPN of Section 3. Since, conceptually, the reaction of a process network to incoming data is immediately determined, it may not be disturbed by the processing of events. To this end, in RPN semantics, these sequences of actions resulting from a data input are grouped together and represented as single, atomic transitions of the labeled transition system. Effectively, this gives internal actions (i.e., completing the reaction to already received input) priority over processing of events. This leads to the concept of so-called streaming transactions, as illustrated in Figure 12. The picture on the left shows the states and transitions of a process network performing individual input and output actions. Input actions are shown as
Kahn Process Networks and a Reactive Extension
sg ? sg 1
997
sg ? sg 1f r! fr 1
sg ? sg 2
sg ? sg
2f fr r! fr !f 2 r1 fr !f r2
sg ? sg 3f r! fr 3
sg ? sg 3
sg fr ? sg !f 4f r3 r! sg fr ? 4 sg 4s sg g? ? sg sg 5f 5f r! r! fr fr 5 4f r! fr 5
Fig. 12 Streaming transactions
horizontal arrows, output actions as vertical arrows. Because of pipelining, the network can perform several input actions before the corresponding output actions are produced. Events should only be accepted in those states where no internal or output actions are available. In Figure 12, these are all states on the lower-left boundary, such as states s1 and s3 . In state s2 , some remaining output is still pending and an event cannot be processed yet. Sequences of actions starting from a state without enabled output or internal actions and going back to such a state represent complete reactions of the network to input stimuli. These sequences of actions are grouped together to form atomic transitions, the streaming transactions, that end in states where events can safely be processed. For example, the sequence of transitions with thick arrows in the picture from state s1 to s3 forms a streaming transaction; from the end state, only reading of new input is possible, all received input has been fully processed. (These states correspond to end-points of maximal executions in accordance with Definition 3 for a particular finite input, and closely resemble the quiescent states of [37].) Some of the possible streaming transactions are shown in the picture on the right. The bold one corresponds to the bold path on the left. In this new LTS, the state space consist of only quiescent states. The streaming transactions of an RPN are determined in two steps. Individual streaming transactions of the constituent processes are determined as well as the processing of events by these processes. Executions of these actions are then taken together to form maximal streaming transactions of the whole network. Such streaming transactions may not consume input events. However, they may include occurrences of internal events and hence they can be indeterminate. Figure 13 shows a part of a possible execution of the 3D game example. The first transition is a streaming transaction, consisting of streaming actions of the processes, but also internal events (nrObj(2)). Note that we have abstracted from internal actions of for instance the rendering component, which might also be visible in these transactions. The second transition is an event; the player has reached the end of a level, and the game is reconfigured from 3D mode to the video mode.
998
Marc Geilen and Twan Basten
3D
sg?sg1...fr!fr100 nrObj(2) sg?sg101 sg?sg102...
3D’
level_end
Video
Fig. 13 Execution of the 3D game RPN
Such event transitions of an RPN are directly determined by the reception of input events, followed by the corresponding network transformation. The RPN semantics associates with every event an abstract transformation of the network structure. How such reconfigurations are precisely specified is left up to the concrete instances of models or languages based on this model. In *charts [21], which can be seen as an RPN instance, specification is done by direct refinement of FSM states by dataflow graphs. In SADF [48] actors or processes read control tokens sent to them by the controlling FSM and adapt or suspend their behavior accordingly. The behavior of the network as a whole is formed by interleaving transitions of both kinds as in the example. An RPN thus has a labeled transition system with executions interleaving events and streaming transactions. In contrast with KPN, this LTS is indeterminate. With this LTS semantics, an RPN can be directly used as a process within a larger RPN and this ways supports hierarchical and compositional specification of RPNs.
8.4 Implementation Issues The operational semantics of RPN defines the boundaries of the behavior that correct implementations of RPNs should adhere to. Within these boundaries there is still some room for making specific implementation decisions, depending on the application and the context. In this section, we briefly discuss some of these considerations. Coordinating Streaming and Events One of the most powerful aspects of KPNs is that execution can take place fully asynchronously. Processes need not synchronize and determinacy of the output is automatically guaranteed. The (deliberate) drawback of the generalization to RPN is that this advantage is (partially) lost. Before processing an input event, the streaming input to the network – conceptually – needs to be frozen and all data must be processed internally. Only when all data has been processed, the event can be applied and the data-flow can be continued. If implemented in this way, the pipelining of the data-flow may be disrupted and deadlines could potentially be missed because of this disruption if the nature of the application doesn’t allow this.
Kahn Process Networks and a Reactive Extension
999
In practice, one can do better for many classes of systems. Instead of processing an event for the whole process network at once, it may in some cases be possible to make the changes along with the ‘information flow’. In particular, if the response of a network to an event is the forwarding of the event to one or more of its subprocesses, then this forwarding can be synchronized with the flow of data such that pipelining need not be interrupted. This is done for instance for the asynchronous messages following the stream wave-front in StreamIt [49]. The discussed operational semantics suggests that events must always be accepted by any process. In practice, it can be useful to allow a process some control w.r.t. the moment when events are accepted. For example, to allow it to accept events only at moments when the corresponding transformation is most easy to do, because the process is in a well-defined state, e.g. at frame boundaries. Such an approach can for example be easily implemented if the underlying process network is an SDF graph that can be statically scheduled. It is then possible to define a cyclic schedule in such a way that one iteration of the cycle constitutes a single streaming transaction. Then, a test for newly arrived events can be inserted at the beginning of the cycle and event transitions can be safely executed. Deadlock Detection and Resolution The correct execution of KPNs using bounded FIFO implementations depends on the run-time environment to deal with artificial deadlock situations (see Section 6, [40], [19]). The same situation may arise in reactive process networks. The dependencies may now also include event channels. One can deal with these artificial deadlocks in a similar manner as for ordinary KPNs. Solving an artificial deadlock may be needed for completing the maximal transaction before processing an event.
Fig. 14 SADF model of an MPEG-4 SP decoder
1000
Marc Geilen and Twan Basten
8.5 Analyzable Models Embedded in RPN KPN is a model of determinate stream processing, but due its expressiveness has many undecidable aspects. In practice often Synchronous Data Flow Graphs (SDF) or Cyclo-Static Data Flow Graphs (CSDF) are used as restricted but analyzable subsets of KPN. Obviously, as an extension of KPN, RPN also has many undecidable aspects. The SADF model of computation uses Synchronous Data Flow graphs as the streaming model, extended with time according to [44] to capture performance aspects. For the control aspects, it uses Markov chains which can be seen as finite state machines decorated with transition probabilities. These probabilities intend to capture the typical behavior of the application in a particular use case. As such it is comparable to the earlier Heterochronous Data Flow [21] model which also makes a combination of SDF and FSMs, but without time and probabilities. Figure 14 shows an example of an SADF graph of an MPEG-4 Simple Profile decoder. An input stream is analyzed by the FD (frame detector) process which can observe the frame type of the next incoming frame. Different frames may employ a different number of macro blocks. In the figure, this number is denoted by x. For every given number x, the sub-network of the other processes is an SDF graph with fixed rates, which allows a static schedule and the performance of which can be analyzed off-line. For every frame, the frame detector sends an event message to the rest of the network indicating the proper frame type and the sub-network makes the necessary changes and invokes the appropriate schedule. The Markov model of the transitions between frame types can be used to study the expected performance of the given system [48]. Alternatively, a worst-case analysis can be done to establish a guaranteed performance using techniques such as introduced in [42] and [18]. Another example of an efficient combination of (restricted structures of) SDF with events is StreamIt [49] which employs a concept called teleport messaging [50] for sending sporadic messages along the information wavefront. Upstream actors or processes can send event messages to downstream actors to arrive at a specified iteration distance to the iteration at which the first data arrives which is dependent on the output currently being produced. In this case, the compiler and scheduler can automatically take care of the required synchronization.
9 Bibliography Kahn Process Networks ave been introduced by Kahn in [26]. Kahn and McQueen presented a programming / implementation paradigm for KPNs as the behavior of a model of (sequential) programs reading and writing tokens on channels. The Kahn Principle, stating that the operational behavior of such an implementation model conforms to the denotational semantics of [26], was introduced, but not proved, by Kahn in [26]. It was proved [16, 45] for an operational model of transition systems
Kahn Process Networks and a Reactive Extension
1001
and in [35] for I/O automata. The operational semantics in terms of I/O automata is very much like the semantics in Section 3, except that in the I/O-automata semantics, FIFOs are not modeled explicitly; it is assumed that they are implicit in the transition system of the processes at the lowest level. This means that the processes will accept any input at any given time. Composition of networks is then achieved with synchronous communication. By making channels and their FIFOs more explicit one can reason about realizations including memory management (FIFO capacities). KPN is often informally regarded as the upper bound of a hierarchy of data-flow models, although technically speaking it is not entirely obvious how to compare data-flow processes based on firing rules with either the denotational or operational semantics of KPN. The relationship between both types of models is elaborated in [33]. Scheduling and resource management, possibly in a distributed fashion are important subjects for realizing KPN implementations or simulators. Scheduling process networks using staticallty bounded channels is an important contribution towards this aim by Thomas Parks in [40], introducing an algorithm that uses bounded memory if possible. Based on this scheduling policy, a number of tools and libraries have been developed for executing KPNs. YAPI [28] is a C++ library for designing stream-processing applications. Ptolemy II [30] is a framework for codesign using mixed models of computation. The process-network domain is described in [22]. The Distributed Process Networks of [51] form the computational back end of the Jade/PAGIS system for processing digital satellite images. [46] covers an implementation of process networks in Java. [1] is another implementation for digital signal processing. Common among all these implementations is a multi-threading environment in which processes of the KPN execute in their own thread of control and channels are allocated a fixed capacity. Semaphores control access to channels and block the thread when reading from an empty or writing to a full channel. This raises the possibility of a deadlock when one or more processes are permanently blocked on full channels. A special thread (preempted by the other threads) is used to detect a deadlock and initiate a deadlock resolution procedure when necessary. This essentially realizes the scheduling policy of [40]. The algorithm of Parks leaves some room for optimization of memory usage by careful selection of initial channel capacities (using profiling) and clever selection of channels when the capacity needs to be increased; see [3]. [3] also introduces causal chains, also used in this chapter to define deadlocks. It is argued in [19] as discussed in Section 6 that a run-time scheduler for KPN should include local deadlock detection. Such deadlock detection has subsequently been implemented in a number of KPN implementations [2, 15, 24, 38]. To optimize the process it can be organized in a distributed fashion [2] or in a hierarchical way [24]. The desire to express non-deterministic behavior and event-based communication has led to additions to pure data-flow models. Examples are the probes of [36] and [28]. [36] describes an extension of the data-flow model with so-called probes, the possibility to test whether a channel has data available for reading. This can be
1002
Marc Geilen and Twan Basten
a powerful construct, but it destroys the property of determinacy; the behavior is no longer independent of the concrete scheduling. This probe construct inspired the designers of YAPI [28] to introduce the select statement, having the same disadvantage. In Ptolemy [22, 30], a framework is defined to connect multiple models of computation, including data-flow and event-based ones. The combination of the reactive and process network domains however, induce too much synchronization overhead to be used for implementation. Many of the combinations of data-flow and reactive behavior are based on a combination of the Synchronous Data Flow model together with some form of reactive behavior [21, 29, 47–49]. The use of an analyzable model such as SDF is natural because it allows for a predictable and determinate combination. The reactive part is frequently specified using (hierarchical) state machines. [29] describes a combination of hierarchical state machines and SDF models, where a complete iteration of the SDF is taken as an action of the state machine. A state machine inside an SDF is required to adhere to the SDF firing characteristics. FunState [47] defines a model, described as ‘functions driven by state machines’, which deliberately tries to separate functional behavior from control. For FunState, the control part also extends to the coordination of the processes that form a data-flow graph and the functions are comparable to separate KPN processes. *charts [21] separates hierarchical finite state machine models from concurrency models. In the data-flow domain, a combination of Hierarchical Finite State Machines with Synchronous Data Flow is presented, called Heterochronous Data Flow (HDF). To combine the finite state transitions of an FSM with the typically infinite behavior associated with concurrent models of computations, the infinite behavior is split in finite pieces. For SDF, separate iterations are used. Similarly, FunState uses individual functions for that. For KPN inside RPN, we have used complete reactions in Section 8. Scenarios Aware Data Flow (SADF) [18, 48] models a system as a finite state machine (decorated with transition probabilities to a Markov Chain) of SDF graphs and is similar to HDF, although primarily with the intention to capture dynamic variation within a determinate execution rather than to capture indeterminate behavior. SADF focusses less on functional specification, but rather on performance modeling by including timing behavior and stochastic abstraction of the automaton behavior. Moreover, SADF deals explicitly with the effects of pipelining different iterations of the SDF graphs and the intermediate automaton transitions. StreamIt [49] employs an SDFlike model of computation. It allows sending sporadic event between processes in the network by a construct called teleport messaging [50]. The scheduling and compilation framework automatically takes care that all data in the pipeline between the sender and receiver is processed before the message is delivered to the receiving process, similar to the RPN model. [7] describes parameterizable SDF models, allowing dynamic reconfiguration during run-time. Processes can change their data-flow behavior depending on parameter settings. In a given SDF configuration, actor executions are characterized by iterations, that fire sub-processes in a particular order that returns the internal buffer states in their original configuration. Such a process can (only) be recon-
Kahn Process Networks and a Reactive Extension
1003
figured between such iterations and if the data-flow behavior has changed, a new schedule is determined (at run-time). Similar to RPN in generality, is an approach [37] of Neuendorffer and Lee. It focusses on reconfiguration as a particular kind of event handling or change of model parameters. It defines quiescent states as the states where reconfigurations are allowed. These quiescent states are strongly related to the maximal streaming transactions. They also propose to use FIFO (First In First Out) channel communication also for events or parameters and to divide input ports in streaming input ports and parameter input ports. In contrast with [37], RPN considers more general event handling and reconfiguration than changing parameters. It also focusses on formalizing dependencies between parameters and quiescent points at different levels of a system hierarchy. The operational semantics of RPN connects easily to implementations of the model. This operational semantics is however not fully abstract, i.e., it contains more details than strictly necessary. [43] discusses fully abstract semantics of indeterminate data-flow models. The traditionally separated branch of synchronous data-flow languages such as Lustre [23] or Signal [4] treat sporadic events in a completely different way. Because these models are based on a global synchrony hypothesis (all system components execute conceptually at the pace of a single global clock), absence of an event in a particular clock cycle is easily, deterministically detected. In this case, sporadic events do not necessarily lead to a non-deterministic model. The overhead of the globally synchronous clock may impact efficiency however.
References 1. Allen G, Evans B, Schanbacher D (1998) Real-time sonar beamforming on a UNIX workstation using process networks and POSIX threads. In: Proc. of the 32nd Asilomar Conference on Signals, Systems and Computers, IEEE Computer Society, pp 1725–1729 2. Allen G, Zucknick P, Evans B (2007) A distributed deadlock detection and resolution algorithm for process networks. In: Acoustics, Speech and Signal Processing, 2007. ICASSP 2007. IEEE International Conference on, vol 2, pp II–33–II–36, DOI 10.1109/ICASSP.2007. 366165 3. Basten T, Hoogerbrugge J (2001) Efficient execution of process networks. In: Chalmers A, Mirmehdi M, Muller H (eds) Proc. of Communicating Process Architectures 2001, Bristol, UK, September 2001, IOS Press, pp 1–14 4. Benveniste A, Guemic PL (1990) Hybrid dynamical systems theory and the signal language. IEEE Trans Automat Contr 35:535–546 5. Benveniste A, Caillaud B, Carloni LP, Caspi P, Sangiovanni-Vincentelli AL (2008) Composing heterogeneous reactive systems. ACM Trans Embed Comput Syst 7(4):1–36 6. Berry G, Gonthier G (1992) The Esterel synchronous programming language: Design, semantics, implementation. Sci Comput Program 19:87–152 7. Bhattacharya B, Bhattacharyya S (2001) Parameterized dataflow modeling for DSP systems. IEEE Transactions on Signal Processing 49(10):2408–2421 8. Bhattacharyya S, Murthy P, Lee E (1999) Synthesis of embedded software from synchronous dataflow specifications. J VLSI Signal Process Syst 21(2):151–166
1004
Marc Geilen and Twan Basten
9. Brock J, Ackerman W (1981) Scenarios: A model of non-determinate computation. In: Díaz J, Ramos I (eds) Formalization of Programming Concepts, International Colloquium, Peniscola, Spain, April 19-25, 1981, Proceedings, LNCS Vol. 107, Springer Verlag, Berlin, pp 252–259 10. Brookes S (1998) On the Kahn principle and fair networks. Tech. Rep. CMU-CS-98-156, School of Computer Science, Carnegie Mellon University 11. Broy M, Dendorfer C (1992) Modelling operating system structures by timed stream processing functions. Journal of Functional Programming 2(1):1–21, URL citeseer.nj.nec. com/broy92modelling.html 12. Buck J (1993) Scheduling dynamic dataflow graphs with bounded memory using the token flow model. PhD thesis, University of California, EECS Dept., Berkeley, CA 13. Carloni LP, Sangiovanni-Vincentelli AL (2006) A framework for modeling the distributed deployment of synchronous designs. Form Methods Syst Des 28:93–110 14. Davey BA, Priestley HA (1990) Introduction to Lattices and Order. Cambridge University Press, Cambridge, UK 15. Dulloo J, Marquet P (2004) Design of a real-time scheduler for Kahn Process Networks on multiprocessor systems. In: Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications, PDPTA, pp 271–277 16. Faustini A (1982) An operational semantics for pure dataflow. In: Nielsen M, Schmidt EM (eds) Automata, Languages and Programming, 9th Colloquium, Aarhus, Denmark, July 1216, 1982, Proceedings, LNCS Vol. 140, Springer Verlag, Berlin, pp 212–224 17. Geilen M (2009a) An hierarchical compositional operational semantics of Kahn Process Networks and its Kahn Principle. Tech. rep., Electronic Systems Group, Dept. of Electrical Engineering, Eindhoven University of Technology 18. Geilen M (2009b) Synchronous data flow scenarios. Transactions on Embedded Computing Systems, Special issue on Model-driven Embedded-system Design, to be published 19. Geilen M, Basten T (2003) Requirements on the execution of Kahn process networks. In: Degano P (ed) Proc. Of the 12th European Symposium on Programming, ESOP 2003, Held as Part of the Joint European Conferences on Theory and Practice of Software, ETAPS 2003, Warsaw, Poland, April 7-11, 2003. LNCS Vol.2618, Springer Verlag, Berlin 20. Geilen MBasten T (2004) Reactive process networks. In: EMSOFT ’04: Proceedings of the 4th ACM international conference on Embedded software, ACM, New York, NY, USA, pp 137–146, DOI http://doi.acm.org/10.1145/1017753.1017778 21. Girault A, Lee B, Lee E (1999) Hierarchical finite state machines with multiple concurrency models. IEEE Transactions on Computer-aided Design of Integrated Circuits and Systems 18(6):742–760 22. Goel M (1998) Process networks in Ptolemy II. Technical Memorandum UCB/ERL No. M98/69, University of California, EECS Dept., Berkeley, CA 23. Halbwachs N, Caspi P, Raymond P, Pilaud D (1991) The synchronous programming language LUSTRE. Proceedings of the IEEE 79:1305–1319 24. Jiang B, Deprettere E, Kienhuis B (2008) Hierarchical run time deadlock detection in process networks. In: Signal Processing Systems, 2008. SiPS 2008. IEEE Workshop on, pp 239–244, DOI 10.1109/SIPS.2008.4671769 25. Jonsson B (1989) A fully abstract trace model for dataflow networks. In: POPL ’89: Proceedings of the 16th ACM SIGPLAN-SIGACT symposium on Principles of programming languages, ACM, New York, NY, USA, pp 155–165 26. Kahn G (1974) The semantics of a simple language for parallel programming. In: Rosenfeld J (ed) Information Processing 74: Proceedings of the IFIP Congress 74, Stockholm, Sweden, August 1974, North-Holland, Amsterdam, Netherlands, pp 471–475 27. Kahn G, MacQueen D (1977) Coroutines and networks of parallel programming. In: Gilchrist B (ed) Information Processing 77: Proceedings of the IFIP Congress 77, Toronto, Canada, August 8-12, 1977, North-Holland, pp 993–998 28. Kock, de et al E (2000) YAPI: Application modeling for signal processing systems. In: Proc. of the 37th. Design Automation Conference, Los Angeles, CA, June 2000, IEEE, pp 402–405
Kahn Process Networks and a Reactive Extension
1005
29. Lee B (2000) Specification and design of reactive systems. PhD thesis, Electronics Research Laboratory, University of California, EECS Dept., Berkeley, CA, memorandum UCB/ERL M00/29 30. Lee E (2001) Overview of the Ptolemy project. Technical Memorandum UCB/ERL No. M01/11, University of California, EECS Dept., Berkeley, CA 31. Lee E, Messerschmitt D (1987) Synchronous data flow. IEEE Proceedings 75(9):1235–1245 32. Lee E, Sangiovanni-Vincentelli A (Dec 1998) A framework for comparing models of computation. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on 17(12):1217–1229, DOI 10.1109/43.736561 33. Lee EA, Matsikoudis E (2007) The Semantics of Dataflow with Firing, preprint (march 7 2 edn, Cambridge University Press. URL http://chess.eecs.berkeley.edu/ pubs/428.html, chapter from "From Semantics to Computer Science: Essays in memory of Gilles Kahn" 34. Liu X, Lee EA (2008) Cpo semantics of timed interactive actor networks. Theor Comput Sci 409(1):110–125, DOI http://dx.doi.org/10.1016/j.tcs.2008.08.044 35. Lynch N, Stark E (1989) A proof of the Kahn principle for Input/Output automata. Information and Computation 82(1):81–92, URL citeseer.nj.nec.com/lynch89proof. html 36. Martin A (1985) The probe: An addition to communication primitives. Information Processing Letters 20(3):125–130 37. Neuendorffer S, Lee EA (2004) Hierarchical reconfiguration of dataflow models. In: Proc. Second ACM-IEEE International Conference on Formal Methods and Models for Codesign (MEMOCODE 2004), IEEE Computer Society Press, to appear 38. Olson A, Evans B (2005) Deadlock detection for distributed process networks. In: Acoustics, Speech, and Signal Processing, 2005. Proceedings. (ICASSP ’05). IEEE International Conference on, vol 5, pp v/73–v/76 Vol. 5, DOI 10.1109/ICASSP.2005.1416243 39. Park D (1979) On the semantics of fair parallelism. In: Abstract Software Specifications, Volume 86 of Lecture Notes in Computer Science, Springer Verlag, Berlin 40. Parks T (1995) Bounded Scheduling of Process Networks. PhD thesis, University of California, EECS Dept., Berkeley, CA 41. Plotkin G (1981) A structural approach to operational semantics. Tech. Rep. DAIMI FN-19, Århus University, Computer Science Department, Århus, Denmark 42. Poplavko P, Basten T, van Meerbergen J (2007) Execution-time prediction for dynamic streaming applications with task-level parallelism. In: DSD ’07: Proceedings of the 10th Euromicro Conference on Digital System Design Architectures, Methods and Tools, IEEE Computer Society, Washington, DC, USA, pp 228–235, DOI http://dx.doi.org/10.1109/DSD. 2007.52 43. Russell J (1989) Full abstraction for nondeterministic dataflow networks. Symposium on Foundations of Computer Science 0:170–175, DOI http://doi.ieeecomputersociety.org/10. 1109/SFCS.1989.63474 44. Sriram S, Bhattacharyya SS (2000) Embedded Multiprocessors: Scheduling and Synchronization. Marcel Dekker, Inc., New York, NY, USA 45. Stark E (1987) Concurrent transition system semantics of process networks. In: Proc. of the 1987 SIGACT-SIGPLAN Symposium on Principles of Programming Languages, Munich, Germany, January 1987, ACM Press, pp 199–210 46. Stevens R, Wan M, Laramie P, Parks T, Lee E (1997) Implementation of process networks in Java. Technical Memorandum UCB/ERL No. M97/84, University of California, EECS Dept., Berkeley, CA 47. Strehl K, Thiele L, Gries M, Ziegenbein D, Ernst R, Teich J (2001) FunState - an internal design representation for codesign. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 9(4):524–544, URL citeseer.nj.nec.com/strehl01funstate.html 48. Theelen BD, Geilen M, Basten T, Voeten J, Gheorghita SV, Stuijk S (2006) A scenario-aware data flow model for combined long-run average and worst-case performance analysis. In: MEMOCODE, pp 185–194
1006
Marc Geilen and Twan Basten
49. Thies W,Karczmarek M, Amarasinghe S (2002) StreamIt: A language for streaming applications. In: Horspool RN (ed) Compiler Construction, 11th International Conference, CC 2002, Held as Part of the Joint European Conferences on Theory and Practice of Software, ETAPS 2002, Grenoble, France, April 8-12, 2002, Proceedings, LNCS Vol. 2306, Springer Verlag, Berlin, pp 179–196 50. Thies W, Karczmarek M, Sermulins J, Rabbah R, Amarasinghe S (2005) Teleport messaging for distributed stream programs. In: PPoPP ’05: Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming, ACM, New York, NY, USA, pp 224–235, DOI http://doi.acm.org/10.1145/1065944.1065975 51. Vayssière J, Webb D, Wendelborn A (1999) Distributed process networks. Tech. Rep. TR 9903, University of Adelaide, Department of Computer Science, South Australia 5005, Australia 52. Yates RK (1993) Networks of real-time processes. In: Best E (ed) CONCUR’93: Proc. of the 4th International Conference on Concurrency Theory, Springer Verlag, Berlin, Heidelberg, pp 384–397
Methods and Tools for Mapping Process Networks onto Multi-Processor Systems-On-Chip Iuliana Bacivarov, Wolfgang Haid, Kai Huang, Lothar Thiele
Applications based on the Kahn process network (KPN) model of computation are determinate, modular, and based on FIFO communication for inter-process communication. While these properties allow KPN applications to efficiently execute on multi-processor systems-on-chip (MPSoC), they also enable the automation of the design process. This chapter focuses on the second aspect and gives an overview of methods for automating the design process of KPN applications implemented on MPSoCs. Whereas previous chapters mainly introduced techniques that apply to restricted classes of process networks, this overview will be dealing with general Kahn process networks.
1 Introduction Multi-processor system-on-chip (MPSoC) is one of the most promising and solid paradigm for implementing embedded systems for signal processing in communication, medical, and multi-media applications. MPSoC platforms are heterogeneous by nature as they use multiple computation, communication, memory, and peripheral resources. They allow the parallel execution of (multiple) applications and, at the same time, they offer the flexibility to optimize performance, energy consumption, or cost of the system. Nevertheless, to optimize an MPSoC in the presence of tight time-to-market and budget constraints, a systematic design flow is required. To deal with this challenge, Kienhuis et al. [1] suggested to structure the design flow in a certain manner, now commonly referred to as the Y-chart approach. It is a systematic methodology for selecting an embedded system implementation from a set of alternatives, a process often denoted as design space exploration. One key idea underlying this approach is to explicitly separate application and architecture Iuliana Bacivarov, Wolfgang Haid, Kai Huang, Lothar Thiele Computer Engineering and Networks Laboratory, ETH Zurich, Switzerland e-mail: [email protected]
S.S. Bhattacharyya et al. (eds.), Handbook of Signal Processing Systems, DOI 10.1007/978-1-4419-6345-1_35, © Springer Science+Business Media, LLC 2010
1007
1008
Bacivarov, Haid, Huang, Thiele
specifications. A separate mapping specification describes how the application is spatially (binding) and temporally (scheduling) executed on the architecture. Design space exploration is then performed by iteratively analyzing and optimizing the application, the structure of the underlying (hardware) architecture as well as candidate mappings, as shown in Fig. 1. application specification
architecture specification mapping specification performance analysis design space exploration synthesis
Fig. 1 Y-chart approach for designing MPSoC.
Many design flows implementing the Y-chart approach have been proposed. For a review, see [2]. These flows have in common that they impose a set of system-level concepts to facilitate design space exploration, such as the use of a formal model of computation, providing restrictions on the set of scheduling policies, and relying on modular specifications. For instance, the application may be formally specified as a data flow model, a synchronous model, or a discrete event model, in order to enable automated performance analysis. In a similar way, resource sharing policies may be limited to an event-triggered or a time-triggered policy, to prune the design space. Finally, using modular system-level specifications will enable quick system modifications concerning the application, architecture and mapping. In the context of (array) signal processing applications executing on MPSoC, the Kahn process network (KPN) model of computation [3] is frequently used. Assuming a network of autonomous, concurrently executing processes that communicate point-to-point via unbounded FIFO channels, the KPN model has additional favorable properties. The KPN model is determinate, i.e. the functional behavior is independent on the scheduling of processes. The inter-process communication via FIFO channels using blocking read semantics can be efficiently implemented either in software, hardware, or in heterogeneous HW/SW systems. Computation and control are completely distributed, requiring no global synchronization, communication, or memory. The resulting modularity allows applications to be scaled easily and opens up many degrees of freedom for implementing a system. Due to these properties, the KPN model of computation is “compatible” with the Y-chart approach and has led to numerous design flows. Although they share the same model of computation, these design flows consider different design objectives, they focus on different aspects, and leverage different properties of the KPN model
Methods and Tools for Mapping Process Networks onto MPSoC
1009
or one of its subclasses. In this chapter, an overview of KPN-based design flows is given, emphasizing both, similarities and differences in these flows. The following section reviews existing design flows and the way they relate to the Y-chart. Afterwards, a closer look at individual steps in the design flow is taken and several methods to tackle them are presented. Finally, for exemplification, a specific design flow is considered in detail.
2 KPN Design Flows for Multiprocessor Systems Several design flows based on the Y-chart approach and the KPN model have been developed. Table 1 shows an (incomplete) list of design flows targeted at the implementation of KPN applications on MPSoC platforms. In addition to the listed design flows, there are other approaches related to the KPN model with different aims. Ptolemy [4] and Metropolis [5] allow the analysis as well as simulation of applications specified as KPNs, among other models of computation. However, they are targeted more towards hardware/software codesign and in particular towards the system synthesis and verification. The design space exploration is not the main focus of these frameworks. The Mathworks Real-Time Workshop [6] and the National Instruments LabVIEW Microprocessor SDK [7] target the implementation of signal processing applications on single-processor systems. SystemCoDesigner [8] and PeaCE [9] are HW/SW codesign flows based on a model of computation that combines the KPN model with finite state machines, see chapter 4.09. Note that even though the focus of this chapter is on the design flows for MPSoC listed in Table 1, many of the presented ideas also apply to the other mentioned design flows. Table 1 KPN design flows for MPSoCs. design flow Artemis [10] Distributed Operation Layer (DOL) [11] Embedded System-Level Platform Synthesis and Application Mapping (ESPAM) [12] Koski [13] Multiapplication and Multiprocessor Synthesis (MAMPS) [14] Open Dataflow (OpenDF) [15] Software/Hardware Integration Medium (SHIM) [16] StreamIt [17]
web page http://daedalus.liacs.nl http://www.tik.ee.ethz.ch/∼shapes http://daedalus.liacs.nl not available online http://www.es.ele.tue.nl/mamps http://opendf.sourceforge.net not available online http://www.cag.lcs.mit.edu/streamit
Generally, KPN design flows for MPSoCs respect the four design phases of the Y-chart: system specification, performance analysis, design space exploration, and system synthesis, as shown in Fig. 1. Based on these four phases, the design process can be described as follows: The starting point of the design flow is a parallelized KPN specification of the appli-
1010
Bacivarov, Haid, Huang, Thiele
cation. In this specification, the coarse-grain data and functional parallelism of the application is made explicit. Fine-grained word or instruction-level parallelism can effectively be handled by today’s compilers. Usually, the KPN is manually specified by the programmer. There are, however, also tools available that allow deriving a KPN from sequential programs, such as the Compaan [18] and pn [19] tools. KPN design flows usually provide a functional simulation capability that enables the execution of KPN specification on a standard single-processor machine in a multitasking environment. Due to the determinacy of KPNs, the timing-independent functionality of the application can be validated this way. Second, the architecture needs to be specified. This is frequently done in form of a system-level specification describing the architectural resources, such as processors, memories, interconnects, and I/O devices. This specification can either describe a fixed MPSoC or the template of a configurable MPSoC platform. In both cases, the architecture specification needs to contain all the information required for design space exploration and performance analysis. In the case of a configurable platform, the architecture specification is also the basis for the synthesis of the final target platform later in the design flow. Hence, it needs to contain information required by the RTL synthesis tool, such as references to VHDL or Verilog code of hardware components, complete IP blocks, and configuration files. The application and architecture specification phase is followed by defining a mapping of the application onto the architecture. In this step, processes are bound to processors and channels are bound to communication paths containing memories and interconnects. In addition, the scheduling and arbitration policies for shared resources are defined. Usually, the final mapping is the result of a design space exploration, which is done based on the system performance analysis. The methods applied for performance analysis range from simple back-of-the-envelope calculations to formal analysis methods, simulations, and measurements. In KPN design flows, performance analysis during design space exploration is possible and is usually done at a rather high level of abstraction. As shown in the next section, different methods targeted towards KPN applications have been proposed in this context that achieve high accuracy within short analysis times. Being able to defer the use of simulation or measurements until late in the design cycle is one of the key advantages of KPN design flows. After manual or automated design space exploration, the system is finally implemented by making use of appropriate synthesis techniques. For this purpose, KPN design flows feature powerful synthesis tools that implement a system based on the application, architecture, and mapping specification in software, hardware, or both hardware and software. Clearly, this is a key advantage of KPN design flows because the pitfalls of implementing a parallel system, such as hardware-software interface generation, deadlocks, starvation, and data races are handled in an automated way. The design flows listed in Table 1 implement this basic Y-chart approach in different ways: On the one hand, the methods that are applied in each of the four phases differ between the design flows, as discussed in the next section. On the other hand, the scope (set of optimization variables) of design space exploration is different. Ba-
Methods and Tools for Mapping Process Networks onto MPSoC
1011
Table 2 Summary of case studies performed using different KPN design flows. case study application motion-JPEG Artemis [20] encoder MPEG-2 DOL [11] decoder motion-JPEG ESPAM [12] encoder design flow
target platform Molen architecture on Xilinx Virtex-II Pro FPGA Atmel DIOPSIS 940
multi-MicroBlaze on Xilinx Virtex-II Pro FPGA multi-NIOS on Altera Koski [13] WLAN terminal Stratix-II FPGA H263 and JPEG multi-MicroBlaze on Xilinx MAMPS [14] decoders Virtex-II Pro FPGA MPEG-4 FPGA (no particular type OpenDF [21] SP decoder specified) SHIM [22] JPEG decoder Sony/Toshiba/IBM Cell BE 12 streaming StreamIt [17] RAW architecture applications
performance analysis trace-driven simulation real-time analytic model
high-level simulation high-level simulation
exploration method evolutionary algorithm evolutionary algorithm exhaustive search simulated annealing dedicated heuristic
not applicable
not applicable
measurement
not applicable not applicable SDF analytic simulated model annealing
sically, one can distinguish between software design flows, where the target platform is fixed, and hardware/software co-design flows, where a template of a target platform is given and the instantiation of a specific platform is part of the design space exploration. This is shown in Table 2 where a few case studies are summarized that have been performed using the design flows listed in Table 1. DOL, SHIM, and StreamIt assume fixed hardware platforms, whereas the scope of the other design flows encompasses the implementation of the target platform on FPGAs.
3 Methods KPN design flows attempt to assist a system designer in implementing an application as a hardware/software system by offering support for several activities such as: • • • •
system specification, system synthesis, performance analysis, and design space exploration.
For each of these activities, methods have been proposed that differ in goal, scope, degree of automation, and complexity. In the previous chapters, mainly methods for subclasses of KPNs have been discussed. In this chapter, we give an overview of methods that are applicable to general KPNs in the context of MPSoCs. For each of the activities mentioned above, we discuss the challenges and proposed solutions.
1012
Bacivarov, Haid, Huang, Thiele
3.1 System Specification Developing applications that run correctly and efficiently on MPSoCs is challenging. The difficulty consists in finding an appropriate level of abstraction that balances the conflicting goals of (a) developing applications in a productive manner and of (b) enabling efficient automated implementation. While productivity is usually achieved by programming at a high abstraction level, efficiency is usually achieved by optimizing code at a low abstraction level. Many case studies provide evidence that for streaming applications, the KPN model of computation achieves a good trade-off between those two goals. On the one hand, streaming applications can often naturally be modeled as a KPN which promotes productivity. On the other hand, runtime environments have been developed that efficiently implement processes and channels. Specifically, the KPN model can be seen as a coordination model [23] which considers the programming of a distributed system as the combination of two distinct activities: the actual computing part comprising a number of processes involved in manipulating data and a coordination part reflecting the communication and cooperation between the processes. The coordination model allows reuse of components because the application programmer can easily build new algorithms by a new composition of existing processes. Furthermore the coordination model allows applications to be ported to different target architectures because usually only the glue-code that implements the coordination part is architecture dependent. Due to these reasons, KPN applications are usually specified in a way that reflects the coordination model. Two different approaches can be distinguished, namely specification using a host as well as a coordination language, and specification using a domain-specific language. When using distinguished host and coordination languages, the KPN processes are specified in a host language (often in C or C++) whereas the coordination part is specified separately using a coordination language (often in XML or UML). This is, for instance, the approach taken in the Artemis and DOL design flows, where C and XML are used. When using a domain-specific language, computation and coordination are expressed in a single language that provides constructs for both parts. OpenDF and StreamIt, for instance, are based on domain-specific languages. In both cases, applications are usually expressed based on the principles of encapsulation and explicit concurrency: Each process completely encapsulates its own state together with the code that operates on it and operates independently from other processes in the system, except for the data dependencies that are made explicit by channels. This allows for modular, scalable, and platform-independent application specifications. System specification for KPNs is thus different from two other frequently used approaches for MPSoC software development, namely specification based on a board support package and specification based on a high-level application programming interface (API). When developing an application based on the board support package that is usually shipped with an MPSoC, the abstraction level is rather low. The focus is thus often on correctly implementing an application using low-level
Methods and Tools for Mapping Process Networks onto MPSoC
1013
primitives for initialization, communication, or synchronization, rather than on optimizing an application. When using a high-level API the designer is relieved from dealing with low-level details (provided that the API has been ported to the target MPSoC). Compared to the KPN based approach, however, automatically optimizing programs written using a high-level API, such as MPI or OpenMP, is more difficult: Due to the lack of an underlying model of computation, the basis for automatically analyzing and optimizing a program is essentially missing.
3.2 System Synthesis The Y-chart approach opens a gap between the system-level specification and the actual implementation of the design, sometimes referred to as the implementation gap. The challenge in bridging this gap is to preserve the KPN semantics on the one hand and achieve the desired performance on the other hand. Also, the pitfalls of parallel programming, such as deadlocks, starvation, and data races need to be handled. This is the task of system synthesis. Different approaches for the synthesis of KPNs for software, hardware, and in combined hardware/software platforms have been proposed. The target architectures have been comprised of single-processors, multi-processors, and FPGAs. In all cases, system synthesis deals with the implementation of processes and channels as well as the arbitration of resources in case that processes and channels are mapped to shared resources. If not all parameters of an implementation are fixed before synthesis, the remaining degrees of freedom need to be exploited during system synthesis. In that case, system synthesis is often considered as an optimization problem where frequently considered optimization goals are the minimization of code size, the minimization of buffer requirements, or the determination of schedules that minimize delays and maximize system throughput. While many of these problems can be solved only for restricted subclasses, a few observations apply to general KPNs: • First, KPN applications can be efficiently implemented on architectures with different processor, interconnect, and memory configurations, as shown in Table 2. As an example, KPN applications can be implemented on distributed memory (message-passing) architectures as well as shared memory architectures. The FIFO communication can be implemented using dedicated hardware FIFOs or buses, but also more complex communication topologies, such as hierarchical buses or networks-on-chip. • Second, KPN applications can be executed in a purely data-driven manner based on their determinacy. This means that resources can operate independently from each other without any global synchronization. Pair-wise synchronization is only needed between processes that are directly connected by channels. From another perspective, this means that KPN applications can be scheduled with any scheduling policy that prevents deadlocks, i.e. preemptive, non-preemptive, or cooperative scheduling could be used. Due to these very relaxed require-
1014
Bacivarov, Haid, Huang, Thiele
ments, KPN applications can usually be easily implemented on top of existing (real-time) operating systems. On the other hand, implementing a runtimeenvironment for a new platform from scratch is also possible because not many services need to be provided by the runtime-environment. • Third, KPN applications can be easily partitioned into processes running in hardware and processes running in software. This is due to the parallel specification of the KPN application on the one hand, and due to the simple interaction of processes over FIFO channels on the other hand which facilitates the synthesis of the HW/SW interface. The observations above indicate that synthesizing a KPN is conceptually not a very difficult task. Implementing a KPN based on a multi-processor operating system, for instance, is rather simple: Processes can be implemented as operating system processes or threads, and channels can be implemented using existing interprocess communication schemes. The difficulties in KPN synthesis origin from optimizing an implementation by minimizing the overhead for FIFO communication and the runtime environment. This can be achieved by considering low-level details of an implementation, for instance by efficiently using the hardware communication infrastructure (e.g. DMA engines) or by efficiently using the memory hierarchy (e.g. caches or scratchpad memories). On the other hand, optimizations can also be done at a high level, for instance, by (automatically) adjusting the granularity and topology of a KPN to the target architecture. This includes the replication of processes to increase the parallelism in a KPN or the merging of processes to reduce inter-process communication. We refrain from giving further details here and refer to the previous chapters for details on applicable techniques. Finally, a further problem needs to be considered in the synthesis of KPNs: The denotational semantics of the Kahn model is based on FIFO channels with unbounded capacity. Since unbounded channels cannot be realized in physical implementations, however, KPNs need to be transformed in a way that allows for an implementation on channels with finite capacity. It can be shown that an operational semantics of KPNs based on channels with finite capacity matches the denotational semantics when artificial deadlocks can be avoided. An artificial deadlock is a deadlock caused by one or more channels having insufficient capacity. Due to the Turingcompleteness of KPNs, it is in general not possible to determine sufficient channel sizes at design time, however. One possibility to deal with this situation are runtime approaches that detect and resolve artificial deadlocks during execution [24]. Another possibility is to restrict the communication behavior of the processes such that the channels become amenable to analysis at design time.
3.3 Performance Analysis During the design process, a designer is typically faced with questions such as whether the timing properties of a certain design will meet the design requirements, what architectural element will act as a bottleneck, or what the memory require-
Methods and Tools for Mapping Process Networks onto MPSoC
1015
ments will be. Consequently, one of the major challenges in the design process is to quantitatively analyze specific characteristics of a system, such as end-to-end delays, buffer requirements, throughput, or energy consumption. We refer to this analysis as performance analysis. The performance analysis of KPNs executing on MPSoCs poses a major challenge due to multiple and heterogeneous hardware resources, the distributed execution of the application, and the interaction of computation and communication on shared resources. To deal with these challenges, multiple methods have been successfully used in the context of KPN design flows. These methods differ in accuracy, evaluation time, set-up effort, and scope. performance metric
worst case
best case
real system
measurement
simulation
WCET/BCET analysis
probabilistic analysis
Fig. 2 Scope of different performance analysis methods for MPSoC.
In Fig. 2, the scope of different performance analysis methods is compared. Leftmost, the interval of values for a performance metric as occurring in the real system is shown. This performance metric could be the end-to-end delay of a system, the utilization of a computation or communication resource, or the occupation of a channel buffer, for instance. Different performance analysis methods now differ regarding the values that can be obtained. When taking measurements of the real system, the measured values only represent a subset of all possible values. Most likely, due to insufficient coverage of corner cases and the limited number of measurement samples, the interval bounds can only be estimated based on the measurements. This observation applies to simulation as well. Best-case and worst-case analysis methods take a different approach by providing safe results about the interval bounds, i.e. upper and lower bounds on the worst-case and best-case behavior, respectively. On the other hand, usually not all parts of a system can be accurately modeled. In that case, (safe) optimistic and pessimistic assumptions need to be made, leading to bounds on system performance measures that are not tight. Finally, also probabilistic methods are used to provide quantitative statements about system behavior. In the following, we take a closer look at simulation and best-case/worst-case analysis due to their frequent application in KPN design flows.
1016
Bacivarov, Haid, Huang, Thiele
Simulation is presumably the most frequently used method for performance analysis. This is reflected by the availability of a wide range of simulation tools that are applicable to different levels of abstraction. The most accurate but also slowest class are cycle-accurate simulators. Instruction-accurate simulators (also referred to as instruction-set simulators or virtual platforms) provide a good trade-off between speed and accuracy which allows entire MPSoCs to be modeled and simulated. An example is the so-called full system simulator of the Cell Broadband Engine which also allows switching between different simulation modes with different accuracies [25]. Besides performance analysis, virtual platforms can also be used for software development and debugging. For this purpose, the full system simulator of the Cell Broadband Engine provides a fast, purely functional simulation mode in which timing is not considered. At higher levels of abstraction, also other kinds of simulation are used for performance analysis. One example is trace-based simulation in the Artemis design flow [10], for instance. In trace-based simulation, first an untimed execution trace of the application is recorded that contains computation and communication events of processes and channels. Based on an architecture description, the mapping of the application onto the architecture, and estimates about the time to process events, this trace is refined towards timing behavior. This technique allows designers to estimate the system performance. Depending on the level of detail in the trace and the modeling of the execution platform, estimation errors of less than 5 % have been reported with a significantly reduced simulation time compared to instruction-accurate simulation. For the design of hard real-time systems, worst-case guarantees on the system timing need to be given. As stated above, worst-case bounds are difficult to obtain from simulation due to insufficient corner case coverage and often prohibitively long execution times of a simulation run. Therefore, analytic methods appear to be a promising method for providing worst-case guarantees even in the case of complex and large-scale MPSoC implementations. Prominent methods for analytic performance analysis are listed in the following. • Holistic Methods: Holistic analysis is a collection of techniques for the analysis of distributed systems. The principle is to extend concepts of classical singleprocessor scheduling theory to distributed systems, integrating the analysis of computation and communication resource scheduling. Several holistic analysis techniques have been aggregated in the modeling and analysis suite for real-time applications (MAST) [26]. • Compositional Performance Analysis Methods: The basic idea of compositional performance analysis methods is to construct an analysis model of small components and propagate timing information between these components. Typical components model the execution of processes on a processor, the transmission of data packets on interconnects, or traffic shapers. Timing information is described by event models, such as periodic, periodic with jitter and bursts, or more general models in terms of arrival curves. Prominent methods of compositional performance analysis are modular performance analysis (MPA) [27] and symbolic timing analysis for systems (SymTA/S) [28]. Both methods support a rich
Methods and Tools for Mapping Process Networks onto MPSoC
1017
set of scheduling policies, such as preemptive and non-preemptive fixed priority scheduling, earliest deadline first scheduling, or time division multiple access. MPA is used in the DOL design flow, for instance. • Automata Based Analysis: Performance analysis of MPSoCs has also been tackled using state-based formalisms. One example are timed automata [29]: The approach is to model a system as a network of interacting timed automata and formally verify its behavior by means of reachability analysis using the Uppaal model checker [30]. A comparison of these performance analysis methods is provided in [31]. Note that beside being suited for the analysis of real-time systems, analytic models are often used as the basis for performing system optimization, such as scheduling parameter optimization [32] or robustness optimization [33]. Finally, one can observe that none of the methods shown in Fig. 3 can fulfill all the requirements concerning accuracy, scope, and set-up effort. Therefore, combinations of the different methods have been proposed: Simulation has been coupled with native execution on the target platform to reduce simulation time [34]. Different analytic methods have been coupled to broaden the analysis scope [35, 36]. Subsystems in simulation have been replaced by analytic models to reduce simulation time and eliminate the need to generate a detailed simulation model of a component [37]. In these efforts, the modularity of KPNs is often leveraged by using the FIFO channels as the interface between the different performance analysis methods.
3.4 Design Space Exploration Designers of MPSoCs face a large design space due to the combinatorial explosion related to the available degrees of freedom. At several points in the design flow and at various levels of abstraction, they need to decide between design alternatives. Specifically, the design space of MPSoCs can be roughly divided into three domains: the application design space, the architecture design space, and the mapping design space. These three domains can be further split up, as shown in Fig. 3. Multi-Processor System Design Space Application Architecture algorithmic transformations configurable hardware source transformations
programmable hardware
Mapping binding scheduling
Fig. 3 Application, architecture, and mapping design space.
1018
Bacivarov, Haid, Huang, Thiele
Exploration of the application design space can be split up into two main kinds of transformations, namely algorithmic and source transformations. Algorithmic transformations make explicit the coarse-grained parallelism in a sequential application by transforming it into a KPN. Given a KPN application, source transformations split and merge processes to trade-off parallelism and communication overhead. Exploration of the architecture design space attempts to find an optimized architecture for a given application. The goal is to instantiate programmable and (re-)configurable hardware components that allow an efficient implementation of an application. Exploration of the mapping design space is the last step in the design space exploration. Given a KPN application and an architecture, processes and channels of the KPN are bound to processors and interconnects in the architecture, and scheduling policies are defined on shared communication or computation resources. Usually, design space exploration is a multi-objective optimization problem. The goal is thus to find a set of Pareto-optimal designs which represent solutions with different trade-offs between the optimization goals such as performance, cost, or energy consumption. The final choice is left to the designer who needs to decide which of the Pareto-optimal designs to implement or to refine to the next level of abstraction. Available approaches to the exploration of design spaces can be characterized as follows. Gries [38] presents a more detailed survey of automated design space exploration and performance analysis in different design flows. • Manual Exploration: The selection of design points is done by the designer. When taking this approach, the advantage of using a KPN design flow lies in efficient performance analysis and automated synthesis of selected designs. • Exhaustive Search: All design points in a specified region of the design parameters are evaluated. Very often, this approach is combined with local optimization in one or several design parameters in order to reduce the size of the design space. Due to the availability of fast performance analysis techniques for KPNs, exhaustive search is a realistic option if the design space is limited (or can be pruned) to roughly a few thousand designs. • Reduction to Single Objective: For design space exploration with multiple conflicting criteria, there are several approaches available that reduce the problem to a single criterion optimization. For example, manual or exhaustive sampling is done in one (or several) directions of the search space and a constraint optimization, e.g. iterative improvement or analytic methods is done in the other. One may also combine the various objectives to a single criterion by means of a weighted sum where the weights express the preferences of the designer. • Black-box Randomized Search: The design space is sampled and searched via a black-box optimization approach, i.e. new design points are generated based on the information gathered so far and by defining an appropriate neighborhood function (variation operator). The properties of these new design points are estimated which increases the available information about the design space. Examples of sampling and search strategies are Pareto simulated annealing, Pareto
Methods and Tools for Mapping Process Networks onto MPSoC
1019
tabu search, or evolutionary multi-objective optimization. These black box optimization methods are often combined with local search methods that optimize certain design parameters or structures. This approach is most frequently used in KPN design flows, as illustrated in Table 2. • Problem-Dependent Approaches: In addition to the above methods, one can find also a close integration of the exploration with a problem-dependent performance analysis of implementations. This approach is often used in design flows that are based on subclasses of KPNs. The StreamIt and MAMPS design flow, for instance, are based on SDF (Synchronous Data Flow) graphs and use adopted techniques for design space exploration, see chapter 4.03.
4 Specification, Synthesis, Analysis, and Optimization in DOL Until now, this chapter introduced KPN design flows and the corresponding main design activities. This section will provide additional technical details by means of a concrete example of a typical design flow: the Distributed Operation Layer (DOL) [11]. The underlying concepts for system specification, synthesis, performance analysis, and design space exploration will be considered, as well as a few typical experimental results for the size of the implementation, the runtime, and accuracy of the applied methods.
4.1 Distributed Operation Layer The distributed operation layer (DOL) [11] is a platform independent MPSoC design flow based on the Kahn process network (KPN) model of computation [3] and targeted at real-time multimedia and (array) signal processing applications. The DOL design cycle, as shown in Fig. 4, follows the Y-chart approach in which the application specification is platform-independent and needs to be related to a concrete architecture by means of an explicit mapping. As usual, the design starts with the specification of the application and architecture (and sometimes even a mapping). Then, code for the functional simulation of the application is automatically generated for testing and debugging the parallel application code with standard debugging tools on a standard PC/workstation. Once the application is functionally correct, it can be mapped onto the target architecture. Based on the architecture and the mapping specification, the system is synthesized by generating the corresponding binaries. Note that, here, system synthesis refers to software synthesis only as the architecture specification is considered to be unaltered during the exploration phase. Then, the synthesis involves the generation of the mapping-dependent source code for processors, the compilation, and the linking to platform specific libraries as well as to the run-time environment. Generated binaries can either be executed on a simulator of the target platform or on the
1020
Bacivarov, Haid, Huang, Thiele
test and debug
application specification (XML & C)
mapping specification (XML)
architecture specification (XML)
functional simulation generation
system synthesis
analysis model generation
simulation on workstation
simulation on virtual platform / execution on physical platform
evaluation on workstation
performance data
calibration with performance data
design space exploration
Fig. 4 Overview of the DOL design flow.
real MPSoC. Both, the functional and the low-level simulation provide performance figures that will enrich the application specification. This information will be used in later phases for the calibration of the analysis model. The design flow described so far is typical for MPSoC design and very similar to the other design flows listed in Table 1 and explained in the previous chapters. What is different in DOL is its focus on the design and analysis of real-time signal processing applications. To this end, an analytic worst-/best-case performance analysis method has been embedded into the design flow. Besides enabling the analysis of real-time systems, using an analytic method for performance analysis facilitates rapid design space exploration due to short analysis times. The resulting performance data are embedded in a design space exploration loop in search of the optimal mapping.
4.2 System Specification For designing the specification format of an MPSoC, one has to consider three criteria. First, the specification format should be expressive enough to represent the class of envisioned applications, i.e. (real-time) signal processing applications. Second, the specification should facilitate automation of system synthesis and analysis. The third criterion is the possibility of mapping an application in different ways onto an architecture. In the DOL framework, these criteria are met by specifying the application as a Kahn process network [3] and by specifying the application independently of architecture.
Methods and Tools for Mapping Process Networks onto MPSoC
1021
When designing parallel applications irrespective of architectures, an important feature is the ability to specify different topologies of the process network with different degrees of parallelism. For this reason, the KPN coordination part is kept separately (described in XML) from the source code of the individual processes (described in C/C++ ), see Fig. 5. Similar hybrid XML/C formats are employed by other frameworks as well (e.g. in Artemis [10], ESPAM [12], and MAMPS [14]).
SURFHVV QDPH S! SRUW W\SH RXWSXWQDPH RXW! VRXUFH W\SH FORFDWLRQ SF! SURFHVV! SURFHVV QDPH S! SRUW W\SH LQSXWQDPH LQ! VRXUFH W\SH FORFDWLRQ SF! SURFHVV! FRQQHFWLRQ QDPH FBILIR! RULJLQ QDPH S!SRUWQDPH RXW!RULJLQ! WDUJHW QDPH S!SRUWQDPH LQ!WDUJHW! FRQQHFWLRQ!
S
F
S
F
F
S F
S SURFHGXUH ,1,73URFHVVS LQLWLDOL]HSURFHVVVWDWH HQG SURFHGXUH SURFHGXUH ),5(3URFHVVS 5($'LQSXWVL]HEXI SURFHVVLQJ :5,7(RXWSXWVL]HEXI HQG SURFHGXUH
Fig. 5 Kahn process network model. Left: XML description of the process network structure. Right top: example of a process network. Right bottom: C code of individual processes.
While the syntax of the XML file is specified using an XML schema, the C code is based on a simple API. As shown in Fig. 5, this API basically consists of four functions, two of which concern computation, namely INIT and FIRE, and two of which concern communication inside the FIRE procedure, namely READ and WRITE: • • • •
INIT contains the code that is executed once at start-up to initialize a process. FIRE contains the code that is repeatedly called by the scheduler. READ implements the blocking read from a FIFO channel. WRITE implements the blocking write to a FIFO channel.
A similar API is defined by Y-API [39], for instance, a library for specifying and executing Kahn process networks. The architecture model in DOL is an abstract representation of the underlying execution platform. Its purpose is to determine at a system-level the consequences of the application mapping. This abstract architecture models the topology (i.e. the set of processors and communication paths between processors) and includes performance figures of the underlying platform useful for performance analysis, e.g. the clock frequency and throughput of architectural resources. The architecture model is a structural description that does not express the functional behavior, and which is specified in XML, similar to the application model. This XML architecture representation is not specific to DOL, but also encountered in other frameworks, such as Artemis and MAMPS.
1022
Bacivarov, Haid, Huang, Thiele
p1
c1
p2
c2
p4
c3
c4 p3
mapping
processor1 prog. mem.
processor2
data mem.
prog. mem. bus
shared mem.
data mem.
01: 02: <process name="p1"/> 03: <processor name="processor1"/> 04: ... 16: 17: <sw_channel name="c2"/> 18: <writepath name="dmem1_bus_shm"/> 19: 20: ... 30: <schedule name="sched1" type="fixedpriority"> 31: 32: 33: 34: 35: 36: 37: 38:
Fig. 6 Mapping of a Kahn process network onto a two-processors architecture and an example of a corresponding mapping XML file.
The application model is brought in correspondence to the architecture model by a mapping (see Fig. 6) which can be either established manually by an experienced designer or generated automatically by design space exploration. This mapping fixes the allocation of hardware resources, the binding of the application elements onto these resources, and the scheduling on shared resources. For the mapping specification, once more, the XML format is used. The mapping XML serves as intermediate format and interface between tools, i.e. the design space exploration tool generates a mapping XML as an output, which is the input for the software synthesis tool. The application XML, the architecture XML, and the mapping XML are the basis for the following DOL synthesis steps, i.e. for the functional simulation and the implementation of the final MPSoC, but also for the generation of the analytic performance analysis model (see Fig. 4).
4.3 System Synthesis Similar to other frameworks, an application specified in DOL cannot be directly executed by just compiling the provided source code of the processes. A synthesis step is required that generates the “glue code” implementing the processes and channels, the bootstrapping and the scheduling of the application. Specifically, synthesis is done first for a standard PC/workstation to support the functional verification and debugging of the application (in which case it should be rather termed functional simulation generation) and second for the target MPSoC (which is properly known as system synthesis). However, due to similarities, the two steps are treated together in the following subsections as facets of system synthesis in the DOL design flow.
Methods and Tools for Mapping Process Networks onto MPSoC
1023
Note that when the behavior of a part of an application can be restricted to a subclass of KPN, the general approach described below could be combined with one of the corresponding synthesis techniques described in the previous chapters. 4.3.1 Functional Simulation Generation The purpose of providing a functional simulation that can be executed on a standard PC/workstation is to provide the application developer with a convenient approach to test and debug the application. Specifically, functional bugs within the application can be exposed and debugged by running a functional simulation on a standard PC and using standard debugging tools, e.g. the GNU debugger gdb. A second role of the functional simulation is to obtain architecture-independent application parameters for performance analysis, such as the amount of data transferred between processes or the number of activations of processes. These parameters can easily be obtained from functional simulation by monitoring the calls of the READ, WRITE, and FIRE methods. By back-annotating these parameters to the application specification as shown in Fig. 4, they can be referred to during performance analysis, as explained later in this section. Using the DOL design flow, a functional simulation can be automatically generated according to the application specification: For each process, an execution thread is instantiated. To implement software channels, inter-thread communication channels are used. The execution of the application is then controlled by a simple data-driven scheduler. Since Kahn process networks are determinate, this is a viable possibility because the scheduling does not influence the input/output behavior of the application. In DOL, the functional simulation is based on the SystemC library. Therefore, processes can be implemented as user-space threads which incurs less runtime overhead compared to using an operating system thread library, such as the pthread library. Fig. 7 shows the software architecture of the functional simulation based on SystemC: Each Kahn process is embedded into a SystemC thread, whereas each Kahn software channel is implemented as a SystemC channel. Moreover, the main file that bootstraps the process network and implements the scheduler to coordinate the quasi-parallel execution of processes is generated automatically as well. Another frequently chosen library is the pthreads library. On multicore multiprocessors where single operating system threads can be executed on different cores, a functional simulation based on pthreads can even achieve a speed-up compared to the sequential version of the application. In [40], speeds-ups of more than 3 have been reported for executing applications specified in SHIM on a quad-core Intel Xeon processor.
1024
Bacivarov, Haid, Huang, Thiele scheduler
fire()
sc thread
fire()
sc thread
write() sc port
fire()
sc port
sc channel
read()
sc port
sc thread
sc port
sc channel
Fig. 7 Software architecture of the functional simulation of a KPN application based on SystemC.
4.3.2 Software Synthesis After the application has been functionally verified by functional simulation, it is ported to the target platform. This requires an architecture dependent runtime environment in which the application is executed. The role of the runtime environment is to hide architectural details of the MPSoC platform by providing a set of high-level services enabling the execution of an application on the platform, such as task scheduling, inter-process communication, or inter-processor communication. Depending on the target platform, developing (parts of) the runtime environment might be necessary to create the basis for software synthesis. In case of the DOL design flow, currently three different hardware MPSoC platforms are supported, all of them being distributed memory architectures: • Cell Broadband Engine [41]: MPSoC consisting of a PowerPC-based Power Processor Element and eight DSP-like Synergistic Processing Elements interconnected via a ring bus. • Atmel Diopsis 940 [42]: tile-based MPSoC, where a single tile is composed of an ARM9 processor and a DSP interconnected by an AMBA bus; up to eight tiles are interconnected via a network-on-chip. • MPARM [43]: Homogeneous MPSoC consisting of identical ARM7 processors connected by an AMBA bus. Fig. 8 depicts a block diagram of the MPARM architecture. Software synthesis for MPARM is based on the RTEMS (Real-Time Executive for Multi-processor Systems) [44] operating system. Basic services provided by RTEMS are the scheduling of processes, device drivers for inter-process communication, and device drivers for system input/output. Based on these services, it is rather simple to bootstrap and execute a process network. As an example, Listing 1 illustrates parts of the code for bootstrapping a process network based on the RTEMS API. Software synthesis for the Cell Broadband Engine is described in [45] in the context of the DOL design flow, and in [22] in the context of the SHIM design flow, for instance.
Methods and Tools for Mapping Process Networks onto MPSoC $50WLOH
1025 $50WLOH1
LQVWUXFWLRQ DQGGDWD PHPRU\
$50 FRUH
LQVWUXFWLRQ DQGGDWD PHPRU\
$50 FRUH
VFUDWFKSDG PHPRU\
'0$ FRQWUROOHU
VFUDWFKSDG PHPRU\
'0$ FRQWUROOHU
6
0
6
0
0
0
EXV
Fig. 8 Block diagram of the MPARM architecture.
Listing 1 shows parts of a main file for a producer-consumer type application ˝ running on MPARM. In lines 2U-3, memory is allocated for the local data of the ˝ two tasks are created for the proproducer and consumer processes. In lines 5-U9, cesses by allocating a task control block, by assigning a task name and a task ID, by allocating a stack, and by setting initial attributes like the task priority and the task mode. Lines 11–12 show the creation of a message queue. In lines 14–18, the rtems_task_start directive puts the tasks into the ready state, enabling the scheduler to execute them. Finally, the initialization tasks deletes itself (line 20). Listing 1 RTEMS initialization task in which two tasks are bootstrapped to run a producer and consumer process of a process network. 1 2 3
rt em s_t ask I n i t ( rtems_task_argument arg ) { producer_wrapper ← malloc ( s i z e o f ( RtemsProcessWrapper ) ) ; consumer_wrapper ← malloc ( s i z e o f ( RtemsProcessWrapper ) ) ;
4
f o r ( j ← 0 ; j < 2 ; j ++) { s t a t u s ← r t e m s _ t a s k _ c r e a t e ( j + 1 , 128 , RTEMS_MINIMUM_STACK_SIZE , RTEMS_DEFAULT_MODES, RTEMS_DEFAULT_ATTRIBUTES , &( t a s k _ i d [ j ] ) ) ; }
5 6 7 8 9 10
s t a t u s ← rt em s_m essage_queue_creat e (1 , 10 , 1 , RTEMS_DEFAULT_ATTRIBUTES , &q u e u e _ i d [ 0 ] ) ;
11 12 13
s t a t u s ← r t e m s _ t a s k _ s t a r t ( t ask_i d [1] , producer_task , ( rtems_task_argument ) producer_wrapper ) ;
14 15 16
s t a t u s ← r t e m s _ t a s k _ s t a r t ( t a s k _ i d [ 2 ] , consumer_task , ( rt em s_t ask_argum ent ) consumer_wrapper ) ;
17 18 19
r t e m s _ t a s k _ d e l e t e ( RTEMS_SELF ) ;
20 21
}
For all the mentioned platforms, the main challenge is to provide an efficient FIFO channel implementation that allows overlapping computation and communi-
1026
Bacivarov, Haid, Huang, Thiele
cation in order to reduce the runtime overhead as much as possible. Aspects that play an important role in this context are the size and location of channel buffers, the efficient use of DMA controllers for data transfers between processors, and the minimization of synchronization messages.
4.4 Performance Analysis The DOL design flow is targeted towards the design of real-time multi-media and signal processing applications. These systems must meet real-time constraints, which means that not only the correctness and performance of a system are of major concern but also the timeliness of the computed results. Typical questions in this context are: • What is the response time to certain events? Is this response time within the required real-time limits? • Can the system accept additional load and still meet the quality-of-service and real-time constraints? • Is a system schedulable, that is, are all real-time constraints met? To be able to answer these questions, a suitable combination of system design and performance analysis is required. To this end, it is essential that the architecture, application, and runtime-environment of a system are amenable to formal analysis, because simulation or measurements are not able to provide guarantees about timing properties. On the other hand, performance analysis methods with a reasonable scope and accuracy need to be employed such that effects occurring in the system implementation can be faithfully modeled. For MPSoC applications, this includes the modeling of heterogeneous resources and their sharing, the modeling of complex timing behavior arising from variable execution demands and interference on shared resources, or the modeling of different processing semantics. Many approaches have been proposed to solve this problem, see [46] for an overview. Frequently used approaches are time-triggered and synchronous approaches, for instance. In purely time-triggered approaches, such as the time-triggered architecture [47] or Giotto [48], processing time of resources is allocated to tasks in fixed time slots. This fixed allocation facilitates analysis, but dimensioning of the slots turns out to be difficult for varying workloads. For instance, using the worst-case workload for setting the slot sizes might lead to over-dimensioned systems. Purely synchronous approaches implemented in synchronous languages, such as Esterel, Lustre, and Signal, rely on a global clock that divides the execution of a system into a sequence of atomic processing steps [49]. While synchronous approaches are successfully used for single-processor systems, applying them to MPSoCs is difficult because MPSoCs are usually split up into different (asynchronous) clock domains such that the synchronous assumption does not hold. The approach taken in DOL relies on using compositional performance analysis where a system is modeled as a set of processing components that interact via event
Methods and Tools for Mapping Process Networks onto MPSoC
1027
streams. This is a good match for MPSoCs as well as for KPNs. Contrary to other approaches, the approach is rather flexible in that it is not limited to a certain system architecture, scheduling policy, or execution semantics. Specifically, Modular Performance Analysis (MPA) is integrated into the DOL design flow. In the following, the basic concepts of MPA are reviewed. Afterwards, it is summarized how MPA is integrated into the DOL design flow. 4.4.1 Modular Performance Analysis (MPA) MPA [27, 50] is an analytical approach for the analysis of real-time systems. It is based on real-time calculus [51] which has its foundations in network calculus, a method for worst-case analysis of communication networks [52, 53]. With MPA, hard upper and lower bound for performance metrics of a distributed real-time system can be computed. As shown in Fig. 9, the performance model of a system is decomposed into a network of abstract processing components that model the computation and communication in a system. These processing components are connected by abstract event streams that model the timing properties of the data streams flowing through the system. Finally, resources are modeled by resource streams that model the availability of processing resources to computation and communication tasks. Different scheduling policies can be modeled by differently connecting processing components and resource streams. As an example, Fig. 9 illustrates fixed priority (FP) scheduling on processors and time-division multiple-access (TDMA) scheduling on the bus. The processing components modeling computation and communication are characterized by the worst-case/best-case execution demand and the minimal and maximal token size, respectively. Event streams are characterized by so-called arrival curves and resource streams by service curves. Summarizing, these abstractions allow the modeling of computation and communication on heterogeneous resources in a unified manner. FP
TDMA 1
processor1 (fixed priority) p1
p2
1
processor2 (fixed priority)
p4
p3
23
FP 34
3
1 2
2
23
2
bus (TDMA)
3
23
4
4 out
4
4
processor1
34
bus
34
processor2
Fig. 9 MPA model of a system with two processors connected by a bus on which a KPN with four processes is executing. In the MPA model, horizontal edges represent event streams whereas vertical edges represent resource streams.
1028
Bacivarov, Haid, Huang, Thiele
Based on these abstractions, the system is analyzed by consecutive propagation of event streams between components. Depending on the mapping of processing components to a resource and its scheduling/arbitration, the timing properties of the streams change. System properties such as resource utilization, system throughput, end-to-end delays, or buffer sizes can be derived this way. Tool support for actually performing the analysis is provided by a freely available Matlab toolbox [54] that implements the underlying algebraic operations. 4.4.2 Integration of MPA into the DOL Design Flow It has been mentioned that the goal of system synthesis is to bridge the implementation gap, that is, to refine a high-level system specification into an actual system implementation. Similarly, there is an “abstraction gap” between an MPA model and the implementation: The execution of sequential processes on a processor is modeled by an abstract processing component, the availability of resources is abstracted by service curves, and the dataflow through the systems by arrival curves. Bridging this abstraction gap, that is, creating an analysis model that correctly models the implementation is a non-trivial task. On the one hand, the high-level system specification is conceptually similar to the analysis model but does not contain all the parameters required to generate an MPA model, such as best-case/worst-case task execution times or token sizes. The implementation, on the other hand, implicitly contains this information, but extracting the information is not straightforward. In the DOL design flow, the abstraction gap is bridged by analysis model generation and calibration. In model generation, the high-level system specification is translated into a corresponding MPA model. In model calibration, the required model parameters are obtained. In both steps, the modular structure of the application and architecture specification is leveraged. The basic approach is depicted in Fig. 10. The analysis model generation represents a branch in the design flow that is parallel to the system synthesis. The analysis model calibration makes use of this feature later on in order to build a database with necessary performance data for the formal model. In the following, the basic approach is described. Further details are provided in [55]. The goal of model generation is to translate the high-level application, architecture, and mapping specification into a corresponding MPA model implemented as a Matlab script. Due to the modular specification of the application that is made explicit in the process network XML description, this is straightforward: Each process is simply modeled as an abstract processing component. Similarly, the communication channels between processes are modeled as abstract communication components and connected to the processing components according to the topology of the KPN. The aim of model calibration is to obtain the quantities to parameterize the generated model such that it correctly models the implementation. Basically, three different types of parameters can be distinguished:
Methods and Tools for Mapping Process Networks onto MPSoC
system specification
model generation
synthesis & compilation
implementation
1029
analysis model
performance prediction simulation, observation
runtime behavior
Fig. 10 Analysis model generation and calibration in the DOL flow.
• First, there are the application parameters that are architecture and timing independent. An example is the minimal and maximal size of tokens transmitted over each channel. • Second, there are the parameters that depend only on the architecture and the runtime environment. Examples are the throughput of the different communication resources or the context switch time of the runtime environment. • Third, there are the application parameters that depend on the architecture and the mapping. Basically, whenever the architecture or mapping changes, new parameters need to be determined. An example is the worst-case/best-case execution time of a process on a processor. Depending on the parameter type, there are different ways to obtain them. Timing independent parameters can be obtained from the functional simulation, due to the determinism of KPN applications. Architecture dependent parameters need to be obtained once a new hardware architecture or runtime environment is employed for realizing a system. The parameters of the third category, i.e. application parameters that depend on the architecture and the mapping, are more difficult to determine. Similar to system-level performance analysis, different methods for worst-case/bestcase execution time analysis have been proposed, for instance [56]. In the DOL design flow, timed simulation on a virtual platform is employed. Note that compared to formal methods, this approach is not suitable for the calibration of hard realtime system models unless complete coverage of corner cases is exhibited in the calibration simulation runs. Finally, one can observe that similar approaches are taken in other design flows. In the Artemis design flow, for instance, model generation and calibration is used to create a model for trace-based simulation [57].
1030
Bacivarov, Haid, Huang, Thiele
4.5 Design Space Exploration The final piece of the DOL flow is design space exploration, built on top of analysis and synthesis tools to find an optimal mapping. In general, the problem of optimally mapping an application to a heterogeneous distributed architecture is known to be NP-complete. Even for systems of modest complexity, one thus needs to resort to heuristics to solve the problem. In addition, the mapping problem is usually multiobjective such that there is no single optimal solution but a set of Pareto-optimal solutions constituting a so-called Pareto-front. In DOL, the aim of the design space exploration is to compute the set of Paretooptimal solutions representing different trade-offs in the design space. Based on the (approximated) Pareto-front, the designer chooses the final solution to implement. Therefore, the mapping problem is specified as a multi-objective optimization problem. Formally, a multi-objective optimization problem is defined on the decision space X which contains all possible design decisions, i.e. architectures, applications and mappings. To each implementation x ∈ X there is associated an objective vector f in the objective space Z that consists of n objectives f = ( f 1 , . . . , fn ) which should be minimized (or maximized). An order relation ≤ is defined on the objective space Z, which induces a preference relation " on the decision space X : x1 " x2 ⇔ f (x1 ) ≤ f (x2 ), for x1 , x2 ∈ X. In other words, for the mapping problem for instance, if mapping x1 is better (minimal) in all objectives than mapping x2 , the optimization algorithm will “preferentially” select mapping x1 . The design search space, symbolized with , is the set of all subsets of X , i.e. it includes all possible solution sets A ⊆ X. The final goal is to determine an optimal element of , i.e. an optimal subset of all possible implementations X. This subset should reflect all trade-offs induced by the multiple objectives. The preference relation " on X that has been defined above can now be used to define a corresponding set preference relation, symbolized with , on the search space . This set preference relation provides the information on the basis of which two candidate Pareto sets can be compared: A B ⇔ ∀b ∈ B, ∃ a ∈ A : a " b. This property reflects the concept of Pareto-dominance: A design point dominates another one if it is equal or better in all criteria and strictly better in at least one. Moreover, the search in the design space will be pursued until a good Pareto-optimal set approximation A ∈ is found. In DOL, evolutionary algorithms are used to solve the mapping optimization problem. Evolutionary algorithms find solutions to a given problem by iterating two main steps [58]: (1) selection of promising candidates, based on an a-priori evaluation of candidate solutions and (2) generation of new candidates by variation of previously selected candidates. The principle of the selection in evolutionary multiobjective optimization is sketched in Algorithm 3. For a complete description, we refer to [59]. The starting point is a randomly generated population P ∈ of size m. During optimization, a heuristic mutation operator based on selection and variation generates another set P ∈ , which is wanted to be better than P in the context of the predefined set preference relation , i.e. P P. Finally, P is replaced by P , if
Methods and Tools for Mapping Process Networks onto MPSoC
1031
the later is preferable to the former (i.e., P P), or P it remains unchanged in the opposite case. Algorithm 3 Main optimization function. 1: 2: 3: 4: 5: 6: 7: 8: 9:
randomly choose A ∈ set P ← A while termination criterion not fulfilled do P ← heuristicSetMutation(P) if P P then P ← P end if end while return P
generate initial set P of size m main optimization loop
Algorithm 4 Heuristic set mutation function. 1: function HEURISTIC S ET M UTATION (P) 2: generate {r1 , . . ., rk } ∈ X based on P 3: P ← P ∪ {r1 , . . ., rk } 4: while |P | > m do 5: choose p ∈ P with f itness(p) = mina∈P { f itness(a)} 6: P ← P \ {p} 7: end while 8: return P 9: end function
The heuristic set mutation operator is detailed in Algorithm 4. First, k new solutions are created based on P, after an appropriate selection and variation operation. While the variation is problem-specific, the selection is independent of the problem, using either an uniform random selection or a fitness-based selection. Then, the k new solutions are added to P, resulting in a set P of size m + k. P is iteratively truncated to size m by removing the solutions with worst fitness values. Note that the fitness values are associated in a performance evaluation process, i.e., in DOL we use the MPA framework as described in Section 4.4. While the selection algorithm described is domain independent, specific methods are used to include domain specific knowledge into the search process and select “best” solutions among a population. These are the domain representation (i.e., the system mapping), the evaluation of designs (i.e., the MPA analysis), and the variation of a population of solutions by mutation and cross-over operations. Although standard variation schemes exist for mutation and crossover, their implementation is strongly dependent on system properties. The mutation generates a local neighborhood of selected design points. In the DOL context, the mutation affects the mapping solutions; for instance, different mappings can be generated with different bindings of processes onto processors. The crossover recombines two selected solutions to generate a new one. Note that during mutation or crossover, infeasible (mapping) solutions can be generated. In this situation, a repair strategy is invoked, which, in conformity with the evolutionary algorithm principle, attempts
1032
Bacivarov, Haid, Huang, Thiele
to maintain a high diversity in the population. An example could be the rerouting of inter-process communication, when during re-mapping a process was bound to a new location. In DOL, the design space exploration framework includes several tools, as shown in Fig. 11. In particular, the EXPO [60] tool is the central module of the framework. As underlying multi-objective search algorithm, Strength Pareto Evolutionary Algorithm (SPEA2) [61] is used that communicates with EXPO via the PISA [62] interface. Similarly, the design space exploration framework in Artemis [10] is also based on PISA and SPEA2. application XML Matlab performance analysis (MPA)
architecture XML EXPO
EA
performance metrics
mapping generation & variation
evolutionary algorithm (SPEA2)
Matlab script
internal system representation
PISA Interface
mapping XML
Fig. 11 Design space exploration in the DOL framework.
Using the frameworks of EXPO and PISA relieves the designer from implementing those parts of the design space exploration that are independent of the actual optimization problem. For example, the selection may be handled inside the multiobjective optimizer SPEA2. The designer just needs to focus on the problem specific parts, that is, the generation, variation, and evaluation of solutions. The implementation of problem specific parts starts with the specification, where the application and architecture (and later on, the mapping) are automatically extracted from the corresponding XML files and represented in the design space exploration framework. Then, candidate mappings are inspected (as described above) based on the provided variation methods. Finally, during design space exploration, the objective values of all candidate mappings are computed by generating the corresponding Matlab MPA scripts and interfacing Matlab for their evaluation. In DOL, all these operations are automatically parameterized using the application and architecture specification. Note that the approach described above is a heuristic search procedure. Therefore, it does not guarantee the optimality of the final solution, i.e. the final set of solutions. However, in our experiments we have identified that after several design space exploration cycles, the found solutions are close to optimal even for large problem complexities.
Methods and Tools for Mapping Process Networks onto MPSoC
1033
4.6 Results of the DOL Framework In this section, a few results are highlighted that have been obtained by applying the DOL design flow described above. Specifically, the design and analysis of a Motion-JPEG (MJPEG) decoder [63] running on MPARM [43] is considered. For the execution of the system, we used a 31-frame input bitstream encoded using the QVGA (320×240) YUV 444 format. The MJPEG decoder decompresses a sequence of frames by applying JPEG decompression to each frame. Because of the inherent parallelism in the MJPEG algorithm, the decoding is done in a pipeline with five stages, each stage being implemented as a Kahn process. The first and last stages are the splitting of streams into frames (ss) and the merging of frames back to streams (ms). The variable length decoding and the splitting of frames into macroblocks form the second stage (sf). The zigzag scan, inverse quantization and the inverse discrete cosine transform form the third (zii), while combining macroblocks back to frames forms the fourth stage (mf). Using the design space exploration framework of DOL based on the PISA interface and SPEA2, one can compute the Pareto optimal mappings of the MJPEG application onto an MPARM system with a variable number of processors. The mapping has been optimized in conformity with two design objectives: (1) end-toend delay in the system computed as the result of MPA analysis, which is an upper bound of the actual end-to-end delay and (2) the cost of the system evaluated as a sum of costs associated with the used processors, memories, and the bus. In the experiments, a population size of 60 individuals has been chosen and the algorithm has been executed for 50 generations. These parameters generally depend on the complexity of the problem to solve. The obtained Pareto front is shown in Fig. 12, consisting of 6 mapping solutions onto a different number of processors. The search in this design space took about 2 hours.
current population
end-to-end delay
1 processor
2 processors
3 processors 4 processors cost
Fig. 12 Pareto optimal solutions resulted after the design space exploration (screenshot of the EXPO tool).
1034
Bacivarov, Haid, Huang, Thiele
For illustration purposes only, we employ a simple configuration with a small number of processes that can be mapped in different ways onto the architecture and can communicate via different hardware communication paths. However, a more efficient implementation can be obtained if the same application is specified with a scalable number of processes processing data in parallel. This would enlarge the parallelism of the design but also the dimensionality of the design space. A more complex design space exploration with the DOL framework is shown in [11], where a scaled version of an MPEG-2 decoder has been mapped onto a tile-based heterogeneous architecture. Other KPN flows, like Sesame in the Artemis project [10] that use exactly the same optimization frameworks of PISA and SPEA2, report comparable parameters and results for the design space exploration. However, their design space exploration is considerably shorter, i.e. 5 s for a design with 8 processes, because they use a much simpler additive performance model. Of course, for larger problem sizes all the parameters will scale and the design space exploration can take much longer if more accurate analysis methods, like MPA or simulation, are used. However, this is an acceptable cost since designers are exploring the entire design space only once. In the remainder of this section, a mapping of the MJPEG application onto a 3-tile MPARM system, that is, three ARM processors interconnected via a shared AMBA bus, is considered. The resulting MPA model is shown in Fig. 13. shared bus TDMA FP
FP
FP 5
1 ss
sf
6
zii
2 7 3 ms
mf
8
4
ARM1
ARM2
ARM3
Fig. 13 MPA model of MJPEG application mapped onto a 3-tile MPARM platform. Px are the five processes of the MJPEG decoding pipeline that communicate via the Cy software channels.
To evaluate the efficiency of the design flow (i.e., the time spent in obtaining results), Table 3 lists the durations of the different design steps for performance analysis of the different design solutions. Several conclusions can be drawn:
Methods and Tools for Mapping Process Networks onto MPSoC
1035
• Automated software synthesis can be done fast. Actually, most of the time required to generate the functional simulation or the binaries for the target platform is required to compile the generated source code rather than to generate this code. • The table shows that timed simulation on the virtual platform is the most timeconsuming step in the design flow. Minimization of simulation time is thus paramount and actually possible in a systematic design flow, as has been shown above. Conversely, simulation time can become a major bottleneck in MPSoC design when following a less systematic design flow requiring many design iterations involving timed simulation. • The generation and analysis of a system’s MPA model is a matter of seconds. Note that similar times have been reported for alternative performance analysis methods like trace-based simulation in the Artemis design flow, for instance. While further reducing this time is desirable, it is a reasonable time frame for performance analysis within a design space exploration loop. • The one-time calibration to obtain the parameters for the MPA model takes several seconds albeit being completely automated. Extracting these parameters manually would be a major effort.
Table 3 Duration of different design steps in the MJPEG design, measured on a 1.86 GHz Intel Pentium Mobile machine with 1 GB of RAM. The simulations were executed to decode a 31-frame input bitstream encoded using the QVGA (320×240) YUV 444 format. step functional simulation generation functional simulation model calibration (one-time effort) synthesis (generation of binary) simulation on MPARM log-file analysis and back-annotation model generation performance analysis based on generated model
duration 42 s 3.6 s 4s 13550 s 12 s 1s 2.5 s
In order to evaluate the accuracy of MPA estimations, the performance bounds computed with MPA are compared to actual (average-case behavior) quantities observed during system simulation. The differences are in a range of 10-20%, which is typical for a compositional performance analysis. Differences in the same range have been observed for several systems in [31], for instance. There are two main reasons for these differences. First, several operators in the formal performance analysis do not yield tight bounds. Second, the simulation of a complex system in general cannot determine the actual worst-case and best-case behavior. The simulations on the system level do not use exhaustive test patterns and do not cover all possible corner cases in the interference through joint resources. Finally, the DOL framework itself is evaluated in terms of code size of the prototype implementation. The DOL design flow and the associated tools are implemented in Java. To give an indication about the size of the implementation, Table 4 shows the code size of different parts of the design flow (excluding the plug-ins for
1036
Bacivarov, Haid, Huang, Thiele
design space exploration). One can see that apart from the tool-internal representations of the system specification, the largest part is the MPA code generator for performance analysis. The software synthesizers and the monitoring for the MPA model calibration are comparatively small. Similar observations can be made for other design flows, as well. Table 4 Java code size of different parts of the DOL design flow. part of design flow DOL representation of system specifications functional simulation generator MPARM code generator MPA Matlab code generator log-file analysis of functional and timed simulation
lines of code 6200 4100 2100 5000 1200
5 Concluding Remarks The mapping of process networks onto multi-processor systems requires a systematic and automated design methodology. This chapter provides an overview over different existing methods and tools, which are all starting from a general Kahn process network (KPN) model of computation and are implementing the established Y-chart approach. Due to fundamental properties of the Kahn model, many problems in the design process can be solved in an automated manner. Thus, the system specification, synthesis, performance analysis, and design space exploration can be implemented in a fully automated way. After an overview over all these activities, this chapter provides a practical illustration of their implementation in the distributed operation layer (DOL) framework. The design steps followed by DOL are somewhat common to all the KPN flows. What is typical to the DOL framework is the embedding of an accurate formal performance analysis model into the design flow. This presents a clear advantage over the standard simulation-based approaches employed for performance analysis, which typically take more time to execute than a formal model and cannot offer guarantees for (hard) real-time signal-processing applications, due to the incomplete coverage of the design space. Another key point is the need for a scalable design flow which allows to design large and complex MPSoC systems, which can clearly be noticed from Table 2. As soon as we are faced with more complex MPSoCs, this forthcoming difficulty needs to be considered in all steps of the design trajectory. In particular, it will have an impact at the system-level, where basic design decisions are taken. In this sense, the Kahn model and design methods based on it are promising candidates due to the modular system specification. It offers a great potential for compositional (and fast) performance analysis and design space exploration. By taking a closer look at the DOL framework, it can be observed that it features a specification format that can
Methods and Tools for Mapping Process Networks onto MPSoC
1037
easily be scaled (i.e. provided by the XML and C basis), it includes a compositional formal performance analysis in the design, and the optimization is done with the support of modular tools such as EXPO and PISA. These features provide the basis for scalable mappings and mapping optimizations.
References 1. Kienhuis, B., Deprettere, E., Vissers, K., van der Wolf, P.: An Approach for Quantitative Analysis of Application-Specific Dataflow Architectures. In: Proc. Int’l Conf. on ApplicationSpecific Systems, Architectures and Processors (ASAP), Washington, DC, USA (July 1997) 338–349 2. Densmore, D., Sangiovanni-Vincentelli, A., Passerone, R.: A Platform-Based Taxonomy for ESL Design. IEEE Design & Test of Computers 23(5) (May 2006) 359–374 3. Kahn, G.: The Semantics of a Simple Language for Parallel Programming. In: Proc. IFIP Congress, Stockholm, Sweden (August 1974) 471–475 4. Ptolemy Web Site. http://ptolemy.eecs.berkeley.edu 5. Balarin, F., Watanabe, Y., Hsieh, H., Lavagno, L., Passerone, C., Sangiovanni-Vincentelli, A.: Metropolis: An Integrated Electronic System Design Environment. Computer 36(4) (April 2003) 45–52 6. MathWorks Real-Time Workshop. http://www.mathworks.com/products/rtw/ 7. NI LabVIEW Microprocessor SDK. http://www.ni.com/labview/microprocessor_sdk.htm 8. Keinert, J., Streubühr, M., Schlichter, T., Falk, J., Gladigau, J., Haubelt, C., Teich, J., Meredith, M.: SystemCoDesigner — An Automatic ESL Synthesis Approach by Design Space Exploration and Behavioral Synthesis for Streaming Applications. ACM Trans. on Design Automation of Electronic Systems 14(1) (January 2009) 1:1–1:23 9. Ha, S., Kim, S., Lee, C., , Yi, Y., Kwon, S., Joo, Y.P.: PeaCE: A Hardware-Software Codesign Environment for Multimedia Embedded Systems. ACM Trans. on Design Automation of Electronic Systems 12(3) (August 2007) 1–25 10. Pimentel, A.D., Erbas, C., Polstra, S.: A Systematic Approach to Exploring Embedded System Architectures at Multiple Abstraction Levels. IEEE Trans. on Computers 55(2) (February 2006) 99–112 11. Thiele, L., Bacivarov, I., Haid, W., Huang, K.: Mapping Applications to Tiled Multiprocessor Embedded Systems. In: Proc. Int’l Conf. on Application of Concurrency to System Design (ACSD), Bratislava, Slovak Republic (July 2007) 29–40 12. Nikolov, H., Stefanov, T., Deprettere, E.: Systematic and Automated Multiprocessor System Design, Programming, and Implementation. IEEE Trans. on Computer-Aided Design of Integrated Circuits and Systems 27(3) (March 2008) 542–555 13. Kangas, T., Kukkala, P., Orsila, H., Salminen, E., Hännikäinen, M., Hämäläinen, T.D.: UMLBased Multiprocessor SoC Design Framework. ACM Trans. on Embedded Computing Systems 5(2) (May 2006) 281–320 14. Kumar, A., Fernando, S., Ha, Y., Mesman, B., Corporaal, H.: Multiprocessor Systems Synthesis for Multiple Use-Cases of Multiple Applications on FPGA. ACM Trans. on Design Automation of Electronic Systems 31(3) (July 2008) 40:1–40:27 15. Bhattacharyya, S.S., Brebner, G., Janneck, J.W., Eker, J., von Platen, C., Mattavelli, M., Raulet, M.: OpenDF — A Dataflow Toolset for Reconfigurable Hardware and Multicore Systems. In: First Swedish Workshop on Multi-Core Computing (MCC), Uppsala, Sweden (November 2008) 16. Edwards, S.A., Tardieu, O.: SHIM: A Deterministic Model for Heterogeneous Embedded Systems. IEEE Trans. on VLSI Systems 14(8) (August 2006) 854–867 17. Gordon, M.I., Thies, W., Amarasinghe, S.: Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in Stream Programs. In: Proc. Int’l Conf. on Architectural Support for
1038
18. 19. 20. 21.
22. 23. 24. 25. 26. 27. 28. 29. 30. 31.
32. 33. 34.
35. 36.
Bacivarov, Haid, Huang, Thiele
Programming Languages and Operating Systems (ASPLOS), San Jose, CA, USA (October 2006) 151–162 Kienhuis, B., Rijpkema, E., Deprettere, E.: Compaan: Deriving Process Networks from Matlab for Embedded Signal Processing Architectures. In: Proc. of the Int’l Workshop on Hardware/Software Codesign (CODES), San Diego, CA, USA (May 2000) 13–17 Verdoolaege, S., Nikolov, H., Stefanov, T.: PN: A Tool for Improved Derivation of Process Networks. EURASIP Journal on Embedded Systems 2007 (2007) Pimentel, A.D.: The Artemis Workbench for System-Level Performance Evaluation of Embedded Systems. Int. J. Embedded Systems 3(3) (2008) 181–196 Janneck, J.W., Miller, I.D., Parlour, D.B., Roquier, G., Wipliez, M., Raulet, M.: Synthesizing Hardware from Dataflow Programs: An MPEG-4 Simple Profile Decoder Case Study. In: IEEE Workshop on Signal Processing Systems (SiPS), Washington, D.C., USA (October 2008) 287–292 Vasudevan, N., Edwards, S.A.: Celling SHIM: Compiling Deterministic Concurrency to a Heterogeneous Multicore. In: Proc. ACM Symposium on Applied Computing (SAC), Honolulu, HI, USA (March 2009) 1626–1631 Gelernter, D., Carriero, N.: Coordination Languages and Their Significance. Commun. ACM 35(2) (February 1992) 97–107 Parks, T.M.: Bounded Scheduling of Process Networks. PhD thesis, University of California, Berkeley (December 1995) IBM SDK for Multicore Acceleration. http://www-128.ibm.com/developerworks/power/cell/ González Harbour, M., Gutiérrez García, J.J., Palencia Gutiérrez, J.C., Drake Moyano, J.M.: MAST: Modeling and Analysis Suite for Real Time Applications. In: Proc. Euromicro Conference on Real-Time Systems, Delft, The Netherlands (June 2001) 125–134 Chakraborty, S., Künzli, S., Thiele, L.: A General Framework for Analyzing System Properties in Platform-Based Embedded System Design. In: Proc. Design, Automation and Test in Europe (DATE), Munich, Germany (March 2003) 190–195 Henia, R., Hamann, A., Jersak, M., Racu, R., Richter, K., Ernst, R.: System Level Performance Analysis — The SymTA/S Approach. IEE Proceedings Computers and Digital Techniques 152(2) (March 2005) 148–166 Alur, R., Dill, D.L.: A Theory of Timed Automata. Theoretical Computer Science 126(2) (April 1994) 183–235 Hendriks, M., Verhoef, M.: Timed Automata Based Analysis of Embedded System Architectures. In: Workshop on Parallel and Distributed Real-Time Systems, Rhodes, Greece (April 2006) Perathoner, S., Wandeler, E., Thiele, L., Hamann, A., Schliecker, S., Henia, R., Racu, R., Ernst, R., González Harbour, M.: Influence of Different System Abstractions on the Performance Analysis of Distributed Real-Time Systems. In: Proc. Int’l Conf. on Embedded Software (EMSOFT), Salzburg, Austria (October 2007) 193–202 Wandeler, E., Thiele, L.: Optimal TDMA Time Slot and Cycle Length Allocation for Hard Real-Time Systems. In: Proc. Asia and South Pacific Conf. on Design Automation (ASPDAC), Yokohama, Japan (January 2006) 479–484 Hamann, A., Racu, R., Ernst, R.: Multi-Dimensional Robustness Optimization in Heterogeneous Distributed Embedded Systems. In: Proc. Real Time and Embedded Technology and Applications Symposium (RTAS), Bellevue, WA, United States (April 2007) 269–280 Kraemer, S., Gao, L., Weinstock, J., Leupers, R., Ascheid, G., Meyr, H.: HySim: A Fast Simulation Framework for Embedded Software Development. In: Proc. Int’l Conf. on Hardware/Software Codesign and System Synthesis (CODES+ISSS), Salzburg, Austria (October 2007) 75–80 Schliecker, S., Stein, S., Ernst, R.: Performance Analysis of Complex Systems by Integration of Dataflow Graphs and Compositional Performance Analysis. In: Proc. Design, Automation and Test in Europe (DATE). (March 2007) 273–278 Künzli, S., Hamann, A., Ernst, R., Thiele, L.: Combined Approach to System Level Performance Analysis of Embedded Systems. In: Proc. Int’l Conf. on Hardware/Software Codesign and System Synthesis (CODES/ISSS), Salzburg, Austria (October 2007) 63–68
Methods and Tools for Mapping Process Networks onto MPSoC
1039
37. Künzli, S., Poletti, F., Benini, L., Thiele, L.: Combining Simulation and Formal Methods for System-Level Performance Analysis. In: Proc. Design, Automation and Test in Europe (DATE). (March 2006) 236–241 38. Gries, M.: Methods for Evaluating and Covering the Design Space during Early Design Development. Integration, the VLSI Journal 38(2) (December 2004) 131–183 39. de Kock, E.A., Essink, G., Smits, W.J.M., van der Wolf, P., Brunel, J.Y., Kruijtzer, W.M., Lieverse, P., Vissers, K.A.: YAPI: Application Modeling for Signal Processing Systems. In: Proc. Design Automation Conference (DAC), Los Angeles, CA, USA (June 2000) 402–405 40. Edwards, S.A., Vadudevan, N., Tardieu, O.: Programming Shared Memory Multiprocessors with Deterministic Message-Passing Concurrency: Compiling SHIM to Pthreads. In: Proc. Design, Automation and Test in Europe (DATE), Munich, Germany (March 2008) 1498–1503 41. Pham, D.C., Aipperspach, T., Boerstler, D., Bolliger, M., Chaudhry, R., Cox, D., Harvey, P., Harvey, P.M., Hofstee, H.P., Johns, C., Kahle, J., Kameyama, A., Keaty, J., Masubuchi, Y., Pham, M., Pille, J., Posluszny, S., Riley, M., Stasiak, D.L., Suzuoki, M., Takahashi, O., Warnock, J., Weitzel, S., Wendel, D., Yazawa, K.: Overview of the Architecture, Circuit Design, and Physical Implementation of a First-Generation Cell Processor. IEEE Journal of Solid-State Circuits 41(1) (January 2006) 179–196 42. Paolucci, P.S., Jerraya, A.A., Leupers, R., Thiele, L., Vicini, P.: SHAPES: A Tiled Scalable Software Hardware Architecture Platform for Embedded Systems. In: Proc. Int’l Conf. on Hardware/Software Codesign and System Synthesis (CODES+ISSS), Seoul, South Korea (October 2006) 167–172 43. Benini, L., Bertozzi, D., Alessandro, B., Menichelli, F., Olivieri, M.: MPARM: Exploring the Multi-Processor SoC Design Space with SystemC. The Journal of VLSI Signal Processing 41 (September 2005) 169–182(14) 44. RTEMS Home Page. http://www.rtems.com 45. Haid, W., Schor, L., Huang, K., Bacivarov, I., Thiele, L.: Efficient Execution of Kahn Process Networks on Multi-Processor Systems Using Protothreads and Windowed FIFOs. In: Proc. IEEE Workshop on Embedded Systems for Real-Time Multimedia (ESTIMedia), Grenoble, France (October 2009) 35–44 46. Edwards, S., Lavagno, L., Lee, E.A., Sangiovanni-Vincentelli, A.: Design of Embedded Systems: Formal Models, Validation, and Synthesis. Proceedings of the IEEE 85(3) (March 1997) 366–390 47. Kopetz, H., Bauer, G.: The Time-Triggered Architecture. Proceedings of the IEEE 91(1) (January 2003) 112–126 48. Henzinger, T.A., Horowitz, B., Kirsch, C.M.: Giotto: A Time-Triggered Language for Embedded Programming. Proceedings of the IEEE 91(1) (January 2003) 84–99 49. Benveniste, A., Caspi, P., Edwards, S.A., Halbwachs, N., Le Guernic, P., De Simone, R.: The Synchronous Languages 12 Years Later. Proceedings of the IEEE 91(1) (January 2003) 64–83 50. Wandeler, E., Thiele, L., Verhoef, M., Lieverse, P.: System Architecture Evaluation Using Modular Performance Analysis: A Case Study. Int’l Journal on Software Tools for Technology Transfer (STTT) 8(6) (November 2006) 649–667 51. Thiele, L., Chakraborty, S., Naedele, M.: Real-Time Calculus for Scheduling Hard Real-Time Systems. In: Proc. Int’l Symposium on Circuits and Systems (ISCAS). Volume 4., Geneva, Switzerland (March 2000) 101–104 52. Cruz, R.L.: A Calculus for Network Delay, Part I: Network Elements in Isolation. IEEE Trans. Inf. Theory 37(1) (January 1991) 114–131 53. Le Boudec, J.Y., Thiran, P.: Network Calculus — A Theory of Deterministic Queuing Systems for the Internet. Volume 2050 of Lecture Notes in Computer Science. Springer Verlag (2001) 54. Wandeler, E., Thiele, L.: Real-Time Calculus (RTC) Toolbox. http://www.mpa.ethz.ch/Rtctoolbox (2006) 55. Haid, W., Keller, M., Huang, K., Bacivarov, I., Thiele, L.: Generation and Calibration of Compositional Performance Analysis Models for Multi-Processor Systems. In: Proc. Int’l Conf. on Systems, Architectures, Modeling and Simulation (IC-SAMOS), Samos, Greece (July 2009) 92–99
1040
Bacivarov, Haid, Huang, Thiele
56. Wilhelm, R., Engblom, J., Ermedahl, A., Holsti, N., Thesing, S., Whalley, D., Bernat, G., Ferdinand, C., Heckmann, R., Mitra, T., Mueller, F., Puaut, I., Puschner, P., Staschulat, J., Stenström, P.: The Worst-Case Execution Time Problem — Overview of Methods and Survey of Tools. ACM Trans. on Embedded Computing Systems 7(3) (April 2008) 36:1–36:53 57. Pimentel, A.D., Thompson, M., Polstra, S., Erbas, C.: Calibration of Abstract Performance Models for System System-Level Design Space Exploration. Journal of Signal Processing Systems 50(2) (February 2008) 99–114 58. Zitzler, E., Thiele, L.: Multiobjective Evolutionary Algorithms: A Comparative Case Study and the Strength Pareto Approach. IEEE Trans. on Evolutionary Computation 3(4) (November 1999) 257–271 59. Zitzler, E., Thiele, L., Bader, J.: SPAM: Set Preference Algorithm for Multiobjective Optimization. In: Conf. on Parallel Problem Solving From Nature (PPSN), Dortmund, Germany (September 2008) 847–858 60. Thiele, L., Chakraborty, S., Gries, M., Künzli, S.: A Framework for Evaluating Design Tradeoffs in Packet Processing Architectures. In: Proc. Design Automation Conference (DAC), New Orleans, LA, USA (June 2002) 880–885 61. Zitzler, E., Laumanns, M., Thiele, L.: SPEA2: Improving the Strength Pareto Evolutionary Algorithm for Multiobjective Optimization. In: Proc. Evolutionary Methods for Design, Optimisation, and Control (EUROGEN), Athens, Greece (2001) 95–100 62. Bleuler, S., Laumanns, M., Thiele, L., Zitzler, E.: PISA – A Platform and Programming Language Independent Interface for Search Algorithms. In: Int’l Conf. on Evolutionary MultiCriterion Optimization (EMO), Faro, Portugal (April 2003) 494–508 63. Wallace, G.K.: The JPEG Still Picture Compression Standard. IEEE Trans. on Consumer Electronics 38(1) (Feb 1992) 18–34
Integrated Modeling using Finite State Machines and Dataflow Graphs Joachim Falk, Joachim Keinert, Christian Haubelt, Jürgen Teich, and Christian Zebelein
Abstract In this chapter, different application modeling approaches based on the integration of finite state machines with dataflow models are reviewed. A particular focus is put on the analyzability of these models and the way how restricted Models of Computation are exploited in design methodologies to optimize the hardware/software implementation of a given application model.
1 Introduction Dataflow graphs are widely accepted for modeling DSP algorithms, e.g., multimedia and signal processing applications. On the one hand, well known optimization strategies are known for static dataflow graphs, while on the other hand modeling complex multimedia applications using only static dataflow models is a challenging task. These problems can be circumvented by using dynamic dataflow models which are a good fit for multimedia and signal processing applications. However, the adhoc representation of control flow in dynamic dataflow actors has lead to problems concerning adequate information extraction from these models to serve as a basis for the analysis of these graphs. Hence, neither static nor dynamic dataflow graphs seem to be the best choice to establish a design methodology upon it. One of the key problems is due to the fact that dataflow actors often show control dominant behavior which cannot adequately be expressed in many dataflow models. Consequently, in the past different modeling approaches integrating Finite State Machines (FSMs) Joachim Falk · Christian Haubelt · Jürgen Teich · Christian Zebelein University of Erlangen-Nuremberg, Hardware-Software-Co-Design, Am Weichselgarten 3 91058 Erlangen, Germany, e-mail: {joachim.falk,haubelt,teich,christian. zebelein}@cs.fau.de Joachim Keinert Fraunhofer Institute for Integrated Circuits IIS, Am Wolfsmantel 33 91058 Erlangen, Germany, e-mail: [email protected]
S.S. Bhattacharyya et al. (eds.), Handbook of Signal Processing Systems, DOI 10.1007/978-1-4419-6345-1_36, © Springer Science+Business Media, LLC 2010
1041
1042
Joachim Falk et al.
with dataflow graphs have been proposed. However, this often comes with the drawback of decreased analysis capabilities. Nevertheless, several successful academic system-level design methodologies have emerged recently based on integrated modeling approaches In this chapter, we will present different integrated modeling approaches (Section 2). Based on these models, different design methodologies, namely Ptolemy II [30], OpenDF [3], and SystemCoDesigner [26], will be presented in Section 3. We will focus particularly on the analyzability of the application model and the exploitation of these analysis capabilities in the corresponding design methodology. Finally in Section 4, a future direction in design methodologies based on integrated FSMs and multidimensional dataflow models will be presented.
2 Modeling Approaches In this section, we will discuss modeling approaches combining dataflow graphs with Finite State Machines (FSMs). Starting with *charts [17], one of the first integrated modeling approaches implemented in the Ptolemy project [11], the California Actor Language (CAL) [9], Extended CoDesign Finite State Machines (ECFSMs) [35], and S YSTE M O C [12] will be discussed.
2.1 *charts One of the first modeling approaches integrating Finite State Machines (FSMs) with dataflow models is *charts (pronounced ”star charts”) [17]. The concurrency model in *charts is not restricted to be dataflow: Other choices are discrete event models or synchronous/reactive models. However, in the scope of this chapter, we will limit our discussion to the FSM/dataflow integration. We start by formally defining deterministic finite state machines: Definition 1 (FSM). A deterministic FSM is a five tuple (Q, , , , q0 ), where Q is a finite set of states, is a set of symbols denoting the possible inputs, is a set of symbols denoting possible outputs, : Q × → Q × is the transition mapping function, and q0 ∈ Q is the initial state. In one reaction, an FSM maps its current state p ∈ Q and an input symbol a ∈ to a next state q ∈ Q and an output symbol b ∈ , where (q, b) = (p, a). Given a sequence of input symbols, an FSM performs a sequence of reactions starting in its initial state q0 . Thereby, a sequence of output symbols in is produced. FSMs are often represented by state transition diagrams. In state transition diagrams vertices correspond to states and edges model state transitions, see Fig. 1 from [17]. Edges are labeled by guard/action pairs, where guard ∈ specifies the input symbol triggering the corresponding state transition and action ∈ specifies
Integrated Modeling using Finite State Machines and Dataflow Graphs
1043
the output symbol to be produced by the FSM reaction. The edge without source state points to the initial state. b/v
a/u Fig. 1 A state transition graph representing an FSM
One reaction of the FSM is given by triggering a single enabled state transition. An enabled state transition is an outgoing transition from the current state where the guard matches the current input symbol. Thereby, the FSM changes to the destination state of the transition and produces the output symbol specified by the action. To simplify notation, each state is assumed to have implicit self transitions for input symbols not used as guards for any explicit transition. The action in such cases is empty or the output symbol ∈ indicating the absence of output. FSMs as presented in Definition 1 can be easily extended to multiple input signals and multiple output signals. In such a scenario, input and output symbols are defined by subdomains, i.e., = 1 × · · · × n and = 1 × · · · × m. 2.1.1 Heterogeneity The *charts approach allows arbitrary nesting of dataflow graphs and FSMs by either refining dataflow actors by FSMs or by refining states of an FSM by dataflow graphs. For this purpose, *charts assume a ”black box” approach, i.e., on each level of hierarchy, a set of connected modules is described and modules are treated as black boxes. Some Model of Computation (MoC) is selected to govern the module interaction on each hierarchical level. However, this MoC is not responsible for governing the modules’ content. The only requirement is that the modules interfaces conform to some standard accepted by the outer MoC. Based on this idea, it is possible to extend FSMs by refining either states or transitions by dataflow graphs which react to some subset of the input signals and produce a subset of output signals defined by the guard and action of state transitions. A reaction of an FSM will usually need to take finite time. Hence, it has to be guaranteed that the computation of the dataflow graph eventually halts. As dataflow systems typically are designed to work on an infinite stream of data, a solution is needed that implies the finiteness of the computation. For many non-terminating dataflow graphs, their execution can be divided into finite iterations. An iteration is a minimal set of actor firings returning the dataflow graph to its initial state. With each reaction of an FSM an iteration of the dataflow graph is associated.
1044
Joachim Falk et al.
On the opposite side, an FSM can be used to specify the behavior of some dataflow actor. However, in this case the dataflow MoC is responsible to unambiguously provide the input symbols to the FSM. 2.1.2 Synchronous Dataflow Graphs The existence of a finite iteration is undecidable for general dataflow graphs. Hence, the application of *charts is restricted to restricted subclasses of dataflow graphs. In Synchronous Dataflow (SDF) [31] Graphs, a firing of an actor is an atomic computation that consumes a fixed number of tokens from each incoming edge and produces a fixed number of tokens on each outgoing edge. The most restrictive subclass of SDF graphs are Homogeneous SDF (HSDF) graphs, where each actor exactly consumes and produces a single token from each incoming and on each outgoing edge, respectively. Edges are conceptionally unbounded FIFO queues representing a possibly infinite stream of data. The consumption and production rates can be used to unambiguously define an iteration that returns the queues to their original state. This can be done by solving the balance equation i pi = j c j where the edge (i, j) is assumed to be directed from actor i to actor j. pi is the number of produced tokens by actor i onto this particular edge, whereas c j is the number of consumed tokens by actor j. i and j denote the number of actor firings of actor i and j, respectively, to return the edge (i, j) to its original state. Beside trivial solution (i = j = 0 or i = j = ), the balance equation may also be satisfied by a finite number of firings. Considering a network of N actors with M edges, there will be M equations with N unknowns. For connected graphs, there exists either the only solution i = 0, 1 ≤ i ≤ N or a unique smallest solution, called the minimal repetition vector , with i > 0, ∀i. When an FSM refines an SDF actor, it must externally obey SDF semantics, i.e., it must consume and produce a fixed number of tokens in each firing. In the simplest case, the FSM refines an HSDF actor and if activated each incoming edge to the actor carries at least one token which takes on some value from an alphabet. The cross product of these alphabets form the input alphabet to the FSM, perfectly matching the FSM model (see Definition 1). The action of the enabled state transition emits a symbols from the output alphabet of the FSM which is used to derive a token value to be produced on each outgoing edge of the HSDF actor. If values of consumed tokens or values of produced tokens are not specified in the guard or action of an enabled state transition, the values are interpreted as values indicating the absence of values. The example in Fig. 2 is taken from [17]. The top level HSDF model consists of two actors connected by a single edge. The minimal repetition vector is A = B = 1. Since the edge (A, B) contains no initial tokens, actor A fires before actor B in each iteration. Both actors are refined by an FSM. The names at the input and outputs of the actors (a,b,x,c,y) are used by the refining FSMs to access the corresponding token values. In the first iteration,
Integrated Modeling using Finite State Machines and Dataflow Graphs FSM
a/x
A
1045
SDF 1 1
a
FSM
A x
B
B 1
b
1
c
y
c/y
1
b
Fig. 2 Refinement of SDF actors by FSMs in *charts.
actor A fires if at least one token is available on each input edge. As the initial state of FSM A is and a is present, the firing of HSDF actor A is performed by executing state transition ( , ) and thereby emitting x and subsequently consuming both input tokens and producing an output token with value x. Afterwards, HSDF actor B is enabled in the first iteration but also HSDF actor A might be enabled in the second iteration assuming sufficient input data is available. Hence, FSM A and FSM B can execute concurrently. The extension towards non homogeneous SDF actors is done by systematically differentiating each token consumed or produced by concatenating its occurrence to its name, i.e., a[0], a[1], . . ., e.g., as seen in Figure 3. This will also be illustrated in the next subsection. If an SDF graph refines a state of an FSM and when this state is the current state, the next reaction of the FSM will consist of an iteration of the SDF graph followed by the reaction of the FSM. If the SDF graph is homogeneous, then it naturally fits the FSM model. If the SDF graph is not homogeneous, the semantics become more subtle. The example in Fig. 3(a) is taken from [17]. The top level FSM A consists of two states and . State is refined by an SDF graph consisting of two actors B and C which are connected by a single edge. The minimal repetition vector is B = 2, C = 1, e.g., the repetition vector is satisfied by two firings of actor B followed by one firing of actor C implementing one iteration of the SDF graph. Assuming the FSM A is in state , the next activation of its outgoing transition leads to one iteration of the SDF graph followed by a state transition to state . Thereby, the FSM will provide the values of b[0] and b[1] as well as a[0] = · · · = a[3] = as input tokens to the SDF graph and emit the token values y[0] . . . y[1] resulting from the SDF graph iteration. Assuming the above behavior, the SDF graph can be substituted by a single SDF actor as shown in Fig. 3(b). However, replacing an SDF graph by a single SDF actor is not possible in general due to feedback loops. Furthermore, one must be aware that the above semantics are only correct if the FSM is indeed the top level system or again refines an SDF actor. 2.1.3 Dynamic Dataflow Dynamic dataflow (DDF) actors consume and produce a variable number of tokens on each firing. To combine FSM with DDF, the concept of firing rules can be used
1046
Joachim Falk et al.
(a)
FSM a[0] ∧ a[1] ∧ a[2] ∧ a[3]/y[0]
A
(b)
FSM a[0] ∧ a[1] ∧ a[2] ∧ a[3]/x[0]
b[0] ∧ b[1]
1
a
B x
SDF
C 1
2
c
b[0] ∧ b[1]
SDF 2
A
y
4
2
b
2
a
BC x
2
b
Fig. 3 (a) SDF graph refining a state (b) SDF graph is substituted by a single SDF actor
[29]. In this case, it is assumed that a DDF actor must assert, prior to any firing, the required number of tokens on each input. If an FSM refines a DDF actor, in each state of the FSM, the number of tokens that need to be consumed from each incoming edge must be determined for the next firing. The semantics are as follows: At least one token is consumed from each incoming edge mentioned in the guard of any outgoing state transition from the current state. If multiple tokens are mentioned, e.g., a[0], a[1], . . ., a[i], the largest index i is determined for the guards of all outgoing state transitions, and the actor will consume i + 1 tokens on the next firing. If a DDF graph refines a state of an FSM, the firing rules of the DDF graph are exported to the environment of the FSM, i.e., the FSM will become a DDF actor. This actor is activated if the firing rules are met. However, most realizations of DDF semantics are not compositional, i.e., generally DDF graphs cannot be represented by a single actor. Techniques to make DDF compositional are discussed in [29].
2.2 The California Actor Language The California Actor Language (CAL) is a programming language which was first invented [9] to describe actors and their behavior only. See also Chapter 32. Later on, the language was extended to allow specification of dataflow graphs containing these actors. We will first briefly review the syntax of CAL actors and the accompanying formal model. For the mathematical model, we assume that all data values processed by CAL actors are from the universal set of values V . The CAL language itself is primarily concerned with the definition of the actors themselves and assumes an environment which provides the concrete dataflow graph and channel types in which the actors are instantiated. Data types which are underspecified in the CAL actor description are assumed to be inferred from the types used in the concrete channels to which the actor will be connected in the environment. For an in depth review of
Integrated Modeling using Finite State Machines and Dataflow Graphs
1 a c t o r Add() T t1, T t2 = > T s: 2 a c t i o n [a], [b] = > [sum] 3 var sum do sum :← a + b; end 4 end (a) Addition actor consuming one token from each of its input ports t1 and t2 producing one token on the output port s containing the summation of the input tokens.
1047
1 a c t o r Select() S,A,B = > O: 2 a c t i o n S:[s],A:[v] = > [v] 3 guard s end 4 5 a c t i o n S:[s],B:[v] = > [v] 6 guard not s end 7 end (b) The Select actor forwarding tokens from either port A or port B depending on a control token on port S.
Fig. 4 Two simple CAL actor examples an adder and the BDF [6] Select actor.
the type system used in CAL, we refer the interested reader to the language manual [9]. CAL actors are described by atomic firing steps [7], where each firing step consumes a finite (possibly empty) sequence of tokens from each input port of the actor and produces a finite (possibly empty) sequence of tokens on each output port. A simple example of a CAL actor can be seen in Figure 4(a) which consumes one token from each input port t1 and t2, calculates the sum of the data values in these tokens, and produces an output token on the output port s containing this data value. The names of the input and output ports as well as optional type declarations for the tokens are specified by the port signature of the actor, e.g., as seen in Figure 4(a) on Line 1 where the two input ports t1 and t2 as well as the output port s are constrained to have the same token type T inferred from the environment. The computation executed by an actor specified in CAL is performed by its actions, e.g., as depicted in the Lines 2-3 where the summation operation of the Add actor is performed. This action is guarded by an input/output pattern (Line 2) encoding the requirements on the number of available tokens on input ports, on the number of free token places on the output ports. Furthermore, the actions can be guarded by general conditions depending on the values of the input tokens as used in modeling of the Select actor known from boolean dataflow (BDF) [6]. The Select actor, depicted in Figure 4(b), forwards a token from either its input port A or B to its sole output port O depending on a control input port S. If the token on S carries a true value, then the token from input port A is consumed and forwarded to the output port (Lines 2-3). Otherwise, the token from input port B is forwarded to the output port (Lines 5-6). In both cases, the control token from port S is consumed. Moreover, CAL actors themselves can have a state ∈ , e.g., representing state variables or states from a schedule FSM controlling action execution. An example of an actor using state variables is depicted in Figure 5(a) where the state variable is defined on Line 2 and carries the accumulated integration value. Furthermore, there can be additional conditions depending on these state variables guarding action execution, e.g., the schedule FSM of the actor named PingPong and depicted
1048
Joachim Falk et al.
1 a c t o r Integrate() T t = > T s: 2 T sum : ← 0; 3 4 a c t i o n [x] = > [sum] 5 do sum :← sum + x; end 6 end (a) Integration actor using the state variable sum to keep the current integration value.
1 2 3 4 5 6 7 8
a c t o r PingPong() A,B = > O: FA: a c t i o n A:[v] = > [v] end FB: a c t i o n B:[v] = > [v] end s c h e d u l e fsm s0: s0 (FA) --> s1; s1 (FB) --> s0; end end
(b) Example of a simple merge actor merging from port A then B repeated ad infinitum.
Fig. 5 Two simple stateful CAL actors one with a state variable the other one with a schedule FSM controlling action activation.
1 a c t o r NDMerge() A,B = > O: 2 a c t i o n A:[v] = > [v] end 3 a c t i o n B:[v] = > [v] end 4 end (a) Example of a simple non-deterministic merge actor specified in CAL.
1 2 3 4 5
a c t o r PMerge() A,B = > O : FA: a c t i o n A:[v] = > [v] end FB: a c t i o n B:[v] = > [v] end p r i o r i t y FA > FB end end
(b) Example of a merge actor preferring tokens from input port A over tokens from B.
Fig. 6 Two CAL actor using different strategies for selection of the action to execute in the presence of multiple enabled actions.
in Figure 5(b) is such an example where the execution of action FA is guarded with the condition that the schedule FSM state is equal to s0 and the action FB is guarded with condition that the schedule FSM state is equal to s1. In this sense, schedule FSMs only represent syntactic sugar of the CAL language as they could be replaced with an explicit actor state variable and guard conditions on this state variable.1 However, explicitly using schedule FSMs clearly separates the actor data manipulation calculating the token values produced and the communication behavior controlling the amount of tokens consumed and produced. Every atomic firing of an action is executed in a state , consumes a finite sequence of input tokens s ∈ S from each input port, produces a finite sequence of output tokens s ∈ S, and determines a new actor state .2 The action can only be executed if its guard conditions are satisfied. More formally, the actions can be seen s→s
as transitions −→ in a transition relation ⊆ × Sn × Sm × where n and m are the number of input and output ports, respectively. s→s
If multiple transitions −→ ∈ are enabled simultaneously, the CAL environment can select one of these transitions non-deterministically for execution. An example of such a CAL actor is depicted in Figure 6(a) which describes a nondeterministic merge actor where the environment can select either the token on port 1 2
The CAL language even allows to specify schedule FSMs by a regular expression. We use S to denote the finite sequence of input tokens from the universal set of values V ,
Integrated Modeling using Finite State Machines and Dataflow Graphs
1049
A or on port B to forward to the output port O. If this decision should be encoded in the actor instead, CAL provides a priority mechanism which creates a non-reflexive, anti-symmetric, and transitive partial order relation on the transitions. Actions which are lower in this partial order cannot be executed if there is at least one action enabled which is higher in the partial order. To exemplify this we can look at the priority merge actor from Figure 6(b) which prioritizes the action FA before action FB. Therefore, action FB can never be executed if there is a token present on input port A. With the priority extension, the formal model of a CAL actor consists of a triple (0 , , ) containing its initial actor state 0 ∈ , its transition relation ⊆ × Sn × Sm × encoding actions and guards, and the partial order ⊂ × encoding the priority relationship of the actions.
2.3 Extended Codesign Finite State Machines The Extended Codesign Finite State Machine (ECFSM) Model of Computation (MoC) has been presented in [35]. As will be seen, the original Codesign Finite State Machine (CFSM) MoC used in POLIS [1] is a special case of the ECFSM MoC. Both models, however, are refinements of the Abstract Codesign Finite State Machine (ACFSM) Model, also presented in [35]. In this section, we will therefore first introduce the ACFSM Model of Computation, followed by the discussion of the refinement step which transforms a given ACFSM into an ECFSM. ACFSMs is a formal model consisting of a network of FSMs (in the following called actors) connected via unbounded FIFO channels. As in CAL, tokens (also called events) are transmitted across the channels carrying data values of some data type. Figure 7 shows an example of such a dataflow network consisting of several ACFSMs.
Fig. 7 Filter example dataflow graph: The ACFSMs (Producer, Controller, Filter and Consumer) are connected via FIFO channels. The Filter actor has two inputs (in and coeff) from which tokens are consumed and one output (out) onto which tokens are produced.
Briefly, a single ACFSM consists of an FSM controlling the communication behavior of the actor, transforming a finite sequence of input tokens into a finite sequence of output tokens. In order to formally describe the behavior of a single ACFSM, we will first look at the behavior of the Filter actor from Figure 7. Its task is to filter a sequence
1050
Joachim Falk et al.
of pixels from the Producer actor by multiplying all pixels of a frame by a coefficient (provided by the Controller actor). At the beginning of each frame, the Producer actor writes two tokens onto the output out containing the number of lines per frame and the number of pixels per line, respectively. These two values are read by the Filter actor from the input in, whereas the initial multiplication coefficient is read from the input coeff. For each line, the coefficient values may change, depending on whether or not a token is available on the input coeff. Subsequently, the filtered pixels are written onto output out. 1 module Filter { 2 i n p u t b y t e in, coeff; 3 o u t p u t b y t e out; 4 i n t nLines, nPix, line, pix; 5 b y t e k; 6 forever { 7 (nLines, nPix) ← read (in, 2); 8 k ← read (coeff, 1); 9 f o r (line ← 0; line < nLines; ++line) { 10 i f ( p r e s e n t (coeff, 1)) 11 k ← read (coeff, 1); 12 f o r (pix ← 0; pix < nPix; ++pix) { 13 w r i t e (out, read (in, 1) * k, 1); 14 } 15 } 16 } 17 } Fig. 8 Behavior of the Filter actor from Figure 7
The behavior of the Filter actor can be represented by C-like pseudocode as shown in Figure 8. There exist three primitives which can be used in order to access the FIFO channels connected to the inputs and outputs of the actor: read(in, n) and write(out, data, n) consume and produce n tokens from input in and output out, respectively. Note that whereas read blocks until enough tokens are available, write never blocks, as the FIFO channels are unbounded. The third primitive present(in, n) returns true if at least n tokens are available on the input in. In order to transform the Filter pseudocode into an ACFSM, we will now discuss the formal definition of an ACFSM: Definition 2. An ACFSM is a triple a = (I, O, T ) consisting of a finite ordered set of inputs I = {i1 , i2 , . . . , in }, a finite ordered set of outputs O = {o1 , o2 , . . . , on }, and a finite set of transitions T . In the following, we will use i[k] to indicate the token on the k-th position on the FIFO channel connected to input i ∈ I (as seen from the ACFSM). Analogously, we will use o[k] to indicate the token produced by a transition onto the k-th position on the FIFO channel connected to output o ∈ O.
Integrated Modeling using Finite State Machines and Dataflow Graphs
1051
Definition 3. A transition of an ACFSM is a tuple T = (req, cons, prod, fguard , faction ): The input enabling rate req : I → N0 maps each input i ∈ I to the number of tokens which must be available on the channel connected to i in order to enable the transition. The input consumption rate cons : I → N0 ∪{ALL} specifies for each input i ∈ I the number of tokens which will be consumed from the channel connected to i when the transitions is executed. Specifying cons(i) = ALL for a given input i resets the channel connected to i, i.e., all tokens currently stored in the channel are consumed. Otherwise, a transition is not allowed to consume more tokens than requested, i.e., cons(i) ≤ req(i). Analogously, the output production rate prod : O → N0 specifies for each output o ∈ O the number of tokens which will be produced on the channel connected to o when the transitions is executed. The guard function fguard is a boolean-valued expression over the values of the tokens required on the inputs, i.e., {i[k] | i ∈ I ∧ 1 ≤ k ≤ req(i)}. Note that only if both the input enabling rate req and the guard function fguard are satisfied, a transition can be executed. Finally, the action function faction determines the values of the tokens which are to be produced on the outputs, i.e., {o[k] | o ∈ O ∧ 1 ≤ k ≤ prod(o)}. In contrast to CAL, ACFSMs cannot have explicit state variables. Nevertheless, these can be modeled by state feedback channels, i.e., by an input/output pair of an ACFSM which is connected to the same channel containing a certain state variable. If a transition wants to update the state variable, it must consume the token containing the old value, and produce a token containing the new value. The execution semantics of an ACFSM can be described as follows: Initially, an ACFSM is idle, i.e., waiting for input tokens. An ACFSM is ready if at least one transition t ∈ T is enabled, i.e., if both the input consumption rate t. req and the guard function t. fguard are satisfied. Subsequently, a single enabled transition t will be chosen, transitioning the ACFSM into the executing state. During the executing state, the ACFSM atomically consumes tokens from the inputs according to t. cons, performs the action function t. faction , and produces tokens on the outputs according to t. prod. It should be noted that due to the non-blocking reads which can be implemented by ACFSMs, they are not continuous in Kahn’s sense [21], i.e., the arrival of tokens at different times may change the behavior of the dataflow graph. In [35], the authors claim that “ACFSM transitions [. . . ] can be conditioned to the absence of an event over a signal”. In principle, there are two possibilities how this could be modeled: • Provide a special syntax for the input enabling rate such that also checks for empty channels can be expressed. However, the input enabling rate as given in Definition 3 only permits the expression of minimum conditions. Therefore, an input i with req(i) = 0 could lead to the execution of the corresponding transition even if the channel connected to i is not empty. • Specify priorities for the transitions of an ACFSM such that transitions requiring more tokens can have a higher priority than transitions requiring less tokens. However, in the original paper it is also stated that “if several transitions can be executed [. . . ], the ACFSM is non-deterministic and can execute any of the
1052
Joachim Falk et al.
matching transitions”, i.e., there seems to be no concept of priorities for transitions. For example, lines 10-11 in Figure 8 cannot be adequately modeled without checking for the absence of tokens: One transition is needed which checks if at least one token is available on input coeff, i.e., req(coeff) = 1. However, as we do not want to block execution if there is no token available, a second transition is needed with req(coeff) = 0. As discussed above, the second transition could be executed even if the corresponding channel is not empty. Without the means of expressing a check for an empty channel, we can only assume that the first transition has a higher priority than the second, allowing us to check for the absence of a new multiplication coefficient. The behavior of the Filter actor given in Figure 8 can be transformed into an ACFSM as follows: First, feedback channels are added for the state variables nLines, nPix, line, pix and k. Additionally, a feedback channel is needed for storing the current state of the ACFSM, state. The corresponding input/output pairs will be named accordingly, resulting in • the set of inputs I = {in, coeff, nLines, nPix, line, pix, k, state} and • the set of outputs O = {out, nLines, nPix, line, pix, k, state}. We assume that an initial token with a value of 0 is placed on each state feedback channel. The resulting transitions are shown in Table 1. For example, the first transition can be enabled if two tokens are available on input in, one token on coeff, and one token on each input connected to a state feedback channel (except pix). Additionally, the current state of the ACFSM must be the initial state, i.e., the value of the (single) token on input state must represent the initial state, i.e., state[1] = 0. If these requirements are satisfied, the transition will be executed, and the specified output tokens produced: For example, the value of the first token consumed from input in is used to produce the single token on output nLines. Note that the value of the current state is set to 1, allowing other transitions to be executed. In conclusion, ACFSMs specify the topology of the dataflow graph and the functional behavior of each module. However, the communication over unbounded channels with FIFO semantics needs to be refined in order to be implemented with a finite amount of resources. This refinement from infinite-sized queues to implementable finite-sized queues is exactly the transformation from an Abstract CFSM into an Extended CFSM, and will be briefly discussed in the following. Write operations carried out by the transitions of an ACFSM are always nonblocking, as the corresponding channels have an infinite capacity. For a given finitesized channel, however, if write operations should be non-blocking and the channel in question is full, new data arriving on the channel will overwrite old data stored in the channel. If this behavior is not desired, different approaches may be applied. First, it is often sufficient to increase the channel’s capacity. This approach fails, however, if the average production rate of the producer is greater then the average consumption rate of the consumer. In general, calculating sufficient channel capacities is undecidable. Therefore, ECFSMs introduces blocking behavior to prevent
Integrated Modeling using Finite State Machines and Dataflow Graphs in 2
(req(i), cons(i)) / fguard coeff nLines nPix line pix 1
1
1
1
0
k
state
1
1
prod(o) / faction out nLines nPix line pix 0
1
1 1 0 nLines[1] ⇐ in[1]
1053 k
state
1
1
nPix[1] ⇐ in[2] line[1] ⇐ 0
state[1] = 0
k[1] ⇐ coeff[1] state[1] ⇐ 1 0
0
0
1
(1,0) 0 (1,0) 0 0 state[1] = 1 ∧ line[1] = nLines[1] (1,0)
0
(1,0)
1
1
1
0
0
0 0 0 state[1] ⇐ 0
0
1
1
0
0
0 0 1 k[1] ⇐ coeff[1]
1
1
0
1
0 1 0 0 line[1] ⇐ line[1] + 1
1
pix[1] ⇐ 0
state[1] = 1 ∧ line < nLines
state[1] ⇐ 2 0
0
(1,0)
0
(1,0)
1
0
1
0
0
1
0
0
1
0
state[1] = 1 ∧ line[1] < nLines[1] 0
0
0
(1,0)
1
(1,0)
0
state[1] ⇐ 2
state[1] = 2 ∧ pix[1] = nPix[1] 1
0
0
(1,0)
0
1
(1,0) (1,0)
state[1] = 2 ∧ pix[1] < nPix[1]
0 0 1 pix[1] ⇐ 0
state[1] ⇐ 1 0 0 1 out[1] ⇐ in[1] ∗ k[1]
0
0
pix[1] ⇐ pix[1] + 1
Table 1 Corresponding ACFSM of the Filter actor shown in Figure 8 consisting of six transitions. In order to adequately model the multiplication coefficient update (Lines 10-11 in Figure 8), we assume that the transitions are assigned priorities according to their ordering, with higher priorities first. Note also that for reasons of readability cons(i) is omitted if req(i) = cons(i).
data loss. Finally, traditional CFSMs are a special case of ECFSMs, where all channels have a capacity of one.
2.4 SysteMoC S YSTE M O C [12] is an implementation of the ECFSM idea in SystemC extended by the notion of actor state and hierarchy. At a glance, the notation extends conventional dataflow graph models by finite state machines controlling the consumption and production of tokens by actors. We will first introduce the non-hierarchical S YSTE M O C model. The vertices of the graph correspond to actors, that are more formally defined below: Definition 4 (Actor). An actor is a tuple a = (I, O, Ffunc , R) containing actor ports partitioned into actor input ports I and actor output ports O, the actor functionality Ffunc and the actor FSM R.
1054
Joachim Falk et al.
Functionality a2 .F
a2 |SqrLoop i1 i2
fcheck fcopyApprox fcopyInput fcopyStore t1
qstart
i2 (1)&¬ fcheck &o1 (1)/ fcopyInput
i1 (1)&o1(1)/ fcopyStore
qloop
t2 i2 (1)& fcheck &o2 (1)/ fcopyApprox
t3
o1 o2
Output ports a2 .O = {o1 , o2}
Input ports a2 .I = {i1 , i2 }
In contrast to CAL and ECFSM, we have split the state of the actor into two parts one manipulated by the actor functionality Ffunc and the other the explicit actor FSM state contained in R. This distinction enables the mathematical expression of the separation of the actor data manipulation calculating the token values produced, e.g., fcopyApprox as shown in Figure 9, and the communication behavior controlling the amount of tokens consumed and produced. Subsequent analysis steps will abstract from the state of the actor functionality and only consider the actor FSM state. More intuitively, one can think of the state of the actor functionality as state variables of the actor, which are primarily modified by the data path of the actor modeled as functions, e.g., fcopyApprox contained in Ffunc .
Firing FSM a2 .R Fig. 9 Visual representation of a SqrLoop actor a2 as used in the network graph gsqr,flat displayed in Figure 10(a). The SqrLoop actor is composed of input ports and output ports, its functionality, and the actor FSM determining the communication behavior of the actor. The condition ix (n)/ox (n) on an actor FSM transition denotes a predicate that tests the availability of at least n tokens/free places on the actor input port ix /output port ox .
Furthermore, we have extended the well known dataflow graph definition with a notion of ports, e.g., channels no longer connect only actors but actor ports as depicted in Figure 10(a). We will call such an extended representation a network graph, formally defined below: Definition 5 (Non-Hierarchical Network Graph). A non-hierarchical network graph is a directed graph gn = (A,C, P) containing a set of actors A, a set of channels C ⊆ A.O × A.I, and a channel parameter function P : C → N ×V ∗ which associates with each channel c ∈ C its buffer size n ∈ N = {1, 2, 3, . . . } and also a possibly non-empty sequence ∈ V ∗ of initial tokens.3 Furthermore, we require that all actor ports A.I ∪ A.O are connected to exactly one channel c. 3
We use the ‘.’-operator, e.g., a.I, for member access of tuples whose members have been explicitly named in their definition, e.g., member I of actor a ∈ A from Definition 4. Moreover, this $ member access operator has a trivial extension to sets of tuples, e.g., A.I = a∈A a.I, which is also used throughout this document. We use V ∗ to denote the set of all possible finite sequences of $ tokens v ∈ V , i.e., V ∗ = n∈{0,1,...}V n
Integrated Modeling using Finite State Machines and Dataflow Graphs Network graph gsqr,flat
Network graph gsqr and its subcluster gsqr, ∈ gsqr .A
Actor input port i1 ∈ a5 .I Actor output port o2 ∈ a4 .O
a5 |Sink
c6
c4
i1
i1
o1
i1
a2 |SqrLoop
o1
a3 |Approx i2
c2
c5
Cluster output port o1 ∈ gsqr, .O
o2
a4 |Dup
Actor a3 ∈ gsqr,.A of subcluster gsqr,
a4|Dup
i1 c1
c3
o2
a1 |Src o1
gsqr,|ApproxBody
o1
c5
Cluster input port i1 ∈ gsqr, .I
i2
c4 i1
c6
c3
i1
o1
o1
i1 a5 |Sink
i2
a2 |SqrLoop
gsqr |SqrRoot
o2
a3 |Approx i2
c2
o1
i1 c1
o2
Channel c6 ∈ gsqr,flat .C
Newton square root approximation Approximation Approximation Loop Check Loop Body
P(c3 ) = (, ()) Actor a3 ∈ gsqr,flat .A
gsqr,flat |SqrRoot
a1 |Src o1
1055
Channel c4 ∈ gsqr,.C
(a) Example of a non-hierarchical network graph (b) And a corresponding hierarchical version gsqr,flat . gsqr . Fig. 10 The network graphs displayed above implement Newton’s iterative algorithm for calculating the square roots of an infinite input sequence of numbers generated by the Src actor a1 . The square root values are generated by Newton’s iterative algorithm SqrLoop actor a2 for the error bound checking and a3 - a4 to perform an approximation step. After satisfying the error bound, the result is transported to the Sink actor a5 .
The execution of S YSTE M O C models can be divided into three phases: (i) checking for enabled transitions for each actor, i.e., sufficient tokens and free places on input and output ports and guard functions evaluating to true, (ii) selecting and executing one enabled transition per actor, and (iii) consuming and producing tokens needed by the transition. Note that step (iii) might enable new transitions. Hence, S YSTE M O C is similar in this regard to ECFSMs where actors are blocked as long as the output buffers cannot store the results. More formally, the actor FSM is defined as follows: Definition 6 (Actor FSM). The actor FSM of an actor a ∈ A is a tuple a.R = (T , Qfsm , q0 fsm ) containing a finite set of transitions T , a finite set of states Qfsm and an initial state q0 fsm ∈ Qfsm . The transitions t = (qfsm , req, cons, prod, fguard , faction , qfsm ) ∈ T are extensions of the ACFSM transitions from Definition 3 by a current state qfsm ∈ Qfsm and a next state qfsm ∈ Qfsm . The enabling rate req : I ∪ O → N0 , input consumption rate cons : I → N0 , and the output production rate prod : O → N0 map each input/output to the required number of tokens/places to fire the transition, the number of tokens consumed if the transition is fired, and the number of tokens produced if the transition is fired, respectively. The special ALL notation of ACFSMs is not supported. The guard function fguard is a boolean-valued expression over the values of the tokens required on the inputs and the state of the actor functionality. Finally, the action function faction from the actor functionality Ffunc determines the values of the tokens which are to be produced on the outputs. Next, we introduce the notion of clusters which we will be used to introduce hierarchy into the non-hierarchical network graph model by allowing actors in a
1056
Joachim Falk et al.
network graph or cluster to be either a proper actor or a cluster. A cluster itself is simply a network graph extended by the notion of ports. Definition 7 (Cluster). A cluster is a graph g = (I, O, A,C, P) containing cluster ports partitioned into cluster input ports I ⊆ A.I and cluster output ports O ⊆ A.O, a set of actors A, a set of channels C ⊆ A.O × A.I, and a channel parameter function P : C → N × V ∗ which associates with each channel c ∈ C its buffer size n ∈ N = {1, 2, 3, . . .} and also a possibly non-empty sequence ∈ V ∗ of initial tokens. Furthermore, the requirement that all actor ports are connected to exactly one channel is relaxed to only include actor ports not in the set of cluster ports. The cluster ports themselves must not be connected to any channels inside the cluster. As can be seen from Definition 5 and Definition 7, a non-hierarchical network graph gn is simply a cluster without cluster ports, i.e., gn .I = gn .O = 0, / where all actor g .A are proper actors, i.e., g .A ∩ G = 0. / 4 We will call a cluster (network graph) g to be non-hierarchical if it contains no subcluster, i.e., g .A ∩ G = 0, / and hierarchical otherwise. More formally, we define a hierarchical network graph as follows: Definition 8 (Hierarchical Network Graph). A hierarchical network graph gn is a cluster without cluster ports, i.e., gn .I = gn .O = 0. / The transformation of a possibly non-hierarchical original cluster into a hierarchical cluster will be denoted by usage of the : G × 2A → G × G cluster operator. This operation takes an original cluster and a set of actors Ato cluster to generate a clustered network graph containing a subcluster which replaces the set of clustered actors Ato cluster . An example of such a cluster operation can be seen in Figure 10(a) depicting the original network graph gsqr,flat and Figure 10(b) showing the clustered network graph gsqr containing the subcluster gsqr, which replaces the clustered actors, i.e., (gsqr,flat , Ato cluster ) = (gsqr , gsqr, ). The semantics a of a hierarchical network graph gn are defined by the semantics of its corresponding non-hierarchical network graph gn . This non-hierarchical network graph is obtained by recursively applying the dissolve operation −1 until the resulting network graph no longer contains any subclusters. The dissolve operation −1 is the inverse of the cluster operation. Hence, so far we only introduced structural hierarchy. To allow arbitrary nesting of dataflow graphs and FSMs we will introduce the notion of cluster FSM g .R associated with a cluster g responsible for the scheduling of the dataflow graph contained in g . Definition 9 (Cluster FSM). The cluster FSM of a cluster g is a tuple g .R = (T, Qfsm , q0 fsm ) containing a finite set of transitions T % t = (qfsm , req, cons, prod, fsched , qfsm ), a finite set of states Qfsm , and an initial state q0 fsm ∈ Qfsm . Cluster and Actor FSMs differ in their possible actions faction / fsched which are selected from the actor functionality Ffunc in the case of actor FSMs and are scheduling sequences 4
We use G to denote the set of all possible clusters.
Integrated Modeling using Finite State Machines and Dataflow Graphs
1057
Fsched in the case of cluster FSMs. A scheduling sequence fsched ∈ Fsched is a finite sequence of actor/cluster firings from the actors/clusters contained in the dataflow graph g , i.e., Fsched = g .A∗ . In summary, a cluster FSM is an actor FSM which simply has another sort of action faction controlling the execution of the contained actors in the cluster. If no cluster FSM is present for a cluster, the contained actors are self-scheduled.
3 Design Methodologies In this section, we will discuss design flows based on integrated Finite State Machines (FSM) and dataflow modeling. A focus is thereby given to the exploitation of any knowledge about limited models of computation increasing analyzability and, hence, enabling optimization of the implementation.
3.1 Ptolemy II The Ptolemy II project [11] started in 1996 as the successor of the Ptolemy classic project [5] by the University of California, Berkeley. Ptolemy II is a software infrastructure for modeling, analysis, and simulation of embedded systems. The project focuses on the integration of different Models of Computation (MoC) by so-called hierarchical heterogeneity. Currently, the supported MoCs are continuous time, discrete event, synchronous dataflow, finite state machines, concurrent sequential processes, process networks, etc. Ptolemy II supports domain polymorphism, i.e., reuse of actors in different MoCs. The coupling of different MoCs is realized via the *charts approach, as presented in Section 2.1. Ptolemy II therefore uses an abstract syntax based on the concept of so-called clustered graphs. A clustered graph consists of entities and relations. Entities have ports and relations connect to ports. In general, entities correspond to actors and relations to edges. This enables the designer to model, analyze, or simulate heterogeneous systems. Each hierarchy level of the clustered graphs is associated with a certain MoC. For the analysis of such a hierarchy level the classical approaches known for the corresponding MoC are employed. However, the decomposition of an application into hierarchy levels with explicit MoC assignments can be problematic for the designer. Therefore, a tendency to select the most general MoC, e.g., process networks instead of synchronous dataflow for the hierarchy levels can be observed. Furthermore, Ptolemy II only allows simulation and target code generation. Also, there is no explicit support for mixed hardware/software implementation. In the following, we will discuss OpenDF, a design framework based on CAL, and SystemCoDesigner, a design methodology based on S YSTE M O C. Both support synthesis of integrated FSM and dataflow models into hardware/software implementations.
1058
Joachim Falk et al.
3.2 The OpenDF Design Flow The OpenDF project [3] provides design entry, simulation, and debugging support for the CAL language [9]. Dataflow graphs are specified via network language (NL) which instantiates the CAL actors. The actor parameters derived from the instantiation of the actors in the NL file are inserted and the CAL source code is parsed and transformed into an XML intermediate representation. This representation is used for various transformations, e.g., type inference, type checking, constant propagation, and dataflow analysis. The result is an intermediate representation of the CAL source code in static single assignment form (SSA). From this result, either a simulation can be executed or two different synthesis backends generating C++ or verilog source code can be started.5 The software synthesis tool CAL2C [19] supports two synthesis targets one targeting POSIX threads and software FIFOs, the other one targeting the SystemC environment, using it as a process network (PN) simulation framework. The synthesis process, also depicted in Figure 11, is done done by (i) translating the NL dataflow description file into a header instantiating all actors and FIFOs and their initial token valuation, (2) translating the CAL actor itself by emitting all the actions and internal state variables as C code by an intermediate step via the C Intermediate Language [34] and its pretty printer, and (3) by emitting code corresponding to the action schedule responsible for the activation of the actions in the functionality. This code corresponds to the guards, the schedule FSM, and the priority specification of the CAL actor. CAL2C synthesis flow CAL source
parser
CAL AST
NL network description
parser
NL AST
trans− formations actor parameter elaboration
CIL
code generation
*.cpp *.cpp *.cpp *.hpp *.hpp *.hpp
NL AST
code generation
*.hpp
Fig. 11 Design flow using of the CAL2C utility compiling CAL into C++ code.
The hardware synthesis tool CAL2HDL [20] starts by evaluating the NL dataflow description deriving the set of actors to synthesize and their parameters. Each actor in this set is synthesized separately, i.e., no scheduling strategy optimization is applied, resulting in purely self-scheduled actor system. The actor synthesis proceeds from the XML intermediate representation by inserting the actor parameters from the NL dataflow description as constants into the actors performing constant propagation and inlining all functions into actions. The resulting intermediate representation is self-contained, i.e., it is no longer referencing any external functions. After this, the intermediate representation is transformed to SSA form. The next step is the generation of a corresponding hardware module for each action by translating the SSA 5
The CAL simulator component of the OpenDF environment can be downloaded from sourceforge [10].
Integrated Modeling using Finite State Machines and Dataflow Graphs
1059
representation to RTL via behavioral compilation, e.g., generating multiplex/demultiplex logic for register access, generating control logic for control flow elements (if-then-else, looping constructs, ...), and static scheduling of operators where possible. Furthermore, a central scheduler hardware module is instantiated, responsible for executing the schedule FSM by evaluating the guards and selecting a highest priority transition to execute from the priority specification of the CAL actor. All these modules communicate via usage of RTL signals. Finally, network synthesis is performed by a straightforward transformation of the dataflow graph from the NL dataflow description into a corresponding hardware structure. Edges in the dataflow graph are translated to FIFOs with the sizes specified in the NL file. Furthermore, actors instantiations in NL may have clock domain annotations. Therefore, actors in the same clock domain are connected via usage of synchronous FIFOs while actors in different domains are connected via asynchronous FIFOs.
3.3 SystemCoDesigner S YSTEM C O D ESIGNER [26] is used as the design methodology for development with S YSTE M O C [12]. It provides timing simulation for possible architecture mappings as well as software and hardware synthesis backends for S YSTE M O C. 3.3.1 Overview The overall design flow of the S YSTEM C O D ESIGNER design methodology is based on (i) a behavioral S YSTE M O C model of an application, (ii) generation of hardware accelerators for some or all S YSTE M O C modules using behavioral synthesis, (iii) determination of their performance parameters like required hardware resources, throughput and latency or estimated software execution times, (iv) design space exploration for finding the best candidate architectures, and (v) rapid prototype generation for FPGA platforms. Figure 12 shows the design flow implemented in S YSTEM C O D ESIGNER. The first step in the S YSTEM C O D ESIGNER design flow is to describe the application in form of a S YSTE M O C model. Each S YSTE M O C actor can then be transformed and synthesized into both hardware accelerators and software modules. Whereas the latter one is achieved by simple code transformations, the hardware accelerators are built by high level synthesis using the Forte Cynthesizer [14]. This allows for quick extraction of important performance parameters like the achieved throughput and the required area by a particular hardware accelerator. In case of Xilinx FPGAs used as target platform, hardware resources in form of flip flops, look-up tables, and block RAMs are estimated. These values can be used to evaluate different solutions found during automatic design space exploration. The performance information together with the executable specification and a socalled architecture template serve as the input model for design space exploration
1060
Joachim Falk et al.
model select CPUs, busses hw accelerators, etc. from the component library specify mapping
SystemC model
Forte Cynthesizer behavioral synthesis component library includes CPUs, busses, hardware accelerators etc.
exploration model
design space exploration
select implementation
optimized solutions
rapid prototyping
Fig. 12 Design flow using S YSTEM C O D ESIGNER: The application is given by a S YSTE M O C behavioral model. Forte Cynthesizer [14] is used to automatically create hardware accelerators. Design space exploration using multi-objective evolutionary algorithms automatically searches for the best architecture candidates. The entire hardware/software implementation is prototyped automatically for FPGA-based platforms. The S YSTE M O C behavioral model of the application is used for both automatic design space exploration and creation of hardware accelerators allowing for rapid prototyping.
(cf. 12). The architecture template is represented by a graph which contains all possible hardware accelerators, processors, memories, and the communication infrastructure from which the design space exploration has to select those ones which are necessary in order to fulfill the user requirements in terms of overall throughput and chip size. The fewer components are allocated, the fewer hardware resources are required. This, however, in general comes along with a reduction in throughput, thus leading to tradeoffs between execution speed and implementation costs. Multi-objective optimization together with symbolic optimization techniques [32] are used to find sets of optimized solutions. From this set of optimized solutions, the designer selects a hardware/software implementation best suited for his needs. Once this decision has been made, the last step of the proposed ESL design flow is the rapid prototyping of the corresponding FPGA-based implementation. The implementation is assembled by interconnecting the hardware accelerators and processors of the selected implementation with special communication modules provided in a component library. Furthermore, the program code for all microprocessors is generated by compiling source code derived from the S YSTE M O C model and linking it with a runtime library for the microprocessor. Finally, the entire platform is compiled into an FPGA bit stream using, e.g., the Xilinx Embedded Development Kit (EDK) [37] tool chain for Xilinx FPGAs.
Integrated Modeling using Finite State Machines and Dataflow Graphs
1061
3.3.2 Exploiting Static MoCs Generating the program code for a microprocessor requires a scheduling strategy for the actors mapped to this microprocessor. While static scheduling is solved [4], [18] for models with limited expressiveness, e.g., static dataflow graphs, real world designs also contain dynamic actors. The most basic strategy is to postpone all scheduling decisions to runtime (dynamic scheduling) with the resulting significant scheduling overhead and, hence, a reduced system performance. However, this strategy is suboptimal if it is known that some of the actors exhibit regular communication behavior like SDF and CSDF actors. The integration of finite state machines via the actor FSM formalism can be used by the classification methodology presented in [38] to identify static parts of the design. The scheduling overhead problem could be mended by coding the actors of the application at an appropriate level of granularity, i.e., combining so much functionality into one actor that the computation costs dominate the scheduling overhead. This can be thought of as the designer manually making clustering decisions. However, the appropriate level of granularity is chosen by the designer by trading schedule efficiency improving bandwidth against resource consumption (e.g., buffer capacities), and scheduling flexibility improving latency. Combining functionality into one actor can also mean that more data has to be produced and consumed atomically which requires larger buffer capacities and may delay data unnecessarily, therefore degrading the latency if multiple computation resources are available for execution of the application. Furthermore, if the mapping of actors to resources itself is part of the synthesis step, as it is for design space exploration, an appropriate level of granularity can no longer be chosen in advance by the designer as it depends on the mapping itself. In order to obtain efficient schedules, subgraphs of static actors can be prescheduled at compile-time with the scheduling decisions which can be performed at this time while scheduling decisions not amenable to compile-time analysis are deferred to runtime. Unfortunately, existing algorithms might result in infeasible schedules or greatly restrict the clustering design space by considering only SDF subgraphs that can be clustered into monolithic SDF actors without introducing deadlocks. The problem stems from the fact that static scheduling adds constraints to the possible schedules of a system. Imposing too strict constraints on the possible schedules excludes all feasible schedules deadlocking the system. In [13], a generalized clustering algorithm is presented which annotates a quasi-static schedule via the cluster FSM formalism given in Definition 9. Intuitively, a QSS denotes a schedule in which a relatively large proportion of scheduling decisions have been made at compile-time. Advantages of QSSs include lower runtime overhead and improved predictability compared to general dynamic schedules. Clustering decisions, e.g., the subgraph which should be clustered, can be driven by the SYSTEM C O D E SIGNER design methodology presented in the previous section. In the following, we will also call such a scheduled cluster a composite actor, as it can be viewed as a black box where only the cluster FSM influences the communication behavior of such a cluster. To exemplify the computation of a quasi-static schedule, we consider the cluster gsqr, from Figure 10(b) on Page 1055 implementing the ApproxBody
1062
Joachim Falk et al.
g
g˜
a4 i1
2
i5 i2 1
o5 1 a1 1o4
a2 22
g
o1
2
o3 i3 i41 a 3 1
a4 i1 [2,0]
o1 [2,0] a
o2 1
(a) Data flow graph g with a cluster g .
i2 [0,2]
o2 [0,2]
(b) Composite actor a replacing g .
Fig. 13 Second example with two feedback loops o1 → i2 and o2 → i1 over the dynamic dataflow actor a4 and a CSDF composite actor a replacement candidate for cluster g .
of Newton’s iterative square root algorithm, This cluster contains two SDF actors a3 |Approx (doing the actual approximation) and a4 |Dup (a simple token duplicator). A valid static schedule for the cluster is a3 .i1 (i)&a4 .o2 (1)/(a3 , a4 ) which checks the prerequisite6 of the availability of one token on the input a3 .i1 and on free place on the output a4 .o2 and fires actor a3 followed by actor a4 if the prerequisite is satisfied. 7 This schedule can be expressed by a cluster FSM with a single state with the a3 .i1 (i)&a4 .o2 (1)/(a3 , a4 ) transition as a self loop. However, the presented static schedule corresponds to a conversion of the cluster into an SDF composite actor. This type of conversion can only express fully static schedules which is insufficient for replacing general static clusters. We will show why composite actors only exhibiting SDF or even CSDF semantics are insufficient to replace a general static cluster as they may introduce deadlocks in the presence of a dynamic dataflow environment. To exemplify this, we consider Figure 13(a) with the cluster g containing the actors a1 , a2 and a3 where i1 , i2 and o1 , o2 are the input and output ports of the cluster. The simplest possible case is the static schedule i1 (2)&i2 (2)&o1 (2)&o2 (2)/ (a1 , a1 , a2 , a3 , a3 ), i.e., check for prerequisite of two tokens on port i1 and i2 as well as two free places on both output ports o1 and o2 , and, if the prerequisite is satisfied, fire actor a1 twice, then a2 , and finally a3 twice. This schedule corresponds to the static schedule of the SDF graph contained in the cluster. Clearly, assuming actor a4 only routes tokens, i.e., the sum of all tokens on channels connected to actor a4 is the same before and after a firing of actor a4 , this schedule is infeasible as always two tokens are missing to activate the cluster, therefore deadlocking the system. One might be tempted to solve the problem by converting the cluster g into a CSDF composite actor a as shown in Figure 13(b). The CSDF composite actor has two phases and thus implements a quasi-static schedule. In the first phase, a positive check for two tokens on port i1 is followed by the execution of the static schedule (a2 ) while the second CSDF phase checks for two tokens on port i2 and executes the static schedule (a1 , a3 , a1 , a3 ). However, also this schedule fails in the 6
Note that we use p(n) to denote that at least n tokens/free space must be available on the channel connected to the actor port p. 7 In general, the longer the static scheduling sequence which can be executed by checking a single prerequisite, the less schedule overhead is imposed by this schedule.
Integrated Modeling using Finite State Machines and Dataflow Graphs
(ni1 , ni2 , no1 , no2 ) (0, 0, 0, 0) e1 (2, 2, 2, 2) e2 ···
(2, 0, 2, 0) (4, 2, 4, 2) ··· e5 (2, 1, 2, 1) (4, 3, 4, 3) ···
1063
i1 a e3 e4
(0, 1, 0, 1) (2, 3, 2, 3) ···
(a) Hasse diagram of iostates.
i2
o1 i1 (2)/(a2) i (1)/(a1, a3 ) s1 2 t1 t3 i2 (1)/(a1 , a3 ) s0 s3 t5 t2 t 4 s2 o2 i2 (1)/(a1, a3 ) i1 (2)/(a2)
(b) Composite actor a replacing g .
Fig. 14 The composite actor a replacing cluster g can be represented via a cluster Finite State Machine. The cluster FSM is derived from the Hasse diagram of the iostates of the g cluster. The transitions are annotated with preconditions and the static scheduling sequences to execute if a transition is taken.
general case if we assume that the dynamic dataflow actor a4 randomly switches between two forwarding modes: Forwarding mode (i) forwards two tokens to port i1 followed by two tokens to port i2 . On the other hand, forwarding mode (ii) forwards one token to port i2 , two tokens to port i1 followed finally by one token to port i2 . Clearly forwarding mode (i) can be satisfied with the suggested CSDF schedule. Unfortunately forwarding mode (ii) will deadlock with the same schedule. We could try to solve this problem by substituting another CSDF composite actor. In the first phase, this actor would check for one tokens on port i2 and executes the schedule (a1 , a3 ). In the second phase, it checks for two tokens on port i1 and one token on port i2 and executes the schedule (a1 , a2 , a3 ). However, this CSDF actor would fail for forwarding mode (i). The above examples demonstrate that in the general case, a static cluster cannot be converted into an SDF nor CSDF composite actor without possibly introducing a deadlock into the transformed dataflow graph. However, via the cluster FSM representation, it is possible to represent a quasi-static schedule for all clusters where no unbounded accumulation of tokens on the internal FIFOs in the cluster can result if unbounded output FIFOs are connected to the cluster output ports.8 Via the cluster FSM notation, the cluster g can be replaced by a composite actor a which can be expressed as shown in Figure 14(b). As can be seen, two transitions t1 and t2 are leaving the start state s0 . Whereas t1 requires at least two tokens on input port i1 (denoted by the precondition i1 (2)) and executes the static schedule (a2 ), t2 requires at least one input token on input port i2 (denoted by i2 (1)) and executes the static schedule (a1 , a3 ). The key idea for producing a deadlock free QSS is to assume the worst case environment for the cluster. In this environment each output port o ∈ g .O is part of a feedback loop to each input port i ∈ g .I and any produced token on an output port o ∈ g .O is needed for the activation of an actor a ∈ g .A in the same cluster through these feedback loops. In particular, postponing the production of an output token results in a deadlock of the entire system. Hence, the QSS determined by the 8
A more formal definition of this condition can be found in [13]
1064
Joachim Falk et al. (ni1 , ni2 , (0, 0, (0, 1, (2, 0, (2, 1, (2, 2, (2, 3, (4, 2, (4, 3, (4, 4, · · · no1 , no2 ) 0, 0) 0, 1) 2, 0) 2, 1) 2, 2) 2, 3) 4, 2) 4, 3) 4, 4) · · ·
Table 2 Minimal number of input tokens ni1 /ni2 necessary on cluster g input port i1 /i2 and maximal number of output tokens no1 /no2 producible on the cluster output ports o1 /o2 with these input tokens. For example, to produce two tokens on output port o2 , two tokens are required from both input ports i1 and i2 . The maximal number of output tokens producible from these input tokens are two output tokens on o1 and o2 .
clustering algorithm guarantees the production of a maximum number of output tokens from the consumption of a minimal number of input tokens. We will represent finishing the production of a maximal number of output tokens with a minimal number of input tokens by so-called input/output states of the static cluster. For the cluster g this relation is given in Table 2 which depicts the iostates of the cluster g . For an efficient computation of this table we refer the interested reader to [13]. We first note that this table is infinite but periodic as can be seen from the depiction of the iostates in the corresponding Hasse diagram (cf. 14(a)). Equivalent values, i.e., iostates which can be reached by firing all actors of the cluster according to the repetition vector of the cluster, are aggregated into one vertex in the Hasse diagram. This is also the reason why the Hasse diagram contains cycles as normally these values would all be depicted as unique vertices. As can be seen from Figure 14(a) and Figure 14(b) there is a one-to-one correspondence between the vertices and edges in the Hasse diagram and the states and transitions in the cluster FSM. For a more technical explanation of deriving the cluster FSM from the Hasse diagram we again refer the reader to [13].
4 Integrated Finite State Machines for Multidimensional Data Flow The previous sections have described how complex applications can be modeled, analyzed, optimized, and synthesized using different data flow models of computation. The usage of finite state machines permits to describe control flow in order to cover applications whose execution cannot be predicted at compile time, but depends on the concrete data values processed during run-time. Whereas various publications were able to demonstrate the benefits of such approaches, they are all restricted to one-dimensional streams of data that consist of a temporal sequence of tokens. On the other hand there exist many applications that process huge multidimensional arrays. Corresponding examples can be found in the domain of digital image and video processing, in simulations describing heat dissipation or electromagnetic waves, or when calculating with huge sized matrices. Such multidimensional algorithms often require data reordering or sampling of an input array with possibly overlapping windows. When modeling these applications
Integrated Modeling using Finite State Machines and Dataflow Graphs
1065
with one-dimensional dataflow, one possibility consists in considering each array as an atomic token. Unfortunately, such a representation hides important information about data parallelism, data reuse and parallel access to several data elements. As a consequence, it is insufficient for design of efficient circuits or implementation of multicore systems. In order to solve these difficulties, multidimensional dataflow models of computations have been proposed by recent research activities [33], [36], [16], [24] (see also Chapter 32). They combine the advantages of dataflow based system design and of polyhedral analysis and synthesis, allowing thus (i) the block-diagram-like specification of complex systems, (ii) the intuitive representation of multidimensional algorithms, (iii) efficient polyhedral analysis in order to determine for instance the required memory sizes, or (iv) generation of high-performance hardware accelerators. Additionally, the use of finite state machines enables the representation of control flow in multidimensional application parts. Furthermore, it permits for tight interaction with one-dimensional models of computation, offering thus new possibilities for system modeling, analysis, and synthesis. Consequently, this section aims to give an overview about the principles and achievements obtained in multidimensional dataflow modeling combined with finite state machines.
4.1 Array-OL Array-OL [16] addresses the specification of regular applications in form of process graphs communicating multidimensional data. Originally developed by Thomson Marconi Sonar9 , Array-OL serves as model of computation for the Gaspard2 design flow and has also been integrated into the Ptolemy framework [8]. In order to express full application parallelism, the execution order by which the data is read and written is not further specified except for the minimal partial order resulting from the specified data dependencies. In other words, it only defines which output pixel depends on which input pixel. For this purpose, applications are modeled as a hierarchical task graph consisting of (i) elementary, (ii) hierarchical, and (iii) repetitive tasks. Elementary tasks are defined by their input and output ports as well as an associated function that transforms input arrays on the input ports to output arrays on the output ports. Hierarchical tasks connect Array-OL sub-tasks to acyclic graphs. Each connection between two tasks corresponds to the exchange of a multidimensional array. Repetitive tasks as shown exemplarily in Fig. 15 are finally the key element for expressing data-parallelism. In principle they represent special hierarchical tasks in which a sub-task is repeatedly applied to the input arrays. Since the different subtask invocations are considered as independent if not specified otherwise, they can be executed in parallel or sequentially in any order. So-called input tilers specify for 9
now Thales
1066
Joachim Falk et al.
each sub-task invocation the required part of the input arrays. These sub-arrays are also called patterns or tiles. The produced results are combined by output tilers to the resulting array. → Each tiler is defined by a paving and a fitting matrix as well as an origin vector − o → − (see Fig. 15). The paving matrix P permits to calculate for each task invocation i the origin of the input or output pattern: → − → − → − −−−→ → −−→ − − → ∀ i ,0 ≤ i < − s− repetition : oi = o + P · i mod sarray − −−−→ −−→ s− repetition defines the number of repetitions for the sub-task. sarray is the size of the complete array, the modulo operation takes into account that a toroidal coordinate system is assumed. In other words, patterns that leave the array on the right border → re-enter it on the left side. Based on the origin − oi , the data elements belonging to a certain tile can be calculated by the fitting matrix F using the following equation: → − → − → − −−→ − −−→ → − → ∀ j ,0 ≤ j < − s− pattern : x j = oi + F · j mod sarray − −−→ s− pattern defines the pattern size and equals the required array size annotated at each port.
<>
−→ s−− array = (9, 8) − −→ s− array = (3, 2)
<<Tiler>> origin = (0,0) fitting = ((1,0),(0,1)) paving = ((3,0),(0,4))
<<Tiler>> origin = (0,0) fitting = (0,1) paving = ((1,0),(0,1))
<<ElementaryComponent>>
− −−−−→ = (3, 4) srepetition −−→ s−− pattern = (3, 4) s−−−−→ = (1) − −−→ s− pattern = (2, 3)
pattern
<<Tiler>> origin = (0,0) fitting = ((2,0),(1,1)) paving = ((3,0),(0,3))
−→ s−− array = (11, 6)
Fig. 15 Specification of repetitive tasks via tilers [16]. Each port is annotated with the produced −−−→ and consumed array size. The number of sub-task invocations is defined by vector − s− repetition . The tiler associated to each communication edge specifies the extraction of sub-arrays.
4.1.1 Mode-Automata Whereas the above modeling approach permits to describe data intensive static applications, it is not possible to represent control flow. To this end, references [28], [27] introduce the concept of synchronous mode-automata. The basic idea consists in providing several calculation rules for one and the same output variable. The currently activated mode decides, which of them shall be effectively used. For this purpose, a special task type, called controlled task, can be applied as depicted in Fig. 16. It can select between different sub-tasks depending on a control token con-
Integrated Modeling using Finite State Machines and Dataflow Graphs
1067
sumed from a control port. Each sub-task must show the same interface, hence the same input and output ports. If this is not the case, then a corresponding hierarchical wrapper must be generated that offers default values for the non-connected input ports and non-available output ports. The values for the control port are generated by a so-called controller task. It can be described either by an UML activity diagram or by means of a finite state machine in form of an UML StateChart diagram. In the latter case, each state corresponds to a desired mode of the controlled task. State transitions can depend on external events or on input tokens generated by ordinary data tasks such that data dependent decisions are possible. In order to obtain deterministic models, Array-OL applies the synchrony hypothesis meaning that the system is fast enough to completely process an incoming event before the next one arrives. The granularity on which the controlled task switches between alternative subtasks can be adjusted by intelligent creation of task hierarchies and sizing of the control array. By this means it is possible to ensure that mode switches can only occur after complete processing of an image, a line, a block, or an individual data element. However, independently of the granularity, Array-OL requires a strict dependency relation between a produced output value and one or several input tokens. In other words, independently of the currently selected mode it can be decided which input data values are required in order to produce a given output value.
<>
− −−→ sarray,3
execution mode <<Tiler>>
<<ElementaryComponent>>
− −−−−→ srepetition
− −−−→ s− pattern,1
− −−−→ s− pattern,2
<<Tiler>>
<<ElementaryComponent>>
−−→ s−array,1
−−→ s−array,2
− −−−−→ srepetition
<<Tiler>>
− −−−→ s− pattern,1
− −−−→ s− pattern,2
<<Tiler>>
Fig. 16 Modeling of the different running modes
4.1.2 Gaspard2 Design Flow Gaspard2 [15] is an UML-based system level design tool for high-performance applications described in Array-OL. As a consequence it is particularly adapted for regular iterative applications such as image filtering or matrix multiplication. The system specification consists of three parts, namely (i) application modeling, (ii) architecture modeling and (iii) mapping of tasks to hardware resources. The corresponding modeling semantics are defined by the so-called MARTE profile, which
1068
Joachim Falk et al.
defines a meta-model describing the properties that an application model must obey to. In particular, it enables the description of parallel computations in the application software part, the parallel structure of its hardware architecture part, and the association of both parts. Furthermore, Gaspard2-specific extensions are provided in order to support for instance IP-core based hardware synthesis. Several transformations have been developed in order to automatically transform the input specification into different target languages, such as Lustre, Signal, OpenMP C, SystemC or VHDL. The mapping specification is such that repetitive tasks can be distributed to multiple resources of a regular architecture. In order to refine the elementary tasks to the desired target technology, manually generated IP cores have to be provided [15]. Alternatively it is possible to use a fine grained UML specification that permits for instance automatic RTL code generation. The tilers interconnecting the different tasks can be synthesized automatically. In case of FPGA targets, they are implemented by combinatorial logic extracting all input patterns and merging all output patterns as depicted in Fig. 17. Consequently, all executions of an elementary task can be performed in parallel. Alternatively, the input tilers can be followed by multiplexers such that the different task executions are performed sequentially in order to reduce the resource requirements [2]. The achievable parallelism not violating the available FPGA resources can be determined by means of an automatic design space exploration. For this purpose, different model types are provided. The black box view does not consider any details of the FPGA and is only used when a system with more than one FPGA or other heterogeneous components shall be represented. The quantitative view enumerates the available resources like BRAMs and DSPs and helps in finding the optimal parallelization of the considered algorithm. The physical view finally also takes routing resources into account. It is used for optimized mapping by taking the regular data dependencies of the Array-OL specification into account. As a result, a constraint file is generated that can be used by commercial place and route tools.
WLOHU
[ [
[ [
HOHPHQWDU\WDVNH[HFXWLRQ HOHPHQWDU\WDVNH[HFXWLRQ HOHPHQWDU\WDVNH[HFXWLRQ HOHPHQWDU\WDVNH[HFXWLRQ
Fig. 17 Communication synthesis in Gaspard2 [2].
[ [ [ [
WLOHU
[
[
Integrated Modeling using Finite State Machines and Dataflow Graphs
1069
4.2 Windowed Data Flow In addition to Gaspard2 using Array-OL, the SystemCoDesigner discussed in Section 3.3 also introduces a methodology for the description, analysis and synthesis of multidimensional applications. To this end, it provides the Windowed Data Flow (WDF) model of computation [23] and its static counter part Windowed Synchronous Data Flow (WSDF) [24]. Both of them are particularly adapted to image processing applications by covering so-called global, local, and point algorithms. Point algorithms such as a gamma transform only require one input pixel in order to calculate a corresponding output pixel. Local algorithms on the other hand need a small subregion of the input image in order to generate the associated output pixel. Often this kind of image manipulations can be implemented as sliding window operations, where a window traverses each input pixel. Edge detection, or the wavelet and discrete cosine transform are only some of the typical operations belonging to this category. Global algorithms such as image segmentation finally cannot start operation until the complete input image is available. In order to cover these algorithms, Windowed Data Flow decomposes an application into a graph G = (A, E) of actors a ∈ A, which are interconnected by communication channels as exemplified in Fig. 18. Each of them transports a sequence of → multidimensional arrays, also called virtual tokens, of size − v (e) ∈ Nne . The magnitude ne ∈ N corresponds to the number of array dimensions. In order to permit for fine-grained dependency analysis and hardware synthesis, the arrays are, however, not considered as an atomic unit. Instead, each write operation of the source actor src (e) of an edge e ∈ E generates a so-called effective token of constant size that is → specified by a vector − p (e) ∈ Nne . In case of Fig. 18, the effective tokens consist of → one single data element (− p = (1, 1)T ) and are combined to an overall array of five → − columns and two rows ( v = (5, 2)T ). The so generated arrays are read by the sink actor snk (e) by sampling it with → possibly overlapping windows of constant size − c ∈ Nne as illustrated in Fig. 18(c). − → n A vector c ∈ N e defines the window movement in each dimension 1 ≤ i ≤ ne . In case of Fig. 18, the window moves by two data elements in horizontal direction, − → and by one data element in vertical direction ( c = (2, 1)T ). Since sliding window algorithms often require border processing, this is also supported by the Windowed Data Flow model of computation. To this end it is assumed that each virtual token is surrounded by a border whose size can be configured by means of two vectors → − →s − b (e) , bt (e) ∈ Nne . In case of Fig. 18, the border encompasses one pixel. These border data elements are, however, not produced by the source actor. Instead, they have a predefined value that can for instance be determined via pixel mirroring or by assuming a constant border value. Additionally, if required initial data elements can be put onto a WDF communication channel in order to allow feed-back loops.
Joachim Falk et al. VQN
1070 & Y
§· ¨¨ ¸¸ © ¹
7
& S
§ · ¨¨ ¸¸ © ¹
7
UGBFRXQW
UGBHQD
HPSW\
GDWD
VUF
& EV
§· ¨¨ ¸¸ ©¹
& F
§ · ¨¨ ¸¸ © ¹
& EW
& § · ¨¨ ¸¸ 'F © ¹
§ · ¨¨ ¸¸ ©¹
VQN
&H , VQN H
7
&H , VQN H
7
&H , VQN H
7
&H , VQN H
7
&H , VQN H
7
&H , VQN H
7
VUF
GDWD ZUBHQD IXOO ZUBFRXQW
(b) WDF edge
(a) Replacement by multidimensional FIFO
(c) Sampling with overlapping windows
Fig. 18 WDF communication semantics. Striped data elements correspond to virtually extended border pixels while data elements generated by each source invocation are shaded with two different levels of gray.
4.2.1 Communication Order By means of the communication semantics introduced previously it is possible to describe which sliding window depends on which input effective token. This, however, is not sufficient for complete specification of image processing applications, since they often require data reordering. A corresponding example can be found, for instance, in the context of a Motion-JPEG decoder when transforming the macro blocks into a format suitable for display devices. Fig. 19(a) illustrates the order by which the macro blocks are generated by the vertical inverse discrete cosine transform (IDCT). It assumes an image with three color components and a size of 8 × 8 pixels. Furthermore, the macro blocks shall consist of 4 × 4 data elements.10 As can be seen, the IDCT generates a complete macro block column at once and sequentially outputs the color components. Display devices on the other hand require all three color components in parallel using a raster scan order instead of the block based processing. Consequently, the data elements belonging to the first row of the upper right macro block are read before those belonging to the second row of the upper left macro block, whereas for the production order it is just the opposite. In other words, this leads to so-called out-of-order communication. In order to support such scenarios, the Windowed Data Flow model of computation permits to specify read and write orders by means of so called firing blocks. 10
Whereas the JPEG standards demands a macro block size of 8 × 8 pixels, the following discussion uses smaller dimensions for illustration purposes.
Integrated Modeling using Finite State Machines and Dataflow Graphs
& H & H & H ZULWH 2VUF
& S
ª§ · º «¨ ¸ » «¨ ¸ » «¬¨© ¸¹ »¼
UHDG 2VQN
§· ¨ ¸ ¨ ¸ ¨¸ © ¹
& F
ª§ · § ·º «¨ ¸ ¨ ¸» «¨ ¸ ¨ ¸» «¬¨© ¸¹ ¨© ¸¹»¼
(a) Write order
1071
EOXH JUHHQ UHG
§· ¨ ¸ ¨¸ ¨ ¸ © ¹
(b) Read order
Fig. 19 Communication order for the shuffle operation
Each firing block groups those read or write operations that shall be executed in raster scan order. Alternatively it can also group smaller firing blocks that shall be executed in raster scan order. Both variants 3 can be covered by4means of vector sequence as exemplified in Fig. 19. Osrc = (4, 1, 3)T , (8, 2, 3)T for instance means → that the source performs four write operations in dimension − e1 , and three in dimen→ − sion e using a raster scan order as depicted in Fig. 19(a). These firing blocks of 3
4 × 1 × 3 write operations are then concatenated in raster scan order in order to result in a firing block of 8 × 2 × 3 write operations, which corresponds to the overall image. On the read side, on the other hand, there exists only one firing block since the data is read in raster scan order without following the macro block structure. 4.2.2 Interaction with a Finite State Machine for Communication Control Once both the array sampling and the communication order is specified for a WDF communication channel, it can be provided with a FIFO-like interface offering information about the current fill level. The resulting communication primitive is called multidimensional FIFO. This name shall allude to the similarities with classical one-dimensional FIFOs, although its multidimensional counter part offers significantly enhanced capabilities, such as (i) out-of-order communication, (ii) virtual border extension, (iii) sampling with possibly overlapping windows, and (iv) access to several data elements in parallel. In order to illustrate the principle idea of the multidimensional FIFO, consider again the communication edge shown in Fig. 18 and assume that both the source and the sink communicate in raster-scan order. Since the multidimensional FIFO depicted in Fig. 18(a) knows all WDF edge parameters together with consumption read and production orders Owrite src and Osnk , it can always determine whether the window required next by the sink is already available or not. By means of Fig. 18(c), it is for instance obvious that the first sliding window for the sink actor is ready as soon as the source has produced the second pixel of the second row. Moreover, given a current write position of the source actor, it is possible to calculate how many
1072
Joachim Falk et al.
read operations will be possible before a required data element is missing. This reasoning does also hold in case of out-of-order communication. In other words, the multidimensional FIFO can provide an empty signal and a fill level indication rd_count similar to what designers are used to from one-dimensional FIFOs. Once the buffer size is fixed, the multidimensional FIFO can also determine whether an effective token from the source can be accepted or not. This decision has to take into account which data elements are still required and which effective token will be produced next. In other words, the multidimensional FIFO is able to provide a full flag and a wr_count fill level indication as usual for one-dimensional FIFOs. By means of this fill level indication, the multidimensional FIFO allows employing fine-grained and memory efficient scheduling. In other words, it is not necessary to provide a buffer space big enough to store a complete image. Instead, only parts of the image generated by the source must be buffered, which permits an efficient implementation. Furthermore, the multidimensional FIFO can be directly integrated into the communication finite state machine discussed in Definition 6. To this end, an actor can be equipped with both one- and multidimensional ports, connected to one- and multidimensional channels respectively. In case of a multidimensional input port im , the predicate im (n) simply translates into the verification whether the multidimensional input channel provides enough input data for n sliding input windows. Similarly, for a multidimensional output port om , the predicate om (n) checks whether the multidimensional output channel provides enough free memory space to store n effective tokens. In consequence, SystemCoDesigner is one of the first approaches that is able to describe actors that simultaneously perform oneand multidimensional communication. In particular it is possible to describe multidimensional algorithms containing complex control flow. Note that in contrast to Array-OL, WDF is not restricted in switching between different modes requiring the same input and output interface, offering thus new possibilities in specification of mixed one- and multidimensional algorithms. In order to translate multidimensional applications into hardware implementations, SystemCoDesigner provides an automatic synthesis facility [25] that generates a high-speed VHDL implementation for each multidimensional FIFO. It supports both in-order and out-of-order communication, as well as the writing and reading of several data elements per clock cycle. As a consequence, the multidimensional FIFO enhances the possibility for module reuse since a sink module does not need to know in which order the source actor will produce the data to process. Pipelined interface definitions enable very high clock frequencies that permit to process images in cinema resolution in real-time on a standard FPGA. In addition it is possible to generate different implementation alternatives that trade achievable throughput against required hardware resources.
Integrated Modeling using Finite State Machines and Dataflow Graphs
1073
4.3 Exploiting Knowledge about the Model of Computation The separation of functionality, communication and control obtained by integration of a finite state machine into each actor offers the advantage to cover both data dependent and static algorithms. Moreover, analysis of the finite state machine [38] enables the tools to identify static application parts. This is important, since this algorithm class offers a huge potential for analysis and optimization. In case of Windowed Data Flow (WDF), for instance, it is possible to automatically calculate the required communication buffer sizes [22] by first calculating a valid schedule and then determining the resulting memory requirements. To this end, a WDF graph of static actors is translated into a lattice representation, where each actor invocation is associated to a point in a multidimensional grid as exemplified in Fig. 20. It belongs to Fig. 18, where the source actor executes 5 × 2 times while the sink only performs 3 × 2 invocations per image. Each of these invocations is represented by a semicircle in Fig. 20(a). The arrows depict on which source invocation a sink execution depends on. This information permits to shift the lattice points belonging to different actors in such a way that no actor invocation requires data elements that will be produced in future. Fig. 20(b) shows the corresponding result for the lattice given in Fig. 20(a). After the lattice shift, an execution of the lattice points in raster scan order ensures causal data dependencies, because data elements are only read after being written. In other words, the corresponding lattice can be considered as a valid schedule. Special care is taken that also out-of-order communication is correctly taken into account. Based on this schedule the required communication buffer sizes can be calculated by means of integer linear programming, where usage of efficient graph algorithms allows limiting the computational complexity. Furthermore, it is possible to trade communication memory against the hardware resources required for the individual actors. Comparison with simulative buffer analysis shows that the polyhedral approach is 150 times faster and results in a 40% smaller buffer size. Consequently, this clearly demonstrates the benefits of multidimensional modeling combined with a finite state machine for automatic actor classification while still permitting the description of data dependent operations.
(a) Lattice representation (b) Shifted lattice representation in order to obtain a valid schedule Fig. 20 Principle of polyhedral buffer analysis for multidimensional dataflow graphs
@
1074
Joachim Falk et al.
References 1. Balarin, F., Giusto, P., Jurecska, A., Passerone, C., Sentovich, E., Tabbara, B., Chiodo, M., Hsieh, H., Lavagno, L., Sangiovanni-Vincentelli, A., Suzuki, K.: Hardware-Software CoDesign of Embedded Systems: The POLIS Approach. Kluwer Academic Publishers (1997) 2. Beux, S.L.: Un flot de conception pour applications de traitement du signal systématique implémentées sur fpga à base d’ingénierie dirigée par les modèles. Ph.D. thesis, Université des Sciences et Technologies de Lille (2007) 3. Bhattacharyya, S., Brebner, G., Eker, J., Mattavelli, M., Raulet, M.: OpenDF - A Dataflow Toolset for Reconfigurable Hardware and Multicore Systems (2008). First Swedish Workshop on Multi-Core Computing, MCC, Ronneby, Sweden, November 27-28, 2008 4. Bhattacharyya, S.S., Buck, J.T., Ha, S., Lee, E.A.: Generating Compact Code from Dataflow Specifications of Multirate Signal Processing Algorithms. IEEE Transactions on Circuits and Systems I: Fundamental Theory and Applications 42(3), 138–150 (1995) 5. Buck, J., Ha, S., Lee, E.A., Messerschmitt, D.G.: Ptolemy: A Framework for Simulating and Prototyping Heterogenous Systems. International Journal in Computer Simulation 4(2), 155– 182 (1994) 6. Buck, J.T.: Scheduling Dynamic Dataflow Graphs with Bounded Memory Using the Token Flow Model. Tech. rep., Dept. of EECS, UC Berkeley, Berkeley, CA 94720, U.S.A. (1993). Technical Report UCB/ERL 93/69, Ph.D dissertation 7. Dennis, J.B.: First version of a data flow procedure language. In: Programming Symposium, Proceedings Colloque sur la Programmation, pp. 362–376. Springer-Verlag, London, UK (1974) 8. Dumont, P., Boulet, P.: Another multidimensional synchronous dataflow: Simulating ArrayOL in Ptolemy II. Tech. Rep. 5516, Institut National de Recherche en Informatique et en Automatique, Cité Scientifique, 59 655 Villeneuve d’Ascq Cedex (2005) 9. Ecker, J., Janneck, J.W.: Cal language report – language version 1.0. Tech. rep., University of California at Berkeley (2003). 10. Ecker, J., Janneck, J.W.: http://opendf.sourceforge.net (2009) 11. Eker, J., Janneck, J.W., Lee, E.A., Liu, J., Liu, X., Ludvig, J., Neuendorffer, S., Sachs, S., Xiong, Y.: Taming heterogeneity - the Ptolemy approach. In: Proceedings of the IEEE (2002). 12. Falk, J., Haubelt, C., Teich, J.: Efficient Representation and Simulation of Model-Based Designs in SystemC. Proc. FDL’06, Forum on Design Languages 2006, pp. 129 – 134. Darmstadt, Germany (2006) 13. Falk, J., Keinert, J., Haubelt, C., Teich, J., Bhattacharyya, S.: A Generalized Static Data Flow Clustering Algorithm for MPSoC Scheduling of Multimedia Applications. In: EMSOFT’08: Proceedings of the 8th ACM international conference on Embedded software (2008) 14. http://www.forteds.com (2009) 15. Gamatié, A., Beux, S.L., Éric Piel, Etien, A., Ben-Atitallah, R., Marquet, P., Dekeyser, J.L.: A model driven design framework for high performance embedded systems. Tech. Rep. 6614, Institut National de Recherche en Informatique et en Automatique (2008) 16. Gamatié, A., Rutten, E., Yu, H., Boulet, P., Dekeyser, J.L.: Synchronous modeling and analysis of data intensive applications. EURASIP Journal on Embedded Systems 2008(561863), 1–22 (2008). 17. Girault, A., Lee, B., Lee, E.: Hierarchical finite state machines with multiple concurrency models. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on 18(6), 742–760 (1999) 18. Hsu, C., Bhattacharyya, S.S.: Cycle-Breaking Techniques for Scheduling Synchronous Dataflow Graphs. Tech. Rep. UMIACS-TR-2007-12, Institute for Advanced Computer Studies, University of Maryland at College Park (2007). 19. Janneck, J.W., Miller, I.D., Parlour, D.B., Roquier, G., Wipliez, M., Raulet, M.: Automatic software synthesis of dataflow program: An mpeg-4 simple profile decoder case study. In: Proc. of the IEEE Workshop on Signal Processing Systems (SiPS’08), pp. 281–286 (2008).
Integrated Modeling using Finite State Machines and Dataflow Graphs
1075
20. Janneck, J.W., Miller, I.D., Parlour, D.B., Roquier, G., Wipliez, M., Raulet, M.: Synthesizing hardware from dataflow programs: An mpeg-4 simple profile decoder case study. In: Proc. of the IEEE Workshop on Signal Processing Systems (SiPS’08), pp. 287–292 (2008). 21. Kahn, G.: The semantics of simple language for parallel programming. In: IFIP Congress, pp. 471–475 (1974) 22. Keinert, J., Dutta, H., Hannig, F., Haubelt, C., Teich, J.: Model-based synthesis and optimization of static multi-rate image processing algorithms. In: Proceedings of Design, Automation & Test in Europe, pp. 135–140 (2009) 23. Keinert, J., Falk, J., Haubelt, C., Teich, J.: Actor-oriented modeling and simulation of sliding window image processing algorithms. In: Proceedings of the 2007 IEEE/ACM/IFIP Workshop of Embedded Systems for Real-Time Multimedia (ESTIMEDIA 2007), pp. 113–118 (2007) 24. Keinert, J., Haubelt, C., Teich, J.: Modeling and analysis of windowed synchronous algorithms. ICASSP2006 III, 892–895 (2006) 25. Keinert, J., Haubelt, C., Teich, J.: Automatic synthesis of design alternatives for fast streambased out-of-order communication. In: Proceedings of the 2008 IFIP/IEEE WG 10.5 International Conference on Very Large Scale Integration (VLSI-SoC), pp. 265–270. Rhodes Island, Greece (2008) 26. Keinert, J., Streubühr, M., Schlichter, T., Falk, J., Gladigau, J., Haubelt, C., Teich, J., Meredith, M.: SYSTEMCODESIGNER - An Automatic ESL Synthesis Approach by Design Space Exploration and Behavioral Synthesis for Streaming Applications. Transactions on Design Automation of Electronic Systems 14(1), 1–23 (2009) 27. Labbani, O.: Modélisation à haut niveau du contrôle dans des applications de traitement systématique à parallélisme massif. Ph.D. thesis, Université des Sciences et Technologies de Lille Laboratoire d’Informatique Fondamentale de Lille, 59655 Villeneuve (2006) 28. Labbani, O., Dekeyser, J.L., Boulet, P., Rutten, E.: Introduction of control into the Gaspard application UML metamodel: Synchronous approach. Tech. Rep. 5794, Laboratoire d’Informatique Fondamentale de Lille, Université des Sciences et Technologies de Lille 59655 Villeneuve d’Ascq Cedex, France (2005) 29. Lee, E.A.: A denotational semantics for dataflow with firing. Tech. rep., EECS, University of California, Berkeley, CA, USA 94720 (1997) 30. Lee, E.A.: Overview of the ptolemy project, technical memorandum no. ucb/erl m03/25. Tech. rep., Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA, 94720, USA (2004) 31. Lee, E.A., Messerschmitt, D.G.: Synchronous Data Flow. Proceedings of the IEEE 75(9), 1235–1245 (1987) 32. Lukasiewycz, M., Glaß, M., Haubelt, C., Teich, J., Regler, R., Lang, B.: Concurrent topology and routing optimization in automotive network integration. In: Proceedings of the 2008 ACM/EDAC/IEEE Design Automation Conference (DAC’08), pp. 626–629. Anaheim, USA (2008) 33. Murthy, P.K., Lee, E.A.: Multidimensional synchronous dataflow. IEEE Transactions on Signal Processing Vol50(7), 2064–2079 (2002) 34. Necula, G.C., McPeak, S., Rahul, S.P., Weimer, W.: Cil: Intermediate language and tools for analysis and transformation of c programs. Lecture Notes in Computer Science 2304, 209– 265 (2002). 35. Sangiovanni-Vincentelli, A., Sgroi, M., Lavagno, L.: Formal models for communicationbased design. In: Proceedings of CONCUR ’00 (2000). 36. Thörnberg, B., Palkovic, M., Hu, Q., Olsson, L., Kjeldsberg, P.G., O’Nils, M., Catthoor, F.: Bit-width constrained memory hierarchy optimization for real-time video systems. IEEE Trans. on CAD of Integrated Circuits and Systems 26(4), 781–800 (2007) 37. XILINX: Embedded SystemTools Reference Manual - Embedded Development Kit EDK 8.1ia (2005). 38. Zebelein, C., Falk, J., Haubelt, C., Teich, J.: Classification of General Data Flow Actors into Known Models of Computation. In: Proc. 6th ACM/IEEE International Conference on Formal Methods and Models for Codesign (MEMOCODE 2008), pp. 119–128. Anaheim, CA, USA (2008)
Index
≺, see lexicographic order 3D DRAM, 521 3D logic-memory integration, 519 ABS, 19 Abstract Decoder Model, 46 accelerator, 454 access cycle, 138 accumulator, 771 Acoustic field visualization, 260 addition, 290 floating-point, 317 multi-operand, 296 parallel prefix, 292 address code generation, 617 Advanced Video Coding, 46 Amplifier, 562 Analog to digital converter, 549 architecture for dynamically reconfigurable embedded systems (ADRES), 468 arithmetic coding, 112 arithmetic shift, 398 artificial deadlock, 984 ASIC, 875 ASIC Implementation, 358 ASIP, 415 ASIP design flow, 417 Auditory scene capture and reproduction, 256 automatic scaling, 707 availability, 125 back-plane, 69 background elimination, 268 background subtraction, 268 balance, 854 barrel shifter, 400 Barrett modular reduction, 164
baseband, 434 BD Live, 33 BDF, 914 beamforming, 246 benchmark, 419 binary translation, 689 Bitstream Syntax Description, 48 Bitstream Syntax Description Language, 46 Bode Sensitivity Integral, 9 Boolean dataflow, 914 Booth encoding, 299 border extensions, 908 boundedness, 979, 981, 983 Broack-Ackerman anomaly, 988 buffer size, 961–964 bunch crossing, 186 C compiler, 576 C language extensions, 577 DSP-C, 577, 578 Embedded C, 577, 578 C standard, 770 CAL, 921 CAL Actor Language, 49 calorimeter, 184 camera calibration, 276 CFDF, 927 Chinese Remainder Theorem, 167 circular buffer, 409, 785 clock and data recovery, 70 cluster assignment, 619 clustered VLIW architecture, 609, 619 coalescing of live ranges, 621 Coarse grain accelerator, 331 coarse-grained reconfigurable array (CGRA), 450 code generation
S.S. Bhattacharyya et al. (eds.), Handbook of Signal Processing Systems, DOI 10.1007/978-1-4419-6345-1_37, © Springer Science+Business Media, LLC 2010
1077
1078 polyhedral, see polyhedral code generation Code optimizations, 579 Address code optimizations, 580 Offset assignment, 582 Pointer-to-Array Conversion, 580 Code compaction, 595 Control flow optimizations, 582 If-Conversion, 584 Zero overhead loops, 582 Loop optimizations, 585 Loop fusion, 587 Loop invariant code motion, 590 Loop reversal, 587 Loop tiling, 587 Loop unrolling, 586 Loop vectorization, 589 Software pipelining, 591 Memory optimizations, 592 Dual memory bank assignment, 592 Optimizations for code size, 594 Coalescing of repeated instruction sequences, 597 Dead code elimination, 595 Redundant code elimination, 595 Strength reduction, 595 code profiling, 417 coefficient quantization, 726 color processing, 105 Compact Muon Solenoid, 183 compiled simulation, 686 Compiler benchmarks, 578 DSPstone, 578 EEMBC, 578 UTDSP, 578 compiler optimization, 469 Compiler-known functions (CKFs), 577 complete partial order, 970 completeness, 983 compression, 27, 29 consistent, 853 control dependency, 653 control hazard, 609 CORDIC, 319 core functional dataflow, 927 correctness boundedness, 983 completeness, 983 soundness, 983 critical path, 811 critical path scheduling, 624 Current-steering D/A converter, 558 data dependence graph (DDG), 467 data dependency, 653
Index data flow graph, 792 data path, 395 data-centric, 124, 145 data-level parallelism (DLP), 459 Dataflow, 239 dataflow actor network, 916 dataflow analysis, 946–950 dataflow interchange format, 927 deadlock, 981 artificial, 984 local, 985 Decimation filter, 556 delay scaling, 885 delay slot, 411 delay transfer, 885 Delay-locked loop, 569 delayed branch, 411 denotational semantics, 971 dependence analysis, 653 design space exploration, 694, 696 design space exploration (DSE), 475 determinacy, 968, 975, 978, 982, 984 device driver, 783 DICOM, 238 DIF, 927 Differential nonlinearity, 548 digital audio broadcasting, 30 Digital Living Network Alliance, 24 digital living network alliance, 36 digital media player, 36 digital media receivers, 24 digital media server, 36 digital music pipeline, 29 digital photo pipeline, 25 digital photography, 25 digital rights management, 37 digital signal processing, 769 digital signature algorithm, 162 Digital to analog converter, 557 digital video broadcasting, 33 digital video pipeline, 31 digital video recording, 33 Direct digital synthesis, 570 dispersion, 69 distributed arithmetic, 308, 878 distributed smart camera, 267 division, 312 DLNA, 36 domain-specific processor, 403 DRAM, 521 DRAM based cache, 524 DSP, 769 DSP Module, 373 DSP48 Slice, 372
Index duty cycling, 136 DVR, 34 dynamic analysis, 758 dynamic dataflow, 913 dynamic dataflow (DDF), 647 Dynamic Matrix Control, 15 dynamic power consumption, 896 effective Kahn Process Network, 986 effective token, 908 EIDF, 926 electronic dispersion compensation, 71 elliptic curve cryptography, 164 embedded processor, 769 embedded RAMs, 875 emptiness check, 942 Enable-invoke dataflow, 926 energy scavenging, 134 entropy coding, 111 Error correcting codes, 344 event rate, 185 Extensible Markup Language, 54 fairness, 975 fast Fourier Transform, 780 fetch packet, 604 FFT, 780 Field Programmable Gate Array, 365 FIFO, 950–951 Fine grain accelerator, 332 finite impulse response, 771, 780 finite impulse response (FIR), 474 finite word-length effects, 726 finite wordlength, 287 FIR filter, 771, 780, 876 firmware design, 441 fixed point, 771, 780 fixed-point, 973 fixed-point arithmetic, 707 fixed-point arithmetic rules, 710 fixed-point conversion, 711 fixed-point data type, 708 fixed-point format, 709 fixed-point simulation, 728 Flash converter, 551 float valve level control, 5 floating point, 780 flow analysis, 653 folding, 792 FPGA, 226, 239, 365, 691, 875 FPGA board, 168 FPGA definition, 878 FPGA implementation, 343 fractional, 772
1079 fractional format, 708 fractional word-length, 709 Fractional-N synthesis, 568 Frequency synthesis, 567 FU Network Description, 48 Functional unit Network Language, 45 Functional Units, 46 functions driven by state machines, 746 fuzzy secrets, 172 generalized rectangle, 903 gesture recognition, 275 GPU, 236, 239 GPU computing, 261 group-of-picture (GOP), 108 guard bits, 396 H.264/AVC, 113 hardware looping, 411 Harvard architecture, 405, 771 hash functions, 162 HD Radio, 31 head-related transfer function (HRTF), 248 hierarchical models, 748 history, 970 HRTF measurement, 252 Huffman coding, 112 human-computer interaction, 243 IEEE 1451, 127 IEEE 802.15.4, 128 ILP parametric, see parametric integer programming in-band on-channel technology, 31 instruction level parallelism, 487 instruction scheduling, 616, 622 instruction selection, 616, 617 instruction set architecture (ISA), 679 instruction set design, 418 instruction set simulator, 440 instruction-level parallelism (ILP), 450, 603 integer code generation, 720 integer format, 708 integer word-length, 709 integral, 772 Integral nonlinearity, 548 integrated code generation, 630 Integrating converter, 550 inter processor communication graph, 753 inter-iteration constraints, 791 interconnects, 461, 479 intermediate representation, 739 intermediate representation (IR), 616
1080 interpreter, 681 intra-iteration constraints, 791 ISA100.11a, 128 ISO, 786 issue packet, 604 issue width, 604 iteration bound, 812 jets, 187 job configuration network, 752 JPEG, 28 K-best detector, 344 Kahn Principle, 979, 980 Kahn Process Network, 968, 977 effective, 986 Kahn process networks (KPN), 646 Kalman filter, 3, 269 kernel-only loops, 473 L1-norm, 713 labeled transition system, 974 Large Hadron Collider, 182 latency, 607 latency constraints, 745 lattice filter, 876 LDPC decoder, 356 LDPC decoder computation complexity, 356 Level-1 trigger, 186 lexicographic maximum, 941 lexicographic order, 937 Linear Quadratic Gaussian, 18 list scheduling, 623 LMS filter, 876 load-store architecture, 397 local deadlock, 985 localization, 126, 148 logical shift, 398 loop transformations, 470 loop combination, 471 loop fusion, 471 loop interchange, 471 loop unrolling, 470 LQG, 18 luminosity, 186 LUT, 875 machine model, 758 mapping, 659 Markov chain Monte Carlo, 270 Mason’s gain formula, 792 maximality, 975 maximum lexicographic, see lexicographic maximum
Index maximum likelihood, 223 mdsdf balance equations, 904 medium access control, 136 memory space, 778 microarchitecture, 444 Microblaze, 367, 382 microphone array, 244 Microsoft Windows DRM, 38 middleware, 130 MIMO, 15 MIMO Channel equalizer, 336 MIMO detector, 339 MIMO detector computational complexity, 342 MIMO receiver, 335 model polyhedral, see polyhedral model Model Predictive Control, 15 modulo routing resource graph (MRRG), 467 modulo scheduling, 469, 626 Montgomery modular reduction, 164 morphological operations, 230 Motion estimation, 529 Motion Picture Expert Group, 44 MPC, 15 MPEG, 31 MPEG-2, 33 MPEG-4 SP Decoder, 58 MPEG-audio, 30 MPI, 238, 239 MPSoC, 487 MPSoC compiler, 640 multi-cycle branch, 411 multi-rate data flow graph, 792 multidimensional delay, 905 multidimensional downsampler, 901 multidimensional synchronous dataflow graphs, 900 multidimensional upsampler, 901 multimedia processing, 476 multiple access memory, 409 multiple inputs and multiple outputs, 15 multiplication, 298 constant, 305 fixed-width, 301 floating-point, 318 Multiview Video Coding (MVC), 118 network equations, 972 Nios, 383 non-functional properties, 758 non-recurrent engineering, 876 number of image elements, 943 number representation, 283
Index carry-save, 285 floating-point, 315 signed-digit, 285 two’s complement, 284 Nyquist rate, 11 occupation time, 606 operating system, 130 operational semantics, 973, 996 KPN, 977 RPN, 996 optical, 69 order lexicographic, see lexicographic order Oversampled A/D converter, 553 PACS, 238 parallel processing, 805 parallel programming model, 645 parallelism, 652 Parameterized cyclo-static dataflow, 925 Parameterized dataflow, 923 parameterized synchronous dataflow, 923 parametric integer programming, 941–942 Pareto, 480 particle filter, 269 particle identification, 205 partitioning, 204, 650 PCSDF, 925 phase-decoupled code generation, 630 Phase-locked loop, 567 phase-locked loop, 70 physically unclonable functions, 172 PID Controller, 12 PID controller, 3 pipelined, 491 pipelining, 792 placement and routing, 466 polyhedral code generation, 944–945 Polyhedral model, 585 polyhedral model, 938 polyhedral process network, 939 polyhedral relation, 936 polyhedral set, 936 polynomial approximation, 321 power consumption, 896 predication, 472 process network polyhedral, see polyhedral process network reactive, 990 timed, 989 process networks, 968 processors, 488 programming model, 501, 643
1081 PSDF, 923 public-key cryptography, 164 Q format, 736 quality of service, 125 Quantization, 547 quantization, 111 quasi-polynomial, 939 upper bound, see upper bound on a quasi-polynomial quasi-static scheduling, 922 R-2R D/A converter, 558 Radon transform, 220, 221 range estimation, 713, 715 ray tracing, 236 reactive behavior, 987 Reactive Process Network, 990 Receding Horizon Control, 15 reconfigurable, 456 Reconfigurable Video Coding, 44 reduced coefficient multiplier, 878 register allocation, 616, 620 relation polyhedral, see polyhedral relation representation of scheduling strategies, 751 reservation table, 606 residue number system, 167 resource allocation, 618 resource reservation, 198 retargetable simulation, 696 retiming, 792 reverberation, 248 ring buffer, 409 rounding, 709 routing, 466 routing resource graph (RRG), 467 RSA cryptosystem, 164 run-time scheduler, 985 RVC-CAL, 45 Sample and hold, 545 Sampling, 544 sampling lattices, 901 sampling matrix, 901 saturation, 709 sbf channel checking, 919 sbf fire-and-exit behavior, 918 sbf function repertoire, 917 sbf function scheduling, 919 sbf transition and selection functions, 917 Scalable Video Coding, 46 Scalable Video Coding (SVC), 115 scale-invariant feature transform, 271
1082 SCAS, 4 scheduler run-time, 985 scheduling, 467, 659 SDF (synchronous data flow) model, 852 secret-key cryptography, 162 Segmented D/A converter, 559 select actor, 914 self-timed execution, 741 semantics denotational, 971 operational, 973 sensing, 126, 133 sensor clique, 269, 271 sensor groups, 271 serial copy management system, 37 set polyhedral, see polyhedral set SFG, 878 shift optimization, 720 shift register, 881 side channel attacks, 170 SIFT, 271 Sigma-delta A/D converter, 556 Sigma-delta modulator, 554 signal flow graph, 792 signal quantization, 726 signal to quantization noise ratio, 726 Signal-to-noise ratio, 549 SIMD instruction, 605, 617 similarity metric, 233, 234 similarity metrics, 234 simulation based range estimation, 714 single appearance (SA) schedule, 854 single instruction multiple data (SIMD), 458, 460 single-input single-output, 14 single-rate data flow graph, 792 SISO, 14 sliding window, 906 SoC, 486 software pipelining, 469, 624 software-defined radio (SDR), 476 SoPC, 367 soundness, 983 source localization, 244 sparse camera network, 277 spatial prediction, 106 Sphere detector, 343 spherical microphone array, 257 SPI, 239 spilling, 621 square root, 315 stability and control augmentation system, 4
Index static power consumption, 896 static timing analysis, 742 statically schedulable regions, 922 Stratix, 368 stream-based function model, 917 structural hazard, 606 Sub-ranging A/D converter, 552 Successive-approximation converter, 551 Sum of absolute differences (SAD), 529 superscalar, 490 support matrix, 901 support tile, 901 switch actor, 914 switched capacitance, 897 Switched-capacitor filter, 561 synchronization, 126, 149 synchronous dataflow (SDF), 647 system property intervals, 742 table-based approximation, 323 temporal prediction, 106 Thermometer-code, 548 thread-level parallelism, 488 threads, 488 Three dimensional (3D) integration, 515 Three-dimensional (3D) integration, 517 Through silicon via (TSV), 517 TI C62x DSP processor family, 611 timed configuration graphs, 758 timed process network, 989 timing properties, 752 timing-directed simulation, 693 toolchain, 439 trace scheduling, 627 trace-driven simulation, 693 tracking, 275 transformation, 110 translation cache, 690 triggering, 181, 186 truncation, 709 Turbo decoder, 349 two’s complement format, 708 unfolding, 792 Unimodular transformations, 585 Loop interchange, 585 Loop reversal, 585, 587 Loop skewing, 585 upper bound on a quasi-polynomial, 943 variable gain amplifier, 80 verification, 203 Very long instruction word (VLIW), 523 very long instruction word (VLIW), 452
Index video compression, 103 Video encoder, 528 video scalability, 115 Video Tool Library, 48 VIPERS, 385 Virtex, 368 Virtex FPGA, 878 Virtex-5 FPGA, 878 virtual auditory scene, 247 virtual token, 908 Viterbi decoder, 345 VLIW (Very Long Instruction Word) DSP architecture, 603
1083 von Neumann architecture, 405 well-behaved dataflow graphs, 913 window movement parameter, 908 windowed static dataflow, 907 WirelessHART, 128 word-length, 709 word-length optimization, 707, 726, 729 worst-case analysis, 199 wsdf balance equations, 909 zero-overhead looping, 411 ZigBee, 128