Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Alfred Kobsa University of California, Irvine, CA, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen University of Dortmund, Germany Madhu Sudan Microsoft Research, Cambridge, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max-Planck Institute of Computer Science, Saarbruecken, Germany
5657
Koen Bertels Nikitas Dimopoulos Cristina Silvano Stephan Wong (Eds.)
Embedded Computer Systems: Architectures, Modeling, and Simulation 9th International Workshop, SAMOS 2009 Samos, Greece, July 20-23, 2009 Proceedings
13
Volume Editors Koen Bertels Stephan Wong Delft University of Technology Mekelweg 4, 2628 CD Delft, The Netherlands E-mail: {k.l.m.bertels,j.s.s.m.wong}@tudelft.nl Nikitas Dimopoulos University of Victoria Department of Electrical and Computer Engineering P.O. Box 3055, Victoria, BC, V8W 3P6, Canada E-mail:
[email protected] Cristina Silvano Politecnico di Milano Dipartimento di Elettronica e Informazione P.za Leonardo Da Vinci 32, 20133 Milan, Italy E-mail:
[email protected]
Library of Congress Control Number: 2009930367 CR Subject Classification (1998): C, B LNCS Sublibrary: SL 1 – Theoretical Computer Science and General Issues ISSN ISBN-10 ISBN-13
0302-9743 3-642-03137-4 Springer Berlin Heidelberg New York 978-3-642-03137-3 Springer Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. springer.com © Springer-Verlag Berlin Heidelberg 2009 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper SPIN: 12718269 06/3180 543210
Preface
The SAMOS workshop is an international gathering of highly qualified researchers from academia and industry, sharing ideas in a 3-day lively discussion on the quiet and inspiring northern mountainside of the Mediterranean island of Samos. The workshop meeting is one of two co-located events (the other event being the IC-SAMOS). As a tradition, the workshop features presentations in the morning, while after lunch all kinds of informal discussions and nut-cracking gatherings take place. The workshop is unique in the sense that not only solved research problems are presented and discussed but also (partly) unsolved problems and in-depth topical reviews can be unleashed in the scientific arena. Consequently, the workshop provides the participants with an environment where collaboration rather than competition is fostered. The SAMOS conference and workshop were established in 2001 by Stamatis Vassiliadis with the goals outlined above in mind, and located on Samos, one of the most beautiful islands of the Aegean. The rich historical and cultural environment of the island, coupled with the intimate atmosphere and the slow pace of a small village by the sea in the middle of the Greek summer, provide a very conducive environment where ideas can be exchanged and shared freely. SAMOS IX followed the series of workshops started in 2001 in a new expanded program including three special sessions to discuss challenging research trends. This year, the workshop celebrated its ninth anniversary, and 18 papers were presented which were carefully selected out of 52 submissions resulting in an acceptance rate of 34.6%. Each submission was thoroughly reviewed by at least three reviewers and considered by the international Program Committee during its meeting at Delft in March 2009. Indicative of the wide appeal of the workshop is the fact that the submitted works originated from a wide international community. More in detail, the regular papers come from 19 countries: Austria (2), Belgium (2), Brazil (1), Canada (1), Finland (8), France (6), Germany (5), Greece (3), India (1), Italy (3), Japan(1), Norway (2), Russia (1), Spain (2), Sweden (2), Switzerland (2), The Netherlands (7), UK (1) and USA (2). Additionally, three special sessions were organized on topics of current interest: (1) “Instruction-set Customization”, (2) “Reconfigurable Computing and Processor Architectures”, and (3) “Mastering Cell BE and GPU Execution Platforms”. Each special session used its own review procedures, and was given the opportunity to include some relevant works selected from the regular papers submitted to the workshop in addition to some invited papers. Globally, 14 papers were included in the three special sessions. The workshop program also included one keynote speech by Yale Patt from the University of Texas at Austin.
VI
Preface
A workshop like this cannot be organized without the help of many people. First of all, we would like to thank the members of the Steering and Program Committees and the external referees for their dedication and diligence in selecting the technical papers. The investment of their time and insight was very much appreciated. Then, we would like to express our sincere gratitude to Karin Vassiliadis for her continuous dedication in organizing the workshop. We also would like to thank Carlo Galuzzi for managing the financial issues, Sebastian Isaza for maintaining the Website and publicizing the event, Zubair Nawaz for managing the submission system, and Dimitris Theodoropoulos and Carlo Galuzzi (again) for preparing the workshop proceedings. We also thank Lidwina Tromp for her continuous effort in the workshop organization. We hope that the attendees enjoyed the SAMOS IX workshop in all its aspects, including many informal discussions and gatherings. We trust that you will find this year’s SAMOS workshop proceedings enriching and interesting. July 2009
Koen Bertels Cristina Silvano Nikitas Dimopoulos Stephan Wong
Organization
General Co-chairs N. Dimopoulos S. Wong
University of Victoria, Canada TU Delft, The Netherlands
Program Co-chairs K. Bertels C. Silvano
TU Delft, The Netherlands Politecnico di Milano, Italy
Special Session Co-chairs L. Carro E. Deprettere C. Galuzzi A. Varbanescu S. Wong
UFRGS, Brazil Leiden University, The Netherlands TU Delft, The Netherlands TU Delft, The Netherlands TU Delft, The Netherlands
Proceedings Co-chairs C. Galuzzi D. Theodoropoulos
TU Delft, The Netherlands TU Delft, The Netherlands
Web and Publicity Chair S. Isaza
TU Delft, The Netherlands
Submissions Chair Z. Nawaz
TU Delft, The Netherlands
Finance Chair C. Galuzzi
TU Delft, The Netherlands
Symposium Board S. Bhattacharyya G.N. Gaydadijev
University of Maryland, USA TU Delft, The Netherlands
VIII
Organization
J. Glossner A.D. Pimentel J. Takala
Sandbridge Technologies, USA University of Amsterdam, The Netherlands Tampere University of Technology, Finland (Chairperson)
Steering Committee L. Carro E. Deprettere N. Dimopoulos T. D. H¨ am¨al¨ ainen S. Wong
UFRGS, Brazil Leiden University, The Netherlands University of Victoria, Canada Tampere University of Technology, Finland TU Delft, The Netherlands
Program Committee C. Basto J. Becker M. Berekovic S. Chakraborty F. Ferrandi G. Fettweis J. Flich W. Fornaciari P. French K. Goossens D. Guevorkian R. Gupta C. Haubelt M. H¨ annik¨ ainen D. Iancu V. Iordanov H. Jeschke C. Jesshope W. Karl M. Katevenis A. Koch K. Kuchcinski D. Liu W. Luk J. McAllister D. Milojevic A. Moshovos T. Mudge N. Navarro A. Orailoglu B. Pottier
NXP, USA Karlsruhe University, Germany TU Braunschweig, Germany University of Singapore, Singapore Politecnico di Milano, Italy TU Dresden, Germany Technical University of Valencia, Spain Politecnico di Milano, Italy TU Delft, The Netherlands NXP, The Netherlands Nokia, Finland University of California Riverside, USA University of Erlangen-N¨ uremberg, Germany Tampere University of Technology, Finland Sandbridge Technologies, USA Philips, The Netherlands University of Hannover, Germany University of Amsterdam, The Netherlands University of Karlsruhe, Germany FORTH-ICS and University of Crete, Greece TU Darmstadt, Germany Lund University, Sweden Link¨ oping University, Sweden Imperial College, UK Queen’s University of Belfast, UK Universit´e Libre de Bruxelles, Belgium University of Toronto, Canada University of Michigan, United States Technical University of Catalonia, Spain University of California San Diego, USA Universit´e de Bretagne Occidentale, France
Organization
K. Rudd T. Sauter P-M. Seidel H. Schr¨ oder F. Silla M. Sima G. Theodoridis L. Vintan
Intel, USA Austrian Academy of Sciences, Austria SMU University, USA University of Dortmund, Germany Technical University of Valencia, Spain University of Victoria, Canada Aristotle University of Thessaloniki, Greece University of Sibiu, Romania
Reviewers Aaltonen, Timo Agosta, Giovanni Ali, Zeyshan Alvarez, Mauricio Arnold, Oliver Arpinen, Tero Azevedo, Arnaldo Basto, Carlos Becker, Juergen Becker, Tobias Berekovic, Mladen Blume, Steffen Bournoutian, Garo Buchty, Rainer Chakraborty, Samarjit Chen, MingJing Ciobanu, Catalin Deprettere, Ed Dimitrakopoulos, Giorgos Dimopoulos, Nikitas Ehliar, Andreas Feng, Min Ferrandi, Fabrizio Fettweis, Gerhard Flatt, Holger Flich, Jos´e Flynn, Michael Fornaciari, William French, Paddy Galuzzi, Carlo Gelado, Isaac Glossner, John Goossens, Kees Guevorkian, David
Gupta, Rajiv Hanke, Mathias H¨ annik¨ ainen, Marko Haubelt, Christian Iancu, Daniel Isaza, Sebastian Jeschke, Hartwig Jesshope, Chris Jin, Qiwei Kakarountas, Athanasios Karl, Wolfgang Karlstr¨ om, Per Katevenis, Manolis Kellom¨aki, Pertti Klussmann, Heiko Kuchcinski, Krzysztof Lee, Kwangyoon Limberg, Torsten Liu, Dake Luk, Wayne Mamidi, Suman Martin-Langerwerf, Javier Martorell, Xavier McAllister, John Merino, Julio Milojevic, Dragomir Moshovos, Andreas Mudge, Trevor Nagarajan, Vijay Najjar, Walid Navarro, Nacho Nikolopoulos, Dimitrios Nolte, Norman Norkin, Andrey
IX
X
Organization
Orailoglu, Alex Pilato, Christian Pottier, Bernard Rudd, Kevin Sauter, Thilo Sazeides, Yiannakis Schr¨ oder, Hartmut Schulte, Michael Seo, Sangwon Silla, Federico Silvano, Cristina Sima, Mihai Sima, Vlad-Mihai Spinean, Bogdan
Takala, Jarmo Theodoridis, George Thomas, David Tian, Chen Tsoi, Brittle Vintan, Lucian Westermann, Peter Woh, Mark Wong, Stephan Wu, Di Yang, Chengmo Zaccaria, Vittorio
Table of Contents
Beachnote What Else Is Broken? Can We Fix It? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yale Patt
1
Architectures for Multimedia Programmable and Scalable Architecture for Graphics Processing Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Carlos S. de La Lama, Pekka J¨ a¨ askel¨ ainen, and Jarmo Takala
2
The Abstract Streaming Machine: Compile-Time Performance Modelling of Stream Programs on Heterogeneous Multiprocessors . . . . . . Paul M. Carpenter, Alex Ramirez, and Eduard Ayguade
12
CABAC Accelerator Architectures for Video Compression in Future Multimedia: A Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yahya Jan and Lech Jozwiak
24
Programmable Accelerators for Reconfigurable Video Decoder . . . . . . . . . Tero Rintaluoma, Timo Reinikka, Joona Rouvinen, Jani Boutellier, Pekka J¨ a¨ askel¨ ainen, and Olli Silv´en Scenario Based Mapping of Dynamic Applications on MPSoC: A 3D Graphics Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Narasinga Rao Miniskar, Elena Hammari, Satyakiran Munaga, Stylianos Mamagkakis, Per Gunnar Kjeldsberg, and Francky Catthoor Multiple Description Scalable Coding for Video Transmission over Unreliable Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Roya Choupani, Stephan Wong, and Mehmet R. Tolun
36
48
58
Multi/Many Cores Architectures Evaluation of Different Multithreaded and Multicore Processor Configurations for SoPC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sascha Uhrig
68
Implementing Fine/Medium Grained TLP Support in a Many-Core Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Roberto Giorgi, Zdravko Popovic, and Nikola Puzovic
78
XII
Table of Contents
Implementation of W-CDMA Cell Search on a FPGA Based Multi-Processor System-on-Chip with Power Management . . . . . . . . . . . . . Roberto Airoldi, Fabio Garzia, Tapani Ahonen, Dragomir Milojevic, and Jari Nurmi A Multiprocessor Architecture with an Omega Network for the Massively Parallel Model GCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Christian Sch¨ ack, Wolfgang Heenes, and Rolf Hoffmann
88
98
VLSI Architectures Design Towards Automated FSMD Partitioning for Low Power Using Simulated Annealing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nainesh Agarwal and Nikitas J. Dimopoulos
108
Radix-4 Recoded Multiplier on Quantum-Dot Cellular Automata . . . . . . Ismo H¨ anninen and Jarmo Takala
118
Prediction in Dynamic SDRAM Controller Policies . . . . . . . . . . . . . . . . . . . Ying Xu, Aabhas S. Agarwal, and Brian T. Davis
128
Inversion/Non-inversion Implementation for an 11,424 Gate-Count Dynamic Optically Reconfigurable Gate Array VLSI . . . . . . . . . . . . . . . . . Shinichi Kato and Minoru Watanabe
139
Architecture Modeling and Exploration Tools Visualization of Computer Architecture Simulation Data for System-Level Design Space Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Toktam Taghavi, Mark Thompson, and Andy D. Pimentel
149
Modeling Scalable SIMD DSPs in LISA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Peter Westermann and Hartmut Schr¨ oder
161
NoGAP: A Micro Architecture Construction Framework . . . . . . . . . . . . . . Per Karlstr¨ om and Dake Liu
171
A Comparison of NoTA and GENESYS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bernhard Huber and Roman Obermaisser
181
Special Session 1: Instruction-Set Customization Introduction to Instruction-Set Customization . . . . . . . . . . . . . . . . . . . . . . . Carlo Galuzzi
193
Table of Contents
Constraint-Driven Identification of Application Specific Instructions in the DURASE System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kevin Martin, Christophe Wolinski, Krzysztof Kuchcinski, Antoine Floch, and Fran¸cois Charot A Generic Design Flow for Application Specific Processor Customization through Instruction-Set Extensions (ISEs) . . . . . . . . . . . . . . . . . . . . . . . . . . . Kingshuk Karuri, Rainer Leupers, Gerd Ascheid, and Heinrich Meyr Runtime Adaptive Extensible Embedded Processors — A Survey . . . . . . Huynh Phung Huynh and Tulika Mitra
XIII
194
204
215
Special Session 2: The Future of Reconfigurable Computing and Processor Architectures Introduction to the Future of Reconfigurable Computing and Processor Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Luigi Carro and Stephan Wong
226
An Embrace-and-Extend Approach to Managing the Complexity of Future Heterogeneous Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rainer Buchty, Mario Kicherer, David Kramer, and Wolfgang Karl
227
Applying the Stream-Based Computing Model to Design Hardware Accelerators: A Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Frederico Pratas and Leonel Sousa
237
Reconfigurable Multicore Server Processors for Low Power Operation . . . Ronald G. Dreslinski, David Fick, David Blaauw, Dennis Sylvester, and Trevor Mudge
247
Reconfigurable Computing in the New Age of Parallelism . . . . . . . . . . . . . Walid Najjar and Jason Villarreal
255
Reconfigurable Multithreading Architectures: A Survey . . . . . . . . . . . . . . . Pavel G. Zaykov, Georgi K. Kuzmanov, and Georgi N. Gaydadjiev
263
Special Session 3: Mastering Cell BE and GPU Execution Platforms Introduction to Mastering Cell BE and GPU Execution Platforms . . . . . . Ed Deprettere and Ana L. Varbanescu Efficient Mapping of Multiresolution Image Filtering Algorithms on Graphics Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Richard Membarth, Frank Hannig, Hritam Dutta, and J¨ urgen Teich
275
277
XIV
Table of Contents
Implementing Blocked Sparse Matrix-Vector Multiplication on NVIDIA GPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Alexander Monakov and Arutyun Avetisyan Experiences with Cell-BE and GPU for Tomography . . . . . . . . . . . . . . . . . Sander van der Maar, Kees Joost Batenburg, and Jan Sijbers Realizing FIFO Communication When Mapping Kahn Process Networks onto the Cell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dmitry Nadezhkin, Sjoerd Meijer, Todor Stefanov, and Ed Deprettere Exploiting Locality on the Cell/B.E. through Bypassing . . . . . . . . . . . . . . Pieter Bellens, Josep M. Perez, Rosa M. Badia, and Jesus Labarta Exploiting the Cell/BE Architecture with the StarPU Unified Runtime System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C´edric Augonnet, Samuel Thibault, Raymond Namyst, and Maik Nijhuis Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
289 298
308 318
329
341
What Else Is Broken? Can We Fix It? Yale Patt The University of Texas at Austin
Abstract. The founder and soul of this conference, Professor Stamatis Vassiliadis, always wanted a Keynote on the beach. A keynote without PowerPoint, air conditioning, and all the other usual comforts of keynotes, comforts both for the speaker and for the audience. After all, the great thinkers of this ancient land did their thinking, teaching, and arguing without PowerPoint and without air conditioning. But they were they and we are we, and no sane SAMOS keynote speaker would put himself in the same league with those masters. Nonetheless, Stamatis wanted it, and I never found it easy to say no to Stamatis, so last year at SAMOS VIII, I agreed to give a Keynote on the Beach. It has been subsequently relabeled The Beachnote, and I have been asked to do it again. The question of course is what subject to explore in this setting, where the sound of the speaker’s voice competes with the sounds of the waves banging against the shore, where the image of the speaker’s gestures competes with the image of the blue sky, bright sun, and hills of Samos. I decided last summer to choose a meta-topic, rather than a hard core technical subject: ”Is it broken,” with particular emphasis on professors – are they ready to teach, are they ready to do research, and students – are they learning, is their education preparing them for what is needed after they graduate. My sense is that for this environment, a meta-topic is the right model, and so I propose to visit it again. For example: our conferences and journals. Are they broken? Can we fix them? Somewhat more technical: The interface between the software that people write to solve problems and the hardware that has to run that software. Is it broken? Can we fix it? These are just examples of some of the things we might explore in this year’s Beachnote. As I said last year, I will welcome other suggestions from the audience as to what they think is broken. My hope is to have us all engaged in identifying and discussing some of the fundamental problems that plague our community.
K. Bertels et al. (Eds.): SAMOS 2009, LNCS 5657, p. 1, 2009.
Programmable and Scalable Architecture for Graphics Processing Units Carlos S. de La Lama1 , Pekka J¨aa¨skel¨ainen2, and Jarmo Takala2 1
Universidad Rey Juan Carlos, Department of Computer Architecture, Computer Science and Artificial Intelligence, C/ Tulip´ an s/n, 28933 M´ ostoles, Madrid, Spain
[email protected] 2 Tampere University of Technology, Department of Computer Systems, Korkeakoulunkatu 10, 33720 Tampere, Finland
[email protected],
[email protected]
Abstract. Graphics processing is an application area with high level of parallelism at the data level and at the task level. Therefore, graphics processing units (GPU) are often implemented as multiprocessing systems with high performance floating point processing and application specific hardware stages for maximizing the graphics throughput. In this paper we evaluate the suitability of Transport Triggered Architectures (TTA) as a basis for implementing GPUs. TTA improves scalability over the traditional VLIW-style architectures making it interesting for computationally intensive applications. We show that TTA provides high floating point processing performance while allowing more programming freedom than vector processors. Finally, one of the main features of the presented TTA-based GPU design is its fully programmable architecture making it suitable target for general purpose computing on GPU APIs which have become popular in recent years. Keywords: GPU, GPGPU, TTA, VLIW, LLVM, GLSL, OpenGL.
1
Introduction
3D graphics processing can be seen as a compound of sequential stages applied to a set of input data. Commonly, graphics processing systems are abstracted as so called graphics pipelines, with only minor differences between the various existing APIs and implementations. Therefore, stream processing [1], where a number of kernels (user defined or fixed) are applied to a stream of data of the same type, is often thought as the computing paradigm of graphics processing units. Early 3D accelerating GPUs were essentially designed to perform a fixed set of operations in an effective manner, with no capabilities to customize this process [2]. Later, some vendors started to add programmability to their GPU K. Bertels et al. (Eds.): SAMOS 2009, LNCS 5657, pp. 2–11, 2009. c Springer-Verlag Berlin Heidelberg 2009
Programmable and Scalable Architecture for Graphics Processing Units
3
products, leading to standardization of “shading languages”. Both of the major graphics APIs (OpenGL and DirectX) proposed their own implementation of such languages. DirectX introduced the High Level Shading Language [3], while OpenGL defined the OpenGL Shading Language (GLSL) [4], first supported as an optional extension to OpenGL 1.4 and later becoming part of the standard in OpenGL 2.0. GLSL is similar to the standard C language, but includes some additional data types for vectors and matrices, and library functions to perform the common operations with the data types. Programs written in GLSL (called shaders) can customize the behavior of two specific stages of the OpenGL graphics pipeline (dashed boxes in Figure 1) [5]. Vertex shaders are applied to the input points defining the vertices of the graphics primitives (such as points, lines or polygons) in a 3D coordinate system called model space. Depending on the type of primitive being drawn, the rasterizer then generates a number of visible points between the transformed vertices. These new points are called fragments. Each drawn primitive usually produces the equal number of fragments as there are covered pixels on the screen. The rasterizer interpolates several attributes, such as color or texture coordinates, between vertices to find the corresponding value (called varying) for each fragment, and the programmable fragment shader can postprocess and modify those values. The movement to allow programming of parts of the graphics pipeline led to GPU vendors providing custom APIs for using their GPUs for more general purpose computing (GPGPU) [6], extending the application domain of GPUs to a wide range of programs with highly parallelizable computation. Finally, in the end of 2008, a vendor-neutral API for programming heterogeneous platforms (which can include also GPU-like resources) was standardized. The OpenCL standard [7] was welcomed by the GPGPU community as a generic alternative to platform-specific GPGPU APIs such as NVIDIA’s CUDA. [8] This paper presents a design work in progress of a programmable and scalable GPU architecture based on the Transport Triggered Architecture (TTA), a class of VLIW architectures. The proposed architecture we call TTAGPU is fully programmable and implements all of the graphics pipeline in software. TTAGPU can be scaled at the instruction and task level to produce GPUs with varying size/performance ratio, enabling its use in both embedded and desktop systems. Furthermore, the full programmability allows it to be adapted for GPGPU style of computation, and, for example, to support the OpenCL API. While common practice in GPU design goes through the intensive use of data-parallel models,
Vertices
Vertex Shader
Transformed vertices
Rasterizer
Fragments
To screen
Framebuffer
Colored fragments
Fragment Shader
Fig. 1. Simplified view of the customizable OpenGL pipeline
4
C.S. de La Lama, P. J¨ aa ¨skel¨ ainen, and J. Takala
our approach tries to exploit parallelism at instruction level, thus avoiding the programmability penalty caused by SIMD operations. The rest of the paper is organized as follows. Section 2 discusses briefly the related work, Section 3 describes the main points in the TTAGPU design, Section 4 provides some preliminary results on the floating point scalability of the architecture, and Section 5 concludes the paper and discusses the future directions.
2
Related Work
The first generation of programmable GPUs included specialized hardware for vertex processing and fragment processing as separate components, together with texture mapping units and rasterizers, set up on a multi-way stream configuration to exploit the inherent parallelism present on 3D graphic algorithms. As modern applications needed to customize the graphic processing to a higher degree, it became obvious that such heterogeneous architectures were not the ideal choice. Therefore, with the appearance of the unified shader model in 2007 [9], the differences between vertex and fragment shaders begun to disappear. Newer devices have a number of unified shaders that can do the same arithmetic operations and access the same buffers (although some differences in the instruction sets are still present). This provides better programmability to the graphic pipeline, while the fixed hardware on critical parts (like the rasterizer) ensures high performance. However, the stream-like connectivity between computing resources still limits the customization of the processing algorithm. The major GPU vendors (NVIDIA & ATI) follow this approach in their latest products [10,11]. The performance of the unified shader is evaluated in [12] by means of implementing a generic GPU microarchitecture and simulating it. The main conclusion of the paper is that although graphical performance improves only marginally with respect to non-unified shader architectures, it has real benefits in terms of efficiency per area. The shader performance analysis in the paper uses shaders implemented with the OpenGL ARB assembly-like low level language. Although already approved by the Architecture Review Board, this is still an extension to the OpenGL standard, while GLSL is already part of it, which is why we have used it as our shader program input language. Furthermore, new trends on parallel non-graphical computations on GPUs are geared towards using high level languages. A different approach to achieve GPU flexibility is being proposed by Intel with its Larrabee processor [13]. Instead of starting from a traditional GPU architecture, they propose a x86-compatible device with additional floating point units for enhanced arithmetic performance. Larrabee includes very little specific hardware, the most notable exception being the texture mapping unit. Instead, the graphics pipeline is implemented in software, making it easier to modify and customize. Larrabee is to be deployed as a “many-core” solution, with number of cores in 64 and more. Each core comprises a 512-bit vector FPU capable of 16 simultaneous single-precision floating point operations.
Programmable and Scalable Architecture for Graphics Processing Units
3
5
TTAGPU Architecture
The goal of the TTAGPU design is to implement an OpenGL-compliant graphics API which is accelerated with a customized TTA processor, supports the programming of the graphic pipeline as described in the OpenGL 2.1 specification [14] (GLSL coded shaders) and allows high-level language programmability especially with support for OpenCL API in mind. Therefore, the design follows a software-based approach, similar to Larrabee, with additional flexibility provided through programmability. However, as not being tied to the x86 architecture, the datapath resource set can be customized more freely to accelerate the GPU application domain. 3.1
Transport Triggered Architectures
VLIWs are considered interesting processor alternatives for applications with high requirements for data processing performance[15] and with limited control flow, such as graphics processing. Transport Triggered Architectures (TTA) is a modular processor architecture template with high resemblance to VLIW architectures. The main difference between TTAs and VLIWs can be seen in how they are programmed: instead of defining which operations are started in which function units (FU) at which instruction cycles, TTA programs are defined as data transports between register files (RF) and FUs of the datapath. The operations are started as side-effects of writing operand data to the “triggering port” of the FU. Figure 2 presents a simple example TTA processor.[16] The programming model of VLIW imposes limitations for scaling the number of FUs in the datapath. Upscaling the number of FUs has been problematic in VLIWs due to the need to include as many write and read ports in the RFs as there are FU operations potentially completed and started at the same time. Additional ports increase the RF complexity, resulting in larger area and critical path delay. Also, adding an FU to the VLIW datapath requires potentially new bypassing paths to be added from the FU’s output ports to the input ports of the other FUs in the datapath, which increases the interconnection network complexity. Thanks to its programmer-visible interconnection network, TTA datapath
Fig. 2. Example of a TTA processor
6
C.S. de La Lama, P. J¨ aa ¨skel¨ ainen, and J. Takala
can support more FUs with simpler RFs [17]. Because the scheduling of data transports between datapath units are programmer-defined, there is no obligation to scale the number of RF ports according to the number of FUs [18]. In addition, the datapath connectivity can be tailored according to the application at hand, adding only the bypassing paths that benefit the application the most. In order to support fast automated design of TTA processors, a toolset project called TTA-based Codesign Environment (TCE) was started in 2003 in Tampere University of Technology [19]. TCE provides a full design flow from software written in C code down to parallel TTA program image and VHDL implementation of the processor. However, as TTAGPU was evaluated only at architectural level for this paper, the most important tools used in the design were its cycleaccurate instruction set simulator and the compiler, both of which automatically adapt to the set of machine resources in the designed processors. Because TTA is a statically scheduled architecture with high level of detail exposed to the programmer, the runtime efficiency of the end results produced with the design toolset depends heavily on the quality of the compiler. TCE uses the LLVM Compiler Infrastructure [20] as the backbone for its compiler tool chain (later referred to as ’tcecc’), thus benefits from its global optimizations such as aggressive dead code elimination and link time inlining. In addition, the TCE code generator includes an efficient instruction scheduler with TTA-specific optimizations, and a register allocator optimized to produce better instructionlevel parallelism for the post-pass scheduler. 3.2
Scaling on the Instruction Level
The TTAGPU OpenGL implementation is structured into two clearly separated parts. First part is the API layer, which is meant to run on the main CPU on the real scenario. It communicates with the GPU by a command FIFO, each command having a maximum of 4 floating-point arguments. Second part is the software implementation of the OpenGL graphics pipeline running in the TTA. We have tried to minimize the number of buffers to make the pipeline stages as long as possible, as this gives the compiler more optimization opportunities. The OpenGL graphics pipeline code includes both the software implementation of the pipeline routines itself, in addition to the user defined shader programs defined with GLSL. For the graphics pipeline code, we have so far implemented a limited version capable of doing simple rendering, allowing us to link against real OpenGL demos with no application code modification. Because tcecc already supports compilation of C and C++, it is possible to compile the user-defined GLSL code with little additional effort by using C++ operator overloading and a simple preprocessor, and merge the shader code with the C implementation of the graphics pipeline. Compiling GLSL code together with the C-based implementation of the graphics pipeline allows user-provided shaders to override the programmable parts, while providing an additional advantage of global optimizations and code specialization that is done after the final program linking. For example, if a custom shader program does not use a result produced by some of the fixed functionality of the
Programmable and Scalable Architecture for Graphics Processing Units
7
for i = 1...16 do f = produce_fragment() // the rasterizer code f = glsl_fragment_processor(f) write_to_framebuffer_fifo(f) Fig. 3. Pseudocode of the combined rasterizer/fragment shader loop body
graphics pipeline code, the pipeline code will be removed by the dead code elimination optimization. That is, certain types of fragment shader programs compiled with the pipeline code can lead, to higher rasterizer performance. Preliminary profiling of the current software graphics pipeline implementation showed that the bottleneck so far is on the rasterizer, and, depending on its complexity, on the user-defined fragment shader. This makes sense as the data density on the pipeline explodes after rasterizing as usually a high number of fragments are generated by each primitive. For example, a line can be defined using two vertices from which the rasterizer produces fragments enough to represent all the visible pixels between the two vertices. Thus, in TTAGPU we concentrated on optimizing the rasterizer stage by creating a specialized rasterizer loop which processes 16 fragments at a time. The combined rasterizer/custom fragment shader loop (pseudocode shown in Fig. 3) is fully unrolled by the compiler, implementing effectively a combined 16-way rasterizer and fragment processor on software. The aggressive procedure inlining converts the fully unrolled loop to a single big basic block with the actual rasterizer code producing a fragment and the user defined fragment shader processing it without the need for large buffers between the stages. In addition, the unrolled loop bodies can be often made completely independent from each other, improving potential for high level of ILP exposed to the instruction scheduler. In order to avoid extra control flow in the loop which makes it harder to extract instruction level parallelism (ILP) statically, we always process 16 fragments at a time “speculatively” and discard the possible extra fragments at the end of computation. 3.3
Scaling on the Task Level
In order to achieve scalability on the task level, we placed hardware-based FIFO buffers at certain points in the software graphics pipeline. The idea is to add “frontiers” at suitable positions of the pipeline allowing multiple processors to produce and process the FIFO items arbitrarily. It should be noted, however, that it is completely possible in this configuration that the same processor produces and processes the items in the FIFOs. In this type of single core setting, the hardware FIFO merely reduces memory accesses required to pass data between the graphics pipeline stages. The guidelines followed when placing these buffers were: 1) separate stages with different data densities, 2) place the FIFOs in such positions that the potential for ILP at each stage is as high as possible, and 3) compile the user-defined shader code and related graphics pipeline code together to maximize code specialization and ILP.
8
C.S. de La Lama, P. J¨ aa ¨skel¨ ainen, and J. Takala
Command FIFO
Vertex processing
Vertex FIFO
OpenGL
OpenGL
Rasterization /
API
State
Fragment proc.
Clipping
Fragment FIFO
Framebuffer writing TTAGPU driver tasks GPU
Fig. 4. High-level software structure
These three points are met by placing two hardware FIFOs in the pipeline. One after the vertex processing, as the number of processed vertices needed for primitive rasterizing changes with the different rendering modes (points, lines or polygons), resulting in varying data density. This FIFO allows vertex processing to proceed until enough vertices for primitive processing are available. It also serves as an entry point for new vertices generated during clipping. The second FIFO is placed after fragment processing, and before the framebuffer writing stage. Framebuffer writing has some additional processing to perform (ownership test, blending, etc.) that cannot be performed completely on per-fragment basis as they depend on the results of previous framebuffer writes. This FIFO allows us to create the highly parallelizable basic block block performing rasterization and fragment processing with no memory writes as the frame buffer writing is done with a custom operation accessing the FIFO. The hardware-supported FIFOs have a set of status registers that can be used to poll for FIFO emptiness and fullness. This enables us to use light weight cooperative multithreading to hide the FIFO waiting time with processing of elements from the other FIFOs. Software implementation structure is shown in Figure 4. The clean isolation between stages allows the system to connect sets of processors that access the FIFO elements as producers and/or consumers making the system flexible and scalable at the task level. Scaling at the task level can be done simply by adding either identical TTAs or even processors with completely different architectures to the system. The only requirement placed for the added processors is the access to the hardware FIFOs.
4
Results
In order to evaluate the ILP scalability of the TTAGPU in the combined rasterizer/fragment processor loop, we implemented a simple example OpenGL
Programmable and Scalable Architecture for Graphics Processing Units
9
Table 1. Resources in the TTAGPU variations resource floating point units 32 bit x 32 register files 1 bit boolean registers transport buses integer ALUs 32 bit load-store units 32 bit shifters
1 FPU 2 FPU 4 FPU 1 2 4 1 2 4 2 4 8 3 6 12 1 1 1 1 1 1 1 1 1
8 FPU 8 8 16 24 1 1 1
16 FPU 16 16 32 48 1 1 1
application that renders number of lines randomly to the screen and colors them with a simple fragment shader. The goal of this experiment was to see how well the single TTAGPU cores scale at the instruction level only by adding multitudes of resource sets to the architecture and recompiling the software using tcecc. The resource set we used for scaling included a single FPU, three transport buses, and a register file with 32 general purpose 32 bit registers. The resources in the benchmarked TTAGPU variations are listed in Table 1. In order to produce realistic cycle counts for floating point code, we used the pipeline model of the MIPS R4000 floating point units of which description was available in literature [21]. The unit includes eight floating-point operations that share eleven different pipeline resources. However, our benchmark used only addition, division, multiplication and comparison of floating point values. The benchmark was executed using the TCE cycle-accurate processor architecture simulator for TTAGPUs with the different number of resource sets. Figure 5 shows the speedup improvements in the unrolled rasterizer loop from just adding multiples of the “scaling resource sets” to the machine and recompiling the code. This figure indicates that the ILP scalability of the heavily utilized rasterizer loop is almost linear thanks to the aggressive global optimizations and a register allocator that avoids the reuse of registers as much as possible, reducing the number of false dependencies limiting the parallelization between the 12
11.5x
10
speedup
8
7.2x
6 3.8x
4 2 1.8x 1.0x 0 1
2
4
8 # of FPU resource sets
16
Fig. 5. Scalability of the rasterizer loop with different number of floating point resources
10
C.S. de La Lama, P. J¨ aa ¨skel¨ ainen, and J. Takala
loop iterations. The scaling gets worse when getting closer to the 16 FPUs version because a hard limit of about 500 general purpose registers in our compiler, and because the loop was implemented with only 16 iterations. With a larger iteration count there would be more operations with which to hide the latencies of the previous iterations.
5
Conclusions
In this paper we have proposed a mainly software-based implementation of a graphics processing unit based on the scalable TTA architecture. We have shown TTA is an interesting alternative to be used for applications where high data processing performance is required, as is the case with GPUs. TTA provides improved scalability at the instruction level in comparison to VLIWs, due to its programmer-visible interconnection network. The scalability of the proposed TTAGPU on both the task and the instruction level makes the system an interesting platform also to be considered for other data parallel applications designed to be executed on GPU-type platforms. Evaluating the proposed TTAGPU platform for supporting applications written using the OpenCL 1.0 standard [7] is being worked on. Additional future work includes completing the OpenGL API implementation, evaluating the multi-core performance of TTAGPU and implementing an actual hardware prototype. Acknowledgments. This research was partially funded by the Academy of Finland, the Nokia Foundation and Finnish Center for International Mobility (CIMO).
References 1. Stephens, R.: A survey of stream processing. Acta Informatica 34(7), 491–541 (1997) 2. Crow, T.S.: Evolution of the Graphical Processing Unit. Master’s thesis, University of Nevada, Reno, NV (December 2004) 3. St-Laurent, S.: The Complete Effect and HLSL Guide. Paradoxal Press (2005) 4. Kessenich, J.: The OpenGL Shading Language. 3DLabs, Inc. (2006) 5. Luebke, D., Humphreys, G.: How GPUs work. Computer 40(2), 96–100 (2007) 6. Owens, J.D., Luebke, D., Govindaraju, N., Harris, M., Kr¨ uger, J., Lefohn, A.E., Purcell, T.J.: A Survey of General-Purpose Computation on Graphics Hardware. Computer Graphics Forum 26(1), 80–113 (2007) 7. Khronos Group: OpenCL 1.0 Specification (Februrary 2009), http://www.khronos.org/registry/cl/ 8. Halfhill, T.R.: Parallel Processing with CUDA. Microprocessor Report (January 2008) 9. Lindholm, E., Nickolls, J., Oberman, S., Montrym, J.: NVIDIA Tesla: A unified graphics and computing architecture. IEEE Micro. 28(2), 39–55 (2008) 10. Wasson, S.: NVIDIA’s GeForce 8800 graphics processor. Tech. Report (November 2007)
Programmable and Scalable Architecture for Graphics Processing Units
11
11. Wasson, S.: AMD Radeon HD 2900 XT graphics processor: R600 revealed. Tech Report (May 2007) 12. Moya, V., Gonz´ alez, C., Roca, J., Fern´ andez, A., Espasa, R.: Shader Performance Analisys on a Modern GPU Architecture. In: 38th IEEE/ACM Int. Symp. Microarchitecture, Barcelona, Spain, November 12-16. IEEE Computer Society, Los Alamitos (2005) 13. Seiler, L., Carmean, D., Sprangle, E., Forsyth, T., Abrash, M., Dubey, P., Junkins, S., Lake, A., Sugerman, J., Cavin, R., Espasa, R., Grochowski, E., Juan, T., Hanrahan, P.: Larrabee: A Many-Core x86 Architecture for Visual Computing. ACM Transactions on Graphics 27(18) (August 2008) 14. Segal, M., Akeley, K.: The OpenGL Graphics System: A Specification. Silicon Graphics, Inc. (2006) 15. Colwell, R.P., Nix, R.P., O’Donnell, J.J., Papworth, D.B., Rodman, P.K.: A VLIW architecture for a trace scheduling compiler. In: ASPLOS-II: Proc. second int. conf. on Architectual support for programming languages and operating systems, pp. 180–192. IEEE Computer Society Press, Los Alamitos (1987) 16. Corporaal, H.: Microprocessor Architectures: from VLIW to TTA. John Wiley & Sons, Chichester (1997) 17. Corporaal, H.: TTAs: missing the ILP complexity wall. Journal of Systems Architecture 45(12-13), 949–973 (1999) 18. Hoogerbrugge, J., Corporaal, H.: Register file port requirements of Transport Triggered Architectures. In: MICRO 27: Proc. 27th Int. Symp. Microarchitecture, pp. 191–195. ACM Press, New York (1994) 19. J¨ aa ¨skel¨ ainen, P., Guzma, V., Cilio, A., Takala, J.: Codesign toolset for applicationspecific instruction-set processors. In: Proc. Multimedia on Mobile Devices 2007, pp. 65070X–1 — 65070X–11 (2007), http://tce.cs.tut.fi/ 20. Lattner, C., Adve, V.: LLVM: A compilation framework for lifelong program analysis & transformation. In: Proc. Int. Symp. Code Generation and Optimization, Palo Alto, CA, March 20-24, p. 75 (2004) 21. Hennessy, J.L., Patterson, D.A.: Computer Architecture: A Quantitative Approach, 3rd edn. Morgan Kaufmann Publishers Inc., San Francisco (2003)
The Abstract Streaming Machine: Compile-Time Performance Modelling of Stream Programs on Heterogeneous Multiprocessors Paul M. Carpenter, Alex Ramirez, and Eduard Ayguade Barcelona Supercomputing Center, C/Jordi Girona, 31, 08034 Barcelona, Spain {paul.carpenter,alex.ramirez,eduard.ayguade}@bsc.es
Abstract. Stream programming offers a portable way for regular applications such as digital video, software radio, multimedia and 3D graphics to exploit a multiprocessor machine. The compiler maps a portable stream program onto the target, automatically sizing communications buffers and applying optimizing transformations such as task fission or fusion, unrolling loops and aggregating communication. We present a machine description and performance model for an iterative stream compilation flow, which represents the stream program running on a heterogeneous multiprocessor system with distributed or shared memory. The model is a key component of the ACOTES open-source stream compiler currently under development. Our experiments on the Cell Broadband Engine show that the predicted throughput has a maximum relative error of 15% across our benchmarks.
1
Introduction
Many people [1] have recognized the need to change the way software is written to take advantage of multi-core systems [2] and distributed memory [3,4,5]. This paper is concerned with applications such as digital video, software radio, signal processing and 3D graphics, all of which may be represented as block diagrams, in which independent blocks communicate and synchronize only via regular streams of data. Such applications have high task and data parallelism, which is hidden when the program is written in C or a similar sequential programming language, requiring the programmer to apply high level optimizations such as task fusion, fission and blocking transformations by hand. Recent work on stream programming languages, most notably StreamIt [6] and Synchronous Data Flow (SDF) [7], has demonstrated how a compiler may potentially match the performance of hand-tuned sequential or multi-threaded code [8]. This work is part of the ACOTES project [9], which is developing a complete open-source stream compiler for embedded systems. This compiler will automatically partition a stream program to use task-level parallelism, size communications buffers and aggregate communications through blocking. This paper describes the Abstract Streaming Machine (ASM), which represents the target system to this compiler. Figure 1 shows the iterative compilation flow, with a K. Bertels et al. (Eds.): SAMOS 2009, LNCS 5657, pp. 12–23, 2009. c Springer-Verlag Berlin Heidelberg 2009
The Abstract Streaming Machine: Compile-Time Performance Modelling
13
source + SPM pragmas
Task fusion Allocation
Mercurium
source + acolib
gcc
Blocking
ICI plugin
executable
trace
Search algorithm
ASM simulator
Fig. 1. The ACOTES iterative stream compiler
search algorithm determining the candidate mapping, which is compiled using Mercurium [10] and GCC. The Mercurium source-to-source convertor translates from the SPM source language [11,12], and performs task fusion and allocation. The resulting multi-threaded program is compiled using GCC, which we are extending within the project to perform blocking to aggregate computation and communication. Running the executable program generates a trace, which is analysed by the search algorithm to resolve bottlenecks. An alternative feedback path generates a trace using the ASM simulator, which is a coarse-grain model of the ASM. This path does not require recompilation, and is used when resizing buffers or to approximate the effect of fission or blocking.
2
Stream Programming
There are several definitions of stream programming, differing mostly in the handling of control flow and restrictions on the program graph topology [13]. All stream programming models, however, represent the program as a set of kernels, communicating only via unidirectional streams. The producer has a blocking push primitive and the consumer has a blocking pop primitive. This programming model is deterministic provided that the kernels themselves are deterministic, there is no other means of communication between kernels, each stream has one producer and one consumer, and the kernels cannot check whether a push or pop would block at a particular time [14]. When the stream program is compiled, one or more kernels are mapped to each task, which is executed in its own thread. The communications primitives
14
P.M. Carpenter, A. Ramirez, and E. Ayguade
are provided by the ACOTES run-time system, acolib, which also creates and initializes threads at the start of the computation, and waits for their completion at the end. The run-time system supports two-phase communication, and can be implemented for shared memory, distributed memory with DMA, and hardware FIFOs. On the producer side, pushAcquire returns a pointer to an empty array of np elements; the np parameter is equal to the producer’s blocking factor, and is supplied during stream initialization. When the task has filled this buffer with new data, it calls pushSend to request that acolib delivers the data to the consumer. On the consumer side, popAcquire returns a pointer to the next full block of nc elements. When the consumer has finished with the data in this block, it calls popDiscard to mark the block as empty.
3
ASM Machine Description
The target is represented as a bipartite graph of processors and memories in one partition, and interconnects in the other. Figure 2 shows the topology of two example targets. Each processor and interconnect is defined using the parameters summarized in Figures 3 and 4, and described below. The machine description defines the machine visible to software, which may not exactly match the physical hardware. For example, the OS in a Playstation 3 makes six of the eight SPEs available to software. We assume that the processors used by the stream program are not time-shared with other applications while the program is running. Each processor is defined using the parameters shown in Figure 3(a). The details of the processor’s ISA and micro-architecture are described internally to the back-end compiler, so are not duplicated in the ASM. The processor description includes the costs of the acolib library calls. The costs of the pushSend and popAcquire primitives are given by a staircase function; i.e. a fixed cost, a
PPE
LS1
LS3
LS5
LS7
SPE1
SPE3
SPE5
SPE7
$1
$2
$3
P1
P2
P3
EIB Bus
Mem
SPE0
SPE2
SPE4
SPE6 Mem
LS0
LS2
LS4
LS6
(a) Cell-based system Processor
(b) Shared-memory system Memory
Interconnect
Fig. 2. Topology of two example targets
The Abstract Streaming Machine: Compile-Time Performance Modelling
Parameter
Description
15
Value
Unique name in platform namespace ‘SPEn ’ Clock rate, in GHz 3.2 True if the processor can perform IO False List of the physical memories addressable by this [(LSn,0)] processor and their virtual address pushAcqCost Cost, in cycles, to acquire a producer buffer (before 448 waiting) pushSendFixedCost Fixed cost, in cycles, to push a block (before wait1104 ing) pushSendUnit Number of bytes per push transfer unit 16384 pushSendUnitCost Incremental cost, in cycles, to push pushUnit bytes 352 popAcqFixedCost Fixed cost, in cycles, to pop a block (before waiting) 317 popAcqUnit Number of bytes per pop transfer unit 16384 popAcqUnitCost Incremental cost, in cycles, to pop popUnit bytes 0 popDiscCost Cost, in cycles, to discard a consumer buffer (before 189 waiting)
name clockRate hasIO addressSpace
(a) Definition of a processor Parameter name
Description
Value
Unique name in platform namespace ‘EIB’ clockRate Clock rate, in GHz 1.6 elements List of the names of the elements (processors and [‘PPE’,‘SPE0’, memories) on the bus · · · , ‘SPE7’] interfaceDuplex If the bus has more than one channel, then define [True, · · · , for each processor whether it can transmit and True] receive simultaneously on different channels interfaceRouting Define for each processor the type of routing [None, · · · , from this bus: storeAndForward, cutThrough, or None] None startLatency Start latency, L, in cycles 80 startCost Start cost on the channel, S, in cycles 0 bandwidthPerCh Bandwidth per channel, B in bytes per cycle 16 finishCost Finish cost, F , in cycles 0 numChannels Number of channels on the bus 3 multiplexable False for a hardware FIFO that can only support True one stream
(b) Definition of an interconnect Fig. 3. Processor and interconnect parameters of the Abstract Streaming Machine and values for the Cell Broadband Engine
16
P.M. Carpenter, A. Ramirez, and E. Ayguade
block size, and an incremental cost for each complete or partial block after the first. This variable cost is necessary both for FIFOs and for distributed memory with DMA. For distributed memory, the size of a single DMA transfer is often limited by hardware, so that larger transfers require additional processor time in pushSend to program multiple DMA transfers. The discontinuity at 16K in Figure 5 is due to this effect. The addressSpace and hasIO parameters provide constraints on the compiler mapping, but are not required to evaluate the performance of a valid mapping. The former defines the local address space of the processor; i.e. which memories are directly accessible and where they appear in local virtual memory, and is used to place stream buffers. The model assumes that the dominant bus traffic is communication via streams, so either the listed memories are private local stores, or they are shared memories accessed via a private L1 cache. In the latter case, the cache should be sufficiently effective that the cache miss traffic on the interconnect is insignificant. The hasIO parameter defines which processors can perform system IO, and is a simple way to ensure that tasks that need system IO are mapped to a capable processor. Each interconnect is defined using the parameters shown in Figure 3(b). The system topology is given by the elements parameter, which for a given interconnect lists the adjacent processors and memories. Each interconnect is modelled as a bus with multiple channels, which has been shown to be a good approximation to the performance observed in practice when all processors and memories on a single link are equidistant [15]. Each bus has a single unbounded queue to hold the messages ready to be transmitted, and one or more channels on which to transmit them. The compiler statically allocates streams onto buses, but the choice of channel is made at runtime. The interfaceDuplex parameter defines for each resource; i.e. processor or memory, whether it can simultaneously read and write on different channels. The bandwidth and latency of each channel is controlled using four parameters: the start latency (L), start cost (S), bandwidth (B), and finish cost (F ). In transferring a message of size n bytes, the latency of the link is given by n n L + S + B and the cost incurred on the link by S + B + F . This model is natural for distributed memory machines, and amounts to the assumption of cache-to-cache transfers on shared memory machines. Hardware routing is controlled using the interfaceRouting parameter, which defines for each processor whether it can route messages from this interconnect: each entry can take the value storeAndForward, cutThrough or None. Each memory is defined using the parameters shown in Figure 4. The latency and bandwidth figures are currently unused in the model, but may be used by the compiler to refine the estimate of the run time of each task. The memory definitions are used to determine where to place communications buffers, and provide constraints on blocking factors.
The Abstract Streaming Machine: Compile-Time Performance Modelling
Parameter Description name size clockRate latency bandwidth
17
Value
Unique name in platform namespace ‘LSn ’ Size, in bytes 262144 Clock rate, in GHz 3.2 Access latency, in cycles 2 Bandwidth, in bytes per cycle 128
Fig. 4. Memory parameters of the Abstract Streaming Machine and values for the Cell Broadband Engine
4
ASM Program Description
The compiled stream program is a connected directed graph of tasks and pointto-point streams, as described in Section 2. All synchronization between tasks happens in the blocking acolib communications primitives described above. A task may have complex data-dependent or irregular behaviour. The basic unit of sequencing inside a task is the subtask, which pops a fixed number of elements from each input stream and pushes a fixed number of elements on each output stream. In detail, the work function for a subtask is divided into three consecutive phases. First, the acquire phase obtains the next set of full input buffers and empty output buffers. Second, the processing phase works locally on these buffers, and is modelled using a fixed processing time, determined from a trace. Finally, the release phase discards the input buffers, and sends the output buffers, releasing the buffers in the same order they were acquired. This three-stage model is not a deep requirement of the ASM, and was introduced as a convenience in the implementation of the simulator, since our compiler will naturally generate subtasks of this form. A stream is defined by the size of each element, and the location and length of either the separate producer and consumer buffers (distributed memory) or the single shared buffer (shared memory). These buffers do not have to be of the same length. If the producer or consumer task uses the peek primitive, then the buffer length should be reduced to model the effective size of the buffer, excluding the elements of history that share the buffer. The Finite Impulse Response (FIR) filters in the GNU radio benchmark of Section 6 are described in this way. It is possible to specify a number of elements to prequeue on the stream before execution begins.
5
Implementation and Methodology
We use a small suite of benchmarks and target platforms, which have been translated by hand into the description files. The benchmarks were evaluated on an IBM QS20 blade, which has two Cell processors. The producer-consumer benchmark is used to determine basic parameters, and has two actors: a producer, and
5
10
20 15 10 5 0
+++++++++++++++++++++++++++++++ +********************************
Gigabytes per second
1.1
*
+* +*+* ++* +*+** + ++** ++* ++** ++** +* +** ++* ++** ++++++++** +*********
Measured Simulated
0.9 0.7
usecs per firing
+
25
P.M. Carpenter, A. Ramirez, and E. Ayguade 1.3
18
15
20
25
+++++++ ++++++******** ++++++++******** +*+*+****** ++* ++*+* +*+** ++*+*+** + ++** * ++** ++** +*+** + ++** ++** ++** +*+** + +* + Measured ++** ++** * Simulated ++** +*+**
30
5
(a) Time per iteration
10
15
20
25
30
(b) Throughput
Fig. 5. Results for producer-consumer benchmark
n=4 n=3
6
7
n=8
n=4
n=5
0.9
n=2
0.7
0.7
0.9
n=2
1.3
7 6 n=5
1.1
n=8
1.1
1.3
consumer, with two buffers at each end. The chain benchmark, is a linear pipeline of n tasks, and is used to characterize bus contention. The chain2 benchmark is used to model latency and queue contention, and is a linear pipeline, similar to chain, but with an extra cut stream between the first and last tasks. The number of blocks in the consumer-side buffer on the cut stream is a parameter, c. For all benchmarks, the number of bytes per iteration is denoted b. Figure 5 shows the time per iteration for producer-consumer, as a function of b. The discontinuity at b = 16K is due to the overhead of programming two DMA transfers. For b < 20.5K, the bottleneck is the computation time of the producer task, as can be seen in Figure 7(a) and (b), which compares real and simulated traces for b = 8K. For b > 20.5K, the bottleneck is the interconnect, and the
10
15
20
25
30
1.3
7
n=6
5
10
15
20
25
30
(b) Chain: averaged real results 5
(a) Chain: real results
0
n=5
n=8
1.1
n=4 3
4
5
+ *
Measured Simulated
3
0
+*
+* +*
+* + +**
+* +* + +**
+* +* +*
+* +* +**
+* c=1 +* c=2 +* c=3 * +** c=6
0
0.7
1
2
0.9
n=2
+* +*
0
5
10
15
20
25
(c) Chain: simulated results
30
2
4
6
8
(d) Chain2: time per iteration
Fig. 6. Time per iteration for the chain and chain2 benchmarks
The Abstract Streaming Machine: Compile-Time Performance Modelling
19
slope of the line is the reciprocal of the bandwidth: 25.6GB/s. Figure 7(c) and (d) compares real and simulated traces for b = 24K. The maximum relative error for 0 < b < 32K is 3.1%. Figure 6 shows the time per iteration for chain, as a function of n, the number of tasks, and b, the block size. Figure 6(a) shows the measured performance on the IBM QS20 blade, when tasks are allocated to SPEs in increasing numerical order. The EIB on the Cell processor consists of two clockwise and two anticlockwise rings, each supporting up to three simultaneous transfers provided that they do not overlap. The drop in real, measured, performance from n = 4 to n = 5 and from n = 7 to n = 8 is due to contention on particular hops of the EIB, which the ASM does not attempt to model. As described in Section 3, the ASM models an interconnect as a set of parallel buses connecting an (unordered) set of processors. Figure 6(b) shows the average of the measured performance of three random permutations of the SPEs. The simulated results in Figure 6(c) are hence close to the expected results, in a probabilistic sense, when the physical ordering of the SPEs is not known. Figure 6(d) shows the time per iteration for chain2, as a function of the number of tasks, n, and the size of the consumer-side buffer of the shortcut stream between the first and last tasks, denoted c. The bottleneck is either the computation time of the first task (1.27us per iteration) or is due to the latency of the chain being exposed due to the finite length of the queue on the shortcut
1
1
2
2
1.0ms
1.002ms
1.004ms
1.006ms
1.008ms
(a) Compute bound (real)
1.004ms
1.006ms
1.008ms
1
2
2 1.002ms
1.004ms
1.006ms
1.008ms
1.0ms
(c) Comm. bound (real)
1.002ms
1.004ms
1.006ms
1.008ms
(d) Comm. bound (simulated)
1
1
2
2
3
3
4
4
5
5
6
6
7 1.5ms
1.002ms
(b) Compute bound (simulated)
1
1.0ms
1.0ms
7 1.507ms
1.514ms
1.521ms
1.528ms
1.535ms
(e) Queuing bound (real)
0.74ms
0.747ms
0.754ms
0.761ms
0.768ms
0.775ms
(f ) Queuing bound (simulated)
Processing
Pop wait
Push remote wait
Pop work
Push local wait
Push work
Fig. 7. Comparison of real and simulated traces
20
P.M. Carpenter, A. Ramirez, and E. Ayguade
stream. Figure 7(e) and (f) shows real and simulated traces for the latter case, with n = 7 and c = 2.
6
Validation
This section describes the validation work using our GNU radio benchmark, which is based on the FM stereo demodulator in GNU Radio [16]. Table 1(a) shows the computation time and multiplicity per kernel, the latter being the number of times it is executed per pair of l and r output elements. Four of the kernels, being FIR filters, peek backwards in the input stream, requiring history as indicated in the table. Other than this, all kernels are stateless. Table 1 shows two possible mappings of the GNU radio benchmark onto the Cell processor, being the mapping of kernel to task and blocking factors. The first allocates one task per kernel, using a total of seven of the eight available SPEs. Based on the resource utilization, the Carrier kernel was split into two worker tasks and the remaining kernels were partitioned onto two other SPEs. This gives 79% utilization of four processors, and approximately twice the throughput of the unoptimized mapping, at 7.71ms per iteration, rather than 14.73ms per iteration. The throughput and latency from the simulator are within 0.5% and 2% respectively. Table 1. Kernels and mappings of the GNU radio benchmark Kernel Demodulation Lowpass (middle) Bandpass Carrier Frequency shift Lowpass (side) Sum
Multiplicity History Time per % of total buffer firing (us) load 8 n/a 398 1.7% 1 1.6K 7, 220 3.8% 8 1.6K 7, 246 30.4% 8 3.2K 14, 351 60.2% 8 n/a 12 0.1% 1 1.6K 7, 361 3.9% 1 n/a 13 0.0%
(a) Kernels Task 1 2 3 4 5 6 7
Kernel
Blocking factor Demodulation 512 Lowpass (middle) 128 Bandpass 1024 Carrier 1024 Frequency shift 1024 Lowpass (side) 128 Sum 128
(b) Naive mapping
Task 1 2 3 4
Kernel
Blocking factor Demodulation 1024 Bandpass 1024 Carrier (even) 1024 Carrier (odd) 1024 Lowpass (middle) 128 Frequency shift 1024 Lowpass (side) 128 Sum 128
(c) Optimized mapping
The Abstract Streaming Machine: Compile-Time Performance Modelling
7
21
Related Work
Most work on machine description languages for retargetable compilers has focused on describing the ISA and micro-architecture of a single processor. Among others, the languages ISP, LISA, and ADL may be used for simulation, and CODEGEN, BEG, BURG, nML, EXPRESSION, Maril and GCC’s .md machine description are intended for code generation (see; e.g. [17]). The ASM describes the behaviour of the system in terms of that of its parts, and is designed to co-exist with these lower-level models. The Stream Virtual Machine (SVM) is an intermediate representation of a stream program, which forms a common language between a high-level and lowlevel compiler [18,19]. Each kernel is given a linear computation cost function, comprised of a fixed overhead and a cost per stream element consumed. There is no model of irregular dataflow. The SVM architecture model is specific to graphics processors (GPUs), and characterizes the platform using a few parameters such as the bandwidth between local and global memory. The PCA Machine Model [20], by the Morphware Forum, is an XML definition of a reconfigurable computing device, in terms of resources, which may be processors, DMA engines, memories and network links. The reconfigurable behaviour of a target is described using ingredients and morphs. Unlike the ASM, the PCA Machine Model describes the entire target, including low-level information about each processor’s functional units and number of registers. ORAS is a retargetable simulator for design-space exploration of streambased dataflow architectures [21]. The target is specified by the architecture instance, which defines the hardware as a graph of architecture elements, similar to the resources of the ASM. Since the purpose is performance analysis rather than compilation, the system is specified to a greater level of detail than the ASM. Gordon et al. present a compiler for the StreamIt language targeting the Raw Architecture Workstation, and applying similar transformations to those discussed in this paper [22]. As the target is Raw, there is no general machine model similar to the ASM. The compiler uses simulated annealing to minimize the length, in cycles, of the critical path. Our approach has higher computational complexity in the compiler’s cost model, but provides retargetability and greater flexibility in the program model. Gedae is a proprietary stream-based graphical programming environment for signal processing applications in the defense industry. The developer specifies the mapping of the stream program onto the target, and the compiler generates the executable implementation [23]. There is no compiler search algorithm or cost model. A version of Gedae has been released for the Cell processor.
Acknowledgements The researchers at BSC-UPC were supported by the Spanish Ministry of Science and Innovation (contract no. TIN2007-60625), the European Commission in the
22
P.M. Carpenter, A. Ramirez, and E. Ayguade
context of the ACOTES project (contract no. IST-34869) and the HiPEAC Network of Excellence (contract no. IST-004408). We would also like to acknowledge our partners in the ACOTES project for the insightful discussions on the topics presented in this paper.
References 1. Sutter, H., Larus, J.: Software and the concurrency revolution. Queue 3(7), 54–62 (2005) 2. Parkhurst, J., Darringer, J., Grundmann, B.: From single core to multi-core: preparing for a new exponential. In: Proc. ICCAD 2006, pp. 67–72. ACM Press, New York (2006) 3. Chaoui, J., Cyr, K., Giacalone, J., Gregorio, S., Masse, Y., Muthusamy, Y., Spits, T., Budagavi, M., Webb, J.: OMAP: Enabling Multimedia Applications in Third Generation (3G) Wireless Terminals. In: SWPA001 (2000) 4. Chen, T., Raghavan, R., Dale, J., Iwata, E.: Cell Broadband Engine Architecture and its first implementation. IBM developer Works (2005) 5. ClearSpeed (2005), http://www.clearspeed.com/docs/resources/ClearSpeed_ -Architecture_Whitepaper_Feb07v2.pdf CSX Processor Architecture 6. Thies, W., Karczmarek, M., Amarasinghe, S.: StreamIt: A Language for Streaming Applications. ICCC 4 (2002) 7. Lee, E., Messerschmitt, D.: Synchronous data flow. Proceedings of the IEEE 75(9), 1235–1245 (1987) 8. Gummaraju, J., Rosenblum, M.: Stream Programming on General-Purpose Processors. In: Proc. MICRO 38, Barcelona, Spain (November 2005) 9. ACOTES IST-034869, Advanced Compiler Technologies for Embedded Streaming, http://www.hitech-projects.com/euprojects/ACOTES/ 10. Balart, J., Duran, A., Gonzalez, M., Martorell, X., Ayguade, E., Labarta, J.: Nanos Mercurium: a Research Compiler for OpenMP. In: Proceedings of the European Workshop on OpenMP, vol. 2004 (2004) 11. Carpenter, P., Rodenas, D., Martorell, X., Ramirez, A., Ayguad´e, E.: A streaming machine description and programming model. In: Vassiliadis, S., Berekovi´c, M., H¨ am¨ al¨ ainen, T.D. (eds.) SAMOS 2007. LNCS, vol. 4599, pp. 107–116. Springer, Heidelberg (2007) 12. ACOTES: IST ACOTES Project Deliverable D2.2 Report on Streaming Programming Model and Abstract Streaming Machine Description Final Version (2008) 13. Stephens, R.: A survey of stream processing. Acta Informatica 34(7), 491–541 (1997) 14. Kahn, G.: The semantics of a simple language for parallel processing. Information Processing 74, 471–475 (1974) 15. Girona, S., Labarta, J., Badia, R.: Validation of Dimemas communication model for MPI collective operations. In: Dongarra, J., Kacsuk, P., Podhorszki, N. (eds.) PVM/MPI 2000. LNCS, vol. 1908, pp. 39–46. Springer, Heidelberg (2000) 16. GNU Radio, http://www.gnu.org/software/gnuradio/ 17. Ramsey, N., Davidson, J., Fernandez, M.: Design principles for machine-description languages. ACM Trans. Programming Languages and Systems (1998) 18. Labonte, F., Mattson, P., Thies, W., Buck, I., Kozyrakis, C., Horowitz, M.: The stream virtual machine. In: Proc. PACT 2004, pp. 267–277 (2004)
The Abstract Streaming Machine: Compile-Time Performance Modelling
23
19. Mattson, P., Thies, W., Hammond, L., Vahey, M.: Streaming virtual machine specification 1.0. Technical report (2004), http://www.morphware.org 20. Mattson, P.: PCA Machine Model, 1.0. Technical report (2004) 21. Kienhuis, B.: Design Space Exploration of Stream-based Dataflow Architectures: Methods and Tools. Delft University of Technology, Amsterdam, The Netherlands (1999) 22. Gordon, M., Thies, W., Amarasinghe, S.: Exploiting coarse-grained task, data, and pipeline parallelism in stream programs. In: Proc. ASPLOS 2006, pp. 151–162 (2006) 23. Lundgren, W., Barnes, K., Steed, J.: Gedae: Auto Coding to a Virtual Machine. In: Proc. HPEC (2004)
CABAC Accelerator Architectures for Video Compression in Future Multimedia: A Survey Yahya Jan and Lech Jozwiak Faculty of Electrical Engineering Eindhoven University of Technology, The Netherlands {Y.Jan,L.Jozwiak}@tue.nl
Abstract. The demands for high quality, real-time performance and multi-format video support in consumer multimedia products are ever increasing. In particular, the future multimedia systems require efficient video coding algorithms and corresponding adaptive high-performance computational platforms. The H.264/AVC video coding algorithms provide high enough compression efficiency to be utilized in these systems, and multimedia processors are able to provide the required adaptability, but the algorithms complexity demands for more efficient computing platforms. Heterogeneous (re-)configurable systems composed of multimedia processors and hardware accelerators constitute the main part of such platforms. In this paper, we survey the hardware accelerator architectures for Context-based Adaptive Binary Arithmetic Coding (CABAC) of Main and High profiles of H.264/AVC. The purpose of the survey is to deliver a critical insight in the proposed solutions, and this way facilitate further research on accelerator architectures, architecture development methods and supporting EDA tools. The architectures are analyzed, classified and compared based on the core hardware acceleration concepts, algorithmic characteristics, video resolution support and performance parameters, and some promising design directions are discussed. The comparative analysis shows that the parallel pipeline accelerator architecture seems to be the most promising. Keywords: RC hardware architectures, accelerators, multimedia processing, UHDTV, video compression, H.264/AVC, CABAC.
1
Introduction
The real-time performance requirement of modern multimedia applications, like: video conferencing, video telephony, camcoders, surveillance, medical imaging, and especially High Definition Television (HDTV) and new emerging Ultra HDTV (UHDTV) in video broadcasting domain, demand for highly efficient computational platforms. The problem is amplified by the quickly growing requirements of higher and higher quality, especially in the video broadcast domain, what results in a huge amount of data processing for the new standards of digital TV, like UHDTV that requires a resolution of (7680x4320)∼ 33Megapixel K. Bertels et al. (Eds.): SAMOS 2009, LNCS 5657, pp. 24–35, 2009. c Springer-Verlag Berlin Heidelberg 2009
CABAC Accelerator Architectures for Video Compression
25
with a data rate of 24Gbps. Additionally, the latest standards video coding algorithms are much more complex due to the digital multimedia convergence and specifically access of multimedia through a variety of networks and different coding formats used by a single device, as well as, the slow vanishing of the old video coding standards (e.g. MPEG-2) and widespread adaptation of the new standards (e.g. H.264/AVC, VC1 etc). The computational platforms for multimedia are also required to be (re-)configurable, to enable their adaptation to the various domains, accessing networks, standards and work modes. Hardware accelerators constitute the kernel of such (re-)configurable high-performance platforms. Despite of spectacular advances in microelectronic industry, the future multimedia systems cannot be realized using the conventional processor architectures or the existing multimedia processors. They require highly efficient specialized hardware architectures to satisfy the stringent functional and non-functional requirements, and be flexible enough to support multiple domains, standards and modes, and have to be implemented with SoC platforms involving embedded (re-)configurable hardware. In particular, (re-)configurable hardware accelerators are indispensable for the development of these specialized and demanding systems, as well as, new design and design automation methodologies to support development of such accelerators. H.264/AVC [1] is the latest multi-domain video coding standard that provides the compression efficiency of almost 50% higher than former standards (e.g. MPEG-2) due to its advance coding tools. However, its computational complexity is about four times higher compared to its predecessors, and induces the necessity of the real-time video coding through a sophisticated dedicated hardware design. H.264/AVC supports two entropy coding modes: Context Adaptive Variable Length Coding (CAVLC) and Context-based Adaptive Binary Arithmetic Coding (CABAC) [2]. CAVLC covers Baseline profile of H.264/AVC for low-end applications, like video telephony, while CABAC targets Main and High profiles for high-end applications, like HDTV. CABAC improves the compression efficiency 9%-14% as compared to CAVLC at the cost of an increase in complexity of 25-30% and 12% for encoding and decoding, respectively, in terms of access frequency [2][3][4]. Its purely software based implementation results in an unsatisfactory performance even for a low quality and resolution video (e.g. 30-40 cycles are required on average for a single bin decoding on DSP [3]). The situation is much worse for High Definition (HD) video as the maximum bin rate requirement of HD (level 3.1 to 4.2) in H.264/AVC, averaged across a coded picture, ranges from 121 Mbins/s to 1.12 Gbins/s [5]. This makes the software based implementation inadequate to achieve the real-time performance for HD video as a multi-giga hertz RISC processor would be required for HD encoding in realtime [6]. Moreover, the serial nature of CABAC paralyzes the other processes in video codec that could be performed in parallel, making CABAC a bottleneck in the overall codec performance. Consequently, to achieve the required performance, flexibility, low cost and low energy consumption, a sophisticated (re-)configurable hardware accelerator for CABAC is an absolute necessity.
26
Y. Jan and L. Jozwiak
However, the bitwise serial processing nature of CABAC, the strong dependencies among the different partial computations, a substantial number of memory accesses, and variable number of cycles per bin processing put a huge challenge on the design of such an effective and efficient hardware accelerator. Numerous research groups from academia and industry all over the world have proposed different hardware architectures for CABAC using different hardware acceleration concepts and schemes. Our work reported in this paper is performed in the framework of a research project that aims to develop an adequate design methodology and propose supporting EDA tools for development of demanding (re-)configurable hardware accelerators. This paper surveys several most interesting recently proposed hardware accelerator architectures for CABAC. Its main purpose is to deliver a critical insight in the proposed hardware accelerator solutions, and this way facilitate our own and other researchers further work on (re-)configurable accelerator architectures for future complex multimedia applications, architecture development methods and supporting EDA tools. The architectures are analyzed, classified and compared based on the core hardware acceleration concepts, algorithmic characteristics, video resolution support and performance parameters in the hardware accelerator domain, like throughput, frequency, resource utilization and power consumption. Based on the critical architecture comparisons some promising design directions are discussed in view of the requirements of current and future digital multimedia applications. The rest of the paper is organized as follows. Section 2 introduces CABAC. Section 3 covers the main hardware accelerator concepts and classification. Using them, Section 4 presents a critical review of hardware accelerator architectures for CABAC, comparison of various architectures and discusses some promising design directions. Section 5 concludes the paper.
2
Introduction to CABAC
CABAC utilizes three elementary processes to encode a syntax element (SE), i.e. an element of data (motion data, quantized transform coefficients data, control data) represented in the bitstream to be encoded. The processes are: binarization, context modeling and binary arithmetic coding, as shown in Figure 1. The binarization maps a non-binary valued SE to a unique binary representation referred to as bin string. Each bit of this binary representation is called a bin. The reduction of the SE alphabet size to binary in binarization not only minimizes the complexity of arithmetic coder, but also enables the subsequent context modeling stage to more efficiently model the statistical behavior of the syntax elements (SEs). Four basic binarization schemes are used in CABAC [2]. The context modeling process determines the probabilities of the bins using pre-defined context (probability) models, before they are encoded arithmetically. The context models are selected taking into account the neighboring information of the bins/SEs referred to as context. CABAC defines 460 unique context models, each of which correspond to a certain bin or several bins of a SE, and are
CABAC Accelerator Architectures for Video Compression
27
Fig. 1. Block Diagram of CABAC Encoder
updated after bin encoding in order to adopt the models to the varying statistics of the video data. Each context model comprises of the 6-bit probability state index (pStateIdx) and the most probable symbol (MPS) value of the bin [2]. CABAC utilizes the table-based binary arithmetic coder [7] to avoid the costly multiplication process in probability calculation. The binary arithmetic coding engine consists of two sub-engines: regular and bypass, as shown in Figure 1. The regular coding engine utilizes adaptive probability models, but the bypass coding engine assumes a uniform probability model to speed up the encoding process. To encode a bin, the regular coding engine requires the probability model (pStateIdx, MPS) and the corresponding interval range (width) R and base (lower bound) L of the current code interval. The interval is then divided into two subintervals according to the probability estimate (ρLP S ) of the least probable symbol (LPS). Then, one of the subintervals is chosen as the new interval based on whether the bin is equal to MPS or LPS, as given in the following equations [2]. Rnew = R − RLP S Rnew = RLP S
and
and
Lnew = L
if
Lnew = L + R − RLP S
bin = MPS if
bin = LPS
(1) (2)
where RLP S = R . ρLP S represents the size of the subinterval associated with the LPS. The probability model is then updated, and the renormalization takes place to keep R and L within their legal ranges. The process repeats for the next bin. In bypass encoding the probability estimation and update processes are bypassed, because uniform probability is assumed for a bypass bin.
3
Main Concepts of Hardware Acceleration
Hardware accelerator is an application-specific hardware sub-system that can implement a given function more effectively and efficiently than in software running on a conventional processor. A good example is the graphic accelerator. The main concepts of hardware acceleration can be summarized as follows:
28
Y. Jan and L. Jozwiak
– Parallelism exploitation for execution of a particular computation instance due to availability of multiple application-specific operational resources working in parallel; – Parallelism exploitation for execution of several different computation instances at the same time due to pipelining; – Application-specific processing units with tailored processing and data granularity. More specifically these concepts can be oriented towards the data parallelism, functional parallelism and their mixture. In data parallelism the multiple data instance of the same type are processed in parallel, provided the application allows and the resources are available. The functional parallelism simultaneously performs different operations on (possibly) different data instances. Also, the speculative execution can be used to enable for more parallelism. To design a high quality hardware accelerator, it is necessary to perform a thorough analysis of the application algorithms and exploit specific computational characteristics inherent to these algorithms. Dependent on the different characteristics discovered and accounted for result in different approaches to the design of hardware accelerators, and therefore, in the past a number of different basic architecture types were proposed: – – – – –
Straightforward datapath/controller hardware architecture Parallel hardware architecture Pipeline hardware architecture Parallel pipeline hardware architecture General purpose processor (GPP) augmented by loosely coupled hardware accelerator resulting from the HW/SW co-design approach – Extensible/Customizable Application Specific Instruction Set Processor (ASIP) with basic accelerators in the form of instruction set extensions (ISE) These basic architectures will be used to categorize the CABAC accelerators.
4
Overview of Hardware Accelerators for CABAC
The accelerator architectures are analyzed here in a systematic conceptual way, when studying the computational characteristics of CABAC, and thus differently than in the sporadic fragmentary comparisons that can be found in the literature. Moreover, we focus on the main problems and solutions that drastically effect the achieved results. We will not actually consider the mixed HW/SW solution for CABAC, i.e. accelerated GPP augmented by loosely coupled hardware accelerator, because this option is not promising regarding the real-time requirements satisfaction due to the strong dependencies in the computations and the resultant high communication overhead. The performance of different approaches is analyzed and the results are compared, when focusing on the throughput, maximum frequency and area. In almost all of the reviewed papers
CABAC Accelerator Architectures for Video Compression
29
no systematic analysis is provided or methods proposed on how to integrate the CABAC accelerator in a complete H.264/AVC en-/decoder. Before considering the accelerator architectural approaches, we have to give a brief overview of the main implementation issues in CABAC. Five memory operations are involved in the en-/decoding of a single bin and two blocking dependencies that hampers the parallel and pipeline approaches. The first dependency is relevant to the context model update. Unless the context model is not updated for the current bin, the next bin processing cannot be started, because the same context model may be used to en-/decode the next bin. Other dependency involves the interval range (R) and base (L) update. Unless both are not renormalized in the renormalization stage, which involves multiple branches and a variable number of cycles, the next bin processing cannot be initiated, because the probability estimation of the next bin depends on the current interval range. These strong dependencies are some of the main challenges in the accelerator design, and a number of solutions are proposed to tackle these problems. 4.1
Straightforward Datapath/Controller Accelerators
The straightforward datapath/controller approach relies on the data flows in the algorithm of the software based implementation. This accelerates the computations to some degree, but does not exploit the true (parallel) nature of the application algorithm and improvement achievable using the hardware acceleration approach. This approach is followed in some CABAC accelerators, in the sense that processing is performed sequentially on a per bin basis, and possibilities are not explored for a multi-bin parallel processing. This always limits the performance to maximally 1 bin/cycle, as the simple serial hardware implementation without any optimizations takes as many as 14 cycles to encode a single bin [8]. Some optimization technique like pre-fetching and simple parallelism [8] etc. were proposed that enables to process one bin in 5 cycles. Chen et al. [9] proposed an FSM and a memory scheme for neighboring SEs, which results in the decoding throughput of 0.33∼0.50 bin/cycle. However, it decodes only CIF video at 30fps. 4.2
Parallel Hardware Accelerators
The inefficiency of the straightforward acceleration approaches to en-/decode in real-time HD video motivated the research community to exploit some alternative approaches to the design of CABAC accelerators. The most promising approach to achieve real-time performance for high resolution video is to process more than one bin/cycle, i.e. to utilize a parallel approach. However, in the en-/decoding of even a single bin complex interdependencies have to be resolved as discussed before, and consequently, the algorithm cannot be parallelized in its true basic nature. Utilizing the static and dynamic characteristics of the SEs that can be discovered through an analysis of CABAC algorithm for real video sequences, the parallelism can be achieved up to some level for some specific SEs, what can result in processing of more than one bin/cycle. However, in parallel
30
Y. Jan and L. Jozwiak
en-/decoding of two or more regular bins the context models have to supplied to the coding engines. Due to the blocking dependencies, this cannot be performed in parallel. Also the context model fetching takes a substantial time. The details of these characteristics of SEs and the corresponding parallel schemes are discussed below. Yu et al. [3] proposed the first parallel architecture for CABAC decoding. Unlike the conventional approaches [8][9] that take a number of cycles to decode a single bin, this architecture decodes 1∼3 bin/cycle. The parallelism in this architecture is achieved through a cascade of the arithmetic decoding engines: two regular ones and two bypass. This enables the decoding of 1 Regular Bin (1RB), 1RB with 1 Bypass Bin (1BB), 2RB with 1BB and 2BB bins in parallel for frequently occurring SEs, like residual data. To reduce the context memory accesses, relevant context models of a SE or group of SEs are accessed in blocks and are stored in a high speed register bank. However, it results in an extra cost of the register bank. The architectures [10][11][12][13][14] are based on the same concept, but after some specific extensions are capable to en-/decode HD video. In [15] sixteen cascaded regular decoding units are used for more speed up for frequent SEs. However, due to dependencies the throughput remains less than 1 bin/cycle, and it causes an increase in the critical path latency and circuit area. In [16] five different architectures for CABAC encoder are designed and analyzed for area/performance tradeoff. Results show that two regular with bypass bins architectures perform better for the high quality video than the others. A predictive approach is employed by Kim et al. [17]. Unlike the architectures [3][10][11][12][14][15], in which there is a latency due to the cascaded arithmetic coding engine, this architecture initiates decoding of two bins simultaneously by prediction. However, due to mis-prediction only 0.41 bin/cycle is achieved, although with a high frequency of 303MHz. Algorithmic optimizations can expose far more parallelism than available in the original application algorithm. A novel algorithm is proposed by Sze et al. [5] which is fundamentally parallel in nature and deterministically en-/decode several (N) bins with different context at the same time. The context models for different bins are determined simultaneously using conditional probabilities, what is different than in the predictive strategy [17] and the cascaded approaches [3][10][11][12][14][15]. The two possible context models that could be used for the second bin are determined by taking into account the two possible values of the first bin (0 or 1). Its software implementation (N=2) enables 2 bins/cycle at a cost of 0.76% increase in bit rate compared to the original CABAC algorithm. However, such optimizations require 3 to 4 multiplications for two bins en-/decoding as well comparators, which could make its hardware implementation costly in resources. 4.3
Pipeline Hardware Accelerators
Although, the parallelism in the form of multi-bin processing in CABAC outperforms the conventional approach, it increases the complexity of the architecture, specially in the renormalization and context management. The cascaded multiple
CABAC Accelerator Architectures for Video Compression
31
processing engines also increase the critical path delay. Moreover, the hardware resources are much increased with not much gain from acceleration [4]. In addition, the multi-bin processing only accelerates the decoding of certain frequent SEs, is not equally well effective for all SEs, and the number of cycles per bin processing varies. Therefore, the pipeline concept of hardware acceleration is also utilized in CABAC, with the prime goal of achieving the real-time performance for HD video. A number of pipeline schemes are proposed to effectively overcome the problems and complexities of other schemes discussed earlier. However, the pipeline hazards appear as a byproduct of pipelining due to the tight dependencies in the CABAC algorithm. There are two pipeline hazards in CABAC: data and structural. A data hazard occurs when the same context model is used for the next bin as for the current bin, which is a read after write (RAW) data hazard. A structural hazard occurs when the context memory is accessed at the same time due to the context model write for the current bin and context model read for the next bin. These hazards cause the pipeline stalls that decrease the throughput of the purely pipelined architecture from the maximum of 1 bin/cycle to a lower value. Below the details of the pipeline schemes, solutions for pipeline hazards and performance of proposed pipeline accelerators are discussed. Zheng et al. [18] proposed a two stage pipeline decoding architecture for residual SEs only. The stalls in the pipeline are eliminated using standard look ahead (SLA) technique, to determine the context model for the next bin using both possible values of the current bin. The proposed architecture supports HD1080i video. This SLA approach is also used in pipeline architectures [19][20][21]. Yi et al. [22] proposed a two stage pipeline decoding architecture, instead of 4 usual stages, to reduce the pipeline latency and to increase the throughput. The data hazard are removed using the forwarding approach, and the structural hazards by using a context model reservoir (CMR) with context memory. However, the SE switching causes stalls due to CMR update, and this limits the throughput to an average of 0.25 bin/cycle. This problem is solved in [23] by using a SE predictor that increases the throughput to 0.82 bin/cycle. Li et al. [4] proposed a three stage dynamic pipeline codec architecture. The pipeline is dynamic in the sense that the pipeline latency varies between one and two cycles depending on the bin type. No pipeline stalls occur for the BB and the RB of value MPS with the interval range (R) in its limit. For data hazards removal a pipeline bypass scheme is used and for structural hazards a dual-port SRAM. The bin processing rate of [18][23] is higher than of [4], because of the coarse pipeline stages with efficient context management. Tian et al. [24] proposed a three stage pipeline encoding architecture. Two pipeline buffers are introduced to resolve the pipeline hazards and the latency issue of [4], what results in the throughput of exactly 1 bin/cycle. Chang [25] proposed a three stage pipeline architecture that combines together the different speed up methods earlier proposed like: pipeline stalls reduction due to SEs switching, context model clustering for decreasing context memory access, and two-bin arithmetic decoding engine. The architecture achieves the average throughput of 0.63 bin/cycle at a comparatively high frequency, as shown in Table 1.
32
Y. Jan and L. Jozwiak Table 1. Comparison of Different Hardware Accelerator Architectures
Design Approach Datapath/Control [8] Codec [9] Decoder Parallel [3] Decoder [14] Encoder [15] Decoder [17] Decoder Pipeline [4] Codec [18] Decoder [22] Decoder [24] Encoder [25] Decoder Parallel pipeline [21] Decoder ASIP/ISE [31] Decoder
Freq. Throughput VLSI Tech. MHz Bin(s)/Cycle TSMC(µm)
Circuit Area (gates)
Resolution Support
80,000(Inc.)∗ 138,226(Inc.)
SD480i@30fps CIF@30fps
30 200
0.2 0.33∼0.5
Virtex-II 0.13
149 186 45 303
1∼3 1.9∼2.3 <1 0.41
0.18 0.35AM S 0.18 0.18
0.3mm2 +32x105reg SD480@30fps 19,426(Exc.) CIF, HD 42,000(Exc.) HD1080i@30fps SD480i@30fps
0.18 0.18 0.18 0.35AM S 0.18
0.496mm2 (Inc.) HD1080i@30fps 46,400(Inc.) HD1080i@30fps 81,162+12.18KB HD1080p@25fps 19,100(Exc.) 35,615(Exc.) HD1080p@30fps
230 0.60Enc /0.50Dec 160 1 225 0.25/0.82[23] 186 1 250 0.63 200
1.27
0.18
28,956+10.81KB
HD1080i@30fps
120
0.021/0.028∗∗
-
-
-
*Context Memory included in the area calculation **LPS/MPS bins
4.4
Parallel Pipeline Hardware Accelerators
The parallel pipeline schemes combine the acceleration features of both approaches, what often result in a super fast accelerator. We could benefit from this approach, if we would be able to process multiple bins in a pipeline fashion without any stall. Although we cannot fully utilize this approach, because it will make the accelerator architecture very complex or may even be impossible to design, its limited practical application is possible by utilizing the characteristics of SEs, like the processing of a single regular bin with one or more bypass bins in parallel pipeline fashion. This approach drastically improves the throughput which is the requirement of the future high quality and resolution systems. Shi et al. [21] proposed a parallel pipeline approach for the real-time decoding of HD video with 4-stages that can decode 1RB or 2BB bin(s)/cycle without any stall. Structural hazards are solved using two dual-port SRAMs and data hazards using forwarding technique and redundant circuitry. Two bypass bins are processed in parallel with no pipeline stalls due to switching from the regular to bypass mode and back to regular mode, what makes this architecture unique. Due to the processing of multiple bypass bins in pipeline average throughput of 1.27 bins/cycle is achieved. 4.5
ASIP/ISE Based CABAC Accelerators
The configurability and extensibility makes ASIP interesting option for the highend adaptive applications. The extensibility in the form of ISE could be used to
CABAC Accelerator Architectures for Video Compression
33
cope with evolving standards and results in an efficient real-time processing for high resolution video applications. Flordal et al. [26] proposed a multi-standard (JPEG2000, H.264) CABAC encoder. A Multi-branch instruction is proposed here for the renormalization in CABAC. Unfortunately, this approach results in unsatisfactory performance for HD video. The work of Osorio et al. [27] is also in this direction, but it is based on an array of simple processors implementing the different tasks of various entropy en-/decoders (e.g for CABAC 5 processors array). Comparative study with Texas Instruments TMS320C6711 VLIW DSP shows only results for QCIF resolution video, however no real figures are given for HD video. Nunez et al. [28] extended the ISA of SPARC-compatible Leon CPU with 7 instructions just to integrate the CABAC in H.264/AVC encoder. The CABAC algorithm is implemented without context management and binarization as a single hardware accelerator unit in a pipeline style. Similarly, in [29] two new instructions are proposed for Trimedia TM3270 media-processor that accelerate only the arithmetic coding part of the CABAC, which supports only D1 resolution video. Tensilica 388VDO and Silicon HiveFlex VSP2500 video processors also utilizes specific instruction set extensions for CABAC implementation that support D1 and HD resolution videos, respectively. Since Multiprocessor SoC (MPSoC) are becoming more and more popular in accelerating the back-end of H.264/AVC. Osorio et al. [30] proposed a novel microprogrammed CABAC decoder for MPSoC based H.264/AVC Codec. Rouvinen et al. [31] utilize the Transport Triggered Architecture (TTA) for implementation of CABAC. Easy customization of TTA resources and programmable visible interconnect structure give many possibilities for the designer to get an optimized ASIP solution. Nine transport buses with other special functional units (SFU) are proposed for according to the CABAC requirement. However, 36 and 48 cycles are consumed in the MPS and LPS bins decoding, respectively. This solution is thus not suitable for HD applications but its flexibility favors multi-standard codec design. Most of the ASIP/ISE approaches discussed so far do not support HD video. We conclude that for HD video any ASIP/ISE other than implementing the whole coarse-grain CABAC accelerator as a single instruction seems to be less effective due to the introduction of extra clock cycles for each instruction fetch. 4.6
Comparison
Most of the results are compared in the previous sections along with the architecture discussion. However, the overall view is also important. The straightforward approach usually en-/decode from 0.2 to 0.5 bin/cycle, as shown in Table 1. Since the SEs are processed in a sequential manner, no substantial speed up is achieved. In the parallel approach the number of bins/cycle fluctuates between a certain maximum and minimum, as it depends on the type of SE. It may result in 2, 3 or even 4 bins/cycle if supported by the architecture as in [12][14]. In the purely pipeline approach the throughput never goes above more than 1 bin/cycle, but independent of SEs it remains at 1 or close to 1 bin/cycle. However, in the parallel pipeline approach some extra performance is obtained from the characteristics
34
Y. Jan and L. Jozwiak
of the SEs that enable to process some bins in parallel, like in reference [21] for the bypass bins. This result in average throughput of more than one bin/cycle for HD video. The decoding rate of 254Mbins/s at operating frequency of 200MHZ of this approach is much higher than ∼45Mbins/s required for HD1080i video. This can be further improved, if the processing of one or more regular bin(s) and/or one or more bypass bin(s) is performed in parallel, but with steady and balanced pipeline to maintain the throughput consistently, simple control and minimal area. Also, the parallel approach seems to be more effective in case of CABAC encoding due to the availability of next bin(s) of a SE. However, in CABAC decoding the next bin information is available only after the processing of the current bin, so pipelined architectures perform better. These kind microarchitectural decisions could be employed in high-level synthesis (HLS) tools to automate the design of such complex accelerators.
5
Conclusion
In this paper, we reviewed numerous approaches to the hardware accelerator architectures for CABAC from the viewpoint of the hardware acceleration concepts and performances. The features and issues involved in each architectural approach are discussed with focus on the real-time, high resolution and high quality video processing capabilities. From the analysis and comparisons it follows that the parallel pipeline accelerator approach seems to be the most promising, because of high and steady throughput, simple control and hardware efficiency as compared to other architectures. However, the computational requirements of the current and future multimedia systems are ever increasing and require further research on accelerator architecture concepts, as well as adequate design methodologies and EDA tools for the development of accelerator architectures.
References 1. ITU-T: Recommendation and Final Draft International Standard of Joint Video Specification (ITU-T Rec. H. 264— ISO/IEC 14496-10 AVC) (May 2003) 2. Marpe, D.a.: Context-based adaptive binary arithmetic coding in the h.264/avc video compression standard. IEEE Transactions on CSVT, 620–636 (July 2003) 3. Yu, W., et al.: A high performance cabac decoding architecture. IEEE Transactions on Consumer Electronics, 1352–1359 (November 2005) 4. Li, L., et al.: A hardware architecture of cabac encoding and decoding with dynamic pipeline for h.264/avc. J. Signal Process. Syst., 81–95 (2008) 5. Sze, V., et al.: Parallel cabac for low power video coding. In: 15th IEEE International Conference on ICIP 2008, October 2008, pp. 2096–2099 (2008) 6. Shojania, et al.: A high performance cabac encoder. In: NEWCAS, pp. 315–318 (2005) 7. Marpe, D., et al.: A highly efficient multiplication-free binary arithmetic coder and its application in video coding. In: ICIP 2003, September 2003, pp. 263–266 (2003) 8. Ha, V., et al.: Real-time mpeg-4 avc/h.264 cabac entropy coder. In: 2005 Digest of Technical Papers. In: International Conference on ICCE, January 2005, pp. 255– 256 (2005)
CABAC Accelerator Architectures for Video Compression
35
9. Chen, J.W., et al.: A hardware accelerator for context-based adaptive binary arithmetic decoding in h.264/avc. In: ISCAS 2005, May 2005, pp. 4525–4528 (2005) 10. Mei-hua, et al.: Optimizing design and fpga implementation for cabac decoder. In: International Symposium on HDP 2007, June 2007, pp. 1–5 (2007) 11. Bingbo, L., et al.: A high-performance vlsi architecture for cabac decoding in h.264/avc. In: 7th International Conference on ASICON, October 2007, pp. 790– 793 (2007) 12. Depr´ a, D.A., et al.: A novel hardware architecture design for binary arithmetic decoder engines based on bitstream flow analysis. In: SBCCI, pp. 239–244 (2008) 13. Jian, et al.: A high-performance hardwired cabac decoder. In: ICASSP 2007, pp. 37–40 (2007) 14. Osorio, R.R., et al.: High-throughput architecture for h.264/avc cabac compression system. IEEE Transactions on CSVT, 1376–1384 (November 2006) 15. Zhang, P., et al.: High-performance cabac engine for h.264/avc high definition realtime decoding. In: International Conference on ICCE 2007, January 2007, pp. 1–2 (2007) 16. Pastuszak, G.: A high-performance architecture of the double-mode binary coder for h.264.avc. IEEE Transactions on CSVT, 949–960 (July 2008) 17. Kim, C., et al.: High speed decoding of context-based adaptive binary arithmetic codes using most probable symbol prediction. In: ISCAS 2006, p. 4 (2006) 18. Zheng, J., et al.: A novel pipeline design for h.264 cabac decoding. In: Ip, H.H.-S., Au, O.C., Leung, H., Sun, M.-T., Ma, W.-Y., Hu, S.-M. (eds.) PCM 2007. LNCS, vol. 4810, pp. 559–568. Springer, Heidelberg (2007) 19. Eeckhaut, H., et al.: Optimizing the critical loop in the h.264/avc cabac decoder. In: IEEE International Conference on FPT 2006, December 2006, pp. 113–118 (2006) 20. Yang, Y.C., et al.: A high throughput vlsi architecture design for h.264 cabac decoding with look ahead parsing. In: Multimedia and Expo., pp. 357–360 (2006) 21. Shi, B., et al.: Pipelined architecture design of h.264/avc cabac real-time decoding. In: 4th IEEE International Conference on ICCSC 2008, May 2008, pp. 492–496 (2008) 22. Yi, Y., et al.: High-speed h.264/avc cabac decoding. IEEE CSVT, 490–494 (2007) 23. Son, W., et al.: Prediction-based real-time cabac decoder for high definition h.264/avc. In: IEEE International Symposium on ISCAS 2008, May 2008, pp. 33– 36 (2008) 24. Tian, X.a.: Implementation strategies for statistical codec designs in h.264/avc standard. In: 19th IEEE International Symposium on RSP, June 2008, pp. 151– 157 (2008) 25. Chang, Y.T.: A novel pipeline architecture for h.264/avc cabac decoder. In: IEEE Asia Pacific Conference on APCCAS 2008, December 2008, pp. 308–311 (2008) 26. Flordal, O., et al.: Accelerating cabac encoding for multi-standard media with configurability. In: 20th International IPDPS 2006, April 2006, p. 8 (2006) 27. Osorio, R.R., et al.: Entropy coding on a programmable processor array for multimedia soc. In: International Conference on ASAP 2007, July 2007, pp. 222–227 (2007) 28. Nunez, et al.: Design and implementation of a high-performance and silicon efficient arithmetic coding accelerator for the h.264 video codec. In: ASAP 2005, pp. 411– 416 (2005) 29. van de Waerdt, J.W., et al.: The tm3270 media-processor. In: 38th IEEE/ACM International Symposium on Microarchitecture 2005, pp. 331–342 (2005) 30. Osorio, R.R., et al.: An fpga architecture for cabac decoding in manycore systems. In: International Conference on ASAP 2008, July 2008, pp. 293–298 (2008) 31. Rouvinen, J., et al.: Context adaptive binary arithmetic decoding on transport triggered architectures. In: SPIE Conference Series (March 2008)
Programmable Accelerators for Reconfigurable Video Decoder Tero Rintaluoma1 , Timo Reinikka2 , Joona Rouvinen4 , Jani Boutellier2 , Pekka J¨ aa¨skel¨ainen3, and Olli Silv´en2 1
2
On2 Technologies, Oulu, Finland
[email protected] Department of Electrical and Information Engineering, University of Oulu, Finland {timo.reinikka,jani.boutellier,olli.silven}@ee.oulu.fi 3 Tampere University of Technology, Tampere, Finland
[email protected] 4 Valmet Automotive, Uusikaupunki, Finland
[email protected]
Abstract. Practically all modern video coding standards such as H.264, MPEG-4 and VC-1 are based on hybrid transform based block motion compensated techniques, that employ almost the same coding tools. The same approach is used with numerous non-standard proprietary codecs, with decoders available via Internet as browser plugins. For mobile devices power efficient hardware accelerators have been developed, but usually only a few standards are supported. Consequently, the decoding of the other formats, including the non-standard ones is done by software, sacrificing the battery life. In this paper we present programmable accelerators for arithmetic code decoding and motion compensation, that can be used with multiple video standards. These functions consume more than half of the cycles in software based decoders. Although the accelerators were originally designed for H.264 standard, they are rather generic for the respective coding tools. They have been implemented on application specific processor technology for flexibility and energy efficiency with the aim of achieving the performance needed for decoding high definition (1920x1088, 30 fps) video.
1
Introduction
Wireless wide band data connections have made the Internet available on mobile communication devices. This digital convergence has led to a situation where higher and higher demands are placed on wireless terminals. For example, High Definition (HD) video camcorder applications are coming to mobile communication devices in the next few years and numerous different video coding formats need to be supported on playback. Many of the video formats are proprietary ones, often developed to avoid the patent pool fees of the international standards. The Internet has made it simple to distribute the decoders of non-standard formats as browser plugins. This is fine for PC users. However, from the mobile device power efficiency point of view K. Bertels et al. (Eds.): SAMOS 2009, LNCS 5657, pp. 36–47, 2009. c Springer-Verlag Berlin Heidelberg 2009
Programmable Accelerators for Reconfigurable Video Decoder
37
these are unfortunate solutions, as they need to be executed in software by the CPU. For the standards based decoders hardware acceleration is occasionally available, although the number of formats supported by the same device is often limited. Table 1 below shows the relative power needs of actual software and hardware implementations of MPEG-4 [1] and H.264 [2] video decoders [3]. Table 1. Relative energy consumption of decoder implementations Software Monolithic on ARM11 hardware MPEG-4 1 0.05 H.264 2 0.25
The popular video codecs like MPEG-4, H.264 and VC-1 [4] all define hybrid transform based block motion compensated techniques that employ mostly the same coding tools. The same organization and coding tools are employed even in the non-standard codecs. This observation has ultimately been a motivator for defining MPEG Reconfigurable Video Coding (RVC) framework [5]. The idea in RVC is to send a description of the codec with the bitstream, and to reconfigure the coding tools accordingly on-the-fly. This is targeted to facilitate multi-format codec design and it apparently favors flexible software solutions. However, a this kind of configurable video codec is a big challenge for the implementors of mobile multimedia devices that need to rely on hardware acceleration and functionally pipelined computing. One proposed hardware architecture for RVC has been presented by Hsiao and Tsai [5]. In the context of this paper we propose implementing the most expensive coding tools needed by HD video decoders using programmable accelerators. This approach is more restricted than the RVC framework, but can still support a large number of video formats, both standardised and proprietary ones. The workload shares of different coding stages in a H.264 software decoder shown in Table 2 illustrate the demands. Although excluded from our treatment, it is noteworthy that motion compensation and deblocking are both FIR filtering intensive functionalities, and could be implemented on the same hardware architecture in an efficient manner. In addition to being the costliest ones, variable length code decoding and motion compensation are also the functionalities that differ most between the standards. In other words, these are the same components that have made the hardware implementations of multi-format decoders very demanding, and forced to limit the number of formats supported by the same hardware. Our Variable Length Code (VLC) decoding accelerator is demonstrated for a Context Adaptive Binary Arithmetic Code (CABAC) decoder needed in H.264. CABAC is a heavily bit-serial algorithm without data parallelism, inefficient on CPUs, and has therefore been a target for earlier hardware designs [6]. The designed motion compensation accelerator has been tested for H.264 style interpolation. Both accelerators are application specific processors based on an exposed communications type architecture. The designs have been made using
38
T. Rintaluoma et al. Table 2. Cycle count distribution for Main Profile H.264 software decoder Operation Share (%) Interpolation 44.4 Deblocking 19.0 CABAC 12.8 Output MB writing 5.4 IQ + Transform 2.0 Memset 1.8 Intra prediction 0.9 Others 13.6
the TTA-based Codesign Environment (TCE) tools for Transport Triggered Architecture (TTA) processors [7].
2
Video Decoder Reconfigurability
Supporting multiple video formats is straightforward with software by the respective player applications or browser plugins that are launched on need basis. However, in hardware it is difficult to justify multiple decoders as most of the functional blocks are only slightly dissimilar between the designs. As a result, the vendors of commercial hardware based multi-format decoders conserve silicon area by re-using the hardwired blocks via advanced finite-state-machine (FSM) control [8]. This way of thinking can be defended by examining the typical functionalities and organization of the hybrid motion compensated transform based video decoders in Figure 1. In the first stage, the decoder reads the video bitstream and performs entropy decoding. This is a bit-serial process that can not be easily parallelized to improve performance. The decoded coefficients are then inverse quantized (IQ) and inverse transformed (IT) to get original residual data which is then added
Fig. 1. Hybrid motion compensated transform based video decoder
Programmable Accelerators for Reconfigurable Video Decoder
39
to predicted pixels. Prediction can be made both for inter and intra pictures. Intra prediction uses only information from the current frame, but inter prediction performs motion compensation (MC) by reading a block of pixels pointed by motion vectors from the reference frame. After adding predicted data to the residual data, deblocking filtering is usually performed to smooth block edges in the image. The coding tools in the functional blocks depend on the video formats, so supporting new formats in an existing design requires flexibility not ordinarily found in hardware. This problem can be approached from two aspects. First, the hardware design could be completely reconfigurable, that is, the decoder could be synthesized from scratch for any format. The second option is to make the functional block programmable, with their internal architectures optimized for certain types of computations. The latter approach is more restricted, but can employ efficient hand-optimized designs. The notable example of the first approach in its most advanced form is MPEG Reconfigurable Video Coding (RVC) that provides for reconfiguration of the decoders to support multiple standards and adoption of new coding tools [9]. It defines a conformance point at tool level and allows to pick coding tools based on the application and supports on-the-fly reconfiguration of the decoder. Figure 2 illustrates traditional and MPEG RVC conformance levels [5], showing how the coding tools can be grouped together and implemented on reconfigurable application specific processors. In RVC the video coding tools such as 8x8 IT, and 1/4 pixel MC are defined in Video Tool Library (VTL). The decoder can be reconfigured by sending decoder and bitstream syntax descriptions as showed by Figure 3. Decoder description language (DDL), defines what coding tools are needed and connections between those coding tools. The coded bitstream format is in turn defined by Bitstream Syntax Description Language (BSDL). [10] The RVC standardization effort recognizes the importance of energy efficiency. Instead of using synthetization, the tools can be hand optimized for a particular platform such as application specific processors [11]. This is important in
Fig. 2. MPEG RVC conformance level
40
T. Rintaluoma et al.
Receiver/decoder
Video tool library
BSDL Decoder reconfiguration Transmitter/ source
DDL Coded bitstream Decoding
Decoded video
Feedback channel
Fig. 3. MPEG RVC decoding
computing intensive parts of a decoder. By using feedback channel, the receiver can even direct the transmitter to encode video stream that can be decoded using coding tools supported by the receiver [9]. Our goal has been to implement the coding tools using programmable application specific processors (ASPs) to avoid compromising the throughput and energy efficiency. Although the designs have originally been made to improve the flexibility of ”conventional” multi-format decoders, the concepts appear to be fairly compatible with the ideas and potential of MPEG RVC.
3
Motion Compensation
In the decoder of block motion compensated video a key issue is the 1/2 or 1/4 pixel precision of the motion vectors. This requires interpolating the image data in the respective block by factor 2 or 4 to reconstruct the result as shown in Figure 4. In practical implementations, algorithms that combine the necessary upsampling and filtering operations in clever ways are employed. As a consequence of optimization, the interpolation algorithms may include format specific rounding rules even at the intermediate stages of computations. The needed conditional operations are most efficiently performed with hardwired pipelines, but those make support for multiple formats difficult to provide. Efficient programmable designs are clearly desirable.
Fig. 4. Quarter pixel motion compensation
Programmable Accelerators for Reconfigurable Video Decoder
41
Below, to illustrate the demands, we consider motion compensation for H.264 that is a complex case. H.264 coding can employ several temporal prediction options, and is based on using 16x16 pixel macroblocks. Temporally predicted macroblocks are usually used as such or partitioned into 16x8, 8x16 or 8x8 blocks. Furthermore, the 8x8 blocks can be divided into 8x4, 4x8 or 4x4 subblocks [2]. For simplicity, the explanation below is only for the luma component. In H.264, samples at half pixel positions are interpolated by using a 6-tap FIR filter (1, -5, 20, 20, -5, 1). Half pixel positions that lie on the same row or column with full sample positions are filtered, scaled, rounded and clipped to range [0, 255]. However, pixel positions that lie on half pixel positions both horizontally and vertically are calculated by using different rounding rules. First half pixel positions are calculated either horizontally or vertically without scaling, rounding and clipping and then half pixel position on other direction is calculated by using these intermediate values and finally these values are scaled, rounded and clipped to correct range. These kinds of different roundings and clipping variations make generic implementations much more complex to implement efficiently and easily increase the cycle count needed for interpolations. In addition to these, samples at quarter pixel positions are in turn derived by averaging the original samples and samples at half pixel positions. [2] Full High Definition video (1920x1088, 30 fps) has to operate at the rate of 244 800 macroblocks per second, so each of them must be processed at most 4.084 µs. If the clock frequency of the motion compensation processor is 150 MHz, 613 clock cycles are available for each macroblock. This translates into a rather high degree of required parallelism in processing. 3.1
Application Specific Processor Implementation
The motion compensation processor shown in Figure 5 was implemented by using the TCE toolkit for Transport Triggered Architectures (TTAs). The processor is most efficient for blocks of 4, 8 and 16 columns wide, with multiples of four rows in height. The interpolator uses 12 FIR filters, four vertical and eight horizontal ones. The same kind of structure has been used in many earlier hardwired implementations such as [12] and [13], but our approach is fully programmable. In our design the interpolation filters are implemented in their own functional unit, contributing to flexibility that allows for handling different rounding and clipping rules found in video coding standards. The processor can be programmed in C language, but efficient handling of different rounding rules and complete utilization of resources required assembly coding. This way it was possible to achieve the performance required by HD video at low clock rates. It should be noticed that pure software solutions running on general purpose processors need also assembly coding if SIMD instructions are to be utilized efficiently. Thanks to programmability, new features can easily be added. For instance, the motion compensation algorithms of MPEG-2, MPEG-4, and VC-1 are straightforward to add to the design.
42
T. Rintaluoma et al.
Fig. 5. MC processor; LU (Load Unit), LSU (Load and Store Unit), GCU (Global Control Unit), ALU (Arithmetic Logic Unit)
3.2
Performance
Table 3 shows the cycles required by the application specific processor to interpolate a WxH block and a macroblock divided into WxH blocks. In addition, worst case clock frequencies for HD video (1920x1088, 30 fps) are given. When the block sizes go down, the execution time for the macroblocks increase, because the relative size of the interpolation window grows. Interpolation of 16x16 block can require a 21x21 window, while a 8x8 block demands a 13x13, and a 4x4 block may need a 9x9 pixel window which is about five times the block size. Also block sizes such as 8x16 and 4x8 require more cycles than their transposed versions because the four FIR filters are organized for four parallel rows. For this reason, the higher blocks have more initialization overhead. For H.264 Baseline profile level 4 HD video the processor needs only a 108 MHz clock rate. In the worst case, with Main and High profiles at level 4, one macroblock requires at most 680 cycles. Then, decoding HD video requires a 166 MHz clock rate. Table 4 compares the proposed programmable implementation to earlier ones in the worst case, when a macroblock is divided into 4x4 blocks. The design in [14] uses a MIPS Plasma processor and a 4x4 block based hardware accelerator for interpolation, while [12], [13] and [15] are pure hardware implementations. In [12] luma is interpolated using 13 FIR filters and [15] uses 4 FIR filters for 4x4 based processing. In contrast, our design can interpolate a 16x16 luma block in 210 cycles whereas [12] requires about 256 cycles and [15] at least 208 Table 3. Cycles required for interpolation and clock frequency at HD video (1920x1088, 30 fps) when only P-slices are used Block Cycles Cycles per Clock frequency size WxH per block macroblock (MHz) 16x16 210 210 51 16x8 112 224 55 8x16 155 310 76 8x8 85 340 83 8x4 43 344 84 4x8 55 440 108 4x4 29.5 472 116
Programmable Accelerators for Reconfigurable Video Decoder
43
Table 4. Comparison of gate count of interpolators and maximum cycles needed to interpolate a luma macroblock
Cycles Gate count
Zatt et al. [14] Wang et al. [12] Li & He [15] Tsai et al. [13] Our Design 1328 432 272 432-488 472 N/A 20686 13027 21506 20500 (est.)
cycles, when samples at most complex pixel positions are interpolated. The block size 16x16 is the most common size in many typical video sequences [12]. It is estimated that the size of our motion compensation accelerator is around 40k gates including the TTA processor without external memories and the special functional unit for chroma interpolations. The added flexibility of the programmable motion compensation processor does not incur serious performance or silicon area overheads. We conclude that the programmable desing is very attractive performance-wise and it is not much more complex than the others, but it is more flexible. In fact, the high throughput enables using C language for lower resolution decoder implementations.
4
Variable Length Code Decoding
The video codecs regularly employ variable-length coding in lossless compression, and in decoding it is among the most time consuming functions. The popular choices include variants of Huffman, arithmetic, and Golomb codes that are all extremely difficult to support with the same hardwired solutions. On the other hand, the inherently serial data flows of variable length decoding makes the exploitation of parallelism in software implementations challenging. Context Adaptive Binary Arithmetic Coding (CABAC) is an entropy coding method that losslessly compresses coefficients according to their probabilities in the given contex to a binary form. The context is the type of data, like motion vectors, transform coefficients etc. CABAC in H.264 uses sophisticated models for different types of data with 460 different contexts in High Profile. [16] The flow of the CABAC algorithm is presented in Figure 6, exposing the high degree of data-dependency that favors hardwired critical paths. Earlier designs have almost invariably resorted to this approach, resulting in dedicated solutions in which the functional blocks can not be re-used with other codes. An interesting hardware solution for CABAC decoder is presented by Eeckhaut et al [6] in which the critical path has been optimized by predictive execution and advanced pipelining. The throughput is very good, 1 decoded bit/cycle and their solution also gives hints how to accelerate the arithmetic decoding algorithm for application specific processors. Our problem has been to find a good compromise that is not too specific or CABAC, but could be used with various range codes as well.
44
T. Rintaluoma et al.
Get the probability model (values for state and MPS)
1
fetch value for the rLPS from static table according to state and current range. Calculate range -rLPS
YES
2
No
4
Is value > range -rLPS ?
3 LPS
MPS
calculate new value because zero point changes. new range is range - rLPS
YES is state zero?
get next state from MPSnextState table new range is rLPS
Change meaning of MPS
No range < 0x100 ? get next state from LPSnextState table
No
5 YES Renormalization process
Done
shift range left and read new bit to value
Fig. 6. The H.264 CABAC decoding process
4.1
Application Specific Processor Implementation
Our variable length decoding processor design shown in Figure 7 is based on Transport Triggered Architecture template, and could ultimately include a critical path function unit for each variable length code. However, we decided to employ a more general design that includes a look-up-table for Huffman codes and CABAC style renormalizations, while a multiplier is provided to enable coping with range coding schemes such as the Boolean used with On2 VP6 codec [17]. The cost of flexibility is the longer critical path that was acceptable due to the availability of earlier code-specific hardwired designs. Table 5 illustrates the performance and silicon area of hardwired, software and an application specific TTA processor implementation of the CABAC decoders.
Fig. 7. VLC processor; LSU (Load and Store Unit), Renormalize unit for CABAC, Mul (16 bit multiplier for range codes), RLPS (table look-up unit), GCU (Global Control Unit), ALU (Arithmetic Logic Unit)
Programmable Accelerators for Reconfigurable Video Decoder
45
Table 5. The performances and silicon areas of CABAC arithmetic decoder implementations in 90 nm CMOS technology General purpose Our Hardwired processor (ARM926EJ-S) design accelerator Cycles/bit 172-248 51-66 1-4 Area (mm2 ) 1.40 [18] 0.085 (est.) 0.093
The hardwired decoder is a commercial footprint-optimized (33k gates) design used with a monolithic HD resolution video decoder. It performs best by a wide margin, but is only intended for CABAC. The code for the programmable processors has written in C language. The application specific processor runs CABAC about three times faster than an embedded microcontroller, but at realistic clock rates they both fall short of decoding HD resolution bitstreams that in bursts can go up to 20 Mbit/s. However, even with the current design, the TTA processor, when clocked at 166 MHz, decodes 3 Mbit/s that is satisfactory for mobile applications. Assembly coding could speed up both implementations approximately by factor 2-3. In the mass market devices the silicon area of the individual designs can be an important cost issue. The CABAC software implementation on a general purpose RISC processor, such as ARM9, uses most of its resources, and consumes a relatively large silicon area. The silicon needs of the hardwired CABAC accelerator and the TTA processor are close to each other, and the inclusion of the CABAC critical path as a special function unit to the TTA processor would not result in a significant increase in area. In fact, the area of a separate CABAC functional unit would be at the same level with the dedicated hardwired accelerator. Unfortunately, the power figures are currently not available.
5
Conclusions
Programmable accelerators based on application specific processors make it possible to implement multi-standard video decoders that have sufficient performance to process high definition video without losing power efficiency. At the same time it is possible to maintain flexibility needed to support reconfigurability for several video coding standards. These qualities are hard to obtain with plain hardware or software solutions even though performance improvements of application and DSP processors targeted to mobile multimedia devices have been substantial in the last few years. Based on our simulations, the designed accelerator for motion compensation offers almost the same level of performance as pure hardware solutions and it is capable of supporting other video coding standards in addition to H.264. Furthermore, our accelerator for arithmetic code decoding can offer 3-4 times better performance in comparison to a software implementation on ARM9 processor. However, there are still opportunities for improvement. To achieve performance
46
T. Rintaluoma et al.
close to monolithic hardware would require including the hardwired CABAC critical path in the programmable solution.
Acknowledgments This research has been funded by the Finnish Technology Developement Agency (TEKES). We also wish to thank professor Jarmo Takala for his contributions.
References 1. ISO/IEC 14496-2:2004: Information technology - Coding of audio-visual objects Part 2: Visual. ISO/IEC. Third edn. (June 2004) 2. ISO/IEC 14496-10:2005; Recommendation ITU-T H.264: SERIES H: AUDIOVISUAL AND MULTIMEDIA SYSTEMS Infrastructure of audiovisual services Coding of moving: Advanced video coding for generic audiovisual services video. ITU-T (November 2005) 3. Fitzek, F.H.P., Reichert, F.: Mobile Phone Programming: and its Application to Wireless Networking. Springer, Heidelberg (2007) 4. SMPTE 421M-2006: VC-1 Compressed Video Bitstream Format and Decoding Process. SMPTE (February 2006) 5. Jer-Min, H., Chun-Jen, T.: Analysis of an soc architecture for mpeg reconfigurable video coding framework. In: IEEE International Symposium on Circuits and Systems, ISCAS 2007, May 27-30, pp. 761–764 (2007) 6. Eeckhaut, H., Christiaens, M., Stroobandt, D., Nollet, V.: Optimizing the critical loop in the h.264/avc cabac decoder. In: IEEE International Conference on Field Programmable Technology, FPT 2006, December 2006, pp. 113–118 (2006) 7. J¨ a¨ askel¨ ainen, P., Guzma, V., Cilio, A., Pitk¨ anen, T., Takala, J.: Codesign toolset for application-specific instruction-set processors, vol. 6507. SPIE (2007) 65070X 8. On2 Technologies (2008), http://www.on2.com 9. ISO/IEC JTC1/SC29/WG11 N8069: Reconfigurable Video Coding Requirements v.2.0. ISO/IEC (November 2006) 10. Richardson, I., Bystrom, M., Kannangara, S., Frutos, D.: Dynamic configuration: Beyond video coding standards. In: IEEE System on Chip Conference. IEEE, Los Alamitos (2008) 11. Lucarz, C., Mattavelli, M., Thomas-Kerr, J., Janneck, J.: Reconfigurable media coding: A new specification model for multimedia coders. In: IEEE Workshop on Signal Processing Systems, October 2007, pp. 481–486 (2007) 12. Wang, S.Z., Lin, T.A., Liu, T.M., Lee, C.Y.: A new motion compensation design for h.264/avc decoder. In: IEEE International Symposium on Circuits and Systems, ISCAS 2005, May 2005, vol. 5, pp. 4558–4561 (2005) 13. Tsai, C.Y., Chen, T.C., Chen, T.W., Chen, L.G.: Bandwidth optimized motion compensation hardware design for h.264/avc hdtv decoder. In: 48th Midwest Symposium on Circuits and Systems, August 2005, vol. 2, pp. 1199–1202 (2005) 14. Zatt, B., Ferreira, V., Agostini, L.V., Wagner, F.R., Susin, A.A., Bampi, S.: Motion compensation hardware accelerator architecture for h.264/avc. In: Mery, D., Rueda, L. (eds.) PSIVT 2007. LNCS, vol. 4872, pp. 24–35. Springer, Heidelberg (2007)
Programmable Accelerators for Reconfigurable Video Decoder
47
15. Li, Y., He, Y.: Bandwidth optimized and high performance interpolation architecture in motion compensation for h.264/avc hdtv decoder. J. Signal Process. Syst. 52(2), 111–126 (2008) 16. Marpe, D., Schwarz, H., Wiegand, T.: Context-based adaptive binary arithmetic coding in the h.264/avc video compression standard. IEEE Transactions on Circuits and Systems for Video Technology 13(7), 620–636 (2003) 17. On2 Technologies (2008), http://www.on2.com 18. ARM (2008), http://www.arm.com/products/cpus/arm926ej-s.html
Scenario Based Mapping of Dynamic Applications on MPSoC: A 3D Graphics Case Study Narasinga Rao Miniskar1,3 , Elena Hammari2 , Satyakiran Munaga1,3 , Stylianos Mamagkakis1, Per Gunnar Kjeldsberg2 , and Francky Catthoor1,3 1 IMEC, Kapeldreef 75, Leuven 3001 {miniskar,satyaki,mamagka,catthoor}@imec.be 2 NTNU, 7491 Trondheim, Norway {hammari,per.gunnar.kjeldsberg}@iet.ntnu.no 3 K.U. Leuven, ESAT Dept., Leuven 3001
Abstract. Modern multimedia applications are becoming increasingly dynamic. The state-of-the-art scalable 3D graphics algorithms are able to adapt at run-time their hardware resource allocation requests according to input, resource availability and a number of quality metrics. Additionally, the resource management mechanisms are becoming more dynamic themselves and are able to cope efficiently at run-time with these varying resource requests, available hardware resources and competing requests from other applications. In this paper, we study the dynamic resource requests of the Wavelet Subdivision Surfaces (WSS) based scalable 3D graphics application. We also show how to schedule its computational resources at run-time with the use of the Task Concurrency Management (TCM) methodology and the System Scenario based approach on MPSoC platform with very heterogeneous Processing Elements (including RISC, VLIW and FPGA accelerator resources).
1
Introduction
It is common for embedded hardware platforms, nowadays, to feature multiple Processing Elements (PEs) and thus to be able to execute highly complex multimedia algorithms with big computational resource requirements. Scalable 3D graphics algorithms can demonstrate multimedia content that is created once and then scaled each time to match the characteristics of the embedded system, where it is deployed (e.g., according to different consumer device displays) [7]. Therefore, such software applications have very dynamic computational resource requests, because the timing and size of each request is not known at designtime and is only determined at run-time based on the user actions and dynamic scalability response of the algorithm itself according to the quality targeted [13]. Additionally, it is very likely that other software applications will be executing concurrently on the embedded platform, sharing the available resources dynamically, as they are loaded and unloaded at run-time (e.g., you receive an email as you play a 3D game on your mobile phone). Therefore, scheduling the K. Bertels et al. (Eds.): SAMOS 2009, LNCS 5657, pp. 48–57, 2009. c Springer-Verlag Berlin Heidelberg 2009
Scenario Based Mapping of Dynamic Applications on MPSoC
49
tasks of these software applications is very challenging, because it is not known at design-time: (i) the computational resource requests of each task, (ii) the number of tasks executed by each software application and (iii) the combination of software applications that will be executed concurrently. The solution given today to this problem is calculating the worst case resource request of any combination of tasks and software applications and allocating resources accordingly at design-time. This solution requires a huge amount of computational resources and is very energy inefficient as it can not adapt the scheduling according to the actual run-time software application real-time demands. Therefore the scheduler not only pre-allocates but also uses the maximum amount of processing resources. In this paper, we characterize the run-time computational resource needs of a very demanding and dynamic scalable 3D graphics application [13], which executes on a heterogenous Multiple Processor System on Chip (MPSoC) concurrently with other applications. For this case study, we implement the Task Concurrency Management (TCM) scheduling methodology [9] to schedule its tasks both at design-time and run-time and also implement the System Scenario approach [5] to avoid using one worst-case resource requirement solution. It is the first time a hardware accelerator is included in a MPSoC platform used for run-time scheduling algorithms like TCM. The computation demanding software of Wavelet Subdivision Surfaces (WSS) for 3D scalable graphics applications as well as three more Task Graphs generated from the TGFF[2] will be scheduled on the aforementioned platform. The structure of this paper is as follows. The next section describes the related work. In section 3, we describe the Task Concurrency Management (TCM) methodology and the System Scenario approach. In section 4, we describe the Wavelet Subdivision Surfaces (WSS) case study, implement the methodology and show some experimental results. Finally, in section 5, we draw our conclusions.
2
Related Work
In the context of scenarios, scenario-based design[1] has been used for some time in both hardware[10] and software design[3] of embedded systems. They use case diagrams[4] which enumerate, from a functional and timing point of view, all possible user actions and the system reactions that are required to meet a proposed system function. These scenarios are called use-case scenarios and do not concentrate on the resources required by a system to meet its constraints. In this paper, we concentrate on a different and complementary type of scenarios, which we call system scenarios [5]. These are derived from the combination of the behavior of the application and its mapping on the system platform. These scenarios are used to reduce the system cost by exploiting information about what can happen at run-time to make better design decisions at design-time, and to exploit the time-varying behavior at run-time. While use-case scenarios classify the application’s behavior based on the different ways it can be used, system scenarios classify the behavior based on the multi-dimensional cost
50
N.R. Miniskar et al.
trade-off during the implementation trajectory. In the context of this paper, whenever scenarios are mentioned they imply System Scenarios. In the context of task scheduling, a good overview of early scheduling algorithms can be found in [11]. This paper uses the terminology task scheduling for both the ordering and the assignment. Scheduling algorithms can be roughly divided into dynamic and static scheduling. In a multiprocessor context, when the application has a large amount of non-deterministic behavior, dynamic scheduling has the flexibility to balance the computation load of processors at run-time and make use of the extra slack time coming from the variation from Worst Case Execution Time (WCET). In the context of this paper, we have selected the Task Concurrency Management methodology [9] to implement a combination of design-time and run-time scheduling, which can balance performance versus energy consumption trade-offs.
3
Task Concurrency Management and System Scenarios Overview
As can be seen in Fig. 1, we use the Task Concurrency Management (TCM) methodology and System Scenarios approach to do the task scheduling of WSS and any other applications that might be executing concurrently on the MPSoC platform. The two phases shown on the left are carried out at design-time and the two phases on the right side are carried out at the run-time. All four phases are part of the System Scenario approach [5] and are instantiated for the TCM scheduling [9]. – At the Scenario identification phase, we define a number of scenarios for each application executed on the selected MPSoC platform. These scenarios are defined based on typical application inputs and their impact on the control flow and data flow of the software application. For each one of these scenarios, we evaluate the execution time and energy consumption of each task if it would be mapped on any of the Processing Elements (PEs) of the selected MPSoC platform. We extract these energy and timing values at design-time via simulation and profiling and insert them in addition to the application’s task-graph in the Grey box model for TCM scheduling [9]. – At the Scenario exploitation phase, we produce at design-time a set of Pareto-optimal schedules using the Grey Box model in the TCM methodology. Each one of these schedules is a specific ordering of one scenario’s tasks and their assignment on specific PEs. Each schedule is represented by a Pareto point on an energy vs performance Pareto curve and each scenario is represented by the Pareto curve itself. – At the Scenario detection phase, we monitor and detect at run-time the pre-calculated scenarios. Each time that a scenario is detected, the Scenario switching mechanism is initiated. Additionally, the energy and real-time constraints are monitored in order to select the optimal performance vs energy consumption trade-off.
Scenario Based Mapping of Dynamic Applications on MPSoC
51
Fig. 1. Four phase System Scenario based TCM methodology
– At the Scenario switching phase, when a scenario is detected, the TCM run-time scheduler selects and implements one of the pre-calculated Paretooptimal schedules, thus switching from one Pareto curve to another. If the TCM run-time scheduler evaluates that a constraint (e.g., real time) will not be met, then it switches from one Pareto point to another (within the same Pareto curve) in order to meet this constraint in the expense of another resource (e.g., energy). The final result at every moment at run-time is the execution of one Pareto-optimal, pre-selected schedule.
4 4.1
WSS Characterization and Task Scheduling Software Application Description
3D content made available on the Internet, such as X3D/VRML and MPEG-4 content, is transported over and consumed on a lot of different networks (wireless and wired) and terminals (mobile phones, PDA’s, PCs, . . . ). A WSS based scalable 3D graphics framework is proposed in [13], where the object’s quality dynamically adapts to the viewing conditions while respecting constraints such as platform resources. Based on the user-commands (Ex: Move forward, Move
52
N.R. Miniskar et al.
backward, rotate, pitch, yaw, etc...), the best triangle budget and the related Level Of Detail (LOD) settings for each visible object will be decided (online) at run-time. Once the triangles are obtained for each visible object, they will passed to Renderer. The abruptness in the no.of visible objects based on the user-commands, the object-level LOD settings will be varied. In the implemented framework, the over-all scene level quality is fixed. The software modules responsible for the processing of each 3D scene content are given below. – Initialization Module • Process Scene Description: Reads the 3D scene information and all 3D objects base meshes, texture, pre-processed quality error details, and scenario based visibility and area data, and holds it in its internal data structure, shown as Mesh Database(DB). This is a one-time initialization process of 3D content. • WSS-Decoder: The WSS based decoding[13] is very computation heavy task (∼1.8secs on TI-C64X+ and ∼33secs on StrongARM processors), thus this task is mapped to a hardware accelerator and all the objects are decoded to the maximum LOD in the initialization phase itself. The triangles for all LOD settings are held in storage memory. – Frame based invocation modules: These modules will be called for each frame when the user presses a command like rotate, yaw, pitch etc.... • Check Visibility: This modules checks the visibility of each object based on the bounding box concept explained in [13] and outputs the visibility information in the Mesh DB. • Build Global Pareto curve: This module generates the global Paretocurve(trade-off curve) based on the error contributed by each visible object and its corresponding triangle budget. It uses the gradient descent algorithm to build this[13].After building the global Pareto plot, this module calls the decoder based on the changes in quality parameters. • Converter(Prepare for Renderer): Finally this module converts the triangles into vertices on the 3d scene and later this will be passed to renderer [13]. For each frame, the application calls the above mentioned modules as shown in the sequence above. In this application there is an object level parallelism, where we can potentially handle the objects in parallel which can enable the multiple parallelization versions of this application. However in the context of this paper, we still would like to show the benefits of TCM scheduling even without exploring parallelism (i.e., this is listed as future work). 4.2
Target Platform
The generic MPSoC platform that we decide to use in this paper is shown in Fig. 2. More specifically, we have considered a heterogeneous platform with two RISC processors (Strong ARM 1100x), two VLIWs with six FUs each
Scenario Based Mapping of Dynamic Applications on MPSoC
53
(TI-C64X+) and one RISC processor(Strong ARM 1100x) with an FPGA Hardware accelerator (Virtex-5 XC5VSX95t speed grade-3) as a co-processor. This is for handling control functions, and data and instruction level parallelism respectively. The StrongARM 1100x processors run at 1.48V, 0.1632A, with clock frequency of 133MHz and the TI-C64X+ processors run at 1.2V with clock frequency of 500MHz and the FPGA runs at 100MHz and 1.0V. We assume a cache memory (C) for each processor and a single shared global memory (SM). The Hardware accelerator is implemented with a Virtex-5 SXT FPGA from Xilinx for the purpose of accelerating the GlobalParetoBuildTime software module of the WSS task. The communication topology considered is a cross-bar switch. Any potential implications of communication or memory latency bottlenecks are not calculated in the context of this paper and are subject to future study.
Fig. 2. Target platform
Fig. 3. Gray-box-model of WSS
In order to apply the TCM Methodology we require Energy and Timing information of each task (see Thread Frames and Thread Nodes in [9]) for the Gray-box model. For these experiments we have used profiling information from the StrongARM and TI-C64X+ processors simulators (see SimItARM Simulator and TI CCStudio v3.3 evolution, respectively). For getting energy information for tasks running on StrongARM processor we have used JouleTrack [12]. For the TI-C64X+ energy profiling, we have used the functional level power analysis model of TI-C6X[8], modified for the functional units, instruction set, and memory hierarchy of the TI-C64X+. Parts of the heterogeneous platform has been used for experiments in earlier papers[9]. It is however, the first time an FPGA-based hardware accelerator is included. We have explored performance and energy consumption of parts of the GlobalParetoBuild task, when implemented on a Virtex-5 SXT FPGA from Xilinx. Selected parts for the hardware acceleration are division operation and its combinations with square root, multiplication and subtraction operations, that appear in the application code but do not have dedicated instructions in the StrongARM and TI-C64X+ processsors. Ready-to-use division, square root, multiplication and subtraction IP blocks from the library of the high-level
54
N.R. Miniskar et al.
hardware development tool Xilinx System Generator[14] were utilized. Low-level hardware description for downloading to FPGA was generated using Xilinx ISE Design Suite and its Timing Analyzer and XPower Analyzer tools along with Modelsim simulator were employed to find the execution time and energy consumption of the obtained hardware accelerator. 4.3
Inputs to 3D-WSS and Scenarios
In the experiments below, we consider WSS in a scalable 3D graphics engine for 3D games. In the game example there is a input representing 3 rooms (R1, R2 and R3), and there are 13 objects(fruits) in R1, 17 objects in R2 and 22 objects in R3. There are 4 cameras(C1, C2, C3 and C4) fixed in 4 positions in each room to see the objects from different angles. As an in-game character enters from one room to another, the best camera position can be chosen by the application controller according to the in-game cinematic effects. In each room of the game we have applied sequence of User commands that define the in-game character movement and perspective. We have applied the Systematic Scenario identification methodology proposed in [5] based on the in-depth analysis of the application. We have identified few parameters that are causing variation in the software modules of this application (room no., and selected camera position by the application). Based on the room no. in which the in-game is present and the camera position, we can derive 12 scenarios(i.e., 3 rooms times, 4 camera-positions). For each scenario we will get one Gray-box-model for the TCM scheduling. For each scenario, we have considered the worst case estimation on the no.of objects, no.of pareto points in all objects, and no.of triangles, and computed the Worst case estimation time(WCET) for all three software modules of this application. There is still a dynamism in execution time in each scenario, because of the varying no.of visible objects(Ex.: 8 to 13 in R1), varying no.of pareto points(Ex.: 200 to 350 in R1), and varying no.of triangles(Ex.: 47000 to 50000 in R1), in each room. However we still don’t have a methodology to deal the dynamism due to these kind of variables (Data variables), which has large range of values for each variable in the program execution. In this paper we are focusing only on the parameters which can be handled through Systematic Scenario Identification methodology. As we have computed the WCET using profiling, we couldn’t give any hard-guarantees based on this execution-times. 4.4
Gray Box Modelling
The task graph of the WSS application is shown in Fig. 3, which has 3 Tasks, CheckVisibility(T1), Build-Pareto & Selection(T2), and Prepare Triangles(T3). The Gray-box-model for the 3D-WSS is constructed based on this task graph and the profiling information (Energy, Execution Time) captured. For this we need to consider the WCET for each scenario and for each module of WSS. The WCETs of 3 modules for StrongARM & TI are shown in Table 1. However, not all scenarios are required due to similarity in their WCET. We have clustered the WCETs manually according to Systematic clustering approach [5,6] and they
Scenario Based Mapping of Dynamic Applications on MPSoC
55
are represented in colors for each module in Table 1. For the GlobalParetoBuild module the clusters are (C1,C2,C4) and (C3) for each room, and hence total 6 scenarios instead of 12. Like this, the end scenarios derived after clustering for CheckVisibility module are 3 for all rooms, and for PrepareRender module are 8 for all rooms. We have repeated the clustering for TI processor WCET too. However, we have obtained the same clusters as for StrongARM processor. Each cluster is treated as one Scenario for the module. The Scenarios obtained at this point are at sub-task level. However, we need the scenarios at the complete taskgraph level, using which we need to optimize pareto optimal mappings at runtime. Based on the clusters we have obtained from sub-tasks, we have selected the sub-task clusters which has maximum clusters among all. We have checked whether all other clusters fall under this, if not we can further cluster the set. In the case of WSS, the maximum clusters obtained from PrepareRender module are sufficient for the whole task-graph level too. At the end, we have obtained total 8 scenarios for the task-graph. Table 1. Scenario clustering of WCETs(m.secs) for StrongARM & TI Ri, Ci CheckVisibility GlobalParetoBuild ARM TI ARM TI R3-C1 45.036 0.97 106.91 4.72 R3-C2 44.772 0.96 101.97 4.6 R3-C3 45.587 1.01 35.63 1.72 R3-C4 45.656 1.00 102.76 4.62 R2-C1 34.832 0.74 79.43 3.49 R2-C2 34.626 0.74 73.08 3.31 R2-C3 35.198 0.78 30.71 1.45 R2-C4 35.266 0.77 73.57 3.32 R1-C1 26.624 0.57 52.11 2.31 R1-C2 26.509 0.57 50.23 2.27 R1-C3 26.892 0.59 18.47 0.88 R1-C4 26.905 0.58 45.36 2.07
PrepareRender ARM TI 386.08 25.23 426.97 29.08 174.44 10.85 389.35 24.71 362.65 23.13 374.22 25.10 154.33 9.43 359.25 22.33 268.52 17.09 274.12 18.02 99.51 5.93 210.06 14.39
After the Scenario identification phase, the Gray-box model is constructed based on the obtained WCET profiling information for the 8 scenarios. An example of the Gray-box model generated per scenario is shown in Fig. 3. 4.5
TCM Design Time Scheduling
At the Scenario exploitation phase, the TCM Design-time scheduler[15] produces 8 Pareto curves (i.e., one per scenario). In Fig. 4, we show as an example 4 of these Pareto curves. Each Pareto curve has a number of Pareto points which represent a pareto optimal ordering and assignment of the WSS tasks to the PEs. In Fig. 4, we show 4 of these schedules (P1-P4) belonging to two Pareto curves (i.e., scenarios). Additionally, we have applied the same for TGFF [2] generated taskgraphs to demonstrate the schedule that would be produced for a random
56
N.R. Miniskar et al.
Fig. 4. Four Pareto points representing 4 TCM Design-time schedule examples (P1P4) of the WSS application. The four Pareto curves represent four scenarios (C1-C4 in R3). Q1 represents the schedule of an another application competing for resources (generated automatically with TGFF).
application sharing the MPSoC resources. The TGFF application schedule is represented by Q1. Each pareto curve is occupying ˜1.3KBytes of memory and there are approximately 8 to 15 points in each task each scenario pareto curve. 4.6
TCM Run Time Scheduling
At the Scenario switching phase, the TCM Run-time scheduler implements a Pareto point selection algorithm [15], to select at run-time the optimal schedule for each application, while meeting the real time constraints and minimizing the energy consumption. The deadline for the WSS application is set initially at 0.5 secs (i.e., the frame will be updated by input user commands coming at a rate of a 0.5 secs period). The TGFF application also has the same frame rate. If the deadline is changed at 0.115 secs (e.g., because user commands come faster) and both applications are sharing resources, then the TCM runt-time scheduler switches to a faster schedule point P1. However if the deadline is increased to 0.22 secs, then the TCM Run-time scheduler chooses schedule point P2 for WSS saving 10% less energy, as in StateB, without missing any real-time constraints. At the Scenario detection phase, if the Camera position switches from C2 to C3 then the TCM Run Time Scheduler is activated again and selects a schedule from the corresponding Pareto curve. In this particular case it would move from schedule P2 to schedule P4, thus saving an additional 55% of energy consumption, without missing any real-time constraints. From the heterogeneity experiments, we have identified that there is 10% performance gain and energy gain just with a small Hardware accelerator for 2 kind of instructions. This will be more, if we explore it for the whole GlobalParetoBuild module. The slacked gained from the Hardware accelerator is used by the TCM run-time scheduler and provided the 16% energy efficient solution at the end for the WSS application.
Scenario Based Mapping of Dynamic Applications on MPSoC
5
57
Conclusions and Future Work
In this paper, we have characterized the computational and energy resource requests of the Wavelet Subdivision Surfaces (WSS) algorithm for scalable 3D graphics on a MPSoC platform. Moreover, this paper demonstrates the implementation of a number of experimental design-time and run-time techniques at various abstraction design levels and shows how they can be combined for a single case study. More specifically, we have demonstrated WSS at the application level, System Scenarios at the system level and Task Concurrency Management at the middleware (resource management) level. In our future work, we plan to parallelize certain tasks of the WSS algorithm, do a more thorough exploration of the available WSS scenarios and MPSoC platform options and calculate the switching overhead between the TCM schedules at run-time. We have also explored the TCM methodology for the heterogeneous platform which has Hardware Accelerator and shown more gains from the TCM methodology.
References 1. Carroll, J.M.: Scenario-based design: envisioning work and technology in system development. John Wiley and Sons, Chichester (1995) 2. Dick, R.P.: Tgff: task graphs for free. In: CODES/CASHE 1998 (1998) 3. Douglass, B.P.: Real Time UML: Advances in the UML for Real-Time Systems, 3rd edn. Addison-Wesley Professional, Reading (2004) 4. Fowler, M., Scott, K.: UML Distilled: A Brief Guide to the Standard Object Modeling Language, 3rd edn. Addison-Wesley Professional, Reading (2003) 5. Gheorghita, S.V., Palkovic, M., Hamers, J., Vandecappelle, A., Mamagkakis, S., Basten, T., Eeckhout, L., Corporaal, H., Catthoor, F., Vandeputte, F., Bosschere, K.D.: System-scenario-based design of dynamic embedded systems, vol. 14, pp. 1–45. ACM, New York (2009) 6. Hamers, J.: Resource prediction for media stream decoding. In: DATE (2007) 7. ISO/IEC. ISO/IEC 14496 Information technology-Coding of audio-visual objects - Part 10: Advanced Video Coding (2004) 8. Laurent, J.: Functional level power analysis: An efficient approach for modeling the power consumption of complex processors. In: DATE (2004) 9. Ma, Z., Scarpazza, D.P., Catthoor, F.: Run-time task overlapping on multiprocessor platforms. In: ESTImedia, pp. 47–52 (2007) 10. Paul, J.M.: Scenario-oriented design for single-chip heterogeneous multiprocessors. IEEE Transactions on VLSI Systems (2006) 11. Ramamritham, K., Fohler, G., Adan, J.M.: Issues in the static allocation and scheduling of complex periodic tasks. IEEE Computer Society, Los Alamitos (1993) 12. Sinha, A., Chandrakasan, A.: Jouletrack - a web based tool for software energy profiling. In: DAC (2001) 13. Tack, N., Lafruit, G., Catthoor, F., Lauwereins, R.: Pareto based optimization of multi-resolution geometry for real time rendering. In: Web3D. ACM, New York (2005) 14. Xilinx Inc. System Generator for DSP User Guide. Version 10.1 (March 2008) 15. Yang, P., Catthoor, F.: Pareto-optimization-based run-time task scheduling for embedded systems. In: ISSS, California, USA (2003)
Multiple Description Scalable Coding for Video Transmission over Unreliable Networks Roya Choupani1,2 , Stephan Wong1 , and Mehmet R. Tolun2 1 2
Computer Engineering Department, TUDelft, Delft, The Netherlands Computer Engineering Department, Cankaya University, Ankara Turkey ¯
[email protected],
[email protected],
[email protected]
Abstract. Developing real time multimedia applications for best effort networks such as the Internet requires prohibitions against jitter delay and frame loss. This problem is further complicated in wireless networks as the rate of frame corruption or loss is higher in wireless networks while they generally have lower data rates compared to wired networks. On the other hand, variations of the bandwidth and the receiving device characteristics require data rate adaptation capability of the coding method. Multiple Description Coding (MDC) methods are used to solve the jitter delay and frame loss problems by making the transmitted data more error resilient, however, this results in reduced data rate because of the added overhead. MDC methods do not address the bandwidth variation and receiver characteristics differences. In this paper a new method based on integrating MDC and the scalable video coding extension of H.264 standard is proposed. Our method can handle both jitter delay and frame loss, and data rate adaptation problems. Our method utilizes motion compensating scheme and, therefore, is compatible with the current video coding standards such as MPEG-4 and H.264. Based on the simulated network conditions, our method shows promising results and we have achieved up to 36dB for average Y-PSNR. Keywords: Scalable Video Coding, Multiple Description Coding, Multimedia Transmission.
1
Introduction
Communications networks, both wireless and wired, offer variable bandwidth channels for video transmission [1], [3]. Display devices have a variety of characteristics ranging from low resolution screens in small mobile terminals to high resolution projectors. The data transmitted for this diverse range of devices and bandwidths have different sizes and should be stored on media with different capacity. Moreover, an encoding which makes use of a single encoded data for all types of bandwidth channels and displaying devices capacities could be of a remarkable significance in multimedia applications. Scalable video coding (SVC) schemes are intended to be a solution for the Internet heterogeneity and receiver K. Bertels et al. (Eds.): SAMOS 2009, LNCS 5657, pp. 58–67, 2009. c Springer-Verlag Berlin Heidelberg 2009
Multiple Description Scalable Coding
59
display diversity problem by encoding the data at the highest quality but enabling the transmitter or receiver to utilize it partially depending on the desired quality or available bandwidth and displaying capacities. The main drawback of the available scalable video coding methods is that they are not suitable for non-reliable environments with a high rate of frame loss or corruption such as wireless networks. This problem stems from the fact that the methods are based on the motion compensated temporal filtering scheme and frames are coded as difference with a (generally prior) reference frame. In case that a reference frame is lost or corrupted, the whole chain of difference frames depending on it becomes unrecoverable. To increase the error resilience of the video coding methods, Multiple Description Coding (MDC) methods have been introduced [4], [5], [7]. These methods improve the error resilience of the video with the cost of adding redundancy to the code. In case that a frame is lost or corrupted, the redundancy is used to replace it with an estimated frame. Franchi, et al., proposed a method to send a video by utilizing independent multiple descriptions. Their method however, does not combine scalability features with multiple description coding and therefore only addresses frame loss or corruption and variations of bandwidth have not been dealt with [16]. The combination of scalable video coding methods and multiple description coding has attracted the interest of researchers recently [2], [3], [13]. The introduction of scalable extension of H.264 standard recently, which relaxes some of the restrictions of other video coding schemes such as using immediate prior frame as reference frame, provides a suitable framework for combining scalability of H.264 with error resistance of MDC schemes. This paper describes a new method which is a combination of the SVC extension of H.264 standard with MDC schemes in a way that no redundancy in the form of extra bits is introduced during the video coding. The remainder of this paper is organized as follows. Section 2 introduces the main multiple description coding methods. Section 3 explores the scalability features of H.264 standard which are used in our proposed method. Section 4 describes the details of our proposed method. In Section 5, we introduce the theoretical base of our performance evaluation method and provide the experimental results and finally, in Section 6, we draw the conclusions.
2
Multiple Description Coding
As a way of encoding and communicating visual information over lossy packet networks, multiple descriptions have attracted a lot of attention. A multiple description coder divides the video data into several bit-streams called descriptions which are then transmitted separately over the network. All descriptions are equally important and each description can be decoded independently from other descriptions which means that the loss of some of them does not affect the decoding of the rest. The accuracy of the decoded video depends on the number of received descriptions. Descriptions are defined by constructing P non-empty sets summing up to the original signal f. Each set in this definition corresponds to a description. The sets however, are not necessarily disjoint. A signal sample
60
R. Choupani, S. Wong, and M.R. Tolun
may appear in more than one set to increase error resilience property of the video. Repeating a signal sample in multiple descriptions is also a way for assigning higher importance to some parts/signals of the video. The more a signal sample is repeated the more reliably it is transmitted over the network. The duplicate signal values increases the redundancy and hence the data size which results in reduced efficiency. Designing descriptions as partition does not necessarily mean that there is no redundancy in the data. In fact, designing the descriptions as a partition prevents extra bits to be added to the original data for error resilience but still the correlation between the spatially or temporally close data can be used for estimating the lost bits. The estimation process is commonly referred to as error concealment and relies on the the preserved correlation in constructing the descriptions. Fine Granular Scalability (FGS)-based MDC schemes partition the video into one base layer and one or several enhancement layers [8]. The base layer can be decoded independently from enhancement layers but it provides only the minimum spatial, temporal, or signal to noise ratio quality. The enhancement layers are not independently decodable. An enhancement layer improves the decoded video obtained from the base layer. MDC schemes based on FGS puts base layer together with one of the enhancement layers at each description. This helps to partially recover the video when data from one or some of the descriptions are lost or corrupt. Repeating base layer bits in each descriptor is the overhead added for a better error resilience. In Forward Error Correction (FEC)-based MDC methods, it is assumed that the video is originally defined in a multi-resolution manner [6], [9]. This means if we have M levels of quality, each one is adding to the fidelity of the video to the original one. This concept is very similar to the multi-layer video coding method used by FGS scheme. The main difference, however, is that there exist a mandatory order in applying the enhancements. In other words, it is sensitive to the position of the losses in the bitstream, e.g., a loss early in the bitstream can render the rest of the bitstream useless to the decoder. FEC-based MDCs aim to develop the desired feature that the delivered quality become dependent only on the fraction of packets delivered reliably. One method to achieve this is Reed Solomon block codes. Mohr, et.al., [15] used Unequal Loss Protection (ULP) to protects video data against packet loss. ULP is a system that combines a progressive source coder with a cascade of Reed Solomon codes to generate an encoding that is progressive in the number of descriptions received, regardless of their identity or order of arrival. The main disadvantage of the FEC-based methods is the overhead added by the insertion of error correction codes. Discrete Wavelet Transform (DWT)-based video coding methods are liable for applying multiple description coding. In the most basic method, wavelet coefficients are partitioned into maximally separated sets, and packetized so that simple error concealment methods can produce good estimates of the lost data [2], [10], [11]. More efficient methods utilize Motion Compensated Temporal Filtering (MCTF) which is aimed at removing the temporal redundancies of video sequences. If a video signal f is defined over a domain D, then the domain can be expressed as a collection of sub-domains {S1;..;Sn} where the union of these
Multiple Description Scalable Coding
61
sub-domains is a cover of D. Besides, a corrupt sample can be replaced by an estimated value using the correlation between the neighboring signal samples. Therefore, the sub-domains should be designed in a way that the correlation between the samples is preserved. Domain-based multiple description schemes are based on partitioning the signal domain. Each partition, which is a subsampled version of the signal, defines a description. Chang [8] utilizes the even-odd splitting of the coded speech samples. For images, Tillo, et.al., [11] propose splitting the image into four subsampled versions prior to JPEG encoding. There, domain partitioning is performed first, followed by discrete cosine transform, quantization and entropy coding. The main challenge in domain-based multiple description methods is designing sub-domains so that the minimum distance between values inside a domain (inter-domain distance) is maximized while preserving the auto-correlation of the signal.
3
Scalable Video Coding Extension of H.264
As a solution to the unpredictability of traffic loads, and the varying delays on the client side problem, encoding the video data is carried out in a rate scalable form which enables adaptation to the receiver or network capacities. This adaptation can be in the number of frames per second (temporal scalability), frame resolution (spatial scalability), and number of bits allocated to each pixel value (signal to noise ratio scalability). In this section, we briefly review the scalability support features of H.264 standard which are used in our proposed method. The scalability support features of H.264 standard were introduced based on an evaluation of the proposals carried out by MPEG and the ITU-T groups. Scalable video coding (SVC) features were added as an amendment to H.264/MPEG4AVC standard [14]. 3.1
Temporal Scalability
Temporal scalability is achieved by dropping some of the frames in a video to reach the desired (lower) frame rate. As the motion compensated coding used in video coding standards encodes the difference of the blocks of a frame with its reference frame (the frame coming immediately before it), dropping frames for temporal scalability can cause some frames to become unrecoverable. H.264 standard relaxes the restriction of choosing the previous frame as the reference frame for current frame. This makes it possible to design hierarchical prediction structures to avoid reference frame loss problem when adjusting the frame rate. 3.2
Spatial Scalability
In supporting spatial scalable coding, H.264 utilizes the conventional approach of multilayer coding, however, additional inter-layer prediction mechanisms are incorporated. In inter-layer prediction the information in one layer is used in the other layers. The layer that is employed for inter-layer prediction is called
62
R. Choupani, S. Wong, and M.R. Tolun
reference layer, and its layer identifier number is sent in the slice header of the enhancement layer slices [12]. Inter-layer coding mode is applied when the macroblock in the base layer is inter-coded. To simplify encoding and decoding macro-blocks in this mode, a new block type named base mode block was introduced. This block does not include any motion vector or reference frame index number and only the residual data is transmitted in the block. The motion vector and reference frame index information are copied from those of the corresponding block in the reference layer.
4
Our Proposed Method
Our proposed method involves using the scalability features of the H.264 standard. To make the video resilient against frame loss or corruption error we define multiple descriptions. However, to achieve a high performance which is comparable to single stream codes, we do not include any error correction code in the descriptions. The error concealment in our proposed method is based on the autocorrelation of the pixel values which is a decreasing function of spatial proximity. Generally, the differences among the pixels values about a given point are expected to be low. Based on this idea we have considered four descriptions D1 to D4 representing four spatial sub-sets of the pixels in a frame as depicted in Figure 1. Each description correspond to a subset Si for i = 1..4. The subsets define a partition as no overlap exists in the subsets and they sum up to the initial set. Si Sj = ∅ f or i = 1, .., 4 and i = j 4
Si = D
i=1
Each description is divided into macro-blocks, motion compensated, and coded independently. The decoder extracts frames and combines them as depicted in Figure 1. When a description is lost or is corrupted, the remaining three
Fig. 1. Organization of the pixels in the descriptions
Multiple Description Scalable Coding
63
Fig. 2. Pixels used (blue) for interpolating the value of a missing pixel (red)
Fig. 3. Multiple Description Schemes with a) 9 Descriptions, b) 16 Descriptions
descriptions provide nine pixel values around each pixel of the lost description for interpolation during error concealment. Figure 2 depicts the pixel values utilized for interpolating a pixel value from a lost description. For interpolation, we are using a weighted interpolation where the weights are normalized by the Euclidean distance of each pixel from the center as given below. We have assumed √
2 2
√
1 22 1 × √1 0 √1 6.828 2 1 22 2
the residue values and motion vectors and other meta-data in a macroblock is transmitted as a data transmission unit and hence are not available when the data packet is lost. The succeeding frames which utilize the estimated frame as their reference frame, will suffer from the difference between the reconstructed frame and the original one. The error generated in this way is propagated till
64
R. Choupani, S. Wong, and M.R. Tolun City, CIF 30Hz 37 Single Layer Multiple Description
36
Average Y−PSNR [db]
35 34 33 32 31 30 29
0
100
200
300
400 500 bit rate [kbit/s]
600
700
800
900
Fig. 4. Coding efficiency comparison between single layer and our proposed method using City video segment Foreman, CIF 30Hz 40 39 38
Average Y−PSNR[db]
37 36 35 34 33 32 31 Single Layer Multiple Description
30 29
0
100
200
300
400 500 bit rate [kbit/s]
600
700
800
900
Fig. 5. Coding efficiency comparison between single layer and our proposed method using Foreman video segment
the end of the GOP. However, if no other frame from the same GOP is lost, the error is not accumulated. The multilayer hierarchical frame structure of H.264 reduces the impact of frame loss to at most log2 n succeeding frames where n is the number of frames in a GOP. Our proposed method has the following features.
Multiple Description Scalable Coding
65
– Multiple description coding is combined with video scalable coding methods with no redundant bits added. – Each description is independent from the rest and the base-enhancement relationship does not exist between them. This feature comes without the extra cost of forward error correction bits added to the descriptions. Any lost or corrupted description can be concealed regardless of its position or order with respect to the other descriptions. – The proposed method is compatible with the definition of the multi-layer spatial scalability of H.264 standard. This compatibility is due to the possibility of having the same resolution in two different layers in H.264 and using inter-coding at each layer independently. We have not set the motion prediction flag and let each description to have its own motion vector. This is because of the independent coding of each description. Setting the motion prediction flag can speed up encoder but it reduces the coding efficiency slightly as the most similar regions are not always happen at the same place in different descriptions. – The proposed method is expandable to more number of descriptions if the error rate of the network is high, a higher level of fidelity with the original video is required, or higher levels of scalability are desired.
5
Experimental Results
For evaluating the performance of our proposed method, we have considered measuring Peak Signal to Noise Ratio of the Y component of the macroblocks (Y-PSNR). Equations 1 and 2 describe Y-PSNR used in our implementation mathematically. M axI P SN R = 20 log10 √ (1) M SE M SE =
m−1 n−1 1 ||I(i, j) − I (i, j)||2 3mn i=0 j=0
(2)
where M axI indicates the largest possible pixel value, I is the original frame and I is the decoded frame at the receiver side. Y-PSNR is applied to all frames of video segments listed in Table 1 by comparing the corresponding frames of the original video segment and after using our multiple description coding method. We have considered the case where one of the descriptions is lost and interpolated. We have randomly selected the erroneous description. We put 32 frames in each GOP and a diadic hierarchical temporal structure has been used for motion compensated coding. We have furthermore imposed the same reference frame for all macroblocks of a frame for simplicity although H.264 supports utilizing different reference frame for macroblocks of a frame. In additionally, we have restricted the number of descriptions lost to one for each GOP. This means at most one forth of a frame is estimated during error concealment step. The location of the lost description in the GOP is selected randomly and the Y-PSNR is obtained for the average of each video segment. The average Y-PSNR values
66
R. Choupani, S. Wong, and M.R. Tolun Table 1. Average Y-PSNR values when loss is in only one frame of each GOP Sequence Name Foreman Stefan & Martin City
Resolution Frame rate Average Y-PSNR (db) 352 × 288 30 36.345 768 × 576 30 33.110 704 × 576 60 34.712
are reported in Table 1. The second set of evaluation tests considers the average Y-PSNR value change for each video segment with respect to the number of frames affected by the lost description. Still however, we are assuming only one description is lost each time and the GOP length is 32. Figure 3 depicts the result of multiple frame reconstruction for three video segments. Despite having multiple frames affected by the loss or corruption problems, the results indicates that the ratio of peak signal to noise ratio is relatively high. As a benchmark to evaluate the efficiency of our algorithm, we have compared average Y-PSNR value of Foreman and City video segments with single layer video coding. Figure 4 and 5 depict the comparison results.
6
Conclusion
A new method for handling the data loss during the transmission of video streams has been proposed. Our proposed method is based on multiple description coding however, coding efficiency is not sacrificed as no extra bit data redundancy is introduced for increasing resilience of the video. The proposed method has the capability of being used as a scalable coding method and any data loss or corruption is reflected as reduction in the quality of the video slightly. Except for the case when all descriptions are lost, the video streams do not experience jitter at play back. The compatibility of the proposed method with H.264 standard simplifies the implementation process. Our proposed method is based on spatial scalability features of H.264 however, a reasonable extension of the work is inclusion of SNR scalability.
References 1. Conklin, G., Greenbaum, G., Lillevold, K., Lippman, A., Reznik, Y.: Video Coding for Streaming Media Delivery on the Internet. IEEE Transaction on Circuits and Systems for Video Technology (March 2001) 2. Andreopoulos, Y., van der Schaar, M., Munteanu, A., Barbarien, J., Schelkens, P., Cornelis, J.: Fully-scalable Wavelet Video Coding using in-band Motioncompensated Temporal Filtering. In: Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 417–420 (2003) 3. Ohm, J.: Advances in Scalable Video Coding. Proceedings of the IEEE 93(1) (January 2005) 4. Goyal, V.K.: Multiple Description Coding: Compression Meets the Network. Signal Processing Magazine 18(5), 74–93 (2001)
Multiple Description Scalable Coding
67
5. Wang, Y., Reibman, A.R., Shunan, L.: Multiple Description Coding for Video Delivery. Proceedings of IEEE 93(1) (January 2005) 6. Puri, R., Ramchandran, K.: Multiple Description Source Coding using Forward Error Correction Codes. Signals, Systems, and Computers 1, 342–346 (1999) 7. Venkataramani, R., Kramer, G., Goyal, V.K.: Multiple Description Coding with many Channels. IEEE Transaction on Information Theory 49(9), 2106–2114 (2003) 8. Chang, S.K., Sang, L.: Multiple Description Coding of Motion Fields for Robust Video Transmission. IEEE Transaction on Circuits and Systems for Video Technology 11(9), 999–1010 (2001) 9. Wang, Y., Lin, S.: Error-resilient Video Coding using Multiple Description Motion Compensation. IEEE Transaction on Circuits and Systems for Video Technology 12(6), 438–452 (2002) 10. Xuguang, Y., Ramchandran, K.: Optimal Subband Filter Banks for Multiple Description Coding. IEEE Transaction on Information Theory 46(7), 2477–2490 (2000) 11. Tillo, T., Olmo, G.: A Novel Multiple Description Coding Scheme Compatible with the JPEG 2000 Decoder. IEEE Signal Processing Letters 11(11), 908–911 (2004) 12. Wiegand, T., Sullivan, G.J., Bjontegaard, G., Luthra, A.: Overview of the H.264/AVC Video Coding Standard. IEEE Transaction on Circuits and Systems for Video Technology 13(7) (July 2003) 13. Schwarz, H., Marpe, D., Wiegand, T.: Overview of the Scalable Video Coding Extension of the H.264/AVC Standard. IEEE Transaction on Circuits and Systems for Video (2007) 14. Hewage, C., Karim, H., Worrall, S., Dogan, S., Kondoz, A.: Comparison of Stereo Video Coding Support in MPEG-4 MAC, H.264/AVC and H.264/SVC. In: Proceeding of the 4th Visual Information Engineering Conference, London (July 2007) 15. Mohr, A.E., Riskin, E.A., Ladner, R.E.: Unequal Loss Protection: Graceful Degradation of Image Quality over Packet Erasure Channels through Forward Error Correction. IEEE Journal of Selected Areas in Communications 18(6), 819–828 (2000) 16. Franchi, N., Fumagalli, M., Lancini, R., Tubaro, S.: A Space Domain Approach for Multiple Description Video Coding. In: ICIP 2003, vol. 2, pp. 253–256 (2003)
Evaluation of Different Multithreaded and Multicore Processor Configurations for SoPC Sascha Uhrig Institute of Computer Science University of Augsburg 86159 Augsburg Germany
[email protected]
Abstract. Multicore processors get more and more popular, even in embedded systems. Unfortunately, these types of processors require a special kind of programming technique to offer their full performance, i.e. they require a high thread-level parallelism. In this paper we evaluate the performance of different configurations of the same processor core within an SoPC: a single threaded single core, a multithreaded single core, a single threaded multicore, and a multithreaded multicore. The used core is the jamuth core, a multithreaded Java processor able to execute Java bytecode directly in hardware. The advantage of Java in a multicore environment is that it brings the threading concept for free, i.e. the software developers are familiar with the threading concept. Our evaluations show that the cores within a multicore processor should be at least two-threaded to bridge the higher memory delays caused by the contention at the shared bus.
1 Introduction In recent years, multicore processors gain more and more importance in high performance applications as well as in embedded systems. We concentrate especially on processor architectures for embedded systems. Unfortunately, it is not trivial to design an application that can benefit optimally from multithreaded and multicore processor architectures. This is because both architectures require a certain amount of thread level parallelism to utilize the offered functional units best. In this paper we present a multicore-enabled version of the jamuth core, a multithreaded Java processor core for Systems-on-Programmable-Chip (SoPC). A Java Virtual Machine (JVM) dealing with the special multicore topics is also available. The great advantage of Java in multicore systems is the thread concept deeply embedded within Java which enables ordinary Java applications to fit well into the programming paradigms of multicores. The jamuth processor is a multithreaded core, which executes most Java bytecodes directly within the processor in a single clock cycle or as sequences of microoperations in several clock cycles (see [1]). More complex instructions are executed with the help of trap routines. Additionally, all required device drivers are also implemented in Java. Hence, the overhead of an operating system layer and the need of a (JIT) compiler or K. Bertels et al. (Eds.): SAMOS 2009, LNCS 5657, pp. 68–77, 2009. c Springer-Verlag Berlin Heidelberg 2009
Evaluation of Different Multithreaded and Multicore Processor Configurations
69
interpreter is eliminated. These circumstances lead to a Java system, which is able to work with very limited hardware resources. We evaluated different parameters of the multithreading and multicore design space: A single threaded single core deals as baseline processor, which is extended to a multithreaded processor with different number of thread slots as well as to a multicore with different amount of cores and a combination of both. We measured the performance of the different architectures in terms of overall and single thread performance together with the processor utilization. This paper is organized as follows: In section 2 we show several related multithreaded and multicore environments. Section 3 presents our single core baseline jamuth Java processor core followed by several extensions for multicore support and the design of a multicore SoPC in section 4. The evaluations are shown in section 5, before section 6 concludes the paper.
2 Related Work In recent years, several multithreaded and multicore processor architectures have been established on the market of server, desktop and portable computers. A comparison of single and multicore desktop architectures based on Intel’s Smithfield and Prescott architectures is given by Sibai [2]. A simulator for simultaneous multithreaded single core processors was developed by D.M. Tullsen [3]. He and Seng reported in several publications [4][5] a formidable speedup of multithreaded architectures. Donald et al. [6] presented a new methodology to extend single core processor simulators to multicore simulators by using one software thread per simulated core. Besides this methodology, the more important topic of that work is that the new multicore simulators profit well from the execution on multicore and multithreaded processors. Gerdes et al. [7] described the MERASA project. Its focus is the development of a multithreaded multicore processor for embedded real-time systems. The multicore processor will be modelled with SystemC and an FPGA prototype should be the outcome of the project. Pitter et al. [8] presented the design of a single threaded Java multicore processor for FPGAs. They evaluated a multicore containing up to 8 Java cores. The cores are interconnected by the so-called SimpCon bus [9] and synchronized by a special synchronization logic. Except for the latter one, no other work evaluates multicore architectures for SoPC. Most work has been done concerning desktop and server applications but embedded systems are rarely the focus of multicore research. The contribution of this paper is the simple design and the evaluation of a multithreaded multicore environment especially for embedded systems based on FPGAs.
3 Architecture of the Multithreaded jamuth Single Core Processor The processor core (see figure 1) contains a multithreaded five stage pipeline (instruction fetch, instruction decode, operand fetch, execute, and the stack cache). The
70
S. Uhrig
Fig. 1. Block diagram of the jamuth core
integrated real-time scheduler shown in the figure is responsible for a real-time capable thread schedule if the core is used in real-time applications. Most of the Java integer as well as several long, float, and double bytecodes are executed directly in hardware, mostly within a single execution cycle. Instructions with medium complexity are realized by microcodes and the complex operations, such as new, athrow, monitorenter, and most floating point commands, call trap routines. As operand stack, a 2k-entry stack cache is integrated within the pipeline which is shared between the hardware thread slots. During the initialization phase, different portions of the stack cache can be assigned to the thread slots so that one thread entirely works on the stack cache while the data of other threads has to be swapped in and out. As an example, the garbage collection should run without swapping due to a continuous progress without any additional and unnecessary memory accesses for swapping. Because of predictability, real-time threads should also run without swapping. Each thread slot possesses an instruction window (IW) into which up to six bytecodes can be prefetched. These instruction windows decouple fetching and decoding of the bytecodes. Instructions can be fetched from three different sources: external memory, instruction cache, and scratch RAM. The instruction cache and the scratch RAM are integrated within each processor core whereas the memory has to be connected via appropriate interfaces. The integrated real-time scheduler supports two scheduling schemes: a simple fixed priority preemptive (FPP) scheduling scheme and the so-called guaranteed percentage (GP [10]) scheduling scheme. Each hardware thread slot is assigned to one of
Evaluation of Different Multithreaded and Multicore Processor Configurations
71
these scheduling schemes. For the FPP scheduling a priority in the range from 0 to #threadslots − 1 and for GP a percentage between 0 to 100 is required. A special jamuth Java class supports the handling of the hardware thread slots and the scheduling policies. In our case, the scheduler performs the simple fixed priority scheduling for the thread slots inside a single core. The IP core contains three additional units: the Debug unit, the Java timer, and an IRQ controller. The first unit is responsible for debug and observing functionalities, the timer is required for the JVM (System.currentMillis() and the Sleep methods), and the IRQ controller translates interrupt requests from the peripheral components to wake-up signals for the thread-slots resp. interrupt signals for already running threads.
4 Building a Multicore In this section we firstly describe the extensions of the single core processor to make it suitable for a multicore environment. Secondly, we show how anyone can assemble a multicore for FPGAs using the Altera SoPC builder [11] together with our processor core. 4.1 Extensions to the Original Core One of the most interesting topics of multicore designs is the memory connection of the processor cores. Because our system targets at a small number of cores (i.e. 2 to 4 cores), a simple bus connection suites well for this architecture. A big advantage of the bus structure is that the standard Avalon switch fabric for Altera FPGAs can be used for the core connections. Besides the connection structure, two further topics must be handled in multicore designs: the software synchronization between the cores and the cooperation of the cores. Because of the real-time capability of the original multithreaded single core processor, the used core does not support bus locking. Instead of atomic read-modify-write memory accesses, the core offers a pair of atomlock/atomunlock instructions for the synchronization of the integrated hardware thread slots. This technique is extended to synchronize multiple cores by an external lock token ring. The lock token is sent from one core to the next in a ring. If a core requires the lock, is takes the token from the ring until it releases the atomic lock. This solution is simple to implement and works well for a small number of cores (about 2–4), because the maximum latency to get the token if it is on the ring grows linear with the number of cores. Unfortunately, this solution is not suitable for real-time requirements because no priorities are taken into account. The second topic, the cooperation of the cores, is required for the system software (i.e. the JVM). Because of the design of the processor core, it is not possible to access the control registers and the Java runtime stacks from outside the core. Hence, one core cannot initialize a thread slot within any other core nor it can resume a thread slot or change its scheduling parameters. To deal with these restrictions, we introduced a central event interrupt. The JVM contains a todo list for each core which is handled at the occurrence of the central event. Possible items in the list are create thread, resume thread slot, and set scheduling parameters.
72
S. Uhrig
Fig. 2. Architecture of a Multicore SoPC (Example)
4.2 Multicore SoPC Design The processor core is designed as an IP core (a component) for Altera’s System-onProgrammable-Chip (SoPC) builder with two standard Avalon bus master interfaces. Hence, it can easily be combined with other Altera, third party, or customer-made components to a SoPC. As external system components, only a memory device like a SDRAM is required. For the design of a multicore, several of our processor cores have to be connected to the memory bus of the SoPC. The peripheral components like a MAC, a UART, or a TFT controller should be combined using a peripheral bus which is connected to all processor cores by a Pipelined Avalon Bridge (a standard component of the SoPC builder) to reduce the size of the switch fabric. Figure 2 illustrates an example system architecture with several peripheral components and two processor cores.
5 Evaluation We evaluated different combinations of multicore and multithreaded configurations. Using each of the configurations, we measured the overall performance, the performance of a single highest priority thread, and the processor’s overall utilization. 5.1 Methodology As benchmark, we used modified versions of three benchmarks out of the JOP embedded benchmark suite [12]: the Sieve benchmark, the Kfl benchmark, and the UdpIp benchmark. The modifications were necessary because the original benchmarks cannot be executed in parallel. Hence, we had to change all static variables into object fields. Additionally, the central loops within the benchmarks are modified to unbounded loops
Evaluation of Different Multithreaded and Multicore Processor Configurations
73
which are executed as long as a special run flag is set. The benchmark results are obtained by counting the loop iterations until the run flag is cleared. Because of these changes, the values presented here must not be compared to any results of the JOP benchmarks of other publications. Depending on the processor configuration, we started several identical benchmark threads in parallel within fixed thread slots of the core(s). In contrast to normal Java thread execution, we bypassed the thread scheduler and adhered the benchmark threads to predefined thread slots. Additionally, we assigned fixed priorities to the thread slots enabling us to pick a special thread out to deal as highest priority, i.e. most important thread. Its performance is considered as the single thread performance in our multithreaded and/or multicore environment. After the initialization phase, we executed the benchmark threads 100 million clock cycles and afterwards we cleared the run flag. The number of loop iterations is measured for each thread independently. Hence, we are able to present the performance of a single thread and the overall performance of the system as well as the performance of a selected core. We executed all three benchmarks with different configurations sequentially 100 million clock cycles and recorded their performance values as well as the pipeline utilization. The values presented here are the average values of all three benchmarks. As configurations, we used all possible combinations of one to three cores and one to three threads per processor. The fourth thread in each core was required for starting and stopping the benchmark threads and for the measurements. Former studies with only one core showed that the fourth thread brings no mentionable performance gain. As test platform we used an FPGA prototype board (DBC3C40 from Devboards [13]) with an Altera Cyclone III EP3C40F484C7 FPGA within which it is possible to implement three processor cores. The test results concerning the pipeline utilization are measured by the Debug module of the processor core. It counts all clock cycles a new instruction is decoded within the decode unit, the number of latency cycles of the threads, and the overall number of cycles. For the multicore configurations, all cores are connected to a single standard SDRAM controller from Altera. The cores got all the same priority, i.e. requests from the cores are serviced round robin. If a core does not request a memory access in his cycle, a possible request of the next core is serviced. 5.2 Evaluation Results The most important characteristic of a processor is its performance. So first, we present the overall performance as average of the three benchmarks in figure 3. The values shown are normalized to the performance of a single threaded single core processor. The number of active threads per core is shown at the x-axis and the number of active cores can be seen at the y-axis. A big performance improvement can be observed at the change from one to two in both directions, i.e. from a single processor to a dual core as well as from a single thread to two threads per core. Towards the third core, a mentionable performance gain is also available but the third thread leads only to a marginal increase. A second, very important factor is the performance of a single highest priority thread. It is important because a high number of applications can only partly be parallelized
74
S. Uhrig
3,5 3
Performance
2,5 2 1,5 1 0,5
so rs
3 2 1
Pr oc es
0
1 2
3
Threads
Fig. 3. Speedup of different multithreaded and multicore configurations normalized to the single threaded single core version
1 Thread/Core 2 Threads/Core 3 Threads/Core
1 0,9
Relative Performance
0,8 0,7 0,6 0,5 0,4 0,3 0,2 0,1 0 1
2 Processors
3
Fig. 4. Performance impact of multithreaded and multicore execution on a single highest priority thread
(if at all), i.e. several parts of the applications must be executed sequentially. Hence, the performance of these sequential parts have still a high impact. Figure 4 shows the single thread performance of the highest priority thread of one core. Because all cores
Evaluation of Different Multithreaded and Multicore Processor Configurations
Single Thread Performance Loss
1,07
75
Overall Performance Gain
Performance Factor
1,06
1,05
1,04
1,03
1,02
1,01
1 1
2
3
Processors
Fig. 5. Performance factor between two and three threads per core
1 Thread/Core 2 Threads/Core 3 Threads/Core
0,6
Pipeline utilization
0,5
0,4
0,3
0,2
0,1
0 1
2 Processors
3
Fig. 6. Pipeline utilization of the first core
have the same priority at the SDRAM interface, the performance of the highest priority threads of all cores are identical (not shown here) but the performance of one core in a dual core environment is reduced compared to a single core. Because of conflicts at the memory interface, it is easy to understand why the performance drops in the case of additional threads, independent of the core which executes the other thread. The fact that two threads on a single core harm the first thread’s performance more than a thread on another core can be explained in two different ways: 1. Threads on one core use the same shared instruction cache whereas different cores use different caches. Hence, cache thrashing can occur if two threads are executed on the same core even if they perform the same program because in general the program counters point to different areas of code.
76
S. Uhrig
2. The conflicts on memory accesses at the SDRAM interface are handled in a pipelined way. In contrast, a conflict concerning a memory access within a core cannot be pipelined because of the real-time capability of the original core and the unpredictable period of the accesses. Concerning the single thread performance a similar behavior can be observed as at the overall performance: the impact of the third thread per core is marginal. But with the overall performance in mind it is even worse: figure 5 shows the reduction of single thread performance compared to the gain of the overall performance if the number of active threads per core is increased from two to three. Using a single or a dual core, the overall performance gain is still higher than the performance loss of the single highest priority thread. But if we use a triple core system, we reduce the single thread performance more than we can profit from the third thread on each core. Another interesting characteristic of a processor is its efficiency. We measured the efficiency of a processor core in terms of its pipeline utilization. Figure 6 presents the utilization of the first core’s pipeline depending on the number of active threads within one core and on the number of active cores. Because the used processor pipeline is a single issue pipeline without any data cache, the IPC (instructions per cycle) of the single threaded execution is relatively low (about 0.35). If we increase the number of processor cores and, hence, the memory access latencies are increased, too, the utilization of a single core is further reduced (down to about 0.31). As a general approach to bridge memory latencies is multithreading, we executed two and three threads per core. The measured pipeline utilization shows the same characteristic like the overall performance presented in figure 3: Introducing the second thread leads to a strong increase of the utilization (factor between 1.23 to 1.45) whereas the third thread enlarges the efficiency only by the factor in the range of 1.01 to 1.04.
6 Conclusions We presented a simple method to build a multicore System on Programable Chip (SoPC) using Altera’s SoPC builder together with our multithreaded Java processor core. A suitable Java Virtual Machine (JVM) with additional functionalities to support multithreading as well as a multicore environment is also available. We evaluated different configurations of multithreading and multicore processors. The focus of our analysis is the number of threads per core and the number of cores. An Altera Cyclone III EP3C40 FPGA is used to implement up to three cores, each with three threads. A standard Altera Avalon bus serves as interconnection network and an SDRAM as main memory. Our results show that on the one hand a higher performance gain can be reached by increasing the number of cores instead of a higher number of threads per core. On the other hand, the resource consumption increases linearly but their utilization is disappointing. A better utilization can be reached by inserting multithreading into the cores. Adding a second thread increases pipeline utilization in the range from 23% to 45% and performance in the range from 26% to 45%. The integration of a third thread causes only a performance gain of up to 6%. As an overall result, we conclude that multicore
Evaluation of Different Multithreaded and Multicore Processor Configurations
77
processors should support two thread slots per core: just one thread suffers too much from conflicts at the common memory interface whereas the gain of a third thread is to small compared to its effort.
References 1. Kreuzinger, J., Brinkschulte, U., Pfeffer, M., Uhrig, S., Ungerer, T.: Real-time Eventhandling and Scheduling on a Multithreaded Java Microcontroller. Microprocessors and Microsystems 27, 19–31 (2003) 2. Sibai, F.N.: Evaluating the performance of single and multiple core processors with pcmark 2005 and benchmark analysis. In: ACM SIGMETRICS Performance Evaluation Review archive, pp. 62–71 (2008) 3. Tullsen, D.M.: Simulation and modeling of a simultaneous multithreading processor. In: The 22nd Annual Computer Measurement Group Conference (1996) 4. Tullsen, D.M., Eggers, S.J., Emer, J.S., Levy, H.M., Lo, J.L., Stamm, R.L.: Exploiting choice: Instruction fetch and issue on an implementable simultaneous multithreading processor. In: 23rd International Symposium on Computer Architecture (ISCA 1996), Philadelphia, PA, USA, pp. 191–202 (1996) 5. Seng, J., Tullsen, D., Cai, G.: Power-sensitive multithreaded architecture. In: 2000 IEEE International Conference on Computer Design: VLSI in Computers and Processors, Austin, TX, USA, pp. 199–206 (2000) 6. Donald, J., Martonosi, M.: An efficient, practical parallelization methodology for multicore architecture simulation. Computer Architecture Letters 5 (2006) 7. Gerdes, M., Wolf, J., Zhang, J., Uhrig, S., Ungerer, T.: Multi-Core Architectures for Hard Real-Time Applications. In: ACACES 2008 Poster Abstracts, L’Aquila, Italy (2008) 8. Pitter, C., Schoeberl, M.: Performance evaluation of a java chip-multiprocessor. In: 3rd IEEE Symposium on Industrial Embedded Systems, Montpellier, France (2008) 9. Schoeberl, M.: Simpcon - a simple and efficient soc interconnect. In: 15th Austrian Workhop on Microelectronics, Graz, Austria (2007) 10. Kreuzinger, J., Schulz, A., Pfeffer, M., Ungerer, T., Brinkschulte, U., Krakowski, C.: Realtime Scheduling on Multithreaded Processors. In: 7th International Conference on Real-Time Computing Systems and Applications (RTCSA 2000), Cheju Island, South Korea, pp. 155– 159 (2000) 11. Altera: (Quartus II Handbook Volume 4: SOPC Builder (version 8.0) (June 2008) 12. Schoeberl M.: (JavaBenchEmbedded V1.0), http://www.jopdesign.com/perf.jsp 13. Devboards: (Datasheet, DBC3C40 Cyclone III Development Board), http://www.devboards.de/pdf/DBC3C40_Vs.1.04.pdf
Implementing Fine/Medium Grained TLP Support in a Many-Core Architecture Roberto Giorgi, Zdravko Popovic, and Nikola Puzovic Department of Information Engineering, University of Siena, Italy http://www.dii.unisi.it/~{giorgi,popovic,puzovic}
Abstract. We believe that future many-core architectures should support a simple and scalable way to execute many threads that are generated by parallel programs. A good candidate to implement an efficient and scalable execution of threads is the DTA (Decoupled Threaded Architecture), which is designed to exploit fine/medium grained Thread Level Parallelism (TLP) by using a hardware scheduling unit and relying on existing simple cores. In this paper, we present an initial implementation of DTA concept in a many-core architecture where it interacts with other architectural components designed from scratch in order to address the problem of scalability. We present initial results that show the scalability of the solution that were obtained using a many-core simulator written in SARCSim (a variant of UNISIM) with DTA support. Keywords: many-core architectures, DTA.
1 Introduction Many-core architectures offer an interesting possibility for efficiently utilizing the increasing number of transistors that are available on a single chip. Several many-core architectures have been developed in industry [1-3] and have been proposed in academia research projects [4, 5]. Although many-core architectures offer hundreds of computational cores, they have to be properly programmed in order to utilize their computing power potential [6]. Decoupled Threaded Architecture (DTA) is a proposal for exploiting fine/medium grained TLP that is available in programs [7]. Even though other types of parallelism are typically present in programs, like Instruction Level Parallelism (ILP) and Data Level Parallelism (DLP), they are not the focus of this paper: we assume that the overall architecture will be offloading parts of the computation with TLP potential on small “TLP-accelerators”, e.g., simple in-order cores, and that other types of accelerators could take care of ILP and DLP. DTA also provides distributed hardware mechanisms for efficient and scalable thread scheduling, synchronization and decoupling of their memory accesses. Previous research experimented with DTA using a simplified framework in order to prove the concept [7]. In this paper, we present an initial implementation of DTA support in a heterogeneous many-core architecture that is compatible with the SARC project [8] architecture, and we describe the hardware extensions that are needed for DTA support. K. Bertels et al. (Eds.): SAMOS 2009, LNCS 5657, pp. 78–87, 2009. © Springer-Verlag Berlin Heidelberg 2009
Implementing Fine/Medium Grained TLP Support in a Many-Core Architecture
79
2 DTA Concept The key features of DTA concept are: i) communication and ii) non-blocking synchronization among threads, iii) decoupling of memory accesses that is based on the Scheduled Data-Flow (SDF) concept [9], iv) clusterization of resources in nodes (differently from SDF) and v) the use of a distributed hardware scheduler (which was centralized in SDF). Data is communicated via frames, which are portions of local memory assigned to each thread. A per-thread synchronization counter (SC) is used to represent a of input data that the thread needs. This counter is decremented each time a datum is stored in a thread’s frame, and when it reaches zero (when all input data have arrived) that thread is ready to execute. In this way, DTA provides a dataflowlike communication between threads - dataflow at thread level, and a non-blocking synchronization (threads can be synchronized using SC and while one thread is waiting for data, processors are available to execute other ready threads).
Thread th0
STORE th1, a STORE th1, b STORE th2, c
Thread th1 SC = 2
LOAD a LOAD b
Thread th2 SC = 2
LOAD c
Thread th0 sends data to threads th1 and th2 by writing into their frames (STORE instructions) and threads th1 and th2 are reading them from frames (LOAD instructions). Threads th1 and th2 can run in parallel.
STORE th3, e
STORE th3, d
Thread th3 SC = 2
LOAD d LOAD e
Thread th3 is synchronized with threads th1 and th2 since its execution will not start before all data (2 in this case) is stored into its frame.
Fig. 1. An example of communication and synchronization among threads in DTA
Threads in DTA are logically divided into smaller phases, which are called code blocks. At the beginning of a thread, Pre-Load (PL) code block reads the data from the frame and stores them into registers. Once the PL phase completes, the Execution (EX) code block starts, and it reads data from the registers and performs calculations. At the end of a thread, Post-Store (PS) code block writes data to the frames of other threads. Another possibility, like in SDF architecture [9], is to use more than two types of pipelines, one to handle PL and PS code blocks – named Synchronization Pipeline (SP) - and the other type to execute EX code blocks – named Execution Pipeline (XP); in this work, we don’t want to lose the flexibility of using existing and smaller cores. Communication of data via frames is preferable, but it is not always
80
R. Giorgi, Z. Popovic, and N. Puzovic
possible to replace accesses to global data in main memory with accesses to frame memory so the threads can access main memory at any point during execution, and in this case DMA-assisted prefetching mechanism can be used to completely decouple memory accesses [10]. In order to overcome the effects of wire delay, Processing Elements (PEs) in DTA are grouped into nodes. The nodes are dimensioned so that all PEs in one node can be synchronized using the same clock [7], and that fast communication can be achieved among them using a simple interconnection network inside a node. On the other hand, communication between nodes is slower, and interconnection network is more complex, but this is necessary to achieve scalability as the available number of transistors increases. The first specific hardware structures that DTA uses is a Frame Memory (FM). This is a local memory that is located near each PE and it is used for storing thread’s data. Access to a frame memory is usually fast and shouldn’t cause any stalls during execution. Another DTA-specific hardware structure is the Distributed Scheduler (DS) that consists of Local Scheduler Elements (LSEs) and Distributed Scheduler Elements (DSEs). Each PE contains one LSE that manages local frames and forwards request for resources to the DSE. Each node contains one DSE that is responsible for distributing the workload between processors in the node, and for forwarding it to other nodes in order to balance the workload among them. DSE together with all LSEs provides functionality of dynamic distribution of the workload between processors. Schedulers communicate between themselves by sending messages. These messages can signal the allocation of the new frame (FALLOC request and response messages), releasing a frame (FFREE message) and storing the data in remote frames [7]. Besides these structures, DTA requires a minimal support in the ISA of the processing element for the creation and management of DTA threads. This support includes new instructions for assigning (FALLOC) and releasing (FFREE) frames, instructions for storing data to other thread’s frames (STORE) and for loading data from frame (LOAD). In the case when PEs cannot access the main memory directly, instructions for reading and writing data to and from main memory are also needed. Further details on DTA, as well as one possible implementation are given in Section 4.
3 Heterogeneous Many-Core Architectures Future chip multiprocessor architectures aim to address scalability, power efficiency and programmability issues. In order to achieve the goal of creating a scalable chip, the architecture is typically subdivided into nodes (Fig. 2) that are connected via a Network on Chip (NoC), where each node contains several processors (or domainspecific accelerators). The communication inside a node is done by using a faster network that connects all the elements that belong to one node (crossbar for example). Since the network inside a node is very efficient, and adding more elements to the node would cause the degradation of its performance, the architecture scales up by adding more nodes to the configuration, and not by increasing the node size. Efficient many-core architectures should also be designed to target diverse application domains, from single threaded programs to scientific workloads. In order to address these domains, the architecture can contain heterogeneous nodes (Fig. 2).
Implementing Fine/Medium Grained TLP Support in a Many-Core Architecture
81
NoC
L2
P11
L2
P12
P11
LS
P12
…
L3
MI
An1 An2 An3 An4 An5 An6 An7 An8
P13 Node 1
P14
A21 A22 A23 A24 Node 2
Node n NoC P
Network on Chip General purpose processor that contains L1 cache A Domain-specific accelerator MI Memory Interface L2,L3 Caches LS Local Store
Fig. 2. An instance of many-core architecture with heterogeneous nodes
Each node contains a mix of general purpose processors and/or accelerators together with local memory (such as shared L2 cache or a Local Store). Typically a general purpose processor performs the control of the nodes and provides operating system services, and may address aggressive Instruction Level Parallelism (ILP) needs. On the other hand, domain specific accelerators will speed-up applications that have specific processing needs (such as vector or multimedia). We use a shared memory model as it simplifies the programmability of the machine. Several programming models for shared memory multiprocessors have been considered recently, such as OpenMP [11] and Capsule [12].
4 Implementing DTA in Many-Core Processor A possible instance of a many-core processor with DTA support should contain a memory controller, and multiple DTA nodes that contain DTA accelerators (Fig. 3). The system needs to contain a general purpose processor (P) that is responsible for sending the code to the DTA nodes, and for initiating the DTA activity. A crossbar is used for providing a fast communication for elements inside a node, which can be a part of the more complex Network on Chip (NoC) that is used to connect the nodes. The Distributed Scheduler is located in each node and since it will communicate mostly with the LSEs inside the same node it is attached directly to the crossbar. In this study, the DTA accelerators are based on the Synergistic Processing Unit (SPU) [13] from the Cell processor. SPU is an in-order SIMD processor, which can issue two instructions in each cycle (one memory and one calculation). In order to keep the processor simple, the designers didn’t implement any branch prediction and SPU relies on the compiler to give hints on branches. It also doesn’t have any caches, but uses the local store to store data and instructions. For the purpose of running DTA
82
R. Giorgi, Z. Popovic, and N. Puzovic
programs, SPU is extended with the Local Scheduling Element, and frames for threads that execute on one SPU are stored in the Local Store (LS). The SPU’s ISA is extended with DTA-specific instructions, and communication with the rest of the system is handled by the LSE. A SPU with DTA-specific extension is called DTA-PE (Fig. 3). Since the SPU contains only one pipeline, all code blocks (PL, EX and PS) will execute on it in sequence. However, SPU’s pipeline is able to issue one memory and one execution instruction at the same time and for instance it can overlap a load from LS with subsequent instructions. instruction fetch data
Memory Controller
LS
SPU
data
control
DTA Node DTA-PE
DTA-PE
DTA-PE
P
…
NoC
LSE
crossbar DTA Node
L2 + directory
DSE
Fig. 3. An instance of a many core architecture with DTA support
The LSE manages threads that execute on one DTA-PE, and it contains structures with information about the current state of the DTA-PE and frame memory: the PreLoad (PLQ) queue and Waiting Table (WT). The Pre-Load Queue contains information about threads that are ready to run (SC = = 0). It is implemented as a circular queue and each entry contains the Instruction Pointer (IP) of the first instruction of the thread and address of the thread’s frame (Frame Pointer – FP). The Waiting Table contains information about threads that are still waiting for data (SC != 0). Number of entries in the WT is equal to the maximal number of frames that are available in the DTA-PE, and it is indexed by a frame number. Each entry in the WT contains the IP and FP of the thread and synchronization count, which is decremented on each write to the thread’s frame. Once the SC reaches zero, IP and FP are transferred to the PLQ. In order to be able to distribute the workload optimally, the DSE must know the number of free frames in each DTA-PE. This information is contained in the Free Frame Table (FFT) that contains one entry with the number of free frames for each DTA-PE. When a FALLOC request is forwarded to a DTA-PE, the corresponding
Implementing Fine/Medium Grained TLP Support in a Many-Core Architecture
83
number of free frames is decremented, and when a FFREE message arrives the number of entries is incremented. Since it may happen that a FALLOC request cannot be served immediately, a Pending FALLOC Queue (PFQ), which stores pending frame requests. Each entry in this queue contains the parameters of the FALLOC request, and the ID of the DTA-PE that sent the request. When a free frame is found, the corresponding entry is removed from this queue. Most of the additional hardware cost that DTA support introduces comes from the structures needed for storing information about threads. These costs are expressed in Table 1. for implementation with one DTA node, using rbe (register bit equivalent) [14] as a unit of measure. The register bit equivalent equals the area of a bit storage cell – a six transistor static cell with high bandwidth that is isolated from its input/output circuits [14]. In the remainder of this section we will give the estimate of the hardware cost that DTA introduces for the case of one node. Table 1. Storage cost of DTA components expressed in register bit equivalent units: nDTA-PEs – number of DTA-PEs in the node, sizeFP – size of FP in bits, nPFQ – number of PFQ entries, sizeIP – size of IP in bits, sizeSC – size of SC in bits
Component DSE LSE
Structure FFT PFQ PLQ WT
Size [rbe] sizeFFT-entry * nDTA-PEs nPFQ * (sizeIP + sizeSC + sizeID) nF * (sizeIP + sizeFP) nF * (sizeIP + sizeFP + sizeSC)
The parameters that influence the hardware cost of DTA support are the number of DTA-PEs in one node (nDTA-PEs), number of frames in each DTA-PE (nF), number of bits needed to store the Synchronization Counter (sizeSC – in bits), Instruction Pointer size (sizeIP – in bits) and the number of entries in the PFQ (nPFQ). The required storage size for keeping an FP entry (sizeFP) in the LSE is log2 nF bits, since instead of keeping the entire address it is enough to keep the frame number from which the address can be reconstructed using simple translation. The Pre-Load Queue contains nF entries (FP and IP), and the Waiting Table contains n entries (FP, IP and SC). The frames are kept in Local Store and no additional storage is needed for them. The Free Frame Table in the DSE contains nDTA-PEs entries, where each entry can have the value from zero to nF since nF is the maximum number of free frames in one DTA-PE (hence, the size of an entry is sizeFFT-entry = log2 (nF+1) bits). The Pending FALLOC Queue contains nPFQ entries where each entry contains the IP (sizeIP bits), SC (sizeSC bits) and the ID of the sender (sizeID = log2 nDTA-PEs). The total size of the structures needed for hardware support is the sum of the costs for LSEs and the cost of the DSE, and it is the function of nDTA-SPEs and nF. Take for example a DTA node with nDTA-PEs = 8 DTA-PEs, where each DTA-PE has 256kB Local Store, Instruction Pointer of sizeIP=32 bits, maximal value for SC of 256 (hence, sizeSC=8 bits), nF = 128 frames per DTA-PE (each frame with 64 4-byte entries). And the DSE that has the possibility to store 8 pending FALLOC requests (one from each DTA-PE). The frames occupy 32kB in each Local Store (256kB total) and the rest of the LS can be used for storing the code and other data that cannot be
84
R. Giorgi, Z. Popovic, and N. Puzovic
communicated using frame memory and needs to be fetched from the main memory (using DMA unit for example). In this case, the PreLoad Queue has 4992 bits and the Waiting Table has 6016 bits of storage, which gives total of 1.3 kB in the LSE. The Pending FALLOC Queue takes 392 bits in the DSE, and the Free Frame Table has 64 bits of storage which yields total of 49 B in the DSE. Hence, all needed structures in one node have 10.8 kB of storage space, which is 0.5% when compared to the total LS size. If we double the number of frames in each DTA-PE to nF = 256 (taking ½ of the LS), the required storage space increases to 22 kB, which is 1.07% of the LS size. Increasing the number of frames even more, to 384 (taking ¾ of the LS), the total size is 33.05 kB which represents 1.6% of the LS size. Based on these values, and neglecting small contributions to the total size, we arrive to the formula: size (nDTA-PEs,nF) = K1 * nDTA-PEs * nF * (2 * log2 (nF) + K2) where K1 = 11/85, K2 = 70 and the size is expressed in bits.
5 Experimental Results 5.1 Methodology In order to validate the DTA support, we extended the SARCSim simulator and tested it with several simple kernel benchmarks. SARCSim is a many-core simulator that is based on the UNISIM framework [15], and developed to simulate the SARC architecture [8]. SARCSim/UNISIM allowed us to use already existing modules (such as processors, network and memory) and to implement only the DTA-specific extensions. The configuration used for performing simulations is the one described in a previous section, and we have varied memory latencies throughout the experiments in order to determine if the memory is the limiting factor for the scalability. The size of the LS used in experiments is 256kB per DTA-PE. The tested configuration didn’t have caches implemented, and all requests went directly to the memory. However, we have performed the tests with memory latency set to one cycle in order to simulate the situation in which the caches are present, and requests always hit. The benchmarks that are used for performing tests are: ─ The bitcount (bitcnt) from the MiBench [16] suite is a program that counts bits in various ways for a certain number of iterations (an input parameter). Its parallelization has been performed by unrolling both the main loop and loops inside each function. ─ Fibonacci (fib) is a program that recursively calculates Fibonacci numbers. Each function call is a new DTA thread. The main purpose of this benchmark is to create a vast number of small threads in order to stress the DTA scheduler. ─ Matrix multiply (mmul) is a program that just does what the name implies. Calculations are performed in threads that work in parallel. Number of working threads is always power of two. Inputs are two n by n matrices. ─ Zoom is a image processing kernel for zooming. It is parallelized by sending different parts of the picture to different processors. Input is an n by n picture.
Implementing Fine/Medium Grained TLP Support in a Many-Core Architecture
80
speedup
millions of cycles
cycles
85 speedup
8
40000
70
7
35000
7
60
6
30000
6
50
5
25000
5
40
4
20000
4
30
3
15000
3
20
2
10000
2
10
1
5000
1
0
0
0
1
2
3
4
5
6
7
0 1
8
2
speedup [log]
millions of cycles
3
4
5
6
7
8
b) fibonacci (10)
a) Bitcnt (10000) 16
8
8
14
8
speedup
millions of cycles
8
7
7
6
6
10
5
5
8
4
4
6
3
3
4
2
2
2
1
1
12 4
2
0
1 1
2
4
8
c) mmul(32)
Memory latency 1 cycle
0
0 1
2
3
4
5
6
7
8
d) zoom (32)
Memory latency 150 cycles
Fig. 4. Execution times and speedup when varying memory latency. Execution time is shown using bars, and speedup using lines. The X axis shows number of DTA-PEs.
All these benchmarks were first hand-coded for the original DTA architecture, and then translated in order to use the SPU ISA with DTA extensions. 5.2 Preliminary Results The first set of experiments shows the scalability of the DTA TLP support when number of DTA-PEs is increased from 1 to 8 in one node (Fig. 4). All benchmarks scale well except for Fibonacci, as the number of requests for new threads exceeds the DSE’s capabilities. We have encountered this situation in a previous study [7] and we overcame this problem by using a virtual frame pointers, as described in the same study. In this work, we didn’t consider yet the use of the virtual frame pointer. As expected, the configuration with memory latency set to 1 cycle has lower execution time than the configuration with memory latency set to 150 cycles. However, the scalability is the same in both cases, and speedup is near to the ideal.
86
R. Giorgi, Z. Popovic, and N. Puzovic
6 Related Work Most of the leading hardware producers have introduced their many-core architectures recently. Examples are Cyclops-64 [1], which is a multi-core multithreaded chip currently under development by IBM, UltraSPARC T2 [2] from SUN Microsystems , and Plurality [3], which uses a pool of RISC processors with uniform memory, hardware scheduler, synchronizer and load balancer. DTA mainly differs from these architectures in the execution model, which is based on the Scheduled DataFlow in the case of DTA. Academic research projects are also focusing on many-core architectures. Speculative Data- Driven Multithreading (DDMT) [5] exploits the concept of dataflow at thread level like DTA. The main difference is that DDMT uses static scheduling while in DTA scheduling is done dynamically at run-time in hardware. TRIPS [4] uses tiling paradigm with different types of tiles. These tiles are reconfigurable in order to exploit different types of parallelism. TRIPS uses dataflow concept inside a thread, and control flow at the thread level, which is the opposite of what DTA does. TAM [17] defines a self-scheduled machine language with parallel threads, which communicate in dataflow manner among them and a machine language that can be compiled to run on any multiprocessor system without any hardware support (unlike DTA which has HW support).
7 Conclusions In this paper, we have presented one possible implementation of TLP support for a many-core architecture, which targets fine/medium grained threads via hardware scheduling mechanism of the DTA. The initial test show that scalability of the architecture is promising in all the cases up to 8 processors per node. The overall conclusion is that since this implementation for TLP support scales good, it suites well in the many-core environment. As a future work, we want to test different configurations with more nodes and to implement some techniques present in native DTA (e.g. virtual frame pointers). We will focus also on a tool that would allow us to automatically extract DTA code from high level programming languages by using methods like OpenMP, which would allow us to perform tests with more benchmarks. Acknowledgements. This work was supported by the European Commission in the context of the SARC integrated project #27648 (FP6 ) and by the HiPEAC Network of Excellence (FP6 contract IST-004408, FP7 contract ICT-217068).
References 1. Almási, G., et al.: Dissecting Cyclops: a detailed analysis of a multithreaded architecture. SIGARCH Comput. Archit. News 31(1), 26–38 (2003) 2. Shah, M., et al.: UltraSPARC T2: A highly-treaded, power-efficient, SPARC SOC. In: IEEE Asian Solid-State Circuits Conference, ASSCC 2007, Jeju (2007)
Implementing Fine/Medium Grained TLP Support in a Many-Core Architecture
87
3. Plurality architecture, http://www.plurality.com/architecture.html 4. Sankaralingam, K., et al.: Exploiting ILP, TLP, and DLP with the polymorphous TRIPS architecture. In: Proceedings of the 30th annual international symposium on Computer architecture, pp. 422–433. ACM Press, San Diego (2003) 5. Kyriacou, C., Evripidou, P., Trancoso, P.: Data-Driven Multithreading Using Conventional Microprocessors. IEEE Trans. Parallel Distrib. Syst. 17(10), 1176–1188 (2006) 6. Harris, T., et al.: Transactional Memory: An Overview. IEEE Micro. 27(3), 8–29 (2007) 7. Giorgi, R., Popovic, Z., Puzovic, N.: DTA-C: A Decoupled multi-Threaded Architecture for CMP Systems. In: 19th International Symposium on Computer Architecture and High Performance Computing, SBAC-PAD 2007, Gramado, Brasil, pp. 263–270 (2007) 8. SARC Integrated Project, http://www.sarc-ip.org 9. Kavi, K.M., Giorgi, R., Arul, J.: Scheduled Dataflow: Execution Paradigm, Architecture, and Performance Evaluation. IEEE Transaction on Computers 50(8), 834–846 (2001) 10. Giorgi, R., Popovic, Z., Puzovic, N.: Exploiting DMA mechanisms to enable non-blocking execution in Decoupled Threaded Architecture. In: Proceedings of the Workshop on Multithreaded Architectures and Applications (MTAAP 2009), held in conjunction with the 23rd IEEE International Parallel and Distributed Processing Symposium (IPDPS 2009), Rome, Italy, May 25-29, 2009, pp. 1–8 (2009) ISBN 978-1-4244-3750-4 11. The OpenMP API specification for parallel programming, http://openmp.org 12. Pierre, P., Yves, L., Olivier, T.: CAPSULE: Hardware-Assisted Parallel Execution of Component-Based Programs. In: Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, Los Alamitos (2006) 13. Kahle, J.A., et al.: Introduction to the cell multiprocessor. IBM J. Res. Dev. 49, 589–604 (2005) 14. Flynn, M.J.: Computer Architecture. Jones and Bartlett Publishers, Sudbury (1995) 15. August, D., et al.: UNISIM: An Open Simulation Environment and Library for Complex Architecture Design and Collaborative Development. IEEE Comput. Archit. Lett. 6(2), 45–48 (2007) 16. Guthaus, M.R., et al.: MiBench: A free, commercially representative embedded benchmark suite. In: Proceedings of the Workload Characterization, WWC-4, 2001. IEEE International Workshop, pp. 3–14. IEEE Computer Society, Los Alamitos (2001) 17. Culler, D.E., et al.: TAM - a compiler controlled threaded abstract machine. J. Parallel Distrib. Comput. 18(3), 347–370 (1993)
Implementation of W-CDMA Cell Search on a FPGA Based Multi-Processor System-on-Chip with Power Management Roberto Airoldi1 , Fabio Garzia1 , Tapani Ahonen1 , Dragomir Milojevic2 , and Jari Nurmi1 1
2
Tampere University of Technology, Department of Computer Systems, P.O. BOX 553, FIN-33101, Tampere, Finland {firstname.lastname}@tut.fi Universit´e Libre de Bruxelles, Bio, Electro and Mechanical Systems, CP165/56, av.F. Roosevelt 50, B-1050 Brussels, Belgium {Dragomir.Milojevic}@ulb.ac.be
Abstract. In this paper we describe a general purpose, homogeneous Multi-Processor System-on-Chip (MPSoC) based on 9 processing clusters using COFFEE RISC processors and a hierarchical Network-on-Chip implemented on an FPGA device. The MPSoC platform integrates a cluster clock gating technique, enabling independent core and memory sleep modes. Low cluster turn-on delay allows frequent use of such technique, resulting in power savings. In order to quantify the performance of the proposed platform and the reduction of power consumption, we implement Target Cell Search part of the WCDMA, a well known SDR application. We show that the proposed MPSoC platform achieves an important speed-up (7.3X) when compared to comparable single processor platform. We also show that a significant reduction in dynamic power consumption can be achieved (50% for the complete application) using the proposed cluster clock-gating technique.
1
Introduction and Related Work
Embedded applications, such as Software Defined Radio (SDR), require more and more processing power that has to be delivered under the constraints of low cost and low power consumption. For most demanding embedded applications standard DSP cores alone can not be used any more, because they do not have enough processing power. This is one of the reasons why in the past years dedicated solutions based on specific hardware platforms have been proposed in the literature. One can mention a multithreaded, low power processor for SDR applications combining classical integer and SIMD units and embedded into a more complex SoC, proposed in [1]. The typical power consumption of this implementation is 150mW per core, running at 600MHz, with a 0.9V supply. In the context of MPSoC platforms for SDR applications, a fully programmable, 4 SIMD cores architecture has been proposed in [2], achieving 2Mbps Wide-band Code-Division Multiple-Access (WCDMA) for a power consumption of 270mW, K. Bertels et al. (Eds.): SAMOS 2009, LNCS 5657, pp. 88–97, 2009. c Springer-Verlag Berlin Heidelberg 2009
Implementation of W-CDMA Cell Search
89
and 24Mbps 802.11a at about 370mW for 90nm technology. More specifically, a low power cell search in WCDMA has been described in [3], with power consumption of 36mW for a circuit built in 350nm. The increased performance of the new generation FPGA circuits, together with low power strategies allow the implementation of the complex SoCs on these devices, some of them being dedicated for the SDR applications [4]. Implementation of OFDM on FPGA has been described in [5] and a more complex SoC using a Network-on-Chip as communication infrastructure has been described in [6]. In this paper we present a medium size, general purpose, homogeneous MultiProcessor System-on-Programmable-Chip (MPSoPC), called Coffee Machine. The platform is based on 9 Coffee RISC processors described in [7] and incorporates a hierarchical Network-on-Chip (NoC) for inter-processor communication, introduced in [8,9,10]. The proposed platform runs the Target Cell Search algorithm, part of the WCDMA[11], a widely adopted air interface for 3rd Generation (3G) mobile communication systems. This paper is organized as follows. In Section 2 we will give an overview of the proposed MPSoC platform architecture and will provide a more detailed description of the different building blocks. We will further concentrate on the utilized software management of clock gating islands for the processing cores and local memories. In Section 3 we will shortly discuss the WCDMA Target Cell Search application mapped on the MPSoC platform and focus on some of the implementation details. Section 4 gives more detailed information on the experimental set-up and system performance: processing capability, area, power consumption with and without cluster clock-gating. Finally, Section 5 concludes the paper and discusses future directions.
2
MPSoC Platform with Dynamic Power Management
Coffee Machine is an MPSoC platform derived from the Silicon Cafe template architecture developed at Tampere University of Technology. Coffee Machine is composed of nine computational clusters (CCs) built around the Coffee RISC [7] processor core. Transactions between CCs take place through a mesh network that taps directly into the local, switched communication infrastructure. A simplified view of the Coffee Machine is provided in figure 1. The nine CCs of the Coffee Machine enable efficient parallelization of the target application. Operation is conducted hierarchically by means of centralized task distribution and data flow management. These run-time management functions include allocating resources for processing tasks as well as parallelizing and scheduling data streams. The management functions are mapped to the central CC (CC4 in figure 1) making it in essence the system master. The master CC is directly connected to off-chip peripherals and main memory. Acting as slaves to the master CC, the remaining eight CCs are utilized for parallel processing. The master CC’s central position in the network topology ensures balanced communication latency across the slaves as the maximum number of hops between any slave and the master is two. This enables wide utility
90
R. Airoldi et al.
Fig. 1. Simplified view of the Coffee Machine MPSoC platform
Fig. 2. Computational cluster built around the Coffee RISC processor core
of the slaves for varying subtask regardless of the physical location. Workload distribution between the eight slave CCs is highly uniform. There are power-oftwo elements in most of the target application’s data streams. Consequently, a significant amount of work can be divided into eight approximately equivalent and independent threads. Figure 2 illustrates the CC structure in more detail. Each CC contains a Coffee RISC processor with local data and instruction scratchpads, and a Bridge Interface (BIF). The BIF is connected with the global network and is composed of an initiator side, a target side and a local arbiter. The master CC contains off-chip I/O peripherals in addition. Data and instruction scratchpad capacities are 256kB and 32kB for the master, 32kB and 16kB for a slave CC. System-wide access to the local resources (memories and peripherals) is provided through non-blocking switching interconnections. 2.1
Interconnection Architecture
CCs have two contending initiators (masters): the processor and the initiator side of the bridge (B/I). Targets (slaves) are accessed in parallel through the local switches unless both request the exact same target at the exact same time. Arbitration both here and in the global switches is based on programmable priority sequences. In case of negative arbitration result, stop signal is asserted forcing the request to be held until go is issued. Processor’s requests to access remote peripherals, that is, peripherals of another cluster, are directed by the local switches to the target side of the bridge (B/T). The B/T contains a run-time reconfigurable lookup table where routing request sequences are assigned to memory pages. These sequences steer the request to its destination cluster where it is absorbed by the respective B/I. The B/I executes remote read and write operations on the local cluster as well as responds to remote reads by passing the data to the B/T for return delivery.
Implementation of W-CDMA Cell Search
91
Multicasting and broadcasting can be achieved in two ways. The data and its remaining routing request sequence can be switched to multiple output ports thereby supporting regular casting patterns. For non-regular casting patterns, data can be targeted to a remote B/T in addition to the remote peripheral. The remote B/T will pass a copy to another CC, where the same procedure can be repeated if necessary. As indicated in table 2, speedup achieved with the Target Cell Search application was 7.3X in comparison to a single processor system. The broadcast mechanism reduces significantly the communication overhead due to distribution of samples to the slave cores. In fact the overall speedup would have been less than 5.2X without broadcasting support. Details of the architecture are described in [10] and [9]. 2.2
Power Management Employed
In an MPSoC it is likely to have an ongoing data transfer, while there’s nothing to process at some part of the chip. This is the case in the Coffee Machine for example during retrieval of results from a slave CC’s data scratchpad. It is thus advantageous to manage the local memory and processing clocks independently from each other. Based on initial experiments with the power management techniques available on Stratix II FPGAs, a software controlled clock gating scheme was chosen for the Coffee Machine. The scheme allows disabling idle processors and/or memories instantly for arbitrary periods of time. There is no adverse effect on the maximum operating frequency and the associated resource utilization overhead is relatively low. Clock gating state is modifiable through a control register with individual enable/disable bits for the memory and processing clocks of the slave CCs.
3
Mapping the Cell Search on the Proposed MPSoC
In this work we will concentrate on one particular part of the W-CDMA system, the target cell search, operation that can be divided into three steps: 1. Slot Synchronization 2. Frame Synchronization 3. Scrambling Code Identification These functional blocks will be mapped on the proposed MPSoC platform using the power management technique described in the previous Section. 3.1
Slot Synchronization
During the Slot Synchronization phase (task graph in Fig. 3), the receiver uses the sequence obtained from the Primary-Synchronization Channel (P-SCH) to synchronize with a cell. The samples coming from one slot (2560 samples) are analyzed in this phase. The operation is performed in two steps.
92
R. Airoldi et al.
Fig. 3. Task graph for the slot synchronization. Tasks in black executed by the main core, tasks in white executed in parallel by all the slaves.
In the first step we identify the possible slot boundary. This is achieved correlating 256 samples from the input sequence with 256 coefficients of the known start code. A value higher than a fixed threshold in the correlation output indicates that the slot boundary has been detected. The task graph in Fig. 3 shows how the workload is distributed. Considering this first part, the main core distributes in broadcast the coefficients for the correlation and a part of the input sequence (256/8 values to each slave core). The slave cores perform the correlations (task in white), and the main core evaluates if the peak has been found. If not, samples are sent to the slaves and the same operations are repeated. In the second step, the system tries to synchronize with the strongest path, since it is possible to have different translated versions of the synchronization signal in the air. This is achieved calculating correlations on a fixed time frame (1024 samples) for the next 4 slots, and returning the average. Also in this case the main core takes care of tdata distribution and calculation of average, while the slaves calculate correlations. In the context of the power analysis, this last step will not be taken into account, because simulation can not be performed using the current environment due to computational and memory resource limitations. For idle time estimation of this operation we will simply assume that all cycles are busy. The number of idle cycles decrease significantly with the peak distance. For example we calculated that 39% of total platform cycles are idle if a peak is found after 50 samples, while is 20% for a peak detected after 470 samples. 3.2
Frame Synchronization
The receiver uses the Secondary-Synchronization Channel (S-SCH) to perform Frame Synchronization, identifying the code group of the cell found in the Slot Synchronization step of the algorithm. Frame Synchronization consists of 16 parallel correlations. The correlations are computed over all the slots that compose one frame, that is 15 slots. Correlations are executed between the received signal and all possible Secondary Synchronization Code Sequences (SSCS), which are 16. Considering the task graph in Fig. 4, the main core transfers two of the 16 SSCSs data to each slave cluster and then 4 samples of the incoming data stream in broadcast.
Implementation of W-CDMA Cell Search
Fig. 4. Task graph for the frame synchronization
93
Fig. 5. Task graph for the scrambling code identification
When 16 correlations are computed, the master core builds the received codeword. Then each core compares the obtained codeword with a local data subset of all possible codewords to identify the code group. Each comparison yields a weight, and the master core finds the maximum among these weights. In this second phase, the share of idle cycles is 37.5%. 3.3
Scrambling Code Identification
During Scrambling Code Identification, the receiver determines the exact primary scrambling code used by the cell. The primary scrambling code is identified through symbol-by-symbol correlation between the received data stream and all possible codes within the code group, identified during Frame Synchronization. This phase is characterized by only 4 tasks (Fig. 5). There is first a sequential transfer of the possible scrambling codes to different slave cores, and then a broadcast transfer of the input samples. At this point slave cores perform the correlations and the master searches for the maximum. In this case the amount of data to be sent sequentially is much higher than the amount of data sent by broadcast. Since most of the time is spent in the transfer, the estimated share of idle cycles for this step is 78.5%.
4
Experimental Set-Up and Results
The MPSoC platform is described in VHDL. The target FPGA device is Altera Stratix II EP2S180. The application has been compiled and run in the simulation framework of the proposed MPSoC platform. Switching activity information is passed to a power analysis tool using Value Change Dump (VCD) files created during simulation. 4.1
Synthesis Results
The Coffee Machine utilizes 72% of the programmable resources on an Altera EP2S180 FPGA Device and works at a maximum frequency of 75M Hz. Area breakdown of the individual functional blocks of one processing cluster, the global communication and the whole system are given in Table 1.
94
R. Airoldi et al. Table 1. Area breakdown of the Coffee Machine Component Adapt. LUT Registers Utilization % CPU 7779 4944 7.4 Init NI 12 114 0.1 Target NI 65 110 0.1 Total/Cluster 8085 6887 7.7 Global Communication 2751 3440 3.1 Total 76648 50218 72
Table 2. Cycle count of the Target Cell Search on a single processing cluster and Coffee Machine Cell Search Step
Single MPSoC Speed [cycles] [cycles] -Up Single Correlation Point 12381 2546 5X Slot Synchronization (Fixed Part) 52890764 7147387 7.5X Frame Synchronization 3750593 471458 8X Scrambling Code ID 149973 56203 2.7X Entire Application 57410380 7802348 7.3X
4.2
Processing Performance of the Platform
For comparison purposes we also implemented the same application on a system containing a single processing cluster only (using the same processor instance as in the MPSoC version). Results of cycle count comparisons for different steps of the application are indicated in Table 2. The first step (Slot Synchronization) is characterized by an undefined number of correlations followed by a fixed part of the code. The number of correlations depends on how long it takes to detect the first peak. For benchmarking purposes, we consider the speedup in calculation of a single correlation point and the speedup related with the fixed part. We observe a speedup of 5 of the proposed MPSoC platform over the single processor cluster. The less than ideal speedup is mainly attributed to the sequential transfer of the samples to all slave cores. The fixed part, that requires only an initial transfer, gives a 7.5X speedup. The speedup for Frame Synchronization is 8X. Such a high value can be explained by the fact that all the transfers use the broadcast mechanism, thus minimizing the communication overhead. The Scrambling Code Identification Step gives the worst speedup (2.7X). In this case the slave cores use different coefficients, known only after Frame Synchronization. Therefore the transfer of these coefficients cannot benefit from the broadcast mechanism, introducing a significant transfer time overhead. We evaluated also the speedup related with the execution of the entire application, supposing that the first peak is found after 50 correlations. The result is 7.3X. Even though the application requires a large number of data transfers,
Implementation of W-CDMA Cell Search
95
its MPSoC implementation benefits from the code parallelization and hardware mechanisms that reduce the communication overhead. Considering the FPGA implementation, the entire cell search requires 104ms. The largest fraction of time is spent in slot synchronization. Since it cannot be done in real-time (one slot is 0.67ms wide), the buffering of 5 slots is required (25 KB altogether). After that, the receiver should keep synchronization with the incoming samples, for example buffering only one slot and overwriting the values when the buffer is full. The frame synchronization is performed in 6.2ms, and since it is faster than the acquisition of a frame of samples, it does not require additional buffering. Same considerations apply for the scrambling code identification that requires 0.75ms. We are assuming here that the three steps are executed sequentially. 4.3
Power Consumption
Power consumption figures of the initial design and the design including processing cluster clock gating technique are shown on Figure 6 (respectively indicated on abscissa as NCLKG and CLKG) for different steps of the Target Cell Search. We present dynamic power consumption components (logic and routing power, respectively shown as light and dark gray boxes) of: the individual clusters (starting from Cluster 0 to Cluster 8) and the complete COFFEE Machine platform. Static power dissipation is not included, because it remains constant with and without clock gating and has been reported at 1.68mW. The power consumption
Fig. 6. Power consumption of the Coffee Machine without and with proposed cluster clock gating
96
R. Airoldi et al.
reduction for different steps is respectively: 18, 44 and 46%. Total reduction in power consumption for the complete Target Cell Search application is 33%. While the absolute power dissipation is high in comparison to ASIC implementations proposed in the literature, one can argue that this is because of the low power efficiency of FPGA circuits. A power scaling factor for FPGA to ASIC technology migration can be used for comparison purposes, such as the one proposed in [12] where 90nm ASIC and FPGA designs were compared suggesting a factor of 12 reduction in dynamic power consumption.
5
Conclusions
We presented a homogeneous MPSoC platform with 9 processing clusters. The architecture is based on computational clusters built around the COFFEE RISC processor and a hierarchical NoC with broadcasting support. For a representative SDR application, the Target Cell Search of the W-CDMA, we show that such platform can provide enough processing power running on an FPGA circuit (Altera Stratix II EP2S180 device; with an operating frequency of 75MHz). It was noted that broadcasting capability of the proposed MPSoC platform contributed significantly to the speedup achieved with the parallelized application. The additional overall speedup thanks to broadcasting was 1.41X, while some parts of the application enjoyed more than twice the performance of a system without broadcasting support. Furthermore, we described a simple power management technique based on cluster clock gating, enabling individual processor and memory sleep modes. We show that depending on the subtask of the application, the power savings vary from 18% to 46%, for a total of 33% for the complete Target Cell Search application.
References 1. Mamidi, S., Blem, E.R., Schulte, M.J., Glossner, J., Iancu, D., Iancu, A., Moudgill, M., Jinturkar, S.: Instruction set extensions for software defined radio on a multithreaded processor. In: CASES 2005: Proceedings of the 2005 international conference on Compilers, architectures and synthesis for embedded systems, pp. 266–273. ACM, New York (2005) 2. Lin, Y., Lee, H., Harel, Y., Woh, M., Mahlke, S., Mudge, T., Flautner, K.: A System Solution for High-Performance, Low Power SDR. In: Proceeding of the SDR 2005 Technical Conference and Product Exposition (2005) 3. Li, C.-F., Chu, Y.-S., Ho, J.-S., Sheen, W.-H.: Cell Search in WCDMA Under Large-Frequency and Clock Errors: Algorithms to Hardware Implementation. IEEE Transactions on Circuits and Systems I: Regular Papers 55(2), 659–671 (2008) 4. Jenkins, C., Ekas, P.: Low-power Software-Defined Radio Design Using FPGAs. In: Software Defined Radio Technical Conference and Product Exposition, Orlando, Florida, November 13-16 (2006) 5. Dick, C., Harris, F.: FPGA implementation of an OFDM PHY. In: Conference Record of the Thirty-Seventh Asilomar Conference on Signals, Systems and Computers, 9-12 November, vol. 1, pp. 905–909 (2003)
Implementation of W-CDMA Cell Search
97
6. Delorme, J., Martin, J., Nafkha, A., Moy, C., Clermidy, F., Leray, P., Palicot, J.: A FPGA partial reconfiguration design approach for cognitive radio based on NoC architecture. In: Circuits and Systems and TAISA Conference, NEWCAS-TAISA 2008. 2008 Joint 6th International IEEE Northeast Workshop, June 2008, pp. 355– 358 (2008) 7. Kylli¨ ainen, J., Ahonen, T., Nurmi, J.: General-purpose embedded processor cores the COFFEE RISC example. In: Nurmi, J. (ed.) Processor Design: System-on-Chip Computing for ASICs and FPGAs, ch.5, pp. 83–100. Kluwer Academic Publishers / Springer Publishers (June 2007) ISBN-10: 1402055293, ISBN-13: 978-1-4020-5529-4 8. Ahonen, T., Nurmi, J.: Synthesizable switching logic for network-on-chip designs on 90nm technologies. In: Proceedings of the 2006 International Conference on IP Based SoC Design (IP-SOC 2006), December 6-7, pp. 299–304. Design and Reuse S.A (2006) 9. Ahonen, T., Nurmi, J.: Programmable switch for shared bus replacement. In: Proceedings of the 2006 International Conference on Ph.D. Research in Microelectronics and Electronics (PRIME 2006), June 11-15, pp. 241–244. IEEE, Los Alamitos (2006) 10. Ahonen, T., Nurmi, J.: Hierarchically Heterogeneous Network-on-Chip. In: The International Conference on Computer as a Tool, EUROCON, 2007, September 9-12, pp. 2580–2586 (2007) 11. Wang, Y.-P.E., Ottosson, T.: Cell search in W-CDMA 18(8), 1470–1482 (2000) 12. Kuon, I., Rose, J.: Measuring the Gap Between FPGAs and ASICs. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 26(2), 203–215 (2007)
A Multiprocessor Architecture with an Omega Network for the Massively Parallel Model GCA Christian Sch¨ ack, Wolfgang Heenes, and Rolf Hoffmann Technische Universit¨ at Darmstadt, FB Informatik, FG Rechnerarchitektur, Hochschulstraße 10, D-64289 Darmstadt, Germany {schaeck,heenes,hoffmann}@ra.informatik.tu-darmstadt.de
Abstract. The GCA (Global Cellular Automata) model consists of a collection of cells which change their states synchronously depending on the states of their neighbors like in the classical CA (Cellular Automata) model. In differentiation to the CA model the neighbors are not fixed and local, they are variable and global. The GCA model is applicable to a wide range of parallel algorithms. In this paper a general purpose multiprocessor architecture for the massively parallel GCA model is presented. In contrast to a special purpose implementation of a GCA algorithm the multiprocessor system allows the implementation in a flexible way through programming. The architecture mainly consists of a set of processors (Nios II) and a network. The Nios II features a general-purpose RISC CPU architecture designed to address a wide range of applications. The network is a well-known omega network. Only read-accesses through the network are necessary in the GCA model leading to a simplified structure. A system with up to 32 processors was implemented as a prototype on an FPGA. The analysis and implementation results have shown that the performance of the system scales with the number of processors. Keywords: Global Cellular Automata, multiprocessor architecture, omega network, FPGA.
1
Introduction
The GCA (Global Cellular Automata) model [1] is an extension of the classical CA (Cellular Automata) model [2]. In the CA model the cells are arranged in a fixed grid with fixed connections to their local neighbors. Each cell computes its next state by the application of a local rule depending on its own state and the states of its neighbors. The data accesses to the neighbors states are readonly and therefore no write conflicts can occur. The rule can be applied to all cells in parallel and therefore the model is inherently massively parallel. The CA model is suited to all kind of applications with local communication, like physical fields, lattice-gas models, models of growth, moving particles, fluid flow, routing problems, picture processing, genetic algorithms, and cellular neural networks. K. Bertels et al. (Eds.): SAMOS 2009, LNCS 5657, pp. 98–107, 2009. c Springer-Verlag Berlin Heidelberg 2009
A Multiprocessor Architecture for the GCA Model
99
The GCA model is a generalization of the CA model which is also massively parallel. It is not restricted to the local communication because any cell can be a neighbor. Furthermore the links to the neighbors are not fixed; they can be changed by the local rule from generation to generation. Thereby the range of parallel applications is much wider for the GCA model. Typical applications besides the CA applications are graph algorithms [3], hypercube algorithms, logic simulation [4], numerical algorithms, communication networks, neural networks, and graphics. The state of a GCA cell consists of a data part and one or more pointers (Fig. 1). The pointers are used to dynamically establish links to global neighbors. We call the GCA model one handed if only one neighbor can be addressed, two handed if two neighbors can be addressed and so on. In our investigations about GCA algorithms we found out that most of them can be described with only one link.
Fig. 1. The operation principle of the GCA
The aim of our research is the hardware and software support of this model. There are mainly three possibilities for an implementation. 1. Fully Parallel Architecture. A specific GCA algorithm is directly mapped into the hardware using registers, operators and hardwired links which may also be switched if necessary. The advantage of such an implementation is a very high performance [5], but the problem size is limited by the hardware resources and the flexibility to apply different rules is low. 2. Partially Parallel Architecture with Memory Banks. This architecture [6,7] offers also a high performance, is scalable and can cope with a large number of cells. The flexibility to cope with different rules is restricted. 3. Multiprocessor Architecture. This architecture is not as powerful as the above two ones, but it has the advantage that it can be tailored to any GCA problem by programming. It also allows integrating standard or other computational models. In [8] we described a 16 bit multiprocessor architecture with a multiplexer network, used for dedicated algorithms like fast fourier transformation and bitonic merge. In this contribution we are presenting a multiprocessor architecture for the GCA model with an omega network [9] based on Nios II which was also implemented in
100
C. Sch¨ ack, W. Heenes, and R. Hoffmann
FPGA logic. The performance of multiprocessor SoC architectures is depending on the communication network and the processor architecture. Several contributions described and compared different bus architectures [10]. Multiprocessor architectures with the Nios II softcore were presented in [11].
2 2.1
Multiprocessor Architecture Overview
– The system consists of p processors with internal memories (program and data) and an omega interconnection network (Fig. 2). – The data memories are implemented as dual port memories. One port is directly connected to its associated processor. The second port is connected to the network enabling all other processors to access the data in the memory. – Each data memory holds a part of the GCA cell field of the application. – A processor can modify only its own cells stored in its internal data memory. – Each processor has only read access to the other data memories, write accesses need not to be implemented due to the GCA model. – The local GCA rule is programmable by processor instructions. – The processor instructions support the accesses to the cells in the internal memory and the read accesses to external cells stored in the other memories. One processor with its associated dualport data memory is called Processing Unit (PU). The omega network interconnects the processing units. Therefore the network complexity depends on the complexity of the communication patterns which is needed for the class of GCA algorithms to be implemented. For many GCA algorithms [12] the communication pattern is rather simple which simplifies the design of the network. Simple or specialized networks can be implemented with multiplexers or fixed connections [8,9], complex networks have to be able to manage concurrent read accesses to arbitrary external memory locations. As in the GCA model a cell is not allowed to modify the contents of another cell, the
Fig. 2. General system architecture with datapath
A Multiprocessor Architecture for the GCA Model
101
network design is simplified because write accesses need not to be implemented. Memory arbitration of concurrent read accesses is handled decentralized within the switching elements of the omega network. Thereby concurrent read accesses need not to be handled by each PU which reduces the overall control logic. Therefore it makes no difference if a PU has to wait due to a blocked network or due to a concurrent read access. 2.2
The NIOS II Softcore Processor
The Nios II processor [13] is a general-purpose RISC processor core, providing: – Full 32-bit instruction set, data path, and address space – 32 general-purpose registers – Dedicated instructions for computing 64-bit and 128-bit products of multiplication – Floating-point instructions for single-precision floating-point operations – Custom instructions For the implementation of the multiprocessor architecture, the custom instructions [13] turned out to be very useful (see section 2.4). 2.3
The Omega Network
The omega network is commonly used in parallel computer systems. It belongs into the category of blocking networks. Figure 3 shows the possible switching states of one switching element. The switching elements are able to process a broadcast request. If two processing units want to access the very same memory location, the concurrent access can be resolved with broadcasting. Figure 4 shows the control signals of a switching element. Input 0 (ain, req0) has higher priority compared to input 1. The switching elements consist of combinatorial logic only. Figure 5 shows the multiprocessor architecture with eight processing units. For better understanding of the architecture the processing units are shown twice. This example also shows that each processor can access its own memory either internally or by using the network. Different accesses to the same memory location are sequentialized by using different priorities. 2.4
Custom Instructions
We expanded the NIOS II instruction set with one extended custom instruction. The custom instruction can be specified with one of the following function values.
Fig. 3. Switching elements
102
C. Sch¨ ack, W. Heenes, and R. Hoffmann
Fig. 4. Control signals of a switching element
Fig. 5. Multiprocessor architecture with eight processing units
– – – – – – – –
LOCAL GET P: internal read of the pointer p (Fig. 1) from internal memory LOCAL GET D: internal read of the data d from internal memory SET PS: write pointer p to the internal memory and shift right by one SET P: write pointer p to the internal memory GET P: (synchronized) read pointer p by using the network SET D: write data d to the internal memory GET D: (synchronized) read data d by using the network NEXTGEN: processor synchronization command (see Section 2.5)
The usage of custom instructions will be discussed in section 4. 2.5
Blocking and Synchronization
The omega network is a blocking network. Using eight processors (Fig. 5) it is possible that up to four of them will be blocked during a concurrent read access to eight different memories. This leads to unsynchronized processor execution of the cellular field. As a consequence of the partially sequential calculation two memories (old/new generation) are needed therefore data consistency has to be
A Multiprocessor Architecture for the GCA Model
103
considered. Due to the blocking read access the processors might calculate different generations at a time. To avoid such a behavior the processors need to be synchronized at generation transition which is done by using a special synchronization command at this state. A barrier-synchronization technique [14, p. 165] is implemented by AND gating. The implementation of the omega network also requires a synchronization for each read access over the network.
3 3.1
FPGA Prototype Implementation FPGA Implementation - Synthesis Results
The prototyping platform is a Cyclone FPGA with the Quartus II 8.1 synthesis software from Altera. The Cyclone II FPGA contains 68,416 logic elements (LE) and 1,152,000 RAM bits [15]. The implementation language is Verilog HDL. The NIOS II processor is built with the SOPC-Builder. We used the Nios II/f processor. This core has the highest execution performance and uses a 6 stage pipeline. The DMIPS rate is 218 at the maximum clock frequency of 185 MHz. The synthesis settings are optimized for speed. Table 1 shows the numbers of logic elements (LE), logic elements for network, the memory usage, the maximum clock frequency and the estimated max. DMIPS [16] rate for all processors [13, Chapter 5]. With the DMIPS rate the speed-up can be calculated. Table 1. Resources and clock frequency for the Cyclone II FPGA processing units 1 4 8 16
total LEs for LEs network 2,824 0 10,165 154 19,985 866 40,878 2,145
memory bits 195,296 289,568 415,264 666,656
register max. max. DMIPS bits clock (MHz) DMIPS speed up 1,585 145 168 1.0 4,910 105 487 2.8 9,351 75 696 4.1 18,244 55 1020 6.0
The amount of logic elements scales pretty well with the amount of processing units. The cost of the network in terms of logic elements was not of significant relevance for p = 16 processors. Clock frequency drops significantly due to the omega network. Only half of the clock frequency can be used for 16 processors compared to 4 (table 1). Although the clock frequency decreases the DMIPS rate increases. Comparing the one processor against the 16 processor implementation shows that a DMIPS speed-up of 6 was achieved due to the decreasing clock frequency. 32 processor synthesis is not possible on the Cyclone II due to limited resources. The second prototyping platform was a Stratix II FPGA with 143,520 ALUTs, 71,760 ALMs and 9,383,040 RAM bits [17]. Table 2 shows the required resources. With 32 processors about 30 % of the ALUTs resources are used. Synthesis for 64 processors generally seems to be possible but could not be accomplished due to problems with the synthesis software.
104
C. Sch¨ ack, W. Heenes, and R. Hoffmann Table 2. Resources and clock frequency for the Stratix II FPGA
processing units 1 4 8 16 32
total ALMs 1,203 4,147 8,397 17,448 36,131
total ALUTs for ALUTs network 1,592 0 5,354 279 10,634 818 21,439 2,344 43,815 6,092
memory bits 195,296 289,568 415,264 666,656 1,169,440
register max. bits clock (MHz) 1,520 200 4,813 133 9,200 100 17,994 75 35,591 55
max. DMIPS DMIPS speed up 232 1.0 617 2.6 928 4.0 1392 6.0 2041 8.7
Comparing one processor against the 32 processor implementation shows that a speed-up of 8.7 was achieved.
4
An Application: Merging Bitonic Sequences
The principle of operation of the multiprocessor system will be demonstrated by parallel merging. In the GCA model each cell holds its own value and compares it with a neighbor’s cell which is in changing distance. The distance is a power of two. If the own value has not the desired relation (e. g. ascending order) it will change its own value to the value of the neighbor. If a pointer points to a higher indexed cell the minimum is computed, otherwise the maximum. The number of generations is log2 (N ), where N is the number of cells. The listing 1 shows the C implementation of one processing unit. At the beginning all processors are synchronized (line: 22-23). Synchronization is done by the call of the custom instruction. This function consists of three parameters. The first parameter defines the action (NEXTGEN) that should be processed, the second parameter defines the global address (global address = processor ID, memory address) if applicable and the third parameter defines the data to be written, valid only for the write action. The C implementation consists of one part which handles the external operations (reading external memory - line 38) and one part which handles the internal operations (reading (line 36), writing (line 42-47) internal memory). The usage of barrier-synchronization (see section 2.5) command is shown in line 51. If the GCA model is sequentially executed on one processor, n steps are necessary in each generation leading to a total number of n · log2 (N ) steps. The execution time is the number of clock cycles over the clock frequency. Table 3 shows a linear speed-up for the clock cycles needed to execute the program for N = 2048. There are no blocking accesses to the network caused by the bitonic merge algorithm and therefore the used cycles show a linear speed-up. The system was implemented for p = 1, 4, 8, 16, 32 processors and N = 2048 cells. Each processor holds n = N/p cells. In the first processing part only external cells (stored in external memories) are compared (Fig. 6). We call these generations external generations. In the second processing part only internal cells are compared. We call these generations internal generations.
A Multiprocessor Architecture for the GCA Model
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53
105
#i n c l u d e <s t d i o . h> #i n c l u d e ” s y s t e m . h” #d e f i n e #d e f i n e #d e f i n e #d e f i n e #d e f i n e #d e f i n e #d e f i n e #d e f i n e #d e f i n e
LOCAL GET P 8 LOCAL GET D 7 SET PS 6 GET P 5 SET P 4 TIME 3 GET D 2 NEXTGEN 1 SET D 0
#d e f i n e #d e f i n e #d e f i n e #d e f i n e
SC 128∗ 1 DAT BLKS 2 0 4 8 LOCL BLKS 128 GENS 11
i n t main ( ) { int i , j ,
neighbor ,
/ / SC = // amount
i i , N,
I ,
s t a r t c e l l . D e f i n e s memory a d d r e s s o f f s e t // t o t a l amount o f d a t a b l o c k s of data b l o c k s p ro ce s s ed by t h i s p r o ce s s o r / / a m o u n t o f g e n e r a t i o n s . 2 ˆ GENS = DAT BLKS
k; // S y n c h r o n i z a t i o n
ALT CI CUSTOM NETWORK EXT INST (NEXTGEN, 0 , 0 ) ; ALT CI CUSTOM NETWORK EXT INST (NEXTGEN, 0 , 0 ) ; f o r ( j = 0; j
at
startup
// C a l c u l a t e memory // Load c e l l
address pointer
i f ( ( i i & k)==0) n e i g h b o r = i i +k ; e l s e n e i g h b o r = i i −k ; i f ( j >3) // D e c id e b e t w e e n i n t e r n a l / e x t e r n a l N = ALT CI CUSTOM NETWORK EXT INST (LOCAL GET D , n e i g h b o r , 0 ) ; else N = ALT CI CUSTOM NETWORK EXT INST (GET D , n e i g h b o r , 0 ) ; I = ALT CI CUSTOM NETWORK EXT INST (LOCAL GET D , i i , 0 ) ; i f ( ( i i & k)==0) i f (N
// Read
local
cell
read
value
// W rit e D i i ,N) ; I ); i i ,N) ; I );
ALT CI CUSTOM NETWORK EXT INST ( SET PS , i i , k ) ; } ALT CI CUSTOM NETWORK EXT INST (NEXTGEN, 0 , 0 ) ; } return 0 ; }
// W r i t e P and
Listing 1. C Program for Bitonic Merge
Fig. 6. Example Bitonic Merge with four processors and 16 cells
shift
106
C. Sch¨ ack, W. Heenes, and R. Hoffmann Table 3. Clock cycles, wait cycles, execution time (Stratix II ) and speed-up
processing units 1 4 8 16 32
cycles 781567 199878 101279 51776 26212
absolut % of cycle speed-up execution time real speed-up wait cycles wait cycles 0 0 1.00 3.9 ms 1.0 4807 2.4 % 3.9 1.5 ms 2.6 2592 2.56 % 7.7 1.0 ms 3.9 1503 2.90 % 15.0 0.7 ms 5.6 911 3.48 % 29.8 0.4 ms 8.2
If one considers the degration of the clock frequency for 32 processors, a real speed-up of about 8.2 is reached. Comparing these values to table 2 shows that the speed-up is lesser than the estimated speed-up calculated out of the DMIPS rate. The column % of wait cycles in table 3 shows, that the synchronization overhead increases with the number of processing units. Compared with fully parallel [5,9] and partially parallel[7,9] architectures the execution time for bitonic merge on the multiprocessor architecture is greater.
5
Conclusion
A programmable multiprocessor architecture for the massively parallel GCA model was designed and implemented as a prototype in FPGA technology. The architecture consists of p processing units with internal memories and a read-only interconnection network. Compared to a dedicated implementation the proposed architecture is very flexible because it can be easily adapted to different GCA algorithms by programming. The speed-up of the prototype increases with the number of processors for the investigated algorithm. A DMIPS speed-up of 8.7 could be reached by increasing the number of processors from 1 to 32. The speed-up regarding clock cycles is 29.8 while the real speed-up is 8.2 comparing the 1 against the 32 processor implementation. If the number of processors gets very high, the cost and time delay of the network have to be taken into account. To keep the frequency high, additional registers have to be inserted into the network which is a subject under investigation. We also work on different bus architectures like a ring network as a simulation platform for multi-agent systems.
References 1. Hoffmann, R., V¨ olkmann, K.P., Waldschmidt, S., Heenes, W.: GCA: Global Cellular Automata, A Flexible Parallel Model. In: Malyshkin, V.E. (ed.) PaCT 2001. LNCS, vol. 2127, p. 66. Springer, Heidelberg (2001) 2. von Neumann, J.: Theory of Self-Reproducing Automata. University of Illinois Press, Urbana (1966)
A Multiprocessor Architecture for the GCA Model
107
3. Jendrsczok, J., Hoffmann, R., Keller, J.: Implementing Hirschberg’s PRAMAlgorithm for Connected Components on a Global Cellular Automaton. In: 21st International Parallel & Distributed Processing Symposium (IPDPS), 9th Workshop on Advances in Parallel and Distributed Computational Models (2007) 4. Wiegand, C., Siemers, C., Richter, H.: Definition of a configurable architecture for implementation of global cellular automaton. In: M¨ uller-Schloer, C., Ungerer, T., Bauer, B. (eds.) ARCS 2004. LNCS, vol. 2981, pp. 140–155. Springer, Heidelberg (2004) 5. Heenes, W., Hoffmann, R., Kanthak, S.: FPGA Implementations of the Massively Parallel GCA Model. In: International Parallel & Distributed Processing Symposium (IPDPS), Workshop on Massively Parallel Processing (WMPP) (2005) 6. Heenes, W., V¨ olkmann, K.P., Hoffmann, R.: Architekturen f¨ ur den globalen Zellularautomat. In: 19. PARS Workshop, Gesellschaft f¨ ur Informatik (GI) (2003) 7. Hoffmann, R., V¨ olkmann, K.P., Heenes, W.: GCA: A massively parallel Model. In: International Parallel & Distributed Processing Symposium (IPDPS), Workshop on Massively Parallel Processing (WMPP) (2003) 8. Heenes, W., Hoffmann, R., Jendrsczok, J.: A Multiprocessor Architecture for the Massively Parallel Model GCA. In: International Parallel & Distributed Processing Symposium (IPDPS), Workshop on System Management Tools for Large-Scale Parallel Systems (SMTPS) (2006) 9. Heenes, W.: Entwurf und Realisierung von massivparallelen Architekturen f¨ ur Globale Zellulare Automaten. PhD thesis, Technische Universit¨ at Darmstadt (2007) 10. Ryu, K.K., Shin, E., Mooney, V.J.: A Comparison of Five Different Multiprocessor SoC Bus Architectures. In: DSD 2001: Proceedings of the Euromicro Symposium on Digital Systems Design, Washington, DC, USA, pp. 202–209. IEEE Computer Society, Los Alamitos (2001) 11. Kulmala, A., Salminen, E., H¨ am¨ al¨ ainen, T.D.: Evaluating Large System-on-Chip on Multi-FPGA Platform. In: Vassiliadis, S., Berekovi´c, M., H¨ am¨ al¨ ainen, T.D. (eds.) SAMOS 2007. LNCS, vol. 4599, pp. 179–189. Springer, Heidelberg (2007) 12. Heenes, W.: Globaler Zellularer Automat: Algorithmen und Architekturen. Master’s thesis, Technische Universit¨ at Darmstadt (2001) 13. Altera, NIOS II Website (2009), http://www.altera.com/products/ip/processors/nios2/ni2-index.html 14. Ungerer, T.: Parallelrechner und parallele Programmierung. Spektrum Akademischer Verlag, Heidelberg (1997) 15. Altera, Datasheet Cyclone II (2006), http://www.altera.com/literature/hb/cyc2/cyc2_cii5v1.pdf 16. Weiss, A.R.: Dhrystone benchmark - history, analysis,“scores” and recommendations (2002), http://www.ebenchmarks.com/download/ECLDhrystoneWhitePaper.pdf 17. Altera, Datasheet Stratix II (2006), http://www.altera.com/literature/hb/stx2/stratix2_handbook.pdf
Towards Automated FSMD Partitioning for Low Power Using Simulated Annealing Nainesh Agarwal and Nikitas J. Dimopoulos Department of Electrical and Computer Engineering University of Victoria Victoria, B.C., Canada {nagarwal,nikitas}@ece.uvic.ca
Abstract. We propose a technique to efficiently partition a FSMD (Finite State Machine with Datapath) using a simulated annealing approach. The FSMD is split into two or more simpler communicating processors. These separate processors can then be clock gated or power gated to achieve dramatic power savings since only one processor is active at any given time. We develop a framework to estimate the potential power savings from partitioning. Using several sample circuits, the estimation framework shows that when the original machine is partitioned into two submachines, on average, 32% static power savings and 19% dynamic power savings can be expected, with a performance impact of 2%. The power savings with more than two partitions can be even higher, with a larger performance impact.
1
Introduction
We focus on the class of sequential circuits characterized by a FSMD (Finite State Machine with Datapath) representation. Partitioning is one technique used to facilitate logic isolation in FSMD circuits [1]. Normally the isolated circuit components are switched off (power gated) [2] or clock gated [1, 3] to conserve static or dynamic power, respectively. Two methods are generally employed for the partitioning of these sequential circuits. The first method relies on disabling parts of the FSM (Finite State Machine) controller. Here, the controller is partitioned into two or more mutually-exclusive FSMs. Each partition is then selectively clock gated [1, 3] or power gated [2]. Thus, only one FSM is active at any given time, while the others are idle and their clocks are stopped, or their power is gated off. The second method tries to discover idle periods in one or more datapath components of the circuit. These components can then be clock gated or power gated. In [4], idle periods in the ALU are discovered and for these periods the ALU is power gated. In [5], individual registers are clock gated, while in [6], individual registers are power gated. Although gating parts of either the controller or the datapath has been shown to be highly effective in reducing power, further savings can be achieved if both K. Bertels et al. (Eds.): SAMOS 2009, LNCS 5657, pp. 108–117, 2009. c Springer-Verlag Berlin Heidelberg 2009
Towards Automated FSMD Partitioning
109
the controller and the datapath are considered together. This was proposed in [1], where a simple heuristic was used in a branch and bound method to partition the FSMD. Further, the method was more suited for a clock gating environment. We use a more thorough and detailed model and hope to achieve better power reduction in a power gating environment where static power is of significant concern. We formulate FSMD partitioning as a non-linear programming problem, which we solve using the simulated annealing algorithm [7]. Our objective is to maximize the isolation of circuit components by minimizing the communication between the partitioned FSMDs. This maximizes the number of components that can be put to sleep thus reducing the overall power dissipation. In [8] we developed a linear model for partitioning and solved it using ILP (Integer Linear programming). We experimented initially with simple models using both ILP and simulated annealing. Both methods yielded similar results. However, the simulated annealing method runs orders of magnitude faster than the ILP method. This makes ILP impractical for circuits beyond about 20 states. We have therefore adopted the simulated annealing approach, and we have also enhanced our model. Our enhanced models yields partitions that better preserve the locality of execution and, thus, provide better power saving opportunities. Although we have not explored them here, other stochastic optimization methods, such as genetic algorithms [9] and swarm intelligence [10], may also be suitable candidate methods to provide rapidly converging solutions. We first proposed a version of the partitioning model presented here in [11]. In this work, we have significantly enhanced the approach we use to partition the circuit. In particular, we have amended the cost function used in the nonlinear programming method by incorporating terms that reflect the contribution of loops in the code that specifies the functionality implemented by the circuit under partition. We have also modified our partitioning method to yield multiple partitions instead of two. Additionally, we have enlarged the set of benchmarks we use to evaluate our methods.
2
FSMD Partitioning
The proposed partitioning technique works at the behavioral level, before synthesis. The FSMD described at the behavioral level is split into two or more separate FSMD units. At any given time, only one FSMD is active while the others are powered off. This results in significant power savings (static and dynamic). If there are data components, namely registers, that are used in multiple partitions, they are kept alive when either partition is active. Ideally, data components must be isolated as much as possible so they can be turned off as long as possible. Further, when data components are shared between FSMDs, their updated values need to be communicated to the newly activated FSMD. This communication overhead results in power dissipation that should be minimized. Another important criterion is to minimize the number of transitions between partitions. Each time a transition occurs we not only have the communication
110
N. Agarwal and N.J. Dimopoulos
penalty, but also encounter a startup delay whereby the capacitances of the newly activated FSMD are charged up. To reduce this performance penalty a lookahead mechanism can be used [6]. To minimize these adverse effects we need to efficiently partition the FSMD to reduce the amount of shared data components. To achieve such an efficient partition we have formulated a non-linear programming problem and have used the simulated annealing [7] algorithm to solve it. Our objective is to minimize the number of shared components between partitions and also minimize the number of possible transitions between the partitions. 2.1
Model Formulation
Let P be a finite state machine with datapath (FSMD) consisting of a set of N finite states defined as S = {s1 , ..., sN } and transitions. We represent the set of transitions (edges) of the FSMD as Eij , which is a binary variable. It is 1 if and only if there exists an edge or transition from state si to sj . Further, let V AR be the set of storage variables which are used in the various expressions and statements of the FSMD. The partitions of P , are a subset of S, along with the transitions related to the states in S. Let the number of partitions be M . The set of partitions can then be identified as Pk for all k ∈ [1, M ]. It is our goal here to partition machine P into submachines (partitions) Pk such that the interaction between these submachines is minimized. Given an FSMD, we first determine the set of variables that are shared between the various states. A variable, v ∈ V AR is considered shared between states si and sj if the variable is read or written in state si and read or written in state sj . We categorize shared variables into two groups. The first group, which we call duplicated variables are those which are either read in both states si and sj , or written in both states. The second group is called transition variables, which consists of registers that are read in state si and written is state sj , or vice versa. We represent the total number of duplicated register bits between states si and sj as Dij , while the total number of transition bits is denoted Tij . Another important criteria in the model is to penalize the partitioning of a loop in the FSMD. To capture this, we introduce pseudo-edges between any two states si , sj belonging to the same loop in the FSMD. Then the binary variable Lij denotes the existence of such a pseudo-edge between states si and sj (Lij = 1). If si and sj do not belong to the same loop, Lij = 0. We introduce the binary variables sik for all i ∈ [1, N ], which are 1 if and only if state si belongs to partition k. Here, N = |S| is the number of states of the original machine P . The total number of duplicated bits between partitions can be counted as N M Dtotal = Dij 1 − sik sjk , (1) i,j=1
k=1
where M is the number of partitions. Also, we must adhere to the constraint
Towards Automated FSMD Partitioning
∀i ∈ [1, N ] :
M
sik = 1,
111
(2)
k=1
which requires that each state si belong to one and only one partition k. Equation 1 can be simplified to M N N Dtotal = Dij − Dij sik sjk . (3) i,j=1
i,j=1
k=1
The first term in equation 3 is constant since it represents all the shared variable bits in the original machine. Therefore it can be ignored in the optimization. We now get M N Dtotal ≈ − Dij sik sjk . (4) i,j=1
k=1
The summations in equation 4 can be switched and the equation can be specified more concisely in vector form as Dtotal ≈ −trace sD Ds . (5) As with the total number of duplicated bits between partitions, the number of transition bits can be represented as Ttotal ≈ −trace sT Ts . (6) The total number of edges between all partitions can be counted as Etotal ≈ −trace sT Es .
(7)
The total number of pseudo-edges between partitions, due to loops, can be counted as Ltotal ≈ −trace sT Ls . (8) The objective function to be minimized can now be formulated as a combination of the parameters introduced earlier. It can be stated as min [αD Dtotal + αT Ttotal + λ (αE Etotal + αL Ltotal )] ,
(9)
where the edges are weighted by λ=
|v|,
(10)
v∈V AR
which is the sum of all register bits in the original partition P . This is because, in the worst case, all register bits may need to be communicated from one partition to the other. We have also introduced the factors, α, which allow us to adjust relative weights for the four parameters based on their relative importance in the final result. Through experimentation we have determined that αD = 0.5, αT =
112
N. Agarwal and N.J. Dimopoulos
1, αE = 1 and αL = 2 provide effective results. This means that we penalize heavily for breaking loops, while the minimization of duplicated variables is less important. Although we chose the values of the parameters αD , αT , αE and αL experimentally, we have not performed a complete design-space exploration nor studied the relative impact of each of these parameters on the final partitioning of the circuit. We anticipate that we will employ a Plackett-Burman methodology [12] to ascertain the impact of each of these parameters as we further explore their optimum values. Equation 9 can be simplified further and can be stated as min −trace sT [αD D + αT T + λ (αE E + αL L)] s , (11) where the goal is to find the optimal matrix, s, that minimizes this cost function. In each iteration of the simulated annealing algorithm, the partition of a randomly chosen state is modified. We have added a quality constraint to this update procedure such that each partition contains at least a few of the total states from the original machine P . This constraint can be specified as ∀k ∈ [1, M ] :
N i=1
sik ≥ φ
N , 0 < φ < M, M
(12)
where φ is a configurable parameter. Experimentation shows that φ = 0.6 is able to effectively eliminate highly unbalanced partitions. This implies that for M = 2 partitions, each partition must contain at least 30% of the total number of states. For the simulated annealing algorithm, the cooling schedule used is Ti+1 = 0.9 Ti . This is experimentally found to give a good convergence profile.
3
Evaluation Framework
To test the effectiveness of our FSMD partitioning approach we have examined a counter circuit and the set of circuits from the DSP kernel suite of the DSPstone benchmark [13]. The designs are implemented using CoDeL [14, 15], which allows system description at the algorithmic level, and produces synthesizable FSMD descriptions in VHDL. The CoDeL compiler has now been augmented to provide all the required parameters for our model. The solution to our model is obtained using a simulated annealing algorithm [16] implemented in Matlab [17]. The power savings, from the resulting partition suggested by the simulated annealing algorithm, are estimated using the framework presented in section 5.
4
Example
We present here an example using a simple 8 bit counter, implemented using CoDeL, whose FSMD is partitioned into two submachines. The CoDeL compiler
Towards Automated FSMD Partitioning
s0
s1
s3
s2
113
Fig. 1. Counter STG with partition
produces a 4 state FSMD in VHDL. Figure 1 presents the STG representation of the counter. After partitioning, we find that states 0 and 3 have been partitioned into submachine, P1 , while states 1 and 2 are in submachine, P2 (see figure 1). This is ideal since the variables count, countOut and the adder are only used in states 1 and 2. This allows the register count, the output latch countOut, and the adder to be completely isolated into partition P2 . Therefore, they need not exist in P1 . As a consequence, whenever partition P1 is active and P2 is inactive, we save power that would normally be dissipated in a non-partitioned FSMD. Further, due to the complete isolation of the count and countOut variables, no data transfer is needed when the active partition is changed. This saves communication overhead. It should be noted that the variable state_value is a special register that encodes state and must exist in all partitions.
5
Power Estimation
Here we present a framework for estimating the potential power savings from partitioning a FSMD. This framework provides a coarse level approximate to the expected power savings based on examining the proportion of time each partition is active and the complexity of each partition. The savings in power dissipation can be broken down into savings in static power and savings in dynamic power. Using experimentation we have found that, at least in the circuit implementations we have used, the static power of the circuit is roughly proportional to the amount of sequential logic in the circuit1 . Thus, by examining the number of sequential elements in the partitions, and the proportion of time they are put to sleep, we can estimate the static power savings. Let Q equal the total number of register bits in the original partition P , and Qk equal the number of register bits in partition Pk . The values for Q and Qk can be calculated as Q= |vi |, Qk = |vi |, vi ∈V AR
vi ∈V ARk
where |vi | is the bit length of register vi , and V ARk ⊆ V AR is the set of registers in partition k. The static power (SP) savings can now be expressed as 1
A power-area relationship has also been exploited in [2, 3].
114
N. Agarwal and N.J. Dimopoulos
SP Savings = 1 −
M P (Pk ) · Qk k=1
Q
−
η Q
(13)
where the parameter P (Pk ) is the proportion of total time spent in partition k, obtained through trace analysis using behavioral simulation. The parameter η is used to capture the addition of any extra register bits required upon partitioning. This is required when extra state encoding bits are needed to incorporate the addition of any entry and exit states in the partitioned state machine. This only becomes significant when the number of states in a submachine is small (less than 4). The dynamic power dissipation in a circuit is due to switching activity. After partitioning, the largest component of power savings is from the reduction of clocking of register components. All other activity in the circuit is necessarily the same as the unpartitioned FSMD to achieve the desired functionality. This includes register value updates and arithmetic computations. Thus, the dynamic power savings can be estimated by examining the reduction in the number of register bits that need to be clocked after FSMD partitioning. However, we need to take into account the overhead added due to data communication whenever a change in active partition occurs. This switching overhead is given by M M Overhead = 0.5 N P C kl · |vi | (14) k=1 l=k+1
vi ∈T V ARkl
where N P C kl is the number of partition changes between partitions k and l over a time period, T , T V ARkl = {V ARk ∩ V ARl } is the number of shared variables between partitions k and l, and |vi | is the bit length of variable vi . The factor 0.5 is used to capture that on average roughly half the bit values will be modified. This factor of 0.5 may be a bit conservative, but we find that the switching overhead is so small (less than 0.5%) that the effect of this parameter on the overall dynamic power savings is negligible. The dynamic power (DP) savings can be estimated as M P (Pk ) · Qk Overhead DP Savings = αC · 1 − − (15) Q f ·T ·Q k=1
where f is the circuit frequency, T is the run time, and αC is the proportion of dynamic power due to clocking. In using the CoDeL environment to generate FSMD circuits, we have found that the switching clock accounts for about 60% of the total dynamic power dissipation [5]. Therefore, we must adjust the total dynamic power savings by this factor. Thus, we use αC = 0.6. The performance overhead from partitioning is determined by the number of extra cycles spent in changing partitions. Our power gating implementation estimates a penalty of three clock cycles for each partition change. The performance overhead can then be measured as M M 3 k=1 l=k+1 N P C kl Perf. Overhead = · 100%. (16) f ·T
Towards Automated FSMD Partitioning
115
In the case where power gating is not employed and only clock gating is used, there is no delay in charging up capacitances. In this case there is a penalty of a single cycle only resulting in one-third the performance overhead of that estimated in equation 16.
6
Results
For effective power estimation, trace data is used from circuit simulation using Synopsys. This data provides information on the state transition sequence during computation. In table 1 we present the estimated power savings using the framework from section 5. The original machines have been partitioned into two, three and four submachines using the model presented in section 2.1. For the case of two partitions, we find that in most cases the static power savings are between 30% and 40%, while the dynamic power savings are between 10% and 20%. For the case of three and four partitions, we see that the average power savings continues to increase. However, the rate of increase in power savings is decreasing, as the amount of logic isolation decreases and we end up with more transition variables. Further, there is a rise in the number of partition changes as the number of partitions increases affecting the dynamic power savings. This rise in the number of partition changes also results in significant performance loss. This phenomenon is exaggerated in the case of the fir2dim kernel as the power savings go down significantly with 4 partitions. Table 1. Power savings and performance overhead based on the estimation framework. (The average values for the percentage power savings and the overhead are given as a geometric mean. The average for the number of cycles is an arithmetic mean). Execution Circuit counter dot product real update complex multiply complex update iir one biquad Average mat1x3 convolution iir n biquads fir n real updates lms n complex updates fir2dim matrix Average
Cycles
2 Partitions 3 Partitions 4 Partitions Power Perf. Power Perf. Power Perf. Savings Overhead Savings Overhead Savings Overhead SP DP SP DP SP DP (%) (%) (%) (%) (%) (%) (%) (%) (%)
6 17 17 20 25 38 21
51.0 37.6 33.6 39.1 37.0 42.1 39.7
37.6 22.6 20.1 23.4 22.2 25.2 24.6
100.0 35.3 35.3 30.0 36.0 15.8 35.9
41.2 59.9 52.8 49.9 51.1 47.6 50.1
38.8 36.0 31.7 29.9 30.7 28.6 32.4
200.0 52.9 70.6 60.0 48.0 39.5 66.3
38.2 68.1 61.9 64.1 58.6 59.4 57.4
44.1 40.8 37.2 38.5 35.2 35.6 38.4
300.0 70.6 88.2 75.0 60.0 39.5 83.2
120 182 198 203 284 356 554 919 5360 908
29.8 34.7 45.4 34.8 30.5 22.0 32.2 30.5 29.0 31.6
17.9 20.8 27.2 20.9 18.3 13.2 19.3 18.3 17.4 19.0
5.0 3.3 4.5 3.0 2.1 2.5 1.1 0.7 0.2 1.7
41.4 37.0 49.7 41.0 32.7 26.8 38.2 37.0 29.4 36.5
24.8 22.2 29.8 24.6 19.6 16.1 22.9 22.2 17.6 21.9
7.5 6.6 6.1 5.9 5.3 4.2 19.0 1.3 0.3 4.0
44.4 37.4 54.5 42.1 41.8 30.5 44.1 21.8 29.8 37.3
26.6 22.5 32.7 25.3 25.1 18.3 26.5 13.1 17.9 22.4
15.0 8.2 16.7 7.4 38.0 5.1 27.1 38.5 1.3 11.7
116
N. Agarwal and N.J. Dimopoulos
The circuits in table 1 are arranged in the order of increasing algorithm complexity based on the number of execution cycles. Examining the performance overhead, we see that the impact is quite large for simpler kernels, while the more complex kernels show little loss of performance. In fact, the presence of loops facilitates the partitioning by providing minimum performance loss and large power savings by ensuring that the computation trace remains within its partition for long times and avoids expensive partition switches. This is exemplified by the matrix and fir2dim benchmarks. Both include loops. For the matrix benchmark, the innermost loop, which is executed a large number of times, is confined in its entirety within a partition in all cases (two, three and four partitions). Meanwhile, for the fir2dim benchmark, when four partitions are used the innermost loop is split between partitions and this results in both performance and power losses. For the simple kernels (shown with a shaded background), the high performance impact suggests partitioning is not advisable. For more complex algorithms, where the performance overhead is less than 5% a real power savings opportunity is present with little impact on performance.
7
Conclusions
In this paper, we have presented a FSMD partitioning technique, which efficiently decomposes the controller and the datapath into multiple partitions, using the simulated annealing algorithm. An estimation framework is developed to evaluate the potential power savings from the partitioning method developed. This framework estimates the power savings as a portion of the power of the non-partitioned system, and it is thus clock-frequency independent2 . Using this framework on a broad set of circuits, we find that, in most cases, significant power savings can be expected using our partitioning approach. It is important to note that upon partitioning a circuit into smaller submachines the critical path may be shortened leading to a possibility of faster timing. This will result in a reduction in the performance penalty, and in some cases may even allow the partitioned circuit to have shorter run times than the unpartitioned circuit. This needs to be explored further through implementation. In an effort to better understand the power savings potential of the proposed partitioning method and to refine our power estimation framework, we are working on implementing all the circuits from table 1 and will report these results soon.
Acknowledgments This work was supported by grants from the Natural Sciences and Engineering Research Council of Canada (NSERC) and the University of Victoria through the Lansdowne Chair. 2
The quantity f · T in equation 15 equates to the number of execution clock cycles and is thus frequency independent.
Towards Automated FSMD Partitioning
117
References 1. Hwang, E., Vahid, F., Hsu, Y.C.: FSMD functional partitioning for low power. In: DATE 1999 (1999) 2. Liu, B., Cai, Y., Zhou, Q., Bian, J., Hong, X.: FSM decomposition for power gating design automation in sequential circuits. In: ASICON 2005 (October 2005) 3. Gao, F., Hayes, J.P.: ILP-based optimization of sequential circuits for low power. In: ISLPED 2003 (2003) 4. Hu, Z., Buyuktosunoglu, A., Srinivasan, V., Zyuban, V., Jacobson, H., Bose, P.: Microarchitectural techniques for power gating of execution units. In: ISLPED 2004, pp. 32–37. ACM Press, New York (2004) 5. Agarwal, N., Dimopoulos, N.J.: Efficient automated clock gating using CoDeL. In: Vassiliadis, S., Wong, S., H¨ am¨ al¨ ainen, T.D. (eds.) SAMOS 2006. LNCS, vol. 4017, pp. 79–88. Springer, Heidelberg (2006) 6. Agarwal, N., Dimopoulos, N.J.: Automated power gating of registers using coDeL and FSM branch prediction. In: Vassiliadis, S., Berekovi´c, M., H¨ am¨ al¨ ainen, T.D. (eds.) SAMOS 2007. LNCS, vol. 4599, pp. 294–303. Springer, Heidelberg (2007) 7. Kirkpatrick, S., Gelatt Jr., C.D., Vecchi, M.P.: Optimization by simulated annealing. Science 220(4598), 671–680 (1983) 8. Agarwal, N., Dimopoulos, N.J.: FSMD partitioning for low power using ILP. In: ISVLSI 2008: Proceedings of the 2008 IEEE Computer Society Annual Symposium on VLSI, Washington, DC, USA, pp. 63–68. IEEE Computer Society, Los Alamitos (2008) 9. Goldberg, D.E.: Genetic Algorithms in Search, Optimization and Machine Learning. Addison-Wesley Longman Publishing Co., Inc., Boston (1989) 10. Beni, G., Wang, J.: Swarm intelligence. In: Proceedings of the 7th Annual Meeting of the Robotics Society of Japan, pp. 425–428. RSJ Press (1989) 11. Agarwal, N., Dimopoulos, N.J.: FSMD partitioning for low power using simulated annealing. In: Proc. ISCAS 2008 (May 2008) 12. Plackett, R.L., Burman, J.P.: The design of optimum multifactorial experiments. Biometrika 33(4), 305–325 (1946) 13. Zivojnovic, V., Martinez, J., Schl¨ ager, C., Meyr, H.: DSPstone: A DSP-oriented benchmarking methodology. In: Proc. ICSPAT 1994 (October 1994) 14. Sivakumar, R., Dimakopoulos, V., Dimopoulos, N.: CoDeL: A rapid prototyping environment for the specification and automatic synthesis of controllers for multiprocessor interconnection networks. In: Proc. SAMOS III, July 2003, pp. 58–63 (2003) 15. Agarwal, N., Dimopoulos, N.J.: Using CoDeL to rapidly prototype network processsor extensions. In: Pimentel, A.D., Vassiliadis, S. (eds.) SAMOS 2004. LNCS, vol. 3133, pp. 333–342. Springer, Heidelberg (2004) 16. Vandekerckhove, J.: General simulated annealing algorithm, http://www.mathworks.com/matlabcentral/fileexchange/loadCategory.do 17. MathWorks, http://www.mathworks.com/
Radix-4 Recoded Multiplier on Quantum-Dot Cellular Automata Ismo H¨anninen and Jarmo Takala Tampere University of Technology, Department of Computer Systems PO BOX 553, FI-33101 Tampere, Finland {ismo.hanninen,jarmo.takala}@tut.fi http://www.tkt.cs.tut.fi/index-english.html
Abstract. This paper describes the implementation of an advanced multiplication algorithm on quantum-dot cellular automata (QCA) nanotechnology, promising molecular density circuits with extreme operating frequencies, using a single homogeneous layer of the basic cells. The multiplier layout is verified with time-dependent quantum mechanical simulation, to ensure stable ground state computation under the fine-grained pipelining constraints of the technology. The novel design utilizes radix4 modified Booth recoding and ultra-fast carry-save addition, resulting in stall-free pipeline operation, with twice the throughput of the previous sequential structure and minimized active circuit area. Keywords: QCA, nanotechnology, arithmetic, multiplication.
1
Introduction
Quantum-dot cellular automata (QCA) is a promising nanotechnology, which offers ways to reach molecular circuit densities and clock frequencies surpassing traditional digital technologies by several orders of magnitude, possibly reaching the teraherz regime. The concept was introduced in early 1990s [1,2] and has been demonstrated in laboratory environment with small systems [3, 4, 5]. The revolutionary operating principle promises outstanding energy-efficiency and computing performance, which has evoked considerable interest in the digital design community, although the implementation technologies are still in development. The primitive device of QCA technology is a bistable cellular automaton, which is operated under clocked control. The physical cells and clocking can be created in several ways, and the approaches promising the fastest switching and highest performance are based on electrostatic coupling between the automata, manufactured with semiconductor, metal island, or molecular process [3, 4, 5]. Information is encoded into the local position of excess charged particles in the QCA cell, having two opposite polarization states to represent binary zero and one, as shown in Figs. 1(a) and 1(b). The state of a cell can be copied into the neighboring cell, enabling a simple wire construct, and a coplanar wire crossing shown in Fig. 1(c). A universal combinatorial logic set is usually constructed with K. Bertels et al. (Eds.): SAMOS 2009, LNCS 5657, pp. 118–127, 2009. c Springer-Verlag Berlin Heidelberg 2009
Radix-4 Recoded Multiplier on Quantum-Dot Cellular Automata
’0’
’1’
’0’
In
’1’
Out
(a)
(Inverter Chain)
Out
(b) in2
in2
in
out
in
out
in1
out1
in1
In
119
out in
out
in
out2
out
in3
(c)
(d)
(e)
Fig. 1. QCA primitives: a) type 1 cell and wire, b) type 2 cell and wire, c) coplanar wire crossing, d) inverters, and e) three-input majority gate
an inverter gate and a three-input majority gate, shown in Figs. 1(d) and 1(e). The majority voter can be configured as a two-input AND gate by fixing the third input to zero value, or an OR gate, with the third input fixed to one. [1, 2] A clocking field determines when the cells are reset to an un-polarized state, latch their input values, and start driving neighboring cells. This naturally enables sequential logic, but on the nanotechnology, also ensures that the array of cells reaches a stable ground state and has true signal gain. Typically, the clock is applied by interleaving four clocking zones repeatedly across the design, and creating an ultra-fine-grained pipeline. [2, 6, 7, 8] This paper presents the implementation of radix-4 modified Booth multiplication on QCA, describing the resulting systolic layout. Noise coupling due to imperfect cellular interaction is avoided by the clock zone placement approach [9], which has a serious consequence: the additional latencies translate into a stalling pipeline, wrecking the performance completely. We avoid this problem by using ultra-fast carry-save addition, and develop components that have matched data rates. The stability of the ground state is verified by time-dependent simulation, using the coherence vector engine of QCADesigner software [10]. The novel multiplier reaches twice the throughput of the previous sequential unit, using only minimal circuit area for active logic, while passive wiring dominates with square-law dependence on the operand word length. The previous designs represent the simple extremities of parallel computation [11, 12, 13], while the more advanced algorithm leads to many ways to adjust the degree of parallelism, which enables exploring the design space for fault-tolerance. The rest of this paper is organized as follows: Section 2 summarizes the previous work on QCA arithmetic and Section 3 the applied multiplication algorithm. Section 4 describes the proposed implementation, while Section 5 presents design analysis. The conclusions follow in Section 6.
120
2
I. H¨ anninen and J. Takala
Related Work
There has been a considerable amount of research into arithmetic circuits on QCA, aiming to solve the challenges of more general digital design. The first papers described a full adder unit [1, 2], followed by the optimal majority logic formulation [14]. The clock zone approach to avoid radius-of-effect induced noise coupling was considered in [9], and a dense layout with one-cycle carry latency, required in stall-free multi-bit pipelines, was presented in [15, 16]. Multi-bit addition was proposed using a standard bit-serial adder [11,17] and a ripple carry adder (RCA) [14, 18, 15, 16]. Advanced structures with reduced latency, the carry lookahead adder (CLA) and the conditional sum adder (CSA), were adapted to QCA and analyzed in [19]. The first multiplier proposal for QCA was the serial-parallel structure [11,12], processing one of the operands as a parallel word and the other bit-serially. This was followed by a fully parallel cellular array multiplier, which was shown to be quite competitive structure on QCA, due to the wiring overhead evening out the area difference between the designs [13, 20]. The novel approach presented here is situated between the previous proposals, and offers several directions for further improvements, not available for the simpler structures.
3
Radix-4 Multiplication
The popular radix-4 modified Booth algorithm handles two’s complement numbers directly, and enables a systolic structure with no control overhead or feedback signals, which is very important simplification on QCA. The approach, as shown in Fig. 2, scans the multiplier word in groups of three bits, including an overlapping lookbehind reference, and transforms it into a high-radix signed-digit (SD) number, where the digits represent the required operations. The number of clock cycles is halved, in comparison to the direct sequential algorithm. The recoding function shown in Table 1 enables skipping over continuous sequences of zero or one bits in the multiplier word, replacing a sequence of operations with just two additions and shifting, while handling also isolated ones efficiently, as opposed to the standard radix-2 Booth algorithm, which can double the number of operations in the worst case. Since the transformation has no dependence between adjacent digits, all the groups can be recoded in parallel, opening a way to compute also the partial products in parallel. This would not be possible with canonical SD recoding, offering maximum number of zero digits.
Fig. 2. Radix-4 overlapped scanning, processing two bits and a lookbehind reference
Radix-4 Recoded Multiplier on Quantum-Dot Cellular Automata
121
Table 1. Recoding function, converting multiplier bits into the radix-4 operation Multiplier bits bi+1 0 0 1 1 0 0 1 1
4
bi 0 1 0 1 0 1 0 1
bi−1 0 0 0 0 1 1 1 1
Comment string of zeros single one start of ones start of ones end of ones end of ones single zero string of ones
Operation 0 +A −2A −A +A +2A −A 0
Implementation
The main design objective was to achieve continuous operation without stalling the pipeline, since this would deteriorate the performance and require complex control, which seems currently very unpractical on QCA due to the timing restrictions. The proposed multiplier is a pipelined systolic structure with no feedback signals, achieving the halved latency and doubled throughput promised by the radix-4 algorithm. However, the design does not benefit from skipping of continuous bit sequences in the multiplier word, which on traditional technologies gives a variable performance boost, depending on the particular multiplier word. The reason for this is the ultra-low latency carry-save adder (CSA), which we are using to accumulate the partial products, making the cost of just shifting relatively high, since the physical distance translates directly into latency and additional pipeline stages, on the nanotechnology. If the adder had lower throughput than the other components, the zero/shift operations could be utilized to skip the adder completely, but with considerable control cost. The top-level structure is shown in Fig. 3, having eight modules: two blocks for multiplier word recoding (shift register and recoder logic), three blocks for creating and selecting the multiples (complementer, distribution network, and mux), and three blocks for accumulating the partial product and converting it into the final result (carry-save adder, vector merge adder, and shift register). Table 2. Latency and area of the components Latency (clock cycles)
Area (cell area units)
Multiplier shift reg.
2
100nB + 100
Multiplier recoder
4
1650
Multiplicand compl.
nA
200nA
Multiple distribution
(nA + 1)/2 + 6
140n2A + 500nA
Component
Multiple select mux Carry-save adder Vector merge adder Result shift reg.
(nA + 1)/2 + 3
900nA + 900
(nA + 1)/2 + 1
360nA + 700
4
700
(nA + nB )/2 − 1 200(nA + nB )/2
I. H¨ anninen and J. Takala
#$%
#$(%
#
# &
'&
& #
#
(&
&
#
'&
!
&
&
&
#
#$(%
&$% &
&
&$% &
$%
&
#$% #
$(%
""
#
$(%
122
Fig. 3. Radix-4 recoded multiplier on QCA, block diagram
Table 2 summarizes the operand word length dependence of the latency and area of the separate components, but the pipelined bit slices run in parallel, effectively hiding the internal delays. The total latency is linear: Ltotal = 2nA + nB /2 + 16, where nA is the multiplicand word length and nB the multiplier word length. The handcrafted layouts were verified with the QCADesigner tool (v. 2.0.3), using the coherence vector engine [10]. The fixed size modules and bit slices were simulated exhaustively, covering all the pipeline states, and the combined blocks were tested with a number of cases and word lengths, as exhaustive runs were prevented by the exponential number of states. Correct functionality was obtained with simulated clock frequencies up to one terahertz, using default parameters and cell width 18 nm, corresponding to a semiconductor technology. 4.1
Recoding the Multiplier and Creating the Multiples
The nB -bit multiplier word B is initially fed into the shift register for conversion from parallel format into serial stream of bit groups, effectively providing a 3-bit sliding window into the multiplier word, shifting two bits per clock cycle (one overlapping reference bit). The recoder block transforms the groups into the radix-4 SD format sequentially (nB /2 digits), the digits internally expressed as the multiple selection control signals seldouble , selnegate , selnull . The nA -bit multiplicand word A (parallel format) is initially fed into the block forming the negated multiple (−A) (two’s complement, with linear latency), while the following network produces the doubled multiples (2A, −2A) by wiring
Radix-4 Recoded Multiplier on Quantum-Dot Cellular Automata
123
that implements the 1-bit shift and extension into nA + 1 bit length. The four words are interleaved by the bit position and fed into the mux, which selects one of them or produces the zero-multiple (choosing from +A, +2A, −A, −2A, 0), based on the three control signals. This wiring as a whole has a linear latency, and a dominating square-law area, in respect to the operand word length. 4.2
Accumulating the Multiples
The word-sequentially operating carry-save adder, shown in Fig. 4, accumulates the selected partial product and shifts the carry-save formatted sum two digits to the right. This effectively combines three full width binary operands (new multiple, sum vector, and carry vector) and computes the whole nA + 1 bit accumulation in one clock cycle, matching the input data rate and enabling the continuous operation of the pipeline. A ripple-carry adder on QCA would require a linear number of stall cycles, and the best lookahead and conditional sum adders logarithmic, on the operand word length. This would stop the pipeline between each addition, waiting for the availability of the feedback value. The carry-save adder consists of two-digit wide slices that compute in parallel, to obtain the continuous data flow using the QCA full adders as (3,2)-counters. The full adder has a carry latency of one clock cycle and sum latency of two cycles (the fastest noise rejecting design [15, 16]), which matches the two-bit shifting very well: there is just enough time for the critical path to produce a digit for the next slice, to be used on the next clock cycle. Because of the shifting, a carry bit is moved one digit position to LSB direction, and the sum bit, requiring longer
m7
m8_mid
(b) m6
m5
m3
m4
m6_mid
s8
s10_mid
c9_mid
c7_mid
m7_mid
m1
m0_mid s2
s6_mid
c5_mid
c6
m0
m2_mid s4
s8_mid
c8
m2
m4_mid s6
s10
(a) m8
s0
s4_mid
c3_mid
c4
m5_mid
m3_mid
s2_mid
c1_mid
c2
s0_mid
c0
m1_mid
c9 c7
s9
c5
c3
c1
c8_mid
c6_mid
c4_mid
c2_mid
c0_mid
s9_mid
s7_mid
s5_mid
s3_mid
s1_mid
s7
s5
s3
s1
(c) Fig. 4. Word-sequential carry-save adder with shifting: a) conceptual structure, b) logic with full adder (FA) components, and c) QCA layout with 9-bit operand
124
I. H¨ anninen and J. Takala
computation time, two digit positions to LSB direction. An accumulation step is spread in time, each digit slice computing with different operand. 4.3
Merging the Vectors
The accumulated partial product has to be processed into standard non-redundant binary number (radix-2). The vector merge adder, shown in Fig. 5, adds the sequential stream of sum and carry vector bits with proper weights, propagating the carries from the LSB side towards MSB side. On each clock cycle, two bits from the sum vector (s0 , s1 ) and one bit from the carry vector (only c0 , since c−1 does not exist) are transformed into two bits of the final result (pi , pi+1 ), which are interleaved by the shift register to produce the parallel result word P . The vector merge adder consists logically of a half adder (HA) and a full adder (FA) component, but the existing QCA designs for these blocks could not be utilized, since their combination would have a total latency of two clock cycles, stalling the pipeline. A layout-level optimization of the combination was designed, to match the data rate and solve the carry dependence between two bit positions, since we are summing two radix-4 digits, obtaining bi-directional carry exchange during one clock cycle. -1.00
s0
pi
1.00 -1.00 c0
c_intra
c_inter s1
(a)
pi+1
(b)
Fig. 5. Vector merge adder: a) logical structure and b) latency-optimized QCA layout
5
Design Analysis
The parallelism of the novel radix-4 recoded multiplier is between the two previous proposals, i.e. the area of the implementation is traded against time. The serial-parallel multiplier [11, 13, 12] uses time-multiplexed hardware, fully parallel in respect to the multiplicand operand, bit-serial in respect to the multiplier operand, while the array multiplier [13,20] has dedicated adder rows, fully parallel in respect to both operands. The performance and cost metrics are compared in Table 3, as functions of the common operand word length n bits.
Radix-4 Recoded Multiplier on Quantum-Dot Cellular Automata
15 000 000
2500000
Array Multiplier Radix-4 Recoded Multiplier Serial-Parallel Multiplier
Area (cell a area unitts)
Are ea (cell arrea units))
20 000 000
10 000 000 '
5 000 000 0
125
Multiple distribution network All other components summed s mmed
2000000 1500000 1000000 500000 0
16
32
48
64
80
96 112 128
16
Word length (bits)
32
48
64
80
96
112 128
Word length (bits)
(a)
(b)
Fig. 6. a) Circuit area of the multipliers and b) Radix-4 multiplier component areas
The latency of the nanotechnology multipliers is in the linear regime, but there is tremendous difference in throughput, which is usually considered the most important performance metric. After the pipelines are filled, the clear winner is the array multiplier, producing a new result on each clock cycle. Our novel radix-4 unit produces a new result once in every n cycles, which is the theoretical maximum when the partial products are accumulated sequentially. The least amount of results is obtained from the previous bit-serial-parallel multiplier. The circuit area of the units grows with a square-law dependence on the operand length, as shown in Fig. 6(a), due to the dominating wiring overhead [13]. The smallest design is the serial-parallel [11, 13, 12], while the novel radix-4 multiplier is about five times as large. The massive array multiplier [13, 20] reaches 40 times the size of the smallest design. The ratios settle asymptotically. The contribution of active and passive circuitry has a common trend: The previous designs suffer from the need to synchronize the operands to the pipelined bit slices with extensive delay wiring (square-law), while the active circuitry of the serial-parallel design grows only linearly and the core of the array multiplier with quadruple dependence on the operand word length. The active circuitry of the novel radix-4 unit is very small and also limited to linear growth, while the huge multiple distribution network consumes most of the area, as shown in Fig. 6(b). The compared designs are all area-optimized and will not yield to further layout-level improvements, leaving architectural and algorithmic approaches as the only way to reduce the high wiring overhead. The previous multiplier proposals are structurally simple, but in the novel radix-4 design, a compromise in the degree of parallelism has a complexity cost. Table 3. Multiplier comparison (common word length n) Design
Latency Throughput Area (cycles) (results per c.) (cells)
Proposed radix-4
2.5n + 16
Serial-parallel [11, 13], optimized in [12] Array [13, 20]
1/n
140n2
1/(2n)
26n2
1
1100n2
3n + 2 2n 4n − 1
126
I. H¨ anninen and J. Takala
The overhead (nearly all in the multiple distribution wiring) increases the area over the previous sequential design, but we are achieving the doubled throughput and utilizing the underlying full adders with 100% efficiency, while the previous design can feed the bit slices with new operands only in about 75% of the cycles. The radix-4 recoded multiplier can be customized to obtain variable degrees of parallel computation and tolerance against malfunctioning low-level hardware, physical defects and faults. Instead of recoding only one digit of the multiplier per cycle, we can obtain several or all of them at once, and use a tree of carrysave adders to sum several multiples in parallel. Sequential operation with several adders offers the option of module level redundancy to increase the reliability, but requires a runtime control mechanism.
6
Conclusions
This paper has described the design of a novel recoded radix-4 multiplier, under the fine-grained pipelining constraints needed to obtain stable ground-state computation on quantum-dot cellular automata. Correct operation of the layoutlevel implementation has been verified with time-dependent quantum mechanical simulation, up to one terahertz frequency. The unit has twice the throughput (results/cycle) of the previous sequential design, at the cost of complexity and larger area, but provides also a versatile starting point for development. Our work is one of the first attempts to design advanced computer arithmetic for the very promising QCA technology, which has several challenges hindering large scale manufacturing. The fundamental issue still to be addressed is the presence of defects and faults inherent to the molecular implementations, resulting in the need to find a hierarchical redundancy scheme to enable practical large systems. Another as important an issue is the power dissipation: QCA is predicted to re-use signal energy so efficiently, that the most important heat source will be the irreversible bit erasures, limiting the operating frequency and giving incentive to develop reversible computing approaches. The challenges need attention on the architectural level. Our aim is to incorporate fault-tolerance into the multiplier design by using redundant hardware in the adder portion, and evaluate the gains and costs of this both on the data path and the required control subsystem. In the long run, practical reversible computing might be obtained, since there is preliminary evidence, that multiplication hardware might be able to maintain information about the system state trajectory, with relatively low cost. [13]
References 1. Lent, C., Tougaw, P., Porod, W.: Quantum cellular automata: the physics of computing with arrays of quantum dot molecules. In: Proc. Workshop Phys. Comp., Dallas, TX, November 17-20, pp. 5–13 (1994) 2. Lent, C., Tougaw, P.: A device architecture for computing with quantum dots. Proc. IEEE 85(4), 541–557 (1997)
Radix-4 Recoded Multiplier on Quantum-Dot Cellular Automata
127
3. Snider, G., Orlov, A., Amlani, I., Bernstein, G., Lent, C., Merz, J., Porod, W.: Quantum-dot cellular automata. In: Dig. Papers of Microprocesses and Nanotechnology Conf., Yokohama, Japan, July 6-8, pp. 90–91 (1999) 4. Orlov, A., Kummamuru, R., Ramasubramaniam, R., Lent, C., Bernstein, G., Snider, G.: Clocked quantum-dot cellular automata devices: experimental studies. In: Proc. IEEE Conf. Nanotechnology, Maui, HI, October 28-30, pp. 425–430 (2001) 5. Kummamuru, R., Orlov, A., Ramasubramaniam, R., Lent, C., Bernstein, G., Snider, G.: Operation of a quantum-dot cellular automata (QCA) shift register and analysis of errors. IEEE Trans. Electron Devices 50(9), 1906–1913 (2003) 6. Blair, E., Lent, C.: Quantum-dot cellular automata: an architecture for molecular computing. In: Proc. Int. Conf. Simulation of Semiconductor Processes and Devices, Boston, MA, September 3–5, pp. 14–18 (2003) 7. Lent, C., Liu, M., Lu, Y.: Bennett clocking of quantum-dot cellular automata and the limits to binary logic scaling. Nanotechnology 17(16), 4240–4251 (2006) 8. Frost-Murphy, S., DeBenedictis, E., Kogge, P.: General floorplan for reversible quantum-dot cellular automata. In: Proc. ACM Int. Conf. Computing Frontiers, Ischia, Italy, May 7-9, pp. 77–81 (2007) 9. Kim, K., Wu, K., Karri, R.: The robust QCA adder designs using composable QCA building blocks. IEEE Trans. Computer-Aided Design 26(1), 176–183 (2007) 10. Walus, K., Jullien, G.: Design tools for an emerging SoC technology: quantum-dot cellular automata. Proc. IEEE 94(6), 1225–1244 (2006) 11. Walus, K., Jullien, G., Dimitrow, V.: Computer arithmetic structures for quantum cellular automata. In: Conf. Rec. 37th Asilomar Conf. Signals, Systems and Computers, Pacific Grove, CA, November 9-12, pp. 1435–1439 (2003) 12. Cho, H., Swartzlander, E.: Serial parallel multiplier design in quantum-dot cellular automata. In: Proc. IEEE Symp. Computer Arithmetic, Montepellier, France, June 25-27, pp. 7–15 (2007) 13. H¨ anninen, I., Takala, J.: Binary multipliers on quantum-dot cellular automata. Facta Universitatis 20(3), 541–560 (2007) 14. Wang, W., Walus, K., Jullien, G.: Quantum-dot cellular automata adders. In: Proc. IEEE Conf. Nanotechnology, San Francisco, CA, August 11-14, pp. 461–464 (2003) 15. H¨ anninen, I., Takala, J.: Binary adders on quantum-dot cellular automata. J. Sign. Process. Syst. (to appear), http://dx.doi.org/10.1007/s11265-008-0284-5 16. H¨ anninen, I., Takala, J.: Robust adders based on quantum-dot cellular automata. In: Proc. IEEE Int. Conf. Application-Specific Systems, Architectures and Processors, Montr´eal, QC, Canada, July 8-11, pp. 391–396 (2007) 17. Fijany, A., Toomarian, N., Modarress, K., Spotnitz, M.: Bit-serial adder based on quantum dots. Tech. Rep. NPO-20869, NASA’s Jet Propulsion Laboratory, Pasadena, CA (2003) 18. Zhang, R., Walus, K., Wang, W., Jullien, G.: Performance comparison of quantumdot cellular automata adders. In: IEEE Int. Symp. Circ. and Syst., Kobe, Japan, May 23-26, pp. 2522–2526 (2005) 19. Cho, H., Swartzlander, E.: Adder designs and analyses for quantum-dot cellular automata. IEEE Trans. Nanotechnol. 6(3), 374–383 (2007) 20. H¨ anninen, I., Takala, J.: Pipelined array multiplier based on quantum-dot cellular automata. In: Proc. European Conf. Circuit Theory and Design, Seville, Spain, August 26-30, pp. 938–941 (2007)
Prediction in Dynamic SDRAM Controller Policies Ying Xu, Aabhas S. Agarwal, and Brian T. Davis Department of Electrical and Computer Engineering and School of Technology Michigan Technological University 1400 Townsend Drive Houghton, MI 49931 {yixu,asagarwa,btdavis}@mtu.edu
Abstract. Memory access latency can limit microcontroller system performance. SDRAM access control policies impact latency through SDRAM device state. It is shown that execution time can be reduced by using a state machine which predicts, for each access, the policy which will minimize latency. Two-level dynamic predictors are incorporated into the SDRAM controller. A range of organizations for dynamic predictors are described, and the performance improvements predicted by simulation are compared using execution time and prediction accuracy as metrics. Results show that predictive SDRAM controllers, reduce execution time by 1.6% to 17% over static access control policies. The prediction accuracy of the best predictor results in 93% prediction accuracy, with 87% accuracy for OP state preferred accesses, and 96% for CPA state preferred accesses. Results show that execution time is strongly correlated to the prediction accuracy of OP, suggesting directions for future predictor development. Keywords: SDRAM, Memory Latency, Access Control Policy.
1 Introduction SDRAM controllers use static access control policies, either Open Page (OP) or Close Page Autoprecharge (CPA). Currently, the policy used is either determined at design time, or by the BIOS during boot up [10]. OP leaves the accessed row open after each access, whereas CPA closes it by performing a bank precharge. The static controller policy that yields minimal benchmark execution time largely depends upon the SDRAM access pattern of an application. OP is beneficial for applications containing a large amount of data locality in the memory access stream; while for applications with little data locality, CPA provides lower execution time. In this paper, dynamic controller policies using the history of memory accesses to select the preferred controller policy for each pending access, are explored. From Table 1 it can be seen that to provide the lowest possible latency, OP should be used for row hits, while CPA should be used for row conflicts. To examine the performance available through dynamic access control policies, the oracle dynamic controller policy (DYN_UPB) uses knowledge of future accesses to always choose correctly whether to maintain an accessed row open or to precharge the bank. K. Bertels et al. (Eds.): SAMOS 2009, LNCS 5657, pp. 128–138, 2009. © Springer-Verlag Berlin Heidelberg 2009
Prediction in Dynamic SDRAM Controller Policies
129
DYN_UPB provides an upper bound on the performance improvement achievable using dynamic controller policies without changing the access schedule[4] or data placement. DYN_UPB is not feasible to implement. Table 1. Memory accesses latency assuming zero contention
Controller policy
Row hit
Row conflict
OP
tcl
tcl + trcd + trp
CPA
tcl + trcd
tcl + trcd
DYN_UPB
tcl
tcl + trcd
trp: precharge to row active delay trcd: row active to column access delay tcl: column access to first data output delay
An SDRAM controller that utilizes different policies for each access will increase the complexity of the controller finite state machine (FSM). As the number of transistors per unit area continues to increase [3], it is possible to implement more complicated SDRAM controllers without increasing the die size and cost. This paper makes the following contributions:•
• • • •
Introduce organizations for two-level dynamic SDRAM controller policy. Compare the performance of dynamic policy predictors against static policies. Study the impact of history register length on prediction accuracy. Study the relationship between prediction accuracy and execution time.
The remainder of this paper is organized as follows. Section 2 discusses SDRAM operation and prior work. Section 3 describes the methodology and design of multiple, two-level dynamic policy predictors. Section 4 presents the simulation results and analysis. Section 5 draws conclusions and suggests future work.
2 Background SDRAM devices contain multiple independent banks; each bank contains an array of memory cells. Depending on the address of an access and the state of the SDRAM devices, a typical SDRAM operation may require up to three phases to access the desired data: bank precharge, row activation, and column access. Bank precharge prepares the bank for row activation. Through row activation, the data contained in one row of the memory array is transferred into the bank’s sense amplifiers. While a row is active in the sense amplifiers, one or more column accesses can be performed to retrieve or modify the desired data. The order and occurrence of these three phases is dependant upon application access pattern and device state. The multi-dimensional structure of SDRAM devices, with dimensions of channel, rank, bank, row and column, results in non-uniform access latency [6]. Whether an access is going to fall within the same row as the previous access is dependent upon the access pattern of an
130
Y. Xu, A.S. Agarwal, and B.T. Davis
application. Thus the preferred access control policy is dependant upon application, and a static access control policy can not result in the lowest latency for all accesses. Prior work [5] has been done to predict the duration that an SDRAM page should be maintained in the open state. Another study [9] predicts not only whether a DRAM page should be precharged subsequent to an access, but also the next desired page to be opened. Our approach differs from prior work in the design of the predictors, the benchmarks, the evaluation metric (execution time), and restricting prediction to the decision to precharge the DRAM row subsequent to an access. The predictive SDRAM controller policy is independent of application, access ordering, and requires no changes to software, microprocessor architecture or SDRAM device architecture. This research has the potential to reduce execution time through modifications to the controller alone. The decision to leave the accessed row open or closed after an access is a binary choice. Keeping a row open (OP) is similar to a taken branch (binary 1), while closing a row (CPA) is similar to a non-taken branch (binary 0). However, unlike branch predictions, mispredictions of access control policy require no roll-back of the state, greatly simplifying implementation. Misprediction will decrease system performance but will not affect program correctness. Prediction mechanisms proven successful in branch prediction [12] are examined as an initial design for controller policy prediction.
3 Dynamic Controller Policy Predictors Two-level dynamic SDRAM controller policy predictors utilize SDRAM access history to make predictions for SDRAM controller policy. 3.1 Overview of Dynamic Policy Predictors The structure inside the predictor that collects the row hit or row conflict behavior for the last n accesses is an n-bit shift register labelled the history register (HR). Each of the 2n patterns possible in the HR corresponds to an entry in the Pattern History Table (PHT) [12]. Each entry in the PHT uses a 2-bit saturating counter and keeps track of the controller policy which would have resulted in the lowest access latency for the Pattern History Table (PHT) History Register (HR) n-1
............
00... 00 00... 01
0
Index
......
......
2-bit counter state
Prediction of policy
11...10 11... 11
Fig. 1. Structure of dynamic policy predictors
Prediction in Dynamic SDRAM Controller Policies
131
prior occurrences of a specific pattern in these n entries. Figure 1 shows the structure of the dynamic policy predictors examined herein. Transitions of the PHT can be seen in Figure 2. The PHT entries are initialized at the beginning of execution. 3.2 Predictor Operation When an access occurs, the row hit or row conflict behavior of this access is recorded into the HR. If a dynamic PHT is used, the behavior is also used to update the PHT entry addressed by the previous access, using the appropriate FSM from Figure 2. The HR is then used to address the PHT. The addressed PHT entry is then used to predict the controller policy for the current access. Conclusive knowledge of whether OP or CPA is preferred is non-speculative only after the successive access address is known. For this reason, the behavior of access N can only be used to update the entry in the PHT addressed by access N-1 where access N-1 and access N are within the same bank. Two different PHT FSMs that bias the prediction toward OP are simulated, due to results suggesting that OP accuracy has a more significant impact upon performance than CPA accuracy. Compared with the non-biased update method, the biased update method, shown in Figure 2(b), always updates the state of a PHT entry to OP on a row hit, regardless of its previous state. As shown in Figure 2(c), the biased PHT method uses OP policy for three states of the PHT entry.
OP
OP
OP row hit row conflict
row conflict
row conflict
row hit
row hit WEAK_OP
WEAK_OP
WEAK_OP
row hit row conflict
row hit
row conflict
row conflict rowhit
PALE_OP WEAK_CP
WEAK_CP
rowhit row conflict
row conflict
row hit
row hit
row conflict
CP
CP
(a) non-biased
(b) biased update
Fig. 2. State transition diagram of dynamic predictors
CP
(c) biased PHT
132
Y. Xu, A.S. Agarwal, and B.T. Davis
3.3 Organizations of Dynamic Policies Depending on how the history information is maintained, there are three variations for HR: Global (G), per-Bank (B) and per-Page (P) and similarly three possible organizations used for the dynamic PHT: Global (g), per-bank (b) and per-page (p). Global uses one data structure (HR/PHT). Per-bank means that each bank has its own data structure, and similarly per-page means each row of an SDRAM bank has its own data structure. Depending on the HR organization, PHT organization, whether the PHT is static (S) or adaptive (A), and whether a biased PHT FSM is used, dynamic predictors have many possible organizations; those simulated are shown in Table 2. The nomenclature used in this paper is similar to [12]. Not all possible predictors are simulated. Some are excluded because their performance is lower than those presented, some due to implementation costs. The cost analysis for each of these prediction policies, in bits of storage required and estimated delay, is given in [11]. Table 2. Simulated SDRAM controller policies•
Category Theoretical Upper Bound Static Policy Static PHT Dynamic Policy Predictors Dynamic PHT
Description and Name DYN_UPB OP CPA Non-biased static {BSg, PSg} biased static b_int {BSg, PSg} non-biased dynamic {BAb, PAg, PAb} biased update dynamic b_upd {BAb, PAg, PAb} biased PHT dynamic b_pht {BAb, PAg, PAb}
4 Simulation Results Execution driven simulation is utilized for the analysis presented. Execution time and prediction accuracy are the primary criteria for evaluating performance. 4.1 Simulation Setup A revised version of SimpleScalar v3.0d [1] is used in the simulations performed. The characteristics of the baseline machine are given in Table 3. The main memory model in SimpleScalar is replaced with a more accurate DDR SDRAM model along with a detailed SDRAM controller. The write buffer performs writeback only when the SDRAM bus has been idle for three cycles [7]. The memory access scheduler selects pending accesses from the memory access queue, interleaving accesses across memory banks to increase utilization. As a result, memory accesses to a common bank are executed in order, while accesses to unique banks may be executed out of order. The
Prediction in Dynamic SDRAM Controller Policies
133
simulated processor has a 16-entry register update unit (RUU) and an 8-entry loadstore queue (LSQ). The memory access queue has enough entries to hold the maximal number of outstanding memory access supported by the LSQ and RUU. Table 3. Baseline machine characteristic
CPU 2.4GHz, 4-way out-of-order execution processor L1 cache 64KB I-cache and 64KB D-cache, 2-way, 32B cache line L2 cache 512KB, 16-way, 64B cache line Main memory 2GB DDR400, 3-4-4-8, single channel, 4 ranks, burst of 8 SDRAM controller policy OP Write Buffer 16 entries Figure 3 shows the total number of SDRAM accesses for each of the 24 correctly executing SPEC CPU2000 [8] benchmarks. Benchmarks which have very few memory accesses will not be significantly impacted by an improvement in the SDRAM system, thus from 24 benchmarks, 17 benchmarks are selected for simulation [2].
7.00E+08
DRAM Accesses
6.00E+08 5.00E+08 4.00E+08 3.00E+08 2.00E+08 1.00E+08
art
mcf
swim
ammp
gcc
app lu
lucas
apsi
mgrid
two lf
v pr
galgel
gap
b zip 2
parser
wupwise
mesa
equake
vortex
gzip
p erlb mk
eo n
crafty
fma3d
0.00E+00
Fig. 3. Total SDRAM accesses for each benchmark
4.2 Access Policy Preference by Benchmark Figure 4 shows the execution time for each benchmark using OP, CPA and DYN_UPB. Benchmarks for which OP results in a lower execution time are shown to the left of the dashed line, CPA to the right. Figure 5 shows the access policy preference for each simulated benchmark. When OP preferred accesses exceed 11% of the total accesses of a benchmark, OP results in lower execution time than CPA. It is notable that even though 89% of the accesses prefer CPA, use of the OP policy results in a lower execution time than CPA.
134
Y. Xu, A.S. Agarwal, and B.T. Davis
2 OP
1.5
CPA 1
DYN-UPB
v pr
av erage
g ap
swim
mg rid
g cc
twolf
b zip 2
lucas
eq uake
ap si
parser
mcf
galgel
art
ammp
0
ap plu
0.5
wupwise
Normalized Execution Time
2.5
Fig. 4. Execution time for all benchmarks normalized to OP 100 %
Percen tag e o f accesses
90 % 80 % 70 % 60 % cpa
50 %
op 40 % 30 % 20 % 10 % 0 % m g ri d
s w im
v pr
gc c
gap
a v e ra ge
(a) CPA preferred benchmarks 100 %
Percen tag e o f accesses
80 %
60 % cpa op 40 %
20 %
average
ammp
art
galgel
wupwise
ap plu
parser
equ ake
mcf
apsi
bzip2
lucas
twolf
0%
(b) OP preferred bencmarks
Fig. 5. Static controller policy preference of benchmarks accesses
4.3 Execution Time Figure 6 shows the execution times of static and predictive SDRAM controller policies, including dynamic policy predictors with history register length of 5, averaged across all
Prediction in Dynamic SDRAM Controller Policies
135
benchmarks and normalized to OP. On average, DYN_UPB improves the performance by 3.7% compared to OP and 19% compared to CPA. The b_upd PAb dynamic predictor achieves the best performance shown, with an improvement of 1.0% compared to OP and 16% compared to CPA. The b_upd PAb predictor achieves 27% of the performance improvement of DYN_UPB over OP and 89% of the performance improvement of DYN_UPB over CPA. All dynamic policy predictors work better than CPA, whereas BAb, BSg, PSg, and b_int_BSg do not achieve the performance of OP.
1 .1 5 1 .1 1 .0 5 1
b_upd PAb
b_upd PAg
b_upd BAb
b_pht PAb
b_int PSg
b_int BSg
PSg
BSg
PAb
PAg
BAb
CPA
OP
DYN-UPB
0 .9
b_pht PAg
0 .9 5
b_pht BAb
Normalized Execution Time
1 .2
Fig. 6. Normalized average execution time (HR = 5)
4.4 Averaged Prediction Accuracy Figure 7 shows the averaged prediction accuracy of OP state preferred, averaged prediction accuracy of CPA state preferred accesses and averaged overall prediction accuracy. All predictors achieve overall prediction accuracy above 81%. The PAb predictor reaches the highest overall prediction accuracy at 91%, and also the highest prediction accuracy of CPA preferred accesses. The b_upd PAb predictor achieves the highest prediction accuracy of OP preferred accesses at 76%.
1 0 .8 0 .6 0 .4 0 .2
a v erag e d a ccu rac y o f O P
av e rag ed ac cu racy o f CPA
Fig.7. Averaged prediction accuracy of all predictors
b_upd PAb
b_upd PAg
b_upd BAb
b_pht PAb
b_pht PAg
b_pht BAb
b_int PSg
b_int BSg
PSg
BSg
PAb
PAg
BAb
CPA
OP
0
DYN_UPB
Prediction Accuracy
1 .2
o v erall ac cu racy
136
Y. Xu, A.S. Agarwal, and B.T. Davis
Predictors with low prediction accuracy of OP preferred accesses have high execution time, regardless of the prediction accuracy of CPA preferred accesses. This is the motivation for biased predictors. The biased predictors sacrifice CPA prediction, and thus overall prediction accuracy, to improve OP accuracy, resulting in lower execution times. The b_upd PAb predictor, which has the highest prediction accuracy of OP preferred accesses, provides the lowest execution time. 4.5 Correlation between Prediction Accuracy and Execution Time Figure 8 shows the correlation between execution time and prediction accuracy of OP, prediction accuracy of CPA, and overall prediction accuracy for all simulated benchmarks. All benchmarks, with the exception of benchmark vpr, exhibit strong negative correlation between execution time and prediction accuracy of OP. This indicates that improvement on prediction accuracy of OP should result in reduced execution time and motivates the biassing of the predictors discussed in Section 3.2 to achieve the best OP prediction accuracy.
0.8 0.6
average
g ap
gcc
v pr
mgrid
swim
apsi
twolf
bzip2
parser
lucas
ammp
mcf
art
equake
-0.4 -0.6
galgel
0 -0.2
applu
0.4 0.2
wupwise
Correlation Coefficien t
1.2 1
-0.8 -1 -1.2
OP and execut ion t im e
CP A and ex ecut ion t im e
o verall accuracy and execution t im e
Fig. 8. Correlation of execution time and prediction accuracy
For most benchmarks, the prediction accuracy of OP is the dominant factor that affects the performance since it has the strongest negative correlation with execution time. Wupwise and vpr are the exceptions. For these two or similar benchmarks, biassing the predictors towards OP may negatively impact execution time. 4.6 Effects of History Register Length The normalized execution times of predictors with increased history register lengths are shown in Figure 9. Execution times are again normalized to OP and averaged across all simulated benchmarks. It is shown that the execution times of the predictors examined decrease as the length of the history register increases.
Prediction in Dynamic SDRAM Controller Policies
137
1.02 1.01 hr=5 hr=7 hr=9 hr=11
1 0.99 0.98
b_upd PAb
b_pht PAb
b_int PSg
PSg
PAb
b_int BSg
BSg
BAb
0.96
b_upd BAb
0.97
b_pht BAb
Normalized Execution Time
1.03
Fig. 9. Normalized execution time of different HR length
With a short history register, the biased dynamic update predictors achieve the lowest execution time. However, with a long history register, the biased static predictors provide the lowest execution time. For example, with history register length of 5 and 7, b_upd PAb achieves the lowest execution time. However, with history register length of 11, b_int PSg achieves the lowest execution time. b_int PSg reduces execution time by 1.6% compared to OP and 17% compared to CPA. The performance improvement is 44% of the improvement available of DYN_UPB over OP.
5 Conclusions SDRAM controller policy impacts access latency, making it advantageous to apply a different controller policy for different accesses. This approach can provide performance improves for all application types. Application such as DVR, apache (web server) and other applications which make extensive use of SDRAM are expected to show more significant improvement than the SPEC CPU 2000 benchmarks shown herein. A predictive dynamic SDRAM controller approach can be used in conjunction with other memory optimization techniques such as prefetching, access scheduling, and data placement. Multiple optimizations operating in conjunction may constructively improve performance, but an exhaustive examination of all combinations and permutations is beyond the scope of this work. The degree to which an SDRAM policy predictor will reduce application execution time is dependent upon the memory access pattern. The best dynamic controller policy predictor simulated achieves an average reduction in execution time of 1.6%. While this improvement is a modest 44% of the improvement available given the framework constraints {benchmarks, access schedule, and data placement}, the changes required for this performance improvement are limited to the SDRAM controller. The prediction accuracy of different dynamic policy predictors ranges from 47% to 87% on OP preferred accesses and 73% to 96% on CPA preferred accesses. Lengthening the history register results in lower execution time and higher prediction
138
Y. Xu, A.S. Agarwal, and B.T. Davis
accuracy. Benchmark execution time is strongly correlated with the prediction accuracy of OP suggesting any future work in access control prediction should focus primarily on predicting those accesses where OP is preferred. This study illustrates that with 400 MHz DDR SDRAM (PC3200) devices, in-order access scheduling, and default placement, modest gains in performance are possible through prediction of SDRAM access control policy. Use of a memory technology with increased non-uniformity between access latencies (i.e. higher frequency SDRAM) is expected to enable more significant performance gains. The gains available through access policy prediction motivate exploration into access scheduling and placement techniques, in conjunction with the predictive access control policy.
References [1] Burger, D., Austin, T.M.: The SimpleScalar Tool Set, Version 2.0., SimpleScalar LLC [2] Citron, D.: MisSPECCulation: Partial and Misleading Use of SPEC CPU2000 in Computer Architecture Conferences. In: Proceedings of ISCA-30, pp. 52–61 (2003) [3] Hamilton, S.: Taking Moore’s Law into the Next Century. Computer 32(1), 43–48 (1999) [4] Hur, I., Lin, C.: Adaptive History-Based Memory Schedulers. In: Proceedings of MICRO 37, pp. 343–354 (2004) [5] Ma, C., Chen, S.: A DRAM Precharge Policy Based on Address Analysis. In: Proceedings of DSD, pp. 244–248 (2007) [6] Rixner, S., Dally, W.J., Kappasi, U.J., Mattson, P., Ownes, J.D.: Memory Access Scheduling. In: Proceedings of ISCA-27, pp. 128–138 (2000) [7] Skadron, K., Clark, D.W.: Design Issues and Tradeoffs for Write Buffers. In: Proceedings of HPCA-3, pp. 44–155 (1997) [8] SPEC CPU 2000V1.2, Standard Performance Evaluation Corporation (December 2001) [9] Stankovic, V., Milenkovic, N.: DRAM Controller with a Complete Predictor: Preliminary Results. In: Proceedings of TELSKS, pp. 593–596 (2005) [10] Wong, A.: Breaking through the BIOS Barrier: the Definitive BIOS Optimization Guide for PCs. Prentice Hall, Englewood Cliffs (2004) [11] Xu, Y.: Prediction in Dynamic SDRAM Controller Policy, MSEE Thesis, College of Engineering, Michigan Tech. (2006) [12] Yeh, T., Patt, Y.N.: A Comparison of Dynamic Branch Predictors that Use Two Levels of Branch History. In: Proceedings of ISCA-20, pp. 257–266 (1993)
Inversion/Non-inversion Implementation for an 11,424 Gate-Count Dynamic Optically Reconfigurable Gate Array VLSI Shinichi Kato and Minoru Watanabe Electrical and Electronic Engineering Shizuoka University 3-5-1 Johoku, Hamamatsu, Shizuoka 432-8561, Japan
[email protected]
Abstract. To date, various optically reconfigurable gate arrays (ORGAs) have been developed to realize both fast reconfiguration and numerous reconfiguration contexts. Optically differential reconfigurable gate arrays (ODRGAs) present the advantageous capabilities compared with ORGAs: they have increased reconfiguration frequency per unit of laser power and reduced optical power consumption. Dynamic optically reconfigurable gate arrays (DORGA) can realize the highest gate density, but an important disadvantage of DORGAs is that their reconfiguration frequency is lower than that of ODRGAs and their optical power consumption is greater than that of ODRGAs. Therefore, a novel inversion/non-inversion dynamic optically reconfigurable gate array that adopts only the good factors from both architectures has been developed. This paper presents an inversion/noninversion implementation for a fabricated 11,424 gate-count dynamic optically reconfigurable gate array VLSI. Based on that implementation, three factors are discussed: gate density, reconfiguration frequency per unit of laser power, and optical power consumption.
1 Introduction Demand for high-speed reconfigurable devices has continued to increase. If a gate array can be reconfigured rapidly, an idle circuit can be removed from the gate array and other necessary circuitry can be programmed onto the gate array at that time, thereby increasing the gate array activity. However, major programmable devices, FPGAs are unsuitable for such dynamic reconfiguration because FPGAs require more than several milliseconds’ reconfiguration time [1]–[3]. On the other hand, high-speed reconfigurable devices have been developed, e.g., DAP/DNA chips, DRP chips, and multi-context FPGAs [4]–[7]. Their chip package includes reconfiguration memories and a microprocessor array or a gate array. The internal reconfiguration memory stores reconfiguration contexts of 4–16 banks, which can be changed from one to another during a clock cycle. Consequently, the arithmetic logic unit or the gate array of such devices can be reconfigured in a few nanoseconds on every clock cycle. However, increasing the internal reconfiguration memory while maintaining the gate density is extremely difficult. K. Bertels et al. (Eds.): SAMOS 2009, LNCS 5657, pp. 139–148, 2009. c Springer-Verlag Berlin Heidelberg 2009
140
S. Kato and M. Watanabe
For that reason, various optically reconfigurable gate arrays (ORGAs) have been developed up to now to realize both fast reconfiguration and numerous reconfiguration contexts [8]–[9]. Optically differential reconfigurable gate arrays (ODRGAs) present the important advantages of providing higher reconfiguration frequency per unit of laser power and lower optical power consumption than other ORGAs [10]–[11]. On the other hand, although dynamic optically reconfigurable gate arrays (DORGAs) can realize the highest gate density which is an important benefit, the reconfiguration frequency per unit of laser power is lower than that of ODRGAs and its optical power consumption is greater than that of ODRGAs [12]–[13]. Certainly, those ORGAs have good and bad points. Therefore, a novel inversion/non-inversion dynamic optically reconfigurable gate array that can extract only the good factors from both architectures has been developed. In addition, its effectiveness has been confirmed using an emulated 68 gate count ORGAVLSI [14,15]. However, the reconfiguration time was measured as millisecond order. Therefore, its fast reconfiguration capability has never been confirmed. Furthermore, the gate density of VLSI chips is very low. This paper presents an inversion/non-inversion implementation for a fabricated 11,424 gate-count dynamic optically reconfigurable gate array VLSI. Based on that implementation, three factors are discussed: gate density, reconfiguration frequency per unit of laser power, and optical power consumption.
2 Conventional ORGA Architectures 2.1 Basic Construction An ORGA optical system comprises laser sources, an optical holographic memory, and a programmable gate array VLSI. The holographic memory can store numerous reconfiguration contexts. The reconfiguration contexts in the holographic memory are addressed by a laser diode array that is mounted on the top side of the holographic memory. The diffraction pattern from the holographic memory can be received as a reconfiguration context on a photodiode-array that is implemented in a programmable gate array of an ORGA-VLSI. Using this arrangement, this architecture enables fast reconfiguration, along with the use of numerous reconfiguration contexts [8]–[15]. 2.2 Optically Differential Reconfigurable Gate Array Optically Differential Reconfigurable Gate Arrays (ODRGAs) have been developed to increase reconfiguration frequency [10]–[11]. Figure 1 shows that each reconfiguration circuit of optically differential reconfigurable gate arrays includes a photodiode, a refresh transistor, and toggle flip flops. In the ODRGA-VLSI, photodiodes are placed near and are directly connected to programming elements of a programmable gate array through static configuration memory. Therefore, the ODRGA-VLSI can be reconfigured perfectly in parallel with no overhead. The reconfiguration frequency is much higher than that of FPGAs. Moreover, the differential reconfiguration architecture of ODRGA-VLSIs offers the advantage of further increasing the reconfiguration speed compared to that of other ORGAs using equivalent laser power [11]. The diffraction light intensity from a holographic memory is inversely proportional to the number of bright bits included in a configuration context. Because the reconfiguration speed can
Inversion/Non-inversion Implementation
Fig. 1. Conventional optical differential reconfiguration circuit with four configuration bits
141
Fig. 2. Conventional optical dynamic reconfiguration circuit with four configuration bits of a DORGA-VLSI
be accelerated by increasing the optical power for each photodiode, if the number of bright bits in a configuration context can be decreased, the reconfiguration speed can be increased with no increase of laser power. The differential reconfiguration strategy is a method to reduce the number of bright bits, which means the state ’1’ of bits of a configuration context. Heretofore, the reconfiguration speed of an ODRGA-VLSI itself has been measured in nanoseconds by exploiting the architecture [11]. However, the static configuration memory has prevented the realization of high-gate-count ORGA-VLSIs. 2.3 Dynamic Optically Reconfigurable Gate Array To realize a high-gate-count ORGA-VLSI, a dynamic optically reconfigurable gate array (DORGA) architecture was proposed [12]–[13]. From it, the static configuration memory was entirely removed to store a context and to use junction capacitances of photodiodes as dynamic configuration memory, as portrayed in Fig. 2. A 0.35 μ m CMOS process 11,424-gate-count dynamic optically reconfigurable gate array-VLSI has been fabricated [13]. In addition, larger gate count VLSIs will be increasingly available through the use of more advanced process technologies. However, an important disadvantage is that the reconfiguration frequency per unit of laser power is lower than that of ODRGAs and its optical power consumption is greater than that of ODRGAs.
3 Novel Inversion/Noninversion Zero-Overhead Configuration Method To realize the advantages of fast configurations and high gate count, a novel inversion/noninversion optical configuration method has been proposed [14,15]. The inversion/noninversion optical configuration method can perfectly remove the static configuration memory and reduce the number of bright bits in a configuration context. Thereby, it can provide a higher reconfiguration frequency per unit of laser power. 3.1 Generation Method of an Optical Configuration Vector In this strategy, configuration data for logic blocks, switching matrix, and I/O blocks are divided into small segments. For the small segments, the inversion/noninversion optical
142
S. Kato and M. Watanabe
configuration method is applied. The following discussion uses bit-length N, which means the number of bits included in a small segment. A configuration vector α used for programming a gate array and an optical configuration vector γ , the information of which is programmed onto a holographic memory, are represented respectively as an N-dimensional vector and N+ 1-dimensional vector as
α = (α1 , α2 , ..., αN ), γ = (γ1 , γ2 , ..., γN , γN+1 ),
(1) (2)
where each element of the vectors takes a binary value {0,1}. The optical configuration vector γ is defined in the following equation as
γ (1,..,N) = α ⊕ γN+1 ,
(3)
where α is a certain configuration context and γN+1 is an inversion bit included in an optical configuration vector γ , and ⊕ signifies an exclusive OR operator. The role of the inversion bit γN+1 is to determine whether an optical configuration is executed based on a certain configuration context α or its inversion context α . The inversion bit is defined in the following equation. 1 : ∑Ni=1 αi ≥ [ N2 + 1], γN+1 = (4) 0 : otherwise, In that equation, [N/2 + 1] denotes that N/2 + 1 is rounded down to the nearest whole number. Consequently, a calculated optical configuration vector γ , including an inversion bit, is programmed onto a holographic memory as a segment of a configuration context. 3.2 Operation of a Gate Array VLSI In advance, an optical configuration vector γ , which is calculated using Eqs. (3) and (4), is programmed onto a holographic memory. Regarding reconfiguration, the optical configuration vector γ , including an inversion bit γN+1 is read out from the holographic memory and is programmed onto an ORGA-VLSI. A configuration vector α in a segment of a gate array on the ORGA-VLSI is generated by calculating exclusive OR operations between a received optical configuration vector γ (1..,N) and an inversion bit γN+1 as follows.
α = γ (1,..,N) ⊕ γN+1 .
(5)
3.3 Estimation Here, the reduction efficiency of the number of bright bits of the inversion/noninversion configuration method is discussed. It is assumed that configuration contexts are given continuously for an ORGA-VLSI and that they uniformly include all possible patterns. For example, regarding 4-bit configuration, all possible patterns means 16 patterns of
Inversion/Non-inversion Implementation
143
”0000”, ”0001”..., and ”1111”. Under such a condition, first, the reduction efficiency of the number of bright bits in a configuration context of conventional ORGAs is estimated. The average number of ’1’ ’s corresponding to laser irradiation is calculated by counting bit ’1’ of all possible vectors and dividing it by N2N of the summation of bits of all possible vectors, as in the following equation.
κORGA =
1 ∑Nr=1 r · N Cr = , 2N N 2
(6)
Therein, N Cr is a combination. The average number of writing ’1’ ’s corresponding to laser irradiation is calculated by counting the ’1’ bits of all possible vectors and dividing it by N2N of the summation of bits of all possible vectors, as in the following equation. ∑ 2 r · N Cr ∑r=[ N2 +1] (N − r + 1) · NCr κnew = r=1N + 2 N 2N N [N]
N
(7)
The first term in the right side of the upper Eq. (7) is identical to Eq. (6). In this case, an inversion bit γN+1 is equal to 0. In addition, the second term in the right side of upper Eq. (7) is applicable to the case in which the inversion bit γN+1 is equal to 1. In the case of four configuration bits, the average number of bright bits can be decreased from 2.0 of conventional ORGAs to 1.5625. Using this inversion/noninversion dynamic optical configuration method, in the case of four bits, about 22% of bright bits are removable. Consequently, the reconfiguration frequency can be increased using this method. 3.4 Inversion/Noninversion Dynamic Optical Configuration Circuit Figure 3 portrays the circuit diagram of an inversion/noninversion dynamic optical configuration circuit including four configuration bits. The configuration circuit consists of charge-integrated photo-circuits and exclusive-OR gates. Since the configuration circuit is based on dynamic optically reconfigurable gate array architecture, the photodiode is used not only for detecting a configuration context but also as dynamic configuration memory. Therefore, a static configuration memory is perfectly removed and a high gate count VLSI can be realized. The only different point from the reconfiguration circuits of the dynamic optically reconfigurable gate array architecture is that an inversion photodiode and exclusive-OR gates are added. This circuit can execute the procedure shown
Fig. 3. Circuit diagram of an inversion/noninversion dynamic optical configuration circuit including four configuration bits
144
S. Kato and M. Watanabe
in Eq. (5) perfectly. Of course, in this method, some increase of the area is necessary. Nevertheless, the increased area is less than 14%; in fact, it can be ignored.
4 Experimental System 4.1 Holographic Memory Pattern Calculation Holographic memory patterns used for these experiments are calculated using the following equations. A hologram for an ORGA is assumed as a thin holographic medium to fit a liquid crystal spatial light modulator. A laser aperture plane, a holographic plane, and an ORGA-VLSI plane are parallelized. The laser beam is collimated; it propagates into the holographic plane. The holographic medium comprises rectangular pixels on the x1 − y1 holographic plane. The pixels are assumed as analog values. On the other hand, the input object is made up of rectangular pixels on the x2 − y2 object plane. The pixels can be modulated to be either on or off. The intensity distribution of a holographic medium is calculable using the following equation. H(x1 , y1 ) ∝
∞ ∞
O(x2 , y2 ) sin(kr)dx2 dy2 , −∞ −∞ r = ZL2 + (x1 − x2 )2 + (y1 − y2 )2 .
(8)
In that equation, O(x2 , y2 ) is a binary value of a reconfiguration context, k is the wave number, and ZL is the distance between the holographic plane and the object plane. The value H(x1 , y1 ) is normalized as 0–1 for the minimum intensity Hmin and maximum intensity Hmax as the following. H (x1 , y1 ) =
H(x1 , y1 ) − Hmin . Hmax − Hmin
(9)
Finally, the normalized image H is used for implementing the holographic memory. The other areas on the holographic plane are opaque to the illumination. The holographic memory patterns of a NAND and an OR circuits were calculated using the method explained above. In this experiment, holographic memory patterns were generated to
(a) Conventional DORGA (b) New method
(a) Conventional DORGA (b) New method
Fig. 4. Holographic memory patterns of a NAND circuit
Fig. 5. Holographic memory patterns of an OR circuit
Inversion/Non-inversion Implementation
145
compare the inversion/noninversion dynamic optically reconfigurable gate array and conventional dynamic optically reconfigurable gate array (DORGA). The holographic memory patterns, including a NAND circuit, of the inversion/noninversion dynamic optically reconfigurable gate array and the conventional dynamic optically reconfigurable gate array (DORGA) are shown in Figs. 4(a) and 4(b). Additionally, the holographic memory patterns, including an OR circuit, of the inversion/noninversion dynamic optically reconfigurable gate array and the conventional dynamic optically reconfigurable gate array (DORGA), are presented in Figs. 5(a) and 5(b). The x-direction distance between bits and the y-direction distance between bits in a configuration context are the same as 34.5 μ m and 33.0 μ m of the photodiode-distances of a fabricated ORGA-VLSI. The pixel quantity in the holographic memory is 300 × 300. The intensity distribution of the holographic memory was normalized to 256 gradations, which is the same as that of an LC-SLM after the calculated floating point value. 4.2 Fabricated ORGA-VLSI A new 11,424-gate-count DORGA-VLSI chip was designed and fabricated using a 0.35 μ m standard CMOS process technology [13]. Voltages of the core and I/O cells were designed identically using 3.3 V. The acceptance surface size of the photodiode is 9.5 μ m × 8.8 μ m. The photodiodes were constructed between an N+ diffusion layer and a Psubstrate. Photodiode cells are arranged at 34.5 μ m horizontal intervals and at 33.0 μ m vertical intervals. This design incorporates 37,856 photodiodes. The average aperture ratio of the entire VLSI is 4.24%. In this design, considering the resolution of optical components and simplified justification of the positioning between a VLSI part and an optical part, photodiodes and their spacing were designed to be large. The top metal layer was used for guarding transistors from light irradiation; the other two layers were used for wiring. The gate array of the DORGA-VLSI uses an island style. In all, 336 optically reconfigurable logic blocks (ORLBs) including two 4-input 1-output LUTs, 360 optically reconfigurable switching matrices (ORSMs), and 8 optically reconfigurable I/O blocks (ORIOBs), which include 4 programmable I/O bits, were implemented. The ORLBs, ORSMs, and ORIOBs are programmable block-by-block, respectively, through 59, 49, and 49 optical connections. 4.3 Experimental System An ORGA optical system comprises laser sources, an optical holographic memory, and an ORGA-VLSI, as shown in Fig. 6. Reconfiguration contexts are stored in a holographic memory and are addressed using a laser array. However, in this experiment, to estimate the reconfiguration speed of an ORGA architecture simply and precisely, a simple single-laser ORGA holographic memory system was constructed using a liquid crystal spatial light modulator (LC-SLM) as a holographic memory and a 532 nm laser (torus 532; Laser Quantum) as a light source. The laser power was about 300 mW. The 1.7-mm-diameter beam from the laser source is expanded by five times to 8.5 mm using two lenses with a 50 mm focal length and a 250 mm focal length. The expanded beam is incident to a holographic memory on an LC-SLM. The LC-SLM is a projection TV panel (L3D07U-81G00; Seiko Epson Corp.). It is a 90◦ twisted nematic device with a
146
S. Kato and M. Watanabe
(a)
(b)
(c)
(d)
Fig. 6. Experimental system. Panel (a) portrays a block diagram of an entire DORGA holographic reconfiguration architecture. Panel (b) depicts a photograph of the DORGA holographic reconfiguration architecture. Panel (c) is an expanded photograph of the area around the DORGA-VLSI. Panel (d) shows the 11,424-gate-count ORGA-VLSI chip using a 0.35 μ m, 9.8 mm2 CMOS process chip, and its board including the chip.
thin film transistor. The panel consists of 1,920 × 1,080 pixels, each of which is 8.5 × 8.5 μ m2 . The LC-SLM is connected to an evaluation board (L3B07-E60A; Seiko Epson Corp.). The video input of the board is connected to the external display terminal of a personal computer. Programming for the LC-SLM is executed by displaying a holographic memory pattern with 256 gradation levels on the personal computer display. The DORGA-VLSI was placed 100 mm distant from the LC-SLM.
5 Experimental Results Using the optical system, the reconfiguration time improvement was measured: that of the new inversion/noninversion dynamic optically reconfigurable gate array was less than that of the conventional dynamic optically reconfigurable gate array. In this experiment, a NAND circuit and an OR circuit were implemented. The CCD-captured configuration context images of a NAND circuit and an OR circuit are portrayed in Figs. 7 and 8. The figures respectively show the conventional dynamic optically reconfigurable gate array’s configuration context and the inversion/noninversion dynamic optically reconfigurable gate array’s configuration context. The contrasts of both methods were very good. The optical contexts were programmed onto the conventional dynamic optically reconfigurable
Inversion/Non-inversion Implementation
(a)
(b)
Fig. 7. CCD captured configuration context images of a NAND circuit. Panels (a) and (b) respectively depict the conventional dynamic optically reconfigurable gate array’s context and the inversion/noninversion dynamic optically reconfigurable gate array’s context.
(a)
147
(b)
Fig. 8. CCD captured configuration context images of an OR circuit. Panels (a) and (b) respectively depict the conventional dynamic optically reconfigurable gate array’s context and the inversion/noninversion dynamic optically reconfigurable gate array’s context.
gate array VLSI and the new inversion/noninversion dynamic optically reconfigurable gate array VLSI. The reconfiguration time and retention time of the NAND circuit of a conventional dynamic optically reconfigurable gate array were measured, respectively, as 34 μ s and 280 μ s. Also, the reconfiguration time and retention time of the NAND circuit of a conventional dynamic optically reconfigurable gate array were confirmed, respectively, as 67 μ s and 662 μ s. On the other hand, the reconfiguration time of the OR circuit of the new inversion/noninversion dynamic optically reconfigurable gate array was improved to 18 μ s. In addition, the retention time of the OR circuit of the new inversion/noninversion dynamic optically reconfigurable gate array was sufficient: 243 μ s. At that time, optical configuration power consumption was also estimated as a 48% reduction. Also, the reconfiguration time of the OR circuit of the new inversion/noninversion dynamic optically reconfigurable gate array was improved to 62 μ s. Furthermore, the retention time of the OR circuit of the new inversion/noninversion dynamic optically reconfigurable gate array was sufficient: 200 μ s. In this case, the optical configuration power consumption was also estimated as a 7% reduction. Therefore, results confirmed that the inversion/non-inversion implementation for a fabricated 11,424 gate-count dynamic optically reconfigurable gate array VLSI can support a 1.5 times higher reconfiguration frequency than that provided by a conventional dynamic optically reconfigurable gate array.
6 Conclusion This paper has presented an inversion/non-inversion implementation for a fabricated 11,424 gate-count dynamic optically reconfigurable gate array VLSI. Its performance demonstrated that the inversion/non-inversion implementation for a fabricated 11,424 gate-count dynamic optically reconfigurable gate array VLSI can accelerate its reconfiguration frequency to 1.5 times faster than that of a conventional dynamic optically
148
S. Kato and M. Watanabe
reconfigurable gate array. Additionally, optical configuration power consumption reduction was confirmed as 28%. This architecture will be very useful as a next-generation ORGA architecture in terms of gate density, reconfiguration frequency, and power consumption.
Acknowledgments This research was supported by the Ministry of Education, Science, Sports and Culture, Grant-in-Aid for Scientific Research on Innovative Areas, No. 20200027, 2009. The VLSI chip in this study was fabricated in the chip fabrication program of VLSI Design and Education Center (VDEC), the University of Tokyo in collaboration with Rohm Co. Ltd. and Toppan Printing Co. Ltd.
References 1. Altera Corporation, Altera Devices, http://www.altera.com 2. Xilinx Inc., Xilinx Product Data Sheets, http://www.xilinx.com 3. Lattice Semiconductor Corporation, LatticeECP and EC Family Data Sheet (2005), http://www.latticesemi.co.jp/products 4. http://www.ipflex.co.jp 5. Nakano, H., Shindo, T., Kazami, T., Motomura, M.: Development of dynamically reconfigurable processor LSI. NEC Tech. J (Japan) 56(4), 99–102 (2003) 6. Dehon, A.: Dynamically Programmable Gate Arrays: A Step Toward Increased Computational Density. In: Fourth Canadian Workshop on Field Programmable Devices, pp. 47–54 (1996) 7. Jones, D., Lewis, D.M.: A time–multiplexed FPGA architecture for logic emulation. In: Custom Integrated Circuits Conference, pp. 495–498 (1995) 8. Mumbru, J., Panotopoulos, G., Psaltis, D., An, X., Mok, F., Ay, S., Barna, S., Fossum, E.: Optically Programmable Gate Array. In: SPIE of Optics in Computing 2000, vol. 4089, pp. 763–771 (2000) 9. Mumbru, J., Zhou, G., Ay, S., An, X., Panotopoulos, G., Mok, F., Psaltis, D.: Optically Reconfigurable Processors. In: SPIE Critical Review 1999 Euro-American Workshop on Optoelectronic Information Processing, vol. 74, pp. 265–288 (1999) 10. Miyano, M., Watanabe, M., Kobayashi, F.: Optically Differential Reconfigurable Gate Array. Electronics and Computers in Japan, Part II 90(11), 132–139 (2007) 11. Watanabe, M., Shiki, T., Kobayashi, F.: Scaling prospect of optically differential reconfigurable gate array VLSIs. Analog Integrated Circuits and Signal Processing (2008) 12. Seto, D., Watanabe, M.: A dynamic optically reconfigurable gate array - perfect emulation. IEEE Journal of Quantum Electronics 44(5), 493–500 (2008) 13. Watanabe, M.: A 11,424 gate-count zero-overhead dynamic optically reconfigurable gate array VLSI. In: IEEE International SOC Conference, September 2007, pp. 75–78 (2007) 14. Watanabe, M., Nakajima, M., Kato, S.: An inversion/non-inversion dynamic optically reconfigurable gate array VLSI. World Scientific and Engineering Academy and Society Transactions on Circuits and Systems 8(1), 11–20 (2009) 15. Kato, S., Watanabe, M.: Inversion/non-inversion zero-overhead dynamic optically reconfigurable gate array VLSI. In: IEEE International Conference on Field@Programmable Technology, pp. 377–380 (2008)
Visualization of Computer Architecture Simulation Data for System-Level Design Space Exploration Toktam Taghavi, Mark Thompson, and Andy D. Pimentel Computer Systems Architecture group Informatics Institute, University of Amsterdam, The Netherlands {T.TaghaviRazaviZadeh,M.Thompson,A.D.Pimentel}@uva.nl
Abstract. System-level computer architecture simulations create large volumes of simulation data to explore alternative architectural solutions. Interpreting and drawing conclusions from this amount of simulation results can be extremely cumbersome. In other domains that also struggle with interpreting large volumes of data, such as scientific computing, data visualization is an invaluable tool. Such visualization is often domain specific and has not become widely studied and utilized for evaluating the results of computer architecture simulations. In this paper, we describe an interactive visual tool for exploring and analyzing alternative architectural solutions at multiple levels of abstraction. As a proof of concept, we have used this tool to create a coordinated, multiple-view visualization for our computer architecture simulation and exploration environment, called Sesame, which aims at system-level performance analysis and design space exploration of multi-core embedded systems. Our results show that our multivariate visualization support can help designers to more easily understand the reasons behind the differences in performance of different design choices, and thus gain more insight in the performance landscape of the design space. Keywords: Computer architecture simulation, design space exploration, exploratory visualization, linked views, multiple views, coordination.
1 Introduction Rapid advances in chip technology and the ever-increasing demands of computer applications have resulted in unprecedented complexity in embedded computer system design, often reflected by the different forms of concurrency exploited in the system architectures. This trend shows no signs of abating and is forcing designers to start modeling and simulating architectural components and their interactions at the very early design stages. Design Space Exploration (DSE), in which alternative architectural solutions are assessed, plays a crucial role in this system-level design. It is imperative to have good exploration methods, techniques and tools in the early design stages, where the design space is at its largest and where a wrong design decision can make the difference between the success or failure of the final product. System-level simulation frameworks that aim for early design space exploration create large volumes of simulation data in exploring alternative architectural solutions. Interpreting and drawing conclusions from these copious simulation results can K. Bertels et al. (Eds.): SAMOS 2009, LNCS 5657, pp. 149–160, 2009. © Springer-Verlag Berlin Heidelberg 2009
150
T. Taghavi, M. Thompson, and A.D. Pimentel
be extremely cumbersome. In other domains that also struggle with interpreting large volumes of data, such as scientific computing, data visualization has become an invaluable tool to facilitate the data analysis. Such visualization is often domain specific and has not become widely used in evaluating the results of computer architecture simulations. Especially in the domain of system-level DSE of embedded systems, surprisingly little research has been undertaken in the interactive visualization to support and guide the process of DSE. In this paper, we describe an interactive visual tool, based on the Improvise [14, 15] framework, for exploring and analyzing alternative architectural solutions at multiple levels of abstraction. We have used this tool to create a coordinated, multiple-view visualization for our computer architecture simulation and exploration environment, called Sesame, which aims at system-level performance analysis and DSE of multi-core embedded systems. However, our visualization tool is not limited to Sesame and may be used for other DSE environments. Multiple views enable users to look at different aspects of their data using different types of visualizations. The various representations enable users to interpret information from different perspectives, thus gaining additional insight into the underlying information. Coordination between the views keeps them synchronized during interaction, which enables users to relate information between the views. Such visualization support can significantly help designers to understand the reasons behind the differences in performance of different designs, and thus gain more insight in the performance landscape of the design space. The remainder of this paper is organized as follows. Section 2 describes related work. In Section 3, we provide a short introduction on information visualization, and briefly describe the Improvise framework, which has been used to construct our visualization tool. Section 4 gives an outline of the Sesame simulation environment for which we have developed the multiple-view visualization tool. In Section 5, we elaborate on the various multivariate views that are provided by our visualization tool, followed by an evaluation in Section 6 illustrating how these different views improve the analysis of DSE data produced by Sesame. Finally, Section 7 concludes the paper.
2 Related Work The use of multiple views in information visualization is very common and useful. This technique has been applied in a variety of domains. Examples are: Navigational View Builder [1] for web site visualization, FilmFinder [2] and Cinegraph [3] for exploring and analyzing film databases, SeeDiff [4] for analyzing changes in source code files, and visualizing and exploring census data [7, 8]. However, in the field of computer architecture simulations, and especially those aimed at system-level DSE, little research has been undertaken on (interactive) information visualization. Most of the visualization work in this area focuses on educational purposes (e.g., [11, 12, 13]), or only provides some basic support for the visualization of simulation results in the form of 2D (and sometimes 3D) graphs. The work of [16,17] provides advanced and generic visualization support, but tries to do so for a wide range of computer system related information which may not necessarily be applicable to computer architecture simulations and in particular to DSE, with its own domain-specific requirements. Vista [18] aims at visualization support
Visualization of Computer Architecture Simulation Data
151
for computer architecture simulations, but it does not target system-level simulations, which may have a serious impact on the scalability requirements of the visualization, nor does it address the needs for visualization from the perspective of DSE. The ARL Trade Space Visualizer [19] is an engineering decision making-tool for complex engineering systems to explore a multi-dimensional trade-off space in a visually intuitive manner. However it is not applicable to computer architecture simulations (e.g. only supporting numerical data types).
3 Information Visualization and Improvise Visualization is all about representing information in a visual form to help a viewer to efficiently and effectively explore, analyze and explain complex information. A key challenge in visualization is designing suitable visual metaphors that enable users to better understand the underlying information and to convey the information that the user is looking for. Information Visualization (InfoVis) [5] focuses on techniques of visualization that deal with abstract data sets, that is, data without “natural” physical or geometric representation such as hierarchical or textual information. Therefore, the user has no predetermined mental model about it. Information visualization systems have two main components: representation and interaction. The representation component involves the way that data is mapped to the visual form. Subsequently, the interaction component allows a user to directly manipulate the representation and to explore the data set to discover additional insights. Examples of interaction techniques are: select, filter, zoom, rotate and scroll.Visual data exploration usually follows a three step process: 1) overview 2) zoom and filter, and 3) details on demand (which has been called the Information Seeking Mantra [6]). First, the user needs to get an overview of the data. The overview represents the whole data entity and provides a general context for understanding the data set. In the overview, the user identifies an interesting, significant or unusual subset of data and focuses on that. Evidently, the actual analysis of the selected subset that follows depends entirely on the nature of this data subset. Improvise [14, 15] is an exploratory information visualization environment which we have used to visualize the data generated by our Sesame simulation and DSE framework. Improvise is an end-user application for building visualizations of structured information such as tabular data. It enables users to load data, create views, specify visual abstractions (i.e., how specific data attributes are mapped into graphical attributes in views), and set up coordination between the views. Views can be coordinated in a variety of ways such that interacting with one view causes visual and meaningful effects in appearance or behaviour of other views. To this end, Improvise enables users to define complex interactive dependencies between the views. These dependencies keep views synchronized during interaction which enable users to relate information between the views. Furthermore, it is scalable in the amount of data to be visualized, number of views and number of coordinates. Improvise is open source software written in Java. Moreover, its visualizations are saved and loaded as regular XML documents in a platform-independent format. Therefore the results can be shared and disseminated easily.
152
T. Taghavi, M. Thompson, and A.D. Pimentel
4 Sesame Sesame is a modeling and simulation framework geared towards efficient performance evaluation of embedded Multi-Processor System-on-Chip (MPSoC) platforms in the multi-media domain [9, 10]. Models in Sesame are defined at a high level of abstraction and capture only the most important characteristics of the components in the system. By omitting detailed component properties, the simulation of an entire system can be much faster than with traditional simulation approaches. This allows for the (performance) assessment of a large number of design options. A key element in Sesame is the recognition of separate application models and architecture models. The application model is an actual application program expressed as a process network consisting of communicating concurrent tasks. The architecture model represents the hardware components in the system, such as processors, memories and interconnection networks, and captures their performance constraints. The application and architecture models are subsequently co-simulated - using a tracedriven mechanism – to assess the performance of a certain mapping of a (concurrent) application onto the underlying (parallel) architecture. To this end, the application model generates event traces which drive the architecture model. These events are an abstract workload representation that only capture the most important behavioral actions of an application, such as reading and writing to/from other processes or execution of a significant unit of computation. The architecture model simulates the execution of every event and associates timing latencies with them where applicable. A global system clock monitors the progress through time of the system model as a whole. As mentioned, there is an explicit mapping of application tasks (i.e. processes) onto architecture components (typically processors). In case multiple tasks are mapped onto the same processor, then a scheduler decides on the order in which the processor model component processes the events from each of the tasks. The mapping as well as the scheduling policy, which are important factors in the overall system performance, are specified as parameters of the simulation. During execution of the system model, the simulation runtime system collects various performance statistics which are useful for evaluation of the system. The most important statistic is perhaps the "Elapsed Time" that describes the total execution time of the system in terms of the number of simulated processor cycles. However, more detailed statistics on a per-component basis are also available, such as for example the utilization of each component. Utilization is described as a percentage of the Elapsed Time that a component was busy processing events or interacting with other components. Clearly, such statistics could, e.g., reveal possible bottlenecks in the system. Similarly, many other statistics can help the designer to evaluate other performance properties of the system. Sesame models are highly parameterized, so that a single model can be used for evaluating a large number of different configurations of a system, which are called design instances. One important parameter is the mapping specification that we mentioned before. Other parameters may describe properties of the architecture: e.g. the number and type of processors, scheduling strategies, network types, memory sizes, or processing/communication speed of different components. The combination of all possible parameters forms the design space that needs to be explored. Combinatorial
Visualization of Computer Architecture Simulation Data
153
explosion of the parameters can easily make the design space very large. Therefore, iterative simulation of each instance will result in a huge amount of statistical data that need to be evaluated by the designer.
5 Visualization of Computer Architecture Simulation Data In this section we explain how the large collection of statistical data generated by Sesame will be transformed into a visual form. Note that these transformation techniques are not limited to Sesame and may be used for other DSE environments. Multiple Coordinated Views is a specific exploratory visualization technique that enables users to explore their data [5]. Displaying the data in multiple ways enables users to understand the information through different perspectives, gaining additional insight and better understanding of the underlying information, overcoming possible misinterpretation and finding interdependencies between data. We have developed the following views that show the simulation data from different perspectives and which are coordinated with each other: •
• • •
• •
Overview+Detail views: Selecting an item in the “overview” navigates the “detail view” to the corresponding details. Items are represented visually smaller in the overview. This provides context and allows direct access to details. The detail view is a zoomed-in-view of the overview. In these views, all design instances are shown in one scatter plot. They are sorted by elapsed time. So the best instances are on the left side. Table view: some attributes of all design instances are shown in a tabular view and can be sorted in either ascending or descending order. Latency view: the read and/or write communication latencies of selected instances in the overview are shown. Matrix view: a multivariate view to compare selected instances in terms of a large variety of characteristics, such as task mapping, number and type of processors, scheduling policy, etc, Task view: shows the application task mapping and task execution times of an instance selected from the matrix view. Method view: shows a break-down of the time spent in each method call (read, write, execution, etc.) for each processor component in the architecture model of an instance selected from the matrix view.
In Fig. 1, an example screenshot of our visualization is shown. A demo and color versions of the pictures are available on the homepage of the primary author. The remainder of this section will discuss each of the above views in more detail. For more details about the experiment that generated the data in this figure, see section 6. 5.1 Overview+Detail Views Usually, the amount of data to be displayed is too large to fit on the screen completely. It is also useful to be able to zoom in on certain parts of the data. In such cases, one wants to focus on certain data, without losing track of the position in the
Fig.1. Screenshot of the coordinated multiple-view visualization
154 T. Taghavi, M. Thompson, and A.D. Pimentel
Visualization of Computer Architecture Simulation Data
155
whole data set. Therefore, two separated views are used. One is the overview that provides a global map of all data and the other is the detail view that provides a zoomed-in-view for detailed information about a small portion of the data. Users can select items from the overview to navigate to corresponding detailed information in the detail view and vice versa, navigating in the detail view indicates the corresponding selection in the overview. Overview and detail diagrams (Fig. 1-A and 1-B) are scatter plots in which the xaxis shows the design instance number and the y-axis shows the elapsed time. Instances are sorted by elapsed time. As a result, the best instances (instances with minimum elapsed time) are on the left side of the diagram. For each instance, there is a nested bar plot that shows the load balance for each processor in the design instance. It uses one bar per processor in which the color of the bar shows the processor type and the scheduling policy of that processor. This is called color coding in literature and in this case each color indicates one property and the color shade shows another property. For example, in our diagrams we use color to identify the processor type and the color shade to identify the scheduling type. If a user selects one or more instances in the overview diagram, the corresponding instances in the detail view will be highlighted, and vice versa, selecting some instances in the detail view highlights corresponding instances in the overview. By selecting some instances in either overview or detail view, the matrix and latency views will also be filled with those instances’ information. The overview and detail diagrams are mainly used to recognize general performance trends, such as finding the best and worst design instances, and retrieving some high-level information about them (e.g., about the number of used processors, processor types, scheduling policy, and processor utilization). 5.2 Table View In the table view (Fig. 1-C), all instances are displayed in rows with columns that contain various attributes of the instances, like instance number, number of processors, type of processors, mapping of application tasks onto processors, and elapsed time. The user can sort the table in both ascending and descending order on any of its attributes. In addition, the table view supports the sorting on multiple columns, allowing a user to sort design points by multiple attribute values. In multiple sort, a number will be written in the column header that indicates the order of that column in the multiple sort. Selecting one or more rows in the table view will highlight the corresponding instances in the overview and detail views. Also, the matrix and latency views will be loaded with detailed information about those instances. The table view is useful for finding and selecting some specific instances. For example, if a user wants to select design instances that contain two MicroBlaze processors and one PowerPC processor (indicated by the string “PC MB MB” in Sesame) and sort them by elapsed time, then the table is first sorted by architecture string and then by elapsed time in descending order. (Fig. 1-C).
156
T. Taghavi, M. Thompson, and A.D. Pimentel
5.3 Latency View The latency view (Fig. 1-D) is another scatter plot in which the x-axis represents the selected instances from the overview diagram and the y-axis shows the application tasks. For any instance selected in the overview, the latency view shows the amount of time that each task is waiting for read and/or write communications. Here, again the color coding has been utilized to represent the latency: yellow to red for read latency and green to blue for write latency. For each instance and task in the latency view, two rectangles are drawn filled with these colors representing the read and write latency. Moreover, the user can select to just see read or write latencies, or both of them. 5.4 Matrix View The matrix view (Fig. 1-E) shows more detailed information about selected instances and can be used to compare the instances. The columns of the matrix represent the application tasks while the rows are the selected instances. The instances in the matrix view are sorted by elapsed time, therefore the instances with lower elapsed times are on the top. Each cell is filled by a color matching the type of the processor and the scheduling policy. Needless to say, the color coding is the same as in the overview diagram. The small rectangles drawn at the bottom right of each cell are used to recognize the tasks which have been executed on the same processor. These tasks have the same color in this small rectangle. In addition, the number written at the upper left of a cell shows the percentage of time that the corresponding processor was busy executing the task. The value in the bottom left of the square denotes the execution time of this task relative to execution of the same task in all the other instances in the experiment. The value “min” means that no other instance executes the task faster, “max” means that no other instance executes the task slower. Note that the performance of an instance for a particular task could also be in between “min” and “max”: we then provide a normalized number between 0 and 1 with respect to the “min” and “max” values. If the user selects one instance in the matrix view, more details about the task mapping and methods will be shown in the task and method view sections. Furthermore, the selected instance will be highlighted in this diagram and also in the overview and latency diagrams. 5.5 Task View This view (Fig.1-F) shows the application task mapping of an instance selected from the matrix view. For each processor of the selected instance, a stacked bar-chart is drawn. Each stack shows the tasks executed on the processor using different colors. The stack height shows the percentage of time that the processor was executing the task. The idle time percentage is shown at the top of the stacked bar with light yellow. The stacks are sorted by their heights (the task with maximum time usage percentage is at the bottom of the chart).
Visualization of Computer Architecture Simulation Data
157
5.6 Method View This view (Fig.1-G) shows a break-down of method call statistics of processor components in the architecture model of a selected design instance. Basically, this shows the intensity of reading data, writing data, and executing of the different processors in the selected design instance.
6 Evaluation By default, Sesame outputs various simulation statistics in human-readable text files. For a DSE experiment consisting of many simulations, the designer would either have to go through the files manually, or write an evaluation program to extract particular relevant information from file and represent it in a useful way (e.g. average numbers, distributions or graphics). The former is time-consuming, error prone, and most importantly, overwhelms the designer with detailed statistics. The latter is time consuming as evaluation programs may not be easily reusable between experiments. Moreover, the designer may not know what exactly needs to be evaluated before he has performed an initial survey of the results. In this section, we will show a number of example observations using our visualization tool that could not be made so easily before. The experiment explores differently configured instances of a multi-processor system-on-chip (MPSoC) running a parallelized video encoder application. Instances consist of 1, 2 or 3 processors of one of two types: MicroBlaze (MB) or PowerPC (PC). Furthermore, we consider every possible mapping of tasks onto processors. Using the overview diagram, we can find that the best instances (Fig.2 on the left) are mostly colored red (MB), while in the worst instances most of the processors are blue (PC). Overall, we can conclude that better instances contain more MB processors. This result was to be expected, since we knew beforehand that for our application the MB processor is faster for almost all application tasks. So a very basic design question is whether we should include a PowerPC (PC) in the design at all. When the designer zooms in on the left hand side of the detailed view, it becomes clear that the optimal instances actually have one PC. However, from this view it is also clear that the performance difference with the best 3-MB system is quite small. Now the designer can make a trade-off between system performance and other design criteria. For example, the cost of IP licenses for a heterogeneous system may not be worth the small performance gain. Similarly, system flexibility, die-size or reliability may influence the final design decision. Now, let us look at the influence of the task mappings on performance. Using the easy-to-use selection mechanism from the visualization tool (table view), we selected all instances that use the same mapping as the optimal instance (instance 297), but have a different underlying architecture. In Fig.3-A, we can see that these instances are all grouped relatively close to the optimum. Further investigation of other mappings close to the optimum, shows a similar result. However, when we select some of the worst mappings, then we see that identical mappings are more evenly distributed among the 36-100th percentile (Fig.3-B). From this, we can conclude that a good mapping is less dependent on the underlying architecture than a bad one. Using the visualization tool, this trend could be discovered in mere minutes, but without it, it could have gone unnoticed to the designer or he would have found out much later.
158
T. Taghavi, M. Thompson, and A.D. Pimentel
Fig. 2. Overview and detail views showing best design instances
Fig. 3-A. Best mapping subset
Fig. 3-B. Worst mapping subset
Conversely, we can also look at the different mappings for a certain architecture (data not shown here). From this, we learn that a 2 or 3 processor instance with only MBs always performs within the 0-48th percentile, implying that even the worst mapping is still better than 52% of all instances. Further, multi-processor instances with a heterogeneous architecture (using both MB and PC) span the entire performance range (from poor to good, and everything in between). The 2 and 3 processor instances with only PowerPCs yield a performance that always falls in the 14-100th percentile, which means that even the best mapping for those architectures is worse than 14% of all instances. This corroborates our previous findings and, moreover, is interesting for the designer if he is looking for a system that can flexibly deal with different task mappings: an all-MB system will give reasonable (but not optimal) performance. The visualization can also be used to understand some exception conditions. In Fig. 4 we show various instances where task 2 is mapped onto a separate processor. From the latency view we can easily see that in these mappings task 2 has a higher write latency compared to other mappings. This is because task 2 never reads but only generates data and subsequent task 3 cannot read the data fast enough. Therefore, the rate of data production is faster than consumption and the write latency is increased.
Visualization of Computer Architecture Simulation Data
159
Fig.4. Write latencies for a selection of instances
Fig. 1-D shows a selection of instances with increasingly high read latencies for tasks 3, 4, 5 and 6. Furthermore, from the min/max annotation in the matrix view (Fig. 1-E) we see that for the worst instances most tasks execute on a processor that is not the optimal processor type for that task. These observations are indications of bottlenecks in the system, but the detailed explanation is outside the scope of this paper. Each of the above observations could be done much easier and quicker using the visualization than with our previous analysis techniques. Moreover, the visualization invites the designer to look at data that would otherwise have been ignored.
7 Conclusion System-level (embedded) computer architecture simulations create large volumes of simulation data to explore alternative design solutions. Interpreting and drawing conclusions from this amount of simulation results can be extremely hard and time consuming. So far, little research has been undertaken in the application of techniques from the field of information visualization to facilitate such analysis. In this paper, we presented a multiple-coordinated visualization tool and we use it to explore simulation results from our Sesame simulation and design space exploration framework. The overall premise for this visualization tool is that users understand their data better if they interact with the presented information and view it through different representations. Our visualization tool therefore allows a designer to 1) get a quick and clear overview of the performance of a large number of evaluated design points, 2) select a set design instances for further investigation, 3) compare the selected design points in terms of different characteristics and metrics using multivariate visualization, 4) look at the simulation results from different levels of abstraction, and 5) find relationships between design parameters and their effects on performance. In the future, we will perform a case study with more complicated application and architecture models. This way, we can further test and improve the capabilities of the visualization for even more intricate exploration case studies.
References 1. Mukherjea, S., Foley, J., Hudson, S.: Visualizing Complex Hypermedia Networks through Multiple Hierarchical Views. In: ACM SIGCHI 1995, pp. 331–337. ACM Press, New York (1995) 2. Ahlberg, C., Shneiderman, B.: Visual Information Seeking: Tight Coupling of Dynamic Query Filters with Starfield Displays. In: ACM SIGCHI 1994, pp. 313–317. ACM Press, New York (1994)
160
T. Taghavi, M. Thompson, and A.D. Pimentel
3. Weaver, C.: InfoVis 2007 Contest Entry: Cinegraph. In: IEEE Symposium on Information Visualization (2007) 4. Ball, T., Eick, S.: Software visualization in the large. IEEE Computer 29(4), 33–43 (1996) 5. Roberts, J.C.: Multiple-View and Multiform Visualization. In: Visual Data Exploration and Analysis VII, vol. 3960, pp. 176–185 (2000) 6. Shneiderman, B.: The eyes have it: A task by data type taxonomy for information visualizations. In: IEEE Symposium on Visual Languages, pp. 336–343 (1996) 7. North, C., Shneiderman, B.: Snap-Together Visualization: A User Interface for Coordinating Visualizations via Relational Schemata. In: Adv. Visual Interfaces, pp. 128–135. ACM, New York (2000) 8. Weaver, C.: Improvisational Geovisualization of the 2000 United States Census. In: AutoCarto (2006) 9. Erbas, C., Pimentel, A.D., Thompson, M., Polstra, S.: A Framework for System-level Modeling and Simulation of Embedded Systems Architectures. EURASIP Journal on Embedded System (2007), doi:10.1155/2007/82123 10. Pimentel, A.D., Erbas, C., Polstra, S.: A Systematic Approach to Exploring Embedded System Architectures at Multiple Abstraction Levels. IEEE Transactions on Computers 55(2), 99–112 (2006) 11. Marwedel, P., Sirocic, B.: Multimedia components for the visualization of dynamic behavior in computer architectures. In: Workshop of Computer Architecture Education (2003) 12. Yehezkel, C., Yurcik, W., Pearson, M., Armstrong, D.: Three simulator tools for teaching computer architecture: Easycpu, little man computer, and rtlsim. Journal on Educational Resources in Computing (JERIC) 1(4), 60–80 (2001) 13. Ibbett, R.N.: Computer architecture visualisation techniques. J. Microprocessors and Microsystems 23(5), 291–300 (1999) 14. Weaver, C.: Building Highly-Coordinated Visualizations In Improvise. In: IEEE Symposium on Information Visualization, pp. 159–166 (2004) 15. Weaver, C.E.: Improvise: A User Interface for Interactive Construction of HighlyCoordinated Visualizations, PhD thesis, University of Wisconsin—Madison (2006) 16. Bosch, R., et al.: Rivet: A flexible environment for computer systems visualization. SIGGRAPH Computer Graphics 34(1), 68–73 (2000) 17. Bosch, R.P.: Using Visualization to Understand the Behavior of Computer Systems. PhD thesis, Stanford University (2001) 18. Mihalik, A.: Vista: A visualization tool for computer architects. Master’s thesis, Massachusetts Institute of Technology (2004) 19. Stump, G.M., Yukish, M., Simpson, N.: Design Space Visualization and Its Application to a Design by Shopping Paradigm. In: Design Engineering Technical Conferences (DETC), DETC2003/DAC-48785(2003)
Modeling Scalable SIMD DSPs in LISA Peter Westermann and Hartmut Schr¨ oder Technische Universit¨ at Dortmund, CAS Lab Otto-Hahn-Str. 4, 44221 Dortmund, Germany {peter.westermann,hartmut.schroeder}@tu-dortmund.de
Abstract. Single instruction multiple data (SIMD) processing is an important technique for achieving high performance in applications with innate data level parallelism such as applications from the Software Defined Radio (SDR) domain. This paper investigates using the LISA 2.0 Language to facilitate the development of scalable SIMD digital signal processors (DSPs). Our work shows that limitations in LISA hinder the development of SIMD data paths; therefore, extensions to LISA that enable to generate a wide SIMD data path from a single scalar processing element have been introduced. Furthermore, generators for SIMD permutation networks with arbitrary SIMD widths have been implemented. The presented solution simplifies the development of scalable SIMD DSPs in LISA considerably. Keywords: LISA, SIMD DSPs, Scalable Processor Models.
1
Introduction
In recent years, different research groups demonstrated that single instruction multiple data (SIMD) processors are well suited for applications from the software defined radio (SDR) domain[1]. Algorithms for W-CDMA and ODFMA based systems have been efficiently implemented on SIMD-based digital signal processors (DSPs) [2,3,4]. However, next generation SDR technology requires significantly greater computational performance [5]; the increased demands may require new SIMD processors with increased SIMD widths. In this context, we investigated using the LISA 2.0 language [6] for modeling scalable SIMD vector DSPs. LISA is an acronym of language for instruction set architecture; it is the key component of the CoWare Processor Designer toolkit [7]. A processor model in LISA can be utilized for automatically generating instruction set simulator (ISS), synthesizable register transfer level (RTL) code, and software development tools from one common description. Some work on modeling instruction set extensions with LISA has been done [8,9]; however, the development of SIMD DSPs with a scalable SIMD width in LISA has not yet been investigated. Seidel et al. [10] developed an instruction set simulator for a SIMD DSP using LISA. Yet, the SIMD DSP is modeled in GenCore and only exported as LISA code. This paper contributes as follows to the modeling of scalable SIMD DSPs in LISA: K. Bertels et al. (Eds.): SAMOS 2009, LNCS 5657, pp. 161–170, 2009. c Springer-Verlag Berlin Heidelberg 2009
162
P. Westermann and H. Schr¨ oder
• Requirements for modeling SIMD architectures are defined and the LISA language is analyzed based on these requirements (Section 2). The analysis shows that limitations in the LISA language prevent the modeling of data paths for scalable SIMD DSPs. • We introduce extensions for the LISA language that resolve the problem of modeling data paths for SIMD DSPs (Section 3). The extensions have been implemented in the GNU M4 [11] macro language and realized as a preprocessing step; hence, access to the source code for LISA is not necessary. Using the extensions, a complete SIMD data path may be generated from the specification of a single scalar data path, which contains the functional units and local registers. • As SIMD permutation networks/permutation units are an important component of any SIMD processor, generators for different permutation networks with regular topologies have been implemented (Section 4). Using this approach, a selected permutation network may be automatically adjusted for varying SIMD widths.
2
Modeling SIMD Data Paths with LISA
2.1
Structure of SIMD DSPs
A SIMD DSP at least consists of a control unit and a SIMD data path.1 The control unit fetches and decodes instructions and triggers the execution on the SIMD data path. The SIMD data path consists of NSIMD identical scalar data paths.2 Each of these scalar data paths is a slave to the control unit and contains arithmetic units and local register files. Hence, from a modeling perspective, a description of the following three aspects is required to model a SIMD data path for given NSIMD : • The behavior and inner structure of the control unit must be defined. • Processing units and registers in a single scalar data path need to be modeled; NSIMD copies of the data path for the complete SIMD data path should be generated automatically. • Control signals from the control unit to the SIMD data path need to be modeled. A language for describing SIMD data paths should thus allow to automatically create multiple data paths from a single template and also define means to transfer control signals to all data paths simultaneously. In the following, LISA language elements that may be used to solve part of this task are investigated. 1 2
An additional scalar data path is of no interest to the investigation of scalable SIMD models. NSIMD denotes the SIMD width.
Modeling Scalable SIMD DSPs in LISA
2.2
163
Overview of LISA Language Elements
LISA models consist of resource declarations and operations. Resources define storage elements such as registers, memories, and pipelines. Operations describe behavior, structure, and instruction set of the processor architecture. The attributes of operations are defined in several sections: • The DECLARE section contains local declarations, e.g. references to child operations. All operations that are referenced/used inside an operation have to be declared first. • The CODING and SYNTAX sections define the binary coding respectively the assembly syntax of instructions. • The BEHAVIOR and EXPRESSION sections describe the behavioral model of an operation as C code. Other operations may be executed inside a BEHAVIOR section in a mechanism similar to function inlining. • The ACTIVATION section allows activating further operations from the current operation. Activations allow to model resource sharing and the execution of operations across pipeline stages. Operations are organized in a hierarchy by assigning operations to pipeline stages and building a chain of activations. The example in Fig. 1 shows a simple fourstage pipeline with pipeline stages FE, DC, EX, and WB. Here, the add and sub operations share the common ALU resource; all operations share a common writeback operation.
Fig. 1. An example of an operation hierarchy with a four-stage pipeline. Grey boxes describe pipeline stages. Arrows describe the activation of operations.
164
2.3
P. Westermann and H. Schr¨ oder
Modeling Multiple Data Paths Using Templates
LISA contains language elements to model multiple resources or operations that share a common identifier – and in case of operations a common behavior. These language elements are called template resources and template operations. Template operations and template resources are defined using angle brackets; two examples are given below: REGISTER uint16 opnd<1 ..16>; OPERATION alu
IN pipe.EX { .. } Here, the index parameter may be used inside of the operation to index further template operations – building an operation hierarchy – or to access template resources. Template resources may only be indexed by constants or by index parameters of template operations. Similar/identical data paths may be defined by template operations; template resources enable to define resources that are local to these data paths. These two language elements can be used to model very long instruction word (VLIW) architectures with multiple processing units of the same type3 and can also be used to model data paths in a SIMD architecture. However, modeling a SIMD data path is still problematic: LISA does not allow to activate multiple instances of the same template operation and scaling has to be done manually. The following example shows the activation of a template operation for an ALU data path from a control operation. The activated child operation first needs to be declared – with a constant index – and may then be used in other sections of the operation. However, the index in angular brackets is only used in the DECLARE section; in other sections, the alu operation is used without an index. Hence, only one instance of a template operation may be declared and used – as different indices cannot be distinguished in the other sections of the operation. OPERATION alu_control IN pipe.EX { DECLARE { INSTANCE alu<1>; } ... ACTIVATION { alu } } VLIW architectures with multiple identical data paths can be modeled using operation GROUP declarations that are coupled with the binary coding of the different operations. However, SIMD data paths cannot be modeled in this manner as the data paths share a binary coding.
3
SIMD Extensions for LISA Models
The analysis in Sect. 2 has shown that LISA does not allow activating multiple copies of a template operation in parallel. Hence, while SIMD data paths may 3
This was the original purpose for introducing these language elements.
Modeling Scalable SIMD DSPs in LISA
165
Fig. 2. Activation of SIMD data paths in LISA using wrapper operations for template operations
be described in LISA, they cannot be utilized. Below, a workaround for this problem is presented first. Afterwards, extensions to LISA that enable modeling scalable SIMD data paths in LISA are introduced. 3.1
Activating SIMD Data Paths in LISA
The problem of SIMD activations can be solved by introducing an additional layer of operations in the LISA model. For each index element of a template operation, a wrapper operation, which activates the template operation, should be introduced. An example activation chain for this approach is depicted in Fig. 2. Each wrapper operation is assigned a unique name based on the name of the template resource and an integer number. A wrapper operation only contains an INSTANCE declaration of the child operation and the activation of the child operation. As the wrapper operations have unique identifiers, it is now possible to activate all lanes of a SIMD data path by activating the wrapper operations. While this approach solves the problem of activating multiple copies of a template operation (representing the lanes of a SIMD data paths), it is not sufficient for a scalable model of a SIMD processor as the wrapper operations have to be manually introduced and activated4 . Therefore, an automated approach using macros for generating scalable LISA code has been adopted. 3.2
Macros for Generating Scalable SIMD Operations
The GNU M4 macro processor [11] has been utilized to extend LISA for support of scalable SIMD models5 . Amongst other features, M4 supports text replacement, string manipulation, and conditional evaluation (e.g. for generating multiple copies of a statement in a loop). Macros for automating the activation of 4 5
This approach is also error-prone and not user-friendly. The LISA tool set uses a platform dependent C preprocessor that is not sufficient for generating scalable SIMD models; therefore, an external tool is necessary.
166
P. Westermann and H. Schr¨ oder Table 1. Overview of macros for SIMD operations and resource usage
Macro name
Description
SIMD_WIDTH
Global definition of the width of the SIMD data path. In each macro SIMD_WIDTH may be locally reset by appending an additional argument N=value to the macro call.
SIMD_OPERATION(op)
Defines wrapper functions for template operation op as in Fig. 2. An optional parameter assigns the wrapper operations to a pipeline stage.
SIMD_INSTANCE(op)
Generates instances of the wrapper functions for op in the DECLARE section of an operation.
SIMD_ACTIVATION(op)
Generates activations for the wrapper functions for op in the ACTIVATION section of an operation.
SIMD_ASSIGNMENT(op1,op2)
Assigns a scalar value or elements of a template resource op2 to the elements of template resource op1.
SIMD_SET_ELEM(op1,op2,width) Initializes the elements of template resource op1 by copying SIMD_WIDTH blocks of width bits from resource op2. SIMD_GET_ELEM(op1,op2,width) Inverse operation for SIMD_SET_ELEM. Copies elements of op2 into SIMD_WIDTH blocks of width bits in resource op1.
template operations representing SIMD data paths and for accessing template resources that model local resources in SIMD data paths have been implemented. An overview of important macros is given in Table 1. In the concrete implementation, adjustable parameters (e.g. SIMD_WIDTH) are defined in a parameter file. Each LISA project furthermore requires a makefile that applies the macros to all LISA files in the project directory and generates output in a subdirectory. The output files may then be processed by the standard LISA tool set. The following example shows the usage of these macros for activating a SIMD data path, which is defined by template operations. Processing the example by expanding the macros results in an activation chain as in Fig. 2 (with SIMD_WIDTH = 4). OPERATION alu { ...} SIMD_OPERATION(alu) OPERATION alu_control { DECLARE { SIMD_INSTANCE(alu) } ... ACTIVATION { SIMD_ACTIVATION(alu) } }
Modeling Scalable SIMD DSPs in LISA
167
Here, wrapper operations for alu are generated first, using the SIMD_OPERATION macro. Inside the parent operation alu_control, the SIMD version of operation alu is first instantiated and then activated.
4
Generation of SIMD Permutation Networks
As many algorithms require a reordering of SIMD vector elements, a SIMD permutation network/unit is an important part of any SIMD DSPs. Therefore, as presented below, some regular permutation networks have been realized in LISA via M4 macros. The permutation network macros also demonstrate the flexibility of extending LISA with a powerful macro language such as M4. 4.1
SIMD Permutation Networks
Full crossbar networks, single-stage shuffle exchange/inverse shuffle exchange networks, and multi-stage cube networks have been selected as samples of SIMD permutations networks.Full crossbar networks allow arbitrary permutations of SIMD vector elements; i.e. all input elements are connected to all output elements. However, the flexibility of a full crossbar is often not needed in signal processing algorithms [12]. A single-stage shuffle exchange/inverse shuffle exchange network only supports a limited number of permutations; complex permutations require repeated permutation operations. A shuffle exchange network consists of a perfect shuffle network followed by 2 × 2 cross point elements for exchanging adjacent elements. The perfect shuffle permutation for N values is defined by the following index transformation [13] (x and σ (x) describe the input and shuffled indices): 2·x σ (x) = 2 · x + mod N (1) N The permutation interleaves elements from the upper and lower half of the input vector (see Fig. 3). The inverse perfect shuffle permutation reverses this operation. The M4 macro implementation combines the shuffle exchange/inverse shuffle exchange networks and enables both types of permutations. A cube network is an example for a multi-stage interconnection network. A cube network for N inputs consists of log2 (N ) stages [14]; each stage contains N/2 switching elements with two inputs and two outputs. The switching elements can be configured to swap the inputs, broadcast one input, or simply forward inputs to outputs without permutations. The input and output lines of a switching element have the same index. At stage i, the input lines that differ only in the i-th bit position are paired together as inputs for one SE. An example is depicted in Fig. 4. 4.2
Generation of Macros for Permutation Networks
All three presented permutation networks share a common characteristic: They are based on a set of rules that defines the network structure for arbitrary widths.
168
P. Westermann and H. Schr¨ oder
(a) Shuffle Exchange
(b) Inv. Shuffle Exchange
(c) Combined Network
Fig. 3. 16-way shuffle exchange, inverse shuffle exchange, and combined shuffle exchange/inverse shuffle exchange networks 0
0
0
0
1
4
2
1
2
1
1
2
3
5
3
3
4
2
4
4
5
6
6
5
6
3
5
6
7
7
7
7
Fig. 4. 8-way cube network: Switching elements are represented by boxes, the numbers describe the input (and output) ordering for each network stage
A cube network is defined by the pairing of input lines based on bit positions, (inverse) shuffle exchange networks require an evaluation of (1), and crossbar networks simply connect all inputs and outputs. The required evaluations can be implemented in M4. A macro call then automatically generates and activates operations that realize the different stages of the network. For each example of a regular permutation network, two macros have been implemented. The first type of macros (GEN_CROSSBAR, GEN_SHUFFLE_EXCH, and GEN_CUBE) allows generating permutation networks that operate on one
Modeling Scalable SIMD DSPs in LISA
169
input vector and produce one output vector. The second type (GEN_CROSSBAR2, GEN_SHUFFLE_EXCH2, and GEN_CUBE2) can be used to generate permutation networks for pairs of vectors (two inputs, two outputs). The macros allow generating SIMD networks that can be automatically scaled with the SIMD width. The syntax of macros for permutations networks is as follows: The first parameter id identifies the name of the generated operations; id can be used in SIMD_INSTANCE and SIMD_ACTIVATION macros to access the permutation network. The next parameters define template resources for input and output (in and out for single vector networks and in1, in2, out1, and out2 for networks on pairs of vectors). Furthermore, template resources for control signals that describe the permutation operation have to be specified as parameters (with one parameter for single vector networks and a pair of parameters for networks on two input vectors).
5
Conclusions
The Processor Designer tool kit based on the LISA language facilitates the development of DSP architectures by automatically generating compilers, assemblers, simulators, and RTL code from a single source. However, analyzing the modeling capabilities of LISA for scalable SIMD architectures showed that such architectures may not be modeled efficiently in LISA: While a scalable number of data paths may be defined in LISA, there is no mechanism to activate them in parallel. This issue can be solved by extending LISA with macro functions that provide activation mechanisms. For this purpose, a framework based on the GNU M4 language has been implemented. Uses of M4 macro extensions for LISA are not limited to SIMD data paths; in general, any parametrized processing unit that is based on a regular set of design rules can be constructed from M4 macros. As an example, we demonstrated the generation of single- and multi-stage permutation networks for a SIMD DSP. Due to the regular structure of the networks, they may be easily modeled in M4 and valid LISA code can be generated automatically. Hence, the proposed M4 extension for LISA has two major benefits: It enables the development of scalable SIMD architectures in LISA, which was not possible before, and it augments LISA with means to describe regular and flexible processing units automatically.
References 1. Becher, R., Dillinger, M., Haardt, M., Mohr, W.: Broadband wireless access and future communication networks. Proc. IEEE 1, 58–75 (2001) 2. van Berkel, K., Heinle, F., Meuwissen, P.P.E., Moerman, K., Weiss, M.: Vector processing as an enabler for software-defined radio in handheld devices. EURASIP Journal on Applied Signal Processing 16, 2613–2625 (2005) 3. Lin, Y., Lee, H., Woh, M., Harel, Y., Mahlke, S., Mudge, T., Chakrabarti, C., Flautner, K.: SODA: A Low-power Architecture For Software Radio. In: Proc. 33rd Intl. Symposium on Computer Architecture (ISCA) (2006)
170
P. Westermann and H. Schr¨ oder
4. Westermann, P., Beier, G., Ait-Harma, H., Schwoerer, L.: Developing FFTs for SC-FDMA on the Embedded Vector Processor. In: Proceedings of the 13th International OFDM-Workshop (InOWo 2008)(2008) 5. Woh, M., Lin, Y., Seo, S., Mudge, T., Mahlke, S.: Analyzing the scalability of SIMD for the next generation software defined radio. In: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2008, March 31–April 4, pp. 5388–5391 (2008) 6. Pees, S., Hoffmann, A., Zivojnovic, V., Meyr, H.: LISA–machine description language for cycle-accurate models of programmable DSP architectures. In: DAC 1999: Proceedings of the 36th ACM/IEEE conference on Design automation, pp. 933– 938. ACM, New York (1999) 7. CoWare: Processor Designer Reference Manual. Product version v2007.1.1 edn. (March 2008) 8. Rashid, M., Apvrille, L., Pacalet, R.: Evaluation of ASIPs Design with LISATek. In: Berekovi´c, M., Dimopoulos, N., Wong, S. (eds.) SAMOS 2008. LNCS, vol. 5114, pp. 177–186. Springer, Heidelberg (2008) 9. von Sydow, T., Blume, H., Kappen, G., Noll, T.G.: ASIP-eFPGA architecture for multioperable GNSS receivers. In: Berekovi´c, M., Dimopoulos, N., Wong, S. (eds.) SAMOS 2008. LNCS, vol. 5114, pp. 136–145. Springer, Heidelberg (2008) 10. Seidel, H., Matus, E., Cichon, G., Robelly, J.P., Bronzel, M., Fettweis, G.: Generated DSP Cores for Implementation of an OFDM Communication System. In: Pimentel, A.D., Vassiliadis, S. (eds.) SAMOS 2004. LNCS, vol. 3133, pp. 353–362. Springer, Heidelberg (2004) 11. Seindal, R., Pinard, F., Vaughan, G.V., Blake, E.: GNU M4, version 1.4.12 - A powerful macro processor. 1.4.12 edn. (September 2008) 12. Raghavan, P., Munaga, S., Ramos, E.R., Lambrechts, A., Jayapala, M., Catthoor, F., Verkest, D.: A Customized Cross-Bar for Data-Shuffling in Domain-Specific SIMD Processors. In: Lukowicz, P., Thiele, L., Tr¨ oster, G. (eds.) ARCS 2007. LNCS, vol. 4415, pp. 57–68. Springer, Heidelberg (2007) 13. Parker, D.S.: Notes on Shuffle/Exchange-Type Switching Networks. IEEE Trans. Comput. 29(3), 213–222 (1980) 14. Siegel, H.J.: Interconnection Networks for SIMD Machines. Computer, Special Issue on Circuit Switching 12(6), 57–69 (1979)
NoGAP: A Micro Architecture Construction Framework Per Karlstr¨om1 and Dake Liu2 1 2
Department of EE, Link¨oping University, Link¨oping, Sweden [email protected] Department of EE, Link¨oping University, Link¨oping, Sweden [email protected]
Abstract. Flexible Application Specific Instruction set Processors (ASIP) are starting to replace monolithic ASICs in a vide variety of fields. However the design of an ASIP is today a substantial design effort. This paper discusses NoGap (Novel Generator for ASIP) a tool for ASIP designs utilizing hardware multiplexed data paths. One of the main advantages of NoGap compared to other ADL tools is that it does not impose limits on the architecture and thus design freedom. To reach this flexibility NoGap makes heavy use of the compositional design principle and is therefore divided into three parts Mage, Mase, and Castle. This paper presents the central concepts of NoGap to show that it is possible to reach this advertised flexibility and still be able to generate HDL code and tools such as simulators and assemblers.
1 Introduction The design and implementation of a new processor is usually the result of a substantial design effort. There are a number of different software tools that relaxes the design effort in one way or another, however all these tools forces the designer into a predefined architecture template. This limitation in design flexibility often makes designers of novel ASIP processors and programmable accelerators revert back to an HDL language, e.g. Verilog or VHDL. HDL languages offers full design flexibility at the register transfer level, but the flexibility comes at the cost of increased design complexity; all details, e.g. register forwarding and/or pipeline control, has to be handled manually. The description used in higher lever tools are often called an Architecture Description Language (ADL), therefore these kinds of design tools will be referred to as “ADL tools” in this paper. Since an ADL tool presents an abstracted view of the design, some details has to be hidden and there must be some a priori assumptions about the device being designed. This fragments the ADL tools in different directions with different areas of use. Currently most ADL tools either describe systems at a higher level of abstraction, modeling transactions and interaction of modules at the system wide scope. Or their aim has been to ease processor design with supporting compilers and simulators. In both of these ADL tool classes there are a number of mature and well performing products which can be used with great success if the system being designed fits into the tools a priori assumptions. The risk of using an ADL tool is however that the final design is just a K. Bertels et al. (Eds.): SAMOS 2009, LNCS 5657, pp. 171–180, 2009. c Springer-Verlag Berlin Heidelberg 2009
172
P. Karlstr¨om and D. Liu
product of what the ADL tool supported, than the innovative product the designer had in mind. NoGap does not aim to make processor construction easy for the masses. NoGap aims to give support to smart designers i.e. the designer knows what he does and uses NoGap to support him/her with tedious tasks. This paper will present the major parts of NoGap highlighting its commonalities and differences with existing tools today. The aim of NoGap is to fill a gap in the ADL toolbox, targeting the design of novel ASIP architectures. NoGap will hopefully be another powerful tool to use when designing advanced and/or complex hardware designs.
2 Related Work This chapter reviews the current state of the art and describe how NoGap compares to these tools. A number of tools such as LISA [8], EXPRESSION [3], nML [1], MIMOLA [5], ArchC [7], and ASIP Meister [4], are tools that support processor design. All of these tools however force a designer into a predefined template architecture. On the other end of the spectrum of design tools are HDL languages such as Verilog, VHDL or SystemC [6], These tools however requires manual handling of all miniscule details of an RTL design. NoGap offers an unique trade off between these two extremes. No template design is assumed but support is given for managing details regarding pipelined instruction controlled architectures. ASIP Meister is the existing tool closest to NoGap. Both tools generate data flow graphs between a number of primitives and does hardware multiplexing. While ASIP Meister is limited to RISC architectures, NoGap does not assume any base architecture only minimal a priori knowledge about processor architectures are used in NoGap. Tools such as Catapult-C [2] offers C to HDL generation for specific algorithms. Although a powerful tool for fixed function DSP hardware, it leaves little room for instruction controlled data paths. NoGap is not primarily designed for fixed function data path design, although it is possible. NoGap strengths lies in the design of hardware multiplexed instruction controlled data path designs.
3 NoGap Overview NoGap consists of a number of components. The central component in NoGap is the NoGap Common Description (NoGapCD ). NoGapCD in turn consists of three parts; the Micro Architecture Structure Expression (Mase), the Micro Architecture Generation Essentials (Mage), and the Control Architecture STructure LanguagE (Castle). The Mase description is an annotated graph representing connections between functional units. The Mage description is an Abstract Syntax Tree (AST) representation of leaf modules containing the actual functionality. And the Castle description contains directives for how instruction decoders should be generated. Even though the NoGapCD can be constructed directly with an C++ API, this is not the intended way to construct
NoGAP: A Micro Architecture Construction Framework
173
Fig. 1. NoGap principle
a new micro architecture. The intended flow is to use a front end tool called a facet and then let the facet generate the NoGapCD . NoGap has a default facet for this, implemented as a language called the NoGap Common Language (NoGapCL ). The final stage in the NoGap flow is to produce something useful, this is done by implementing a tool reading the NoGapCD and from that generate a useful output. These output tools are called spawners in the NoGap terminology. A spawner can, for example, be an RTL code generator or a cycle accurate simulator generator. The NoGap framework principle is depicted in Figure 1, where a number of possible facets generate a NoGapCD and from that a number of spawners generate the various tools needed. This approach will allow for a modularized approach with the possibility to reuse code over different projects. For example, a cycle accurate simulator spawner could have been implemented for an earlier project but a new facet would ease the design effort for a new project. In this case only a new facet has to be implemented and time/money can be saved by reusing the old spawner.
4 Pipelined Instruction Driven Architectures Traditionally hardware architectures has been classified by how often their function can be changed; never → ASICs, once in a while → configurable data paths, every Table 1. Classification of Pipelined Instruction Driven (PID) architectures Class Description PID0 A fixed function unit (just one instruction) PID1 A multi function functional unit, function depends on operations fed to the data path. The unit is responsible for transforming incoming operations to control signals and delay the respective control signals according to the pipeline. PID2 Same as a the PID1 but operations are generated internally in the unit.
Example fixed function ASICs Instruction driven data paths
Processors, microcoded, or FSM driven accelerators
174
P. Karlstr¨om and D. Liu
cycle → processors. This classification is important but for the sake of clarifying the fundamentals of NoGap we classify all Pipelined Instruction Driven (PID) architectures according to Table 1. Among the large variety of existing ADL tools, there is an abstraction gap between the RTL- and ADL tools. The existing ADL tools for PID architecture implementation, can best be described as instruction based (barely) RTL aware tools. The aim of NoGap is to reverse this situation and deliver an RTL based instruction aware tool. The foundation for this reversal is an assumption that problem of implementing PID architectures is to manage pipeline complexity and not to implement the separate functional units. Describing the separate functional units at the register transfer level is usually a good idea, it retains control of the hardware and enables flexibility to implement novel functionality. The main problem when designing PID architectures is to manage the complexity of pipelines and hardware multiplexing. Managing and synchronizing a larger pipelined architecture, in an RTL language, is a delicate dance with the devil. Doing design space exploration in this manner is an even harder exercise. A reasonable question to ask is then if it is possible to construct a tool that delivers design flexibility yet gives support for any kind of PID architecture. Realizing that processor architectures falls into the PID class, a natural second question is if it is possible to design a tool such as if it satisfies the first question, is it then possible that this tool can be extended to ease processor design as well? We argue that this is possible and also with a much larger design flexibility then any current tool delivers. To investigate if this is indeed possible we developed NoGap guided by the following two rules: Fully flexible PID1 architectures. The central dogma of NoGap is to deliver complete design freedom for PID1 architectures. Minimal a priori assumptions about PID2 architectures. While still supporting PID2 architectures the a priori assumptions about how such an architecture looks shall be minimal to maintain as much design flexibility as possible. These rules of NoGap forces us to define a minimal set of a priori knowledge needed for PID architectures. PID1 architectures only needs a priori knowledge about what a pipeline is and what operations are. PID2 architectures needs more a priori knowledge. Table 2 presents the a priori knowledge in programmed into NoGap. NoGap does not treat storage elements in any special way. A register file for example, will be treated as any other Functional Unit (FU). To inform NoGap about the Table 2. Minimal a priori knowledge for PID1 and PID2 architectures Knowledge
Description Needed for PID1 Architectures operations What an operation is and how it affect a hardware structure. pipelining What a pipeline is and how it functions. Needed for PID2 Architectures stalling Stalling to resolve data and structural hazards. forwarding Forwarding can be used to minimize stalling. jumping Change of operation flow depending on computed results. flushing If needed for change of operation flow
NoGAP: A Micro Architecture Construction Framework
175
possible source and destination operands. The Castle description is used to denote certain nodes as data sources and other nodes as data destinations. This is all according to the powerful yet flexible design by composition principle prevalent in NoGap.
5 NoGapCD This section will outline the components of the NoGapCD with emphasis on the Mase description. When constructing NoGap we were faced with a three fold problem. First we needed to find a way to describe the hardware to be used. Secondly we needed to find a way to define operations and relate them to the hardware, this had to be done in a manner resulting in efficient hardware utilization. Thirdly we had to find a way to describe the general architecture of the control path. 5.1 NoGapCL NoGapCL is a the default NoGap facet used to construct the Mase, Mage and Castle descriptions. Since NoGap is a tool aimed at hardware design its central language in many ways resembles a synthesizable subset of VHDL and Verilog with a number of important differences. The main differences will be outlined in this section. From the language point of view Mase and Mage are fused together into one common language. However the different parts will be handled very differently in the compiler. Mage is the part closest to hardware and essentially describes the functionality of the system. A Mage description will be compiled into and saved as an abstract syntax tree from which various spawners can extract their needed information to synthesize the leaf modules. The Mase description captures the spatial and casual relationship between the modules described in Mage. 5.2 Operations in Mage Descriptions To manage complexity while still maintaining flexibility NoGaps central design principle is compositional design. For this reason a Mage FU can not be aware of what an operation is, it should be usable as either an instruction driven module, e.g. an ALU with an operation select input, from the operation construct in Mase descriptions. Or be used as a normal hardware module, perhaps as sub modules in a Mage description. For this reason an FU can not have any special operation select constructs but there still needs to be a way to access different code parts depending on different inputs to the FU. A novel approach is used to solve this dilemma in NoGap. We call this approach dynamic clause selection and it works as described below. Dynamic Clause Selection. Dynamic clause selection is used to make FUs behave as operation controlled FUs. Dynamic Clause selection works as follows. All clauses can be named. A FU with named clauses can be used in an operation construct where the clause to be accessed can be specified. NoGap will then calculate which and how input signals needs to be set to reach that specific clause. This problem then comes down to a satisfiability problem and is as such NP-complete. But any practical module will have a limited set of problems to solve and thus keeping this approach practical, however
176
P. Karlstr¨om and D. Liu
if the problem becomes too complex to solve in a reasonable amount of time the FU is considered to be illegal for dynamic clause selection. Note that this is solved once for each FU and only affects compile times. Some cases are also unsatisfiable, e.g. if a clause is unreachable or the clause can not be uniquely activated, thus not every FU is a legal dynamic clause selection FU. But a large set of FUs can be designed to be legal dynamic clause selection FUs. The input ports used to control clause access will be marked as control inputs and becomes part of the synthesized control path. The ports marked as control inputs can never be used as data input from other FUs. 5.3 Port and Wire Sizing NoGap can dynamically determine the size of wires and ports, even if the input port sizes to a Mase FU is not know at compile time. This extends the functionality already existing in Verilog and VHDL with parameters and generics respectively. While the parameterized modules can have their port sizes set from the outside, they can not determine their own port sizes depending on where in the data path they end up. The dynamic port sizing in NoGap ensures that adding an instruction will not break already existing functionality even if the same FU is used by other instructions with hardware multiplexing. The sizing problem is complicated by the fact that a Mase graph might contain loops. The sizing algorithm is therefore split into two phases. First an annotation phase and then a size solving phase. The annotation phase annotates all sizes with either values or symbolic expressions if no value is available. The solving phase uses a graph to direct the symbolic manipulations of the expressions. Although an interesting topic, the exact details of the algorithm is outside the scope of this paper. The overall algorithm for sizing the wires can be summarized as follows steps. 1. Make an initial annotation in the graph of all edges to be sized. 2. Resolve all size equation relations, simplify and substitute trivial relations. 3. Resolve loop dependencies by iterating the specified number of iterations. A Mage description using dynamic input port sizing will usually require dynamic output port sizing as well. The size of an output port can be written as a function of one or several input port sizes. Apart from the normal four operator (add, subtract, divide, and multiply) two additional operators can be used; binary maximum ($) and binary minimum (@), for example an adder would probably have the output port expression set to (opa$opb) + 1, where opa and opb are the input port sizes. This approach presents two problems. First if one or several Mase inputs are dynamically sized, i.e. its size is not known at compile time, how is the computations done of the sizes for the rest of the Mase graph. Secondly how shall loops in the Mase graph be handled. The first problem is solved by doing symbolic computation of the wire sizes. The second problem, handling loops, is harder to solve, but it is solved using a two phased approach, the annotation phase and the solver phase. In the annotation phase all wires are assigned two size expressions one called the source expression and one called the target expression. Source expressions are set when processing output nodes and target expressions are set when processing source nodes. The algorithm processes the graph with a wave front traversal. Target expressions can either be set to a copy of a valid (already set) source expression or to a new unique symbolic variable if no valid source expression exists for the wire.
NoGAP: A Micro Architecture Construction Framework
177
When the annotation phase is completed all wire size relations are compiled to a graph representing their relationships to each other and the introduced symbols. This new graph is then used to back annotate size expressions to symbols that are trivially equal. Looped relationships are detected in the graph and solved according to the maximum allowed iterations specified by the designer. The exact algorithm for this processing is outside the scope of this paper. A separate paper detailing how this is done will be presented later. 5.4 Mage The Mage description is similar to HDL languages such as Verilog and VHDL. The main differences from Verilog and VHDL is that no combinatorial loops are allowed, the dynamic port sizing, and implicit clock and reset inputs. Combinatorial loops can be checked for with the techniques for SSP generation as discussed in Section 6.2 An FU does not need a clock or reset input, these are implicitly assumed for all FUs using the cycle construct. All logic takes place within either a cycle or a comb block. The comb block describes combinatorial logic and the cycle block describes clocked logic. If a Mage FU should be controllable with operations clauses needs to be names so dynamic clause selection can take place, as described in Section 5.2. 5.5 Mase Mase is an annotated graph representation of connections between functional units. Mase is created by constructing a graph with all connections needed for all instructions for the module in question. A simple (just mapping two input ports to two output ports) initial Mase description is depicted in Figure 2. In order for the spawners to build their tool the Mase description might have to be simplified or modified. Typical Mase transformation are flip flop insertion, flip flop combining, mux insertion, edge combining, wire naming, and wire and port sizing. Figure 3 shows the resulting Mase description after applying these transformations. Both these graphs are generated by running NoGap and taken as is. Note how operation edges have been combined, how flip flops ( FF * ) and muxes ( MUX * ) have been inserted, how unique names have been assigned to each edge, and how the correct edge and output port sizes have been set. New ports has also been created for the mux control signals. 5.6 Data Paths and Control Paths A Mase FU will have data and control inputs. The control inputs are either mux selection signals or control signals for the various FUs for a PID1 architecture only a simple control path, delaying control signals and doing instruction decoding, is needed since instruction sequencing is done from the outside. Data signals are pipelined internally in the Mase graph and need no other consideration. The control signals including the control signals for the muxes are however directly connected to their respective control nodes and therefore needs to be delayed and perhaps muxed if different operations use the same FU but at different times. These pipeline timing muxes are controlled by
178
P. Karlstr¨om and D. Liu
Fig. 2. Original graph, generated by NoGap Fig. 3. After all transformations, generated by NoGap
pipeliner nodes. A pipeliner node accepts as input a vector of delayed operation classes. From this vector the pipeliner node then decides which operation gets priority, the default action is to give the longest operation priority, e.g. the operation class with longest delay, thus ignoring shorter operations competing for the same resource. NoGap is responsible for doing the instruction classification depending on an operations resource utilization, the exact details for how this is done is outside the scope of this paper. There is a close relationship between the data path and the control path. In a Mase graph all data will be pipelined correctly but all control signals, to control nodes, e.g. inserted mux control nodes or FU control input nodes, are not pipelined. A separate control path therefore needs to be added which will keep the control data in synchronization with the data to be processed. This control path can also be set to handle instruction hazards and register forwarding if needed. 5.7 Castle Castle is the part of NoGap dealing with internal instruction generation, essentially turning a PID1 architecture into a PID2 architecture. The part that describes how instruction streams are generated is called a sequencer in NoGap, The data path itself should not know about how a sequencer operates apart from being able to stall the instruction stream if necessary. The sequencer description contains information about how source and destination operands are generated, if and how register forwarding is implemented, how jumps are handled. Interrupt handling is also specified in the Castle description. Although Castle is an important part of NoGap the finer details will be presented in a later paper.
6 Generators This section will outline the techniques used to generate synthesizeable HDL, Software simulators, and Compilers.
NoGAP: A Micro Architecture Construction Framework
179
6.1 HDL Generation Generation of synthesizeable HDL code from either a Mage and Mase description is a straightforward task. The Mage graph needs to have muxes and flip flops inserted, wires needs to be uniquely named and sized correctly (see Section 5.3). After that the translation to HDL is straight forward. Castle descriptions will in the end be transformed to Mase and Mage descriptions and thus they are generated with the same mechanism as Mage and Mase. Mase descriptions requires a little more work, they need to be translated to HDL from their AST:s, the process is although also rather straight forward as Mage descriptions are already HDL like. 6.2 Cycle Accurate Simulators As stated from the start it must be possible to generate simulators from Mage and Mase descriptions. Simulation requires linearization of a parallel hardware architecture. The rule in NoGap that no combinatorial cycles are allowed ensures that it will be possible to find such a serialization on a cycle by cycle basis. The serialization involves creating the data flow graph for the current instruction and do a partial ordering to find the execution order. A partial ordering requires the graph to be a DAG, but internal registers in the FUs in the DFG can break otherwise illegal loops. To check for and resolve these fake loops, each FU needs to be split into three partitions; source, sink, and pass, called an SSP partition, depicted in Figure 4. Input ports not directly affecting the output is connected to the sink node. Output ports not directly dependent on the input ports are connected to the source node. Finally all input and output ports that have a combinatorial relationship are connected to the pass node. Note that the SSP partitioning might be different for different operations. The SSP partitioning can be extracted from the variable dependency graph computed for Mage serialization. Mage FUs also needs to be serialized, this is doable since combinatorial loops are illegal. This serialization requires us to compute a variable dependency graph for the FU. An example of such a graph for a simple FU, is shown in Figure 5, where Co= is a combinatorial dependency, Cy= is a register assignment, and Op is operands in an expression. If the Cy= edges are removed and there are loops in the graph the FU contains combinatorial loops and is thus illegal, as is the case with the bottom graph in Figure 5. If there are no combinatorial loops a partial ordering of the graph will give the serialization.
Fig. 4. SSP decomposition of FU
Fig. 5. Variable dependency graph for Mage FU, generated by NoGap
180
P. Karlstr¨om and D. Liu
6.3 Compiler Generation Compilers are important tools for many PID2 architectures, especially if different architectures needs to be evaluated. The NoGapCD is however to flexible to accurately generate compilers. Compiler generation would require using a subset of NoGapCD with extra information extracted from a processor generation facet.
7 Conclusions This paper started by classifying hardware architectures according to Table 1. As we argue that there is an abstraction gap between HDL tools and ADL tools, we implemented NoGap, this paper described the major parts of NoGap. NoGap offers design flexibility while managing complexity by using a compositional design approach. To be precise NoGap offers full flexibility for PID1 architectures and minimal a priori assumptions for PID2 architectures. NoGaps uses an intermediate representation called NoGapCD , which can be used to generate a number of tools needed for processor design. The immediate focus of NoGap was not to do automatic compiler generation, but the already existing framework can be used together with a compiler generation facet. NoGap has HDL generation is stable and complete control path generation is working. A proof of concept simulator generator has been developed.
References 1. Fauth, A., Van Praet, J., Freericks, M.: Describing instruction set processors using nml. In: European Design and Test Conference, ED&TC 1995, Proceedings, pp. 503–507 (1995) 2. Graphics, M.: Catapult C Synthesis. Mentor graphics (2008), http://www.mentor.com/products/esl/high level synthesis/ catapult synthesis/index.cfm 3. Halambi, A., Grun, P.: Expression: A language for architecture exploration through compiler/simulator retargetability (1999) 4. Itoh, M., Higaki, S., Sato, J., Shiomi, A., Takeuchi, Y., Kitajima, A., Imai, M.: Peas-iii: an asip design environment. In: 2000 International Conference on Computer Design, Proceedings, pp. 430–436 (2000) 5. Leupers, R., Marwedel, P.: Retargetable code generation based on structural processor descriptions (1998) 6. Osci. System C (January 2008), http://www.systemc.org 7. Rigo, S., Araujo, G., Bartholomeu, M., Azevedo, R.: Archc: a systemc-based architecture description language. In: 16th Symposium on Computer Architecture and High Performance Computing, SBAC-PAD 2004, pp. 66–73 (2004) 8. Zivojnovic, V., Pees, S., Meyr, H.: Lisa - machine description language and generic machine model for hw/sw co-design (1996)
A Comparison of NoTA and GENESYS Bernhard Huber and Roman Obermaisser Institute of Computer Engineering, Vienna University of Technology, Treitlstr. 3, 1040 Wien, Austria {huberb,romano}@vmars.tuwien.ac.at
Abstract. In the field of embedded systems, the trend of converging application systems (e.g., multimedia systems entering the automotive domain) is observable. Undertakings to tackle the challenges introduced by this digital convergence have been started by the European Union in the last years. GENESYS is such a European project that aims at developing a cross-domain architecture for embedded systems, which facilitates this digital convergence. This paper compares the GENESYS architecture with the Network on Terminal Architecture (NoTA), which is primarily targeted to the consumer applications domain, and elaborates on their major commonalities and differences such as service-orientation, component-based design, or the need for rigorous interface specifications. Keywords: Embedded Systems, Cross-Domain Architectures.
1
Introduction
The significance of embedded systems has enormously increased over the past years. Mainly due to the tremendous gain of performance, shrinking geometries and decreasing costs of embedded computing devices, many in recent history unimaginable functionalities are realized by embedded systems nowadays. This trend is observable across application domains: The consumer applications domain has experienced a boost due to ubiquitous mobile applications. Control applications benefit from the increasing performance by the ability to solve more complex control tasks and from the decreasing costs by the ability to increase system dependability by deploying replicated nodes with lower cost overhead. These advancements of embedded systems have led to specific experts and technologies in the individual application domains. However, nowadays this trend is changing: For instance multimedia systems are integrated in automotive systems and not only in high-end luxury cars (e.g., integration of mobile telephony in the vehicle’s entertainment system). On the other hand, control applications make use of Internet technologies, such as enabling remote observation, diagnosis, and maintenance of industrial plants. This trend imposes new challenges on the entire embedded systems industry. To address these challenges, European industry in conjunction with the EC have set up the Technology Platform ARTEMIS, which defines in its Strategic Research Agenda [1] medium-term visions and research targets for the European embedded system industry, including K. Bertels et al. (Eds.): SAMOS 2009, LNCS 5657, pp. 181–192, 2009. c Springer-Verlag Berlin Heidelberg 2009
182
B. Huber and R. Obermaisser
the development of a cross-domain reference architecture. As a first initiative the European project GENESYS has been started in January 2008 with the objective of developing a blueprint of such a cross-domain architecture. Independent of the ambitious goals of ARTEMIS, the development of the Network on Terminal Architecture (NoTA) has been started in 2003. NoTA addresses the ever increasing challenge of the consumer applications domain to integrate services originating from various technology providers into a single system. Therefore, the objectives of NoTA include the increase of interoperability, composability, and reuse of systems to shorten the time-to-market, to reduce development costs, and to provide a platform for application development that is nearly agnostic to changing implementation technologies. This paper presents a comparison of both architectures and elaborates on potential synergies. The rest of the paper is structured as follows. Section 2 introduces NoTA, whereas the focus of Section 3 is devoted to GENESYS. The commonalities and differences of both architectures are discussed in Section 4. The feasibility of combining NoTA and GENESYS is addressed in Section 5. The paper finishes with a conclusion in Section 6.
2
Network on Terminal Architecture
NoTA is a modular service-based system architecture for mobile and embedded devices. It is an open architecture with the primary goal to define a unified interface for embedded devices in order to ease the development and integration of interoperable services and devices. The development of NoTA is driven by an open architecture initiative, initially started at Nokia Research Center in 2003, with the aim to provide a solution that can be used throughout the industry, academia and developer community. Since then several releases of the implementation of the interconnect have been developed and are now open to the public. These results are currently used in industrial product implementations. NoTA does not define services for any specific domain or products, but provides a service-oriented framework for the design of embedded applications, which is driven by end-user requirements [2]. NoTA identifies devices that consist of service nodes and application nodes. Devices communicate via the so-called Device Interconnect Protocol (DIP). The DIP offers two communication modes: messagebased communication and streaming communication. The message-based communication is bi-directional and used by application nodes to exert control over service nodes. The streaming communication is uni-directional and enables the transfer of data (e.g., multimedia data). 2.1
System Structuring
NoTA provides three distinct abstraction levels for system design, denoted as functional architecture, logical architecture, and implementation architecture [3]. The functional architecture describes the functional aspects of the system by means of application nodes (AN) and services nodes (SN) interconnected by the DIP. Application nodes interact with the user of the system and make use of theservices provided by the SNs. Service nodes provide their services to ANs and
A Comparison of NoTA and GENESYS
183
utilize the services of other SNs to fulfill their specified service. Services provided by ANs and SNs are solely exploited via service interfaces. These interfaces are described by the Service Interface Specification (SIS), which defines the service syntax, the behavior (time-free via finite state machines) and bounds for non functional properties such as latencies, bandwidth and energy consumption [4]. The logical architecture describes the grouping of ANs and SNs to subsystems. Besides this logical viewpoint, subsystems also exhibit a physical viewpoint: All resources required for implementing the ANs and SNs are part of the hosting subsystem and are not shared with other subsystems. Hence, services of different subsystems can access shared resources solely via the service interface. Thus, the partitioning of services into subsystems is a first important design decision influencing system performance versus service independency and encapsulation. The implementation architecture describes the physical implementation of the system. At this viewpoint, the conceptual element provided by NoTA is a device. A device is a physical entity that provides resources like processors, buses, memories, and peripherals, which can host one or more subsystems. An exemplary implementation architecture of a NoTA system is depicted in Figure 1. 2.2
Device Interconnect Protocol
The DIP represents the communication infrastructure of a NoTA system. The DIP consists of two protocol layers – the Low Interconnect (L IN) and the High Interconnect (H IN) (cf. Figure 1). Essentially, the L IN connects subsystems by mapping the communication requests of services to the actual physical communication infrastructure. It provides uniform socket-based communication mechanisms. The L IN is further responsible for the discovery of the physical entities that are the endpoints for the communication activities. For providing a uniform interface to the H IN independent from the underlying physical transport protocol, the L IN is split in two layers. While the higher layer provides stable services that are independent from the transport protocol, the services of the lower layer are tailored to the characteristics of a particular physical interface. The main purpose of the H IN is the registration, discovery, and activation/deactivation of services. Service registration and discovery is managed by one dedicated H IN – denoted HManager. A new service is registered at the HManager with its service ID and its interconnect address, i.e. the information on which device and subsystem the particular service is located. For service discovery, a query containing the service ID is sent to the HManager, which resolves the according interconnect address. Service IDs are allocated based on service ontologies by a dedicated service node, the Service Level Resource Manager [5]. The DIP is nearly independent of a specific communication technology. Hence, it is possible to replace an off-chip network with an on-chip network with low overhead, e.g., when a chip is replaced by an IP core due to technological advancements. In such a case, the transport protocol specific part of the L IN needs to be adapted, whereas the remaining parts of the DIP, in particular the interface towards the application and service nodes, remain unaffected.
B. Huber and R. Obermaisser Device 1
184
AN
AN
AN
System
AN H_IN L_IN
On-Chip Interconnect L_IN H_IN
AN
SN
SN
SN
Off-Chip Interconnect
SN
H_IN L_IN
SN
closed open environment environment
Inter-Device Communication System
H_IN L_IN
Device 2
Device Integration Level (always closed)
Device
Off-Chip Communication System Chip
subsystem boundary device boundary high interconnect low interconnect
Fig. 1. NoTA System Structure
3
System Integration Level
SN
IP Core
Chip Integration Level (always closed) Network on Chip
Fig. 2. GENESYS Integration Levels
Generic Embedded Systems Architecture
GENESYS (GENeric Embedded SYStem Platform) is an FP7-ICT collaborative project that aims at the development of novel technologies for embedded computer systems across individual application domains. The objectives of the project are driven by the aim to enable significant improvements concerning time-to-market, cost, and robustness for a wide range of applications, from mobile phones to safety-critical computers in airplanes. First results describing the central architectural principles and services are already available as public report1 . Concrete instantiations of GENESYS for the industrial domain are the scope of the INDEXYS project, which starts on 1st of April 2009. It is also intended to extend the GENESYS instantiations for other domains in subsequent research projects. In the following, important architectural principles characterizing the GENESYS architecture are introduced [6]. 3.1
Component-Oriented Architecture
The GENESYS architecture follows a strict component orientation: A component is a self-contained hardware/software subsystem that is used as a building block in the design of a larger system. It encapsulates the implementation and hides implementation details. The provided services are exposed via a set of interfaces. Since a component hides the (complex) internal structure from its user, it offers an appropriate unit of abstraction for system design. Therefore, it helps to tackle the challenge of design complexity in modern embedded systems. Message-based Communication Infrastructure. For reducing system design complexity, GENESYS strives for a consequent decoupling of the (computational) components from the communication infrastructure to facilitate design and analysis of both parts in isolation. This decoupling is realized by using message passing, i.e., the unidirectional exchange of messages, as the primary communication paradigm. The exchange of messages is determined by the interface 1
http://www.genesys-platform.eu/documents.htm
A Comparison of NoTA and GENESYS
185
design of the components; thus, the communication infrastructure is unaffected by modifications of the internals of components as long as the behavior at the interface is not changed. GENESYS supports three distinct communication modes: periodic messages where the instant of transmission is determined by an a priori defined, collision-free message schedule, sporadic messages where the transmission of messages can be triggered by any significant event, and synchronized data streams, which enable the on-the-fly processing of synchronized streaming data from multiple sources. The message paradigm is a universal model, since on top of a basic message passing service it is possible to realize other communication paradigms such as a shared memory or higher communication protocols [7]. Component Interfaces. The interfaces of a component abstract from the internals of the component and enable the exchange of one component independently from other components as long as the behavior at the component’s interface remains stable. The GENESYS architectural style distinguishes two types of component interfaces: local interfaces and linking interfaces (LIFs). Local interfaces are interfaces to the component’s environment. LIFs are used for the integration of components. For facilitating composability and interoperability, a precise LIF specification in the temporal and value domain is required [8]. 3.2
System Structuring
GENESYS addresses embedded systems design at various levels of integration. At the lowest level, GENESYS considers the closed integration of IP cores interconnected by a Network-on-Chip (NoC). The other extreme is the ad-hoc integration of independently developed, autonomous systems. Common to all integration levels is the waistline architecture, which means that there is a small set of services that needs to be provided by every GENESYS-compliant platform. Additional services, which are beneficial for particular domains can be built upon those mandatory services. Integration Levels. Due to the substantial differences of service characteristics at different integration levels, three different integration levels are introduced: the chip-level, the device-level, and the system-level (see Figure 2). System Level. A system consists of devices, each of which is a spatially and logically self-contained apparatus (e.g., an ECU in a car or a mobile phone). Devices are connected with each others via their LIFs. This connection can be realized in an open or closed environment. In an open environment the composition of devices occurs dynamically during the system operation. In a closed environment all devices are known a priori. Device Level. If a device has an internal structure that is relevant from the point of view of the architecture, then the device can be decomposed into a set of chips that interact via inter-chip LIFs. Chip Level. If a chip is an Multi Processor System-on-a-Chip (MPSoC) that has an internal structure with relevance from the point of view of the architecture,
186
B. Huber and R. Obermaisser
then the chip can be decomposed into a set of IP cores. The IP cores communicate with each other using inter-IP core LIFs via NoCs. Not all three levels must be present in every instantiation of the GENESYS architecture. For instance, in a single chip system the connection to the open world, e.g. the Internet, could already be at the chip integration level removing the need to consider the device level. Waistline Structure. The GENESYS architecture defines generic platform services (e.g., operations to send and receive a message) as a foundation for the development of applications. There are two types of platform services at any given level of integration: core services and optional services. The role of the core and optional services is depicted in Figure 3. The core services are mandatory in every instantiation of the architecture. At any given integration level, the core services form a waist that should include exactly those features, which are required by all the targeted application domains. For instance, a message transport service is a core service. Optional services build upon the core services and provide functionality to customize the architecture towards a particular application domain. However, these services need not to be present in every instantiation of the architecture. For instance an encryption service could be an optional service. Abstraction Levels. GENESYS distinguishes two abstraction levels for the design of embedded systems, namely a logical and a physical system view. The logical system view describes the services provided by the system and their relationships. In GENESYS terminology, a system consists of jobs that interact (and provide their services) via message-based interfaces. A job is regarded as an atomic unit and the internals of a job are only relevant for the developer of the job but not for system integration. A set of jobs that form a logical unit, is called a Distributed Application Subsystem (DAS). A DAS is a nearlyindependent subsystem, that provides a part of the system’s overall functionality. The logical structuring of a system into DASs and jobs is independent from the physical location of the jobs in the final embedded system. The physical system view describes the concrete implementation of the system. It consists of components, each being a physical entity that can host a job. The three integration levels possess specific types of components, namely IP cores at the chip-level, chips at the device level and devices at the system level.
4
Commonalities and Contrasts
Both architectures, NoTA and GENESYS, are driven by very similar objectives: to tackle the ongoing digital convergence in modern embedded systems and the resultant challenges. This section elaborates on similarities of both architectures, but also points out the major differences. 4.1
Commonalities
Service Orientation. Both architectures focus on the identification and specification of services. In terms of GENESYS, this is the identification of cohesive
A Comparison of NoTA and GENESYS
187
subsystems, the DASs, and their further decomposition into jobs, each of which is providing a well-defined application service. In NoTA the system’s functionality is described by means of service nodes and application nodes, which are logically grouped to subsystems. Furthermore, system design is concerned with a reasonable allocation of those functional entities onto physical hardware entities. In NoTA the term device is used to denote a component, i.e., the entity that represents the integration of hardware and software to implement a dedicated part of the overall system’s functionality. Depending on the level of integration, the terms used in GENESYS are IP cores, chips, and devices. Component-Based Design. Modularity, reuse, and insensitivity to technological changes are major foci of both architectures. For this purpose, both architectures push the construction of components with explicitly defined interfaces. A component is a self-contained subsystem that can be independently developed and used as a building block in the design of a larger system. It is a replaceable part of a system that encapsulates the implementation. This encapsulation of functionality facilitates the evolvability of the system: For instance, due to the explicit interface specification, components can be exchanged without the need to alter any interacting components as long as the component’s behavior at its interfaces remains stable. Interface Specification. To enable component-based design and to achieve interoperability between components, interface specifications need to precisely describe the component’s behavior in the value domain and the temporal domain. This includes functional and non functional properties (e.g., dependability properties) as well as syntactic properties of the exchanged information. Furthermore, a semantic specification is required to decide on the interoperability of different component implementations. In GENESYS components are interconnected via LIFs. A precise LIF specification includes input and output assertions, specification of syntactic, temporal and dependability properties, semantic specification, and a periodic ground state (i.e., interface state at the restart instant of a component). The interface specification in NoTA, denoted Service Interface Specification (SiS), is split in two parts [4]: The control interface describes the input and output messages sent or received by a service as well as the externally observable states of the service. The data interface comprises a list of data types each service supports for communicating with other services as well as a description of non functional properties for the data transfer between services. Stable Platform Services. Both architectures define a stable set of platform services that can be utilized for the development of applications. The interface to these services is independent from the actual underlying implementation technology. This minimizes the migration effort of already implemented applications to new technologies. The platform services that are available in every instantiation of the architecture are denoted core services in GENESYS. They form the stable waist of the architecture and hide changes of the implementation technology
188
B. Huber and R. Obermaisser
from the application. Among the core services are basic communication services, diagnostic services, security services, and resource management services. NoTA provides a stable API to ANs and SNs to access the services provided by the DIP (e.g., service registration and discovery). The H IN is completely independent from the underlying communication technology; thus, the application would be unaffected from technology changes. Since particular applications (e.g., legacy software) would require additional functionality that exceeds the capabilities of the platform services, both architectures provide means for extensibility: The GENESYS waistline architecture serves exactly for this reason. While the set of core services remains stable, optional services can be used to extend the architecture and provide higher-level services to an application developer. Similarly, the basic services provided by NoTA can be extended using proxy layers upon the H IN, which enable the provision of higher-level services (e.g., the Khronos protocol stack) upon the NoTA communication infrastructure. 4.2
Differences
Architecture Extent and Focus. NoTA is a framework for the development and integration of applications, mainly targeted to the needs of the consumer applications industry. It provides well-specified interfaces and a standardized interconnect protocol in order to facilitate the integration of independently developed services. The GENESYS architecture goes beyond this core functionality. It defines architectural principles, each instantiation of the GENESYS architecture needs to adhere to, i.e., a guideline how to develop a concrete instantiation of the architecture. In addition, it describes a concrete set of services: core services (e.g., communication services, diagnostic services) and optional services (e.g., security services). Guarantees for Architectural Properties. It is the intention of NoTA to provide an interface to the applications that abstracts completely from the implementation technology, i.e., from the transport layer that is accessed by the L IN to connect different components. Although functional and non-functional properties of component interactions are included in the SiS (e.g., communication latency, energy efficiency), it is hard to guarantee those properties, since they very much depend on the actual implementation of the underlying platform. For instance, if Ethernet is used for data transport, a maximum transmission latency between components A and B cannot be guaranteed in case a (perhaps faulty) component C monopolizes the communication link to component B. Also GENESYS abstracts from the concrete implementation technology. However, the architectural style enforce characteristics of the used platform that ensure that important properties of the architecture can be guaranteed. For instance, the periodic message transport service ensures the timely transport of messages with respect to guaranteed bandwidth, transmission latency and latency jitter. Also, fault isolation between components is provided by encapsulation mechanisms of the GENESYS architecture [6]. Thus, only platforms that ensure those properties are suitable technologies for GENESYS.
A Comparison of NoTA and GENESYS
189
Multiple Integration Levels. For the design and integration of embedded applications with NoTA a single level of abstraction is used: Multiple devices, each of which is hosting one or more subsystems, are integrated via the DIP. The internal physical structure of a device is only of minimal relevance for NoTA: The device is required to provide adequate resources for the implementation of the hosted subsystem(s) and if multiple subsystems are located on a single device, the only way subsystems can interact is via the service interfaces. So NoTA does not explicitly state at which level the integration of ANs and SNs takes place, i.e., a NoTA application can be implemented on a single chip, on multiple chips, or on a set of physical devices. GENESYS rigorously distinguishes multiple integration levels. At each integration level the only way components can interact is via the LIFs of the component. At the system level devices are integrated, which are interconnected by an inter-device LIF. If the internal structure of the devices is of relevance for the system designer, a device is decomposed into a set of chips, interacting via inter-chip LIFs. If relevant, chips are further decomposed into a set of IP cores interacting via inter IP core LIFs. The interface specification is identical at each integration level. However, the functional and non-functional properties of the LIFs may differ at the individual integration levels depending on the deployed communication infrastructure.
5
Instantiation of NoTA on GENESYS
The similarities between NoTA and GENESYS as discovered in the previous section raise the question, whether it might be possible to combine the benefits of both architecture and instantiate NoTA as optional service on top of the GENESYS architecture. It is the purpose of this section to provide a theoretical evaluation of this matter. 5.1
Compatibility in System Structure
The first question to be answered is, whether the way to structure systems in both architectures matches. In NoTA, applications are described in terms of service nodes (SNs) and application nodes (ANs), which interact via strictly defined interfaces. The counterpart hereto are jobs in the GENESYS terminology. In NoTA, subsystems are used to group SNs and ANs that belong together to larger functional units. Subsystems in the terminology of NoTA are not only logical constructs, but also represent a physical entity that provides adequate resources to implement the hosted services. Hence, subsystems also consist of a physical interface to the physical interconnect. If interacting subsystems are located on the same device, this interface might connect the subsystem to an intra-device interconnect, whereas subsystems located on different devices communicate via an inter-device interconnect. The GENESYS architecture provides the concept of DASs to perform a logical integration of multiple jobs to a single logical entity. A direct equivalent to the concept of subsystems is not provided, since the physical structuring of a system depends on the integration level
190
B. Huber and R. Obermaisser
in GENESYS: At the chip-level, the GENESYS architectural styles enforces a strict one-to-one mapping of jobs to IP cores. Thus, multiple jobs that form a logical unit and require spatial proximity for an efficient implementation due to their interaction patterns are implemented as IP cores on the same chip. This corresponds to the representation of NoTA subsystems. In order to implement the physical interface of a NoTA subsystem, a dedicated gateway IP core can be used in the GENESYS architecture to interconnect each chip to a chip-external network. The next integration level in GENESYS, the device level, provides the encapsulation of several chips into one device. This is analog to the integration of multiple NoTA subsystems into a single NoTA device. The physical interconnection of NoTA devices to the overall system is represented by the system level in GENESYS. 5.2
NoTA on Top of GENESYS
The GENESYS architecture is devised by experts from industry, research institutes, and academia of many different domains to ensure that important requirements of all targeted domains are covered in the resulting architecture template. That is, the GENESYS architecture comprises know-how and experience of different fields of application, to facilitate its deployment across domain boundaries. In particular with respect to temporal guarantees for communication and encapsulation of applications, the GENESYS architecture could improve the existing platforms for consumer applications, since those properties are key requirements for component-oriented design, component integration, and reuse. To broaden the applicability of the GENESYS architecture for consumer applications, the provision of the NoTA application programming interface as optional service on top of the GENESYS core services would be highly advantageous. This could ease the change for consumer applications from their specialized hardware platform towards GENESYS and would increase the acceptance of the GENESYS architecture in this high-volume domain. To facilitate this instantiation, mainly the DIP of NoTA needs to be realized as optional service for GENESYS. Firstly, the DIP provides a stable interface to the applications; secondly, parts of the lower layer of the DIP are used to adapt the interconnect implementation to the actual transport protocol. The examples described in [9,10] use such adapters to stack the DIP on top of the MIPI UniPro protocol. Likewise it is reasonable to build the DIP as an optional service on top of the GENESYS core services and to develop an adapter that connects the NoTA interconnect to the core communication services of GENESYS. The GENESYS architecture seems to be well-suited as an implementation platform for NoTA: The DIP provides message-based communication and streaming communication. Both types of communication are also natively provided by GENESYS (via periodic or sporadic message based communication as well as real-time streaming). The example outlined in Figure 3 depicts a GENESYS system at the chip level using, e.g., the the TTSoC architecture [11] as platform to provide the core services at this level. The main extensions of NoTA compared to the GENESYS core services are service registration/discovery and
A Comparison of NoTA and GENESYS
191
GENESYS Jobs SN
AN
SN
SN
SN
H_IF
H_IN H_IN L_IN OS OS L_IN Waist 1
L_IF
OS OS
Core Services (e.g. at Chip Level)
L_IN (up) Adapter
AS1 ...
TTSoC
...
Polaris Cell
NoTA Socket … Message-based or streaming communication NoTA Interconnect
AS1 AS1
AN
L_IN (down) Core Services (e.g. communication service)
GENESYS Port ... Message-based (periodic or sporadic) or real-time streaming communication
Fig. 3. Placing NoTA as Optional Service in the GENESYS Waistline Architecture
NoTA-specific resource management. In NoTA, the so-called resource management service node (RMSN), is used to implement this functionality. The RMSN is realized as a dedicated node in one subsystem. The same strategy can be followed when implementing NoTA as optional service for GENESYS, i.e., to use a dedicated job for implementing the functionality of the RMSN.
6
Conclusion
The development of NoTA and Genesys is inspired by very similar challenges. For solving those challenges many commonalities between both architectures can be found. The main similarity is the component-orientation and that both architectures take the achievement of composability and interoperability as an objective with top priority. As a consequence, both architectures are based on service specifications which include rigorous specifications of interfaces. In contrast to GENESYS, NoTA is tailored to applications of the consumer domain. Therefore, the NoTA framework is not intended to provide services for a broad range of applications across different domains. Hence, the design of NoTA focuses on the provision of a minimal set of services that is required to facilitate the integration of applications provided by various parties. Due to the similarities of both architectures and their non-contradicting architectural concepts, it seems possible to combine both concepts by realizing the NoTA interconnect as optional service on top of the GENESYS core services. This could beneficially influence both architectures: GENESYS by taking advance of a mature architecture from the consumer applications domain which may strengthen the position of GENESYS in this domain. NoTA by using a platform that is designed to provide fault-tolerance and resilience against transients [6] in order to improve the reliability of the products in the consumer domain. Given that a cost-efficient instantiation of the GENESYS architecture is available, the use of GENESYS as platform for the implementation of NoTAbased consumer applications could result in a substantial qualitative improvement of products of this domain. First results will be shown by the prototype instantiations of the GENESYS architecture developed in the scope of the project as well as by upcoming research projects.
192
B. Huber and R. Obermaisser
Acknowledgments. This work has been supported by the European FP7-ICT research project GENESYS under project number FP7/213322 and the research project INDEXYS under ARTEMIS JU grant agreement number 100021, respectively BMVIT project number 820404. The discussions within the research group at the TU Vienna and with partners of the GENESYS project, particularly with Kari Tiensyrj¨ a, Eila Ovaska, and Heikki Waris are warmly acknowledged.
References 1. ARTEMIS: Strategic research agenda - 1st edn. (2006), http://www.artemis-sra.eu/downloads/SRA_MARS_2006.pdf 2. Kronl¨ of, K., Kontinen, S., Oliver, I., Eriksson, T.: A method for mobile terminal platform architecture development. Advances in Design and Specification Languages for Embedded Systems, 285–300 (2007) 3. Suoranta, R.: New directions in mobile device architectures. In: 9th EUROMICRO Conf. on Digital System Design: Architectures, Methods and Tools, pp. 17–26 (2006) 4. Lilius, J., Lindqvist, J., Porres, I., Truscan, D., Eriksson, T., Latva-Aho, A., Rakkola, J.: Testable specifications of NoTA-based modular embedded systems. TUCS Techn. Report No 841, Turku Centre for Computer Science (September 2007) 5. Lappetel¨ ainen, A., Tuupola, J.M., Palin, A., Eriksson, T.: Networked systems, services and information. The ultimate digital convergence. In: First International Network on Terminal Architecture Conference (2008) 6. Obermaisser, R., El Salloum, C., Huber, B., Kopetz, H.: Fundamental design principles for embedded systems: The architectural style of the cross-domain architecture GENESYS. In: Proc. of IEEE Int. Symposium on Object/Component/ServiceOriented Real-Time Distributed Computing (March 2009) 7. Kopetz, H., Obermaisser, R., Peti, P., Suri, N.: From a federated to an integrated architecture for dependable embedded real-time systems. Technical Report 22, Technische Universit¨ at Wien, Institut f¨ ur Technische Informatik (2004) 8. Kopetz, H., Suri, N.: Compositional design of RT systems: A conceptual basis for specification of linking interfaces. In: Proc. of IEEE Int. Symposium on ObjectOriented Real-Time Distributed Computing, pp. 51–60. IEEE, Los Alamitos (2003) 9. Lappetel¨ ainen, A.: Extending NoTA for distributed products. Slides of the presentation at First International Network on Terminal Architecture Conference (2008) 10. Pourbigharaz, F., Aleksic, M.: NoTA imaging solution. Slides of the presentation at First International Network on Terminal Architecture Conference (2008) 11. Kopetz, H., El Salloum, C., Huber, B., Obermaisser, R., Paukovits, C.: Composability in the Time-Triggered System-on-Chip architecture. In: 21st Annual IEEE International SoC Conference (September 2008)
Introduction to Instruction-Set Customization Carlo Galuzzi TU Delft, The Netherlands
Over the last years, we have witnessed the increased use of customizable processors. These processors can be tuned and/or extended to meet specific requirements and to achieve a balance between performance, power, hardware resources and time-to-market. The problems involved are numerous and they tend to have high computational complexity. One of the main approaches is the identification of the most suitable instructions to include in the instruction-set. The customization of a given instruction-set with specialized application-specific instructions has become a common technique used to speed up the execution of applications. As a result, in this session we first present two papers focusing on methods to design custom instructions and then, a paper which gives an overview on contemporary architectures that offer dynamic instruction-set support. Kevin Martin et al. present a constrain-driven method for fast identification of computational patterns which form a base for application-specific instruction selection and processor extension generation. Kingshuk Karuri et al. present a generic and easily adaptable flow for application oriented instruction-set extension design that supports both of the prevalent ASIP design paradigms - complete ISA design from scratch through an extensive design-space exploration, or limited ISA adaptation for a pre-designed and pre-verified base-processor core. Huynh Phung Huynh and Tulika Mitra provide a detailed survey of the contemporary architectures that offer dynamic instruction-set support and discuss compiler and/or runtime techniques to exploit such architectures.
K. Bertels et al. (Eds.): SAMOS 2009, LNCS 5657, p. 193, 2009.
Constraint-Driven Identification of Application Specific Instructions in the DURASE System Kevin Martin1, Christophe Wolinski1,2 , Krzysztof Kuchcinski3, Antoine Floch1 , and François Charot1 1 3
INRIA, Centre Rennes - Bretagne Atlantique, France 2 University of Rennes I, Irisa, France Dept. of Computer Science, Lund University, Sweden
Abstract. This paper presents a new constraint-driven method for fast identification of computational patterns that is a part of DURASE system (Generic Environment for Design and Utilization of Reconfigurable Application-Specific Processors Extensions). The patterns identified by our system form a base for application specific instruction selection and processor extension generation. Our method identifies all computational patterns directly from an application graph satisfying all architectural and technological constraints imposed by target processors and FPGA devices. The considered constraints include a number of inputs and outputs, a number of operators, and a delay of the pattern critical path. Therefore the identified patterns can be well tailored to target processors. Our approach uses heavily constraint programming methods, which makes it possible to mix graph isomorphism constraints with other constraints in one formal environment. We have extensively evaluated our algorithm on MediaBench and MiBench benchmarks with tough architectural and technological constraints. The obtained patterns have good coverage of application graphs while limiting number of operators and fulfill architectural and technological constraints.
1 Introduction Our DURASE system enables automatic synthesis of application specific processor extensions that speed up the application execution. The extensions are tightly connected to a target processor and used through newly created instructions (see Figure 2 for example of the NIOS II processor and its extension). The design flow adopted in the DURASE system is presented in Figure 1. The input to our system is an application code written in C, a target processor instruction set and an architecture model. The output is a processor extension and application specific instructions for accessing this extension. The processor extension is built using a merged pattern implementing all selected computational patterns. Our system generates also an interface to the processor and the transformed application source code, including application specific instructions. In this paper, we discuss identification of computational patterns in details but our design process involves also selection of specific patterns that speed up application execution. The pattern identification and selection are executed in two consecutive steps. In the first step, we explore typical computational patterns and identify the most useful ones for a given application. The identified computational patterns are then K. Bertels et al. (Eds.): SAMOS 2009, LNCS 5657, pp. 194–203, 2009. c Springer-Verlag Berlin Heidelberg 2009
Constraint-Driven Identification of Application Specific Instructions
195
Fig. 1. Generic hardware and software extension set generation flow
Fig. 2. The ASIP “NIOS II” processor with its extension
used in the mapping and scheduling step where a subset of patterns is selected for implementation. The developed DURASE system uses advanced technologies, such as algorithms for graph matching and graph merging (recently developed by our laboratories [1]) together with constraints programming methods. Our system uses also a generic compilation platform GECOS recently extended with polyhedral transformations [2]. The internal representation of the DURASE system is the Hierarchical Conditional Dependency Graph (HCDG) [3,4] capturing both the application control-flow and data-flow. It supports formal graph transformations and optimizations. In the HCDG graph nodes are guarded by boolean conditions and polyhedrons depending on loop indexes and parameters. After data-flow analysis [5], each read statement in a graph has a set of possible source contributions relying on its execution contexts and polyhedral conditions. Finally, loops are unrolled totally or partially to generate an HCDG which is an input to the pattern generator.
196
K. Martin et al.
In this paper, we consider the architecture model of an ASIP processor with an extended instruction set. Extended instructions implement identified and selected computational patterns and can be executed sequentially with the ASIP processor instructions [6]. Our generic simplified architecture is depicted in Figure 2. It is composed of one functionally reconfigurable cell implementing a set of computational patterns (selected by the DURASE system). This cell is connected directly to the processor’s data-path. Before the cell synthesis, the patterns are merged by our merging procedure [1]. The cell contains also registers for cases when generated patterns have more than two inputs and one output (case of the NIOS II). The number of registers and the structure of interconnections are application-dependent. The paper is organized as follows. In section 2 the related work on pattern identification is discussed. Section 3 introduces briefly constraint programming that is used in our approach. Pattern identification is discussed in section 4. Section 5 presents experimental results. Finally, conclusions are presented in section 6.
2 Related Work Previous research on pattern extraction, such as [7,8,9], is characterized by combined pattern matching and pattern generation for ASIPs. In [7], this is achieved with clustering that uses information on frequency of node type successions. Authors of [8] and [9] use an incremental clustering that uses different heuristic approaches with the common aim of identifying frequently occurring patterns. Another method is presented in [10] where the pattern searching algorithm identifies a big pattern using convexity and input/output constraints. Some improvements of this method were proposed in [11]. Pattern searching under input/output constraints is also used in [12]. The basic algorithm starts from each exit node of the basic block and constructs a sub-graph by recursively trying to include parent nodes. The assembled sub-graph is considered as a potential new instruction. The quality of this instruction is then determined by their system. In [13], a set of Multiple Input Single Output subgraphs (MaxMISO) is identified first. Each MaxMISO sub-graph is not contained in any other MISO sub-graph. In the next step a candidate set composed of two-input/oneoutput MISOs found inside the MaxMISO set is selected. Finally, using the selected candidates the application graph is partitioned by a nearly-exhaustive search method using the branch-and-bound algorithm. In [14], a complete processor customization flow was presented where patterns are clustered, one after the other, making some local decisions. In the method presented in [15] the patterns are incrementally assembled by adding the neighbor nodes to existing matches corresponding to non-isomorphic patterns formed in the previous iteration. In our previous work, we have used similar incremental method [16] but the subgraph isomorphism constraint has been used for evaluation of number of possible matches. This method did not provide possibility to easily control architectural constraints of created patterns. Our current approach is radically different and overcomes drawbacks of the previous approaches. We do not generate patterns by assembling them incrementally and we are able to consider technological constraints during pattern
Constraint-Driven Identification of Application Specific Instructions
197
identification. We impose instead constraint defining valid patterns and combine them with architectural and technological constraints. Then the constraint solving method identifies all patterns fulfilling these constraints. The final selection of patterns identified with this method is controlled by smart filtering process. The smart filtering process uses information derived by a special method that is based on sub-graph isomorphism constraints and constraints programming.
3 Constraint Programming In our work we use constraint satisfaction methods implemented in constraint programming environment JaCoP [17]. A constraint satisfaction problem is defined is a set of variables (V ), a set of finite domains (FD) of these variables (D ), and a set of constraints (C ). Finite domain variables (FDV) are defined by their domains, i.e. the values that are possible for them. A finite domain is usually expressed using integers, for example x :: 1..7. A constraint c(x1 , x2 , . . . , xn ) ∈ C among variables of V is a subset of D1 × D2 × . . . × Dn that restricts which combinations of values the variables can simultaneously take. Equations, inequalities and even programs can define a constraint. In this paper, we use the graph matching constraint GraphMatch. This constraint defines conditions for (sub-)graph isomorphism between target and pattern graphs (the pattern graph can be defined as a set of separate sub-graphs). It has been implemented using a pruning algorithm developed specially for this purpose [16] and used previously in our UPaK system [6]. A solution to a CSP is an assignment of a value from variable’s domain to every variable, in such a way that all constraints are satisfied. The specific problem to be modeled will determine whether we need just one solution, all solutions or an optimal solution given some cost function defined in terms of the variables. The solver is built using constraints own consistency methods and systematic search procedures. Consistency methods try to remove inconsistent values from the domains in order to reach a set of pruned domains such that their combinations are valid solutions. Each time a value is removed from a FD, all the constraints that contain that variable are revised. Most consistency techniques are not complete and the solver needs to explore the remaining domains for a solution using search. Solutions to a CSP are usually found by systematically assigning values from variables domains to the variables. It is implemented as depth-first-search. The consistency method is called as soon as the domains of the variables for a given constraint are pruned. If a partial solution violates any of the constraints, backtracking will take place, reducing the size of the remaining search space.
4 Pattern Generation Pattern generation is defined, in our approach, for an acyclic application graph G = (N, E) where N is a set of nodes and E is a set of edges. These graphs represent our HCDG representations. A pattern is a subgraph P = (N p , E p ) of graph G where
198
K. Martin et al. DIPS ← 0/
for each ns ∈ N
T PS ← 0/ CPS ← | FindAllPatterns(G, ns ) | for each p ∈ CPS if ∀pattern ∈ T PS : p ≡ pattern T PS ← T PS ∪ {p}, NMPp ← | FindAllMatches(G, p) | NMPns ← | FindAllMatches(G, ns ) | for each p ∈ T PS if coe f · NMPn ≤ NMPp DIPS ← DIPS ∪ {p} return DIPS
Fig. 3. Pattern identification process
N p ⊆ N and E p ⊆ E. Pattern P is also sub-graph isomorphic to graph G. This subgraph isomprohism is found, in our system, by defining a set of constraints and finding solutions to them. To be able to define these constraints we introduce a number of definitions. A set of successor nodes of node n is defined by succ(n) = {n : (n, n ) ∈ E}. Similarly, we define predecessors of node n as pred(n) = {n : (n , n) ∈ E}. All successors of node n are defined recursively as allsucc(n) = {n ∪ allsucc(n ) : (n, n ) ∈ E}. A path between two nodes ni and n j in graph G is defined as a string of nodes as follows path(G, ni, n j ) = (ni , ni+1 , . . . , n j ) where each consecutive nodes nk , nk+1 in this string satisfy either (nk , nk+1 ) ∈ E or (nk+1 , nk ) ∈ E. The pattern generation process is depicted in Figure 3. In the first step of this algorithm, all computational patterns formed around each seed node ns ∈ N satisfying all architectural and technological constraints are identified. It is also possible to identify patterns for representative seed nodes using different heuristics but this is not considered in this paper. This is carried out by FindAllPatterns(G, n) function that is implemented using constraint programming methods. The constraints for finding all valid computational patterns are defined in subsection 4.1. Subsection 4.2 presents architectural constraints on input/output requirements and subsection 4.3 introduces constraints that control the delay of critical path of the pattern. In the next step, the temporary pattern set (T PS) is expended by non-isomorphic patterns coming from the current pattern set (CPS). Finally, only patterns whose numbers of matches in the application graph is high enough comparing to the number of matches obtained for single node patterns, composed of their seed nodes, are added to the definitively identified pattern set (DIPS). The number of matches of a given pattern in the application graph is also obtained using the constraint programming method implemented by function FindAllMatches(G, n). We use for this purpose our special graph matching constraints developed for UPaK system [6]. 4.1 Pattern Constraints The computational pattern created around seed node ns in application graph G is an acyclic graph Pns = (N p , E p ). An example pattern graph is depicted in Figure 4. The constraint program builds patterns by imposing constraints that define valid patterns.
Constraint-Driven Identification of Application Specific Instructions
199
Only solutions that fulfill all these constraints are accepted. Constraints (1)-(5) define valid computational patterns. Constraint (1) states that each node n ∈ N p , different from seed node ns , must form a path in undirected pattern graph Pns starting at this node and leading to the seed node. We consider here non-directed edges of pattern graph edges, as defined before. ∀n ∈ N p ∧ n = ns ∃path(Pns , n, ns )
(1)
Furthermore, each node n ∈ N is modeled by one finite domain variable nsel . The value of this variable is 1 if node n ∈ N p and 0 otherwise. nssel = 1 because the seed node ns always belongs to the pattern created around it. The constraints for all nodes in a pattern are defined in formulas (2) - (5). ∀n ∈ (N − (allsucc(ns) ∪ ns )) : nsel = 1 ⇒
∑
msel ≥ 1
(2)
msel = 0 ⇒ nsel = 0
(3)
m∈succ(n)
∀n ∈ (N − (allsucc(ns ) ∪ ns )) :
∑
m∈succ(n)
Formulas (2) and (3) express two constraints between each node n ∈ N − (allsucc(ns ) ∪ ns ) and its direct successors. For example, node n ∈ {N0, N1, N2, N3} for pattern graph depicted in Figure 4. Constraint (2) requires selection of at least one node among successors of node n if this node is selected to be part of the pattern. Constraint (3), on the other hand, prohibits node n to belong to the pattern if none of its successors belongs to this pattern. ∀n ∈ allsucc(ns ) : nsel = 1 ⇒
∑
msel ≥ 1
(4)
msel = 0 ⇒ nsel = 0
(5)
m∈(pred(n)∩(allsucc(ns )∪ns ))
∀n ∈ allsucc(ns ) :
∑
m∈(pred(n)∩(allsucc(ns )∪ns ))
Constraints (4) and (5) define relations between each node n ∈ allsucc(ns ) and its direct predecessors. For example, n ∈ {N4, N5, N6, N7, N8} for pattern graph depicted
Fig. 4. Example of the pattern
200
K. Martin et al.
in Figure 4. Constraint (4) requires selection of at least one node from the predecessor nodes set reduced to nodes in the allsucc(ns ) ∪ ns if node n is selected. Constraint (5) prohibits nodes from predecessor set of node n reduced to nodes in allsucc(ns ) ∪ ns to belong to the pattern if node n does not belong to this pattern. 4.2 Interface Constraints Computational patterns identified with constraints defined in subsection 4.1 can have any number of inputs and outputs. To fulfill architectural interface requirements we need to control number of pattern inputs and outputs and to achieve this we impose additional constraints. To be able to define these constraints, each node n ∈ N has two associated constants indegreen and outdegreen defining its number of input edges and output edges respectively. Some of these edges can become pattern external inputs or outputs. We also assume that identified patterns cannot exceed number of inputs PatternInputs and number of outputs PatternOut puts. ∀n ∈ N : nin = indegreen −
∑
msel
(6)
∑ (nsel · nin) ≤ PatternInputs
(7)
m∈pred(n)
n∈N
∀n ∈ N : nout = outdegreen −
∑
msel
(8)
∑ (nsel · nout ) ≤ PatternOut puts
(9)
m∈Succn n∈N
An edge becomes pattern input (output) if node n ∈ N p has an external input (output) edge or a source (destination) node for this edge does not belong to the pattern. Constraints (7) and (9) limit the number of inputs and outputs of the created pattern for nodes belonging to the pattern. For a fan-out nodes n ∈ N p with several outgoing edges in the graph, i.e., nodes that send the same data to several other nodes, constraint (8) is replaced by constraint (10). ∀n ∈ N :
∑
msel < outdegreen ⇔ nout = 1
(10)
m∈succ(n)
4.3 Critical Path Constraints Pattern critical path defines clock cycle requirements for the entire architecture and therefore needs to be controlled. To be able to control the delay of pattern critical path, two new finite domain variables, startn and delayn , are introduced for each node n ∈ N. These variables define the starting time of each node and the delay of the node. For nodes n ∈ / N p their delayn = 0 (for example, each node n ∈{N0, N1, N2, N5, N6, N7, N8} from Figure 4 has its delayn = 0). Data dependencies for each edge (n, m) ∈ E are defined by constraints (11). ∀(n, m) ∈ E : startn + (delayn · nsel ) ≤ startm
(11)
Constraint-Driven Identification of Application Specific Instructions
201
Constraint (12) defines completion time EndTimen for each node n ∈ N. ∀n ∈ N : (startn + delayn ) · nsel = EndTimen
(12)
The EndTimen time is different than zero only for node n ∈ N p . Constraint (13) imposes the maximal delay for the entire pattern (assuming that PatternCriticalPath specifies maximal delay of the critical path). ∀n ∈ N : EndTimen ≤ PatternCriticalPath
(13)
In some cases we would like to control not only the delay of the critical path but also the size of the pattern. Constraint (14) imposes the maximal number of nodes in the created pattern. It is possible to control it even better using weighted sum and applying different weights to different node types. In this way area requirements can be taken into account.
∑ nsel ≤ PatternNumberO f Nodes
(14)
n∈N
Figure 5 shows an example of the application graph corresponding to the partially unrolled FIR application covered by the automatically generated patterns.
Fig. 5. Example a graph covering
5 Experimental Results Our new computational pattern identification method has been evaluated using applications from MultiMedia and MiBench benchmark sets, written in C. We have used two scenarios and compared them to the original results of our previous system UPaK (part of the table with headline UPaK) [6]. In the first scenario, our method has been applied to the set of computational patterns identified by our UPak system to find patterns fulfilling architectural constraints. UPaK uses a greedy method to generate the initial set of patterns. In the second scenario, our method has been applied directly to an application graph. All results are presented in Table 1. They have been obtained under the following constraints: PatternNumberO f Nodes ≤ 10, PatternCriticalPath ≤ 15ns. Critical path corresponds to three processor cycles for the Nios2Fast processor running at 200MHz on Stratix2 Altera FPGA. Coefficient coe f = 0.5 since it has been verified
202
K. Martin et al.
Table 1. Results obtained for the applications from MediaBench and MiBench benchmark sets
34 58 250 278 201 37 152 67 10
3 21 13 10 26 6 8 28 22
1 10 7 2 3 6 2 3 2 5
14.0 1.3 6.8 22.2 113.9 27.3 510.0 15.2 0.1
76 3 1 0.2 76 4 1 72 - - - - 17 8 57 20 3 0.2 44 55 11 53 1 1 0.1 53 10 2 67 2 1 0.2 51 11 2 27 - - - - 7 6 58 - - - - 5 3 98 4 2 0.2 6 8 3 100 - - - - 12 2 67 2 45 4
0.2 0.2 3.5 0.1 2.5 0.1 0.1 0.5 0.3
76 72 60 54 59 27 57 98 60 62
6 11 28 7 11 17 2 3 -
2 1 10 4 3 9 2 2 4
0.7 1.2 9.0 9.5 13.9 0.6 3.2 1.8 -
88 19 78 68 78 64 11 9 53
82 74 254 157 145 134 53 24 6
coverage [%]
time (s)
identified selected
coverage [%]
time (s)
2: new approach 2 in / 1 out 4 in / 2 out identified selected
coverage [%]
identified selected time (s)
coverage [%]
time (s)
Nodes identified selected JPEG Write BMP Header JPEG Smooth Downsample JPEG IDCT EPIC Collapse BLOWFISH encrypt SHA transform MESA invert matrix FIR unrolled FFT Average
coverage [%]
1: UPaK + new approach 2 in / 1 out 4 in / 2 out identified selected time (s)
UPaK Benchmark
3 2.1 100 5 1.1 98 21 42.3 93 9 1533.0 73 6 96.2 86 13 1.67 100 12 8.4 72 3 3.6 100 2 0.2 60 8 87
Fig. 6. The number of the generated patterns and application graph coverage as function of coefficient coe f for smart filtering method
experimentally that the number of generated patterns decreases while the graph coverage is almost the same with this coefficient, as depicted in Figure 6. Sign “-” in the table indicates that only one-node patterns have been identified and selected. For some applications two-input patterns either do not exist or they are very rare. The experiments were carried out on a PC with T7600 processor running at 2.33GHz. It can be noticed that our new method generates excellent patterns comparing to the results obtained by the UPaK system that constrains only the number of nodes in the generated patterns. The number of selected patterns is small and the graph coverage is high. When four input and two output patterns were allowed, the average graph coverage was even better than the one obtained with our state-of-the-art UPaK system that does not support generation of patterns under architectural and technological constraints.
6 Conclusion In this paper, we have presented a part of the DURASE system corresponding to the pattern generation process. We have shown that our approach is radically different and
Constraint-Driven Identification of Application Specific Instructions
203
that it overcomes drawbacks of the previous approaches. We do not generate patterns by assembling them incrementally. We instead impose constraints defining valid patterns and combine them with architectural and technological constraints. Then the constraint solving method identifies all patterns fulfilling these criteria. In our approach, the smart filtering process is responsible for the final pattern selection. This process uses information derived by a special method that is based on sub-graph isomorphism constraints and constraint programming. The obtained results show excellent quality of the identified patterns.
References 1. Wolinski, C., Kuchcinski, K., Raffin, E.: Architecture-driven synthesis of reconfigurable cells (March 2009) (submitted for publication) 2. GeCoS: Generic compiler suite, http://gecos.gforge.inria.fr/ 3. Kountouris, A.A., Wolinski, C.: Efficient scheduling of conditional behaviors for high-level synthesis. ACM Trans. Des. Autom. Electron. Syst. 7(3), 380–412 (2002) 4. Kuchcinski, K., Wolinski, C.: Global approach to assignment and scheduling of complex behaviors based on HCDG and constraint programming. Journal of Systems Architecture 49(12-15), 489–503 (2003) 5. Feautrier, P.: Dataflow analysis of array and scalar references. International Journal of Parallel Programming 20 (1991) 6. Wolinski, C., Kuchcinski, K.: Automatic selection of application-specific reconfigurable processor extensions. In: Proc. Design Automation and Test in Europe, Munich, Germany, March 10-14 (2008) 7. Kastner, R., Kaplan, A., Memik, S.O., Bozorgzadeh, E.: Instruction generation for hybrid reconfigurable systems. ACM Trans. Des. Autom. Electron. Syst. 7(4) (2002) 8. Arnold, M., Corporaal, H.: Designing domain specific processors. In: Proceedings of the 9th International Workshop on Hardware/Software CoDesign, Copenhagen, April 2001, pp. 61–66 (2001) 9. Clark, N., Zong, H., Mahlke, S.: Processor acceleration through automated instruction set customization. In: 36th Annual International Symposium on Microarchitecture (2003) 10. Atasu, K., Pozzi, L., Ienne, P.: Automatic application-specific instructionset extensions under microarchitectural constraints. In: 40th Design Automation Conference (DAC) (2003) 11. Biswas, P., Banerjee, S., Dutt, N., Pozzi, L., Ienne, P.: ISEGEN: Generation of high-quality instruction set extensions by iterative improvement. In: 42nd Design Automation Conference (DAC) (2005) 12. Peymandoust, A., Pozzi, L., Ienne, P., De Micheli, G.: Automatic instruction set extension and utilization for embedded processors. In: ASAP (2003) 13. Guo, Y.: Mapping applications to a coarse-grained reconfigurable architecture. PhD Thesis, University of Twent, Eindhoven, Netherlad (September 8, 2006) 14. Leupers, R., Karuri, K., Kraemer, S., Pandey, M.: A design flow for configurable embedded processors based on optimized instruction set extension synthesis. In: DATE (2006) 15. Guo, Y., Smit, G., Broersma, H., Heysters, P.: A graph covering algorithm for a coarse grain reconfigurable system. In: Languages, Compilers, and Tools for Embedded Systems (LCTES 2003), San Diego, California, June 11-13 (2003) 16. Wolinski, C., Kuchcinski, K.: Identification of application specific instructions based on sub-graph isomorphism constraints. In: IEEE 18th Intl. Conference on Application-specific Systems, Architectures and Processors, Montréal, Canada, July 8-11 (2007) 17. Kuchcinski, K.: Constraints-driven scheduling and resource assignment. ACM Transactions on Design Automation of Electronic Systems (TODAES) 8(3), 355–383 (2003)
A Generic Design Flow for Application Specific Processor Customization through Instruction-Set Extensions (ISEs) Kingshuk Karuri, Rainer Leupers, Gerd Ascheid, and Heinrich Meyr Institute for Integrated Signal Processing Systems, RWTH Aachen University, Aachen, Germany {karuri,leupers,ascheid,meyr}@iss.rwth-aachen.de
Abstract. Instruction-Set Extensions (ISEs) have gained prominence in the past few years as a useful method for tailoring the ISAs (Instruction-Set Architectures) of ASIPs (Application Specific Instruction-Set Processors) to the computational requirements of various embedded applications. This work presents a generic and easily adaptable flow for application oriented ISE design that supports both of the prevalent ASIP design paradigms complete ISA design from scratch through an extensive design-space exploration, or limited ISA adaptation for a pre-designed and pre-verified base-processor core. The broad applicability of this design flow is demonstrated using ISA customization case studies for both of these two design philosophies.
1
Introduction
Over the past few years, programmable processor cores have started to feature prominently in an increasing number of embedded SoC (System-on-Chip) designs. Their programmability ensures shorter time-to-market for new products through high degree of code reuse, and facilitates longer time-in-market and simplified maintenance through software upgrades and bug fixes. However, even with these key advantages, general purpose programmable processors are mostly inadequate to cope with the increasing computational complexity and energy efficiency requirements of various embedded multimedia and wireless applications. As a consequence, a new breed of programmable processing engines namely ASIPs [14] - has emerged in the last few years to conciliate the conflicting demands of programmability and computational performance/energy efficiency. An ASIP combines a programmable base-processor with various special ISA extensions and micro-architectural enhancements for meeting the computational requirements of a single, or at most, a small set of target applications. One major form of application specific customization for an ASIP is to extend the processor ISA by adding a set of special purpose instructions. These special purpose instructions, usually known as ISEs (Instruction-Set Extensions), are usually implemented inside a CFU (Custom Functional Unit) - a co-processor tightly coupled to the base-processor core. Classical instances of ISEs can be K. Bertels et al. (Eds.): SAMOS 2009, LNCS 5657, pp. 204–214, 2009. c Springer-Verlag Berlin Heidelberg 2009
A Generic Design Flow for Application Specific Processor Customization
205
found in many domain specific architectures like digital signal processors (e.g. multiply-accumulate and add-compare-select instructions) or network processing units (e.g. bit-slice manipulation instructions for packet processing). ASIPs are often designed to deliver much higher computational efficiency than even such domain specific processors, and therefore, need far more complex ISEs to meet the performance constraints. Consequently, ISA customization often forms the primary hurdle in the ASIP design process, and has deservedly received a lot of attention in industry and academia [7][11][12][10][15][4]. This paper presents a generic design-flow for automated ASIP ISA customization. Unlike the previous works in this area, which mostly focus on the various algorithmic aspects of automatic ISE identification, this work describes a framework which can be easily integrated into a wide variety of ASIP design tool-chains. This framework can either support limited customization of existing ASIP ISAs, or, more importantly, facilitate quick design-space exploration to develop optimized ISA and CFU structures completely from scratch. The rest of the paper is organized as follows. The next section describes related works and clearly highlights our contributions. Sections 3 through 5 outline various components of our design flow. Section 6 presents two case studies - one each for limited ASIP customization and complete design from scratch. The final section summarizes this paper and provides some pointers for future work.
2
Background and Related Work
An ASIP design for a given target application can be either done from scratch, or can be accomplished by limited customization of an existing base processor. The state-of-the-art tool-chains to assist ASIP design from scratch are usually based on Architecture Description Languages (ADLs) [6][1]. These tool-chains allow automatic generation of the entire software ecosystem (i.e. the compiler tool chain and the Instruction-Set Simulator) and the RTL hardware for a processor from its ADL description. The ADL based flow grants designers complete freedom in tuning an ASIP to the exact needs of an application, but it also requires more design time and effort. The state-of-the-art in limited ASIP customization are configurable processor [3] [16] based design flows which require much less time and know-how than needed in ADL based tools, but usually produce less efficient designs. The ISA customization process, in both kinds of design flows, generally starts with the source code of the target application written in a high-level programming language - usually C. The process usually consists of 3 major steps - (i) application profiling and characterization, (ii) automated ISE identification and (iii) ISE verification and integration into an ASIP design. Application profiling and characterization is commonly the starting point of ISA customization. Application profiling is used to identify the most computationally demanding segments - also known as hot-spots - of the target application. The second ISE customization step involves automated ISE identification from the hotspots identified through profiling. A typical identification algorithm usually looks inside the DFG (Data Flow Graphs) of each hot-spot and attempts to combine several arithmetic/logical/data-transfer operations into a single special instruction.
206
K. Karuri et al.
ISE verification and integration is the final step of the whole process. The output of the automated ISE identification step is a set of ISEs - usually represented as a set DAGs (Directed Acyclic Graphs) with operations as nodes. The final steps of the customization flow involve conversion of these DAGs to datapath descriptions inside an ASIP’s hardware model, re-targeting of the ASIP’s compiler tool chain and verification of the correctness of ISEs through hardware or Instruction-Set simulation. So far, majority of the work on ISA customization has mostly focused on the automated ISE identification algorithms. There exists two prominent approaches to the identification problem. The first one attempts to find small but reusable DAG patterns from the target application. The second one tries to find largest possible sub-graphs from application hot-spots without looking into reusability. For the rest of this paper, we will refer to the second approach because it is usually better for obtaining hardware acceleration. Identification of large ISEs from a DFG has been formulated as a maximal convex sub-graph identification problem [7]. Each convex sub-graph of a DFG is a potential ISE, provided it conforms to a set of architectural constraints imposed by the base-processor/CFU interface. Examples of such constraints include restrictions on the data-bandwidth between the base-processor and the CFU (i.e. restrictions on the maximum number of General Purpose Register and mainmemory reads/writes allowed from each ISE), and area/timing restrictions on the ISE data-path. Numerous approaches have been proposed to identify convex sub-graphs under these architectural constraints [7][11][12][15]. Some of these works [15][13][2][8] also suggest methods to overcome the data-bandwidth constraints of the base-processor/CFU interface, because the overall quality of ISEs (i.e. the available hardware acceleration) largely depends on them. An area which has been mostly neglected by these previous works is how the ISE identification process can be fully embedded seamlessly in an ASIP designflow. Many of these previous works either work with abstract performance estimation models, or target only a single configurable processor for presenting results. According to our knowledge, a generic design-flow which can be extended to a large variety of ASIP design technologies has not been reported so far. Another important area that most of the previous works completely ignore is the possibility of design-space exploration while designing an ASIP from scratch. Such design space exploration is absolutely necessary to find the right data-bandwidth for the base-processor/CFU interface, or the right number of arithmetic/logic units like adders or multipliers in the CFU. The current work mostly concentrates on building an ISA customization framework which can address both of these two issues.
3
Design Flow
The architecture of our design flow is presented in Figure 1. The entry point of the design-flow is the source code of the target application in ANSI-C which is parsed by the LANCE C front-end to generate an Three Address-Code Intermediate Representation (3-AC IR). All the arithmetic/logical operations and type conversions
A Generic Design Flow for Application Specific Processor Customization
207
Fig. 1. Our ISA Customization Flow
(including implicit operations from array and structure address computations), memory accesses and control transfer operations are explicitly visible in this IR. As shown in Figure 1, the design-flow consists of the 3 major components: (i) ISE identification infrastructure, (ii) µProfiler and (iii) ISE back-end. The whole process of ISE customization is guided through an user-friendly GUI which allows designers to set various interface and architectural constraints during ISE identification. The ISE identification infrastructure translates the 3-AC IR into a CDFG (Control Data Flow Graph) and provides a set of well defined interfaces through which an ISE identification algorithm can analyze, manipulate and transform the CDFG structure. The identified ISEs as well their inputs and outputs can be annotated on the CDFG itself. The µProfiler is a novel, target architecture independent application profiler [9]. Since it collects profiling information by instrumenting the fine-grained LANCE IR, it can accurately report the computational complexity of different application hot-spots in terms of the number of executed arithmetic, logical, memory access and control transfer operations. Additionally, it can provide fairly accurate cycle count information for RISC like architectures using a configurable cost model. The µProfiler can also estimate cycle savings for a set of ISEs by reading in annotated CDFGs produced by the ISE identification infrastructure. The ISE back-end translates the annotated CDFG to a hardware description and inserts the identified ISEs into the original application’s source code through assembly functions. Depending on the ASIP design-flow, the outputs of the backend vary widely. Currently, the back-end can produce ISE behaviors for two configurable processors and one ADL based design flow.
208
K. Karuri et al.
Our design-flow allows two levels of design space exploration. The first-level of design space exploration is carried out using µProfiler and the ISE identification infrastructure. For each set of architectural constraints (entered using the GUI), the µProfiler presents a cost-benefit analysis in form of the area of the identified ISEs vs. the total cycle savings. By comparing the cost-benefit analysis of various CFU configurations, designers can easily identify the best possible set of ISEs and base-processor/CFU interface for an ASIP. In this level of design space exploration, only relative merits/de-merits (and not exact values) of various CFU configurations are important. The next level of design-space exploration can be carried out using the LISA ADL based design-flow. Our ISE back-end can be easily parameterized to generate ISE definitions for different ASIP configurations described in LISA. This permits exact evaluation of the design-points short-listed in the first level of exploration.
4
ISE Identification Infrastructure
The ISE identification infrastructure converts each function definition in the original application’s source code into a CDFG structure. Each CDFG consists of several basic blocks connected by control-flow edges. The set of most frequently executed basic-blocks from all the CDFG’s of an application represent the hot-spots. A basic block corresponding to a program hot-spot is represented by a DAG, G = (V, E) where V is the set of nodes representing operations, and E is the set of edges representing data-dependencies between these operations. There may exist a non-empty set VF (VF ⊂ V ) of operations which must be implemented in the base-processor core. Nodes belonging to VF are called forbidden nodes, and may include (but are not limited to) function calls and jumps, floating point operations, and memory load/stores (if memory accesses are not permitted from CIs). The rest of the nodes can be combined into various ISEs. The CDFG data-structure provides access methods that any ISE identification algorithm can use to inspect the individual operations and their datadependencies. For each identified ISE, the constituent CDFG nodes and the input/output edges can be annotated with an unique identifier corresponding to that ISE. The ISE back-end uses these annotations to construct the data-path and data-dependencies for each ISE. For each operation vi ∈ V , two important parameters - software latency (SWi ) and hardware latency (HWi ) - are required by any ISE identification algorithm. HWi represents the delay of executing vi in hardware normalized to the baseprocessor clock. SWi represents the number of cycles it takes to execute vi in software using a Base-Processor Instruction (BPI). These values are passed to our framework through the GUI or a parameter file. Currently, two ISE identification algorithms have been integrated into our infrastructure. While a detailed discussion of these algorithms is out-of-scope of the current work, we will provide a brief understanding of how the various architectural constraints are taken care of in these two algorithms. The first algorithm, based on ILP (Integer Linear Programming), iterates over the nodes of V
A Generic Design Flow for Application Specific Processor Customization
209
and constructs one ISE in each iteration by selecting a set of nodes. The selected nodes must maximize an objective function while obeying all the architectural constraints represented by a set of linear inequalities. The nodes selected in one iteration are excluded from consideration in the next iteration. The process continues till only forbidden nodes are left outside ISEs. A detailed description of the algorithm can be found in [15]. The second algorithm, based on HLS (High Level Synthesis), uses resourceconstrained scheduling to pipeline the entire DFG V of a hot-spot. Each forbidden node is scheduled alone in a single scheduling step and is executed as BPI. Several non-forbidden nodes can be combined in a single scheduling step through chaining and data-parallelism. Each such scheduling step becomes a single ISE. 4.1
Register Constraints
Because a large ISE might combine several operations together, it generally needs many inputs and produces more than one output. However, usually only a limited number to GPR (General Purpose Register) I/O ports are made available to the CFU for a variety of reasons (difficulty to provide data-forwarding paths, increased area/timing of the GPR file due to higher number of ports etc.). One possible way to overcome this problem is to add IRs (Internal Registers) to the CFU which can be used to communicate data between two different ISEs. Since IRs are only visible inside the CFUs, ISEs must communicate with BPIs using GPRs. Our framework allows putting a restriction on the number of GPR as well as IR ports available to the CFU. An example usage of IRs is presented in Figure 2. The dark operation nodes are part of various ISEs, while the light nodes are executed as BPIs. Inputs/outputs communicated via GPRs are represented by dashed edges, whereas those passed through IRs are shown using solid lines. If 2-1 GPR I/O constraint (i.e. maximum of 2 GPR reads and 1 GPR write from an ISE) is assumed, then ISE1 can not be formed in absence of IRs since it has 3 inputs and 2 outputs. In presence of an IR file, the input coming from the BPI b1 first has to be moved from a GPR to an IR (with a special move instruction). Similarly, the node v1
Fig. 2. Increasing Data-Bandwidth to ISEs using IRs
210
K. Karuri et al.
can produce its output into another IR which can be directly used by v3 through the IR file, but has to be moved to a GPR before usage in b4. Note that, nodes belonging to different ISEs can communicate using GPRs as well (v2 ∈ ISE1 and v4 ∈ ISE3). The handling of IRs in the ILP model can not be covered in the scope of this work. Interested readers can refer to [15] for more details. In the HLS algorithm, each scheduling level is associated with a count of current inputs through GPRs and IRs. When a node is added to a scheduling level, the status of each of its predecessors is checked to determine the new count of IR and GPR inputs. If a predecessor is also a predecessor of another node in the same scheduling level, then it does not change the IR/GPR input count. Otherwise, the GPR input count is incremented if it is a forbidden node, and the IR input count is incremented if it is non-forbidden. If addition of the node violates either the GPR or the IR input constraint, then it is removed from the scheduling level and added to a new succeeding scheduling level. Both the algorithms apply a number of post-ISE identification optimizations, including an left-edge algorithm based IR allocation, to reduce the total number of IRs in the CFU. 4.2
Computational Resource Constraints
Our framework allows designers to constrain the number of computational resources (e.g. adders, subtractors, multipliers, comparators, shifters) inside the CFU so as to control the CFU area. It is possible to even declare customized computational resources (e.g. min/max/clip operations) and restrict their numbers in the CFU. Such custom computational units must appear in the source code as function calls, and the ISE identification algorithms recognize them by simple name matching. The user can specify a set of resources R, and the maximum number, rcount , for each resource type r (∈ R) in the CFU. Each operation v ∈ V needs a number of resources for completing its execution. If a set of nodes, Vise = {v1 , v2 , .., vn } belong to the same ISE, then the condition: n
rusager,i ≤ rcount ∀r ∈ R
(1)
i=1
must be satisfied. rusager,i in the above equation represents the number of resources of type r required by a node vi ∈ Vise . The ILP simply generates corresponding inequalities for each of the resource constraints, whereas the HLS algorithm checks whether addition of a node to a scheduling level violates any of the resource constraints. 4.3
Memory and Scratch-Pad I/O Constraints
Our framework allows designers to specify whether memory accesses are permitted from the CFU. Additionally, designers are allowed to specify a set of scratch-pads into the CFU and to map specific data objects to each of these scratch-pads (Scratch-pads are small and fast memories which can be used as
A Generic Design Flow for Application Specific Processor Customization
211
software controlled caches). Memory and scratch-pad access constraints can be modeled as resource constraints, and accordingly handled in the identification algorithms. 4.4
Latency Constraints
In our framework, designers are allowed to specify the maximum number of cycles, Latmax , an ISE can execute before it must commit its results to the GPR file. This constraint can be handled by verifying the following inequality. Let v1 , v2 , .., vn be any sequence of operations such that there exists an edge between each vi and vi+1 for 1 ≤ i ≤ n − 1. If all the nodes of the sequence belong to the same ISE then the following constraint must hold: n
HWi ≤ Latmax
(2)
i=1
5
The ISE Back-End
The task of the back-end is to generate hardware descriptions and compiler retargeting information for the identified ISEs. Our hardware generation back-end can convert the behavior of the ISEs - represented by DAGs - to either C or RTL (Register Transfer Level) Verilog. The generated Verilog can be seamlessly integrated to an ARC [3] configurable processor model. The C code can be integrated automatically to either CoWare CorXpert for MIPS processors [5], or in the LISATek ASIP design framework [6] based on LISA ADL. The C or Verilog output can also be integrated to other processor models through some manual modifications. This leads to a very generic ISA customization flow. In our design flow, the back-end generates a modified C application source code where the ISEs are inserted through assembly function calls. The definition of these assembly functions are also generated into a header file which is included from the modified source file. This simplifies the process of compiler re-targeting to a great extent. The back-end can also produce modified instruction schedulers for LISA C compiler. The method is quite generic and can be adopted in other C compilers as well. 5.1
The LISA Generation Back-End
The LISA generation back-end is very important from design space exploration’s perspective, since it allows exact characterization of various ISEs and baseprocessor/CFU interfaces in terms of area, timing and power consumption. Designers can easily parameterize it to generate ISEs for a variety of base-processor structures. We will explain the working of the LISA generation back-end using Figure 3 which shows two ISEs integrated into a LISA processor model. LISA facilitates a hierarchical description of the processor ISA behavior, syntax and coding. The basic element of a LISA description is a LISA OPERATION.
212
K. Karuri et al.
Fig. 3. Example LISA Description Generated by Back-end
The behavior of an operation, described in a variant of C, can invoke other operations through ACTIVATION, and such an invocation chain usually describes the behavior of one complete instruction (e.g. invocation chain of decode, ise dc, ise ex defines ISEs in Figure 3.(b)). Various instructions can share the same functionality by invoking the same OPERATION. An operation can access global RESOURCEs such as registers, pipeline registers, wires and memories just like accessing C variables or arrays. The resources are declared in a RESOURCE section 1 in Figure 3.(b)). The pipeline structure is imposed on the model by assigning ( operations to different pipeline stages. The data-path generator for LISA currently assumes a classical 5-stage pipelined base-processor architecture where the execution of an ISE starts from 2 in Figure 3.(b)). Memory accesses from ISEs are always inithe EX stage (e.g. 4 in Figure 3.(b)) and completed in the MEM stage. tiated in the EX stage (e.g. Since the back-end can not deviate from this base-processor pipeline structure, multi-cycle operations have to be manually broken into several execution stages. The back-end, however, allows changes in the data-forwarding architecture. For example, the designer can specify that one output of EX (or MEM) stage is forwarded to only one input of the EX stage. In that case the back-end uses the forwarded values from pipeline registers for the first GPR input, and directly 3 in Figure 3.(b)). The back-end also reads the second value from GPRs (e.g. allows the designers to specify the instruction coding schemes and the register, memory and pipeline stage naming conventions to be used for output generation. As a consequence, even with the restrictions on the base-processor structure, a wide variety of ASIP descriptions can be used in our design-flow.
A Generic Design Flow for Application Specific Processor Customization
213
The back-end generates hints on which computational resources can be shared between two instructions. However, the actual resource sharing has to be performed manually (e.g. the adder sharing shown in the behavior of ise ex in Figure 3.(b)).
6
Case Studies
This section presents results from two case-studies. The goal of the first casestudy was to tailor the CFU as well as the base-processor/CFU interface of a LISA processor model to a target application - the H.264 video decoder. The second case-study customized an ARC base-processor with ISEs for X.264 (a public domain implementation of H.264 video standard) encoder. The results of the case-study are summarized in Table 1. The starting processor model for the first case study was a simple 5 stage pipelined RISC processor model named LTRISC. The µProfiling results showed that a large portion of execution time is spent inside a function name Clip which simply clips the input data between given upper and lower limits. As a consequence, we decided to include Clip as a special functional unit in the CFU. We targeted a total of 8 hot-spots functions for ISE identification. For these 8 hot-spots, we generated ISEs for a variety of CFU configurations with different register/memory I/O and computational resource constraints. Using the µProfiler based performance estimation, we found that the best CFU configuration is to have 2/1 GPR I/Os, one memory read/write port and 14/12 IR I/Os. In actual Instruction-Set Simulation, the customized processor runs 1.95× faster than the original processor core for the decoder application. In the second case-study, an ARC600 processor core was customized for various hot-spots of the X.264 encoder. Since CFU interface was fixed in this case (2/1 GPR I/O ports, no memory ports), we could not perform any meaningful design space exploration. A number of ISEs were generated for 2 hot-spot functions of the source code - mc chroma, pixel satd wxh. On an FPGA prototype of the customized processor, the target application runs 1.58× faster than the original software execution. The case-studies show that our design-flow is generic enough and can be used for different ASIP design tool-chains. The first case-study also demonstrates how our design-flow can support design space exploration during ISE customization. Table 1. Summary of LISA and ARC case-studies ASIP Design Target Tool-Chain Application LISA
ARC
Base Processor
CFU Interface
Application Speed-up (× pure software execution) H.264 Video 16 GPR LTRISC, 2/1 GPR I/O 1.95× Decoder 5-stage 14/12 IR I/O pipeline 1 Mem. Port X.264 Video ARC600 2/1 GPR I/O 1.58× Encoder No mem. Ports
214
7
K. Karuri et al.
Summary
In this paper, we have presented a generic design-flow for ISA customization of application specific processors. The design-flow allows several levels of design space exploration while designing an ASIP from scratch. The generality of the design-flow has been demonstrated by showing results for two state-of-the-art ASIP design frameworks. Currently the LISA generation back-end can only generate ADL descriptions for a fixed base-processor pipeline structure. In future, we will like to extend it to encompass a large variety of pipeline architectures. We will also like to automate the process of resource sharing between various ISEs which can lower the area overheads of CFUs.
References 1. Halambi, A., Grun, P., Ganesh, V., Khare, A., Dutt, N., Nicolau, A.: Expression: A language for architecture exploration through compiler/simulator retargetability. In: DATE (1999) 2. Verma, A.K., Brisk, P., Ienne, P.: Rethinking custom ise identification: a new processor-agnostic method. In: CASES (2007) 3. ARC International, http://www.arc.com 4. Galuzzi, C., Panainte, E., Yankova, Y., Bertels, K., Vassiliadis, S.: Automatic selection of application-specific instruction-set extensions. In: CODES+ISSS (2006) 5. Coware CORXpert, http://www.coware.com/news/press439.htm 6. CoWare LISATek Tools, http://www.coware.com/ 7. Atasu, K., Pozzi, L., Ienne, P.: Automatic Application Specific Instruction Set Extensions Under Microarchitectural Constraints. In: DAC (2003) 8. Karuri, K., Chattopadhyay, A., Hohenauer, M., Leupers, R., Ascheid, G., Meyr, H.: Increasing data-bandwidth to instruction-set extensions through register clustering. In: ICCAD (2007) 9. Karuri, K., Faruque, M., Kraemer, S., Leupers, R., Ascheid, G., Meyr, H.: Finegrained application source code profiling for asip design. In: DAC (2005) 10. Clark, N.T., Zhong, H., Mahlke, S.A.: Processor Acceleration through Automated Instruction-set Customization. In: MICRO-36 (2003) 11. Biswas, P., Dutta, N., Pozzi, L., Ienne, P.: ISEGEN: Generation of High-Quality Instruction Set Extensions through Iterative Improvement. In: DATE (2005) 12. Yu, P., Mitra, T.: Scalable Custom Instructions Identification for Instruction Set Extensions. In: CASES (2004) 13. Jayaseelan, R., Liu, H., Mitra, T.: Exploiting forwarding to improve data bandwidth of instruction-set extensions. In: DAC (2006) 14. Leupers, R., Ienne, P. (eds.): Customizable Embedded Processors: Design Technologies and Applications. Morgan Kaufmann, San Francisco (2006) 15. Leupers, R., Karuri, K., Kraemer, S., Pandey, M.: A design flow for configurable embedded processors based on optimized instruction set extension synthesis. In: DATE (2006) 16. Tensilica, http://www.tensilica.com/
Runtime Adaptive Extensible Embedded Processors — A Survey Huynh Phung Huynh and Tulika Mitra School of Computing National University of Singapore {huynhph1,tulika}@comp.nus.edu.sg
Abstract. Current generation embedded applications demand the computation engine to offer high performance similar to custom hardware circuits while preserving the flexibility of software solutions. Customizable and extensible embedded processors, where the processor core can be enhanced with application-specific instructions, provide a potential solution to this conflicting requirements of performance and flexibility. However, due to the limited area available for implementation of custom instructions in the datapath of the processor core, we may not be able to exploit all custom instruction enhancements of an application. Moreover, a static extensible processor is fundamentally at odds with highly dynamic applications where the custom instructions requirements vary substantially at runtime. In this context, a runtime adaptive extensible processor that can quickly morph its custom instructions and the corresponding custom functional units at runtime depending on workload characteristics is a promising solution. In this article, we provide a detailed survey of the contemporary architectures that offer such dynamic instruction-set support and discuss compiler and/or runtime techniques to exploit such architectures.
1 Introduction The ever increasing demand of high-performance at low-power in the embedded domain is fueling the trend towards customized embedded processors [13]. A customized processor is designed specifically for an application domain (e.g., network, multimedia etc.) enabling it to offer significantly higher performance compared to its generalpurpose counterparts, while consuming much lower energy. This dual improvement in power-performance is achieved by eliminating certain structures (e.g., floating-point unit) that are redundant for the particular application-domain, while choosing appropriate dimensions for other structures (e.g., cache, TLB, register file). The elimination of redundant structures cuts down energy/area wastage and tailor-made dimensioning of required structures improves performance at reduced power budget. A further step towards customization is instruction-set extensible processors or extensible processors for short. An extensible processor opens up the opportunity to customize the Instruction-Set Architecture (ISA) through application-specific extension instructions or custom instructions. Each custom instruction encapsulates a frequency occurring complex pattern in the data-flow graph of the application(s). Custom instructions are implemented as Custom Functional Units (CFU) in the data-path of the processor core. As multiple instructions from the base ISA are folded into a single custom K. Bertels et al. (Eds.): SAMOS 2009, LNCS 5657, pp. 215–225, 2009. c Springer-Verlag Berlin Heidelberg 2009
216
H.P. Huynh and T. Mitra
instruction, we save fetching/decoding costs and improve code size. More importantly, the CFU can typically achieve significantly lower latency through parallelization and chaining of basic operations (the latency is determined by the critical path in the dataflow graph of the corresponding custom instruction) compared to executing one operation per cycle sequentially in the original processor. On the other hand, as custom instructions are exposed to the programmer, extensible processors offer great flexibility just like any software-programmable general-purpose processors. The large number of commercial extensible processors available in today’s market (e.g., Xtensa [9], Lx [8], ARC configurable cores [2], OptimoDE [7], MIPS CorExtend [18]) is a testament to their wide-spread popularity. There are, however, some drawbacks of traditional extensible processors. First, we need to design and fabricate different customized processor for each application domain. A processor customized for one application domain may fail to provide any tangible performance benefit for a different domain. Soft core processors with extensibility features that are synthesized in FPGAs (e.g., Altera Nios [1], Xilinx MicroBlaze [21]) somewhat mitigate this problem as the customization can be performed post-fabrication. Still, customizable soft cores suffer from lower frequency and higher energy consumption issues because the entire processor (and not just the CFUs) is implemented in FPGAs. Apart from cross-domain performance problems, extensible processors are also limited by the amount of silicon available for implementation of the CFUs. As embedded systems progress towards highly complex and dynamic applications (e.g., MPEG-4 video encoder/decoder, software-defined radio), the silicon area constraint becomes a primary concern. Moreover, for highly dynamic applications that can switch between different modes (e.g., runtime selection of encryption standard) with unique custom instructions requirements, a customized processor catering to all scenarios will clearly be a sub-optimal design. Runtime adaptive extensible embedded processors offer a potential solution to all these problems. An adaptive extensible processor can be configured at runtime to change its custom instructions and the corresponding CFUs. Clearly, to achieve runtime adaptivity, the CFUs have to be implemented in some form of reconfigurable logic. But the base processor is implemented in ASIC to provide high clock frequency and better energy efficiency. As CFUs are implemented in reconfigurable logic, these extensible processors offer full flexibility to adapt (post-fabrication) the custom instructions according to the requirement of the application running on the system and even midway through the execution of the application. Such adaptive extensible processors can be broadly classified into two categories: – Explicit Reconfigurability: This class of processors need full compiler or programmer support to identify the custom instructions, synthesize them, and finally cluster then into one (or more) configurations that can be switched at runtime. In other words, custom instructions are generated off-line and the application is recompiled to use these custom instructions. – Transparent Reconfigurability: This class of processors do not expose the extensibility feature to the compiler or the programmer. In other words, the extensibility is completely transparent to the user. Instead, the runtime system identifies the custom instructions and synthesizes them while the application is running on the
Runtime Adaptive Extensible Embedded Processors — A Survey
217
system. These systems are more complex, but may provide better performance as the decisions are taken at runtime. In this article, we will first provide a quick survey of the architecture of explicit runtime adaptive extensible processors followed by the compiler support required for such processors. Next, we will discuss transparent reconfigurable processors and their runtime systems. Finally, we will conclude this survey by outlining the challenges and opportunities in this domain.
2 Explicit Runtime Adaptive Extensible Processors In this section, we will focus on extensible processors that require extensive compiler or programmer intervention to achieve runtime reconfigurability. 2.1 Architecture Temporal Reconfiguration. We start with architectures that enable temporal reconfiguration, but only one custom instruction can exist at any point of time. That is, there is no spatial sharing of the reconfigurable logic among custom instructions. PRISC (PRogrammable Instruction Set Processor) [17] is one of the very first architectures to include temporal reconfigurability of the custom functional units. Temporal reconfiguration virtually enlarges the limited reconfigurable hardware, which is tightly attached to the datapath of core processor. PRISC supports a set of configurations, each of which contains a computation kernel or a custom instruction. At any point of time, there is only one active configuration for reconfigurable hardware. However, each of the configurations can become active at some point of time through time-multiplexing. Therefore, temporal reconfiguration can extend the computational ability of the reconfigurable hardware at the cost of reconfiguration overhead. Figure 1 shows the Programmable Functional Unit (PFU) in parallel with the other traditional functional units in the datapath of the PRISC processor. PFU data communication is similar to the other functional units. However, PFU can support only two input operands and one output operand. With the limitation on the number of input and output operands, PRISC cannot implement large custom instructions that can potentially provide more performance benefit though instruction-level parallelism as well as higher latency reduction. Moreover, as each configuration can include only one instruction, PRISC effectively restricts the number of custom instructions per loop body to one;
Fig. 1. PRISC Architecture [17]
218
H.P. Huynh and T. Mitra
otherwise, the temporal reconfiguration cost within loop body will typically outweigh any benefit of custom instructions. OneChip [14] reduces reconfiguration overhead by allowing multiple configurations to be stored in the PFU, but only one configuration is active at any point of time. Moreover, OneChip comprises of a superscalar pipeline with PFU to achieve higher performance for streaming applications. However, OneChip lacks the details of how programmers specify or design the hardware that is mapped onto the reconfigurable logic. Spatial and Temporal Reconfiguration. Both PRISC and OneChip allow only one custom instruction per configuration that can result in high reconfiguration cost specially if two custom instructions in the same code segment are executed frequently, for example, inside a loop body. Our next set of architectures enable spatial reconfiguration, that is, the reconfigurable hardware can be shared among multiple custom instructions. The combination of spatial and temporal reconfiguration is a powerful feature that partitions the custom instructions into multiple configurations, each of which contains one or more custom instructions. This clustering of multiple custom instructions into a single configuration can significantly reduce the reconfiguration overhead. Chimaera [22], which is inspired by PRISC, is one of the original works considering temporal plus spatial configuration of the custom functional units. Chimaera tightly couples Reconfigurable Functional Unit (RFU) with a superscalar pipeline. The main innovation of the Chimaera RFU is that it uses nine input registers to produce the result in one destination register. Simple compiler support is provided to automatically map group of normal instructions into custom instructions. However, Chimaera compiler lacks support for spatial and temporal reconfiguration of custom instructions so as to make runtime reconfiguration more efficient. Stretch S6000 [10] commercial processor follows this research trend. Figure 2 shows the Stretch S6000 engine that incorporates Tensilica Xtensa LX dual-issue VLIW processor [9] and the Stretch Instruction Set Extension Fabric (ISEF). The ISEF is software-configurable datapath based on programmable logic. It consists of a plane of Arithmetic/logic Units (AU) and a plane of Multiplier Units (MU) embedded and interlinked in a programmable, hierarchical routing fabric. This configurable fabric acts as a functional unit to the processor. It is built into the processor’s datapath, and resides alongside other traditional functional units. The programmer defined application specific instructions (Extension Instructions) are implemented in this fabric. When an extension instruction is issued, the processor checks to make sure the corresponding configuration (containing the extension instruction) is loaded into the ISEF. If the required configuration is not present in the ISEF, it is automatically loaded prior to the execution of the user-defined instruction. ISEF provides high data bandwidth to the core processor through 128-bit wide registers. In addition, 64KB embedded RAM is included inside ISEF to store temporary results of computation. With all these features, a single custom instruction can potentially implement a complete inner loop of the application. The Stretch compiler fully unrolls any loop with constant iteration counts. Partial Reconfiguration. Partial reconfiguration provides the ability to reconfigure only part of the reconfigurable fabric. With partial reconfiguration, idle custom instructions
Runtime Adaptive Extensible Embedded Processors — A Survey
219
Fig. 2. Stretch S6000 datapath [10]
can be removed to make space for the new instructions. Moreover, as only a part of the fabric is reconfigured, it saves reconfiguration cost. DISC (Dynamic Instruction Set Computer) [20] is one of the earliest attempts for an extensible processor to provide partial reconfiguration feature. DISC implements each instruction of the instruction set as an independent circuit module. It can page-in and page-out individual instruction modules onto reconfigurable fabric in a demand-driven manner. DISC supports relocatable circuit modules such that an existing instruction module can be moved inside the fabric to generate enough contiguous space for the incoming instruction module. The drawback of DISC system is that both standard and custom instructions are implemented in reconfigurable logic, causing significant performance overhead. On the other hand, the host processor is under-utilized as it only performs resource allocation and reconfiguration. Extended Instruction Set RISC (XiRisc) [15] follows this line of development to couple a VLIW datapath with a pipelined run-time reconfigurable hardware. XiRisc has a five-stage pipeline with two symmetrical execution flows called Data Channels. Reconfigurable datapath supports up to four source operands and two destination operands for each custom instruction. Moreover, reconfigurable hardware can hold internal states for several computations so as to reduce the register pressure. However, configuration caching is missing in XiRisc leading to high reconfiguration overhead. Moreover, there is lack of compiler support for designer to automatically generate custom instructions. Molen [19] polymorphic processor incorporates an arbitrary number of reconfigurable functional units. Molen resolves the issue of opcode space explosion for custom functions as well as data bandwidth limitation of the reconfigurable hardware. Moreover, Molen architecture allows two or more independent functions to be executed in parallel in the reconfigurable logic. To achieve these features, Molen requires a new programming paradigm that enables general-purpose instructions and hardware descriptions of custom instructions to coexist in a program. An one-time instruction set extension
220
H.P. Huynh and T. Mitra
of eight instructions is added to support the functionality of reconfigurable hardware. Molen compiler automatically generates optimized binary code for C applications with pragma annotation for custom instructions. The compiler can also generate appropriate custom instructions for each implementation of reconfigurable logic. The reconfiguration cost is hidden by scheduling the instructions appropriately such that the configuration corresponding to a custom instruction can be prefetched before that custom instruction is scheduled to execute. 2.2 Compiler Support Most of the runtime adaptive extensible processors lack appropriate compiler support to automate the design flow. However, given the tight time-to-market constraint of embedded systems, compiler support is instrumental in developing greater acceptability of these architectures. Currently, the burden is entirely on the programmer to select appropriate custom instructions and cluster them into one or more configurations. Choosing an appropriate set of custom instructions for an application itself is a difficult problem. Significant research effort has been invested in developing automated selection techniques for custom instructions [13]. Runtime reconfiguration has the additional complication of both temporal and spatial partitioning of the set of custom instructions in the reconfigurable fabric. We have recently developed an efficient framework [12] that starts with an application specified in ANSI-C and automatically selects appropriate custom instructions as well as clubs them into one or more configurations (see Figure 3). We first extract a set of compute-intensive candidate loop kernels from the application through profiling. For each candidate loop, one or more Custom Instruction Set (CIS) versions are generated differing in performance gain and area tradeoffs. The control flows among the hot loops are captured in the form of a loop trace (execution sequence of the loops) obtained through profiling. The hot loops with multiple CIS versions and the loop trace are fed to the partitioning algorithm that decides the appropriate CIS version and configuration for each loop. The key component of the framework is an iterative
Fig. 3. Compiler framework for runtime adaptive extensible processors [12]
Runtime Adaptive Extensible Embedded Processors — A Survey
221
Fig. 4. A set of periodic task graphs and the corresponding schedule [11]
partitioning algorithm. We model the temporal partitioning of the custom instructions into different configurations as a k-way graph partitioning problem. A dynamic programming based pseudo-polynomial time algorithm determines the spatial partitioning of the custom instructions within a configuration. The selected CIS versions to be implemented in hardware pass through a datapath synthesis tool. It generates the bitstream corresponding to each configuration (based on the outcome of the temporal partitioning). These bitstreams are used to configure the fabric at runtime. The remaining loops are implemented in software on the core processor. Finally, the source code is modified to exploit the new custom instructions. We also extend our work to include runtime reconfiguration of custom instructions for multiple tasks along with timing constraints [11]. An application is modeled as a set of periodic task graphs, each associated with a period and a deadline. Multiple CIS versions are generated for each constituent task of a task graph. Each task has many instances in the static non-preemptive schedule over the hyper-period (the least common multiple of the task graph periods) as shown in Figure 4. The objective is to minimize processor utilization by exploiting runtime reconfiguration of the custom instructions while satisfying deadline constraints. To achieve this goal, temporal partitioning divides the schedule into a number of configurations, where area constraint is imposed on each configuration. For example, Figure 4 illustrates an initial fragment of the schedule and its partitioning into three configurations. Note that each configuration contains a disjoint subsequence of task instances from the original schedule. Temporal partitioning allows a larger virtual area at the cost of reconfiguration overhead. The area within a configuration is spatially partitioned among the task instances assigned to it by choosing appropriate CIS version for each task instance. A dynamic programming based algorithm is enhanced with various constraints to efficiently solve the problem.
3 Transparent Extensible Processors We now proceed to describe extensible processors that are reconfigured transparently by the runtime system. Configurable Compute Accelerators (CCA): Transparent instruction-set customization supports a plug-and-play model for integrating a wide range of accelerators into a predesigned and verified processor core. Moreover, instruction-set customization occurs at runtime. An architectural framework for transparent instruction-set customization has
222
H.P. Huynh and T. Mitra
Fig. 5. Transparent Instruction Set Customization. (a) Subgraph Identification and (b) Runtime Processing [6].
been proposed in [5]. The framework comprises of static identification of subgraphs for execution on CCA [6] and runtime selection of custom instructions to be synthesized to CCA as shown in Figure 5. First, the program is analyzed to identify the most frequent computation subgraphs (custom instructions) to be mapped onto CCA. Figure 5(a) shows that two subgraphs have been selected. They are considered as normal functions and will be replaced by function calls. At runtime, the first time a selected subgraph is encountered, it is executed in the core pipeline while a hardware engine determines the CCA configuration concurrently. From the second execution onwards, the subgraph is implemented in the CCA as shown in Figure 5(b). Static subgraph extraction and replacement are achieved by adding a few steps into the conventional code generation process, which comprises of prepass scheduling, register allocation and postpass scheduling of spill code as shown in Figure 6. These steps are shaded in gray in the figure. First, given a dataflow graph, subgraph identification selects a set of potential subgraphs, which will be later implemented on CCA. Subgraph identification is a well-studied problem; interested readers can refer to [13] for a detailed exposition of the solutions. Note that subgraph identification is performed before register allocation to avoid false dependencies within data flow graph. After subgraph identification, selected subgraphs are collapsed into a single instruction. However, when collapsing subgraphs, code motion ensures the correctness if the subgraph crosses branch boundaries. Before getting into register allocation, the collapsed instruction is expanded so that register allocator can assign the registers to internal values. The advantage of this approach is that even a processor without CCA can execute the subgraphs as well (because they are treated as normal functions). More importantly, subgraph expansion ensures that register allocation remains relatively unchanged. After register allocation, each subgraph is compacted to an atomic node and passed on as input to
Fig. 6. Compiler Flow for CCA Architecture [6]
Runtime Adaptive Extensible Embedded Processors — A Survey
223
postpass scheduling. When postpass scheduling completes, each subgraph is expanded once again and a function is created for each subgraph along with a function call. WARP: At the other end of the spectrum, we have WARP [16] that has been developed with completely transparent instruction-set customization in mind. WARP processor consists of a main processor with instruction and data caches, an on-chip profiler, WARP-oriented FPGA and an on-chip computer-aided design (CAD) module. The execution of an application starts only on the main processor. During the execution, the profiler determines the critical kernels of the application. Then, CAD module invokes the Riverside On-Chip CAD (ROCCAD) tool chain. ROCCAD tool chain starts with decompilation of the application binary code of software loops into high-level representation that is more suitable for synthesis. Next, the partitioning algorithm determines the most suitable loops to be implemented in FPGA. For the selected kernels, ROCCAD uses behavioral and Register Transfer Level (RTL) synthesis to generate appropriate circuit descriptions. Then, ROCCAD configures the FPGA by using Just-In-Time (JIT) FPGA compilation tools. The JIT compiler performs logic synthesis to optimize the hardware circuit followed by technology mapping to map the hardware circuit onto reconfigurable logic fabric. Placement and route are then performed to complete the JIT compilation. Finally, ROCCAD updates the application binary code to utilize the custom accelerators inside the FPGA. RISPP (Rotating Instruction Set Processing Platform) [4] is a recent architecture that offers a unique approach towards runtime customization. RISPP introduces the notion of atoms and molecules for custom instructions. Atom is the basic datapath, while a combination of atoms creates custom instruction molecule. Atoms can be reused across different custom instruction molecules. Compared to the contemporary reconfigurable architectures, RISPP reduces the overhead of partial reconfiguration substantially through an innovative gradual transition of the custom instructions implementation from software into hardware. At compile time, only the potential custom instructions (molecules) are identified, but these molecules are not bound to any datapath in hardware. Instead, a number of possible implementation choices are available including a purely software implementation. At runtime, the implementation of a molecule can gradually “upgrade” to hardware as and when the atoms it needs become available. If no atom is available for a custom instruction, it will be executed in core pipeline using the software implementation. RISPP requires fast design space exploration technique at runtime to combine appropriate elementary data paths and evaluate tradeoffs between performance and hardware area of the custom instructions [3]. A greedy heuristic is proposed to select the appropriate implementation for each custom instruction.
4 Conclusions In this article, we focused on a detailed survey of extensible processors that provide runtime reconfiguration capability of the custom instruction sets. We observe that these architectures span a large spectrum starting from simplest solution that provides only temporal reconfiguration of a single custom instruction to more complex partial reconfiguration and finally completely transparent reconfiguration solution where the custom instructions are identified and implemented at runtime. We also discuss compiler
224
H.P. Huynh and T. Mitra
support necessary to exploit and harness this unique reconfiguration capability. Even though the architectural landscape in this domain looks quite promising, there is a serious lack of software tool support to take these solutions forward. In particular, runtime reconfiguration demands spatial and temporal partitioning of the custom instructions of an application into multiple configurations — a challenging problem for which only preliminary solutions exist today. Transparent extensible processors offer an interesting alternative to customization; however, the runtime overhead for design space exploration and synthesis is somewhat limiting the effectiveness of these proposals. We hope future research will bridge the gap between architecture and application to create an end-to-end solution for mapping applications to dynamic architectures.
Acknowledgements This work is partially supported by NUS research project R-252-000-292-112.
References 1. Altera. Introduction to the Altera Nios II Soft Processor, ftp://ftp.altera.com/up/pub/Tutorials/DE2/Computer Organization/tut nios2 introduction.pdf 2. ARC. Customizing a Soft Microprocessor Core (2002), http://www.arc.com/upload/download/ARCIntl 0126 CustomizingSoftMicCore wp.pdf 3. Bauer, L., Shafique, M., Henkel, J.: Run-time instruction set selection in a transmutable embedded processor. In: DAC (2008) 4. Bauer, L., Shafique, M., Kramer, S., Henkel, J.: RISPP: Rotating instruction set processing platform. In: DAC (2007) 5. Clark, N., Blome, J., Chu, M., Mahlke, S., Biles, S., Flautner, K.: An architecture framework for transparent instruction set customization in embedded processors. In: ISCA (2005) 6. Clark, N., Kudlur, M., Park, H., Mahlke, S., Flautner, K.: Application-specific processing on a general-purpose core via transparent instruction set customization. In: MICRO (2004) 7. Clark, N., Zhong, H., Fan, K., Mahlke, S., Flautner, K., Van Nieuwenhove, K.: OptimoDE: Programmable Accelerator Engines through Retargetable Customization. In: Hot Chips (2004) 8. Faraboschi, P., Brown, G., Fisher, J.A., Desoli, G., Homewood, F.: Lx: A technology platform for customizable VLIW embedded processing. In: ISCA (2000) 9. Gonzalez, R.E.: Xtensa: A configurable and extensible processor. IEEE Micro. 20(2) (2000) 10. Gonzalez, R.E.: A software-configurable processor architecture. IEEE Micro. 26(5) (2006) 11. Huynh, H.P., Mitra, T.: Runtime reconfiguration of custom instructions for real-time embedded systems. In: DATE (2009) 12. Huynh, H.P., Sim, J.E., Mitra, T.: An efficient framework for dynamic reconfiguration of instruction-set customization. In: CASES (2007) 13. Ienne, P., Leupers, R. (eds.): Customizable Embedded Processors. Morgan Kauffman, San Francisco (2006) 14. Jacob, J.A., Chow, P.: Memory interfacing and instruction specification for reconfigurable processors. In: FPGA (1999) 15. Lodi, A., Toma, M., Campi, F., Cappelli, A., Canegallo, R., Guerrieri, R.: A VLIW processor with reconfigurable instruction set for embedded applications. IEEE Journal of Solid-State Circuits 38(11) (2003)
Runtime Adaptive Extensible Embedded Processors — A Survey
225
16. Lysecky, R., Stitt, G., Vahid, F.: WARP processors. ACM Transactions on Design Automation of Electronic Systems 11(3) (2006) 17. Razdan, R., Smith, M.D.: A high-performance microarchitecture with hardwareprogrammable functional units. In: MICRO (1994) 18. MIPS Technologies. MIPS Configurable Solutions, http://www.mips.com/everywhere/technologies/configurability 19. Vassiliadis, S., Wong, S., Gaydadjiev, G., Bertels, K., Kuzmanov, G., Panainte, E.M.: The MOLEN Polymorphic Processor. IEEE Transactions on Computers 53(11) (2004) 20. Wirthlin, M.J., Hutchings, B.L.: A Dynamic Instruction Set Computer. In: FCCM (1995) 21. Xilinx. Microblaze Processor, http://www.xilinx.com/products/design resources/proc central/ microblaze.htm 22. Ye, Z.A., Moshovos, A., Hauck, S., Banerjee, P.: CHIMAERA: A high-performance architecture with a tightly-coupled reconfigurable functional unit. In: ISCA (2000)
Introduction to the Future of Reconfigurable Computing and Processor Architectures Luigi Carro1 and Stephan Wong2 1
2
UFRGS, Brazil TU Delft, The Netherlands
As technology scales up, the design productivity slows down and new approaches must be sought after to reach the design productivity (or bridge the gap) that is nowadays common for general-purpose processor design. New tools, technologies, languages, operating systems, and design approaches are likely to be needed. Nowadays, extra transistors are utilized for acceleration purposes (next to prototyping circuits), like they have been used in the past for floating-point operation, and it is today for the case of special blocks embedded into computer architectures like the MMX, SSE, and GPU. In this special session, selected papers add to the discussion on how this symbiosis will evolve, showing how general-purpose or multicore computing can benefit from reconfiguration, how can one generalize current accelerators, and how this all will affect the way compilers produce code or deals with virtualization, and the integrating role of the OS. Moreover, one should look at how technology evolution will cope with current possible show stoppers, like the communication problem and the interface to the operating system. Five papers covering these different aspects are presented in this session. The first one discusses the role of the OS on managing reconfigurable resources. The next two papers discuss the role of reconfigurable devices being used as accelerators in new application domains, followed by an analysis of the role of reconfigurable computing in the current search for parallelism exploitation. Finally, the session ends with a survey on the use of reconfiguration in multithread architectures.
K. Bertels et al. (Eds.): SAMOS 2009, LNCS 5657, p. 226, 2009.
An Embrace-and-Extend Approach to Managing the Complexity of Future Heterogeneous Systems Rainer Buchty, Mario Kicherer, David Kramer, and Wolfgang Karl Universit¨at Karlsruhe (TH) Institut f¨ur Technische Informatik, Lehrstuhl f¨ur Rechnerarchitektur 76128 Karlsruhe, Germany {buchty,kicherer,kramer,karl}@ira.uka.de
Abstract. In this paper, we present a particularly lightweight, integrative approach to programming and executing applications targeting heterogeneous, dynamically reconfigurable parallel systems. Based on an analysis of existing approaches, we strictly focused on compatibility and lightweightedness. Our approach therefore follows an embrace-and-extend strategy and achieves desired functionality by adopting and augmenting existing system services, achieving the desired properties. We implemented this concept using the Linux OS and demonstrated its suitability with a heterogeneous platform comprising IA32 multicore processors and current FPGA accelerator hardware using state-of-the-art HyperTransport interconnection technology.
1 Introduction and Motivation Driven by advances in fabrication technology, the computing power of individual microprocessor cores was constantly increasing over the last decades with clock rates boosted by three orders of magnitude and increasingly complex processor microarchitectures. This lead to current superscalar designs exploiting ILP using sophisticated out-of-order execution and prediction techniques. With the addition of integer and floating-point vector units also data parallelism was exploited, paving the way for ubiquitous multimedia applications. The increased capabilities were implicitly usable by the programmer and, in worst case, required the use of dedicated libraries and recompilation. Due to technological limitations, these past approaches of automatically gaining more performance by increasing single-core performance have come to a halt. Forecasts indicate a massive growth of manycore architectures and reverting the trend of most powerful individual “all-purpose” processor cores. For future multi- and manycore architectures a mix of trimmed general-purpose processors and dedicated application accelerators, e.g. cryptographic units or network processors, is envisioned: recent examples are Intel’s Atom processor (lacking costly out-of-order execution), or their reanimation of 1994’s P54C processor architecture (now offering an improved V pipeline for vector operations) for use in the upcoming Larrabee architecture. The combination of general-purpose and application-specific processing units within a current multicore processor is demonstrated by Sun’s T1 (Niagara II) processor Improvements in FPGA technology foster the use of reconfigurable logic instead of static accelerators, enabling a more flexible use of the silicon resources by dynamically K. Bertels et al. (Eds.): SAMOS 2009, LNCS 5657, pp. 227–236, 2009. c Springer-Verlag Berlin Heidelberg 2009
228
R. Buchty et al.
reconfiguring the logic resources to fulfill desired application-supporting functionality. For established programming paradigms this is an even harder problem than heterogeneity, which by itself already demands for costly hardware-aware programming efforts: coping with dynamically changing platforms typically requires overloading the core program logic with associated control structures handling the heterogeneous, dynamic nature of the underlying architecture. Unlike with previous technology improvements, the potential processing power of such future heterogeneous multicore architectures requires explicit use of parallel programming techniques, leading to a break with conventional approaches to program development and execution. Easing programmability of heterogeneous architectures and dealing with runtime reconfigurability is hence considered one of the Grand Challenges of Computer Engineering [5]. Typically, related approaches use a high level of abstraction, avoiding the need to specifically address the underlying hardware; this is left to a virtualization or runtime layer. A common drawback of most of these approaches is, however, tying the concept to a dedicated programming language, programming model, or virtual machine. In addition, none of these approaches take application requirements into consideration when performing the application-to-hardware (A2H) mapping. Disregarding these requirements is likely to break running applications dependent on the fulfilment of certain requirements such as latency, throughput, or accuracy issues. We therefore propose a lightweight embrace-and-extend approach to the process of creating a hardware-independent application description and performing applicationaware A2H mapping. Cornerstones of this approach are universal applicability, compatibility, and lightweight implementation. The approach is independent of programming language, programming model, and operating system. It provides a seamless and most compatible path for migrating from conventional programming to programming of heterogeneous and dynamic architectures without breaking existing program code, development tools, and infrastructures. This is achieved by careful extension of techniques already employed in modern OSes’ runtime systems. We like to start this paper with an overview over existing approaches for motivating our work, followed by an introduction of our concept and its exemplary implementation, demonstrating how careful embracing and extending the existing OS infrastructure leads to a lightweight, compatible approach to application programming and execution on parallel, heterogeneous systems. We prove that using such an approach achieves a seamless upgrade path from conventional to heterogeneous execution without breaking compatibility nor introducing measurable penalties in execution time.
2 Program Execution on Heterogeneous Parallel Systems In order to harnessing the power of heterogeneous parallel architectures, infrastructures supporting program development for and execution on such systems must comply with the specific requirements of both aspects, parallelism and heterogeneity: they need to enable exploitation of thread-level parallelism as well as assignment of computeintensive kernels to dedicated accelerator hardware, including configuration of this hardware if reconfigurable technology is used. This field is actively researched with a number of already existing and currently developed approaches.
Managing the Complexity of Future Heterogeneous Systems
229
A recent example of a heterogeneous system is the MOLEN reconfigurable processor [15]. It employs function-level granularity and offers dynamically alterable coprocessor instructions. Its according development framework focuses on the generation of integrated applications consisting of a software description, design and integration of required accelerator hardware and its integration into the instruction set, as well as generating the according software tools and synthesis scripts. It is therefore plain application-centric and no specific runtime aspects and/or compatibility aspects are considered; ongoing work, however, targets OpenMP interoperability. Several approaches ease application description for heterogeneous and parallel systems. They typically operate on function-level granularity rather than instruction level and typically feature an associated runtime system. The currently most prominent example is CUDA [6], targeting heterogeneous platforms consisting of a host machine running the basic control flow, and GPUs used as highly parallel FP accelerator units. CUDA features an extension to the C language dealing with partitioning and offloading compute kernels to the GPU. These GPUs employ a dedicated memory hierarchy and a hardware thread manager to cope with the high level of parallelism involved. CUDA requires only minimal changes on software level, basically rewriting accelerable functions using CUDA primitives, therefore providing a smooth upgrade-path leading to quick adoption by programmers. This concept is extended by the so-called OpenCU [11] approach, targeting a more flexible distribution of the workload between CPU and GPU, so that individual functions may run on either hardware. The commercial RapidMind [13] platform follows a similar concept, consisting of a gradual language extension (mainly providing a uniform datatype declaration) and an according runtime system. An application is written in a way that enables the framework to create binary representations to be executed on various types of computing nodes, including GPUs, IA32 CPUs, and the Cell BE architecture. A dedicated runtime layer ensures the application task distribution and interplay of the individual application tasks. The last three approaches do not specifically take application requirements into account, but solely rely on implicit declarations like providing correct data types or following a programmerdefined partitioning. However, the approaches demonstrate how a smooth upgrade-path may lead to quick adoption by programmers and integration into operating systems. Focusing on dynamic exploitation of parallelism for improved use of parallel systems is Intel’s C for Throughput Computing (Ct) [7] focuses on dynamic exploitation of parallelism by improving the use of parallel system through an automated, dynamic partitioning and orchestration of a running application. To enable such, Ct requires adopting a dedicated C++ library and adjusting the program code accordingly. This approach is a strong point for intelligent, self-partitioning runtime systems; however, in its current form it is not suitable for hardware-constraint systems. Neither does it address the fulfilment of application requirements. The EXOCHI project [16] directly approaches the programming of heterogeneous platforms, featuring a C/C++ environment supporting creating a unified application description. This description is compiled into a so-called fat binary containing individual binary representations for a given heterogeneous system. Using a dedicated, OpenMPderived runtime system this binary is then mapped to the individual cores. The Merge framework [9] extends the EXOCHI concept towards dynamic mapping of individual
230
R. Buchty et al.
application parts to the underlying heterogeneous architecture; it uses the map-reduce parallel programming language. EXOCHI/Merge provide an interesting way to achieve unified binaries and exploit parallelism, but again do not address specific application requirements. The approach is furthermore tied to a specific VM and, in case of Merge, requires the use of a dedicated programming language. The recent Lime project [4,1,14] aims at a unified application description by extending the Java framework and focuses on removing the border between soft- and hardware: an application written in Java may be either executed in software on a JVM or transformed into a dedicated hardware description.. This most interesting approach again is based on a careful extension of an existing programming and execution infrastructure. It therefore is tied to Java and the JVM, which restricts the use of legacy code and also might impose a problem for certain hardware-constraint systems. Of the existing approaches, only very few – typically vendor-provided and platformrestricted – approaches are so far used by a significant amount of programmers. What these approaches share is a moderate extension of existing programming methods e.g. by introducing certain keywords and data types. However, even with such moderate extensions the compatibility with existing code is broken. Other approaches even require adoption of unusual programming models and/or programming languages. For our universally applicable approach, we therefore define the following cornerstones: compatibility and fulfilment of application requirements. Maintaining sourcecode and binary compatibility ensures a smooth upgrade path from conventional systems. Hence, the approach must neither rely on a dedicated programming language nor model. With respect to the specific requirements of parallel systems, compatibility and interoperability with commonly used parallel programming models such as OpenMP is mandatory. Also runtime compatibility must be ensured, i.e. using legacy binaries in new hardware-aware environments and vice versa. Such can be achieved by using a lightweight embrace-and-extend approach, i.e. the extension of already present system layers and clever exploitation of their properties. Strict fulfilment of specific requirements is mandatory for certain applications. disregarding these will lead to noticeable effects ranging from simple slowdown to breaking the application: this is most notable for all real-time applications such as real-time media streaming, transcoding, or securing, but also vital for certain numerical computations where e.g. certain minimum computing accuracy is required for individual program phases. We therefore require compatible ways to express application requirements and ensure their fulfilment. For legacy applications a mechanism is required ensuring same performance as experienced on these application’s native systems. Existing approaches typically either completely disregard such requirements or are based on a worst-case scenario, the latter degrading the system’s flexibility in task/thread-mapping and overall efficiency. Our approach, to be presented in the following section, is specifically designed for fulfilling the above cornerstones. It puts strong focus on universal applicability, ranging from high-performance computing to embedded systems. It neither breaks existing code nor disables mixed-code execution and enables declaration and fulfilment of application requirements. As as result of the overall lightweight design the approach does not impose a execution time penalty compared to native execution.
Managing the Complexity of Future Heterogeneous Systems
231
3 A Lightweight and Universal Approach to Parallel and Heterogeneous Program Execution In order to achieve the required interoperability, compatibility, and lightweightedness, we carefully extended and augmented existing concepts employed in modern operating systems in order to fulfill the specific requirements of describing and executing programs on a heterogeneous, dynamically configurable parallel system. Figure 1 shows the basic architecture of our concept spanning compiler, runtime and hardware domain. Key part of this architecture is a guided function mapping process taking place during runtime: an application typically dissects into a control thread and potentially accelerable computing kernels. These kernels are mapped to suitable SW and HW implementation alternatives during runtime with according hardware reconfiguration taking place where required. This level of granularity is common to almost all existing approaches. In contrast to these, we however do not employ an additional runtime layer but rather embrace and extend the OS’s own runtime system. Modern OSes typically employ so-called runtime (or dynamic) linking, performing function resolution if non-statically linked library routines are called. The used binary formats support this dynamic linking process by dedicated data structures; an example
Attributes
Code LOAD r0,arg LOAD r1,arg call fn()
Code
Control System
Application
HW Predefine
Code Layer Compiler Domain Univ. Binary Run−time Domain
Impl. #1 Attributes UDI r0,r0,r1
Impl. #2
SW−Library Dynamic Linker
Attributes PUSH r0,r1 CALL asf_sp POP r0
Run−time Layer HW−Library
Impl. #3 Attributes
Hardware Domain Adaptive Processing Hardware
HW Layer
FEs
System/HW Monitor(s)
Fig. 1. Concept Outline
PUSH r0,r1 CALL asf_dp POP r0
Library
232
R. Buchty et al.
for this is the Executable and Linking Format (ELF): ELF executables employ several sections, some of which are dedicated to resolving of function symbols, most notably the so-called Global Offset Table (GOT). During runtime, these sections are searched for functions to be called and resolved to the resulting address. For later reference, this address stored in the GOT to speed up later function calls. This process is called lazy linking. An obvious approach therefore is to extend this technique in order to enable dynamic re-linking of already resolved functions upon demand. This approach, depicted in Figure 2 is ideal with respect to compatibility and interoperability as only the linking process is altered. Only the required switching logic needs to be added to the OS which, with modularized kernels, can be done using a loadable kernel module. This process is completely transparent to the programmer and the remaining parts of the OS. Being an extension to the existing linking process, the approach enables compatibility with existing legacy code. By design it is furthermore completely independent of compiler and programming language. However, the GOT-based approach is not thread-safe: being part of the binary, the maintenance structures used for dynamic function resolution are only present per task, i.e. per running application instance. Hence, they offer no possibility to increase the resolution from application/task to thread level. We therefore designed a second, thread-safe approach supporting dynamic function resolution by embedding the according maintenance information into the Task State Segment (TSS). This is possible, because the TSS – despite the fact that certain CPUs offer hardware support for TSS management – is entirely managed and maintained by the OS. This approach, dubbed Dynamic Linking System (DLS) and illustrated by Figure 3, employs so-called proxy functions in place of function calls. During runtime, this proxy is dynamically adjusted to the desired function implementation. The number of function pointer substitutions is only limited by system memory and address space, therefore providing a theoretically unlimited amount of function pointer substitutions, even with an overlapping set of functions and libraries. Depending on whether runtime addition and removal of further dynamically mappable functions is required, this approach can be realized as two different implementations, resembling static (DLS-SL) and dynamic linking (DLS-DL) of conventional Kernel Function Switcher gls_switch_fct() Hard Disk
ProcFS
Control Daemon
Alter GOT
get_elf_header()
Glob. Offs. Table long (*fct)(int a, ...)
long libfct_a(int a, ...)
long libfct_b(int a, ...)
Fig. 2. Function-mapping using GOT manipulation
Managing the Complexity of Future Heterogeneous Systems
233
Kernel dls.h dls_struct_ptr this: dls_struct* next: dls_struct_ptr* dls_struct dls_set_fct()
dls_fcts_ptr: dls_fct_type* num_fcts: int next: dls_struct*
ProcFS
Proxy Function long (*fct)(int a, ...)
long libfct_a(int a, ...)
Control Daemon
long libfct_b(int a, ...)
Fig. 3. Function-mapping using a TSS-based Proxy System
binaries. In the static approach, DLS employs only the dynamic mapping required for thread-safe function mapping, otherwise also a dynamic linking mechanism is used for function resolution. Because of the added functionality, the changes to the kernel are slightly more complex, therefore no easy module-based approach is possible, but the changes have to be made in the kernel source code. 3.1 Providing and Achieving Mapping Guidance In order to not breaking application constraints, e.g. throughput or computation accuracy, the aforementioned mapping process must strictly adhere to application requirements. This requires mechanisms for providing these requirements on source-code and binary level, and a control interface by with the runtime mapping process is steered. For the control interface we again embraced already existing techniques: targeting a Unix-based infrastructure for our test implementation, we integrated this control interface using the so-called proc filesystem [10] or procfs; procfs is a pseudo file-system used to access process information from the kernel and is available on most Unix and Linux environments. Through this interface not only guidance information can be provided for steering the mapping process in order to fulfill application requirements, but also specific function implementations may be added, removed, or selected during runtime. This interface gives the flexibility to either control the steering process manually, script-based, or through a dedicated control daemon. These requirements can either be provided as independent control data, or, preferably be included into the corresponding application and library binaries by employing capabilities of current file formats. The aforementioned ELF format, for instance, enables transparent augmentation by additional sections. This ensures compatible execution, i.e. augmented binaries may be executed on standard systems and vice versa. During runtime, the function resolver evaluates function call requirements against implementation capabilities and chooses from an (ideally) Pareto-optimal selection
234
R. Buchty et al.
based on guidance information stored in the application binary and/or individual libraries. Typically, a programmer would like to provide such guidance information during programming time. This is possible code using so-called pragmas. Pragmas are a method to provide additional information to an according compiler infrastructure, which is only extracted and processed if these compilers know about specific pragmas; otherwise, they do not affect the program generation process and therefore provide a compatible way of augmenting the application description with guidance information.
4 Implementation and Evaluation Key parts of the described approach were implemented and evaluated on a Linux-based system, demonstrating the applicability and suitability of this approach. Our test setup comprises of a general-purpose multicore PC platform enhanced by a dedicated FPGA accelerator card comprising a Xilinx Virtex4-FX100, using state-of-the-art HyperTransport interconnection technology. The software stack is depicted in Figure 4. Here, we see the interplay of the various software layers from application to kernel space down to hardware access. To properly separate these layers, we employ two OS-provided control interfaces which are inter-process communication (IPC) based on the aforementioned procfs interface between the runtime part and the control daemon, and hardware device drivers to access the accelerator hardware by this daemon. For our test implementation we partitioned the accelerator’s resources in order to achieve a heterogeneous, parallel application accelerator. Hence, the logic resources are provided as 6 individually configurable and accessible slots attached to a static HT-based interface infrastructure [8]. Enabling a uniform, configuration independent interface, dedicated abstraction layers are provided in hardware. With this platform, we could demonstrate that our solution does not show negative impact on application runtime for standard operation, i.e. when no function mapping occurs [2]. Extending these measurements, we furthermore quantified the latency involved with function switching as presented in Table 1. In comparison with set-up latencies for CUDA compute kernels and FPGA hardware reconfiguration as shown in Table 2 we can safely state that not only is our approach some orders of magnitudes, but also the cost for re-resolving a function pointer becomes invisible as function change time is dominated by the accelerator hardware setup costs. Using dedicated test applications, we ensured proper functioning of our infrastructure [12,3]. One test application for this setup was hardware-accelerated 3DES encryption [2]. Depending on the data payload size, either the software or hardware implementation was selected in order to achieve maximum performance. With this very test Table 1. Function switching latencies
Table 2. Accelerator setup latencies
Mechanism Latency Switching Interface DLS-SL 1.6 µs 0.5 µs 1.1 µs DLS-DL 4.8 µs 3.3 µs 1.5 µs
Setup Type Latency Factor CUDA kernel init. ∼0.6 s 100,000 FPGA config. ∼10 ms 2,000
Managing the Complexity of Future Heterogeneous Systems
Application
AMS
235
Application User address space
Library IPC
AMS
Device AMS
Accel.
Daemon address space
Control Daemon Device Kernel Driver
Accel.
Mem.
Kernel address space
Hardware Main Memory
Fig. 4. Software Component Interplay
application, we furthermore successfully demonstrated compatibility with the OpenMP programming model.
5 Conclusion and Outlook Starting with a requirements analysis and discussion of related approaches, we motivated an alternative, lightweight and most compatible concept addressing program generation and execution on heterogeneous, dynamically changing parallel systems. By carefully extending existing concepts and techniques employed in modern OSes, our concept enables full compatibility with existing application binaries and libraries, showing no performance degradation compared to native execution. Furthermore, our approach enables full compatibility and interoperability with existing parallel programming models such as OpenMP or MPI. Exploiting the capabilities of modern binary formats, we ensured backwards compatibility with conventional runtime systems not performing any mapping process. We provide implicit and mechanisms to apply guidance information for steering the mapping process. If no such information is applied, a best-effort fall-back strategy is executed. The presented approach is by design languageagnostic and not tied to a specific infrastructure. The applicability and suitability of the concept was demonstrated by a test implementation based on the Linux operating system, using the ELF file format. This solution can easily be transferred to other Unixbased systems and other OSes employing similar runtime systems and binary formats. As of now, the prototype implementation offers all basic building-blocks required for dynamic function mapping. Currently under development is an autonomous control daemon which automatically guides the mapping process with respect to present hardware resources and application requirements.
236
R. Buchty et al.
References 1. Hormati, A., Kudlur, M., Bacon, D., Mahlke, S., Rabbah, R.: Optimus: Efficient Realization of Streaming Applications on FPGAs. In: Proceedings of the 2008 International Conference on Compilers, Architecture, and Synthesis for Embedded Systems (CASES) (October 2008) 2. Buchty, R., Kramer, D., Kicherer, M., Karl, W.: A Light-weight Approach to Dynamical Runtime Linking Supporting Heterogenous, Parallel, and Reconfigurable Architectures. In: Architecture of Computing Systems – ARCS 2009, 22nd International Conference. LNCS, vol. 5455. Springer, Heidelberg (2009) 3. Buchty, R., Kramer, D., Nowak, F., Karl, W.: A Seamless Virtualization Approach for Transparent Dynamical Function Mapping targeting Heterogeneous and Reconfigurable Systems. In: Becker (ed.) ARC 2009. LNCS, vol. 5453, pp. 362–367. Springer, Heidelberg (2009) 4. Bacon, D.F., Rabbah, R.: Liquid Metal (Lime) (August 2008), http://domino.research.ibm.com/comm/research projects.nsf/ pages/liquidmetal.index.html 5. Ungerer, T., et al.: Grand Challenges der Technischen Informatik. VDE Verband der Elektrotechnik, Elektronik, Informationstechnik e.V. (2008) 6. Halfhill, T.R.: Parallel processing with CUDA. Microprocessor Report (January 2008) 7. Intel Corp. Ct: C for Throughput Computing (2007-2009), http://techresearch.intel.com/articles/Tera-Scale/1514.htm 8. Kramer, D., Vogel, T., Buchty, R., Nowak, F., Karl, W.: A general purpose HyperTransportbased Application Accelerator Framework. In: Proceedings of the First International Workshop on HyperTransport Research and Applications (WHTRA 2009). Universit¨atsbibliothek, Heidelberg (2009) 9. Linderman, M.D., Collins, J.D., Wang, H., Meng, T.H.: Merge: a programming model for heterogeneous multi-core systems. In: ASPLOS XIII: Proceedings of the 13th international conference on Architectural support for programming languages and operating systems, pp. 287–296. ACM, New York (2008) 10. Tim Jones, M.: Access the Linux kernel using the /proc filesystem. IBM developerWorks (2006), http://www.ibm.com/developerworks/library/l-proc.html 11. Munshi, A., Sandmel, J.: Data Parallel Computing on Multiple Processors. Patent Publication No. US2008004648 (April 2008), http://www.wipo.int/ 12. Nowak, F., Buchty, R., Kramer, D., Karl, W.: Exploiting the HTX-Board as a Coprocessor for Exact Arithmetics. In: Proceedings of the First International Workshop on HyperTransport Research and Applications (WHTRA 2009). Universit¨atsbibliothek, Heidelberg (2009) 13. RapidMind, Inc. RapidMind Multi-Core Development Platform (2008), http://www.rapidmind.net/ 14. Huang, S.S., Hormati, A., Bacon, D., Rabbah, R.: Liquid Metal: Object-Oriented Programming Across the Hardware/Software Boundary. In: Vitek, J. (ed.) ECOOP 2008. LNCS, vol. 5142, pp. 76–103. Springer, Heidelberg (2008) 15. Vassiliadis, S., Wong, S., Cotofana, S.D.: The MOLEN µ-coded Processor. In: Brebner, G., Woods, R. (eds.) FPL 2001. LNCS, vol. 2147, p. 275. Springer, Heidelberg (2001) 16. Wang, P.H., Collins, J.D., Chinya, G.N., Jiang, H., Tian, X., Girkar, M., Yang, N.Y., Lueh, G.-Y., Wang, H.: EXOCHI: architecture and programming environment for a heterogeneous multi-core multithreaded system. SIGPLAN Not. 42(6), 156–166 (2007)
Applying the Stream-Based Computing Model to Design Hardware Accelerators: A Case Study Frederico Pratas and Leonel Sousa INESC-ID/IST TULisbon, Portugal Rua Alves Redol, 9, 1000-029 Lisboa, Portugal {fcpp,las}@inesc-id.pt
Abstract. To facilitate the design of hardware accelerators we propose in this paper the adoption of the stream-based computing model and the usage of Graphics Processing Units (GPUs) as prototyping platforms. This model exposes the maximum data parallelism available in the applications and decouples computation from memory accesses. The design and implementation procedures, including the programming of GPUs, are illustrated with the widely used MrBayes bioinformatics application. Experimental results show that a straightforward mapping of the stream-based program for the GPU into hardware structures leads to improvements in performance, scalability and cost. Moreover, it is shown that a set of simple optimization techniques can be applied in order to reduce the cost, and the power consumption of hardware solutions.
1
Introduction
Reconfigurable hardware can be used as a very efficient co-processing solution to accelerate certain types of applications. According to Pareto’s principle, 80% of the time spent executing an application corresponds to only 20% of the code. Although reconfigurable hardware is particularly suitable to accelerate small but computationally intensive kernels, there is no straightforward approach to efficiently map algorithms into hardware. Most of the times, original algorithms have to be modified or substituted by other algorithms in order to achieve efficient hardware implementations. In fact, mapping algorithms into hardware is still an open research topic [1]. The streaming architectures have shown significant performance improvements over traditional architectures [2]. This programming paradigm exposes most of the application parallelism required to efficiently utilize parallel architectures. Due to advantages of the stream-based programming approach, even general-purpose CPU architectures have been considered to map this type of programming model [3]. Moreover, the development of powerful Graphics Processing Units (GPUs) has led to its programming for general-purpose computation (GPGPU) according to the stream-based programming model. For example, with CUDA [4], it is relatively simple for programmers to create efficient stream-based implementations on NVIDIA GPUs, in contrast with the effort needed to map applications into hardware. K. Bertels et al. (Eds.): SAMOS 2009, LNCS 5657, pp. 237–246, 2009. c Springer-Verlag Berlin Heidelberg 2009
238
F. Pratas and L. Sousa
In this paper we propose a new hardware design approach based on the stream programming model, by using GPUs at an intermediate phase of the design to simplify the algorithm mapping and achieve efficient hardware implementations. Porting applications to the streaming model which is an actual research topic [3,5], is not the target of this paper. An efficient streaming program is able to exploit most of the existent concurrency, namely by dividing the work into several primitive kernels that can take advantage of the GPUs numerous simple processing elements to be executed in parallel. Since this approach reveals most of the concurrency existent in order to be efficiently parallelized on the GPU, it naturally leads to scalable hardware architectures. To illustrate the proposed approach for designing hardware accelerators we use MrBayes [6], a bioinformatics program that performs the Bayesian inference of evolutionary (phylogenetic) trees, as a case-study. The field of phylogenetic inference deals with the reconstruction of the evolutionary history for a set of organisms based on a multiple sequence alignment of molecular sequence data. The scoring function used in MrBayes is also adopted in other popular programs for phylogenetic inference [7]. While our results are based on MrBayes, the work herein presented is of general interest, as it discusses an efficiently technique to map a stream-based program into hardware.
2
MrBayes: A Program for Bayesian Phylogenetic Inference
MrBayes is based on the Maximum Likelihood (ML) model [8] that represents a broadly accepted criterion to score phylogenetic trees. Profiling has shown that the most computationally intensive core of the application consists in two Phylogenetic Likelihood Function (PLF): CondLikeDown, and CondLikeRoot, which account for more than 85% of the total execution. All current PLF-based programs spend the largest part of run time, typically around 85-95%, for computing the PLF [9]. Thus, the PLF function represents a typical example of a candidate for parallelization at a fine level of granularity. The PLFs are computed on a fixed tree, with the estimation of the branch lengths and parameters of the statistical model of nucleotide substitution. For DNA data sequences, a model of nucleotide substitution is provided by a 4x4 matrix (denoted as Q and shown in Figure 1). This matrix contains the instantaneous transition probabilities for a certain DNA nucleotide (A - Adenine, C Cytosine, G - Guanine, or T - Thymine) to mutate into a nucleotide A, C, G, or T. In this application, the extended Γ model [10] is adopted as usual, with 4 discrete rates, r0 , ..., r3 (see Figures 1 and 2). To compute the likelihood of a fixed unrooted tree topology with given branch lengths and model parameters, one initially needs to compute the entries for all internal likelihood vectors. They contain the probabilities of observing an A,C,G, or T for each column of the input alignment. Hence, the conditional likelihood vectors “cl” have the same length “m” as the sequences in the input alignment. The PLFs mainly consist of independently for loops with a computational load that depends of
Applying the Stream-Based Computing Model Q A A
C
G
T
r
239
Conditional Likelihood vector 0
m elements
... r
n
C G
A
C
G
T
r
0
... r
T
Fig. 1. Nucleotide substitution matrix Q
Input: cl Arrays (Left/Right) Input: Substitution Matrices Q (Left/Right) Output: Result clP foreach cl element i do foreach Discrete rate r do foreach Q row j do Calculate Inner Products (QLef t,j , clLef t ) (QRight,j , clRight ) end Multiply the final arrays. end end (a) Primitive kernel
n
Fig. 2. Conditional likelihood (cl) vector (detail of one element)
vector X
X
vector Y
X
X
+
X +
+ Inner Product (b) Inner product dependencies
Fig. 3. PLF computation
the sequence length (m) and the number of discrete rates (r), as described in Figure 3(a). In each iteration, the functions CondLikeDown and CondLikeRoot multiply the likelihood vector elements by the substitution matrix for each of the defined discrete rates. Thus, considering 4 discrete rates, the computation of a likelihood element requires 4 matrix-vector multiplications, or in other words 16 inner products. The inner product can be seen as a reduction, namely the multiply and accumulate operations over 2 vectors of 4 floating-point numbers as depicted in Figure 3(b). PLF implementation in MrBayes uses single-precision floating-point arithmetic.
3
PLF Stream-Based Computation
As already mentioned, we use a streaming algorithm as a starting point. In our case study, for each PLF invocation, the input data is made available before
240
F. Pratas and L. Sousa
processing, provided in a stream format to the processor and after being processed returned back to the memory as a stream of processed data. In the GPU case, the input data is transfered to the device memory, copied to the local memory of the GPU and then provided in sequence to execute in parallel the PLF function. At the end, processed data returns back to the main memory in the Central Processing Unit (CPU). By considering a GPU architecture and the CUDA API model for the stream-based implementation, there are some particular features to consider, namely: i) the high amount of parallelism that may be exploited considering the large number of processing elements; ii) the data transfer synchronization procedure is handled automatically by CUDA; iii) data must be stored according to the hierarchical memory structure. To efficiently exploit the GPU resources, the number of threads should be maximized, each thread being responsible for a simple computation. This decomposition is also suitable for hardware design and implementation. In our case study, the parallelization should be performed at the level of each likelihood vector entry, i.e., the computation of each vector entry can be considered as a primitive kernel. Besides, these computations are independent: each thread can be assigned to compute one inner product while, for a given discrete rate, each group of four threads can be assigned to process a stream of likelihood vector entries (see Figures 1 and 2) as depicted in Figure 4(b). In order to maximize our design efficiency we had to properly balance the workload by partitioning the data in three levels: i) global partitions splits the data when its size is larger than the device memory, thus guaranteing full scalability; ii) block partition evenly distributes the m likelihood array elements between the CUDA blocks, which we independently execute; iii) thread partitions assigns a group of four threads to compute a stream of inputs. This configuration allows cl - Right/Left child
4
+
X
4
X
X
X
4
X
X
X
Substitution Matrix Q(R/L)
X
4
...
+
block 2
X
block 1
X
+
X
+
X
+
X
+
block 0
global partition 1
+
+
X
4
global partition 0
+
X
+
X
+
m × r × 4 × float
(a) GPU data partition example
+ 2
X
Thread 0
groups of threads, assigned to likelihood vector entries
2
X
Thread 1
r × 4 × float
2
Thread 2
Thread 3
2
X
X
R/L inner products
Results
(b) Thread scheduling, serial reduction
Fig. 4. GPU data partition and Likelihood vector computation and synchronization
Applying the Stream-Based Computing Model
241
to fully parallelize the likelihood vector computation among the several cores of the GPU in a balanced and scalable way. The same parallelization approach is used in both PLFs. The fact that the likelihood vector is organized in multiples of 4 floats can be used to further improve the GPU performance. Since groups of 4 threads are assigned to each likelihood vector discrete rate (array of 4 floats), the compiler is able to coalesce memory accesses because the threads access subsequent memory locations [4]. By extending this technique several groups of 4 threads can access neighbor likelihood vector discrete rates to further improve the performance.
4
From GPU to Reconfigurable Hardware
The interesting issue is how the stream-based model used by the GPUs can also be mapped into hardware. Our main goal is to use the stream-based CUDA/GPU implementation as an intermediate step allowing the hardware designer to use widely available programming and debugging tools, while abstracting himself/ herself from the complex low level details of the hardware implementation. Here we are going to present the procedure to translate the streaming GPU implementation of MrBayes bioinformatics application presented in Section 3 into hardware. As stated before, the computation of each likelihood vector entry depicted in Figure 4(b), performed independently by each thread, can be identified as the primitive kernel. By considering the floating-point adder and multiply units as unitary, i.e., each of them takes one slot of time to produce a result, the direct mapping of the primitive kernel results in the schedule shown in Figure 5(a). Typical folding procedures [11] can be used systematically to optimize the operations and efficiently implement this process in hardware. For the floating-point operations depicted in the Figure 5(a) we can use a single multiplier and adder concurrently; to achieve this solution, some of the steps shown in Figure 5(a) can be collapsed as depicted in Figure 5(b). The direct connection in step S1 of Figure 5(a) can be substituted by an addition with zero, and according to Figure 5(b) it is clear that the stages S0 till S4 can be collapsed to reduce the number of floating-point units required, up to the minimum depicted in Figure 5(c); this solution, which has two streams of data as inputs and iteratively accumulates the partial results, has exactly the same latency as the structure in Figure 5(b). Moreover, advantages can be taken from the characteristics of the floating-point units to design more efficiently structures; for example, if both the multiplier and the adder are pipelined structures, in our case with two stages, we can merge the right and left inner products required to compute each likelihood entry by interleaving inputs as show in Figure 5(d). The Final Collapsed Architecture (FCA) is a compact design able to compute the same amount of data as the one in Figure 5(a), but optimized regarding the amount of hardware. In a higher abstraction level, the computation of one likelihood discrete rate array requires the use of 4 FCA units as shown in Figure 6(a). Following the same approach adopted to map the hardware structures from Figure 5(c) to Figure 5(d), if the floating-point units have 8 pipeline stages the 4 FCA units
242
F. Pratas and L. Sousa
clL0 pL0 clL0 pL0
clL1 pL1
X
S0
clL1 pL1
X
S2
clL0 pL0
X
S1
X
clL2 pL2 + clL2 pL2
clL0 pL0
S0 +
S1
X
S3
X +
clL3 pL3
S2 +
X
S5
S3
+ +
+
S6
+ S7
S4
Left Right
X
S5
X
clL2 pL2
X
clL3 pL2 pL3
X
X
+
X
(a)
clL1 pL1 clL0 pL0
S0
X
S1
+
S2
+
S3 S4
X
S5
X
(c)
pR3 pL3 pR2 pL2 pR1 pL1 pR0 pL0
X
S1
Right S2
(b)
S0
X + Left
Left Right
+
clR3 clL3 clR2 clL2 clR1 clL1 clR0 clL0
clL3 pL3 clL2 pL2 clLclL 3 pL 13 1 pL 2 pL clLclL 02 0 pL
clL2
clL3 pL3
X +
X
clL1
pL clL1 pL1 1
+
X
clL3 pL3 + S4
X
'0'
Left/Right
(d)
Fig. 5. Mapping the GPU implementation into hardware: (a) Direct map from the GPU implementation; (b) Concurrent scheduling; (c) Hardware reutilization; (d) Final Collapsed Architecture (FCA) conditional likelihood Right/Left child
Substitution Matrix Q(R/L)
conditional likelihood Right/Left child
Substitution Matrix Q(R/L)
MUX
FCA 0
FCA 1
FCA 2
FCA 3
Pipelined FCA Resultant conditional likelihood Resultant conditional likelihood
(a) Abstract view of 4 units
(b) Collapsed units
Fig. 6. Higher abstraction level
can be collapsed to use the same hardware (2(Lef t/Right) × 4(F CA units)) as depicted in Figure 6(b). Once more the input streams arriving from the substitution matrix have to be interleaved. This final structure computes one conditional likelihood discrete array.
Applying the Stream-Based Computing Model
243
Table 1. Systems Setup. (1) Each Virtex 4 slice contains two 4-input LUTs and two flip-flops, while Virtex 5 slices contain four 6-input LUTs and four flip-flops. (2) Each Virtex 4 DSP contains a 18 x 18 multiplier, and a 48-bit adder, while in Virtex 5 each DSP contains a 25 x 18 multiplier, and a 48-bit adder. Characteristics Frequency [GHz] Power [Watts] # Cores # Slices(1) / # DSP(2) Technology
Baseline GPU Virtex 4 Virtex 5 Virtex 5 Intel x86 GeForce 8800 GT LX100 LX110 FX200T 3.000 0.575 0.450 0.667 0.450 65 105 <10 <10 <10 1 112 – – – – – 49152 / 96 17280 / 64 122880 / 384 45-nm 65-nm 90-nm 65-nm 65-nm
Finally the hardware in Figure 6(b) can be replicated into a vectorial structure, according to the processing time requirements and/or the cost constraints for the target reconfigurable device. To simplify the hardware design, the number of replicated units should be related to the number of parameters of the adopted rate heterogeneity model used for this application. Therefore, for the Γ model, 4 units are needed, each one having its respective input data. This allows each substitution matrix to be assigned to a different structure. Moreover, in order to control the data flow from/to the external memory, a control unit with a simple FSM must be implemented. A single control unit can be used for several units in a SIMD fashion. The procedure described here is according to the PLF CondLikeDown. However, it can be equally applied to the CondLikeRoot. The resultant architecture for the CondLikeRoot function has one more input stream, requiring floatingpoint units with 12 pipeline stages (3(Lelf t/Right/P arent) × 4(F CA units)) and one additional multiplier. Hardware structures can be implemented for both PLFs simultaneously, or dynamic reconfiguration can be used to implement one at a time, in a reconfigurable device.
5
Experimental Setup and Results
To evaluate our proposal we assessed the performance of the stream-based parallelization of the PLFs in the NVIDIA GPU and in the Xilinx FPGAs [12]. The configuration details for these systems is provided in Table 1. Results for the FPGAs were obtained with the Xilinx ISE 10.1 tools, after Place-and-Route. Timing results were obtained by running the input data sets 5 times and calculating the average. The power values reported are the Thermal Design Point (TDP) power consumption for the processors only. Given that we are comparing the computation only, we do not consider the power consumption of the rest of the systems. The architecture denoted as Baseline, an Intel x86 general-purpose processor at 3.0GHz, is used as reference system. In this study we use MrBayes version 3.1.2, and as inputs simulated DNA test data sets of various sizes, generated with Seq-Gen [13] (v1.3.2). In the rest
244
F. Pratas and L. Sousa
of this paper we use a two number convention to distinguish the data sets: the first number corresponds to the leaves and the second number to the columns (e.g., by 20 5K), which are directly related with the amount of PLF calls and the input stream size, respectively. Finally, we have also used a subalignment of a real-world phylogenomic alignment of mammalian sequences with 20 organisms, 28,740 alignment columns, and 8,543 distinct column patterns, which is denoted as 20 8543. MrBayes was executed with a fixed random number of seeds and a fixed number of generations to ensure comparability of results. The results presented in Figure 7 show the execution time (bars) and speedup (line) for the real-world data set in all systems. These results were obtained for implementations with the maximum amount of pipelined FCAs that could fit into each FPGA. Relatively to the baseline system, any of the four systems used to implement the stream-based solution show a significant improvement in performance. The results for Virtex 4 and Virtex 5 LX110, which have similar amount of hardware resources, show that an efficient hardware implementation can be straightforwardly achieved by using the stream-based computing model. The execution times of the CondLikeRoot function are slightly different for the two FPGAs due to the number of FCAs that can be configured in each one (see Table 2). Virtex 5 FX200T achieves a much higher speedup at the cost of area, since 8x more FCAs can be configured in this board, thus increasing the degree
Fig. 7. Total time and SpeedUp for all systems, real data set (20 8543)
Fig. 8. Scalability of Virtex 5 FX200T and comparison with the GPU
Applying the Stream-Based Computing Model
245
Table 2. FPGA results for maximum occupancy Virtex 4 Virtex 5 Virtex 5 LX100 LX110 FX200T Occupation (Slices / DSPs) 15% / 100% 17% / 75% 63% / 100% Frequency [MHz] 128 104 54 CondLikeDown # of cycles 40 40 40 Maximum # FCAs 8 8 64 Occupation (Slices / DSPs) 15% / 54% 31% / 88% 84% / 95% Frequency [MHz] 98 105 56 CondLikeRoot # of cycles 72 72 72 Maximum # FCAs 4 8 52 PLF function
Characteristics
of parallelism. The obtained speedup increases more than 4 times, even with the frequency reduced to half as show in Table 2. The results in Figure 8 show the scalability achieved with both the GPU and the Virtex 5 FX200T systems for the different input data sets. The chart in this figure organizes the data sets in four groups, according to their data size and amount of iterations. One can observe from the figure that GPU performance is significantly improved with the increase of the data set size. The same doesn’t happen with the Virtex 5 implementation where a similar performance is obtained for all data sets independently of their size. In GPUs, the threads scheduling and blocks switching have more impact on smaller tests. Table 2 summarizes the results obtained for the different Virtex FPGA models. With exception of the Virtex 4, where the number of DSPs is not sufficient to have more elements in CondLikeRoot, in all the other cases both the logic slices and DSP slices required occupation is similar for the two PLFs. In this case it seems that this difference does not have an important impact on the frequency. Moreover, the results of Virtex 5 LX110 and Virtex 5 FX200T show that the frequency is greatly affected by the increase of the number of FCAs. This last effect is related with the increase in the routing overhead when implementing many FCAs and the FPGA resources being almost 100% occupied. Nevertheless, the ratio between the increase in the number of FCAs and the frequency reduction is 4. Moreover, for any of the FPGA implementations the power consumption is one order of magnitude lower than in the GPU and in the baseline processor.
6
Conclusions
In this work we proposed a simple approach to design hardware accelerators based on the streaming-based computing model, and by using GPUs at an intermediate phase of the design. Using our proposal, hardware designers can develop efficient stream-based solutions, program it on a GPU using the available tools for tuning and debugging, and efficiently map the obtained solution into hardware. Since this approach reveals most of the program’s concurrency in order to parallelize it to the GPU, it naturally leads to scalable hardware architectures. We adopted MrBayes, a Bioinformatics program that performs the Bayesian inference of evolutionary (phylogenetic) trees, as a case-study to illustrate our
246
F. Pratas and L. Sousa
proposal, but results are applicable to a wider range of applications. The experimental results show that a straightforward mapping of the stream-based program for the GPU lead to an efficient hardware implementation of the processing kernels with improvements in performance and cost. The results also show that the proposed approach is scalable and that the efficiency of the achieved hardware structures is almost independent of the input data sets size. Based on the obtained results, software tools can be developed to assist design of hardware architectures by adopting the stream-based computing model.
Acknowledgment Acknowledgments are due to Dr Alexandros Stamatakis for providing the original MrBayes software and input data sets used in this paper. Thanks are also due to Dr Pedro Trancoso for a critical reading of this paper.
References 1. Plishker, W., Sane, N., Kiemb, M., Anand, K., Bhattacharyya, S.: Functional DIF for Rapid Prototyping. In: RSP, pp. 17–23 (2008) 2. Dally, J., Labonte, F., Das, A., Hanrahan, P., Ahn, J., Gummaraju, J., Erez, M., Jayasena, N., Buck, I., Knight, J., Kapasi, U.: Merrimac: Supercomputing with Streams. In: Proc. of the Int. Conf. on Supercomputing, USA (2003) 3. Gummaraju, J., Rosenblum, M.: Stream Programming on General-Purpose Processors. In: MICRO 2005, Washington, DC, USA, pp. 343–354. IEEE Computer Society, Los Alamitos (2005) 4. Lindholm, E., Nickolls, J., Oberman, S., Montrym, J.: NVIDIA Tesla: A Unified Graphics and Computing Architecture. In: MICRO 2008, March 2008, vol. 28(2), pp. 39–55 (2008) 5. Thies, W., Karczmarek, M., Amarasinghe, S.: StreamIt: A Language for Streaming Applications. In: Horspool, R.N. (ed.) CC 2002. LNCS, vol. 2304, pp. 179–196. Springer, Heidelberg (2002) 6. Ronquist, F., Huelsenbeck, J.: MrBayes 3: Bayesian Phylogenetic Inference Under Mixed Models. Bioinformatics 19(12), 1572–1574 (2003) 7. Stamatakis, A., Ludwig, T., Meier, H.: RAxML-III: A Fast Program for Maximum Likelihood-based Inference of Large Phylogenetic Trees. Bioinformatics 21(4), 456– 463 (2005) 8. Felsenstein, J.: Evolutionary Trees from DNA Sequences: A Maximum Likelihood Approach. Journal of Molecular Evolution 17, 368–376 (1981) 9. Ott, M., Zola, J., Stamatakis, A., Aluru, S.: Large-scale Maximum Likelihoodbased Phylogenetic Analysis on the IBM BlueGene/L. In: On-Line Proc. of IEEE/ACM Supercomputing Conf. (2007) 10. Yang, Z.: Maximum Likelihood Phylogenetic Estimation from DNA Sequences with Variable Rates Over Sites. Journal of Molecular Evolution 39, 306–314 (1994) 11. Parhi, K.: VLSI Digital Signal Processing Systems. Wiley, New York (1999) 12. Xilinx: Virtex-5 Family Overview. Xilinx Product Specification (February 2009) 13. Rambaut, A., Grass, N.: Seq-Gen: An Application for the Monte Carlo Simulation of DNA Sequence Evolution along Phylogenetic Trees. C. App. in BioSc. 13, 235– 238 (1997)
Reconfigurable Multicore Server Processors for Low Power Operation Ronald G. Dreslinski, David Fick, David Blaauw, Dennis Sylvester, and Trevor Mudge University of Michigan, Advanced Computer Architecture Labratory, 2260 Hayword, Ann Arbor, MI, 48109 {rdreslin,dfick,blaauw,dennis,tnm}@eecs.umich.edu
Abstract. With power becoming a key design constraint, particularly in server machines, emerging architectures need to leverage reconfigurable techniques to provide an energy optimal system. The need for a single chip solution to fit all needs in a warehouse sized server is important for designers. This allows for simpler design, ease of programmability, and part reuse in all segments of the server. A reconfigurable design would allow a single chip to operate efficiently in all aspects of a server providing both single thread performance for tasks requiring it, and efficient parallel processing helping to reduce power consumption. In this paper we explore the possibility of a reconfigurable server part and discuss the benefits and open questions still surrounding these techniques. Keywords: Reconfigurable, Low Power, Server Architectures.
1 Introduction The exponential growth of the web has yielded an equally dramatic increase in the demand for server style computers. According to IDC the installed base of servers will exceed 40 million by 2010. In fact, these figures may be conservative as there is a continual flow of unanticipated applications coming online. For example, Facebook, which is only 3 years old, is expected to grow from 1,000 to 10,000 servers in a year. The growth in servers has been accompanied by an equally dramatic growth in the demand for energy to power them. Furthermore, the cost of this power and its associated cooling is approaching the cost of the servers themselves [1]. For example, it is estimated that the five largest internet sites consume at least 5MWh of power each [2]. Meanwhile, through technology advancements and new design techniques such as 3D die stacking, Moore’s law continues to hold. This means there is an increasing number of transistors available on a chip. However the power allocated for those transistors remains constant. This power envelope means that either new, low power architectures need to be developed, or the additional transistors supplied by Moore’s law will go unutilized. At the same time the need for a chip that satisfies all applications within a server environment is critical. Designers prefer to use the same chip for all portions of the server to ease the programming constraints as well as keep cost and maintenance to a K. Bertels et al. (Eds.): SAMOS 2009, LNCS 5657, pp. 247–254, 2009. © Springer-Verlag Berlin Heidelberg 2009
248
R.G. Dreslinski et al.
minimum. Simply creating a low power chip capable of handling only a portion of the workloads required in a server will not be economically viable unless current designers rework their approach. Therefore, there is a need for a system that supports both low power throughput computing and fast single threaded performance to address the coming problems in large scale servers. To address this we propose a system leveraging near-threshold voltage scaling and parallel processing to reduce the power consumption of the throughput oriented applications in a server. At the same time we employ reconfigurable techniques to adapt to response times of the throughput computing or to handle applications where single threaded performance is critical. This reconfigurability will also rely heavily on the ability of the OS to measure and adapt the chip to reduce power consumption while still maintaining the needed performance. In this paper we explore an example architecture and some of the difficulties associated with the OS scheduling. We present some early solutions and point to future research directions for remaining problems.
2 Reconfigurable Architecture Although the techniques discussed in this paper are not directly tied to a particular architecture, the following processor description will serve as an example for illustration throughout the rest of the paper. In Figure 1, we present our example reconfigurable architecture, which has several interesting design points that we will discuss in the following subsections. The basic design is a machine with 66 cores. Each core is tied to the same ISA so that code can be migrated freely between cores without concern for the ability of any given core to complete the task.
Fig. 1. Example reconfigurable server architecture, using clustering techniques
Reconfigurable Multicore Server Processors for Low Power Operation
249
2.1 Core Types There are two basic core types in the design. The first core type is a simple, in-order execution core. This core is replicated 64 times across the architecture and is to be used in highly parallel tasks to reduce energy consumption. In figure 1 it is the core replicated 4 times in each cluster. The core itself is designed to run at a range of frequency/voltage pairs. Depending on computational demands the cores can be scaled to offer faster performance, but with increased power consumption. At the lowest end of operation the cores operate in what we term the near-threshold (NTC) operating region[3]. This region is where the supply voltage is at or just above the threshold voltage of the transistor. In this region cores see approximately a 100x reduction in power with only a 10x reduction in performance, resulting in a 10x reduction in energy. The processor is not operated at subthreshold [4,5] voltage levels, due to the poor energy/performance tradeoff that occurs at this point. See figure 2 for a view of the tradeoffs of delay and energy in different supply voltage regions.
Fig. 2. Energy and delay for different supply voltage operating regions
The second core type is a more complex out-of-order core. This core is designed to operate only at full voltage and is used to perform single threaded tasks that cannot be parallelized, or time critical tasks that would take too long on the simple cores. In a server farm many tasks still require single thread performance and any single solution chip needs to be able to provide this performance in addition to any energy saving parallel cores it offers. These cores can either be enabled, or power gated when not needed to reduce energy. Depending on the thermal characteristics of these cores, nearby simple core clusters may need to be disabled while operating these cores.
250
R.G. Dreslinski et al.
2.2 Clustered Architecture In the work done by Zhai et al. [3,6], they propose the use of parallelism in conjunction with NTC operation to achieve an energy efficient system. While traditional superthreshold many-core solutions have been studied, the NTC domain presents unique challenges and opportunities for architects. Of particular impact is the reliability of NTC memory cells and differing energy optimal points for logic and memory, as discussed below. Zhai’s work showed that SRAMs, commonly used for caches, have a higher energy optimal operating voltage than processors, by approximately 100mV [3]. This results from the lower activity in caches, which amplifies leakage effects. SRAM designs also face reliability issues in the NTC regime, leading to a need for larger SRAM cells or error correction methods, further increasing leakage and the energy optimal operating voltage. Due to this higher optimal operating voltage, SRAMs remain energy efficient at higher supply voltages, and thus at higher speeds, compared to logic. Hence, there is the unique opportunity in the NTC regime to run caches faster than processors for energy efficiency, which naturally leads to architectures where multiple processors share the same first level cache.
Fig. 3. Cluster Based Architecture
These ideas suggest that we create an architecture with n clusters, each with k cores, where each cluster shares a first level cache that runs k times faster than the cores (Figure 3). Different voltage regions are presented in different colors and use level converters at the interfaces. This architecture results in several interesting tradeoffs. First, applications that share data and communicate through memory, such as certain classes of scientific computing, can avoid coherence messages to other cores in the same cluster. This reduces energy from memory coherence. However, the cores in a cluster compete for cache space and incur more conflict misses, which may in turn increase energy use. This situation can be common in high performance applications where threads work on independent data. However, these workloads often execute the same instruction sequences, allowing opportunity for savings with a
Reconfigurable Multicore Server Processors for Low Power Operation
251
clustered instruction cache. Initial research of this architecture [3,6] shows that with a few processors (6-12), a gain of 5-6X performance improvement can be achieved. Since the caches are operating at a frequency that is higher than the cores, it is possible to turn off some of the cores in the cluster, clock the remaining cores at a higher frequency, and not have to change the cache frequency. This is beneficial because the SRAM needs to be validated at each operating frequency and the timing of signals, particularly relating to the sense amplifier, are sensitive to timing changes, whereas core logic timing scales predictably with voltage. In our example architecture we present a system with clusters of size 4, meaning there are 4 cores in the cluster and the cache is operated at 4x the frequency of the cores. We propose to allow each cluster to be operated in 4 modes. These modes correspond to the number of cores being operated at a particular time, while the remaining cores are power gated off. Figure 4 shows how each of the 4 configurations would look, assuming a 75MHz NTC operating point when all 4 cores are enabled. The benefit of using these modes is that the system is able to trade-off power for response time. So if the system is given a response time constraint, the OS can adapt the number of cores and frequency to achieve the energy optimal solution. Figure 5 shows a simulation run using the M5 [7] full system simulator for a cluster of size 4. The benchmark being run is a simplified version of the SpecWeb[8] benchmark, in a system with a 64kB clustered ICache and a 256kB clustered DCache. As can be seen the throughput in the system remains nearly constant, but the response time of the system can be reduced at the expense of additional power.
Fig. 4. Modes of operating a 4 core cluster with 75MHz NTC frequency
252
R.G. Dreslinski et al.
Fig. 5. Tradeoff of response time and power for a cluster based architecture
3 Thread to Core Mappings and Thread Migration To fully utilize the capabilities of the reconfigurable system the operating system will need to carefully orchestrate the mappings and migrations of threads to cores. The system will need to have a set of constraints in terms of performance desired and the operating system will perform some heuristic mapping and migration schemes to optimize the system for power consumption. In the following subsections we detail some of the decisions the operating system will need to make and present some examples of performance metrics that can be used to guide these decisions. 3.1 Large Data Cache Requirement Since the cores within a cluster share the same L1 cache space, there is the potential for threads running on cores within the same cluster to thrash each others data. In order to prevent this from impacting the overall runtime, the OS will need to detect these competing threads and migrate them to different clusters. Preferably this will be done by collocating the threads with large data cache requirements with threads having small cache demands or on clusters where fewer cores are enabled reducing the cache pressure. One possible detection scheme for this condition would be a performance counter that tracks the number of cache evictions that a particular thread causes of data that belonged to a different thread. This technique would employ 2-bits of overhead on each cache line to distinguish the core for which the cache line was originally fetched. On an eviction, if the core is not the one who originally fetched the data, a counter is incremented for the core causing the eviction. When the
Reconfigurable Multicore Server Processors for Low Power Operation
253
OS reads these counters it will be able to get an approximation for the negative impact a particular thread has on others in the same cluster, allowing a decision to be made about thread migration. 3.2 Shared Instruction Stream/Data In some cases threads can either run the same instruction streams, i.e. SIMD, or may operate on the same pieces of shared data. In these cases it is beneficial for the OS to map these threads to the same cluster. There are several advantages to doing so. First, there is less cluster based evictions due to the sharing of the cache line. Second, in the case of data it avoids the costly process of moving data around when multiple cores wish to modify the data. Third, the threads act as prefetchers for each other reducing the latency of the system. As in the previous case, section 3.1, the same 2-bit field can be added to each cache line noting the core in the cluster that fetched the cache line. 10 counters can be used to keep track of pairs of cores that shared a line. If a cache line is ever read by a core different than the one that fetched the data the corresponding counter is incremented. This provides the OS with a count of currently scheduled threads in a cluster and the amount of sharing that is taking place. To detect threads that are not currently in the same cluster that would benefit from being in the same cluster a more complex scheme would need to be designed on top of the coherence protocol to detect these conditions. This is left for future work. 3.3 Producer/Consumer Communication Patterns In some programming models there are producer/consumer data relationships. This is where one threads output is constantly the input to another thread. In these types of patterns it is beneficial for the OS to migrate the threads to the same cluster. By doing so, the consumer can avoid going out of the cluster to get the data produced by the producer. Depending on the interconnect in the system, if it isn’t possible to put them in the same cluster, then there is also benefit by putting them in clusters near each other. For example in a network-on-chip style interconnect having them close reduces the number of cycles it takes to transfer the data, and it relieves congestion on the network. An elegant solution to detecting these communication patterns has not yet been worked out, but a possible solution in a directory based coherence machine might involve tracking some read/write patterns on cache lines at the home node. This again is left for future work. 3.4 Single Thread Performance Some threads will require more performance because they may lie on the critical path of execution. The OS needs to identify these threads and migrate them either to clusters with fewer cores enabled, and thus a faster frequency, or in the extreme case to one of the complex out-of-order cores in the system. These threads can hopefully be identified by either the programmer or the compiler, but in some case may rely on hardware feedback. In most cases these threads tend to serialize the operation of the machine, so in situations where few threads are running, identifying the ones that have been running the longest may indicate which threads need to be migrated to faster cores. The OS may need to migrate threads to the complex cores for short
254
R.G. Dreslinski et al.
periods of time and measure overall utilization to determine if there is a positive impact of running the threads at these locations.
4 Conclusions With power becoming a major concern, particularly in servers, new energy optimal architectures need to be explored. In this paper we looked at the use of reconfigurable architectures to provide a single chip solution to a broad range of targets. When parallelism is abundant the system can adapt and save large amounts of energy, at the same time when single thread performance is the bottle neck the system can be reconfigured to provide the necessary throughput. The architecture explored in this paper employed near-threshold techniques, L1 cache clustering, and heterogeneous design (in-order and out-of-order cores) to achieve extremely energy efficient computation. The paper also looked forward at the difficult task the OS will have with managing the thread to core mapping for energy optimality, and proposed some initial techniques that might be employed by the OS.
References 1. Lim, K., Ranganathan, P., Chang, J., Patel, C., Mudge, T., Reinhardt, S.: Understanding and Designing New Server Architectures for Emerging Warehouse-Computing Environments. In: Proceedings of the 35th ISCA, pp. 315–326 (2008) 2. Report to Congress on Server and Data Center Energy Efficiency, US Environmental ProtectionAgency, http://www.energystar.gov/ia/partners/prod_development/ downloads/EPA_Datacenter_Report_Congress_Final1.pdf 3. Zhai, B., Dreslinski, R.G., Mudge, T., Blaauw, D., Sylvester, D.: Energy efficient nearthreshold chip multi-processing. In: ACM/IEEE ISLPED, pp. 32–37 (2007) 4. Zhai, B., Blaauw, D., Sylvester, D., Flautner, K.: Theoretical and practical limits of dynamic voltage scaling. In: ACM/IEEE Design Automation Conference, pp. 868–873 (2004) 5. Wang, A., Chandrakasan, A.: A 180mV FFT processor using subthreshold circuit techniques. In: IEEE International Solid-State Circuits Conference, pp. 292–529 (2004) 6. Dreslinski, R.G., Zhai, B., Mudge, T., Blaauw, D., Sylvester, D.: An Energy Efficient Parallel Architecture Using Near Threshold Operation. In: Proceedings of the 16th PACT, pp. 175–188 (2007) 7. Binkert, N.L., Dreslinski, R.G., Hsu, L.R., Lim, K.T., Saidi, A.G., Reinhardt, S.K.: The M5 Simulator: Modeling Networked Systems. IEEE Micro. 26(4), 52–60 (2006) 8. SpecWeb99 benchmark, http://www.spec.org/web99
Reconfigurable Computing in the New Age of Parallelism Walid Najjar and Jason Villarreal Department of Computer Science and Engineering University of California Riverside Riverside, CA 92521, USA {najjar,villarre}@cs.ucr.edu
Abstract. Reconfigurable computing is an emerging paradigm enabled by the growth in size and speed of FPGAs. In this paper we discuss its place in the evolution of computing as a technology as well as the role it can play in the current technology outlook. We discuss the evolution of ROCCC (Riverside Optimizing Compiler for Configurable Computing) in this context. Keywords: Reconfigurable computing, FPGAs.
1 Introduction Reconfigurable computing (RC) has emerged in recent years as a novel computing paradigm most often proposed as complementing traditional CPU-based computing. This paper is an attempt to situate this computing model in the historical evolution of computing in general over the past half century and in doing so define the parameters of its viability, its potentials and the challenges it faces. In this section we briefly summarize the main factors that have contributed to the current rise of RC as a computing paradigm. The RC model, its potentials and challenges are described in Section 2. Section 3 describes the ROCCC 2.0 (Riverside Optimizing Compiler for Configurable Computing), a C to HDL compilation tool who objective is to raise the programming abstraction level for RC while providing the user with both a top down and a bottom-up approach to designing FPGA-based code accelerators. 1.1 The Role of the von Neumann Model Over a little more than half a century, computing has emerged from non-existence, as a technology, to being a major component in the world’s economy. It has a profound impact on the daily life of a large number of this planet’s inhabitants. Such a rapid evolution of a technology is unprecedented in human history. Probably, the single most important enabling factor of this emergence has been the von Neumann, or stored program, model of computation where a single storage unit holds both instructions and data. Prior to the proposal of this model by Alan Turing and John von Neumann “computers” relied on a fixed program architecture where re-programming involved re-wiring or the re-setting of switches. K. Bertels et al. (Eds.): SAMOS 2009, LNCS 5657, pp. 255–262, 2009. © Springer-Verlag Berlin Heidelberg 2009
256
W. Najjar and J. Villarreal
In this model a program is expressed as a sequence of instructions stored in memory. A Central Processing Unit (CPU) fetches the instruction from memory and executes them. The execution of an instruction implies a predetermined order of operations. Each instruction is assumed to be atomic, i.e. uninterruptible, and sequential, i.e. no instruction is started before the previous one is completed. The program dynamically determines the ordering of instructions. The stored program model provides the common conceptual framework upon which all the elements of a modern computer systems are built: architectures, microarchitectures, languages, compilers, operating systems, applications, I/O devices, all refer to this common framework. Instruction Set Architectures (ISAs), reflecting this model, quickly defined the boundary between the software and hardware realms. Later, micro-architectures provided a variety of structural implementations within the same ISA. The model is implicit in the definition of all imperative languages (e.g. FORTRAN, Algol, Pascal, C, C++, Java etc.) as evidenced by expressions such as I = I+1, where I points to a memory location rather than a mathematical variable. It is impossible to over-emphasize the role this common framework has played in the evolution of computing as we have experienced it. However, the stored program model has its limitations: The von Neumann bottleneck refers to the limited bandwidth between the memory and CPU. This limitation has given rise to cache architectures and virtual memory. The sequential execution severely limits the potential parallelism in programs. Efforts to overcome this limitation include instruction level parallelism in the micro-architecture, macro-level parallel architectures such as SIMD, MIMD, vector machines, SPMD etc. These limitations have provided the main impetus to the vast and very fruitful research efforts in computer architecture, micro-architecture, parallel architectures, compilers, language designs, algorithms, operating systems etc. This work has provided a tremendous insight into the nature of computing using this paradigm, which was translated into a tremendous improvement in performance. 1.2 The Role of Moore’s Law Over the past 50 years, the achievable computing performance increased by more than 10 orders of magnitude! This was the VLSI (Very Large Scale Integration) revolution. It was driven by two factors: (1) Moore’s Law that stated that the number of transistors on a die would double approximately every two years, and (2) the shrinking of the feature size of transistors which resulted in a dramatic increase of the affordable clock frequency. This dramatic performance increase was accompanied by a comparable decrease in energy expanded, cost and volume. The side effect of the VLSI revolution was that the thriving research in parallel computing models, algorithms, programming languages, parallelizing compilers and parallel architectures that had flourished in the 70s, 80s and early 90s came to a stand still and died. A very large number of companies, and an even larger number of research projects, that had developed platforms on various parallel models (SIMD,
Reconfigurable Computing in the New Age of Parallelism
257
MIMD, SMP, message passing and SPMD) folded because it was impossible to compete with Moore’s Law. Today, VLSI technology has reached a point where the shrinking of the feature size of transistors faces major technological barriers. The number of dopant ions in a single device has become so small that it poses a risk to the stability of the device over time. Furthermore, the sheer cost of new fabrication lines has become prohibitively high. However, the number of transistors on a die keeps on increasing in the form of multi-cores and many cores. This phenomenon is ushering the New Age of Parallelism. In this new age of parallelism we see the same, or very similar, topics being addressed in the context of multi-core computing and networks on a chip. 1.3 Field Programmable Gate Arrays Gate arrays started life as devices intended for use as “glue logic” to replace the large numbers of LSI chips that played this role with one single device per board. As their size and speed grew, FPGAs evolved into platforms used for functional verification and rapid prototyping and then for rapid time to market. With the widening range of FPGA applications came new features in the FPGA architectures: embedded DSP cores, CPU cores, on chip Block RAM etc. Even though research in FPGA-based hardware code acceleration has been carried for over 10 years, this application is a relatively new comer and has not yet had any major impact on the overall market for FPGA and hence on their internal architectures.
2 Reconfigurable Computing, Potentials and Challenges There is no formal definition of reconfigurable computing. However, the generally agreed upon definition is that of a hardware structure whose functionality is defined by the user. One can argue, and some have, that an ALU is a reconfigurable structure as it can be an adder or a multiplier etc. These functionalities, however, have been defined at design time and cannot be changed. In the late 50s and early 60 Gerlad Estrin of UCLA proposed the fixed plus variable model where a programmable hardware structure acted as a co-processor to a fixed datapath CPU [1,2,3]. This work is sometimes recognized as the early version of a reconfigurable computing structure. 2.1 The Role of FPGAs There is no doubt that without the very rapid increase in size and speed of available FPGA devices, riding the curve of Moore’s Law, there would not be any discussion of reconfigurable computing. They have provided the components used in the very early reconfigurable computing machines such as the Splash 1 and Splash 2 [4]. It has often been argued that the architecture of FPGAs is too fine grained and hence their use in reconfigurable computers implies a significant amount of overhead as opposed to coarser granularity architectures. This is true. However, FPGAs have two fundamental advantages (1) they are available now, and (2) any coarser granularity must target a specific subset of applications and hence would suffer similar, if not larger, overhead on other applications. There have been a number of academic and
258
W. Najjar and J. Villarreal
industry projects that have set out to develop coarse-grained reconfigurable architectures. They faced two major obstacles: 1. Competing with the VLSI revolution and Moore’s Law. During the time it takes to conceive of a design, develop it and have it fabricated, the performance of CPUs and DSPs has increased at least by a factor of two and the size and speed of FPGAs have increased by a similar factor. 2. Developing a suitable tool chain for their programmability. This is the most challenging obstacle. These tools have to parallelize a source code written in a sequential language (C/C++), partition it and map it on an array of processors and schedule the execution. While some very significant breakthroughs have been achieved these remain daunting obstacles. However, the main challenge faced by these efforts is the application specialization. It is clear that for a narrowly defined set of applications one can develop a coarsegrained architecture that outperforms FPGAs. However, as the set of applications gets broader, the efficiency advantages of this architecture diminish. 2.2 Applications of Reconfigurable Computing Applications of the reconfigurable computing, today, can be divided in two broad categories: • Embedded systems. Including high-performance embedded systems, air or space borne computers, telecommunication machinery etc. In these systems the vendors develop the applications and the end user is generally not expected to develop new applications. • High performance computing. These systems attempt to leverage the tremendous potential of high-end FPGAs to deliver a speed-up, over traditional processors, of several orders of magnitude. While this distinction is not always clear cut these two application domains have greatly varying requirements. In the first, the cost and power consumption are primary considerations. In the second, the speed-up must be large enough to justify the added costs of a large number of high-end FPGA devices. 2.3 Potentials A survey of all the applications that have been tried on reconfigurable computing platforms and the achievable speed-ups reported is beyond the scope of this paper. However, we will discuss why and how such speed-ups are achieved to have a better insight in developing solutions to the challenges facing this model. In doing so we will report on the analysis done in [5]. In [5] the authors present an analysis of the speed-up that can, and often is, achieved by mapping computations on an FPGA through which data is streamed. The analysis points to the main sources for the speed-up:
Reconfigurable Computing in the New Age of Parallelism
• •
259
The elimination of overhead operations. These are instructions in a loop body that manage data movement to/from the memory, index calculations and control operations. The parallelism that can be achieved on the FPGA. Typically done by unrolling loop bodies.
Other factors, not addressed in that paper, include the folding of constants into the computation, the tailoring of the data bit width, the distribution of data storage on the data path, deep pipelining, multiple clock domains etc. The first factor of the speed-up is inherent in the von Neumann model. Decoupling the fetching of the data from the computation proper eliminates it. This advantage is mitigated or eliminated when the data must be accessed randomly in a pattern determined dynamically at run time. The streaming of the data also eliminates a number of pipeline bubbles due to cache misses, data and instructions, control operations etc. Reported experimental measurements show this inefficiency factor to be about one order of magnitude. The clock frequency that can be achieved on an FPGA is typically about an order of magnitude smaller than that of a CPU. It is compensated, however, by the inefficiency factor itself. The parallelism that can be achieved on an FPGA is due, primarily, to the sheer size of these devices (FPGAs are the largest chips being fabricated) and is very substantially enhanced by optimizations that shrink the size of the circuits such as constant folding (i.e. using logic instead of registers), variable bit width, distributed storage etc. The degree of parallelism reported in research papers is typically measured in orders of magnitude. It is the primary factor of the measured speed-up over a microprocessor. In spite of their tremendous potentials, FPGAs and reconfigurable computing in general are not yet in the main stream as computing platforms. The adoption of this new paradigm hinges on the mitigation of a number of challenges. One non-technical obstacle is the perception of FPGAs as exotic devices. Compared to microprocessors and DSPs, FPGAs are relative newcomers. The research into their use for general applications is less than a decade old. They are not covered or used in traditional computer science curricula. However, addressing the other challenges can lift this obstacle. 2.4 Challenges Programmability and Tool Chain. The programming of FPGAs is typically done using HDLs (Hardware Description Languages) and requires a solid background in logic and circuit design. Furthermore, the programming tools chain is long and complex when compared to the simple compilation step of traditional languages. HDLs rely on a radically different paradigm than imperative languages: they describe reactive entities, support timing and concurrency. A wider acceptance of FPGAs as code accelerators requires the existence of a programmability model that leverages the popularity of widely used programming languages such as C/C++ and Java as well as a programming tool chain that is capable of abstracting away the details and intricacies of FPGA architectures.
260
W. Najjar and J. Villarreal
Algorithms and Applications. Ideally, the porting of an application code to an FPGA accelerator would involve the compiling of the frequently executed code segments (typically a loop nest) to hardware. However this is hardly the case. Most highperformance application codes have been thoroughly optimized to perform well on sequential machines and do not lend themselves to such a simple porting. Most often a complete re-working of the algorithm and application code is necessary to achieve a speed-up that justifies the added costs of an FPGA platform. FPGA acceleration is ideally suited for applications where large amounts of data can be streamed through a circuit. Furthermore, the computation that is implemented on the circuit cannot rely on a large data structure, as is often the case in on a Von Neumann platform. Very large hash tables are a most typical example. Access to such a table from an FPGA would eliminate all potentials for parallelism on the device as memory accesses would have to be serialized. What is required, therefore, is either a structuring of the existing algorithm or the development of a new one that supports (1) the streaming of data and (2) relies on a relatively small state space for the computation.
3 Generating Hardware Accelerators – ROCCC 2.0 The Riverside Optimizing Compiler for Configurable Circuits (ROCCC) was designed to provide an alternative to hand coding hardware for FPGAs without major loss of performance. ROCCC is an optimizing compiler designed to create hardware accelerators of software systems by translating kernels of C code into highly parallel and optimized hardware circuits. ROCCC performs many high level and parallelizing optimizations on C code including all standard compiler transformations on a much greater scale than a compiler for a Von Neumann machine. The objectives of these optimizations are the maximization of the parallelism in the generated circuit, the maximization of the clock rate of the generated circuit, and the minimization the number of off-chip memory accesses. ROCCC is not designed to compile complete systems but instead create highly efficient hardware accelerators for use in large systems. 3.1 Philosophy of ROCCC 2.0 For every high level algorithm and system there are a near-infinite number of ways to express the execution in C. Although each C algorithm can be expanded and converted into hardware in numerous ways, the best software algorithm is rarely the best hardware implementation. An inefficient hardware implementation may slow down total system execution, as described in [6] while exploring the hardware acceleration of molecular dynamics implementations. When a good hardware implementation is known, the C system code has to be structured differently and the compiler requires heavy guidance on what to generate. This varies for every application and made the implementation and use of the ROCCC compiler overly complex. Additionally, FPGAs exist on a wide variety of platforms. When the compiler has full knowledge of the target platform, the code generation and optimization ordering become completely intertwined and unmanageable. A new platform would require not only a new port, but also a complete retuning of each optimization to match the available resources.
Reconfigurable Computing in the New Age of Parallelism
261
These observations led to the development of ROCCC 2.0. The main idea behind the development of ROCCC 2.0 is the description of hardware circuits in a bottom-up fashion through the creation of modules while still preserving the optimizations that generated efficient circuits. Each module can exist either as software or hardware. Designers describe modules using C, compile them with ROCCC, and have access to a library of previously compiled modules. IP cores may also be imported directly into the module library and then can be accessed by other modules and systems with a function call. When building complete systems, ROCCC is able to parallelize and replicate modules as needed and places each module in the correct location in a large pipeline. ROCCC 2.0 also separates the platform specifics from a standard abstraction layer that interfaces with different memory models, vastly simplifying both the implementation and interface of each module. 3.2 Example Describing a module is done using standard C code without any new keywords. The interface to the module is described with a struct detailing all of the inputs and outputs. The implementation of the module is described in a function that takes and returns an instance of this struct. Figure 1 shows a 5-TAP FIR filter module. The FIR filter has 5 inputs and one output specified in the interface struct. The implementation performs multiplication and addition against a constant array, which is propagated and eliminated through high-level transformations. typedef struct { int A0_in ; int A1_in ; int A2_in ; int A3_in ; int A4_in ; int result_out ; } FIR_t ; FIR_t FIR(FIR_t { const int T[5] f.result_out = f.A1_in * f.A2_in * f.A3_in * f.A4_in *
f) = {3,5,7,9,11}; f.A0_in * T[0] + T[1] + T[2] + T[3] + T[4] ;
return f ; }
Fig. 1. A 5-TAP FIR Filter Module in C
#include “roccc-library.h” void FIRSystem() { int A[100] ; int B[100] ; int i ; int output ; for (i = 0 ; i < 100 ;++i) { FIR(A[i], A[i+1], A[i+2], A[i+3], A[i+4], output) ; B[i] = output ; } }
Fig. 2. A Complete System in C That Includes the FIR Module
Once compiled with ROCCC, the FIR filter may be used by other C code either as a C procedure (by calling the function in Figure 1) or as a hardware module. Figure 2 illustrates both the creation of a complete system in hardware, which includes accessing streams of data, and the inclusion of a hardware module. All modules available
262
W. Najjar and J. Villarreal
for use are exported in a library format as both hardware descriptions in VHDL and C function declarations for use at the C level. The arrays A and B in Figure 2 are detected and analyzed by ROCCC to be input and output streams of data, and the appropriate abstracted interface is created. The stream interface is defined by ROCCC and not platform specific; instead it merely is a generic fifo interface with storage for data reuse. By including roccc-library.h, C code has access to call any previously compiled module directly. This is accomplished by calling a function with the name of the module and passing parameters that correspond to the inputs and outputs listed in the original struct in the same order in which they appear in the declaration. The code in Figure 2 consists of ROCCC 1.0 code with the addition of modules. This allows us to leverage all of the previous parallelizing transformations and provide good performance of the generated hardware. The compilation process is user-directed and with the addition of modules allows for greater control of the final circuit’s architecture.
4 Conclusion and Future Outlook Reconfigurable computing shows great potential in improving the execution performance of a large class of applications. The main difficulty in utilizing this potential, however, is in the programmability of the platform and the translation of sequential systems into concurrent streaming engines. Recognizing this need, ROCCC 2.0 begins to bridge the gap between the high level sequential languages developers are familiar with, and the streaming circuit structure required for good performance. The lowering of this barrier would allow for a wider adoption of this model, the development of application codes that are well suited for RC and pave the way for the evolution of architectures specifically designed to support reconfiguration.
References 1. Estrin, G., Viswanathan, C.R.: Organization of a “fixed-plus-variable” structure computer for eigenvalues and eigenvectors of real symmetric matrices. Journal of the ACM 9(1), 41– 60 (1962) 2. Estrin, G., Turn, R.: Automatic assignment of computations in a variable structure computer system. IEEE Transactions on Electronic Computers EC-12(5), 755–773 (1963) 3. Estrin, G., Bussell, B., Turn, R., Bibb, J.: Parallel processing in a restructurable computer system. IEEE Transactions on Electronic Computers EC-12(5), 747–755 (1963) 4. Buell, D., Arnold, J., Kleinfelder, W.: Splash 2: FPGAs in a Custom Computing Machine. IEEE CS Press, Los Alamitos (1996) 5. Guo, Z., Najjar, W., Vahid, F., Vissers, K.: A Quantitative Analysis of the Speedup Factors of FPGAs over Processors. In: Symp. Field-Programmable Gate Arrays (FPGA), Monterey, CA (February 2004) 6. Villarreal, J., Najjar, W.: Compiled Hardware Acceleration of Molecular Dynamics Code. In: Int. Conf. on Field Programmable Logic and Applications (FPL 2008), Heidelberg, Germany (Septermber 2008)
Reconfigurable Multithreading Architectures: A Survey Pavel G. Zaykov, Georgi K. Kuzmanov, and Georgi N. Gaydadjiev Computer Engineering, EEMCS, TU Delft, The Netherlands {Zaykov,G.K.Kuzmanov,G.N.Gaydadjiev}@tudelft.nl
Abstract. This paper provides a survey on the existing proposals in the field of reconfigurable multithreading (ρMT) architectures. Until now, the reconfigurable architectures have been classified according to implementation or architectural criteria, but never based on their ρMT capabilities. More specifically, we identify reconfigurable architectures that provide implicit, explicit or no architectural support for ρMT. For each of the proposals, we discuss the conceptual model, the limitations and the typical application domains. We also summarize the main design problems and identify some key research questions related to highly efficient ρMT support. In addition, we discuss the application prospectives and propose possible research directions for future investigations.
1
Introduction
Many applications running on modern embedded devices are composed of multiple threads, typically processing (exchanging) data among multiple sources. During the quest of maximum performance and flexibility, the hybrid architectures combining one or more embedded General Purpose Processors (GPPs) with reconfigurable logic have emerged. There is a clear trend which shows that in the near future there will be more embedded systems integrating reconfigurable technology [1], [2], [3]. It is envisioned that multithreading support will become an important property of such systems. One of the fundamental problems in multithreaded architectures is efficient system resource management. This has been successfully solved in contemporary GPPs using various implicit and explicit methods. In literature [4], the explicit techniques have been further partitioned into three main categories: Block Multithreading (BMT) - employing Operating System (OS)/ compiler approaches and Interleaved/ Simultaneous Multithreading (IMT/ SMT) using hardware techniques. However, none of these solutions can be applied straightforwardly for managing reconfigurable hardware resources. The main reason is that the reconfigurable hardware is changing its behavior per application, unlike the GPPs, which have fixed hardware organization regardless the programs running on them. Yet, current state-of-the-art architectures do not provide efficient holistic solutions for accelerating multithreaded applications by reconfigurable hardware. In this paper we approach the reconfigurable multithreading (ρMT) architectural problems both from the hardware and the software prospective. K. Bertels et al. (Eds.): SAMOS 2009, LNCS 5657, pp. 263–274, 2009. c Springer-Verlag Berlin Heidelberg 2009
264
P.G. Zaykov, G.K. Kuzmanov, and G.N. Gaydadjiev
The specific contributions of this paper are as follows: – We analyze a number of existing reconfigurable proposals with respect to their architectural support of ρMT. Based on this analysis, we propose a taxonomy with three main classes, namely: reconfigurable architectures with explicit, implicit and no ρMT support ; – We summarize several design problems and state open research questions addressing performance efficient management, mapping, sharing, scheduling and execution of threads on reconfigurable hardware resources; – We provide our vision for promising research directions and possible solutions of the identified design problems; The paper is organized as follows: in Section 2, a taxonomy covering related projects is presented. More details about the design problems and the status of the current state-of-art, completed with our vision on some possible application prospectives and potential research directions are described in Section 3. Finally, the concluding remarks are presented in Section 4.
2
A Taxonomy of Existing Proposals
A taxonomy on Custom Computing Machines (CCM) with respect to explicit configuration instructions has been already proposed in [5]. However, that study did not consider multithreading support as a distinguishing feature. In this section, we introduce a taxonomy of existing reconfigurable architectures with respect to the ρMT support they provide. We identify three main classes of such architectures, namely: with explicit, with implicit, and with no architectural ρMT support. Note, the meaning that we relate to the definitions of explicit and implicit ρMT, is different what is used in GPP systems. In general purpose systems, the classification is based on multithreading support from algorithmic point of view [4]. In our taxonomy we use as a distinguishing feature the presence of architectural/ μ-architectural extensions for creation/ termination of multiple threads on reconfigurable logic. If we classify the ρMT research projects based on the GPP explicit multithreading technique, our taxonomy would look like as follows: – Reconfigurable Block Multithreading (ρBMT): e.g. [1], [6], [7]; – Reconfigurable Interleaved Multithreading (ρIMT): e.g. [8]; – Reconfigurable Simultaneous Multithreading (ρSMT): e.g. [9], [10]; In this paper, we consider a different classification prospective. In architectures with no ρMT support, application threads are mapped into reconfigurable hardware using software techniques – either at the OS or at the compiler level. This software approach provides unlimited flexibility, but the performance overhead too often penalizes the overall execution time especially for real-time implementations. On the other hand, architectures with implicit ρMT support, provide performance efficient solutions at the cost of almost no flexibility due to the fixed underlying microarchitecture (μ-architecture) facilitating multithreading.
Reconfigurable Multithreading Architectures: A Survey
265
Fig. 1. A Conceptual Behavioral Model of ρMT Related Projects
To exploit the flexibility provided at both the software level, as well as at the architectural and the μ-architectural level and to achieve higher system performance, a third emerging class of architectures is identified and termed as architectures with explicit ρMT support. Hereafter, we enlighten the proposed taxonomy through examples of existing reconfigurable architectures. A conceptual behavioral model of an ρMT system is depictured in Figure 1. It represents the basic steps in the management and execution process of multiple threads. Initially, the programmer creates applications (tasks – Section A) or kernel service (Section B) composed of multiple threads. Later, during run-time when an application is selected for execution, depending on the system status information, the Top-level Scheduler (Section C) passes threads to local schedulers (Section D and E). The local reconfigurable scheduler (Section E) accommodates multiple units - queues, scheduling algorithm, placement technique and loading process. The synchronization between different threads is managed by e.g. semaphores (Section F). The different sections of the behavior model, depicted in Figure 1, are implemented either at the software level, or at the μ-architectural level, depending on the particular architecture. Hereafter, we shall reveal how different popular reconfigurable proposals manage the scheme from Figure 1 and based on their architectural support for ρMT, we shall classify them. 2.1
Modern State-of-the-Art Reconfigurable Architectures
The reconfigurable hardware allows the designer to extend the processor functionality both statically and at run-time to speed up the application by executing its critical parts in hardware. In [11], a survey on architectural proposals targeting GPP cores extended with reconfigurable logic is presented. However, that paper has not considered ρMT as a classification criterion. In the years after,
266
P.G. Zaykov, G.K. Kuzmanov, and G.N. Gaydadjiev
a few more reconfigurable proposals have been introduced, capable to be supported by an OS without any specific hardware modifications. We choose to briefly introduce the following two of these later reconfigurable projects, uncovered by [11], because we consider them as a natural evolution of contemporary embedded systems and potentially good candidates for future explicit ρMT extensions: MOLEN: We choose The Molen Polymorphic Processor [1] proposed by CE Lab, TUDelft, The Netherlands, as an example of tightly coupled (processor/ co-processor) fine-grained reconfigurable architecture. It combines a GPP with several reconfigurable Custom Computing Units (CCU-es). The processor has an arbiter, which partially decodes and issues instructions to either of the GPP or the reconfigurable coprocessor. In the Molen original papers, multithreading has not been discussed, but a follow-up research towards multithreading has been reported in [9]. An overview of this enhanced MT version of Molen is examined in the Subsection 2.3. Montium TP: As an example of a Coarse Grained Reconfigurable Array processor core, we choose Montium TP [2], designed by RECORE Systems. This architecture has the following features: once configured, it does not issue any instructions (just processes the data). It does not have a fixed instruction set architecture (ISA) - the application is encoded at microcode level and has fast reconfiguration response time, because of its coarse-grained hardware structure. In its current implementation, the Montium TP is capable to support execution of multiple threads (applications) but only at the OS level. The processor was originally targeting the domain of streaming applications. 2.2
Architectures with No ρMT Support
As we have already classified architectures with no ρMT support provides simultaneous execution of multiple threads at the software level only – either by the OS or by the compiler without any explicit support. OS support for ρMT – In this section, we group all known OS targeting reconfigurable devices and implementing in software - Section A, B, C, D, E and F from Figure 1. The first proposal, which identifies some of the necessary services, that an Operating System for reconfigurable devices should support, is presented in [6] and [12] by a research group at the University of South Australia. BORPH [13]: The research work presented by the University of California Berkeley, identifies that application migration from one reconfigurable computing platform to another, using conventional codesign methodologies, requires from the designer to learn a new language and APIs, to get familiar with new design environments and re-implement existing designs. Therefore, BORPH is introduced as an OS designed specifically for reconfigurable computers, sharing the same UNIX interface among hardware and software threads, which speeds up the design process. The proposal has the following limitations - hardware executed threads are executed on but do not share reconfigurable resources.
Reconfigurable Multithreading Architectures: A Survey
267
Experimental results are produced from simple applications such as: wireless signal processing, low density parity check decoder and MPEG-2 decoding. SHUM-uCOS [14]: Another design, tackling the problems caused by the essential differences between software and hardware-tasks is the SHUM-uCOS by the Fudan University, China [14]. The authors propose a real-time OS (RTOS) for reconfigurable systems employing an uniform multi-task model. It traces and manages the utilization of reconfigurable resources, improves the utilization and the parallelism of the tasks with hardware task preconfiguration. The limitations on the current implementation of SHUM-uCOS are in the static scheduling approach and the resource reusage, supported by the compiler only. For evaluation of the system, the authors use benchmarks and VOIP algorithms. Compiler techniques for multithreading on reconfigurable platforms The most common feature of the architectures grouped in this subcategory is the responsibility of the compiler for task partitioning, scheduling and management of the system resources. The major reason to employ multithreading in these architectures is to hide reconfiguration latencies. MT-ADRES [7]: In MT-ADRES by IMEC, Belgium, the compiler framework has been extended to support several threads. The most significant limitation of this proposal is the inability to execute/ terminate threads at run-time which is posed by the compiler static scheduling and optimization algorithms, operating with Control Data Flow Graph (CDFG). Control decisions, such as hiding the reconfiguration latencies and resource management are taken at compile time. All experiments providing information about MT-ADRES performance are achieved through multimedia simulations. UltraSONIC [15]: Another proposal falling in this category is the UltraSONIC project, represented by Sony Research Labs, UK. It is a reconfigurable architecture optimized for video processing. It has a list of Plug-In Processing Elements, connected through several buses. The programmer receives an architecture abstraction through an API interface. Multitasking is achieved through an algorithm (implemented in the compiler) working on the application Directed Acyclic Graph. 2.3
Architectures with Implicit ρMT Support
The proposals from this category share one common feature - the detailed multithreading support on reconfigurable threads is implicit, i.e. hidden from the system programmer. The ISA does not have dedicated special instructions for thread creation and termination procedures. The functionality is achieved with μ-architectural extensions while preserving the architectural model. Reconfigurable Extensions for the CarCore Processor: In [9], the authors combine a simultaneous multithreading (SMT) processor with a Molen style reconfigurable coprocessor [1]. To minimize the complexity of the implementation, the authors employ several constrains to the architecture. They modeled a hardware scheduler, which supports execution on reconfigurable logic of only one
268
P.G. Zaykov, G.K. Kuzmanov, and G.N. Gaydadjiev
thread at a time, preserving the real-time capability for it. Once a thread is started for hardware execution, it could not be interrupted until it is finished (no context-switching). There is no additional ISA extensions for reconfigurable thread management. Meanwhile, other non-real-time threads can continue their execution employing the latencies of the real-time thread. The implementation includes two scheduling policies – fixed-priority and round-robin, over four active threads. Hthreads [10]: The Hthreads(Hybrid Threads) model presented by University of Kansas is multi-layer computational architecture which aims to bridge the gap between the programmers and complex reconfigurable devices. Some of the main system features are migration of thread management, synchronization primitives and run-time scheduling services (Figure 1, Section F) for both hardware and software threads into dedicated hardware module accessed from the GPP only through an universal bus. The authors represent hardware threads with user defined component, state controller and universal interface (Register Set). Synchronization procedures are performed through semaphores. Because of the fact that the system does not have modifications at architectural and μ-architectural levels, the proposal is classified as an implicit ρMT. The experimental results are provided in the image processing application domain. 2.4
Architectures with Explicit ρMT Support
The basic idea of this ρMT class is to combine the flexibility of the software and the reconfigurable hardware with the potential performance efficiency of the latter and to support ρMT, both at the software level and at the μ-architectural level. There are several partial solutions in the literature which do not provide such a compete mixed model of ρMT - the software and the hardware corporate together to provide simultaneous execution of multiple threads. In such a model, the system services (e.g. scheduling, resource management) should be optimally separated between software and μ-architectural levels. Combined with efficient memory management and thread/function parameters exchange through dedicated registers, an architecture with explicit ρMT support would potentially reduce the intra- and inter- thread communication costs. Similar approaches are taken in the following proposals: OS4RS [16]: In [16], a research group at IMEC, Belgium, investigates the concepts and reveals some of the open questions, raised by the run-time multithreading and interconnection networks for heterogeneous reconfigurable SoC. The novelty of their approach resides in the integration of the reconfigurable hardware in a multiprocessor system completely managed by the OS (OS4RS). The system maintains several threads by a two-level scheduler. In their current implementation, the toplevel scheduler (Figure 1, Section C) is implemented in software. The low-level/ local scheduler can be implemented in software (Figure 1, Section D) or hardware (Figure 1, Section E) depending on the type of the slave computing resources (GPP or reconfigurable logic). In their current implementation, the local-level hardware (reconfigurable) scheduler is not implemented, yet. The authors also propose a
Reconfigurable Multithreading Architectures: A Survey
269
proof-of-concept method for context-switching and migration between heterogeneous resources by saving the task state. The OS4RS has been tested in JPEG frame decoding and experimental 3D video game. Reconfigurable Multithreaded processor [8] by University of WisconsinMadison. The authors augment multiprocessor system, composed by multithreaded Digital Signal Processor(DSP) and RISC processor, with multiple Polymorphic Hardware Accelerators (PHAs) - reconfigurable hardware units. The PHAs are implemented as a functional units at the execution stage of the processor pipeline. The instruction set is extended with four instructions for read/ modify the PHA state/ mapping procedure. The multithreading is mainly employed to hide the reconfiguration time. Once configured, in case of identical PHA instructions, the PHA could be reused by different threads. Because of the fact that PHAs are not sharing the same reconfigurable area, there is no necessity for placement algorithm. The architecture is limited to Interleaved Multithreading called Token Triggered Threading. The authors argue the choice of such an approach instead of Simultaneous Multithreading, because of the possible power consumption reduction. The authors propose two PHA binding techniques - static & dynamic. The implementation includes only static (compile time) mapping approach. In case of a run-time binding, the system should provide realtime constraints by restricting PHA reusage among threads. 2.5
Summary of the Proposed Taxonomy
Based on the criteria of the provided ρMT support, the aforementioned architectures can be briefly classified as follows. More elaborated discussion and full list of references could be found in [17]: I. No architectural ρMT support: I.1. OS support for ρMT: Molen [1], Montium [2], Convey hybrid-core HC-1 [18], RAMP [19], South Australia [6], BORPH [13], SHUM-uCOS [14]; I.2. Compiler techniques for ρMT: MT-ADRES [7], UltraSONIC [20]; II. Implicit architectural ρMT support : CarCore Processor extensions [9], REDEFINE [21], Hthreads [10], ρMT Architectural Model [22], University of Karlsruhe [23]; III. Explicit architectural ρMT support : III.1. μ-architecture + OS: Reconfigurable Architectures of this kind are just emerging. This approach is promising for high performance efficient scheduling and execution of threads on reconfigurable hardware due to the hardware & software co-design of the ρMT managing mechanisms. OS4RS [16]; III.2. μ-architecture + compiler: Reconfigurable Multithreaded processor [8].
3
Design Problems and Open Research Questions
The very basic design questions related to thread scheduling on reconfigurable resources are: – Which threads to execute, schedule or preempt at certain instance of time
270
P.G. Zaykov, G.K. Kuzmanov, and G.N. Gaydadjiev
(e.g., when the requested reconfigurable area of prepared for execution hardware threads is higher than the available area)? – Where to place a thread (in case of several possibilities)? – When to reallocate the newly created threads and how to efficiently hide the reconfiguration latencies? Depending on model assumptions, from complexity point of view, the scheduling problem on reconfigurable logic could be reduced to several well-known NP-Hard problems [24], [25], [26]. Therefore, one of the ways to be solved is by introduction of an advanced heuristic algorithm. Some open research questions and several partially and completely solved design problems, grouped by topic, are presented below. For more details, the interested reader is referred to [17]. Hiding reconfiguration latencies: In reconfigurable systems, the reconfiguration latency is caused by the time needed for the configuration bitstream to set the reconfigurable device for the particular operation. Typically, configuration latency is introduced during the initial task loading (tasks are composed of one or multiple threads). This is one of the major system delays and causes severe performance degradation in case of frequent reconfigurations. In literature, the most common ways to hide or minimize the reconfiguration latency are: 1. Compressing the task’s bitstream. Different techniques are examined in [27]; 2. Employing prefetch technique for earlier reconfiguration (overlap with computation) and local caching. The existing proposals could be grouped into three categories: – Static – predictions are performed at design time by the compiler; (e.g., The Molen compiler [28]); – Dynamic – at runtime by the reconfigurable scheduler, which stores most recent configurations [29]; – Hybrid (combining the Static & Dynamic approaches) [29]. In case of missprediction, alternative Hybrid methods [29] always pay time penalty, by delaying the reconfiguration; Scheduling and placement algorithms: In the research work presented in [30] by ETH Zurich, the authors propose several algorithms to manage the sharing of resources in the reconfigurable surface. Their proposal includes system services for a partial reconfiguration, which by scheduling the dynamically incoming threads solve the problems with complex allocation situations. The primary idea of the project is to separate threads into two groups according to their arriving times - synchronous and arbitrary, which are scheduled by different heuristic algorithms. Each one of the scheduling techniques is combined with optimized placement methods. The algorithms are further enhanced by a research group at Fundan University [25]. The authors prove that the combination of a scheduling algorithm with a recognition-complete placement method does not result to a recognitioncomplete technique. They also investigate the cases of potential thread migration – a newly arrived thread is started either in software or in hardware. Slightly
Reconfigurable Multithreading Architectures: A Survey
271
different approach is proposed in [31] by a research group at the Paderborn University. They enhance a single processor algorithm (e.g., a stochastic server) with preemption support (limited only during the time of reconfiguration) for hardware tasks. Context switching: In [32], the authors clearly identify the two possible techniques for context switching of hardware threads in partially reconfigurable FPGAs. The techniques are named as follows: 1) Thread Specific Access Structures – when the scheduler decides to switch a thread, it’s current state is saved in an external structure. The major advantages of the approach are the high data efficiency and the architecture independence. The disadvantages come from the fact that each thread is different and it is difficult to design a standard generic interface. In [33], the authors explore the control software required to support thread switching as well as the requirements and features of context saving and restoring in the FPGA coprocessor context. Similar approach is taken in [34] - each hardware thread is represented by one complicated finite state machine. 2) Configuration Port Access – the thread bitstream is completely downloaded from the FPGA chip and the state information is filtered. In [32], the authors design custom tools for offline bitstream processing. The advantages of this approach is that additional design efforts and its information about internal thread behavior are not needed. In [35], the authors additionally compress the bitstream to minimize the size and delay of downloaded data. Real-time support for reconfigurable hardware threads: In the literature, there are two basic approaches (described below) capable to deliver realtime support for software/ hardware heterogeneous platforms: 1) Per-case solutions using Heuristic Algorithms – many of the proposed algorithms support “Commitment Test” - each newly created hardware thread is checked for successful termination before its deadline and critical affects (e.g., delays) on other executing threads. Unfortunately the proposed ideas (heuristic algorithms) are designed only for independent hardware threads with known executing times, therefore they are not applicable for hardware threads with data, resource or communication dependencies. 2) Complete Solutions on Conventional Reconfigurable Platforms (e.g., BORPH [13], UltraSonic [20], Hthreads [10]) – none of them supports reconfigurable resource sharing among executing threads. In case reconfigurable area is shared, all possible resource collisions are solved at compile time. Application Prospective & Potential Research Directions: One of the direct gains from employing a ρMT architecture, after solving the open questions from Section 3, would be the capability for time efficient run-time creation, termination and management of multiple threads sharing the reconfigurable resources without critically affecting (delaying) each other. Possible future research could extend the functionality and overcome some limitations, e.g.: 1. Real-time and runtime support of multiple hardware threads through archi-
272
P.G. Zaykov, G.K. Kuzmanov, and G.N. Gaydadjiev
tecture agnostic hardware scheduler. It could support run-time creation and termination of multiple threads mapped into reconfigurable logic and hardware system implementation. 2. More sophisticated scheduling policies capable to fairly distribute resources among multiple resource-dependent hardware threads. Introduction of a metric evaluating the resource distribution and potential thread starvation. 3. Hiding of reconfiguration latencies and efficient thread-preemption and migration model with estimation of performance costs. For periodic and sporadic threads, the migration might take place right after the end of the current iteration. The following list summarize the topics presented in Section 3: Partially [PS] and Completely[CS] Solved Design Problems [CS] - Hiding reconfiguration latencies by prefetching, context switching and resource reusage among threads; [36], [29], [27] [PS] - Optimized inter-thread communication scheme; [34] [PS] - Real-time thread support by the reconfigurable architecture; [9], [20], [10] [PS] - Preemptive techniques [context switching] for threads with arbitrary arriving times. Consider inter-thread data dependencies, free reconfigurable area and communication profile; [30], [25], [24] [PS] - Thread migration between software and hardware; [33], [32], [35] [PS] - Consider virtualization and protection; [37], [22] [PS] - Rescheduling of threads, depending on the workload; [31] [PS] - Run-time creation and termination of threads; [34], [13] Open Research Questions [O] [O] - Hardware scheduler agnostic to the employed embedded GPP processor; [O] - System performance evaluation parameters; [O] - Intra-thread management by the scheduler;
4
Conclusions
In this paper, we provided a survey and proposed a taxonomy of existing reconfigurable architectures with respect to their support of multithreading on reconfigurable resources. We identified three main classes – explicit, implicit and no ρMT support, each one of them with several sub-categories. We further summarized a number of identified design problems and several research questions, which addressed performance efficient management, mapping, sharing, scheduling and execution of threads on reconfigurable hardware resources. We provided our vision for potential research directions and possible solutions of some open research topics. We marked which of the identified design problems have been partially or completely solved and which research questions remain open.
Acknowledgments This work was supported by the HiPEAC European Network of Excellence - cluster 1200 (FP6-Contract number IST-004408) and by the Dutch
Reconfigurable Multithreading Architectures: A Survey
273
Technology Foundation STW, applied science division of NWO (project DSC.7533).
References 1. Vassiliadis, S., Wong, S., Cotofana, S.D.: The MOLEN μρ-coded processor. In: Brebner, G., Woods, R. (eds.) FPL 2001. LNCS, vol. 2147, pp. 275–285. Springer, Heidelberg (2001) 2. Heysters, P.M.: Coarse-grained reconfigurable computing for power aware applications. In: ERSA, pp. 272–280 (2006) 3. Seno, K., Yamazaki, M.: Virtual mobile engine (VME) LSI that “changes its spots” achievies ultralow power and diverse functionality. CX-News 42 (2005), http://www.sony.com 4. Ungerer, T., Robic, B., Silc, J.: A survey of processors with expliclicit multithreading. ACM Computing Surveys 35(1), 29–63 (2003) 5. Sima, M., Vassiliadis, S., Cotofana, S.D., van Eijndhoven, J.T.J., Vissers, K.A.: Field-programmable custom computing machines - a taxonomy. In: Glesner, M., Zipf, P., Renovell, M. (eds.) FPL 2002. LNCS, vol. 2438, pp. 79–88. Springer, Heidelberg (2002) 6. Wigley, G.B., Kearney, D.A.: The first real operating system for reconfigurable computers. In: ACSAC, pp. 129–136. IEEE Computer Society Press, Los Alamitos (2000) 7. Wu, K., Kanstein, A., Madsen, J., Berekovic, M.: MT-ADRES: Multithreading on coarse-grained reconfigurable architecture. In: Diniz, P.C., Marques, E., Bertels, K., Fernandes, M.M., Cardoso, J.M.P. (eds.) ARCS 2007. LNCS, vol. 4419, pp. 26–38. Springer, Heidelberg (2007) 8. Mamidi, S., Schulte, M., Iancu, D., Glossner, J.: Architecture support for reconfigurable multithreaded processors in programmable communication systems. In: ASAP, pp. 320–327. IEEE Press, Los Alamitos (2007) 9. Uhrig, S., Maier, S., Kuzmanov, G.K., Ungerer, T.: Coupling of a reconfigurable architecture and a multithreaded processor core with integrated real-time scheduling. In: RAW, pp. 209–217 (2006) 10. Peck, W., Anderson, E., Agron, J., Stevens, J., Baijot, F., Andrews, D.: HTHREADS: a computational model for reconfigurable devices. In: FPL, pp. 885– 888 (2006) 11. Compton, K., Hauck, S.: Reconfigurable computing: a survey of systems and software. ACM Computing Surveys 34(2), 171–210 (2002) 12. Diessel, O., Wigley, G.B.: Opportunities for operating systems research in reconfigurable computing. In: ACRC (1999) 13. So, H.K.-H., Brodersen, R.: A unified hardware/software runtime environment for FPGA-based reconfigurable computers using BORPH. ACM Transactions on Embedded Computing Systems 7(2), 1401–1407 (2008) 14. Zhou, B., Qui, W., Peng, C.-L.: An operating system framework for reconfigurable systems. In: CIT, pp. 781–787 (2005) 15. Noguera, J., Badia, R.M.: Multitasking on reconfigurable architectures: microarchitecture support and dynamic scheduling. Trans. on Embedded Computing Sys. 3(2), 385–406 (2004) 16. Marescaux, T., Nollet, V., Mignolet, J.-Y., Bartic, A., Moffat, W., Avasare, P., Coene, P., Verkest, D., Vernalde, S., Lauwereins, R.: Run-time support for heterogeneous multitasking on reconfigurable SoCs. Integration 38(1), 107–130 (2004)
274
P.G. Zaykov, G.K. Kuzmanov, and G.N. Gaydadjiev
17. Zaykov, P.G., Kuzmanov, G.K., Gaydadjiev, G.N.: State-of-the-art reconfigurable multithreading architectures. Technical Report - CE-TR-2009-02 (2009) 18. The convey HC-1 computer, architecture overview (white paper), p. 11 (2008), http://www.conveycomputer.com 19. Gibeling, G., Schultz, A., Asanovic, K.: The RAMP architecture & description language. In: WARFP (2006) 20. Haynes, S.D., Epsom, H.G., Cooper, R.J., McAlpine, P.L.: UltraSONIC: A reconfigurable architecture for video image processing. In: Glesner, M., Zipf, P., Renovell, M. (eds.) FPL 2002. LNCS, vol. 2438, pp. 482–491. Springer, Heidelberg (2002) 21. Satrawala, A., Varadarajan, K., Lie, M., Nandy, S., Narayan, R.: Redefine: Architecture of a soc fabric for runtime composition of computation structures. In: FPL 2007, pp. 558–561 (2007) 22. Wallner, S.: A reconfigurable multi-threaded architecture model. In: Omondi, A.R., Sedukhin, S.G. (eds.) ACSAC 2003. LNCS, vol. 2823, pp. 193–207. Springer, Heidelberg (2003) 23. Bauer, L., Shafique, M., Kreutz, S., Henkel, J.: Run-time system for an extensible embedded processor with dynamic instruction set. In: DATE, pp. 752–757 (2008) 24. Steiger, C., Walder, H., Platzner, M.: Heuristics for online scheduling real-time tasks to partially reconfigurable devices. In: FPL, pp. 575–584 (2003) 25. Zhou, X., Wang, Y., Huang, X.-Z., Peng, C.-L.: On-line scheduling of real-time tasks for reconfigurable computing system. In: FPT, pp. 57–64 (2006) 26. Angermeier, J., Teich, J.: Heuristics for Scheduling Reconfigurable Devices with Consideration of Reconfiguration Overheads. In: Proceedings 15th Reconfigurable Architectures Workshop, Miami, Florida (2008) 27. Resano, J., Mozos, D., Verkest, D., Catthoor, F.: A reconfiguration manager for dynamically reconfigurable hardware. IEEE Design & Test of Computers 22(5), 452–460 (2005) 28. Panainte, E.M.: The Molen compiler for reconfigurable architectures. Ph.D. dissertation, TU Delft (2007) 29. Li, Z., Hauck, S.: Configuration prefetching techniques for partial reconfigurable coprocessor with relocation and defragmentation. In: FPGA, pp. 187–195 (2002) 30. Steiger, C., Walder, H., Platzner, M., Thiele, L.: Online scheduling and placement of real-time tasks to partially reconfigurable devices. In: RTSS, pp. 224–235. IEEE Computer Society, Los Alamitos (2003) 31. Dittmann, F.: Methods to exploit reconfigurable fabrics - making reconfigurable systems mature. Ph.D. dissertation, University of Paderborn (2007) 32. Kalte, H., Porrmann, M.: Context saving and restoring for multitasking in reconfigurable systems. In: FPL, pp. 223–228. IEEE Press, Los Alamitos (2005) 33. Simmler, H., Levinson, L.: Multitasking on FPGA coprocessors. In: Gr¨ unbacher, H., Hartenstein, R.W. (eds.) FPL 2000. LNCS, vol. 1896, pp. 121–130. Springer, Heidelberg (2000) 34. Majer, M., Teich, J., Ahmadinia, A., Bobda, C.: The Erlangen Slot Machine: A dynamically reconfigurable fpga-based computer. VLSI Signal Processing 47(1), 15–31 (2007) 35. Ahmadinia, A., Bobda, C., Koch, D., Majer, M., Teich, J.: Task scheduling for heterogeneous reconfigurable computers. In: SBCCI, pp. 22–27 (2004) 36. Chen, Y., Chen, S.Y.: Cost-driven hybrid configuration prefetching for partial reconfigurable coprocessor. In: IPDPS, pp. 1–8. IEEE Press, Los Alamitos (2007) 37. Wallner, S.: Micro-task processing in heterogeneous reconfigurable systems. J. Comput. Sci. Technol. 20(5), 624–634 (2005)
Introduction to Mastering Cell BE and GPU Execution Platforms Ed Deprettere1 and Ana L. Varbanescu2 1
Leiden University, the Netherlands 2 TU Delft, The Netherlands
Both Cell BE-type and GPU processors have emerged as multi-processor execution platforms that can outperform general purpose multi-core computers in certain application domains. The two architectures are quite different, and by no means interchangeable. GPUs are reminiscent of fine-grained systolic array architectures, while the Cell BE is suitable to execute a set of co-ordinated coarse-grained tasks. By now, enough applications have been mapped on either of these two processors, mostly by hand, that the pros and cons tables can be filled. The next step is to provide mappings that are based on efficient programming models and methods, in particular methods that minimize communication overheads. The six papers in this special session are attempts to take precisely that route. Three of them are taking the GPU as the underlying execution platform, the third taking also the Cell-BE multicore processor into consideration. The other three papers are targetting the Cell-BE processor. Richard Membarth et al. present an efficient mapping of a multiresolution image processing algorithm on NVIDIA’s Tesla C870 Graphics Processing Unit, using the CUDA programming model. A speedup of 33x is obtained as compared to a parallelized implementation on a Xeon Quad Core. Alexander Monakov and Arutyum Avetisyan consider the execution of large sparse matrix problems on Graphics Processor Units with the aim of reducing memory bandwidth requirements. This goal is achieved by introducing a new hybrid blocked storage format. Because memory bandwidth is key to improve performance, reducing the need for bandwidth demands. Sander van der Maas et al. consider the mapping of the basic operations of iterative Tomography reconstruction algorithms onto a GPU execution accelerator, which yield superior performance compared to a similar mapping on a Cell-BE execution platform. Dmitry Nadezhkin et al. focus on the mapping of Kahn Process Network streaming application specifications on the Cell-BE multiprocessor execution platform, with the aim of optimizing FIFO communication to reduce synchronization overhead. Pieter Bellens et al. introduce a main memory ’bypassing’ technique for a Producer-Consumer pair communication, to circumvent bandwidth limitations and improve program performance for Cell-BE mapped applications. The technique is integrated in Cell Superscalor (CellSs). K. Bertels et al. (Eds.): SAMOS 2009, LNCS 5657, pp. 275–276, 2009.
276
E. Deprettere and A.L. Varbanescu
C´edric Augonnet et al. report on a runtime system for improving programmability and providing performance portability for heterogeneous multicore processors, such as the Cell-BE execution platform. This system, StarPU, provides a highlevel, unified execution model that is tightly coupled to an expressive data management library.
Efficient Mapping of Multiresolution Image Filtering Algorithms on Graphics Processors Richard Membarth, Frank Hannig, Hritam Dutta, and J¨urgen Teich Hardware/Software Co-Design, Department of Computer Science, University of Erlangen-Nuremberg, Germany {richard.membarth,hannig,dutta,teich}@cs.fau.de
Abstract. In the last decade, there has been a dramatic growth in research and development of massively parallel commodity graphics hardware both in academia and industry. Graphics card architectures provide an optimal platform for parallel execution of many number crunching loop programs from fields like image processing, linear algebra, etc. However, it is hard to efficiently map such algorithms to the graphics hardware even with detailed insight into the architecture. This paper presents a multiresolution image processing algorithm and shows the efficient mapping of this type of algorithms to the graphics hardware. Furthermore, the impact of execution configuration is illustrated and a method is proposed to determine the best configuration offline in order to use it at run-time. Using CUDA as programming model, it is demonstrated that the image processing algorithm is significantly accelerated and that a speedup of up to 33x can be achieved on NVIDIA’s Tesla C870 compared to a parallelized implementation on a Xeon Quad Core.
1 Introduction and Related Work Nowadays noise reducing filters are employed in many fields like digital film processing or medical imaging to enhance the quality of images. These algorithms are computationally intensive and operate on single images. Therefore, dedicated hardware solutions have been developed in the past [1, 2] in order to process images in real-time. However, with the overwhelming development of graphics processing units (GPUs) in the last decade also graphics cards became a serious alternative and were consequently deployed as accelerators for image processing [3]. In many fields multiresolution algorithms are used to process a signal at different resolutions. In the JPEG 2000 and MPEG-4 standards, the discrete wavelet transform which is also a multiresolution filter, is used for image compression [4]. Object recognition benefits from multiresolution filters as well by gaining scale invariance. This paper presents a multiresolution algorithm for image processing and shows the efficient mapping of this type of algorithms to graphics hardware. The computationally intensive algorithm is accelerated on commodity graphics hardware and a performance comparable to dedicated hardware solutions is achieved. Furthermore, the impact of execution configuration is illustrated. A design space exploration is presented and a method is proposed to determine the best configuration. This is done offline and the information is used at run-time to achieve best the results on different GPUs. We use the K. Bertels et al. (Eds.): SAMOS 2009, LNCS 5657, pp. 277–288, 2009. c Springer-Verlag Berlin Heidelberg 2009
278
R. Membarth et al.
Compute Unified Device Architecture (CUDA) to implement the algorithm on GPUs from NVIDIA. This work is related to other studies. Ryoo et al. [5] present a performance evaluation of various algorithm implementations on the GeForce 8800 GTX. Their optimization strategy is however limited to compute-bound tasks. In another paper the same authors determine the optimal tile size by an exhaustive search [6]. Baskaran et al. [7] show that code could be generated for explicit managed memories in architectures like GPUs or the Cell processor that accelerate applications. However, they consider only optimizations for compute-bound tasks since these predominate. Similarly, none of them shows how to obtain the best configuration and performance on different graphics cards. The remaining paper is organized as follows: Section 2 gives an overview of the hardware architecture used within this paper and the following Sec. 3 illustrates the efficient mapping of multiresolution applications to the graphics hardware. The application accelerated using CUDA is explained in Sec. 4, while Sec. 5 shows the results of mapping the algorithms to the GPU. Finally, in Sec. 6 conclusions of this work are drawn and suggestions for future work are given.
2 Architecture In this section, we present an overview of the Tesla C870 architecture, which is used amongst others as accelerator for the algorithms within this paper. The Tesla is a highly parallel hardware platform with 128 processors integrated on a chip as depicted in Fig. 1. The processors are grouped into 16 streaming multiprocessors. These multiprocessors comprise eight scalar streaming processors. While the multiprocessors are responsible for scheduling and work distribution, the streaming processors do the calculations. For extensive transcendental operations, the multiprocessors also accommodate two special function units. A program executed on the graphics card is called a kernel and is processed in parallel by many threads on the streaming processors. Therefore, each thread calculates a
Fig. 1. Tesla architecture (cf. [8]): 128 streaming processors distributed over 16 multiprocessors
Efficient Mapping of Multiresolution Image Filtering Algorithms
279
small portion of the whole algorithm, e.g. one pixel of a large image. A batch of these threads is always grouped together into a thread block that is scheduled to one multiprocessor and executed by its streaming processors. One of these thread blocks can contain up to 512 threads, which is specified by the programmer. The complete program has to be divided into such sub-problems that can be processed independently on one multiprocessor. The multiprocessor always executes a batch of 32 threads, also called a warp, in parallel. The two halves of a warp are sometimes further distinguished as halfwarps. NVIDIA calls this new streaming multiprocessor architecture single instruction, multiple thread (SIMT) [9]. For all threads of a warp the same instructions are fetched and executed for each thread independently, i.e. the threads of one warp can diverge and execute different branches. However, when this occurs the divergent branches are serialized until both branches merge again. Thereafter, the whole warp is executed in parallel again. Each thread executed on a multiprocessor has full read/write access to the 1.5 GB device memory of the Tesla. This memory has, however, a long memory latency of 400 to 600 clock cycles. To hide this long latency each multiprocessor is capable to manage and switch between up to eight thread blocks, but not more than 768 threads in total. In addition 8192 registers and 16384 bytes of on-chip shared memory are provided to all threads executed simultaneously on one multiprocessor. These memory types are faster than the global device memory, but shared between all thread blocks executed on the multiprocessor. The capabilities of the Tesla architecture are summarized in Table 1. Table 1. Hardware capabilities of the Tesla C870
Threads per warp Warps per multiprocessor Threads per multiprocessor Blocks per multiprocessor Registers per multiprocessor Shared memory per multiprocessor
32 24 768 8 8192 16384
3 Mapping Methodology To map algorithms efficiently to graphics hardware, we distinguish between two types of kernels executed on the GPU. For each type a different optimization strategies applies. These are compute-bound and memory-bound kernels. While the execution time of compute-bound kernels is determined by the speed of the processors, for memorybound kernels the limiting factor is the memory bandwidth. However, there are measures to achieve a high throughput and good execution times for both kernel types. A flowchart of the used approach is shown in Fig. 2. First, for each task of the input application corresponding kernels are created. Afterwards, the memory access of these kernels is optimized and the kernels are added either to a compute-bound or memorybound kernel set. Optimizations are applied to both kernel sets and the memory access
280
R. Membarth et al.
application
tasks
for each task create kernel
kernels
memory access optimization
determine type of kernel set of compute bound kernels
compute bound - invariant code motion - intrinsic functions - ...
set of memory bound kernels
memory bound kernel fusion
- data packing - ...
memory access optimization
configuration exploration
optimized kernels
Fig. 2. Flowchart of mapping strategy
pattern of the resulting kernels is again checked. Finally, the optimized kernels are obtained and the best configuration for each kernel is determined by a configuration space exploration. 3.1 Memory Access Although for both types of kernels different mapping strategies apply, a proper memory access pattern is necessary in all cases to achieve good memory transfer rates. Since all kernels get their data in the first place from device memory, reads and writes to this memory have to be coalesced. This means that all threads in both half-warps of the currently executed warp have to access contiguous elements in memory. For coalesced memory access, the access is combined to one memory transaction utilizing the entire memory bandwidth. Uncoalesced access needs 16 separate memory transactions instead and has a low bandwidth utilization. Reading from global memory has a further restriction for the Tesla C870 to achieve coalescing: The data accessed by the entire half-warp has to reside in the same segment of the global memory and has to be aligned to its size. For 32-bit and 64-bit data types the segment has a size of 64 bytes and 128 bytes, respectively. Since most algorithms do not adhere to these constraints, two methods are used to get still the same memory performance as for coalesced memory access. Firstly, for both, memory reads and writes, the faster on-chip shared memory is used to introduce a new memory layer. This new layer reduces the performance penalty of uncoalesced memory access significantly as the access to shared memory can be as fast as for registers.
Efficient Mapping of Multiresolution Image Filtering Algorithms
281
When threads of a half-warp need data elements residing permuted in global memory, each thread fetches coalesced data from global memory and stores the data to the shared memory. Only reading from shared memory is then uncoalesced. The same applies when writing to global memory. Secondly, the texturing hardware of the graphics card is used to read from device memory. Texture memory does not have the constraints for coalescing. Instead, texture memory is cached which has further benefits when data elements are accessed multiple times by the same kernel. Only the first data access has the long latency of the device memory and subsequent accesses are handled by the much faster cache. However, texture memory has also drawbacks since this memory is read-only and binding memory to a texture has some overhead. Nevertheless, most kernels benefit from using textures. An alternative to texture memory is constant memory. This memory is also cached and is used for small amounts of data when all threads read the same element. 3.2 Compute-Bound Kernels Most algorithms that use graphics hardware as accelerator are computationally intensive and also the resulting kernels are limited by the performance of the streaming processors. To accelerate these kernels further – after optimizing the memory access – either the instruction count can be decreased or the time required by the instructions can be reduced. To reduce the instruction count traditional loop-optimization techniques can be adopted to kernels. For loop-invariant computationally intensive parts of a kernel it is possible to precalculate these offline and to retrieve these values afterwards from fast memory. This technique is also called loop-invariant code motion. The precalculated values are stored in a lookup table which may reside in texture or shared memory. Constant memory is chosen when all threads in a warp access the same element of the lookup table. The instruction performance issue is addressed by using intrinsic functions of the graphics hardware. These functions accelerate in particular transcendental functions like sinus, cosine, and exponentiations at the expense of accuracy. Also other functions like division benefit from these intrinsics and can be executed in only 20 clock cycles instead of 32. 3.3 Memory-Bound Kernels Compared to the previously described kernels, memory-bound kernels benefit from a higher ratio of arithmetic instructions to memory accesses. More instructions help to avoid memory stalls and to hide the long memory latency of device memory. Considering image processing applications, kernels operate on two-dimensional images that are processed typically using two nested loops on traditional CPUs. Therefore, loop fusion [10] can merge multiple kernels that operate on the same image as long as no inter-kernel data dependencies exist. Merging kernels provides often new opportunities for further code optimization. Another possibility to increase the ratio of arithmetic instructions to memory accesses is to calculate multiple output elements in each thread. This is true in particular when integers are used as data representation like in many image processing algorithms. For instance, the images considered for the algorithm presented next in this paper use a 10-bit grayscale representation. Therefore, only a fraction of the 4 bytes an integer occupies are needed. Because the memory hardware of GPUs is optimized for 4 byte operations, short data types yield inferior performance.
282
R. Membarth et al.
However, data packing can be used to store two pixel values in the 4 bytes of an integer. Afterwards, integer operations can be used for memory access. Doing so increases also the ratio of arithmetic instructions to memory accesses. 3.4 Configuration Space Exploration One of the basic principles when mapping a problem to the graphics card using CUDA is the tiling of the problem into smaller, independent sub-problems. This is necessary because only up to 512 threads can be grouped into one thread block. In addition, only threads of one block can cooperate and share data. Hence, proper tiling influences the performance of the kernel, in particular when intra-kernel dependencies prevail. The tiles can be specified in various ways, either one-, two-, or three-dimensional. The used dimension is such chosen that it maps directly to the problem, e.g. for image processing two-dimensional tiles are used. The tile size has not only influence on the number of threads in a block and consequently how much threads in a block can cooperate, but also on the resource usage. Registers and shared memory are used by the threads of all scheduled blocks of one multiprocessor. Choosing smaller tiles allows a higher resource usage on the one hand, while larger tiles support the cooperation of threads in a block on the other hand. Furthermore, the shape of a tile has influence on the memory access pattern and the memory performance, too. Consequently, it is not possible to give a formula that predicts the influence of the thread block configuration on the execution time. Therefore, configurations have to be explored in order to find the best configuration, although the amount of relevant configurations can be significantly narrowed down. Since the hardware configuration varies for different GPUs, also the best block configuration changes. Therefore, we propose a method that allows to use always the best configuration for GPUs at run-time. We explore the configuration space for each graphics card model offline and store the result in a database. Later at run-time, the program identifies the model of the GPU and uses the configuration retrieved from the database. In that way there is no overhead at run-time and there is no penalty when a different GPU is used. In addition, the binary code size can be kept nearly as small as the original binary size.
4 Multiresolution Filtering The multiresolution application considered here utilizes the multiresolution approach presented by Kunz et al. [11] and employs a bilateral filter [12] as filter kernel. The application is a nonlinear multiresolution gradient adaptive filter for images and is typically used for inter-frame image processing, i.e. only the information of one image is required. The filter reduces noise significantly while sharp image details are preserved. Therefore, the application uses a multiresolution approach representing the image at different resolutions so that each feature of the image can be processed on its most appropriate scale. This makes it possible to keep the filter window small. Figure 3 shows the used multiresolution application: In the decompose phase, two image pyramids with subsequently reduced resolutions (g0 (1024 × 1024), g1 (512 × 512), ... and l0 (1024 × 1024), l1 (512 × 512), ...) are constructed. While the images of the
Efficient Mapping of Multiresolution Image Filtering Algorithms
283
1024x1024
l0
g0
g0
filter0
f0
filter1
f1
filter2
f2
filter3
f3
r0
reconstruct0
decompose0
r0
g1
512x512
l1
r1
reconstruct1
decompose1 g2 l2
256x256
r2
reconstruct2
decompose2 g3
128x128
l3
decompose3
r3
reconstruct3 g4
64x64
l4
filter4
f4
decompose4
r4
reconstruct4 g5
filter5
f5
Fig. 3. Multiresolution filter application with five layers
first pyramid (gx ) are used to construct the image of the next layer, the second pyramid (lx ) represents the edges in the image at different resolutions. The operations involved in these steps are to a large extent memory intensive with little computational complexity like upsampling, downsampling, or a lowpass operation. The actual algorithm of the application is working in the filter phase on the images produced by the decompose phase (l0 , ... l4 , g5 ). This algorithm is described below in detail. After the main filter has processed these images, the output image is reconstructed again, reverting the steps of the decompose phase. The bilateral filter used in the filter phase of the multiresolution application applies the principle of traditional domain filters also to the range. Therefore, the filter has two components: One is operating on the domain of an image and considers the spatial vicinity of pixels, their closeness. The other component operates on the range of the image, i.e. the vicinity refers to the similarity of pixel values. Closeness (Eq. (1)), hence, refers to geometric vicinity in the domain while similarity (Eq. (3)) refers to photometric vicinity in the range. We use Gaussian functions of the Euclidean distance for the closeness and similarity function as seen in Eq. (2) and (4). The pixel in the center of the current filter window is denoted by x, whereas ξ denotes a point in the neighborhood of x. The function f is used to access the value of a pixel. c(ξ , x) = e
d(ξ ,x)
− 21 ( σ )2 d
d(ξ , x) = d(ξ − x) = ξ − x s(ξ , x) = e
δ ( f (ξ ), f (x)) 2 − 21 ( ) σr
δ (φ , f ) = δ (φ − f ) = φ − f
(1) (2) (3) (4)
284
R. Membarth et al.
The bilateral filter replaces each pixel by an average of geometric nearby and photometric similar pixel values as described in Eq. (5) with the normalizing function of Eq. (6). Only pixels within the neighborhood of the relevant pixel are used. The neighborhood and consequently also the kernel size is determined by the geometric spread σd . The parameter σr (photometric spread) in the similarity function determines the amount of combination. When the difference of pixel values is less than σr , these values are combined, otherwise not. h(x) = k−1 (x) k(x) =
∞ ∞
∞ ∞
f (ξ )c(ξ , x)s( f (ξ ), f (x))d ξ
(5)
−∞ −∞
c(ξ , x)s( f (ξ ), f (x))d ξ
(6)
−∞ −∞
Compared to the memory access dominated decompose and reconstruct phases, the bilateral filter is compute intensive. Considering a 5 × 5 filter kernel (σd = 2), 50 exponentiations are required for each pixel of the image – 25 for each, the closeness and similarity function. While the mask coefficients for the closeness function are static, those for the similarity function have to be calculated dynamically based on the photometric vicinity of pixel values.
5 Results This section shows the results when the described mapping strategy of Sec. 3 is applied to the multiresolution filter implementation. We show the improvements that we attain for compute-bound kernels as well as memory-bound kernels. Furthermore, our proposed method for optimal configuration is shown exemplary for a Tesla C870 and GeForce 8400. For the compute-bound bilateral filter kernel, loop-invariant code is precalculated and stored in lookup tables. This is done for the closeness function as well as for the similarity function. In addition, texture memory is used to improve the memory performance. Aside from global memory, linear texture memory as well as a two-dimensional texture array are considered. The left graph of Fig. 4 shows the impact of the lookup tables and texture memory on the execution time. The lookup tables are stored in constant memory. First, it can be seen that textures reduce significantly the execution times, in particular when linear texture memory is used. The biggest speedup is gained for a lookup table for the closeness function while the speedup for the similarity function is only marginal. Using lookup tables for both functions has no further improvement. In the closeness function all threads access the same element of the lookup table. Since the constant memory is optimized for such access patterns, this lookup table shows the biggest acceleration. In the right graph intrinsic functions are used in addition. Compiling a program with the -use_fast_math compiler option enables intrinsic functions for the whole program. In particular the naive implementation benefits from this, having most arithmetic operations of all implementations. Altogether, the execution time is reduced more than 60% for the best implementation using a lookup table for the closeness
Efficient Mapping of Multiresolution Image Filtering Algorithms
4
4 global mem texture array linear texture
3.5 3
3.5 execution time (ms)
execution time (ms)
285
2.5 2 1.5 1 0.5
global mem texture array linear texture
3 2.5 2 1.5 1 0.5
0
0 naive
LUT(c)
LUT(s) LUT(c+s)
naive LUT(c) LUT(s) LUT(c+s) fastmath fastmath fastmath fastmath
Fig. 4. Optimization of the compute-bound bilateral filter (σd = 2) kernel: Shown is the influence of loop-invariant code motion and intrinsic functions for an image of 512 × 512 using different memory types
function as well as intrinsic functions. This implementation achieves up to 63 GFLOPS counting a lookup table access as one operation. For the naive implementation over 80 GFLOPS are achieved using intrinsic functions. The kernels for the decompose and reconstruct phases are memory-bound. Initially for each task of these phases a separate kernel is used, i.e. one kernel for lowpass, upsample, downsample, etc. Subsequently these kernels are merged as long as data dependencies are met. Figure 5 shows the impact of merging kernels exemplary for a sequence of tasks, which is further called expand operator: First, the image is upsampled, then a lowpass is applied to the resulting image and finally the values are multiplied by a factor of four. This operator is used in the decompose phase as well as in the reconstruct phase. Merging the kernels for these tasks reduces global memory accesses and allows further optimizations within the new kernel. The execution time for an image of 512 × 512 could be significantly reduced from about 0.38 ms to 0.19 ms. However, writing the results back to global memory of the new kernel is uncoalesced since each thread has to write two consecutive data elements after the upsample step. Therefore, shared memory is used to buffer the results of all threads and write them afterwards coalesced back to global memory. This reduces the execution time further to 0.09 ms. These optimizations are also applied to the other tasks of the decompose and reconstruct phase. In total, the execution time of the first implementation using global memory is reduced from 4.36 ms to 0.20 ms for decompose and from 2.29 ms to 0.10 ms for reconstruct. After the algorithm is mapped to the graphics hardware, the thread block configuration is explored. The configuration space for two-dimensional tiles comprises 3280 possible configurations. Since always 16 elements have to be accessed in a row for coalescing, only such configurations are considered. This reduces the number of relevant configurations to 119, 3.6% of the whole configuration space. From these configurations, we assumed that a square block with 16 × 16 threads would yield the best performance for the bilateral filter kernel. Because each thread loads also its neighboring pixels, a square block configuration utilizes the texture cache best when loading data. However, the exploration shows that the best configurations has 64 × 1 threads on the
286
R. Membarth et al.
1.2
us lp us+lp+mul mul us+lp+mul+smem
execution time (ms)
1
0.8
0.6
0.4
0.2
0 2d -te e
e
e
ur xt
e
al
ur xt
e
ur xt te
-te
ob
gl
2d
ur xt te
ur xt
al
-te
ob
gl
2d
e
al
ur xt te
ob
gl
Fig. 5. Optimization of the memory-bound expand operator: Shown is the influence of merging multiple kernels and utilization of shared memory to achieve coalescing
Impact of block configurations on execution times
Impact of block configurations on execution times 90
6 85
80 Executiontime (ms)
Executiontime (ms)
5.5
5
75
70
65 4.5 60 Optimum with 64x1: 4.19ms
Optimum with 32x6: 58.04ms
4
55 32
64
96
128
160
192
224
256
288
320
Number of threads (blocksize)
(a) Tesla C870
352
384
416
448
480
512
32
64
96
128
160
192
224
256
288
320
352
384
416
448
480
512
Number of threads (blocksize)
(b) GeForce 8400
Fig. 6. Configuration space exploration for the bilateral filter (σd = 2) for an image of 1024 × 1024 on the Tesla C870 and GeForce 8400, respectively
Tesla C870 and 32 × 6 on the GeForce 8400. Figure 6 shows the execution times of the 119 considered configurations for both cards. The data set is plotted in 2D for better visualization. Plotted against the x-axis are the number of threads of the block. That is, the configuration 16 × 16 and 32 × 8 have for instance the same x-value. The best configuration takes 4.19 ms on the Tesla and 58.04 ms on the GeForce, whereas the previously as optimal assumed configuration of 16 × 16 takes 4.67 ms and 59.22 ms, respectively. While the best configuration is 10.3% faster on the Tesla, it is only about 2% faster on the GeForce. Compared to the worst (however coalesced) configuration the best configuration is more than 30% faster in both cases. This shows that the best configuration for an application is not predictable and that an exploration is needed to determine the best configuration for each graphics card. These configurations are determined once offline and stored to a database. Later at run-time, the application has only to load its configuration from the database. This way always the best performance can be achieved with only a moderate code size increase.
Efficient Mapping of Multiresolution Image Filtering Algorithms
287
A comparison of the complete multiresolution filter implementation with a CPU implementation shows the speedup that can be achieved. The CPU implementation is running on a Xeon Quad Core (2.66 GHz) and uses OpenMP to utilize all four cores of the CPU. As seen in Table 2, the Tesla achieves a speedup between 21x and 33x compared to the CPU. Also images up to a resolution of 2048 × 2048 can be processed in real-time using a 5 × 5 filter. Table 2. Speedup and frames per second (FPS) for the multiresolution application on a Tesla C870 and a Xeon Quad Core (2.66 GHz) for σd = 2 and different image sizes
512 × 512 1024 × 1024 2048 × 2048 4096 × 4096 FPS(Xeon) FPS(Tesla) Speedup
17.29 382.13
4.64 130.99
1.08 36.17
0.11 2.32
22.09
28.17
33.24
21.05
6 Conclusions In this paper it has been shown that multiresolution filters can leverage the potential of current highly parallel graphics cards hardware using CUDA. The image processing algorithm was accelerated by more than one magnitude. Depending on the task, different approaches have been presented in order to achieve remarkable speedups. Memory-bound tasks benefit from a higher ratio of arithmetic instructions to memory accesses, whereas for compute-bound kernels the instruction count has to be decreased at the expense of additional memory accesses. Finally, the best configuration for kernels is determined by exploration of the configuration space. To avoid exploration at run-time for different graphics cards the best configuration is determined offline and stored to a database. At run-time the application retrieves the configuration for its card from the database. That way, the best performance can be achieved independent of the used hardware. Applying this strategy to a multiresolution application with a computationally intensive filter kernel yielded remarkable speedups. The implementation on the Tesla outperformed an optimized and also parallelized CPU implementation on a Xeon Quad Core by a factor of up to 33x. The computationally most intensive part of the multiresolution application achieved over 80 GFLOPS taking advantage of the highly parallel architecture. The configuration space exploration for the kernels revealed more than 10% faster configurations compared to configurations thought to be optimal. An implementation of the multiresolution filter as gimp plugin is also available online1 showing the impressive speedup compared to conventional CPUs. In future work the configuration space exploration could be integrated in the workflow of existing tools. Also the capabilities of newer graphics hardware, which support asynchronous concurrent execution between the CPU and GPU could be used to share 1
http://www12.cs.fau.de/people/membarth/cuda/
288
R. Membarth et al.
the workload between the host and device. Lower resolutions can be calculated on the CPU while the computationally more intensive higher resolutions are processed on the graphics card. This introduces a new level of parallelism using heterogeneous processing architectures where tasks could be mapped to the architecture which suits better the algorithm.
References 1. do Carmo Lucas, A., Ernst, R.: An Image Processor for Digital Film. In: Proceedings of IEEE 16th International Conference on Application-specific Systems, Architectures, and Processors (ASAP), Washington, DC, USA, pp. 219–224 (2005) 2. Dutta, H., Hannig, F., Teich, J., Heigl, B., Hornegger, H.: A Design Methodology for Hardware Acceleration of Adaptive Filter Algorithms in Image Processing. In: Proceedings of IEEE 17th International Conference on Application-specific Systems, Architectures, and Processors (ASAP), Steamboat Springs, CO, USA, pp. 331–337 (2006) 3. Stone, S., Haldar, J., Tsao, S., Wen-Mei, W., Liang, Z., Sutton, B.: Accelerating Advanced MRI Reconstructions on GPUs. In: Proceedings of the 2008 Conference on Computing Frontiers, pp. 261–272 (2008) 4. Christopoulos, C., Skodras, A., Ebrahimi, T.: The JPEG 2000 Still Image Coding System: An Overview. IEEE Transactions on Consumer Electronics 46(4), 1103–1127 (2000) 5. Ryoo, S., Rodrigues, C., Baghsorkhi, S., Stone, S., Kirk, D., Wen-Mei, W.: Optimization Principles and Application Performance Evaluation of a Multithreaded GPU using CUDA. In: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming (PPoPP), Salt Lake City, UT, USA, pp. 73–82 (2008) 6. Ryoo, S., Rodrigues, C., Stone, S., Baghsorkhi, S., Ueng, S., Hwu, W.: Program Optimization Study on a 128-Core GPU. In: The First Workshop on General Purpose Processing on Graphics Processing Units, Boston, MA, USA (2007) 7. Baskaran, M., Bondhugula, U., Krishnamoorthy, S., Ramanujam, J., Rountev, A., Sadayappan, P.: Automatic Data Movement and Computation Mapping for Multi-Level Parallel Architectures with Explicitly Managed Memories. In: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming, Salt Lake City, UT, USA, pp. 1–10 (2008) 8. Owens, J., Houston, M., Luebke, D., Green, S., Stone, J., Phillips, J.: GPU Computing. Proceedings of the IEEE 96(5), 879–899 (2008) 9. Lindholm, E., Nickolls, J., Oberman, S., Montrym, J.: NVIDIA Tesla: A Unified Graphics and Computing Architecture. IEEE Micro. 28(2), 39–55 (2008) 10. Wolfe, M., Shanklin, C., Ortega, L.: High Performance Compilers for Parallel Computing. Addison-Wesley Longman Publishing Co. Inc, Boston (1995) 11. Kunz, D., Eck, K., Fillbrandt, H., Aach, T.: Nonlinear Multiresolution Gradient Adaptive Filter for Medical Images. In: Proceedings of the SPIE: Medical Imaging 2003: Image Processing, San Diego, CA, USA, vol. 5032, pp. 732–742 (2003) 12. Tomasi, C., Manduchi, R.: Bilateral Filtering for Gray and Color Images. In: Proceedings of the Sixth International Conference on Computer Vision, pp. 839–846 (1998)
Implementing Blocked Sparse Matrix-Vector Multiplication on NVIDIA GPUs Alexander Monakov and Arutyun Avetisyan Institute for System Programming of RAS, Moscow, Russia {amonakov,arut}@ispras.ru
Abstract. We discuss implementing blocked sparse matrix-vector multiplication for NVIDIA GPUs. We outline an algorithm and various optimizations, and identify potential future improvements and challenging tasks. In comparison with previously published implementation, our implementation is faster on matrices having many high fill-ratio blocks but slower on matrices with low number of non-zero elements per row.
1
Introduction
Modern graphics processors are massively parallel computing devices with outstanding computational power and memory bandwidth. For example, NVIDIA GeForce GTX 285 peaks at 1063 GFLOPS in single precision and 159 GBytes/s memory bandwidth. As a result, GPUs are increasingly used for accelerating appropriate compute-intensive tasks. NVIDIA GPUs are programmed in a programming model called CUDA [1]. In sparse matrices, the fraction of non-zero elements is small. While it is possible to use generic data structures and routines to perform computations with such matrices, it is inefficient (as most calculations on zero elements are redundant) and sometimes even impractical due to large dimensions of the matrix. In practice, sparse matrices are stored in specialized data structures, having the size proportional to the number of non-zero elements. Calculations involving sparse matrices arise in many numerical computations. For example, solving a partial differential equation using finite elements method boils down to solving a system of linear equations Ax = b, where A is sparse. Non-zero elements of A would be arranged in a regular or an irregular pattern depending on the selection of a structured or unstructured mesh for discretization of the original problem. Solving Ax = b for sparse A is frequently done using iterative methods, in which case the most time-consuming step is computing the matrix-vector product y = y + At for some t. In the conjugate gradient method, other steps operate on vectors and are relatively easy to implement efficiently. In this paper we discuss implementing the sparse matrix-vector product on NVIDIA GPUs with no specific assumptions about the structure of A. If values or locations of non-zero elements can be efficiently computed (e.g. when A is derived from discretization on a regular mesh, its non-zero elements occupy K. Bertels et al. (Eds.): SAMOS 2009, LNCS 5657, pp. 289–297, 2009. c Springer-Verlag Berlin Heidelberg 2009
290
A. Monakov and A. Avetisyan
several diagonals), a specialized implementation will likely demonstrate better performance. Optimizing for symmetric sparse matrices A or computing y = At for multiple t simultaneously is also out of the scope of this article. We argue that reducing the memory bandwidth requirement is key to improving performance and use blocking to achieve this goal. Our benchmarks show that our implementation compares favorably to previously published non-blocked implementation.
2
The CUDA Programming Model
CUDA programming model closely resembles organization of graphics hardware. Computations are launched in a parallel manner, with all threads executing the same function, called a kernel. The threads are partitioned into blocks. Threads within a block have access to common shared memory and may synchronize using a barrier synchronization instruction. Physically, a block of threads is executed on a GPU “core” called a multiprocessor. NVIDIA GT200 GPUs include up to 30 such multiprocessors. Each multiprocessor contains eight single-precision ALUs, one double-precision ALU and one instruction issue unit. Threads of a block are partitioned into 32-thread SIMD groups called warps. Instruction issue unit can switch between active warps with low overhead, which allows to hide instruction latency. GT200 allows for up to 32 active warps (and hence up to 1024 active threads) per multiprocessor. Ratio of active warps to theoretical maximum is called occupancy. Maximizing occupancy is important when optimizing memory-intensive tasks, as it allows to improve memory latency hiding by switching between active warps. Each multiprocessor contains a 16-KiB storage area, which serves as shared memory for thread blocks running on this multiprocessor. For example, allocating 5 KiB for each block does not allow for more than 3 active blocks per multiprocessor. The register file is another partitioned resource, as there is no register save/restore on warp switching. Therefore, the amount of shared memory per block and the number of registers allocated per thread both affect possible amount of active warps. Memory requests are serviced for halves of a warp at a time. To achieve highest memory throughput, memory access pattern must follow coalescing rules: accessed addresses must fit into a 64 or 128-byte window, and the window itself must be aligned.
3
Sparse Matrix-Vector Multiplication
We implement y = y + Ax, where A is a N × M sparse matrix, x and y are dense M -element and N -element vectors, respectively. Performance of SpMV (sparse matrix-vector multiplication) largely depends on memory throughput: for every matrix element Aij only two floating-point
Implementing Blocked Sparse Matrix-Vector Multiplication
291
operations are performed (a multiplication by input vector element xj and accumulation to output vector element yi ). Since the GPU memory bandwidth is nearly an order of magnitude larger that that of modern x86 systems, it is tempting to utilize GPUs as accelerators for numerical methods involving sparse matrices. When performing SpMV, memory bandwidth is primarily used for: 1. Reading non-zero elements of A; 2. Reading coordinates of non-zero elements; 3. Servicing cache misses on accesses to x (if x is allocated in read-only cached texture memory, as in our implementation); 4. Reading and writing y elements; Bandwidth consumption by items 1 and 4 obviously does not depend on the matrix storage format. Number of cache misses on vector x fetches depends on pattern of non-zero elements, hardware implementation and organization of computations. Lack of details on hardware implementation makes estimating of cache misses a hard problem, but may be possible in the future: Volkov and Demmel [2] have experimentally discovered important characteristics of GPU caches, and a CUDAcompatible GPU simulator is currently being developed [3]. In this paper, we present an approach that addresses issue 2 (that is reducing the memory bandwidth required to read coordinates of non-zero elements, by using blocked storage format).
4
Previous Work
Optimization of SpMV for CPUs has been extensively studied (e.g. see Vuduc’s dissertation [4], which includes descriptions of many storage formats and provides experimental data on CPUs). Many researchers note how SpMV implementations usually extract only several percent of CPU’s peak performance and note importance of blocking to reduce pressure on memory subsystem and to explicitly express data reuse. A recent work [5] presents evaluation of optimized SpMV implementations on multi-core x86 processors, STI Cell and Sun Niagara 2. Operations on dense matrices on GPUs have been thoroughly analyzed, which is partially due to more regular nature of the problem. Volkov and Demmel [2] present an experimental study of GPU memory subsystem and an efficient implementation of dense matrix-matrix multiplication. The implementation is shown to be nearly optimal under the constraints of hardware implementation. Exploration of algorithms on sparse matrices on GPUs have not yet received as much attention in comparison. Bell and Garland[6] investigate the performance of several non-blocked methods. They propose using a hybrid approach to sparse matrix storage, which results in efficient SpMV implementation for a variety of test matrices. In earlier work [7], Buatois et al. note the importance of using blocked formats but do not optimize for memory coalescing.
292
A. Monakov and A. Avetisyan
In this paper we describe a new hybrid blocked storage format and present experimental results that are better than reported in the previous work [6,7] on matrices with high fill ratio of blocks.
5 5.1
Implementing Blocked SpMV in CUDA Non-blocked Storage Formats
We refer to the following non-blocked storage formats for sparse matrices: 1. Coordinate format (COO). For each non-zero element, both its column and row indices are explicitly stored. Elements may either be stored in any order, or elements from the same row may be required to be packed together. 2. Compressed sparse row format (CSR). Elements are sorted by row index (ordering of elements within a row is not specified). For each element, only its column index is explicitly stored. Additionally, an array of indices of the first element in each row is stored. 3. ELLPACK format. For each row, exactly K elements are stored (extra zero elements are introduced for rows that contain less that K non-zero elements, and it is not possible to represent matrices with more that K non-zero elements in any row). Like in CSR, elements are sorted by row index and only column indices are explicitly stored. Bell and Garland [6] provide a thorough description of these and other storage formats along with performance evaluation of SpMV on GPU for each of them. 5.2
Blocked Storage Formats
Basic blocked storage formats for sparse matrices are BCOO (blocked coordinate format) and BCSR (blocked CSR format). In both formats matrix is subdivided into blocks of size Br × Bc that are stored in dense form (which may require storing some extra zeros). In BCOO, both coordinates of each block are stored. In BCSR, blocks are assumed to begin in a row divisible by Br , which allows to explicitly store only column coordinate for each block and implicitly encode row coordinates by sorting blocks by rows and storing indices of first block in each row like in CSR format. For the purpose of GPU implementation we employ a hybrid blocked format combining features of BCOO and BCSR. First, we subdivide the matrix into strips of Sr consecutive rows. Within each strip, blocks are stored in BCOO format. Note that, assuming Sr is sufficiently small, it is possible to store block column index and offset from strip top row in one word (e.g. using bits 0-6 for offset when strips are no more than 128 in height leaves bits 7-31 for column indexing, allowing matrices with up to 33554432 columns). We do not require that any of block coordinates is divisible by Br or Bc . Blocks are sorted by strip index, and, similar to CSR, an index of the first block in a strip is stored for each strip.
Implementing Blocked Sparse Matrix-Vector Multiplication
5.3
293
Work Distribution
We need to partition the computation into tasks that do not require communication and thus can be assigned to CUDA thread blocks. This is naturally achieved by processing each strip completely by one thread block. Threads of a block are logically subdivided into groups of size Br × Bc . Each of the logical groups reads a matrix block (thus each thread reads one element of a block) and multiplies it by corresponding element of vector x. After all blocks are processed, threads synchronize and update the corresponding portion of vector y. Note that if the number of blocks in the strip is not evenly divisible by the number of logical groups, some logical groups will wait on a synchronization point, while other logical groups execute an extra iteration. In order to maximize occupancy, we choose between 128, 256 or 512 threads per block (512 threads per block does not allow for high occupancy on firstgeneration CUDA hardware, since the hardware is limited to 24 active warps or 768 threads per multiprocessor). We choose the minimal value, 128 threads per block, to reduce inefficiency caused by uneven work distribution within a block. In order to simplify control flow, we limit ourselves to Sr , Br and Bc that are power of two. Furthermore, in order to avoid synchronization we mandate that Bc does not exceed warp size (i.e., Bc ≤ 32). 5.4
Shared Memory Allocation
Since we allow arbitrary placement of blocks within a strip, we need to allocate addressable storage for Sr elements in shared memory. We also want to avoid synchronization as much as possible to maximize performance. Therefore, we assign Sr elements to each logical group. Since logical groups are parts of warps, they are able to update their portion of this temporary storage without explicit synchronization. 5.5
Occupancy Considerations
Shared memory consumption is the limiting factor of occupancy in our implementation. Recall that 200-series GPUs host at most 1024 threads per multiprocessor. Since shared memory capacity is 16 KiB, this leaves less than 4 single-precision floats per thread on average before occupancy is constrained by shared memory requirement. For our implementation this sets the highest possible strip height at twice the block size in single precision.
6
Hybrid Approach
Note that blocked formats require storage of extra zero elements, if non-zero elements of the matrix do not form dense blocks of required size. This wastes both storage and bandwidth. In order to mitigate this deficiency of blocked approach, we employ a hybrid approach, inspired by [6], where best performance is observed on hybrid ELL/COO method.
294
A. Monakov and A. Avetisyan
We have chosen to use ELL format for storing non-zero values not covered by blocks, because it allows for very efficient implementation. However, ELL format itself has some storage overhead if number of non-zero elements to be stored differs from row to row. To reduce the cost of this overhead we do the following: 1. We allow amount of non-zero elements per row vary from strip to strip. This allows to store shorter rows for strips where the longest row is shorter than the longest row in all strips at the cost of storing one additional array of indices, similar to CSR format. Note that such optimization is applicable in every implementation using ELL format. 2. We implement a simple matrix reordering heuristic to pack rows with close amount of non-zero elements into strips. This is not necessarily beneficial, because it may reduce locality of non-zero elements and cause less blocks to be identified. To implement the reordering, we iterate over all rows in the order they appear in initial matrix and add the row into corresponding bucket. The buckets are assigned based on the number of non-zero elements in the row: specifically, we use separate buckets for rows with up to 30 non-zero elements, and one more bucket for all rows with 31 or more non-zero elements. When adding a row into a bucket causes the total number of rows in this bucket reach strip size Sr , we output these rows into the reordered matrix, thus creating a strip of rows with equal size (or size over 31 if these rows are from the last row). The resulting permutation is recorded and applied to the columns after row order is decided. Even though number of non-zero elements is exactly equal in strips with low number of non-zero elements after such reordering, block selection may again make it uneven. We also note that launching two CUDA kernels that implement multiplication by block and ELL portions is inefficient, because it causes y vector to be read from global memory and written back twice: for example, for very sparse matrix with the average of two non-zero elements per row, reading and writing y vector requires the same bandwidth as reading non-zero elements of the matrix itself. Instead, we combine both kernels by hand into one. Such approach also benefits from better x vector caching if accesses from block and ELL portions have spatial locality.
7
Block Selection
The performance of the described approach largely depends on quality of block selection pass. It must partition each strip into block and ELL parts so that total processing time is minimized. This does not necessarily imply minimizing space required for storing the strip, as loops processing block and ELL parts show different throughput. Even when minimizing storage space, optimal block selection is a non-trivial task. We have implemented two approaches to this problem.
Implementing Blocked Sparse Matrix-Vector Multiplication
295
The first uses dynamic programming to calculate optimal selection of blocks with the constraint that block row coordinates are divisible by Br . The drawbacks of this approach are high time and memory requirements. The second is a rather simple heuristic approach based on greedy block selection. It gives consistently worse results than the first, but is much faster and may be used as a fallback implementation when the first requires too much memory or fast conversion is required. In the greedy approach we parametrize block selection by minimum block fill fmin . The algorithm outline is as follows: 1. Identify all blocks with number of non-zero elements being more or equal than fmin and record them in a heap structure. 2. Do a selection pre-pass by consecutively choosing blocks with largest count of non-zero elements, until the block with most non-zero elements has less than fmin non-zero elements (as a block is chosen, number of non-zero elements in blocks that contain elements from the chosen block is reduced). 3. Record the maximum residual count Kest of non-zero elements in row. This would be the row length for ELL partition if we selected all blocks found in pre-pass for block partition. 4. Do the final selection pass in a fashion similar to the pre-pass, but skip blocks in rows that have no more than Kest non-zero elements left. Choosing such blocks does not save memory and computations. Here, choosing arbitrary blocks sometimes leads to selections where length of first and last rows of a strip is not or only slightly reduced, as blocks on strip boundary have small number of non-zero elements after more populated blocks further from boundary are chosen. Again, we restrict block row coordinates to be divisible by Br to mitigate this effect.
8
Performance Evaluation
We evaluated our implementation on a Geforce GTX280 with CUDA toolkit version 2.1. Results with 4 × 4 blocks are presented in table 1. Most test matrices are those referenced in [5] and [6]. A few matrices were added to the suite, namely bcsstm35 and msc23052 to evaluate performance on relatively small matrices with few non-zero elements and also nd6k and nd12k as matrices with high ratio of non-zero elements per row. These matrices are taken from University of Florida sparse matrix collection [8]. Rows, Columns and Nonzeros show the base characteristics of the matrices. Blocks column shows number of blocks identified in each matrix. In blocks column shows the percentage of non-zero elements assigned to blocks; the rest is stored in ELL format. Av. fill gives average percentage of non-zero elements in blocks and can be derived from previous three columns. Reference and Our columns show performance of reference and our implemennz tations in GFLOPS, respectively. Listed value is calculated as 2∗N T , where Nnz
296
A. Monakov and A. Avetisyan
Table 1. Performance results on Geforce GTX280, single precision, 4 × 4 blocks Name Rows Columns Nonzeros Blocks In blocks, % Av. fill, % Reference Our Speedup bcsstm35 30237 30237 20619 4041 60.8 19.4 0.70 0.8 1.13 cant 62451 62451 4007383 143388 47.8 83.5 16.56 19.3 1.17 consph 83334 83334 6010480 210768 49.7 88.6 21.09 21.25 1.01 cop20k A 121192 121192 2624331 13116 3.8 47.5 8.32 9.81 1.18 dense2 2000 2000 4000000 250000 100 100 3.92 14.03 3.58 mac econ 206500 206500 1273389 2333 0.5 17.1 7.78 7.83 1.01 mc2depi 525825 525825 2100225 0 0 n/a 19.12 7.41 0.39 msc23052 23052 23052 1154814 36524 37.6 74.3 12.22 16.44 1.35 nd12k 36000 36000 14220946 803195 82.1 90.9 12.10 21.83 1.80 nd6k 18000 18000 6897316 391516 82.3 90.6 12.29 20.32 1.65 pdb1HYS 36417 36417 4344765 273790 82.6 81.9 13.28 19.94 1.50 pwtk 217918 217918 11634424 635930 78.4 89.6 21.48 22.21 1.03 qcd5 4 49152 49152 1916928 0 0 n/a 21.48 17.6 0.82 rail4284 4284 1092610 11279748 809 0.1 87.1 2.54 1.25 0.49 rma10 46835 46835 2374001 139934 68.5 72.6 11.16 15.64 1.40 scircuit 170998 170998 958936 4159 1.5 21.6 6.81 4 0.59 shipsec1 140874 140874 7813404 375870 57.4 74.6 18.22 18.31 1.01 webbase-1M 1000005 1000005 3105536 1087 0.5 89.3 6.50 1.38 0.21
is the number of non-zero elements and T is time required for one multiplication by vector, not including time to copy data to device and back. As a reference, we use Bell and Garland’s implementation of hybrid method [6]. All tests use single precision. Our implementation uses texture lookups to fetch data where coalesced access is not possible. Speedup is calculated as ratio of reference time to this implementation’s result, so a speedup value below one means that described implementation is actually slower than reference. Exceptional speedup on dense matrix is due to the fact that reference implementation does not fully exploit the capabilities of device on this testcase, since it spawns only as much threads as there are rows in the matrix. Our implementation not only spawns four times more threads, but also requires less memory bandwidth, hence the speedup. Some of the matrices are processed much slower: webbase-1M, mc2depi, scircuit and rail4284. We explain slowdown on the first three tests by small number of non-zero elements per row, which causes high percentage of idle threads in our implementation. Slowdown on rail4284 is probably explained by its highly irregular structure (reference implementation stores more that 50% of non-zero elements in COO partition, which indicates that number of non-zero elements per row varies a lot). Finally, there is some slowdown on qcd5_4 test, but it is hard to explain; we note that when tested on Geforce GTX260, the performance of both approaches is equal. Performance on other tested matrices varies, with best speedups observed on nd6k and nd12k matrices that are characterized with high counts of nonzero elements per row, and their structure allows to identify many blocks. Note that average block fill in some matrices is below 50%, which means many blocks are used to avoid expanding ELL partition (e.g. when block size and strip height are equal, adding one block has significantly lower cost than adding one ELL column to the block selection based on dynamic programming).
Implementing Blocked Sparse Matrix-Vector Multiplication
9
297
Future Work
We see the following topics as important for the future research in this area: 1. Investigating how matrix reordering can be used to reduce cache misses on GPU hardware. 2. It is possible to reduce storage space requirement of symmetric matrices by storing only one of symmetric halves. However, efficiently parallelizing SpMV for GPUs is not trivial in this case. 3. Some applications include computing y = Ax for multiple x. It imposes different optimization criteria: as number of x vectors grows, they consume more bandwidth than the sparse matrix itself. As a result, a relatively simple storage method like CSR for the matrix is likely to suffice, and the optimization challenge shifts to efficient storage of input vectors: e.g., for 16 vectors it is better to store them in interleaved fashion to allow perfectly coalesced access.
Acknowledgements We thank Alan Mycroft, Anton Lokhmotov and anonymous reviewers for their helpful comments on this paper. We also thank Russian Foundation of Basic Research and Royal Society for financial support of this work and the Software Performance Optimisation research group at Imperial College London for providing access to a GTX280 card.
References 1. NVIDIA Corporation: NVIDIA CUDA Programming Guide 2.1 (2008) 2. Volkov, V., Demmel, J.W.: Benchmarking GPUs to tune dense linear algebra. In: SC 2008: Proceedings of the 2008 ACM/IEEE conference on Supercomputing, pp. 1–11. IEEE Press, Piscataway (2008) 3. Collange, S., Defour, D., Parello, D.: Barra, a modular functional GPU simulator for GPGPU. Technical report, CCSd/HAL: e-articles server (based on gBUS), France (2009), http://hal.ccsd.cnrs.fr/oai/oai.php 4. Vuduc, R.W.: Automatic performance tuning of sparse matrix kernels. Technical report (2003) 5. Williams, S., Oliker, L., Vuduc, R., Shalf, J., Yelick, K., Demmel, J.: Optimization of sparse matrix-vector multiplication on emerging multicore platforms. In: Proceedings of SC 2007 (2007) 6. Bell, N., Garland, M.: Efficient sparse matrix-vector multiplication on CUDA. NVIDIA Technical Report NVR-2008-004 (2008) 7. Buatois, L., Caumon, G., L´evy, B.: Concurrent number cruncher: An efficient sparse linear solver on the GPU. In: Perrott, R., Chapman, B.M., Subhlok, J., de Mello, R.F., Yang, L.T. (eds.) HPCC 2007. LNCS, vol. 4782, pp. 358–371. Springer, Heidelberg (2007) 8. Davis, T.A.: University of Florida sparse matrix collection. NA Digest 92 (1994)
Experiences with Cell-BE and GPU for Tomography Sander van der Maar, Kees Joost Batenburg, and Jan Sijbers IBBT-Vision Lab, Department of Physics University of Antwerp, Belgium {Sander.vanderMaar,Joost.Batenburg,Jan.Sijbers}@ua.ac.be
Abstract. Tomography is a powerful technique for three-dimensional imaging, that deals with image reconstruction from a series of projection images, acquired along a range of viewing directions. An important part of any tomograph system is the reconstruction algorithm. Iterative reconstruction algorithms have many advantages over non-iterative methods, yet their running time can be prohibitively long. As these algorithms have high potential for parallelization, multi-core architectures, such as the Cell-BE and GPU, can possibly alleviate this problem. In this paper, we describe our experiences in mapping the basic operations of iterative reconstruction algorithms onto these platforms. We argue that for this type of problem, the GPU yields superior performance compared to the Cell-BE. Performance results of our implementation demonstrate a speedup of over 40 for a single GPU, compared to a single-core CPU version. By combining eight GPUs and a quad-core CPU in a single system, similar performance to a large cluster consisting of hundreds of CPU cores has been obtained.
1
Introduction
Computer tomography (CT) is a well known imaging technique in which a 3D image of an object is reconstructed from a series of 2D projection images. A prominent example of this technique can be found in medical CT scanners, which are capable of creating high resolution images of a patient’s internal organs. The projection images are acquired using a scanning device. From the information present in these images, a 3D representation of the object is computed by a tomographic reconstruction algorithm. Currently, the most popular reconstruction algorithms are filtered backprojection (see Ch.3 of [1]) and the Feldkamp algorithm [4]. Though highly computational efficient, these methods suffer from several disadvantages. In particular, they require a large number of projections to obtain accurate reconstructions, and are sensitive to noise in the measured data. Iterative reconstruction techniques, on the other hand, do not suffer from these drawbacks. However, they are highly computationally intensive since the reconstructed image is obtained from repeated forward and backward projections. These operations impede the use of iterative reconstruction techniques in practice, especially for the reconstruction of large images. K. Bertels et al. (Eds.): SAMOS 2009, LNCS 5657, pp. 298–307, 2009. c Springer-Verlag Berlin Heidelberg 2009
Experiences with Cell-BE and GPU for Tomography
299
The basic operations of forward and back projection are well suited for parallelization, in particular on SIMD architectures. A series of relatively simple operations must be applied to a large number of data elements independently. On the other hand, parallelization of these operation is not straightforward. First of all, the algorithm needs to be restructured in such a way that it can be executed in parallel correctly, without the possibility of memory access collisions. Second, to obtain optimal performance, the algorithm implementation must be tuned with respect to the particular parallel hardware architecture. Over the last few years, several software libraries have been released which allow the application of consumer hardware to perform parallel computations. IBM released an API, including a full processor simulator, to allow programming of the Cell Broadband Engine Architecture [9]. NVIDIA developed the CUDA programming platform that allows programming of their graphics cards in a Clike language [5]. Both platforms promise (theoretical) TeraFLOPS performance using just low-priced consumer hardware. In this paper, we investigate the suitability of these two platforms (Cell-BE and GPU) for iterative tomography computations. The paper starts with an introduction to the basic concepts of tomography and iterative reconstruction algorithms in Section 2. Section 3 provides a discussion of the features of both hardware platforms, in the context of tomography. We argue that the GPU is much more suitable for this type of computation. Section 4 describes how the basic tomography operations can be performed on NVIDIA GPUs using CUDA. In many tomography applications, the speedup that Fig. 1. Basic setting of can be obtained using a single GPU is not sufficient. transmission tomography Section 5 describes our work on developing a desktop system that combines the computational power of eight GPUs. Experimental results are presented in Section 6. It is shown that our single 8-GPU system can outperform a large CPU cluster, consisting of hundreds of CPUs, for tomography computations. Section 7 concludes this paper.
2
Tomography
Fig. 1 shows the basic setting of parallel beam tomography. Projections are acquired of an unknown physical object along a range of angles. Although the figure shows a 2D setting, practical applications of tomography typically involve 2D projections of a 3D object. Projections are measured along lines lθ,t = {(x, y) ∈ R2 : x cos θ + y sin θ = t}, where θ represents the angle between the line and the y-axis and t represents the coordinate along the projection axis. Iterative tomography algorithms typically discretize the image of the unknown object, forming an array of pixels (or voxels, in the case of 3D imaging). The
300
S. van der Maar, K.J. Batenburg, and J. Sijbers
projection process can then be modeled as a system of linear equations W x = p.
(1)
The matrix W is called the projection matrix, of which the elements represent the contribution of a voxel in x to a projection value in p. Reconstruction of x from its projections p would require the inverse of W . This matrix W , however, typically contains hundreds of billions of elements, ruling out the possibility of computing its generalized inverse explicitly or even storing it in memory. Hence, the set of linear equations given in Eq. (1) is solved in an iterative way. In such algorithms, the entries of the projection matrix are typically not stored in memory, but computed on-the-fly. A variety of iterative reconstruction algorithms are used in tomography, such as SART, OSEM and SIRT (see Ch.7 of [1]). All of these algorithms have a similar structure and consist of a sequence of forward projection and back projection steps. For brevity, we focus on the SIRT algorithm, which consists of the following steps, applied iteratively: – Forward projection: The projection data p(k) of the current reconstruction x(k) are computed: p(k) = W x(k) . – Difference computation: The difference between the computed projections p(k) and the measured projections p is determined. This difference is weighted by a diagonal matrix β that describes the contribution of each voxel to a projection value: e(k) = β(p − p(k) ). – Backprojection: The difference for each detector element is smeared out across the corresponding line through the reconstructed volume: u(k) = γW T (p − p(k) ), with γ a diagonal matrix that describes the contribution of each detector element to a reconstruction value. – Update step: The reconstruction is updated with u(k) to yield a new reconstruction x(k+1) : x(k+1) = x(k) + u(k) In this paper, we focus on parallel beam tomography, where the projection for each angle is sampled along parallel rays. In such a setup, each 2D slice of a 3D voxel volume can be reconstructed independently, which allows parallel processing of slices without the need for communication between threads processing different slices. When processing a large 3D volume, consisting of 10243 voxels, along with a large series of 1024 projection images, the number of operations required to compute a single forward or backprojection is of the order 1012 . Moreover, these operations are applied iteratively, possibly for a large number of iterations. Even on a modern CPU, a single run of an iterative reconstruction algorithm can easily take weeks for such a data set.
3
Tomography on Parallel Hardware
In this section, the Cell-BE and GPU hardware platforms will be discussed, in the context of tomography. Both platforms are potential candidates for accelerating
Experiences with Cell-BE and GPU for Tomography
301
iterative reconstruction algorithms. We argue that for this type of problem, the GPU yields superior performance compared to the Cell-BE. 3.1
Cell Broadband Engine Architecture
The Cell is a heterogeneous architecture [9], containing two different types of processors. A PowerPC, containing two hardware threads, has a number of SPE (Synergetic Processing Element) co-processors at its disposal for performing calculations. SPEs are able to access main memory and perform calculations independently, resulting in a flexible programming model with a wide field of potential applications. The maximum memory bandwidth between the Cell and main memory is 25.6 GB/s. To limit the load on main memory, every SPE has 256 kB of on-chip memory. Most algorithms will be able to reduce in- and outgoing communications by utilizing this programmer-managed cache. When RAM is accessed from an SPE, addressing needs to be performed in memory aligned blocks. Memory copy operations are performed asynchronously, allowing computations to be performed while waiting for a memory access to finish. 3.2
GPU Architecture
Recent GPUs have a computational speed of up to 1 TeraFLOPS. Graphics hardware is optimized for operations needed when rendering 3D scenes. During rendering, the same operations are performed on a large number of pixels, independent of the pixel values. GPUs implement an SIMD architecture to exploit this data-parallelism. Also, because of high data rates required in texture-based rendering, memory bandwidth between the GPU and video RAM is high. Highend cards offer a bandwidth of over 100 GB/s. Part of the explanation for the GPUs’ high computation speed is their relative design simplicity. Since most operations performed in rendering are relatively elementary, GPUs consist of a large number of processors implementing a limited instruction set. On NVIDIA GPUs, access to global memory can be speeded up significantly by a technique known as coalescing. When all threads executed by an SIMD processor address adjacent memory blocks, the operation is performed in the same time as one read or write. Less restrictive memory access is allowed on 4 kB of on-chip memory, which is available to groups of threads executed by a single SIMD processor. 3.3
Tomography: Cell vs. GPU
The forward projection and back projection operations have several characteristic properties that determine the optimal hardware architecture for parallelization of these steps: – High memory access intensity. In order to compute a forward projection for all projection angles, the full image data set has to be read from memory,
302
S. van der Maar, K.J. Batenburg, and J. Sijbers
possibly many times, and an equally large set of projection data must be written. When comparing the amount of data read and written from and to main memory with the time spend on calculation, it becomes clear that memory access is the main bottleneck. – Potential for creating a large number of threads. Both forward projection and back projection have high potential for parallelization. We recall that both the forward projection and backprojection operations both execute a linear transformation. This means that the projection of an image equals the sum of the projections of all its pixels. – Data locality. While computing a projection, adjacent voxels are projected to detector pixels that are also adjacent. This allows for threads executed by the same SIMD unit to address continuous memory blocks throughout the projection computation. – Low complexity of control statements; requirement for high-speed trigonometric operations. The control flow of iterative reconstruction algorithms is completely data-independent (and, as a consequence, predictable) and essentially consists of a series of nested loops that must be executed for a fixed number of iterations. Typically, trigonometric operations, such as sine and cosine, are executed in the inner loop to establish the mapping between the reconstructed image and the projection data. Algorithm performance depends critically on the ability to perform such operations fast. The Cell and GPU both have different characteristics with respect to these algorithm properties. The memory bandwidth limitation is much stronger for the Cell than for the GPU platform. Although the on-chip memory that the SPEs can access is much larger than the local cache available to the GPU SIMD units, this local cache cannot be employed to reduce memory access, as every forward projection and back projection requires the whole data set to be processed. The fact that algorithm parallelization can result in a large number of independent threads has more impact on a GPU implementation than on a Cell implementation, as the number of processing units on the GPU is much higher. Data locality is an important algorithm property on both platforms. Even though the instruction set implemented by GPUs is more limited than the operations offered by the Cell, they fit very well with the requirements of iterative tomography. Examples of operations often performed during a projection include trigonometric functions and float to integer conversions. These are implemented efficiently in GPU hardware. GPUs have already been used for tomography computations by several research groups [6,7]. Typically, shader languages such as ‘Cg’ have been used for this purpose. Mueller et al. successfully implemented an iterative reconstruction algorithm (SART). Part of their solution consists of working around the limited accuracy caused by the hardware’s representation of numbers as eight-bit color channel intensities. To increase accuracy, some operations had to be performed on the CPU. This approach, of using shader languages, has several drawbacks. It hides a number of hardware features from the programmer, such as on-board shared
Experiences with Cell-BE and GPU for Tomography
303
memory and flexible memory write operations. It is nearly impossible to efficiently use multiple GPUs for one reconstruction. The fact that images have to be considered as textures by the programmer, on which a sequence of graphics operations is performed, imposes the need for a rather artificial translation step between the actual problem model and its representation on the GPU. The NVIDIA CUDA platform, released in 2007, allows for more general access to the GPU hardware by exposing the underlying SIMD architecture. The GPU can be programmed in a C-like language, which also facilitates access to a local cache. Using multiple graphics cards is straightforward as well. As the GPU architecture matches well with all four tomography algorithm properties listed above, it forms a highly attractive platform for tomography. The limited bandwidth and lower number of computation units on the Cell make the platform less suitable for iterative tomography algorithms. In [8], we implemented an iterative tomography algorithm on the Cell architecture, obtaining a speedup of 6.5×compared to a sequential CPU implementation. The next part of this paper focuses on our GPU implementation of tomography algorithms. As will be shown in Section 6, much higher speedups can be realized on the GPU platform.
4
Implementation Details
We implemented the SIRT algorithm, including the forward projection and backprojection operations, using the NVIDIA CUDA platform. As a complete description of our implementation is beyond the scope of this paper, we highlight several design choices in this section. – Coalesced memory access. For optimal performance, global memory access should always be coalesced, meaning that the data should be accessed in aligned, well-ordered blocks. If this requirement is not met, memory operations will be performed on a thread-by-thread basis, increasing the access time up to a factor 16. The data locality present in tomography problems can be exploited to obtain fully coalesced memory access, both for the projection data and the image data. Even if memory accesses are performed coalesced, latency is still very high. Typical values are 400 to 600 cycles, predicted to rise in the future. This fits with the trend seen in the field of computer architecture often named the ‘memory wall’. NVIDIA GPUs implement latency hiding by stalling threads that are waiting for a memory operation to finish, and replacing them with active threads. In our implementation, several thousands of threads are created to benefit from latency hiding. – Avoiding memory access collisions. GPUs do not offer mutexfunctionality, which introduces the risk of memory-write collisions: multiple threads that write to the same location in memory simultaneously, yielding undefined results. To eliminate all write collisions, forward projections are performed ray driven, where all threads have been assigned a location in the
304
S. van der Maar, K.J. Batenburg, and J. Sijbers
(a)
(b)
Fig. 2. (a): Every thread is attached to a detector pixel during forward projecting; (b): Backwards projecting is accomplished by attaching every thread to a voxel in the data set
projection data buffer. Back projections are performed voxel driven. In this case, a number of pixels are updated exclusively by one particular thread. Figure 2(a) and 2(b) visualize this approach. – Avoiding CPU-GPU and inter-GPU communication. A final consideration relates to the reduction of CPU-GPU communication. Even though the calculation of the difference between the measured and computed projection is not much faster on the GPU than on the CPU (and therefore performing it on the GPU might seem unnecessary), keeping it on the GPU reduces traffic over the PCIe-bus considerably. This communication would become a major bottleneck if the difference calculation would be performed on the CPU. This constraint becomes more severe when multiple GPUs are installed in one pc, a situation we will discuss in the next section.
5
FASTRA: Eight-GPU Desktop Super Computer
As will be shown in Section 6, our GPU implementation of the Forward Projection and Back Projection operations yields a considerable speedup compared to our sequential CPU implementation (over 40×). Still, attaining a speedup that is again an order of magnitude larger remains highly desirable. Here, we list some of the reasons to strive for a much higher speedup: – Problem size. If the size of the 2D detector doubles in both dimensions, the size of the reconstructed volume increases by a factor of eight, whereas the number of required projections typically also doubles, thereby increasing the running time of the basic tomography operations by a factor of 16. In recent years, the detector size has steadily increased, up to the point where detectors of 4096×4096 pixels are now common in many applications. To deal with such huge data requirements, even a speedup of 40 is not sufficient. – Rapid prototyping. Development of new reconstruction algorithms requires considerable parameter-tuning and testing. Even for small data sets, reconstructing the data sets many times - varying algorithm parameters - is highly time-consuming.
Experiences with Cell-BE and GPU for Tomography
305
– Real-time applications. Certain tomography applications, such as airport luggage scanning, requires real-time reconstruction. To achieve real-time reconstruction with at an acceptable image resolution, an additional speedup is required. A straightforward way to obtain an additional large speedup would be the use of a large supercomputer cluster. For the case of parallel projections, distributing the computation across a large number of nodes is straightforward: the volume can be partitioned into a stack of 2D slices that can each be reconstructed independently. A GPU cluster, where each node contains a GPU, would therefore yield an expected speedup that is almost linear in the number of nodes. Alternatively, a CPU cluster can be used, containing hundreds of CPUs. Both approaches (GPU cluster, CPU cluster) are not ideally suited for use tomography practice. Contrary to many problems in scientific computing that are currently solved by supercomputers, tomography can be considered as an application, that is routinely used by a large user base. Effectively, this would require a cluster for every tomography scanner, which is simply not feasible. Even for scientific use, where access to a supercomputing facility is commonly available, reserving full CPU time for this single application is not desirable. To deal with this problem, we considered the use of multiple GPUs within a single workstation. By combining four dual-GPU NVIDIA 9800GX2 graphics cards, it is possible to have eight GPUs within a single workstation. Figure 3 shows the special workstation we developed for accelerating tomography on the GPU, called FASTRA. We refer the reader to [2] for a complete description of the system. As individual slices can be distributed among the different GPUs, each working independently, the expected speedup for a large data sets is almost linear in the number of GPUs. As we will Fig. 3. FASTRA show in Section 6, this linear speedup can also be observed in real experiments. Our special workstation, which could also be considered as a desktop supercomputer, offers the following advantages compared to larger cluster solutions: – Supercomputer performance for this application. As the speedup obtained using a single GPU is already 40×, a speedup of over 300× can be obtained using the FASTRA system, compared to a single CPU core. Therefore, performance similar to a moderately sized CPU cluster, consisting of 100s of CPU cores, can be achieved. – Affordable. Because the system consists of consumer-level hardware, the system costs less than $5000. – Mobile. As all eight GPUs are combined in a single system, contrary to a cluster solution, the FASTRA system can be easily moved. – Energy efficient The power consumption of the entire system does not exceed 1500W.
306
S. van der Maar, K.J. Batenburg, and J. Sijbers
The next section reports on experimental results comparing our multi-GPU workstation with a large supercomputer cluster.
6
Experimental Results
Iterations per minute
In this section, we present experimental results comparing our CUDA implementation of SIRT with an existing CPU implementation that has been in active use for several years. Because of FASTRA’s theoretical computation speed of 4 TeraFLOPS, we expected that its performance could be comparable to an actual super computer cluster. The University of Antwerp has a moderately sized cluster, consisting of 512 Opteron nodes, with a peak performance of 2 TeraFLOPS: CalcUA [3]. Slices of 1024 × 1024 voxels were reconstructed from projection data acquired from 1024 projection directions recorded with 1024 detector pixels. Optimized CPU-code running on the cluster is able to reconstruct 512 slices of this data sets in 67.4 seconds. Using our implementation, FASTRA performs this reconstruction in 59.9 seconds, see Figure 4(a). A single core of a 2.40 GHz Intel Quad Core CPU requires just over 40 minutes to perform the same reconstruction, which results in a speed-up of 40 when compared with a recent processor. Figure 4(b) shows how the computation scales with the number of GPUs. Since the bandwidth between the CPU and PCIe-cards is constant and shared by all graphics cards, a large number of GPUs installed on one motherboard reduces the available per-GPU communication speed considerably. By measuring the reconstruction speeds when only a limited number of GPUs is employed, an assessment can be made of the impact of this reduction of communication bandwidth. As our results show, the reconstruction speed depends linearly on the number of GPUs. This can be explained by the very limited amount of communication required during a reconstruction. 500 400 300
59.9 sec
FASTRA
67.4 sec
CalcUA 0
10 20 30 40 50 60 70 80
(a)
200 100 0
1
2
3
4
5
6
7
8 GPUs
(b)
Fig. 4. (a): FASTRA is able to perform a reconstruction in less time than CalcUA, University of Antwerp’s super computer cluster; (b): The reconstruction speed scales linearly with the number of GPUs used during the reconstruction
7
Conclusions
Tomography is an important technique for 3D imaging. In many cases, iterative reconstruction algorithms yield more accurate reconstructions than direct
Experiences with Cell-BE and GPU for Tomography
307
methods, yet at a high computational cost. Computation time often limits their practical applicability. The basic operations of iterative tomography algorithms, forward projection and back projection, are well suited for parallel execution. Both the Cell-BE and GPU offer high computation speeds and fast memory access. Combined with their low price, this makes them attractive candidates for speeding up iterative reconstruction methods. We argued that GPUs are better suited for performing iterative tomography than the Cell. In general, similar arguments also hold for other applications that require large data sets to be processed with relatively simple operations. In such cases, memory throughput forms the main bottleneck, which is considerably higher for GPUs than for the Cell. We demonstrated a substantial speed-up for the iterative SIRT algorithm using GPUs. FASTRA, a workstation containing eight GPU cores, computes a reconstruction in less time than a moderately-sized cluster containing hundreds of CPUs.
References 1. Kak, A.C., Slaney, M.: Principles of Computerized Tomographic Imaging. In: Volume Algorithms for reconstruction with non-diffracting sources, pp. 49–112. IEEP Press, New York (1988) 2. FASTRA GPU SuperPC (2008), http://fastra.ua.ac.be 3. Core Facility CalcUA (2008), http://www.calcua.ua.ac.be 4. Feldkamp, L.A., Davis, L.C., Kress, J.W.: Practical cone-beam algorithm. Journal of the Optical Society of America A: Optics, Image Science, and Vision 1(6), 612–619 (1984) 5. NVIDIA Corporation. NVIDIA CUDA Compute Unified Device Architecture, Programming Guide Version 1.0 (June 2007) 6. Xu, F., Mueller, K.: Real-time 3D computed tomographic reconstruction using commodity graphics hardware. Physics in Medicine and Biology 52, 3405–3419 (2007) 7. Mueller, K., Xu, F., Neophytou, N.: Why do commodity graphics hardware boards (GPUs) work so well for acceleration of computed tomography? In: SPIE Electronic Imaging (2007) 8. van der Maar, S.: Tomography mapped onto the Cell Broadband Processor. Master’s thesis, Universiteit Leiden, The Netherlands (August 2007) 9. Gschwind, M.: The cell broadband engine: exploiting multiple levels of parallelism in a chip multiprocessor. Int. J. Parallel Program. 35(3), 233–262 (2007)
Realizing FIFO Communication When Mapping Kahn Process Networks onto the Cell Dmitry Nadezhkin, Sjoerd Meijer, Todor Stefanov, and Ed Deprettere Leiden Institute of Advanced Computer Science Leiden University, Niels Bohrweg 1, 2333CA Leiden, The Netherlands {dmitryn,smeijer,stefanov,edd}@liacs.nl
Abstract. Kahn Process Networks (KPN) are an appealing model of computation to specify streaming applications. When a KPN has to execute on a multi-processor platform, a mapping of the KPN model to the execution platform model should mitigate all possible overhead introduced by the mismatch between primitives realizing the communication semantics of the two models. In this paper, we consider mappings of KPN specification of streaming applications onto the Cell BE multi-processor execution platform. In particular, we investigate how to realize the FIFO communication of a KPN onto the Cell BE in order to reduce the synchronization overhead. We present a solution based on token packetization and show the performance results of five different streaming applications mapped onto the Cell BE. Keywords: Models of Computation, Kahn Process distributed FIFO communication, the Cell BE platform.
1
Networks,
Introduction
One of the driving forces that motivated the emergence of multi-processor systems on chip (MPSoCs) originates from the complexity of modern applications [1]. Many applications are specified with complex block diagrams that incorporate multiple algorithms. Such applications are called heterogeneous. The emergence of heterogeneous applications led to the design of heterogeneous MPSoC architectures which provide improved performance behavior by executing different algorithms, which are part of an application, on optimized/specific processing components of an MPSoC. However, heterogeneous MPSoCs are very hard to program efficiently, and still it is not very clear how this could be done in a systematic and possibly automated way. It is a common believe that the key to solve the programming problem is to use parallel models of computation (MoC) to specify applications [2]. This is because the structure and executional semantics of parallel MoCs match the structure and executional semantics of MPSoCs, i.e., a parallel MoC consists of tasks that can execute in parallel and an MPSoC consists of processing components that run in parallel. Nevertheless, in many cases there is a mismatch between the communication semantics of a MoC and the communication infrastructure available in an MPSoC. Therefore, a major issue when programming K. Bertels et al. (Eds.): SAMOS 2009, LNCS 5657, pp. 308–317, 2009. c Springer-Verlag Berlin Heidelberg 2009
Realizing FIFO Communication on the Cell
309
an MPSoC, is to figure out how to bridge such mismatch, i.e., how to realize the communication semantics of a MoC using the available communication infrastructure of the MPSoC. Unfortunately, a solution approach to the mentioned mismatch is specific for a given MPSoC platform and MoC. In this paper, we share our experience in bridging the mismatch between the communication semantics of the Kahn Process Network (KPN) model of computation and the communication infrastructure of the Cell BE platform. The Cell BE platform [3] is a very good representative example of a state-of-the-art heterogeneous MPSoC platform. It has a PowerPC host processor (PPE) and a set of eight computation-specific processors, known as synergistic processing elements (SPEs). The memory subsystem offers private memories for each SPE processing elements and a global memory space, to which only PPE has direct access, while each SPE utilizes accompanied Memory Flow Controller. The processors and I/O interfaces are connected by the coherent interconnect bus which is a synchronous communication bus. The KPN model is a very good representative of the class of dataflow models used to specify streaming applications. A KPN is a graph in which the nodes are active entities (processes or tasks or threads) that communicate point-to-point over FIFO channels. Most of the dataflow models are of the same nature, i.e., they consist of tasks connected with FIFO channels. The mismatch mentioned earlier is illustrated by the example in Figure 1 where a KPN consisting of 7 processes and 7 FIFO channels is mapped onto the Cell BE platform. Processes P1 , P2 and P7 are mapped on the PPE, and processes P3 to P6 are mapped on the SPEs. The FIFO communication channels has to be mapped onto the Cell BE communication, synchronization and storage infrastructure. On the one hand, the semantics of the FIFO communication is very simple: Producer and Consumer processes in a producer/consumer pair interact asynchronously with the communication channel that they are connected to, and the synchronization is by means of blocking read/write. On the other hand, in the Cell BE platform the processors are connected to a synchronous communication bus and there is no specific HW support for blocking FIFO communication. Therefore, the KPN communication model and the Cell BE communication infrastructure do not match. The KPN FIFO channels have to be realized by using
Fig. 1. A 6-process dataflow network mapped onto the Cell BE platform
310
D. Nadezhkin et al.
the private memory of a SPE, and/or the global memory, and the Cell BE specific synchronization methods which may be costly in terms of communication latency. The challenge is how to do this in the most efficient way, i.e., to minimize the communication latency. In the following section, we give a survey of related works. In Section 3 we consider the particular issues of realizing FIFO communication semantics of the KPN model of computation on the Cell BE platform. Section 4 illustrates our realization of the FIFO communication channels on the Cell BE platform. In Section 5, we show some experiments with real-world applications. Section 6 concludes the paper.
2
Related Work
In a similar work [4], KPNs have been mapped onto the Intel IXP processor. The IXP, however, has hardware support for FIFO buffers and no optimizations have been applied to reduce communication latencies. Another model-based project that is similar to our approach in programming the Cell BE platform is the architecture-independent stream-oriented language StreamIt [5], which shares some properties with the Synchronous DataFlow (SDF) [6] model of computation. The Multicore Streaming Layer (MSL) [7] framework realizes the StreamIt language on the Cell BE platform focusing on automatic management and optimization of communication between cores. All data transfers in the MSL are explicitly controlled by static scheduler, and thus, a synchronization in FIFO communication is not an issue. However, this approach is limited to applications that can be specified with SDF. Our approach is more general as a broader class of applications can be specified with KPNs compared to SDF, at the cost of introducing blocking read and write FIFO primitives, ensuring that processes block if data is not available or cannot be written. The introduced synchronization becomes an issue, which we tackle in this paper. As to the low-level communication on the Cell, MPI like The Cell Messaging Layer[8] library is implemented guided by the similar idea as in our approach, i.e., receiver-initiated communication. However, the library offers just low-level send and receive primitives without focusing on realization of FIFO abstraction.
3
Issues of Mapping KPNs onto the Cell BE Platform
In mapping KPN processes onto processing elements of the Cell BE platform, different assignment options are possible: each processor can host one or more KPN tasks. For the PPE processor which has two hardware threads and runs a multitasking operating system, a threaded library can be utilized to host several KPN tasks. Although multitasking for the SPEs is also possible, in practice it is inefficient as the context switching is very expensive: all the code and data, while switching, should be saved in the global memory. Thus, in this paper we consider that only one KPN process can be assigned to each SPE processor.
Realizing FIFO Communication on the Cell
311
Given the considerations above, there is a variety of mapping strategies which lead to the appearance of different types of FIFO communication channels. For example, in Figure 1 processes P1 (producer) and P2 (consumer) are mapped onto the PPE, and we say that the FIFO channel connecting them is of PPEto-PPE type. If the producer and the consumer are one and the same process mapped onto the SPE (like process P3 in Figure 1), then we refer to the FIFO channel as of SPE-to-self type. Similarly, we identify PPE-to-self, SPEi -to-SPEj , PPE-to-SPE, and SPE-to-PPE types of FIFO communication channels. All of them require different implementations as different components of the Cell BE platform are involved. Thus, we identify the following classes of FIFO channels, classified by connection type: a) class self (PPE-to-self and SPE-to-self), b) class intra (PPE-to-PPE), and c) class inter (SPEi -to-SPEj , PPE-to-SPE and SPEto-PPE). The first two classes of FIFO channels are easy to implement efficiently, as FIFOs from these classes are realized using just local memories and synchronization primitives. We will not discuss the detailed implementation of these FIFOs in this paper, however, we briefly explain the realization. In the class self, the FIFO channel connects a process with itself. Since there is only one thread of control, the access to the FIFO is ordered and therefore no special synchronization is required. In the class intra, where producer and consumer processes are mapped on the PPE, a FIFO channel is a shared resource in shared memory to which mutual exclusive access is applied. We rely on the pthread library to deal with this producer/consumer communication. In this paper we focus on the class inter FIFO channels, which connect the producer and consumer processes mapped onto two different processing elements of the Cell BE platform. The first issue to be addressed is where the memory buffer of a FIFO has to reside? The Cell BE platform provides two memory storages, thus, the buffer can reside in global memory or be distributed partly between private memories of the producer and consumer processes. The cons of the former approach lie in the presence of the shared component, which should be accessed with mutually exclusive pattern. For example, a SPE process connected to class inter FIFO, should not only compete for the memory resource, but also move the data from the global storage to the local memory prior to computation. The implication of this is enormous synchronization overhead making the performance of this approach not better than the performance of the sequential version of an application. When the memory buffer of a FIFO channel is distributed between private memory storages of a producer and consumer processes, the issue is how to implement the FIFO semantics over the distributed memory buffer such that it does not mask the performance benefits of going distributed. The FIFO semantics is realized by means of Direct Memory Access (DMA) transfers and synchronization messages between producer and consumer processes. The more a KPN is communication dominant, the more synchronization overhead is generated which can lead to a performance penalty. Therefore, the issue is to minimize the number of data transfers over the distributed FIFO channels as much as possible.
312
4
D. Nadezhkin et al.
Solution Approach
Our approach in minimizing the DMA data transfers is based on packetizing of tokens. In this case, a number of tokens are grouped into a single packet, which is transferred as one DMA transfer. Packetizing decreases the number of DMA data transfers or in other words, it decreases the number of synchronizations. Determining the packet size becomes a very important issue, and as it will be shown, it depends on how the DMA data transfers are initiated. Also, it will be shown, that in some cases an incorrect packet size may lead to a deadlock. Before determining the size of a packet, we need to consider the possible protocols for realizing the FIFO semantics over the distributed memory buffer in detail. As all FIFO channels we consider are point-to-point, tokens can be transferred either in a data driven or a data demand fashion. The former case follows a push strategy in which the producer initiates a data transfer as soon as it has produced data, whereas the latter case follows a pull strategy in which the consumer initiates a data transfer as soon as it requires data. The two strategies are shown in Figure 2, where the numbered circles indicate the order of synchronization messages along with the DMA data transfer; nodes P and C represent the producer and the consumer, respectively. We explain both strategies in detail and discuss pros and cons, and provide our solution choice. In the push strategy depicted in Figure 2a, the producer first makes a write request as soon as one or more tokens have been produced (1). Then, the consumer transfers the data with a DMA (2) and sends the notification message to the producer (3). Thus, two synchronization messages are required to complete one DMA data transfer. Packetizing of tokens happens on the producer side and the size of the packet should be known before the data transfer. However, wrong computed size may result in a deadlock in some network topologies. For example, consider Figure 3. The network consists of 3 tasks (P1 , P2 and P3 ) and 3 FIFO channels (F1 , F2 and F3 ). Assume that for channel F1 , the packet size, which guarantees deadlock free network evaluation equals to 3. Then, we change the packet size of F1 to 4 tokens. When P1 has generated 3 tokens, instead of sending them to P2 , P1 continues to produce new tokens to fill a packet, reading from the input channel F3 . The P2 process cannot proceed, as packet from P1 has not been sent because the packet is not complete, and the data is not available. Similarly, the P3 process gets blocked in reading data from P2 , and thus, it cannot produce the token for P1 . The network is in a deadlock. Hence, for the Write Request 1
P
DMA
2
Notify 3
(a) push
Request 1
C
P
DMA
2
C
Ack 3
(b) pull
Fig. 2. Push and pull strategies for class inter FIFO channels
Realizing FIFO Communication on the Cell
313
P2 F1
Read
F2 Read
P1
Read
F3
P3
Fig. 3. An example of a deadlock in the push strategy
push strategy the safe size of a packet for all FIFO channels should be computed at compile time. The pull strategy for realizing FIFO semantics over the distributed memory buffer is composed of the following three steps shown in Figure 2b: 1. Read request (1). The consumer first tries to read from its local buffer. If this buffer does not contain the required data, then it sends a request message to the producer and gets blocked on reading the acknowledgement message from the producer. The request message contains the maximum number of tokens the consumer may accept. 2. Data transfer (2). The producer which receives the read request can either be blocked on writing to its local storage or be busy executing a function. If it is blocked, it serves other requests immediately, and if it is executing then it immediately serves the request after execution. So in any case the producer handles the request and transfers all tokens it has available for the consumer as one packet by means of a DMA transfer. 3. Acknowledgement (3). The producer notifies the consumer after completion of the data transfer issuing a message containing the total number of tokens which have been transferred as one packet in the previous step. In the pull strategy, for every DMA data transfer, also two synchronization messages are required and the size of the packet to be transferred is computed dynamically in step (2) of the protocol given above. The only way we can control the dynamic packetizing is by setting the size of the memory buffer. The larger the size, the larger packet can be assembled. Since a consumer gets the data as soon as it is available, a deadlock is impossible in the pull strategy. Both strategies have their own advantages and disadvantages. On the one hand, in the push strategy, the required computation of packet sizes at compile time is not always possible for any KPN. This is because, the rate of production and consumption of tokens in channels might not be known at compile time. In the pull strategy, the computation of packet sizes is dynamic, i.e., computed at run-time, hence always possible. On the other hand, the push strategy is predictable, i.e., packet sizes are know at compile time and this information can be used to reason about performance, while in the the pull strategy such reasoning is not possible. Moreover, in some networks, the dynamically computed packet sizes may not be larger than one token. A simple example of such network is a producer/consumer pair where the rate of token production and consumption is the same. Regarding the synchronization overhead, both strategies require the
314
D. Nadezhkin et al.
same number of synchronization messages for a single DMA data transfer. Based on the above comparison between the push and the pull strategies, we have chosen the pull strategy as it is generic and can be used for any KPN.
5
Experimental Evaluation
In this section we present several experiments of KPNs mapped onto the Cell platform. The main goal is to show the impact of tokens packetizing on synchronization overhead induced in the class inter FIFO channels using the pull strategy. To carry out the experiments, we have developed a tool named Leiden Cell Ccode Generator (LCCG), which performs a mapping of a KPN specification onto the Cell BE platform in an automated way. The tool accepts KPNs generated by the pn and the Compaan compilers [9,10]. Given a KPN specification and a mapping file in which processes are assigned to processors, LCCG generates C-code for the PPE and SPE processor elements, as well as specific FIFO read and write primitives for each type of communication channel. After running the LCCG tool, the generated code is compiled on the Playstation3 platform with IBM’s XLC compiler using the libspe2 library. Experiment: JPEG Encoder In this experiment we map a JPEG encoder application onto the Cell BE platform. The encoder takes a stream of frames with sizes of 512 × 512 pixels and applies the JPEG algorithm on these frames. The KPN is depicted as a graph in Figure 4. The KPN consists of 7 processes and 15 FIFO channels. Each task of the application corresponds to a node in the graph; every channel is annotated by a name of a data structure which specifies a token, and FIFO sizes which guarantee deadlock free execution of the network. We map the computationally intensive processes DCT, Q and VLE on different SPEs, whereas the other processes are mapped onto the PPE. For this application, buffer sizes of 1 will give a deadlock free network, which means that we can observe token packetizing ChrominanceHuffTableAC: 1 HeaderInfo: 1
VideoInInit
c2: 1
VideoInMain
Block: 1
DCT c1: 1
Block: 1
LuminanceQTable: 1
Q
Block: 1
ChrominanceQTable: 1
DefaultTables
ChrominanceHuffTableDC: 1
VLE
LuminanceHuffTableDC: 1
Packets: 1
LuminanceHuffTableAC: 1
VideoOut LuminanceTablesInfo: 1 ChrominanceTablesInfo: 1
Fig. 4. KPN specification of the JPEG encoder
Realizing FIFO Communication on the Cell
315
by increasing the buffer sizes. Therefore, we run the KPN with four different configurations: we use FIFO buffer sizes of 1, 16, 32 and 48 tokens. All columns in Figure 5a depict the distribution of the time the DCT, Q and VLE tasks spend in computing, stalling and communicating. It shows how much time processes spend on real computations and thus also how much time is spend in the communication overhead. While stalling, a process is awaiting the synchronization messages from other processes, i.e., showing the synchronization overhead. In the communicating phase, a process is transferring the actual data. The first 3 bars in Figure 5a correspond to the configuration with all buffer sizes set to one token; the remaining bars show results of configurations with larger buffer sizes illustrating the effect of token packetizing. We observe a redistribution between computation and stalling fractions in all tasks: the stalling parts have been decreased, while the computation parts were increased. Thus, the packetizing decreases the synchronization overhead. The overall performance of different versions is depicted in Figure 5b. We observe that as the processors spend less time in synchronization, the performance increases. FIFOs=16
FIFOs=32
FIFOs=48
12
Throughput (Mbs)
Distribution
FIFOs=1 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% DCT
Q
VLE
DCT
Q
VLE
DCT
Q
VLE
DCT
Q
10 8 6 4 2
VLE
0
Tasks on SPE Compute
(a)
Stall
Comm
1
16
32
48
FIFO sizes
(b)
Fig. 5. Results of experiments with JPEG encoder: a) distribution of times the DCT, Q and VLE processes of JPEG encoder spend in computation, stalling and communication for non-packetized and packetized versions; b) throughput of JPEG encoder with different FIFO sizes
Other Experiments In other experiments we investigated the benefits of packetizing in applications with different computation-to-communication ratio. For that purpose, we mapped JPEG2000, MJPEG, Sobel, and Demosaic applications onto the Cell BE. The first two application have coarse-grained computation tasks, while the latter two are communication dominant. For each application, we compared the throughput of the sequential version running on the PPE and two parallel versions: the first one is with minimum buffer sizes that guarantee deadlock free network, i.e., without packetizing possible, and the second, with buffer sizes which are larger then the previous version to allow packetizing. The experiments are depicted in Figure 6. The y-axis is a log scale of throughput in Mbit/s. For all algorithms, the packetized versions work better than non-packetized. As the JPEG2000 and MJPEG are characterized by their coarse grain tasks, the
316
D. Nadezhkin et al.
Throughput Mbs (log)
10,000
1,000
0,100
Sequential Not packetized Packetized
0,010
0,001
0,000 JPEG2000 (8/6)
MJPEG (7/5)
Sobel (5/3)
Demosaic (14/6)
Fig. 6. Throughput comparison of sequential, non-packetized and packetized versions of JPEG2000, MJPEG, Sobel, and Demosaic applications
communication overhead is insignificant and we see that the parallel versions are faster than the sequential version for all, but non-packetized MJPEG algorithms. The Sobel and the Demosaic kernels have very lightweight tasks, thus, the introduced inter-processor communication and overhead are more costly than the computations itself. This is the reason the columns in the third and fourth experiments in Figure 6 show a significant slow-down compared to the sequential application. The conclusion is not to consider fine-grained parallelization of applications on the Cell BE platform using FIFO communication.
6
Conclusion
In this paper, we presented a solution for bridging the mismatch between the KPN communication model and communication primitives of the Cell BE platform. The absence of hardware support for FIFO communication in the Cell BE, makes reading and writing from/to FIFO channels expensive operations. We have investigated several approaches to realize the FIFO communication on the Cell. As a result of our investigation we selected an approach called the pull strategy which is based on packetizing of tokens. The experimental results show that this approach gives a performance that is always better than the performance of a network without packetization. All types of FIFO communication channels have been implemented in a tool which automatically generates C-code for the KPN tasks and FIFO channels, or in other words, automatically maps a KPN specification onto the Cell BE platform. Acknowledgments. We would like to thank Bart Kienhuis and Hristo Nikolov for the useful discussions on this paper and its results.
References 1. Wolf, W., Jerraya, A.A., Martin, G.: Miltiprocessor System-on-chip (mpsoc) Technology. IEEE Transactions on Computer-Aided Desing of Integrated Circuits and Systems 27(10) (2008) 2. Martin, G.: Overview of the mpsoc design challenge. In: DAC 2006: Proceedings of the 43rd annual conference on Design automation, pp. 274–279. ACM, New York (2006)
Realizing FIFO Communication on the Cell
317
3. Kahle, J.A., Day, M.N., Hofstee, H.P., Johns, C.R., Maeurer, T.R., Shippy, D.: Introduction to the cell multiprocessor. IBM J. Res. Dev. 49(4/5), 589–604 (2005) 4. Meijer, S., Kienhuis, B., Walters, J., Snuijf, D.: Automatic partitioning and mapping of stream-based applications onto the intel ixp network processor. In: SCOPES 2007: Proceedings of the 10th international workshop on Software & compilers for embedded systems, pp. 23–30. ACM, New York (2007) 5. Thies, W., Karczmarek, M., Amarasinghe, S.P.: Streamit: A language for streaming applications. In: Horspool, R.N. (ed.) CC 2002. LNCS, vol. 2304, pp. 179–196. Springer, Heidelberg (2002) 6. Lee, E.A., Messerschmitt, D.G.: Static scheduling of synchronous data flow programs for digital signal processing. IEEE Trans. Computers 36(1), 24–35 (1987) 7. Zhang, X.D., Li, Q.J., Rabbah, R., Amarasinghe, S.: A lightweight streaming layer for multicore execution 8. Pakin, S.: Receiver-initiated message passing over rdma networks. In: IEEE International Symposium on Parallel and Distributed Processing, 2008. IPDPS 2008, pp. 1–12 (April 2008) 9. Verdoolaege, S., Nikolov, H., Stefanov, T.: pn: a tool for improved derivation of process networks. EURASIP Journal on Embedded Systems, Special Issue on Embedded Digital Signal Processing Systems 2007 (2007) 10. Kienhuis, B., Rijpkema, E., Deprettere, E.F.: Compaan: Deriving Process Networks from Matlab for Embedded Signal Processing Architectures. In: Proc. 8th International Workshop on Hardware/Software Codesign (CODES 2000), San Diego, CA, USA, May 3-5 (2000)
Exploiting Locality on the Cell/B.E. through Bypassing Pieter Bellens1 , Josep M. Perez1, Rosa M. Badia1,3 , and Jesus Labarta1,2 1
Barcelona Supercomputing Center, Spain Universitat Politecnica de Catalunya, Spain Consejo Superior de Investigaciones Cientificas, Spain 2
3
Abstract. Cell Superscalar (CellSs) provides a simple, flexible and easy programming approach for the Cell Broadband Engine (Cell/B.E.) that automatically exploits the inherent concurrency of applications at a function or task level. The CellSs environment is based on a source-to-source compiler that translates annotated C or Fortran code and a runtime library tailored for the Cell/B.E. that orchestrates the concurrent execution of the application. In the context of our parallel runtime we analyse the effect of the bandwidth of the Element Interconnect Bus (EIB) on an application’s performance. We introduce a technique called bypassing that potentially increases the observed bandwidth and improves the execution time for applications with a distributed computation pattern. Although the integration of bypassing with CellSs is work in progress we present results for five fundamental linear algebra kernels to demonstrate the applicability of bypassing and to attempt to quantify the benefit that can be reaped.
1
Introduction and Related Work
Multi-core architectures improve performance through parallelism instead of the traditional increase in clock speed or fancier superscalar pipeline designs. They typically amass powerful yet simple cores on a single chip to achieve a high performance to area ratio and to reduce the power dissipation. But there is no such thing as a free lunch: this novel architecture brings up two equally challenging problems. This paper focusses on the Cell Broadband Engine (Cell/B.E.) as an exponent of the multi-core concept. The first issue concerns the interconnect that couples the processing elements. Performance is not determined by raw processing power but by memory bandwidth. Hence the success of multi-cores depends on the solutions they offer for overcoming the Memory Wall[1,2]. The increase in computational power puts even higher demands on memory regarding high bandwidth and low latency[3]. All the cores share the memory bandwidth and aggravate the bottleneck compared to single-core processors[4]. Whereas the multi-core architecture represents an evolution in processor design, a revolution is required when it comes to memory design. As Wulf[1] puts it: “We are going to hit a wall in the improvement K. Bertels et al. (Eds.): SAMOS 2009, LNCS 5657, pp. 318–328, 2009. c Springer-Verlag Berlin Heidelberg 2009
Exploiting Locality on the Cell/B.E. through Bypassing
319
of system performance unless something basic changes.” For multi-core processors the importance of this warning cannot be overstated. The interconnect of the Cell/B.E. is termed the Element Interconnect Bus (EIB) and has been detailed extensively in recent literature, ranging from architectural descriptions[5] to experimental evaluations[6,7]. This previous work demonstrates that special care must be taken in order to actually achieve the advertised bandwidth of the Cell/B.E.. Theoretical peak capacity and operational reality are not always in accordance. E.g. performance analysis of applications[8,9] on the Cell/B.E. invariably includes techniques to overcome bandwidth limitations and improve execution time. This concern is not limited to specific applications but manifests itself also in more general tools like compilers[10] that e.g. incorporate support for a software cache. We are interested in the bandwidth characteristics of the Cell/B.E. and in particular its effect on the overall performance. Secondly there is the question of how to program such a complex processor, or rather of parallel programming in general. Established standards like OpenMP[11] and MPI[12] offer powerful and extensive functionality, but they are lacking in economics of use and the results are implementation-dependent. In this paper we use Star Superscalar (StarSs)[13] for developing parallel code. The incarnation tailored for the Cell/B.E. is called Cell Superscalar (CellSs)[14]. CellSs consists of a set of tools that assist in efficiently developing parallel applications for the Cell/B.E.. The CellSs programming model hides the complexity of the architecture from the programmer and enables code written with a sequential execution model in mind to behave like parallel code at runtime. The burden of dealing with multiple threads, synchronisation, scheduling and data sharing shifts from the programmer to the CellSs runtime. The remainder of this paper is organised as follows. Section 2 introduces some architectural characteristics of the Cell/B.E. of interest to us. Our programming environment for the Cell/B.E., CellSs, is the subject of section 3. Section 4 uses the implementation details of the Cell/B.E. and the CellSs runtime to argue for a runtime technique, bypassing, to circumvent bandwidth limitations and improve program performance. Section 5 establishes the validity of bypassing on a set of benchmarks from the field of linear algebra implemented with CellSs.
2
The Cell Broadband Engine
The Cell/B.E. (figure 2) is a multi-core chip that consists of a PowerPC Processor Element (or PPE), a 64-bit, 2-way multi-threaded, in-order PowerPC processor, and multiple Synergistic Processor Elements (or SPEs) that are in-order, 128-bit wide SIMD cores. The ALUs are referred to as PowerPC Processor Unit (PPU) and Synergistic Processer Units (SPUs). The PPE and the SPEs are connected to the EIB that also couples main memory (via the Memory Interface Controller or MIC) and I/O devices. The SPEs access main memory exclusively via DMA transfers by programming their individual Memory Flow Controllers (MFCs). Data and code reside in the SPE’s 256Kb Local Store (LS). The Cell/B.E. basically is a single-chip MIMD.
320
P. Bellens et al.
Fig. 1. The Cell Broadband Engine as seen from the Element Interconnect Bus
(a)
(b)
Fig. 2. Run time (a) and W Tr (b) for BasicCellComp. The experiment was repeated for different degrees of bypassing (“bp”).
The SPU instruction pipeline sustains four single-precision floating point multiply-accumulate operations per cycle. At a clock rate of 3.2 GHz this adds up to 204.8 GFLOPS for the 8 cores of the Cell/B.E. . Each bus cycle an SPE can read and write 16 bytes from/to the EIB, but the peak bandwidth is bounded by the cache snooping protocol, topping off at 204.8 GB/s. Jimenez et al.[6] experimentally verify the EIB’s bandwidth in different scenarios for the Cell/B.E. clocked at 2.1GHz. The perceived EIB bandwidth can differ from the theoretical upper bound because of blocking and contention. The EIB consists of four unidirectional rings for data transfers, and each ring can carry three simultaneous non-overlapping transfers. The physical location of the bus elements is important: communications from SP E3 and SP E5 to SP E7 block transactions from SP E1 to SP E7. Communications between elements further apart are more expensive since they have to span more links, so data paths are more likely to overlap. Contention
Exploiting Locality on the Cell/B.E. through Bypassing
321
occurs when EIB traffic is directed at a single EIB element. E.g. SP E5, SP E7, SP E2 and SP E4 will overwhelm SP E6 with their maximum aggregate bandwidth of 102.4GB/s in case all of the former try to communicate simultaneously with the latter. Bandwidth limitations can be countered by software caches and by doublebuffering. An SPE software cache reduces the number of communications by caching and reusing data inside the SPE, effectively lowering the degree of contention and blocking. Instead of trying to reduce the load on the EIB one can also try to tolerate the latency that results. Double buffering does not reduce contention and blocking, but the effect on the computation is mitigated by hiding the communication latency by overlapping communication with computation.
3
Cell Superscalar
The CellSs environment consists of a compiler and a library that implement a programming interface for the Cell/B.E.. Basically it offers a convenient way to convert standard (sequential) C or Fortran into a parallel equivalent. The user adds pragmas to the original code to mark the functions (or tasks) intended to be executed in an SPE. At run time CellSs executes the user code and internally organises the parallel execution: it tracks data dependencies, resolves them and schedules tasks to the multiple cores. The main program of a CellSs application runs on the PPE, together with the CellSs PPE runtime library that orchestrates the execution and delegates the execution of tasks to the SPEs. The PPE library runs two separate threads, one of which (the master thread) executes the user application. It also renames arguments to avoid false dependencies and defines the task precedence based on the remaining true dependencies. The tasks and the associated dependence information are visible to the other thread running in the PPE, the helper thread. In turn the latter uses this dependence information to build the task dependence graph for the application. As the helper thread disposes of global dependence information, it can perform task scheduling, and it is in charge of the communication and synchronisation with the SPEs through callbacks. The CellSs SPE runtime library transfers arguments between main memory and local store and executes the tasks. It implements a local software cache per SPE to reduce the number of data transfers and double-buffering to hide the DMA latency.
4
Bypassing
Whether an application performs well on the Cell/B.E. depends primarily on the ability to feed all cores with sufficient data. If the bandwidth of the EIB is insufficient to prevent starvation of the SPUs the performance will suffer. Ideally an SPU never has to wait for data to arrive, i.e. the DMAs finish before the corresponding synchronisation takes place. Therefor the elapsed time an SPE spends waiting for a DMA to finish, the wait time, is a natural measure for the
322
P. Bellens et al.
bandwidth requirements of an application. In our experience the performance of a vast majority of the code on the Cell/B.E. degrades because of bandwidth issues: in practice, the wait time seldomly is insignificant. If T is the set of all DMA transfers performed by an application, and waitt is the time spent blocking for transfer t ∈ T to finish, we define the following two measures for wait time: wait W Tr = t∈T|T | t and W Ta = t∈T waitt . Consider for example the following computation pattern which appears frequently on the Cell/B.E. : the PPE threads break up the computation and divide the work over the SPEs, that repeatedly transfer in some data from main memory, operate on it, and write the results back to memory. We have developped two synthetic benchmarks (BasicCellComp and ExtCellComp) that try to quantify the importance of balancing the data streams on the Cell/B.E. for this general computation template. Both benchmarks update all the elements of a sizeable array. The update operations can be performed independently one from the other. Each SPE repeats a three-step cycle until the entire input array has been processed. First an SPE starts a DMA transfer to bring part of the input array to its LS. When the array has arrived in the LS the SPE proceeds to update the array elements. After the partial array has been processed it is transferred back to main memory synchronously. BasicCellComp considers the worst-case scenario in which the forementioned update operation regresses to a NOP and the granularity of the computation is zero. There is no software cache nor does this benchmark use double buffering. ExtCellComp does double buffering [15] and executes an update kernel with a uniform non-zero run time. According to this general computation pattern data flows from SPE A to SPE B via main memory: SPE A computes a datum and transfers it to main memory where SPE B retrieves it and copies it to its LS. A better approach bypasses main memory and moves data directly between the LS of A and the LS of B in an attempt to reduce contention and blocking. As a consequence the wait time shrinks and the execution time is reduced as in figure 3. Main memory is only accessed as a last resort. We have designed a general mechanism, bypassing, that automatically detects and exploits such bypass opportunities independently of the type of application. Figure 2 depicts the reduction in computation time and wait time for increasing degrees of bypassing for BasicCellComp. The bypassed accesses are distributed
Fig. 3. SPE stages with and without bypassing
Exploiting Locality on the Cell/B.E. through Bypassing
323
uniformly. For this particular application the execution time (figure 2(a)) grows with the number of SPEs. This seemingly contradictory behaviour is due to the severe impact of contention on the EIB. With each additional SPE the contention and blocking get worse and the wait time grows to the point that BasicCellComp slows down instead of speeding up. The zero granularity of the update task only serves to exagerate this effect. W Tr = 10µsec for 8 SPEs (bp 0% in figure 2(b)). This is of the same order of magnitude as the execution time of an optimised matrix multiplication kernel [16]. As the number of bypassed transfers increases the wait time shrinks and the execution time improves. Also note that bypassing affects the performance more as the number of SPEs grows and contention and blocking get worse. Figure 4 repeats the same experiment for ExtCellComp. The number of SPEs is fixed to 8 but this time we varry the array slice size and the execution time of the array update kernel. The trends observed in the first benchmark remain valid for this second, more realistic benchmark. For small argument sizes/array slices and slower update kernels double buffering hides the DMA latency and reduces the effect of bypassing. E.g. the run times for array slices of 32Kb (figure 4(a)) are similar for the implementation with the 20µsec kernel without bypassing compared to the kernels with bypassing. For larger array slices the version without bypassing cannot hide the contention on the EIB by double buffering and the benchmark slows down compared to the versions that bypass arguments. For an SPE A to successfully bypass a communication, the object first has to be present in another SPE B. This premise holds as long as B recently computed on the given object and B has a software cache. The limited size of the LS necessarily implies that only a small number of objects can be held before they are flushed back to main memory. Hence bypassing can help to overcome the EIB bottleneck provided that the computation exhibits temporal locality. Conversely, bypassing is a technique that exploits non-trivial forms of temporal locality in
(a)
(b)
Fig. 4. Run time (a) and W Tr (b) for ExtCellComp. The experiment was repeated for different run times of the array update kernel (“rt”) and different degrees of bypassing (“bp”).
324
P. Bellens et al.
the Cell/B.E.. The reuse of an object is not limited to a single SPE: bypassing makes the object available to the whole system. The next section experimentally verifies the applicability of our implementation of bypassing on the Cell/B.E.
5
Experiments
We have extended CellSs with the bypassing capability described in section 4. Our implementation uses the Atomic Cache Unit (ACU) of an SPE to track the location of task arguments. A task argument can reside in main memory or in the LS of one or more SPEs or both. Bypassing follows a pull-protocol: an SPE locates the arguments it requires, acquires the necessary locks and initiates the DMAs. Once the objects have been transferred to the LS the locks are released and the presence of these objects in the LS of this SPE is marked. The exact implementation and the details of the consistency protocol are beyond the scope of this paper. Our bypassing mechanism aims to improve performance by reducing the wait time. Two conditions must be fulfilled for this to make sense in CellSs. Firstly, there must be ample opportunity to apply bypassing (section 5.1). Unless the task scheduler in CellSs successfully exposes temporal locality for task arguments no improvement is to be expected. Secondly we should check that the wait time effectively decreases by using bypassing (section 5.2). We omit absolute performance results in this section. The reduction in wait time (section 5.2) corroborates the soundness of the bypassing idea. Whether this reduction in wait time carries over to a reduction in the execution time depends on the implementation1 of CellSs. Performance results therefor tell more about the current implementation of CellSs than about the merit of bypassing. We chose to defer absolute performance results until the time when CellSs has been tuned to fully exploit the bypassing mechanism. All experiments were conducted on a Cell blade at the Barcelona Supercomputing Center and the presented numbers average 100 executions. We present results for blocked matrix applications that take 32∗ 32 hypermatrices consisting of 64 ∗ 64 blocks of single-precision floats as inputs: matmul (a blocked matrix multiplication, implemented with the kernel from the Cell SDK), lu (a blocked LU decomposition of a square matrix A, that computes a lower triangular matrix L and an upper triangular matrix U and checks if A = L × U up to a certain accuracy), choleskyC (a blocked Cholesky factorisation that traverses the matrix by columns), choleskyR (a blocked Cholesky factorisation that traverses the matrix by rows), jacobi (a blocked version of the Jacobi eigenvalue algorithm that calculates the eigenvalues and eigenvectors of a real symmetric matrix). 5.1
Bypassing Applicability
For each benchmark we measure the frequency with which an SPE finds a task’s argument in the software cache, transfers it via a bypassed DMA or transfers 1
E.g. if the execution time of the CellSs runtime typically is dominated by computation in the PPU threads the benefit of bypassing will be compromised.
Exploiting Locality on the Cell/B.E. through Bypassing
325
it from main memory. Figure 5 shows that for 8 SPEs 10% to 20% of all task arguments can be transferred via bypassing. Except for figure 5(c) the software cache in an SPE seems to be more successful in exploiting the temporal locality of the computation. One should keep in mind though that the CellSs task scheduler has been designed with the software cache in mind[14]. Our runtime is biased towards efficiency in the software cache and this result serves to illustrate that bias.
(a)
(b)
(c)
(d)
(e) Fig. 5. Shares of task arguments found in an SPE’s software cache (softcache), transferred from main memory (MM DMA) and bypassed from another SPE’s LS (bp)
326
P. Bellens et al.
In some cases the software cache hit rate decreases with the number of SPEs (figures 5(a) and 5(c)). For these applications the scheduler does a good job at uncovering temporal locality but spreads out the reused objects over different SPEs. The software cache hits fall and bypassing makes up for it. Generally one can conclude that bypassing is more appealing for a larger number of SPEs. More SPEs means a larger working set of arguments and hence more opportunities to transfer arguments from one LS to another.
(a)
(b)
(c)
(d)
(e) Fig. 6. Effect of bypassing on W Ta (defined in section 4). For all the benchmarks bypassing reduces the time spent waiting for DMAs to finish once more than 4 SPEs are involved.
Exploiting Locality on the Cell/B.E. through Bypassing
5.2
327
Bypassing Effect on the Wait Time
For 8 SPEs bypassing reduces W Ta with 20% to 50% in CellSs (figure 6). For fewer SPEs the overhead of this solution becomes prohibitive and nothing is gained. Fewer SPEs also means less blocking and contention and consequently the effect of bypassing diminishes as well. As in the previous section we conclude that an execution only benefits from bypassing if most of the SPEs of the Cell/B.E. are taking part in the computation.
6
Conclusion
We have introduced and motivated bypassing and demonstrated its applicability for improving the observed bandwidth on the Cell/B.E.. Our implementation in CellSs is generic and automatic, in the sense that it is application-independent and needs no intervention from the programmer. Our experiments on a representative set of benchmarks point out that applications on the Cell/B.E. are susceptible to improvements of their wait time via bypassing. CellSs with bypassing is work in progress. In this paper we have focussed on the correctness of the bypassing idea instead of performance of the runtime as a whole. That is why we prefer to report solely on the reduction of the wait time in section 5 instead of publishing premature performance results. Future efforts will improve the integration of CellSs with our bypassing code and detail the bypassing mechanism and the coherence protocol.
References 1. Wulf, W.A., McKee, S.A.: Hitting the memory wall: implications of the obvious. SIGARCH Comput. Archit. News 23(1), 20–24 (1995) 2. Wilkes, M.V.: The memory wall and the cmos end-point. SIGARCH Comput. Archit. News 23(4), 4–6 (1995) 3. Rafique, N., Lim, W.-T., Thottethodi, M.: Effective management of dram bandwidth in multicore processors. In: PACT 2007: Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques, Washington, DC, USA, pp. 245–258. IEEE Computer Society, Los Alamitos (2007) 4. Rixner, S., Dally, W.J., Kapasi, U.J., Mattson, P., Owens, J.D.: Memory access scheduling. In: ISCA 2000: Proceedings of the 27th annual international symposium on Computer architecture, pp. 128–138. ACM, New York (2000) 5. Ainsworth, T.W., Pinkston, T.M.: Characterizing the cell eib on-chip network. IEEE Micro. 27(5), 6–14 (2007) 6. Jim´enez-Gonz´ alez, D., Martorell, X., Ram´ırez, A.: Performance analysis of cell broadband engine for high memory bandwidth applications. In: ISPASS, pp. 210– 219. IEEE Computer Society, Los Alamitos (2007) 7. Ainsworth, T.W., Pinkston, T.M.: On characterizing performance of the cell broadband engine element interconnect bus. In: Proceedings of the First International Symposium on Networks-on-Chip (2007) 8. Chow, A.C., Fossum, G.C., Brokenshire, D.A.: A Programming Example: Large FFT on the Cell Broadband Engine. IBM (May 2005)
328
P. Bellens et al.
9. Blagojevic, F., Nikolopoulos, D.S., Stamatakis, A., Antonopoulos, C.D.: Dynamic multigrain parallelization on the cell broadband engine. In: PPoPP 2007: Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming, pp. 90–100. ACM, New York (2007) 10. Eichenberger, A.E., O’Brien, J.K., O’Brien, K.M., Wu, P., Chen, T., Oden, P.H., Prener, D.A., Shepherd, J.C., So, B., Sura, Z., Wang, A., Zhang, T., Zhao, P., Gschwind, M.K., Archambault, R., Gao, Y., Koo, R.: Using advanced compiler technology to exploit the performance of the cell broadband engineTM architecture. IBM System Journal 45(1), 59–84 (2006) 11. OMP community. The community of OpenMP users, researchers, tool developers and provider website (2006), http://www.compunity.org/ 12. Snir, M., Otto, S.: MPI-The Complete Reference: The MPI Core. MIT Press, Cambridge (1998) 13. Planas, J., Badia, R.M., Ayguad´e, E., Labarta, J.: Hierarchical task based programming with StarSs. International Journal of HIgh Performance Computing Applications (under evaluation) 14. Perez, J.P., Bellens, P., Badia, R.M., Labarta, J.: Cellss: making it easier to program the cell broadband engine processor. IBM J. Res. Dev. 51(5), 593–604 (2007) 15. Cell broadband engine programming handbook, version 1.1, International Business Machines Corporation, Sony Computer Entertainment Incorporated, Toshiba Corporation (2007) 16. Hackenberg, D.: Fast matrix multiplication on cell (smp) systems website (2008), http://www.tu-dresden.de/zih/cell/matmul
Exploiting the Cell/BE Architecture with the StarPU Unified Runtime System Cédric Augonnet1, Samuel Thibault1 , Raymond Namyst1 , and Maik Nijhuis2 1
INRIA Bordeaux Sud-Ouest – LaBRI – University of Bordeaux 2 Vrije Universiteit Amsterdam [email protected], {samuel.thibault,raymond.namyst}@labri.fr, [email protected]
Abstract. Core specialization is currently one of the most promising ways for designing power-efficient multicore chips. However, approaching the theoretical peak performance of such heterogeneous multicore architectures with specialized accelerators, is a complex issue. While substantial effort has been devoted to efficiently offloading parts of the computation, designing an execution model that unifies all computing units is the main challenge. We therefore designed the S TAR PU runtime system for providing portable support for heterogeneous multicore processors to high performance applications and compiler environments. S TAR PU provides a high-level, unified execution model which is tightly coupled to an expressive data management library. In addition to our previous results on using multicore processors alongside with graphic processors, we show that S TAR PU is flexible enough to efficiently exploit the heterogeneous resources in the C ELL processor. We present a scalable design supporting multiple different accelerators while minimizing the overhead on the overall system. Using experiments with classical linear algebra algorithms, we show that S TAR PU improves programmability and provides performance portability.
1 Introduction Multicore architectures are now widely adopted. Desktop personal computers and even laptops typically contain a multicore CPU and a powerful graphic card (GPU). The multicore Cell processor is used both in the PlayStation 3 gaming console as well as the Roadrunner supercomputer. Similarly to the use of accelerating devices such as GPUs, core specialization addresses HPC’s demand for more computational power, alongside with a better power efficiency. Future processors will therefore not only get more cores, but some of them will be tailored for specific workloads. While such designs intend to address architectural limits, exploiting them efficiently introduces numerous challenging issues at all levels, ranging from programming models and compilers to the design of scalable hardware solutions. The design of efficient runtime systems for these architectures is therefore a critical issue. In a previous study, we have shown that the S TAR PU runtime system efficiently supports platforms with multicore CPUs alongside GPUs [1]. By delegating the manageK. Bertels et al. (Eds.): SAMOS 2009, LNCS 5657, pp. 329–339, 2009. c Springer-Verlag Berlin Heidelberg 2009
330
C. Augonnet et al.
ment of low-level resources to S TAR PU, compilation environments or high performance libraries can concentrate on their primary algorithmic concerns in a portable fashion. In this paper, we demonstrate that the design of S TAR PU is flexible enough for efficiently supporting the C ELL architecture while taking its specificities into account. Using an asynchronous approach, we show that S TAR PU handles multiple accelerators with little overhead. The efforts required to port an application to the C ELL architecture are very limited when using S TAR PU. Furthermore, S TAR PU’s high-level scheduling optimizations are directly applicable on a variety of heterogeneous platforms. Therefore S TAR PU makes a big step towards performance portability. The remaining of this paper is organized as follows. First, Section 2 introduces S TAR PU. Section 3 presents the architecture of the C ELL processor and the C ELL Run Time Library (C ELL -RTL), which is used by S TAR PU. The design of a driver for C ELL -RTL is discussed in Section 3.2, and we evaluate its efficiency in Section 4. After comparing our results with existing work in Section 5, we conclude this paper and give future work directions in Section 6.
2
S TAR PU, a Unified Runtime System
The S TAR PU runtime system offers support for heterogeneous multicore processors by offering a high level abstraction of tasks, named codelet, which can be executed on different architectures such as homogeneous multicore processors, GPUs and C ELL processors. S TAR PU uses all computing resources at the same time by transparently mapping codelets as efficiently as possible on all available resources while hiding lowlevel technical mechanisms. From a programming point of view, S TAR PU is not a new language but a library that execute tasks explicitly submitted by the application. S TAR PU also takes particular care of scheduling those tasks efficiently and allows scheduling experts to implement custom scheduling policies in a portable fashion. The design of S TAR PU is organized around two main components: a data management library that offers a high level interface for manipulating data distributed across a heterogeneous machine; and a unified execution model which executes the tasks encapsulated into the codelet structure. A data management library. Accessing main memory from an accelerator core, such as a GPU, is usually either impossible or extremely costly. S TAR PU therefore provides a data management library which offers a high level interface for manipulating the pieces of data which are used by codelets [1]. This library offers a distributed shared memory by enforcing data coherency along the machine while protecting it from concurrent modifications. Its transparent data migration or replication mechanisms also perform extra transformations that are useful in a hybrid environment, such as automatically converting endianness or remapping data into a layout that is more suitable for the target architecture. As shown in Figure 1, applications describe the data layout so that the drivers can perform data transfers at a high abstraction level. A unified execution model. Applications asynchronously submit tasks to S TAR PU in the form of codelets. A codelet encapsulates a task that can be executed on one or more
Exploiting the Cell/BE Architecture with the StarPU Unified Runtime System
331
Fig. 1. The path of a codelet through S TAR PU
of the compute resources controlled by S TAR PU. We will denote the compute resources as workers. A codelet contains a high-level description of the data it accesses, its implementations on the various workers, and a callback function which S TAR PU calls after executing the codelet. Programmers can also include extra hints (e.g., priorities) to guide the scheduling engine. Figure 1 shows that supporting a new architecture in S TAR PU requires limited efforts. On the one hand, such a driver must support launching the computation of the codelet it got from the scheduler. On the other hand, it needs to implement the methods for transferring buffers to or from the corresponding architecture. There is exactly one driver per worker, therefore there can be multiple instances of a driver. The drivers currently available for multicore CPUs and NVIDIA GPUs are synchronous. When a codelet is executing, the corresponding driver waits for the completion of the codelet and it can not execute other codelets.
3 Extending S TAR PU for the C ELL Processor The C ELL processor is a heterogeneous multicore chip composed of a main hyperthreaded core, which is the Power Processing Unit or PPU, and 8 coprocessors named Synergistic Processing Units or SPUs. The SPUs can only directly access 256 KB local memory. All cores are interconnected by the Element Interconnect Bus (EIB). Data transfers between the main memory and SPUs’ local memory require explicit DMA mechanisms, which use the EIB. The EIB also supports hardware mailbox mechanisms for synchronization purposes. 3.1 The C ELL Run Time Library The C ELL Run Time Library (C ELL -RTL), co-developed in Amsterdam Vrije Universiteit and Université de Bordeaux, was initially designed as a back-end for the C ELL -S PACE framework for building streaming applications on C ELL processors [9]. C ELL -RTL executes tasks on the SPUs with little overhead while applying some optimizations such as multibuffering (overlapping multiple memory transfers and computation) or task reordering. As shown on Figure 2, C ELL -RTL implements efficient data transfers, and since it offers an interface to asynchronously submit tasks to the SPUs, adding a basic C ELL -RTL driver requires limited efforts.
332
C. Augonnet et al.
1
1
Fig. 2. The C ELL -RTL and the C ELL architecture
All these characteristics make C ELL -RTL an excellent target for our S TAR PU runtime system, which can somehow be seen as an extension to C ELL -RTL that features a data management library which protects data from concurrent accesses, and a scheduling engine which distributes the codelets over the accelerators. However, a few considerations have to be studied before designing a C ELL -RTL driver for S TAR PU. In particular, since SPUs typically handle fine grain tasks, it is crucial to hide most overhead by overlapping DMA transfers and computation. C ELL -RTL therefore supports multiple pending tasks on a single SPU, and automatically applies multibuffering techniques. Unlike the current implementation of the S TAR PU drivers for multicore and NVIDIA GPUs, the C ELL -RTL driver for the SPUs therefore has to be asynchronous, which means that task submission is non-blocking so that it is possible to submit multiple jobs simultaneously. 3.2 Designing a S TAR PU Driver for C ELL -RTL In this section, we study the requirements to implement the C ELL -RTL driver. Firstly, the fine granularity of jobs that typically run on SPUs implies that it is crucial that S TAR PU asynchronously submits multiple jobs to C ELL -RTL. In order to scale with the number of SPUs, we must use a single thread to control all SPUs at the same time to avoid the important cost of context switches on the PPU (we measured a 1.6µs overhead per context switch). Having a SPU to send an interrupt to the PPU to notify the termination of a task is extremely costly. In C ELL -RTL, SPUs send notification messages by the means of DMA transfers. As a result, we need a mechanism to poll C ELL -RTL regularly in the background which we call progression. Thirdly, synchronizing cores within the C ELL architecture is expensive. In order to limit the amount of interactions between the PPU and the different SPUs, we use a C ELL -RTL mechanism called job chaining [9]. An asynchronous driver controlling multiple workers. Instead of synchronously submitting a single task at a time and waiting for its termination, the driver for the C ELL -RTL requests multiple codelets from the S TAR PU scheduling engine and submits them all at once to C ELL -RTL. Since both S TAR PU and C ELL -RTL use similar callback mechanisms, the callback of a C ELL -RTL job is responsible for the termination of the corresponding S TAR PU codelet.
Exploiting the Cell/BE Architecture with the StarPU Unified Runtime System
333
Similarly to the driver for the NVIDIA GPUs [1], the C ELL -RTL driver requires one CPU thread, which in turn needs to run on a CPU core. In contrast with usual multicore machines equipped with GPUs, the C ELL has more coprocessors than CPU cores (up to 8 SPUs vs 2 PPU contexts if we consider hyperthreading). Having a separate driver for each coprocessor would therefore overload the PPU due to the numerous context switches between the driver threads, impacting the overall performance. An asynchronous driver which controls multiple workers does not have this problem and can scale with the number of accelerators. This approach can also be applied to multiGPU setups, which are now becoming standard. Progression within S TAR PU. Efficient synchronization between the different cores is a challenging issue which C ELL -RTL addresses by using DMA for notifying task terminations instead of using the inefficient hardware mailbox mechanisms [9]. Although this DMA approach is very efficient, it requires that the application polls regularly to detect the completion of SPU tasks. Thus, S TAR PU needs a progression mechanism to poll the C ELL -RTL driver. Polling only before or after the submission of a new task is not sufficient: S TAR PU must not be blocked waiting for a resource that can only be released by the C ELL -RTL driver. Let us for instance consider two tasks A and B which both modify a data item D. First, the driver gets A from the scheduler, takes a lock that protects D, asynchronously submits task A to C ELL -RTL, and then gets B from the scheduler. Before executing B, it needs to take the lock that protects D. Waiting for this lock would block the driver and thus deadlock: the driver does not poll C ELL -RTL, since it waits for the lock. Therefore it does not notice the completion of A, and A does not release its lock on D. Not only the C ELL -RTL driver, but the whole of S TAR PU therefore needs a progression mechanism which avoids such deadlocks. The simplest solution to ensure progression is launching a separate progression thread. Since the hyperthreaded PPU supports two threads, adding a second thread to the thread already devoted to the driver may seem reasonable. This however yields too much overhead as shown in Section 4. Moreover, in most cases polling before the submission of a new task should be sufficient. Considering that deadlocks only occur when the driver for C ELL -RTL is blocked, another solution consists in adding a progression mechanism within every blocking procedure which could wait for the termination of a task. As illustrated by Figure 1, the only resources manipulated by a S TAR PU driver are codelets (step 3) and data (step 4). Instead of getting blocked waiting for a lock that protects a piece of data which is used by another task, the data management library will make S TAR PU progress until the resource is available again. Likewise, if a driver requests a codelet from the scheduling engine and none is schedulable, the scheduler will make S TAR PU progress. In both cases, consuming CPU cycles to poll on behalf of the blocked driver is not an issue since the corresponding worker would have been stalled anyway. Moreover, such a polling is likely to improve reactivity, thereby reducing the time while the worker would have been waiting otherwise. Transparent job chaining. Instead of submitting a single job to C ELL -RTL at a time, programmers can directly inject a list of jobs that the SPUs fetch autonomously: instead
334
C. Augonnet et al.
of handling the submission and termination one job at a time, the PPU only submits a single chained job, and gets notified upon the completion of the entire chain. S TAR PU exploits this job chaining mechanism transparently for the programmer. Since the design of the scheduling engine of S TAR PU is based on a set of codelet queues, the C ELL -RTL driver can directly request a list of codelets from the scheduler. It is worth noting that the scheduling engine may have applied scheduling optimizations beforehand, e.g., reordering the codelet list. In other words, the C ELL -RTL drivers gets all the benefits of the scheduling policies which are enforced by S TAR PU, regardless of the underlying hardware. When using C ELL -RTL without the support from S TAR PU, programmers explicitly have to construct those chains. Manually selecting the optimal number of chains and their respective size is not necessarily simple. For example, programmers have to supply at least two chains per SPU to ensure that multibuffering is possible. Requiring application programmers to have such a specific knowledge of the C ELL architecture and the design of the C ELL -RTL itself is not compatible with our portability concerns. In contrast, getting S TAR PU to construct those chains automatically relieves programmers from a serious burden. When the C ELL -RTL driver gets a list of codelets from the scheduler, the list it split into multiple chunks. To ensure the use of multibuffering, there are up to two chunks per SPU. However, we create less chunks if their length is below some minimal size, so that we avoid injecting inefficient small chains.
4 Evaluation In order to validate our approach, we present how well S TAR PU performs with a few applications which we run on the C ELL processor. We then discuss the effects of the various design choices we have made. Experimentation platform. All the experiments presented in this paper were executed on a S ONY P LAY S TATION 3 running a F EDORA C ORE 8 L INUX 2.6.26 kernel and using IBM SDK 3.1. Applications can use 1 PPU and 6 SPUs, which are clocked at 3.2 GH Z. It is important to note that with this configuration, the memory available is below 200 MB. We therefore only present benchmarks with a very small size. Benchmarks. We first analyze the behaviour of two single precision dense linear algebra applications written on top of S TAR PU. Since there is no comprehensive implementation of the BLAS kernels currently available for the SPUs, we use the SGEMM kernel from IBM SDK, and we use the SPOTRF and STRSM kernels written by K URZAK et al. [7]. Figures 3 and 4 illustrates the benefits of our different optimizations. A direct implementation with C ELL -RTL gives the overhead caused by S TAR PU itself. Matrix multiplication is a set of independent identical tasks, so that we obtain near linear speedups on Figure 3. The scalability of the C HOLESKY decomposition algorithm is however limited by a lack of parallelism, which our optimization techniques help to reduce.
Exploiting the Cell/BE Architecture with the StarPU Unified Runtime System
335
90 Linear speedup Cell-RTL StarPU Progression thread No chaining IBM BLAS No chaining + Progression thread
80
70
GFlops
60
50
40
30
20
10
0 1
2
3 4 Number of SPE(s)
5
6
Fig. 3. Matrix multiplication Linear speedup StarPU Progression thread No chaining No chaining + Progression thread
70
60
GFlops
50
40
30
20
10
0 1
2
3 4 Number of SPE(s)
5
6
Fig. 4. C HOLESKY decomposition
The impact of job chaining. By reducing the need for synchronization, job chaining reduces the amount of interaction between S TAR PU and the coprocessors. As a result, Table 1 illustrates that job chaining gives better task throughput so that each SPU gets better performance. Chaining also improves scalability on both those benchmarks which hardly get any speedup otherwise. The choice of the progression mechanism. Avoiding the use of a progression thread brings even more benefits on scalability and on the overall performance. As mentioned in Section 3.2, the use of a separate progression thread creates a sensible burden for the operating system which is measured in Table 2. This table gives the output of the U NIX time command, which not only gives the total execution time, but also the amount of
336
C. Augonnet et al. Table 1. Impact of job chaining on scalability
Speed on 1 SPU (in GF LOPS) Speedup on 6 SPUs (against 1 SPU) without chaining with chaining without chaining with chaining Matrix multiplication 13.67 14.90 2.89 5.42 C HOLESKY decomposition 10.89 11.91 1.63 3.77
Table 2. OS overhead during C HOLESKY decomposition
Total execution time Time spent in the OS Number of calls to schedule() per second
With progression thread No progression thread 3.2s 2.1s 1.5s (47.9%) 0.48s (22.9%) CPU0 : 142 CPU0 : 47 CPU1 : 4242 CPU1 : 51
time spent in the L INUX kernel. We also use the content of the /proc/schedstat file to determine the number of calls to the schedule() function of the L INUX kernel. When using a separate thread, this number explodes from one every 20 ms to one every 250 µs, which shows the pressure on the operating system. Avoiding overloading the system not only saves some computational power for the applications that run on the PPU, but it also improves overall system reactivity and thus reduces synchronization overhead. A low overhead. It is interesting to compare our results with other runtime systems which run the same computation kernels. The simplicity of the matrix multiplication algorithm makes it easy to re-implement it using different tools. Figure 3 gives a comparison of the performance of different implementations of a matrix multiplication based on the same computation kernel on SPUs. We have first built a reference implementation which directly uses the PPU interface of the SPU-accelerated BLAS library shipped with IBM SDK 3.1 for the Cell processor. We have then realized a S TAR PU implementation and an implementation directly on top of C ELL -RTL which does not use S TAR PU. All implementations use the same computation kernel, so they obtain similar results on a single SPU; however, the performance of the reference implementation does not scale. Although S TAR PU adds an additional software layer, and protects its data from concurrent accesses, Figure 3 shows that S TAR PU has a low overhead: the performance of the C ELL -RTL implementation is only slightly better than that of the S TAR PU implementation. Moreover, S TAR PU automatically exploits job chaining while we explicitly had to construct chains of proper size in the C ELL -RTL implementation. The benefits of S TAR PU scheduling policies. The scheduling engine, which is independent from the drivers, can reorganize the lists of codelets according to the scheduling policy before the C ELL -RTL driver requests from those lists. Table 3 illustrates the benefits of a scheduling policy with support for priorities. Previously we obtained similar improvements using this scheduling policy with the same application using a multicore machine equipped with a GPU [1]. This experiment
Exploiting the Cell/BE Architecture with the StarPU Unified Runtime System
337
Table 3. The impact of prioritized tasks on C HOLESKY decomposition Without priority With priority 1 SPU 11.57 GF LOPS 11.91 GF LOPS 6 SPUs 43.42 GF LOPS 44.96 GF LOPS
confirms that S TAR PU is generic enough to efficiently exploit architectures as different as GPUs and C ELL processors. A portable approach. Provided the proper computer kernels, we only had to ensure proper data alignment, and add them to the SPU function library in C ELL -RTL. Porting our benchmarks therefore only required adding a couple lines to the code that already runs on multicore processors alongside with a GPU. The benefits resulting from scheduling policies show that S TAR PU offers portable performance.
5 Related Work Regardless of portability issues, accelerators are typically programmed using constructors’ low level API nowadays. In addition to standard graphic APIs, AMD’s F IRE S TREAMand especially NVIDIA’s CUDA are currently the most common way to program GPUs. C ELL processors are usually programmed directly with the L IB SPE. While most efforts aim at optimizing computation kernels, the demand for a unified approach is illustrated by the O PEN CL standard which not only attempts to unify programming paradigms, but also proposes a low-level device interface. Likewise, substantial attention is given to the use of well-established standards such as MPI [10] or O PEN MP [2]. S TAR PU could provide a common runtime system for the numerous programming languages that were designed (or extended) to exploit accelerators [8,4]. Various runtime systems were designed to offer support for the C ELL architecture [3,9,12]. While most approaches adopt similar tasking APIs, few of them target heterogeneous platforms with various accelerator technologies at the same time, and even fewer consider the use of both accelerators and multicore CPUs. IBM ALF [3] and S EQUOIA [5] for instance target both C ELL and multicore processors, but not GPUs. C HARM ++ also offers support for both C ELL [6] and GPUs [12], however there is no performance evaluation available yet for GPUs to the best of our knowledge. Contrary to S TAR PU which offers a flexible interface to design portable scheduling policies, S EQUOIA and C ELL G EN [11] do not focus on the scheduling problematics and only use basic load balancing mechanisms which may not be sufficient when dealing with irregular applications and heterogeneous platforms.
6 Conclusions and Future Work In this paper, we described how our S TAR PU runtime system was extended to efficiently taking advantage of the C ELL architecture. We discussed how to support multiple accelerators while keeping a low overhead on the operating system, thereby saving
338
C. Augonnet et al.
computational power on the PPU. We have shown that S TAR PU and C ELL -RTL integrate well. C ELL -RTL efficiently executes the tasks that were injected by the S TAR PU driver while applying low-level C ELL specific optimizations on the SPUs. S TAR PU and its higher-level abstraction enforces data consistency and maps codelets onto the different SPUs. Surprisingly, when porting our existing linear algebra benchmarks, the most serious concern was finding the proper compute kernels given the very limited number of BLAS kernels currently implemented. Once these kernels were available, porting existing linear algebra benchmarks to the C ELL only required adding a few lines of code, which demonstrates the contribution of S TAR PU in terms of programmability. S TAR PU is not only portable in the sense that it allows a single application to run on multiple architectures. It also allows performance portability, since the application benefits from S TAR PU optimizations regardless of the underlying hardware. In the future, we plan to extend our data management library to be fully asynchronous. This enhancement allows maintaining consistency directly at the SPUs. It also allows transparent support of SPU pipelining, which is particularly useful for streaming applications. We will also port our model to other architectures, for instance by adding a driver for the O PEN CL unified device interface. We will also implement dynamic code loading mechanisms in C ELL -RTL to avoid wasting the limited local memory on SPUs. We are currently porting the PA S TI X and the MUMPS sparse matrix solvers. We also envision using S TAR PU as a backend for high-level languages (e.g., CellSs or HMPP) that could automatically generate S TAR PU codelets. Selecting the optimal codelet granularity is a difficult algorithmic issue, especially in a heterogeneous context. S TAR PU could thus give feedback to the applications or the compiler environment so that it can adapt its behaviour dynamically or even offline.
References 1. Augonnet, C., Namyst, R.: A unified runtime system for heterogeneous multicore architectures. In: Highly Parallel Processing on a Chip (2008) 2. Bellens, P., Perez, J.M., Badia, R.M., Labarta, J.: CellSs: a programming model for the cell BE architecture. In: ACM/IEEE conference on SuperComputing (2006) 3. Crawford, C.H., Henning, P., Kistler, M., Wright, C.: Accelerating computing with the Cell Broadband Engine processor. In: Conference on Computing Frontiers (2008) 4. Dolbeau, R., Bihan, S., Bodin, F.: HMPP: A Hybrid Multi-core Parallel Programming Environment. Technical report, CAPS entreprise (2007) 5. Fatahalian, K., Knight, T.J., Houston, M., Erez, M., Reiter Horn, D., Leem, L., Young Park, J., Ren, M., Aiken, A., Dally, W.J., Hanrahan, P.: Sequoia: Programming the Memory Hierarchy. In: ACM/IEEE Conference on Supercomputing (2006) 6. Kunzman, D., Zheng, G., Bohm, E., Kalé, L.V.: Charm++, Offload API, and the Cell Processor. In: Proceedings of the Workshop on Programming Models for Ubiquitous Parallelism, Seattle, WA, USA (September 2006) 7. Kurzak, J., Buttari, A., Dongarra, J.: Solving Systems of Linear Equations on the CELL Processor Using Cholesky Factorization. IEEE Transactions on Parallel and Distributed Systems 19(9) (2008)
Exploiting the Cell/BE Architecture with the StarPU Unified Runtime System
339
8. McCool, M.D.: Data-Parallel Programming on the Cell BE and the GPU using the RapidMind Development Platform. In: GSPx Multicore Applications Conference (2006) 9. Nijhuis, M., Bos, H., Bal, H., Augonnet, C.: Mapping and synchronizing streaming applications on Cell processors. In: International Conference on High Performance Embedded Architectures & Compilers (2009) 10. Ohara, M., Inoue, H., Sohda, Y., Komatsu, H., Nakatani, T.: MPI Microtask for programming the Cell Broadband Engine processor. IBM Syst. J. 45(1) (2006) 11. Schneider, S., Yeom, J.S., Rose, B., Linford, J.C., Sandu, A., Nikolopoulos, D.S.: A comparison of programming models for multiprocessors with explicitly managed memory hierarchies. In: PPoPP 2009 Proceedings. ACM, New York (2008) 12. Wesolowski, L.: An Application Programming Interface for General Purpose Graphics Processing Units in an Asynchronous Runtime System. Master’s thesis (2008)
Author Index
Agarwal, Aabhas S. 128 Agarwal, Nainesh 108 Ahonen, Tapani 88 Airoldi, Roberto 88 Ascheid, Gerd 204 Augonnet, C´edric 329 Avetisyan, Arutyun 289 Ayguade, Eduard 12 Badia, Rosa M. Batenburg, Kees Bellens, Pieter Blaauw, David Boutellier, Jani Buchty, Rainer
318 Joost 318 247 36 227
298
2, 36
Karl, Wolfgang 227 Karlstr¨ om, Per 171 Karuri, Kingshuk 204 Kato, Shinichi 139 Kicherer, Mario 227 Kjeldsberg, Per Gunnar 48 Kramer, David 227 Kuchcinski, Krzysztof 194 Kuzmanov, Georgi K. 263 Labarta, Jesus 318 Leupers, Rainer 204 Liu, Dake 171
Carpenter, Paul M. 12 Carro, Luigi 226 Catthoor, Francky 48 Charot, Fran¸cois 194 Choupani, Roya 58 Davis, Brian T. 128 de La Lama, Carlos S. 2 Deprettere, Ed 275, 308 Dimopoulos, Nikitas J. 108 Dreslinski, Ronald G. 247 Dutta, Hritam 277 Fick, David 247 Floch, Antoine 194 Galuzzi, Carlo 193 Garzia, Fabio 88 Gaydadjiev, Georgi N. Giorgi, Roberto 78
J¨ aa ¨skel¨ ainen, Pekka Jan, Yahya 24 Jozwiak, Lech 24
Maar, Sander van der 298 Mamagkakis, Stylianos 48 Martin, Kevin 194 Meijer, Sjoerd 308 Membarth, Richard 277 Meyr, Heinrich 204 Milojevic, Dragomir 88 Miniskar, Narasinga Rao 48 Mitra, Tulika 215 Monakov, Alexander 289 Mudge, Trevor 247 Munaga, Satyakiran 48 Nadezhkin, Dmitry 308 Najjar, Walid 255 Namyst, Raymond 329 Nijhuis, Maik 329 Nurmi, Jari 88
263
Hammari, Elena 48 Hannig, Frank 277 H¨ anninen, Ismo 118 Heenes, Wolfgang 98 Hoffmann, Rolf 98 Huber, Bernhard 181 Huynh, Huynh Phung 215
Obermaisser, Roman
181
Patt, Yale 1 Perez, Josep M. 318 Pimentel, Andy D. 149 Popovic, Zdravko 78 Pratas, Frederico 237 Puzovic, Nikola 78 Ramirez, Alex
12
342
Author Index
Reinikka, Timo 36 Rintaluoma, Tero 36 Rouvinen, Joona 36
Thompson, Mark 149 Tolun, Mehmet R. 58 Uhrig, Sascha
Sch¨ ack, Christian 98 Schr¨ oder, Hartmut 161 Sijbers, Jan 298 Silv´en, Olli 36 Sousa, Leonel 237 Stefanov, Todor 308 Sylvester, Dennis 247 Taghavi, Toktam 149 Takala, Jarmo 2, 118 Teich, J¨ urgen 277 Thibault, Samuel 329
68
Varbanescu, Ana L. 275 Villarreal, Jason 255 Watanabe, Minoru 139 Westermann, Peter 161 Wolinski, Christophe 194 Wong, Stephan 58, 226 Xu, Ying
128
Zaykov, Pavel G.
263