Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Alfred Kobsa University of California, Irvine, CA, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen TU Dortmund University, Germany Madhu Sudan Microsoft Research, Cambridge, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max Planck Institute for Informatics, Saarbruecken, Germany
6566
Mladen Berekovic William Fornaciari Uwe Brinkschulte Cristina Silvano (Eds.)
Architecture of Computing Systems ARCS 2011 24th International Conference Como, Italy, February 24-25, 2011 Proceedings
13
Volume Editors Mladen Berekovic Institut für Datentechnik und Kommunikationsnetze Hans-Sommer-Straße 66, 38106 Braunschweig, Germany E-mail:
[email protected] William Fornaciari Dipartimento di Elettronica e Informazione Via Ponzio 34/5, 20133 Milano, Italy E-mail:
[email protected] Uwe Brinkschulte Johann Wolfgang Goethe-Universität Frankfurt am Main Robert-Mayer-Straße 11-15, 60325 Frankfurt am Main, Germany E-mail:
[email protected] Cristina Silvano Dipartimento di Elettronica e Informazione Via Ponzio 34/5, 20133 Milano, Italy E-mail:
[email protected]
ISSN 0302-9743 e-ISSN 1611-3349 ISBN 978-3-642-19136-7 e-ISBN 978-3-642-19137-4 DOI 10.1007/978-3-642-19137-4 Springer Heidelberg Dordrecht London New York Library of Congress Control Number: 2011920161 CR Subject Classification (1998): C.2, C.5.3, D.4, D.2.11, H.3.5, H.4, H.5.4 LNCS Sublibrary: SL 1 – Theoretical Computer Science and General Issues
© Springer-Verlag Berlin Heidelberg 2011 This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com)
Preface
The ARCS series of conferences has over 30 years of tradition reporting top results in computer architecture and operating systems research. It is organized by the special interest group on Computer and System Architecture of the GI (Gesellschaft fr Informatik e.V.) and ITG (Informationstechnische Gesellschaft im VDE Information Technology Society). In 2011, ARCS was hosted by Politecnico di Milano, the largest technical university in Italy, on the campus located in Como. Lake Como (Lago di Como in Italian also known as Lario from the Latin Larius Lacus) is a Y-shaped glacial lake surrounded by the Alps. Around Lake Como, there are many interesting sites to visit: historical monuments, beautiful villas and breathtaking sights. Besides these tourist attractions, today Como is a dynamic business city with a relevant past in textile (silk) industry. This year, the conference topics comprised design aspects of multi/ many-core architectures, network-on-chip architectures, processor and memory architectures optimization, adaptive system architectures such as reconfigurable systems in hardware and software, customization and application-specific accelerators in heterogeneous architectures, organic and autonomic computing, energy-awareness, system aspects of ubiquitous and pervasive computing, and embedded systems. The call for papers attracted about 62 submissions from all around the world. Each submission was assigned to at least three members of the Program Committee for review. The Program Committee decided to accept 22 papers, which were arranged in seven technical sessions. The Program Committee meeting was held on November 19 at VDE Haus in Frankfurt am Main, Germany. The accepted papers are from Cyprus, Czech Republic, France, Germany, Iran, Italy, Japan, The Netherlands, Norway, Spain, USA and UK. Two keynotes on computing systems complemented the strong technical program. We would like to thank all those who contributed to the success of this conference, in particular the members of the Program Committee (and the additional reviewers) for carefully reviewing the contributions and selecting a high-quality program. The workshops and tutorials were organized and coordinated by Wolfgang Karl and Dimitiros Soudris. Our special thanks go to the members of the Organizing Committee for their numerous contributions: Giovanni Agosta, Finance Chair, for setting up the conference software, Yvonne Bernard as Web Chair designed and maintained the website, Carlo Galuzzi as Proceedings Chair took over the tremendous task of preparing this volume, Christian Hochberger as
VI
Preface
Industry Liason and Gianluca Palermo as Publicity Chair. We especially would like to thank Simone Corbetta and Patrick Bellasi for taking care of the local arrangements and the many other aspects of preparing the conference. We trust that you will find this year’s ARCS proceedings enriching and hope you enjoyed the warmness of the Italian people and the unique taste of the Italian cuisine. February 2011
Mladen Berekovic William Fornaciari Uwe Brinkschulte Cristina Silvano
Organization
The conference was held during February 24–25, 2011 on the Como Campus of the Politecnico di Milano, Como, Italy.
General Chairs Mladen Berekovic William Fornaciari
TU Braunschweig, Germany Politecnico di Milano, Italy
Past General Chair Christian Mueller-Schloer
Leibniz University Hannover, Germany
Program Chair Uwe Brinkschulte Cristina Silvano
University of Frankfurt, Germany Politecnico di Milano, Italy
Finance Chair Giovanni Agosta
Politecnico di Milano, Italy
Workshop and Tutorial Chairs Wolfgang Karl Dimitrios Soudris
Karlsruhe Institute of Technology (KIT), Germany National Technical University of Athens, Greece
Industry Liason Christian Hochberger
TU Dresden, Germany
Publicity Chair Gianluca Palermo
Politecnico di Milano, Italy
Proceedings Chair Carlo Galuzzi
TU Delft, The Netherlands
VIII
Organization
Local Arrangements Chairs Simone Corbetta Patrick Bellasi
Politecnico di Milano, Italy Politecnico di Milano, Italy
Web Chair Yvonne Bernard
Leibniz University of Hannover, Germany
Program Committee Michael Beigl Koen Bertels Mladen Berekovich Arndt Bode Plamenka Borovska Juergen Branke J¨ urgen Brehm Uwe Brinkschulte Philip Brisk Jo˜ao Cardoso Luigi Carro Nate Clark Koen De Bosschere Nikitas Dimopoulos Oliver Diessel Falko, Dressler Paolo Faraboschi Fabrizio Ferrandi Alois Ferscha Pierfrancesco Foglia William Fornaciari Bj¨orn Franke Roberto Giorgi Joerg Henkel Andreas Herkersdorf Christian Hochberger Murali Jayapala Gert Jervan Chris Jesshope Ben Juurlink Wolfgang Karl Andreas Koch Krzysztof Kuchcinski Paul Lukowicz
KIT Karlsruhe, Germany Technical University of Delft, The Netherlands TU Braunschweig, Germany TU Munich, Germany TU Sofia, Bulgaria University of Warwick, UK Leibniz University Hannover, Germany University of Frankfurt, Germany UC Riverside, USA NESC-ID, Lisboa, Portugal UFRGS, Brazil Georgia Institute of Technology, USA Ghent University, Belgium University of Victoria, Canada University of New South Wales, Australia University of Erlangen, Germany HP Labs Barcelona, Spain Politecnico di Milano, Italy University of Linz, Austria Universit`a di Pisa, Italy Politecnico di Milano, Italy University of Edinburgh, UK Universit` a di Siena, Italy Karlsruhe Institute of Technology, Germany TU Muenchen, Germany TU Dresden, Germany IMEC, Belgium Tallin University of Technology, Estonia University of Amsterdam, The Netherlands TU Berlin, Germany Karlsruhe Institute of Technology (KIT), Germany TU Darmstadt, Germany Lund University, Sweden University of Passau, Germany
Organization
Erik Maehle Christian Mueller-Schloer Dimitrios Nikolopoulos Alex Orailoglu Daniel Gracia P´erez Pascal Sainrat Toshinori Sato Hartmut Schmeck Karsten Schwan Cristina Silvano Olaf Spinczyk Martin Schulz Dimitrios Soudris Leonel Sousa Rainer G. Spallek Benno Stabernack Jarmo Takala J¨ urgen Teich Pedro Trancoso Theo Ungerer Mateo Valero Stephane Vialle Lucian Vintan Klaus Waldschmidt Stephan Wong Sami Yehia
Universit¨ at zu L¨ ubeck, Germany Leibniz University Hannover, Germany FORTH, Greece UCSD, USA CEA, France Universit´e Paul Sabatier, Toulouse, France Fukuoka University, Japan University of Karlsruhe, Germany Georgia Tech, USA Politecnico di Milano, Italy University of Dortmund, Germany LLNL, USA Technical University of Athens, Greece TU Lisbon, Portugal TU Dresden, Germany Fraunhofer HHI, Germany Tampere University of Technology, Finland Universit¨at Erlangen, Germany University of Cyprus, Cyprus University of Augsburg, Germany UPC, Spain Supelec, France Lucian Blaga University of Sibiu, Romania University of Frankfurt, Germany Delft University of Technology, The Netherlands Thales, France
List of All Reviewers Involved in ARCS 2011 Al Faruque, Mohammad A. Andersson, Per Angermeier, Josef Anjam, Fakhar Beigl, Michael Berekovich, Mladen Bernard, Yvonne Bertels, Koen Bode, Arndt Boppu, Srinivas Borovska, Plamenka Brandon, Anthony Branke, Juergen Brehm, J¨ urgen Brinkschulte, Uwe
Brisk, Philip Cardoso, Jo˜ ao Carro, Luigi Cazorla, Fran Clark, Nate De Bosschere, Koen Di Massa, Vincenzo Diessel, Oliver Dimopoulos, Nikitas Dressler, Falko Ebi, Thomas Faraboschi, Paolo Ferrandi, Fabrizio Ferscha, Alois Foglia, Pierfrancesco
IX
X
Organization
Fornaciari, William Franke, Bj¨ orn Giorgi, Roberto Gruian, Flavius Guzma, Vladimir Henkel, Joerg Herkersdorf, Andreas Hochberger, Christian Huthmann, Jens Ilic, Alekasandar Jayapala, Murali Jervan, Gert Jesshope, Chris Juurlink, Ben Karl, Wolfgang Kissler, Dmitrij Knoth, Adrian Koch, Andreas Kuchcinski, Krzysztof Lange, Holger Lukowicz, Paul Maehle, Erik Mameesh, Rania Meyer, Rolf Moreto, Miquel Mueller-Schloer, Christian Nadeem, M. Faisal Naghmouchi, Jamin Nikolopoulos, Dimitrios Orailoglu, Alex Palermo, Gianluca P´erez, Daniel Gracia Pericas, Miquel Pitk¨ anen, Teemu Portero, Antonio
Pratas, Frederico Puzovic, Nikola Roveri, Manuel Sainrat, Pascal Salami, Ester Santos, Andr´e C. Sato, Toshinori Schmeck, Hartmut Schmid, Moritz Schulz, Martin Schuster, Thomas Schwan, Karsten Seedorf, Roel Silvano, Cristina Soudris, Dimitrios Sousa, Leonel Spallek, Rainer G. Spinczyk, Olaf Stabernack, Benno Takala, Jarmo Teich, J¨ urgen Thielmann, Benjamin Trancoso, Pedro Tumeo, Antonino Ungerer, Theo Valero, Mateo Vialle, Stephane Vintan, Lucian Waldschmidt, Klaus Wink, Thorsten Wong, Stephan Yehia, Sami Zgeras, Iannis Zhibin, Yu
Table of Contents
Customization and Application Specific Accelerators A Code-Based Analytical Approach for Using Separate Device Coprocessors in Computing Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Volker Hampel, Grigori Goronzy, and Erik Maehle Scalability Evaluation of a Polymorphic Register File: A CG Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C˘ at˘ alin B. Ciobanu, Xavier Martorell, Georgi K. Kuzmanov, Alex Ramirez, and Georgi N. Gaydadjiev Experiences with String Matching on the Fermi Architecture . . . . . . . . . . Antonino Tumeo, Simone Secchi, and Oreste Villa
1
13
26
Multi/Many-Core Architectures Using Amdahl’s Law for Performance Analysis of Many-Core SoC Architectures Based on Functionally Asymmetric Processors . . . . . . . . . . . Hao Shen and Fr´ed´eric P´etrot
38
Application-Aware Power Saving for Online Transaction Processing Using Dynamic Voltage and Frequency Scaling in a Multicore Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yuto Hayamizu, Kazuo Goda, Miyuki Nakano, and Masaru Kitsuregawa
50
Frameworks for Multi-core Architectures: A Comprehensive Evaluation Using 2D/3D Image Registration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Richard Membarth, Frank Hannig, J¨ urgen Teich, Mario K¨ orner, and Wieland Eckert
62
Adaptive System Architectures Emulating Transactional Memory on FPGA Multiprocessors . . . . . . . . . . Matteo Pusceddu, Simone Ceccolini, Antonino Tumeo, Gianluca Palermo, and Donatella Sciuto
74
Architecture of an Adaptive Test System Built on FPGAs . . . . . . . . . . . . . J¨ org Sachße, Heinz-Dietrich Wuttke, Steffen Ostendorff, and Jorge H. Meza Escobar
86
An Extensible Framework for Context-Aware Smart Environments . . . . . Angham A. Sabagh and Adil Al-Yasiri
98
XII
Table of Contents
Processor Architectures Analysis of Execution Efficiency in the Microthreaded Processor UTLEON3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jaroslav Sykora, Leos Kafka, Martin Danek, and Lukas Kohout
110
A Dynamic Instruction Scratchpad Memory for Embedded Processors Managed by Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stefan Metzlaff, Irakli Guliashvili, Sascha Uhrig, and Theo Ungerer
122
Exploring the Prefetcher/Memory Controller Design Space: An Opportunistic Prefetch Scheduling Strategy . . . . . . . . . . . . . . . . . . . . . . Marius Grannaes, Magnus Jahre, and Lasse Natvig
135
Memory Architectures Optimisation Compiler-Assisted Selection of a Software Transactional Memory System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Martin Schindewolf, Alexander Esselson, and Wolfgang Karl An Instruction to Accelerate Software Caches . . . . . . . . . . . . . . . . . . . . . . . . Arnaldo Azevedo and Ben Juurlink Memory-, Bandwidth-, and Power-Aware Multi-core for a Graph Database Workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pedro Trancoso, Norbert Martinez, and Josep-Lluis Larriba-Pey
147
158
171
Organic and Autonomic Computing A Light-Weight Approach for Online State Classification of Self-organizing Parallel Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . David Kramer, Rainer Buchty, and Wolfgang Karl Towards Organic Active Vision Systems for Visual Surveillance . . . . . . . . Michael Wittke, Carsten Grenz, and J¨ org H¨ ahner Emergent Behaviour in Collaborative Indoor Localisation: An Example of Self-organisation in Ubiquitous Sensing Systems . . . . . . . . . . . . . . . . . . . Kamil Kloch, Gerald Pirkl, Paul Lukowicz, and Carl Fischer
183
195
207
Network-on-Chip Architectures An Improvement of Router Throughput for On-Chip Networks Using On-the-fly Virtual Channel Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Son Truong Nguyen and Shigeru Oyanagi
219
Table of Contents
Energy-Optimized On-Chip Networks Using Reconfigurable Shortcut Paths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nasibeh Teimouri, Mehdi Modarressi, Arash Tavakkol, and Hamid Sarbazi-azad
XIII
231
A Learning-Based Approach to the Automated Design of MPSoC Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Oscar Almer, Nigel Topham, and Bj¨ orn Franke
243
Gateway Strategies for Embedding of Automotive CAN-Frames into Ethernet-Packets and Vice Versa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Andreas Kern, Dominik Reinhard, Thilo Streichert, and J¨ urgen Teich
259
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
271
A Code-Based Analytical Approach for Using Separate Device Coprocessors in Computing Systems Volker Hampel, Grigori Goronzy, and Erik Maehle University of L¨ ubeck Institute of Computer Engineering Ratzeburger Allee 160, 23562 L¨ ubeck, Germany {hampel,maehle}@iti.uni-luebeck.de,
[email protected]
Abstract. Special hardware accelerators like FPGAs and GPUs are commonly introduced into a computing system as a separate device. Consequently, the accelerator and the host system do not share a common memory. Sourcing out the data to the additional hardware thus introduces a communication penalty. Based on a combination of a program’s source code and execution profiling we perform an analysis which evaluates the arithmetic intensity as a cost function to identify those parts most reasonable to source out to the accelerating hardware. The basic principles of this analysis are introduced and tested with a sample application. Its concrete results are discussed and evaluated based on the performance of a FPGA-based and a GPU-based implementation. Keywords: FPGA, GPU, hardware accelerator, profiling, analysis.
1
Introduction
Adding special hardware to serve as a coprocessor in computing systems has a long history, a prominent instance being Intel’s 8087 floating point processor extending the capabilities of 8086/8088-based systems [2]. A few years ago, graphic cards have been used not to render graphics but to provide additional computing power. From this trend general purpose graphics processing units (GPGPU) have evolved. They intentionally soften the shading pipeline paradigm, and they also offer an application programming interface (API) such as Nvidia’s CUDA [15] to easier and better make use of such coprocessors. Further activities focus on the development of manufacturer independent APIs like OpenCL [14]. Besides GPUs, Field Programmable Gate Arrays (FPGA) have also been added to computing systems to serve as a coprocessor. FPGAs are configured using hardware description languages like Verilog and VHDL, implementing processing features on the register transfer level. A FPGA-based coprocessor thus usually does not execute programs and hence lacks a fixed instruction set architecture (ISA). On the other hand, this lack of an ISA allows more flexible ways M. Berekovic et al. (Eds.): ARCS 2011, LNCS 6566, pp. 1–12, 2011. Springer-Verlag Berlin Heidelberg 2011
2
V. Hampel, G. Goronzy, and E. Maehle
to process data on the level of single bits and/or of bit widths wider than usual, for example, offering extensive bit-level parallelism. Differing from the 8087 coprocessor which is coupled to the host CPU over the instruction stream [2], GPUs and FPGAs are separate devices which are connected to the host system over buses which form a natural communication bottleneck. With separate device memories, the coprocessors are usually mapped into the memory address space of the host system. Data is sent to the coprocessor by explicitly copying it to the corresponding memory section, utilizing high-level languages like C. When writing a program employing special hardware one faces the problem of finding those parts of a program that contribute the most benefits in terms of overall execution time when sourced out to a coprocessor. In this paper we address this problem and propose a combined approach of static analysis and runtime profiling to obtain a program’s computational characteristics. The number of instructions executed in a part of a program and the communication effort, i.e., the input and the output data to this part, are combined to give the arithmetic intensity [1] as a cost measure. This cost measure is evaluated for all possibilities resulting in a profile of the whole program and thus to make better selections for a coprocessor’s functionality. In the following related work is discussed first. A sample application is introduced in Sec. 3. The combined approach is presented in detail in the subsequent section. Two sample coprocessor implementations of the selections made utilizing the proposed method are presented in Sec. 5. A discussion of this paper’s contributions and future work are given in the concluding section.
2
Related Work
Several analytical approaches to the usage of coprocessors have been presented in the past: In [3] Trident is introduced. It translates high-level programming language code of algorithms into a hardware description which can be synthesized and implemented to be run on a FPGA. The software code is thus analyzed under aspects of hardware design in terms of data flow and control flow. Its basis is a graph representation of the computations to be performed and the application of several algorithms for scheduling, allocation, etc. thus optimizing the computations. In the field of GPUs analytical work has also been published: In [7] image processing algorithms are implemented and evaluated on a GPU in a systematic way. Six metrics related to activities within the GPU or general characteristics of the algorithms are deduced to serve as a guideline in the process of implementation. The study presented in [5] compares the implementations of Quantum Monte Carlo methods to find initial ground states in atomic or molecular clusters for several different GPU types and a FPGA. Again a metric is used to optimize the implementations. However, both works are application specific in the analysis of the respective algorithms. Additionally, the metrics are primarily
Code Analysis for Coprocessors
3
tailored to GPUs, i.e., their aim is to help to make best use of the GPU architecture. Other more general but less analytical studies on how to best use the GPU architectures based on metrics are presented in [8] and [9]. These studies almost neglect the fact that the GPU or the FPGA is hosted in a system. A sample GPU-implementation of an ultra-wideband synthetic aperture radar using synchronous impulse reconstruction for data acquisition is presented in [6]. The initial sequential high-level language code is “profiled in an attempt to isolate compute intensive sections of the software”, i.e., to make a fact-based choice on which algorithm to source out to the coprocessor. As they proceed to multi-thread the host-processor’s activity with the coprocessor’s activity, the system oriented development process becomes even more obvious. A completely different view on coprocessor development is expressed in [4], saying that it usually is the programmers responsibility to “identify a kernel, and package it as a separate function” insinuating a trial-and-error development process. According to this view, the expressed intention is to ease the programming of the coprocessors to allow faster and hence more trials.
3
Sample Application
Ray Tracing has been chosen as a sample application. It provides a high computational complexity and also is a well understood and well documented algorithm in computing. Starting with a basic Ray Tracer based on [10] some alterations have been applied: Previous work (see [11]) hinted at the troubles of sourcing out parts of a program to a coprocessor. This work re-affirmed to always keep the coprocessor communication coarse-grained, i.e., not to send single values but bulk-wise data to the coprocessor. Knowing that one of the coprocessor implementations would employ a FPGA, it would not be possible to move the whole of the ray tracing algorithm to it due to its complexity. Thus the initial object oriented ray tracer from [10] has been rewritten in sequential C-code, allowing to select a part of it to be moved to the coprocessor. Usually ray tracing is performed ray by ray, contradicting the requirements of an efficient coprocessor communication. As a consequence multiple rays have been bundled to form a tile of the overall picture. The rays are kept in arrays of structures, forming consecutive memory sections which can easily be memcopied to the usually memory mapped coprocessors (see [12], [15]). Each single part, i.e., each functional stage, of the ray tracer is executed on a set of rays. The intermediate results are also kept as bundled data sets passed on to later stages of the algorithm. As a result, the tracer has become more memory intense. The sizes of the data sets are limited by the host system’s memory size as well as the coprocessors memory in which some of the data may have to be buffered. The tracer algorithm runs through the following stages: (1) The input rays are checked if they have reached a recursion limit which terminates possible infinite reflections. If the input rays are to be traced, they are intersected with all objects in a scene (2). Only the rays which have hit an object must be further treated, and they are sorted into a new subset of rays (3). Evaluating the shadows is
4
V. Hampel, G. Goronzy, and E. Maehle
done in two separate major steps. In the first one, the shadow rays from ambient lighting are created (4), intersected with the objects of a scene (5), and finally used to compute a basic coloring of the rays (6). Further major steps evaluate the shadows caused by point lights. Starting with some supplementary calculations (7) which are then used to create the shadow rays from all point lights (8), these shadow rays are intersected with all objects in the scene as well (9). The results are combined with the basic coloring originating from the ambient light (10), concluding the basic tracer algorithm. Because some materials may be reflective, the results from (10) are preliminary and have to be sorted into a data structure corresponding to the pixel positions of the rays (11). Finally, a set of reflected rays is created (12) which can be reintroduced to the overall tracing algorithm. As this brief description hints, the ray tracer is limited in some ways: Objects may be planes, spheres, or boxes. The objects’ surfaces may be reflective, checkered, or matte of several colors. Besides ambient lighting there may be several point lights of various colors. The numbers, sizes, and positions of the objects, the point lights, and the viewers position can be freely defined. As this work is not a work about ray tracing but the integration of coprocessors, neither special optimizations of the algorithms nor unusual or difficult optical effects have been implemented.
4 4.1
Combined Analysis General Concept
As mentioned in Sec. 1, utilizing a coprocessor in a separate device requires to transfer the user data from the host system memory to the coprocessor and its memory, respectively. This communication thus is a cost factor when using hardware accelerators and should be kept reasonably low. At the same time the computing effort sourced out to the coprocessor should be as large as possible. Both dimensions can be measured based on a program’s code: An interval with included borders A and B is defined: [A; B]. The borders stand for a line of code of the program. The lines of code inside the interval represent the part of the program’s functionality which might be sourced out to a coprocessor, and thus the two borders become representatives in the source code of the communication that has to be carried out. In order to evaluate the ratio of sourced out computing effort and its corresponding communication costs, both have to be measured. Using the software profiling suite Valgrind with the Cachegrind tool [13], the number of instructions per line of source code can be profiled. Cachegrind executes the source code in a simulation environment and counts the hits and misses to level 1 and level 2 instruction caches and data caches, respectively. Adding up the level 1 instruction cache hits and misses gives the total number of instructions executed as level 2 cache is accessed only if a level 1 access produced a miss. Cachegrind can be set to annotate these sums to the line of code these instructions have been triggered by. This results in a profile of the program in terms of computing effort, i.e., which parts cause the largest computing load when executing the program.
Code Analysis for Coprocessors
5
Adding up the numbers of instructions executed of all the lines of code within the interval gives the total number of instructions which might be executed on the coprocessor. The communication costs a potential sourcing out would cause are determined by identifying the data used, generated, or altered by the program’s parts within the interval. As mentioned above the interval’s beginning A represents the input communication to the coprocessor. Hence all data that is initialized prior to A and that is read between A and B has to be transfered to the coprocessor. An equivalent principle applies for the interval’s ending B and thus its output communication: All data that is written inside the interval and read after B has to be transfered back to the host system. Summing up all the communication at the interval’s borders gives a measure of the overall communication effort for an interval. The ratio of the total number of instructions executed inside the interval and the communication costs is calculated for all intervals [A; B] with A, B ∈ [0; LastLineOf Code] and A < B to give the arithmetic intensity. As the combined analysis is performed on the level of source code, some conditions should be met or at least kept in mind: 1) The interval shall not break basic block statements like loops, if-, and case-statements. Combining the else-block of an if-statement with the first half of a loop obviously doesn’t make sense, not to mention the difficulty of an appropriate coprocessor implementation. 2) Accesses to data shall not overlap if there are no dependencies between them. Such accesses can be moved within the source code to solve the issue as they are independent from each other. 3) The accessed data shall not contain references to memory locations which are to be dereferenced as in a linked list, for example. 4) If the interval includes a function call the data accesses inside the function have to be elevated to the source code level the analysis is performed on, i.e., they also need to be taken into account. 5) Data that is initialized on a higher source code level than the analysis level has to be treated as initialized, i.e., as if a write access occurs prior to the interval. Estimating the costs for each interval is of quadratic complexity as the number of intervals is n2 − n /2. In addition, the time to profile the execution of the program as well as other parameters like the number of memory sections and the number of accesses to these sections should not be neglected. 4.2
Implementation
The concept of the combined analysis from Sec. 4.1 in conjunction with the coarse-grained software design described in Sec. 3 has been implemented as a Java program. It stores the accesses to the data sections as lists of events with each event being characterized by its access type, a read or a write, and its occurrence at a line of code. These lists are ordered by the occurrence of the events in ascending order, following the sequential execution of the program. The interval borders A and B are inserted into the lists as events of special types, keeping the lists sorted. Iterating all data sections’ lists of events, the input communication effort is calculated by adding up the data sections’ sizes if
6
V. Hampel, G. Goronzy, and E. Maehle
event A is not the first element in the list, i.e., reads and/or writes occur prior to the interval, and event A’s later neighbor is a read or read-write event. In a next step the event lists are traversed from event B to event A and all read events are removed from the lists until a write event occurs. Doing so ensures that B’s earlier neighbor is either event A or a write event. The output communication effort again is calculated by adding up the data sections’ sizes if event B is not the last element in the list and B’s later neighbor is a read event and B’s earlier neighbor is not event A. The data set (A, B, ArithmeticIntensity) is written to a file for storage and an analysis of the raw data and their graphical representation. 4.3
General Result Interpretation
Figure 1 shows three instances of a graphical representation of the costs calculated following the combined analysis introduced in Sec. 4.1. The dotted line from the lower left corner to the upper right corner in all of the three graphs represents an interval of length zero, and only the upper left part holds valid results. The functionality included in the interval increases moving from the diagonal to the upper left corner of the diagram, as indicated by a gradient filling in the left diagram. The higher the complexity the higher the chances are that the functionality may not be implementable on a coprocessor due to limited hardware resources. In general, a series of combinations leading to reasonably high costs appears as a dark rectangle in the graphical representation of the overall result. Such a hot-spot is shown in the center diagram of Fig. 1. Its width and height corresponds to the number of lines of code with few instructions to be executed before and after the interval, respectively. Thus the smallest reasonable interval, i.e., the borders A and B, is found just off the right bottom corner of the hot-spot. The intervals with smaller arithmetic intensity, or costs, are represented by lighter colors. Because all possibilities of intervals are treated in the analysis, a hot-spot may “cast shadows”, i.e., the more optimal interval is included in a larger interval, depicted in the right diagram of Fig. 1. The larger interval has costs less than the optimal interval and thus will be represented by a lighter color. Shadows appear aligned horizontally and vertically to the actual
Fig. 1. Combined analysis’s results: The coprocessor implementation complexity gradient (left), a general hot-spot (center), a hot-spot and its shadows (right)
Code Analysis for Coprocessors
7
hot-spot, usually separated by white spaces resulting from block-statements. The hot-spots give two hints: 1) They indicate which interval’s functionality is most efficient to be sourced out to a coprocessor and promises to boost the overall performance, and 2) they suggest a tendency on how complex the coprocessor’s implementation will be. 4.4
Combined Analysis of the Sample Application
Performing the analysis presented in Sec. 4.1 on the ray tracing sample application introduced in Sec. 3 gives the results illustrated in Fig. 2. The left most series of hot-spots corresponds to an interval with a fixed beginning at the start of the tracing function and its ending running through the whole function until the interval covers it completely. The costs change with each block that is newly integrated into the interval. Table 1 gives detailed descriptions on these hot-spots.
Fig. 2. Graphical representation of the results of the analysis applied to the sample application presented in Sec. 3. The cross-marked intervals have been sourced out to a coprocessor. For graphical reasons costs larger than 2000 are represented by the same color.
8
V. Hampel, G. Goronzy, and E. Maehle
Table 1. Detailed descriptions of the sample application’s hot-spots for the complete algorithm with A = 0 and floating B, corresponding to the leftmost vertical sequence of hot-spots to add some functionality-based orientation to Fig. 2 #
B=
1 2
22-35 67-70
instr./com. functionality ca. 1100 ca. 1300
3 71-74 4 82-103
ca. 1700 5600-6200
5 129-132 6 147
ca. 6400 ca. 9400
7 8 9 10
181-183 194-197 208-213 260-262
ca. ca. ca. ca.
10200 10600 10500 10500
evaluating the object hit functions . . . sorting out the no hit rays from the intermediate results . . . creating the shadow rays for ambient lighting . . . evaluating the shadow hit functions, performing the basic ray coloring and performing supplementary calculations . . . creating the shadow rays for point lights . . . evaluating the shadow hit functions for point lights (being a single line notation, this interval is not visible in Fig. 2) . . . further coloring . . . finalizing the preliminary colors . . . setting up the reflected rays . . . finalizing the overall results, i.e., completing the full algorithm.
As indicated in Sec. 4.3, functionality of low implementation complexity to be sourced out to a coprocessor should be found near the diagonal in the graph. A horizontally shadowed instance of block 4 from Tab. 1 appears closest to this line. The corresponding interval [A = 72 . . . 75; B = 82 . . . 85] has costs of about 5200, representing the evaluation of the shadow hit function for ambient lighting. Block 6 is also horizontally shadowed close to the diagonal, [A = 133; B = 147], doing the same as block 4 for the point lights’ shadow rays at costs of about 2000. Unfortunately, this block and its shadow correspond to single intervals which remain invisible in the graphical representation due to image resolution. Both intervals are highlighted with crosses in Fig. 2, and they both have been chosen to be sourced out to a coprocessor due to their costs. They also appear not to be of too much complexity and thus not challenging the FPGA’s resources. Additionally, both blocks represent the same functionality performed on different data. The coprocessor thus can be used at two separate points in the program, making a FPGA reconfiguration unnecessary.
5
Implementations
The following sections give details on the evaluation of a FPGA-based and a GPU-based coprocessor implementation based on the combined analysis’s results of the sample application. Their performances are summarized in Tab. 2.
Code Analysis for Coprocessors
9
Table 2. Performance summary of the FPGA-based and the GPU-based evaluation implementations and for cases a) ambient lighting and b) the point lights
5.1
FPGA ambient point
GPU ambient point
light(s)
5,514 s 12,525 s
3,149 s 5,779 s
runtime, SW only
4,204 s 1.31 29%
8,188 s 1.53 29%
994 s 1,187 s 3.17 3.34 72% 54%
runtime with coprocessor speedup communication overhead
2,991 s 2.85
5,775 s 4.20
275 s 11.45
runtime minus communic. time speedup
541 s 10.69
FPGA-Based Implementation
The first coprocessor implementation is targeted at a Xilinx Virtex 4 LX160 FPGA inside a Cray XD1 system. A single FPGA is mapped into the memory of a SMP node of four AMD Opteron cores running at 2.2 GHz. In the host system’s programming model data is transfered to the FPGA by copying it to the mapped memory. Device drivers and special hardware then take care of the communication. The host system and the FPGA may simultaneously initiate independent 64-Bit transfers with a maximum bandwidth of roughly 1.6 GBytes/s (see [12]). The maximum communication bandwidth also depends on the FPGA’s clockspeed which can be set to 67 through 199 MHz. The FPGA has four banks of memory attached to it, each bank provides a 64-Bit data interface, totaling at a maximum access width of 256 Bit per cycle. Because most of the computations in ray tracing are based on a spatial model, three dimensional vectors are frequently used. The geometric calculations are performed in double precision floating point numbers, and usually all three components of two vectors are used. Consequently, 384-Bit accesses to buffers would be necessary. Hardware constraints prohibit such an implementation. So the coprocessor reads the same component of each of the two operand vectors simultaneously and the whole of the components sequentially, resulting in a 128-Bit read access in each cycle and a vector operation to take three cycles to issue. The coprocessor’s datapath is composed of a set of heavily pipelined Xilinx CoreGen floating point cores and buffering registers. The design allows to launch new vector operations every three clock cycles. Three different calculations can be executed on the datapath, implementing the three object types’ shadow hit function. The FPGA takes 162 cycles to evaluate one shadow hit function. Running at 149 MHz and depending on the object type the FPGA performs 1.341 to 2.334 GFlops in double precision. Parameters of the whole scene are the properties of the objects and their materials. Thus, these parameters have to be transfered to the coprocessor once. The coprocessor has been implemented in a way which allows to buffer one set of light properties only. Correspondingly, three data sets are tansfered prior to
10
V. Hampel, G. Goronzy, and E. Maehle
each computation: the shadow rays of a tile, their number, and the current light’s properties. The computation itself is started by sending a start-flag. The host system then polls a memory location for the coprocessors finishing signal with the results having been transfered to another memory location. The runtimes for intersecting the shadow rays with the objects have been measured for a) ambient lighting and for b) both the point lights, and for each tile and each reflection depth in both cases. Using the pure software implementation a single tile takes an average of 5,514 microseconds in case a) and 12,525 microseconds in case b) to evaluate the first reflection depth only. Using the FPGAbased coprocessor a) takes 4,204 microseconds and b) takes 8,188 microseconds, resulting in partial speedups of 1.31 for a) and 1.53 for b). If the communication overhead is neglected, the pure computing times are 2,991 microseconds and 5,775 microseconds for a) and b), respectively, corresponding to speedups of 1.84 and 2.17. 5.2
GPU-Based Implementation
The second coprocessor implementation is aimed at a Intel Core 2 Quad system (Yorkfield Q8200 at 2.33 GHz) hosting a Nvidia GTX 285 graphics processing unit. The GTX 285 offers 1 GByte of memory of 512 Bit width which can be initialized by the host system as mapped memory or as pinned memory [16]. It has 240 compute cores which are organized in ten stream processor units. The GPU is connected to the host system over a PCIe 2.0 x16 bus, enabling a theoretical transfer rate of 8 GBytes/s. Unlike with the FPGA-extended system in Sec. 5.1, communication costs with the GPU do not depend on the actual implementation of the sourced out functionality. For a comparable study the GPU-based coprocessor implementation does not use all of its capacities, namely its huge onboard memory, sharing the tilesize with the FPGA-based coprocessor. In addition, both coprocessors implement the functionality of the same part of the ray tracer, and they both work with double precision floating point numbers. However some optimizations within these limits have been implemented: Besides the objects’ properties and the materials’ properties all lights’ properties are transfered to the GPU only once. With each tile the shadow rays and their number has to be sent to the coprocessor only. Memory accesses are done through pinned memory which results in a more effective communication between both devices to reach its typical in-field peak transfer rate of roughly 5 GBytes/s. As with the FPGA-based coprocessors, the two cases a) and b) are evaluated separately. A pure software implementation takes an average of 3,149 microseconds per tile for case a) and an average of 5,779 microsecond for case b). The GPU implementation’s runtimes are 994 microseconds for a) and 1,728 microseconds for b) which, including the communication overhead, result in partial speedups of 3.17 and 3.34, respectively. Neglecting the communication effort the runtimes for execution of the computations are 275 microseconds for case a) and 541 microseconds for case b) with corresponding speedups of 11.45 and 10.68.
Code Analysis for Coprocessors
6
11
Conclusions
One base of the combined analysis presented in Sec. 4 is the line of code in which instructions are executed or communications begin and end. As lines of code are written by a programmer, the concentration of functional complexity per line is highly dependent on the coding style. Further investigations on this aspect should be carried out, although the usage of lines of code is common with profiling tools like Valgrind [13]. Both coprocessors lead to a moderate speedup and thus show the generality of our approach with respect to memory mapped coprocessors. But future work should be done in refining the counting of instructions. A first measure would be to list different types’ contributions to the computing effort. Doing so promises to also give hints on which kind of coprocessor is best be applied to which part of the program, e.g., sections with a high floating point load should be sourced out to a GPU rather than a FPGA. A second measure would focus on a purely static analysis of a program by, e.g., analyzing the compiler’s assembler code. Doing so would also give a view on compute times independent of the input data which can, e.g., determine the number of loop iterations. Of course this would result in more leveled costs and thus less obvious design choices. Although the two coprocessors and their respective host systems can not be compared in absolute values to each other because of their different hardware generations, the two speedup values can. As expected the GPU coprocessor can handle the many double precision floating point operations much better than the FPGA implementation. However, the GPU did not outperform the FPGA as clearly as expected. In both cases, communication still is a significant penalty. These findings support further efforts on integrating one or several accelerators and standard CPUs into a single piece of hardware. A recent step into this direction is Convey’s HC1 [17]. Its host CPU shares the system memory with several FPGAs which promises to reduce the communication penalty. This approach allows to methodically select parts of a program for coprocessorbased acceleration which enables a faster design process as otherwise an expert would have to study the initial code. Though complexity is an issue with very large programs, our approach could surely be automated to cover mid-sized problems and to interact with other tools like Trident [3] to generate a HDLdescription of a FPGA-based coprocessor. At this point however, the developer should still be able to have a final say on the results.
References 1. Harris, M.: Mapping Computational Concepts to GPUs. In: Pharr, M. (ed.) GPU Gems 2, ch. 31, Addison-Wesley Longman, Amsterdam (2005) 2. Palmer, J.: The Intel 8087 numeric data processor. In: ISCA 1980: Proceedings of the 7th annual symposium on Computer Architecture, La Baule, USA, pp. 174–181 (1980), http://doi.acm.org/10.1145/800053.801923 3. Tripp, J.L., Gokhale, M.B., Peterson, K.D.: Trident: From High-Level Language to Hardware Circuitry. Computer 40(3), 28–37 (2007), http://dx.doi.org/10.1109/MC.2007.107
12
V. Hampel, G. Goronzy, and E. Maehle
4. Han, T.D., Abdelrahman, T.S.: hiCUDA: High-Level GPGPU Programming. IEEE Transactions on Parallel and Distributed Systems (March 31, 2010), http://doi.ieeecomputersociety.org/10.1109/TPDS.2010.62 5. Weber, R., Gothandaraman, A., Hinde, R.J., Peterson, G.D.: Comparing Hardware Accelerators in Scientific Applications: A Case Study. IEEE Transactions on Parallel and Distributed Systems (June 02, 2010), http://doi.ieeecomputersociety.org/10.1109/TPDS.2010.125 6. Park, S.J., Ross, J., Shires, D., Richie, D., Henz, B., Nguyen, L.: Hybrid Core Acceleration of UWB SIRE Radar Signal Processing. IEEE Transactions on Parallel and Distributed Systems (May 27, 2010), http://doi.ieeecomputersociety.org/10.1109/TPDS.2010.117 7. Park, I.K., Singhal, N., Lee, M.H., Cho, S., Kim, C.: Design and Performance Evaluation of Image Processing Algorithms on GPUs. IEEE Transactions on Parallel and Distributed Systems (May 27, 2010), http://doi.ieeecomputersociety.org/10.1109/TPDS.2010.115 8. Ryoo, S., Rodrigues, C.I., Stone, S.S., Baghsorkhi, S.S., Ueng, S.-Z., Stratton, J.A., Hwu, W.W.: Program optimization space pruning for a multithreaded gpu. In: CGO 2008: Proceedings of the 6th Annual IEEE/ACM International Symposium on Code Generation and Optimization, Boston, MA, USA, pp. 195–204 (2008), http://doi.acm.org/10.1145/1356058.1356084 9. Ryoo, S., Rodrigues, C.I., Baghsorkhi, S.S., Stone, S.S., Kirk, D.B., Hwu, W.W.: Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In: PPoPP 2008: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Salt Lake City, UT, USA, pp. 73–83 (2008), http://doi.acm.org/10.1145/1345206.1345220 10. Suffern, K.G.: Ray Tracing from the Ground up, A K Peters Ltd (2007) 11. Sobe, P., Hampel, V.: FPGA-Accelerated Deletion-Tolerant Coding for Reliable Distributed Storage. In: Lukowicz, P., Thiele, L., Tr¨ oster, G. (eds.) ARCS 2007. LNCS, vol. 4415, pp. 14–27. Springer, Heidelberg (2007), http://dx.doi.org/10.1007/978-3-540-71270-1_2 12. Cray Inc.: Cray XD1 FPGA Development. Release 1.4 (2006) 13. Valgrind Developers: Valgrind User Manual. Release 3.5.0 (August 19, 2009) 14. Munshi, A. (ed.): The OpenCL-Specification. Version 1.1 (June 11, 2010) 15. Nvidia Corp.: NVIDIA CUDA C Programming Guide. Version 3.2 (September 8, 2010) 16. Nvidia Corp.: NVIDIA OpenCL Best Practices Guide. Version 2.3 (August 31, 2009) 17. Brewer, T.M.: Hybrid-core Computing: Punching through the power/performance wall. Scientific Computing, November/December (2009), http://www.conveycomputer.com/Resources/ScientificComputing62629.pdf
Scalability Evaluation of a Polymorphic Register File: A CG Case Study C˘at˘ alin B. Ciobanu1 , Xavier Martorell2,3 , Georgi K. Kuzmanov1 , Alex Ramirez2,3 , and Georgi N. Gaydadjiev1 1
Computer Engineering Laboratory, Electrical Engineering Department, Delft University of Technology, The Netherlands {c.b.ciobanu,g.k.kuzmanov,g.n.gaydadjiev}@tudelft.nl 2 Universitat Polit`ecnica de Catalunya, Spain 3 Barcelona Supercomputing Center {xavier.martorell,alex.ramirez}@bsc.es
Abstract. We evaluate the scalability of a Polymorphic Register File using the Conjugate Gradient method as a case study. We focus on a heterogeneous multi-processor architecture, taking into consideration critical parameters such as cache bandwidth and memory latency. We compare the performance of 256 Polymorphic Register File-augmented workers against a single Cell PowerPC Processor Unit (PPU). In such a scenario, simulation results suggest that for the Sparse Matrix Vector Multiplication kernel, absolute speedups of up to 200 times can be obtained. Moreover, when equal number of workers in the range 1-256 is employed, our design is between 1.7 and 4.2 times faster than a Cell PPU-based system. Furthermore, we study the memory latency and cache bandwidth impact on the sustainable speedups of the system considered. Our tests suggest that a 128 worker configuration requires the caches to deliver 1638.4 GB/sec in order to preserve 80% of its peak speedup.
1
Introduction
Recent generations of processor designs have reached a point where just increasing the clock frequency in order to gain performance is no longer feasible because of power and thermal constraints. As more transistors are available in each generation of CMOS technology, designers have followed two trends in order to improve performance: the specialization of the cores targeting improved performance in certain classes of applications and the use of Chip Multi-Processor (CMP) designs in order to extract more performance in multi-threaded applications. Examples of specialized extensions include Single Instruction Multiple Data (SIMD) extensions such as Altivec [9], which are designed to exploit the available Data Level Parallelism, but also the hardware support for the Advanced Encryption Standard [8] which provides improved performance for data encryption. A typical example of a heterogeneous CMP architecture is the Cell M. Berekovic et al. (Eds.): ARCS 2011, LNCS 6566, pp. 13–25, 2011. c Springer-Verlag Berlin Heidelberg 2011
14
C.B. Ciobanu et al.
processor [12]. This shift in the processor architectures employs new programming paradigms and has a significant impact on how programs have to be optimized in order to maximize performance. Engineers have to consider both single threaded performance but also multi-processor scalability. In our previous work we have proposed a Polymorphic Register File (PRF) [4], which provides an easier programming model targeting high performance vector processing. More specifically, in this paper we investigate the scalability of such a PRF augmented vector accelerator when integrated in a multi-processor system. The study focuses on the achievable performance with respect to the number of processors when employed in a complex computational problem, namely the Conjugate Gradient (CG) method. CG is one of the most commonly used iterative methods for solving systems of linear equations[19]. The iterative nature of CG makes it a good option for solving sparse systems that are too large to be handled by direct methods. CG scalability is critical, as it determines the maximum problem size which can be processed within a reasonable execution time. Previous studies have shown that 1D and 2D vector architectures can significantly accelerate the execution of this application - more than 10 times compared to a scalar processor [4]. In this work, we analyze the performance of such accelerators in a heterogeneous multicore processor with specialized workers - the SARC architecture [16]. Moreover, we consider critical parameters such as the available memory bandwidth and the memory latency. More specifically, the main contributions of this paper are the following: – Performance evaluation of the Sparse Matrix Vector Multiplication (SMVM) kernel, comparing a vector processor using a Polymorphic Vector Register File implementation to the Cell BE and the PowerXCell 8i [10]. The Polymorphic vector register file system achieved speedups of up to 8 times compared to the Cell PowerPC Processor Unit (PPU); – Scalability analysis of the SMVM kernel: simulation results suggest that a system comprising of 256 PRF accelerators can reach absolute speedups of up to 200 times compared to a single Cell PPU worker. The system scales almost linearly for up to 16 workers, and more than 50% of the single core relative speedup is preserved when using up to 128 PRF cores; – Evaluation of the impact of memory latency and shared cache bandwidth on the sustainable performance of the SMVM kernel. We consider scenarios of up to 128 PRF workers and target at least 80% of their theoretical peak speedups. The memory latency simulations indicate that the system can tolerate latencies up to 64 cycles to sustain that performance. The cache tests suggest that such a configuration requires a bandwidth of 1638.4 GB/sec. The rest of the paper is organized as follows: Section 2 provides the background information on the competitive architectures we have selected, the target application and describes related work. The case study scenario is presented in Section 3. Simulation data along with their analysis are presented in Section 4. Finally, the paper is concluded in Section 5.
Scalability Evaluation of a Polymorphic Register File: A CG Case Study
2
15
Background and Related Work
A Polymorphic Register File (PRF) is a parameterizable register file [4], which can be logically reorganized by the programmer or by the runtime system to support multiple register dimensions and sizes simultaneously. Figure 1 shows an example of a two-dimensional PRF assuming that the physical register file is 128 by 128 elements. The physical register storage space is allocated to a number of 1D and 2D logical vector registers, while remaining space is available for defining more logical registers. The benefits of this architecture are: 1. Potential performance gain by increasing the number of elements processed with a single instruction, due to multi-axis vectorization; 2. A more efficient utilization of the register file storage, eliminating the potential storage waste of fixed register size organizations; 3. Variable number of registers which can be defined in order to arbitrarily partition the available physical register space; 4. Reduced static code size as the target algorithm may be expressed with higher level instructions. The same binary instructions may be used regardless of the shape, dimensions or data type of the operands. The compatibility of the operands is checked by the microarchitecture. The logical registers are defined by adding a second register bank to the architecture - the Register File Organization (RFORG) Special Purpose Registers (SPR). For each logical register, it is required to specify the coordinates: the location of the upper left corner (Base), the horizontal and vertical dimensions (Horizontal Length and Vertical Length) as well as the data type using a 3 bit field, supporting 32/64-bit floating point or 8/16/32/64-bit integer values. More details on the organization of the PRF can be found in [4].
Fig. 1. The Polymorphic Register File
16
C.B. Ciobanu et al.
The Conjugate Gradient Method is one of the most important methods used for solving a system of linear equations, with the restriction that their matrix is symmetric and positive definite [19]. The iterative nature of the algorithm makes it suitable for solving very large sparse systems for which applying a direct method is not feasible. The CG version we have used is part of the NAS Parallel Benchmarks [1]. By profiling the code we have found that the main computational kernel is the double precision Sparse Matrix - Dense Vector Multiplication (SMVM), which accounts for 87.32% of the total execution time in the scalar version of CG. The Compressed Sparse Row (CSR) format is used to store the sparse matrices. The following pseudo code sequence presents the SMVM kernel, where a is a onedimensional array storing all the non-zero elements of the sparse matrix, p is the dense vector and w stores the result of the multiplication. colidx and rowstr contain the extra information required by the CSR format. for (j = 1; j