Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen University of Dortmund, Germany Madhu Sudan Massachusetts Institute of Technology, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Moshe Y. Vardi Rice University, Houston, TX, USA Gerhard Weikum Max-Planck Institute of Computer Science, Saarbruecken, Germany
4644
Nadine Azemard Lars Svensson (Eds.)
Integrated Circuit and System Design Power and Timing Modeling, Optimization and Simulation 17th International Workshop, PATMOS 2007 Gothenburg, Sweden, September 3-5, 2007 Proceedings
13
Volume Editors Nadine Azemard LIRMM, UMR CNRS/Université de Montpellier II 161 rue Ada, 34392, Montpellier, France E-mail:
[email protected] Lars Svensson Chalmers University of Technology Department of Computer Engineering 412 96 Göteborg, Sweden E-mail:
[email protected]
Library of Congress Control Number: 2007933304 CR Subject Classification (1998): B.7, B.8, C.1, C.4, B.2, B.6, J.6 LNCS Sublibrary: SL 1 – Theoretical Computer Science and General Issues ISSN ISBN-10 ISBN-13
0302-9743 3-540-74441-X Springer Berlin Heidelberg New York 978-3-540-74441-2 Springer Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. Springer is a part of Springer Science+Business Media springer.com © Springer-Verlag Berlin Heidelberg 2007 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper SPIN: 12111398 06/3180 543210
Preface
Welcome to the proceedings of PATMOS 2007, the 17th in a series of international workshops. PATMOS 2007 was organized by Chalmers University of Technology with IEEE Sweden Chapter of the Solid-State Circuit Society technical cosponsorship and IEEE CEDA sponsorship. Over the years, PATMOS has evolved into an important European event, where researchers from both industry and academia discuss and investigate the emerging challenges in future and contemporary applications, design methodologies, and tools required for the development of the upcoming generations of integrated circuits and systems. The technical program of PATMOS 2007 consisted of state-of-the-art technical contributions, three invited talks and an industrial session on design challenges in real-life projects. The technical program focused on timing, performance and power consumption, as well as architectural aspects with particular emphasis on modeling, design, characterization, analysis and optimization in the nanometer era. The Technical Program Committee, with the assistance of additional expert reviewers, selected the 55 papers presented at PATMOS. The papers were organized into 9 technical sessions and 3 poster sessions. As is always the case with the PATMOS workshops, full papers were required, and several reviews were received per manuscript. Beyond the presentations of the papers, the PATMOS technical program was enriched by a series of speeches offered by world class experts, on important emerging research issues of industrial relevance. Jean Michel Daga spoke about “Design and Industrialization Challenges of Memory Dominated SOCs”, Davide Pandini spoke about “Statistical Static Timing Analysis: A New Approach to Deal with Increased Process Variability in Advanced Nanometer Technologies” and Christer Svensson spoke about “Analog Power Modelling”. Furthermore, the technical program was augmented by two industrial talks, given by leading experts from industry. Fredrik Dahlgren, from Ericsson Mobile Platforms, spoke about “Technological Trends, Design Constraints and Design Implementation Challenges in Mobile Phone Platforms” and Anders Emrich, from Omnisys Instruments AB, spoke about “System Design from Instrument Level down to ASIC Transistors with Speed and Low Power as Driving Parameters”. We would like to thank the many people that worked voluntarily to make PATMOS 2007 possible, the expert reviewers, the members of the technical program and steering committees, and the invited speakers who offered their skill, time, and deep knowledge to make PATMOS 2007 a memorable event. Last but not least we would like to thank the sponsors of PATMOS 2007, Ericsson, Omnisys, Chalmers University and the city of Göteborg, for their support. September 2007
Nadine Azemard Lars Svensson
Organization
Organizing Committee General Chair Technical Program Chair Secretariat Proceedings
Lars Svensson, Chalmers University, Sweden Nadine Azemard, LIRMM, France Ewa Wäingelin, Chalmers University, Sweden Nadine Azemard, LIRMM, France
PATMOS Technical Program Committee A. Alvandpour, Linköping Univ., Sweden D. Atienza, EPFL, Switzerland N. Azemard, LIRMM, France P. A. Beerel, Univ. of Southern California, USA D. Bertozzi, Univ. of Ferrara, Italy N. Chang, Seoul National Univ., Korea J. J. Chico, Univ. de Sevilla, Spain J. Figueras, Univ. de Catalunya, Spain E. Friedman, Univ. of Rochester, USA C. E. Goutis, Univ. of Patras, Greece E. Grass, IHP, Germany J. L. Güntzel, Univ. Fed. de Pelotas, Brazil R. Hartenstein, Univ. of Kaiserslautern, Germany N. Julien, LESTER, France K. Karagianni, Univ. of Patras, Greece P. Marchal, IMEC, Belgium P. Maurine, LIRMM, France V. Moshnyaga, Univ. of Fukuoka, Japan W. Nebel, Univ. of Oldenburg, Germany D. Nikolos, Univ. of Patras, Greece A. Nunez, Univ. de Las Palmas, Spain V. Paliuras, Univ. of Patras, Grece, D. Pandini, ST Microelectronics, Italy F. Pessolano, Philips, The Netherlands H. Pfleiderer, Univ. of Ulm, Germany C. Piguet, CSEM, Switzerland M. Poncino, Politecnico di Torino, Italy R. Reis, Univ. of Porto Alegre, Brazil M. Robert, Univ. of Montpellier, France J. Rossello, Balearic Islands Univ., Spain D. Sciuto, Politecnico di Milano, Italy J. Segura, Balearic Islands Univ., Spain
VIII
Organization
D. Soudris, Univ. of Thrace, Greece L. Svensson, Chalmers Univ. of Technology, Sweden A. M. Trullemans, Univ. LLN, Belgium D. Verkest, IMEC, Belgium R. Wilson, ST Microelectronics, France
PATMOS Steering Committee Antonio J. Acosta, University of Sevilla/IMSE-CNM, Spain Nadine Azemard, LIRMM - University of Montpellier, France Joan Figueras, Universitat Politècnica de Catalunya, Spain Reiner Hartenstein, University of Kaiserslautern, Germany Jorge Juan-Chico, University of Sevilla/IMSE-CNM, Spain Enrico Macii, Politecnico di Torino (POLITO), Italy Philippe Maurine, LIRMM - University of Montpellier, France Wolfgang Nebel, OFFIS, Germany Vassilis Paliouras, University of Patras, Greece Christian Piguet, CSEM, Switzerland Dimitrious Soudris, Democritus University of Thrace (DUTH), Greece Lars Svensson, Chalmers University of Technology, Sweden Anne-Marie Trullemans, Université Catholique de Louvain (UCL), Belgium Diederik Verkest, IMEC, Belgium Roberto Zafalon, ST Microelectronics, Italy
Executive Steering Sub-committee President Vice-president Secretary
Enrico Macii, Politecnico di Torino (POLITO), Italy Vassilis Paliouras, University of Patras, Greece Nadine Azemard, LIRMM - University of Montpellier, France
Table of Contents
Session 1 - High-Level Design (1) System-Level Application-Specific NoC Design for Network and Multimedia Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lazaros Papadopoulos and Dimitrios Soudris
1
Fast and Accurate Embedded Systems Energy Characterization Using Non-intrusive Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nicolas Fournel, Antoine Fraboulet, and Paul Feautrier
10
A Flexible General-Purpose Parallelizing Architecture for Nested Loops in Reconfigurable Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ioannis Panagopoulos, Christos Pavlatos, George Manis, and George Papakonstantinou An Automatic Design Flow for Mapping Application onto a 2D Mesh NoC Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Julien Delorme
20
31
Session 2 - Low Power Design Techniques Template Vertical Dictionary-Based Program Compression Scheme on the TTA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lai Mingche, Wang Zhiying, Guo JianJun, Dai Kui, and Shen Li Asynchronous Functional Coupling for Low Power Sensor Network Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Delong Shang, Chihoon Shin, Ping Wang, Fei Xia, Albert Koelmans, Myeonghoon Oh, Seongwoon Kim, and Alex Yakovlev
43
53
A Heuristic for Reducing Dynamic Power Dissipation in Clocked Sequential Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Noureddine Chabini
64
Low-Power Content Addressable Memory With Read/Write and Matched Mask Ports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Saleh Abdel-Hafeez, Shadi M. Harb, and William R. Eisenstadt
75
The Design and Implementation of a Power Efficient Embedded SRAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yijun Liu, Pinghua Chen, Wenyan Wang, and Zhenkun Li
86
X
Table of Contents
Session 3 - Low Power Analog Circuits Design of a Linear Power Amplifier with ±1.5V Power Supply Using ALADIN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bj¨ orn Lipka and Ulrich Kleine
97
Settling Time Minimization of Operational Amplifiers . . . . . . . . . . . . . . . . Andrea Pugliese, Gregorio Cappuccino, and Giuseppe Cocorullo
107
Low-Voltage Low-Power Curvature-Corrected Voltage Reference Circuit Using DTMOSTs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Cosmin Popa
117
Session 4 - Statistical Static Timing Analysis Computation of Joint Timing Yield of Sequential Networks Considering Process Variations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Amit Goel, Sarvesh Bhardwaj, Praveen Ghanta, and Sarma Vrudhula A Simple Statistical Timing Analysis Flow and Its Application to Timing Margin Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . V. Migairou, R. Wilson, S. Engels, Z. Wu, N. Azemard, and P. Maurine A Statistical Approach to the Timing-Yield Optimization of Pipeline Circuits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chin-Hsiung Hsu, Szu-Jui Chou, Jie-Hong R. Jiang, and Yao-Wen Chang
125
138
148
Session 5 - Power Modeling and Optimization A Novel Gate-Level NBTI Delay Degradation Model with Stacking Effect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hong Luo, Yu Wang, Ku He, Rong Luo, Huazhong Yang, and Yuan Xie Modelling the Impact of High Level Leakage Optimization Techniques on the Delay of RT-Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Marko Hoyer, Domenik Helms, and Wolfgang Nebel Logic Style Comparison for Ultra Low Power Operation in 65nm Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mandeep Singh, Christophe Giacomotto, Bart Zeydel, and Vojin Oklobdzija Design-In Reliability for 90-65nm CMOS Nodes Submitted to Hot-Carriers and NBTI Degradation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CR. Parthasarathy, A. Bravaix, C. Gu´erin, M. Denais, and V. Huard
160
171
181
191
Table of Contents
XI
Session 6 - Low Power Routing Optimization Clock Distribution Techniques for Low-EMI Design . . . . . . . . . . . . . . . . . . Davide Pandini, Guido A. Repetto, and Vincenzo Sinisi
201
Crosstalk Waveform Modeling Using Wave Fitting . . . . . . . . . . . . . . . . . . . Mini Nanua and David Blaauw
211
Weakness Identification for Effective Repair of Power Distribution Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Takashi Sato, Shiho Hagiwara, Takumi Uezono, and Kazuya Masu
222
New Adaptive Encoding Schemes for Switching Activity Balancing in On-Chip Buses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . P. Sithambaram, A. Macii, and E. Macii
232
On the Necessity of Combining Coding with Spacing and Shielding for Improving Performance and Power in Very Deep Sub-micron Interconnects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . T. Murgan, P.B. Bacinschi, S. Pandey, A. Garc´ıa Ortiz, and M. Glesner
242
Session 7 - High Level Design (2) Soft Error-Aware Power Optimization Using Gate Sizing . . . . . . . . . . . . . . Foad Dabiri, Ani Nahapetian, Miodrag Potkonjak, and Majid Sarrafzadeh Automated Instruction Set Characterization and Power Profile Driven Software Optimization for Mobile Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . Matthias Grumer, Manuel Wendt, Christian Steger, Reinhold Weiss, Ulrich Neffe, and Andreas M¨ uhlberger
255
268
RTL Power Modeling and Estimation of Sleep Transistor Based Power Gating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sven Rosinger, Domenik Helms, and Wolfgang Nebel
278
Functional Verification of Low Power Designs at RTL . . . . . . . . . . . . . . . . . Allan Crone and Gabriel Chidolue
288
XEEMU: An Improved XScale Power Simulator . . . . . . . . . . . . . . . . . . . . . ´ Zolt´ an Herczeg, Akos Kiss, Daniel Schmidt, Norbert Wehn, and Tibor Gyim´ othy
300
Session 8 - Security and Asynchronous Design Low Power Elliptic Curve Cryptography . . . . . . . . . . . . . . . . . . . . . . . . . . . . Maurice Keller and William Marnane
310
XII
Table of Contents
Design and Test of Self-checking Asynchronous Control Circuit . . . . . . . . Jian Ruan, Zhiying Wang, Kui Dai, and Yong Li
320
An Automatic Design Flow for Implementation of Side Channel Attacks Resistant Crypto-Chips . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Behnam Ghavami and Hossein Pedram
330
Analysis and Improvement of Dual Rail Logic as a Countermeasure Against DPA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A. Razafindraibe, M. Robert, and P. Maurine
340
Session 9 - Low Power Applications Performance Optimization of Embedded Applications in a Hybrid Reconfigurable Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Michalis D. Galanis, Gregory Dimitroulakos, and Costas E. Goutis
352
The Energy Scalability of Wavelet-Based, Scalable Video Decoding . . . . . Hendrik Eeckhaut, Harald Devos, and Dirk Stroobandt
363
Direct Memory Access Optimization in Wireless Terminals for Reduced Memory Latency and Energy Consumption . . . . . . . . . . . . . . . . . . . . . . . . . . Miguel Peon-Quiros, Alexandros Bartzas, Stylianos Mamagkakis, Francky Catthoor, Jose M. Mendias, and Dimitrios Soudris
373
Poster 1 - Modeling and Optimization Exploiting Input Variations for Energy Reduction . . . . . . . . . . . . . . . . . . . . Toshinori Sato and Yuji Kunitake
384
A Model of DPA Syndrome and Its Application to the Identification of Leaking Gates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A. Razafindraibe and P. Maurine
394
Static Power Consumption in CMOS Gates Using Independent Bodies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D. Guerrero, A. Millan, J. Juan, M.J. Bellido, P. Ruiz-de-Clavijo, E. Ostua, and J. Viejo Moderate Inversion: Highlights for Low Voltage Design . . . . . . . . . . . . . . . Fabrice Guigues, Edith Kussener, Benjamin Duval, and Herv´e Barthelemy On Two-Pronged Power-Aware Voltage Scheduling for Multi-processor Real-Time Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Naotake Kamiura, Teijiro Isokawa, and Nobuyuki Matsui
404
413
423
Table of Contents
Semi Custom Design: A Case Study on SIMD Shufflers . . . . . . . . . . . . . . . Praveen Raghavan, Nandhavel Sethubalasubramanian, Satyakiran Munaga, Estela Rey Ramos, Murali Jayapala, Oliver Weiss, Francky Catthoor, and Diederik Verkest
XIII
433
Poster 2 - High Level Design Optimization for Real-Time Systems with Non-convex Power Versus Speed Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ani Nahapetian, Foad Dabiri, Miodrag Potkonjak, and Majid Sarrafzadeh Triple-Threshold Static Power Minimization in High-Level Synthesis of VLSI CMOS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Harry I.A. Chen, Edward K.W. Loo, James B. Kuo, and Marek J. Syrzycki A Fast and Accurate Power Estimation Methodology for QDI Asynchronous Circuits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Behnam Ghavami, Mahtab Niknahad, Mehrdad Najibi, and Hossein Pedram
443
453
463
Subthreshold Leakage Modeling and Estimation of General CMOS Complex Gates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Paulo F. Butzen, Andr´e I. Reis, Chris H. Kim, and Renato P. Ribas
474
A Platform for Mixed HW/SW Algorithm Specifications for the Exploration of SW and HW Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . Christophe Lucarz and Marco Mattavelli
485
Fast Calculation of Permissible Slowdown Factors for Hard Real-Time Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Henrik Lipskoch, Karsten Albers, and Frank Slomka
495
Design Methodology and Software Tool for Estimation of Multi-level Instruction Cache Memory Miss Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . N. Kroupis and D. Soudris
505
Poster 3 - Low Power Techniques and Applications A Statistical Model of Logic Gates for Monte Carlo Simulation Including On-Chip Variations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Francesco Centurelli, Luca Giancane, Mauro Olivieri, Giuseppe Scotti, and Alessandro Trifiletti Switching Activity Reduction of MAC-Based FIR Filters with Correlated Input Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Oscar Gustafson, Saeeid Tahmasbi Oskuii, Kenny Johansson, and Per Gunnar Kjeldsberg
516
526
XIV
Table of Contents
Performance of CMOS and Floating-Gate Full-Adders Circuits at Subthreshold Power Supply . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jon Alfredsson and Snorre Aunet
536
Low-Power Digital Filtering Based on the Logarithmic Number System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C. Basetas, I. Kouretas, and V. Paliouras
546
A Power Supply Selector for Energy- and Area-Efficient Local Dynamic Voltage Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sylvain Miermont, Pascal Vivet, and Marc Renaudin
556
Dependability Evaluation of Time-Redundancy Techniques in Integer Multipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Henrik Eriksson
566
Keynotes Design and Industrialization Challenges of Memory Dominated SOCs . . . J.M. Daga
576
Statistical Static Timing Analysis: A New Approach to Deal with Increased Process Variability in Advanced Nanometer Technologies . . . . . D. Pandini
577
Analog Power Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C. Svensson
578
Industrial Session - Design Challenges in Real-Life Projects Technological Trends, Design Constraints and Design Implementation Challenges in Mobile Phone Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F. Dahlgren
579
System Design from Instrument Level Down to ASIC Transistors with Speed and Low Power as Driving Parameters . . . . . . . . . . . . . . . . . . . . . . . . A. Emrich
580
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
581
System-Level Application-Specific NoC Design for Network and Multimedia Applications Lazaros Papadopoulos and Dimitrios Soudris VLSI and Testing Center, Dept. of Electrical and Computer Engineering, Democritus University of Thrace, 67100, Xanthi, Greece {lpapadop,dsoudris}@ee.duth.gr
Abstract. Nowadays, embedded consumer devices execute complex network and multimedia applications that require high performance and low energy consumption. For implementing complex applications on Network-on-Chips (NoCs), a design methodology is needed for performing exploration at NoC system-level, in order to select the optimal application-specific NoC architecture. The design methodology we present in this paper is based on the exploration of different NoC characteristics and is supported by a flexible NoC simulator which provides the essential evaluation metrics in order to select optimal communication parameters of the NoC architectures. We show that it is possible with the evaluation metrics provided by the simulator we present, to perform exploration of several NoC aspects and select the optimal communication characteristics for NoC platforms implementing network and multimedia applications.
1 Introduction In the last years, network and multimedia applications are implemented with embedded consumer devices. Modern portable devices, (e.g. PDAs and mobile phones) implement complex network protocols in order to access the internet and communicate with each other. Moreover, embedded systems implement text, speech and video processing multimedia applications, such as MPEG and 3D video games, which experience a fast growth in their variety functionality and complexity. Both areas are characterized by a rapidly increasing demand in computational power in order to process complex algorithms. The implementation of these applications to portable devices is a difficult task, due to their limited resources and the stringent power constraints of such systems. Single processor systems are not capable of providing the required computational power for complex applications with Task-Level Parallelism (TLP) and hard real-time constraints (e.g. frame rate in MPEG). To face these problems, multiprocessor systems are used to provide the computational concurrency required to handle concurrent events in real-time. With technology scaling, the integration of billions of transistors on a chip is enabled. Current MPSoC platforms usually contain bus-based interconnection infrastructures. The bus based structures suffer from limited scalability, poor performance for large systems and high energy consumption. The computational power along with N. Azemard and L. Svensson (Eds.): PATMOS 2007, LNCS 4644, pp. 1–9, 2007. © Springer-Verlag Berlin Heidelberg 2007
2
L. Papadopoulos and D. Soudris
energy efficiency that modern applications require, cannot be provided by shared bus types of interconnections. During the last five years, Network-on-chip (NoC) has been proposed as the new SoC design paradigm. NoC, instead of using bus structures or dedicated wires, is composed of an on-chip, packet-switched network, which provides more predictability and better scalability compared to bus communication schemes [1]. Satisfactory design quality for modern complex applications implemented in NoC, is possible when both computation and communication refinement is performed. Current NoC simulators focus mainly on communication aspects, at high abstraction level. However, they are not flexible enough in order to achieve satisfactory communication refinement. Additionally, they do not provide enough evaluation metrics that enable the designer to choose the optimal characteristics of the NoC architecture. In this paper we present a NoC design methodology for system-level exploration of NoC parameters. The methodology is based on a new NoC simulator which is an extension of Nostrum NoC Simulation Environment (NNSE) [2]. Due to the flexibility of this tool, the designer can easily explore many NoC characteristics such as topology, packet size and routing algorithm and choose the application-specific NoC architecture, which is optimal in terms of performance, communication energy and link utilization. In this work, we chose to explore different NoC topologies for specific network and multimedia applications. The remainder of the paper is organized as follows. In Section 2, we describe some related work. In Section 3, we analyze the design methodology that is supported by the NoC simulator. In Section 4 our benchmarks are introduced and the experimental results are presented. Finally, in Section 5 we draw our conclusions.
2 Related Work NoC as a scalable communication architecture is described in [1]. The existing NoC research is presented in [3] and shows that NoC constitutes a unification of current trends of intra-chip communication. Many NoC design methodologies focus on application-specific NoC design. The NoC architecture is customized for each application, in order to achieve optimal performance and energy trade-offs. For instance, linear programming based techniques for synthesis of custom NoC architectures are presented in [4]. The work in [5] describes a tool for automatic selection of the optimal topology, and core mapping for a given application. Other design methodologies are presented in [6], [7] and [8]. The design approach described in this work is complementary and can be used along with the aforementioned design methodologies, since the flexible NoC simulator we present is useful for exploring and evaluating the NoC characteristics, such as topology, mapping and routing. From a different perspective, several NoC implementations have been proposed. For instance, SPIN network [9] implements a fat-tree topology using wormhole routing. CHAIN NoC [10] is implemented using asynchronous circuit techniques. MANGO [11] is a clockless NoC which provides both best-effort and guaranteed services. However, the aforementioned approaches are not flexible enough, since they limit the design choices. Also, they do not focus on application-specific NoC design and therefore, they are not suitable for implementing exploration of NoC parameters
System-Level Application-Specific NoC Design
3
according to the characteristics of the application under study. Thus, a system-level and flexible simulator is needed for exploring interconnection characteristics. Nostrum NoC simulator (NNSE) [2] focuses on grid-based, router-driven communication media for on-chip communication. As we show in the next section, the new tool based on Nostrum, which supports our methodology, adds new features to Nostrum and allows the easy exploration of several NoC parameters.
3 NoC Design Methodology and Simulator Description In this section we analyze the application-specific NoC design methodology and describe the NoC simulator we developed, which supports the design methodology. 3.1 NoC Design Methodology The NoC design methodology we present is based on the exploration of NoC characteristics at system-level and the selection of the optimal architecture that meets the design constraints.
Fig. 1. Application-specific NoC design methodology
Figure 1 presents the NoC design methodology. The input of the methodology is the communication task graph (CTG) of the application and the output is an optimal application-specific NoC architecture which meets the design constraints. The first step of the design process is the partitioning of the application into a number of distinct tasks and the extraction of the CTG. The parallelization of the application can be done with several methods, as the one described in [13]. The second step of the methodology is the construction of the system-level NoC architecture, using the NoC simulator. In this step, one or more NoC aspects are
4
L. Papadopoulos and D. Soudris
explored, such as topology, packet size, buffer size etc. As we show in the following subsection, the flexibility of the simulator allows the exploration of several NoC communication characteristics easily. The next step is the scheduling of application tasks onto the NoC architecture. In this methodology we refer to static scheduling. This step can be implemented with a scheduling algorithm e.g. [14]. After performing task scheduling, step 4 refers to simulation process and the extraction of the evaluation metrics. The designer analyzes the experimental results and determines whether or not the chosen architecture satisfies the imposed design constraints. If the design constraints are satisfied, the NoC architecture can be selected. Otherwise, the NoC exploration of the same or other NoC aspects continues. Thus, the design methodology is an iterative process that ends when the design constraints are satisfied. Although the methodology we present can be used for any application domain, in this work we focus specifically on network and multimedia applications. This is because modern applications from both areas are implemented in embedded devices and demand increased computational power. 3.2 NoC Simulator Description The NoC simulator is developed for implementing NoC exploration at system-level. The tool emphasizes in communication aspects (such as packet rates, buffer size etc.). Therefore, it abstracts away lower level aspects, such as the cache and other memory effects, in order to keep the complexity of the NoC model under control. For instance, in Section 4 we try to capture the impact of topology on the overall behavior of the network. The high abstraction level of the simulator and its simplicity allows the easy exploration and quick modifications. Additionally, provides a variety of metrics that allow the evaluation of the specific NoC implementation. Table 1. New features added to Nostrum NoC Simulation Environment
Topologies: Irregular Topologies Routing: Both XY routing and routing tables Evaluation Metrics: Performance Throughput Total number of packets Link Utilization Communication Energy Consumption The NoC simulator is developed as an extended version of Nostrum NoC Simulation Environment (NNSE) and provides a number of new features which have been added to Nostrum. The simulator allows the construction of irregular topologies and routing can be done either using XY routing or routing tables. Also, provides more
System-Level Application-Specific NoC Design
5
evaluation metrics such as performance, average throughput, link utilization and communication energy consumption. Thus, the simulator we provide allows an indepth exploration of different NoC aspects at system-level. The additional features are summarized in Table 1.
Fig. 2. The pseudocode of the NoC simulator
The simulator is developed in SystemC and the pseudocode is depicted in figure 2. Resources, switches and channels are implemented as sc_modules. Application’s threads are allocated in each resource which implement functions read and write. Resources, which are an abstraction of processing elements, provide the required interface for allocating application threads on them and the required network interface, to connect the specific resource to the network. The resources communicate via the channels. The way the resources are connected and the number of channels used are defined by the designer. Thus, various topologies can be implemented. Each channel handles data independently, according to the bandwidth restriction. Every resource contains a switch, which implements the selected routing algorithm. By using adaptive routing, congestion avoidance can be implemented. The communication process between two resources is shown in Figure 3. First, channels are opened between different resources, to construct the specific topology. Then, application’s threads are implemented on the simulator. During the simulation time, threads trigger the resources to perform read and write functions. Thus, packets are injected to the network. Switches and channels handle the packets according to the specific destination.
6
L. Papadopoulos and D. Soudris Application Threads Send data
Packetization Packets are transmitted to switch.
Receive data
Resource and Network Interface
Receives packets in a round robin fashion.
Switch
Determines the output according to the specified routing algorithm.
Reassemble data Receive packets from the buffer.
Packet is stored in the appropriate output buffer. Destination is the resource connected to this switch.
Channel Packet is transmitted to the appropriate channel.
Fig. 3. The communication process between two threads in the NoC simulator
The simulator provides the essential evaluation metrics to explore NoC characteristics. Average packet delay refers to the time that a packet needs to be transferred from the source to the destination through the network. Energy consumption is calculated as described in [12] and is affected by the switch architecture and the number of hops passed by packets.
4 Experimental Results We used the methodology presented in the previous section to evaluate optimal NoC platforms for network and multimedia applications. The NoC aspect we chose to explore in this work is the topology. The topologies we implemented are: 6x6 2DMesh and 6x6 Torus, Binary Tree and Fat Tree of 31 nodes. Other NoC characteristics such as buffer size, packet size and routing algorithm can also be explored. The scheduling algorithm we chose for the third step of the design methodology is presented in [14]. The cost factors we used to evaluate each NoC topology are: throughput, link utilization and communication energy consumption. The applications we used as benchmarks in this work are taken from the Embedded System Synthesis Benchmark Suite (E3S) [15]. The first one is the Consumer application, which contains the kernels of four multimedia applications (including JPEG
System-Level Application-Specific NoC Design
7
compression / decompression). The second one is the Office application, which contains three kernels from the multimedia application domain. The last one is the Networking application, which is comprised of the kernels of three network protocols. The mapping process of the IP cores of the E3S benchmarks has been done manually. However, the purpose of this work is to evaluate the system-level NoC design methodology we present and to prove the effectiveness of the NoC simulator we designed. The development of a mapping algorithm, which will be included in the design process, will be a part of our future work. 4.1 Methodology Applied to Multimedia Applications We applied the proposed methodology to Consumer and Office applications from E3S benchmarks. The Consumer application consists of four multimedia kernels. We implemented the CTG of the application on the four different topologies and the results are shown in figure 4.
Fig. 4. Normalized throughput, link utilization and communication energy consumption of the Consumer application
From figure 4, it is deduced that the optimal topology in terms of throughput is the Torus. Implementing the application on the Binary Tree topology, optimal link utilization is achieved, but also communication energy consumption is increased at 38% compared to the Torus topology. The designer can choose the topology that satisfies the imposed design constraints of the NoC platform. Office application consists of three multimedia kernels. The evaluation metrics obtained by the implementation of the proposed methodology are presented in figure 5. From figure 5, it is concluded that the Torus implementation leads to increased throughput, but also to high communication energy consumption. On the other hand, 2D-Mesh implementation results in 41% lower energy consumption. Increased link utilization is achieved by implementing the application on the Fat Tree topology.
8
L. Papadopoulos and D. Soudris
Fig. 5. Normalized throughput, link utilization and communication energy consumption of the Office application
4.2 Methodology Applied to Network Application We evaluated the design methodology to the Networking application from the E3S benchmarks. The application is composed of three network kernels, which are Open Shortest Path First (OSPF) protocol, packet flow and route lookup. The evaluation metrics obtained for each topology we explored are depicted in figure 6.
Fig. 6. Normalized throughput, link utilization and communication energy consumption of the Networking application
From figure 6, it is deduced that implementing the application on the Torus topology, 54% increased throughput is achieved. 37% increased link utilization is experinced when the Networking application is implemented on the Binary Tree. Finally, the Fat Tree topology results in 52% less communication energy consumption, compared to the 2D-Mesh.
System-Level Application-Specific NoC Design
9
From the experimental results presented above, the importance of the exploration procedure that is included in our methodology is shown. It is proved that there is not any general solution to the aspect of topology selection. Instead, for every specific application, exploration is needed in order to distinguish the optimal topology that meets the design constraints.
5 Conclusion For efficient design of future embedded platforms, system-level methodologies and efficient tools are highly desirable. Towards this end, we have presented a systematic methodology for NoC design supported by a flexible NoC simulator. As shown from our case studies, using the proposed methodology we have managed to choose the optimal communication architecture for a number of network and multimedia applications. Our future work addresses the systematic use and the further automation of our approach.
References 1. Benini, L., De Micheli, G.: Networks on Chips: A new SoC paradigm. IEEE Computer 35(1) (2002) 2. Millberg, M., Nilsson, E., Thid, R., Kumar, S., Jantsch, A.: The Nostrum backbone-a communication protocol stack for networks on chip. In: Proc. VLSI Design (2004) 3. Bjerregaard, T., Mahadevan, S.: A survey of research and practices of network-on- chip. ACM Computing Surveys (CSUR), 38(1) (2006) 4. Srinivasan, K., Chatha, K.S., Konjevod, G.: Linear programming based techniques for synthesis of network-on-chip architectures. In: Proc. ICCD, pp. 422–429 (2004) 5. Murali, S., De Micheli, G.: SUNMAP: A tool for automatic topology selection and generation for NoCs. In: Proc. DAC, San Diego, pp. 914–919 (2004) 6. Jalabert, A., Murali, S., Benini, L., De Micheli, G.: xpipesCompiler: a tool for instantiating application specific networks on chip. In: Proc. DATE (2004) 7. Ogras, U.Y., Marculescu, R.: Energy-and performance-driven NoC communication architecture synthesis using a decomposition approach. In: Proc. DATE, pp. 352–357 (2005) 8. Pinto, A., Carloni, L.P., Sangiovanni-Vincentelli, A.L.: Efficient synthesis of networks on chip. In: Proc. ICCD (2003) 9. Guerrier, P., Greiner, A.: A generic architecture for on-chip packet-switched interconnections. In: Proc. DATE, pp. 250–256 (2000) 10. Bainbridge, W., Furber, S.: CHAIN: A delay-insensitive chip area interconnect. IEEE Micro 22(5), 16–23 (2002) 11. Bjerregaard, T.: The MANGO clockless network-on-chip: Concepts and implementation. Ph.D. thesis, Information and Mathematical Modeling, Technical University of Denmark, Lyngby, Denmark 12. Tao Ye, T., Benini, L., De Micheli, G.: Packetized on-chip interconnect communication analysis for MPSoC. In: Proc. DATE (2003) 13. The Cadence Virtual Component Co-design (VCC), http://www.cadence.com/ company/pr/09_25_00vcc.html 14. Meyer, M.: Energy-aware task allocation for network-on-chip architectures. MSc thesis, Royal Institute of Technology, Stockholm, Sweden 15. Dick, R.P.: The embedded system synthesis benchmarks suite (E3S) (2002)
Fast and Accurate Embedded Systems Energy Characterization Using Non-intrusive Measurements Nicolas Fournel1 , Antoine Fraboulet2 , and Paul Feautrier1 2
1 INRIA/Compsys ENS de Lyon/LIP, Lyon F-69364 France INRIA/Compsys INSA-Lyon/CITI, Villeurbanne F-69621 France
Abstract. In this paper we propose a complete system energy model based on non-intrusive measurements. This model aims at being integrated in fast cycle accurate simulation tools to give energy consumption feedback for embedded systems software design. Estimations takes into account the whole system consumption including peripherals. Experiments on a complex ARM9 platform show that our model estimates are in error by less than 10% from real system consumption, which is precise enough for source code application design, while simulation speed remains fast.
1
Introduction
With present day technology, it is possible to build very small platforms with enormous processing power. However, physical laws dictate that high processing power is linked to high energy consumption. Embedded platforms are mostly used in hand held appliances, and since battery capacity does not increases at the same pace as clock frequency, designers are faced with the problem of minimizing power requirements under performance constraints. The first approach is the devising of low-energy technologies, but this is outside the scope of this paper. The second approach is to make the best possible use of the available energy e.g. by adjusting the processing power to the instantaneous needs of the application, or by shutting down unused parts of the system. These tasks can be delegated to the hardware; however it is well known that the hardware only source of knowledge is the past of the application; only software that can anticipate future needs. Energy can also be minimized as a side effect of performance optimization. For instance, replacing a conventional Fourier transform by an FFT greatly improves the energy budget; the same can be said of data locality optimization, which aims at replacing costly main memory accesses by low-power cache accesses. The ultimate judge in the matter of energy consumption is measurement of the finished product. However, software designers, compilers and operating systems need handier methods for assessing the qualities of their designs and directing possible improvements. Hence designers need simple analytical models which must be expressed in term of software visible events like instructions, N. Azemard and L. Svensson (Eds.): PATMOS 2007, LNCS 4644, pp. 10–19, 2007. c Springer-Verlag Berlin Heidelberg 2007
Fast and Accurate Embedded Systems Energy Characterization
11
cache hits and misses, peripheral activity and the like. There are several ways of constructing such models. One possibility is electrical simulation of the design; this method is too time-consuming for use on systems of realistic size. Another method is to interpolate/extrapolate from measurements on a prototype. This is the method we have applied in this work. The paper is organized as follows. After reviewing state of the art techniques in section 2 we present in section 3 a methodology to build complete platform energy consumption models oriented for software development. Section 4 presents the resulting model for an ARM9 development platform. This section also validates our model on more significant pieces of code, multimedia applications, thanks to its implementation in a fast and cycle accurate simulation tool. We then conclude and discuss future work.
2
Related Works
Many works focus on energy characterization of VLSI circuits. We can organize them using two main criteria: their level of hardware abstraction and the calibration method. For the first criterion, we can group the models in three main categories which are, by increasing level of abstraction, transistor/gate level models, architectural level models and finally instruction level models. Among these models there are usually three methods for building consumption models. The first method is analytical construction, the second one is simulation based, and the third is based on physical measurements. In transistor (gate) level models, all transistor (gate) state changes are computed to give an energy consumption approximation for a VLSI component. This method is highly accurate, but a complete description of the component is needed. The models built at this level of abstraction are generally reserved to hardware designers and are very long to simulate. At the upper level of abstraction, architectural or RTL level, the system is divided in functional units. Each unit can be represented by a specific model adapted to its internal structure.(e.g. bit-dependent or bit-independent models for Chen et al. [1]). To be more accurate some works, like Kim et al. [5], subdivide the block into sub-blocks to apply different models on each sub-block. This family of models allows to extend model to a complete platform, but the models proposed so far are not able to execute a full software system. The highest level is instruction/system level of abstraction. At this level, models are based on events such as instructions execution ([13,7,9]). Tiwari et al. in [13] propose to characterize the inter-instructions energy consumption, which represents the logic switching between two different instructions. Others works also take into account the logic switching due to data parameters [11]. The system considered in these models is generally composed of CPU, bus and memory. Only few works focus on modeling a complete platform. Among them, EMSIM [12] is a simulator based on Simunic et al. [10] model for the StrongARM SA110 energy characterization. This simulator poorly characterizes the peripherals. The SoftWatt simulator proposed in [4] uses a complete system simulator based on
12
N. Fournel, A. Fraboulet, and P. Feautrier
SimOS to monitor a complex operating system and interactions among CPU, memory hierarchy and hard disk operations. Their simulator is modified to include analytical energy models and the output reports kernel and user mode energy consumption up to the operating system services level of granularity. Data are sampled during simulation and dumped to log files at a coarser granularity than cycle level, leading to average power consumption figures. The closest work to ours, AEON’s model [6], is a complete platform energy consumption model based on measurement and uses simulator internal energy counters. The targeted platform is an 8 bit AVR micro-controller based sensor network node that does not include CPU pipelines, complex memory hierarchy and peripherals. Our model allows the simulation of much more complex hardware architectures while being independant of simulator internals. As far as calibration methods are concerned, analytical models are generaly based on manufacturers data, e.g. in Simunic et al. [10] the model is built thanks to datasheet informations. Simulation based calibration needs a full knowledge of the underlying level architecture, which means that it needs a description of low level hardware (VHDL, or Verilog descriptions). Measurement based methods only need few informations on the hardware and works like [13,2] have shown that it is possible to extract internal unit consumption from system measurements. In this paper we propose a methodology for complete platform energy consumption model construction based on simple and non-intrusive measurements. The model is built at a level of abstraction close to the system level presented before, but is extended to the complete platform by coupling it with architectural level principles presented by Kim et al. in [5]. We also take peripherals energy models and dynamic frequency and voltage scaling into account.
3
Model Construction Basics
We present in this section our methodology to build complete platform models. We first give more details on the structure and the parameters of the energy model. Section 4 will present the target dependent model parameters through a case study on an ARM9 based platform. 3.1
Model Structure and Parameters
Our choice among all the modeling methods which have been presented in Sect. 2 is to build an architectural level model, in which the system is divided into its main functional blocks at the platform based level such as CPU, interconnection bus, memory hierarchy, peripherals. . . The energy consumption of an application Eapp is obtained be adding all blocks consumptions Ebl . Each block can have its own energy consumption model. To have a platform model better suited for software development, we use instruction level abstraction for CPU. CPU energy consumption ECPU is described in equation 1. ECPU = Einsn + Ecache + EMMU
(1)
Fast and Accurate Embedded Systems Energy Characterization
13
The energy consumption is the sum of the energy consumed by instruction execution, plus cache and MMU overheads consumptions, and consumption of all other blocks of the platform. Eapp = ECPU + Ebl (2) This model aims at being integrated in a full platform cycle accurate simulation tool. The most interesting way of writing the model for this kind of purpose is to define a time slot energy consumption. The chosen time slot is the CPU instruction execution. There are two reason for choosing this time reference. The first is that it is the finest time reference since CPU have generally the highest clock frequency in embedded systems. Secondly, interrupt requests, the only mean for the hardware peripherals to interact with the software, are managed at the end of instructions execution. From a software point of view, there is no need to use a finer time reference to report hardware events more precisely. The model can be rewritten in a form where the consumption of CPU and other blocks are reported for the currently executed instruction. All E∗ will be kept for overall application consumptions, for the sake of notation simplicity the consumption at instruction level of granularity will be noted as E∗ . This new model formula is expressed in the following equation: Eslot = ECPU + Ebl (3) blocks
The last peculiarity in this model is the measurement based data collection. As we only get global measures for the platform consumption, we can foresee that the base consumptions of each block will not be easily distinguishable. We mean here that once the embedded system is put in its laziest state, idle state for example with all possible units powered off, the resulting consumption is considered as a base consumption regrouping the base consumption of every powered peripherals. Obviously, a part of this consumption is static power dissipation. We will call this term Ebase , it is important to note that this consumption is reported to the current executed instruction on the CPU. It can be expressed as in equation (5), as it is dependent on the instruction length linsn in terms of clock cycles. Equation (3) becomes equation (4). Eslot = Ebase + ECPU + Ebl (4) Ebase = linsn × Ec
base
(5)
The CPU and other blocks consumption are then expressed as overhead against the idle state. As described in equation (1), CPU energy consumption is given by the executed instruction energy cost. This model can be simplified by regrouping instructions in classes as proposed in [7]. As far as other blocks are concerned, we can expand them as bus, memories and other peripherals. This is interesting since bus and memories will be subject to events generated by the processor, such as memory writes. The peripherals will be then modeled by state machines giving the energy consumption of the peripheral during the time slot.
14
N. Fournel, A. Fraboulet, and P. Feautrier
The last step in model construction consists in defining all possible parameters for these components. Due to the limited information available, the developers would not necessarily know the behavior of intra-blocks logic. The parameters for the CPU are already selected, since it is modeled thanks to instructions consumptions. The same can be done for cache, MMU and even co-processors consumptions. The parameters for other blocks are limited to behavioral parameters (UART sending a byte) and their states such as operating modes. Each energy cost in this model is function of the running frequency and power supply voltage to allow dynamic and frequency scaling capabilities of the platform to be modeled. An example of this is presented in the next section.
4
Model Construction Case Study
In this section we propose an example of our methodology application. This methodology was applied on a ARM based development board. This platform uses an ARM922T and usual embedded systems peripherals (e.g. UART, Timers, network interface) on the same chip. Our hardware architecture exploration reveals that the platform has three distinct levels of memory, a cache, a scratchpad and main memory. All peripherals are accessible through two levels of AMBA bus. We will give details about the energy consumption model construction for this platform, then we will check the accuracy of the resulting model. 4.1
Methodology Application
The complete platform modeling method presented in section 3.1 is applied on our ARM9 platform in this section. The measurement setup used for these experiments is close to the one depicted in [9]. We used a digitalizing oscilloscope, the shunt resistor is replaced by a current probe and we also used a voltage probe. Calibration benchmarks. We built benchmarks to calibrate our model, more precisely our block models. The hardware exploration gives us the main blocks to be modeled, namely the CPU, the different bus levels, the memory levels, and the other peripherals such as UART, interrupt controller or timers. For example, the selected parameters for our CPU model are the CPU instructions, or possibly class of instructions, plus the caches and MMU activities. We thus built benchmarks to evaluate the cost of possible parameters, in order to select only relevant ones. Here are examples of benchmarks that were used, and their target event: • loop-calibration: Measurement loop overhead benchmark. By running an empty loop, we can estimate the loop overhead. • insn-XXX: Comparison of CPU instructions execution costs (add, mov, . . . ). The target instruction is executed many times inside a loop. • XXX-access: Calibration of costs of each bus level (AHB1/2) and memory level (cache, scratchpad or main memory), depending on the address accessed.
Fast and Accurate Embedded Systems Energy Characterization
15
Table 1. Benchmarks results for simple operation energy calibration bench name length energy (nJ) error (pJ) loop-calibration 4 69.084 5.1777 insn-nop 1 16.747 1.2884 AHB1-access 6 101.33 7.7132 AHB2-access 18 300 22.998 Dcache-access 1 17.146 1.3007 mem-access 40 775.44 54.551 spm-access 8 131.72 10.168 timer-test on(nop) 1 16.754 1.2857
• timer-test: Example of peripherals energy characterization, this benchmark allows us to measure the timer power consumption. It is subdivided into two benchmarks, one in which the timer is stopped and the second in which the timer is running. The structure of the loop is the same as the insn-XXX benchmark with a nop instruction. Calibration results. Benchmark energy results examples are listed in table 1. Full results are available in [3]. These results represent for each benchmark the length of the calibrated event in CPU clock cycles (second column), the perevent raw energy cost measured on the complete platform (third column) and the measurement error (fourth column). Energy costs reported here give the consumption of the complete platform for an full event execution. These raw costs have to be refined to get the final parameters. As an example, the scratchpad memory access benchmark result (spm-access) gives the energy consumption of the CPU executing a load instruction, the bus conveying the load request and response and finally the scratchpad memory. The bus access cost includes the register accesses in the targeted peripherals since it is impossible to dissociate their costs. By removing the consumption of the CPU (one load and seven nop) and bus consumption, we finally obtain the scratchpad memory access cost. Experiments reported in [3] shows that the scratchpad memory does not consume more energy than a register accessed via the bus. Other model simplifications are possible. For example, the CPU cache models are simplified by taking into account only memory access bursts in case of misses since the overhead can be neglected. The basic model presented in section 3.1 can be rewritten, by using models simplifications obtained by calibration. We found that most instructions have the same energy consumption as long as they stay inside the CPU. Currently only ARM32 instruction set is modeled. Thumbs (16bit) instruction set can be modeled using the same benchmark methodology. In our setup, it is not possible to isolate the instruction cache consumption, which is lumped with the instruction consumption. ICache misses can be modeled as memory accesses. We finally have a model for which CPU instructions are grouped in two classes, arithmetic and logic intra-CPU instructions, and load / store instructions. A memory load access is modeled as a load instruction, plus a bus overhead, plus a
16
N. Fournel, A. Fraboulet, and P. Feautrier 2.0e−06
+
AHB2−reg−write AHB1−reg−write loop−cal insn−cmp_mul insn−cmp_nop
1.8e−06
energy per event (J)
1.6e−06 1.4e−06 1.2e−06
+ +
1.0e−06 8.0e−07 6.0e−07 4.0e−07 2.0e−07 + + + +
0.0e+00 + 1
+
+
+ + + +
+ + +
3
+
+
+
+ + + +
+
clock divisor 5
7
9
11
13
15
17
Fig. 1. Multiple frequencies experiments: This figure shows that the energy per event increases linearly with the clock period
memory overhead. Peripherals energy consumption are taken into account thanks to state machines that give their consumption during instructions execution. The final model is written on Equation (6). Eslot = Ebase + Einsn + Ebus access + Emem + Eperiph state (6) Eslot is the energy consumption of the instruction execution time slot, Einsn is the cost of instruction given by its class cost, Ebus access is the bus overhead cost for load or store instructions, Emem is the overhead for memory accesses. The last term represents the sum of the energy overhead of peripherals state. These cost are all overhead costs, since the full consumption of a peripheral cost, for example, is given by its base energy cost comprised in Ebase and the overhead. Frequency Scaling. The model presented before is valid for full speed software execution. However, the Integrator CM922T has frequency scaling capabilities but no dynamic voltage scaling (DVS) capabilities, hence when we reduce the frequency we cannot decrease energy consumption. When repeating five benchmarks at different frequencies, we obtain the curves in Fig. 1. This figure represents the per event energy values for the five benchmarks as a function of the ref clock divisor, r = f f where fref is the nominal frequency (198 MHz here). These curves show that energy per event increases when frequency is decreased, and this may seem counter-intuitive. To understand these results observe first that a given event, e.g. the execution of some specific instruction, entails an almost constant number of bit flips, and that each flip uses a fixed amount of energy. Hence, to a first approximation, and in the absence of voltage scaling, the energy for a given event should be a constant. However, in our platform, frequency scaling acts only on the processor and Excalibur embedded peripherals; the consumption of other peripherals and external memories is not affected. Hence, the addition of a parasitic term which is roughly proportional to
Fast and Accurate Embedded Systems Energy Characterization
17
Table 2. Linear regression from curves of Fig. 1 based on the formula 7 Benchmark name Erp base (nJ) Emc (nJ) error (pJ) insn-mul 10.91 26.37 572.36 loop-calibration 10.52 19.22 258.90 insn-nop 10.54 6.35 105.61 access-AHB1 11.06 36.72 1085.37 access-AHB2 11.06 106.32 3431.46
the duration of the event or inversely proportional to frequency. This is clearly the case for the curves of Fig. 1. We must underline that all five benchmarks generate activity in the modified clock domain (CPU), but not on the remaining part of the platform. On top of that we kept all peripherals in the modified clock domain in an idle state. Hence, the event energy cost namely Eevt , which can be an instruction execution or a bus access for examples. In this consumption we can identify two types of sources. The first is the energy due to modified clock domain Emc , which is constant The second is the one due to the remaining part of the platform Erp base . Their relation in the total consumption of event is given by relation: Eevt = Erp
base
× linsn × r + Emc
(7)
The first term is dependent on the frequency ratio r and the instruction length linsn , whereas the second is not. Linear regressions on the results presented in Fig. 1 are shown on table 2. As shown in this table, equation (7) give a good explanation for the experiments on clock frequency variation. These results gives us an estimation of what we can consider as base energy, which is not changing against software execution. The last two columns are the events real consumption and the regression error. The value for the base energy can be approximated by the mean value 10.82 nJ per cycle (with a standard deviation of ±2.6 10−2 ). 4.2
Model Validation
We describe here our accuracy tests experiments. Our model is implemented in a simulator, and its results were compared to physical measurements. Simulator Integration. Our model is implemented in a simulation tool suite. The simulation tools are composed of two simulators. The first is a complete platform functional simulator in charge of generating a cycle-accurate execution trace of the software. This trace reports all executed instructions, and all peripherals activities (state changes). This first step allow software developers to functionally debug their applications and supply them the material to make the second step simulation. To fulfill this first task, we implemented the behaviour of the Integrator platform in the open source simulator skyeye. We also upgraded it to the cycle accurate trace generation. The second tool is the energy simulation tool. This simulator implements the model presented in the previous
18
N. Fournel, A. Fraboulet, and P. Feautrier
Table 3. Simulators results: the results obtained for execution time and energy consumption by real hardware measurement are shown in second and third columns, the simulation ones in fourth and fifth columns. The last two columns give the error percentile of the simulation. Measured values Simulated values Error Benchmark cycles energy (J) cycles energy (J) cycles (%) energy (%) jpeg 6916836 1.142440e-01 6607531 1.037940e-01 - 4.4 - 9.1 jpeg2k 7492173 1.268535e-01 7663016 1.200488e-01 + 2.2 - 5.3 mpeg2 13990961 2.335522e-01 14387358 2.208065e-01 + 2.8 - 5.4
section. Its main task is to compute model parameters from the cycle-accurate execution trace. It accumulates all computed energies, and reports them in an energy profile file for source level (C code) instrumentation and annotation. Validation Methodology. To check the accuracy of the resulting model, we propose to compare the consumption estimation of the model, thus implemented in our tool to physical measurement on the real platform. The test application chosen for this model validation are widely spread multimedia applications : JPEG, JPEG2000 and MPEG2. The implementations of these three applications are Linux standard libraries. Hence they use operating system services and standard libc functions. All experiments could have been made with Linux (or even uClinux), since the simulation tools are complete enough to run these operating systems. For limited measurement duration reasons, we decided to replace these heavy OS by the lightweight one, Mutek [8]. Linux hardware layer abstraction makes interrupt request managment too long to allow a reasonable sized image to be decoded in our measure time window. The three applications are executed in the simulation tools to get estimations of their execution. Accuracy. Results of model estimations and physical measurements are presented in table 3. The second and third columns reports the physical measurement results, in terms of execution duration in CPU clock cycles and in terms of energy consumption in Joules. Fourth and fifth columns gives the same kind of informations concerning the simulation results. Finally, the last two columns gives the percentile error of simulation errors of the simulation results against the physical measurement on the target hardware platform. These results show that a 10% error rate can be achieve by our simple complete platform energy model. This estimations are obtained in roughly less than a minute (25s for the first simulation plus 20s for the second). We think that the error rate of 10% is largely acceptable in regard of the simulation time.
5
Conclusion
In this paper we have explained how an accurate, energy consumption model for a full embedded system can be built from external measurements and microbenchmarks. Our methodology requires a prototype platform of comparable
Fast and Accurate Embedded Systems Energy Characterization
19
technology. Quantitative energy data are gathered at the battery output and are translated into per instruction energy figures by data analysis. The resulting model is thus driven by the embedded software activity and can be used with a simulation execution trace as input. It is thus possible to very easily add an energy estimator to a software functional simulator so as to get feedback at the source level. As simulation tools modifications are kept at a minimum the simulation speed is not impacted. Consumption data clearly identify power hungry operations, thus offering guidelines for software design tradeoffs. The model built on an ARM9 based development board using this methodology achieved an error rate of less than 10 % at the source level, which is acceptable compared to its simplicity of implementation and its fast running time.
References 1. Chen, R.Y., Irwin, M.J., Bajwa, R.S.: Architecture-level power estimation and design experiments. In: ACM TODAES, January 2001, vol. 6, pp. 50–66. ACM Press, New York (2001) 2. Contreras, G., Martonosi, M., Peng, J., Ju, R., Lueh, G.-Y.: XTREM: a power simulator for the Intel XScale core. In: LCTES ’04, pp. 115–125 (2004) 3. Fournel, N., Fraboulet, A., Feautrier, P.: Embedded Systems Energy Characterization using non-Intrusive Instrumentation. Research Report RR2006-37, LIP - ENS Lyon (November 2006) 4. Gurumurthi, S., Sivasubramaniam, A., Irwin, M.J., Vijaykrishnan, N., Kandemir, M., Li, T., John, L.K.: Using complete machine simulation for software power estimation: The softwatt approach. In: International Symposium on High Performance Computer Architecture (2002) 5. Kim, N.S., Austin, T., Mudge, T.r., Grunwald, D.: Power Aware Computing. In: Challenges for Architectural Level Power Modeling, Kluwer Academic Publishers, Dordrecht (2001) 6. Landsiedel, O., Wehrle, K., G¨ otz, S.: AEON: Accurate Prediction of Power Consumption in Sensor Nodes. In: SECON, October 2004, Santa Clara (2004) 7. Lee, M.T.-C., Fujita, M., Tiwari, V., Malik, S.: Power analysis and minimization techniques for embedded dsp software. IEEE Transactions on VLSI Systems (1997) 8. P´etrot, F., Gomez, P.: Lightweight Implementation of the POSIX Threads API for an On-Chip MIPS Multiprocessor with VCI Interconnect. In: DATE 03 Embedded Software Forum, pp. 51–56 (2003) 9. Russell, J.T., Jacome, M.F.: Software power estimation and optimization for high performance, 32-bit embedded processors. In: International Conference on Computer Design (October 1998) 10. Simunic, T., Benini, L., De Micheli, G.: Cycle-accurate simulation of energy consumption in embedded systems. In: 36th Design Automation Conference, May 1999, pp. 867–872 (1999) 11. Steinke, S., Knauer, M., Wehmeyer, L., Marwedel, P.: An accurate and fine grain instruction-level energy model supporting software optimizations. In: PATMOS (2001) 12. Tan, T.K., Raghunathan, A., Jha, N.K.: EMSIM: An Energy Simulation Framework for an Embedded Operating System. In: ISCAS 2002 (May 2002) 13. Tiwari, V., Malik, S., Wolfe, A., Lee, M.: Instruction level power analysis and optimization of software. Journal of VLSI Signal Processing (1996)
A Flexible General-Purpose Parallelizing Architecture for Nested Loops in Reconfigurable Platforms* Ioannis Panagopoulos1, Christos Pavlatos1, George Manis2, and George Papakonstantinou1 1
Dept. of Electrical and Computer Engineering National Technical University of Athens Zografou 15773, Athens Greece {ioannis,cpavlatos,papakon}@cslab.ece.ntua.gr http://www.cslab.ece.ntua.gr 2 Dept. of Computer Science University of Ioannina P.O. Box 1186, Ioannina 45110 Giannena, Greece
[email protected]
Abstract. We present an innovative general purpose architecture for the parallelization of nested loops in reconfigurable architectures, in the effort of achieving better execution times, while preserving design flexibility. It is based on a new load balancing technique which distributes the initial nested loop’s workload to a variable user-defined number of Processing Elements (PEs) for execution. The flexibility offered by the proposed architecture is based on “algorithm independence”, on the possibility of on-demand addition/removal of PEs depending on the performance-area tradeoff, on dynamic reconfiguration for handling different nested-loops and on its availability for any application domain (design reuse). An additional innovative feature of the proposed architecture is the hardware implementation for dynamic generation of the loop indices of loop instances that can be executed in parallel (dynamic scheduling) and the flexibility this implementation offers. To the best of our knowledge this is the first hardware dynamic scheduler, proposed for fine grain parallelism of nested loops with dependencies. Performance estimation results and limitations are presented both analytically and through the use of two case studies from the image processing and combinatorial optimization application domains.
1 Introduction The platform based design methodology has been proven to be an effective approach for reducing the computational complexity involved in the design process of embedded systems [1]. Reconfigurable platforms consist of several programmable components *
This work is co - funded by the European Social Fund (75%) and National Resources (25%) Operational Program for Educational and Vocational Training II (EPEAEK II) and particularly the Program PYTHAGORAS II.
N. Azemard and L. Svensson (Eds.): PATMOS 2007, LNCS 4644, pp. 20–30, 2007. © Springer-Verlag Berlin Heidelberg 2007
A Flexible General-Purpose Parallelizing Architecture
21
(microprocessors) interconnected to hardware and reconfigurable components. Reconfigurable components allow the flexibility of selecting specific computationally intensive parts (mainly nested loops) of the initial application to be implemented in hardware, during hardware/software partitioning [2], in the effort of achieving the best possible increase in performance, while obeying specific area, cost and power consumption design constraints. The most straightforward approach of speeding up the execution of nested loops is realized by the implementation of the nested loop in hardware and its placement in the platform either on an FPGA or as a dedicated ASIC (Application Specific Integrated Circuit) [3][4]. Another common approach also entails the migration of the nested loop to hardware, but prior to that, applies several theoretical algorithmic transformations (usually in the effort of parallelizing its execution) to further improve its performance. It has been mainly influenced by similar techniques exploited in general purpose computers [5][6][7]. In the second approach a scheduling algorithm is needed in order to act as a controller indicating the loop instances that will be executed in parallel at every control step. Existing scheduling algorithms are classified into two main categories: static and dynamic. Static algorithms [8] are executed prior to the loop’s execution and produce a mapping of loop instances’ execution at specific control steps. Static algorithms require a space in memory for storing the execution map. This makes them inappropriate for reconfigurable embedded systems where memory utilization needs to be minimized and the systems needs to handle nested-for loops with different communication and computation times. In contrast, dynamic algorithms attempt to use the runtime state information of the system in order to make informative decisions on balancing the workload and are executed during the computation of the nested loop. This makes them applicable to a much larger spectrum of applications and since they neither require a static map of the execution sequence nor they are limited by a specific loop they are the best candidates for loop parallelization in embedded systems. An important class of dynamic scheduling algorithms are the self-scheduling schemes presented in [9]. Our proposed architecture is based on the second approacha and is applied to grids. We target fine grain parallelism and apply the proposed scheduling algorithm on FPGA. To the extent of our knowledge this is the first implementation of a dynamic scheduling algorithm which handles data dependencies on embedded reconfigurable systems. The presented architecture allows the parallelization of the initial nested loop through the use of a number of interconnected PEs implemented in hardware that work in parallel and are coordinated by a main execution controller. It does not require any initial application specific algorithmic transformation of the nestedloop or architectural transformations of the hardware implementation which is the case of the most systolic approaches. Grids are a widely known family of graphs and have been applied to a variety of diverse application domains such as arithmetic analysis (e.g. convolution, solving differential equations), image and video processing (e.g. compression, edge detection), digital design, fluid dynamics etc. Our load balancing architecture addresses design flexibility, optimization and generality in those domains. The proposed architecture is presented as follows: Initially, we establish the notation and definitions that will be used throughout this paper and present the theoretical framework that governs our approach (Section 2). Then, we provide a general overview of the
22
I. Panagopoulos et al.
architecture and pinpoint the issues that need to be tackled and the questions that need to be answered to establish the performance merit of the approach (Section 3). Section 4 deals with those issues and presents the way they are resolved. Finally in Section 5 two case studies are presented that evaluate the actual speed-up gained by the application of the proposed architecture. Conclusion and Future Work follow in Section 6.
2 Definitions-Notation A perfectly nested loop has the general representation illustrated in Listing 1. for (i1=0;i1
}
Listing 1. Genreal abstract representation of a perfectly nested loop
where i1,i2,…,in are the loop indices, L1,L2,…,Ln are the loop bounds and S is a set of statements. Statements may be of any kind from conditional branches to I/O operations. A loop instance is defined as a single execution of the statements in the loop’s body for specific index values. We can represent the loop instances in an ndimensional space (for an n dimensional loop) where each coordinate is a specific index of the for-loop. Each point in space represents a specific loop instance. We use arrows between points (loop instances) in the graph to express data dependencies (ex Figure 1(a)). The extracted dependencies among loop instances are used to calculate sets of points that can be executed in parallel. Those sets of points are separated by hyperplanes [5]. The hyperplanes formed for the previous example are illustrated in Figure 1(b) as dashed lines. This paper considers the case where the for-loop has unitary dependencies among loop instances (grids). An example case is illustrated Figure 1(c). The hyperplanes define sets of points that can be executed in parallel [5].
(a)
(b)
(c)
Fig. 1. (a) Graphical representation of a 2D loop. (b) The hyperplanes formed according to data dependencies (c) A 2 dimensional loop with unitary dependencies.
Moreover, there exists an analytical expression for finding those points:
i1 + i2 + ... + iN = c
(1.1)
A Flexible General-Purpose Parallelizing Architecture
23
where c defines the time all points satisfying equation (1.1) will be executed in parallel. Loop instances will be executed on Processing Elements. Let P={PE1,PE2,…,PEk} be a set of processing elements and M their shared memory where all variables reside. The memory M can be accessed by only one PE at a time. We also define the following times:
Tc is the time required by the PE to perform the loop body computations. Tr is the time required by the PE to read input variables Tw is the time required by the PE to write output variables.
3 Architectural - Functional Overview The general architecture of the proposed design is illustrated in Figure 2(a).
Point Reception Unit
Calculations
Memory Access Unit
(a)
(b)
Fig. 2. (a) The system’s architecture (b) Internal architecture of the PEs
The pacemaker of the architecture is the “Controller”. The “Controller” component is initiated by the CPU and is responsible for dispatching the current parallel time of execution. The parallel time is defined as the time unit at which loop instances that belong to a specific hyperplane can be executed in parallel. The “Controller” component dispatches this time to the “Point Generator Component”. The “Point Generator” is initiated, finds the solutions of equation (1.1) for the received time (See section 4), and stores the solution’s index values to its interfacing FIFO buffers. The “Point Distributor” Component receives those index values and sends them to the requesting PEs. PEs receive the index values of the loop instances to be processed, fetch from memory the requested data, perform the loop instance’s calculations and store the results to memory. Upon completion of the generation of the index values for a specific parallel set, the “Point Generator” component informs the controller which waits until all PEs have performed their computations and then dispatches the next time unit. This stalling of the “Controller” is crucial to preserve violation of loop dependencies in the case where some PEs are still calculating loop instances of the previous time step. The PEs internal architecture for the execution of the loop instance’s statements is presented in Figure 2(b). The “Point Reception Unit” implements the communication protocol with the “Point Distributor Component”. The received index values are used as an offset of a base address by the “Memory Access Unit” component which fetches the required data from main memory. The “Calculations” component is then initiated that performs the loop instance’s calculations. The results are stored to the main memory
24
I. Panagopoulos et al.
through the Memory Access Unit. Reconfiguration of the proposed architecture during runtime can occur during the execution of a specific loop for a further speed-up in performance with a cost in area by adding/removing “I-modules”, without any further change in the architecture (for the “I-modules” see Section 4). The realization of this architecture presents the following problems that need to be tackled (see Section 4): • The “Point Generator Component” needs to be fast enough in the generation of the indices of loop instances that will be executed in parallel. An efficient way is needed for finding the solutions to equation (1.1) • The “Point Generator Component” needs to be implemented in hardware and be easily reconfigured to handle for-loops with various dimensions. • The upper limit of parallelization which is imposed by memory access conflicts. In the following section, those considerations are thoroughly analyzed.
4 Analysis The “I-module” architecture used for the implementation of the “Point Generator” is based on a refinement of a brute force algorithm for finding and enumerating the solutions to Equation (1.1) (first-order Diophantine equation with unitary coefficients). The trivial approach is to implement a brute force algorithm that iterates through all possible values of the indices and generates the solutions (Listing 2). L[i], i=0...N-1, holds the upper bound for the indices in each dimension. Array I[…] is used for temporal storage of a candidate solution. for (I[0]=0;I[0]<=L[0];I[0]++) for (I[1]=0;I[1]<=L[1];I[1]++) … for (I[N-1]=0;I[N-1]<=L[N-1];I[N-1 CheckSolution(); void CheckSolution () { if (I[0]+I[1]+…+I[N-1])==c /* A solution has been found */ else ;
}
Listing 2. The initial brute force algorithm based on for-loops
void FindSolution (int pos) { for (i=0;i<=L[pos];i++) { I[pos]=i; if (pos==N-1) CheckSolution(); else FindSolution(pos+1); }
}
Listing 3. Brute Force algorithm for finding and enumerating the solution of the first order Diophantine equation with unitary coefficients (pos holds the current position in the indices vector)
An alternative Brute Force algorithm is based on a recursive function whose template is shown in Listing 3. This algorithm is more suitable for its implementation in hardware since it requires the design of only one hardware module (the FindSolution() function) which is replicated “as-is” depending of the size of the initial vector. The function CheckSolution() evaluates the current generated vector and stores the result if the sum of its index values equals to c. The proposed, refined algorithm is given in Listing 4. The function RemSumL(k) returns the value of: Lk + Lk +1 + ... + LN and the initial value of pos is 0 and of r is c.
A Flexible General-Purpose Parallelizing Architecture
25
void FindSolution (int pos,int r){ if (pos==(N-1)){ i[pos]=r; CheckSolution(); return; } for (i=0;i<=Min(L[pos],r);i++){ I[pos]=i; if ((r-i)<=RemSumL(pos+1)) && ((r-i)>=0) FindSolution(pos+1,r-i); } }
Listing 4. Proposed refinement of the Brute Force algorithm for finding and enumerating the solutions of the first order Diophantine equation with unitary coefficients
Fig. 3. An interconnected series of I-modules generating the solutions of equation (1.1) for a 3D vector. Note that additional I-modules can be added at the end of the series to solve the equation for higher dimensional vectors.
This proposed refined algorithm will be used for the generation of the index values of loop instances to be executed in parallel in the proposed implementation. This algorithm gives an average saving of 87% on the computation steps required. To preserve the generality of the approach, the “Point Generator” is implemented in hardware as a line of smaller interconnected modules defined as I-modules. Each Imodule is responsible for generating an index value participating in a single solution of equation (1.1) and passes the control either to the next or the previous I-module in line depending on its current state. This behaviour implements the execution of Listing 4 where the function FindSolution() either recursively calls a new function or returns (equivalently going right or left in the I-modules). The implementation of the “Point Generator” component as a series of interconnected modules, guarantees the generation of a new solution in at most D stages (where D is the index vector dimension). Moreover, it can be easily extended for the solution of equations with higher dimensions by adding more I-modules in line (Figure 3). ( L1 L2 ...LN )T N
(1.2)
Tc ≥ ( N −1)(Tr + Tw )
(1.3)
⎢ T ⎥ N ≤ ⎢ c ⎥ +1 ⎣ Tr + Tw ⎦
(1.4)
Tcompl =
Tcompl
⎧ ( L1 L2 ...LN )T ⎢ T ⎥ , N ≤ ⎢ c ⎥ + 1 = NU ⎪ N ⎪ ⎣ Tr + Tw ⎦ =⎨ ⎪ ( L1 L2 ...LN )T , N > ⎢ Tc ⎥ + 1 ⎢ ⎥ ⎪ NU ⎣ Tr + Tw ⎦ ⎩
(1.5)
We finally need to provide the theoretical estimate on the speed-up that is achieved by the proposed approach and defines the limitations of the proposed architecture. In the case where there no dependencies among loop instances and memory accesses are non-blocking, the solution is straightforward. Each PE carries out one execution of a loop instance in time T equal to T=Tr+Tc+Tw. Therefore, the time required for the completion of the execution of the nested loop in N PEs is given by (1.2). In the case
26
I. Panagopoulos et al.
where memory accesses are blocking, (1.2) holds, up to the point where the number of PEs does not create any conflicts on memory accesses. Therefore, there exists an upper bound in the number of PEs above which there is no further improvement in performance. (1.2) holds when the number of PEs is such that during the computation time of one PE, all others have enough time to complete memory accesses1. This is true when (1.3) holds. Solving (1.3) wrt. N yields the upper bound of the number of PEs above which there is no further performance improvement (1.4). So in the case where there are no dependencies and memory accesses are blocking (1.2) is reformulated to (1.5). Solving (1.1) for each c, we get groups of points that can be executed in parallel. The value of c ranges from 0 to (L1-1 +(L2-1)+…+(LN-1). To simplify the theoretical analysis, we consider equal upper bounds for all loop limits L=max(L1, L2, L3,.., LN). This approach although more pessimistic still yields acceptable optimization results. Given this assumption, the loop points form a hypercube which is symmetric with respect to the hyperplane that satisfies equation (1.1) with c=L-1. The amount of points fc that need to be processed in each hyperplane c, for c ∈ [0...L − 1] is given by the following equation (only for grids) [11]: D −1 c −1
f c ( D) = D + ∑∑ f n (m)
(1.6)
m =1 n =1
where D is the loop’s dimension. The maximum number of points that need to be processed within a single hyperplane occurs when c=L-1. Since the total space is symmetric wrt this hyperplane, the total execution time can be estimated by considering the following times (Figure 4):
Tcompl = 2Tlt + Tc + 2Tgt
(1.7)
Points in each hyperplane less than the number of processors
Symmetry Line
Points in each hyperplane less than the number of processors
Points in each hyperplane greater than the number of processors
Fig. 4. Example of a 2D index space of a 2 dimensional loop with indices with equal constants upper bounds. Lines define different regions (Each hyperplane is defined as a line parallel to boundary lines). 1
Apparently, memory accesses are not perfectly distributed over time and thus even in this case memory conflicts will occur. But up to the point where computation time is larger than the overall communication time, the overall execution time will not be affected since in a specific period of time all PEs will have the time to access the memory.
A Flexible General-Purpose Parallelizing Architecture
27
where
Tlt = PT L−2
f i ( D )T (1.8) N i = P +1 f L −1 ( D )T Tc = N We do not consider the case where memory access conflicts start affecting the execution time. This means that we limit our analysis up to the point where the number of PEs satisfy(1.4). Tgt =
∑
The time required to process P hyperplanes with total number of points less than or at most equal to the number of available PEs (Tlt). This time due to the symmetry presented in Figure 4 is doubled. The time required to process Q=P-(L-2) hyperplanes with total number of points greater than the number of available PEs (Tgt) up to the symmetry hyperplane. This time due to the symmetry presented in Figure 4 is doubled. The time required to process the points belonging to the symmetric hyperplane.
The total completion time is given by the equation (1.7). Equations (1.7),(1.8) yield the performance estimate of our proposed approach imposing the limitation of equation (1.4) and actually verify that a considerable performance improvement does exist for this case. The theoretical result will be verified in the following two case studies.
5 Case Studies For the case studies, a simulation environment in SystemC has initially been created. Then, the architecture has been implemented in Verilog and simulated in the XILINX ISE environment using ModelSim. 5.1 Image Filtering
A 5x5 filter is applied to a 640x480 pixels image. The filtering algorithm is presented in Listing 5. We parallelize the execution of the outer 2D loop that cycles through the points of the image. The 5x5=25 values composing the filter will be stored within each PE. The loop body’s computations require 75 tu (time units) to read image data and 9 tu to write the filtered data. The computation takes approximately 975tu to complete2. The performance of the final architecture wrt the number of PEs used is illustrated in Figure 5. 2
Exact times in ns or ms depend on the technology used to implement the required hardware. To establish a common base for performance comparisons, we introduce the measuring unit of 1tu which is the time that a typical RISC microprocessor needs to perform a simple operation such as an addition or a register load instruction. Exact times can be calculated by using the clock period which is allowed by the technology used for the implementation of the system’s combinatorial circuits.
28
I. Panagopoulos et al. for (x=0;x<640;x++) for (y=0;y<480;y++) { red=green=blue=0; for (i=0;i<5;i++) for (j=0;j<5;j++) { red=red+image[(x-5/2+i+640)%640][(y-5/2+j+480)%480].red*filter[i][j] green=green+image[(x-5/2+i+640)%640][(y-5/2+j+480)%480].green*filter[i][j] blue=blue+image[(x-5/2+i+640)%640][(y-5/2+j+480)%480].blue*filter[i][j] } Filtered[x][y].red=min(max(red,0),255) Filtered[x][y].green=min(max(green,0),255) Filtered[x][y].blue=min(max(blue,0),255) }
Listing 5. An image filtering algorithm
tu
Speed-up 12
5,00E+08
10
4,00E+08
8
3,00E+08
6 2,00E+08
4 1,00E+08
2 PEs
0,00E+00
1
2
3
4
5
6
7
8
9
PEs 10 11
0 1
2
3
4
5
6
7
8
9
10
11
Fig. 5. (a-Left) Performance estimation (total amount of clock cycles) of the parallel implementation of the image filtering algorithm (b-Right) Speed-up (times faster compared to the purely sequential approach) achieved for various number of PEs (For PEs above the indicated in the figure the line curves due to memory conflicts)
5.2 Combinatorial Optimization in Hardware/Software Codesign
A Hardware/Software codesign problem is defined as follows. We are given an architecture consisting of one microprocessor interconnected to an FPGA. The CPU and FPGA communicate through the use of a shared memory. Let Tasks = {T1 , T2 ,..., TN } a set of given tasks of an initial specification and G={Tasks,E} a directed acyclic graph where edges E define data dependencies among tasks. For each task, we are given the time required for its execution in hardware (th) (FPGA) and in software (ts) (CPU) and its algorithmic description. We need to find the partitioning strategy that leads to the optimal performance given the constraint that on the FPGA we can program up to 4 tasks. Clearly this is a combinatorial optimization problem. For each permutation of 4 tasks that will be implemented on the FPGA, a simulator is executed that simulates the execution of the current configuration and outputs a performance estimate. This estimate is used as the evaluation function of the optimization algorithm. For this case study, we use a simple iterative climber as the basis of optimization algorithm. Iterative climbers start from a random solution (a specific mappings of 4 tasks to be implemented on the FPGA), check its neighbours and select the one which offers the greatest increase in the evaluation function. The neighbourhood is defined as the set of mappings that emerge from the current solution by exchanging any two tasks between the FPGA and the CPU. Since no diversification exists in the iterative
A Flexible General-Purpose Parallelizing Architecture
29
for (i=0;i<640;i++) for (j=0;j<640;j++) for (k=0;k<640;k++) { a[i][j][k]=max(a[i-1][j][k],a[i][j-1][k],a[i][j][k-1]); CurMapping=GetSelectedMapping(); NewMapping=RandomMapping() BestNeighboorNew=EvaluateNeighboors(NewMapping); BestNeighbootCur=EvaluateNeighboors(CurrentMapping); if (BestNeighboorNew is better than BestNeighboorCur) a[i][j][k]=BestNeighboorNew else a[i][j][k]=BestNeighboorCur }
Listing 6. The 3 dimensional loop used for the combinatorial optimization problem
climber, the algorithm is very likely to be trapped in a local optimum. To refine the algorithm, we need to allow diversification (the algorithm may choose solutions that do not offer an increase in the evaluation function in the effort of searching other parts of the solution space). A three dimensional loop is used for such approach (Listing 6). The algorithm in Listing 6 is parallelized using the proposed architecture. For a design problem of 30 tasks the loop body’s computations require 21 tu (time units) to read the solutions and 7 tu to write its computed solution. The computation takes approximately 1500tu to complete. The performance of the final architecture wrt the number of PEs used is illustrated in Figure 6. tu 7,00E+08
Speed-up 60
6,00E+08
50 5,00E+08
40
4,00E+08
30
3,00E+08
20
2,00E+08
10
1,00E+08
0
0,00E+00
(a)
1
4
7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 PEs
1
(b)
4
7
10 13 16 19 22 25 28 31 34 37 40 43 46 49 PEs
Fig. 6. (a) Performance estimation (total amount of clock cycles) of the parallel implementation of the combinatorial optimization algorithm (b) Speed-up (times faster compared to the purely sequential approach)achieved for various number of PEs (For PEs above the indicated in the figure the line curves due to memory conflicts)
6 Conclusion An overview of the innovative features and claimed characteristics of the proposed architecture for parallelizing nested loops in reconfigurable platforms is listed below: • •
It offers a sufficient performance improvement compared to the non parallel hardware implementation of the nested loop without sacrificing flexibility. To the best of our knowledge it is the first implementation in hardware of a dynamic scheduling algorithm for nested loops with dependencies for fine-grain parallelism.
30
•
I. Panagopoulos et al.
It encapsulates an innovative hardware implementation of the solution of the first order Diophantine equation with unitary coefficients that can be easily reconfigured to handle various dimensions of vector spaces..
Those claims have been verified analytically and the application of the proposed architecture has been tested in two case studies from the image processing and the combinatorial optimization application domains.
References 1. Chang, H., Cooke, L., Hunt, M., Marting, G., McNelly, A., Todd, L.: Surviving the SOC Revolution. Kluwer Academic Publishers, Dordrecht (1999) 2. Li, Y., Callahan, T., Darnell, E., Harr, R., Kurkure, U., Stockwood, J.: Hardware-Software Co-Design of Embedded Reconfigurable Architectures. In: DAC 2000, pp. 507–512 (2000) 3. Economacos, G., Economacos, P., Poulakis, I., Panagopoulos, I., Papakonstantinou, G.: Behavioural synthesis with Systemc. In: Design Automation and Test in Europe Conference and Exhibition (DATE01), Munich, Germany, pp. 21–25 (2001) 4. Galanis, M.D., Dimitroulakos, G., Goutis, C.: Performance Improvements from Partitioning Applications to FPGA Hardware in Embedded SoCs. The Journal of SuperComputing 35, 185–199 (2006) 5. Moldovan, D.: Parallel Processing: From Applications to Systems. Morgan Kaufmann, San Mateo, CA (1993U) 6. Cheng, C., Parhi, K.K.: A Novel Systolic Array Structure for DCT. IEEE Transactions on Circuits and Systems 52(7), 366–369 (2005) 7. Bednara, M., Teich, J.: Automatic Synthesis of FPGA Processor Arrays from Loop Algorithms. The Journal of Supercomputing 26, 149–165 (2003) 8. Kruskal, C.P., Weiss, A.: Allocating independent subtasks on parallel processors. IEEE Trans. On Software Engineering 11(10), 1001–1016 (1985) 9. Ciorba, F.M., Andronikos, T., Riakiotakis, I., Chronopoulos, A.T., Papakonstantinou, G.: Dynamic Multi Phase Scheduling for Heterogeneous Clusters. In: Proc. of the 20th IEEE International Parallel and Distributed Processing Symposium (IPDPS ’06), Rhodes, Greece (2006) 10. Andronikos, T., Koziris, N., Papakonstantinou, G., Tsanakas, P.: Optimal Scheduling for UET-UCT Generalized n-Dimensional Grid Task Graphs. In: Proceedings of the 11 th IEEE/ACM International Parallel Processing Symposium (IPPS97), Geneva, Switzerland, p. 146151 (1997) 11. Panagopoulos, I., Pavlatos, C., Dimopoulos, A., Papakonstantinou, G.: Hardware Solution of a First-Order Diophantine Equation. In: HERCMA 2007, Athens, Greece (to be presented)
An Automatic Design Flow for Mapping Application onto a 2D Mesh NoC Architecture Julien Delorme INSA/IETR Laboratory 20 avenue des Buttes de Coesmes 35043 Rennes Cedex, France
[email protected]
Abstract. Complex application specific SoC are often based on the Network-on-Chip (NoC) approach. NoC are under investigation since several years and many architectures have been proposed. Generic NoC are often proposed with their synthesis tool in order to rapidly tailor a solution for a specific application implementation. The optimized mapping of cores on a NoC and the optimized NoC configuration in terms of topology, FIFO and link sizes for instance is a new research area which is investigated deeply now. Validation and evaluation of solutions is often conducted through simulations. Comparisons between proposed optimization approaches is difficult as they use their own evaluative application. Benchmarking is a classical solution to normalize comparisons. We are proposing in this paper a complete design flow which allow to make an automatic Algorithm Architecture Adequation (AAA) onto a NoC architecture. This flow is based on a SystemC model simulation at TLM level. We illustrate these design flow with a benchmark of an 4G radiocommunication application.
1
Introduction
Future Systems-on-Chip (SoC) for multimedia, video or telecommunication will contain a great amount of IP blocks. All of these have to be connected together and require a high bandwidth to satisfy the Quality of Service (QoS). Existing interconnects (bus topology) may no longer be feasible for SoC with many IP blocks, due to dynamic communication requirements. The leading features for SoC Design are scalability, flexibility, reusability, and reprogrammability. In this way, the Network-on-Chip (NoC) paradigm [1] has been proposed and used for interconnecting the cores and replacing bus topology. The use of NoC interconnection has several advantages including better structure, performance and modularity. As a consequence, CAD tools have to explore NoC parameters before synthesis regarding the target application requirements. Our proposition is to purpose an automatic design flow for application mapping onto a NoC architecture. The adequation of the application onto architecture constraints is made automatically through an algorithm based on a heuristic. The flow generates automatically the NoC SystemC model to perform simulations at transaction N. Azemard and L. Svensson (Eds.): PATMOS 2007, LNCS 4644, pp. 31–42, 2007. c Springer-Verlag Berlin Heidelberg 2007
32
J. Delorme
Level (TLM). The proposed benchmarking application is a mobile terminal MCCDMA chain applied to the future 4G Radio telecommunication standards. This application is divided in twenty IP cores which have been evaluated separately. Throughput, data size and treatment latency are provided for each computing resource. With such parameters it is possible to conduct a full NoC architectural exploration. This paper illustrates the design flow of our exploration tool. This flow gives also the possibility to the designer to specify manually the constraints of the architecture with the application.
2
Context
Regular tile-based NoC architecture is becoming the future media of communication of the next SoC generation. Such media communication structure consists of regular tiles where each tile can be a general-purpose processor, a DSP, a memory subsystem or a dedicated IP block. Each router is embedded within each tile with the objective of connecting it to its neighboring tiles. Then, the inter-router communication can be achieved by routing packets. Given a target application described as a set of concurrent tasks which have been assigned and scheduled, to exploit such an architecture, the fundamental question is how to make the adequation of this application task onto architecture task. The mapping of each task of the application graph have to be placed correctly in the way to minimized the routing path and communication latency. The figure 1 show the problematic of the mapping and routing onto a NoC architecture.
SEND
DATA
DATA
ACCEPT
NORTH
DATA
IP OP
IP OP
Functional Bloc
ACCEPT DATA
NI SEND
DATA SEND
WEST
T1
5
6
7
8
9
13
10
14
11
15
T2
Mapping + routing
?
T4
IP
Arbiter
OP
ACCEPT DATA
EAST
SEND
T3 T5
IP OP
T6
router Network logic
2D mesh NoC architecture
IP OP
ACCEPT
12
16
DATA
Communication task Graph
Fig. 1. 2D mesh architecture mapping and routing problems
ACCEPT
4
DATA
3
DATA
2
SEND
1
DATA
IP : Input Port OP : Output Port
SOUTH
Fig. 2. NoC architecture
The mapping and routing problems described in the figure above represent a new challenge, especially in the context of the regular tile-based architecture, as this significantly impacts performance metrics of the system. In this paper, we focus on this mapping problem by proposing an efficient algorithm. This algorithm propose a suitable solution for mapping and routing IP block on a regular
An Automatic Design Flow for Mapping Application
33
2D mesh NoC. In the section 3, we are going to described the target architecture used with its imposed constraints. Then, section 4 details the algorithm employed in our design flow to tailor with the problem of mapping and routing path allocation. Section 6 describes the 4G radiocommunication experimented. Experimental results in section 7 show that significant mapping and routing trade offs can be achieved, while guaranteeing the specified system performance.
3
Platform Description
NoC are emerging solutions for future on-chip interconnections and a lot of investigations have been done in this field for the last ten years. NoC are flexible, and offer significant aggregated bandwidth since they allow simultaneous communications. As mentioned earlier, a solution based on independent data processing units interconnected by a NoC seems the most appropriate for a high data rate baseband system. In data processing application, most of the modules mainly communicate with few others. As a consequence, the network do not need a high order topology than a 2D mesh topology [2]. Also, data processing units are specialized and traffic is rather predictable. Finally, communications are often short and by burst, so packet switching is preferable to circuit-switching since no connection time is needed, which is time-consuming. As a consequence, the NoC FAUST [3] uses wormhole switching mode [4] with a credit based control flow. It has low latency, saves memory buffers and, with an appropriate routing algorithm, deadlock are avoided [5]. The QoS (Quality-of-Service) employed in this NoC is of type BE (Best effort). As a consequence, there is no guarantee for bandwidth available for communications. This QoS offer a better bandwidth exploitation of each links between routers compare to the GT (Guaranteed Traffic). The GT [6,7] guarantees bandwidth reservation for communication with low exploitation of links and add complexity in routers or NI. Functional units are connected to the network through network interface (NI). The NI manages the NoC communication mechanism by use of credits to fill input FIFO and empty output FIFO of the application cores (HW or SW). The NoC architecture is well adapted to our radiocommunication application because it decouples computing from communication. As a consequence, the NoC architecture used is composed of 5 ports per router of which one is reserved for IP block connection (see figure 2). In the context of radio telecommunication applications, specific characteristics can be taken into account to have a well adapted network. So, in the aim to reach the performance constraints of the application mapped onto the architecture, it is necessary to well defined the algorithm of mapping and routing. This algorithm have to take into account the bandwidth requirements between each vertex of the application graph. This algorithm is mandatory regarding the QoS BE used in the NoC where no guarantees on communications are possible.
34
4
J. Delorme
Formulation Problem
Simply stated, for a given application, we try to find onto which router should each IP have to be mapped with the shortest communication path (packets and credits paths). This criteria is made regarding the bandwidth reservation of each links for communication to avoid violation of theoretical bandwidth each links between routers. To formulate this problem more formally, we define the following terms: Definition 1. an application graph (APG) G = G(C,A) is a directed graph where each vertex ci represents an IP block selected, and each directed arc ai,j represents the communication from the vertex ci to the vertex cj . Where each ai,j has the following properties: – v(ai,j ): arc volume from vertex ci to cj , which stands for the amount of bits transmitted for the communication from ci to cj . – b(ai,j ): arc bandwidth requirement from vertex ci to cj , which stands for the bandwidth requested (bits/sec) between ci and cj to meet the timing constraints requested by the application Definition 2. an architecture graph (ARG) G = G(T,P) is a directed graph where each vertex ti represents a router of the architecture, and each directed arc pi,j represents the routing path between ti and tj . And each pi,j have the following properties: – Pi,j : represents the set of minimal paths between two routers ti and tj – b(li,j ): represents the bandwidth available on the link between the router ti and tj Using these definitions, the performance aware mapping and routing path allocation problem can be formulated as: Size(AP G) ≤ Size(ARG)
(1)
The mapping function map() from APG to ARG is defined regarding the criteria mentioned below: ∀ci ∈ C, map(ci ) ∈ T (2) ∀ci = cj ∈ C, map(ci ) = map(cj ) B(lk ) ≥ b(ai,j ) × f (lk , p(map(ci ),map(cj ) ))
(3) (4)
∀ai,j
Where B(lk ) is the maximal bandwidth on one link lk and: f (lk , pm,n ) = {
0 : lk ∈ / L(pm,n ) 1 : lk ∈ L(pm,n )
(5)
As a consequence, these conditions signify that each IP block could be placed on only one router. And this mapping is done regarding the non violation of the theoretical bandwidth of each links between each routers of the architecture. These method is based on the work developed in [8], [9], [10].
An Automatic Design Flow for Mapping Application
5
35
The Design Flow
As a consequence, the place and route algorithm presented in the section 4 is integrated in our design flow. Our flow allow to create and simulate a NoC with an application mapped on it in SystemC description at transaction level. These explorations are mandatory to help the designer in the specifications of the hardware description of the NoC before starting real implementation. So our flow have two input entry, the automatic generation managed by the algorithm presented before and the semi-automatic generation managed by an Excel description. This second choice is mandatory in cases where the designer have specific constraints especially on the hardware part. The flow proposed is presented onto figure 3. Automatic generation
ARG
semi-automatic generation
APG
Manual specifications constraints (Excel)
AAA Architecture
Application
Structure of UT and NoC + Placement
Configuration validation of the NI
Hardware generation of the NoC
Software configuration du NoC
Control processor Validation of the configurations
Launch of simulations
Parameters adjustments
Routers performances
UT performances XML
Emulation on FPGA platform
Fig. 3. The design flow
This flow allows to study performances of the UT and the routers in the aim to verify the respect of the timing constraints of the application mapped onto the NoC. Architectural and applicative parameters can be adjust in the aim to reach the timing constraints by a loop back on the initial conditions. This flow will allow to generate automatically the VHDL equivalent code to accelerate the explorations phases. An other aspect of our flow is to provide a routing algorithm which can manage two different contexts: the mono-component or the multi-components. The second context is presented within the framework of the context of the European project 4MORE in which we were investigated. Due to specific hardware constraints induce by the project, we have use the semiautomatic generation mode of our flow. This mode offers the possibility to the designer specify manually the hardware and software constraints.
36
6 6.1
J. Delorme
4G Radio Communication Application 4G Context Specification
Future telecommunication systems need to be more and more powerful with greater bandwidth, higher mobility and autonomy to answer the needs of a growing market. 3G standard technologies is fully operational, but researcher already work to specify the fourth generation of mobile systems. Performance and optimized power consumption are still the key factors for these systems. These new systems have to be the most flexible and have to support different standards to allow evolution and updating of SoC. Such constraints imply a radical change on present design methodologies for future SoC designs. A high-performance candidate for future mobile systems is the Multi-Carrier Code Division Multiple Access (MC-CDMA) technique [11]. This new modulation technique brings new algorithms and computation constraints. Designers have to deal with these requirements for implementation. We focus here on the physical layer, precisely on the implementation of a MC-CDMA transmitter and receiver in a Baseband modem. This Baseband modem implements different chained processing sub-systems schematically presented on Figure 4. The different functions in the physical layer impose strong constraints on IP blocks concerning power calculation, performances and complexity. Our proposition is to plug each IP block onto the NoC in adequacy with the hardware resources available for the final demonstrator. The 4MORE project demonstrator is composed of four components, two ASIC and two FPGA. SystemC simulations have to fit these requirements in the aim to be the most realistic with the hardware platform implementation. 6.2
IP Block Model
As we have mentioned before, the idea of our simulations is to validate the 4G chain presented in Figure 4 with IP blocks modeling proposed in Figure 6 and 7. Indeed, each resource requires different input data sizes and produces different output data sizes. Moreover, they do not have the same data treatment time implying different latencies between blocks which are data dependent. As a consequence, in a NoC implementation, this kind of evaluation is mandatory to respect real-time constraints. It is especially true with this application where we have to respect the frame time conditions as mentioned on Figure 5. As specified before, the NoC used is a wormhole packet switching technique with two virtual channels with a QoS in BE [12]. This QoS induce the fact that simulations are mandatory to take decision for topology, data paths, credit and data amount and hardware dimensioning. So, all the resources implemented are parameterized with the necessary parameters: – – – – –
Input data size Output data size Computing time Input data path(source) Output data path(destination)
An Automatic Design Flow for Mapping Application
37
Data user
Channel coder
RF to base band
RF to base band
Interleaver
iFFT
iFFT
ROTOR
ROTOR
CBS
CFO
AMRC
Channel estimator
MIMO decoder
MIMO encoder
CFO
Channel estimator
AMRC-1
CBS-1
FFT
FFT
Base Band to RF
Base band to RF
Deinterleaver Channel decoder Data user
Fig. 4. Block diagram of the TX and RX MC-CDMA physical layer Trame
TX
TG
RX
TG
TX
TSLOT
TG
RX
TG
TX
TG
RX
TG
TX
TG
RX
TG
TG 20.8Ps TSLOT 0.667 ms TOFDM 20.8Ps
S P P D D D D D D D D P P D D D D D D D D P P D D D D D D D D Z TOFDM
Fig. 5. 4G timing constraints frame
With those informations, we can model the global behavior of each block inside the chain without modeling precisely the algorithms used. All input and output data size are expressed in FLIT (32 bits) and compute time is in clock cycles in order to be silicon technology and design frequency independent. All these parameters specify the core treatment of each block of the TX and RX chains but do not represent the treatment for a complete OFDM symbol for the whole slot. Some particular cases with long treatment times like FFT can be a bottleneck to satisfy the real-time constraints for the whole frame. An other constraint have to be pointed out is the hardware constraints. In the project, an ASIC using the NoC FAUST and integrating some of the IP block chain already available have been realized. As a consequence, the final demonstrator of the 4MORE project is composed of several components (2 ASIC and 2 FPGA). As a consequence, the hardware constraints are presented on the figure 8 below. Consequently, the flexibility of the flow allows to check the respect of the timing constraints and to study the congestions which could appears at the I/O level between each components. Finally, with these constraints, we present in
38
J. Delorme
Fig. 6. TX chain resource parameters
Fig. 7. RX chain resource parameters
MIMO ENCO D
CDMA MOD
MAPP ING
BIT INTER
NoC PERF
RAM 1
CPU
RAM 2
EXT RAM CTRL
ROTO R
CHAN NEL COD
FPGA 1
OFDM MOD
ETHE RNET
FRAM E SYNC
OFDM DEMO D
CDMA
MAPP ING
BIT DE INTER
CHAN NEL DEC
OFDM MOD
MIMO ENCO D
CDMA MOD
MAPP ING
BIT INTER
CHAN NEL COD
NoC PERF
RAM 1
CPU
RAM 2
EXT RAM CTRL
ROTO R
FRAM E SYNC
ETHE RNET
OFDM DEMO D
CDMA
MAPP ING
BIT DE INTER
MIMO CHA NNEL EST
TX BB to RF 1
MIMO DEC
RX RAM RF 1
CFO 1
MIMO CHA NNEL EST
FPGA 2
ASIC 2
ASIC 1
TX Input Data Output Data Compute Time Channel decoder 1 2 64 Bit Interleaving 96 96 6144 Mapping 1 6 6 Spreading 8 8 48 MIMO encoding 48 48 50 FFT 1024 24 1280 2620 Base band to RF 1 0 10
RX Input Data Output Data Compute Time RF to BB 1 2 1280 1280 0 ROTOR 1 2 672 + 1 672 672 CFO 1 2 23 1 23 MIMO channel est 1 2 1344 2688 2688 IFFT 1 2 (1024pt) 1280 695 2620 MIMO decoding 1 2 4032 1344 1344 De-spreading 672 84 4032 De-mapping 12 12 10 De-interleaving 2016 2016 2016 Conv Decoder 8 1 32
TX BB TO RF 2
RX RAM RF 2
CFO2
CHAN NEL DEC
Fig. 8. Hardware constraints of the 4MORE project
the next section the feasibility of the integration of this 4G radiocommunication MIMO MC-CDMA chain in the context of the final demonstrator of the 4MORE project using a NoC.
7 7.1
Simulation Results Simulation Exploration Methodology
Our explorations are based on the validation of the 4G radio application onto the FAUST NoC based on SystemC model generation. The design flow proposed
An Automatic Design Flow for Mapping Application
39
Assessment of I/O
Topology decision on to the FPGA
Study of data paths
FIFO dimensioning
Credit and data packet amount modification
Decision on NoC frequency
Fig. 9. NoC methodology mapping and validation
offers enough flexibility to reach rapidly the tradeoff which could demonstrate the feasibility of the adequation. These explorations have to be realistic with the implementation on the final hardware demonstrator structure. Our study have been made in the aim to validate a final demonstrator which is composed of four components (two ASIC and two FPGA). IP block mapping topology is already fixed inside the ASIC and the flexibility is only available with FPGA components. As a consequence, to match timing constraints, our methodology is based on the routers links load to make decision on FPGA topology as mentioned on Figure 9. After balancing of traffic load on I/O ports of the four components, throughput is improved by modifying NI parameters of each IP block. More precisely, we highlight the impact of NI FIFO sizes on the global performances of the NoC. Modifications on FIFO sizes inside NI impact on NoC traffic congestion (especially for NoC in QoS BE). This criteria allows to realize a pipeline for the whole application which conduct to a modification of global throughput of the NoC. Increase of FIFO sizes is a tradeoff between timing performances and both hardware and energy consumption. This methodology have been applied on our 4G application and we present in the next section the results obtained on the RX part. In our explorations, we check that the latency induced by NoC communication mechanism plus the compute time of resources, respect the timing frame constraint of 20.8μs (Timing constraints of the frame (Figure 5)) for each OFDM symbol. The simulations have to demonstrate that the system fulfils real-time requirements taking into account the hardware constraints. These explorations have been realized using the semi-automatic generation of the flow for the advantages mentioned previously. 7.2
Simulation Results
We focus here on the RX part as TX timing constraint is lower due to low data density. We check the respect of timing constraints of the radiocommunication
40
J. Delorme
frame focusing on the treatment time of each OFDM symbol. The treatment time represent the global latency of each IP block inside the NoC and is define as below: T = Ti + Tt + To (6) Where Ti is the time elapsed between the first FLIT read and the last FLIT read in the input FIFO until the amount required is satisfied. Tt is the treatment time of data and To is the time elapsed between the first FLIT and the last FLIT to send from the output FIFO to the target resource. To fulfil the real time frame, T must be under 20.8μs. In the first step, we have fixed both components frequency at 125MHz to explore the impact of the NI FIFO sizes onto latencies of communication (global performances of the application). ASIC NI FIFO are fixed to 16 excepting the OFDM modulation where NI FIFO sizes could be modified. Considering real implementation, this exception is allowed due to the fact that this module integrates a memory which give the possibility to store a whole OFDM symbol. Otherwise, NI FIFO of FPGA can be increased due to the flexibility offered by their structure. The tradeoff between frequency and NI FIFO sizes is presented on Figure 10.
3.8
4 x 10 Average Resource Treatment Time vs. FIFO size at 125 MHz
OFDM 1 treatment time OFDM 2 treatment time FRAME CONSTRAINTS
3.6 3.4
Time elapse (ns)
3.2 3 2.8 2.6 2.4 2.2 2 1.8
0
500
1000 1500 OFDM NI FIFO size(in bits)
2000
2500
Fig. 10. RX OFDM treatment time per- Fig. 11. RX OFDM treatment time performances increasing NI FIFO sizes from formances with 16 and 1024 FLIT NI 16 up to 2048 FLIT at 125MHz FIFO vs. increase of frequency
The upper frequency that could be reached is 175Mhz, fixed by the ASIC constraints. As a consequence, we have chosen to increase NI FIFO in the aim to reduce latencies on communications instead of increasing frequency. A 1024 FLIT NI FIFO deep allows to reduce significantly the latencies on communications (size close to the size of the OFDM symbol). To complete these results, we have explored the impact of the variations of the frequency when NI FIFO size are fixed to 16 and 2048 FLIT. The results obtained are shown on Figure 11. With these results, we have the possibility to save hardware ressources (SoC running at frequency upper than 250MhZ) or save electrical power consumption
An Automatic Design Flow for Mapping Application
41
(SoC running at frequency upper than 125MhZ) respecting the real time constraints of the application. As a consequence, taking into account the maximum upper frequency of the ASIC, we have fixed the NI FIFO size inside FPGA to 1024 FLIT. These constraints have been validated and integrated in the hardware platform of the final demonstrator of the project. In this section we have presented the impact of the NI FIFO sizes on global performances for a NoC running at the same frequency. These results point out that, for a NoC in QoS of type BE, a tradeoff between hardware ressources and NoC frequency have to be made. Consequently, taking into account the ASIC specifications, we have made the choice to have a design which is hardware consuming. But we plan to improve these studies by analyzing more precisely each FIFO in the aim to adapt their size regarding their load while keeping the same performances.
8
Conclusion and Future Works
In this paper we have presented our design flow for dimensioning a NoC in QoS of type BE. This flow allows enough flexibility to adjust the several parameters of the NoC in the aim to find the best tradeoff regarding the timing constraints of the application. Validation of the flow we have been made using a 4G application as a benchmarking approach to evaluate the mapping of cores. We have shown the feasibility of the mapping of a 4G chain onto a NoC system under hardware restrictions (FIFO sizes and frequency). We have also highlighted the performance gain of the NI FIFO increase on the application compared to variation of the NoC frequency. Our future exploration is to check the impact of different frequencies on ASIC and FPGA on the global performances (study of nodes bottleneck in the case of different frequencies). The next exploration will be to take into account the NI FIFO load in the aim to optimize precisely the size of each FIFO of the IP blocks mapped onto the NoC (low hardware consumption). We already work on the validation of the automated mapping and routing heuristic for mapping IP onto the NoC.
References 1. de Micheli, G., Benini, L.: Networks on chip: A new paradigm for systems on chip design. In: DATE ’02: Proceedings of the conference on Design, automation and test in Europe, p. 418. IEEE Computer Society Press, Washington, DC, USA (2002) 2. Third, M.M.R., Jantsch, A.: Evaluating noc communication backbones with simulation. In: IEEE NorChip Conference, November 2003, IEEE Computer Society Press, Los Alamitos (2003) 3. Durand, Y., Bernard, C., Lattard, D.: Faust: On-chip distributed architecture for a 4g baseband modem soc. In: IPSOC2005 (December 2005) 4. Felperin, P.R.S., Upfal, E.: A theory of wormhole routing in parallel computers. IEEE Transactions on Computers, 704–713 (1996)
42
J. Delorme
5. Glass, C.J., Ni, L.M.: The turn model for adaptive routing, vol. 41(5), pp. 874–902. ACM Press, New York, USA (1994) 6. Hansson, A., Goossens, K., Radulescu, A.: A unified approach to constrained mapping and routing on network-on-chip architectures. In: CODES+ISSS ’05: Proceedings of the 3rd IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis, pp. 75–80. ACM Press, New York, USA (2005) 7. Goossens, K., Dielissen, J., Gangwal, O.P., Pestana, S.G., Radulescu, A., Rijpkema, E.: A design flow for application-specific networks on chip with guaranteed performance to accelerate soc design and verification. In: DATE ’05: Proceedings of the conference on Design, Automation and Test in Europe, pp. 1182–1187. IEEE Computer Society Press, Washington, DC, USA (2005) 8. Hu, J., Marculescu, R.: Exploiting the routing flexibility for energy/performance aware mapping of regular noc architectures. In: DATE ’03: Proceedings of the conference on Design, Automation and Test in Europe, p. 10688. IEEE Computer Society Press, Washington, DC, USA (2003) 9. Hu, J., Marculescu, R.: Energy-aware mapping for tile-based noc architectures under performance constraints. In: Design Automation Conference, 2003. Proceedings of the ASP-DAC 2003. Asia and South Pacific, pp. 233–239. IEEE Computer Society Press, Washington, DC, USA (2003) 10. Hu, R.M.J.: Energy-aware communication and task scheduling for network-onchip architectures under real-time constraints. In: DATE 04: Proceedings of the conference on Design, automation and test in Europe, p. 10234 (2004) 11. Chouly, A.B.A., Jourdan, S.: Orthogonal multicarrier techniques applied to direct sequence spread spectrum cdma systems. In: GLOBECOM’93, pp. 1723–1728 (2005) 12. Rijpkema, E., Goossens, K.G.W., Radulescu, A., Dielissen, J., van Meerbergen, J., Wielage, P., Waterlander, E.: Trade offs in the design of a router with both guaranteed and best-effort services for networks on chip. In: DATE ’03: Proceedings of the conference on Design, Automation and Test in Europe, p. 10350. IEEE Computer Society Press, Washington, DC, USA (2003)
Template Vertical Dictionary-Based Program Compression Scheme on the TTA Lai Mingche, Wang Zhiying, Guo JianJun, Dai Kui, and Shen Li School of Computer National University of Defense Technology, Chang Sha, P.R. China
[email protected]
Abstract. As a critical technology in the embedded system nowadays, program code compression can improve the code density and reduce the power consumption. Especially for the Transport Triggered Architecture (TTA), the long instruction word is one of the key problems to degrade the processor performance. In this paper, with the analysis to the spatial locality of the data transports, a template vertical dictionary-based program compression scheme is proposed. It not only efficiently eliminates the redundant empty slots as well as the invalid long immediate encodings, but also applies the vertical dictionarybased compression at the slot level. The experiment shows that this scheme achieves the compression ratio of 32.3%, especially corresponds to the tiny dictionary size. Then, the effects on area and power consumption are also measured. The total area of the processor core and the local instruction memory could be reduced by about 29% and power consumption by nearly 25% respectively. Keywords: transport triggered architecture, code compression, spatial locality, dictionary-based, power consumption.
1 Introduction With all kinds of the electronic equipments widely used on PDA and automobiles, Very Long Instruction Word (VLIW) architectures have gained considerable popularity in the embedded systems. Nowadays, the typical embedded DSPs such as the Trimedia series [1] of Philips and the TMS320C64 series from Texas Instruments [2] follow the VLIW architecture, where, with the effective compiler assistance, the multiple operations packed into a long instruction word are issued to the concurrently operating functional units to exploit the instruction level parallelism. However, this instruction encoding leads to the poor code density [3]. First, generating enough operations in a code fragment always requires unrolling loops and the software pipeline, thereby increasing the code size. Second, the unused functions translate to wasted bits in the instruction encoding. TTA proposed by Corporal can be viewed as a superset of traditional VLIW architecture [4]. The fine-grained and flexible programming model is particularly suitable for customizing the instructions and tailoring the hardware resources according to the requirements of the applications. However, the TTA processors also suffer from poor code density [5]. Especially with the extension of the target N. Azemard and L. Svensson (Eds.): PATMOS 2007, LNCS 4644, pp. 43–52, 2007. © Springer-Verlag Berlin Heidelberg 2007
44
L. Mingche et al.
applications, its automatic design process [8] will upgrade the architecture configuration, which leads to the longer instruction word as well as the need for the larger memory resources, consequently, to system, where the memory might consume more area than processor core. In addition, poor code density increases the program I/O bandwidth, which consumes the power dissipation seriously. Several program compression methods have been proposed for TTA. In one of the approaches, entropy encoding [5] was commonly used to improve the code density, where the compression ratio could be achieved nearly 60%. Then, empty slots were eliminated from the instructions by using templates in [6]. The compression ratio was reported around 40%. Finally, Heikkinen evaluated the dictionary-based compression method on TTA [7]. Frequent words were stored into a dictionary at the different levels. The experiment results in [7] showed that this compression was not suited at the instruction level, while the best compression ratio was achieved at the slot level, where one instruction included multiple move slots and the compression ratio of 0.53 was achieved. In this paper, the generated codes are firstly analyzed during the ASIP design process [8]. Then, due to a plenty of the empty slots and the spatial locality of the data transports, the template vertical dictionary-based compression that achieves the compression ratio of 32.3% is proposed. The hardware prototype is implemented on the 180 nm CMOS technology. This prototype not only has the tiny dictionary but also performs the run-time decompression, especially obtains the power consumption benefits. This paper introduces the TTA in section 2. The code characteristics analysis is performed in section 3. Then, a new code compression scheme and its implementation are presented in section 4 and section 5. Section 6 illustrates the experiment results. Finally, the paper concludes with section 7.
2 Transport Triggered Architecture The general structure of the TTA is very simple, shown in Figure.1. The major difference between TTA and VLIW architecture is the way that the operation is executed. In the TTA model, the program specifies only the data transports on the interconnection network. Operations occur as a “side-effect”. Thus, the fact that there is only one transport template in the instructions results in the SIMT (Single Instruction Multiple Transports) model in the TTA programming [12]. As shown in Figure.1, four function units (FU) and one register file (RF) are connected to 4 buses by sockets. Every FU or RF has one or more operator registers (ORs), result registers
Fig. 1. General Structure of TTA
Template Vertical Dictionary-Based Program Compression Scheme on the TTA
45
(RRs) but only one trigger register (TR). Data transferred to the trigger register will trigger the FU to work. The function unit can execute different operations indicated by the operation codes. For example, one addition operation is divided into three data transports. First, the values of r1 and r2 are written into the operator register and the trigger register of the integer ALU, and the data being transferred to the trigger register will trigger this function unit to work. Second, the add operation executes and the result data transfers from its result register to the r3. ADD r3, r2, r1 => r1 -> ADD_O; r2 -> ADD_T ADD_R -> r3 Then, as shown in the Fig.1, the connection between the network and FU is called input or output sockets. The input socket can be viewed as a gateway which selects one data item to the operator registers or the trigger register, but the output socket is responsible to drive the result value to the buses. For the further discussion of the TTA instruction template, the input sockets and the output sockets of the individual function unit or register file are redefined as entry port and exit port respectively as illustrated in Fig.1.
3 Code Characteristics Analysis As illustrated in the Fig.2, in order to support the parallel operations in a single TTA instruction, the TTA instruction template includes multiple move slots, and every slot is used to define the data transports on the corresponding bus. Also, each move slot contains three fields, as illustrated in Fig. 2. The guard field specifies the guard value that controls whether the data transport on the bus is executed or not. The destination field specifies the destination of the transport, e.g. the operator registers, the trigger registers or the RF write ports. Note that, because the destination field is composed of the port address and the operation code as shown in Fig.2, the encodings of the destination fields of the transports to the same target FU should be successive because of the single entry port address of the target unit as illustrated in Fig.1. Then, the source field is similar to encode the result registers and the RF read ports. As shown in Fig.2, the upper half of the source space is occupied by the short immediate, so the source field can be interpreted as an immediate flag, and a register id or a short immediate. This paper studies the TTA embedded processor architecture according to the ASIP design flow [8]. Firstly, The ASIP methodology analyzes the characteristics of the embedded applications, including the 64-points complex FFT, the IDCT with 8×8
Fig. 2. TTA instruction template
46
L. Mingche et al. Table 1. Hardware resources of the two TTA processor configurations
Conf. A B
Function units
Register Files
2 ALU, 1 multiplier,1 compare,1 jump,1 ld/st, 2 logic&shifter,1 float unit, 1 float division, 4 ALU, 1 multiplier, 2 compare, 1jump, 2 ld/st, 2 logic&shifter, 1 float unit, 1 float division,
16×32b/bank 2 banks 16×32b/bank 4 banks
Buses 6 8
Instr. width 152 bits 200 bits
block, MPEG2 decoder and JPEG compression [10]. Secondly, the TTA processors are designed using the design space explorer of TTA framework. Two configurations, a configuration being a compromise between cost and performance (A) and a highperformance configuration (B) are chosen as shown in Table.1. To exploit the characteristics of the TTA transports in the generated code [12], this paper uses the software simulator to display the percentage of the various data transports
(i=1,…,N). As illustrated in Fig.3, we divide all the data transports into several groups, including the empty transports, the short immediate transports and the transports between FUs and RFs, etc. From the percentage displayed in the Fig.3, it is very clear that the programs contain parts where data dependencies limit the parallelism, resulting in sequences of instructions that contain a lot of empty transports. Empty trans FU to FU FU to RF RF to FU Short imme Others
cjpeg(A) mpeg2dec (B) idct (A) fft (B) 0%
20%
40%
60%
80%
100%
Fig. 3. Percentage distribution of the different transports
Then, the spatial locality of the data transports in TTA architecture is presented. Because of the transport-trigger characteristic, especially the effective schedule measures, most of the data transports are centralized on the limited data paths. As shown in Fig.3, the statistics from the simulation of the mpeg2dec on the conf.A illustrates the meanings of the spatial locality, where, a lot of the data transports are only centralized on the few data paths whose transport numbers are above 200, while other data paths are seldom referred in our simulation. Due to this property, the transport vector <exit port, entry port>, where exit port and entry port represent the port addresses of the source unit and destination unit respectively as shown in the Fig.1, can be use as the dictionary items in the following compression scheme. Next, Table.2 also displays the number of the data transports involved in these limited data paths. It is very obvious that the number of the data transports involved in the most frequent path vectors (i = 1,…,N) increases along with N, especially the most frequent path vectors cover most of the data transports when N equals 64.
Template Vertical Dictionary-Based Program Compression Scheme on the TTA
47
Fig. 4. Transport distribution of the mpeg2dec (A) Table 2. The number of data transports involved in the most frequent vectors
fft(B) idct(A) mpeg2dec(B) cjpeg (A)
Total 582 671 36841 42417
N=4 99 120 2947 4241
N=8 186 228 5526 7635
N = 16 291 369 9947 12725
N = 32 430 550 17315 21208
N = 64 559 657 28735 34630
N = 128 582 671 33156 39872
Especially, we consider that the traditional dictionary-based compression [13] is not suited to compress the move slot actually. For example, as shown in Fig.3, nearly 18.3% of the slots are occupied by the transports from FUai to FUbi, where, FUai represents the function unit with the port address of ai. Although these slots have the same encoding in the source fields because of the single result register, the various triggered operations of the target unit may lead to the different encodings of the destination fields. Thus, the boring triggered operations may result in a big dictionary size, or even the limited dictionary items can’t achieve a satisfactory compression ratio. Based on the locality properties above, it is wise to treat the path vectors as the dictionary items, and the operation id as content offset respectively. As illustrated in Fig.3, nearly 8.3% of the slots are occupied by the transports from FUs to RFs. The similar compression method may be performed. Then, towards the set of the transports from RFai to the FUbi, this paper divides it into two sets: (i) the transports from the RFai to trigger registers, (ii) the ones from the RFai to the operator registers. The latter occupying 5.1% of the slots is also suitable for the compression method above.
4 Code Compression Algorithm 4.1 Template-Based Compression Most programs contain parallelism for the TTA compiler to exploit. However, the data dependencies in the programs limit the parallelism and result in a lot of the
48
L. Mingche et al.
empty slots in the program codes. In [6] proposed an instruction template method to avoid explicit specification of NOPs for TTA architecture. The template-based compression method provided in this paper is similar. An instruction template indicates the valid move slot while the rest of the buses receive an empty transport implicitly. For each instruction, the guard field of each move slot has specified the empty transport. Thus, by analyzing the guard fields ahead of the instructions as well as the flag describing whether the long immediate slot is used, the instruction decoder can obtain the number of slots in the template, their width, and their bit positions. 4.2 Vertical Dictionary-Based Compression Based on the code characteristics analysis, this paper attempts to search the most frequent path vectors as the dictionary items to compress the slots vertically. Thus, during the ASIP design flow, the TTA scheduler is try to allocate the different transports with the same path vector to the specified bus, especially emphasizing the spatial locality of data transports on the individual slot. First of all, a pretreatment is introduced. The exchanges of the source fields and the destination fields are performed on the data transports from the RFai to the operator registers of the FUs, forcing the content offsets generated in the following scheme are from the encoding of the RFai. Secondly, the dictionary-based compression algorithm is performed to a matrix, where each row corresponds to a data transport, called the uncompressed vector. Then these vectors in the matrix correspond to all the transports on the same slot as shown in Fig.2. The compression procedure is listed as below: 1) 2)
Select a maximal vector from the uncompleted compressed vectors in the matrix as the next dictionary item Ti. For each slot vector Y labeled as uncompressed in the matrix, it performs (1)
Y = Ti – Y
3)
Then a new matrix can be generated, in which any vector smaller than the threshold θ is labeled as compressed. If all the vectors in the new matrix are compressed, the compression finishes. Otherwise, go to step 1.
Completing the above procedure, the dictionary is generated. But it is not suited for the decompression. In fact, this dictionary only needs some mathematical improvement to achieve the simple implementation and fast decompression. Suppose there is an uncompressed slot sequence x1, x2,…, xn. Followed the scheme above, this sequence becomes another one y1, y2,…, yn ( ∀i (yi<θ)) while getting the dictionary items T1, T2,…, Tn. If a certain transport vector xi is compressed to the vector yi by the dictionary items T1, T2,…, Tw. There exists the following equation (2): Tw − (T w − 1 " − (T1 − x i )) = y i
(2)
From the equation (2), T w − T w −1 − " (− 1 )
w −1
T 1 − (− 1 )
t w − yi (− 1)
Where,
(
w −1
w−1
t w = Tw − Tw −1 − " (− 1)
xi = yi
= xi
w −1
T1
(3) (4)
) (− 1)
w −1
(5)
Template Vertical Dictionary-Based Program Compression Scheme on the TTA
49
In the equation (3), the slot vector xi uses the first w dictionary items for its compression. So the fast decompression can be realized only using the equation (4). If tw is treated as the new dictionary item, there exists the following equation: x i = t w + (− 1) y i w
(6)
Then, using the new dictionary items tw, the compressed vector yi can be decompressed to the original slot vector xi. The key factor of the program compression is the dictionary size. Any good code compression method should limit the dictionary size. The approach to reduce the dictionary size works as follows: if the value of δ’ is less than the threshold δ*, which is determined by the dictionary size and the constraint size, the item tw is deleted. Where, the δ(x) represents the entropy of the item x. δ ' = ∑ j =1 δ ( x i ) − ∑ j =1 δ ( y i ) −δ (t w ) p
p
j
j
(7)
Finally, this algorithm retains the high-entropy dictionary items corresponding approximately to the most frequent path vectors, even the combination of those neighboring path vectors.
5 Hardware Prototype The compressed TTA instruction is composed of the compressed move slots and the guard fields involving the compress flag and the long immediate flag. In the process of the code decompression, the decoder obtains all the guard fields to decide the valid move slots and their positions. Then, according to the guard flags again, the decompression unit decides any move slot to be decompressed or not. The compressed slot is split into two parts, the upper bits are used as the dictionary index, and the low significant ones are used for the context offset. Then, the decompression only needs to do sub or add operation between the dictionary items and the offset as equation (6). If the decompression is added as an extra pipeline stage, the branch latency is increased due to the increasing pipeline depth, resulting in the increased cycle counts. For these benchmark applications, the cycle counts are increased on average 7.3% on configuration A and 3.9% on configuration B. The prototype above is built with its Verilog RTL description and then synthesized using Design Complier [11] in a 0.18μm CMOS standard cell technology. The report indicates that the critical path consumes 3.4ns, but the decompression and the decoder unit only consume about
Fig. 5. Decompression prototype
50
L. Mingche et al.
1.6ns and 1.5ns respectively. Thus, it is wise to combine the decoder and the decompression into one proper pipeline stage. By the way, as shown in the Fig.5, using an item of the Address Table (AT), which maps the original address to the compressed address, the compressed instruction can be accessed normally especially when the branch operation is executed.
6 Experiments 6.1 Compression Ratio
Ratio(%)
For configuration A and B, experiments using the software simulator written in the high-level language are performed on the four applications including fft, idct, mepg2dec and cjpeg. By adjusting the items number of the slot dictionary, the Fig.6 illustrated the compression effect. Then, the dictionary size is another key for any program compression method. As shown in the Fig.6, the program compressions of some applications are sensitive to the dictionary size, e.g. fft or idct, where, the increasing of the dictionary size seriously influences the compression ratio. For the configuration A and B, the total size of the eight dictionaries is set to 152B and 200B respectively. With the experiments on the template method as illustrated in [6], the 42.1% of the compression ratio is achieved here. However, Table.3 shows the compression ratio of 32.3% on average with 8 items in each slot dictionary. 60 50 40 30 20 10 0 f(A)
f(B)
Template
i(A) items=4
i(B)
m(A)
items=8
m(B)
c(A)
items=16
c(B) Benchmark items=32
Fig. 6. The compression ratio under different dictionary sizes Table 3. The compression result on the different applications No. 1 2 3 4 5 6 7 8
program name fft(A) fft(B) idct(A) idct(B) mpeg2dec(A) mpeg2dec (B) cjpeg (A) cjpeg (B) average
original code size 3,059B 3,550B 3,914B 4,650B 208,715B 249,450B 265,848B 287,325B
compressed size 1,120B 1,212B 1,472B 1,636B 62,196B 66,852B 78,956B 80,736B
compression ratios 0.367 0.342 0.376 0.352 0.298 0.268 0.297 0.281 0.323
Template Vertical Dictionary-Based Program Compression Scheme on the TTA
51
6.2 Area and Power Consumption The effects on area and power consumption are measured in this paper. Due to its small size, the dictionary in the scheme is suited for the implementation of the standard cells because of the access time and the programmability. In the implementation, the processor core is synthesized using the Synopsys Design Complier [11] with the standard cells on the 180 nm CMOS process, while the register files and the local instruction ram are from the Memory Compiler [11]. Note that, the original cases are respectively deployed with the 19KB and the 25KB local instruction rams, where the programs of some large benchmarks may be loaded from the memory controllers for several times. Due to compression scheme, the need of the local instruction ram is reduced. Table.4. illustrates the areas of the core and the local instruction ram. For the 8KB and the 10KB instruction ram deployments in the compressed case, the area is reduced on average 30.3% on configuration A and 28.1% on configuration B. Then, the power consumption is reduced mainly for the less memory traffic and the smaller local instruction ram. In this paper, by the estimation of the PrimePower [14] on the switch activities, the power consumption of the process core and the local instruction ram is depicted in Fig.7, the power consumption is reduced on average 25.1% on configuration A and 24.6% on configuration B. Note that, the power consumption reductions obtained in the paper are smaller than those already reported for TTA in [13], but the percentage of the power consumption considered in [13] only included the instruction ram and the control logic except for the processor core, which has a great percentage of the power dissipation as shown in Fig.7. Then, the logic of the huge dictionary in [13] is minimized for the power consumption without any programmability which is necessary for embedded processors. On the contrast, one of the advantages in our scheme is the tiny dictionary size with the highly programmability. Table 4. Areas of the processor core and the instruction ram proc. core 101, 528 Gates 107, 510 Gates 172,080 Gates 179,693 Gates
original conf.a compressed conf.a original conf.b compressed conf.b
instr ram 133,238 Gates 56,100 Gates 175,312 Gates 70,125 Gates
Total 234,766 Gates 163,610 Gates 347,392 Gates 249,818 Gates
Proc.core Instruction RAM
Original Conf.A Compressed Conf.A Original Conf.B Compressed Conf.B
Power(mW) 0
50
100
150
200
250
Fig. 7. Power consumptions of the processor core and the instruction ram
52
L. Mingche et al.
7 Conclusions The key factors to evaluate a code compression method include the compression ratio, the limited hardware cost, the real-time decompression, the dictionary size, the dependence to the processor and so on. This paper presents a new template vertical dictionary-based program compression scheme. The experiment shows that this scheme not only corresponds to the simple decompression prototype and the tiny dictionary, but also achieves the compression ratio of 32.3%. Then, the area of the local instruction ram and the processor core could be reduced about by 29% and power consumption by nearly 25%, correspondingly. The following related problems are to be studied further. The first is to study on the strategy to reduce the percentage of the transport from the RF to the operator registers to further optimize the compression ratio. The second is to apply this scheme to the cache architecture and evaluate its performance.
References 1. Riemens, A.K., Vissers, K.A., Schutten, R.J.: TriMedia CPU64 application domain and benchmark suite. In: ICCD99, pp. 580–585 (1999) 2. TMS320C64x CPU: Instruction Set Reference Guide, Texas Instruments, USA (2000) 3. Colwell, R.P., Nix, R.P., O’Connel, J.J.: A VLIW architecture for a trace scheduling compiler. IEEE Trans. Comput. 37(8), 679–967 (1988) 4. Corporaal, H.: Microprocessor Architecture from VLIW to TTA. John Wiley & Sons Ltd, West Sussex, England (1998) 5. Kuukkanen, P., Takala, J.: Bitwise and dictionary modeling for code compression on transport triggered architectures. WSEAS Transactions on Circuits and Systems 3(9), 1750–1755 (2004) 6. Heikkinen, J., Rantanen, T., Cilio, A.G.M., Takala, J., Corporaal, H.: Valuating TemplateBased Instruction Compression on Transport Triggered Architectures. In: IWSOC 2003, pp. 192–195 (2003) 7. Heikkinen, J., Cilio, A., Takala, J., Corporaal, H.: Dictionary-Based Program Compression on Transport Triggered Architectures. In: Proc. IEEE Int. Symp. on Circuits and Systems, Kobe, Japan, May 23-26, pp. 1122–1125 (2005) 8. Hong, Y., Li, S., Kui, D., Zhiying, W.: A TTA-based ASIP design methodology for embedded systems. Journal of Computer Research and Development 43(4), 752–758 (2006) 9. Witten, I.H., Moffat, A., Bell, T.C.: Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann Publishers, San Francisco, CA, U.S.A. (1999) 10. Lee, C., Potkonjak, M., Mangione-Smith, W.H.: MediaBench: A tool for evaluating and synthesizing multimedia communications systems. In: Proc. 30th Ann. IEEE/ACM Int. Symp. Microarchitecture, Research Triangle Park, December 1-3, 1997, pp. 330–335 (1997) 11. http://www.synopsys.com/products/logic/design_compiler.html 12. Corporaal, H., Hoogerbrugge, J.: Code generation for Transport Triggered Architectures, Code Generation for Embedded Processors (1995) 13. Heikkinen, J., Takala, J.: Effects of Program Compression. In: Vassiliadis, S., Wong, S., Hämäläinen, T.D. (eds.) SAMOS 2006. LNCS, vol. 4017, pp. 259–268. Springer, Heidelberg (2006) 14. Data Sheet: PrimePower Full-Chip Dynamic Power Analysis for Multimillion-Gate Design, Synopsys, Inc. (2004)
Asynchronous Functional Coupling for Low Power Sensor Network Processors Delong Shang1, Chihoon Shin2, Ping Wang1, Fei Xia1, Albert Koelmans1, Myeonghoon Oh2, Seongwoon Kim3, and Alex Yakovlev1 1
Microelectronic System Design Group, School of EECE, University of Newcastle, Newcastle upon Tyne, NE1 7RU, United Kingdom {delong.shang,ping.wang,fei.xia,albert.koelmans, alex.yakovlev}@ncl.ac.uk 2 Korea University of Science and Technology, Korea {cshin,mhoonoh}@etri.re.kr 3 Electronics Telecommunications Research Institute, Korea [email protected]
Abstract. This paper describes the design of an asynchronous implementation of a sensor network processor. The main purpose of this work is the reduction of power consumption in sensor network node processors and the research presented here tries to explore the suitability of asynchronous circuits for this purpose. The Handshake Solutions toolkit is used to implement an asynchronous version of a sensor processor. The design is made compact, trading area and leakage power savings with dynamic power costs, targeting the typical sparse operating characteristics of sensor node processors. It is then compared with a synchronous version of the same processor based on a reasonable power metric to guarantee accurate comparison. Apart from that, we also compare the design effort between synchronous and asynchronous implementations.
1 Introduction Wireless sensor networking is one of the most exciting technologies to emerge in recent years [9]. This is because advances in hardware and wireless network technologies have made it possible to create low-cost, low-power, multi-functional miniature sensor devices. A sensor network (SN) can provide access to information anytime, anywhere by collecting, processing analyzing and disseminating data. Thus, the network actively participates in creating a smart environment. The architecture of the sensor node’s hardware usually consists of five components: sensing hardware, processor, memory, power supply and transceiver [10]. Reference [6] compares several traditional microprocessors which have been used as sensor processors. They have several operating modes in order to save power. Under some modes, the microprocessors consume a small amount of power as they either enter into idle states or change to low frequency. Power efficiency is a prime concern in wireless sensors, whether powered by a battery or an energy-scavenging module. The wireless sensor node, being a microelectronic device, can only be equipped with limited power source. As a result, the main part of the node, the processor, is always required to consume as less power as possible. N. Azemard and L. Svensson (Eds.): PATMOS 2007, LNCS 4644, pp. 53–63, 2007. © Springer-Verlag Berlin Heidelberg 2007
54
D. Shang et al.
The majority of current SN research is based on software and communication, usually using off-the-shelf microprocessors which have high speed and power consumption greater than most SN applications require [1]. In general, traditional microprocessor design strategies have not reached the best possible power consumption, especially for the specialized application set of sensing networks. One generally ignored technique is asynchrony. Asynchronous circuits are a potential solution to save power, as asynchronous circuits completely remove the global clock signals. Normally the clock signals consume more than 40% of the total power dissipation. For example, in the Alpha 21064 about 65% of the total power goes to drive the clock signals. In addition, asynchronous circuits are event-driven. Power is only consumed when and where necessary. In addition to such considerations of dynamic power, asynchronous circuits may also be suitable for reducing leakage power as compact designs, with smaller area thus lower leakage, can be more easily derived. This further indicates potential significance for asynchrony in sensor processors because sensor applications tend to have sparse loads and low duty cycles. In this paper, we will present the design of an asynchronous sensor processor based on the common understanding of SN applications as an exploration for the suitability of asynchronous circuits in low power sensor processors. We will pay special attention to the question of trading area and leakage gains with dynamic power costs. The remainder of the paper is organized as follows: Section 2 introduces the general low power design techniques at logic and circuit levels; Section 3 introduces the TI MSP430 microprocessor; Section 4 describes an asynchronous implementation of the MSP430; Section 5 gives comparisons in terms of power consumption and logical design effort; and Section 6 gives conclusions and the future work.
2 Low Power Design Techniques In general, power reduction techniques exist at all levels of a system. In this work, we pay special attention to the following techniques at the logic and circuit level: •
• • • • •
Edge-triggered latches: By changing from flow-through latch to a unique edge-triggered configuration, it is estimated that the clock load could be reduced by half. Another benefit is the absence of spurious transitions on the latch outputs compared to flow-through [8]. Pipelining is typically used to increase throughput. However, it also allows us to reduce frequency and thus reduce power consumption under a given performance requirement [8]. Reducing clock skew: two separated clock signals are used, one is for controller and one for datapath, to reduce power by using simple clock distribution circuits [8]. Clock gating techniques: reducing switching activities [8]. Short pulse of control signals [12,13]. Removing all possible glitch pulses (hazard) in controller to save power [8].
Asynchronous Functional Coupling for Low Power Sensor Network Processors
55
3 TI MSP 430 Microprocessor The MSP family of microprocessors from Texas Instruments has recently seen wider use for sensor networking applications and there are reasons to expect that MSPs will out-perform the older architectures in use until now. One microprocessor of the family, the TI MSP430, incorporates a 16-bit CPU, peripherals, and flexible clock system that interconnect using a von-Neumann common memory address bus and memory data bus. Partnering a modern CPU with modular memory-mapped analogue and digital peripherals, the MSP430 offers solutions for demanding mixed-signal applications. MSP430’s clock system is designed specifically for battery-powered applications and uses six different operating modes, ranging from fully active to not clocking the core, to keeping the digital oscillator running to generate the clock but disabling the loop control to save power, to fully powered down [3]. All these features look attractive in the context of SN systems. The complete MSP430 instruction set consists of 27 core instructions and 24 emulated instructions. The core instructions are instructions that have unique op-codes decoded by the CPU. The emulated instructions are instructions that make code easier to write and read, but do not have op-codes themselves, instead they are replaced automatically by the assembler with equivalent core instructions. In this work, we derive our MSP430 implementations using only the MSP430’s instruction set and specifications found in its user guide. The “real” MSP430 hardware is not in the public domain for logic and circuit level analyses.
4 Asynchronous SN Processor Implementation Recently, a benchmark method for SN processors has been proposed [1]. This takes into account that for SN processors pure performance is less important, because SN applications are usually periodic with extremely low duty cycle, as shown in Table 1. Table 1. Sensor Network Application Sampling Rate Application Atmospheric temperature Barometric pressure Body temperature Natural seismic vibration Engine temperature Blood pressure
Duty Cycle (Hz) 0.017 – 1 0.017 – 1 0.1 – 1 0.2 - 100 100 - 150 50 - 100
The duty cycle is defined as in equation (1), DutyCycle
=
RunTime
1 +
IdleTime
(1)
In sensor based applications, duty cycles are extremely low since idle time from the end of previous job is as long as the physical phenomenon. When the idle time
56
D. Shang et al.
gets longer, the power consumption during this period becomes more significant. Normally, static energy such as leakage of circuits and some dynamic energy, like clock transition, affect the power consumption at idle time. Although several projects have tried to accommodate the low power requirements by designing specialized processors for SN, the lack of sharing of common understanding about SN applications had made their results pointless and impractical [1, 2]. Asynchronous CMOS circuits have the potential advantage of very low power consumption because they only dissipate when and where necessary and the global clock signals are completely removed. So some techniques introduced in Section 2 are automatically implemented in asynchronous circuits, for example, removing glitches and short pulses of control signals. However, it is very difficult to design these asynchronous circuits at gate level [4]. Substantial research is focusing on developing synthesis tools for asynchronous circuits. Handshake Solutions offers one of the most successful toolkits of this type. This toolkit includes TiDE, the Handshake Solutions design flow which has been in use for many years. It also includes Haste, used as a general-purpose programming language. Haste has been defined specifically for the design of VLSI circuits. It is supported by a dedicated tool set for compilation, simulation, and analysis. In this work, we use the TiDE design flow to design our asynchronous MSP430 implementation. In addition, to compare design effort and other parameters, we designed a synchronous version of MSP430 as well. Unlike synchronous design flows, which mostly derive from RTL level specifications, TiDE accepts behavioral specification. Fig. 1 shows an example. At the top of the left-hand side, the code is the Verilog behavioral specification which carries out the function shown in the bottom of the left-hand side. The code at the top of the right-hand side is the Haste specification and the bottom of the right-hand side is the representation of the specification above based on handshake protocols. An important notion to be aware of when designing Haste programs is that of transparency. Transparency, which can briefly be described as “what you program is Behavioral code (Verilog) module inc (out, in) begin output out; input in; reg [15:0] x, out; x = in; x = x + 1; #10 out = x; End
Asynchronous code (Haste) int = type[0..15] & inc : main proc(in? chan int & out! chan int). begin x : var int & y : var int | in?x ;y:=x+1 ;out!y End
Fig. 1. An example
Asynchronous Functional Coupling for Low Power Sensor Network Processors
57
what you get” has both advantages and disadvantages. An important consequence when using a transparent compiler is that the programmer should, at least to some extend, be aware of the cost of language constructs, and sometimes even about details of the compilation steps from Haste to the netlist level [11]. So, a reasonable Haste behavioural specification should depend on architecture. We believe the architecture can be important for power issues. For sensor processors, because of the low sampling frequency, a compact architecture is most appropriate. In general, architecture is designed based on the instruction sets and the requirements of the performance and area. A compact architecture mostly is realized by sharing hardware resources to reduce the area. Normally this sharing is called functional coupling, which allows several operations to share the same hardware resources at different times. The reduction in area usually helps reduce static power consumption such as leakage power, but the function sharing will usually cause an increase of dynamic power. When the system duty cycle is low and processing is sparse, system idle time tends to be greater than system active time. In SN applications the duty cycle is usually dominated by idle time. In this type of situation compact architectures can be good for reducing system power consumption [13], by trading dynamic power loss for leakage power gains. IR
... Dst_temp Src_temp Data_reg
Addr Dst
Src
Fig. 2. An architecture for MSP430 datapath MOV
OF
ADD
OF
OF
EX
WB
OF
OF
EX
WB
RRA
OF
WB
SXT
OF
WB
WB
.. CMP ID
IF
ID
IF2
.. RETI
ID
IF3
WB OC
WB
JC
OC
WB
OC
WB
..
JMP
JEQ
Fig. 3. Operation allocation of instructions
58
D. Shang et al. MOV IC
OF
IC
ADD ..
ID
IF
RRA ID
IF2
EX
IC
CMP
WB
SXT .. RETI
ID
IF3
JC ..
IC
JMP OC
JEQ
Fig. 4. Operation allocation after functional coupling
Since the high level asynchronous technique is actually behavioral (logical) level specification, the work to couple functionally is quite easy for asynchronous systems compared to modifications in synchronous systems, which is based on complicated state machine mechanisms. It is thus considerably simpler to derive a compact asynchronous design from the result of a direct TiDE synthesis run than to derive a compact synchronous design from the result of a direct Synopsys process from the same specification. Based on the specification of the MSP430, and the performance requirement, a compact architecture shown in Fig. 2 is used for implementing the core of our MSP430 versions. There are two buses (data bus and address bus) working as the interface between the core and the other part of the processor (ROM, RAM, timer, watchdog, etc.). Inside the core, there are register files, instruction register (IR), address register (AREG), data register (DREG), source temporary register and destination temporary register working as data storage. In addition, the ALU and SHIFTER work as functional units and multiplexers (de-multiplexers), and PC increment mechanism is also included. Finally, decoding logic and other control logic manage the core. To be compact, apart from the PC increment, the other computations, such as calculating addresses for operands, share the ALU unit. In addition, functional coupling is also applied to the instruction set. Fig. 3 shows the operation allocations of the instruction set. In general, after fetching an instruction and decoding, operations for a certain instruction are known and scheduled. For example, if the instruction is “add”, fetching operands twice is required and then followed by ‘add’ (ex) and writing back. In practice, some operations can be executed based on shared components, such as operand fetching (OF), executing (EX), writing back (WB) and so on, because they do not happen at the same time. Fig. 4 shows the block diagram of the operation allocation after functional coupling. Based on this compact architecture and the functional coupling techniques, an asynchronous MSP430 is specified in Haste, and input to the TiDE design flow.
Asynchronous Functional Coupling for Low Power Sensor Network Processors 200,000ps
400,000ps
600,000ps
800,000ps
1,000,000ps
59
1, 200, 0
ABus[15:0]
0000
0002
0004
0
0008
4000
4002
4004
4006
4008
4
400C
4
4010
4
4014
4 0
4018
DBus_in[15:0]
4035
2000
4501
4
4000
4303
430D
430A
430E
430C
4
0500
4
0520
9
0010
3 4
531A
ir_read_3031_4 ir_read_3031_5 ir_read_3031_6 ir_read_3031_7 ir_read_3031_8 ir_read_3031_9 ir_read_3031_10 ir_read_3031_11 probe_pc[15:0]
0000
0004
0006
4000
4002
4004
4006
4008
400A
400E
4012
4016
4018
Fig. 5. Waveforms obtained
Fig. 6. Power behavior of asynchronous processor when idle time increases
Our asynchronous MSP430 is implemented based on the TiDE design flow in AMS 130nm CMOS technology. As the design is started from the behavioral specification without converting complicated state machines, we can design an algorithm and synthesize it to actual circuit directly. Once we got the gate-level netlist of the design, a corresponding sdf (Standard Delay Format) file was generated as well, which is used with simulation toolkits together to simulate the design with more timing information such as gate delay and so on. A waveform obtained from the simulation tools (Synopsys toolkit) is shown in Fig. 5, in which Abus stands for address bus, Dbus data bus, probe-PC program counter and so on. Synopsys Primepower toolkit, which is the full-chip dynamic power analysis tool for multimillion-gate designs, was used to analyze the power consumption of the design. As we can see in a result of power simulation shown in Fig. 6, leakage power becomes more significant than dynamic power when the idle time increases. This demonstrates the validity of our investigation of trading leakage power gains for dynamic power losses using compact designs.
5 Comparison In order to put the asynchronous design into perspective, we implemented a synchronous MSP430 on the same AMS CMOS 130nm technology, using the Synopsys
60
D. Shang et al. Table 2. Comparison Between Low Power Processors (J/instruction)
Processors
Clk
Bits
Proc.
Vdd
Power J/ins
Freq.
Atmegal128
o
8
0.35
3V
1.5 n
4MHz
80C51
x
8
0.18
1.8V
89 p
88MHz
BitSNAP
x
16
0.18
0.5V
43p
4MHz
Lutonium
x
8
0.18
1.8V
500p
200MHz
Clever Dust 2
o
8
0.25
1V
12p
500KHz 142KHz
0.6V
17p
6MHz
1.8V
152p
54MHz
Michigan
o
8
0.13
0.2V
600f
Our Async
x
16
0.13
1.2V
60p
16MHz
Our Sync
o
16
0.13
1.2V
135p
16MHz
toolkits. We started the design from a specification in Verilog. After compiling and mapping, a gate-level netlist and a corresponding sdf file are generated. After that we simulated the design based on the netlist and the sdf file and analyzed the power consumption using the Primepower toolkit. Comparisons were made between this and the asynchronous versions of MSP430, especially for power consumption characteristics. To compare the power consumption, a reasonable power metric is required. Table 2 shows the power comparisons between some existing commercial microprocessors and ours. The comparison is based on a power metric which is energy per instruction. In the table, Our Async stands for our asynchronous MSP430 and Our Sync our synchronous MSP430. The figures corresponding to these two rows are calculated by ourselves. The other figures are obtained from the related papers. Our asynchronous MSP430 is 16-bit width datapath, works in 1.2v, and is implemented under 130nm CMOS technology. It consumes 60pJ per instruction working at 16MHz. However this metric is useful only for an overall and inaccurate power comparison, since almost every approach has different ISA, and architecture. For accurate comparison, we use the metric EPB (Energy Per Bundle) which represents the actual energy consumed by the processor core to handle a bundle of samples during a duty cycle of work [1]. We use one practical SN application as the test bench. The application – RLE_stream – emulates environmental monitoring which periodically collects data from outside, filters them through Schmitt trigger-based threshold detector and compresses them using Run Length Encoding (RLE). In this paper, we accurately measure the power consumption of our own implementations (our async and our sync) using the above test bench based on the EPB metric. Comparisons with other microprocessors could not be made for lack of data. The final power simulation results are shown in Fig. 7 and Fig. 8. We ran the test bench which is about 40us. In these two figures, Sync is our synchronous MSP430 which works based on a global clock signal. In implementation, a clock tree is used to balance the global clock signal. Sync idle is defined as during its idle time, the clock is gated to stop clock transition. Async is our asynchronous MSP430 without compact
Asynchronous Functional Coupling for Low Power Sensor Network Processors
61
Fig. 7. Relative distribution between dynamic and static power
Fig. 8. Battery life time versus idle time
optimization, and async (functional coupled) with compact optimization. In Fig. 7, the idle time is 64ms, and in Fig. 8, the lifetime is for a single 1.5V 1000mA AAA size battery. Through our coupling to derive a compact implementation, the area was reduced by almost 30%. On the other hand, dynamic power increased as much as static power decreased. Since the area is dominant power factor for SN, battery lifetime was increased by more than 30% when idle time is long enough. In addition, we also compare the design effort for both synchronous and asynchronous MSP 430 microprocessors. The synchronous design took about 2 persons and two months in full time, and about 2 persons spent one month on designing the architecture. The length of the Verilog codes is more than 8000 lines. Compared to the synchronous design, as asynchronous design starts from behavioral level, it took about 20 days and only one person was focusing on codes if we do not consider the time when we designed the architecture. The length of the Haste codes is about 1200 lines.
62
D. Shang et al.
6 Conclusions and Future Work A low power version of MSP430 for sensor applications was designed. Because of the specific requirements of the sensor processor, especially low duty cycles, a compact architecture was used to reduce leakage power in idle mode. The logic simulation shows that the design is correct and functions as expected. Compared to standard microprocessors, the specially designed asynchronous MSP430 for sensor applications shows good properties for low power dissipation, for example, the battery lifetime is increased by 30% by trading dynamic power for area and leakage power. Comparisons with existing microprocessors are also given in terms of energy dissipation per instruction. However more accurate comparison based on EPB is given for our asynchronous and synchronous implementations. Focusing on special applications, designing dedicated microprocessors is a new research topic, especially with asynchronous design technology. In this paper, we use a commercial microprocessor, TI MSP430, as the prototype. However it is still designed as a general purpose microprocessor. In the future we will investigate methods of developing dedicated microprocessors and their instruction sets as well.
Acknowledgement This work is supported by the EPSRC project -- NEGUS (EP/C512812/1) and the Korea Research Foundation Grant funded by the Korean Government (MOEHRD) (KRF-2006-612-D00063) at the University of Newcastle upon Tyne. During the time, lots of useful discussion is from our colleagues. Here we would like to express our thanks to them.
References 1. Nazhandali, L., Minuth, M., Austin, T.: SenseBench: Toward an Accurate Evaluation of Sensor Network Processors. In: ISCA 2006 (2006) 2. Nazhandali, L., Minuth, M., Zhai, B., Olson, J., Austin, T., Blaauw, D.: A SecondGeneration Sensor Network Processor with Application-Driven Memory Optimizations and Out-of-Order Execution. In: ISCA 2005 (2005) 3. TI, MSP430, http://focus.ti.com 4. Sparso, J., Furber, S.: Principles of Asynchronous Circuit Design – A System Perspective. Kluwer Academic Publishers, Dordrecht (2002) 5. Handshake Solution, Phillips: TiDE — Timeless Design Environment, http://www.handshakesolutions.com/products_services/tools/Index.html 6. Lynch, C., Reilly, F.O.: Processor Choice for Wireless Sensor Networks. In: Proc. of REAlwsn 2005, June 20-21, Stockholm, Sweden (2005) 7. Moyer, B.: Low Power Design for Embedded Processors. Proc. of IEEE 89(11) (November 2001) 8. Dobberpuhl, D.: The Design of High Performance Low Power Microprocessor. In: ISLPED 1996 (1996)
Asynchronous Functional Coupling for Low Power Sensor Network Processors
63
9. Tilak, S., Abu-Ghazaleh, N., Heinzelman, W.: A Taxonomy of Wireless Micro-Sensor Netweok Models. ACM Mobile Computing and Communications Review (MC2R) 6(2) (April 2002) 10. Tubaishat, M., Madria, S.: Sensor Networks: An Overview. IEEE Potentials (2003) 11. Peeters, A., Wit, M.: Haste Manual. Handshake Solutions 12. Bajwa, R.S., Hiraki, M., Kojima, H., Gorny, D.J., Shridhar, A., Seki, K.: Instruction buffering to reduce power in processors for signal processing. IEEE Trans. On VLSI 5(4), 417–423 (1997) 13. Markovic, D., Stojanovic, V., Nikolic, B., Horowitz, M.A.: Methods for true energyperformance optimization. IEEE Journal of Solid-State Circuits 39(8) (August 2004)
A Heuristic for Reducing Dynamic Power Dissipation in Clocked Sequential Designs Noureddine Chabini Department of Electrical and Computer Engineering, Royal Military College of Canada P.B. 17000, Station Forces, Kingston, ON, K7K 7B4, Canada [email protected]
Abstract. Assigning computational elements to low supply voltages can reduce dynamic power dissipation, but increase execution delays. The problem of reducing dynamic power consumption by assigning low supply voltages to computational elements off critical paths is NP-hard in general. It has been addressed in the case of combinational designs. It becomes more difficult in the case of clocked sequential designs since critical paths are defined relative to the position of registers. By repositioning some registers, some computational elements could be moved from critical paths, and hence their supply voltages can be scaled down. In this paper, we propose a polynomial time algorithm to determine solutions to this problem in the case of clocked sequential designs. Experimental results have shown that the proposed algorithm is able to significantly reduce dynamic power dissipation.
1 Introduction For today and next generation digital systems, designers have to not only deal with increasing the execution speed of these systems, but to also deal with reducing their power consumption. Reducing power consumption is required in particular for battery-operated portable digital systems, and for high performance systems. Complementary Metal-Oxide Semi-conductor (CMOS) continues to be the dominant technology used to implement digital designs. In CMOS technology, there are three components that contribute to the power dissipation of a digital system. They are: dynamic power, short-circuit power, and leakage power. Dynamic power, denoted here by Pd, is consumed due to charging and discharging of capacitances. It is approximately defined by the following equation [1][2]:
Pd ≈ KCfVdd2
(1)
where K is the switching activity factor, C is the load capacitance, f is the clock frequency, and V is the supply voltage. As one can deduce from equation (1), scaling down the supply voltage of a computational element allows reducing dynamic power dissipation. The effect of scaling down the supply voltage of a computational element is an increase of the execution delay of this element as one can deduce from the following equation [3]: dd
d ≈ (D ⋅ Vdd ) / (Vdd − Vth )
2
N. Azemard and L. Svensson (Eds.): PATMOS 2007, LNCS 4644, pp. 64–74, 2007. © Springer-Verlag Berlin Heidelberg 2007
(2)
A Heuristic for Reducing Dynamic Power Dissipation in Clocked Sequential Designs
65
where d is the execution delay of the computational element using V as the supply voltage, and D is a constant. Consequently, to still satisfy timing constraints, only supply voltages of computational elements off critical paths can be scaled down assuming fixed threshold voltages. Designs with multiple supply voltages are reported in the literature [4][5] and demonstrate the benefits of using multiple supply voltages to reduce the consumption of dynamic power. Scaling the supply voltage to minimize power consumption under timing constraints is an NP-hard problem in general [7], which opens the door to the development of heuristic approaches. The problem of reducing dynamic power dissipation using supply voltage scaling as a mechanism under timing constraints has been addressed in the literature for the case of combinational designs. This problem becomes more difficult in the case of clocked sequential designs since the critical paths are defined relative to the position of registers in the design. So, a computational element could be moved from critical paths by repositioning some registers in the design, and hence its supply voltage could now be scaled down to reduce dynamic power dissipation while still be satisfying timings constraints. The process of repositioning registers in a mono-phase clocked sequential design to optimize a certain objective function is called retiming, which was proposed in [8]. Exact approaches to address the above problem for the case of mono-phase clocked sequential digital designs are reported in [6][9]. The basic idea in these approaches is to optimally combine retiming and supply voltage scaling, which leads to a global solution to the problem instead of a local one. These proposed approaches are based on Mixed-Integer/Integer Linear Programs (MILP/ILP). Since the problem is NP-hard in general, these MILP/ILP cannot be solved in reasonable run-time for the case of large designs. Our focus in this paper is to provide a polynomial time algorithm to compute solutions to this problem for designs with any size. The rest of this paper is organized as follows. In the next section, we show how we model a clocked sequential design as a cyclic graph. Also, we give an introduction to retiming and to valid periodic schedules, which are needed for our proposed approach. In Section 3, we present our proposed polynomial time algorithm for computing solutions to the target problem in this paper. Experimental results are provided and discussed in Section 4. Conclusions are presented in Section 5. dd
2 Preliminaries 2.1 Cyclic Graph Model
In this paper, a mono-phase clocked sequential design is modeled as a directed cyclic graph G = (V , E , d , w) , where V is the set of computational elements in the design, and E is the set of arcs that represent wires. We denote by d (v) the execution delay of the node v ∈ V , and by w(eu , v ) the weight of the arc eu , v ∈ E . The value of w(eu , v ) is the number of registers on the wire modeled by the arc eu , v ∈ E . As an example, a directed cyclic graph model for a correlator [8] is given by Fig.1. For this example, computational elements are labeled by vi=1, 2,...,8 . The number inside each circle represents the execution delay of computational element vi. The value on each arc is the number of registers on this arc.
66
N. Chabini
0
v8
v7 7
0
v6
0
0 1
3
7 0
0 1
v1
v5
0
7
3
3
1
v2
v3
0 1
3 v4
Fig. 1. Cyclic graph model for a digital correlator as provided in [8]
2.2 Retiming
Let G = (V , E , d , w) be a cyclic graph modeling a clocked sequential design. Retiming r is defined (as in [8]) as a function r : V → Z , which transforms G to a functionally equivalent cyclic graph Gr = (V , E , d , wr ) . The set Z represents natural integers. The weight of each arc eu , v in Gr is defined as follows:
wr (eu , v ) = w(eu , v ) + r (v) − r (u ) , ∀ eu,v ∈ E .
(3)
Since the weight of each arc in Gr represents the number of registers on this arc, then we must have:
wr (eu , v ) ≥ 0 , ∀ eu,v ∈ E .
(4)
Any retiming r that satisfies inequality (4) is called a valid retiming. From expressions (3) and (4), one can deduce the following inequality:
r (u ) − r (v) ≤ w(eu , v ) , ∀ eu,v ∈ E .
(5)
Let P(u, v) denotes a path from node u to node v in V. Equation (3) implies that for every two nodes u and v in V, the change in the register count along any path P(u, v) depends only on its two endpoints:
wr ( P(u, v)) = w( P(u, v)) + r (v) − r (u ) , ∀ u, v ∈ V , w( P(u, v)) = ∑ex , y∈P (u ,v ) w(ex , y ) .
where:
(6)
(7)
Let d ( P(u, v)) denotes the delay of a path P(u, v) from node u to node v. d ( P(u, v)) is the sum of the execution delays of all the computational elements that
belong to P(u, v) . A 0-weight path is a path such that w( P (u, v)) = 0 . The minimal clock period of a synchronous sequential digital design is the longest 0-weight path. It is defined by the following equation: Π = Max∀u , v ∈ V{d ( P (u, v )) | w( P(u, v)) = 0} .
(8)
A Heuristic for Reducing Dynamic Power Dissipation in Clocked Sequential Designs
0
v8
v7 7
1
v6
1
0 0
3 v1
v5
1
7
7 0
1 0
3 v2
67
3
1
1 0
v3
3 v4
Fig. 2. Retiming the design in Fig. 1 to achieve a clock period Π = 13
Two matrices called W and D are very important to the retiming algorithms. They are defined in [8] as follows:
and
W (u , v) = Min{w( P(u , v))} , ∀u , v ∈ V ,
(9)
D(u, v) = Max{d ( P(u, v)) | w( P(u , v)) = W (u , v)} , ∀u , v ∈ V .
(10)
The matrices W and D can be computed as explained in [8]. Minimizing the clock period of a synchronous sequential digital design is one of the original applications of retiming that are reported in [8]. For instance, for Fig. 1, the clock period of this design is Π = 24 , which is equal to the sum of execution delays of computational elements vi=4, 5, 6, 7 . However, we can obtain Π = 13 if we apply the following retiming vector {−1,−2,−2,−3,−2,−1,0,0} to the vector of nodes
{v1 , v2 , v3 , v4 , v5 , v6 , v7 , v8 } in G . The retimed graph Gr is presented by Fig. 2. Notice that the weight of each arc in Gr is computed using expression (3). The minimal clock period Π = 13 in Gr corresponds to the length of the longest 0-weight path composed by computational elements vi=7,8,1, 2 . For the purpose of this paper, we extract from [8] the following theorem, which is also proved in [8]. Theorem 1: Let G = (V , E , d , w) be a synchronous digital design, and let Π be a positive real number. Then there is a retiming r of G such that the clock period of the resulting retimed design Gr is less than or equal to Π if and only if there exists an assignment of integer value r(v) to each node v in V such that the following conditions are satisfied: (1) r (u ) − r (v) ≤ w(eu , v ) ∀ eu,v ∈ E , and (2) r (u ) − r (v) ≤ W (u, v) − 1 ∀u , v ∈ V | D(u , v) > Π . 2.3 Valid Periodic Schedule
Let G = (V , E , d , w) be a directed cyclic graph modeling a mono-phase clocked sequential design. Each node v in this graph will execute more than once over time. Each execution of v is called an iteration. At each iteration, all the nodes of G are executed and they should finish executing before the next iteration will start. Let N be the set of non-negative integers. We define a schedule as a function s : N × V → N that, for each iteration k ∈ N , determines the start execution time sk (v) for each node v ∈ V .
68
N. Chabini
We mean by periodic schedule with period Π any schedule s that satisfies (11): sk (v ) = s 0(v) + k ⋅ Π , ∀k ∈ N , ∀v ∈ V ,
(11)
where s 0(v) is the start execution time of the first instance of the node v. Also, we signify by valid schedule any schedule s that satisfies data dependency constraints. In terms of start execution time, this is equivalent to (12): s ( k + w( eu , v ))(v) ≥ sk (u ) + d (u ) , ∀k ∈ N , ∀eu , v ∈ E .
(12)
Using equation (11), inequality (12) transforms to: s 0(v) − s 0(u ) ≥ d (u ) − Π ⋅ w(eu , v ) , ∀eu , v ∈ E .
(13)
Let Gs = (V , E , w) be a cyclic graph, where V and E are as defined for the graph G = (V , E , d , w) . The weight of each arc eu , v ∈ E in Gs = (V , E , w) is equal to the value (d (u ) − Π ⋅ w(eu , v )) provided on the right hand side of inequalities (13). The system of inequalities (13) can be solved using algorithms for computing the length of the longest path in the graph Gs = (V , E , w) . For instance, Bellman-Ford’s algorithm [10][11] for longest paths from a chosen node v x in the graph Gs to the other nodes can be used to solve this system [12]. A solution can be computed only if this graph does not have any cycle with positive weight [12]. By fixing Π to the value of the clock period of the synchronous sequential design, the graph Gs cannot have any cycle with positive weight, since otherwise this synchronous sequential design cannot operate correctly with this value of the clock period. As Soon As Possible (ASAP) and As Late As Possible (ALAP) schedules [11] are two possible solutions to the system of inequalities (13). To compute an ASAP schedule, Bellman-Ford’s algorithm [10][11] for longest paths from a chosen node v x in the graph Gs to the other nodes can be used. Finding an ALAP schedule relative to the node v x can be achieved as follows [12]: • • •
Invert the direction of any arc in the graph Gs . The weight of this arc is still the same and is as defined in the previous paragraph. Apply Bellman-Ford’s algorithm for longest paths from node v x to the other nodes in the resulting graph Gs . Multiply the computed values, in the previous step, by -1.
For instance, relative to the node v1 and for Π = 13 , the ASAP and ALAP schedules for the case of the design in Fig. 2 are as given by the Table 1 below. For the graph Gs , the mobility of a node v is equal to ( ALAP(v) - ASAP(v) ). The mobility of any node in the graph Gs allows determining whether or not this node is on a critical path. When the mobility of a node is 0, then this node is on a critical path in the graph Gs . Notice that even though a node in Gs has a mobility greater than 0 this node can sometimes be on a critical path in G . Indeed, from Table 1, we observe that the mobility of nodes 2 and 7 is non-zero but these nodes are on the critical path in the graph G given by Fig. 2, since Π = 13 is determined by the longest 0-weight path composed by v7, v8, v1 and v2. However, the mobility of a node in Gs is still useful in selecting a candidate node, in G , to be assigned to a low supply voltage, as we will
A Heuristic for Reducing Dynamic Power Dissipation in Clocked Sequential Designs
69
explain in next section. Notice from Table 1 that for the nodes v7, v8, v1 and v2 which are on the critical path, their mobility is smaller than the mobility of the others nodes which are not on the critical path. Table 1. The ASAP, ALAP, and Mobility for nodes of the design in Fig. 2, using Π = 13 Nodes ASAP ALAP Mobility
v1 0 0 0
v2 -10 -4 6
v3 -20 -11 9
v4 -30 -11 19
v5 -17 -8 9
v6 -7 -1 6
v7 3 6 3
v8 10 13 3
3 Problem Description and Proposed Resolution Approach Without loss of generality, we assume in this paper that two supply voltages are available: high supply voltage (denoted henceforth by VddH ) and low supply voltage (denoted henceforth by VddL ). When a computational element u is powered by VddH , its dynamic power dissipation will be PdH (u ) and its execution delay will be d H (u ) . If u is powered by VddL , then its dynamic power dissipation will be PdL (u ) and its execution delay will be d L (u ) . We assume that the given clocked sequential design modeled by G = (V , E , d , w) operates originally with VddH . To reduce dynamic power dissipation of the input design G = (V , E , d , w) , our aim is then to power as many computational elements as possible with VddL but without increasing the clock period Π . To achieve this goal, the basic idea in our proposed algorithm is to iteratively select a best candidate node v from G to be tried for being assigned to VddL . To find a best candidate node, we propose to use the following objective function to carry out the selection: ⎧ ⎪⎪ 0, if Mobility (v) = 0 Gain(v) = ⎨ ⎛ , ∀v ∈V . ⎞ ⎜ ⎟ ⎪ ⎜⎜ ⎛⎜⎜⎝ PdH (v)− PdL (v) ⎞⎟⎟⎠ / Mobility (v) ⎟⎟, otherwise ⎟ ⎠ ⎩⎪ ⎜⎝
(14)
In our approach, when Mobility (v) = 0 , this means that the node v is on a critical path and hence cannot be assigned to VddL . So, we have Gain(v) = 0 . Otherwise, we propose to use
⎛ ⎜ ⎝
PdH (v)−PdL (v) ⎞⎟⎠ and Mobility (v) to define the objective function
Gain(v) . Why does it make sense to proceed as given by expression (14)? There are at least two scenarios that could help in explaining this. As a first scenario, for the nodes that have the same value for the term ⎛⎜⎝ PdH (v)−PdL (v) ⎞⎟⎠ , then from (14) we can deduce that the node with the smallest value for the term Mobility (v) will of course have the highest gain. This makes sense since one needs to give a higher priority to the node which is more critical, which means here to give a higher priority to the node with a smallest mobility. As a second scenario, for the nodes that have the same value
70
N. Chabini
for the term Mobility (v) , then from (14) we can deduce that the node with the highest value for the term ⎛⎜⎝ PdH (v)−PdL (v) ⎞⎟⎠ will of course have the highest gain. This makes sense since one needs to give a higher priority to the node that allows the most reduction of dynamic power dissipation. Now, assume that a node v is selected to be tried for being assigned to VddL . Since the execution delay of this selected node will increase and because the design G must still operate with the clock period Π , then we propose to retime G by first setting temporarily the execution delay of this node v to d L (v) . Based on Theorem 1, if no retiming exists then the node v cannot be assigned to VddL , otherwise the execution delay and dynamic power of v are fixed permanently to d L (v) and PdL (v) , respectively. The process of selecting another node has to be repeated until all the nodes are tried. Our proposed algorithm is given below. In the next paragraph, we will run this algorithm on the example design in Fig. 2 with Π = 13 , and give intermediate results, which would help the reader to get a clear idea of how this algorithm works. Algorithm Input: Graph G =(V,E,d,w), clock period Π , and low supply voltage to be tried for each node. Nodes of G are assumed to originally operate under high supply voltages. This algorithm will then try to assign the given low supply voltages to as many nodes as possible while reducing dynamic power dissipation as much as it can without violating timings constraints. Output: Graph G =(V,E,d,w) with reduced dynamic power dissipation while still operating with clock period Π . Begin 1. Mark all the nodes in G =(V,E,d,w) as not visited yet. 2. While (all the nodes in G =(V,E,d,w) are not visited yet) do 3. Using the given Π , compute ASAP, ALAP and the Mobility of each node in G as it was done in the Subsection 2.3. 4. For each node v, which is not visited yet, calculate its dynamic power gain defined as given by expression (14). 5. For each node in G, if its Mobility is 0 then mark this node as visited. 6. Among all the nodes that are not visited yet, select the node v which has the highest gain computed above. 7. Mark the selected node v as visited. 8. Assign the node v to the low supply voltage. Its execution delay and dynamic power will be the delay and dynamic power corresponding to this supply voltage. 9. Since one node is now slowed down, then retime the graph G =(V,E,d,w) to double-check if it can still operate with the clock period Π . For this, use one of the algorithms reported in [8][11]. If the design cannot operate with this clock period Π , then assign the node v back to the high supply voltage. Its execution delay and dynamic power will be the delay and dynamic power corresponding to this supply voltage. 10. End of the while loop.
A Heuristic for Reducing Dynamic Power Dissipation in Clocked Sequential Designs
71
End.
In this paragraph, we will briefly present and discuss intermediate results for the case of executing the above-proposed algorithm on the design in Fig. 2 assuming Π = 13 . These intermediate results are reported in Table 3, and are obtained assuming we have supply voltages, execution delays and dynamic powers as those presented in Table 2. For each iteration, Table 3 reports the mobility and the gain (defined by the expression (14)) for each node. As it can be observed from this table, only one node has been marked by ** or by * at each iteration. A node marked by ** means that this is the node that has been selected for being assigned to a low supply voltage and finally it has been assigned to this supply voltage. When the node is marked by only * this means that this is the node that has been selected for being assigned to a low supply voltage but finally it cannot be assigned to this supply voltage. As it can be observed for this Table 3, at each iteration, the algorithm has selected a node with the highest gain. Also, if a node has been assigned to a low supply voltage, then the mobility of the nodes has been updated in the next iteration, otherwise it has been keep intact (see the example of iterations 3 and 4, since the node 5 has not been assigned to a low supply voltage in iteration 3). As it can be deduced from the algorithm, its Step 2 will execute no more than V , where V is the number of Table 2. Example of execution delays and dynamic power dissipations Supply Voltage VddH L dd
V
Nodes d H (v ) (ns)
v1
v2
v3
v4
v5
v6
V7
v8
3
3
3
3
7
7
7
0
PdH (v) (uW)
20
20
20
20
100
100
100
0
d L (v ) (ns)
4
4
4
4
10
10
10
0
PdL (v) (uW)
15
15
15
15
60
60
60
0
Table 3. Results of running the proposed algorithm above on the design in Fig. 2 using Π = 13 . For each node, execution delay and dynamic power are as given by Table 2. Node Mobility(v) Gain(v)
v1
v2
v3
v4
v5
v6
v7
v8
0
6
9
19
9
6
3
3
0
0.83
0.55
0.26
4.44
6.66
13.33**
0
Mobility(v) Gain(v)
0
3
6
16
6
3
0
0
0
1.66
0.83
0.31
6.66
13.33**
0
0
Iteration 3
Mobility(v) Gain(v)
0
0
3
13
3
0
0
0
0
0
1.66
0.38
13.33*
0
0
0
Iteration 4
Mobility(v) Gain(v)
0
0
3
13
3
0
0
0
0
0
1.66*
0.38
13.33
0
0
0
Mobility(v) Gain(v)
0
0
3
13
3
0
0
0
0
0
1.66
0.38**
13.33
0
0
0
Iteration 1 Iteration 2
Iteration 5
72
N. Chabini
computational elements in the design. However, because the mobility of some nodes would be 0, then Step 2 would execute less than V times. For the design in Fig. 2, the Step 2 will execute no more than 8 times. As it can be deduced from Table 3, the algorithm stopped after 5 iterations instead of 8, because at iteration#1, nodes v1 and v8 have been immediately marked as visited but never tried since their mobility is 0. At iteration#3, node v2 has been immediately marked as visited but never tried since its mobility is 0. Finally, for this example as described in this paragraph, the proposed algorithm has reduced dynamic power dissipation by 22.36%. Let V and E be the number of computational elements and the number of arcs in G = (V , E , d , w) , respectively. The step number 2 of the algorithm executes no more than V times. Steps 3 and 9 can be carried out using Bellman-Ford’s algorithm [10][11], thus their time complexity is O(V ⋅ E ) . The time complexity for all the
other steps is in O(V ⋅ E ) . Thus, the time complexity of the algorithm is in
(
)
OV ⋅E . 2
4 Experimental Results The objective of this section is to assess the proposed algorithm in terms of reductions of dynamic power dissipation. To this end, we coded this algorithm in C++ and executed it on the set of designs reported in the first column of Table 5. The supply voltages, execution delays and dynamic power dissipations used for this assessment are presented in Table 4, which are similar to those provided in [7]. For each design, the clock period Π is fixed to the minimal clock period obtained by first retiming the design assuming all the computational elements are operating with a high supply voltage: i) 5.0V when the two supply voltages are (5.0V, 3.3V), ii) 3.3V when the two supply voltages are (3.3V, 2.4V), and iii) 2.4V when the two supply voltages are (2.4V, 1.5V). Let PdH (design) be the dynamic power consumed by the design using VddH only, and let PdH + L (design) be the dynamic power consumed by the design using a
combination of VddH and VddL . Relative dynamic power saving (in %) is defined here as ((PdH (design) − PdH + L (design) ) / (PdH (design) )) × 100 . As it can be noticed from Table 5, the proposed algorithm is able to reduce the dynamic power dissipation for this set of designs by factors ranging from 18.95% to Table 4. Example of execution delays and dynamic power dissipations to be used for assessing the proposed algorithm. These data are similar to those in [7].
Computational elements Mul Add Sub
Delay (ns) 100 20 20
5.0V Power (uW) 2504 118 118
Delay (ns) 175 36 36
3.3V Power (uW) 1090 52 52
Delay (ns) 287 57 57
2.4V Power (uW) 577 27 27
Delay (ns) 717 143 143
1.5V Power (uW) 225 11 11
A Heuristic for Reducing Dynamic Power Dissipation in Clocked Sequential Designs
73
Table 5. Assessment of the proposed algorithm, assuming data provided in Table 4
Design DES Correlator, order 10 Correlator, order 20 Correlator, order 30 Correlator, order 40 Correlator, order 50 FIR, order 10 FIR, order 20 FIR, order 30 FIR, order 40 FIR, order 50 IIR, order 10 IIR, order 20 IIR, order 30 IIR, order 40 IIR, order 50
Relative Dynamic Power Saving (%) VddH 5.0V , VddL 3.3V
Relative Dynamic Power Saving (%) VddH 3.3V , VddL 2.4V
Relative Dynamic Power Saving (%) VddH 2.4V , VddL 1.5V
18.95 42.84 43.00 43.05 43.07 43.08 48.75 51.34 52.20 52.63 52.89 48.64 51.28 52.16 52.60 52.87
15.82 36.21 35.95 35.94 34.90 34.29 40.61 42.77 43.48 43.84 44.06 40.52 42.72 43.45 43.82 44.04
0.88 0.29 2.64 2.64 2.64 2.64 52.68 55.48 56.41 56.88 57.16 52.56 55.42 56.37 56.85 57.13
52.89% in the case of ( VddH = 5.0V , VddL = 3.3V ), from 15.82% to 44.06% for the case of ( VddH = 3.3V , VddL = 2.4V ), and from 0.29% to 57.16% for the case of ( VddH = 2.4V , VddL = 1.5V ).
5 Conclusions The problem of reducing dynamic power dissipation by assigning low supply voltages to computational elements off critical paths without violating timings is an NP-hard problem in general. This problem has been addressed in the case of combinational designs. It is more difficult in the case of clocked sequential designs since critical paths are defined relative to the position of registers. We addressed this problem for the case of clocked sequential designs. We proposed a polynomial time algorithm for this problem. Experimental results have shown that the proposed algorithm is able to significantly reduce dynamic power dissipation.
References 1. Benini, L., Macii, E., De Micheli, G.: Designing Low Power Circuits: Practical Recipes. IEEE Circuit and System Magazine 1(1), 6–25 (2001) 2. Raghunathan, A., et al.: High-Level Power Analysis and Optimization. Kluwer, Norwell, MA (1997) 3. Usami, K., Horowitz, M.: Clustered voltage scaling technique for low-power design. In: Proceedings of Int. Workshop on Low Power Design, pp. 3–8 (1995) 4. Usami, K., et al.: Automated low power technique exploiting multiple supply voltages applied to a media processor. IEEE Journal of Solid-State Circuits 33(3), 463–472 (1998)
74
N. Chabini
5. Pering, T., Burd, T.D., Brodersen, R.W.: Voltage scheduling in the IpARM microprocessor system. In: Proceedings of Int. Symp. on Low Power Electron, Design, pp. 96–101 (2000) 6. Chabini, N., Chabini, I., Aboulhamid, E.-M., Savaria, Y.: Unification of basic retiming and supply voltage scaling to minimize dynamic power consumption for synchronous digital designs. In: Proc. Great Lakes Symp. on VLSI, Washington, DC, pp. 221–224 (2003) 7. Chang, J., Pedram, M.: Energy Minimization Using Multiple Supply Voltages. IEEE Transactions on Very Large Scale Integration Systems 5(4), 436–443 (1997) 8. Leiserson, C.E, Saxe, J.B.: Retiming Synchronous Circuitry. Algorithmica, 5–35 (1991) 9. Sheikh, F., Kuehlmann, A., Keutzer, K.: Minimum-power retiming for dual-supply CMOS circuits. In: Proceedings of the 8th ACM/IEEE Workshop on Timing issues in the specification and synthesis of digital systems, pp. 43–49. IEEE Computer Society Press, Los Alamitos (2002) 10. Cormen, T.H., Leiserson, C.E., Rivest, R.L.: Introduction to Algorithms. McGraw-Hill, New York (1990) 11. De Micheli, G.: Synthesis and Optimization of Digital Circuits. McGraw-Hill, Inc., New York (1994) 12. Boyer, F.-R., Aboulhamid, E.-M., Savaria, Y., Boyer, M.: Optimal design of synchronous circuits using software pipelining techniques. ACM Trans. Design Autom. Electr. Syst. 6(4), 516–532 (2001)
Low-Power Content Addressable Memory With Read/Write and Matched Mask Ports Saleh Abdel-Hafeez1, Shadi M. Harb2, and William R. Eisenstadt2 Department of Computer Engineering, Jordan University of Science & Technology Irbid, Jordan 21110 [email protected] Department of Electrical & Computer Engineering University of Florida, Gainesville, FL 32611, USA {sharb,wre}@tec.ufl.edu
Abstract. A low-power content addressable memory (CAM) with read/write and mask match ports is proposed. The CAM cell is based on the conventional 6T cross-coupled inverters used for storing data with an addition of two NMOS transistors for reading out. In addition, the CAM has another four transistors for mask comparison operation through classical pre-charge operation. The readout port exploits a pre-charge reading mechanism in order to alleviate the drawback of power consumption generated from sensing amplifiers and all other related synchronization circuits which are structured in every column in the memory. Thus, the read and match features can have concurrent operations. An experimental CAM structure of storage size 64-bit x 128-bit is designed using 0.18-µm CMOS single poly and three layers of metals measuring a cell die area of 24.4375 µm2 and a total silicon area of 0.269192 mm2. The circuit works up to 200 MHz in simulation with total power consumption of 0.016 W at 1.8-V supply voltage. Keywords: CAM, low power, pre-charge, sense amplifier, 6T-cell, 8T-cell.
1 Introduction A Content-Addressable Memory (CAM) is a parallel functional memory that contains large amounts of stored data for simultaneous comparison with input data. The match result of the CAM cell is the match data address. CAMs provide highly efficient architecture for high speed fully parallel data searching which are used for a wide range of applications such as high performance graphics, associative computing, data compressions, processor caches, lookup tables, TLBs, database accelerators, neural networks, image coding and IP classification [1]-[4]. In the standard CMOS technology, several 6T SRAM cells structures have been utilized for CAM memory cells [4]-[5] where the transistor count for the traditional CAM requires nine transistors which are comprised of six transistors for read/write and three transistors for comparison. However, most of these designs have either a reliability problem, high power consumption, or are not suitable to continue technology scaling. For N. Azemard and L. Svensson (Eds.): PATMOS 2007, LNCS 4644, pp. 75–85, 2007. © Springer-Verlag Berlin Heidelberg 2007
76
S. Abdel-Hafeez, S.M. Harb, and W.R. Eisenstadt
example, the 9T traditional CAM inherits a stability problem while performing a disturb-read operation which affects the memory reliability and significantly decreases the static noise margin (SNR). Furthermore, most of these designs consume a considerable amount of power during the read operation due mainly to the I/O interfaced buffer with standby sensing amplifier currents. In addition, these designs normally lack the read-out and mask ports, where the concurrent read and match operation is an important feature for testing procedure. In addition, the read function is important for data retrieval and refreshes purposes. On the other hand, several CAM designs with less than nine transistors have been proposed in the literature [6][7] which require special manufacturing technology [8]. For example, a 4T dynamic CAM presented in [6] achieves high memory densities but suffers from a match line coupling effect between adjacent cells and a write/match interference. Selectiveprecharge CAM is proposed in [9] to reduce the match line power consumption by doing a partial comparison first, and only if a partial match is obtained in a given row does the match line pre-charge. However, this approach has a time penalty for the worst case if all lines are precharged. Toggling Match line CAM [9] alternates between active high and active low match line to reduce its switching activities by half. However, this technique requires an additional active-high/active- low (AHAL) signal which incurs more hardware cost. Furthermore, it increases the hit match power consumption. Other work has been done based on a single bit line design with five-transistor D-latch and different comparison circuit topologies such as 7T, 9T and 10T CAMs [9]-[10]. To avoid the drawbacks of the dynamic circuit design such as noise margin, clock skew, and charge sharing, these designs exploit the static pseudo NMOS logic structure to reduce the switching activity in the match line. However, these designs suffer from considerable static power consumption. To realize fast access time, low voltage technology features, low power consumption and comparable silicon area with low area overhead, the 8T cell presented in [11]-[13] is adapted in our CAM cell which results in a new 12T CAM structure. In the 8T structure, adding two stacked nFETs to a 6T cell provides a read mechanism that doesn’t disturb the internal node of the cell. This requires separate read and write word lines and can accommodate dual-port operation with separate single/multi-port read and write bit lines. Not having a read-disturb issue will allow more scaling by lowering the Vth of the nFETs of the cells to the same degree as the Vth of CMOS logic transistors can be lowered [13]. Furthermore, the dual-port 8T cell alleviates any stability problem which significantly provides larger SNR especially at low voltage and even provides a performance advantage over the 6T cell, if the pass-gates and read buffer are designed to be strong devices [13]. However, the area of the 8T cell is increased by 30% [12]-[13] compared to the 6T area, but it was shown in [13] that for the same speed, the complete SRAM module area is reduced by 15%. Although this area overhead is large in the field of memory, it is shown in [13] that the overall memory silicon area for the 8T cell is less for the same capacity and operating speed. This is due to the elimination of synchronization and sequential timing adjustments circuits used in the 6T cell SRAM design. The paper is organized as follows: Section 2 discusses the CAM cell structure and layout, design topology is given in section 3, simulation and results are given in section 4, conclusion is given in section 5.
Low-Power CAM With Read/Write and Matched Mask Ports
77
2 12T CAM Cell 2.1 Cell Structure The proposed 12T CAM cell contains an eight-transistor memory structure for memory read/write operation and four additional transistors for data comparison. In the 8T memory structure, the conventional CMOS six-transistor cross-coupled inverters are employed since they are stable, compact, and more reliable than any other regenerative circuit design. Consequently, the 6T portion of the cell is only optimized in minimum sizes for a write operation, while the two additional stacked NMOS transistors are optimized in minimum sizes for reading out as shown in Figure 1. An added transistor (K1) controlled by a control signal called mask (MK) is used in order to prevent a path to ground which is generated during the pre-charge phase. This modification allows the match line to have a high flexibility of selecting the cells needed for comparisons. When the mask line is low, no comparison can take place and no activities occur at the match line. Once the mask signal is enabled, the stored data are compared against the incoming input data through the complementary bit lines. If the stored data and input data are matched in the masked cell, the match line is kept high; otherwise, the match line will be pulled down to the ground level value.
Fig. 1. 12T CAM Cell with suggested transistor sizes using 0.18µm
2.2 Cell Layout The layout of the 12T cell is presented with an efficient layout geometry arrangement and small area as shown in Figure 2. The proposed cell is designed using 0.18-µm and three metal layers. The write operations are performed in 6T geometry of approximate size 2.2 x 2.2 µm2, where the total cell size is 5.75 x 4.25 µm2. The read bit line is accomplished with separate poly line which yields to an isolation between the read
78
S. Abdel-Hafeez, S.M. Harb, and W.R. Eisenstadt
and write mechanisms. Furthermore, the sizes of the two NMOS read transistors provide fast and sufficient conducting capability. On the other hand, the sizes of transistors in the write portion of the cell are only optimized for the write operation since the read operation uses different portions of the cell. This makes the write portion of the 8T cell 6% smaller than that of the conventional 6T SRAM cell, which implies that the presented cell total area is only 30% larger than that of the 6T cell as presented in [12]-[13]. This area overhead is composed of not only the two added transistors but also of the contact areas of the VDD, GND, WWL, RWL, BL, BLB, and RBL for complete memory cells array overlap structure. This helps on routing for complete structure design through eliminating intermediate adding busses between cells.
Fig. 2. CAM Cell Layout
Low-Power CAM With Read/Write and Matched Mask Ports
79
3 Design Topology 3.1 Architectural Overview The high level CAM architecture for 64-bit (Height) X 128-bit (Width) is shown in Figure 3. The main functional parts are the CAM array which is organized as rows and columns with parallel search capability, read/write pre-decoder/decoder circuitry, and I/O buffers. The CAM structure is split into two main blocks, where the address decoders are passing in the middle in order to minimize the loading capacitances on the signal drivers and optimize the layout routing geometry.
CAM 64 X 64
DECODER
CAM 64 X 64
Pre-Decoder
Fig. 3. Architectural Overview
3.1.1 CAM Array Structure The CAM array structure presented in Figure 4 consists of an array of the proposed 12T cell with read, write, and comparison capabilities. The read word line (RWL), the write word line (WWL) and the match line (MT) traverse the CAM array horizontally for each row using metal 2 (M2). On the other hand, the write bit line (WBL) and the mask control signal (MK) traverse the CAM array vertically for each column using metal 3 (M3). A pull-up pre-charge pseudo PMOS transistor derived by the match clock signal (MCLK) is attached to each match line for every row. The match line (MT) capacitance is the transistor (K2) diffusion capacitance per cell with wire capacitance, where the read line capacitance is the transistor (N2) diffusion capacitance per cell with wire capacitance. The power supply line (VDD) runs horizontally using M2, where the ground line (GND) runs vertically using (M3).
80
S. Abdel-Hafeez, S.M. Harb, and W.R. Eisenstadt
Fig. 4. CAM Array Structure
Fig. 5. Read/Write Pre-decoder/Decoder Circuit
Low-Power CAM With Read/Write and Matched Mask Ports
81
3.1.2 Output Buffer Sense Circuitry A low-power, low-voltage, and high speed sense buffer is designed to realize the precharge mechanism as well as sensing the output value on the read lines as shown in Figure 5. Similar to the sensing amplifier circuit with the 6T structure, the output buffer is implemented for every column. The PMOS (P1) transistor is used to precharge the RBL at the falling edge of the read clock, while the output latch holds the previous outcome. During the rising edge of the clock, the P1 transistor is disabled and the output latch updates the current RBL value. Timing intervals between P1, output latch, and RWL are preserved in order to optimize power dissipation reduction and provide high speed read access with a reliable outcome. The timing synchronization is simply accomplished through the use of INV1, INV2, and INV3.
4 Functional Overview Since the write mechanism in the 8T cell is similar to the 6T cell conventional approach, the focus in this brief is only on the read and comparison operations. For the read operation, the RBL is pre-charged during the asserted low of RCLK, implying
Tc2 =
α C Lm β p Vdd
=(
α Tox L p C Lm )( ), με Wp Vdd
(1)
α Tox ) is the technology 0.18µm parameters and Vdd is the power supply με voltage (1.8V); Lp is the pre-charge sense circuit transistor (P1), which is selected to have the Wp
where (
value of Lp =0.18µm and Wp = 8µm; the CLm is the RBL capacitance. On the other hand, the RWL is enabled during the rising edge of RCLK, and the pre-charge sense buffer transistor (P1) is disabled. Thus, the data cells of the selected row are evaluated and latched through the sense buffers with an access time of Tdacc and held in the output buffer until the next rising edge as shown in Fig. 6. The worst case read access time is given by Tdacc = (
α Tox L neff C Lm )( ) + Sense buffer delay time, με Wneff Vdd
(2)
where Wneff and Lneff are the effective sizes of the cell read transistors N1 and N2. Furthermore, the power consumption consumed by a single RBL is simply measured by the equation 2 P = C Lm Vdd f,
where f is the maximum RCLK operating frequency.
(3)
82
S. Abdel-Hafeez, S.M. Harb, and W.R. Eisenstadt
Accordingly, the overall read bit lines power of a CAM block with N column lines is 2 Prt = NCLm Vdd f
(4)
Similarly, for the comparison operation, the match lines are pre-charged for each row at the falling edge of the MCLK similar to the pre-charge phase of RCLK. At evaluation, the pre-charge is disabled, and the mask lines are decoded which enables the masked cells for comparison. As long as all cells in the masked row match, the match line will be kept pre-charged, otherwise, it will be pulled down to ground. It is worthwhile mentioning that simultaneous read/write and comparison can be performed concurrently. In addition, all the previous equations for read operation can be applied for comparison, since they have the same structure and same sense buffers.
Fig. 6. (a) Read timing constraints with respect to read address and read data
5 Simulation and Results The simulation of a 64-bit x 128-bit CAM is presented in Figures 7-9 which is conducted by SpectreS simulator using Cadence. As can be observed, the input data written into the memory cell is read out correctly and the memory is functioning correctly. Figure 9 shows the waveforms of the comparison operation. As shown, the input data is written in one cycle and compared against different input value in the next cycle which results in discharging the match line as soon as the mask line is enabled. The worst case output delay is measured to be 3.91 ns which implied that the CAM component can operate at 200 MHz. In addition, the worst case power consumption was reported to be 0.016 W, while the reported 6T cell SRAM-based CAM is 0.2936 W. This is due mainly to the standby current of the sensing amps that was generated for every column. Figure 12(a) shows a comparison of the maximum power consumption at different operating frequencies between 6T and 8T cells which shows that the 8T structure reduces the power consumption by an average rate of 34%
Low-Power CAM With Read/Write and Matched Mask Ports
83
Fig. 7. Simulation results of memory read outcome
Fig. 8. Simulation results of memory write outcome
Fig. 9. Simulation Results of memory comparison
for the same capacity and speed. On the other hand, Figure 12(b) shows the area cost of the 8T cell which is less than the 6T at sizes below 20k-bits by 15%~5% due to the area overhead induced by the synchronization circuitry in the 6T cell, while it is larger otherwise due to the large layout size of the 8T cell.
84
S. Abdel-Hafeez, S.M. Harb, and W.R. Eisenstadt
(a)
(b) Fig. 10. (a) Maximum power dissipation versus operating frequencies, (b) Total layout area in µm2 versus memory sizes in bit storage area
6 Conclusion A novel 12T CAM cell for a low-power, low-supply voltage, and high density embedded CAM structure is proposed. The cell read and match portions provide high and fast conducting capability with concurrent operation. On the other hand, the proposed cell has separate read, write, and compare mechanisms, where all traditional sensing differential pair amplifiers are eliminated with all synchronization overhead circuitry. The proposed pre-charge and evaluate sense buffer is designed for low
Low-Power CAM With Read/Write and Matched Mask Ports
85
power consumption and high speed operation. Accordingly, the simulation results show that our CAM (64-bit x 128-bit) operates at 200 MHz and consumes less power consumption of about 34% than the 6T SRAM-based CAM structure. In addition, the overall silicon area is about 12% less than that of a similar CAM storage with 6T SRAM based cell that operates at the same speed.
References 1. Pei, T.B., Zukowski, C.: VLSI implementation of routing tables: tries and CAMs. In: Proc. IEEE INFOCOM, 1991, vol. 2, pp. 515–524 (1991) 2. Panchanathan, S., Goldberg, M.: A content-addressable memomry architecture for image coding using vector quantization. IEEE Trans. Signal Process 39(9), 2066–2078 (1991) 3. Lee, C.Y., Yang, R.Y.: High-throughput data compresser designs using content addressable memory. IEE Proc. Circuits, Devices and Syst. 142(1), 69–73 (1995) 4. Wade, J.P., Sodini, C.G.: A ternary content-addressable search engine. IEEE J. Solid-State Circuits 24, 1003–1013 (1989) 5. Miyatake, H., Tanaka, M., Mori, Y.: A design for high –speed low-power CMOS fully parallel content-addressable memory macros. IEEE J. Solid-State Circuits 36(6), 956–968 (2001) 6. Delgado-Frias, J.G., Yu, A., Nyathi, J.: A dynamic content addressable memory using a 4transistor cell. In: Design of Mixed-Mode Integrated Circuits and Applications. Third International Workshop, 26-28 July 1999, pp. 110–113 (1999) 7. Lin, C.S., Chang, J.C., Liu, B.D.: Design for low-power, low-cost, and high reliability precomputation-based content-addreessable memory. In: Circuits and Systems, 2002 APCCAS ’02. Asia-Pacific Conference, 28-31 October 2002, vol. 2, pp. 319–324 (2002) 8. Miw, T., Yamada, H., Hirota, Y., Satoh, T., Hara, H.: A 1MB 2-Tr/b nonvolatile CAM based on flash memory technologies. IEEE J. Solid-State Circuit 31(11), 1601–1609 (1996) 9. Thirugnanam, G., Vijaykrishnan, N., Irwin, M.J.: A novel low power CAM design. In: ASIC/SOC Conference, 2001. Proceedings. 14th Annual IEEE International, 12-15 September 2001, pp. 198–202. IEEE Computer Society Press, Los Alamitos (2001) 10. Cheng, K.-H., Wei, C.-H., Chen, Y.-W.: Design of low-power content-addressable memory cell. In: Symposium on Circuits and Systems, 2003. MWSCAS ’03. Proceedings of the 46th IEEE International Midwest, 27-30 December 2003, vol. 3, pp. 1447–1450 (2003) 11. Abdel-hafeez, S.M., Sribhashyam, S.P.: System and method for efficiently implementing double data rate memory architecture. US patent No. 6,356,509, Issued March 12, 2002 (2002) 12. Chang, L., et al.: Stable SRAM cell design for the 32 nm node and beyond. In: Symp. VLSI Technology Dig., June 2005, pp. 128–129 (2005) 13. Takeda, K., et al.: A Read-Static-Noise-Margin-Free SRAM Cell for Low-Vdd and HighSpeed Applications. IEEE JSSC 41(1), 113–121 (2006) 14. Stojanovic, V., Oklobdzija, V.G.: Comparative Analysis of Master-Salve Latches and Flip-Flops for High Pefromance and low-Power Systems. IEEE J. Solid-State Circuits 34(4), 536–548 (1999)
The Design and Implementation of a Power Efficient Embedded SRAM Yijun Liu, Pinghua Chen, Wenyan Wang, and Zhenkun Li The Faculty of Computer, Guangdong University of Technology, Guangzhou, Guangdong, China, 510006 [email protected]
Abstract. In this paper, a power efficient 2K asynchronous SRAM is presented for embedded applications. The SRAM adopts a low swing write scheme, which greatly reduces the power dissipated by charging and discharging the bitlines. A small dual-rail decoder is proposed to compensate for the extra silicon area needed by the low swing write technique. The new SRAM is demonstrated to a factor of 4 improvement in power efficiency over a commercial SRAM macro. It also 30% faster than the commercial SRAM macro with only 3% area overhead.
1
Introduction
As important components, embedded SRAMs consume a significant portion of power dissipated in System-On-Chip (SoC) circuits, because they contain high capacitance bit lines, which are frequently switched. A number of techniques [1][2] have been proposed to minimize the power dissipation of SRAMs. Although these techniques are different, they are all based on few basic principles: minimizing the active capacitance and reducing swing voltage. Dynamic power consumption of a conventional static CMOS circuit is the major portion of total power consumption during its full operation. Dynamic power consumption can be calculated by the following equation: P =
1 2 × C × Vdd ×f 2
(1)
where C is the actual switch capacitance of the CMOS circuit, Vdd is the supply voltage, and f is the operation frequency. Since the power consumption of a CMOS circuit is directly proportional to its swing voltage squared. Swing voltage reduction can achieve a big power saving. For conventional SRAMs, during reading operations, bitlines are not fully swinged because the driving ability of RAM cells is too weak to fully swing highcapacitance bitlines. Sensor amplifiers are used to amplify voltage difference of bitlines. Sensor amplifiers not only increase the speed of reading operations but also reduce the power consumption of reading operations based on Equation1. For writing operations, the bitlines are fully swinged because they must overpower memory cell flip flop with new values. So one writing operation usually N. Azemard and L. Svensson (Eds.): PATMOS 2007, LNCS 4644, pp. 86–96, 2007. c Springer-Verlag Berlin Heidelberg 2007
The Design and Implementation of a Power Efficient Embedded SRAM
87
consumes more power than one reading operation. Moreover, capacitance of bitlines increases as the number of rows in a SRAM increases; the size of writing drivers should also be increased to maintain the writing speed. This further increases the power consumption of writing operations. Writing operations use 90% of the total power when the number of words in a column becomes 256 [3]. As a result, minimizing writing power is critical in the design of power efficient SRAM. This paper will discuss the design of 2-Kb SRAM using low voltage swing during both reading and writing operations. A novel dual-rail decoder is proposed which significantly reduces the size of the SRAM. Specific circuit implementation and experimental results will be presented.
2
Low-Swing Write Techniques
The critical problem of low swing writes is overwriting memory cell flip flop while maintaining stability of memory cell during reads. Two general techniques were proposed to achieve low swing write: – Reduce reference voltage of bit lines; – Use more ‘input-sensitive’ memory cells. For conventional embedded SRAM design, bit lines are referred to Vdd by precharging, and during writes, write drivers discharge half of bit lines to ground. Thus, write power can be minimized by reducing bit line reference voltage. Allowersson and Andersson [4] proposed a technique where they us a low bit line reference voltage of 0.5 volt under Vdd =5 volt. A small bit line differential is propagated to the internal cell nodes through access n-transistors when the word line is activated. When the word line is turned off, the small voltage differential is amplified by the positive feedback of cross-coupled inverters. To maintain stored Boolean value, a lower word line voltage is used during reads, thus weakening access n-transistors and preventing a spurious discharge of internal cell nodes. The penalty of this technique is that when access n-transistors are gated by a lower voltage, their read accesses become slower, resulting in a longer read delay. Mai et. al. use the same principle, except that they use a different bit line reference voltage [5]. They propose a half-swing pulse-mode technique which can easily generate the voltage of Vdd /2, so they use Vdd /2 as bit line reference voltage. During write, half of bit lines are discharge to ground. Since bit lines are only half swinged, 75% of bit line write power can be saved compared to conventional techniques. However, using a Vdd /2 reference for bit lines may potentially lead to cell instability during reads because of the leakage current from logic high internal nodes. This problem can be solved by increase internal cell voltage. The advantage of low swing write techniques based on reducing bit line reference voltage is that they use the RAM cells that are the same with those used with conventional SRAMs, so no area penalty exists in cell arrays. However, these techniques have the disadvantages as follows: – Special techniques must be used to generate bit line reference voltage and internal cell voltage.
88
Y. Liu et al.
– Since bit line reference voltage is reduced, the differential voltage between cell internal nodes becomes smaller than usual, which inevitably results in a longer write delay. An alternative method has been proposed to achieve low swing write — using more ‘input-sensitive’ memory cells. Wang et. al. [6] propose a current-mode RAM cell as shown in Figure 1(a). The RAM cell is similar to conventional 6-transistors RAM cell, except for an equalization n-transistor — Meq . During read, the cell-equalization signal weq are kept low, and the RAM cell acts the same as the conventional RAM cells. The equalization n-transistor clears the content of the memory cell period to write, making bit lines very easy to write internal cell nodes even with a small voltage differential. P-transistors are used as access transistors, which results in a very compact cell area. However, since the gain factor of p-transistors (βp ) is smaller than that of n-transistors (βn ), the RAM cell has a long access delay during read and write. If n-type access transistors are used, the area will be much bigger since each RAM cell has 5 n-transistors and it is hard for a regular layout design. Another input-sensitive RAM cell is proposed by Kanda et. al. in [3] as shown in Figure 1(b). The RAM cell is called a sensor-amplifying RAM cell (SAC) which acts as a sense amplifier. During read, SLC is kept high, making the RAM cell act the same as the conventional RAM cell. SLC is dropped low and the VSS switch is turned off before the wordline turn on. Even with a very small voltage differential between a pair of bit lines, the cell node can be inverted because the driver n-transistors do not draw current. After the wordline goes low, the VSS switch is turned on and the small voltage differential is amplified to full swing inside the RAM cell. The SAC technique is similar to another technique proposed by Amrutur [7] as shown in Figure 2. All the cells in a given row share a virtual ground (vgnd), which is driven by an AND-gate. During read, the virtual ground is driven low, making the SRAM act the same as conventional SRAM. Before write, the virtual ground is first driven high to reset the contents of the cells. Then the word line is pulsed high, transferring the small swing bit line signal to the internal cell nodes. After that, the virtual ground is driven low, causing the full swing of the bitline
nbitline
wordline
bitline
nbitline
wordline
Meq SLC
Vss switch
weq
(a)
(b)
Fig. 1. ‘Input-sensitive’ RAM cells
The Design and Implementation of a Power Efficient Embedded SRAM
89
cells. However, this technique has problems when read. Since all the cells in the row share the virtual ground, the read current of the cells gathers on the virtual ground and flows to ground through an n-transistor inside the AND-gate. This causes two problems: 1. The current is so big that it needs a wide metal wire of the virtual ground, which is wider than RAM cell height; 2. The current raises the voltage of the virtual ground, resulting in read instability. One solution of this technique is to distribute the n-transistor of the AND-gate to the cells in the row, as SAC scheme does. wordline
... vgnd W/R
wordline
bgnd
bit/bit_bar Read cycle
Write cycle
Fig. 2. Amrutur’s low write scheme
Comparing these techniques, we decide to use SAC scheme for low swing write because: – The RAM cell is similar to the conventional 6-transistor RAM cell; – The write and read delays are relatively short compared to other techniques; – The RAM cell maintains reasonable static noise margin. Sensor-amplifying RAM cell design is very important using SAC scheme, since memory cell design is the tradeoff among silicon area, read speed and noise margin. The basic requirement of RAM cells is safe read. Figure 3(a) illustrates the read current path of a sensor-amplifying cell. Let us assume that internal node A is low, B is high and the bit lines are precharged to Vdd − VT (VT is the threshold voltage of n-transistors.). If T2 keeps off during read, the state holding inside the RAM cell will not be flip flopped and read operations are safe. It is easy to figure out that all the three n-transistors — Taccess , T1 , Tss — are in their linear regions, since they all satisfy the condition: 0 < Vds < Vgs − VT . The n-transistors — Taccess , T1 , Tss — are fabricated using the same nMOS technology and having the same channel lengthes, so their resistances are inversely proportional to their widths. The equivalent circuit of that in Figure 3(a) is shown in Figure 3(b). From Figure 3(b), VdsT 1 = (Vdd − VT ) ×
α W11 1 α Waccess + α W11 + α W1ss
90
Y. Liu et al.
= (Vdd − VT ) ×
W1 Waccess
1 +
W1 Wss
(2)
+1
VgsT 2 = VdsT 1
(3)
where α is a constant defined by the CMOS technology. We assume that all transistors are designed to have the minimum channel length.
Vdd−Vt
a.1/Waccess Taccess
A discharge current
B
R
a.1/W1
T2 off
on T1
a.1/Wss Tss
(a)
(b)
Fig. 3. Discharge current during read
wordline
0.28/0.18
0.28/0.18 0.56/0.18
0.28/0.18
0.28/0.18 0.56/0.18
RAM Cell
RAM Cell
SLC
RAM Cell
RAM Cell
2.24/0.18
Fig. 4. Shared Tss scheme to reduce area overhead
The requirement that guarantees T2 keeps off is VgsT 2 < VT . Since the width of Taccess is normally designed to have the minimum size, from Equation 2, increasing the width of T1 and reducing the size of Tss can reduce VgsT 2 , thus increasing the read margin. However the contribution to the increase of read margin coming from reducing the size of Tss is relatively less, and a small size of Tss also results in a small discharge current and big read delay. This can also be studied from Figure 3(b). Tdischarge ∝ Cbitline × ΔV × R 1 1 1 + + ) (4) Waccess W1 Wss So the selection of the channel width of Tss is also a tradeoff between speed and size. Another concern of SAC RAM cell design is how to put Tss . If each cell contains a Tss transistor, the area overhead of the RAM cell array is very big. Tdischarge ∝ α × (
The Design and Implementation of a Power Efficient Embedded SRAM
91
Kanda et. al. put one big Tss transistor every 4 RAM cells. We also use the same scheme, but we change the transistor sizes. Figure 4 illustrates the architecture and transistor sizes. P-transistors and access n-transistors are set the minimum size. W1 and W2 are set 2×Waccess . Wss are set 8×Waccess . From our experiment and calculation, the penalty of the SAC RAM cell includes 12% area increase, 25% noise margin decrease and 7% read latency increase.
3
Dual-Rail Decoder
In big-size SRAMs, there are two kinds of decoders — row decoders and column decoders. Row decoder activates one of the word lines in a block, which connects the RAM cells of this row to bit lines. Column decoder controls a big switch and connects one of bit line column to peripheral interface. Our design only contains a row decoder since all the RAM cells are put inside a column. Figure 5 illustrates a conventional two-level SRAM decoder, which has n bits address input and activates one of 2n word lines. To improve speed, the first level predecoders are used. Each predecoder decoders n/2 address bits to drive one of 2n/2 predecode line. The second level decoder is one column of 2-input AND gates. Each of them ANDs two predecode lines to generate a word line signal. The second level decoder 1 1
wrodline 0
...
2
n/2
2
n/2 n
wrodline 2 −1
The first level decoder n/2
n/2
Fig. 5. A two level decoder
However, the decoder shown in Figure 5 has two problems: 1. One of the word lines is always active; 2. Two word lines may be active simultaneously for a short period when the address bits change. To shorten the active duty cycle of word lines and avoid the positive pulses of word lines from overlapping, pulsing word line techniques are needed. Pulsing word line can either be achieved by gating the second level AND gates with a pulsed enable signal as shown in Figure 6(a) or putting an address transition detection (ATD) pulse generator on each word line as shown in Figure 6(b). The drawbacks of using a pulsed enable signal are: (1) The enable signal is a long wire with a big capacitance and its funout is 2n ; (2) The second level AND gates have three n-transistors stacked, which increases the delay. While using the second technique, the pulse generators increase the size of the decoder.
92
Y. Liu et al. The second level decoder wrodline 0 from predecoder
...
from predecoder
n
wrodline 2 −1
enable
(a) The second level decoder wrodline 0
...
ATD pulse generator n
wrodline 2 −1
(b)
Fig. 6. Pulsing wordline techniques
address1
0 1
dual−rail coding
decoder
...
...
...
enable
addressn n
2 −1
Request ADT pulse generator
(a) make sure that only one pulse on every rising edge
a
wordline
b
delay
SLC
Out Request Read
(b)
(c)
Fig. 7. The proposed dual-rail decoder
To overcome the problems mentioned above, a novel dual-rail row decoder is proposed as shown in Figure 7(a). In the new dual-rail decoder, the pulsed enable signal does not gate the last level AND gates but gates the inputs of address bits and their complementary bits, giving address bits a dual-rail format. So the funout of the enable signals is 2n, which is much smaller than 2n when
The Design and Implementation of a Power Efficient Embedded SRAM
93
n > 4. The enable signal is generated by an ATD pulse generator under the control of an asynchronous Request signal. The schematic ATD pulse generator is illustrated in Figure 7(b). The pulse generator detects the rising edges of the Request signal, and then generates a positive pulse, whose width is decided by the delay element. The predicating circuit prevents the pulse generator from generating several pulses with one rising edge input. Since all the signals inside the dual-rail are pulsed, it is very safe to use dynamic AND gates inside the decoder as shown in Figure 7(c). Each dynamic AND gate contain only one p-transistor, which minimizes the size and the power consumption of the decoder. The circuit in Figure 7(c) generates a word line signal and a SLC signal. SLC signal is gated by Read signal, since during read, SLC must keep low. From mentioned above, the proposed dual-rail decoder is smaller and more power-efficient than conventional decoders. There is no speed penalty caused by the new scheme.
4
Architecture and Timing
The proposed SRAM is embedded in an asynchronous microprocessor, so its interface is designed to communicate with outside using a normal 4-phase handshake protocol [8]. Inside the SRAM, a delay-matching scheme is used. The proposed SRAM is 2Kb (64x32 bits), and its architecture is illustrated in Figure 8(a). Delay1 and delay2 are ADT pulse generator with asymmetric delays. They control the timing of write enable signal (WEn) and sensor amplifier enable signal (REn) respectively. The write and read timing diagrams are shown in Figure 8(b) and (c) respectively. The acknowledge signal (ACK) is generated by a dummy column. The dummy column is on the critical path of the SRAM. ACK signal matches read and write delays and its rising edges mean the current read or write operation is finished, and its falling edges mean precharging is finished. [3] uses a special voltage to generate the bit line voltage differential during write. However, we use only one voltage in the chip. Since the discharging ntransistors of write drivers are operated in linear reign, the voltage change of bit lines are directly proportional to discharge time and the width of the discharging n-transistors if the channel length of transistors are the minimum size. We have: ΔV ∝ Tdischarge · W The voltage differential can be controlled by adjusting the size of n-transistors and the pulse width of W En. We use a 0.18 μm CMOS technology, and the capacitance attached on bit lines is 75 ff. To get a voltage differential of 0.2 volt, the size of the discharge n-transistors is W/L : 1.00μm/0.18μm, and the W En pulse width is 0.12 ns. Since only 0.2 volt voltage differential needs to be discharged, the write driver size is small than the conventional write drivers, which partly compensates for the area overhead of the sensor-amplifying RAM cell array.
94
Y. Liu et al.
...
bitlines Precharge W/R
Precharge circuit
WEn
Write driver
delay1
Request
Memory cell array
...
delay2
dummy column
Dual−rail decoder
Address
Sensor amplifiler & Latches
...
REn
(a) Request Request WEn wordline wordline REn
SLC
ACK ACK
(b)
(c)
Fig. 8. The architecture and timing of the proposed SRAM
5
Experimental Results
The 2Kb embedded SRAM is designed using ST 0.18μm technology. The chip microphotograph is shown in Figure 9. The circuit is simulated using HSPICE under typical operating conditions (supply voltage=1.8v and temperature=27oC). The experimental results and the comparisons with a commercial RAM macro (ST SRAM macro [9]) are illustrated in Table1. The ST SRAM macro uses the same CMOS technology as the proposed SRAM. As can be seen from the table, the new SRAM reduces the power consumption of ST SRAM macro by almost a factor of 4 with only 3% area overhead. With an asynchronous control circuit, the new SRAM can have different read and write delays. During write, the new SRAM is 50% faster than ST RAM, and during read, the new SRAM is 25% faster than ST RAM. ST RAM uses a more compact memory cell, which is 11% small than the RAM cell used in this paper. The TSS n-transistors results in another 12% area overhead. So the proposed SRAM has a memery cell array 23% bigger than ST memory cell array. However, these two RAMs have almost the same size. This is because the proposed dual-rail decoder
The Design and Implementation of a Power Efficient Embedded SRAM
95
Fig. 9. The layout of the proposed SRAM Table 1. Comparisons between the new SRAM and ST macro Area (μm2 ) Delay/Write Delay/Read Power/Write Power/Read
Proposed SRAM ST macro 245 × 126 186 × 160 1ns 2ns 1.5ns 2ns 4.62μA/MHz 21.0μA/MHz 4.70μA/MHz 16.1μA/MHz
Ratio 1/0.97 1/2 1/1.33 1/4.55 1/3.42
and the small write driver needed by low swing write compensate for the area overhead of the memory cell array. The decoder and the periphery circuit in the proposed SRAM is about 34% smaller than that in ST RAM.
6
Conclusions
This paper presents a 64x32 bits SRAM for embedded applications. A novel dualrail decoder is proposed, which cooperates efficiently with a low swing write scheme. We demonstrated a factor of 4 improvement in power efficiency over a commercial RAM macro. There is hardly any area penalty of the proposed SRAM because the area overhead of the sensor-amplifying cell array is made up by the small decoder and write driver. Different read and write delays are allowed with an asynchronous logic control, which makes the proposed SRAM about 30% faster than the commercial SRAM macro.
96
Y. Liu et al.
Acknowledgements The project was funded by a grant from Ministry of Personnel of the People’s Republic of China (No. 2006-164) and a grant from Guangdong Science and Technology Department (No. 2006B11801010). The authors would like to acknowledge with gratitude these grants.
References 1. Itoh, K., Sasaki, K., Nakagome, Y.: Trends in Low-Power RAM Circuit Technologies. Proceedings of IEEE 83(4), 524–543 (1995) 2. Margala, M.: Low-power SRAM circuit design. In: IEEE International Workshop on Memory Technology, Design and Testing, 9-10 August 1999, pp. 115–122 (1999) 3. Kanda, K., Sadaaki, H., Sakurai, T.: 90% write power-saving SRAM using senseamplifying memory cell. IEEE Journal of Solid-State Circuits, 39(6), 927–933 (2004) 4. Alowersson, J., Andersson, P.: SRAM cells for low-power write in buffer memories. In: IEEE Symposium on Low Power Electronics, 9-11 October, 1995, 60–61 (1995) 5. Mai, K.W., Mori, T., et al.: Low-power SRAM design using half-swing pulse-mode techniques. IEEE Journal of Solid-State Circuits 33(11), 1659–1671 (1998) 6. Wang, J., Yang, P., Tseng, W.: Low-power embedded SRAM macros with currentmode read/write operations. In: 1998 International Symposium on Low Power Electronics and Design, 10–12 August, pp. 282–287 (1998) 7. Anrytyr, B.S.: Design and analysis of fast low power SRAMs, PhD thesis, Standford University 8. Sparsø, J., Furber, S. (eds.): Principles of Asynchronous Circuit Design: A systems Perspective. Kluwer Academic Publishers, Dordrecht (2001) 9. Hemos8dSPS4 datasheet. ST. Co. (April 04, 2002)
Design of a Linear Power Amplifier with ±1.5V Power Supply Using ALADIN Björn Lipka and Ulrich Kleine IESK, Otto-von-Guericke University of Magdeburg, P.O. Box 4120, 39016 Magdeburg, Germany {Bjoern.Lipka,Ulrich.Kleine}@E-Technik.uni-magdeburg.de http://iesk.et.uni-magdeburg.de/is/
Abstract. This paper presents an extension of ALADIN for approximately calculating the current density and heat distribution of analog circuits. The tools are integrated into the ALADIN package, which allows designers to create analog circuit layouts automatically. The optimization is speeded up and the reliability of the design is improved. The benefit of ALADIN is demonstrated with the design of a linear power amplifier with ±1.5V power supply.
1 Introduction In general, each analog application needs a special layout which pays attention to analog specific requirements. Due to technology scaling, the parameters of MOS transistors get worse for analog integrated circuits and therefore the design window, in which the dimensions of elements must fit, decreases. Hence, the amount of optimization will increase in the future, especially for low voltage applications. For a reliable optimization, the knowledge of electrical parameters caused by the layout is required. Today most analog layouts are hand-crafted by specialists, which is a very time consuming process. Consequently the optimization is normally performed without the layout. The parasitic elements are only roughly estimated during the optimization phase. Different topology alternatives are seldom checked. The analog circuit design tool ALADIN (Automatic Layout Design Aid for Analog Integrated Circuits) provides the ability to create high quality layouts with a minimum effort [1-4]. In Fig. 1 (a) an overview of ALADIN is given. The entire system mainly consists of three components: Design Assistant, Module Generator and Technology Interface. The Design Assistant provides a graphical user interface to optimize an analog circuit from schematic to layout. It is integrated into the Cadence Design Framework II and can be easily adapted to other commercial software. The Module Generator environment allows designer to write technology and application independent adjustable modules as complex as necessary by using a natural description language. The Technology Interface applies a new methodology of the design rule description and automates the adaption of a new technology. The circuit design flow using ALADIN is illustrated in Fig. 1 (b). In a schematic editor, a simulated circuit is partitioned into several modules. The parasitic capacitances in N. Azemard and L. Svensson (Eds.): PATMOS 2007, LNCS 4644, pp. 97–106, 2007. © Springer-Verlag Berlin Heidelberg 2007
98
B. Lipka and U. Kleine
the module layout are controlled by a capacitance sensitivity matrix. The electrical rules, like electromigration, matching, etc, are considered during the module generation. The steps from selecting elements of a module to layout generation are repeated until all the modules are defined. During the placement of all modules [3, 4], which is performed simultaneously with the global routing, an appropriate topology of each module is chosen from all possible alternatives. After the layout generation, an extraction is performed and the extracted parasitic elements are automatically annotated back into the schematic. The entire circuit is then simulated with a good parasitic estimation and the optimization loop starts by changing parameters of the circuit and/or by redefining constraints. The concept of module generation in ALADIN is that the layout synthesis is not bound to a fixed generator library. Rather, ALADIN offers a natural description language MOGLAN (module generator language) [1]. Analog circuit designers can easily modify the existing topologies or create new generators. ALADIN presents a natural description language (MOGLAN), which enables designers to describe parameterizable modules hierarchically and design rules independently. This language features loops, conditional statements and a set of functions to create and to wire primitive geometries without considering absolute coordinates. Moreover, the design rules are automatically evaluated and fulfilled. Basic geometric objects are generated by calling primitive functions. Complex modules are constructed by compacting either geometric primitives or hierarchically built objects to an existing structure. The subject of this paper is the design of a ±1.5 V power amplifier by using ALADIN. The buffer can drive a maximum of ±0.7 V into a 50 Ω load. In order to optimize the analog circuit regarding the exertion of influence to current density and heat distribution, tools have been developed and added to ALADIN. The paper is organized as follows. In the next section the novel tools of the ALADIN package, which give the ability to calculate the current density of modules and heat distribution of circuits, are described. In following the design of a linear power amplifier and its layout solution are explained. Finally the simulation results are discussed and the conclusion is drawn.
Fig. 1. Structure Overview of ALADIN (a), circuit design flow within ALADIN (b)
Design of a Linear Power Amplifier with ±1.5V Power Supply Using ALADIN
99
2 Novel ALADIN Tools 2.1 Modified Calculation of Current Density Due to electromigration effects, especially in driver circuits, too large currents have to be avoided. The narrow metal connection does not with stand the introduced current densities causing electromigration. In general, the maximum current density should not exceed one milliampere per one micrometer. Exceptionally, this can be observed in the inner bend of a 90° corner bend [5]. A tool has been developed and integrated into the ALADIN package, which performs a roughly calculation of the current density of an arbitrary shape [12]. Beside the simulation of arbitrary shapes, the tool offers designers the feasibility to simulate parts of a layout or the whole layout structure. All layers can be selected for the current density calculation. In this manner it is possible to optimize a module or a layout in respect to the current density distribution. In earlier publications the resistance of arbitrary shapes has been calculated by solving the potential distribution [6]. The presented program has another approach. The idea is to divide the given shape into small squares of the same size. Each square representing the current bends is modeled by four resistors. In detail every side of a square implements a resistor. Each resistor has twice the sheet resistance value. By solving the resulting resistor network is possible to obtain the approximately current density distribution of a selected shape. The size of the squares can be specified arbitrarily to improve the accuracy of the results. In Fig. 2 the simulation results of composite shape is shown. The left half side of the figure consists of a 90° corner bend. As can be seen the highest current density occurs at the inner bend, whereas the current density within the surrounding area and the remaining figure is comparative low.
Fig. 2. Simulated current density of a shape, composite by a 90° corner bend and a 135° corner bend
100
B. Lipka and U. Kleine
On the right half of the shape the simulation results for a 135° corner bend are depicted. Even though a load can be observed at the inner bends, the exploitation is quite low within the endangered areas. A graphical user interface has been implemented in TCL/Tk to view the results of a current density calculation. The actual simulation is achieved by a program written in C++. This tool of the ALADIN package can be used independent of Cadence or other commercial CAD software. 2.2 Heat Simulation The increasing miniaturization of integrated structures requires more and more paying attention on electrothermal interactions. Many mechanisms of IC’s dependent on the operating temperature, such as offset voltages, electrical overstress or electromigration. Designers of integrated circuits with critical thermo-electrical behavior need assistance during the development process. There have been some research efforts dealing with the solution of electrothermal problems in electronic circuits [7-9]. The presented heat simulation program is a tool of the ALADIN package to assist the designer by optimizing a layout in respect to the heat distribution of the circuit. The input data for the heat simulation is based on the DC simulation of the analog extract of a circuit. In detail, the power dissipation of the transistors is chosen as input data and extracted by the tool. Furthermore it is considered, that the heat conductivity is athermal, the heat dissemination is isotropic and the simulation is calculated for a steady operation only. By fulfilling the requirements the heat conduction equation can be written as
⎡ ∂ 2T ( x, y, z ) ∂ 2T ( x, y, z ) ∂ 2T ( x, y, z ) ⎤ λ ( x, y , z ) ⋅ ⎢ + + ⎥ = − g ( x, y , z ) ∂x 2 ∂y 2 ∂z 2 ⎣ ⎦
(1)
where T is the temperature, g is the power density of heat source and λ is the thermal conductivity. When solving the differential equation system with the help of a numerical solution a large linear equation system is generated:
A ⋅T = G
(2)
where T is the vector of temperatures, A is the matrix of thermal conductivities and G is the vector of input heat sources. The size of the equation system (2) depends on the size of the circuit and the grid resolution, which can be specified by the designer. For solving the linear equation system a conjugated gradient method has been implemented. The results are only rough estimations of the temperature difference between transistors. After a successful heat simulation the results are inserted by the tool into the analog extract automatically. A following DC simulation of the circuit shows the impact of the heat distribution. Based on the results the designer is able to rate the exercise of influence to respective modules of the simulated circuit and to chose another arrangement, if required. For instance, the simulation of the presented power amplifier indicates an increase of the offset voltage of the output stage of the two-stage operational amplifier, as a
Design of a Linear Power Amplifier with ±1.5V Power Supply Using ALADIN
101
result of minor distance to the p-channel output transistor of the power amplifier. In this case the offset voltage amounts 11 μV. By simulating a modified arrangement of the modules the offset voltage is reduced to 4 μV. The user interface of the heat simulation tool has been implemented in SKILL, the interpreter language of Cadence. The algorithms for calculating the heat distribution have been written in C++.
3 Design of the Linear Power Amplifier The structure of the amplifier [10-12] is shown in Fig. 3. It is using two two-stage operational amplifiers and a one-stage preamplifier. The amplifier A1 connects the gate of the p-channel transistor M1 and pulls up the output. In a similar manner, the amplifier A2 connects the n-channel transistor M2 and pushes down the output. If the two amplifiers A1 and A2 have a small offset voltage, the transistors M1 and M2 are switched off in the quiescent state. The quiescent output current is controlled by the transistors M3 – M6. If the difference is nearly equal to zero, the transistors M1 or M2 can be switched off. The crossover distortion is compensated by transistors M5 and M6. For example, the actual current control through M5 will be achieved, if A1 equalizes the source voltage of M3 and M5. The main limiting factor on the output swing is the large threshold voltage of M5 and M6 due to back bias effects. If a low threshold p-channel were available, the output swing could be improved. Although the transistors M5 and M6 supply fraction of the load current, their real usefulness lies in quiescent current control. The two amplifiers A1 (Fig. 4) and A2 (Fig. 5) use a two stage architecture. The compensation is realized by a passgate, which operates as a resistor and a capacitor. The simulation results are listed in Table 2 and Table 3.
Bias
M1
-
V+ -
A1 +
M5
M3 M4
Vout
M6
+
V+ Vin
M2
A2 -
A3 +
Fig. 3. Schematic of a Linear Power Amplifier
102
B. Lipka and U. Kleine V
DD
MP1
MP2
Input+
MN5 CL
MP3
InputMN4
MN3
ID
C0
MP0 MN0
MN2
MN1
GND
Fig. 4. Schematic of the two-stage operational amplifier A1 V DD
MP0
MP1
MP2 MN0 C0
I DC MP3
MP4
Input MN3 C1 MP5
Input + MN1
MN2
GND
Fig. 5. Schematic of the two-stage operational amplifier A2 V
MP0
DD
MP1
MP2
MP6
MP3
MP4
MP5
MP7
MP8
MP9
InputMP11
Input+
MP10
ID
MN1
MN0
MN3
MN4
MN2
MN5
GND
Fig. 6. Schematic of the one-stage operational amplifier A3
CL
Design of a Linear Power Amplifier with ±1.5V Power Supply Using ALADIN
103
Table 1. Component sizes of the operational amplifier A1 – A3 OpAmp A1 MP0 400/2 MP1 70/2 MP2 70/2 MP3 2600/2 MN0 400/2 MN1 400/2 MN2 4000/2 MN3 400/1.2 MN4 400/1.2 MN5 200/2
MP0 MP1 MP2 MP3 MP4 MP5 MN0 MN1 MN2 MN3
OpAmp A2 800/2 800/2 8000/2 400/1.2 400/1.2 400/2 200/2 35/2 35/2 1300/2
OpAmp A3 MP0 200/0.6 MP1 100/0.6 MP2 200/0.6 MP3 100/0.5 MP4 50/0.5 MP5 50/0.5 MP6 200/0.6 MP7 200/0.8 MP8 200/0.8 MP9 200/0.8
A3 (continued) MP10 1000/0.4 MP11 1000/0.4 MN0 400/0.6 MN1 400/0.6 MN2 400/0.6 MN3 200/0.5 MN4 400/0.5 MN5 400/0.5
In Fig. 6 the structure of the preamplifier A3 is depicted. It uses a one-stage architecture. By introducing an additional current path within the cascode, it is possible to increase the positive output voltage swing. This output swing of the output stage of the amplifier is comparable to a two-stage operational amplifier. Due to the high ohmic output resistance the power supply rejection ratio and the noise behavior is much better than for a two-stage operational amplifier.
4 Layout Solutions With the help of the ALADIN package and its new features, module generators have been written to create the layout of the linear power amplifier. The layout of this amplifier is depicted in Fig. 7. As can be seen a quite dense layout results. Its core layout area is 530 μm x 430 μm. The preamplifier A3 is located in the upper left half of the layout. Right hand side the operational amplifier A1 has been placed. Below the amplifier A1 is arranged. Between the amplifiers A2 and A3 the current mirrors, formed by the transistors M3 / M5 and M4 / M6, are positioned. In the right-most position the power transistors M1 and M2 can be seen. The presented tool for calculation of current density can be used to optimize the created modules and structures in respect to reduce the effect of electromigration. For instance, with the help of this tool a high load at the inner bends of the conductors in the upper and lower right side of the power amplifier has been determined. In order to reduce the impact, 135° bends have been chosen. The heat simulation allows analyzing the impact of the heat distribution. The designer can decide about choosing other geometries for distinct modules or another arrangement of the modules. The effort for a new generation of the layout is quite minimal, because it is done automatically by ALADIN. For example, based on the results of a heat simulation of the linear power amplifier the distance between the power transistor M1 and the operational amplifier A1 has been increased to reduce the temperature influence from both structures. Compared to a manual layout, in which most of the checks can only be applied as a rule of thumb and therefore not accurately because of lack of time, the design safety is increased in this automatic solution.
104
B. Lipka and U. Kleine
Fig. 7. Layout of a Linear Power Amplifier
5 Simulation Results In Table 2 - Table 4 the post-layout simulation results of the operational amplifiers A1 A3 are listed. The unity-gain frequency of the amplifier A1 is 66 MHz, of the amplifier A2 60 MHz and of the one-stage amplifier A3 100 MHz. The DC open loop gain amounts 84 dB for A1 and A2, 68 dB for amplifier A3. The phase margin constitutes 53° for amplifier A1 and 71° for amplifier A2 and A3. Table 2. Simulation results for operational amplifier A1 using a 14 pF load VDD [V] DC Open Loop Gain [dB] Bandwidth [MHz] Phase [°] Slewrate [V/μs] Area [μm2]
2.7 83 64 55 84
3 84 66 53 90 337 x 181
3.3 86 64 52 97
As can be seen at the results in Table 2 - Table 4 the operational amplifiers A1 - A3 are working in the range from 2.7 V up to 3.3 V power supply.
Design of a Linear Power Amplifier with ±1.5V Power Supply Using ALADIN
105
Table 3. Simulation results for operational amplifier A2 using a 7pF load VDD [V] DC Open Loop Gain [dB] Bandwidth [MHz] Phase [°] Slewrate [V/μs] Area [μm2]
2.7 83 58 74 76
3 84 60 71 74 492 x 146
3.3 85 59 71 71
Fig. 8. Transient response of the power amplifier Table 4. Simulation results for operational amplifier A3 using a 7 pF load VDD [V] DC Open Loop Gain [dB] Bandwidth [MHz] Phase [°] Slewrate [V/μs] Area [μm2]
2.7 67 98 71 69
3 68 100 71 78 210 x 94
3.3 69 101 72 78
In Fig. 8 the transient response of the analog power amplifier is shown. The first curve is the input, the second is the output and the third is the resulting current through a 50 Ω output resistor. The power supply voltage is 3V. The maximum sampling frequency amounts more than 1 MHz.
106
B. Lipka and U. Kleine
6 Conclusions In this paper novel simulation and optimization aids of ALADIN for the design of analog integrated circuits have been presented. Due to the knowledge of current density and heat distribution the optimization of analog circuits is much more effective and reliable. In manual designs these parameters are normally only estimated due to time limitations. With these new features the reliability and the quality of analog layout generation are improved. Furthermore the design of a linear power amplifier has been presented. Due to the use of the two – stage operational amplifier it is possible to build an analog power amplifier with an output voltage of ±0.7 V for ±1.5 V power supply. As can be seen at Fig. 8 the effect of offset voltages is eliminated.
References 1. Wolf, M., Kleine, U., Hosticka, B.: A Novel Analog Module Generator Environment. In: Proc. The European Design & Test Conference, March 1996, pp. 388–392 (1996) 2. Wolf, M., Kleine, U.: A Novel Design Assistant for Analog Circuits. In: Proc. ASP-DAC, February 1998, pp. 495–500 (1998) 3. Zhang, L., Kleine, U., Jiang, Y.: An Automated Design Tool for Analog Layouts. IEEE Trans. On VLSI 14(8), 881–894 (2006) 4. Zhang, L., Raut, R., Jiang, Y., Kleine, U.: Placement Algorithm in Analog-Layout Design. IEEE Trans. On CAD 25(10), 1889–1903 (2006) 5. Lienig, J., Jerke, G.: Electromigration – Aware Physical Design of Integrated Circuts. In: VLSID ’05: Proc. Conference On Embedded Systems Design, January 2005, pp. 77–82 (2005) 6. Horowitz, M., Dutton, R.W.: Resistance Extraction from Mask Layout Data. IEEE Transactions on CAD CAD-2, 145–150 (1983) 7. Cheng, Y.-K., Rosenbaum, E., Kang, S.-M.: ETS-A: A New Electrothermal Simulator for CMOS VLSI Circuits. In: Proc. The European Design & Test Conference, March 1996, pp. 560–570 (1996) 8. Tsai, C.-H., Kang, S.-M.: Fast Temperature Calculation for Transient Electrothermal Simulation by Mixed Frequency/Time Domain Thermal Model Reduction. In: DAC ’00: Proc. On Design Automation, June 2000, pp. 750–755 (2000) 9. Lee, S.S., Allstot, D.J.: Electrothermal Simulation of Integrated Circuits. IEEE J. Solid – State Circuits, 1283–1293 (December 1993) 10. Gray, P.R., Meyer, R.G.: MOS operational amplifier design – A tutorial overview. IEEE J. Solid – State Circuits SC-17, 969–982 (1982) 11. Fisher, J.A.: A High-Performance CMOS Power Amplifier. IEEE J. Solid – State Circuits SC-20, 1200–1205 (1985) 12. Lipka, B., Lindroos, J., Kleine, U.: Design of an Analog Power OpAmp with 1.5V and 45° Structures to Reduce the Effect of Electromigration. In: Kleinheubacher Tagung 2006, September 2006, Miltenberg (2006)
Settling Time Minimization of Operational Amplifiers Andrea Pugliese, Gregorio Cappuccino, and Giuseppe Cocorullo Department of Electronics, Computer Science and Systems, University of Calabria, Via P. Bucci, 42C, 87036-Rende (CS), Italy Tel: +39 984 494769 Fax: +39 984 494834 {a.pugliese,cappuccino,cocorullo}@deis.unical.it
Abstract. Settling time is one of the most important performance parameters for a whole class of amplifier, such as those employed in switched-capacitor-based circuits and analog-to-digital converters. In this work, analysis to predict and to minimize the settling time for amplifiers characterized by first-, second-, and third-order system-wise behaviour, is developed. The proposed method is very useful for design purposes. It allows amplifier poles to be placed directly in the complex plane, achieving the best-settled time response in accord with the desired accuracy level.
1 Introduction The time response of operational amplifiers (op-amps) plays a fundamental role in determining the speed of circuits such as switched-capacitor (SC) filters and analog-to-digital converters [1], [2]. In these designs, the main issue is to have amplifiers characterized by both precise and fast step response i.e. characterised by low steady-state error and low settling time, respectively. This task requires a time-domain approach, indisputably a stiffer assignment with respect to the optimization of other amplifier parameters usually performed in the frequencydomain approach. Analytic expressions of the steady-state error are generally simple to carry out, since they interrelate only to the asymptotic behaviour of the system. On the contrary, owing to non-linear dependence on system poles and zeros location, settling time expressions are very complicated. As a consequence, the analysis and the minimization of settling time become in most cases hard to accomplish. In the past, the minimization of the settling time of SC circuits was analyzed in [3]. In the paper, both frequency and time responses are carried out, based on a second order (two-pole) transfer function description of the amplifier. Recent studies [4] highlight more in general the relationship between the settling time and the pole/zero placements of a generic second-order system, considering also the presence of a polezero couple (doublet), a common situation for real circuits. However, a second-order transfer function may not suffice to describe the dynamic behaviour of modern multistage op-amps. In many practical design cases, the employed op-amp is characterised by a thirdorder system behaviour. One example is the cascade configuration of three amplifier N. Azemard and L. Svensson (Eds.): PATMOS 2007, LNCS 4644, pp. 107–116, 2007. © Springer-Verlag Berlin Heidelberg 2007
108
A. Pugliese, G. Cappuccino, and G. Cocorullo
stages [5], that is a widely exploited scheme used for satisfying the huge request of high-performance op-amps in the actual low-voltage CMOS technologies. An accurate study of the dynamic settling error (DSE) of such systems can be found in [6]. In the same paper, the relationships between open-loop and closedloop poles and zeros were also defined by means of step response analysis. Based on these expressions, it can be theoretically possible to find open-loop pole location to minimize the settling time. For this purpose, in [6] two possible criteria to re duce the settling time to some degree are reported, but no suggestions about how really to obtain the fastest response are given. The aim of the present work is to provide for a method to minimize the small-signal settling time of a generic op-amp, slew-rate limitations are not considered, being unimportant in many SC applications [3]. In the next sections, a formal formulation of settling time will be presented. Then the optimum pole positioning for achieving the minimum settling time (MST) will be obtained for amplifiers characterized by first-, second- and third-order transfer functions, respectively. Finally, the effectiveness of the proposed method will be demonstrated by a practical design example, and conclusions will be drawn.
2 Minimization of Settling Time of a Generic Op-amp The settling time is a common-use figure of merit for time response of op-amps. In general, it refers to amplifier capability on reaching a fixed output steady-state voltage value, when a step signal is applied to its inputs. It can be defined as the time required by the output signal to reach and stay within a given error band centred on the resulting steady-state output level, starting from the time instant of application of a small-signal input voltage step. For a generic feedback op-amp, the time response y(t) to an input step of amplitude A0 can be carried out by means of inverse Laplace transformation:
⎡A ⎤ y ( t ) = L-1 ⎢ 0 ⋅ G ( s ) ⎥ ⎣ s ⎦
(1)
where G(s) is the closed-loop transfer function of the system. Usually, the resulting expression of y(t) is a strongly nonlinear function of the time t, whose parameters are the locations in the complex-plane of G(s) poles and zeros. When the system is stable the time response (1) converges to its steady state value y(∞), then the limit lim y(t) = y(∞) exists and is finite. t →∞
A direct measurement of the difference between the instantaneous value of y(t) and y(∞) is the DSE:
ξ (t ) =
y(∞) − y(t ) y(∞)
(2)
Settling Time Minimization of Operational Amplifiers
109
whose modulus Ξ(t ) = ξ (t ) can be opportunely written, by using the Laplace final value theorem1, as in the following:
⎡ 1 G (s ) ⎤ Ξ (t ) = 1 − L -1 ⎢ ⋅ ⎥ ⎣ s G0 ⎦
(3)
where G0 is the closed-loop DC gain of the op-amp. Then, in accord with the definition given for the settling time at the beginning of this section, it follows that for a given level of accuracy ψ, the settling time tS is the minimum interval needed to obtain a DSE (2) bounded in the range [−ψ ,+ψ ] i.e.
t S (ψ ) = min { t : Ξ (t ) ≤ ψ , ∀ t ≥ t }.
(4)
For instance, the settling time tS and response accuracies of the 1%, 0.1%, 0.01% require that the modulus of DSE (3) becomes less or equal to -40 dB, -60 dB, -80 dB, respectively, for all t ≥ t S . It is worth pointing out tS always exists for any level of accuracy, in the case of stable systems. In fact, for any ε > 0, however small, a time τ can be found such that is y (t ) − y (∞) <ε for all t ≥ t S . Furthermore, in the case of (3) is a bijective function (i.e. invertible), each point of it, is in one-to-one correspondence to the time t. The latter is also the minimum interval (i.e. the settling time) required to achieve the level of accuracy ψ i.e. t S = Ξ −1 (ψ ) . Unfortunately, the modulus of DSE (3) generally is a transcendent function of the time t, and a closed expression for the settling time is practically feasible only for very simple cases such as of first-order transfer functions. The achievement of the MST tSMIN for the system requires an accurate placement of G(s) poles and zeros:
t SMIN (ψ ) = min { t S (ψ ), ∀ G (s )}.
(5)
Actually, transfer function zeros of real internally compensated op-amps should not be treated as independent parameters. In fact, compensation elements chosen to satisfy amplifier stability introduce dominant poles in the response but dominant zeros as well [8]. These latter affect pole residues and thus affect the time response even significantly, although they cannot be independently located in the complex plane from poles by choosing the compensation network. Thus any choice aimed to obtain a desiderated pole placement automatically will result in a modification of the zeros of the transfer function. From a practical point of view, however, the number of independent parameters to be considered in time response of a system with zeros remains unchanged with respect to the case of an allpole system, as well as in the minimization methodology to be presented in the following. 1
The Laplace final value theorem can be applied under the hypothesis that the function s*G(s) is analytic on the imaginary axis and in the right half of the s-plane (i.e. if s*G(s) does not have poles with real part zero or positive) [7].
110
A. Pugliese, G. Cappuccino, and G. Cocorullo
2.1 First Order Response In many practical cases op-amp behaviour is characterized by a simple dominant-pole closed-loop dynamics such as, for example, when feedforward compensation techniques are used to cancel the effect of high-frequency poles as indicated in [9]. The small-signal behaviour of these amplifiers is well-described by a first-order transfer function:
G(s) =
G0 1 + s p1
(6)
where G0 is the DC gain and s = − p1 is the angular frequency of the single pole. The − p ⋅t
DSE (2) results Ξ ( t ) = e 1 for which a variable normalization is introduced and is used also in the rest of the paper. It consists in substituting the time t with a normalized time α * t (α > 0). In accordance with the Laplace-transformation time scaling property [4], the resulting normalized-time DSE ΞN (t ) , the settling time TS and the MST TSMIN will refer to a normalized-frequency s/α system representation, whose transfer function G N ( s ) = G (α ∗ s ) is equivalent to. Then, Ξ N (t ) , TS, TSMIN are defined as well as (2), (4) and (5), respectively, but with a reduced degrees of freedom with respect to them. In fact, the system GN(s) always has a pole with real angular frequency equal to -1 rad/sec, independently of original poles location in G(s) once the time-scaling factor α is chosen equal to the real part of the dominant-pole angular frequency. However, it is worth pointing out that the DSE (3) can be reconstructed directly from
⎛t ⎞ ΞN (t) by means the following de-normalization procedure: Ξ ⎜ ⎟ = Ξ N (t ) . ⎝α ⎠ Similarly, the MST tSMIN of the original system is given by t SMIN = TSMIN α . It follows that the settling properties of a generic op-amp can be full-analyzed by using its normalized representation GN(s). For a first order system GN(s) becomes G −t G N ( s ) = 0 and the only DSE function is Ξ N (t ) = e . In this case, a closed-form 1+ s relationship between the MST TSMIN and the level of accuracy ψ can be simply carried out as following: TSMIN = − ln(ψ ) .
(7)
Expression (7) is always valid since, for an one-pole system, ψ≤ 0 dB is. 2.2 Second Order Response
The analysis of the second-order system is extremely usefulness for design purposes. Particularly, in feedback configuration, as a consequence of the internal frequency compensation usually adopted, the dynamics of many op-amps is characterized by a dominant couple of complex poles, even though the amplifier open-loop transfer
Settling Time Minimization of Operational Amplifiers
111
function is generally of a high order. The second-order transfer function of a generic op-amp in closed-loop configuration, under the assumption of two complex poles and no zeros, is:
G ( s) = ωn ⋅ 2
G0 ωn + 2ζωn s + s 2 2
(8)
where G0 is the DC gain and ωn and ζ (0 <ζ < 1) are the natural frequency and the damping factor of the complex conjugate poles, respectively. Choosing the real part of poles (i.e. ζωn product) as scaling factor, a ζ-parametric function GN ( s) =
G0 can be obtained. The corresponding (normalized) 1 + 2ζ 2s + ζ 2s 2
DSE (3):
⎧ −t ⎛ 1−ζ 2 ⎞ ⎜ ⎟ ⎪Ξ N (t ) = e sin t + ϕ ⎪ 2 ⎜ ζ ⎟ 1 − ζ ⎨ ⎝ ⎠ ⎪ ⎪⎩ϕ = sin −1 1 − ζ 2
(
)
(9)
is a transcendent function of the time t and, in this case, it is not possible to carry out an analytic expression for TSMIN as indeed done for a first order system. In fact, owing to the oscillations of the response, a given settling error level may be reached at two or more different time instants, i.e. the DSE (8) is not a bijective function.
Fig. 1. Normalized settling time TS of second order systems (continuous line) obtained from DSE functions (9) (dashed lines): a) when ζ = 1
2 ; b) when ζ = 0.65, 1 2 , 0.75
However, it can be “inverted” to find the settling time TS by using plot i.e. drawing all DSE points satisfying (4) as done in Fig. 1a for a system with ζ = 1 2 , for example. It is worth noting the settling time TS is a discontinuous function. Particularly, the first discontinuity occurring at ψ = -27.15 dB is given when the
112
A. Pugliese, G. Cappuccino, and G. Cocorullo
−
πζ 1− ζ
2
maximum overshoot y max = e of the step response y(t) [7], is equal to -27.15 dB (4.3%). Moreover, plot in Fig. 1b shows the settling time TS changing when damping factors ζ = 0.65, ζ = 1 2 and ζ = 0.75 are considered. Evidently, the finding of the MST requires all DSE functions (8) for 0< ζ < 1 have to be considered. Actually, for
ψ = -27.15 dB the settling time can not be reduced, however close to ζ = 1 2 , the damping factors are chosen. In fact, for each accuracy level ψ corresponds one and only one damping factor value ζopt:
ζ opt (ψ ) = −
ln(ψ ) ln(ψ ) + π 2
. 2
(10)
ζopt guarantees the MST for ψ and is obtained by inverting equation ψ =ymax.
Fig. 2. a) Normalized MST TSMIN and b) damping factor ζopt of second order systems
Plots of the MST TSMIN and the corresponding MST damping factor ζopt (10) are then reported in Fig. 2. Differently from the past, these plots lead the designer directly to find the best-settled second order response on the basis of the maximum DSE that can be tolerated. Then, plots of Fig. 2 are a very useful tool in order to achieve fast settling behaviour for many real op-amps. 2.3 Third Order Response
Many modern op-amps, such as the low-voltage cascade topologies [5], are characterised by a third-order system-wise behaviour. For a generic op-amp the third-order closed-loop transfer function under the assumption of a pole at s = − p1 , two complex poles with ωn and ζ as natural frequency and the damping factor, respectively and no zeros, is:
G (s) = ωn ⋅ 2
(1 + s
G0
p1 )(ω n + 2ζω n s + s 2 ) 2
(11)
Settling Time Minimization of Operational Amplifiers
113
Choosing the ζωn product (i.e. the complex-pole real part supposed as dominant p without losing of generality) as scaling factor, ρ = 1 is the normalized real pole
ζω n
and G N ( s ) =
G0
(1 + s ρ )(1 + 2ζ 2 s + ζ 2 s 2 )
is the two-parameter transfer function. The
corresponding (normalized) DSE (3) is:
⎧ ⎛ 1−ζ 2 ⎞ e − ρt e−t ⎜ ⎪Ξ N (t ) = 2 2 + ρζ sin t +ϕ⎟ 2 2 2 2 2 ⎜ ζ ⎟ ρ ζ − 2 ρζ + 1 ⎪⎪ 1 − ζ ρ ζ − 2 ρζ + 1 ⎝ ⎠ ⎨ ⎛ ⎞ ⎪ 1−ζ 2 −1 2 −1 ⎟ ⎪ϕ = sin 1 − ζ − sin ⎜⎜ 2 2 2 ⎟ ⎪⎩ ⎝ ρ ζ − 2 ρζ + 1 ⎠
(
(
)(
)
)
(12)
Fig. 3. Normalized settling time TS of third order systems (continuous line) obtained from DSE
functions (12) (dashed lines): a) when ζ = 1 2 and ρ =0.98÷1.06; b) when ζ = 1 2 and
ρ =1.08; c) when ζ = 1 2 and ρ =1.06÷1.1; d) when ρ =1.08 and ζ = 0.65÷0.75
Depending on ρ and ζ values, some approximations are possible when one exponential term of (12) is prevalent to the other. In fact, for all t ≥ 0 and 0< ζ < 1,
114
A. Pugliese, G. Cappuccino, and G. Cocorullo
Fig. 4. a) Normalized MST TSMIN, b) normalized real pole ρopt and damping factor ζopt of third order systems
when ρ <<1 the DSE (12) degenerates to the (no-normalized) first order system with − ρ⋅t pole at s = −ρ : ΞN (t) ≈ e , as well as for all ρ and ζ ≈ 0. Similarly, for all t ≥ 0 and 0< ζ < 1, when ρ >10 it degenerates to DSE (9) of a (normalized) second order system. However, excluding these degenerate cases, the DSE (12) must be considered just as it is in order to find the MST of the system (11). Then, as done for second-order systems, a plot of the settling time TS can be obtained drawing all DSE points satisfying (4), taking into account variations owing to both independent parameters ρ and ζ. For a system with ζ = 1 2 the Fig. 3 shows that: 1) the settling time TS improves with response oscillations around its steady-state value (i.e. for ρ > 1) (Fig. 3a); 2) when ρ =1.08, for instance, the discontinuity occurring at ψ = -59dB is given when the first two peaks of the step response y(t) are equal to the upper and the lower borders of the accuracy-level error band [−ψ ,+ψ ] , respectively (Fig. 3b); 3) for the accuracy level ψ = -59dB, TS cannot be reduced, however close to ρ =1.08, normalized real pole value ρ is chosen (Fig. 3b) and, similarly, 4) however close to
ζ = 1 2 , damping factors are chosen for ρ =1.08. In fact, for each accuracy level ψ corresponds the couple of damping factor and real pole (ρopt, ζopt) that guarantees the MST for that accuracy level. However, for third order systems ψ−ρ and ψ−ζ relationships are transcendent functions and an analytic expression, such as (10), cannot be derived. Plots of the MST TSMIN and the corresponding MST normalized real pole ρopt and damping factor ζopt, obtained by means numerical simulations, are then reported in Fig. 4. In the current op-amp design practice, especially for low-voltage discrete-time applications, these plots are very useful to locate poles allowing for the best-settled time response to be reached. In the following, the usefulness of the proposed method is demonstrated by an application example.
Settling Time Minimization of Operational Amplifiers
115
Fig. 5. Comparison of MST TSMIN of first, second and third order systems
Fig. 6. Comparison of a) MST step response y(t) and b) DSE functions of first, second and third order systems
3 Application Example At equal time-scaling factor, the MST TSMIN of first, second and third order systems is compared in Fig. 5. It is worth pointing out third-order op-amps can be designed to achieve the fastest time response with respect to first- and second-order amplifiers, for any accuracy level. Considering for instance the settling accuracy level of ψ = -34 dB (i.e. 2%) is required and has to be reached in tsmin=1 μs by a feedback op-amp as response to a unity-step signal. Plots of Fig. 5 directly allows the MST TSMIN to be evaluate as equal to 3.91, 2.81, 2.08 for op-amps characterized by first, second and third order transfer functions, respectively, with closed-loop poles as resulting from the plot of Fig. 2 (for second order systems i.e. ζopt= 0.78) and Fig. 4 (for third-order systems i.e. ρopt =1.35 and ζopt =0.45). Plots of step responses and DSE functions of Fig. 6 confirm settling time performances of the three designs, as expected by Fig. 5. Then, the de-normalizing operation allows the absolute positions of poles to be calculated in order to guarantee
116
A. Pugliese, G. Cappuccino, and G. Cocorullo
tSMIN=1μs for the three designs (i.e. p1=3.914*106 rad/sec, ζωn=2.812*106 rad/sec, ζωn=2.077*106 rad/sec for first-, second- and third-order op-amps, respectively).
4 Conclusions and Future Work The settling time minimization of a generic op-amp described by its transfer function has been analyzed. The MST and pole locations achieving it for first-, second- and third-order responses, has been obtained. The proposed method is very useful for op-amp design purposes, because the designer can directly place system poles in the complex plane to achieve the best-settled time response, according to the accuracy level required by the application. In the circuit practice, information about the optimum closed-loop poles positioning can be helpfully exploited for the frequency compensation network sizing as well as for choosing other important amplifier parameters such as stage transconductances. Obviously, an optimum-settled op-amp cannot be really designed, owing to nominal value deviations of circuital elements causing unavoidable variations of integrated circuits manufacturing processes. However, sub-optimal solutions are also simply identified by the proposed procedure when, for instance, a worst-case settling accuracy level is considered. Finally, the development of relationships linking the gain-bandwidth product of first-, second- and third-order systems optimized to achieve the MST, is currently under investigation.
References 1. Choi, T.C., Kaneshiro, R.T., Brodersen, R.W., Gray, P.R., Jett, W.B., Wilcox, M.: Highfrequency CMOS Switched-Capacitor filters for communications application. IEEE J. Solid-State Circuits sc-18, 652–664 (1983) 2. Geerts, Y., Marques, A.M., Steyaert, M.S.J., Sansen, W.: A 3.3-V, 15-bit. Delta-sigma ADC with a signal bandwidth of 1.1 Mhz for ADSL applications. IEEE J. Solid-State Circuits 34, 927–936 (1999) 3. Yang, H.C., Allstot, D.J.: Considerations for fast settling operational amplifiers. IEEE Transactions on Circuits and Systems 37(3), 326–334 (1990) 4. Schlarmann, M.E., Geiger, R.L.: Relationship between amplifier settling time and pole-zero placements for second-order systems. In: IEEE Proc. Midwest Symposium, Circuits and Systems, vol. 1, pp. 54–59. IEEE, Los Alamitos (2000) 5. Eschauzier, R.G.H., Huijsing, J.H.: Frequency compensation techniques for low-power operational amplifiers. Kluwer, Boston, MA (1995) 6. Marques, A., Geerts, Y., Steyaert, M., Sansen, W.: Settling time analysis of third order systems. In: IEEE Int. Conf. on Electronics, Proc. Circuits and Systems, 1998, vol. 2, pp. 505–508 (1998) 7. Kuo, B.C.: Automatic control systems. Prentice-Hall, Englewood Cliffs, NJ (1975) 8. Leung, K.N., Mok, P.K.T.: Analysis of multistage amplifier-frequency compensation. IEEE Trans. Circuits Syst. I Fundam. Theory Appl. 48 (September 2001) 9. Thandri, B.K., Silva-Martinez, J.: A robust feedforward compensation scheme for multistage operational transconductance amplifiers with no Miller capacitors. IEEE J. SolidState Circuits 38(2), 237–243 (2003)
Low-Voltage Low-Power Curvature-Corrected Voltage Reference Circuit Using DTMOSTs Cosmin Popa Faculty of Electronics, Telecommunications and Information Technology Bucharest, 1-3 Iuliu Maniu, Bucharest, Romania
Abstract. A low-voltage low power voltage reference realized in 0.35 μm CMOS technology will be presented. In order to achieve two important goals of low-power high performances bandgap references realized in the newer CMOS technologies – a low supply voltage and a small value of the temperature coefficient, a modified structure using dynamic MOS transistors (equivalent with a virtually lowering of the material bandgap) and a square-root curvaturecorrection will be implemented. The accuracy of the output voltage will be increased using an Offset Voltage Follower Block as PTAT voltage generator, with the advantage that matched resistors are replaced by matched transistors. The low-power operation of the circuit will be achieved by using exclusively subthreshold-operated MOS devices. Experimental results confirm the theoretical estimations, showing a minimum supply voltage of 2.5V and a temperature coefficient of about 9.4 ppm / K for an extended temperature range (173 K < T < 423 K ) .
1 Introduction Voltage references are widely used in applications such as A/D and D/A converters, acquisition data systems or smart sensors. As the accuracy and precision of these circuits increase, the requirements for the temperature stability of the voltage references have also increased. In CMOS systems, two important trends can be distinguished: one focusing at high-performance operation and the other at low-power low-voltage operation, the last one associated with the requirement of compatibility with latest CMOS technologies. With the system trend toward high performance, the system typically operates on the nominal supply voltage, while all the circuits must reach high performance. Because of the superior performance (due to the better parameters match) of bipolar voltage references with respect to the circuits using MOS transistors, the first approaches of the high-performance voltage reference were implemented in bipolar technology. However, because of the nonlinear temperature dependence of the baseemitter voltage [1], there exists a theoretical limit for improving the temperature stability of a simple BGR. First-order compensated bandgap references, having a temperature coefficient (TCR) greater than 50 ppm / K , useful only for applications that do not require a very good accuracy of the reference voltage, have been reported N. Azemard and L. Svensson (Eds.): PATMOS 2007, LNCS 4644, pp. 117–124, 2007. © Springer-Verlag Berlin Heidelberg 2007
118
C. Popa
[2] - [5]. In order to improve the temperature behavior of the bandgap reference, a lot of correction techniques were developed, resulting so-called curvature-corrected bandgap references. A first way to improve the TCR of a bandgap reference is to correct the nonlinear temperature dependence of the base-emitter voltage by a suitable biasing of the bipolar transistor. Thus, a biasing at a PTAT3 + PTAT4 collector current was proposed in [1], allowing a simulated temperature coefficient in the range of 4 − 8 ppm / K (depends on the supply voltage). The curvature-corrected technique from [6], based on the polarization of the bipolar transistor at a PTATn current, reports a TCR about 20 ppm / K without trimming and chip temperature stabilization. Another possibility to improve the temperature dependence of a BGR is to compensate the nonlinear temperature characteristic of the base-emitter voltage by a correction voltage, which is added to the basic reference voltage [7], or by a correction current added to the PTAT current [8], [9]. The reference voltage presented in [7] has a relatively large temperature coefficient, about 30 ppm / K for a limited temperature range, caused by the MOS parameters mismatches. The compensation technique based on the correction current decreases the temperature under 10 ppm / K , but it has the disadvantages of a large silicon occupied area and of the incompatibility with CMOS processes. Designing the CMOS bandgap references, the required bipolar devices are realized as parasitic vertical or lateral transistors, available in CMOS technology. The result will be a small degradation of the temperature behavior of the circuit caused by the poorer parameters match of MOS devices’ parameters with respect to those of bipolar transistors. With the systems trend on low-power operation, the supply voltage is typically much lower than the nominal supply voltage for the process and it is often dictated by the minimum voltage of a battery. In this case, none of the previous implementations is possible because of the large value of 0.6 ÷ 0.7V required for the base-emitter voltage, resulting the necessity of an full CMOS realization, that is without any parasitic bipolar transistors. The problem of the large value of the threshold voltage could be overtaken by a virtual reduction of VT , using DTMOS (Dynamic Threshold MOS) transistors. In this paper, a low-power bandgap reference with a square-root curvaturecorrection will be presented. The circuit is implemented in 0.35 μm CMOS technology using MOS transistors working in weak inversion and DTMOSTs, resulting a much smaller power consumption and a lower supply voltage with respect to the previous reported implementations of voltage reference circuits.
2 Theoretical Analysis The proposed circuit represents a new approach of a bandgap reference, using MOS transistors in weak inversion for replacing the classical bipolar transistor. The disadvantage of a poorer parameters match of MOS transistors with respect to bipolar transistors are strongly compensated by a much smaller silicon occupied area and by the small values of the supply voltage that could be achieved using DMOSTs.
Low-Voltage Low-Power Curvature-Corrected Voltage Reference Circuit
119
2.1 The Temperature Behavior of the New CMOS Weak Inversion Bandgap Reference
The proposed weak inversion circuit represents a new extension of a classical bipolar bandgap reference, realized by replacing the bipolar transistor with a MOS transistor working in weak inversion, presenting previous mentioned advantages. Thus, because the reference voltage is the sum of a gate-source voltage and a PTAT voltage, it will have the following expression: V REF (T ) = VGS (T ) − T
dVGS (T ) dT
,
(1)
T =T0
where T0 is the reference temperature. Considering a subthreshold operation of the MOS transistor, its logarithmical law could be expressed as:
⎡ I D (T ) ⎤ VGS (T ) = VT (T ) + nVt ln ⎢ ⎥, ⎣ (W / L) I D0 ⎦
(2)
VT being the threshold voltage with the following temperature dependence (neglecting the bulk effect): VT (T ) = V FB + 2Φ F (T ) .
(3)
Φ F is the Fermi potential and it has a linear temperature dependence: ⎛N ⎞
E
Φ F (T ) = Vt ln⎜⎜ A ⎟⎟ + G . ⎝ A ⎠ 2
(4)
n is a constant parameter, Vt is the thermal voltage, E G is the silicon bandgap energy (considered, in a first-order analysis, independent on temperature), I D (T ) is
the temperature dependence of the drain current and I D0 is a parameter having the following expression (given by the continuity between the weak inversion and the saturation regions):
I D0 =
2 μ n( p ) C ox (nVt ) 2 e2
.
(5)
μ n( p ) is the carriers’ mobility with a temperature dependence given by: μ n( p ) = BT −γ .
(6)
From the previous relations, considering a temperature dependence of the drain current expressed by:
I D (T ) = CT α ,
(7)
it results the reference voltage expression: V REF (T ) = V FB + E G + n(α + γ − 2)
KT q
⎛ T ⎞ ⎜ ln ⎟ ⎜ T − 1⎟ . 0 ⎝ ⎠
(8)
120
C. Popa
So, in order to cancel the main temperature dependence of the reference voltage, it is necessary that α = 2 − γ = 0.5 , for a usual value γ = 1.5 . In this case, it will remain only a small temperature dependence, having the following causes: a. the silicon bandgap energy variations (9); b. the not perfectly match of the MOS transistors parameters; c. the second-order effects; d. the approximately values of γ and α . E G (T ) = a − bT − cT 2 .
(9)
So, in order to correct the characteristic (8) of the MOS bandgap reference, it is necessary to implement a current with a CT 0.5 temperature dependence. 2.2 The Implementation of the CT 0.5 Block
The CMOS implementation of the CT 0.5 block is presented in Fig. 1. All transistors are working in weak inversion. The left part of this block represents an auxiliary first-order compensated voltage reference, used to implement two currents: I 0 , which is, in a first-order analysis, independent on temperature and I 1 , with a linear dependence on temperature. The self-biasing of the current source assures a low sensitivity to the supply voltage variations. There is not possible to implement a cascode self-biased current source because of the low-voltage requirements of the entire circuit.
VDD
I0.5 I0.5
I1
VREFaux I1
I0
I0
I0
Fig. 1. The CMOS implementation of the CT 0.5 block
2.3 Low-Power Operation Bandgap Reference Using DTMOSTs
A very important trend in designing low-voltage low-power analog and/or digital circuits is the continuous decreasing of the supply voltage. When correctly designed, bandgap reference circuits give an output voltage that is somewhat higher than the material bandgap extrapolated to 0 K . Typical bandgap references need at least an
Low-Voltage Low-Power Curvature-Corrected Voltage Reference Circuit
121
extra voltage of 100 − 300mV for proper operation. The minimum supply voltage for the typical bandgap circuit is then approximately:
V DD min = V REF + 300 mV ≅ 1.5V .
(10)
With newer CMOS processes, the nominal supply voltage decreases. In 0.5 μm CMOS processes, the nominal supply voltage is 3.3V ; for 0.25 μm CMOS, it is 2.5V and for 0.15 μm CMOS the supply voltage is 1.5V and will continue to decrease with each newer process generation. In conclusion, for these processes, the traditional bandgap reference circuit cannot be used because of the too small value of the supply voltage. In order to obtain a proper operation of the classical bandgap reference at low supply voltages, there are many possible solutions: • Using a low-bandgap material (germanium) to realize low-bandgap diodes. It is very expensive and not available in standard CMOS processes; • Using a fraction of the voltage across the diodes, by resistive subdivision. The disadvantage is a considerable area consumption because of the large value resistors, imposed by a low-power operation; • A virtually lowering of the material bandgap, using an electrostatic field – is used in this section. Dynamic threshold MOS transistor (DTMOST). On layout level, the DTMOST is a MOS transistor with an interconnected well and gate. A cross section of a DTMOST is presented in Fig. 2.
Fig. 2. Cross section of a DTMOST
Noting that at the outside of the device we only see the externally applied VGS , the apparent material bandgap in the DTMOST will be lower than the silicon bandgap for a classical MOS transistor [10]. Its temperature dependence can be approximated by a constant (the apparent material bandgap extrapolated to 0 K ) and a term linear dependent on temperature. Using diodes with a low apparent material bandgap (implemented with DTMOSTs) in the CMOS reference circuit, the result will be a bandgap reference with a lower output voltage and a lower required supply voltage.
122
C. Popa
2.4 The Implementation of PTAT Voltage Generator Using OVF (Offset Voltage Follower) Block
The main cause of limiting the reference voltage accuracy is represented by the mismatches of the resistors. In order to increase the accuracy of the output voltage, a non-conventional way of implementing the PTAT voltage [10] is presented in Fig. 3. VDD
VPTAT I0
I0.5 M
VREF
1
1
N
Fig. 3. Low voltage bandgap reference using DTMOSTs VDD
VPTAT I0.5
I1
M
VREFaux I1
I0
I0
I0
I0.5
VREF 1
I0 1
N
Fig. 4. CMOS implementation of the low-power weak inversion DTMOST bandgap reference
The circuit from the right side of Fig. 3 represents an “Offset Voltage Follower” with a built-in PTAT voltage offset. The advantage of this circuit with respect to a classical PTAT voltage generator is that matched resistors are replaced by matched transistors [10]. The output voltage of the offset voltage follower is:
V PTAT = nVt ln( MN )
(13)
2.5 CMOS Implementation of the Low-Voltage Low-Power Weak Inversion DTMOST Bandgap Reference
The previous described bandgap reference is implemented in 0.35 μm CMOS technology. The circuit diagram is presented in Fig. 4.
Low-Voltage Low-Power Curvature-Corrected Voltage Reference Circuit
123
3 Experimental Results The SPICE simulation V REF (T ) based on previous mentioned technology parameters is presented in Fig. 5.
Fig. 5. SPICE simulation V REF (T ) for the square-root curvature-corrected bandgap reference
Fig. 6. I D (VGS ) for DTMOST
Fig. 7. I D (VGS ) for NMOST
124
C. Popa
The supply voltage is VDD = 2.5V . The most important MOS parameters used in the previous simulations are: VTn = 0.4V , VT p = −0.5V . The simulated temperature
coefficient of the voltage reference is TCR = 9.4 ppm / K for an extended temperature range of 173 K < T < 423 K . The reduction of the minimal supply voltage of the circuit as a result of decreasing the threshold voltage by using DTMOSTs is shown in Figs. 6 and 7.
4 Conclusions It was presented a CMOS bandgap reference designed for low-voltage low power operation. In order to obtain this goal, classical diodes are replaced by dynamic threshold MOS transistors, presenting the advantage of virtual lower material bandgap energy. The novelty of the original proposed superior-order curvature-correction technique is given by the using of an Offset Voltage Follower block for implementing the linear correction, with the advantage of an increased accuracy and of a much smaller silicon occupied area and, additionally, by proposing an original superiororder correction based on a square-root circuit. The voltage reference was implemented in 0.35 μm CMOS technology, reporting a minimum supply voltage of 2.5V and a temperature coefficient of about 9.4 ppm / K for an extended temperature range of 173 K < T < 423 K .
References 1. Filanovsky, I.M., Chan, Y.F.: BiCMOS Cascaded Bandgap Voltage Reference. In: IEEE 39th Midwest Symposium on Circuits and Systems, pp. 943–946 (1996) 2. Ferro, M., Salerno, F., Castello, R.: A Floating CMOS Bandgap Voltage Reference for Differential Applications. IEEE Journal of Solid-State Circuits, 690–697 (1989) 3. Tham, K.M., Nagaraj, K.: A Low Supply Voltage High PSRR Voltage Reference in CMOS Process. IEEE Journal of Solid-State Circuits, 586–590 (1995) 4. Vermaas, L.L.G.: A Bandgap Voltage Reference Using Digital CMOS Process. IEEE International Conference on Electronics, Circuit and Systems, 303–306 (1998) 5. Banba, H.: A CMOS Bandgap Reference Circuit with Sub-1-V Operation. IEEE Journal of Solid-State Circuits, 670–674 (1999) 6. Popa, C.: Curvature-Compensated Bandgap Reference. In: The 13th International Conference on Control System and Computer Science, University “Politehnica” of Bucharest, pp. 540–543 (2001) 7. Salminen, O.: The Higher Order Temperature Compensation of Bandgap Voltage References. In: IEEE International Symposium on Circuits and Systems, pp. 1388–1391 (1992) 8. Gunawan, M.: A Curvature-Corrected Low-Voltage Bandgap Reference. IEEE Journal of Solid-State Circuits, 667–670 (1993) 9. Lee, I., Kim, G., Kim, W.: Exponential Curvature-Compensated BiCMOS Bandgap References. IEEE Journal of Solid-State Circuits, 1396–1403 (1994) 10. Annema, A.J.: Low-Power Bandgap References Featuring DTMOSTs. Philips Research Laboratories, The Netherlands
Computation of Joint Timing Yield of Sequential Networks Considering Process Variations Amit Goel, Sarvesh Bhardwaj, Praveen Ghanta, and Sarma Vrudhula1 Department of Electrical Engineering 1 Consortium for Embedded Systems Department of Computer Science and Engineering Arizona State University, Tempe, AZ 85281
Abstract. This paper presents a framework for estimating the timing yield of sequential networks in the presence of process variations. We present an accurate method for characterizing various parameters such as setup time, hold time, clock to output delay etc. of sequential elements in the network. Using these models and the models of interconnects gate delays, and clock skews, we perform statistical timing analysis of combinational blocks in the circuit. The result of the timing analysis is a set of constraints involving random process variables that the network has to satisfy together in order to work correctly. We compute the joint yield of all the constraints to estimate the yield of the entire network. The proposed method provides a speedup of up to 400× compared to 10000 Monte Carlo simulations with an average error of less than 1% and 5% in mean and standard deviation respectively.
1
Introduction
The ultra large scale of integration in current process technologies has enabled the design of complex, high performance circuits. However, process variations due to aggressive reduction in the transistor feature size can have significant impact on the timing yield of the manufactured circuits [3, 10]. Hence, it has become absolutely necessary to estimate the timing yield of a design before it is manufactured in order to perform various design optimizations to improve the timing yield. The estimation of the timing yield of a large design consisting of various library cells requires: (1) accurate characterization of the different components in the circuit as a function of underlying process parameters, (2) an efficient method for analyzing the yield of various sub-blocks in the design, and (3) a method for computing the yield of the entire design based on the joint yield of each individual block. To this extent, this paper presents a framework for efficiently analyzing the timing yield of large sequential circuits. With the critical dimensions reaching 45nm, process variability has become the primary cause of concern during the design of high performance circuits. These variations can result in large spread in the performance and leakage power of the manufactured circuits [3, 10]. Process variations can be broadly classified into two categories: (1) inter-die variations (between different chips), and N. Azemard and L. Svensson (Eds.): PATMOS 2007, LNCS 4644, pp. 125–137, 2007. c Springer-Verlag Berlin Heidelberg 2007
126
A. Goel et al.
(2) intra-die variations(on the same chip). The traditional static timing analysis (STA) technique of evaluating a design at multiple process corners can handle the inter-die variations as it assumes that all components in the design are operating at same process corner. However, to analyze the impact of intra-die variations, we need methods to compute the statistics of the design metrics (delay, leakage etc.) of a circuit based on the statistics of the process parameters. This has led to a number of methods for performing statistical timing [18, 1, 5, 20, 21] and leakage analysis [16, 4] of combinational circuits. An important step in performing block level statistical analysis is the characterization of the various metrics such as delay and leakage of a cell as a function of its device parameters. Significant amount of work has been done to model leakage [16], interconnect delays [19, 13], gate delays [2, 5, 18, 14, 20] and clock skew [1, 12] as functions of the process variables. Previous methods for modeling gate delay as a function of its parameters either used the Taylor series expansion for the delays [5, 18] and obtained the sensitivities (coefficients of the expansion) numerically from SPICE simulations, or used traditional response surface methods to fit a model for the delay [20, 14]. Recently, in [2] an alternate method of expressing delay as an orthogonal polynomial series in the process variables was proposed. In comparison to the work on gate delay modeling, relatively little work has been done in the area of characterization of sequential elements. In [17] the authors model setup time of a data flip flop (DFF) considering the effect of random dopant variations in threshold voltage (Vt ) of the transistors in master stage of the DFF. In [6] the authors obtain a linear model of the setup and hold time, and propagation delay empirically by running a large number of Monte Carlo simulations and assuming independent uniform distribution of process variables. In this work, we characterize the the setup and hold time, and the propagation delays of the flip flops as second order functions of the correlated process parameters of the devices in both the master and the slave stage. Compared to previous work, we propose a method to perform library characterization of FFs using a more efficient and accurate method. Using the models of the gate delays, a number of block-based methods for performing statistical static timing analysis (SSTA) of combinational circuits have been proposed [5, 20, 21, 18]. The earlier methods for SSTA were based on linear approximations of the gate delays [5, 18] assuming the parameters to be Gaussian random variables (RVs). Correlated random variables are transformed using principal component analysis (PCA) to obtain a new set of independent random variables. Using the result of PCA, a canonical form of the delay was propagated, based on Clark’s approach [7]. Since the max of two Gaussians is not a Gaussian, modeling the max as a linear function can introduce significant inaccuracies. Thus, methods have been proposed [21, 20] to propagate quadratic expansions of delay. Even though a number of techniques have been proposed for analyzing the timing yield of combinational circuits, to the best of our knowledge, only [15] has addressed the problem of estimating the timing yield of sequential circuits. However, our approach differs significantly from [15] in two ways. First,
Computation of Joint Timing Yield of Sequential Networks
127
Compute Timing Yield
Partition design into different blocks
Flip-Flop Characterization DFF
Examine node
Obtain propagation delay, setup & hold time
Logic Gate
Obtain gate delay model
Evaluate setup and hold constriants
Add constraints to the constraint Set
Gate Delay Characterization
Start traversing
Compute max of inputs and compute output arrival time No End of circuit? Yes
Compute Joint Yield of the constraint Set
Fig. 1. Framework for timing yield estimation
we model the delays as quadratic functions of process parameters as compared to the linear models used in [15]. Second, we model all the flip-flop parameters as random variables (as they should be when accounting for process variations) whereas in [15] they are modeled as deterministic quantities. The proposed framework for analyzing the timing yield of sequential circuits is outlined in Fig. 1. We start by characterizing all the sequential and the combinational elements in the circuit and de-correlating the parameters of the different gates using PCA to represent the setup time, hold time, propagation delay and clock skews of various elements in a canonical form. To remove the cycles and feedbacks, we partition the circuit into a number of sub-circuits. This is followed by a run of SSTA on each of the sub-circuitsto obtain constraints corresponding to the setup and hold time violations. Once all the sub-circuits have been analyzed, we compute the joint probability of satisfying all the constraints for a particular clock frequency freq to obtain the timing yield at freq . The organization of the rest of the paper is as follows: Section 2 describes the process of characterization and modeling of timing properties of an edge triggered data flip-flop using second order models. Section 3 outlines the algorithm for joint yield estimation and statistical timing analysis of large sequential circuits considering clock skew variations. The results of the proposed framework are presented and compared with MC for accuracy in Section 4. Finally, Section 5 gives the conclusions of our work.
2
Characterization of Flip Flops
In this work, we use (and thus characterize) the mux-type edge triggered D flip-flops (EDFF) shown in Fig. 2(d). Flip flops that have different architectures
128
A. Goel et al.
than the one shown in Fig. 2(d) can also be characterized using the proposed approach. In addition to the propagation delay (dCQ ) from the clock (C) to the output (Q) of the flip flop, the DFF has two additional parameters: (1) the setup time (TSU ), and (2) the hold time (TH ). The setup time (hold time) is defined as the minimum time that the data signal should be held stable prior to (after) the triggering clock edge so that the data can be be correctly captured at the output. As shown in Fig. 2 (a), (b), and (c), the dCQ of a flip flop is dependent on the difference between: (1) data arrival time and the clock arrival, also known as the setup slack (TSS ), and (2) data removal time and the clock arrival time, also known as the hold slack (THS ). As the setup slack and the hold slack are decreased, the dCQ starts to increase and beyond a certain point, the flip-flop fails to latch the input data. To be conservative, the TSU is set to be the setup slack at which the dCQ increases by 10% from its minimum value. The minimum value of the dCQ is the propagation delay when the setup slack is sufficiently large. For example, in Fig. 2 (b), the setup time for sample 2 (solid curve) is S2. Thus, the TSU of a flip-flop is computed by sweeping the setup slack TSS from a sufficiently large value down to a point where dCQ increases by 10%. The hold time TH is also computed similarly by sweeping the hold slack THS .
Fig. 2. (a) Timing diagram of a flip-flop (d) Mux-type edge triggered D flip-flop
In the presence of process variations, the parameters of the flip-flop have to be modeled as RVs. Hence each manufactured flip-flop will have different set of parameters, resulting in different timing characteristics. For example, Fig. 2 (b) shows the dCQ as a function of TSS for three different samples. Clearly, estimating the statistics of TSU and TH using Monte Carlo (MC) simulations will be computationally prohibitive. Thus we propose a method that requires a small number of MC simulations to represent the TSU , TH , and dCQ as a function of the parameters of the flip-flop.
Computation of Joint Timing Yield of Sequential Networks
2.1
129
Models for Flip-Flop Parameters
A second order stochastic process can be approximated using an infinite series expansion in orthogonal basis polynomials Ψk (ζ) in variables ζ as, f (ζ) =
∞
λk Ψk (ζ),
(1)
k=0
where the coefficients λk can be shown to be the inner product of f (ζ) and Ψk (ζ), based on the orthogonality of Ψj (ζ) and Ψk (ζ) (j = k) [8]. Thus λk = f (ζ), Ψk (ζ), where the inner product is defined as, f (ζ), Ψk (ζ) = f (ζ) · Ψk (ζ) · w(ζ) · dζ. (2) The equality in (1) is in the norm · = ·, · of the space. This expansion can be shown to be optimal with respect to the underlying distribution of the process parameters; a property that the other methods based on Taylor series expansion and regression models cannot guarantee. In (2) w(ζ) is the probability density function (PDF) of ζ. In practice, the expansion in (1) has to be truncated to a finite number of terms. To model the various parameters of the flip-flops, we consider variations in the threshold voltages of the nMOS (Vtn ) and the pMOS (Vtp ), effective channel length (Leff ) and the oxide thickness (Tox ) of the devices. We assume that the parameters have Gaussian distributions, thus they can be expressed in terms of standard normal random variables ζ = (ζ1 , ζ2 , ζ3 , ζ4 ) as, ¯ eff + σL ζ4 (3) Vtn = V¯tn + σV n ζ1 , Vtp = V¯tp + σV p ζ2 , Tox = T¯ox + σT ζ1 , Leff = L where p¯ and σp represents the mean and standard deviation, respectively, of the parameter p. For the case of Gaussian RVs, the Hermite polynomials [8] (Hk (ζ)) form an orthogonal basis Ψk (ζ). Thus, we can represent the setup time TSU (ζ) as TSU (ζ) =
N
k=0
4 4 4 λk · Hk (ζ) = λ0 + λ1j ζj + λ2j (ζj2 − 1)+ λ3jk ζj ζk j=1
j=1
(4)
j,k=1 j
The value of coefficients λk in (4) can be obtained as, 1 λk = E TSU (ζ) · Hk (ζi , ζk ) = 2π
TSU (ζ) Hk (ζj , ζk )φ(ζ)dζ
(5)
ζ∈(−∞,∞)4
where φ(·) is the standard normal probability density function. The value of the above integral is computed using Gaussian-Hermite quadrature [11]. The value of setup time (TSU (ζ)) at the quadrature points is obtained by running SPICE simulations. The values of the setup time for the quadrature points is used for evaluating the integral in (5). We characterized the mux-type flip-flop shown in Fig. (2(d)) and compared our delay model with 10,000 MC simulations for two cases: (1) considering variations in the complete flip-flop, and (2) considering
130
A. Goel et al. 1
0.2
0.9
0.16
0.7
Probability Distribution
Cummulative Distribution
0.8
0.6 0.5 0.4 0.3 0.2
14
16 18 Setup Time (ps)
(a)
20
0.14 0.12 0.1 0.08 0.06 0.04
MC with all variations Our Model MC with master variations
0.1 0 12
MC with all variations Our Model MC with master variations
0.18
0.02
22
0 12
14
16 18 Setup Time (ps)
20
22
(b)
Fig. 3. (a) Setup time prob. distribution, and (b) Setup time prob. density
variations in only master stage of the flip-flop as in [17]. The distribution of setup time obtained are shown in Fig. (3). As shown in the figure, the PDF of the setup time considering variations in only master stage shows a significant departure from the simulation results. In comparison, our model matches very closely with the MC results with an error of 0.02% in mean and 0.4% in standard deviation. Using the same method we obtained the delay models for setup time and propagation time of flip-flops, delays of logic gates and interconnects. Next, we proceed to model the intra-die variations. Since it is practically impossible to consider individual correlations due to the large number of RVs(one for each gate) required to model them, the chip area is divided into smaller grids as described in [5]. A single variable is used for each parameter and total correlation between the parameters is assumed for the gates in the same grid. The covariance matrix representing correlations between the grids is derived from the original covariance function and a new set of uncorrelated random variables (ξ) for each process parameter is then obtained using PCA. Thus, we have total r = n2 × p RVs in the new expansion, where n2 is the number of grids and p is the number of process parameters.
3
Timing Yield of Sequential Networks
Given the representations of the parameters of all the elements in the circuit, our objective is to compute the timing yield at a target frequency freq . A sequential circuit can have feedbacks and therefore we have to: (1) partition the circuits to remove the feedbacks, (2) perform timing analysis on the sub-circuits obtained as a result of partitioning, and (3) compute the joint yield of each of the subcircuits. To remove the feedbacks, we partition each of the flip-flops as shown in Fig. 3(a) [15]. In the new sub-circuits the arrival times at the additional primary inputs (e.g. Q1 PI) introduced as a result of the partitioning should be the same as the arrival times of the corresponding signals before partitioning. In order to ensure this, we assign the arrival time to these signals as the sum of the clock arrival time and the dCQ of the partitioned flip-flop.
Computation of Joint Timing Yield of Sequential Networks
131
NAND
A
D
OR
OR
Q1
D
AND
Q2
B
dcomb
Data
D
Combinational circuit
Q
Original Circuit with feedback
A
Q1_PI
NAND AND
OR
D Q1
B
AND
OR
D
Q Q
dCQ
Clock
Q1_PI
Th
D
FFi+1
FFi
C
Tsu
t i, i+1
Q2
B
C CLK
CLK
New circuits after splitting with no cyclic paths
(a)
Fig. 4. (a)Partitioning the circuit to remove feedbacks (b) Timing constraints for a single stage pipeline
3.1
SSTA of Combinational Circuits
After partitioning the original circuit into a number of sub-circuits with no feedbacks, we perform SSTA on each of the sub-circuits one at a time. We use a method similar to that proposed in [2] for performing SSTA. A brief description of the method is provided here. Using the characterization described in section 2.1, the delay expressions of all the elements in the circuit can be represented in the following canonical form, d(β) = β0 +
r j=1
β1j ξj +
r j=1
β2j (ξj2 − 1)+
r
β3jk ξj ξk .
(6)
j,k=1 j
To perform block based SSTA, the gates are considered in their topological order. At each gate, the statistical max of the all the inputs is evaluated and is added to the delay of the gate to compute the arrival time at the output of the gate. The max of two delay expansions is also represented using the same form as in (6). As described in section 2.1, the coefficients in the expansion of the max are the inner product of the max with the corresponding basis polynomials. For example, the max dmax of two delay expansions d(α) and d(β) is computed as follows: dmax = max{d(α), d(β)} = d(α) + max{0, d(β) − d(α)} = d(α) + max{0, d(δ)} ≈ d(α) + d(γ)
(7)
where δ = β − α and d(γ) ≈ max{0, d(δ)}. The essence of this approximation is that max{0, d(δ)} is approximated using an expansion of the form (6), with coefficients being γ. The advantage of this approximation is that once the coefficients γ have been computed, they can be added to α to approximate dmax using d(α + γ).
132
A. Goel et al.
The coefficients γ can be computed as the inner product given in (2) using Gaussian quadrature. For example, the coefficient γ1i in the expansion d(γ) can be obtained by computing the inner product max{0, d(δ)}, γ1i , where d(δ) = δ0 +δ1i ξi +δ2i (ξi2 −1)+ξi ·
r
δ3ij ξj +
j=1
r r δ1j ξj +δ2j (ξj2 −1) + δ3jk ξj ξk j=1 j=i
j,k=1 j=i
Since the number of RVs (r) in the expansion can be extremely large, evaluating the inner product using full Gauss-Hermite quadrature can be very expensive. Therefore, we propose a moment matching based dimensionality reduction technique that transforms the expansion d(δ) to a function of only 3 random variables. This method is based on mapping the last three terms in the equation given above to two functions of new Gaussian random variables ζ1i and ζ2i using r moment matching by defining Xi = j=1 δ3ij ξj = ai ζ1i and Yi =
r
r δ1j ξj + δ2j (ξj2 − 1) + δ3jk ξj ξk ,
j=1 j=i
2 Y˜i = b1i ζ2i + b2i (ζ2i − 1)
(8)
j,k=1 j=i
where Y˜i is an approximation
of Yi . The mean of both Xi (Gaussian) and Yi r 2 above is zero and ai = j=1 δ3ij . The coefficients b1i and b2i are obtained by minimizing the difference in the skew of Yi and Y˜i and matching their variance. We compute the solution of the minimization problem in a closed form to obtain the coefficients b1i and b2i . The expansion d(δ) is now expressed as a function of only 3 random variables ξi , ζ1i , and ζ2i . However, since ζ1i and ζ2i are functions of the same set of random variables, we de-correlate them using the correlation coefficient of Yi and Xi to obtain two new independent random variables χ1 and χ2 . Now that we have the representation of d(δ) in terms of 3 independent random variables, we compute the inner product max{0, d(δ)}, γ1i using GaussHermite quadrature to obtain the coefficient γ1i . Similarly, the other coefficients in the expansion can also be computed to obtain the d(γ), which can be added to d(α) to obtain dmax . The result of the max can then be added to the expansion of the gate delay to obtain the output arrival time for that gate. This is repeated for computing arrival time of all the primary outputs. 3.2
Clock Skew Analysis
Clock skew is a phenomenon in synchronous circuits in which the clock signal (sent from the clock source) arrives at different components at different times. Clock skew can be introduced at the design time (due to routing and/or unequal loads at sinks) or during the fabrication process. Although, most of the designs these days have the techniques to minimize the skew introduced at design time; in the presence of significant process variations, even perfectly balanced clock paths can also give rise to skew. Statistical analysis of clock skew is also important to solve the problem of clock network pessimism introduced in the design due to
Computation of Joint Timing Yield of Sequential Networks
133
traditional static timing analysis algorithm [9]. This problem has been addressed using heuristic and graph based techniques [9]. This is not an issue in SSTA because the arrival time at any gate in the circuit is a random variable, and for a given realization, has a single value, albeit with some probability. If dS,i and dS,j is the distribution of the delays for paths from the clock source S to sinks i and j respectively, then the skew can be expressed as: τi,j = dS,i − dS,j
(9)
In our analysis, the delay of the path from a source to any sink node can be evaluated by computing the sum of the delay expansions of interconnects and buffers on the path. Also, the skew between any two sink nodes can be obtained by computing the diff erence of the two path delays. The effect of common paths gets canceled in SSTA when skew is calculated as given in (9). However, correlations between the delay random variables along different capturing and launching subpaths can be included, leading to more accurate analysis. 3.3
Timing Constraints
For a pipelined sequential circuit such as shown in Fig. 3(b) to be operate correctly, the propagation delay of the combinational block between the launch (F Fi ) and the capture (F Fi+1 ) flip-flops, the clock skew between the two flipflops, the dCQ of the launch flip-flop, and the setup (hold) time of the capture flip-flop should satisfy certain constraints. For example, the setup time constraint is obtained as follows: The arrival time of the signal at the input of the target flip-flop (F Fi+1 ) should be less than the difference of the clock arrival time, clock period (Tclk ), and the setup time of F Fi+1 . That is: dS,i + dCQ,i + dcomb < dS,i+1 − TSU,i+1 + Tclk dCQ,i + dcomb + TSU,i+1 + τi,i+1 < Tclk.
(10)
Similarly, the hold time constraint can also be obtained as: dS,i + TCQ,i + dcomb > TH,i+1 + dS,i+1 or, TCQ + dcomb + τi,i+1 − TH,i+1 > 0
(11)
In a multi-pipeline sequential circuit, for the complete circuit to work we need to satisfy the constraint equations of all the sequential logic blocks and this gives rise to the problem of SSTA in sequential networks. We need to evaluate the constraint equations for all sequential blocks and compute their joint distribution for analyzing the circuit. 3.4
Joint Timing Yield Estimation
From (10), all the quantities on the LHS have the canonical form of (6). Since the LHS of (10) is simply the sum of canonical forms, the LHS of the setup constraint corresponding to the i-th flip-flop can be written as: r r r S(β i ) = βi0 + β1ij ξj + β2ij (ξj2 −1) + β3ijk ξj ξk . (12) j=1
j=1
j,k=1 j
134
A. Goel et al.
Similarly, the LHS of the hold time constraint of the i-th flip flop can be written as: r r r H(γ i ) = γi0 + γ1ij ξj + γ2ij (ξj2 −1) + γ3ijk ξj ξk . (13) j=1
j=1
j,k=1 j
The timing yield Y of a sequential circuit defined as the probability that the setup time and the hold time constraints at all the flip-flops in the circuit are satisfied simultaneously is, Y = Prob(S ∩ H)
(14)
where S = {S(βi ) < Tclk : for all i} is the set of setup time constraints and H = {H(βi ) > 0 : for all i} is the set of hold time constraints. Thus, S can be rewritten as: S = {max{S(βi ) : for all i} − Tclk < 0} = {S < 0},
(15)
where S = max{S(βi ) : for all i} − Tclk , and the set H can be rewritten as H = {max{−H(γ i ) : for all i} < 0} = {H < 0},
(16)
where H = max{−H(γ i ) : for all i}. We compute S and H using the max operator described in section 3.1. Finally, S ∩ H can be represented as {max{H, S} < 0}, which is also evaluated using the max operator and thus the timing yield of the network can be obtained.
4
Experimental Results
The framework for SSTA using quadratic models was implemented in C++. All experiments were carried out on 2.16 GHz Intel core duo machine with 2GB memory. The delay models for all the cells in the library and EDFF (Fig. 2(d)) were obtained for 90-nm technology using BSIM4 predictive technology models. HSPICE was used for carrying out the spice simulations. The experiments were run for sequential circuits in ISCAS89 benchmark. The circuits were technology mapped to a cell library using SISv1.3. The routing placement of the technology mapped net-list was done using Metaplacer tool in UMpack. 10% variations in 3σ were taken for Vtn , Vtp , Lef f ) and Tox . It is worth noting here that the variations in threshold voltage modeled here represent the effect of random dopant concentration of the manufactured devices. The change in threshold voltage due to variations in channel length are accounted in the spice models. The correlation between these variables was modeled using a radial exponential covariance function. The chip was divided into grids depending on the size of the circuit. Each process variable was the modeled using n2 variables for a circuit divided into n × n grids. Thus, total r = 4n2 = 36 random variables were used for the second order delay expansion. The library characterization results obtained by ignoring the cross product terms gave an error of less than 1% in mean and less
Computation of Joint Timing Yield of Sequential Networks
135
Table 1. Results of SSTA for ISCAS89 benchmark sequential circuits Circuit # Gates # DFFs
μd Comparison(ps) σd Comparison(ps) Runtime (sec.) μquad μM C quad (%) lin (%) σquad σM C quad (%) lin (%) MC/SSTA Speedup
s38417 s38584 s35932 s15850 s9234 s5378 s1423 s1238 s838 s510 Average
422.6 255.9 355.9 704.3 500.4 220.57 983.58 329.16 146.35 128.6
23815 20705 17793 10369 5825 2958 731 526 390 211
1636 1452 1782 597 228 179 74 18 32 6
422.1 255.6 352.9 704.2 500.3 220.54 983.53 329.12 146.8 128.83
0.1 0.1 0.84 0.02 0.01 0.01 0.00 0.01 0.30 0.17 0.015
2.43 0.85 4.07 0.19 0.18 0.15 0.27 0.13 2.86 1.32 1.37
10.1 7.45 11.9 22.6 20.8 8.73 28.0 14.6 5.65 5.07
12.3 8.23 11.1 22.6 20.6 7.90 28.0 14.6 5.69 5.16
17.8 9.5 6.9 0.01 0.81 10.5 0.01 0.00 0.97 1.97 4.85
64.8 39.7 21.64 48.69 12.22 16.3 60.5 43.7 30.22 26.7 35.48
42422/102 35435/88 45030/323 19120/60 7531/19 7785/23 3405/6 1656/6 284/1 202/1
416 402 139 319 396 338 567 276 284 202 334
Fig. 5. Comparison of the Yield using our approach with quadratic MC and linear MC
than 3% in standard deviation when compared with MC simulations. Therefore, the cross product terms were ignored in the timing analysis of the circuits. Results of the proposed method were compared with MC simulations (based on the same delay models) for accuracy. To minimize the error in variance in MC simulation to 1%, we take 10,000 MC samples. However, to evaluate the error introduced by ignoring the cross-product terms, we consider these terms in the MC simulations. Hold time for the flip-flop shown in Fig. (2(d)) was negative when designed using 90nm BSIM4 model. Therefore, the constraint in (11) is trivially satisfied for all flip-flops in the circuit and we only need to evaluate the joint setup constraint of the network to determine its timing yield. Clock skew at a flip-flop adds to the setup time in setup constraint evaluation and is subtracted from the hold time for hold constraint evaluation. Therefore, we account for clock skew variations by the variations in setup-time and hold-time of the circuit. In Fig. (5), we show the timing yield of the benchmark circuit s35932 having 19521 cells, obtained by using our approach. We compare our results with the distributions obtained from MC simulations using exact quadratic models, and linear models as proposed in previous works. Our results show good match with the MC results obtained using exact delay models. However, the MC results using linear models show a significant difference when compared with MC
136
A. Goel et al.
results using quadratic models. This underscores the need for second order delay models. Table 1 shows the result of our experiments on ISCAS89 benchmark circuits. The average error in mean and standard deviation of the distribution of joint constraint equation obtained by our approach, is less than 0.5% and 5% respectively. Also, the results for MC simulations using linear model show large errors in mean and standard deviation when compared with MC. Our approach gives over 300X improvement in runtime compared to MC.
5
Conclusions
In this paper we propose accurate methods for modeling the timing characteristics of flip-flops in the presence of process variation. Using the second order models of library cells, we propose a comprehensive framework for statistical timing analysis of large sequential circuits considering spatial correlations and clock skew variations. The results of the timing analysis algorithm form the basis for joint yield estimation of the sequential blocks in the network at the desired frequency. Our results demonstrate the feasibility and accuracy of statistical anlysis of large sequential networks.
References [1] Agarwal, A., Blaauw, D., Zolotov, V.: Statistical timing analysis for intra-die process variations with spatial correlations. In: Proc. of ICCAD (2003) [2] Bhardwaj, S., Ghanta, P., Vrudhula, S.: A framework for statistical timing analysis using non-linear delay and slew models. In: ICCAD ’06: Proceedings of the 2006 IEEE/ACM international conference on Computer-aided design, New York, NY, USA, pp. 225–230. ACM Press, New York (2006) [3] Borkar, S., et al.: Parameteric variations and impact on circuits and microarchitecture. In: Proc. DAC (2003) [4] Chang, H., Sapatnekar, S.: Full-chip analysis of leakage power under process variations, including spatial correlations. In: IEEE/ACM Design Automation Conference, 2005, pp. 523–528. ACM Press, New York (2005) [5] Chang, H., Sapatnekar, S.S.: Statistical timing analysis under spatial correlations. IEEE Transactions on CAD (2005) [6] Chen, C.C.-P., Zhang, L., Hu, Y.: Statistical timing analysis in sequential circuit for on-chip global interconnect pipelining. In: IEEE DAC (2004) [7] Clark, C.E.: The greatest of a finite set of random variables. In: Operations Research (1961) [8] Ghanem, R.G., Spanos, P.: Stochastic Finite Elements: A Spectral Approach. Springer, Heidelberg (1991) [9] Jindrich, Z., Paul, F.: General framework for removal of clock network pessimism. In: IEEE/ACM ICCAD (2002) [10] Karnik, T., Borkar, S., De, V.: Sub-90 nm technologies-challenges and opportunities for cad. In: IEEE/ACM International Conference on CAD, 2002, ACM Press, New York (2002) [11] Keese, A., Matthies, H.G.: Numerical methods and smolyak quadrature for nonlinear stochastic partial differential equations. Technical report, Institute of Scientific Computing, Brunswick (2003)
Computation of Joint Timing Yield of Sequential Networks
137
[12] Ling, W., Savaria, Y.: Analysis of wave-pipelined domino logic circuit and clocking styles subject to parametric variations. In: International Symposium on Quality Electronic Design (2006) [13] Mehrotra, V., Nassif, S., Boning, D., Chung, J.: Modeling the effects of manufacturing variation on high-speed microprocessor interconnect performance. In: International Electronic Devices Meeting, pp. 767–770. IEEE, Los Alamitos (December 1998) [14] Okada, K., Yamaoka, K., Onodera, H.: A statistical gate-delay model considering inra-gate variability. In: IEEE/ACM ICCAD (2003) [15] Pan, M., Chu, C.C.-N., Zhou, H.: Timing yield estimation using static timing analysis. In: IEEE International Symposium on Circuits and Systems (2005) [16] Rao, R., Devgan, A., Blaauw, D., Sylvester, D.: Parametric yield estimation considering leakage variability. In: Proc. of DAC (2004) [17] Roy, K., Mahmoodi, H., Mukhopadhyay, S.: Estimation of delay variations due to random-dopant fluctuations in nanoscale cmos circuits. IEEE Journal of SolidState Circuits (2005) [18] Visweswariah, C., et al.: First-order incremental Block-Based Statistical Timing Analysis. In: IEEE/ACM Design Automation Conference, 2004, pp. 331–336 (2004) [19] Wang, J., Ghanta, P., Vrudhula, S.: Stochastic Analysis of Interconnect Performance in the Presence of Process Variations. In: Proc. of ICCAD (2004) [20] Zhan, Y., et al.: Correlation-aware statistical timing analysis with non-gaussian delay distributions. In: IEEE DAC, November 2005, IEEE Computer Society Press, Los Alamitos (2005) [21] Zhang, L., et al.: Correlation-Preserved Non-Gaussian Statistical Timing Analysis with Quadratic Timing Model. In: Proc. of DAC (2005)
A Simple Statistical Timing Analysis Flow and Its Application to Timing Margin Evaluation V. Migairou1, R. Wilson1, S. Engels1, Z. Wu2, N. Azemard2, and P. Maurine2 1
STMicroelectronics Central CAD & Design Solution, 850 rue Monnet, 38926, Crolles, France 2 LIRMM, UMR CNRS/Université de Montpellier II, (C5506), 161 rue Ada, 34392, Montpellier, France
Abstract. The increase of within-die variations and design margins is creating a need for statistical design methods. This paper proposes a simple statistical timing analysis method considering the lot to lot process shifts occurring during production. This method is first validated for 90nm and 65nm processes. Finally, this statistical timing analysis is applied on basic ring oscillators to evaluate the timing margins introduced at the design level by the traditional corner based approach.
1 Introduction If statistical simulation methods have been widely adopted for years by analogue designers to evaluate the quality of their circuits, they have traditionally suffered from exponential run time complexity with circuit size (not feasible for digital designs). Due to the growing impact of process variations, the situation is changing and a need for statistical timing engines and design methods has recently appeared [1]. Several reasons may explain this change. The increase of within-die variations and the need to increase design margin to take into account new nanometer effects [NBTI, ETC] are two of them. Within-die variations are mainly due to lithographic distortions [2] and random variations of intrinsic characteristics of materials involved in the fabrication of the integrated circuits. While the impact of die-to-die variations [3] may properly be captured by case analyses, the impact of within-die variations may be uncovered by corner analysis. As an illustration, the worst timing configurations of races between clock paths and data paths are not necessarily captured by multiple corner analyses [4] and traditional approach then requires additional design margin or validation steps. However, if this situation is currently tolerable, it will quickly become unpractical. Indeed, increasing design margins will induce convergence problems at design steps, and more precisely during the timing analysis and optimization steps. There is thus a future for statistical timing engines. Several statistical timing methods are available in the literature [1, 3] and industrial timing engines are under development or test [7]. If these broad scope methods solve partially or fully several problems, such as spatial correlations and path re-convergence, they are quite complex and difficult to put into practice. The main reason lies in the difficulty to N. Azemard and L. Svensson (Eds.): PATMOS 2007, LNCS 4644, pp. 138–147, 2007. © Springer-Verlag Berlin Heidelberg 2007
A Simple Statistical Timing Analysis Flow and Its Application
139
provide appropriate process data to such kind of tools. As an example, it is extremely time consuming to properly evaluate the spatial correlations for a few key parameters, and the resulting questions are: how many parameters should we monitor to accurately capture the process characteristics? Which parameters should we monitor, in order to have required data for digital design validation? Within this context the contribution of this paper is to introduce a simple statistical timing analysis method and to provide a first evaluation of the actual timing margins. The remainder of the paper is organized as follows. In section 2, the proposed statistical timing analysis flow is introduced. In section 3, the proposed method is validated on 90nm and 65nm process. Finally a conclusion is presented in section 4.
2 Statistical Timing Analysis Flow Traditionally, timing analysis deals with deterministic values, either best or worst case timings. These values, obtained during the characterization step for several operating conditions, are usually stored in timing library files. These files are then provided to timing analyzers to compute the timing properties of circuits during the timing analysis step. This separation between the characterization and the performance analysis steps constitutes the keystone of the traditional standard cell design flow. To be fully compliant with the standard design flow, any statistical timing analysis method should adopt the same approach. This was a main target for the development of the method proposed herein. More precisely, our objective is to introduce a simple statistical timing methodology: -
providing the same results than Monte Carlo simulations, while drastically reducing induced cpu cost, requiring only the data provided by standard spice modelling, such as statistical spice model
Considering these constraints, we adopted the statistical timing analysis flow represented in Fig. 1. As shown, the characterization and timing analysis tasks are separated, and the link between those two steps, i.e. between synthesis and physical design steps is ensured by two files: the standard.lib and stat.lib timing library files generated during the standard cell characterization step. 2.1 Standard Cell Characterization The goal of the standard cell characterization step is to provide timing tables representing the timing behaviour of cells with respect to their operating conditions, namely the input ramp duration τIN and the output load CL. This implies launching a great number of electrical simulations, to obtain the sampled values of both the propagation delay and transition times for a set of predefined couples (τIN, CL). These tables will then be fed to the timing analysis engine that will compute the timings for any couple (τIN, CL) (by interpolation) for a complete design. As shown in Fig.1, the goal of the statistical standard cell characterization is roughly the same as the traditional characterization approach. The main difference
140
V. Migairou et al.
being that Monte Carlo simulations are performed rather than usual transient analyses, to obtain both the average and standard deviations of the timings. Based on the results of these Monte Carlo simulations, the stat.lib is generated. It gathers the mean and standard deviation values of all timings in a tabular way. If this modification of method appears minor, it has a significant impact on the cpu time required to characterize a library due to the large increase in the number of simulations needed. Some solutions have been proposed to reduce this cpu time. From first order analytical expressions of the standard deviations of CMOS timings, we demonstrated how to define a statistical timing characterization protocol for simple combinational cells. This method allows up to 80% reduction in the CPU time needed to produce the required timing views [5]. Note also it implies (from a statistical point of view) that all distributions are assumed to be normal. Standard Statistical Spice Model
Standard cell library characterization
Characterization tool Standard cell Spice Netlists
(Eldo or Spice) + Scripts
Stat.lib (Timing Look up tables)
Gate level path netlists
(PrimeTime)
Standard.lib (Timing Look up tables)
Traditional Timing Analysis
Timing analysis
Statistical
Mean & Std
Path delay and transition time distributions
Scripts: a) b) c) d)
Comp_stat_cell_delay Comp_stat_cell_transition_time Cumulate_stat_path_delay Simplify_sampled_path_delay_
SS
Corner values
FF
Path Delay Dist
Worst and best propagation delays
Fig. 1. Standard and statistical characterization and timing analysis flow
2.2 Timing Analyzer Core As illustrated in Fig.1, the proposed statistical timing analysis flow adopts a path based approach [4] and assumes that all distributions are Gaussian. The reconvergence problem is therefore not considered. The core of this flow is constituted by some scripts bridging the gap between the traditional corner approach and the statistical approach. Among all developed scripts, four of them (Comp_stat_cell_delay, Comp_stat_cell_transition_time, Cumulate_ stat_path_delay, Simplify_sampled_ path_delay) are of prime importance. As illustrated in Fig. 1 the tool takes as input the stat.lib file and the gate level netlist of paths.
A Simple Statistical Timing Analysis Flow and Its Application
141
2.2.1 Cell Output Transition Time Distribution The Comp_stat_cell_transition_time routine uses as input the incoming transition time distribution (pdf) and computes the output transition time (pdf) of all cells with respect to their operating conditions for the paths being analysed. To perform the computation of the output transition time, a sampled representation of normal distributions was adopted. In order to ensure a quick computation while keeping a good level of accuracy, the number of samples has been limited to 6 as shown in Fig.2.
Stat.lib
<>OUT 1 = f n ( τ IN 1 )
σ OUT 1 = f n ( τ IN 1 )
<>OUT 4 = f n ( τ IN 4 ) Stat.lib
σ OUT 4 = f n ( τ IN 4 )
<>OUT 6 = f n ( τ IN 6 ) Stat.lib
τ IN 1 , pIN 1
σ OUT 6 = f n ( τ IN 6 )
τ IN 6 , pIN 6
* <> IN , σ IN
Sampled distribution of the output transition time
<>OUT , σ OUT Assuming a normal distribution
Fig. 2. From input to output transition time distribution
As shown, the first step is to compute for all samples τINi (i=1:6) of the input transition time distribution the associated probability pINi. Then, reading the mean and standard deviation values (which are function of τIN) in the stat.lib file, a sampled normal distribution is generated for each τINi. This process enables capturing the impact of the input transition time variation (i.e. the output transition variation of the driving gate) on the output transition time distribution. In other words, this enables capturing the electrical correlations induced by the switching process of CMOS gates. Once the six sampled distributions are computed, a normalized sampled distribution of the output transition time is obtained. From this set of samples, the mean and standard deviation of the output transition time are deduced.
142
V. Migairou et al.
2.2.2 Cell Delay Distribution ‘Comp_stat_cell_delay’ routine is fed by the input transition time distribution (pdf) and computes the propagation delay (pdf) of any cell with respect to its operating conditions. The process to compute this distribution is roughly similar to that of the output transition time discussed above. The only difference is that the resulting set of samples (* in Fig.2) is not reduced to a Gaussian distribution and is directly used by the ‘Cumulate_stat_path_delay’ routine described below. 2.2.3 Path Delay Distribution The path delay distribution is computed iteratively starting from the path input to the path output. This task is carried out by the ‘Cumulate_stat_path_delay’ routine. The latter is called each time the computation of the delay distribution of a cell is achieved to compute the new ‘current’ path delay distribution. In fact it performs the convolution product of the ‘current’ path delay and ‘current’ cell distributions to obtain the new path delay distribution. This process is illustrated in Fig.3.
Gate n Gate 1
Gate n-1 delay distribution
⊗ Current path delay distribution
Next path delay distribution
Fig. 3. From cell to path delay distribution
However, performing the convolution product leads to an ever increasing number of samples, resulting in a large cpu time to process long paths. To overcome this problem, a routine called ‘Simplify_sampled_path_delay’ is used to limit the number of samples below a limit defined by the user. During the validation step described below, we limited the number of samples to 200.
3 Validation The statistical timing analysis flow has been validated in two successive steps. In a first step, we compared the values obtained, for several paths, with the proposed analysis flow to those obtained with Monte Carlo simulations. We considered both 90nm and 65nm processes In a second step, we compared the values obtained either with our flow or Monte Carlo simulations to those measured on silicon for several test structures. Since silicon data are confidential, we have applied a multiplicative coefficient to all simulated and measured data given in the next paragraphs.
A Simple Statistical Timing Analysis Flow and Its Application
143
3.1 Monte Carlo vs. Statistically Calculated Values To validate the statistical static timing analysis flow (SSTAF) introduced in section 2, we compared the calculated propagation delay distributions of several paths to those obtained with Monte Carlo simulation (Eldo). The logic depth of the simulated paths is ranging from 9 to 100. The 90nm and 65nm electrical model cards used were Bsim4 statistical spice model card, including inter-die variations and intra-die ones. Table 1 reports the mean and standard deviations, as well as the best (1.1⋅Vdd, -40°C) and worst (0.9⋅Vdd, 125°C) case corner timings obtained with both methods for 5 paths. Table 1. Path timing data SSTAF (ps) Process
90nm
Path
Logic depth
1
9
2
15
3
26
4
50
5
100
Input edge Fall Rise Fall Rise Fall Rise Fall Rise Fall Rise
Monte Carlo (ps)
Error (%)
mean
σ
mean
σ
mean
σ
886 858 2095 2136 1210 1214 1699 1737 3140 3149
27 24 58 59 33 35 50 52 96 97
903 882 2002 2110 1196 1201 1744 1732 3140 3128
27 24 58 56 32 34 50 52 96 95
2% 3% 5% 1% 1% 1% 3% 0% 0% 1%
0% 0% 0% 5% 3% 3% 0% 0% 0% 2%
Corners (ps) Best timing 681 668 1608 1575 908 898 1300 1290 2324 2314
Worst timing 1195 1168 2826 2860 1609 1594 2331 2316 4218 4203
Deduced from Table 1, the accuracy between simulated and calculated (SSTAF) values is good since errors are lower than 10%. Note however that the maximum absolute errors are always obtained for the mean values. This is due to the interpolation process between the data reported in the stat.lib file which constitute the main source of discrepancies. Note also that comparing the mean±3σ values to the best and worst case timing values, leads to timing margins representing up to 20% of the mean timing value. However these timing margins include the process, temperature and voltage sensitivities of the timings. Timing margins due to the consideration of worst or best process conditions will be discussed in the next paragraphs. 3.2 Statistically Calculated Values vs. Silicon Data The statistical timing analysis flow has been applied to compute the propagation delay distributions of various ring oscillators used as process monitoring circuits. The obtained values have been compared to data measured over several manufactured lots. Fig.4. gives the delay distributions obtained with the SSTA flow, with Monte Carlo simulations and computed from data measured on a single lot. The results obtained with different corner analysis are also given.
144
V. Migairou et al.
SS_P
TT
FF_P SSTAF
SS_PVT
FF_PVT
Silicon Monte Carlo
Timing
Fig. 4. STTAF versus Silicon and Simulation results
FF_PVT, SS_PVT denote the best (1.1⋅Vdd, -40°C) and worst case (0.9⋅Vdd, 125°C) timing corners, while SS_P and FF_P are the timings obtained considering respectively worst and best process model cards and the supply voltage and temperature conditions corresponding to the measurement conditions. Thus, for the lot under consideration, the agreement between the simulated (Monte Carlo), calculated (SSTAF) and measured distribution is good validating thus the timing analysis flow introduced in section 3. This figure also allows estimating the timing margins between mean±3σ values and the different corner timing values. The CAD timing margins (SS_CAD-mean+3σ ) considered to take into account process, voltage and temperature variations represents up to 20 % of the mean timing values, while the timing margins considered to only take into account the process variations represents up to 8 % of the mean or typical timing value depending on the path under consideration. This highlights the great sensitivity of the timings to the supply voltage variations [6]. 3.3 Timing Margins and Improved SSTA Flow The results above suggest that the timing margins that could ideally be recovered using statistical timing design tool and design methodology is about 20 %, for the design and process under consideration, depending on the accuracy of the STA tools. This is quite important; however, this result has been obtained for a single and quite well centred lot, i.e. a lot having a mean timing value close to that obtained for typical PVT conditions. However, due to possible process decentring of the production lines, the centre of the distributions may also be shifted as illustrated in Fig. 5. Even if these process centring are kept under control, they may affect significantly the performance distributions obtained for each lot during the production. As a result the potential process timing margins that may be recovered at the design stage using statistical timing analysis could be lower, at constant timing yield. Thus, the timing margins introduced by worst or best case analysis strongly depends on the standard deviation σC of the probability density function over time of the mean timing value of each lot as illustrated by Fig.5. σC depends on the maturity of the process; more precisely σC value quickly decreases from a large value during
A Simple Statistical Timing Analysis Flow and Its Application TT_PVT
SS_P
145
FF_P
SS_PVT
FF_PVT
Lot Process Timing margin
(a)
timing
(b)
distribution of the lot mean timing value
timing
Production Process Timing margin
(c)
Timing distribution over full production
timing
Fig. 5. (a) Timing distribution obtained for three lots produced at different time (b) distribution of the lot mean timing value during the production (c) Timing distribution over full production Lot 1 SS_PVT
FF_PVT
TT
Lot 2
Lot 3
Lot 1,2,3
Fig. 6. Ring oscillator frequency distribution measured for three lots during the ramp up
process ramp up towards a smaller value during production. As an illustration, Fig.6. gives the measured distributions obtained for several lots and the corresponding overall distribution of ring oscillator’s frequency during the ramp up of the process under consideration. Considering the process centring of production lines, which can be intentionally shifted to improve yield of a product knowing its exact behaviour in silicon, the SSTA flow must be modified to obtain more accurate distributions. This would enable to further increase the design robustness to eventual process centring fluctuation, or to control intentional process shifts. An improved statistical timing analysis flow is represented in Fig.7. As illustrated, the main difference is an additional input provided
146
V. Migairou et al.
by the production line. This additional input is the probability density function of the lot mean timing value. The latter can be used to compute the distribution of the path timings taking into account process centring variations; then taking more robust design rules, and allowing to ensure critical path validation with a higher level of confidence. For example, taking into account process centring and actual design margins, you can decide to reduce leakage and estimate the impact induced on margins and then yield.
Standard
Path delay and transition time distributions
Statistical Spice Model
Standard cell library characterization
Characterization tool Standard cell Spice Netlists
(Eldo or Spice) + Scripts
Stat.lib (Timing Look up tables)
Gate level path netlists
(PrimeTime)
Standard.lib (Timing Look up tables)
Traditional Timing Analysis
Timing analysis
Statistical
Scripts: a) b) c) d)
Comp_stat_cell_delay Comp_stat_cell_transition_time Cumulate_stat_path_delay Simplify_sampled_path_delay
Worst and best propagation delays
distribution of the lot mean timing value
Path delay and transition time distributions for a centred lot
Worst and best propagation delays
Fig. 7. Improved statistical timing analysis flow
4 Conclusion We have introduced a simple statistical timing analysis flow allowing us to obtain at design level, the propagation delay distributions of logical paths. To validate the proposed flow, comparisons between the calculated distributions and the ones obtained with Monte Carlo simulations have been done. The agreement between the calculated and simulated was satisfactory. Comparisons with silicon data have confirmed that predicted probability density functions were able to describe silicon behaviour. These comparisons have enabled us to define an improved statistical design flow taking into account process centring during chip production, thereby increasing yield for a given performance.
A Simple Statistical Timing Analysis Flow and Its Application
147
References [1]. Amin, C.S., et al.: Statistical static timing analysis: how simple can we get? In: Proceedings of the 42nd Design Automation Conference, 2005, pp. 652–657 (2005) ISBN: 1-59593-058-2 [2]. Gupta, P., Kahng, A.B., Sylvester, D., Yang, J.: A Cost-Driven Lithographic Correction Methodology Based on Off-the-Shelf Sizing Tools. In: Proc. ACM/IEEE Design Automation Conf., June 2003, pp. 16–21 (2003) [3]. Borkar, S., et al.: Parameter Variations and Impact on Circuits and Microarchitecture. In: Proceedings of the 40th Annual ACM IEEE Design Automation Conference, 2003, pp. 338–342 (2003) ISBN:1-58113-688-9 [4]. Leonard, L., et al.: A path-based methodology for post-silicon timing validation. In: ICCAD 2004, pp. 713–720 (2004) [5]. Migairou, V., et al.: Tatistical Characterization of Library Timing Performance. In: Vounckx, J., Azemard, N., Maurine, P. (eds.) PATMOS 2006. LNCS, vol. 4148, Springer, Heidelberg (2006) [6]. Lasbouygues, B., et al.: Temperature and Voltage Aware Timing Analysis. IEEE Transaction on Computer Aided Design (to appear, 2007) [7]. http://www.extreme-da.com/
A Statistical Approach to the Timing-Yield Optimization of Pipeline Circuits Chin-Hsiung Hsu1 , Szu-Jui Chou1 , Jie-Hong R. Jiang1,2 , and Yao-Wen Chang1,2 1
Graduate Institute of Electronics Engineering/ 2 Department of Electrical Engineering National Taiwan University, Taipei 10617, Taiwan {arious,rerechou}@eda.ee.ntu.edu.tw, {jhjiang,ywchang}@cc.ee.ntu.edu.tw
Abstract. The continuous miniaturization of semiconductor devices imposes serious threats to design robustness against process variations and environmental fluctuations. Modern circuit designs may suffer from design uncertainties, unpredictable in the design phase or even after manufacturing. This paper presents an optimization technique to make pipeline circuits robust against delay variations and thus maximize timing yield. By trading larger flip-flops for smaller latches, the proposed approach can be used as a post-synthesis or post-layout optimization tool, allowing accurate timing information to be available. Experimental results show an average of 31% timing yield improvement for pipeline circuits. They suggest that our method is promising for high-speed designs and is capable of tolerating clock variations.
1
Introduction
As the semiconductor fabrication technology advances to the sub-100nm feature size regime, sensitivities of IC designs to process variations and environmental fluctuations are ever-increasing. To maintain design robustness against these uncertainties, it becomes more and more apparent that traditional design methodologies need to be modified and consider variations at the early stage of a design flow since not all process variations can be diminished with technology advances after all. In recent years, statistical approaches to circuit analysis and optimization have been revolutionizing the EDA community. They are mostly centered around delay and power issues, the two main concerns affected by design uncertainties. In this paper, we focus on the timing issue. Traditional approaches to timing optimization were based on worst-case analysis. For instance, any gate delay under a certain operation condition may be set as a deterministic value fixed at the 3σ point in statistics to ensure enough margin tolerating variations. However, worst-case analysis is too conservative especially for more and more stringent design constraints in timing. Furthermore, when designs become more sensitive to process variations, it is harder to make design safe under worst-case variations. N. Azemard and L. Svensson (Eds.): PATMOS 2007, LNCS 4644, pp. 148–159, 2007. c Springer-Verlag Berlin Heidelberg 2007
A Statistical Approach to the Timing-Yield Optimization
r0
149
r3 r2
r1
r4
Fig. 1. A motivating example for timing yield improvement by replacing DFFs with latches
Due to the inadequacy of traditional worst-case analysis, the need of statistical analysis emerges, and has attracted intensive research efforts. Statistical optimization is the next step as statistical analysis is getting mature. Based on statistical timing analysis, most existing statistical optimization approaches focused on gate sizing, e.g., [4,5,6,11], and clock skew scheduling, e.g., [1,7,10,14]. Rather, we propose a new statistical optimization methodology, which is orthogonal and complementary to gate sizing and can possibly be combined with clock scheduling for further improvement. We take advantage of the transparency property of level-sensitive latches for tolerating delay uncertainties. In fact, there were prior efforts focusing on the tradeoff between flip-flops and latches in other optimization context. For instance, flip-flops may be replaced with latches to optimize storage [16] or power [8]. However, to the best of our knowledge, there was no work done in the context of optimizing timing yield in the statistical domain. Consider Figure 1 for a motivating example. In the circuit, assume the delays (in nanoseconds) of an and gate and a not gate, and a wire are in normal distributions N (5, 1), N (3, 1), and N (0, 0), respectively. (That is, we neglect the wire delay and assume that the and- and not-gate delays are of mean values 5 and 3, respectively, and are of the same variance 1.) Suppose the clock period is 8ns. By Monte Carlo simulation, the timing yield of the circuit with all positive-edge triggered D-type flip-flop (D-FF) registers is 33.19%; after replacing r2 with an active-high latch, the yield increases to 93.02%. A nearly 60% improvement is achieved by replacing a D-FF with a latch. Note that in this replacement the number of pipeline stages remains unchanged. Given a design with edge-triggered D-FF implementation of state-holding elements (i.e. registers), we substitute level-sensitive latches for D-FFs such that timing yield is maximally improved. In addition, this substitution also enhances the tolerance to clock skew uncertainty as was known in the timing community. Based on dynamic programming, we devise an optimal algorithm for pipelined circuits, and generalize it for arbitrary sequential circuits. The proposed method can be used for pre-layout optimization under a statistical model of design uncertainties. Moreover, because latches are of smaller sizes compared with D-FFs, the substitution is possible without affecting nearby circuit structures and thus can be performed even after physical design. Thereby, accurate timing information may be used. In contrast, yield improvement by gate sizing may invalidate
150
C.-H. Hsu et al.
prior physical design when devices are sized up, and thus may suffer from the design closure problem. Why is latch substitution challenging? Firstly, statistical timing analysis for latch-based design is itself tricky compared with those for combinational designs and D-FF based sequential designs [3]. Secondly, aside from the timing analysis issue, for optimization there are an exponential number of register configurations to be explored. Essentially, each register can be of type D (standing for a DFF), H (an active-high latch), or L (an active-low latch). Thus, for a design with n registers, there are 3n possible configurations, each of them requiring the above analysis to determine its timing yield. Despite these challenges, there exist effective approaches to the latch substitution problem. We organize our explanation as follows. Section 2 gives some preliminaries of our models and the underlying timing analysis. Section 3 analyzes the effect of substituting latches for D-FFs, and formalizes our optimization objectives. Section 4 presents our algorithms, which are evaluated with experimental results in Section 5. Finally, concluding remarks are given in Section 6.
2
Preliminaries
2.1
Statistical Timing Models and Analysis
To simplify our exposition, in our discussion we shall assume that gates are the main delay sources. However, wire delays as well can be taken into account straightforwardly. Using the model of [15], global and local variations as well as correlations can be handled. By statistical static timing analysis (SSTA), the input-to-output delay distributions of a combinational block in a sequential circuit can be obtained. Thus we may compute the longest combinational-path delay distribution Δ(ri , rj ) (resp. shortest combinational-path delay distribution δ(ri , rj )) from register ri to register rj by Gaussian-approximating max [2] (resp. min) and sum operations over Gaussian random variables.1 While δ(ri , rj ) is immaterial in combinational timing analysis, it is crucial in analyzing sequential circuits involving latches. Note that Δ(ri , rj ) (similarly δ(ri , rj )) is not a distribution for some single fixed path, rather it may probabilistically correspond to different paths. 2.2
Timing Yield of Sequential Circuits
Let T = TH + TL be the clock period with high interval TH and low interval TL . Given a design with some target operation speed, its timing yield is the probability that no violation occurs with respect to timing constraints, see e.g. 1
For circuits with pure D-FF registers, analyzing register-to-register delays may seem far from necessary. In fact, computing the longest delay of every combinational block is enough. However, for circuits containing latches, computing register-to-register delays is necessary due to the transparency of latches making combinational blocks not well separable for timing analysis.
A Statistical Approach to the Timing-Yield Optimization
r0
r1
C1
C2
151
r2
T
(a)
TH
TL
(b) (1) (2) (3) (4)
(c) (1’) (2’) (3’) (4’) Combinational block Delay( r0, r1 )
Active interval of r1 Delay( r1, r2 )
Register
Fig. 2. A single-path pipelined circuit and timing diagrams. (a) type(r0 ) = type(r1 ) = type(r2) = D; (b) type(r1 ) = H and type(r0 ) = type(r2) = D; (c) type(r1 ) = L and type(r0) = type(r2) = D.
[3]. In the simplest case, when a circuit is implemented with D-FFs for all of its registers, its timing yield is the probability Pr[ (Δ(ri , rj ) ≤ T )], (1) (ri ,rj )
for any register pair (ri , rj ) with a combinational path from ri to rj . For example in Figure 2 (a), where registers r0 , r1 , r2 are of type D, then the yield is Pr[(Δ(r0 , r1 ) ≤ T ) ∧ (Δ(r1 , r2 ) ≤ T )].
3 3.1
(2)
Timing Yield and Register Configuration Timing Yield Changed by Latch Replacement
We study the effects of substituting latches for D-FFs. To begin with, consider the single-path pipelined circuit of Figure 2. Intuitively, an active-high latch can tolerate longer delay of its fan-in combinational block than a D-FF. If the type of r1 is changed to H as shown in Figure 2 (b), the longest delay of combinational
152
C.-H. Hsu et al.
block C1 can exceed T . For a circuit to operate without any timing violation, essentially four cases need to be analyzed depending on Δ(r0 , r1 ): case 1 0 ≤ Δ(r0 , r1 ) < TH : The signal of C1 arrives r1 within the active interval and can directly pass to C2 ; so T < Δ(r0 , r1 )+ Δ(r1 , r2 ) ≤ 2T must hold. In addition, C2 must satisfy T < δ(r0 , r1 ) + δ(r1 , r2 ) ≤ 2T for r2 to latch the right value. case 2 TH ≤ Δ(r0 , r1 ) < T : The signal of C1 arrives r1 before r1 is turned on; so it must wait until r1 is active again at T . C2 must satisfy Δ(r1 , r2 ) ≤ T . In addition, C1 must satisfy δ(r0 , r1 ) > TH ; otherwise the earliest and latest signals of C1 arrive C2 in different clock cycles. case 3 T ≤ Δ(r0 , r1 ) < T + TH : The delay of C1 is in the active interval of r1 and can directly pass to C2 ; so T < Δ(r0 , r1 )+ Δ(r1 , r2 ) ≤ 2T must hold. Also, δ(r0 , r1 ) > TH must hold for the same reason as case 2. case 4 T + TH ≤ Δ(r0 , r1 ) < 2T : The signal of C1 cannot pass through r1 in 2T ; so this case is forbidden. Although case 1 incurs no timing violation in this example, it is problematic if r1 has a designated initial value (which will be erased) or r1 fans out to a primary output since then the number of pipeline stages seen from the output is different. We exclude it from our yield calculation and consider only legal cases 2 and 3. For these two cases, the delay between r0 and r1 is restricted to TH ≤ Δ(r0 , r1 ) < T + TH and δ(r0 , r1 ) > TH , while the delay between r1 and r2 is restricted to max{Δ(r0 , r1 ), T } + Δ(r1 , r2 ) ≤ 2T , where max{Δ(r0 , r1 ), T } equals T in case 2 and Δ(r0 , r1 ) in case 3, respectively. Thus, the yield equals Pr[case 2] + Pr[case 3] = Pr[(TH ≤ Δ(r0 , r1 ) < T ) ∧ (δ(r0 , r1 ) > TH ) ∧ (Δ(r1 , r2 ) ≤ T )] + Pr[(T ≤ Δ(r0 , r1 ) < T + TH ) ∧ (δ(r0 , r1 ) > TH ) ∧ (T < Δ(r0 , r1 ) + Δ(r1 , r2 ) ≤ 2T )] = Pr[(TH ≤ Δ(r0 , r1 ) < T + TH ) ∧ (δ(r0 , r1 ) > TH ) ∧ (max{Δ(r0 , r1 ), T } + Δ(r1 , r2 ) ≤ 2T )].
(3) (4)
In contrast, if the type of r1 is changed to L as shown in Figure 2 (c), four cases similar to the above ones need to be analyzed depending on Δ(r0 , r1 ), which we omit due to limited space. The analysis forms the basis of our yield calculation. It can be extended to the analysis of pipeline circuits since every pair of adjacent registers can be transform into a circuit as in Figure 2. In computing the timing yield of a pipeline circuit, the timing constraints of a combinational block depend on the types of its preceding registers, which leads to complex computation especially for latches. Due to the transparency of latches, delay distributions need to be propagated across latches. For example, Δ(r0 , r1 ) is needed in Equation (4) in calculating the yield between registers r1 and r2 . (For D-FF based designs, there is no need to propagate distribution across register boundaries since the output of a D-FF has zero arrival time.) To resolve this complication, we shift the delay distribution of a combinational
A Statistical Approach to the Timing-Yield Optimization
153
block to make the equations for the three types of registers identical. That is, we modify the delay distribution of a register input and pass it as a slack to the fan-out blocks. Thereby we may propagate probability distributions across latches. Precisely speaking, for active-high latches, by defining Δshift (r0 , r1 ) ≡ Δ(r0 , r1 ) − TH , δshift (r0 , r1 ) ≡ δ(r0 , r1 ) − TH , Δshift (r1 , r2 ) ≡ max{Δ(r0 , r1 ) − T, 0} + Δ(r1 , r2 ), andδshift (r1 , r2 ) ≡ δ(r1 , r2 ), Equation (4) can be rewritten as Pr[case 2] + Pr[case 3] = Pr[(Δshift (r0 , r1 ) < T ) ∧ (δshift (r0 , r1 ) > 0) ∧ (Δshift (r1 , r2 ) < T ) ∧ (δshift (r1 , r2 ) > 0)].
(5)
For active-low latches, similar rewriting is also available, which we omit due to limited space. For D-FFs, on the other hand, no shifting is needed. With the above distribution shifted, we make all longest delay constraints compared with T and shortest delay constraints compared with 0 as in Equations (5). Finally, for any register pair ri and rj connected by a combinational block under analysis, we perform the max operation over {Δshift (ri , rj )} and min operation over {δshift(ri , rj )}, and obtain the probability of the combinational block without timing violation by Pr[(max{Δshift (ri , rj )} < T ) ∧ (min{δshift (ri , rj )} > 0)]. 3.2
Problem Formulation
Definition 1. Let R be a nonempty set of registers of a sequential circuit. A register configuration of R is a total function ρ : R → {D, H, L}. D-FFs are the most common implementation of state-holding elements of sequential circuits due to their simple edge-triggered timing constraints. We assume that a given design is in D-FF implementation initially. By changing the initial register configuration, a circuit can be made more insensitive to timing variations while maintaining its behavior. Essentially, pipeline stages should not be changed before and after modifying register configurations. Therefore, no two latches of the same type can be connected by a combinational path. Furthermore, even two latches of different types cannot be connected by a combinational path because the number of pipeline stages will decrease if the total number of registers cannot increase. Hence we require that the fan-in and fan-out registers of a latch have to be of type D. (Note that a positive-edge triggered D-FF can be decomposed into an active-low latch followed by an active-high latch. So it is possible to maintain pipeline stages by increasing the register count, which we disallow in this paper.) The optimization problem can be stated as follows. Yield optimization problem: Given a sequential circuit with ρ(r) = D, for any register r, and the distributions of its gate and wire delays, find the register
154
C.-H. Hsu et al.
Start
Library
Cycle No breaking
Statistical dynamic programming
Circuit
Graph conversion
Register configuration
Acyclic graph?
Estimated yield improvement
Monte Carlo justification
Yes End
Fig. 3. The flowchart of statistical latch replacement
configuration such that timing yield is maximally improved subject to the above replacement criterion.
4 4.1
Statistical Latch Replacement Optimization Flow Overview
The flow of our algorithm is shown in Figure 3. Firstly, the input circuit is abstracted and converted to a register dependency graph with statistical timing models and analysis to abstract essential timing information. Secondly, all cycles of the register dependency graph are made acyclic with respect to a chosen minimal feedback vertex set. Thirdly, the resultant acyclic graph is levelized in topological order from inputs to outputs. Fourthly, our statistical dynamic programming algorithm is conducted forwardly over the levelized acyclic graph. The optimal configuration can then be derived by tracing backward from outputs to inputs. Finally, Monte Carlo simulation can optionally be applied to justify the yield improvement. 4.2
Statistical Dynamic Programming
We abstract a given input circuit C with a register dependency graph G = (V, E), where a vertex vi ∈ V represents a register ri in C and there is an directed edge (vi , vj ) ∈ E if and only if there is a combinational path from ri to rj in C. Also, register-to-register distributions Δ(ri , rj ) and δ(ri , rj ) are computed according to the delay distributions of C, and is associated to its corresponding edge (vi , vj ) ∈ E. If a circuit has feedback, there will be cycles in the converted graph. In order to levelize the register dependency graph, we break all cycles by finding a minimal feedback vertex set (FVS) [9]. After making a register dependency graph acyclic, we levelize it in a topological order such that each vertex is labelled with the longest distance from an input vertex. Given an levelized acyclic register dependency graph, we derive a register configuration
A Statistical Approach to the Timing-Yield Optimization
155
Algorithm: StatisticalDynamicProgramming Input: levelized register dependency graph G = (V, E) and delay distributions on E Output: optimal register configuration for yield begin 01 set level-1 registers to D-FFs with local yield 1 02 := LevelCount (G) 03 for i = 2, . . . , 04 let Ri be the set of registers at level-i 05 for every register configuration α of Ri 06 compute the highest local yield Yα of α subject to the configurations of Ri−1 and their local yields 07 record the config. of Ri−1 responsible for Yα 08 set R to the config. β of all D-FFs 09 for i := − 1, − 2, . . . , 2 10 set Ri to the config. βi responsible for βi+1 11 return β’s end
Fig. 4. The Statistical Dynamic Programming Algorithm
with maximal timing yield by the statistical dynamic programming algorithm outlined in Figure 4. We add artificial D-FFs at the primary inputs and outputs when converting a circuit to a register dependency graph. Hence we set level-1 and level- registers to be of type D, where is the number of levels in the levelized register dependency graph. In addition, we define the local yield of a register to be the accumulated yield computed forward from level-1 registers, each having local yield 1. The statistical dynamic programming algorithm computes and stores the optimal configurations and the corresponding local yields in a forward direction based on the timing analysis introduced in Section 3. Take a single-path pipelined circuit as an example. The statistical dynamic programming algorithm proceeds in two phases as shown in Figure 5 (a) and (b). In the first phase, three configurations {D, H, L} are considered for each register in a forward direction. Since we require that the fan-in and fan-out registers of a latch need to have type D, only a subset of two consecutive configurations need to be considered as indicated by the arrows of Figure 5 (a). At each level, the maximal local yield is kept for each configuration of {D, H, L}. Once the final level is reached, the algorithm enters the second phase. It extracts the optimal configuration for each register backward. For a register dependency graph with large pipeline widths, the above algorithm becomes inefficient (in fact, exponential complexity in the pipeline width) since it considers all possible configurations for registers at each level. We alleviate this problem by greedily optimizing one register at a time without considering
156
C.-H. Hsu et al.
D
D
D
D
D
H
H
H
H
D
r1
L
L
L
L
rn
r2
ri
ri+1
rn-1
D
D
D
D
D
H
H
H
H
D
r1
L
L
L
L
rn
r2
ri
ri+1
rn-1
(a)
(b)
Fig. 5. Statistical dynamic programming for a single-path pipelined circuit optimization. (a) Forward yield calculation. Only feasible edges are shown. (b) Backward tracing the optimal configuration.
the configurations of other registers at the same level. Thus, we may need to handle the consistency problem for conflicting register type assignments. Note that because we only consider one register at a time, the result may differ from the global optimum. It is a tradeoff between optimality and efficiency.
5
Experimental Results
The proposed algorithm is implemented in C++ codes. The experiments were conducted on a Linux machine with Pentium IV 3.2GHz CPU and 3GB memory. Two sets of circuits are used: pipeline circuits and general sequential circuits all from ISCAS benchmark suites. The pipeline circuits were generated from combinational circuits by adding 4-stage pipelines. For a given circuit, under the SIS [13] environment, technology mapping was conducted to obtain delay information and then minimum-period retiming was performed (thus registers were relocated evenly over the circuit). In addition, the circuits were synthesized to balance long and short combinational paths. (Note that, in high-speed and/or low-power designs, long and short paths tend to be balanced. For instance, performance-driven logic optimization and power optimization with dual threshold voltage assignments tend to balance long and short delays. Thus, design trends meet our timing requirements.) All delay variations are in normal distribution with 10–20% deviation. Table 1 shows the results for 10% and 20% delay deviations. Columns 1, 2, and 3 show the circuits, numbers of pipeline stages, and numbers of registers, respectively. The clock periods are shown in the 4th column, where the clock period of a circuit is determined by imposing the timing yield of the circuit with all D-FF registers to fall between 60–65%. The numbers of D-FFs replaced by level-sensitive latches are shown in the fifth column. Columns 6, 7, and 8 list
A Statistical Approach to the Timing-Yield Optimization
157
Table 1. ISCAS benchmark circuits with 10% and 20% delay deviations Circuit
# of Total Clock period Replaced reg. Original yield (%) Final yield (%) stages reg. 10% 20% 10% 20% 10% 20% 10% 20%
ISCAS85 c432 c499 c880 c1355 c1908 c3540 c5315 c7552
5 5 5 5 5 5 5 5
214 186 242 218 240 278 867 879
8.13 8.65 7.36 9.42 13.44 11.14 11.88 11.26
8.58 9.37 7.74 10.18 14.26 11.96 12.60 12.12
18 13 14 9 19 78 0 56
28 8 16 10 19 62 0 69
64.4 65.0 61.5 60.3 62.5 64.0 60.1 63.7
63.2 62.2 62.7 62.3 64.0 62.3 61.7 63.5
100.0 100.0 67.0 100.0 100.0 94.1 60.1 99.6
97.2 100.0 98.7 99.8 98.1 93.9 61.7 99.9
Average
35.6 35.0 5.5 39.7 37.5 30.1 0.0 35.9
34.0 37.8 36.0 37.5 34.1 31.6 0.0 36.4
0.21 0.11 0.14 0.16 0.19 0.42 0.61 0.71
27.41 30.93 0.284
ISCAS89 s1196 s5378 s9234
Impv. (%) CPU time (s) 10% 20% 10% 20%
Pipeline circuits with clock minimization 0.20 0.11 0.13 0.16 0.19 0.40 0.63 0.68 0.313
Sequential circuits -
18 179 211
50.24 53.54 47.79 52.98 108.57 118.86
3 10 8
4 10 8
62.9 65.2 54.7
59.7 61.1 57.8
67.2 71.9 56.0
Average
62.4 65.2 59.3
4.3 6.7 1.3
2.7 4.1 1.5
0.04 0.44 0.90
0.05 0.45 0.89
4.10
2.77
0.460
0.463
120.00% 100.00% Yield: Original Final Improvement
80.00% 60.00% 40.00% 20.00% 0.00% 8
8.5
9
9.5 10 10.5 11 11.5 12
Clock
Fig. 6. Experimental results. Yield vs. clock period for circuit c1355.
the original, final, and improved timing yields, respectively. The yields are are justified with Monte Carlo simulation. The reported CPU times in the ninth column are without counting the Monte Carlo simulation. Each of Columns 4–9 is divided further into two sub-columns for 10% and 20% delay deviations. As can be see, the improvements are consistent above 30% for all of the pipeline circuits, except for circuits c880 and c5315. For c880 in the 10%deviation case, some inaccuracy in timing analysis causes inadequate latch replacements and degrades the yield improvement. For c5315, a similar reason causes inadequate latch replacements, which are later cancelled by the justification of Monte Carlo simulation. This problem can be overcome by using more accurate SSTA tools. Nevertheless, the average improvements of pipeline circuits are 27% and 31% for deviations 10% and 20%, respectively. It suggests that our approach to yield improvement is robust against the changes of delay deviation. It is interesting to note that the numbers of replaced registers for 20%-deviation cases are in general larger than those for 10%-deviation ones as shown in Column 5. It suggests the importance of latch replacements for
158
C.-H. Hsu et al.
increased delay deviations. On the other hand, for cyclic sequential circuits, such as s1196, s5378, and s9234, our approach only yields mild improvements. It is understandable because the register dependency graphs of these circuits are close to complete graphs, which makes latch replacement almost impossible. To see the relation between the clock period and yield improvement, we conduct another experiment over circuit c1355. The result is plotted in Figure 6. As can be seen, by reducing the clock period, the yield of the original design with all DFFs tends to vanish very quickly from 100% to 0% whereas that of the optimized version remains high and stable for another 1 unit of delay. (The glitch in the figure is due to different optimal register configurations for different clock periods.) The result tends to suggest that our latch replacement algorithm is robust against clock variation, and suitable for high-speed designs. Hence our approaches are promising for yield improvement in the current trend of high-speed designs.
6
Conclusions and Future Work
Based on statistical timing analysis, we have proposed an algorithm to optimize the timing yield of a sequential circuit. Experimental results show that, by substituting latches for D-FFs, timing yield can be improved about 31% on average for pipelined circuits. In addition, the results suggest that latch replacement tends to tolerate clock variations. Complementary to other design-for-yield methodologies like gate sizing and clock skew scheduling, our technique may be combined with these techniques for further improvement. Since most circuits use D-FFs for register implementation, our approach may be widely applicable to standard designs. Since replacing D-FFs with latches incurs no area penalty, the proposed algorithm can be used for not only pre-layout but also post-layout optimization, where accurate timing information is available. For future work, since our approach only yields mild timing yield improvements to cyclic sequential circuits, some work needs to be done to overcome this limitation. On the other hand, we may consider multiple-phased clocking scheme, which may lead to further yield improvements. Also, setup-time and hold-time constraints may be added in our framework.
Acknowledgments This work was supported in part by NSC grants 94-2218-E-002-083, 95-2221-E002-432, and 95-2218-E-002-064-MY3.
References 1. Albrecht, C., Korte, B., Schietke, J., Vygen, J.: Cycle time and slack optimization for VLSI-chips. In: Proc. ICCAD, 1999, pp. 232–238 (1999) 2. Clark, C.E.: The greatest of a finite set of random variables. Operations Research 9(2), 145–162 (1961) 3. Chao, C.-T., Wang, L.-C., Cheng, K.-T., Kundu, S.: Static statistical timing analysis for latch-based pipeline designs. In: Proc. ICCAD (2004)
A Statistical Approach to the Timing-Yield Optimization
159
4. Choi, S.-H., Paul, B., Roy, K.: Novel sizing algorithm for yield improvement under process variation in nanometer technology. In: Proc. DAC (2004) 5. Chopra, K., Shah, S., Srivastava, A., Blaauw, D., Sylvester, D.: Parametric yield maximization using gate sizing based on efficient statistical power and delay gradient computation. In: Proc. ICCAD (2005) 6. Guthaus, M., Venkateswaran, N., Visweswariah, C., Zolotov, V.: Gate sizing using incremental parameterized statistical timing analysis. In: Proc. ICCAD (2005) 7. Hurst, A., Brayton, R.: Computing clock skew schedules under normal process variation. In: Proc. IWLS (2005) 8. Lalgudi, K., Papaefthymiou, M.: Fixed-phase retiming for low power design. In: Proc. ISLPED (1996) 9. Lin, H.-M., Jou, J.-Y.: On computing the minimum feedback vertex set of a directed graph by contraction operations. IEEE Trans. on CAD 19(3) (2000) 10. Neves, J., Friedman, E.: Optimal clock skew scheduling tolerant to process variabtions. In: Proc. DAC, pp. 623–628 (1996) 11. Raj, S., Vrudhula, S., Wang, J.: A methodology to improve timing yield in the presence of process variations. In: Proc. DAC, pp. 448–453 (2004) 12. Sakallah, K., Mudge, T., Olukotun, O.: checkTc and minTc : Timing verification and optimal clocking of synchronous digital circuits. In: Proc. ICCAD, 1990, pp. 552–555 (1990) 13. Sentovish, E.M., et al.: SIS: a system for sequential circuit synthesis. Technical Report UCB/ERL M92/41, UC Berkeley (1992) 14. Tsai, J.-L., Baik, D., Chen, C.-P., Saluja, K.: A yield improvement methodology using pre- and post-silicon statistical clock scheduling. In: Proc. ICCAD, pp. 611–618 (2004) 15. Vishweswariah, C., Ravindran, K., Kalafala, K., Walker, S., Narayan, S.: Firstorder incremental block-based statistical timing analysis. In: Proc. DAC, 2004, pp. 331–226 (2004) 16. Wu, T.-Y., Lin, Y.-L.: Storage optimization by replacing some flip-flops with latches. In: Proc. DAC (1996)
A Novel Gate-Level NBTI Delay Degradation Model with Stacking Effect Hong Luo1, Yu Wang1 , Ku He1 , Rong Luo1, Huazhong Yang1, , and Yuan Xie2, 1
Circuits and Systems Division, Dept. of EE, Tsinghua Univ., Beijing, 100084, P.R. China [email protected] 2 CSE Department, Pennsylvania State University, University Park, PA, USA [email protected]
Abstract. In this paper, we propose a gate-level NBTI delay degradation model, where the stress voltage variability due to PMOS transistors’ stacking effect is considered for the first time. Experimental results show that our gate-level NBTI delay degradation model results in a tighten upper bound for circuit performance analysis. The traditional circuit degradation analysis leads to on average 59.3% overestimation. The pin reordering technique can mitigate on average 6.4% performance degradation in our benchmark circuits.
1 Introduction As technology scales, accelerated aging effect [1] for nanoscale devices poses as a key challenge for designers to find countermeasures that effectively mitigate the degradation and prolong system’s lifetime. Negative bias temperature instability (NBTI), which has deleterious effect on the threshold voltage and the drive current of PMOS transistors, is emerging as one of the major reliability concerns [2]. Due to NBTI effect, the threshold voltage of PMOS transistor is shifted, carrier mobility and drain current are reduced [3], and the performance degradation occurs [4,5,6]. The NBTI phenomena can be classified as static NBTI and dynamic NBTI. Static NBTI is under the DC stress condition, and the detailed physical mechanism was described in [7]. The impact of electric and environment parameters (such as electric field across the oxide and temperature) on the interface trap generation was studied in [8, 9]. Dynamic NBTI under the AC stress condition leads to a less severe parameter’s shift over long time because of the recovery phenomenon [4, 9, 10, 11]. Many analytical NBTI models have been proposed recently. The impact of NBTI on the worst case performance degradation of digital circuits was analyzed in [12]. An analytical model for multi-cycle dynamic NBTI was proposed in [13], where a recursion process was used to evaluate the NBTI effect. A predictive NBTI model was proposed in [14, 15], the effect of various process and design parameters was described. An accurate and fast close-form analytical model was proposed in [16], where temperatureaware NBTI modeling was also considered.
This work was supported by grants from 863 program of China (No. 2006AA01Z224), and NSFC (No. 60506010, No. 90207001). Yuan Xie’s work was supported in part by NSF CAREER 0643902 and MARCO/DRAPAGSRC.
N. Azemard and L. Svensson (Eds.): PATMOS 2007, LNCS 4644, pp. 160–170, 2007. c Springer-Verlag Berlin Heidelberg 2007
A Novel Gate-Level NBTI Delay Degradation Model
161
Most of these previous proposed NBTI models may suffer from inaccuracy or high computational complexity, and gate-level NBTI modeling is still in its infancy. In this paper, based on an accurate and fast close-form analytical model [16], we propose a gate-level NBTI delay degradation model considering stacking effect. Our contribution in this paper distinguishes itself in the following aspects: • A single transistor analytical NBTI model is extended to a novel gate-level model, which for the first time considers the variability of the stress voltage due to stacking effect; • A novel accurate gate-level delay model for Vth degradation is first proposed. A tightened upper bound for circuit performance degradation can be achieved with our new gate-level delay model. The rest of the paper is organized as follows. In Section 2, we first review previous NBTI models, then our model considering the variability of the stress voltage due to stacking effect is described. In Section 3, the new gate-level delay model is presented based on traditional delay analysis. The simulation results of the ALU benchmark circuits are shown and analyzed in Section 4. Finally, Section 5 concludes the paper. Note that the simulation results in the following sections are based on a standard cell library constructed using the PTM 90nm bulk CMOS model [17]. Vdd = 1.2V, |Vth | = 200mV are set for all the transistors in the circuits. The operation time is set to 3 × 108s (about 10yr).
2 NBTI Model 2.1 Previous NBTI Models A threshold voltage degradation Δ Vth is caused by the interface trap generation due to PMOS NBTI effect, which is described by [18]
Δ Vth = −(1 + m)
qeNit (t) Cox
(1)
where m represents equivalent Vth shifts due to mobility degradation, qe is the electronic charge, Cox is the gate oxide capacitance, and Nit (t) is the interface trap generation due to PMOS NBTI effect. The interface trap generation is often described by the reaction-diffusion (R-D) model [19]. An analytical solution exists under the DC stress condition, which is regarded as static NBTI model, kf N0 Nit (t) = 1.16 (DH t)1/4 = Bt 1/4 (2) kr where N0 is the concentration of initial interface defects; kf is dissociation rate which depends on electric field across the gate oxide, and kr is constant self-annealing rate; and DH is the corresponding diffusion coefficient [19].
H. Luo et al.
Interface traps ΔNit (×1012cm-2)
162
12.0
10.0
Time = 3×108s kf = 0.01s-1 -18 3 kr = 1.0×10 cm /s 14 -2 N0 = 1.24×10 cm -17 2 DH = 1.0×10 cm /s
(Our model) 9.08
8.0 (Kumar) 8.89 6.0
4.0
Kumar Our model 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Duty cycle
Fig. 1. Comparison between Kumar’s and our models
In the multi-cycle dynamic NBTI model proposed by Kumar et al. [13], the interface trap generation can be evaluated by a recursion formula, 1/4 Nit (nT ) 4 0 Nit [(n + ps)T ] = Nit ps + (3) Nit0 s where Nit0 = BT 1/4 , and β = 1−p 2 ; T and ps are the period and the duty cycle of the stress waveform, respectively. Actually, Eq. (3) describes a tight upper bound of all the relaxation phases. A close-form equation was proposed in [16] using the fitting approach. The model in [16] can describe the dynamic NBTI effect with the same accuracy but faster. In this paper, we use the same method in [16] to construct our dynamic NBTI model. Eq. (3) in Kumar’s model [13] is used as the fitting target function. Hence, the interface trap generation can be described as kf N0 Nit (t) = 1.16 · ξ (ps ) · (DH t)1/4 (4) kr where ξ (ps ) = ps 0.27ps +0.28 . The comparison between Kumar’s and our model is shown in Fig. 1, and the Maximum Error of Nit (t) is 2.14% (9.08 × 1012cm−2 from our model, and 8.89 × 1012cm−2 from Kumar’s model in Fig. 1). 2.2 Our Novel Gate-Level NBTI Model with Stacking Effect Traditionally, the estimation method of Vth degradation in a logic gate due to NBTI is as follows: the PMOS transistors and their corresponding inputs are first assumed to be mutually independent; then the Vth degradation of each PMOS transistor is analyzed independently based on Eq. (4); finally, the maximum value is chosen to be the Vth degradation of the gate and can be used to calculate the delay degradation of the gate.
A Novel Gate-Level NBTI Delay Degradation Model FS
FS PS
R
0
163
FS PS
R
T
PS 2T
R
(n−1)T
nT
t Cycle 1
Cycle 2
FS PS
Cycle n
FS R
0
PS
FS R
T
PS 2T
R
(n−1)T
nT
t Cycle 1
Cycle 2
Cycle n
Fig. 2. Stress waveform applied to the gate of the PMOS transistor (R is “relaxation”, and FS, PS are “full stress”, “partial stress”, respectively)
Obviously, the above method is not accurate in the gate-level NBTI analysis, because the stacking effect is not considered. In [14], Vth variability due to the body effect in the transistor stack was considered, but only the static NBTI effect was analyzed. In this section, a novel gate-level NBTI model with stacking effect is proposed based on the transistor-level NBTI model given in Section 2.1. In this paper, the stress voltage variability due to stacking effect in the logic gate is considered. Because of the resistance of the transistors, the internal nodes are biased at a middle voltage, which leads to different Vgs of the PMOS transistors. Therefore, when the PMOS transistor is under stress, it is not always −Vdd biased. We denote the stress condition under −Vdd as “full stress (FS)”, and the stress under a lower voltage as “partial stress (PS)”. Before the new Vth degradation model with stacking effect is proposed, the interface trap generation due to dynamic PMOS NBTI effect mixed with “full stress” and “partial stress” should be analyzed. The random aperiodic signal can be converted to deterministic periodic waveform, based on the signal probability (SP). With the same SP, the NBTI effect will be the same [6]. Hence, we use the waveform shown in Fig. 2 as the input of PMOS transistor. In the first waveform, the “full stress” phase is ahead of the “partial stress” phase in one cycle; and the second waveform shows the reversed condition. From the numeric simulation based on the reaction-diffusion model [19], we find that the order of these phases have negligible impact on the final generation of interface traps. Fig. 3 shows the comparison between the stress waveforms in Fig. 2 and the error is 0.18% (5.50 × 1012cm−2 vs. 5.51 × 1012cm−2 ). In the following part of this paper, we assume that “full stress” is always ahead of “partial stress” in a cycle. Fig. 4 shows the numeric simulated interface trap generation due to dynamic NBTI under different time ratio of “full stress” phase to “partial stress” phase. We find that the mixed effect of these two stress phases can be derived by weighted averaging the “full stress” and “partial stress” effect, which is described as Nit,mixed =
pFS pPS Nit,FS + Nit,PS pFS + pPS pFS + pPS
(5)
where Nit,FS is the interface trap generation if all stress phases are “full stress”; and Nit,PS is the interface trap generation if all stress phases are “partial stress”. The parameters pFS and pPS are signal probabilities of “full stress” and “partial stress”, respectively. By
H. Luo et al.
Interface traps ΔNit (×1012cm-2)
164
7 (PS ahead) 5.51
6
FS: kf = 0.01s-1
5
PS: kf = 0.004s-1
4
(FS ahead) 5.50
3 2
FS ahead of PS PS ahead of FS
1 0 0
2000
4000 6000 Operation time (s)
8000
10000
Interface traps ΔNit (×1012cm-2)
Fig. 3. The impact of the stress phases’ order on interface trap generation 9
60% FS 60% PS 30% FS, 30% PS 40% FS, 20% PS 20% FS, 40% PS
8 7 6
6.57
5 4 3
5.50
2
4.18
1 0 0
2000
4000 6000 Operation time (s)
8000
10000
Fig. 4. The analysis of mixed NBTI effect of “full stress” and “partial stress”
calculating, we find the maximum error occurs at “30% FS, 30% PS”. The simulated trap generation is 5.50 × 1012cm−2 as shown in Fig. 4, and by Eq. (5), the estimated interface trap generation is 5.38 × 1012cm−2 . Therefore, the maximum error is 2.18%. However, in a transistor stack, the PMOS transistors can be biased at various voltages, so there exists more than one “partial stress” condition. Therefore the law described above is extended to more than two different stress conditions. First, we number these stress conditions as S0 , S1 , S2 , . . . , and S0 is always the “full stress” condition. The signal probabilities of these stress conditions are p0 , p1 , p2 , . . . , respectively, and the signal probability of relaxation condition is denoted as r; so the duty cycle ps of all the stress conditions is ps = ∑ p i = 1 − r (6) i
where the number of i’s is related to the number of PMOS transistors in the stack.
A Novel Gate-Level NBTI Delay Degradation Model
165
As the threshold voltage degradation is proportional to the interface trap generation, the final Vth degradation due to PMOS NBTI effect with more than one stress condition is modeled according to Eq. (5) as
Δ Vth = ∑ Δ Vth,i i
pi ps
(7)
where Δ Vth,i is the corresponding threshold voltage degradation if all the stress phases are Si . According to Eq. (1) and (4), Δ Vth,i is expressed as
Δ Vth,i = ηi · ps 0.27ps +0.28 ·t 1/4
(8)
and the parameter ηi is decided by the predictive model proposed in [14], Eox,i −Ea ηi = A · Tox Cox (Vgs,i − Vth) · exp( ) · exp( ) E0 kb T
(9)
where Vgs,i is the stress voltage corresponding to different stress phase due to stacking effect first described in this paper, and other parameters are the same as in [14]. If only one “full stress” condition is considered, that is ps = p0 , Eq. (7) can be simplified as Δ Vth = Δ Vth,0 = η0 · ps 0.27ps +0.28 ·t 1/4 (10) which consists with the NBTI model in section 2.1.
3 Gate-Level Delay Degradation Analysis 3.1 Traditional Gate Delay Model Previously, the propagation delay of a gate can be approximately expressed as [18] tpd =
CLVdd CLVdd Leff = Id μ CoxWeff (Vgs − Vth)α
(11)
where α is the velocity saturation index, and CL contains the parasitic capacitance. The shift in the transistor threshold voltage Δ Vt can be derived using Eq. (10). Hence, with the Taylor series expansion, the delay degradation Δ tpd for the gate is derived as
Δ tpd =
αΔ Vth ·tpd0 Vgs − Vth
(12)
where tpd0 is the original delay of the gate without any Vth degradation, and can be extracted from third-party time analysis tools. 3.2 Our Novel Gate-Level Delay Model The proposed NBTI model with stacking effect described in Section 2.2 leads to different Vth degradation for each PMOS transistor in a logic gate, but the gate delay model
166
H. Luo et al. Vdd D
M0
u
C
M1
v
B
M2
w
A
M3 Y
M4
M5
M6
M7
CL
Fig. 5. The schematic of NOR4 gate
described in Section 3.1 is incapable to handle this situation. So a novel gate-level delay model is proposed in this paper. An NOR4 gate is used to illustrate our derivation, and the schematic is shown in Fig. 5, where CL is the external load capacitance. If the Vth degradation of these PMOS transistors are small, the gate delay can be considered linear with Δ Vth and CL , tpd = tpd0 + Δ tpd = tpd0 + ∑ [(αi ·CL + βi )Δ Vth,Mi ] ,
i = 0, 1, 2, 3
(13)
i
where the parameters αi and βi describe the effect of charging external load capacitance CL and internal parasitic capacitance respectively, and they only depend on the gate type. In order to use the existing results extracted from the timing analysis tools directly, the term CL in Eq. (13) should be eliminated. From Eq. (11), we can derive another linear equation, that the original propagation delay is linear with external load capacitance CL , tpd0 = P ·CL + Q (14) where P is the load delay factor, and Q describes the intrinsic delay. From Eq. (13) and (14), tpd can be derived as tpd0 − Q tpd = tpd0 + ∑ αi + βi Δ Vth,Mi P i = tpd0 + tpd0 · ∑(gi Δ Vth,Mi ) + ∑(hi Δ Vth,Mi ) i
gi =
αi , P
Q hi = βi − αi P
(15)
i
(16)
where gi and hi only depend on the gate type. In the standard-cell design, the parameters gi and hi of all the gates in the cell library can be calculated in advance, and then a lookup table is created.
A Novel Gate-Level NBTI Delay Degradation Model
167
Table 1. Threshold voltage degradation in NOR4 gate Transistor M0 M1 M2 M3 Δ Vth 25.7mV 20.6mV 14.5mV 5.5mV Table 2. Comparison between the traditional gate delay analysis and our gate-level delay model Gate Original delay Type tpd0 NOR4 168.1ps NOR3 142.5ps NOR2 111.2ps INV 83.5ps
Hspice Δ tpd 6.9ps 5.7ps 4.7ps 3.6ps
Our model Δ tpd Estimation error 7.0ps 1.4% 5.8ps 1.8% 4.7ps 0.0% 3.6ps 0.0%
Traditional model Δ tpd Overestimation 10.5ps 50.0% 8.4ps 44.8% 6.1ps 29.8% 3.6ps 0.0%
We demonstrate the impact of stress voltage variability due to stacking effect on NBTI analysis. The signal probabilities of all the input patterns are equal. The Vth degradation of all the PMOS transistors in NOR4 gate are shown in Table 1. We can see that transistor M0, which is closest to the power supply as shown in Fig. 5, has the largest threshold voltage degradation; while M3 has the smallest threshold voltage degradation. Therefore, the gate-level delay analysis is necessary for accurate estimation of NBTI effect. The comparison between traditional gate delay analysis and our novel gate-level delay model is shown in Table 2. The third column of Table 2 is the gate delay degradation with stacking effect simulated by Hspice, and the fourth column is calculated by our delay model, while the estimation error of our model is shown in the fifth column. These data demonstrate that our gate-level delay model is accurate enough for delay analysis. If the traditional approach is used to analyze the gate delay degradation, the worst case Δ Vth,M0 = 25.7mV is set as Δ Vth in Eq. (12), and the results are shown in the sixth column. We can see that the traditional gate delay analysis overestimates the delay degradation, and these overestimations compared to our model are shown in the seventh column. We can see that more transistors in PMOS stack lead more overestimation: 29.8% overestimation in NOR2 gate, while 50.0% in NOR4 gate. From Table 2, we can also see that in the gate with no stacking effect, as inverter (INV) and AND gate, gate-level delay analysis leads to the same result with traditional analysis. Only the result for INV gate is listed in Table 2 for brevity.
4 Experimental Results In this section, some ALU circuits and c6288 circuit in ISCAS85 are used as the benchmarks to investigate the effect on the circuit performance degradation using our NBTI delay degradation model. As stacking effect leads to different Vth degradation of the PMOS transistors, the gate delay can be minimized using pin reordering technique just as that in leakage minimization [20]. The enumeration searching pin reordering technique is used in our experiment just to estimate the upper bound of our gate-level model in mitigating circuit performance degradation due to NBTI.
168
H. Luo et al. Table 3. Delay degradation of benchmark circuits Circuits
Rstack
array4 61/125 array8 347/663 bk16 47/177 bk32 124/384 booth9 277/603 ks16 31/99 ks32 138/375 log16 135/160 log32 268/457 pm8 278/613 pm16 1713/3042 c6288 2128/2447 Avg. N/A
Original delay No stacking effect Stacking effect Pin reordering tpd0 (ns) Δ tpd,ns (ps) Δ tpd,ws (ps) Δ tpd,pr (ps) 4.11 204.9 157.6 153.4 4.89 212.8 188.5 179.2 2.31 171.7 74.6 67.0 3.15 240.1 119.9 77.8 2.91 147.3 138.3 134.9 2.71 181.0 100.8 99.3 3.87 289.1 126.0 120.0 1.56 79.5 45.2 45.2 2.20 122.9 60.5 60.5 2.71 106.9 95.5 93.1 4.48 227.9 210.1 205.8 8.54 564.0 457.0 410.6 N/A 59.3% 0.0% -6.4%
The results are shown in Table 3. The circuits array4 and array8 are 4x4 and 8x8 array multipliers; bk16 and bk32 are 16-bit and 32-bit Brent Kung adders; booth9 is 9x9 booth multiplier; ks16 and ks32 are 16-bit and 32-bit Kogge Stone adders; log16 and log32 are 16-bit and 32-bit log shifter; and pm8 and pm16 are 8x8 and 16x16 parallel multipliers. Rstack is the ratio of gates with PMOS transistor stack. The original delay tpd0 is extracted from an STA tool. The delay degradation with no stacking effect Δ tpd,ns is evaluated using transistor-level NBTI model Eq. (4) and gate delay model Eq. (12). The delay degradation with stacking effect Δ tpd,ws is evaluated using our novel gatelevel NBTI and delay model Eq. (7) and (15). In Table 3, we use the fifth column (Δ tpd,ws ) as the standard data, which the fourth and sixth columns are compared to. We can see that from Table 3, the traditional method brings on average 59.3% overestimation of the circuit delay degradation. The pin reordering technique leads to on average 6.4% improvement of circuit performance. The overestimation of the circuit delay degradation and the improvement of circuit performance by pin reordering technique depend on not only Rstack , but also the contribution of gates with PMOS stack to the critical paths in the circuit. For example, ks32 leads to 129.4% overestimation of delay degradation, much larger than pm16, although Rstack of ks32 is less than that of pm16. bk32 and ks32 have almost the same Rstack , and the overestimations of delay degradation are both large, but bk32 has a larger improvement of circuit performance by pin reordering. Almost all the gates in c6288 circuit are NOR2, the overestimation of circuit delay degradation is 23.4%, very close to the overestimation of a single NOR2 gate: 29.8%.
5 Conclusion Negative bias temperature instability is emerging as one of the major circuit performance degradation concerns. Fast and accurate analysis of NBTI-induced circuit degradation is important for circuit designers to find mitigation solutions. In this paper, we
A Novel Gate-Level NBTI Delay Degradation Model
169
use a simple close-form analytical Vth degradation model for PMOS to develop a novel gate-level NBTI and delay model. The stress voltage variability due to PMOS transistors’ stacking effect is for the first time considered in gate-level NBTI modeling. The traditional analysis of gate delay degradation due to NBTI results in 50.0% overestimation for an NOR4 gate, while in the circuit performance degradation analysis, the maximum overestimation is 130.2% in 16-bit Brent Kung adder (bk16) circuit. The mitigation of performance degradation by pin reordering technique can reach up to 35.1% in 32-bit Brent Kung adder (bk32) circuit.
References 1. Borkar, S.: Designing reliable systems from unreliable components: The challenges of transistor variability and degradation. Micro, IEEE 25(6), 10–16, 0272–1732 (2005) 2. Huard, V., Denais, M., Parthasarathy, C.: NBTI degradation: From physical mechanisms to modelling. Microelectron. Reliab. 46(1), 1–23 (2006) 3. Kufluoglu, H., Alam, M.A.: Theory of interface-trap-induced NBTI degradation for reduced cross section MOSFETs. IEEE Trans. Electron Devices 53(5), 1120–1130 (2006) 4. Wittmann, R., Puchner, H., Hinh, L., Ceric, H., Gehring, A., Selberherr, S.: Impact of NBTIdriven parameter degradation on lifetime of a 90nm p-MOSFET. In: Proc. Intl. Integrated Reliab. Workshop Final Report, pp. 99–102 (2005) 5. Reddy, V., Krishnan, A.T., Marshall, A., Rodriguez, J., Natarajan, S., Rost, T., Krishnan, S.: Impact of negative bias temperature instability on digital circuit reliability. Microelectron. Reliab. 45(1), 31–38 (2005) 6. Kumar, S., Kim, C., Sapatnekar, S.: Impact of NBTI on SRAM Read Stability and Design for Reliability. In: Proc. ISQED, pp. 210–218 (2006) 7. Ogawa, S., Shiono, N.: Generalized diffusion-reaction model for the low-field charge-buildup instability at the Si-SiO2 interface. Physical Review B 51(7), 4218–4230 (1995) 8. Alam, M., Mahapatra, S.: A comprehensive model of PMOS NBTI degradation. Microelectron. Reliab. 45(1), 71–81 (2005) 9. Mahapatra, S., Saha, D., Varghese, D., Kumar, P.: On the generation and recovery of interface traps in MOSFETs subjected to NBTI, FN, and HCI stress. IEEE Trans. Electron Device 53(7), 1583–1592 (2006) 10. Chen, G., Li, M., Ang, C., Zheng, J., Kwong, D.: Dynamic NBTI of p-MOS transistors and its impact on MOSFET scaling. IEEE Electron Dev. Lett. 23(12), 734–736 (2002) 11. Mahapatra, S., Bharath Kumar, P., Dalei, T., Sana, D., Alam, M.: Mechanism of negative bias temperature instability in CMOS devices: degradation, recovery and impact of nitrogen. In: IEDM Tech. Dig., pp. 105–108 (2004) 12. Paul, B., Kang, K., Kufluoglu, H., Alam, M., Roy, K.: Impact of NBTI on the temporal performance degradation of digital circuits. IEEE Electron Dev. Lett. 26(8), 560–562 (2005) 13. Kumar, S., Kim, C., Sapatnekar, S.: An Analytical Model for Negative Bias Temperature Instability. In: Proc. IEEE/ACM ICCAD, pp. 493–496 (2006) 14. Vattikonda, R., Wang, W., Cao, Y.: Modeling and Minimization of PMOS NBTI Effect for Robust Nanometer Design. In: Proc. DAC, pp. 1047–1052 (2006) 15. Bhardwaj, S., Wang, W., Vattikonda, R., Cao, Y., Vrudhula, S.: Predictive Modeling of the NBTI Effect for Reliable Design. In: Proc. CICC, pp. 189–192 (2006) 16. Luo, H., Wang, Y., He, K., Luo, R., Yang, H., Xie, Y.: Modeling of PMOS NBTI Effect Considering Temperature Variation. In: Proc. ISQED, pp. 139–144 (2007)
170
H. Luo et al.
17. Nanoscale Integration and Modeling Group, ASU: Predictive Technology Model (PTM) 18. Paul, B., Kang, K., Kufluoglu, H., Alam, M., Roy, K.: Temporal Performance Degradation under NBTI: Estimation and Design for Improved Reliability of Nanoscale Circuits. In: Proc. DATE, vol. 1, pp. 1–6 (2006) 19. Stathis, J., Zafar, S.: The negative bias temperature instability in MOS devices: A review. Microelectron. Reliab. 46(2-4), 270–286 (2006) 20. Sultania, A., Sylvester, D., Sapatnekar, S.: Transistor and pin reordering for gate oxide leakage reduction in dual Tox circuits. In: Proc. ICCD, pp. 228–233 (2004)
Modelling the Impact of High Level Leakage Optimization Techniques on the Delay of RT-Components Marko Hoyer1, Domenik Helms1 , and Wolfgang Nebel2 1
OFFIS Research Institute D - 26121 Oldenburg, Germany 2 University of Oldenburg D - 26121 Oldenburg, Germany {hoyer|helms|nebel}@offis.de
Abstract. To adress the problem of static power consumption, approaches as ABB and AVS have been proposed to reduce runtime leakage in integrated circuits. Applying these techniques is a trade off between power and delay, which is best decided early in the design flow. Therefore high level power and delay estimation is needed. In our work, we present a fast RT Level delay macro model considering supply and bias voltages and temperature. Errors below 5% combined with only few characterization data enables this approach to be used by high level design tools to support leakage optimization by e.g. ABB and AVS.
1
Introduction
In the past years, a lot of work focussed on the problem of static power consumption. Two major types of leakage currents can be identified: Gate leakage will likely get under control using high-k materials [1] isolating the gate among the channel, but subthreshold leakage will remain a problem. Several approaches have been presented reducing subthreshold leakage, from which MTCMOS (power gating), ABB (Adaptive Body Biasing) and AVS (Adaptive Voltage Scaling) are the most promising ones. These approaches differ in their applicability. While power gating can only be applied to idle parts of a circuit, ABB and AVS can reduce leakage without losing the circuits functionality. Due to additional cost in terms of area, power and delay, these techniques have to be considered at a high abstraction level in the design process, as there is the most potential for them to save power. Thus, high level models are required, determining the power savings but also the additional cost. In case of power gating only power and area are additional cost, ABB and AVS also impact performance. A reduction of leakage directly results in a slower circuit. Thus, delay estimation has to be done, when applying these techniques to meet the timing constraints of a design. Delay estimation of digital circuits is well established because it is needed at several levels of abstraction all over the whole design process. If a fully specified
This work was supported by the European Commission within the Sixth Framework Programme through the CLEAN project (contract no. FP6-IST-026980).
N. Azemard and L. Svensson (Eds.): PATMOS 2007, LNCS 4644, pp. 171–180, 2007. c Springer-Verlag Berlin Heidelberg 2007
172
M. Hoyer, D. Helms, and W. Nebel
circuitry view is available, the circuit simulator SPICE associated with the BSIM transistor model [2] can be used to accurately estimate the delay. At higher design levels, missing information and the high complexity prohibits this approach. Less accurate table based gate level models combined with statistical area based wireload models are state of the art in delay estimation. Such models are not sufficient for the purpose of leakage optimization. Parameters as supply voltage or temperature are only described for corner cases. High level power optimization may directly influence the worst case temperature. At high level the system-wide temperature distribution is available. Thus, the temperature assumptions are less pessimistic, generating the need for delay models for a lower than the absolute worst case temperature. When supporting ABB optimization, our models will also need continous supply-voltage and body voltage models. Thus delay models currently available are not suited for low leakage optimization. In our work we present a RTL delay model, considering all parameters needed for high level leakage optimization. Fast estimation combined with only few characterization data for whole technologies makes this approach useful for high level design tools supporting low leakage design. In the remainder of this work the gate modelling is presented followed by a description of the RT model in Section 3 and 4 . The evaluation results are presented in Section 5 and finally concluded.
2
Related Work
Several approaches have been proposed applying ABB and AVS to reduce leakage. In [3], static bias voltages are applied to a circuit after manufacturing to adjust differences in power and speed caused by process variations. The approach presented in [4] dynamically adopts the bias voltage at runtime, considering temperature and process variation. [5] has shown that a combination of ABB and AVS results in large power savings. [6] presents an aproach, in which RT components with timing slack in a design are clustered to island with an optimal combination of supply and body bias voltage. As mentioned before, applying all these techniques requires estimation of the savings in terms of power and the additional costs in terms of delay. Detailed physics about the main sources of leakage are already known and analytically modelled at transistor level e.g. by [7, 2]. [8] presented an approach to estimate leakage at gate level. An Extension to RT level also regarding supply and bias voltages, temperature and process variations is proposed in [9]. Delay estimation can be done at transistor level, using the low level circuit simulator SPICE with the BSIM transistor model [2]. Simple table lookup gate level models characterized by circuit simulations and combined with a statistical wireload model are currently used by commercial EDA design tools to estimate the delay of RT components and whole designs. In [10] a new gate model called CCS is presented, taking the fact of rising resistances and capacitances in interconnects into account. Second order effects in the capacity behaviour of a transistor are also regarded.
Modelling the Impact of High Level Leakage Optimization Techniques
173
As these approaches have not been developed to be used for leakage optimization, some parameters are not regarded. Supply voltage and temperature typically are considered in corner cases, the body bias voltage, needed for ABB, is not taken into account. This drawback can be eliminated by using the half analytical gate macro model presented in [11] where an inverter is modelled considering the supply and threshold voltage and the driven load. The extension to different slew rates and other gates allows to adopt this approach for gate delay estimation. To the best of our knowledge no fast RTL delay models exist, considering supply and bias voltages, and temperature and thus can be used to guide leakage optimization techniques at high abstraction level.
3
Gate Level Model
In addition to supply and bias voltages, and temperature, delay models have to regard the driven load and the slew rate at the input, as they strongly affect the delay. Furthermore, the extension to our RTL macro model requires to model the delay between each input and the output. To reduce the complexity of the model, rising and falling delay are not modelled separately. Considering the fact that in typical CMOS gates the signal from the input to the output is inverted, nearly the same count of rising and falling delays exist in the critical path of an RT component. Thus, this simplification is valid for the final purpose of RT level delay modelling. The main idea of our approach is to divide the model into two parts. While the reference part is identical for all gates of the technology and describes the principal timing-behaviour of the technology, the individual part has to be characterized for each pin at each gate. The following consideration proves that a simple inverter model is suitable for the reference part. To propagate an edge from one input pin of a gate to its output, exactly one conducting path from the positive to the negative power rail containing the two switching transistors exists. All other transistors behave like a capacitance in the locking case and like a resistance followed by a capacitance in the conducting one. Replacing all locking transistors by a capacitance and all conducting transistors by a resistance and a capacitance as displayed in Figure 1, a simple inverter remains. As the supply voltage and the bias voltage hardly affect the behaviour of the static not switching transistors, their influence to the delay can be modelled in the reference model. The temperature and the load must be considered separately for each pin at each gate. 3.1
Inverter Model
The reference inverter is modelled using a simple three dimensional sampling field regarding the two bias voltages and the supply voltage. As the dependency on the slew rate at the gates input is strongly nonlinear and also depends on the type and on the input pin of the gate [11], this parameter cannot be modeled this way. Because the gate models are used as the base of an RT model, a simplification concerning the slew rate can be done. Given a synthesised design, the slew rate
174
M. Hoyer, D. Helms, and W. Nebel
Fig. 1. A NAND3 (a) and its equivalent gate (b) consisting of an inverter, resistances in the path from positive to negative power rail and additional capacitances
at each input of all contained gates is nearly constant, as it is mainly determined by the fanout of the previous gates output. Only the second order effect that changing supply and body bias voltages also change the slew rate between two gates has to be considered. In our approach, the indirect modelling of the slew rate is realized by a single inverter driving the inverter to be modelled when sampling the field. The driving inverters always have the same supply and bias voltages, as the gate to be modelled. This way, the indirect influence of the bias and the supply voltage to the input slew, usually driving the inverter, is considered. 3.2
Modelling Other Gates Also Considering Temperature and Load
The next step is to extend the inverter model supporting other gates of the technology. Therefore, the additional RC chain in the equivalent gate (rf. Figure 1(b)) has to be modelled. The serial resistances are linearly scaling down the current through the gate and thus linearly increasing the delay. The capacitive part increases the delay by a constant value. In our approach, the equation dgate at input = p0 ·dinv +p1 with p0 and p1 being two constants is used to map the reference model dinv to any input pin of any gate. In a final step, the temperature and load dependence is added to the model. Experiments over a huge number of gates show that the temperature and the load can be separated from the rest of the model with errors of less than 6%. Thus their behaviour can be characterized identically for all combinations of supply and bias voltage. As can be seen in Figure 2, the delay linearly depends on the load. The temperature dependence, presented in Figure 3, can be approximated by a quadratic polynomial, in which the linear coefficient must be zero in order to get a strictly increasing function over the whole temperature range. Using these abstractions, the delay of each gate at each pin can be modelled by dgate at input = (p0 · dinv + p1 )(p2 · load + p3 )(p4 · temp2 + p5 ) (1) where six free parameters have to be characterized.
Modelling the Impact of High Level Leakage Optimization Techniques
Fig. 2. The influence of the load on the delay of an OAI21 gate and an inverter. The dependence is nearly linear but differs between the gates of a technology.
3.3
175
Fig. 3. The influence of the temperature on the delay of an OAI21 gate and an inverter. The dependence can be approximated by an a · temp2 + b polynomial.
Model Characterization
In the first step of the characterization flow (rf. Figure 4), the parameters of the reference model have to be captured, using a circuit simulator with a compatible transistor model (e.g. SPICE with BSIM [2]) and the schematic view of the inverter. For temperature and load, nominal values are selected. To match this model to other gates, three measurement with same dimension as the reference field are required for each pin. The first one is simulated using the same temperature and load as the reference model. It is needed to fit p0 and p1 in Equation (1) using linear regression. The second and third field are simulated one with a different temperature and one with a different load. In combination with the first one, they are used to determine parameters p2 to p5 in the gate model. Transforming equation (1) to dgate
at input
= g0 (dinv + g1 )(load + g2 )(temp2 + g3 )
g 0 = p0 p 2 p 4 , g 1 =
(2)
p1 p3 p5 , g2 = , g3 = p0 p2 p4
reduces the six parameters to four independent ones per pin and gate.
Fig. 4. Overview of the characterization flow. In three steps a simple inverter model considering Vdd , VbbP M OS and VbbNM OS is extended to a delay model supporting all gates of a technology also regarding temperature and driving load.
176
4
M. Hoyer, D. Helms, and W. Nebel
Extension to RT Level
The delay of an RT component is defined as the maximum time between a changing signal at one of the component’s inputs and the corresponding change at one of the outputs. At gate level, this time can be computed by adding the single delays of the gates and wires along the longest path in terms of delay through the component. Thus, this critical path has to be modelled, to determine the delay of an RT component. 4.1
Modelling the Critical Path of an RT Component
Interconnect delay modelling is done conventionally as the effect of supply and bias voltages and temperature is weak. The load, caused by these wires, is not negligible. In design sizes of 90nm and below they are in the same order of magnitude as the load caused by the cells. A statistical wireload model as the one contained in conventional technology libraries can be used to determine the capacitance of a wire connected to a gates output in the design. Finding the critical path within a gate level description of the component is straight forward. For each gate, the load to be driven is computed by summing up the input capacitances of all connected gates and of the wires between them. A default load is selected for the gates which are connected to an output, assuming the RT component is driving interconnect which is either short, or driven by an inverter. Knowing the load of each gate in the component, the delay can be computed using the models described in Section 3. The critical path can then be determined from the gate delays and the gate level description of the component using standard graph algorithms. As the delay of the RT component is the sum of the gate delays along the critical path, the delay model of this path is the simple summation of the gate models and can be expressed by dRT (Vdd , Vbbp , Vbbn , temp) = d gate at pin (Vdd , Vbbp , Vbbn , temp, Ci ) (3) i
where Ci is the load of gate i. Inserting the gate models developed in Section 3 and expanding this equation results in a bilinear expression (4) which is depending on the square of the temperature and the delay calculated by the reference model. dRT (dinv , temp) = r0 dinv temp2 + r1 dinv + r2 temp2 + r3 r0 = g0i (Ci + g2i ) , r1 = g0i g3i i
r2 =
i
g0i g1i , r3 =
(4)
i
g0i g3i g1i
i
As can be seen, the critical path of an RT component can be modelled by an inverter model considering the supply and bias voltages and four additional
Modelling the Impact of High Level Leakage Optimization Techniques
177
Table 1. Differences between modelled and estimated delays of 3 RT components due to the indirect modelling of the slew rate component
length of measured modelled derating the critical path delay [ns] delay [ns] factor AddCla04 11 23.4 30.6 0.76 AddCla16 15 33.6 46.3 0.73 MultCsa8x8 49 97.6 143.1 0.68
constants, which can easily be computed from the delay models of the gates along the path. In Table 1, the maximum delays of three RT components measured with SPICE and computed with our model are listed. Due to the indirect modelling of the slew rate at gate level, this preliminary model overestimates the delay about 20% to 30% depending on the length of the critical path and the contained gates. As the effects of the model parameters to the slew rate are still modelled at gate level, a derating approach is selected to map the modelled delay to the estimated one. Therefore, the delay of the critical path has to be simulated at one combination of the input parameters. The derating factor is then given by the ratio between this measured and the modelled delay. 4.2
Varying Critical Paths
Using the derating, the resulting model is now able to accurately estimate one critical path in the RT component. Due to differences in the design, the gates have different delay behaviour when changing the temperature, supply voltage or bias voltage. Thus, the critical path of the RT component can change with variation of these parameters. In Figure 5, a simple example is presented, in which the critical path is changing with temperature. A simple extension of this example can result in several critical paths varying with the parameters, making a model splitting for different parameters impractical. When modelling the delay, an slight overestimation as shown by the dashed line in Figure 5 is a conservative approximation. Considering the quadratic dependence on the temperature, only the delays at the lowest and at the highest temperature have to be determined. As the maximum of strictly increasing functions is always convex, the approximation will always lead to a conservative overestimation of the delay. Due to only small differences between the gradients of the different functions, this approach will not result in large errors. Equation 4, which is used to model the critical path of an RT component, expresses a bilinear dependence on the square of the temperature and on the delay of the reference model. Thus, the same overestimation approach as in the one dimensional example above can be used. Therefore, the delays at the four corner cases, defined by the minimal and maximal temperature and the minimal and maximal delay of the reference model, have to be determined. To find the overestimating
178
M. Hoyer, D. Helms, and W. Nebel
Fig. 5. A simple configuration, in which the critical path is varying within the range of the modeled temperature. An overestimating approximation is ploted with the dashed line
model, the four constants of the bilinear equation must be selected in such a way, that the four delays at the corner cases are part of the function. As a result, a hard macro model for an RT component has been developed, which can estimate the maximum delay considering the supply and bias voltages and the temperature. One-time characterized for a target technology, the model can be used in high level optimization tools.
5
Evaluation
The evaluation results are obtained, comparing the modeled delay against simulations at transistor level. Therefore, the SPICE simulator combined with the transistor model BSIM4.50 was used. The whole characterization of our model has been done for a 90nm technology. The analyzed RT components have been synthesized using a commercial synthesis tool. For each component, we extracted the critical path and limited simulation to it, to ensure having maximum delay. Table 2 holds the ranges and the count of samples for each dimension of the reference inverters sample field. Interpolation between the samples causes a maximum error of 2% with a standard deviation of 0.1%. In Table 3, the errors caused in the gate models by circuitry, temperature and load separation are listed. The NAND and NOR gates represent the extreme Table 2. Interpolation between the samples of the reference model within the listed ranges with the listed count of samples causes maximum errors below 2% with a std. deviation of 0.1% range count of samples Vdd 0.8V - 1.8V 11 Vbbp -0.5V - 0.2V 2 Vbbn -0.5V - 0.2V 2
Table 3. The errors, when mapping the reference model to gates and separating temperature and load
NAN2 NAN3 NOR2 NOR3 OAI221
circuitry sep. load sep. temp sep. Std.[%] Std.[%] Std.[%] 1.78 2.4 2.9 3.4 3.7 4.0 2.6 4.6 5.5 2.3 3.4 5.1 1.8 2.1 4.3
Modelling the Impact of High Level Leakage Optimization Techniques
179
Table 4. Evaluation of whole RT components with different length of the critical path component length of cp. Std.[%] Max.[%] Mean[%] AddCla4 11 2.14 10.09 -0.02 AddCla16 15 2.39 10.1 -0.39 AddRpl16 32 4.61 13.49 2.29 MultCsa8x8 49 1.8 5.7 0.31
cases of a regular serial NMOS and PMOS stack, while the OAI221 gate is a combination of serial and parallel devices. One can see that the temperature separation causes the highest errors with a standard deviation of 5.5% in the NOR2 gate. A Monte Carlo evaluation over the whole parameter range using a wide selection of gates has been done to determine the accuracy of the whole gate model approach. The results of some gates are presented in Figure 6.
Fig. 6. Monte Carlo evaluation of several gates over the whole range of input parameters. For each pin of the gate, mean and maximum error and the standard deviation are presented. Single errors with maximum values up to 22% are caused, when a very low supply voltage is selected.
A standard deviation of 7% with single errors up to 22% is the upper bound of errors at gate level. Analyzes have shown that the high single errors appear when using a supply voltage near the lower bound of the modelled range. At RT level, it has to be shown that the derating approach, used to correct differences caused by the indirect modelling of the slew rate, is feasible over the whole range of the input parameters. Therefore, a Monte Carlo evaluation has been done using several RT components with different architectures and different count and type of gates along their critical path. The results, listed in Table 4, show that the errors depend on the length of the critical path but also on the architecture of the component. Due to the combination of gates with different errors, the standard deviation with up to 5% and the maximum errors with up to 15% are lower than the errors at gate level.
180
6
M. Hoyer, D. Helms, and W. Nebel
Conclusion
Our bottom up approach results in an RTL delay hard macro model considering supply and bias voltages, and temperature. The speedup in comparison to approaches at lower abstraction levels enables using this model to apply leakage optimization techniques at high abstraction levels in the design process. Maximum errors of 15% with a standard deviation below 5% over the whole input parameter range are more accurate than needed by this use case. The small amount of characterization data allows our approach to be used for characterizing whole technologies and thus to be applied in tools supporting high level design optimization. In order to increase the acuracy of our model, we currently work on more precisely describing the interconnect delay, as its contribution on delay is growing with shrinking technology size.
References 1. Technology Working Groups. ITRS-Process Integration, Devices and Structures. http://www.itrs.net/Links/2005ITRS/PIDS2005.pdf 2. Hu, C.: Bsim model for circuit design using advanced technologies. In: Symposium on VLSl Circuits Digest of Technical Papers, 2001, pp. 5–6 (2001) 3. Keshavarzi, A., Tschanz, J., Narendra, S., De, V., Daasch, W., Roy, K., Sachdev, M., Hawkins, C.: Leakage and Process Variation Effects in Current Testing on FutureCMOS Circuits. IEEE Design & Test of Computers 19, 36–43 (2002) 4. Tschanz, J., Kao, J., Narendra, S., Nair, R., Antoniadis, D., Chandrakasan, A., De, V.: Adaptive Body Bias For Reducing Impacts of Die-To-Die and Within-Die Parameter Variations on Microprocessor Frequency and Leakage. IEEE Journal of Solid-State Circuits 37, 1396–1402 (2002) 5. Meijer, M., Pessolano, F., de Gyvez, J.P.: Technology Exploration for Adaptive Power and Frequency Scaling in 90nm CMOS. In: ISLPED 2004 (2004) 6. Helms, D., Meyer, O., Hoyer, M., Nebel, W.: Voltage- and Abb-Island Optimization in High Level Synthesis. In: Proceedings of the 2007 International Symposium on Low Power Electronics and Design (2007) 7. Roy, K., Mukhopadhyay, S., Mahmoodi-Meimand, H.: Leakage Current Mechanisms and Leakage Reduction Techniques in Deep-Submicrometer CMOS Circuits. Proc. of the IEEE 91(2) (2003) 8. Rao, R.M., Burns, J.L., Devgan, A., Brown, R.B.: Efficient Techniques for Gate Leakage Estimation. In: Proceedings of the 2003 International Symposium on Low Power Electronics and Design, 2003, pp. 100–103 (2003) 9. Helms, D., Hoyer, M., Nebel, W.: Accurate PTV, State, and ABB Aware RTL Blackbox Modeling of Subthreshold, Gate, and PN-junction Leakage. In: Vounckx, J., Azemard, N., Maurine, P. (eds.) PATMOS 2006. LNCS, vol. 4148, Springer, Heidelberg (2006) 10. Synopsis. CCS Timing, Technical White Paper (December 2006), http://www.synopsys.com/products/libertyccs/ccs timing wp.pdf 11. Daga, J.M., Auvergne, D.: A Comprehensive Delay Macro Modeling for Submicrometer CMOS Logics. IEEE Journal of Solid-State Circuits (January 1999)
Logic Style Comparison for Ultra Low Power Operation in 65nm Technology Mandeep Singh1, Christophe Giacomotto1, Bart Zeydel1, and Vojin Oklobdzija2 1
University of California, Davis, Davis, CA 95616 USA {mandeep,christophe,zeydel}@acsel-lab.com 2 University of Sydney, Sydney, NSW, 2006 Australia [email protected]
Abstract. Design considerations for ultra low power circuits are presented through a study of circuit families operating at ultra low supply voltages. We examine static CMOS logic versus pass-transistor logic to determine which logic style is best suited for ultra-low power design. Furthermore, in this work we present a modification to Complementary Pass-gate Logic which improves its operation in ultra low power conditions. The operation of this modified CPL (MTCPL), in ultra low supply voltage conditions is compared to CMOS+, Dual Value Pass transistor Logic, and static CMOS in the same environment. The results show that although CMOS+ demonstrates the best energy delay characteristics for ultra low-power design, MTCPL yields the best energy at low data activities.
1 Introduction Several works [2][9][10] have explored low power aspects of pass transistor families. In this work we re-evaluate the potential benefit of using pass logic in a modern 65 nm process. CMOS+ which combines Standard CMOS gates and the dual pass gate, has rarely been second guessed as a design choice. This is mainly due to its simplicity, easy integration into design tools and reliability at ultra low voltages. The reason we wish to explore pass transistor logic families further is due to the nature of pass gate based designs which allow good control over leakage paths. Intuitively, because logic operations are carried out by passing charge from one transistor to another without dumping any to the ground, pass transistor logic is expected to yield lower energy consumption. And, since the connections to either a voltage supply source or ground are at the edges of logic, in a purely pass transistor based circuit, the potential leakage paths are limited to these locations. Design to control these leakage paths can be achieved through simple rules and without a complicated analysis of all permutations of circuit paths. Since leakage power is becoming an increasingly difficult problem to address, this can be very useful. In most processes, transistors with higher threshold voltages can be placed either at the head or foot of a circuit to reduce N. Azemard and L. Svensson (Eds.): PATMOS 2007, LNCS 4644, pp. 181–190, 2007. © Springer-Verlag Berlin Heidelberg 2007
182
M. Singh et al.
leakage, and as long as they are not in the critical path it makes for effective leakage control. With the use of pass transistor based logic families this procedure is further simplified since the number of paths to power and ground are limited. Standard static CMOS is typically the logic family of choice for most non-critical data paths. This is due to 2 factors: 1) Current technology nodes requires high gain efficiency within the logic to overcome interconnect parasitics. 2) Synthesis tool chains are much simpler for Standard CMOS than pass gate logic in general. However, [13] show that integrating pass gate logic with standard CMOS is applicable to ultra-low power / ultra-low voltage designs and can improve energy efficiency by reducing leakage paths. Therefore the domain for pass gate type circuits is ultra low power, low duty cycle and non time critical circuits. Sensor devices with their low duty cycle requirements and other power and energy saving requirements, could offer a suitable target for this technology and logic family. In this work leakage and sub-threshold supply voltage effects on performance and power are examined on an arithmetic unit. In the next section we cover the previous work in this area and detail the logic families selected for comparison. In section 3 we examine how pass transistor based families work at ultra low voltages and how their energy-delay tradeoffs at those voltages can be improved. Section 4 details our simulation setup and section 5 discusses the results and is followed by the conclusion and future work.
2 Logic Families Selection for Design Comparison 2.1 Previous Work In [1] the authors describe the synthesis process for Complementary Pass Transistor Logic (CPL), Double Pass-transistor Logic (DPL) and DVL based circuits. The synthesis procedure described in the work was followed to produce the circuits analyzed in this work. The authors in [2] provide a detailed overview of several pass transistor families including DPL and CPL. Physical aspects of the designs are also discussed in detail along with power and delay. The work [2] does not take into consideration DVL which is an optimized implementation of DPL and it also ignores CMOS+ based hybrid designs, which are mostly static CMOS but use transmission gates to improve the efficiency of multiplexers and XOR gates. Also, pervious works on this topic do not look at the circuit performance at voltages close to and below the threshold voltage values of the technology. The impact of voltage reduction on the operation and energy of the circuit in these different logic families is essential for determining which logic family is best suited for ultra low-power operation. Multi-threshold design techniques [8][10][11]can significantly reduce energy consumption when combined with voltage scaling. For example, Kao et al. [8] show the benefit of a Multi-Threshold technique (MTCMOS) on a clock storage element. And, Lindert et al. [10] provide a design methodology based on active body biasing which dynamically skew the transistor threshold. These techniques improve power consumption at low voltage and can be applied to most logic styles.
Logic Style Comparison for Ultra Low Power Operation in 65nm Technology
183
2.2 Selected Logic Families The logic families that we considered for this comparison were all static logic families due to the power hungry nature of dynamic logic styles. Also other pass transistor families such as the DCVSL were left out of the comparison because the differential restore logic makes it very inefficient as far as power is concerned. DPL, a family close to DVL was left out because we consider DVL to be an optimized version of DPL. Static CMOS, CMOS+, CPL and DVL are compared in this work and MTCPL is explored as a better alternative for CPL at ultra low voltages (see section 3). Fig. 1 shows the traditional NOR logic gate in CPL, DPL, DVL. For further information on CPL, DPL, DVL see [1].
Fig. 1. NOR in: (I) DPL, (II) CPL, (III) DVL
3 Multi-threshold Complementary Pass-Gate Logic (MTCPL) for Ultra Low Voltage Operation 3.1 Basic Concept The theoretical operation limit of pass transistor based circuit such as CPL is with the supply at 2VT (twice the threshold voltage). With the supply at 2 times the threshold voltage the VT drop across the pass transistor device causes the voltage seen at the input of the restore logic to be very close to VDD/2 when it is supposed to evaluate a logic high. With a further drop in supply voltage this logic high will be less than VDD/2 because of the VT drop and cannot be evaluated at the restore logic. This should result in a logic failure. CPL though can operate well below 2VT and with careful design there is potential for this logic family to be energy efficient in ultra low power circuits. Operation at voltages below 2VT is possible due to leakage paths that allow charging internal nodes to the potentials that would allow the restore logic to work. Fig. 2 shows an XOR gate implemented in CPL with the dual PMOS voltage restore transistors. The figure also illustrates some of the leakage paths that exist for this type of a circuit and since this basic structure is representative of the logic style, these paths will be common to most CPL circuits. The voltage at node ‘x’ needs to be restored to supply level due to the VT drop across the preceding transistors. The PMOS transistors help pull ‘x’ up to supply voltage. Another charging mechanism exists in the leakage path from the input inverter to
184
M. Singh et al.
Fig. 2. General Structure of CPL based pass transistor circuits. (I) Input arrangement (II) XOR Gate Implementation and respective Leakage Paths.
‘x’ which is shown as the ‘charging leakage path’. Although this path charges ‘x’ very slowly, at ultra low voltages it helps to restore the swing to an acceptable level for the restore logic and the output buffers to be able to work properly. This effect is shown more clearly in the succeeding section where we analyze this slow charging and then improve on the design. 3.2 Delay Analysis To demonstrate the leakage charging effect and how it scales we show Node X on Fig. 3 charging to its maximum voltage without the restore logic present and, how the time it takes to charge scales with decreasing supply voltages. This is demonstrated in Fig. 4 and we can see the exponential increase in the time it takes to get to the maximum achievable voltage – which is above (Vdd - VT ). Although the delay increases significantly, this figure demonstrates the ability of pass transistor logic to work at ultra low voltages. To improve the energy-delay tradeoff through improving the leakage characteristics, CPL can be enhanced through the changes discussed in the introduction. These are discussed in the next section.
Fig. 3. Simple Pass transistor circuit without the restore logic
Logic Style Comparison for Ultra Low Power Operation in 65nm Technology
185
Fig. 4. XOR Node X Charge Time
3.3 Proposed Changes for Ultra Low Power Operation It is known that the delay and energy characteristics of pass transistor based circuits degrade severely at low voltages [10]. At ultra low voltages the leakage energy starts to increase exponentially with voltage causing a further degradation of the energy characteristics of these circuits. To overcome this increase in leakage energy and have CPL deliver energy and delay savings comparable to the respective static CMOS designs we use high VT transistors to plug leakage paths in an effort to reduce leakage current further (Fig. 5).
Fig. 5. XOR Logic with High VT transistors in the leakage paths
186
M. Singh et al.
Multi-threshold transistors allow for regulation of leakage currents. Forward leakage current through the pass gate logic is required to charge the internal nodes to the required voltage levels for operation at low supply levels. Therefore when designing the circuit the transistors used for implementing the logic should be allowed to be fast to allow reliable internal node charging and also to increase the circuit speed especially at low voltages. Fast transistors are not required though for the restore logic and these can be implemented with high threshold devices to cut the leakage paths to the supply planes. This approach significantly reduces the energy at low supply voltages and allows CPL to function more efficiently with respect to energy. We refer to CPL with high VT devices as multi-threshold CPL (MTCPL). There is a tradeoff between putting high threshold logic restore logic and recharge time for reliable operation. High VT restore logic and input inverters allow for reduction of leakage paths but since it is the trickle leakage currents that allow for level restoration below 2VT supply, the operation of the circuits becomes very slow.
4 Setup and Experimental Procedure The circuits were used with fixed input and output loads with FO4 slopes. All devices were minimum size and sizing was not considered as a variable in our experiment because minimum size is accepted as the minimum energy transistor sizing solution [6][7]. If the circuits required true and complementary inputs (as is the case for DVL, CPL) the energy spent by the logic used to generate such inputs was taken into account. A 65nm CMOS technology was used for the data collection. Typical process corners were used to conduct the simulations at 25oC, with the VT value for the transistors at those corners being close to 240mV. The circuits used for the comparison were all 8 Bit Ripple Carry Adders implemented with each particular logic style. To account for the energy they were simulated over their critical path time for representative vectors over the range of voltages presented. In the following section, data activity levels refers to the activity at the input of the circuit or utilization of circuit. All inputs switching at every cycle would thus infer 100% data activity.
5 Ultra Low-Power Performance Fig. 6 shows the leakage energy associated with each of the logic styles examined on an 8-bit Ripple Carry Adder. A DVL based adder was also simulated but the leakage was found to be high and not of interest. The results show the average leakage energy for these circuits at the associated voltage. It can be seen that the standby energy associated with MTCPL is significantly lower than CMOS+. With MTCPL we were able to improve over static CMOS as well as CPL by a minimum of 30% at the minimum leakage (optimum voltage) point. At 0.45V with MTCPL we were able to see close to 30% savings over static CMOS which improves significantly as we move towards higher supply voltages. The low leakage is an important property for circuits which operate on a very low duty cycle and spend most of time idle.
Logic Style Comparison for Ultra Low Power Operation in 65nm Technology
187
Fig. 6. Leakage Energy(0% Activity) Vs. Voltage for 8 Bit Ripple Carry Adders
However when active energy is taken in account the minimum energy for each logic style varies greatly and we can see the leakage energy savings of MTCPL translate well with activity. This behavior is detailed in Fig.7, which shows the minimum energy achieved by each of the logic styles with increasing data activity from 0 to 0.5. Even though the leakage energy is an order of magnitude less than the active energy of these families, at low data activity (0.1-0.2) MTCPL is clearly the optimum logic style as far as energy is concerned. Because CMOS+ has very low active energy it starts to become a better choice at higher activity levels. Every transistor in the CMOS and CMOS+ designs have their dual (nMOS and pMOS tree). Similarly CPL and MTCPL have their complement nMOS trees. Consequently, all four designs have the same amount of gate capacitance. However MTCPL and CPL also need to drive the capacitive parasitic drain-source connections as well, resulting in worse active energy. Also [13] shows how stacking leakage and sneak leakage currents cause an increase in the leakage energy associated with CMOS+ which is the reason for the high leakage seen in Fig.6. Fig. 7 exhibits an important property of these logic styles by outlining the data activity regions in which each style is preferable. It also defines how significant the leakage energy is as opposed to the active energy at the minimum energy point for each logic family. The point at which the characteristic curve for CMOS+ crosses the one for MTCPL shows the activity level needed to compensate for the leakage energy overhead for CMOS+. And, since the active energy for CMOS+ is smaller compared to the rest of the logic styles it becomes the optimum style above that activity level.
188
M. Singh et al.
Fig. 7. Minimum Energy Vs. Data Activity
The energy delay characteristics of each logic style are examined in Fig.8. This figure shows CMOS+ and CMOS as the dominant styles at 25% data activity with CMOS+ achieving the lowest energy with the best delay. CMOS+ is at-least 50% faster at the minimum energy point than both static CMOS and MTCPL. Again, DVL was simulated under the same conditions as well but was found to have greater energy consumption making it unfit for comparison. The results demonstrate that CMOS+ is best suited for ultra low power operation with data activity above 0.17 in terms of delay and energy. However for lower activity MTCPL has better energy characteristics even though CMOS+ is faster. Fig.8 also demonstrates the rapid increase in energy once the supply voltage goes below the threshold voltage of the technology at the simulated corner. This can be traced back to the leakage energy increase at this point (Fig.6). This is due to the MOS sub-threshold current characteristics (IDS increases exponentially below VT) and the non full-rail swing associated with the MOS transistors at sub-threshold supplies. As the supply voltage goes below VT, both the ON and OFF transistors are leaking more and less respectively, ending up in a typical voltage divider situation where the resistances are of the same order of magnitude. Consequently, the output of a simple inverter is expected to have a voltage offset from both logic level high and low. This effect gets further amplified since the next stage of logic is driven by non-full rail inputs. Since the inputs are not full rail and the leakage is exponentially dependent on VGS, the ratio ION / IOFF decreases exponentially. This in return increases delay faster than the reduction of leakage current due to VDD scaling. This results in increasing energy consumption per logic operation.
Logic Style Comparison for Ultra Low Power Operation in 65nm Technology
189
Fig. 8. Energy Vs. Delay as a Function of Voltage @ 25% Data Activity
Intuitively, this means the minimum energy point is attained at an optimum supply voltage either at or above the threshold voltage. Fig. 8 shows this happening since the threshold voltage for this technology is ~240mV and the optima are around 300350mV. This has been shown for CMOS+ [6] and in this work we confirm this is true for all types of logic families. Thus we show that from a minimum energy standpoint it is never advisable to drop the supply voltage below VT regardless of the logic style.
6 Conclusion Through this study we present a comparison between pass transistor, static CMOS and the hybrid CMOS+ logic styles. We also introduced MTCPL, a modified CPL structure which improves the energy characteristics of CPL through effective leakage control. Through multi-threshold transistors used for leakage control, MTCPL exhibits the minimum leakage of any logic style but is worse than standard CMOS and CMOS+ logic when it comes to average energy at data activity rates higher than 0.17. Our analysis thus suggests that leakage energy benefits lie in MTCPL making it a good design choice for low activity circuits. So even though CMOS+ is still the optimal design choice for ultra low power circuits due to the major advantages in active energy and delay, MTCPL is a better design choice at lower activities if delay is not a concern, due to the major savings in leakage energy consumption.
190
M. Singh et al.
References 1. Marcovic, D., Nikolic, B., Oklobdzija, V.G.: A general method in synthesis of passtransistor circuits. Elsevier Microelectronics Journal 31, 991–998 (2000) 2. Zimmermann, R., Fichtner, W.: Low-power logic styles: CMOS versus pass-transistor logic, Solid-State Circuits. IEEE Journal 32(7), 1079–1090 (1997) 3. Shams, A.M., Darwish, T.K.: Performance Analysis of 1-Bit CMOS Full Adder Cells. IEEE Trans. On Very Large Scale Integration (VLSI) Systems 10(1) (2002) 4. Shalem, R., John, E., John, L.K.: A novel low power energy recovery full adder cell. In: VLSI 1999 Proceedings. Ninth Great Lakes Symposium (4-6 March, 1999) 5. Radhakrishnan, D.: Low Voltage, low power CMOS full adder. IEE Proc-Circuits Devices Syst. 148(1) (February 2001) 6. Vratonjic, M., Zeydel, B.R., Oklobdzija, V.G.: Circuit Sizing and Supply-Voltage Selection for Low-Power Digital Circuit Design. In: Vounckx, J., Azemard, N., Maurine, P. (eds.) PATMOS 2006. LNCS, vol. 4148, pp. 148–156. Springer, Heidelberg (2006) 7. Dao, H.Q., Zeydel, B.R., Oklobdzija, V.G.: Energy Optimization of Pipelined Digital Systems Using Circuit Sizing and Supply Scaling. IEEE Transaction on VLSI Systems 14(2), 122–134 (2006) 8. Kao, J., Chandraksan, A.: MTCMOS Sequential Circuits. ESSCIRC (September 2001) 9. Song, M., Asada, K.: Desing of Low Power Digital VLSI Circuits Based on Novel Pass transistor logic. IEICE Trans. Electron. E81-C(11) (November 1998) 10. Lindert, N., Sugii, T.: Dynamic Threshold Pass Transistor Logic for Improved delay at Low power supply voltages. IEEE JSSC 34(1) (January 1999) 11. Calhoun, B.H., Honoré, F.A, Chandrakasan, A.P.: A Leakage reduction methodology for Distributed MTCMOS. IEEE JSSC 39(5) (May 2004) 12. Bui, H.T., Wang, Y., Jiang, Y.: Design and Analysis of Low Power 10 Transistor Full Adders Using Novel XOR XNOR gates. IEEE Transactions on Circuits and Systems – II 49(1) (January 2002) 13. Wang, A., Chandrakasan, A.: A 180mV FFT Processor Using Subthreshold Circuit Techniques. In: Proceeds, IEEE International Solid State Circuits Conference (2004) 14. Calhoun, B.H., Wang, A., Chandrakasan, A.P.: Device Sizing for Minimum Energy Operation in Subthreshold Circuits. In: IEEE Custom Integrated Circuits Conference (CICC), October 2004, pp. 95–98 (2004)
Design-In Reliability for 90-65nm CMOS Nodes Submitted to Hot-Carriers and NBTI Degradation CR. Parthasarathy1, A. Bravaix2, C. Guérin1,2, M. Denais1, and V. Huard3 1
STMicroelectronics, 850, rue Jean Monnet, 38926 Crolles, France [email protected]
2
L2MP – ISEN, UMR CNRS 6137, Maison des Technologies, 83000 Toulon, France 3
[email protected] Philips R&D Crolles, 850, rue Jean Monnet,
38926 Crolles, France [email protected]
Abstract. Practical and accurate Design-in Reliability methodology has been developed for designs on 90-45nm technology to quantitatively assess the degradation due to Hot Carrier and Negative Bias Temperature Instability. Simulation capability has been built on top of an existing analog simulator ELDO. Circuits are analyzed using this methodology illustrating the capabilities of the methodology as well highlighting the impacts of the two degradations modes.
1 Introduction Reliability issues are becoming increasingly difficult to optimize in the latest technology nodes due to diverse device offerings, complex products and overdrive requirements. Design-in Reliability (DiR) seeks to provide a quantitative assessment of reliability at design stage thereby enabling judicious margins to be taken beforehand. As a methodology, it refers to the ensemble of modeling, providing simulation capability and analyzing designs all with respect to CMOS device reliability. Shrinking device sizes induces a significant electric field increase which deviates from the ideal scaling law due to process complexity and type of applications. This leads to HotCarrier Injections (HCI) at the drain which reduce current drivability both in N and P channel devices in advanced CMOS nodes [1]. Core circuits and Input/Output (IO) devices operates at high temperature which induces strong current reduction under the sole effect of a negative gate potential on p-channel transistors which is a typical reliability issue addressed as Negative Bias Temperature Instability (NBTI). As both degradation modes is cumulative, they have a net effect on circuit timing and need to be addressed for DiR [2-4]. In this work, we provide details of DiR methodology developed for designs in 90 to 65nm CMOS nodes including IO devices addressing the effects of both HCI and NBTI. First, the modeling aspects of the NMOS HCI and PMOS NBTI phenomena are described. Subsequently, an implementation of an efficient and user-friendly reliability simulation engine inside the analog simulator ELDO is elaborated. Application of the methodology on the designs is then presented. N. Azemard and L. Svensson (Eds.): PATMOS 2007, LNCS 4644, pp. 191–200, 2007. © Springer-Verlag Berlin Heidelberg 2007
192
CR. Parthasarathy et al.
2 Modeling of HCI and NBTI Degradation HCI and NBTI experiments were conducted on different devices – with thin gateoxide (1.6 ≤ Tox ≤ 2nm) and thick oxide devices (3-6.5nm) of this technology for different conditions of stress enabling extraction of the coefficients to the models for HCI and NBTI. The degradation models are presented in Table 1. The parameter ΔD chosen to represent the degradation forms the bridge between the physical degradation and the evolution of MOS parameter. The models are linearized in time for being suitable for integration during circuit simulation. The boundaries of the integrals represent the window during which the stress assessment is made (during circuit simulation). 2.1 HCI in NMOS The equation for channel hot-carrier is a classical one monitored using the substrate current [5]. l_factor represents the length dependence [6] and this has been implemented as:
l _ factor = A. (L / Lmin )
N
(1)
time
Vds
% c ha n ge
% cha ng
e
Lmin is the minimum (nominal) length; L is the length of the device. A and N are empirically determined. The saturation level for HCI, where present, is usually the same for different stress conditions. The transconductance peak Gmmax is taken as the degradation parameter ΔD _ HCI .
time
Vds
Vgs
Vgs
Fig. 1. Percent change in currents for different operating conditions after stress due to NBTI stress in PMOS (left) and HCI stress in NMOS (right)
Figure 1 presents data showing time evolution of percentage change in degradation of the Ids-Vds curve (for different Vgs) for NBTI (left) and HCI (right) under a given set of stress conditions each. These degradation of the IV curves were mapped to BSIM4 and BSIM3 MOS models. In the case of NBTI, the complete I-V curve set can
Design-In Reliability for 90-65nm CMOS Nodes
193
be derived from variations in the threshold voltage (Vth0) and weak inversion slope. In the BSIM (3/4) models, the shift of Vth0 automatically represents a mobility degradation due to the effective mobility having (Vgs+Vth)-like dependence (MOBMOD=1, 3 in BSIM3) and a separate mobility adjustment is not found to be necessary. In the case of HCI, it may be observed that the linear regime degrades very rapidly, conforming to the view of localized degradation around the drain region which is screened when operating in the saturation region. HCI requires adjustments in Vth0, U0, UA and NFACTOR MOS parameters. However, as evident from the above figure, the very big shifts in the linear regimes need to be artificially compensated to emulate the very small shifts seen in saturation regime. The thick (IO) to thin gateoxide transistors (high speed core) exhibit distinct Vgs dependences due to HCI phenomena. In thick oxide PMOS, HCI degradation at low Vgs (channel length shortening) is quickly compensated by the NMOS degradation which leads the degraded CMOS cells. In contrast in thin oxide both devices show a reduction in speed performances due to the saturation current degradation (and Gmmax). In this case, the model is customized by enabling MOS parameter adjustments in different ways in each of the regime.
Full Max ISUB : VG = VD/2 Open HE VG = VD
2.8 2.6
-1
10
'IDS / IDSo(%) , 'VT (mV)
-'IDSat/IDSato , VGS=VDS= VDD REV
0
10
2.4 2.2 2
-2
10
VDS (V) :
LPIO NMOS 10/0.20µm 0
10
1
10
2
10
3
10
4
10
STRESS TIME (s) (a)
5
10
VDS(V) , LG(µm)
'VT
2.6 0.2 Full 1.4 0.05 Open
30 20 10
'IDSlin 'IR DSat
-3
10
40
6
10
0 0.0
0.5
ts= 4000 s 1.0
1.5
2.0
2.5
3.0
STRESSINGGATE-VOLTAGE (V) (b)
Fig. 2. (a) HCI degradation in thick oxide (IO) NMOS transistors for two bias conditions (b) Vgs-dependencies of HCI damage between thin (1.6nm) and thick oxides (3nm) NMOSFETs
2.2 NBTI Modeling From experimental results, NBTI is seen to depend on the applied Vgs – with an acceleration factor γ , temperature – with activation energy Ea and has a time exponent n, around 0.15-0.2. Vth is taken as the degradation parameter ΔD _ NBTI . The degradation due to NBTI saturates after a given period of time and the amount of this saturated degradation is dependent on the applied stress and can be viewed as the maximum damage (trapping or trap generation) a given set of stress conditions
194
CR. Parthasarathy et al.
Table 1. HCI and NBTI degradation models used as a function of time, bias and temperature
(ΔD _ NBTI) α ∫ ⎡⎢exp(− Ea) * exp(γ *Vgs)⎤⎥ * dt kT ⎦ tsart ⎣ 1/ n
tstop
NBTI
HCI
1/ n
(Δ D
tstop
_ HCI
)1 / n α ∫
tstart
⎡ ⎛ Ib ⎞ ⎤ Id * ⎢⎜ ⎟⎥ W ⎣ ⎝ Id ⎠ ⎦ ( l _ factor
m
* dt )
can create. Saturation can be taken into account during simulation by changing the time exponent in parts (piece-wise linear), or continuously (emulating a logarithmic behavior). This work presents a simpler approach where the degradation has a fixed time exponent for a certain amount of stress time followed by saturation. Capability exists to take into account NBTI dependence on switching signal (AC) at the level of changing MOS parameters, if needed. Effect of Vds on PMOS, which reduces NBTI degradation and creates a hot-carrier like degradation for short channel lengths, is not discussed in this paper and can be found in another work [4].
3 Reliability Simulation The existing analog simulator ELDO has been enhanced to communicate with a proprietary Application Programming Interface (API), named User-Defined Reliability Model (UDRM) [7]. The models described in the previous section are encoded into this UDRM, providing flexibility in terms of stress and parameter update modeling. For example, for NBTI mode, the instantaneous Vgs and Temperature given by equation and the time for which the NBTI stress is present (up to a certain fraction of the PMOS bulk voltage) are recorded during circuit simulation which are then used to compute saturation level.
Fig. 3. Schematics of the reliability simulation flow
Design-In Reliability for 90-65nm CMOS Nodes
195
Figure 3 presents the reliability simulation flow. During the “fresh” simulation, operating conditions present on the different transistors in the circuit are monitored and the stress present on the devices is computed as a function of their respective operating conditions. This stress assessment is done by the On-The-Fly technique [8] including recoverable (charge detrapping) and permanent components in parallel to a normal analog simulation minimizing CPU time consumption and requirements of data management. In the “aged” run, the stress seen during the simulation is extrapolated to the required aging time and corrections are provided to each device’s MOS model corresponding to its stress. When simulations are re-run using these degraded MOS models, the effect of aging on the circuit operation becomes apparent. Using simple syntax, the designer can specify at which “AGE” the circuit must be re-simulated, how many intermediate steps are used for sampling the stress evolution and the time window in the simulation used to calculate the stress: .AGE TAGE=value [NBRUN=value] + [TSTART=value] [TSTOP=value] The tool allows storing the library of degraded device models and reloading it in a future simulation. + [MODE=AGEsim | AGEload | AGEsave] + [AGELIB=file_name] This is useful when the stress in the circuit is in one mode and the impact is in another mode, e.g., a power-down mode followed by a start-up mode. The implementation is such that the designer can choose the combination of degradation (HCI/NBTI) to be analyzed by setting a variable in the simulation netlist. An example stress simulation is given in Figure 4 showing the HCI (VDS= VDD, VIN rises) NBTI (Vgsp= VIN-VDD) components for an arbitrary input on an inverter. This trace happens at the same time as the simulation and helps designer to view the operating conditions around the occurrences of stress phases during the cycles.
0
HCI Stress (AU)
NBTI Stress (AU)
Voltage (V)
1
0
Time (s)
20μ
Fig. 4. Illustration of instantaneous stress computation for an arbitrary stimulus at an inverter input, with substrate (HCI) current during VIN transient and negative Vgsp=-VDD (VIN=0)
4 Silicon Verification The simulation results are verified first at transistor level, ensuring that the I-V curves during simulation emulates the I-V curves of degraded transistor. Subsequently, they
196
CR. Parthasarathy et al.
are applied on an inverter tested on silicon, which serves to validate device level (PMOS and NMOS) degradation while operating as circuits. Validation of NBTI is reasonably straightforward when it is a DC stress. Validation of HCI damage during switching is difficult due to the high ratios of switching time to frequency (lengthening the time required for having degradation), corner and spread of device. In order to facilitate a minimum set of validation, some non-standard pulses were provided at the input of the inverter at 125C. Saturation currents were measured during stress by biasing input and output at 0 for PMOS and at nominal VDD for NMOS. Examples of correlation data are presented in Figure 5. The good correlation obtained particularly for NBTI is due to measurements at device level (section 2) and at inverter level having been done under similar conditions (that is, measurement done with a given time lag after stress interruption). It is seen that the frequency doesn’t play a role till the high level of the amplitude is less than about VDD/2, which means NBTI can be computed quasi-statically. Also, the effect of frequency and duty-cycle at low gate bias remains to be verified. For these reasons, NBTI degradation is treated quasistatically for stress around nominal VDD. A: IN=0 (supply=VDD1); B: IN=100KHz pulse;Amplitude:0-VDD1/4 (tr = tf = 1us) C: IN=100KHz pulse;Amplitude=0-VDD1/2 (tr = tf = 1us) D: IN=1MHz pulse;Amplitude=0-VDD1/2 (tr = tf = 0.1us) E: IN=100KHz pulse;Amplitude=0-VDD2/2 (tr = tf = 1us)
Idsat degradation (%)
10
filled Silicon open Simulation NBTI HCI
7.5 5 2.5 0 A
B
C
D
E
Fig. 5. Verification of stressed inverter and measured for different input stimuli at 125C
4.1 Analysis of Circuits Circuits have been treated extensively to analyze hot-carrier effects. [8, 9]. In this work, the aim is to analyze both HCI and NBTI and examine their contributions interact with each other. The inverter was simulated with the two switching configurations – a fast (small rise time) and a slow (big rise time) input under given conditions of supply (which determines the stress) and temperature. The delay contributions due to HCI, NBTI and the net delay degradation measured at nominal VDD are shown in Figure 6. It is seen that the delay degradations for HCI and NBTI add up and in the case of slow
Design-In Reliability for 90-65nm CMOS Nodes
197
input they are in opposite directions and tend to compensate one another. This is caused by both PMOS and NMOS remaining ON and interacting during the transition. In the simulation tool, we can turn on the effects individually. If only NBTI is considered, the PMOS is weakened. As a result, the output fall transitions are faster than before (negative change) since NMOS can discharge faster and the output rise transitions are slower since the PMOS is weaker. If only CHC on NMOS is considered, the situation is the opposite – output fall transitions are slower as NMOS is weaker while output rise transitions driven by PMOS against a weak NMOS are faster (again, negative delays). The net delay is the sum of these changes, and in this case is seen to be dominated by the CHC damage on the NMOS. In case of input with fast transitions (fast input), the transitions are determined only by one of the transistors, since the fast transitions switch on one transistor and switch off the other transistor rapidly. In this case, the net delay change is dominated by the damage on the transistor which is switched ON – NMOS in the case of output fall transitions and PMOS in the case of output rise transitions.
%delay change
15% 10% 5% 0% IN_R_OUT_F
IN_F_OUT_R
IN_R_OUT_F
IN_R_OUT_F
-5%
slow input
fast input
-10% -15%
Net delay
delay(NBTI)
delay(HCI)
e.g.:IN_R_OUT_F – Input Rising : Output Falling
Fig. 6. Delay change due to HCI in NMOS and NBTI in PMOS occurring in an inverter
The overall implication is that the effect of combination of device degradation on circuit performance cannot be predicted generically but instead is a function of the operating conditions of the circuit. It is also seen in this case that hot-carrier damage could dominate the change in delay. HCI could also dominate for very long stress times, as the saturation levels are quite different for HCI (higher) and NBTI (lower). 4.1.1 Application to Digital Design Flow For a typical digital flow using standard cells, it is also important that the degradation data is propagated to gate level timing analyses. The degradation was characterized for the various timing arcs for a 700 cell core (LP 90nm 2nm gate oxide) transistor based library operating at 1.1V. In this case, the CHC degradation was evaluated to have negligible impact and only the NBTI impact was estimated as NBTI is the
198
CR. Parthasarathy et al.
Number of timing arcs
relevant degradation mode. Figure 7 shows the distribution of the degradation of the various timings, which formed a new corner for timing assessment at the gate level. The negative timings are not a surprise as it results in places where NMOS has to pull down a node balanced by a PMOS weakened by NBTI. Tests on calibration critical paths brought up 3-5% additional timing margin requirements.
Decrease
0
Increase
% change in timing Fig. 7. Distribution of delays due to NBTI stress in different cells in a 700 cell core library
4.1.2 Custom Design Blocks The reliability analysis was also applied to certain custom blocks and in this case, the flexibility of the tool with respect to different types of analysis plays a role. A two-stage opamp with a PMOS differential input is stressed with asymmetric input voltages for a given operating conditions and extrapolation time. In this case, the degradation is mainly due to NBTI (long lengths are used). The effect of this stress can be seen in different ways as shown in Figure 8. The DC operating point calculation shows an input offset, which is approximately equal to the threshold shift of the MOS under stress. A transient simulation of this opamp as a voltage follower highlights this input offset. AC analysis shows that if this offset is taken into account, the open loop gain, its gain and phase margins do not alter much. NMOS devices in analog circuits could be susceptible to HCI if their lengths are not long enough. SRAM cells form an important component in a chip. In a classical 6-T SRAM cell, one of the two PMOS is under NBTI stress configuration. The stored data is also susceptible to NBTI, and the worst case would be when it needs to read non-changing data inducing mismatch between both PMOS which could deteriorate the sensing function. In the case of SRAMs, the analysis at the level of memory cell is to examine its tendency to fail in terms of Noise Margin (SNM) as well as the read and write margins. It is noted that usually, in the case of NBTI, the read margin tends to decrease, while analogous to the delay improvements in logic buffers, the write margin is seen to increase. At a global level, the impact of the stress is assessed based
Design-In Reliability for 90-65nm CMOS Nodes
199
changes in timings, such as access times, since these are specification driven and need to remain within a fixed range. Hence, the SNM is seen to degrade but was verified to be within acceptable limits. Another example is a phase frequency detector which is typically the input block in a Phase-locked loop (PLL) which compares the incoming and generated low frequency clock and generates correction signals to set the output frequency to the charge pump. Any drift in this block as slow drifts under aging, would directly result in drift of the output frequency, meaning that a jitter would be introduced. Figure 9 shows the degradation in the correction pulses when the signals are locked. The simulated VCO was constructed in 65nm gate-length technology. The initial voltageto current conversion was performed by the 5nm gate-oxide devices working at 2.5V while the oscillator and buffers are constructed from the 1.7nm gate-oxide devices working at 1V. We study the stress at a value VH under power down conditions. When the PLL is woken up, the VCO will see the degradation which relaxes with operating time. In the case of VCO, we need to see its response in terms of its role in the PLL. When the PLL starts after power down signal, it undergoes a phase of ‘locking’ to the input signal which can be typically about 1 ms. The initial relaxation rate of VCO could be high, and could cause jitter since the value of VCONT at which the right frequency is obtained would be changing. Eventually, the relaxation slows down and the PLL starts seeing an almost quasi-static state of the VCO, and the PLL locks to a given value of VCONT to achieve its loop balance. For this VCONT, the frequency will change slowly (1%) due to relaxation.
VDD_stress
+
Vin
+
VDD Vos
Voltage Follower
Stress CL
GND
-
Vout Vin
GND
2.6
CL
+
VDD
Open loop analysis -
Vout
CL GND
Phase (degrees)
Voltage (V)
0
Magnitude (dB)
70
0
-180
0 0
1E-3 Time(s)
2E-3
-40 1
Frequency (Hz)
1E10
Fig. 8. Opamp in stress configuration (top left). The resulting impact analyzed in transient analysis in a voltage follower configuration (bottom left) and in small signal AC analysis to examine the open loop configuration (bottom right).
CR. Parthasarathy et al.
Divider
Divider
VCO + Filter
DN
Voltage (AU)
Voltage (AU)
Voltage (AU)
Voltage (AU)
REF
Fout
UP Charge Pump
INF
Phase Frequency Detector
200
0
Time (s)
2n
Time (AU)
Fig. 9. Top: PFD inside a PLL. Degradation of UP and DN signals after operating at room temperature (bottom left) and after a power down at high temperature (Bottom right).
5 Conclusions DiR methodology has become a sought-after capability for the current generation technologies. We have described the development and demonstrated the powerful abilities of a practical DiR methodology in providing quantitative reliability assessment for CMOS designs by taking into account both HCI and NBTI degradation.
References [1] Leblebici, Y., Kang, S.M.: Hot-carrier reliability of MOS VLSI circuits. Kluwer Academic Publishers, Dordrecht (1993) [2] Thewes, R., et al.: Microelectronics Reliabilty 40, 1545–1554 (2000) [3] Reddy, V., et al.: IEE IRPS Proc., pp. 248–254 (2002) [4] Parthasarathy, C.R., et al.: Microelectronics Reliability, 46, 1464–1471 (2006) [5] Hu, C., et al.: IEEE Trans. on Electron Dev. 32(2), 375–385 (1985) [6] Mistry, K., Doyle, B.S.: IEEE Electron Dev. Lett., 10(11), 500–502 (1989) [7] ELDO User’s Manual, Mentor Graphics Version 6.5_1 Release 2005.1 (2005) [8] Denais, M., et al.: IEEE IEDM Proc., pp. 109–112 (2004) [9] Mistry, et al.: IEEE Trans. on Electron Dev. 40(7), 1284–1293 (1993) [10] Hsu, W.J., et al.: IEEE JSSC 27 (3), 247–257 (1992)
Clock Distribution Techniques for Low-EMI Design Davide Pandini, Guido A. Repetto, and Vincenzo Sinisi Central CAD and Design Solutions STMicroelectronics, Agrate Brianza, 20041 Italy {davide.pandini,guido.repetto,vincenzo.sinisi}@st.com
Abstract. In modern digital ICs, the increasing demand for performance and throughput requires higher operating frequencies of hundreds of megahertz, and in several cases exceeding the gigahertz range. Following the technology scaling trends, this request will continue to rise, thus increasing the electromagnetic interference (EMI) generated by electronic systems. The enforcement of strict governmental regulations and international standards, mainly (but not only) in the automotive domain, are driving new efforts towards design solutions for electromagnetic compatibility (EMC). Hence, EMC/EMI is rapidly becoming a major concern for high-speed circuit and package designers. The on-chip clock signals with fast rise/fall times are among the most detrimental sources of electromagnetic (EM) noise, since not only they generate radiated emissions, but they also have a large impact con the conducted emissions, as the power rail noise localized in close proximity of the toggling clock edges propagates to the board through the power and ground pins. In this work, we analyze the impact of different clock distribution solutions on the spectral content of typical onchip waveforms, in order to develop an effective methodology for EMC-aware clock-tree synthesis, which globally reduces the EM emissions. Our approach can be seamlessly integrated into a typical design flow, and its effectiveness is demonstrated with experimental results obtained from the clock distribution network of an industrial digital design.
1 Introduction The continuous shrinking of the device feature sizes introduced by aggressive technology scaling trends, and the increasing complexity of digital ICs, require higher operating frequencies with faster clock rates. The clock signals are among the primary sources of EMI in high-performance digital circuits, where the high-speed clock drivers and associated circuitry generate most of the EM emissions in the whole system. The level of these radiations depends on several factors including choice of components (mainly buffers), clock frequency and edge rates, circuit layout, and shielding. Moreover, the simultaneous switching of logic gates and blocks typical of synchronous circuits causes narrow current and voltage glitches on the power distribution network, localized in close proximity of the clock edges. Such power rail noise can propagate to the whole board through the die power and ground pins, thus originating conducted EM emissions. In this work, we analyze the EM emissions introduced by the clock signal and the power rail current noise, and we present a practical methodology to reduce the rise/fall times N. Azemard and L. Svensson (Eds.): PATMOS 2007, LNCS 4644, pp. 201–210, 2007. © Springer-Verlag Berlin Heidelberg 2007
202
D. Pandini, G.A. Repetto, and V. Sinisi
of the clock waveforms that can be exploited during clock-tree synthesis (CTS), and seamlessly integrated into a typical digital design flow. Furthermore, we evaluate the impact of clock edge relaxation against clock skew on both radiated and conducted emissions, and we propose functional guidelines for EMC-aware clocking. This paper is organized as follows: Section 0 describes the major sources of EMI in digital circuits and overview some relevant previous work, while Section 0 provides a theoretical analysis of the spectral content of the typical on-chip digital signals. In Section 0 clock distribution techniques for low EMI are discussed, and our clock-tree synthesis methodology for EMC is proposed. In Section 0 experimental results showing the effectiveness of our approach are presented, and Section 0 summarizes a few conclusive remarks.
2 EMI in Digital Design: Previous Work In high-performance microcontrollers and cores, clock rates are steadily increasing, forcing rise/fall times to decrease, and clocking to be strongly synchronous all over the chip. This means that all clock edges occur at the same time, a goal known as "zero clock skew". Today's design tools support this trend. However, for reduced EM emissions, clock spreading is recommended. Given the periodic nature and fast transition time of clock signals, the energy is concentrated in narrow bands near the harmonic frequencies [1]. Signal edges should be distributed over a timeslot which should be as long as allowed by the operating frequency and circuit delay paths. Although not directly supported by design tools, chip designers should take advantage from implementing this clock spreading concept to reduce EMI from the system clock drivers and associated circuitry. Spread spectrum clock (SSC) techniques modulating the clock frequency have been proposed in [2] and [3]. Since fast current variations show a large harmonic content, it is well-known that a critical detrimental factor for EMI at the chip level is the simultaneous switching noise (SSN) caused by the dynamic power and ground rail current spikes. The L·dI/dt noise (or SSN) is originated from package and power rail inductance, and following the increasing circuit frequencies and decreasing supply voltages, it is becoming more and more important with each technology generation [6][8][9]. The impact of power and ground supply level variations on the EM emissions of digital circuits was discussed in [4], and in [5] an approach to analyze the on-chip power supply noise for high-performance microprocessors was presented. Because the dynamic SSN is a major source of EMI, one of the critical tasks to develop an effective EMC-aware design methodology is the deployment of a CAD solution that helps designers to reduce SSN by optimizing the placement of on-chip decoupling capacitors (i.e., decaps) in proximity of current-hungry blocks [7]. In [10] a systematic study to understand the effects of off-chip and on-chip decoupling on the radiated emissions was presented. It was demonstrated that off-chip decaps are effective at reducing the EM radiation in the low-frequency range, while they do not have any significant impact in the high-frequency region. In contrast, on-chip decoupling is a valuable technique to suppress the EM radiation in the high-frequency region, and by a careful combination of off- and on-chip decoupling it was possible to achieve a significant suppression of EM emissions over the whole frequency spectrum. A qualitative assessment on the
Clock Distribution Techniques for Low-EMI Design
203
y(t) T
T/2
Imax
a
a
δ
a
a
a
a
a
a
t
Fig. 1. Periodic triangular pulse train (power/ground rail noise representation)
relative impact of design techniques to reduce on-chip EMI was presented in [11], to help designers making effective decisions at an early stage in the design process.
3 Spectral Analysis of Digital Signals The typical on-chip clock signals are periodic and approximately trapezoidal waveforms with finite rise/fall times. Essentially, a digital clock is characterized by the following parameters: rise time tr, fall time tf, period T, amplitude A, and duty cycle τ., and to obtain a closed-form expression for the spectrum magnitude we assume the same rise and fall times, i.e., tr = tf. In this case, the analytical expression of the coefficients of the one-sided (positive frequencies) spectrum magnitude can be computed with the Fourier series [12], and is given by:
cn+ = 2 ⋅ A ⋅
τ sin (π ⋅ n ⋅τ / T ) sin (π ⋅ n ⋅ t r / T ) . ⋅ ⋅ T π ⋅ n ⋅τ / T π ⋅ n ⋅ tr / T
(1)
In order to extract more qualitative and intuitive information from (1), we replace the discrete spectrum with a continuous envelope by substituting f = n/T, thus obtaining:
envelope = 2 ⋅ A ⋅
τ sin(π ⋅ f ⋅ τ ) sin(π ⋅ f ⋅ t r ) , ⋅ ⋅ π ⋅ f ⋅τ π ⋅ f ⋅ tr
T
(2)
and from a careful analysis of (2) it is possible to assess the impact of some relevant clock parameters on the spectral composition of the clock waveform. It is evident that the high-frequency content of a trapezoidal pulse train is primarily due to the rise/fall times. Signals with a small (fast) rise/fall time will have larger high-frequency harmonics than pulses with larger (slow) rise/fall times. Therefore, to reduce the high-frequency spectrum, it is necessary to decrease the slopes of the clock pulses, as it was discussed in [11]. Ideally, by means of the mathematical model (2), designers can assess the effectiveness of increasing the clock rise/fall times on the high-frequency spectrum of the clock signals, and generate constraints and guidelines for the CTS tool. SSN is another relevant source of EM emissions due to the harmonics generated by fast supply current variations. To derive a closed-form expression, the current waveform on the power/ground rail can be approximated by a periodic triangular pulse train as shown in Figure 1 without introducing significant errors, where the triangular shapes are characterized by their height Imax and base a (which is nearly equal to the
204
D. Pandini, G.A. Repetto, and V. Sinisi
clock rise/fall time), the clock period T, and the duty cycle expressed as the difference with respect to half-period: T/2−δ. Therefore, the one-sided spectrum of the triangular pulse train is given by:
⎧ 4 ⋅ I ⋅ a ⎡ sin (nω a 2) ⎤ 2 max 0 ⎪ ⋅⎢ ⎥ ⋅ sin (nω0 a 2) T n ω a 2 ⎦ ⎪ + 0 ⎣ cn = ⎨ 2 ⎪ 4 ⋅ I max ⋅ a ⎡ sin (nω 0 a 2 )⎤ ⋅ ⎢ ⎥ ⋅ cos(nω0 a 2) ⎪ T ⎣ nω0 a 2 ⎦ ⎩
n odd (3)
n even
where ω0=2πf0, and f0 is the operating frequency. The experimental results presented in [18] demonstrated that the most important parameter in (3) that can be controlled to reduce the harmonic amplitude is the current noise peak Imax, generated by highdriving strength components and switching power-hungry blocks. Hence, a significant mitigation in the conducted EM emissions can be achieved by reducing the peak of the high-frequency current that ultimately propagates on the PCB from the power/ground pins of the silicon die.
4 Clock Distribution Techniques for EMC-Aware Design For system timing, an ideal digital clock is an infinite succession of very fast-edged, identical pulses with a perfectly repeated structure. Unfortunately, from an EMC perspective, such periodic waveform is also one of the principal sources of EMI because it radiates electromagnetic energy with high spectral density. The spectral power of a simple, repetitive signal, such as a clock, is concentrated around a relatively small number of discrete frequencies, and consequently, a small number of radiated modes. In contrast, random data signals spread their spectral power across a much larger number of radiated modes, each mode having a smaller average power. Less power in each mode is better, because both FCC [14] and EN [15] emission regulations penalize the worst-case (peak) radiation in any given mode. Several techniques have been proposed for modulating the clock frequency and distributing the accumulated spectral power over a large number of new modes, each with reduced energy content. The usefulness of spread spectrum clock (SSC) to reducing interference is often debated, but in general, modulating the clock reduces the measured peak radiation. Unfortunately, the practical realization of this technique comes at a very high architectural cost. In the automotive domain, EMI requirements are very stringent, but at the same time there is a strong pressure on the system designers dictated by an increasingly competitive market, to avoid costly solutions, such as SSC. One of the major problems with SSC is that most CPUs are based on a PLL to generate their internal clock, and modulating the reference clock simply introduces several architectural challenges whose solution not only adds extra cost on the final product, but also impacts on the time-to-market. Therefore, we believe that if EMC is among the product requirements, then other techniques such as: • • •
Inserting both off- and on-chip decaps; Slowing the rise/fall time of the clock; Introducing a sustainable amount of clock skew;
Clock Distribution Techniques for Low-EMI Design
205
should be exploited by the system and chip designers, since SSC cannot be safely used in digital systems, when the clock must be synchronized to other clock or timing signals, or in the presence of multiple-clock domains. 4.1 Clock-Tree Synthesi\s for Low EMI The spectral analysis of the typical clock waveform carried out in Section 0 has demonstrated that fast edge-rates and high-driving strength components are among the major sources of EMI, because they directly impact on the peak energy (i.e., amplitude) value of the harmonics, where the operating frequency (i.e., clock rate) only locates the harmonics, but it does not directly influence EMI. Therefore, a practical and low-cost design solution to reduce on-chip EM emissions would ideally increase the rise/fall time of the clock by weakening the driving strength of the clock-tree buffers. For high-speed designs, with reduced timing budgets, this technique might be difficult to implement. However, the typical frequency range of automotive products (up to few hundreds of MHz) allows to increasing the rise/fall times of the clock signals (by weakening the clock-tree buffers) without impairing the set-up/hold times, and the overall circuit functionality. Hence, we believe this is a practical approach for this class of applications. The clock signal rise/fall time relaxation can be achieved during the clock-tree synthesis step of the design flow. Given the complexity and size of modern SoCs, and increasingly tighter skew constraints, the depth and structure of the clock tree is chosen automatically by means of clock-tree synthesis and clock-buffer insertion tools, in order to equalize the delay to the leaf nodes by balancing the interconnect and buffer delays. The functional elements at the leaf nodes (i.e., registers and flip-flops) are thereby tightly synchronized by such clock distribution structure. The commercial CTS tools currently used in digital flows are developed to aggressively minimize the skew and the clock transitions, given the buffer and inverter library available. Hence, it is only possible to set the maximum allowable rise/fall time, but it is not possible to limit the minimum clock transition time. In order to force the CTS tool to synthesize a clock tree with rise/fall times as close as possible to the maximum value defined to decrease the frequency spectrum magnitude, libraries with weaker buffers must be used. Therefore, with the tools currently available, our CTS approach for reducing the on-chip EM emissions can only achieve partial and quite suboptimal results. However, we believe that even though such results are preliminary, they demonstrate the beneficial impact of this technique for EMI reduction. Obviously the new rise (fall) time must be validated against the set-up/hold times of the clock-tree leaf registers and flip-flops. It is worth pointing out that our approach can be seamlessly integrated into a standard digital flow, and the resulting circuit will have lower harmonics in the clock waveform, less SSN on the power rails, less area and less leakage. Moreover, it is a systematic technique that allows a global EMI reduction across the overall chip. 4.2 Clock Skew for Low EMI Clock skew is due to the unbalanced structure of the clock distribution network (clock tree, clock grid, etc.) due to the difference in interconnection length and uneven load.
206
D. Pandini, G.A. Repetto, and V. Sinisi
The clock is unable to reach different leaf-nodes (FFs and registers) at the same time because of the diverse RC delays on the clock paths. As technology moves deeper into the nanometer regime, this problem is also exacerbated by the increasing impact of process variations. In digital ICs, clock network is designed to ensure minimum skew and minimum latency, given power dissipation and clock buffer area limitations. However, when the skew is minimized, a new problem rises, linked to EMI. In balanced clock distribution networks, ideally the buffers of each level, and the leaf-FFs and registers have exactly the same behavior and toggle at the same time, introducing high dynamic current spikes localized around fast clock transitions, and consequently more SSN. The impact on SSN stems from the driving strength of the components selected by the CTS tool to minimize the clock skew and rise/fall times, and by the switching activity concentrated in close proximity of sharp clock edges. Unless fulldriving strength is required for timing considerations, reduced driving strength for the components is recommended. Moreover, the current/power peak in synchronous design is usually generated in association with active clock phase. Such peak strongly depends on clock skew. Controlled clock skew can be inserted by using well known gate/wire sizing, buffer insertion and routing algorithms. Although it was not focused on reducing EMI, an approach for clock skew optimization was presented in [16], while the work [17] described a technique to spread the clock edges among sequential blocks to properly shape the power supply current and to reduce the conducted emissions.
5 Experimental Results In order to develop a methodology for EMC-aware clock-tree synthesis, we have evaluated the techniques described in Sections 0 and 0 on a digital block from an automotive SoC design in 90nm technology. The operating frequency was 100MHz, in the typical range for this class of applications, the duty cycle was fixed to 50%, while the clock signal amplitude was 1V, a typical value for this technology. Another important issue is that commercial CTS tools are not developed to increase the clock transition time or to introduce some skew. In fact, by setting the CTS constraints, it is only possible to define the maximum allowable clock rise/fall time, but not the minimum tolerable transition time. Therefore, given the buffer and inverter library, the CTS tool will try to minimize the clock rise/fall times, while maintaining the nominal operating frequency and duty cycle1. The practical solution adopted to reach this goal was to remove the buffers with higher driving strength from the library when performing the clock-tree synthesis for larger transition times. The original clock tree was obtained by setting the rise/fall time for the CTS tool at 0.2ns (standard CTS), while the clock tree for low EMI was synthesized with a nominal transition time of 0.4ns (EMC-aware CTS). Since in practice we could not enforce such transition time values across the clock tree and at the leaf nodes, we computed the distribution of all the clock slopes present in the clock network (i.e., at the buffer input and output pins, and at the clock pin of the registers and flip-flops). 1
The target skew value for this experiment was set to zero, thus forcing the CTS tool to minimize the skew as much as possible.
Clock Distribution Techniques for Low-EMI Design
207
Table 1. Clock rise/fall times (distribution peak) Standard2
EMC3
Rise time [ns]
0.10
0.27
Fall time [ns]
0.08
0.23
The rise/fall times were calculated with circuit simulations on 6000 the RC representation of the clock 5000 tree obtained with a commercial 4000 parasitic extraction tool. The 3000 distributions of the rise times for 2000 1000 the clock networks obtained with 0 standard and EMC-aware CTS 0 0.2 0.4 0.6 are shown in Figure 2, where the Rise time [ns] corresponding distributions of the fall times are similar, and summaFig. 2. Rise time distribution rized in Table 1. The spectral analysis of the clock waveforms was performed with FFT [13], considering the rise/fall times corresponding to the peak of the distributions in Table 1. However, these transition times are different and faster with respect to the nominal values defined as maximum rise/fall times in the CTS tool, which are the values necessary to obtain the target reduction in harmonic amplitude. Obviously this approach was dictated by the limitations of our CTS tool that attempts to minimize the clock transition times given the library of buffers and inverters. The harmonic amplitude of the clock waveforms obtained with standard and EMC-aware CTS respectively, for the rise/fall times reported in Table 1, are summarized in Table 2, which shows that this technique effectively reduces the radiated emissions in the high-frequency spectrum. By slowing down the rise/fall times of the system clock, the CTS tool uses less and weaker buffers in the implementation of the clock distribution network. Therefore, there is a significant reduction in active area and leakage power. In all our experiments, set-up/hold times were always verified. In contrast, the introduction of clock skew did not achieve any significant reduction for the clock waveform spectral content. Hence, clock skew techniques do not bring any practical benefit for the EM radiated emissions. In order to evaluate the impact of clock skew on the power rail noise, and consequently on the conducted EM emissions, we constructed the IC model shown in Figure 3, where the component values were extracted from a 90nm technology. It is important to notice that we neglected the contribution of the on-chip power strap parasitic inductance. Our assumption was verified experimentally, and we demonstrated that for this technology and for this package, the package parasitic inductances dominate the on-chip power grid inductances. EMC-aware CTS
Number of transitions
Standard CTS
2 3
Standard clock-tree synthesis. EMC-aware clock-tree synthesis.
208
D. Pandini, G.A. Repetto, and V. Sinisi Table 2. Spectrum amplitude [dBμV] Freq. (GHz)
0.9
1.1
1.9
2.1
2.9
3.1
Standard EMC
96.40 93.20
94.40 89.73
87.80 76.21
86.40 72.96
80.31 65.57
78.72 63.59
Amplitude Reduction [dBμV]
3.20
4.67
11.59
13.44
14.74
15.13
We considered the same clock tree obtained with the CTS previously described, and several leafPackage model 1 N Cdec FFs with their output capacitive Cpkg loads, connected between the VSS RvSS Rpkg Lpkg power and ground lines of the IC model, with a max clock-skew value between each FF of 1ns, CLK obtained by either reducing the Clock tree driving strength (i.e., the gate size) of some buffers, or inserting Fig. 3. IC model for current spectrum analysis more buffers on the clock-tree paths from the root-tree buffer to the leaf-FFs. The typical frequency spectrum of the power rail current noise is reported in Figure 4. It is possible to analyze the significant impact of clock skew on reducing the current harmonic peak of about 15dBμA around the ringing frequency, which for this experiment was about 1.5GHz. Rpkg
Lpkg
VDD
RvDD
Die model
Fig. 4. Power rail current spectrum (clock skew)
Fig. 5. Power rail current spectrum (clockrise/ fall time relaxation)
In contrast, the clock edges relaxation only has a negligible impact on the harmonic amplitude reduction of the current noise waveform, and consequently on the conducted EM emissions, as illustrated in Figure 5.
Clock Distribution Techniques for Low-EMI Design
209
6 Conclusions In this paper we have discussed the increasing importance of electromagnetic compatibility and electromagnetic emission reduction in electronic system design, and we have proposed two practical techniques for EMC-aware clock-tree synthesis that can be seamlessly integrated into a standard digital design flow. To develop EMCcompliant products, systematic design techniques that reduce the level of on-chip EM emissions globally, across the entire design flow, are necessary. It is worth pointing out, that even if the techniques presented in this work seem straightforward, today there are no CAD tools that can specifically address the EMC problem; hence, even implementing these simple techniques may really help in reducing the level of onchip EM emissions. Our experimental results demonstrated that the CTS approach presented in this work, although based on tools that are not specifically developed for low-EMI design, it can nevertheless reduce the global level of EM emissions across the overall circuit, while preserving the correct functionality. In particular, we have analyzed two different methods: a) clock signal rise/fall time increase; b) clock skew. In summary: • •
Clock edge relaxation is very effective at reducing the global on-chip radiated emissions, but it has a negligible impact on the conducted emissions; Clock skew significantly reduces the on-chip conducted emissions, but it does not significantly impact on the radiated emissions.
Therefore, an effective approach for low-EMI design would exploit a combination of these two methods according to the other design constraints. Our future work will address this topic. Finally, it is also important to remind that in general designing for low EMI may introduce extra design cost, and EMC will be traded off against the other typical constraints such as area, timing, and power. Hence, in order to effectively manage this new design objective, current CAD tools should be extended to implement design solutions for EMC, and should include analysis capabilities that will allow assessing the effectiveness of such solutions and their impact on EMC compliance before fabrication, to avoid costly re-design. We believe that this target can be successfully achieved by a closer cooperation between the EDA community and the semiconductor industry.
References [1] Ott, H.W.: Noise Reduction Techniques in Electronic Systems. J. Wiley and Sons (1988) [2] Hardin, K.B., Fessler, J.T., Bush, D.R.: Spread Spectrum Clock Generation for the Reduction of Radiated Emissions. In: Proc. Intl. Symp. on Electromagnetic Compatibility, February 1994, pp. 227–231 (1994) [3] Kim, J., Kam, D.G., Jun, P.J., Kim, J.: Spread Spectrum Clock Generator with Delay Cell Array to Reduce Electromagnetic Interference. IEEE Trans. on Electromagnetic Compatibility 47, 908–920 (2005) [4] Osterman, T., Deutschman, B., Bacher, C.: Influence of the Power Supply on the Radiated Electromagnetic Emission of Integrated Circuits. Microelectronics Journal 35, 525– 530 (2004)
210
D. Pandini, G.A. Repetto, and V. Sinisi
[5] Chen, H.H., Ling, D.D.: Power Supply Noise Analysis Methodology for Deep-Submicron VLSI Chip Design. In: Proc. Design Automation Conf., June 1997, pp. 638–647 (1997) [6] Larsson, P.: Resonance and Damping in CMOS Circuits with On-Chip Decoupling Capacitance. IEEE Trans. on CAS-I 45, 849–858 (1998) [7] Bobba, S., Thorp, T., Aingaran, K., Liu, D.: IC Power Distribution Challenges. In: Proc. Intl. Conf. on Computer-Aided Design, November 2001, pp. 643–650 (2001) [8] Cha, H.-R., Kwon, O.-K.: An Analytical Model of Simultaneous Switching Noise in CMOS Systems. IEEE Trans. on Advanced Packaging 23, 62–68 (2000) [9] Tang, K.T., Friedman, E.G.: Simultaneous Switching Noise in On-Chip CMOS Power Distribution Networks. IEEE Trans. on VLSI Systems 10, 487–493 (2002) [10] Kim, J., Kim, H., Ryu, W., Kim, J., Yun, Y.-h., Kim, S.-h., Ham, S.-h., An, H.-k., Lee, Y.-h.: Effects of On-chip and Off-chip Decoupling Capacitors on Electromagnetic Radiated Emission. In: Proc. Electronic Components and Technology Conf., May 1998, pp. 610–614 (1998) [11] Pandini, D., Repetto, G.A.: Spectral Analysis of the On-chip Waveforms to Generate Guidelines for EMC-aware Design. In: Vounckx, J., Azemard, N., Maurine, P. (eds.) PATMOS 2006. LNCS, vol. 4148, pp. 532–542. Springer, Heidelberg (2006) [12] Paul, C.R.: Introduction to Electromagnetic Compatibility. J. Wiley and Sons, New York, NY (1992) [13] Press, W.H., Teukolsky, S.A., Vetterling, W.T., Flannery, B.P.: Numerical Recipes in C. Cambridge Univ. Press, Cambridge U.K. (1992) [14] FCC, FCC Methods of Measurements of Radio Noise Emissions from Computing Devices. FCC/OST MP-4 (July 1987) [15] EN 55022:1995 (CISPR 22:1993), Limits and Methods of Measurement of Radio Disturbance Characteristics of Information Technology Equipment [16] Benini, L., Vuillod, P., Bogliolo, A., De Micheli, G.: Clock Skew Optimization for Peak Current Reduction. Journal of VLSI Signal Processing 16, 117–130 (1997) [17] Blunno, I., Gregoretti, F., Passerone, C., Peretto, D., Reyneri, L.M.: Designing Low Electro Emissions Circuits through Clock Skew Optimization. In: Proc. ICECS, September 2002, pp. 417–420 (2002) [18] Hockanson, D.M., Slone, R.D.: Reducing Radiated Emissions from CPUs through Core Power Interconnect Design. In: Proc. Intl. Symp. on EMC, August 2005, pp. 927–932 (2005)
Crosstalk Waveform Modeling Using Wave Fitting Mini Nanua1 and David Blaauw2 1
Sun Microsystems Inc., Austin, TX, USA [email protected] 2 University of Michigan, Ann Arbor, MI [email protected]
Abstract. Crosstalk analysis has become an essential part of high performance design in nanometer technologies. Interconnects in nanometer technology have increased resistance and coupling capacitance due to process scaling. The crosstalk pulses are complex and require a new modeling approach. We show that current models such as triangular and Weibull exhibit as much as 31% error in propagated pulse for some crosstalk waves in nanometer technology. We present a methodology based on wave fitting as a model for crosstalk waves. We compare the accuracy of the wave fitting model proposed with existing wave models such as: Weibull, isosceles triangular and trapezoidal. We present the simulation results for different gates in 65nm Bulk CMOS technology and provide a comparison in error statistics for propagated crosstalk pulse. We demonstrate that our approach has a average propagated pulse error of less than 5% and improves the overall crosstalk analysis results by at least 67%.
1 Introduction Crosstalk analysis has become a significant part of nanometer design cycle. Technology scaling, increasing interconnect density, faster signal transition times and increasing system frequency have all contributed to an increase in coupling noise in nanometer designs [1]. A typical crosstalk analysis methodology [2]- [4], solves for the crosstalk voltage pulse height and width induced on a quiet victim interconnect due to switching of interconnects coupled to it called the aggressors. A crosstalk pulse on an interconnect is determined to cause a functional failure in the design if its height is greater than the noise immunity (or noise rejection curve) of the gate(s) connected to that interconnect. The noise rejection curve at the gate input is analogous to noise propagation at its output. All crosstalk analysis schemes in general require pre-characterization of design library for determination of gate noise rejection curves. A typical noise rejection curve for an inverter in 65nm technology with fanout of 4 is shown in Fig. 1. The entire region above the noise rejection curve is the failure region. To determine the noise rejection curve of a gate, its input is sensitized and a pulse with varying height and width is applied to it. For a given pulse width the smallest pulse height that results in propagation of a noise pulse larger than acceptable threshold is recorded as the noise immunity for that gate input. As the pulse width increases the noise immunity reaches saturation, as shown in Fig. 1. N. Azemard and L. Svensson (Eds.): PATMOS 2007, LNCS 4644, pp. 211–221, 2007. © Springer-Verlag Berlin Heidelberg 2007
M. Nanua and D. Blaauw
Pulse Height (V)
212
Active Region Failure Region Saturated Region Pulse Width (ps)
Fig. 1. Noise immunity curve example
Plot A
Noise Wave
Plot B Model1 Model2
Model1 Model2
Model1 Model2
Model1 Model2
Propagated Wave
Voltage
Fig. 2. Wave modeling inaccuracy
Actual Propagated Wave Weibull Propagated Wave Noise Wave Weibull Model Time
Fig. 3. Weibull modeling inaccuracy
For a design library in a given technology, the pulse width corresponding to noise immunity saturation, Ws is constant. For pulse widths larger than Ws the crosstalk wave model with large modeling errors in pulse width can be tolerated but it is necessary to have a good pulse model for widths less than Ws. This is illustrated in Fig. 2 plot A and B. Two crosstalk pulses with pulse width of 800ps (at the base) and 112ps, respectively are modeled with two different isosceles triangular models. Model1 derives its width by conserving pulse area and Model2 derives its width from conserving the pulse base width. Even though the two models have large width discrepancies, the model propagated wave heights are within 4% of actual propagated wave in PlotA, whereas the propagated pulses in PlotB have an error of 9% and 18%, respectively.
Crosstalk Waveform Modeling Using Wave Fitting
213
The crosstalk pulse shapes in nanometer designs are complex and have small pulse widths which are not adequately modeled by traditional models such as triangular and trapezoidal. These models although simple and adequate for wide pulses are subject to large errors for narrow noise pulses. Weibull model based on probability density function has been suggested [5]- [7] for modeling the crosstalk wave shape in recent publications. However, the noise pulses in nanometer technology designs tend to have multiple peaks and complex shapes not easily modeled by Weibull models, resulting in either a high modelling error or an excessively complex wave model. Fig. 3 shows a crosstalk wave in 65nm CMOS Bulk technology induced on a victim net due to switching of four aggressors resultiing in superposition of four crosstalk events over a short period of time. The resulting crosstalk wave and its Weibull model is shown in Fig. 3. In this case the Weibull propagated wave has a 31% error in the propagated peak compared to actual propagated wave. We propose a wave fitting methodology to model crosstalk pulses in a design in 65nm Bulk CMOS technology. This technique has been previously used for enhancing timing results [8] but there is no extension of this technique to crosstalk analysis. We show that a crosstalk pulse (or a set of crosstalk pulses) can be transformed to represent other complex noise pulses in a design for accurate crosstalk analysis. We demonstrate that our scheme exhibits propagated pulse height error of less than 5% and improvement of 67%-80% in crosstalk analysis results compared to Weibull based analysis. Our scheme can also be expanded easily to increase the accuracy of analysis as the crosstalk pulse shapes become more complicated with emerging new technologies.
2 Methodology Overview In our approach to crosstalk wave modeling, we follow the steps outlined below. 1. Generate a representative set of crosstalk waves in the given design, Wc. 2. Generate a set of waves called reference waveforms, Wr such that it can be used for modeling any wave in Wc. 3. Generate the noise immunity curves for the design library for every waveform in Wr. 4. During crosstalk analysis for a computed noise pulse (per victim), select the reference waveform in Wr with least fitting error. 5. Use the noise immunity curves corresponding to that reference waveform for failure determination.
3 Other Analytical Models Typical waveform approximation models used in industrial crosstalk analysis are: isosceles triangular, Weibull and trapezoidal. The modeling for each is explained in the following sub-sections. We use these models for error comparison with our
214
M. Nanua and D. Blaauw A T50%
Voltage
H
T Time
Tp
Fig. 4. Noise pulse parameters
proposed wave fitting model. The model parameters are expressed in terms of the following actual noise pulse parameters (shown in Fig. 4): A = area of noise pulse, H = height of noise pulse, Tp = time to peak of noise pulse, T = time duration of noise pulse. 3.1 Trapezoid Waveform The trapezoid model is a piece wise linear model such that the noise pulse height is preserved. Its base is modeled to be equivalent to twice the T50% and its height is taken as H. The top trapezoid side, parallel to its base, is simply the gate delay. Fig. 5 shows an actual noise pulse along with its trapezoidal equivalent and their gate response. Actual Pulse
Voltage
Trapezoid Model
Propagated Waves Time
Fig. 5. Trapezoid model
3.2 Isosceles Triangular Wave Model For any given actual noise pulse its isosceles triangular equivalent waveform is represented as a piece wise linear waveform with voltage height H. The width can be computed either by matching the pulse base width, T or derived from the pulse area, A. We use the latter approach and compute the width as follows: 2⋅A Wtri = -----------
H
(1)
For this model to exhibit small errors it is essential to conserve the pulse area. If pulse height alone is conserved then the model is prone to large errors. An example of actual noise waveform and its isosceles equivalent is shown in the top plot in Fig. 6.
Crosstalk Waveform Modeling Using Wave Fitting
Triangular Model
Voltage
Actual Pulse
215
Propagated Waves Time
Fig. 6. Isosceles triangular model
The bottom plot shows the gate response to the actual noise pulse and its isosceles triangular model when applied to the gate input. 3.3 Weibull Wave Model The Weibull probability density function approximation for a noise pulse starting at ti is given by the following formula [7]:
t – t c c –1 1 -----–----c- t – t –⎛⎛----------i⎞ – ----------⎞ ⎝⎝ ⎠ c⎠ f(t) = a ⋅ ⎛⎝c-----–----1-⎞⎠ c ⋅ ⎛⎝----------i⎞⎠ ⋅ e b c b
(2)
where, a = H
(3)
1–c 1–c ----------- ----------A ⋅ c c – 1 ⎛ ⎞ ⎛ ⎞ b = ---------- ⋅ ----------- c ⋅ e c ⎝ a ⎠ ⎝ c ⎠
(4)
1 c = -----------------------------------T p⎞ ⎛ 1 + ln ⎜ 1 – ------⎟ T⎠ ⎝
(5)
Parameters a, b and c are computed from the actual noise pulse parameters as defined in (3) - (5), and used to construct its Weibull approximation described by (2). Plots in Fig. 7 show an actual noise pulse with its Weibull model and the gate response to both. Weibull Model
Voltage
Actual Pulse
Propagated Waves Time
Fig. 7. Weibull model
216
M. Nanua and D. Blaauw
For all wave models considered notice that the propagated noise height error is small in the example plots. This is due to the insensitivity of propagated pulse height to pulse width modeling errors for noise waves with wide width (larger than 200ps). However, the noise waves in nanometer high-performance designs tend to have narrow pulse width (less than 200ps) and therefore, these modeling approaches exhibit large errors. This is further illustrated in the following sections
4 Comprehensive Waveform Set Generation In order to generate a set of reference crosstalk waves used in our proposed modeling approach, it is required to first generate a superset of crosstalk wave shapes that exist in the design. We used the circuit shown in Fig. 8 to generate noise waveforms in an industrial microprocessor core designed in 65nm CMOS Bulk technology. Interconnects in the testbench circuit are modeled as 3-π RC distributed circuit as shown . The model impedance varies in accordance with the length and metal layer input parameters specified. The length is varied from 200um to 500um for lower metal layers and 500um to 2mm for higher metal layers. These lengths reflect typical values in designs in this technology. The interconnect drivers are modeled as voltage source in series with linear device resistance. The device resistance typically varies from 40ohm to 400ohm. This simple driver model is sufficient because the goal is to capture all the unique wave shapes, not necessarily all pulse peak and width combinations. A sample of the waveforms generated is shown in Fig. 9. We also ensure the waveforms generated are realistic by controlling the device resistance and interconnect length combinations to within the range defined by design slew requirements. Aggressor1
Aggressor4
Aggressor Victim
Cc 3 Aggressor
Victim
Cg 3 Aggressor2
Aggressor3
Voltage
Fig. 8. Test circuit for crosstalk wave generation
Time
Fig. 9. Sample Waveforms
Cc 6 Cg 6
Cc 6
Cc 3
Cg 6
Cg 3
Crosstalk Waveform Modeling Using Wave Fitting
217
5 Reference Waveform Set An important step in using the wave fitting model is to determine the set of reference crosstalk waves Wr. This is a unate covering problem which is NP hard [9]. We define εpmax as the user defined maximum allowed error for the propagated pulse. In order to generate Wr we need to perform two steps: waveform transformation and error matrix generation. 5.1 Waveform Transformation Any wave
wi ∈ Wc can be transformed to model another wave w j ∈ Wc using
voltage linear scaling to match the pulse height as shown in Figure 10. The pulse width can be matched using the temporal least square transformation for selected equipotential points on the two waveforms such that: wj = a + b ⋅ wi
(6)
The transformation variables a and b are defined in terms of temporal data points x ∈ wi , y ∈ wi for n data points as:
∑y ⋅ ∑x – ∑x ⋅ ∑x ⋅ y a = --------------------------------------------------------------------2 2 n ⋅ ∑ x – ( ∑ x) n ⋅ ∑ x ⋅ y–∑ x ⋅ ∑ y b = -------------------------------------------------------2 2 n ⋅ ∑ x – (∑ x) 2
(7)
(8)
Voltage (V)
The equations (7) and (8) are obtained by equating the first derivative of least square errors with respect to a and b to zero and solving for a and b, respectively. We used 9 data points distributed over the wave and found it to be sufficient in reducing the fitting error.
Voltage Scaling Temporal Transformation
Time (s)
Fig. 10. Waveform transformation
218
M. Nanua and D. Blaauw
5.2 Error Matrix Generation In order to determine Wr, we generate the error matrix Ei,j such that every element corresponds to the error in the propagated noise peak, εpi,j when wave wi is used to model wave wj. An example of Ei,j is shown in Table 1 for 5 sample waves. In order to reduce the error matrix computations we pre-screen waveforms based on the least square error value. We also do not simulate the diagonal elements in Ei,j since these would obviously have 0% error. The steps are therefore summarized as follows: 1. For every wave
wi ∈ Wc select a candidate wave w j ∈ Wc . Compute the least
square fitting error. 2. If the fitting error is less than 10%, determine the propagation error ε pi , j ∈ Ei , j .If the fitting error is larger than 10%, skip to the next waveform in the set. 3. Repeat the above two steps until all waveforms have been selected as reference waveform. We examine the error matrix and select the waves representing the rows such that all columns would have at least one element smaller than required error, εpmax. This ensures the overall error will have the upper bound of εpmax. In our experiments we selected εpmax of 5%. For the sample error matrix shown in Table 1 with εpmax of 5%, we can construct the base wave set, Wb as {2}. The number of computations done to generate the error matrix in Table 1 is 11 instead of 25 (5x5) which is a 44% reduction in number of computations for this small example. We found an overall 25% reduction in computations for our testcase with 5104 waves. Table 1. Sample Error Matrix Ei,j Wave
0
1
2
3
4
0
0
1.1%
0.3%
0.6%
7.6%
1
0.2%
0
-
0.2%
6.1%
2
0
1.9%
0
0.2%
3.9%
3
0.6%
1.1%
0.2%
0
4.9%
4
0.5%
1.4%
-
0.8%
7.8%
6 Noise Analysis with Wave Fitting In our proposed analysis methodology as outlined in Section 2, the wave model proposed is used for library characterization and determination of violations during crosstalk. Fig. 11 shows the 3σ error distribution in propagated pulse height for different gate types with different wave models for narrow pulse widths (<200ps).
Crosstalk Waveform Modeling Using Wave Fitting
219
The mean of the distribution is also marked in the plot. Our proposed wave model is represented as XWave and XWave2 depending on the number of reference waves, one or two respectively. The average error for our proposed model is 0.01% for an inverter and 1.5% for a nand2 or a nor2, respectively. All other models have a larger distribution (+-40%) than our proposed model.
Nor2
Nand2
Mean Error
Inv
%Error
Fig. 11. Propagated pulse height error distribution for narrow pulses
Voltage
We analyze 5105 actual noise waves in an industrial microprocessor core designed in 65nm Bulk CMOS technology. We record the propagated waves when the waves are modeled with different models discussed in Section 3 and our proposed wave fitting method to determine violations. For this comparison we used three reference waveform sets comprising of two, four and six waves, respectively. The noise waves in the reference wave sets are shown in Fig. 12. We incrementally added the reference waves to expand the original set to improve the accuracy of the results. Table 2 shows the number of violations reported with each model as well as the spice reported violations for actual waveforms for three different class of receiver circuits: inverter, nand and nor.
Time
Fig. 12. Waves in Reference Waveset
The noise violations for each modeling technique reported in the Table 2 are divided into: violations common with spice results (Common), violations reported by
220
M. Nanua and D. Blaauw
spice but filtered by modeling (Filtered) and the violations reported by the modeling not reported by spice (False).Triangular wave modeling filters the maximum number of real violations for all receiver types, whereas the trapezoidal modeling reports large number of false violations for all receiver types. Weibull model also reports a small but significant number of filtered and false violations. Our proposed Wave fitting models (XWav2 and XWav4 and XWav6) improve the Weibull model by reducing the number of false violations by 67%-80%, depending upon the receiver type. Our proposed model also does not filter any true violations. With this example we also demonstrate that a large number of waveshapes can be adequately represented with a very small number of representative waveshapes. In addition, we illustrate our approach can be expanded to increase accuracy. Table 2. Noise Failures for 65nm Design
Violations Circuit INV
NAND
NOR
Wave Model Tri Weibull
Common 2832 2991
Filtered 163 4
False 0 30
Trapz
2995
0
132
XWav2
2979
16
7
XWav4
2991
4
7
XWav6
2995
0
7
Tri Weibull
1514 1761
277 30
3 48
Trapz
1787
4
312
XWav2
1791
0
82
XWav4
1791
0
81
XWav6
1791
0
16
Tri Weibull
2658 2797
140 1
8 85
Trapz
2794
4
253
XWav2
2798
0
88
XWav4
2798
0
85
XWav6
2798
0
16
Spice 2995
1791
2798
We do not include the inaccuracies in propagated pulse width in our results because all crosstalk analysis schemes to our knowledge determine violations based on propagated pulse height. The analysis schemes that rely on propagation simply add the propagated pulse height to coupling pulse height for worst case temporal alignment.
Crosstalk Waveform Modeling Using Wave Fitting
221
7 Conclusions We demonstrated the need for a better crosstalk wave model for nanometer designs due to complex wave shapes that exist in the design. We demonstrated that current modeling approaches adequately represent wide pulses but can have as much as 31% error in propagated pulse height for narrow crosstalk pulses. We compared the crosstalk analysis results with different modeling approaches and demonstrated that the analysis based on a wave fitting technique does not filter true violations and has 67%-80% less number of false violations.
References 1. Sylvester, D., Hu, C.: Analytical Modeling and Characterization of Deep-Submicrometer Interconnect. Proceedings IEEE 89 (2001) 2. Shepard, L.: Harmony: Static Noise Analysis of Deep Submicron Digital Integrated Circuits. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 18 (1999) 3. Levy, R., et al.: ClariNet: A noise analysis tool for deep submicron design. In: DAC, pp. 233–238 (2000) 4. Shrivastava, S., et al.: Improved Approach for Noise Propagation to Identify Functional Noise Violations. In: Proceedings IEEE 17th International Conference on VLSID (2004) 5. Amin, C.S., Dartu, F., Ismail, Y.I.: Weibull Based Analytical Waveform Model. In: ICCAD (November 2003) 6. Gyure, A., et al.: Noise Library Characterization for Large Capacity Statis Noise Analysis Tools. In: Proceedings of Sixth ISQED (2005) 7. Kasnavi, A., et al.: Analytical Modeling of Crosstalk Noise Waveforms using Weibull Function. In: Proceedings of IEEE ICCAD (2004) 8. Jain, A., Blaauw, D., Zolotov, V.: Accurate Gate Delay Model for Arbitrary WaveformShapes. In: ACM/IEEE Workshop on Timing in Synthesis and Specification (TAU) (February 2005) 9. Coudert, O., Madre, J.C.: New Ideas for Solving Covering Problems. Proceedings IEEE, DAC (1995)
Weakness Identification for Effective Repair of Power Distribution Network Takashi Sato, Shiho Hagiwara, Takumi Uezono, and Kazuya Masu Tokyo Institute of Technology, Yokohama 226-8503, Japan
Abstract. A procedure called box-scan search which identifies possible weakness in a power distribution network of an LSI is proposed. In the procedure, node pairs having large voltage difference but located in close proximity are considered as good candidates for improving connections using additional wire. The virtual box is the grid space in which the node voltages are compared. Scans of node voltages of the virtual box generate a prioritized list of the fixing point candidate. Experimental results show effectiveness of the proposed procedure for pointing out the node pairs which requires low-impedance connection.
1 Introduction On-chip power supply integrity analysis, typically conducted as voltage drop analysis, has become one of the most important verification steps in modern LSI design. Recent design flow strongly relies on information obtained through the voltage drop analysis [1][2]. Supply voltage of each logic instance is not just constrained as specification, but also utilized in timing analysis. As technology node advances, voltage drop constraints are becoming even tighter due to the increase of power density and lowering of supply voltage. Once the violation related with power distribution network (PDN) is found at late in the design stages, significant revision is required for fixing. One of the difficulties in fixing the violation is that it is hard to determine the most efficient location to modify. When a design is interconnection limited, wiring tracks are mostly filled and there is not so much room to change. Other difficulties include: it is nontrivial if currently adding wires are effective or not, and it is time-consuming to evaluate the improvement quantitatively. Because of those reasons, repair of the PDN tends to become painful and expensive task. Good guide for PDN fixing is also useful for tweaking synthesized PDN at early stage in the design. A methodology which clearly indicates exact location and which accurately estimates voltage drop recovery is keenly requested. There are not so many literatures that handle potential weakness prediction in the PDN. Leung proposed a post-layout addition of redundant PDN connections that serves as dummy metal fill [3]. Rauscher et al. discusses noise reduction algorithm for I/O circuits’ power supplies [4]. It reduces oscillating voltage amplitude by inserting damping resistor between bonding pad and on-chip supply ring. Voltage drop fixing as well as signal setup and hold time fixing are discussed in [5]. However, it does not mention about N. Azemard and L. Svensson (Eds.): PATMOS 2007, LNCS 4644, pp. 222–231, 2007. c Springer-Verlag Berlin Heidelberg 2007
Weakness Identification for Effective Repair of PDN
223
how to figure out optimal point for fixing nor the required resistance of the additional connection. In this paper, we contribute for the PDN fixing in the following. – Address non-obviousness in effective PDN fixing. – Propose a procedure that efficiently identifies potential weakness. – Propose a procedure that estimates relationship between added wire resistance and voltage drop improvement. This paper is organized as follows. In Sec. 2, issues associated with PDN repair is discussed. Then, the repairing problem is formally defined. In Sec. 3, we propose a box scan search procedure for strengthening PDN. Estimation of the voltage recovery for repaired PDN is also discussed. Experimental results are presented in Sec. 4. The concluding remarks are stated in Sec. 5.
2 PDN Weakness Identification and Fixing Problem 2.1 Issues in PDN Fixing In this section, we first describe issues associated with PDN fixing. Figure 1 shows a simple schematic diagram of an example PDN. In the figure, the resistors represent equivalent parasitic wire resistance of the PDN and the current sources represent power consumption of the circuit blocks. The objective here is to improve the maximum voltage drop in the circuit by connecting a pair of nodes with an additional resistor. In real world, the additional resistor corresponds to via and/or wire connection.
1mA A0 1V
1Ω
5Ω
1mA
1mA
A1
A2
A3
5Ω
10Ω
10Ω
5Ω
1Ω 1V
A4 1Ω
1V
5Ω
13Ω
10Ω
B0
B1
1.5mA
3mA
2Ω
5Ω
B2
B3
1Ω B4
1V
1.5mA
Fig. 1. A simple example of the PDN fixing problem
Table 1 summarizes the lowest node voltage in the circuit when a pair of nodes is connected by an additional resistor of the value ranging 1 to 50 Ω. The first column shows connecting node pairs. Connections listed in the table are limited so that one node is selected from Ai and the other is from Bj (i, j = 0, · · ·, 4). In the original circuit in Fig. 1, the worst voltage drop is 29.72 mV at node B2. We know that the selection of the node pair is critically important since improvement differs substantially.
224
T. Sato et al.
Table 1. The worst voltage drop in the example circuit when different pairs of nodes are connected by a resistor (unit: mV). Asterisk (*) marks the best voltage recovery for each resistance used. Connected node pair (A0, B0) (A1, B1) (A2, B2) (A3, B3) (A4, B4) (A0, B1) (A1, B0) (A1, B2) (A2, B1) (A2, B3) (A3, B2) (A3, B4) (A4, B3)
Additional resistance 1 (Ω) 5 (Ω) 10 (Ω) 50 (Ω) 29.69 29.71 29.71 29.72 27.35 27.82 28.21 29.14 24.55 25.66 26.51 28.52 23.80 25.45 26.55 28.68 29.32 29.57 29.63 29.70 20.89 23.51 25.19 28.29 29.95 29.87 29.83 29.75 23.52 24.90 25.95 28.34 27.82 28.19 28.49 29.24 27.24 27.81 28.23 29.18 21.56 22.41 24.19 27.84 30.27 30.06 29.95 29.78 *17.97 20.66 23.44 27.97
distance 0 0 0 0 0 1 1 1 1 1 1 1 1
Connected node pair (A0, B2) (A1, B3) (A2, B0) (A2, B4) (A3, B1) (A4, B2) (A0, B3) (A1, B4) (A3, B0) (A4, B1) (A0, B4) (A4, B0)
Additional resistance 1 (Ω) 5 (Ω) 10 (Ω) 50 (Ω) 18.03 *17.93 *20.36 *26.79 26.49 27.26 27.82 29.05 29.94 29.87 29.83 29.75 30.49 30.26 30.11 29.84 25.79 26.69 27.37 28.87 18.03 17.96 20.41 26.87 *17.97 20.64 23.25 27.91 30.45 30.22 30.08 29.83 29.87 29.81 29.78 29.73 21.30 23.79 25.39 28.35 29.18 29.50 29.59 29.69 29.73 29.72 29.72 29.72
distance 2 2 2 2 2 2 3 3 3 3 4 4
Considering its physical layout, the distance between nodes is important since it determines achievable resistance of the additional connection. If nodes are far apart, connecting them with small parasitic resistance is unfeasible. In the table, the last column represents pseudo distance |i−j| between nodes. In this example, connecting nodes (A3, B3) is more preferable than nodes (A3, B4) as we should have more chances to be able to connect with shorter wire. In Table 1, asterisks represent the best results for each additional resistance. For example, (A4, B3) and (A0, B3) are the best node pairs in 1 Ω connections. When we limit the distance less than or equal to 1, the best choice becomes a pair (A4, B3) only and the second best is (A0, B1) for 1 Ω. If 10 Ω connection is what we can achieve, then still (A4, B3) is the best choice but the second best changes to (A3, B2). In general, achievable resistance is determined by the distance between nodes as well as congestion of the routing track between nodes. It is surprising to see that the worst voltage does not necessarily decrease as the decrease of the additional resistance. This happens when the nodes having the worst voltage change depending on the resistance of the additional wire. Typical example is the pair (A0, B2). When 1 Ω resistor is used, the worst voltage is found at node A2. The worst node changes to B1 when 5 Ω resistor is used. In this example, the average voltage drop for all nodes of 1 Ω and 5 Ω resistor are 12.56 mV and 15.23 mV, respectively, implying that connection with lower resistance still yields better results as expected. This means that a strategy that randomly adds wires to fill empty routing track may not be effective for all circuits. During PDN fixing, previously added wires may become obstacles for other supply-rail connections which might better improve overall PDN quality. That is why priority is required. A strategy that fills an empty track haphazardly may end up with discouraging result in more advanced technology nodes since it is expected that more number of power domains are expected to be used in order to pursue lower power consumption.
Weakness Identification for Effective Repair of PDN
225
2.2 Assumption and Problem Formulation We define the PDN weakness identification problem under following assumptions. – Define voltage drop of each node by using static voltage drop analysis. Temporal change of the voltage waveform is not considered. – Layout design is available — i.e. both the PDN wires and circuit power are modeled deterministically using electrical equivalent models. A chip contains nodes which are labeled as ni (i = 1, . . . , N ) where N is the total number of nodes. Voltage drop of a node i is defined as ΔVni = Vdd − Vni where Vdd is a supply voltage. Define ΔVmax as the maximum voltage drop in the chip. (1)
ΔVmax = max(ΔVni ) ni
Then, the PDN weakness fixing problem is defined as: given a PDN, minimize maximum voltage drop ΔVmax by connecting a pair of nodes i and j using a resistor of the value Ra . The objective of this work is to approximately solve this problem by separating it into two sub-problems: 1) PDN weakness identification and 2) voltage drop recovery estimation.
3 Solving a PDN Fixing Problem 3.1 PDN Weakness Identification: Box-Scan Search We propose a straightforward procedure called box-scan search (BSS) which approximately but efficiently identifies the weakness of the PDN where strengthening is required. The idea of virtual box (VB) is introduced in the BSS as conceptually illustrated in Fig. 2. The wires in the figure represent a part of PDN. Ideally, all wires provide for the same voltage, Vdd, but they are different in reality because of the voltage drop. The wire H (H1, H2, H3) represents global power supply wire which uses an upper level tier. Wires A through E are local power distributions. Wires A and B are connected to the global trunk-wire using vias but wires C, D, and E do not have via connection to
VB1
VB2
H1
A
H2
B
IA
VB3
IB
C
IC
H3
D
E
ID
IE
Fig. 2. Concept of the box-scan search (BSS) and the virtual box (VB)
226
T. Sato et al.
H in around illustrated area. Current sources whose size represent relative power consumption of the circuit are connected to the local wires. Currents IA and ID are larger than IB , IC , and IE , which results in the largest voltage drop at node D. In the BSS procedure, we choose a node pair having the largest voltage difference in the VB. The VB is a spatially discretized grid of the layout. In Fig.2, the VB is defined as rectangular boxes illustrated using dashed lines. In the BSS, all VB are scanned to find the maximum voltage difference. Once the size of VB is determined, nodes in a VB are considered close and upper bound of the connection resistance is easily calculated. In this example, the maximum and the minimum voltage nodes in VB1 becomes H1 and A, respectively thus the largest voltage difference for VB1 is calculated as δVVB1 = VH1 − VA . Similarly, δVVB2 = VH2 − VC and δVVB3 = VH3 − VD , where VX represents voltage of node X. Considering via connection and relative current, δVVB3 becomes the largest. The aim of the BSS is to generate sorted list of voltage differences which are potentially caused by the missing vias or missing connections as in node D in VB3. Let us here define the BSS procedure little more formally. Assume that chip size is (Lcx, Lcy , Lcz ) where Lcx and Lcy are x- and y- directional chip size. Lcz is total wire thickness defined by distance between uppermost and lowermost metal layers used for the PDN. Let the size of VB is (Lvx , Lvy , Lvz ). Then, the chip can be discretized by Nbx , Nby , Nbz of VB in x-, y-, and z-direction, respectively. Here, number of division is Nvb = int(Lcx /Lvx ) + 1, int(Lcy /Lvy ) + 1, int(Lcz /Lvz ) + 1. Now, each VB has unique name of VB(cx , cy , cz ) where cx = 1 . . . Nbx , cy = 1 . . . Nby , cz = 1 . . . Nbz , respectively. The BSS generates a prioritized list of fixing-required node pair using following procedure. 1. Calculate number of divisions: (Nbx , Nby , Nbz ). 2. For all VB(cx , cy , cz ) in (cx = 1..Nbx , cy = 1..Nby , cz = 1..Nbz ), calculate δVV B(cx ,cy ,cz ) defined as the difference of the maximum and the minimum voltages in a VB. 3. If δVV B(cx ,cy ,cz ) exceeds predefined threshold, put a node pair and δV into a fixing candidate set S. 4. Sort members of candidate set S in descending order. In step 2, calculating δVV B(cx ,cy ,cz ) is efficient since finding the maximum and the minimum node voltage has linear calculation complexity to number of nodes. In practice, designers would like to repair more than one part of a chip in order to make their PDN as robust as possible. For fixing at multiple points, greedy approach can be considered to repair following the order of the candidate list S. 3.2 Voltage Recovery Estimation Nodal equation for a given PDN is expressed using the following form. GV = J
(2)
Here, G is a conductance matrix of the PDN, V is a node voltage vector, and J is a current vector. We divide nodes into two sets: S the nodes selected by the BSS and
Weakness Identification for Effective Repair of PDN
227
U − S the nodes not selected by the BSS. U is a set of all nodes in the PDN. After node ordering, Eq.(2) can be transformed as:
G0 B
BT Gs
V0 Vs
=
J0 Js
(3)
In the above equation, equations and variables are sorted according to the priority by the BSS. G0 is a conductance matrix for the nodes that are not selected (i.e. nodes in set U − S), Gs is a conductance matrix related to the selected nodes in set S. B and B T are also conductance matrices which represent connections between selected and nonselected nodes. V0 and Vs are the node voltage vectors consisting of the non-selected and the selected nodes by the BSS, respectively. J0 and Js are the currents consisting of the non-selected and the selected nodes, respectively. As we are interested in connecting the best and worst voltage nodes, we eliminate other non-selected nodes. (Gs − BB T )Vs = Js − BG−1 0 J0
(4)
Connecting the best and worst nodes in set S superposes a conductance stamp matrix C of an additional resistance between the nodes connected by additional wires. G Vs = J
(5)
This equation gives an estimation for voltage recovery versus connecting resistance. Here, G = Gs − BB T + C and J = Js − BG−1 0 J0 . Equation (5) can be easily solved as the number of equation is equal to the number of nodes S, twice of the number of VB at the maximum, which is made to be much smaller than the number of original matrix nodes in general. The elimination of the non-selected node equations (nodes in U − S) limits the node connection using additional resistor only between the nodes in set S. Therefore, voltage estimation using the above equation is not necessary an exact solution but an approximation. In the BSS procedure, there are several parameters that can be controlled. – For a chip using recent technologies, metal layers that are close to the top layer are dedicated for power distribution. In order to find design flaw such as missing vias connecting from those layers to the local distribution layers, division in z-direction is unnecessary in many cases. – There exists a trade-off between the size of VB and parasitic resistance of adding connections. If dimension of the VB is too large, the best and the worst nodes may be distant and hard to connect. Also, making the VB size larger enlarges error in voltage recovery estimation since the best or the worst voltage nodes may be very local. On the other hand, if the VB is too small, we may overlook “globally better” solution. There should be an appropriate size of the VB but obtaining it is one of our future work. Using incrementally larger VB sizes is a simple but effective solution. – In Fig.2, adjacent VB do not overlap. To find more desirable connections at around the border, it would be better to make slight overlap between VB.
228
T. Sato et al.
4 Simulation Examples Figure 3 again shows the circuit in Fig. 1 with the VB. Referring to Table 1, we investigate tradeoffs of resistance (node distance) and voltage recovery. Figure 4(a) shows the maximum voltage drop for different node pairs. The connection between (A3, B3) yields smaller maximum voltage drop than any other 0-distance connection for wide range of resistance less than 9 Ω. We understand that the VB provides us for good information for fixing candidate. VB0
VB1
1mA A0 1Ω
1V
5Ω
VB2
VB3
VB4
1mA
1mA
A1
A2
A3
5Ω
10Ω
10Ω
A4 1Ω
5Ω
1Ω 1V
1V
5Ω
13Ω
10Ω
B0
B1
1.5mA
3mA
2Ω
5Ω
B2
B3
1Ω B4
1V
1.5mA
30
30
28
25
B2 Node voltage (mV)
Maximum voltage drop (mV)
Fig. 3. Example circuit of Fig.1 with the VB
26 24 22
(A2 B2) (A3 B3)
(A0 B1)
20
(A4 B3) (A0 B2)
18
B1 20
A2
15
A1
10
A3
5
A0 A4
16 0.1
1
10 Resistance (Ω)
(a)
100
0 0.001
0.01
0.1 1 Resistance (Ω)
10
100
(b)
Fig. 4. (a) Change of the worst node voltage as functions of added resistance for various node pairs. (b) Node voltages as functions of added resistance between (A0, B2).
Figure 4(a) also includes the worst node voltage for node pairs with distance 1 and 2. Although (A0, B2) works well for resistance larger than 5 Ω, the maximum voltage drop in the circuit saturates beyond that point. This is, as mentioned earlier, due to the change of the worst voltage node. For the connection of (A0, B2), the worst node changes B2 → B1 → A2 as resistance decreases. At 5 Ω where the transition of the worst node from B1 to A2 occurs, the worst voltage curve turns to increase. For more clarification, Fig. 4(b) shows node voltages when (A0, B2) are connected. Voltage of B2 steeply decreases to 5 mV since B2 now connects tightly with the supply source through A0. However, connection between A0 and B2 do not contribute to reduce voltage drop for
Weakness Identification for Effective Repair of PDN
229
nodes Ai (i = 0 . . . 4). Rather, the added connection increases current flow at node A0, which eventually increases voltage drop for the nodes in the Ai row. As a result, voltage drop of interior nodes between A0 and A4 increases. This suggests that the range of the fixing can be local and through scan of all VB is necessary. Figure 5 shows power distribution of more realistic PDN design. Structural design of the PDN is summarized in Table 2. Entire chip area is surrounded by 30 μm-width ring using M9, M8, M7, and M6. Power supply pads are tapped on the ring. In this experiment, the PDN wires in M1 are not considered. Logic cell currents on each segment of M1 are summed up to distribute at the tap point of M6. The chip includes three large macro blocks which are labeled as BLK1, BLK2, and BLK3. Power density of each BLK is 3, 1, and 2 mW/100 μm2 , respectively. Other area is filled with standard logic cells. 8
3
y-coordinate (mm)
BLK2 2
6
1
4 BLK3
0 BLK1
2
0
0
mW/100μm2
2 4 6 x-coordinate (mm)
8
Fig. 5. Power distribution of an experimental design Table 2. PDN structure and power consumption summary Chip size Chip supply voltage Metal 9, 8, 7, 6 density (width/pitch) Total power consumption BLK1, BLK2, BLK3
8 mm × 8 mm 1V 20/40, 6/100, 2/200, 2/200 μm/μm 4.26 Watt 0.57, 0.58, 1.16 Watt
Figure 7 shows voltage drop distribution at layer M6. The PDN in Fig. 7(b) has basically the same structure as Fig. 7(a) but slightly modified for demonstration purpose. F1 and F2 mark the location in Fig. 6. At F1, wires of M9 having coordinates in between (2.0, 2.0) and (1.6, 2.0) are removed, which disconnects y-directional connection in M9 and vias connecting between M8 and M9. Removing vias imitates unintended design flaw in the PDN. Similar modification has been made for F2. In this example, the modifications increase the maximum voltage drop by only 10 mV. That is, it is difficult to notice or find the flaw by just looking at the voltage drop map in Fig. 7(b). This becomes even more difficult when, but in general, the reference of Fig. 7(a) is unavailable. Figure 8 shows the maximum voltage drop difference in each VB calculated by the BSS procedure. Size of the VB is 250 × 250 × 10 μm — all metal layers are covered by each VB. In Fig.7(a), there is no noticeable peak in ΔV , which means the PDN is
230
T. Sato et al. 8
400 μm F2
y coordinate (mm)
6
BLK2 4
missing vias between M9 and M8 disconnected wires in M9
2
F1 BLK1
0
0
2
BLK3
4 6 x coordinate (mm)
8
Fig. 6. Disconnections in the PDN 8
7
30
6
20
5 10
4 3
0 mV
2
7
30
6
20
5 10
4 3
0 mV
2 1
1 0
y coordinate (mm)
y coordinate (mm)
8
0
1
2 3 4 5 6 x coordinate (mm)
7
(a) The original PDN
8
0
0
1
2 3 4 5 6 x coordinate (mm)
7
8
(b) The PDN with missing connections
Fig. 7. Voltage drop map comparison for two PDNs
well designed. But in Fig. 8(b), we see distinct peaks at around the coordinates (2, 2) and (6, 6). These exactly match the locations of F1 and F2 where we removed connections. Not that the voltage difference at F1 is larger than F2 since F1 is located above the higher current density block. Since the BSS procedure uses voltage drop distribution, it is unavoidable to depend on the assumed power consumption pattern. However, checking with several power consumption scenarios highlights existing flaw, providing us an automated way of finding out where to give attention. Once the candidate list is obtained, Eq.(5) can be used to estimate the voltage recovery versus connection resistance. Note also that we notice series of large peaks around an edge of the chip. This is due to the current concentration at the wires connecting between supply pad and trunk ring. The result of the BSS procedure also suggests that to alleviate the worst voltage drop, it is efficient to improve connections to the pads. Table 3 summarizes CPU time required for the BSS procedure. As DC analysis result can be reused for different VB divisions, i.e. only the maximum and the minimum voltage nodes in finer VB are the candidates for the maximum and the minimum nodes in coarser ones, the analysis is efficient. Total CPU time for three different VB sizes is less than 1 minute.
Weakness Identification for Effective Repair of PDN 8 7 6 5 4 3 2 1 0 mV 0
8 6 4 2 0 0 1
1 2
2 3
3 4
y coordinate (mm)
4 5
5 6
6 7
x coordinate (mm)
231 8 7 6 5 4 3 2 1 0 mV 0
8 6 4 2 0 0 1
1 2
2 3
3 4
y coordinate (mm)
4 5
5 6
6 7
7 88
x coordinate (mm)
7 88
(a) The original PDN
(b) The PDN with missing connections
Fig. 8. Comparison of the maximum voltage difference Table 3. Simulation time summary # Resistors DC estimate # VB division (sec) (sec) 64 x 64 256 x 256 1024 x 1024 36k 2.2 0.3 3.9 18.9 211k 5.6 11.7 12.8 32.7
5 Conclusion In this paper, difficulties associated with the PDN fixing is pointed out and the BSS procedure which identifies possible weakness in a power distribution network of an LSI is proposed. In the procedure, nodes having large voltage differences but located in close proximity are considered as good candidates for effective improvement using additional wire connections. The determination of the closeness between nodes is efficiently conducted using the virtual box concept. A DC analysis and scans of the VB efficiently generates a prioritized candidate node list. Experimental results including 211k-resistor PDN example show effectiveness of the proposed procedure by successfully pointed out the candidate nodes.
References 1. Dalal, A., Lev, L., Mitra, S.: Design of an efficient power distribution network for the UltraSPARC-IT M microprocessor. In: Proc. ICCD, pp. 118–123 (1995) 2. Dharchoudhury, A., Panda, R., Blaauw, D., Vaidyanathan, R.: Design and analysis of power distribution networks in PowerPCT M microprocessors. In: Proc. DAC, pp. 738–743 (1998) 3. Leung, K.-S.: SPIDER: simultaneous post-layout IR-drop and metal density enhancement with redundant fill. In: Proc. IEEE/ACM Int. Conf. on Computer-Aided Design, pp. 33–38 (2005) 4. Rauscher, J., Pfleiderer, H.J.: Power supply noise reduction using additional resistors. In: Proc. Signal Propagation on Interconnects, pp. 193–196 (2005) 5. Zhu, Q.K., Kolze, P.: Metal fix and power network repair for SOC. In: Proc. IEEE Computer Society Annual Symp. on Emerging VLSI Technologies and Architectures, pp. 33–37. IEEE Computer Society Press, Los Alamitos (2006)
New Adaptive Encoding Schemes for Switching Activity Balancing in On-Chip Buses P. Sithambaram, A. Macii, and E. Macii Politecnico di Torino, Torino-10129, Italy {prassanna.sithambaram,alberto.macii,enrico.macii}@polito.it
Abstract. Thermal Spreading has shown to be a successful approach to bus temperature minimization. The idea at the basis of this technique is that of periodically permuting the routing of input bitstreams to the various bus lines, with the objective of temporally and spatially distributing the number of transitions over the entire bit-width, thus avoiding high switching activities to occur always on a few lines, which obviously causes an unnatural increase in temperature. In this paper, we propose new encoding schemes which improve the capabilities of the Thermal Spreading approach of balancing the switching activities over the bus wires. The solutions we introduce are adaptive and dynamic in nature, as they select what bitstream goes to what bus line based on the actual bus traffic, thanks to some on-line monitoring capabilities which is offered by some ad-hoc hardware unit which runs in parallel at the transmitting and receiving ends of the bus. The experimental results show that, on average, the proposed encoding schemes improve the transition balancing capabilities of the Thermal Spreading technique by a significant amount.
1
Introduction
As CMOS technology keeps on scaling (the 65nm node is now mainstream and 45nm devices are close to mass manufacturing), the number of transistors per unit area is increasing rapidly. This aggressive scaling of geometries helps realizing highly-complex SoCs. On the downside, though, it brings with it many issues that affect the reliability and robustness of such complex chips. Geometry scaling usually implies an increase in power density, leading to raising chip temperatures. In addition, scaling of interconnects in advanced processes to realize efficient communication is also resulting in an increased power dissipation, and thus in a temperature rise in buses and wires. This phenomenon, termed self-heating [1], is further worsened due to technological factors, such as low thermal conducting inter- and intra-metal dielectrics used in current ICs to reduce power consumption and cross talk, and due to higher power densities. This soaring up of temperature to alarming levels has put severe demands on packaging and cooling systems, thereby significantly increasing the cost per chip. In the past, several techniques for reducing bus power consumption have been proposed. Bus power is proportional to the product of the average number of N. Azemard and L. Svensson (Eds.): PATMOS 2007, LNCS 4644, pp. 232–241, 2007. c Springer-Verlag Berlin Heidelberg 2007
New Adaptive Encoding Schemes
233
signal transitions and line capacitance. Hence, one way of reducing power dissipation on bus lines is to encode the data sent on the bus with schemes that reduce the average number of transitions [2]. Power consumption is synonym of heat dissipation, and hence of temperature raising that may negatively impact interconnect performance/reliability [1,3,4]. Unfortunately, all the bus encoding techniques that can be found in the literature, born for minimizing the bus power consumption, do not directly address chip temperature minimization. In fact, while power simply depends on the number of transitions of a wire (i.e., the number of times the capacitance of the wire is charged and discharged), temperature is also influenced by the time at which transitions do occur. In other terms, for reducing power it is sufficient to minimize the switching activity of a wire, while for limiting temperature it is mandatory to control the temporal distribution of the wire transitions. In [5], the authors propose an improved bus energy and thermal model. Based on such a model, they present an encoding scheme to spread the switching activity among all the bus lines for reducing the bus peak temperature. With respect to traditional low power bus encoding schemes, the Thermal Spreading approach has lower area and power overheads and a very simple architecture, consisting of a shift register and a crossbar logic. The switching activities are distributed over all the bus lines and, as a consequence, the total bus power consumption is uniformly distributed over all the wires (i.e., each bus line gives a similar contribution to total power). The basic concept is to migrate the switching activities among all bus lines, by rotating the bus line position at a certain period. In this way, the spreading prevents the situation of having few bus lines that are very “hot” and the majority at a lower temperature. The more frequently the rotation is performed, the more evenly the switching activities are distributed. However, the area and power overhead of the codec would also increase with the increase in rotation frequency, and a trade-off has to be made. In this paper, we present new encodings that improve the performance of Thermal Spreading [5]. In particular, we propose two schemes of increasing complexity, which are based on the concept of on-line monitoring of the switching activities of all the bus lines, and dynamic selection of what input bitstream should be routed to what bus line in order to better equalize the distribution of the transitions over the entire bus. It is important observing, at this point, that as in the case of Thermal Spreading, the proposed encoding solutions do not contribute to a reduction of the switching activity; they simply help in better distributing (temporally and spatially) the number of transitions. Therefore, traditional low-power bus encoding can be successfully paired with the approaches of this paper to further decrease power and temperatures of on-chip buses. The rest of the paper is organized as follows: Section 2 details two different encoding schemes; in particular, Section 2.1 describes a coarse-grained solution, where all the bus lines are permuted every μ bus cycles, while Section 2.2 highlights a fine-grained approach, where pairwise permutations (i.e., swapping) are performed as soon as the activities of two lines become highly unbalanced; this happens, usually, with a period which is much shorter than μ; after μ cycles, a
234
P. Sithambaram, A. Macii, and E. Macii
complete reconfiguration of the bus lines is done, as in the coarse-grained technique. Section 3 shows an hardware implementation of the encoder, Section 4 presents some experimental results and Section 5 closes the paper.
2
Adaptive Encoding Schemes for Activity Balancing
As highlighted in the introduction, our objective is that of defining new encoding schemes which improve the performance of the Thermal Spreading approach of [5] by better balancing the temporal and spatial distribution of the transitions over the bus lines. The solutions we propose rely on the idea of on-line monitoring of the bus traffic to drive the actual assignment of the input bitstreams to the bus wires. We propose two techniques of increasing effectiveness and complexity. The first one, which we call Coarse-Grained Balancing Scheme, samples the activities on the various bus lines at fixed intervals (i.e., every μ bus cycles), and performs bitstream routing according to the collected information. The second solution, named FineGrained Balancing Scheme, applies corrections to bitstream routing as soon as the activity becomes significantly unbalanced between two or more bus wires. The remainder of this section presents the two aforementioned codes in details. On the other hand, a possible hardware architecture for the implementation of the proposed encodings is sketched in Section 3. We anticipate here that, timing-wise, the insertion of the encoder-decoder logic (codec, hereafter) is not critical, as it does not belong to the direct bus path, except for a couple of very cheap crossbar switches. Instead, the impact on area and power consumption needs to be carefully controlled. For this reason, we are currently working on the development of a more efficient implementation than the simple one of Section 3. 2.1
Coarse-Grained Balancing Scheme
Let us consider a communication bus of m wires, w1 , . . . , wm which connects module A to module B, and assume that a stream S of N , m-bit patterns, p1 , . . . , pN , has to be transferred from A to B. Let us call pi (j) the j-th bit of pattern pi . Usually, the transmission works in such a way that bus wire wj takes care of transferring from module A to module B the bitstream B j corresponding to the j-th bit of all the patterns of stream S, that is, B j = {p1 (j), p2 (j), . . . , pN (j)}. N −1 Let tj = i=1 tj (i, i + 1) be the total number of transitions occurring on wire wj when bitstream B j is transmitted from A to B. The total number of transitions occurring on the entire bus when stream S is transmitted is given by: T (S) =
m
tj
j=1
If we assume that the load capacitance of each bus wire is the same, T is proportional to the power consumed by the bus. Experimental data show that, typically, transitions are not distributed uniformly across the bus width. In other terms, when a stream of patterns is transmitted, some wires exhibit a higher number of transitions than others
New Adaptive Encoding Schemes
235
(i.e., tw >> tz , for some w and z. This implies that different wires may consume a different amount of power. In [5] it was observed that, as temperature of a metal wire depends on the power that such a wire dissipates, distributing in a more uniform way the transitions over the various bus wires would help in equalizing the power consumed by each line, thus resulting in a decrease of the maximum temperature that the hottest wires will reach. A simple round-robin rotation scheme (called Thermal Spreading) was thus proposed to make sure that bitstreams originating high numbers of transitions are not routed always onto the same wires. Rotations occur at a fixed period, NT . The shorter NT , the better the balancing, but the higher the power overhead required to re-route the bitstreams from one wire to another. The Coarse-Grained Balancing Scheme we introduce here pushes the concept of switching activity balancing even further. More specifically, we propose a solution for distributing the number of bus transitions both spatially (as done by Thermal Spreading) and temporally. The idea is that of “adaptively” adjusting the routing of the bitstreams to the bus wires in such a way that, in the long run, the distribution of the transitions gets equalized in space (i.e., each wire gets a similar number of transitions), and periods of high activity of a wire are interleaved with periods of low activity (temporal balancing), thus resulting in a limited increase in wire temperature. Let Cntj be a counter which, for any pattern pi of stream S that is transmitted over the bus, keeps track of the cumulative number of transitions that have occurred on wire wj . After a given number of bus cycles, say μ, the status of each counter is checked and the bus wires are sorted in descending order of their transition counts. Let us assume, without loss of generality, that wire w1 is the one with highest number of transitions at time μ, and wire wm the one with lowest number of transitions. Then, we apply a permutation of the mapping of the bitstreams so that B 1 is routed to wm , B 2 is routed to wm−1 , and so forth. After the remapping is completed, the counters are reset, and the process restarted, until another μ bus cycles have elapsed and a new permutation is applied. Obviously, as for the case of the Thermal Spreading technique, the total number of transitions originated by the transmission of stream S (i.e., parameter T (S)) is not reduced by the Coarse-Grained Balancing Scheme. However, as demonstrated in [5], traditional low-power bus encoding can be successfully coupled with codes that target activity balancing to reduce the total power (and temperature). The choice of μ is critical; in fact, a small value implies a finer degree of temporal distribution of the activities, thus resulting in a tighter control of the temperature. But, on the other hand, it requires more work from the bitstream re-routing point of view, which induces a higher power overhead by the hardware which is in charge √ of managing the re-routing. Experiments have shown that a value of μ = N gives, on average, the best results, although at the moment no algorithm is available to optimally define the value of parameter μ, being it
236
P. Sithambaram, A. Macii, and E. Macii
very much dependent on the characteristics of the data stream that needs to be transmitted. Useless to remark, the Coarse-Grained Balancing Scheme works best for highly stationary data streams; in fact, if bit-wise statistics tend to become stable after some patterns are transmitted, the adaptation mechanism tends to lead to an optimal distribution of the transitions. If, instead, the temporal distribution of the transitions on each bitstream changes significantly over-time, re-routing every μ cycles may not be sufficient to originate a satisfactory temporal balancing of the activity. In this case, some finer-granularity adaptation may be needed, as that provided by the Fine-Grained Balancing Scheme described next. 2.2
Fine-Grained Balancing Scheme
In order to overcome the problem that the scheme of Section 2.1 may encounter when weakly stationary streams are transmitted, we propose a variant which increases the capability of the code to apply early bitstream re-routing in case of sudden change in the bitwise statistics. The idea behind the Fine-Grained Balancing Scheme is that of adding some intermediate checks on the current transition counts of each bus line and, if needed, apply immediate swapping of critical wires. Assume that the transmission of stream S starts normally. Suppose that, after some bus cycles, a few bus wires w1 , . . . , wl , with l << m, reach a cumulative number of transitions which is much higher than that of most of the other wires wl+1 , . . . , wm . In this case, bitstreams B 1 , . . . ,B l are immediately re-routed to wires with small transition counts (say, wm−l , . . . , wm . To limit the number of times these intermediate re-routings take place, and thus guarantee that the correction is applied only for meaningful cases, we have defined two conditions that need to be satisfied simultaneously in order for the immediate swapping of two wires wj and wq to start: 1. At least ν > 2. tj > β · tq .
μ α
bus cycles must have elapsed since the last reconfiguration;
The first condition prevents the swapping to start too early, that is, when the statistics have not settled; the second condition ensures that only critical wires are swapped between two consecutive reconfigurations (which occur every μ bus cycles). The choice of parameters α and β plays a fundamental role in the performance of the code. Extensive, yet empirical experimentation, has brought us to the conclusion that an assignment of such parameters which guarantees a good tradeoff between the performance of the code and the power overhead due to the intermediate wire swapping is the following: – α = 10; – β = 2. However, as for the case of parameter μ, at this point in time no optimal way of determining the values of α and β has been devised. This is the subject of current research.
New Adaptive Encoding Schemes
3
237
Codec Architecture
The basic elements of the encoder architecture are: An array of resettable counters, a decoding block and a crossbar switch (see Figure 1). The counters, which are working in parallel with the bus transmission path, are in charge of collecting information regarding the number of transitions that occur on each bus line. This information is passed to the decoder block, which decides what input-output pair needs to be connected in the crossbar switch. In the case of the Coarse-Grained Balancing Scheme, an additional counter (CntC) provides the information on when the complete reconfiguration of the crossbar switch has to happen (i.e., when N bus cycles have elapsed). In the case of the Fine-Grained Balancing Scheme, the decoder block which drives the crossbar switch is more complex decoder, as it has to handle also pairwise bus line swaps which are triggered by unbalanced activity conditions that can occur earlier than every N bus cycles.
&URVV%DU 6ZLWFK &RQILJXUDWLRQ
&QW &QW
7R%XV'HFRGHU
&QW &QW &QW
'HFRGHU
5HFRQIB6LJQDO
&QW &QW &QW )URP%XV&RQWUROOHU
&QW&
Fig. 1. Encoder Architecture
&URVV%DU 6ZLWFK
&RQILJXUDWLRQ 5HFRQIB6LJQDO
)URP%XV(QFRGHU
Fig. 2. Decoder Architecture
Regarding the decoder architecture, it has a much simpler structure, as the only task it has to accomplish is to restore the mapping between the incoming bitstreams and the physical bus wires. The inputs to the crossbar switch are thus the configuration data sent by the encoder and the Reconf Signal, whose purpose is that of triggering the reconfiguration of the switch.
238
4 4.1
P. Sithambaram, A. Macii, and E. Macii
Experimental Results Experimental Set-Up
We have implemented a bus transition simulator which evaluates the number of transitions that occur on each bus wire when a given data trace is fed to it. The simulator is able to support the Thermal Spreading encoding of [5], as well as the Coarse-Grained Balancing Scheme and the Fine-Grained Balancing Scheme of this paper. To evaluate and compare the performance of the various encoding schemes, we have used the variance of the number of transitions as cost metric. Assuming that T (S) is the total number of transitions originated by the transmission on the bus of stream S, and that the bus is composed of m lines, the mean number of transitions per line is M = T (mS ) . Then, the variance associated to stream S is given by: m 1 σ 2 (S) = (tj − M )2 m j=1 Clearly, σ 2 (S) is zero in the case of a perfectly balanced number of transitions over all the bus wires, and it increases as the balancing decreases. Therefore, a lower value of σ 2 (S) indicates a higher efficiency of the encoding. 4.2
Parameter Space Exploration
The performance of the encoding schemes of Section 2 depend on some parameters. More specifically, the Coarse-Grained Balancing Scheme relies on parameter μ to trigger a reconfiguration of the crossbar switch, while in the Fine-Grained Balancing Scheme parameters α and β are responsible of the frequency at which intermediate wire swaps take place between two complete permutations. In this section, we study how sensitive the two encoding schemes are on the values of such parameters. We report the results of parameter exploration for a specific bus trace. However, the selection of the values of parameters μ, α and β used for collecting the experiments of the next section was determined by considering the results of a similar exploration process applied to all the data streams we used for the experiments. Concerning parameter μ, we have run the Coarse-Grained Balancing Scheme 6 on trace say, which consists of √ a total of N = 10 patterns, for a total of 15 different values, ranging from N to N/2. As expected, the best performance is √ achieved for μ = N , as the value of σ 2 monotonically increases as μ increases, as shown in the plot of Figure 3. Regarding parameters α and β, we explored their impact separately. In a first experiment, we fixed the value of β = 2, and run the Fine-Grained Balancing Scheme for 20 different values of α, ranging from 1 to 20. The plot of Figure 4 shows how σ 2 decreases with a decrease of the value of α. This is quite intuitive, as a smaller α implies that intermediate swaps start happening later (for
New Adaptive Encoding Schemes
239
instance, for α = 1 we have that ν = μ, and thus no intermediate wire swap takes place). We observe, however, that a high value of α implies many intermediate swaps, leading to a significant power overhead in the re-routing hardware. As mentioned earlier, the value that best trades overhead and efficiency is achieved for α = 10. The exploration of parameter β was performed in a similar way to α. In particular, we fixed α = 10 and calculated σ 2 for 9 different values of beta, ranging from 2 to 10. A smaller β clearly triggers the swapping more often, implying potential better performance of the code. The plot of Figure 5 shows the results of the exploration. We notice that the variance increases rapidly as β increases (for a very large β, indetermediate swaps occur very seldom, and the results flatten to those of the Coarse-Grained Balancing Scheme. This behavior holds for all the data streams we considerd; therefore, we have come to the conclusion of adopting a value of β = 2 for all the experiments of next section. 6e+08 "coarse_grained_exploration.gp"
5e+08
Variance
4e+08
3e+08
2e+08
1e+08
0 0
5000
10000
15000
20000
25000 mu
30000
35000
40000
45000
50000
Fig. 3. Plot of σ 2 as a function of μ 4.5e+07 "alfa.gp"
4e+07
3.5e+07
Variance
3e+07
2.5e+07
2e+07
1.5e+07
1e+07
5e+06
0 0
2
4
6
8
10 alfa
12
14
16
Fig. 4. Plot of σ 2 as a function of α
18
20
240
P. Sithambaram, A. Macii, and E. Macii 4.5e+07 "beta.gp"
4e+07
3.5e+07
Variance
3e+07
2.5e+07
2e+07
1.5e+07
1e+07
5e+06 2
3
4
5
6 beta
7
8
9
10
Fig. 5. Plot of σ 2 as a function of β
4.3
Experimental Data
This section reports the results of the experiments we have collected by applying the encoding schemes of Section 2. We have considered a total of 8 binary streams captured using SimpleScalar [6] and corresponding to the traffic of a 32-bit address bus. Streams’ lengths range from 106 patterns (stream say) to 1.5 · 107 patterns (stream dijkstra). Table 1 summarizes the data. In particular, column Thermal Spreading shows the σ 2 resulting from the application of the encoding method presented in [5], while columns Coarse-Grained and Fine-Grained report the variance (σ 2 ) and the percentage improvement (Δ) over the Thermal Spreading approach achieved by the two activity balancing encodings proposed in this paper. In order to guarantee a fair comparison, the Thermal Spreading method has been run with a period equal to the value of μ chosen for executing the Coarse-Grained Table 1. Experimental Results Stream
Thermal Spreading σ2 dijkstra 6482524632.88 djpeg 1582421922.74 gsme 1730105614.58 lame 1111718746.84 mad 1298489106.19 rijndael i 1101662903.80 rijndael o 1152124876.81 say 65664855.13 Average
Coarse-Grained σ2 Δ [%] 4986197800.80 23.08 923305760.56 41.65 1121074902.85 35.20 929554002.86 16.39 793356078.75 38.90 783470235.81 28.88 693592773.68 39.80 44767874.62 31.82 31.96
Fine-Grained σ2 Δ [%] 4736887910.76 26.93 821742126.90 48.07 986545914.51 42.98 743643202.28 33.11 761621835.60 41.35 650280295.72 40.97 631169424.05 45.22 37605014.68 42.73 40.16
New Adaptive Encoding Schemes
241
√ and the Fine-Grained Balancing Schemes, that is, μ = N , where N is the length of the data stream. The results clearly show that the encoding techniques described in Section 2 are superior to the Thermal Spreading of [5]. The average variance reduction achieved by the Coarse-Grained Balancing Scheme is around 32%; such an amount raises to around 40% in the case of application of the Fine-Grained Balancing Scheme. The improvement provided by the Fine-Grained Balancing Scheme over the Coarse-Grained Balancing Scheme varies from 2% (stream mad) to 17% (stream lame) and it greatly depends on the statistics of the trace, as this is what enables the intermediate swaps.
5
Conclusions
In this paper, we have presented two adaptive encoding schemes for balancing the switching activities over the bus wires. The proposed solutions improve the existing approaches by on-line monitoring the switching activities of all the bus lines and dynamically selecting what input bitstream should be routed to what bus line in order to better equalize the temporal and spatial distribution of the transitions over the entire bus. We have also described a possible hardware implementation of the codec, which showed a negligible impact on bus latency. The experimental results confirm that the proposed solutions improve the transition balancing capabilities of the Thermal Spreading technique.
References 1. Banerjee, K., Mehrotra, A.: Coupled Analysis of Electromigration Reliability and Performance in ULSI Signal Nets. In: ICCAD-01: IEEE/ACM International Conference on Computer-Aided Design, San Jose, CA, November 2001, pp. 158–164 (2001) 2. Macii, E., Pedram, M., Somenzi, F.: High-Level Power Modeling, Estimation, and Optimization. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 17(11), 1061–1079 (1998) 3. Chiang, T.-Y., Banerjee, K., Saraswat, K.: Compact Modeling and SPICE-Based Simulation for Electrothermal Analysis of Multilevel ULSI Interconnects. In: ICCAD-01: IEEE/ACM International Conference on Computer-Aided Design, San Jose, CA, November 2001, pp. 165–172 (2001) 4. Ajami, A., Banerjee, K., Pedram, M.: Modeling and Analysis of Nonuniform Substrate Temperature Effects on Global ULSI Interconnects. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 24(6), 849–861 (2005) 5. Wang, F., Xie, Y., Vijaykrishnan, N., Irwin, M.J.: On-chip Bus Thermal Analysis and Optimization. In: DATE-06: IEEE Design Automation and Test in Europe, Munich, Germany, March 2006, pp. 850–855 (2006) 6. Burger, D.C., Austin, T.M., Bennett, S.: Evaluating Future Microprocessors – The Simplescaler Toolset, Technical Report 1342, University of Wisconsin, Department of CS (1997), http://www.simplescalar.com
On the Necessity of Combining Coding with Spacing and Shielding for Improving Performance and Power in Very Deep Sub-micron Interconnects T. Murgan1, P.B. Bacinschi1 , S. Pandey2 , A. Garc´ıa Ortiz3 , and M. Glesner1 1
Inst. of Microelectronic Systems, Darmstadt Univ. of Technology, Darmstadt, Germany 2 Group for Computer Architecture, Univ. of Bremen, Bremen, Germany 3 AnaFocus, Seville, Spain
Abstract. In this work, the necessity of combining signal encoding schemes with low-level anti-crosstalk techniques like spacing and shielding is analyzed. It is shown that in order to increase the throughput improvement and/or reduce the power consumption, coding schemes should be integrated with layout techniques since methods like spacing and shielding can be regarded as very simple encoding schemes. On this basis, a theoretical framework for assessing the improvement in throughput and/or power consumption is constructed. Furthermore, several possibilities to integrate coding with classical anti-crosstalk techniques are discussed.
1 Introduction On-chip VDSM (very deep sub-micron) interconnects increasingly affect the overall chip power consumption and performance with every technological downscaling. Interconnect analysis and optimization at high levels of abstraction is extremely attractive since it offers a much larger room for improvement than optimization at lower levels. However, optimizing performance and power consumption in interconnects at multiple levels can offer even more improvement opportunities. For this purpose, efficient highlevel models for delay and power consumption in VDSM interconnects are required. The main objective of this work is to prove the necessity of integrating signal encoding schemes with low-level anti-crosstalk techniques like spacing and shielding. In order to increase the throughput improvement and/or reduce the power consumption, encoding schemes should be combined with layout methods like spacing and shielding. For this purpose, a theoretical framework for assessing the improvement in throughput and/or power consumption is constructed. Additionally, several possibilities to integrate coding with classical anti-crosstalk techniques are discussed. This work is organized as follows. In Sec. 2, we briefly discuss the notions of coupling and self transition activity as well as that of inter-wire coupling. Furthermore, Sec. 3 gives an overview on bus encoding techniques in capacitively and inductively coupled interconnects, while Sec. 4 discusses the applicability of coding, spacing, and shielding. Afterwards, Sec. 5 shows the efficiency in terms of power and/or performance of combining coding with those anti-crosstalk techniques. The work ends with some concluding remarks. N. Azemard and L. Svensson (Eds.): PATMOS 2007, LNCS 4644, pp. 242–254, 2007. © Springer-Verlag Berlin Heidelberg 2007
On the Necessity of Combining Coding with Spacing and Shielding
243
2 Delay and Transition Activity in VDSM Interconnects As an effect of technology scaling, the coupling capacitance between neighboring wires increased with each technology node and is currently dominating the overall line capacitance. Nevertheless, the second-order coupling capacitances are almost perfectly shielded by the first-order ones. Thus, the capacitive coupling of one bus line with the two neighbors accurately determines the dynamic power consumption and the time required for a transition to complete in that bus line [16]. In the following, we denote the transition in line i in an n-bit wide bus as Δbi = b+ i − − + bi , where b− i and bi represent the initial and final value on line i, respectively. We also define the transition vector, Δb = [b1 , b2 , · · · , bn ]t . The set of line delays is a function of the transition vector. In capacitively-coupled interconnects, when inductances can be safely ignored, the delay in line k, δk , of a symmetric bus is given in [14] as: δk = τ0 (1 + 2κ)Δb2k − κΔbk (Δbk−1 + Δbk+1 ) , (1) where τ0 is the delay of the crosstalk-free line and κ = Cc /Cs is the so-called bus aspect factor, i.e., the ratio of the coupling capacitance Cc and the self (ground) capacitance Cs (Cg ). It is to be noticed, that for convenience the term Δb2k is used instead of |Δbk |. Inductive coupling is a long-range effect in contrast to the short-range capacitive coupling. Therefore, the effect of the aggressors of order higher than two cannot be discarded for an accurate analysis. In [11], an extended linear delay model has been proposed, that takes into account the pattern-dependency of delay under consideration of inductive coupling effects. The delay predicted in line k is thus: n
δk = αk Δb2k +
αik Δbi · Δbk =
n
αik Δbi · Δbk ,
(2)
i=1
i=1,i=k def
where αk = αkk represents the delay in line k with quiet aggressors, and αik for i =k denotes the contribution to the delay of the aggressor line i on line k. In buses exhibiting both capacitive and inductive coupling the coefficients for the first-order neighbor can be either negative or positive, while the second-order coefficients are always nonpositive. When b+ i = 0, only the NMOS transistor of the line driver is active and ideally no energy is drawn, and when b+ i = 1, all the current required for loading the capacitors is taken from the supply. As shown in [3], the mean value of the total energy consumption (Et ) can be computed as: Et =
2 Vdd [Cs Ts + 2Cc Tc ] 2
(3)
where: Ts = Tc =
n i=1 n i=1
ts i = 2 tc i =
n i=1 n i=1
E[ b+ i Δbi ]
(4)
E[ b+ i (2Δbi − Δbi+1 − Δbi−1 ) ]
(5)
244
T. Murgan et al.
are the so-called total self and coupling transition activity, respectively. It is to be mentioned that E[·] denotes the expectation operator. Relating the coupling transition activity – which is actually an inter-wire energy measure – to one single line, namely the driver that feeds that line, represents an elegant approach as the energy budget of each driver can be written in a simple fashion. Nonetheless, the ratio between the coupling capacitance and the self capacitance between two lines varies rapidly with small changes in the inter-wire spacing. Therefore, the bus aspect factor κ is defined only for a symmetric bus and the coupling becomes a measure of the total energy consumption related not to a driver but to a pair of lines. Since the coupling capacitance varies much more rapidly than the self capacitance, the ground capacitance is considered in the following for simplicity constant with spacing. Thus, the self capacitance in every line is Cs , while the coupling capacitances between line i and line j, Cc ij , vary as a function of spacing. Let κij be the bus aspect factor between line i and line j: κij = Cc ij /Cs , where def κij = 0 if |i − j| = 1, and let κi = κii+1 be the aspect factor between line i and line i + 1. Let θc ij be the inter-wire coupling activity between line i and line j: + θc ij = E[(b+ i − bj )(Δbi − Δbj )] =
ts i ts j + + − E[b+ i Δbj ] − E[bj Δbi ] 2 2
(6)
def
and let θc i = θc ii+1 be the inter-wire coupling between line i and line i + 1. It can be easily verified that: n n Tc = tc i = θc i , (7) i=1
i=0
where θ0 and θ1 are computed in function of the bus margins. Thus, the weighted total coupling activity, TWc , can be defined as: TWc =
n Cc ii+1 i=0
Cs
·θc i =
n
κ i θc i .
(8)
i=0
Consequently, the total energy consumption becomes: Et =
2 Cs Vdd · (Ts + 2TWc ) . 2
(9)
3 Improving Throughput and Power in Buses by Coding A powerful method to improve performance and/or reduce noise induced by crosstalk in interconnects is to avoid worst case patterns by means of signal encoding [12, 15]. Thereby, the transmitted data can be modified in such a way that those transitions that spawn large delay or crosstalk values are eliminated. Data encoding can be employed for reducing transition activity and thus, power consumption [10, 18, 19, 1], an idea which has been proposed more than a decade ago and reached a remarkable maturity, as a multitude of general and also application-specific codes have been proposed in the literature. Simply put, the underlying idea behind coding schemes for power is to find a
On the Necessity of Combining Coding with Spacing and Shielding
245
(static or dynamic) mapping function that transforms the input codeword alphabet into another one that assigns low-power transitions to the most probable switching patterns and low-power states to the most frequent patterns. Furthermore, several methods have been proposed for reducing bus delay, improve throughput, and decrease crosstalk-induced noise [4,17,20]. Rao et al. proposed in [13] an encoding algorithm with a corresponding circuit scheme for on-chip buses that simultaneously decreases capacitive crosstalk and leakage-induced power consumption. Inductive noise can be reduced by the so-called bus stuttering method, that inserts dummy states in order to avoid worst-case noise generating states [7]. Let us consider for simplicity of formalism that τ0 = 1 and Δbi = 1, where line i is the line under analysis. Thus, the delay in capacitively coupled buses can be expressed as: δi = (1 + 2κ) − κ(Δbi−1 + Δbi+1 ), (10) where δi is the delay in line i. Thus, there are five possible delay classes for a switching line from 1 to 1 + 4κ in steps of κ. For k = 1, 5, we can define the delay class Δk as a set: Δk = {Δb | max δi = 1 + (k − 1)κ, for all i = 1, B}, (11) where B is the bus width. The delay 1 + (k − 1)κ is called the characteristic delay of the delay class Δk . The aforementioned partitioning of delay classes does not hold when inductive effects cannot be neglected anymore. When only the inter-wire capacitances have to be taken into consideration, all possible delay values are very tightly closed around the characteristic values of the delay classes. However, with increasing inductive effects, delay classes are rather intervals than fixed values. Conceptually, the coefficients of the extended delay model can be split into those standing for capacitive coupling (αij,C ), and those standing for the inductive coupling (αij,L ). Thus: αij = αij,C +αij,L , i = j, where αij,C ≤ 0 and αij,L ≥ 0. Capacitive coupling is a short-range effect and thus, only first-order neighbors can be considered as in the models proposed in [14,17]. Consequently, we can write the delay in line k as: δk = αk Δb2k + (αkk−1,C Δbk−1 + αkk+1,C Δbk+1 )Δbk + (αkk−1,L Δbk−1 + αkk+1,L Δbk+1 )Δbk + (αkk−i,L Δbk−i + αkk+i,L Δbk+i )Δbk
(12)
i=0,1 def
(i)
In a symmetric bus, αkk−i = αkk+i = αk , if the corresponding neighbor exists. We (i) (i) can define in a similar way αk,C and αk,L .When the corresponding neighbors do not exist, either the coefficients or the associated transitions can be defined as zero. For a symmetric bus, the delay becomes: (1)
(1)
δk = αk Δb2k + (αk,C + αk,L )(Δbk−1 + Δbk+1 )Δbk + Sind (k)Δbk ,
(13)
where Sind (k) stands for the cumulative influence of the inductive aggressors of an order higher than two: Sind (k) = (αkk−i,L Δbk−i + αkk+i,L Δbk+i ). (14) i=0,1
246
T. Murgan et al. def
Let η = max{Sind (k)} = max{
i≥2 (αkk−i,L+αkk+i,L )} ≥ 0
be the maximum cumu (i) lative effect of the inductive aggressors. For a symmetric bus, η = 2 max{ i≥2 αk,L }. In a classically operating bus, the clock period, Tck , must be chosen to be large enough so that any transition can be completed, i.e., Tck ≥ τ0 (1 + 4κ). One efficient way in avoiding delays of certain classes is to eliminate a selected set of states involved in those toggling patterns from the symbols alphabet. The codec is thus at most a static map. It is nevertheless obvious, that by not permitting a set of states, the maximum achievable information rate on the bus is reduced. For an n-bit wide bus, the bit rate reduction factor, ζb (n, k), is defined as the ratio between the maximum achievable information rate on the coded bus (that is the number of allowed states) and the actual bus width. Further, we can define also the speed increasing factor, ζs (n, k), which stands for the interconnect delay decreasing rate: ζs (n, k) =
1 + 4κn , 1 + kκn
(15)
where k = {0, 1, 2, 3, 4} indicates the highest allowed delay. It is to be noticed that for a fixed physical bus width, κ is actually a function of n, therefore the simpler notation κn . In the rest of the work, the notations κ and ζs (k) are used whenever the bus width is not subject to modifications. For an efficient encoding, we have in general k = {2, 3} [14]. Thus, the total actual throughput increase rate can be defined as: ζt (n, k) = ζs (n, k) · ζb (n, k).
(16)
In order for a code to be efficient, the achieved throughput increase rate must be higher than one, i.e., ζt (n, k) > 1. In the case of inductive coupling, ζb (n, k) is the same as for the capacitive case. (1) However, in the case of disjoint classes, i.e., αk ≤ −2η for all k = 1, B, we have (1)
ζs (n, k) =
1 + 4κn − 2αk,L + η (1)
1 + kκn − (k − 2)αk,L + η
.
(17)
where τ0 = 1 and k = 0, 3. It can be easily shown that coding for performance would be more efficient with (1) (1) inductive coupling only if κ ≤ −αk,L /(2αk,L + η) ≤ 0 [9]. Nevertheless, κ is a nonnegative parameter, which renders the aforementioned inequality impossible. Thus, the only possibility for coding to be more efficient in the inductive case is directly related to the possibility of reducing η. As mentioned later, this can be achieved by shielding.
4 Coding and Classic Anti-crosstalk Techniques Spacing (increased metal separation) and shielding are probably the most common crosstalk reduction techniques. In the sequel, we review on the one hand a simple throughout-improving coding scheme. On the other hand, spacing and shielding are compared without considering any form of coding and thus, the worst case delay class is always Δ5 . In the first line, inductive coupling is neglected.
On the Necessity of Combining Coding with Spacing and Shielding
247
4.1 Simple Coding Schemes for Improving Throughput In [6], Konstantakopoulos and Sotiriadis introduced the so-called D-RLL(1,∞) encoding scheme, i.e., the Differential Run Length Limited (1,∞) code. The code employs the same decorrelator proposed by Stan and Burleson for decreasing power consumption (see [18, 19]), that is actually reducing the two-dimensional problem of mapping the most probable transitions to the least power-hungry ones to the one-dimensional problem of reducing the number of ones in the code words. In the case of coding for performance, no bus line exhibits a delay belonging to a class higher than Δ3 , that is 1 + 2κ, if every vector at the output of the static mapper does not consists of successive ones. This rule defines the so-called RLL(1,∞) codes. Let Φn be the set of possible codewords for bus width n. It can be shown that the number of elements of Φn , is directly related to the Fibonacci sequence: |Φn | = Fn+2 ,
(18)
where |Φn | represents the cardinality of the set Φn . Further, Fn+2 stands for the n + 2 element of the Fibonacci sequence: Fn+2 = Fn+1 + Fn , with F0 = 0 and F1 = 1. Thus, the bit rate reduction factor becomes: ζb (n, 2) = log2 Fn+2 /n. 4.2 Spacing and Shielding If the bus width is increased by inserting r redundant bits, than with respect to the (sh) (sh) unshielded bus, the information rate remains unaltered, i.e., ζb = 1, where ζb represents the bit rate reduction factor. However, the speed increase factor for an uncoded shielded bus must be redefined as follows: ζs(sh) (n, r) =
Cg,n 1 + 4κn · , Cg,n+r 1 + 4κn+r
(19)
where Cg,k and κk represent the ground capacitance and the capacitance factor for a k-bit wide bus, respectively. Basically, there are two opposite forces: with increased spacing (decreased spacing) Cg,k increases and Cc,k decreases. Nevertheless, it can be shown that the increase rate of Cg,k is much less aggressive than the decrease in Cc,k [9]. In addition, shielding can be compared by considering it a coding technique applied on a bus of the same width, i.e., n + r. Thus, if two inserted shields span at least three signal lines, the highest allowed delay class remains Δ5 and no delay imn provement is achieved. Thus, ζb (n + r, 4) = n+r , and the total throughput increase rate is strictly less than one. Consequently, spacing is in general a more effective technique than shielding for reducing delay in capacitively coupled interconnects. When inductive coupling is not negligible anymore, the condition for shielding to be more efficient than spacing becomes: (1)
1 + 4κn − 2αk,L + η 1 + 4κn+r − (1)
(1) 2αk,L sh
+ ηsh
>
Cg,n+r , Cg,n
(20)
where αk,L sh and ηsh are the first-order inductive coupling and the maximum cumulative inductive coupling of the aggressors of order higher than two in the shielded bus,
248
T. Murgan et al. Table 1. Combining coding and spacing (n = 8, k = 1, 4, m = 2, 8) m Δ2 2 1.6515 3 2.2890 4 2.7307 5 2.9917 6 2.9974 7 2.7362 8 2.2710
Δ3 1.7634 2.3740 2.7328 2.7515 2.5177 2.1335 1.6763
Δ4 1.7259 2.3143 2.4885 2.3448 2.0357 1.6616 1.2737
Δ5 1.6900 2.1720 2.2003 1.9747 1.6569 1.3227 1
respectively. It can be noticed that in comparison with the previous case, there are three decisive parameters that have to be taken into account. In lines that exhibit a significant amount of inductive coupling, shielding is employed in order to reduce the coupling. Therefore, low-resistance ground and Vdd lines are inserted between signal lines in order to provide closely spaced return paths [2, 5, 14]. When the inductive coupling is of an important amount, the high values of η can be (1) dramatically reduced through shielding at the expense of a slightly higher αk,L and an increased κ. Consequently, in inductively-coupled lines, shielding generally performs better than spacing [8]. The main disadvantage of spacing is that in order to obtain a convenient bus layout, the data characteristics have to be known at design time.
5 Combining Coding with Spacing and Shielding Spacing and shielding can be regarded as simple or more “primitive” coding schemes that try to reduce the effects of crosstalk on parameters like delay and power consumption. On the one hand, increasing line spacing implies a reduction in bit rate per cycle and can be regarded as a serialization of the data transmission, and on the other hand, shielding is equivalent to introducing redundancy at a constant total bit rate. Moreover, coding can be combined with lower level techniques like shielding and/or spacing in order to achieve even a higher total throughput increase rate. In the following, the suitability of combining spacing with coding is analyzed. 5.1 Combining Coding with Spacing Let n, k, and m denote the initial bus width, the maximum allowed delay class, and the resulting bus width, respectively. Thus, the bit reduction rate and the speed increase factor can be defined in a more general fashion: def m ζb (n, k, m) = · ζb (m, k) and (21) n Cg,n 1 + 4κn Cg,n 1 + 4κn def Cg,n + 4Cc,n ζs (n, k, m) = = · = · · ζs (m, k), (22) Cg,m + kCc,m Cg,m 1 + kκm Cg,m 1 + 4κm where def
ζb (n, k) = ζb (n, k, n)
and
def
ζs (n, k) = ζs (n, k, n).
(23)
On the Necessity of Combining Coding with Spacing and Shielding
249
Thus, the total throughput increase rate becomes: ζt (n, k, m) =
m Cg,n 1 + 4κn · · · ζt (m, k). n Cg,m 1 + 4κm
(24)
Further, if the bus width is fixed, the total throughput increase rate can be written as: def
ζt (n, k) = ζt (n, k, n).
(25)
Tab. 1 illustrates the theoretical total throughput increase rate for several hybrid schemes consisting of coding and spacing when applied on an eight-bit wide bus with t = 1.2 µm, w = 0.4 µm, and h = 0.6 µm. It can be observed that for the described bus geometry, coding (the last row) performs theoretically slightly better than pure spacing (last column). However, when combining coding and spacing, the maximum achievable throughput increase rate augments by more than 30 %. It is to be noticed that finding the best spacing is equivalent to finding the optimal compromise between serial and parallel data transmission. Basically, for every fixed k, the goal is to find an m that maximizes the achievable total throughput increase rate. 5.2 Optimizing Delay and Transition Activity Konstantakopoulos developed in [6] a scheme that implements a version of the aforementioned D-RLL(1,∞) code. A 4-bit wide bus is extended to 6 bits and the encoder maps the input data to symbols that do not have adjacent bits equal to one. Thus, the highest allowed delay class is Δ3 . It is to be noticed that as previously discussed, the effectivity of such an implementation that expands the bus depends on the resulting increased bus aspect ratio. The scheme is efficient only if: Cg,4 1 + 4κ4 · ≥ 1. Cg,6 1 + 2κ6
(26)
The goal was to design an encoder consisting out of as few as possible gates. For this purpose, one input bit has been hardwired directly to an output one. This requires the use of at least four symbols of weight equal to three. Nevertheless, the encoder can also be designed to minimize the self activity on the bus, i.e., for reducing the mean symbol weight [1]. Tab. 2 illustrates one of the many possible implementations of a codec characterized by codewords with weights not larger than two. The total self transition activity depends on the statistical data characteristics. For simplicity, uniformly distributed data is considered. Thus, the mean symbol weight is reduced from 1.75 to 1.5. This corresponds to an improvement of about 14.28 % in total self activity with respect to the scheme developed in [6]. 5.3 Optimizing Delay and Total Transition Activity The scheme developed for the D-RLL(1,∞) code could also be improved to additionally reduce the coupling transition activity or the total equivalent coupling activity. The
250
T. Murgan et al. Table 2. D-RLL(1,∞) implementation for minimal self transition activity bin,3 bin,2 bin,1 bin,0 bout,5 bout,4 bout,3 bout,2 bout,1 bout,0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 1 1 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 1 0 1 0 1 1 0 1 0 0 0 0 1 0 1 1 1 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 1 0 0 1 1 0 1 0 0 0 1 0 1 0 1 0 1 1 1 0 0 0 1 0 1 1 0 0 1 0 0 1 0 0 1 1 0 1 0 1 0 1 0 0 1 1 1 0 0 1 0 0 1 0 1 1 1 1 1 0 1 0 0 0
problem formulation is to find for an n-bit wide bus the so-called power optimal mapping function (code) ΨC among all mapping functions such that the average energy consumption is minimized. For constant bit rate per symbol in an n-bit wide bus, the cardinality of a D-RLL(1,∞) alphabet is Fm+2 , where log2 Fm+2 ≥ n. Therefore, the total number of possible mapping functions is given by: Fm+2 ! Fm+2 2n n PFm+2 = =2 !· (27) (Fm+2 − 2n )! 2n n! where Pkn = (n−k)! represents the number of permutations of n different things taken k at a time. For n = 8, the first Fibonacci number greater than 28 = 256 is F14 = 377. 377! Thus, m = 12 and the total number of possible codes is P256 377 = 121! . Consequently, it is virtually impossible to search for the best code in an exhaustive manner even for narrow buses. Another efficient way to reduce power consumption in a bus when the data statistics are known a priori to the design is asymmetrical spacing. In this way, neighboring lines exhibiting a high coupling activity are more widely spaced than those with a low coupling activity. Actually, asymmetrical spacing is a technique that trades power for performance by finding a set of spacings si,i+1 for i = 1, n − 1 that minimizes the weighted coupling transition activity, TCw :
TCw =
n−1
tc (i, i+1)·κ(i, i+1) =
i=1
n−1
tc (i, i+1)·κ(si,i+1)
(28)
i=1
with
n−1 i=1
si,i+1 = constant
(29)
On the Necessity of Combining Coding with Spacing and Shielding
251
Table 3. Normalized power and κ for different minimum spacings smin [µm] Norm. Pow. 0.18 93.94 % 0.16 88.20 % 0.14 64.86 % 0.12 80.14 % 0.10 76.80 % 0.08 72.83 % 0.06 70.80 % 0.04 69.67 % 0.02 68.42 %
κmax 4.59 5.43 6.55 8.09 10.27 13.60 19.17 21.19 30.22
s1,2 0.220 0.230 0.240 0.280 0.280 0.280 0.275 0.275 0.280
s2,3 0.220 0.230 0.240 0.280 0.280 0.280 0.275 0.275 0.280
s3,4 0.210 0.230 0.240 0.220 0.220 0.240 0.275 0.275 0.280
s4,5 0.200 0.220 0.220 0.220 0.220 0.240 0.260 0.265 0.260
s5,6 0.180 0.160 0.160 0.160 0.200 0.200 0.185 0.190 0.200
s6,7 0.180 0.160 0.140 0.120 0.100 0.080 0.060 0.065 0.060
s7,8 0.190 0.170 0.160 0.120 0.100 0.080 0.070 0.055 0.040
Tab. 3 shows the optimal spacings calculated with a branch-and-bound algorithm for a synthetic 8-bit signal with μ = 0, σn = 0.19531, and ρ = 0.93. Thickness, height, and width have been set to 1.2 µm, 0.6 µm, and 0.4 µm, respectively. The minimum permitted spacing has been varied between 0.02 µm and 0.018 µm. It can be observed that while the power consumption decreases at a significant rate, the bus aspect factor increases much more rapidly for this type of bus geometry and coupling activity. Asymmetrical spacing can achieve a significant reduction of the total transition activity, however at the expense of an important performance loss. Contrary to coding which can be employed at run-time, this technique is applicable only at design time if the data statistics are known a priori. At run-time, instead of asymmetrical spacing, one can implement an active shielding scheme whenever the bus width is greater than the number of bits required for data representation. For instance, one bus line remains unused when a 7-bit wide signal is mapped on an 8-bit fixed wide bus. In order to reduce power consumption, the unused bit should be mapped between the bits exhibiting the highest coupling activity. The effectiveness of the shielding depends actually on the exact coupling activity. Tab. 4 shows the coupling activity in the case of non-isolated non-shielded and quietly shielded 2-bit uncorrelated and uniformly distributed data. It can be easily shown that the coupling activity with a shield inserted between the two signal lines is equal to the self activity and thus, Tc = 0.5 in both cases. Consequently, for uncorrelated uniformly distributed 2-bit data the quiet shield does not bring along any advantage. In order to reduce the coupling activity between b0 and b1 , the following coding function that expands a bus from n to n+1 bits can be defined: ΨC ([bn−1 , bn−2 , . . . , b0 ]) = [bn−1 , bn−2 , . . . , b1 , bsh , b0 ], Table 4. Bit coupling activity for unshielded and shielded 2-bit data
00 01 10 11
00 0 0 0 0
01 1 0 2 1
10 1 2 0 1
11 0 0 0 0
00 01 10 11
00 0 0 0 0
01 1 0 1 0
10 1 1 0 0
11 2 1 1 0
(30)
252
T. Murgan et al.
000
001
111
100
101
Fig. 1. Markov process describing the coupling activity in a shielded 2-bit bus
where the shielding line is computed as: + − − − − + + − − bsh = b+ b b b + b b = b b b ⊕ b 0 1 0 1 0 1 0 1 0 1 .
(31)
The transitions in the bits b1 , bsh , and b0 are equivalent to the stochastic process represented in Fig. 1. The edges have equal probability and therefore, the associated stochastic matrix is: ⎛ ⎞ 11101 ⎜1 1 1 1 0⎟ ⎟ 1 1 ⎜ P = ·AG = · ⎜ 1 1 1 1 0⎟ (32) ⎟, 4 4 ⎜ ⎝1 1 1 1 0⎠ 11101 where AG denotes the adjacency matrix. Thus, the state probability vector (the fixed vector of P ) is: t 1 w= · 33321 . (33) 12 Consequently, the resulting coupling activity is Tc = 3/8. The cost for the reduction in coupling activity is a small self transition activity in the shielding line, i.e., ts,sh = 1/16. When combining coding for performance and coding for power into a single scheme, one has to apply first the coding for speed. After the highest tolerated delay class is defined, the redundancy can be used for further reducing the transition activity if possible. As previously shown, coupling activity can also be traded for self activity.
6 Concluding Remarks In this work, we have shown that in order to improve the throughput and/or power consumption in on-chip VDSM interconnects, signal encoding schemes must be combined with classical low-level (layout) anti-crosstalk techniques like spacing and shielding. For this purpose, a theoretical framework for assessing the improvement in throughput and/or power consumption has been constructed and in addition, several possibilities to integrate coding with classical anti-crosstalk techniques are discussed. Furthermore, in the future, with increasing skin and proximity effects, coding can also be combined with other layout techniques like wire splitting and wire tapering.
On the Necessity of Combining Coding with Spacing and Shielding
253
References 1. Benini, L., Macii, A., Macii, E., Poncino, M., Scarsi, R.: Architectures and Synthesis Algorithms for Power-Efficient Bus Interfaces. IEEE Trans. on Computer-Aided Design (CAD) of Integrated Circuits and Systems 19(9), 969–980 (2000) 2. El-Moursy, M.A., Friedman, E.G.: Interconnect-Centric Design for Advanced SoC and NoC. In: Design Methodologies for On-Chip Inductive Interconnects, pp. 85–124. Kluwer, Dordrecht, The Netherlands (2004) 3. Garc´ıa Ortiz, A., Murgan, T., Kabulepa, L.D., Indrusiak, L.S., Glesner, M.: High-Level Estimation of Power Consumption in Point-to-Point Interconnect Architectures. J. of Integrated Circuits and Systems 1(1), 23–31 (2004) 4. Hirose, K., Yasuura, H.: A Bus Delay Reduction Technique Considering Crosstalk. In: Design Automation and Test in Europe (DATE), Paris, France, March 2000, pp. 441–445 (2000) 5. Ismail, Y.I., Friedman, E.G.: On-Chip Inductance in High Speed Integrated Circuits. Kluwer, Norwell, Massachusetts (2001) 6. Konstantakopoulos, T.K.: Implementation of Delay and Power Reduction in Deep Sub-Micron Buses Using Coding. Master’s thesis, Massachusetts Inst. of Technology (May 2002) 7. LaMeres, B.J., Khatri, S.P.: Bus Stuttering: An Encoding Technique to Reduce Inductive Noise in Off-Chip Data Transmission. In: Design Automation and Test in Europe (DATE), Munich, Germany, March 2006, pp. 522–527 (2006) 8. Massoud, Y., Kawa, J., MacMillen, D., White, J.: Modeling and Analysis of Differential Signaling for Minimizing Inductive Cross-talk. In: Design Automation Conf (DAC), Las Vegas, Nevada, June 2001, pp. 804–809 (2001) 9. Murgan, T.: High-Level Optimization of Performance and Power in Very Deep Sub-Micron Technologies. WiKu-Verlag, Duisburg–Cologne, Germany (2007) 10. Murgan, T., Bacinschi, P.B., Garc´ıa Ortiz, A., Glesner, M.: Partial Bus-Invert Bus Encoding Schemes for Low-Power DSP Systems Considering Inter-Wire Capacitance. In: Vounckx, J., Azemard, N., Maurine, P. (eds.) PATMOS 2006. LNCS, vol. 4148, pp. 169–180. Springer, Heidelberg (2006) 11. Murgan, T., Momeni, M., Garc´ıa Ortiz, A., Glesner, M.: A High-Level Compact PatternDependent Delay Model for High-Speed Point-to-Point Interconnects. In: Intl. Conf. on Computer-Aided Design (ICCAD), San Jose, California, November 2006, pp. 323–328 (2006) 12. Rabaey, J.M., Chandrakasan, A., Nikoli´c, B.: Digital Integrated Circuits. A Design Perspective, 2nd edn. Prentice-Hall, Upper Saddle River, New Jersey (2003) 13. Rao, R.R., Deogun, H.S., Blaauw, D., Sylvester, D.: Bus Encoding for Total Power Reduction Using a Leakage-Aware Buffer Configuration. IEEE Trans. on Very Large Scale Integration (VLSI) Systems 13(12), 1376–1383 (2005) 14. Sotiriadis, P.P.: Interconnect-Centric Design for Advanced SoC and NoC. In: Power Reduction Coding for Buses, pp. 177–206. Kluwer, Dordrecht, The Netherlands (2004) 15. Sotiriadis, P.P., Chandrakasan, A.P.: Reducing Bus Delay in Submicron Technology Using Coding. In: Asia and South Pacific Design Automation Conf (ASPDAC), Yokohama, Japan, January-February 2001, pp. 109–114 (2001) 16. Sotiriadis, P.P., Chandrakasan, A.P.: A Bus Energy Model for Deep-Submicron Technology. IEEE Trans. on Very Large Scale Integration (VLSI) Systems 10(3), 341–350 (2002) 17. Sridhara, S.R., Ahmed, A., Shanbhag, N.R.: Area and Energy-Efficient Crosstalk Avoidance Codes for On-Chip Buses. In: Intl. Conf. on Computer Design (ICCD), San Jose, California, October 2004, pp. 12–17 (2004)
254
T. Murgan et al.
18. Stan, M.R., Burleson, W.P.: Bus-Invert Coding for Low-Power I/O. IEEE Trans. on Very Large Scale Integration (VLSI) Systems 3(1), 49–58 (1995) 19. Stan, M.R., Burleson, W.P.: Low-Power Encodings for Global Communication in CMOS VLSI. IEEE Trans. on Very Large Scale Integration (VLSI) Systems 5(4), 444–455 (1997) 20. Victor, B., Keutzer, K.: Bus Encoding to Prevent Crosstalk Delay. In: Intl. Conf. on Computer-Aided Design (ICCAD), San Jose, California, November 2001, pp. 57–69 (2001)
Soft Error-Aware Power Optimization Using Gate Sizing Foad Dabiri, Ani Nahapetian, Miodrag Potkonjak, and Majid Sarrafzadeh Computer Science Department, University of California Los Angeles Los Angeles, CA 90095 USA {dabiri,ani,miodrag,majid}@cs.ucla.edu
Abstract. Power consumption has emerged as the premier and most constraining aspect in modern microprocessor and application specific designs. Gate sizing has been shown to be one of the most effective methods for power (and area) reduction in CMOS digital circuits. Recently, as the feature size of logic gates (and transistors) is becoming smaller and smaller, the effect of soft error rates caused by single event upsets (SEU) is becoming exponentially greater. As a consequence of technology feature size reduction, the SEU rate for typical microprocessor logic at the sea level will go from one in hundred years to one every minute. Unfortunately, the gate sizing requirements of power reduction and resiliency against SEU can be contradictory. 1) We consider the effects of gate sizing on SEU and incorporate the relationship between power reduction and SEU resiliency to develop a new method for power optimization under SEU constraints. 2) Although a non-linear programming approach is a more obvious solution, we propose a convex programming formulation that can be solved efficiently. 3) Many of the optimal existing techniques for gate sizing deal with an exponential number of paths in the circuit, we prove that it is sufficient to consider a linear number of constraints. As an important preprocessing step we apply statistical modeling and validation techniques to quantify the impact of fault masking on the SEU rate. We evaluate the effectiveness of our methodology on ISCAS benchmarks and show that error rates can be reduced by a factor of 100% to 200% while, on average, the power saving is simultaneously decreased by less than 7% to 12% respectively, compared to the optimal power saving with no error rate constraints.
1
Introduction
Single event upsets (SEUs) from transient faults have emerged as a key challenge in logic circuitry design [14]. Recent studies indicate that from 1992 to 2011 SUEs rate for logic will increase by more than a billion times and will surpass the soft error rates of unprotected memory. As a consequence of technology feature size reduction, the SEU rate for typical microprocessor logic at the sea level will go from one in hundred years to one every minute [14] resulting in a clear need for addressing the problem in systematic way. SEU faults arise from energetic particles such as neutrons from cosmic rays and alpha particles N. Azemard and L. Svensson (Eds.): PATMOS 2007, LNCS 4644, pp. 255–267, 2007. c Springer-Verlag Berlin Heidelberg 2007
256
F. Dabiri et al.
from packaging material, generating electron-hole pairs as they pass through a semiconductor device [18]. During ICs’ normal operation, these faults can be caused by electromagnetic interference (EMI). Transistor source and diffusion nodes can collect these charges, and a sufficient amount of accumulated charge may invert the state of a logic device, such as an SRAM cell, a latch or a gate, thereby introducing a logical fault into the circuit’s operation. Because this type of fault does not reflect a permanent failure of the device, it is termed soft error (SE) or transient fault (TF). Advances in microelectronic technology, which shrink IC size to the nanometer range while also reducing the power supply, are making electronic circuits increasingly susceptible to transient faults (TFs). In fact, the reduction of the charge stored on circuit nodes, along with the decrease in noise margins, greatly increases the probability of voltage glitches temporarily altering nodes’ voltage values [4]. Meanwhile, the continuous increase in ICs’ operating frequencies makes the sampling of such glitches increasingly probable. Consequently, TFs will become a frequent cause of failure in many applications, while technology advances. Power consumption has been recognized as the critical constraint in modern microprocessor and application specific designs and gate sizing has been one of the most effective methods for power minimization in CMOS digital circuits. Unfortunately, gate sizing requirements for power reduction and resiliency against SEU are contradictory. We consider the effects of gate sizing on SEU and incorporate the relationship between power reduction and SEU resiliency, and we have developed a new method of power optimization under SEU constraints that leverages convex programming to obtain provably optimal solutions. As an important preprocessing step and consideration we apply statistical modeling and validation techniques. Gate sizing is a timing optimization process in high performance VLSI circuit design. In this design process, the size of each gate in a combinational circuit is properly tuned so that circuit area and/or overall power dissipation are minimized under specified timing constraints. Gate sizing or the similar problem of transistor sizing has been an active research topic in recent years. Many approaches have been proposed before [13] [2] [11] [17]. Previous approaches that have taken power considerations into account during transistor sizing include [1] [10] [16]. The approach in [10] utilizes linear dependency between power and gate sizes, however, since it optimizes one path at a time, the approach may lead to suboptimal solutions. A linear programming approach for exploring the power-delay-area tradeoff for a CMOS circuit is presented in [1]; we use more accurate nonlinear logical effort delay models in this work. Another linear programming based approach is presented in [16]. Power optimization with convex programming is proposed by Menezes, Baldick, and Pillegi [13]. In their work, timing constraint are constructed for every path in the circuit which can potentially generate a very large (exponential) number of constraints. In order to capture the effects of logic fault
Soft Error-Aware Power Optimization Using Gate Sizing
257
masking, we introduce resubstitution-based statistical methodology and techniques [6], [5] for quantifying error propagation through logic circuitry. We also introduce a new formulation for gate sizing problem using convex programming. Our approach is different from previous ones because the number of constraints in our formulation is linear with respect to circuit size as opposed to the exponential size of constraints in previous work. At the same time, we impose a bound on soft error rates and evaluate the performance of gate sizing considering single event upset. The rest of the manuscript is organized in the following way. First, in Section 1 we go over the preliminaries and cover models we have used for delay and soft errors. In Section 2 we introduce the statistical methodology that we have incorporated to calculate logical masking probabilities. Our formulation and problem transformation is presented in Section 3 and Section 4 illustrates the simulation results on ISCAS85 benchmarks.
2 2.1
Preliminaries Power and Delay Models
Power dissipation of gates in digital CMOS circuits is composed of dynamic and static components. Dynamic power corresponds to the power dissipated in charging and recharging internal capacitors in every gate, given by Equation (1). Pdynamic =
N
2 Ci fclock VDD =
i=1
N
Φi .Wi
(1)
i=1
where Ci is the effective switching capacitance of the gate i which is a linear function of the size of the gate. fclock is the clock frequency, VDD is the power supply voltage. Φi simply represents the linear dependency of the power to the size since the capacitance of a gate is a linear function of the size (width of the gate). The above sum is taken over all the gates in the circuit. Gate delay can be represented as a function of internal capacitors of logic gates. We use the logical effort method to model the delay [15]. The delay di of gate i can be written as: di = pi + gi hi
(2)
where pi is the parasitic delay of the gate and is independent of size. gi is the logical effort of the gate which is intuitively the driving capability of the gate, and hi is the electrical effort (gain). hi is the size dependent term in the delay: j Cj hi = (3) Ci The above sum is taken over all the loads that gate i drives. As stated previously, gate capacitance is linearly dependent to the size of the gate and therefore it
258
F. Dabiri et al.
is a function of size. To represent the dependency of delay to size, we rewrite Equation 2 using a new function κ: j kj Wj di = κi (Wi , Wj , ...) = pi + gi (4) Wi C
k W
where Cji = jWi j . Logical effort model is a simplified gate delay model which may not be very accurate for current circuit technologies. We have chosen this model to illustrate the concept of soft error rate impact on power optimization and how we can address this issue. Our method can be generalized for more complicated delay models as well. 2.2
Single Event Upset
A single event upset (SEU) is an event that occurs when a charged particle deposits some of its charge in a micro-electronic device, such as a CPU, memory chip, or power transistor. This happens when cosmic particles collide with atoms in the atmosphere, creating cascades or showers of neutrons and protons. At deep sub-micrometer geometries, this affects semiconductor devices at sea level. In space, the problem is worse in terms of higher energies. Similar energies are possible on a terrestrial flight over the poles or at high altitude. Trace amounts of radioactive elements in chip packages also lead to SEUs. Frequently, SEUs are referred to as bitflips. A method for estimating soft error rate (SER) in CMOS SRAM circuits was recently developed by [9]. This model estimates SER due to atmospheric neutrons (neutrons with energies > 1M eV ) for a range of submicron feature sizes. It is based on a verified empirical model for the 600nm technology, which is then scaled to other technology generations. The basic form of this model is: SER = F × A × e
−
Qcrit QS
(5)
Where F is the neutron flux with energy ≥ 1M eV , in particles/(cm2 s), A is the area of the circuit sensitive to particle strikes (the sensitive area is the area of the source of the transistors which is a function of gate size), in cm2 , Qcrit is the critical charge, in fC, and QS is the charge collection efficiency of the device, in fC.1 [4] presents a very accurate model for Qcrit and its dependency on gates sizes. In the following model, Qcrit for the gate i is dependent on gate sizes as below: Qcriti = Qcritmin + ai (Wi − Wimin ) + bj (Wj − Wjmin ) (6) j
where Qcritmin is the critical charge for minimum driver conductance, minimum diffusion capacitances , and minimum fan-out gate input capacitance. Coefficients a and bj s constant parameters that weigh the contribution to Qcrit . The 1
The term ”gate” is used to represent both ”logic gates” and ”gate terminal” of a CMOS transistor which can be misleading.
Soft Error-Aware Power Optimization Using Gate Sizing
259
sum is taken over all gates driven by gate i. As seen in Equation 5, gate sizing has an effect on Qcrit , therefore we use a function, Θ, to represent this dependency: Qcriti = Θi (Wi , Wj , ...)
(7)
Furthermore, the sensitive area to SEU, A, is linearly dependant of size: Ai = αi Wi
(8)
Substituting Qcrit and Ai in Equation 5 gives us an nonlinear relationship between error rate and gate sizes for a given logic gate. It is important to notice that even if a soft error is generated in a logic gate, it does not necessarily propagate to the output. Soft error can be masked due to following factors: – Logical masking occurs when the output is not affected by the error in a logic gate due to subsequent gates whose outputs only depend on other inputs. – Temporal masking (Latching-window masking) occurs in sequential circuits when the pulse generated from particle hit reaches a latch but not at the clock transition, therefore the wrong value is not latched. – Electrical masking occurs when the pulse resulting from SEU attenuates as it travels through logic gates and wires. Also pulses outside the cutoff frequency of CMOS elements will be faded out [7][14]. Therefore we assign a probability ρ to each logic gate indicating how likely a pulse resulted from SEU can survive to the end and cause an error in the output. The final error rate (λ) assigned to each gate i would be λi = SERi .ρi . In section 3 we introduce a methodology for statistically computing theses probabilities. 2.3
System Lifetime and MTTF
In this paper we are considering soft error rates as a measure for system failure. If the error rate in a system is λ, the mean time to failure is M T T F = λ1 . Therefore, if an MTTF greater than a given value, such as Υ , is desired, it implies that λ ≤ Υ1 . In digital circuits, since all gates are potentially prone to soft errors, the total error rate of the circuit (Λ) is Λ = ∀gate λi . Using equation 5 we can derive the following equation for the total error rate of a digital circuit: Λ=
i
3 3.1
ρi SERi =
ρi .F.Ai e
−
Qcrit i QS i
(9)
i
Statistical Analysis of Gate Masking General Approach
Extensive statistical analysis was done on the circuits from the ISCAS 85 and 89 benchmarks to determine the impact that gate masking can have on the circuit
260
F. Dabiri et al.
level soft error rate. The first approach was to observe statistically what the impact of an error in a specific gate would have on the observed error in the circuit. In other words, we compared the global output that we observed with and without soft errors in gates. From this analysis we were able to determine the probability that an error in a specific gate could result in an error in the circuit. The analysis was conducted by simulating the output values of the circuits for randomly selected input values. First, we simulated the circuit a statistically large number of times, using random independently generated input values for all the inputs. Then we compared the output results of the proper functioning circuit, with that of the circuit where a single event upset had occurred, or in other words where a bit had been flipped. As would be expected, because of gate masking, the effect of the flipped bit was not always realized at any of the global outputs. We carried out this simulation for every gate in the circuit, for all the benchmark circuits. 3.2
Reliability of Results
In our experimentation we specifically considered 2000 independently random input values. Of course, this is a small fraction of the actual number of possible input values, but experimentation over various runs with different input instances revealed a large correlation among runs. We verified our results by running various instances of the experiments to verify that indeed the results we were obtaining were consistent. One such instance, benchmark c432, is shown in the graph in Figure 1. The graphs shows the fifteen different runs on the same benchmark, for a statistically significant number of different randomly selected input values. The results obtained are compared with the total sum of all of these fifteen runs. The graph shows that fifteen runs each deviate by less than 3% from the values obtained with the fifteen times larger test case. The results
Fig. 1. Shown here for a single benchmark, c432, across 15 different runs, we see less than a 3% variation from the values obtained using the total iterations across all 15 of the runs. This provides evidence that our results are statistically close to the actual circuit characteristics.
Soft Error-Aware Power Optimization Using Gate Sizing
261
help to demonstrate the reliability of the statistical analysis conducted. And they give statistical evidence of the correlation between the results obtained and the actual characteristics of the circuit
4
Problem Formulation
In logic synthesis, circuits are usually modeled as a directed acyclic graph G = (V, E) (see Figure 2 a. In this model, nodes represent the logic gates and edges stand for the precedence relation between them. We transform the given graph G into G in such a way that, each node v in G is spilt into two nodes v1 and v2 and an edge connecting v1 to v2 . In the transformed graph, the new edges are basically the logic gates from the original graphs. Figure 2 a shows an example of such transformation. In order to have a single input and a single output, nodes s and t have been added to the graph; s is connected to all primary inputs and all primary outputs are connected to t. The delay of a path p =< s, v1 , v2 , ..., t > from node s to node t is equal to the summation of the delays of each edge along the path. We use the terms ’delay of a path ’ and ’the distance between nodes s and t ’, interchangeably. Where the sum is taken over all the edges in path p. The problem is defined as: Given a DAG G = (V, E) and a timing constraint T and an error rate constraint Υ . minimize Pij (10) ∀eij ∈E
such that the delay of every path from s to t is less than or equal to T and the error rate (caused by SEU) is less than Υ . Pij is the power consumption of ij th edge in the DAG which is a function of the capacitance of the gate2 . The timing constraint can be stated as: eij ∈pk dij ≤ T for every path pk from s to t. Note that the number of paths in a DAG is exponential in terms of the number of edges in the graph, therefore this formulation is not efficient. Throughout the rest of this section, we will convert it to a formulation with the same objective function that has a linear number of constraints. Theorem 1: There is an optimal gate sizing solution on a DAG such that the distance between any node u and the output t is independent of the choice of the path taken between them and this distance is unique. Proof: Suppose the claim is not true, i.e. there exists a node v where its distance to t through path P1 is less then P2 (see Figure 2 b). Before getting to the proof, it is important to note that the edge euv is a split node. In other words it represents the gate w in the original graph and therefore the outgoing edges 2
Indices for gate parameters such as power (P ) and delay (d), is changed starting from this section of the paper because every gate is represented by an edge in the transformed graph (see Figure 2). For example instead of Pi we use Pij for the power consumption of a gate.
262
F. Dabiri et al.
Fig. 2. (a) DAG representation of the circuit up-left and it’s transformation. Each node (gate) in the original DAG is replaced by an edge in G . (b) Figure for Theorem 1.
from v are representing actual edges from the original graph, not any gates. Without loss of generality we can assume P1 is shorter than P2 . We claim that there exists an edge (e∗ ) in P1 that can be slowed down and still not violate the timing constraint because P2 is on the critical path from v to t. One immediate candidate for e∗ is the first edge in P1 . Increasing the delay of e∗ by dP2 − dP1 will not cause a timing violation and since it does not contribute in the cost function nor error rate constraint, the total power dissipation and error rate remain constant. This increase is made by assigning a ”dummy” delay to e∗ . Therefor we can maintain the same objectives by equalizing the delay of P1 and P2 and since this optimization problem has only one global minimum, there should exist an optimal solution in which the statement in Theorem 1 holds. Theorem 1 still holds for more complex delay-power model. As long as the power dissipation of a gate is a nondecreasing function of gate size (which indeed is) the theorem holds. The following observation is immediately inferred from the above theorem: Now that the delay between every node to the destination in the optimal solution is independent of the path taken, let ti be a variable assigned to each node vi that represents its distance to t. A similar technique was proposed in [8] which has resulted in an efficient integer delay budgeting algorithm. We call ti the distance variable of node vi . In other words, ti is the delay of the system from node vi to the output. Therefore, the delay and power consumption of each edge (node in the original graph) is represented by: dij = ti − tj = pij + gij hij Pij = φij Wij ∀eij ∈ E
Soft Error-Aware Power Optimization Using Gate Sizing
263
Thus, instead of having a constraint for each path from s to t, we construct the following constraints: ti − tj ≥ pi + gi hi , ∀eij ∈ E(G ) − E(G)
(11)
ti − tj ≥ 0, ∀eij ∈ E(G ) ∩ E(G)
(12)
t s − tt ≤ T
(13)
Wij ≥ Wmin
(14)
Equation 11 enforces that the delay assigned to each gate is greater or equal to its minimum delay (parasitic delay) while Equation 12 assigns a non-negative delay to those edges in the original graph that represent connection between gates. Equation 13 guarantees that the distance from s to t is less than or equal to the timing constraint T . Equation 13 can be interpreted as a minimum delay required for the virtual edge between s and t. This edge is also shown in Figure 2 a. The constraint in Equation 14 is simply the lower bound on transistor width. Error rate can be bounded by the following constraint:
ρi .F.αi Wi e
−
Θi (Wi ,Wj ,...) QS i
≤Υ
(15)
i
in which Υ is the desired upper bound on soft error rate. All the timing constraints on paths are reformulated as edge constraints and the optimization problem can be restated as: minimizef (W ) = φij Wij (16) ∀eij ∈E
subject to constraints in Equations 11 - 15. The objective in 16 along with constraints stated in Equations 11 through 15 form a nonlinear optimization problem with linear number of constraints in terms of the size of the graph. In the next section we modify and solve this problem using convex programming method. 4.1
Convexity of the Optimization Problem
Convex optimization problems are far more general than linear programming problems, but they share the desirable properties of LP problems: They can be solved quickly and reliably even in very large cases. A convex optimization problem is a problem where all of the constraints are convex functions and the objective is a convex function to be minimized, or a concave function to be maximized. With a convex objective and a convex feasible region, there can be only one optimal solution, which is globally optimal. Several proposed methods – notably the Interior Point method – can either find the globally optimal solution, or prove that there is no feasible solution to the problem. The objective defined in Equation 16 is a linear function of the variable Wi and is therefore convex. To state the convexity of proposed formulation, it is
264
F. Dabiri et al.
required to show that the feasible solution space created by the constraints is in fact convex. The constraints in 11 can be rewritten as: pi g i hi tj + + ≤1 ti ti ti
(17)
Since all the variables in the above equation are positive, the constraint in 17 is a posynomial expression. Posynomials are not convex in this format but with a change of variables, they can be mapped to a convex space. If each variable x in a posynomial expression is substituted with ez , the resulting expression becomes an exponential convex function [3]. The constraint presented in Equation 15 is a nonlinear function which generally is very hard to handle. This function has both linear and exponential dependancy on variable which results in a non-monotone function. Each variable Wi , is linearly contributing in only one term of 15 (through sensistive are corresponding to Ai ) but is exponentialy included few terms of 15 (in Qcrit for gate i and all fan-in gates). Therefore, not only are there more terms that include Wi in the exponent but their effect compared to the linear dependancy of A to Wi is much more significant. On the other hand, shrinking a gate size both reduces the power consumption and the sensitive area to SEU. Therefore the exponential contribution of gate sizes and power dissipation are the two contradictory factors, not the sensitive area. This observation led us to modify Equation 15 such that we assume an average value for each Ai and change the constraint to exponential form which is convex. This modification and exponential transformation in the posynomial constraints keeps the objective and other constraints convex and the solution space, which is the intersection of all subspaces created by convex constraints, is itself convex. In the simulation section, after calculation gate sizes, we use the actual resulting gate sizes to recompute total error rate, and report the correct numbers.
5
Simulation Results
We used the MOSEK convex optimization tool [12] to solve the proposed formulation on the ISCAS benchmarks. Although ISCAS benchmarks do not include very large circuits, but have been proven to be a proper set of test cases to show the validity of a proposed methodology and concept. For each benchmark, we calculate the total delay of the circuit, T , with initial size values and then use that delay as the timing constraint for the optimization problem. Each benchmark has a total error rate with the initial size, Λ, which we used a constraint on the error rate. For each circuit, we minimize the power consumption with four different bounds on the error rate, Υ , starting with the no bound on error rate, error rate of the initial circuit and up to reduced rate by a factor of 200%. Figures 3 and 4 illustrates the power consumption reduction versus the bound on error rate for combinational circuits. A point (x, y) in these graphs means that the power dissipation has been reduced by y% while the total error rate has been decreased by at least a factor of x% compared to the error rate in the initial circuit.
Soft Error-Aware Power Optimization Using Gate Sizing
Fig. 3. Soft error rate can be reduced by huge factor without compromising the power saving in these combinational circuits
265
Fig. 4. Simulation results for 4 larger combinational circuits: power saving in c6288 benchmark is more defendant to bounds on error rates
It can be observed that an average of 63% power saving can be achieved without any constraint on the error rate for these bench marks. Obviously, power saving percentage depends on the initial design which we compare our results with. If the original design is far from optimal, then the power saving ratio becomes larger. The significance of the results are that we can achieve optimal power reduction in the presence of soft error rate constraints. This is the reason why we have not compared our results with any other power optimization method. We ran the same optimization on sequential circuits as well. Figures 5 and 6 summarize these results. Average power reduction is about 61% for these benchmarks. Comparing the results of combinational circuits to sequential ones, we observe that the error rate bounds on sequential circuits as less restrictive in the power optimization process. The one immediate reason could be the fact that temporal masking reduces the affect of soft errors in sequential circuits whereas no such masking is present in combinational circuits.
Fig. 5. It can be observed that error rate in these sequential circuits can be improved dramatically without compromising power savings
Fig. 6. Simulation results for 4 larger sequential circuits including s1423 which is more defendant to bounds on error rates compared to other benchmark
266
6
F. Dabiri et al.
Conclusion
In this paper, we have introduced a new formulation for gate sizing that targets power optimization, resiliency against SEUs, and timing constraints simultaneously. As a preprocessing step for the optimization, we have developed a statistical modeling and validation technique that quantifies the impact of fault masking in combinational logic. We formulated the problem as a convex optimization problem of linear size, as opposed to the previous convex programming approaches which could potentially be exponential in size. The MOSEK convex optimization tool was used to evaluate the proposed approach on ISCAS benchmarks. We were able to minimize the power dissipation for a given timing constraint and various upper bounds on error rates caused by a single event upset. An important practical result was that convex programming-based gate sizing can simultaneously reduce power consumption and improve SEU resiliency. Our simulation showed that different circuits from the various benchmark behave differently when carring out power optimization while considering soft errors. As future work, we propose examinimg the question of designing circuits such that through gate sizing the most power savings can be achieved while enforcing low error rates.
References [1] Berkelaar, M.R.C.M., Jess, J.A.G.: Gate sizing in mos digital circuits with linear programming. In: EURO-DAC ’90: European design automation conference, Los Alamitos, CA, USA, 1990, pp. 217–221. IEEE Computer Society Press, Los Alamitos (1990) [2] Borah, M., Owens, R.M., Irwin, M.J.: Transistor sizing for minimizing power consumption of cmos circuits under delay constraint. In: ISLPED ’95, International Symposium on Low Power Design, New York, NY, USA, pp. 167–172. ACM Press, New York (1995) [3] Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge University Press, New York, NY, USA (2004) [4] Cazeaux, J.M., Rossi, D., Omana, M., Metra, C., Chatterjee, A.: On transistor level gate sizing for increased robustness to transient faults. In: IOLTS ’05, International On-Line Testing Symposium, Washington, DC, USA, 2005, pp. 23–28. IEEE Computer Society, Los Alamitos (2005) [5] Efron, B.: The Jackknife, the Bootstrap, and Other Resampling Plans. S.I.A.M., Philadelphia (1982) [6] Tibshirani, R.J., Efron, B.: An Introduction to the Bootstrap. Chapman & Hall/CRC, New York, NY, USA (1994) [7] Mitra, S., et al.: Logic soft errors in sub-65nm technologies design and cad challenges. In: DAC ’05, Design Automation Conference, New York, NY, USA, 2005, pp. 2–4. ACM Press, New York (2005) [8] Ghiasi, S., Bozorgzadeh, E., Choudhuri, S., Sarrafzadeh, M.: A unified theory of timing budget management. In: ICCAD ’04, International conference on Computer-aided design, Washington, DC, USA, 2004, pp. 653–659. IEEE Computer Society Press, Los Alamitos (2004)
Soft Error-Aware Power Optimization Using Gate Sizing
267
[9] Wender, C., Hazucha, S.A, Svensson, P.: Cosmic-ray soft error rate characterization of a standard 0.6-μm cmos process. IEEE Journal of Solid-State Circuits, 1422–1429 (2000) [10] Hedlund, K.S.: Aesop: a tool for automated transistor sizing. In: DAC ’87: Proceedings of the 24th ACM/IEEE conference on Design automation, New York, NY, USA, pp. 114–120. ACM Press, New York (1987) [11] Menezes, N., Baldick, R., Pileggi, L.T.: A sequential quadratic programming approach to concurrent gate and wire sizing. In: ICCAD ’95: Proceedings of the 1995 IEEE/ACM international conference on Computer-aided design, Washington, DC, USA, 1995, pp. 144–151. IEEE Computer Society, Los Alamitos (1995) [12] MOSEK ApS, Denmark. The MOSEK optimization tools manual (2002), http://www.mosek.com [13] Sapatnekar, S.S., Chuang, W.: Power-delay optimizations in gate sizing. ACM Trans. Des. Autom. Electron. Syst. 5(1), 98–114 (2000) [14] Shivakumar, P., Kistler, M., Keckler, S.W., Burger, D., Alvisi, L.: Modeling the effect of technology trends on the soft error rate of combinational logic. In: DSN ’02, Dependable Systems and Networks, Washington, DC, USA, 2002, pp. 389– 398. IEEE Computer Society, Los Alamitos (2002) [15] Sutherland, I.E., Sproull, R.F.: Logical effort: designing for speed on the back of an envelope. In: Proceedings of the 1991 University of California/Santa Cruz conference on Advanced research in VLSI, Cambridge, MA, USA, 1991, pp. 1–16. MIT Press, Cambridge (1991) [16] Tamiya, Y., Matsunaga, Y., Fujita, M.: Lp based cell selection with constraints of timing, area, and power consumption. In: ICCAD ’94, International Conference on Computer-Aided Design, Los Alamitos, CA, USA, 1994, pp. 378–381. IEEE Computer Society Press, Los Alamitos (1994) [17] Tennakoon, H., Sechen, C.: Efficient and accurate gate sizing with piecewise convex delay models. In: DAC ’05, Design Automation Conference, New York, NY, USA, 2005, pp. 807–812. ACM Press, New York (2005) [18] Weaver, C., Emer, J., Mukherjee, S.S., Reinhardt, S.K.: Techniques to reduce the soft error rate of a high-performance microprocessor. In: ISCA ’04: Proceedings of the 31st annual international symposium on Computer architecture, Washington, DC, USA, 2004, p. 264. IEEE Computer Society, Los Alamitos (2004)
Automated Instruction Set Characterization and Power Profile Driven Software Optimization for Mobile Devices Matthias Grumer1 , Manuel Wendt1 , Christian Steger1 , Reinhold Weiss1 , Ulrich Neffe2 , and Andreas M¨ uhlberger2 1
Institute for Technical Informatics Graz University of Technology {grumer,wendt,steger,rweiss}@iti.tugraz.at 2 NXP Semiconductors Business Line Identification {ulrich.neffe,andreas.muehlberger}@nxp.com
Abstract. The complexity of mobile devices is continuously growing due to the increasing requirements on performance. In portable systems such as smart cards, not only performance is an important attribute, but also the power and energy consumed by a given application. It is mandatory to accomplish software power optimizations based on accurate power consumption models characterized for the processor. Both the optimization and the characterization are carried out mostly manually and are thus very time consuming processes. This paper presents an environment for automated instruction set characterization, based on physical power measurements. Further, an optimization system is presented that allows an automated reduction of power consumption based on a compiler optimization.
1
Introduction
The complexity and functionality of mobile devices is growing continuously. This results in a higher energy consumption of such devices. Often the mobile devices are supplied by a battery. To increase the life-time of the batteries the devices should be optimized for low power. In particular, power peaks reduce the battery lifetime caused by the physical properties of batteries [1]. Mobile devices like smart cards are also often supplied by a radio frequency (RF) field which provides a strictly limited amount of power. If the power consumed by such a device exceeds this limit a reset can be triggered by the power control unit or otherwise the chip may stay in an unpredictable state. Furthermore the transmission from RF-system to a reader is often done via amplitude shift keying. Power peaks, which result in an unwanted modulation of the field, can potentially disturb the communication. Therefore the smart card has to be optimized for low power
This work was funded by the Austrian Federal Ministry for Transport, Innovation, and Technology under the FFG contract FFG 810124.
N. Azemard and L. Svensson (Eds.): PATMOS 2007, LNCS 4644, pp. 268–277, 2007. c Springer-Verlag Berlin Heidelberg 2007
Automated Instruction Set Characterization
269
Fig. 1. Automated power characterization and software optimization for embedded systems
with the constraint to avoid peaks in power consumption. Mobile devices are often used to process and store confidential information. Simple power analysis (SPA) and differential power analysis (DPA) are attacks based on the analysis of the power consumption profile of a smart card [2]. Reducing power peaks an thus flattening the power consumption profile can hinder these attacks. To address these problems different solutions to reduce the power consumption at different system levels have been proposed. As power peaks are mainly caused by determined instruction sequences, in this work we focus on the software level. We present a new concept, where the optimization is done by the compiler. As depicted in Fig. 1, first the processor is characterized at the instruction level. An automated characterization framework has been implemented to facilitate fast operation. From this characterization an energy model is derived, which is used by a power simulator. The power simulator returns a cycle-accurate power profile of the executed program to the debugger. An analysis module abstracts this data and sends it back to the compiler. Based on this information and the energy model, the compiler is able to optimize the source code and to verify the optimization. The remainder of this paper is organized as follows. Section 2 surveys related work for software power optimization. Section 3 depicts in detail the characterization tool. In section 4 the optimization system is described. Results are presented in section 5. The conclusions are summarized in section 6.
2
Related Work
Tiwari et al. [3,4] outlined the importance of energy optimization at the software level in embedded systems already in the nineties. They presented different optimization techniques for reducing the software energy consumption. All these techniques are based on instruction level power analysis. The underlying energy model
270
M. Grumer et al.
defines base costs (BC) to characterize a single instruction. The circuit state overhead (CSO) describes circuit switching activity between two consecutive instructions. The cost factors are extracted by average measurements of the current drawn by the processor as it repeatedly executes short instruction sequences. To determine the BC of an instruction, a sequence with always the same instruction was used. A loop with several pairs of two instructions minus the BC of each instruction gives the CSO costs of this pair. This requires n + n2 measurement cycles, where n is the number of instructions. Several authors presented techniques to minimize required measurements by grouping the instructions depending on the functional unit used by the instructions [5,6]. However, up to 3 man months are necessary to characterize a processor, depending on the processor architecture. The characterization methods are heavily dependent on the processor and require detailed knowledge of the architecture. Brandolese et al. propose a general methodology, independent of the specific processor [7]. The methodology obviates the architectural level and focuses on the functionalities involved in the instruction execution. The proposed model is built on an a priori knowledge of both the energy characterization of a set of instructions and the relevant functional characteristics of each instruction of the set. The model is thus a trade-off between accuracy and generalization. Based on the energy models, different optimization strategies at the software level have been implemented. Tiwari et al. propose different compilation techniques for low energy software in [8]. They suggest the use of a code-generatorgenerator like IBURG [9], where the cost of each pattern is defined by the energy consumption. This technique, however, was not effective, as the authors observed that the energy-based and the cycle-based code generators produced very similar code. In the same work the authors propose to reorder the instructions in such a way that the circuit switching activity is minimized. Therefore, the instructions are scheduled depending on the circuit state overhead. On a 486DX2 architecture this technique only led to a energy reduction of 2%. Similar to the last technique, in [10] a novel list-scheduling algorithm for low-energy program execution is presented. The algorithm is based on a basic block approach. First a dependency table is derived for each block, then registers are renamed for reducing output dependencies. From the ready set of instructions, the instruction which minimizes the inter-instruction cost is selected to be scheduled. The results of this algorithm have shown a 4.54% decrease in the total energy dissipation. Coupled with other optimization strategies such as operand replacements, total energy savings of over 9% are achieved. While there are many projects dealing with energy reduction at software level, references dealing with the reduction of power peaks are scarce.
3
Characterization Tool
The developed tool allows an automated characterization of a processor. The tool is built in a modular way to allow an easy adaptation to other processor architectures. Figure 2 depicts the overall structure of the characterization tool.
Automated Instruction Set Characterization
271
An instruction set description file contains an entry for each instruction with all information necessary to create the test programs. The format of the file should be usable for most processor and is easily extendable. Based on the instruction set description file, the testbench generator generates assembler programs for each measurement cycle. To measure the base costs, from each entry in the description file a loop with always the same instruction is generated. The loop length can be defined freely. To measure the CSO costs, for each pair of instructions defined in the description file, a loop alternating the two instructions is generated. As the tool works in a completely automated way, the determination of the CSO costs of each instruction pair can be accomplished. In a last step the testbench generator produces different assembler files to characterize different data dependency models. The generator is connected via an interface directly to the assembler and linker. A bootloader loads the executable file directly on the processor and the measurement software takes control from that point on. The measured values are stored in a well defined XML format and can be visualized from the tool. The following sections explain the measurement circuit more in detail.
Fig. 2. Automatic characterization tool
3.1
Measurement Circuit
The accuracy of the instruction set model is limited by the errors introduced during the physical power measurement. Especially in case of processors for mobile devices, where currents in a magnitude of a few mA at several MHz have to be measured, dedicated power measurement hardware is necessary to meet the desired model accuracy. The novel cycle accurate measurement method developed in the course of this project is based on current integration by minimizing the supply voltage fluctuation. In this clock-driven sampling the consumed average power given by P = IDD · VDD is recorded for each processor cycle. Since the supply voltage VDD of processors is constant, the power only depends on the average current IDD . An average measurement of the current can be done by the circuit shown in figure 3. While one capacitor of the circuit is charged with an exact copy of IDD , the other one is discharged over the resistor R. At the end of each clock cycle the voltage VC on the charged capacitor is sampled and the capacitors are switched. Assumed the delay time of the switches is much less than the clock
272
M. Grumer et al.
Fig. 3. Measurement circuit: clock driven sampling
period the integration interval is given by 1/fCLK and the average current can be calculated as follows: IDD = VC · C · fCLK . The switch control circuit delays the processor clock for at least one switch delay. This causes a start of current integration just a tick before the processor clock rises. The reason for this short forward shift of the integration interval is that most of the energy in synchronous circuits is consumed during signal transitions at the beginning of each clock cycle. At the end of a clock period, the processor is in a stable state and consumes only leakage power. Since leakage currents are nearly constant throughout a cycle, the amount of leakage current measured is no influenced by moving the measurement window. 3.2
Energy Model
The energy model developed is a flexible and accurate combination of an instruction-level energy model and a data dependent model. The total energy Etotal consumed by a program is the sum over all clock cycles ntotal of the energy consumption per cycle Ecycle : Etotal =
n total
Ecycle (n).
(1)
n=0
The energy per clock cycle can be further decomposed into four parts: instruction dependent energy dissipation Ei , data dependent energy dissipation Ed , energy dissipation of the cache system Ec , and finally the dissipation of all external components Ee including the bus system, memories and peripherals: Etotal =
n total
[Ei (n) + Ed (n) + Ec (n) + Ee (n)].
(2)
n=0
The instruction dependent part of this model is based on the instruction-level energy model defined by Tiware et al. [3].
Automated Instruction Set Characterization
273
A pipeline aware model is used for the cycle by cycle estimation. The main idea of this approach is to distribute the measured costs among all pipeline stages. Thus all other inter-instruction costs (IIC) caused by inter instruction locks, transaction look-aside buffer or data/instruction cache misses can be modeled by pipeline stalls. Each instruction is responsible for the calculation of its energy consumption for the pipeline stages. A microarchitectural energy model shared between all instructions is used to consider data dependent switching activity of busses and functional units. Bit patterns and specific properties of an instruction are calculated with its own instruction energy model. Our results in [11] shows an accuracy of more than 95% for all benchmarks.
4
Optimization System
The whole optimization system is depicted in Fig. 4. The source code of an application is compiled to target code. As compiler the GNU Compiler Collection (GCC) is used. The target architecture is a MIPS32 4KSc processor. The target code is then executed via a debugger on an cycle accurate instruction set simulator. The simulator is directly attached to the energy analysis unit which, based on the energy model, calculates the energy consumption per instruction.
Fig. 4. Software energy optimization system overview
Using symbolic information, the energy analysis unit abstracts these values and calculates the energy consumption at the intermediate language level. The resulting reports and the energy model are then used by the optimization unit to perform optimizations at the intermediate language level.
274
4.1
M. Grumer et al.
GCC RTL Representation
The intermediate representation under consideration in this work is the Register Transfer Language (RTL) representation. In this language, the instructions to be output are described in an algebraic form that describes what the instruction does. GCC implements many passes for transforming and optimizing the RTL representation. In a final pass, assembler code is generated out of the RTL representation. This generation is performed based on the machine description of the target processor. The machine description contains a pattern for each instruction that the target machine supports. These patterns are first used to generate an RTL list based on named instruction patterns. In this generation process, a pattern code and an unique id-number is assigned to each RTL expression. After performing all the optimization passes on the RTL list, the list is matched against the RTL templates to produce assembler code. This work concentrates on the instruction scheduling pass. Originally, this pass looks for instructions whose output won’t be available by the time it is used in subsequent instructions. It reorders instructions within a basic block to try to separate the definition and use of items that would otherwise cause pipeline stalls. This pass is used in this work to implement the power optimization scheme. For further details on GCC see [12]. 4.2
Optimization Algorithms
The system performs two optimizations at the intermediate language level. Both optimizations are implemented in the instruction schedule pass of the compiler and are described in the following two sections. Instruction Packing. As there are no CSO costs when two equal instructions are executed successively, this optimization simply tries to group equal instructions. For this purpose, the algorithm searches for expressions in the ready list, having the same pattern code as the last expression scheduled. If there is no expression with the same pattern code, normal scheduling is continued. Instruction Scheduling by CSO Costs. A new greedy algorithm has been implemented, which schedules the RTL expressions depending on their CSO costs. This algorithm substitutes the original selection algorithm in the Haifa scheduler. The algorithm requires double compilation for an application. In the first compilation cycle, the algorithm matches the unique id generated for each RTL expression against the CSO cost. This is necessary because the mapping of RTL expressions to instructions is ambiguous. In a second compilation cycle, for every RTL expression scheduled, the algorithm searches the RTL-expression with the lowest CSO costs in the ready list. The algorithm schedules this RTL statement, updates the ready list and identifies the next RTL statement to be scheduled. This algorithm doubles the compilation time. However application for embedded system tend to be relatively small and thus compile time is acceptable.
Automated Instruction Set Characterization
5
275
Evaluation and Results
In this section the results concerning the measurement circuit and the optimization system are presented. 5.1
Measurement Circuit
The experimental diagram in Fig. 5 presents the performance characteristic of the clock driven sampling discussed in section 3.1. The diagram represents typical cases from the multiple measurement tests, which have been performed repeatedly in the lab. To determine the measurement accuracy all measurement results have been compared to a constant current at the input of the current mirror. This input current was drawn by shunt resistors with a tolerance of less then 0.05%. The voltage drop across these resistors has been measured to calculate the size of the input current.
Fig. 5. Experimental results for the clock driven sampling at 2MHz with 2 different capacities
The experimental results have shown a high dependency between operation range, measurement accuracy and hardware configuration. If the input current is too high compared to the chosen capacitors, the capacitors become fully loaded before the clock cycle ends. On the other hand too big capacitors cause only a small voltage drop, which leads to errors in analog to digital conversion. 5.2
Optimization System
For a first evaluation of the optimization system, six evaluation programs, consisting mainly of algebraic and array functions were compiled. Table 1 shows the changes in mean value, total energy consumption and standard deviation by applying the instruction packing algorithm. While the mean value and the total energy consumption is only reduced by up to 1 %, the standard deviation was reduced by up to 7%. It can be deduced, that the instruction packing does not reduce the energy consumption, but it produces a significantly smoother power profile. While the reduction of the mean value corresponds in all programs to the reduction of the total energy consumption,
276
M. Grumer et al.
Table 1. Changes of the mean value, total energy consumption and standard deviation when applying instruction packing algorithm Program Mean value
C1 C2 C3 C4 C5 C6
Total Standard energy deviation Gain [%] Gain [%] Gain [%] 0.96 0.96 4.7 0.83 0.83 6.4 0.17 0.17 5.9 0.63 -2.57 7.2 0.32 0.32 5.6 0.92 0.92 5.1
this is not the case in program “C4”. The reason is, that by reordering the instruction, the execution time rose, probaly due to additional pipline stalls, which also increase the total energy consumption. Applying the instruction scheduling by CSO costs to the same programs delivers almost the same results. When taking a closer look at the produced assembler files, it can be seen that both optimization algorithm produce quite the same code. Obviously, if there is the possibility to schedule the same instruction as the last scheduled, the scheduling by CSO behaves like the instruction packing, because there are no CSO cost between equal instructions. All six programs are quite small and each of them fits into the cache. Thus no cache miss can occur. When concatenating the six programs and compiling them to one executable cache misses will occur. The results of this evaluation show only a reduction of the standard deviation of 3.4 % for the instruction packing and 1.9 % for instruction scheduling by CSO costs. The high power consumption of a memory access could be an explanation for this.
6
Conclusion
The minimization of power peaks in the power profile of mobile devices represents an important aspect for better utilization of the energy sources, system stability and system security. In this paper we presented a new approach to flatten the power profile of an application executed on a embedded processor. The results have shown that optimizations at the software level can produce a flattener power profile. Furthermore, such optimizations can be done automatically at the compiler level. The automated power characterization of a processor reduces significantly the development effort for these optimizations.
References 1. Haid, J., Kargl, W., Leutgeb, T., Scheiblhofer, D.: Power Management for RFPowered vs. Battery-Powered Devices. In: Proceedings of Workshop on Wearable and Pervasive Computing, Graz, Austria (2005)
Automated Instruction Set Characterization
277
2. Rothbart, K., Neffe, U., Steger, C., Weiss, R., Rieger, E., Muehlberger, A.: Power consumption profile analysis for security attack simulation in smart cards at high abstraction level. In: EMSOFT ’05: Proceedings of the 5th ACM international conference on Embedded software, Jersey City, NJ, USA, pp. 214–217. ACM Press, New York (2005) 3. Tiwari, V., Malik, S., Wolfe, A.: Power analysis of embedded software: a first step towards software power minimization. In: ICCAD ’94: Proceedings of the 1994 IEEE/ACM international conference on Computer-aided design, IEEE Computer Society Press, Los Alamitos (1994) 4. Tiwari, V., Malik, S., Wolfe, A., Lee, M.T.C.: Instruction level power analysis and optimization of software. J. VLSI Signal Process. Syst. 13(2-3), 223–238 (1996) 5. Nikolaidis, S., Laopoulos, T.: Instruction-level power consumption estimation of embedded processors for low-power applications, Amsterdam, The Netherlands, vol. 24, pp. 133–137. Elsevier Science Publishers B. V., Amsterdam, The Netherlands (2002) 6. Laopoulos, T., Neofotisots, P., Kosmatopulos, C., Nikolaidis, S.: Measurement of current variations for the estimation of software-related power consumption. IEEE Transactions on Instrumentation and Measurement (2003) 7. Brandolese, C., Fornaciari, W., Salice, F., Sciuto, D.: An instruction-level functionally-based energy estimation model for 32-bits microprocessors. In: DAC ’00: Proceedings of the 37th conference on Design automation, Los Angeles, California, United States, pp. 346–351. ACM Press, New York (2000) 8. Tiwari, V., Malik, S., Wolfe, A.: Compilation techniques for low energy: an overview. In: IEEE Symposium on Low Power Electronics, San Diego, California, United States, pp. 38–39. IEEE Computer Society Press, Los Alamitos (1994) 9. Fraser, C.W., Hanson, D.R., Proebsting, T.A.: Engineering a simple, efficient codegenerator generator. ACM Lett. Program. Lang. Syst. 1(3), 213–226 (1992) 10. Sinevriotis, G., Stouraitis, T.: A novel list-scheduling algorithm for the low energy program execution. In: IEEE International Symposium on Circuits and Systems, pp. 97–100. IEEE Computer Society Press, Los Alamitos (2002) 11. Neffe, U., Rothbart, K., Steger, C., Weiss, R., Rieger, E., Muehlberger, A.: A Flexible and Accurate Model of an Instruction-Set Simulator for Secure Smart Card Software Design. In: Macii, E., Paliouras, V., Koufopavlou, O. (eds.) PATMOS 2004. LNCS, vol. 3254, pp. 491–500. Springer, Heidelberg (2004) 12. Stallman, R.M.: GNU Compiler Collection Internals (GCC). GCC Developer Community (2005) http://gcc.gnu.org
RTL Power Modeling and Estimation of Sleep Transistor Based Power Gating Sven Rosinger1 , Domenik Helms1 , and Wolfgang Nebel2 1
OFFIS Research Institute University of Oldenburg D - 26121 Oldenburg, Germany {rosinger|helms|nebel}@offis.de 2
Abstract. We present an accurate RT level estimation methodology describing the power consumption of a component under power gating. By developing separate models for the on- and off-state and the transition cost between them, we can limit errors to below 10% compared to SPICE. The models support several implementation styles of power gating as NMOS/PMOS or Super-Cutoff. Additionally the models can be used to size the sleep transistors more accurate. We show, how the models can be integrated into a high level power estimation framework supporting design space exploration for several design for leakage methodologies.
1
Introduction
In today’s digital integrated circuits, leakage currents are responsible for the dominating part of total energy consumption. Especially in low utilized parts of the circuit, energy is wasted as a result of leakage currents. Regarding leakage minimization techniques, there exist three dominating management strategies to convert idle time of unused RT-components into leakage savings: power gating, adaptive body biasing and the use of a minimal leakage vector. In designs with small utilization ratios and long idle periods of RT-components, power gating has highest potential on saving energy because of the least remaining leakage current in the off-state. In contrast to the implementation of power gating on lower design levels, the entry level of development is shifted from traditional RTL specifications in VHDL or Verilog to higher levels of abstraction. To evaluate different design alternatives early power estimation considering power gating is thus getting more and more important. This power estimation has to be completely automated because the designer does not want to deal with design decisions on transistor level, for example the dimensioning of sleep transistors that implies a tradeoff between faster computation and less leakage. To deal with this conflict RTL models for power gated components are needed and must be able to size the sleep transistor
This work was supported by the European Commission within the Sixth Framework Programme through the MAP2 project (contract no. FP6-2004-SME-COOP MAP2031984).
N. Azemard and L. Svensson (Eds.): PATMOS 2007, LNCS 4644, pp. 278–287, 2007. c Springer-Verlag Berlin Heidelberg 2007
RTL Power Modeling
279
by maintaining the required performance constraints. These models also have to consider the parameters of the gating circuitry during estimation. In our work we present a set of models allowing a holistic, cycle-accurate and fast estimation of power gated RT-components. Combined with only few characterization data this modeling approach builds a cornerstone of our high level leakage optimization framework. The remainder of this work is organized as follows. In Section 2, previous work on modeling is summarized and existing inaccuracies are identified. Section 3 presents a modeling methodology for the additional power costs induced by the gating circuitry followed by our modeling approach that distinguishes between the on-state, off-state and the switch-over. The estimation framework, in which we use our models, is presented in Section 4. The evaluation results of the models and estimation flow are presented in Section 5 and 6 followed by a conclusion.
Fig. 1. Power gated RT-component using a PMOS sleep transistor and a buffer chain driving the sleep transistor
2
Related Work
The idea of using sleep transistors for leakage reduction is well known. Previous work [1, 2, 3] only considers the implementation on transistor level and the possible savings in leakage in an abstract way. In [4] the authors describe the first system-level trade-off of sleep-transistorbased power gating techniques. They detail most of the cost in area, performance and power caused by the gating circuitry shown in Figure 1 but they do not provide accurate models, that can be used for power estimation. Especially temperature dependency and support for different gating types (eg. SCCMOS techniques) are not available. Secondary they do not mention the buffer chain, needed to drive the sleep transistor. One major inaccuracy in all previous work is the approximation of the sleep transistor as a fixed resistant Rsleep . The resulting voltage drop across the sleep transistor is then given with Vsleep (t) = Rsleep · I(t) where I(t) is the current through the sleep transistor. This assumption can be traced back to [5] from 1997 and is still used in recent work of sizing the sleep transistor [6]. Simple simulations using SPICE indicate, that this simplification leads to errors of 20%
280
S. Rosinger, D. Helms, and W. Nebel
in comparison to the real voltage drop, using transistor technologies of 90nm and below. As a consequence, the estimation will become inaccurate and it even worse the sizing of the sleep transistor is based on a faulty assumption. Regarding the dynamic power of transitions from off-state to on-state, previous modeling approaches base on the assumption, that averaged half of the capacitances within the RT-component are charged during startup [4], [7]. In these papers incomplete and spurious transitions during startup are neglected. SPICE simulations show, that the resulting error becomes 50% for a 4 bit adder and even more than 100% for a 8 bit multiplier. As a result of the above mentioned reasons, a holistic modeling of sleeptransistor-based power gated RT-components, that could be used in automated EDA tool support for estimation and optimization, is necessary and is thus presented in the remainder of this work.
3
Modeling of Power Gating
As mentioned, power gating is not free of additional costs. Because of different characteristics in each power-state, the modeling is divided into modeling the on-state with conducting sleep transistor, modeling the off-state with locking sleep transistor and modeling the transition from off to on sleep transistor. 3.1
Modeling the On-State
During on-state, the RT-component is consuming dynamic power due to activities at the inputs. A lot of previous work deals with modeling energy considering data dependency, component bitwidth and supply voltage. The latter is reduced in this case to a virtual supply voltage due to a voltage drop across the sleep transistor. To get precise and cycle accurate estimation results regarding power gating, the average voltage drop has to be determined first and the resulting virtual supply voltage can be used for estimation using existing models. The problem in using the virtual supply voltage in estimation is, that it depends on the actual current through the circuit and the current flow depends on the virtual supply voltage in turn. To avoid a complete new characterization of each RT-component with the sleep transistor size as parameter, we build a voltage drop model separately and combine both models to estimate the energy. Important parameters of this model are transistor size, temperature, current flow and supply voltage. Because for large transistors, doubling the transistor width will exactly halve the voltage drop over whole ranges of all other variables this parameter is separated from the others. For the remaining parameters a three dimensional measuring field is created using simulations in SPICE. To reduce the amount of characterization data, we use the Levenberg-Marquardt fitting algorithm presented in [12]. We tried to fit the influence of the current on the voltage drop to a wide range of polynomial, exponential, logarithmical and root functions. We found out that the dependency can best be described using a polynomial of fourth order (see eq. 1). Vsleep (I) = α4 I 4 + α3 I 3 + α2 I 2 + α1 I 1
(1)
RTL Power Modeling
281
The measuring field is then simplified to a two dimensional field storing a set of the four parameters (α1 , α2 , α3 , α4 ) for each combination of temperature and supply voltage. For given temperature tˆ, supply voltage vˆ, current ˆi and transistor width w ˆ our model first interpolates between the four adjacent polynomials using bilinear interpolation shown in equation 2 and then uses the resulting polynomial to compute the voltage drop under the given current (V(sleep (ˆi)). As a final step, tˆ,ˆ v) the model result is sized accordingly to w. ˆ t,v 3 t,v 2 t,v 1 4 V(sleep (I) = αt,v 4 I + α3 I + α2 I + α1 I tˆ,ˆ v)
αt,v = (1 − δt)(1 − δv) · αn(i,j) + δt(1 − δv) · αn(i+1,j) n (1 − δt)δv · αn(i,j+1) + δtδv · αn(i+1,j+1) with ti ≤ tˆ ≤ ti+1 , vj ≤ vˆ ≤ vj+1 δt = tˆ − ti , δv = vˆ − vj
(2)
To handle the interdependency between current flow and voltage drop during estimation, a simple calculation of the working point is not possible, because the models are not available in a closed analytical form. Thus an iteration based approach is presented in Figure 2 and 3. The left Figure shows the voltage dropcurrent characteristics of the sleep transistor and the RT-component each for its own. Assuming a voltage drop of Vsleep = 0 the averaged current through the sleep transistor is 0 and the current through the RT-component becomes largest. Assuming a voltage drop of Vsleep = VDD it is the other way round. Starting with VDD as supply voltage (meaning Vsleep = 0) we will get the current IMAX RT . Using this current as input to the voltage drop model, we will get a first rough approximation of the voltage drop. By iterating this procedure we converge to the fixed working point that is defined by the intersection of the two graphs, because in the real system, the current through the RT-component must be equal to the current through the sleep transistor. This voltage drop can be used for virtual supply voltage computation and thus for estimation. As you can see in Figure 3 the iteration almost reaches the steady state after 5 iteration steps.
Fig. 2. Voltage drop-current characteristics of sleep tr. and RT-comp
Fig. 3. Resulting cycle-averaged voltage drop in each iteration step
282
S. Rosinger, D. Helms, and W. Nebel
To guarantee that the iteration terminates to a steady state, IMAX RT has to be smaller than IMAX sleep that is always given if the sleep transistor is sized meaningful. To estimate leakage and performance of RT-components in on-state we use the leakage and delay model presented in [11] and [10]. Additional power costs, that have to be modeled, occur in terms of gate leakage of the sleep transistor and leakage of the buffer chain. Our sleep transistor leakage model is based on a single transistor characterization using SPICE considering temperature, supply voltage and the voltage drop across the transistor as degrees of freedom. Due to the fact that the leakage current depends linearly on the transistor width [11], the characterization can be done using one fixed width and scaling the model result accordingly. The resulting measuring field is compressed using the same procedure mentioned above . To estimate the leakage of a complete buffer chain, we sum up the PMOS and NMOS widths of all transistors within the buffer chain and use the leakage model presented in [11]. 3.2
Modeling the Off-State
In the off-state there are two aspects that have to be modeled: the remaining leakage of the RT-component and the leakage of the buffer chain. The first is limited to the subthreshold and gate leakage current through the sleep transistor. It is important to model these leakage currents because it counteracts the savings in comparison to leakage in on-state. The most important parameters influencing the remaining leakage are the temperature, supply voltage, voltage drop and sleep transistor width. To support super cutoff techniques [8], that uses voltages greater VDD as gate voltage for PMOS sleep transistors (and lower than GN D in NMOS case), we also consider the gate voltage of the sleep transistor. In our approach, we assume a maximal voltage drop Vsleep = VDD across the sleep transistor leading to the worst remaining leakage. This assumption induces an error in terms of an overestimation of up to 15% but simplifies the overall model. Analog to previously mentioned leakage models the linear impact of the transistor width is exploited again. To express the influence of the remaining parameters on the leakage currents, we created a measuring field based on SPICE simulations. Because of the nearly exponential correlation we logarithmized the measuring field and used a linear regression to reduce the amount of characterization data. The leakage of the buffer chain for the off-state is modeled identically to the on-state. 3.3
Modeling the Switch-Over
The main overhead in power, dominating the break even time after which a switched off component starts saving energy, is the dynamic power during the switch-over. It can be divided into energy consumed by RT-component, sleep transistor and buffer chain.
RTL Power Modeling
283
As indicated in Section 2, dynamic energy estimation approaches using the charged capacitances cause unacceptable errors, because incomplete and spurious transitions are not considered. Attempts in scaling this models depending on the logical gate depth of the circuit failed. For this reason we created switch-over models of all components in our library. We simulated the circuit starting with a falling edge at the PMOS sleep transistor input (in case of NMOS gating a rising edge) until the component reached a steady state. During this time we gathered the currents through the sleep transistor and the virtual supply voltages and integrated the product over time to get an energy value. This was done for each component at different bitwidths, temperatures and supply voltages and the resulting model interpolates between these values. We neglect the input vector of the component after careful analysis because the induced error is only about max. 13% for a 8 bit adder and diminishes with rising component size. If a component is switched off, the sleep transistor prevents short circuit currents thus not consuming any energy. Regarding the dynamic energy consumed by sleep transistor and buffer chain, we used the same parameters mentioned in Section 3.1 during leakage modeling. The characterization process was done nearly the same way mentioned above. In this cases the only difference is, that characterizations were done for rising and falling edges as input data to handle different short circuit currents separately.
4
Estimation Framework
The modeling-methodology described above was developed for being used inside our algorithmic level power estimation framework. Figure 4 shows the estimation framework and how the methodology is integrated. We use SystemC as input specification language, which is translated into a control data flow graph (CDFG). This CDFG is scheduled and afterwards the binding is realized using a relaxation based heuristic [9] that tries to build clusters of operations with small idle times between the operations and bind them to the same resource, to maximize idle-time elsewhere. Information on allocation and bitwidths of the components are used to initialize the models. During this step the sleep transistor is sized at first using the voltage drop model described in Section 3.1, the user given maximal delay increase1 and the power and delay models [10]. Afterwards the buffer chain is dimensioned depending on the sleep transistor size. The third step is the wake up time computation under consideration of the SLEEP signal propagation delay. At last the break even time is computed as a fraction of additional energy overhead (Section 3.3) and leakage power savings (Section 3.1 and 3.2). The estimation process then uses activity data gathered from an execution of the specification to feed the models with realistic data pattern. The fixed schedule is used to decide whether switching the RT-component off, if the idle time passes the break even time provided by the models, or keeping the state as 1
By upscaling real VDD , the performance-penalty can also be 0.
284
S. Rosinger, D. Helms, and W. Nebel
Fig. 4. Flow used to estimate designs using power gating
it stands. In the first case, the wake up time provided by the models is used to turn on the components duly.
5
Model Evaluation
The evaluation was done by comparing the model results to the results we get out of SPICE simulations including transistor model BSIM4.50. Excepting the RT-component switch-over-energy the modeling and thus the evaluation could be done independently of a specific RT-component. In all our models we used 27◦ C to 127◦ C as temperature and 0.6V to 1.5V as voltage domain. We modeled the offset voltage of the sleep transistor, used in SCCMOS techniques to enforce cutoff, in the range 0V to ±0.3V relative to the supply voltage or ground. Additionally, all modeling was done for different gating techniques, including Cutoff-CMOS and Super-Cutoff-CMOS techniques implemented using PMOS or NMOS sleep transistors. In Table 1 the evaluation results are presented for 45nm and 65nm technology. The evaluation was done using an isolated sleep transistor with a width of 1000nm and setting or measuring voltages and currents at all transistor connections. The results show average relative errors and standard deviations less than 2% for all models. Maximum errors of up to 17% could be observed at the corner cases of parameter domains. The leakage model we use to estimate leakage of RT-components and the buffer chain is evaluated in detail in [11]. Table 2 shows evaluation results of the energy model presented in Section 3.3 of different RT-components. As the characterization process was done using
RTL Power Modeling
285
65nm
45nm
Table 1. Errors of voltage drop, leakage and switch-over-energy model of sleep transistor and buffer chain
PMOS CCMOS max. ∅ std. NMOS CCMOS max. ∅ std. PMOS SCCMOS max. ∅ std. NMOS SCCMOS max. ∅ std. PMOS CCMOS max. ∅ std. NMOS CCMOS max. ∅ std. PMOS SCCMOS max. ∅ std. NMOS SCCMOS max. ∅ std.
voltage sleep tr. drop leak. on 0.60 1.93 0.05 0.41 0.09 0.36 11.50 0.92 0.16 0.24 0.58 0.22 0.34 2.13 0.05 0.35 0.06 0.32 14.10 0.94 0.21 0.22 0.62 0.22 1.50 2.14 0.29 0.48 0.18 0.42 13.10 0.97 0.14 0.27 0.60 0.23 1.46 2.51 0.32 0.42 0.24 0.38 11.00 1.02 0.19 0.26 0.49 0.23
sleep tr. sleep tr. leak. off sw.-over 16.10 3.89 0.21 0.35 1.61 0.96 5.19 3.27 0.31 0.41 1.07 0.78 11.30 4.46 0.41 0.44 1.94 1.28 3.18 3.30 0.19 0.42 0.55 0.88 17.10 4.17 0.26 0.41 1.92 1.04 5.01 3.46 0.33 0.39 0.13 0.77 12.60 4.55 0.38 0.45 1.99 1.34 3.58 3.42 0.21 0.43 0.62 0.91
buffer chain. sw.-over
buffer chain independent on gating technique max: 4.84 ∅: 1.56 std: 0.60
buffer chain independent on gating technique max: 4.10 ∅: 1.43 std: 0.54
Table 2. Errors of switch-over-energy model of different RT-components in 45nm and 65nm technologies
PP PP tech. RT-comp.PPP AddCla AddRpl DecRpl IncRpl MultCsa MultWall Mux21 Mux31 SubCla SubRpl
45nm max. −4, 60/ + 3, 70 −11, 9/ + 16, 4 −1, 12/ + 1, 65 −1, 56/ + 5, 36 −32, 7/ + 17, 0 −3, 51/ + 5, 08 −6, 69/ + 3, 71 −1, 91/ + 5, 60 −1, 85/ + 1, 70 −2, 00/ + 2, 72
∅ −0, 44 4, 84 −0, 04 2, 70 8, 14 1, 40 1, 19 0, 68 < 0, 01 0, 28
65nm std. 0, 45 1, 79 0, 14 0, 66 2, 24 0, 52 0, 37 0, 34 0, 21 0, 24
max. −2, 99/ + 4, 84 −9, 46/ + 10, 2 −0, 86/ + 1, 12 −2, 96/ + 2, 25 −6, 41/ + 4, 02 −4, 03/ + 4, 58 −2, 39/ + 1, 96 −0, 91/ + 1, 90 −2, 16/ + 2, 04 −2, 11/ + 3, 72
∅ 1, 00 4, 86 −0, 13 0, 47 1, 61 1, 96 0, 07 0, 20 −0, 17 0, 50
std. 0, 45 1, 45 0, 11 0, 23 0, 48 0, 53 0, 18 0, 14 0, 32 0, 35
286
S. Rosinger, D. Helms, and W. Nebel
4- and 32-bit components and we use 16-bit components in evaluation, the presented errors also reflect bitwidth-interpolation of the models. In comparison to errors of other approaches mentioned in Section 2, our models allow a significant more accurate estimation. To verify the modeling process we also modeled and evaluated 90nm and 130nm technologies in all gating techniques mentioned above as well and obtained very similar results.
6
Estimation Flow Evaluation
Using estimation flow presented in Section 4, we estimated the functional units of different input algorithms. Figure 5 shows the estimation results of a multiplier, adder, incrementer and decrementer used in a jpeg algorithm. For each component the adjacent columns compare the cell- (yellow), net- (green) and leakage-energy (purple) between PMOS power gated (left) and not gated unit (right). The fourth energy type (blue) in the left columns indicates the energy overhead mentioned in Section 3.3. Leakage of the sleep transistor and buffer chain is included in the leakage segment. In this benchmark, the utilization ratio of the functional units is low, and the break even time is passed often. Thus, the units idle most the time and power gating is very effective. At best the power of the decrementer can be reduced by a factor 29 and in the worst case the power of the busy adder can be reduced by a factor of 2.1.
Fig. 5. JPEG algorithm power gating estimation results of functional units
7
Conclusion
We presented models for all power states and the additional hardware needed for power gating. The accuracy of these models is more than sufficient for the
RTL Power Modeling
287
purpose of high level design-space exploration. By integrating the models to our synthesis framework, we can trade-off the savings per component and per design. We will focus our research on describing the additional hardware overhead for generating the power-off-signals, and on improving the synthesis - especially the scheduling and binding - to increase the gain of power gating. Additionally, we will try to consider and reduce the effect of process variability which are already described in our leakage and delay models [10,11] by optimizing power gating not for the average but for the worst case.
References 1. Benini, L., De Micheli, G.: Dynamic Power Management. In: Design Techniques and CAD Tools, Kluwer Academic Publishers, Dordrecht (1998) 2. Pedram, M., Rabaey, J.: Power Aware Design Methodologies. Kluwer Academic Publishers, Dordrecht (2002) 3. Min, K.-S., Sakurai, T.: Zigzag Super Cut-off CMOS (ZSCCMOS) Scheme with Self-Saturated Virtual Power Lines for Subthreshold-Leakage-Suppressed Sub-1VVDD LSI’s. In: Proceedings of the 28th European Solid-State Circuits Conference, 2002, pp. 679–682 (2002) 4. Jiang, H., Marek-Sadowska, M., Nassif, S.R.: Benefits and Costs of Power-Gating Technique. In: Proceedings of the 2005 International Conference on Computer Design, 2005, pp. 559–566 (2005) 5. Kao, J., Chandrakasan, A., Antoniadis, D.: Transistor sizing issues and tool for multithreshold CMOS technology. In: Proceedings 34th Design Automation Conference, June 1997, pp. 409–414 (1997) 6. Ramalingam, A., Zhang, B., Devgan, A., Pan, D.Z.: Sleep transistor sizing using timing criticality and temporal currents. In: Proceedings of the 2005 conference on Asia South Pacific design automation, 2005, pp. 1094–1097 (2005) 7. Hu, Z., Buyuktosunoglu, A., Srinivasan, V., Zyuban, V., Jacobson, H., Bose, P.: Microarchitectural techniques for power gating of execution units. In: ISLPED ’04: Proceedings of the 2004 international symposium on Low power electronics and design (2004) 8. Kawaguchi, H., Nose, K., Sakurai, T.: A super cut-off CMOS(SCCMOS) scheme for 0.5-V supply voltage with picoampere stand-by current. IEEE Journal of SolidState Circuits 35, 1498–1501 (2000) 9. Kruse, L.: Estimating and Optimizing Power Consumption of Integrated Macro Blocks at the Behavioral Level, Dissertation (2001) 10. Hoyer, M., Helms, D., Nebel, W.: Modeling the Impact of High Level Leakage Optimization Techniques on the Delay of RT-Components. In: PATMOS, LNCS, Springer, Heidleberg (2007) 11. Helms, D., Hoyer, M., Nebel, W.: Accurate PTV, State, and ABB Aware RTL Blackbox Modeling of Subthreshold, Gate, and PN-junction Leakage. In: Vounckx, J., Azemard, N., Maurine, P. (eds.) PATMOS 2006. LNCS, vol. 4148, pp. 56–65. Springer, Heidelberg (2006) 12. Kanzow, C., Yamashita, N., Fukushima, M.: Levenberg-Marquardt methods for constrained nonlinear equations with strong local convergence properties. Journal of Computational and Applied Mathematics 172, 375–397 (2004)
Functional Verification of Low Power Designs at RTL Allan Crone1 and Gabriel Chidolue2 1
Mentor Graphics Corporation Kista Science Tower SE-164 51 Kista, Sweden [email protected] 2 Mentor Graphics IRL (UK Branch) Rivergate Newbury Business park, London Road, RG14 2QB Newbury, United Kingdom [email protected]
Abstract. Power is the number one constraint impacting today’s electronic designs. The need to minimize dynamic and static power consumption creates unique verification challenges. A common low power design technique involves switching off certain portions of the design (power islands) when that functionality is not required to reduce leakage power and restoring power when that functionality is needed again. This creates the need to save and restore state information with retention flops and latches, and to ensure the power island returns to a known good state when powered up. Verification of correct design functionality of power islands within the context of a power management scheme has traditionally been performed at the gate level, if at all. Defect rectification at this level is costly in terms of resource and design cycle. This paper discusses the application of innovative techniques to enable power-aware verification at the RTL with traditional RTL design styles and reusable blocks. Keywords: Low power aware management, Simulation, Retention, Corruption, UPF, PCF.
1 Introduction The increasing demand of high performance, portable, battery operated, system-onchip (SoC) in communication and computing has highlighted power dissipation as a critical design constraint. What may not be as well known is the need to reduce power consumption for non-portable systems such as base stations where heat dissipation and energy consumption are key drivers. The focus has shifted to low power consumption from traditional constraints like area, performance, cost and reliability. This has led designers to explore different techniques, to reduce power dissipation in their designs and increase battery life. [1, 2]. Each progression deeper into sub-micron processes, power consumption due to leakage rivals and, eventually, surpasses consumption due to dynamic power (switching activity). Reports suggest that leakage may constitute over 40% of total N. Azemard and L. Svensson (Eds.): PATMOS 2007, LNCS 4644, pp. 288–299, 2007. © Springer-Verlag Berlin Heidelberg 2007
Functional Verification of Low Power Designs at RTL
289
power at the 70 nm technology node [3]. The quadratic dependency of leakage power on total transistor count easily qualifies leakage optimization as a key design objective. Traditional low power techniques such as lowering the power supply voltage VDD and gating clocks become less effective and even undesirable as voltage reduction increases circuit delay which leads to decreasing threshold voltage in order to maintain performance. Unfortunately, this in turn increases leakage current radically due to the exponential nature of leakage current in the sub-threshold regime of the transistor. However, the impact of leakage power strongly depends on the specific circuit operating mode [4]: • Continuously operated circuits, that are characterized by a high switching activity and by a small idle time of internal components • Event driven circuits, where external events trigger burst computations that are separated by long idle times In the first case, the switching component of the power dissipation typically dominates that of leakage. Leakage is the main fraction of the total power for event driven circuits where the switching activity is kept null during idle time by gating the clock. While dynamic power is eliminated, the circuit still dissipates energy due to leakage. Most handheld devices come under the category of event driven circuits; hence they are characterized by intermittent operations with long periods of idle time (Standby). The sleep concept is commonly used for leakage power reduction in the ASIC domain. This gives rise to power gating structures (MTCMOS) [5] in designs which is one of the well known techniques for reducing leakage power in standby mode while still maintaining high speed in active mode. With power gating, during standby state, one or more portions of the circuit will be switched off and the passage from the source to ground will be blocked eliminating leakage in those portions of the circuit. One side effect of power gating is that data in storage elements are lost. As a result designers use some retention memory elements to retain key state data during the sleep mode and restore it upon power up. Current verification techniques available in the market enable gate level verification of the circuits involving power reduction techniques (power-aware behavior). Gate–level verification poses many issues such as: • Time: Gate-level simulations are slow • Debugging: Debugging at gate level is difficult as the original RTL specification has been compiled through synthesis • Problem Rectification: It takes longer and requires more resources to resolve functional problems uncovered at the gate level than RTL • Standard: No standard way exists to perform power-aware verification at the gatelevel Power Aware Simulation (PA Simulation) solves the problem of functional verification of power-aware designs. It gives designers the ability to functionally
290
A. Crone and G. Chidolue
verify their power management techniques at the RTL reducing the cost significantly both in terms of effort and time. It works with normal RTL coding styles. There is no need to hand instantiate gate-level retention cells for state data in the design. There is no need to tightly intertwine the power control network (PCN) with the RTL functional network. Thus, legacy RTL blocks are easily reused without modifying the RTL code and new reusable blocks can be created independent of the power-aware environment they may be used within. This paper discusses the organization of power aware designs, the PA simulation flow, and an example.
2 Power Management Techniques Designers have developed various power management techniques for low–power designs. These power management techniques make use of “sleep operation” to reduce the power consumption. In circuits used for these designs (MTCMOS), the circuit states are lost in the sleep mode because the virtual power supply lines connected to the logic circuits float in order to cut off standby leakage current. Some designs have conventional sleep operation where the power supply of the design is cut off when the circuit is not in use. Such designs do not require data to be retained in the registers/latches used in the design. “Power down” in notebook computers is an example of such sleep modes. Even with conventional sleep operation where no state information needs to be saved, functional verification of the design is required to ensure that the “awake” portions of the design function properly when the rest are sleeping and to ensure that when power is restored to sleeping blocks, the system will continue with correct operation. Alternatively, retention sleep mode is where only the operation of the logic circuit is stopped when the circuit is not in use. Some low power devices in portable equipment use this sleep operation during an intermittent operation such as waiting for input from a keyboard or communicating through a slow interface. These devices resume operations without restarting because the common registers and the pipeline registers also preserve the data during sleep. These sequential cells are also termed as retention registers (RFFs) or retention latches (RLAs). To implement sleep power management techniques, a well defined Power Management Block (PMB) is created with specific Power Control Signals. 2.1 Power Management Design Structure As the power management techniques employ turning off and on various segments of design, these designs are typically divided into different domains or islands based on their power control signals. Normally, they are categorized into several Voltage domains and Power islands. A voltage domain runs at a particular voltage or set of voltages. Power islands (power island and power domain are used interchangeably) have a single set of power control signals. Also, the design has at least one domain
Functional Verification of Low Power Designs at RTL
291
Fig. 1. Voltage and power domains (or islands)
which is always on. This is normally known as wake-up domain. An example set of power control signals is: • Power_good/on: to control the power on/off of each power domain. • RET/SAVE/RESTORE: to control sequential retention cells. • ISO: to control isolation cells. Isolation cells are logic gates that determine the output values that are driven by a power island when it is powered down. 2.2 Retention Memory Elements To maintain the state of registers and latches in sleep mode, retention elements are employed that can save and restore their data contents when in sleep mode. Retention registers (RFF) and latches (RLA) are affected by power control signals controlling the RTL region to which they belong. After RFFs/RLAs have performed a successful operation of SAVE depending on the type of register and proper control signals, these registers will start driving ‘X’ when the power is switched OFF.[5, 7] They will drive the saved value only when a successful RESTORE operation is completed. If either the SAVE or RESTORE operation is not successful, ‘X’ is restored during the RESTORE operation. These elements will behave as normal memory elements when the power is switched on and all power control signals are not asserted. An example of a retention flip flop is Clock–Low Retention FF [8]. These types of RFFs have one control signal known as RET. When RET is asserted and clock is low they perform the SAVE operation and when RET is de-asserted and clock is low, RESTORE operation is performed. They behave as normal FFs when RET is low. (Fig. 2).
292
A. Crone and G. Chidolue
Fig. 2. Clocking diagram of clock low retention flip flop
2.3 Specifying Design Intent of Low Power Designs Most registers and latches in a RTL design are inferred and not explicitly coded as part of the HDL. The RTL coding style would be significantly impacted if designers were required to explicitly instantiate retention registers and latches wherever persistent state information occurs. Furthermore, when designs go through a technology spin, there’s no guarantee that the retention cells will be the same with the new technology library. Similarly, explicit routing of power control signals to retention cells and the instantiation of isolation cells and level shifters within the RTL code creates an unnecessary tight coupling between the design functionality and the low power design intent. If the low power design intent can be specified separate from the RTL code but related to the RTL code, then familiar RTL coding styles can be maintained and reuse of the RTL maximized. A means of specifying the low power intent separately from the functionality in RTL is key. It should provide the information required to overlay the RTL functionality with the Power Control Network (PCN) and power aware functionality. The power specification data comprises information about voltage and power islands, corruption semantics, isolation strategy, level shifting requirements, the power control signals, power and ground information and their associated switches. It also maps state data that needs to be retained, including memories, to the retention behavior that will be realized in the gate-level implementation. This information can be used to functionally verify the design at RTL (or higher) and to ensure the low power design intent is implemented through synthesis in the gate-level design. Synthesis tools can use the same information to automatically create an appropriate power aware gate level netlist. Accellera, an organization focused on identifying and creating new standards and methodologies for the electronic design industry, recently approved a standard for Low Power Design intent specification. This standard is called the Unified Power Format (UPF) 1.0. In addition
Functional Verification of Low Power Designs at RTL
293
Table 1. Power instance corruption rules Mode
Behavior
Power off
Force outputs of instance and outputs of all registers and latches w/in the instance to ‘X’. Left attribute of VHDL user-defined type. Default initial value for SystemVerilog userdefined types. Random or some user-defined value.
Power on
Release the outputs of the instance. A reevaluation of the signal must occur. Release outputs of all registers and latches within the instance after all the outputs are released.
to support for Unified Power Format, Mentor Graphics also supports a proprietary format known as Power Configuration Format (PCF). Throughout the remainder of this paper, both UPF and PCF will be referred to as power configuration file.
3 Power Aware Simulation The simulator must be able to: • Identify all sequential elements inferred by the RTL design (registers, latches and memories) • Overlay the RTL design with the PCN • Pull in the appropriate retention cell model behavior • Dynamically modify the behavior of the design to reflect the specified low power design intent in power down and up situations.
4 PA Simulation Flow PA Simulation is broadly categorized into 4 steps based on the requirements discussed in the last section. • • • •
Register/latch Recognition from RTL design. Identification of power elements and their power control signals. Elaborating the power aware design. Power Aware Simulation.
4.1 Register, Latch, and Memory Recognition Register, latch and memory recognition from the HDL is normally associated with the front-end of synthesis which transforms the HDL into a structural netlist containing flip flops, latches, memories (essentially a collection of addressable registers) and
294
A. Crone and G. Chidolue
other logic. To accomplish recognition of sequential elements, Mentor Graphics enhanced its verification platform, Questa, incorporating it with the ability to infer sequential elements and recognize memory structures, providing a complete power aware verification solution. 4.2 Identification of Power Elements The power configuration file contains the low power design intent and is parsed for the following information: • Power and voltage domains and power control signals • Mapping of sequential elements to retention models (RFFs, RLAs, memories) • Intended corruption semantics (outputs of a power island and sequential elements) Using the UPF or PCF and sequential element information, the simulator is able to create the Power Control Network, overlay the functional network with the Power Control Network and integrate the specified power-aware behavior (retention, corruption and isolation) with the RTL functionality. 4.3 Elaborating the Power Aware Design The system’s power management block is specified in RTL the same as any block in the design. The power management block drives the signals that compose the PCN based on the system’s power management strategy—signaling power islands to retain state, powering them down, powering them up and then signaling to restore state. The signals of the PCN exist in the design, but are not routed throughout the RTL. During elaboration, the PCN is routed according to the Power specification in the power configuration file. This is the process of overlaying the RTL functional netlist with the PCN. During this process, power domains that have been specified to have state retention in the power configuration file will be identified. The sequential elements within these domains will be identified and based on corruption extent specification, power aware retention cell models are inserted to modify the behavior of these sequential element with power aware behavior; e.g., power down behavior-corruption, state retention, power up behavior, and state restoration. All other logic elements within a power domain will have their behavior modified such that at power down their outputs are corrupted and on power up they resume normal operation. 4.4 Power Aware Simulation After integrating the low power design specification with the RTL functional specification, the PA Simulation is ready to start. The design simulates normally. Typically the testbench will exercise the Power Management Block so that it turns specified power islands on and off following the specified power down, isolation and save / retention sequence. While specific island(s) are powered down by the Power Management Block, the testbench can verify that the “awake” portions of the design continue to operate properly. Assertions written using PSL or SVA can also be employed to verify the
Functional Verification of Low Power Designs at RTL
295
correct sequence for power up/down, retention and isolation. They can also be used to ensure proper working of the “awake” portions of the design when specified power islands are powered down. When the power management strategy (based on test stimulus) determines that power island(s) need to be turned on, the Power Management Block enables the power to the power islands that were previously turned off and signals restoration of retained values for the sequential elements. Verification continues to ensure that the power islands come up in good known states and the entire system can continue operating normally. PA simulation exposes many functional bugs including: • Failure to retain sufficient state information to enable restoration of functionality when power is restored • Dependency on output values • Problems when interacting state machines in different power islands restore to states that create deadlock or livelock situations • Improper sequencing of save and restore operations by the PMB • Failure to reset a block on power on to a known good state for non-retentive blocks.
5 Example To demonstrate the specification of low power design intent with an RTL design, an example memory subsystem is presented. Details of the dram controller and dram blocks are not particularly relevant. Therefore, only the RTL for the memory subsystem is shown. The reader can imagine the memory controller implements an FSM with a register for the current state and that the memory to which it controls declares one or more variables for the memory contents. Zeidman [7] provides a RTL implementation which serves as the inspiration. Assumptions: • A PMB exists somewhere else in the design • PMB drives signals: o mem_power : 1 if power is on o mem_save : retention save active high o mem_restore : retention restore active high • Both the memory controller and the memory are within the same power and voltage domain. 5.1 Sample Design module Mem_subsys(…); reg
clock;
reg
reset;
reg
[7:0]
addr_in;
296
A. Crone and G. Chidolue
reg
as;
reg
rw;
wire
we_n;
wire
ras_n;
wire
cas_n;
wire
ack;
wire [3:0] addr_out; wire [7:0] data;
// Data bus
assign data = (as & we_n) ? 8'hzz : data_out;
dram_control Dram_control( .clock(clock), .reset_n(~reset), .as_n(~as), .addr_in(addr_in), .addr_out(addr_out), .rw(rw), .we_n(we_n), .ras_n(ras_n), .cas_n(cas_n), .ack(ack));
dram Dram( .ras(~ras_n), .cas(~cas_n), .we(~we_n), .address(addr_out), .data(data));
endmodule
Functional Verification of Low Power Designs at RTL
297
5.2 Power Configuration File HEADER (VERSION=”2.7”), (SEPARATOR=”/”); SCOPE /top/dut; CEXTENT = OUT_SEQ_AND_WIRES; POWER mem_subsys_pi \ -i mem_subsys, (mem_power) \ RETENTION mem_save mem_restore ; DEFAULT_MAP MY_CFKRRFF : FF_CKFR; MAP ABC_RETMEM : RETMEM_CKHI \ mem_save mem_restore, \ -s mem_subsys/dram/mem; The header statement documents the version of the PCF being used and the path separator character employed in the file. The PCF is designed to work with any language and each HDL has its preferred separator character. The scope statement sets the context within the design hierarchy that applies to all relative paths and statements from that point until the next scope statement. The corruption extent (CEXTENT) statement specifies that all the outputs, internal wires driven by internal combination logic and sequential elements (retention and non-retention) of the power island are to be corrupted on power down. The power statement defines a power island named mem_subsys_pi. The power island is defined as the instance mem_subsys. The “-i” indicates that the name that follows identifies an instance in the design. The granularity can also include signals (-s) and processes (-p). By default, the power statement applies recursively to all children instances below any instances specified. In this case, the dram_control and dram instances are part of this power domain. The power statement can be specified to be nonrecursively applied by using a –nr switch. Signal mem_power indicates whether power is on or off. In general, this can be any legal Verilog expression where a value of 1 indicates power is on and 0 indicates it is off. Support of expressions allows the “logical and” of global and local signals to indicate power on. The retention clause within the power statement is optional. If specified, then the retention control signal(s) provided will apply, by default, to any retention elements in the power island. The retention clause can specify one (for example “sleep”) or two retention control signals. In this example, we have specified two: mem_save and mem_restore. The default map statement states that any inferred register (flip-flop or FF) in this scope (/top/dut) whether it be positive or negative clock edge triggered should be mapped to MY_CKFRRFF. MY_CKFRRFF is a retention FF implemented in Verilog
298
A. Crone and G. Chidolue
and conforming to modeling guidelines for use in the PA simulation. Memories and flipflops are handled separately during the pseudo synthesis stages of elaboration in vopt. The map statement can be used to overrides the DEFAULT_MAP statement to explicitly map a specific retention flipflop or memory model to a specific region in the design. In the PCF file, the MAP statement is used to map a specific retention memory model to specific signal that models a memory in RTL (The retention model could be provided by the foundry). The default map and map statements illustrate precedence. Specifications at a lower level of the design hierarchy override higher level specifications. And, more specific statements, such as the map statement, override default statements—the default map. Combining default recursive application with this precedence policy results in very concise specifications through the use of defaults that are overridden infrequently.
6 For Further Investigation In the process of further refining our initial power-aware verification capabilities, we discovered that most users appreciate the ability to specify low power design intent separate from RTL functionality. However, there are some users who are integrating the low power and functional design intent in a single “golden” HDL source. It is clear that prior to the availability of a separate specification mechanism, users had no other choice but to model the low power intent in the HDL code. It is less clear that the approach of integrating low power and functional design intent contains inherent advantages beyond reducing the number of files to manage and ensuring that any change to either aspect of the design implementation forces a full re-verification of the block. We continue to gather information on these requirements. It would not be surprising to find that both approaches to low power design intent specification are required. For example it is not uncommon to see isolation wrappers embedded in RTL code. As stated earlier, Accellera UPF standard format for specifying low power intent can be used to over come some of these issues as tool support for UPF in areas such as simulation, synthesis and equivalency checking increases.
7 Conclusion The naturally loose coupling of low power and functional design intent permits separate specification of both. Separate specification provides the freedom to define a very concise and high level mechanism for specifying low power design intent. It also enables greater reuse of IP within different low power architectures. Mentor’s low power design specification and verification solution have been developed in conjunction with partners who are successfully using the solution in production designs and have found issues that would’ve required respins. Recent standardization efforts of low power design specification and flow will greatly benefit the design community.
Functional Verification of Low Power Designs at RTL
299
References 1. Chandrakasan, A., Sheng, S., Brodersen, R.: Low-Power CMOS Digital Design. IEEE J. of Solid State Circuit 27(4), 473–484 (1992) 2. Singh, D., Rabaey, J., Pedram, M., Catthoor, F., Rajgopal, S., Sehgal, N., Mozdzen, T.: Power Conscious CAD Tools and Methodologies: a Perspective. Proc. of the IEEE 83(4), 570–594 (1995) 3. Kao, J., Narendra, S., Chandrakasan, A.: Subthreshold leakage modeling and reduction techniques. In: IEEE International Conference on Computer-Aided Design, 2002, pp. 141–148. IEEE Computer Society Press, Los Alamitos (2002) 4. Chandrakasan, A., Yang, I., Vieri, C., Antoniadis, D.: Design considerations and tools for low-voltage digital system design. In: Proc. of Design Automation Conf. (1996) 5. Mutoh, S., Douseki, T., Matsuya, Y., Aoki, T., Shigematsu, S., Yamada, J.: 1-V power supply high-speed digital circuit technology with multi threshold-voltage CMOS. IEEE J. Solid-State Circuits 30, 847–854 (1995)
XEEMU: An Improved XScale Power Simulator ´ Zolt´an Herczeg1 , Akos Kiss1 , Daniel Schmidt2 , Norbert Wehn2 , and Tibor Gyim´ othy1 1
2
University of Szeged, Department of Software Engineering ´ ad t´er 2., H-6720 Szeged, Hungary Arp´ {zherczeg,akiss,gyimothy}@inf.u-szeged.hu University of Kaiserslautern, Department of Electrical Engineering Erwin-Schr¨ odinger-Straße, D-67663 Kaiserslautern, Germany {schmidt,wehn}@eit.uni-kl.de
Abstract. Energy efficiency is a top requirement in embedded system design. Understanding the complex issue of software power consumption in early design phases is of extreme importance to make the right design decisions. Power simulators offer flexibility and allow a detailed view on the sources of power consumption. In this paper we present XEEMU, a fast, cycle-accurate simulator, which aims at the most accurate modeling of the XScale architecture possible. It has been validated using measurements on real hardware and shows a high accuracy for runtime, instantaneous power, and total energy consumption estimation. The average error is as low as 3.0% and 1.6% for runtime and energy consumption estimation, respectively.
1
Introduction
Due to advances in microelectronics and communication technology, complex information processing can be efficiently embedded in an increasing range of portable products like mobile phones, mp3 players and PDAs [1]. Low cost, energy efficiency and fast time to market are the top requirements in embedded system design. Typical portable appliances are microprocessor-centric architectures. Power and energy optimization of hardware is a well investigated discipline. Hence modern microprocessor architectures are equipped with many knobs for minimizing energy and power like clock gating, power supply and frequency scaling, leakage control, etc. But the actual power and energy consumption of a microprocessor is determined by the application running on it. However the interrelation between the software and hardware w.r.t. power/energy consumption and optimization is very challenging. Understanding the complex issue of software power consumption in combination with the underlying processor system in early design phases of embedded systems is of extreme importance to make the right design decisions. Several possibilities exist to evaluate the energy and power consumption, the most obvious being measurements on real hardware. This is seldomly possible, N. Azemard and L. Svensson (Eds.): PATMOS 2007, LNCS 4644, pp. 300–309, 2007. c Springer-Verlag Berlin Heidelberg 2007
XEEMU: An Improved XScale Power Simulator
301
however, as often the platform is not fully defined in the early phases of embedded systems design. Hardware measurements on a specific platform do not allow to evaluate the effect of different architectural parameters, like different processors, cache models and sizes. Power simulators offer more flexibility and overcome the problem of noise and side-effects that influence measurements. With simulation tools it is possible to gather large amounts of data automatically. They also allow a much more detailed view on the sources of power consumption than power measurement. Hence, the availability of accurate power simulation tools is the key to energy efficient embedded software design. In this paper we present XEEMU, a fast, cycle-accurate simulator for the XScale architecture [2], one of the most widespread, low-level RISC architectures in the embedded domain. In contrast to many other existing power simulators, as [3,4], which simulate power and performance of theoretical microprocessor architectures, XEEMU aims at the most accurate modeling of the XScale architecture possible, trading off flexibility for much more reliable results. XEEMU proves to be more accurate than the well known and freely available XScale simulator XTREM [5] in terms of runtime and energy estimation, due to its improved pipeline and power model and the cycle accurate simulation of the SDRAM subsystem. It offers a high flexibility through freely and independently configurable frequencies for the core clock and the memory. Nonetheless, XEEMU reaches more than twice the simulation speed compared to XTREM. The methodology we used to create XEEMU is an iterative two-step process, combining measurements and simulation. As the total energy consumption is heavily dependent on the runtime of a program, it is mandatory to model the behavior cycle-accurate first, before in the second step the power model is created. The creation of both the power and the behavioral pipeline model is done iteratively with a validation against measurements on an evaluation board at the end of each iteration. Where publicly available documentation is not sufficient for the creation of models, synthetic benchmarks that stress one certain effect are used to isolate and test individual parts of a model. The refinement of a model is done in two ways: extension and correction if the observed effects are not considered in the model, or parameter fitting when the effects are considered but over- or underestimated. The methodology is a general approach that can be used for a wide range of different processors. Our case-study, which yielded in the creation of XEEMU, shows the pitfalls in creating precise pipeline models and power models. It points out aspects that have to be taken into account when creating power simulators targeted for other architectures as well. The rest of the paper is organized as follows: Section 2 presents related work done so far on power and energy simulation and measurement, Section 3 describes how XEEMU was developed from an existing simulator by correcting the identified problems and Section 4 shows the results of the improvements. Finally, Section 5 concludes the gathered experiences and gives directions for future work.
302
2
Z. Herczeg et al.
Related Work
The simulation of the behavior as well as the power dissipation of microprocessors can be done on various abstraction levels, namely circuit level, architectural level, and instruction level. Spice [6] is the de-facto industrial standard general circuit simulator. While it offers a high accuracy it requires a precise description of the hardware on transistor level. Due to the complexity of a microprocessor this yields in unacceptably low simulation speed. Moreover, the circuit level layout of a processor is usually not available to the public. Much greater simulation speeds can be achieved at the instruction level. Tiwari et al. proposed a methodology [7,8] to characterize the power consumption of individual instructions by hardware measurements. It requires measurement of benchmarks for every single instruction as well as additional benchmarks for inter-instruction dependencies and data dependencies. The accuracy of this approach is limited by the fact that in the highly optimized pipelines of today’s microprocessors as the XScale the execution times of instructions may vary strongly. This is especially true for external memory accesses, but also for stalls due to mispredicted branches. The existence of a nearly cycle-accurate instruction level simulator for the XScale architecture is claimed in [9]. However, this tool is not available to the public. Furthermore, it was not compared to other existing simulators so the results can neither be verified nor exploited. To model exact, cycle-accurate behavior of a pipeline, simulation on the microarchitectural level is mandatory. SimpleScalar [10] is an open source, configurable, generic processor core simulator. It is capable of cycle-accurate pipeline simulation covering all internal effects, like stalling. It offers high flexibility and enables designers to evaluate architectural optimizations. SimplePower [11] extends SimpleScalar with power estimation functions for each of the pipeline stages. In every simulated clock cycle the functions calculate the power consumption for every functional unit, based on analytical power models and lookup-tables. Because the pipeline organization differs heavily from that of the XScale architecture, results are not comparable. Sim-Panalyzer [3] (formerly named PowerAnalyzer) is a power estimation tool based on SimpleScalar. It interprets ARM instructions, just like XScale processors, but as it does not alter the generic simulation core it still cannot be used to simulate the XScale core accurately. Wattch [4] is a parameterized power model of common structures present in modern microprocessors, which has also been used to extend SimpleScalar, but it is flexible and can also be integrated into other architectural simulators. It is based on mathematical formulas but focuses only on dynamic power consumption, completely omitting the growing effect of leakage. Based on the Wattch power model Contreras et al. created XTREM [5] an architectural power simulator tailored for the XScale core. It differs from the previous mentioned works in that it focuses on one specific architecture aiming at a more accurate power and performance simulation at the cost of less
XEEMU: An Improved XScale Power Simulator
303
architectural flexibility. Since this approach targets our intentions we will investigate this tool in more detail in the next sections.
3 3.1
XTREM – An In-Depth Review Experimental Setup and Benchmarks
To validate XTREM, we use the ADI 80200EVB, XScale-based evaluation board from ADI Engineering. The board is equipped with an Intel 80200 XScale processor [12] and 32 Mbytes of SDRAM. Input clocks provided by the board are 66MHz for CLK and a 100MHz MCLK for the peripheral bus controller and the SDRAM. CLK is used to generate the XScale internal core clock CCLK, which we set to 600MHz, to achieve the best MIPS/power ratio [2]. The core operating voltage is 1.3V. The memory and peripheral bus controller is hosted in a Xilinx FPGA [13]. The board provides a jumper to measure core power consumption. Using a Tektronix TPS 2014 digital storage oscilloscope connected to a PC we sampled the power consumption with a 0.1Ω resistor in line with the processor core. The core voltage during operation remained constant. To eliminate noise in the measurements we calculated the average value for each sample by measuring seven runs of each benchmark. The results of simulation and measurement are compared for 10 test programs selected from the CSiBE benchmark environment [14]. The selection contains various command line tools such as data and image compressors, converters and parsers. These programs not only test the overall accuracy of the simulator but exploit the special characteristics of the architecture as well. The JPEG compressor (cjpeg) utilizes many shift operations. The hex encoder-decoder pair (enhex, dehex) executes many conditional block data transfer instructions. The VSL abstract machine (vam) is a computation dominant program, i.e., it rarely accesses the memory and fits in the caches. In contrast to vam, minigzip and the PNG encoder (pnm2png) are memory dominant programs, causing a high number of data cache misses. All these programs are written in standard C and are compiled with the GCC-based Wasabi cross compiler tool chain [15] to stand alone binaries, using optimization option -O3. On the hardware and in the simulator the programs are executed with no underlying operating system and all I/O operations mapped to main memory, thus eliminating a source for side effects. 3.2
Performance Validation
It is very well known that there is a strong correlation between runtime and energy consumption. Thus, cycle-accurate simulation is necessary before starting to create a power model. Therefore, we first validated the runtime model of XTREM. We configured XTREM to the same frequency and voltage as the hardware, i.e. 600MHz core clock. The memory clock in XTREM is, however, not configurable, but instead it assumes a constant memory access latency. As due to cost
304
Z. Herczeg et al.
Runtime (s)
6.0 5.0 4.0 3.0 2.0
ADI 80200EVB XTREM Mem lat: 18
cache_miss
vam
pnm2png
png2pnm
minigzip
jpegtran
jikespg
enhex
djpeg
dehex
0
cjpeg
1.0
XTREM Mem lat: 48 XTREM Mem lat: 78
Fig. 1. Simulated runtime for different values of memory latency in XTREM compared to measurements on the board
limitations DRAM dominates for larger memories in embedded systems, this assumption is invalid. Depending on the state of the memory controller and the memory banks, and on the clock frequency of the memory subsystem memory access times may vary heavily. Using the processor internal performance counters we observed cache miss costs varying from 78 to 126 CCLK cycles for the chosen setup. To reflect these effects a cycle accurate model of the memory controller and the memory banks in the memory clock domain is mandatory. However, setting the memory latency in XTREM to 78 CCLK cycles yields in a runtime overestimation for all test programs, by more than a factor of 2 for the memory intensive pnm2png. The optimal correlation of benchmark run times as simulated by XTREM and measured on the processor was found with a fixed memory access latency of only 18 CCLK cycles, as is shown in Fig. 1. Obviously this number is much smaller than the measured memory latency of at least 78 CCLK cycles. This is due to the fact that XTREM does not consider that by pipelining up to 4 outstanding memory requests the memory latency can be partially hidden, thus preventing the XScale pipeline to stall in reality. To counterbalance this effect, the memory latency in XTREM has to be much smaller. In the synthetic benchmark (cache miss), which was designed to produce a high number of cache misses, latency hiding is less effective, yielding in a higher average stall rate. Here, XTREM has to be configured with a memory latency of 48 CCLK cycles to estimate the runtime of this benchmark correctly. An in-depth investigation of XTREM shows a lot of other inaccuracies in the modeling of the caches and the pipeline, which is shown in Fig. 2. The problem with the cache handling of XTREM is that, while the XScale processors support both write-through and write-back policies, it supports the write-through policy only, and even that is handled incorrectly. Moreover, XTREM does not simulate the fetch buffers between the caches and the main memory, which are used in the processor. As for the pipeline, in the Instruction Fetch 1 (IF1) stage
XEEMU: An Improved XScale Power Simulator
305
D1 D2 DWB Memory pipeline
IF1 IF2 ID RF
J
- X1 - X2 XWB J ^ M1 M2 J
- Mx 6
Integer pipeline Multiply pipeline
Fig. 2. The pipeline of XScale
XTREM uses a branch target buffer (BTB) of different size and indexing algorithm for predicting branches than the true XScale processor. XTREM also does not support queuing of fetched instructions between the pipeline stages IF2 and Instruction Decode (ID), while the queue is used in the XScale architecture so that IF1 and IF2 can still operate when later stages of the pipeline are stalling. In XTREM, some functionalities are implemented in the Execute 1 (X1) stage, while they reside in earlier stages in XScale: the flow generator, which translates complex, CISC-like instructions (like block data transfers) to micro operations (μops) works in ID, and the shifter unit is in the Register File (RF) stage. These differences cause improper stalling behavior. Even more important is that the operation of XTREM’s flow generator is incorrect as well, since it generates almost twice as much μops for data transfer instructions as the XScale. To achieve cycle-accurate simulation of the XScale processor we extended the simulator with a cycle-accurate model of the SDRAM subsystem and fixed all the detected pipeline related problems. The BTB and the caching strategies were modified according to the documentation of the XScale architecture. 3.3
Power Modeling
The implementation of the power model of XTREM (inherited from Wattch) is based on activity counters and assumes that inactive functional units consume almost no power because of effective clock gating. However, our measurements have shown that the power dissipation decreases only a small amount (about 20%) on core stalls. Besides the mentioned issue, the power model does not support dynamic frequency (CCLK) and voltage scaling. Hence, this power model is not suitable for the accurate power modeling of XScale. Instead of developing a new power model, we adopted and fine-tuned the versatile power model of the Sim-Panalyzer tool. During the iterative process of fine-tuning we compared measurements and simulation and fitted the parameters of the power model of each functional unit or pipeline stage using synthetic benchmarks that stress certain features of the pipeline.
4
Experimental Results
The modifications made to XTREM in response to the problems described in Section 3 resulted in a new improved power and performance simulation tool,
Z. Herczeg et al.
Runtime (s)
306
3.0 2.0 1.0 0
0.8
XTREM
ADI 80200EVB
cache_miss
vam
pnm2png
png2pnm
minigzip
jpegtran
jikespg
enhex
djpeg
0
dehex
0.4 cjpeg
Energy (J)
1.2
XEEMU
Fig. 3. Runtime and energy consumption
called XEEMU. We compared run times and energy consumptions for our chosen benchmark set and synthetic benchmarks on XTREM, XEEMU, and the evaluation board, as described in Section 3.1. Results are shown in Fig. 3. As mentioned earlier, XTREM was configured to a memory latency of 18 CCLK cycles, even though the synthetic benchmark cache miss clearly indicates a higher memory latency. It can be seen that the runtime of memory dominated benchmarks is underestimated by XTREM, while it overestimates the runtime of all other benchmarks. The average error (not including the synthetic benchmark) is 13.0%, the maximum error 19.7%. XEEMU shows a much more accurate prediction of the runtime in every single benchmark and improves the average and maximum errors to 3.0% and 6.4%, respecitvely. XEEMU also works very well on all synthetic benchmarks. The same applies to the energy consumption, as well. XTREM overestimates the effectiveness of clock gating in the case of pipeline stalls and thus underestimates the energy consumption of memory dominant programs. Due to the much more accurate pipeline model, the improved simulation of the memory subsystem, and the fine-tuned power model, the energy estimates provided by XEEMU are accurate within 4.5% in worst case and 1.6% in the average case. In Fig. 4 the instantaneous power dissipation of two test programs is shown. Since the real and simulated run times of the programs differ, we normalized them to allow comparison. The first program, jikespg, was chosen as it has several different regions of execution (delimited by dashed vertical lines on the charts). Region A is a computation dominant part of the program. In this region the inaccurate pipeline model of XTREM results in non-existing stalls during the simulation, which are the cause for the mispredicted relative decrease in dissipation. In region B the number of read and write accesses to the external
XEEMU: An Improved XScale Power Simulator
ADI 80200EVB dissipation (W)
jikespg (A)
0.6
minigzip (B) (C)
0.5
0.4
0.4
0.3
0.3
0.2
XTREM dissipation (W)
(D)
0.6
0.5
0.2 0
1
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0
1
0
1
0
1
0.2 0
XEEMU dissipation (W)
307
1
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2 0
1
Fig. 4. Power dissipation of two programs with normalized runtime
1200 900 600
XEEMU
vam
pnm2png
png2pnm
minigzip
jpegtran
jikespg
enhex
djpeg
0
dehex
300 cjpeg
Simulation time (s)
1500
XTREM
Fig. 5. Comparison of simulation times for XEEMU and XTREM
bus increase. Because of the inaccuracy of the power model of XTREM, this yields an unwanted increase in the dissipation. Memory read instructions are frequent in region C, as well, but contrary to region B, they cause low external
308
Z. Herczeg et al.
bus activity, i.e., it is mostly the cache that serves the requests. The improper modeling of the external bus results in the drop in the dissipation graph of XTREM. The other example is minigzip, a heavily memory dominant program, which causes a huge amount of cache misses, accounting for 56% of the runtime. This is especially true for region D, where the core usually stalls, except for the peaks found on the dissipation graphs of the board and XEEMU. In the graph of XTREM the extra peaks are caused by the wrong memory access model and memory latencies. On the contrary to the problems observed with XTREM, the dissipation graphs of XEEMU show the accuracy of its power model, pipeline model, and memory model. Another important benefit of XEEMU over XTREM is the improved runtime, which we measured on a standard PC (DualCore AMD 2.2GHz, 4GB RAM) running Debian Linux, Kernel 2.6.19.2. As shown in Fig. 5, XEEMU simulates the benchmarks on average 2.5 times faster than XTREM.
5
Conclusions and Future Work
In this paper we presented XEEMU, a new power and energy simulation tool for XScale processor cores. XEEMU extends the architectural model of XTREM with a precise, cycle-accurate model of the memory subsystem and corrects many errors we found in the simulation of the pipeline. It has been validated using measurements on real hardware and shows a high accuracy for runtime, instantaneous power, and total energy consumption estimation. With a low average error of only 3.0% for runtime and only 1.6% for energy consumption estimation it clearly outperforms XTREM in all test cases. Using a different and less computationally complex power model than XTREM, XEEMU also offers a more than 2-fold increase in simulation speed. XEEMU was designed for maximum simulation accuracy while still offering high flexibility. In contrast to XTREM, it already offers two completely asynchronous, configurable clock domains for the core and memory clock, as well as a cycle accurate SDRAM memory subsystem simulation. Currently, we extend the simulation model of the memory subsystem with a power model, making XEEMU one of the first realistic system-simulators for cycle accurate performance and power evaluation available. Additionally we will implement features for dynamic voltage scaling offering a valuable tool for compiler and embedded software designers. XEEMU will be released to the public for a wider use. Acknowledgement. This work was (partially) supported by NKTH (Hungarian National Office for Research and Technology) and BMBF (German Federal Ministry of Education and Research) within the framework of the Bilateral German-Hungarian Collaboration Project on Ambient Intelligence Systems (BelAmI), and by the P´eter P´ azm´any Program (no. RET-07/2005).
XEEMU: An Improved XScale Power Simulator
309
References 1. ARTEMIS Strategic Research Agenda Working Group: Strategic research agenda, 1st edn. (2006) 2. Clark, L.T.: An embedded 32-b microprocessor core for low-power and highperformance applications. IEEE Journal of Solid-State Circuits 36(11), 1599–1608 (2001) 3. Mudge, T., Austin, T., Grunwald, D.: The SimpleScalar-Arm power modeling project, http://www.eecs.umich.edu/∼ panalyzer/ 4. Brooks, D., Tiwari, V., Martonosi, M.: Wattch: A framework for architectural-level power analysis and optimizations. In: Proceedings of the 27th Annual International Symposium on Computer Architecture (June 2000) 5. Contreras, G., Martonosi, M., Peng, J., Ju, R., Lueh, G.Y.: XTREM: a power simulator for the Intel XScale core. SIGPLAN Not. 39(7), 115–125 (2004) 6. Rabaey, J.M.: The Spice Home Page http://bwrc.eecs.bekeley.edu/Classes/IcBook/SPICE/ 7. Tiwari, V., Malik, S., Wolfe, A.: Power analysis of embedded software: a first step towards software power minimization. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 2(4), 437–445 (1994) 8. Tiwari, V., Malik, S., Wolfe, A., Lee, M.: Instruction-level power analysis and optimization of software. VLSI Signal Processing (13), 223–238 (1996) 9. Varma, A., Debes, E., Kozintsev, I., Jacob, B.: Instruction-level power dissipation in the Intel XScale embedded microprocessor. In: Sudharsanan, S., Bove Jr., V.M., Panchanathan, S. (eds.) Proceedings of SPIE, Embedded Processors for Multimedia and Communications II, San Jose, California, USA, vol. 5683, pp. 1–8. SPIE (2005) 10. Austin, T., Larson, E., Ernst, D.: Simplescalar: an infrastructure for computer system modeling. computer 2(35), 59–67 (2002) 11. Ye, W., Vijaykrishnan, N., Kandemir, M., Irwin, M.J.: The design and use of SimplePower: A cycle-accurate energy estimation tool. In: Proc. of 37th DAC, Los Angeles, California, pp. 340–345 (2000) 12. Intel corporation: Intel 80200 Processor based on Intel XScale Microarchitecture: Developer’s Manual. Order Number: 273411-003 (March 2003) 13. Intel corporation: High Performance Memory Controller for the Intel 80200 Processor. Order Number: 273494-001 (March 2001) 14. Department of Software Engineering, University of Szeged: GCC code-size benchmark environment (CSiBE) http://www.csibe.org/ 15. Wasabi Systems Inc.: Wasabi Systems GNU tools version 031121 for Intel XScale microarchitecture http://www.intel.com/design/intelxscale/dev tools/ 031121/wasabi 031121.h tm
Low Power Elliptic Curve Cryptography Maurice Keller and William Marnane Dept. of Electrical and Electronic Engineering University College Cork, Cork City, Ireland {mauricek,liam}@rennes.ucc.ie
Abstract. The designer of an elliptic curve processor is faced with many design choices that include the algorithm and coordinate system to be used. The power consumption of elliptic curve processors is becoming increasingly important as such processors find new uses in power constrained environments. This paper studies the effect that algorithm and coordinate choices have on the power consumption and energy per point multiplication of an FPGA based, reconfigurable elliptic curve processor.
1
Introduction
Elliptic curve cryptography has become a popular choice for implementing public key cryptosystems due to its ability to offer a similar level of security to traditional systems, such as RSA, but with smaller key lengths [1]. The smaller key length results in smaller memory and bandwidth requirements for elliptic curve cryptosystems. This makes them suitable for use in resource constrained environments which require security protocols, such as wireless sensor networks. Elliptic curve point scalar multiplication is the computation of Q = [k]P , where Q and P are points on the elliptic curve E defined over GF (pm ) and k is a scalar. The security of elliptic curve cryptosystems is based on the elliptic curve discrete logarithm problem (ECDLP). This is the problem of determining k given P and Q. This problem is deemed computationally infeasible for an appropriate choice of system parameters such as those defined in [2]. The efficiency of an elliptic curve cryptosystem depends on the efficiency with which Q = [k]P can be performed. Since elliptic curves were first proposed for use in cryptography by Koblitz [3] and Miller [4] many different algorithms for computing Q = [k]P have appeared in the literature. It is also possible to use different coordinate systems to represent an elliptic curve point P . The choice of coordinates effects the performance of an elliptic curve cryptosystem. The finite field over which the elliptic curve is defined also effects the efficiency of the system. Finite fields of characteristic p = 2 are a popular choice as arithmetic over these fields can be performed efficiently in hardware.
This research was funded by the Embark Initiative Postgraduate Research Scholarship Scheme from the Irish Research Council for Science, Engineering and Technology (IRCSET).
N. Azemard and L. Svensson (Eds.): PATMOS 2007, LNCS 4644, pp. 310–319, 2007. c Springer-Verlag Berlin Heidelberg 2007
Low Power Elliptic Curve Cryptography
311
Until recently most research on elliptic curve implementations focused on minimising the time per point multiplication. Recently, however, several implementations aimed at minimising power consumption for applications in constrained environments have appeared [5,6]. An ASIC based low power elliptic curve digital signature chip was presented in [7]. All of these ASIC based architectures use one particular algorithm and coordinate system to perform the point scalar multiplication. In [8] a low power elliptic curve processor based on an Atmel 8-bit microprocessor and a dedicated ASIC coprocessor to perform field multiplications is presented. This work studies the effect that algorithm and coordinate choice have on the power consumption of a reconfigurable GF (2m ) elliptic curve processor implemented on an FPGA, in order to identify the set of choices which minimises the power and energy consumption of the processor.
2
Characteristic 2 Finite Fields
The finite field GF (2m ) consists of 2m elements. In this work a polynomial basis m−1 is used to represent elements of GF (2m ) i.e. a = i=0 ai xi ∈ GF (2m ), ai ∈ {0, 1}. Addition of two elements is simply bitwise XORing of the elements. Multiplication, squaring and division is performed modulo an irreducible polynomial f (x). The field is closed under the operations of addition, multiplication, squaring and division. The security of an elliptic curve cryptosystem depends on the size of the underlying field GF (2m ) i.e. the size of m. For this work the field size m = 163 is used which is considered equivalent to 1024-bit RSA [9]. However, the processor presented in Section 4.2 can be used for any field size m.
3
Elliptic Curves
A non-supersingular elliptic curve over GF (2m ) is defined by the equation: E(GF (2m )) : y 2 + xy = x3 + ax2 + b, b =0
(1)
An elliptic curve point P ∈ E is defined as the pair of elements of GF (2m ) (x, y) which satisfy equation (1). The points on E form a commutative finite group under the point addition operation. The special point ϕ, known as the point at infinity, is the additive identity of the group. Addition of two points on E is performed using the well known “chord and tangent” process [1]. The underlying operations used to perform point addition are GF (2m ) arithmetic operations. Point doubling is a special case of point addition where the two input points are the same. Elliptic curve point scalar multiplication is performed by repeated point additions and doublings i.e. Q = [k]P = P + P + . . . + P . 3.1
Coordinate System
Elliptic curve points can be represented using different coordinate systems. Affine coordinates, as seen in the previous section, represent each point using two
312
M. Keller and W. Marnane
GF (2m ) elements. Addition and doubling of affine points can be described by a set of algebraic equations [1]. Using these equations point addition can be performed in 1M + 1D + 1S + 8A (where M,D,S,A refer to GF (2m ) multiplication, division, squaring and addition respectively). Similarly a point doubling can be performed in 1M + 1D + 1S + 6A. Addition is simply bitwise XOR of the two operands and it can be implemented virtually for free (one clock cycle) in hardware relative to the other operations. Therefore the additions are neglected when discussing the cost of point operations. A projective elliptic curve point P = (X, Y, Z) consists of three elements of GF (2m ). To convert the affine point (x, y) to projective coordinates one simply sets Z = 1 i.e. (x, y, 1). There are several different types of projective coordinates and they differ in how the projective point maps to an affine point. In an elliptic curve cryptosystem the result of a point scalar multiplication will usually need to be transmitted between two parties. As affine points are represented in 2m bits (and can be compressed further [10]) they are preferred for transmission. Therefore, projective coordinates are generally used for internal computations but the resultant projective point is converted to an affine point before being transmitted. A Jacobian projective point P = (X, Y, Z) maps to the affine point P = (X/Z 2 , Y /Z 3 ). Algebraic equations for Jacobian point addition and doubling can be found in [1]. Jacobian point addition and doubling can be performed in 15M + 5S and 5M + 5S respectively. The cost of converting a Jacobian point to an affine point is 3M + 1D + 1S. When performing an elliptic curve point scalar multiplication it can be assumed that the input point P will always be in affine coordinates. Therefore when converted to Jacobian coordinates it will have Z = 1. In the special case that one of the input points to Jacobian point addition has Z1 = 1 then the cost of addition can be reduced to 11M + 4S. A further improvement to this can be obtained for the special case that the elliptic curve parameter a = 1. This saves 1M from the cost of Jacobian addition. A Lopez-Dahab projective point P = (X, Y, Z) maps to the affine point P = (X/Z, Y /Z 2 ). This coordinate system and the corresponding equations for point addition and doubling are defined in [11]. Point addition and doubling cost 10M + 4S and 5M + 5S in the case Z1 = 1. The cost of conversion back to affine coordinates is 2M + 1D + 1S. The special case of a = 0 or 1 provides a saving of 1M for both point addition and doubling. Using the formula described in [12] a further 1M can be saved from the point addition at the cost of 1S. For this work the NIST pseudo-random curve over GF (2163 ) [2] was used. Therefore a = 1. Table 1 summarises the cost of elliptic curve point addition and doubling for the various coordinate systems. For both types of projective coordinates described here only one division is required at the end of a point scalar multiplication to convert the result back to affine coordinates. In Appendix A.4.4 of [10] an algorithm for computing the inverse of an element of GF (2m ) using repeated multiplications and squarings is described. For the field GF (2163 ) the inversion can be implemented in 9M and 162S. In a hardware elliptic curve processor using projective coordinates this
Low Power Elliptic Curve Cryptography
313
Table 1. Cost of Point Operations a = 1 Coordinates Addition Doubling Conversion Conv. no divider Affine 1M+1D+1S 1M+1D+1S Jacobian (Z1 = 1) 10M+4S 5M+5S 3M+1D+1S 12M+163S Lopez-Dahab 8M+5S 4M+5S 2M+1D+1S 11M+163S
Algorithm 1. Binary Double and Add Point Multiplication Input: P ∈ E(GF (2m )), k = ti=0 ki 2i Output: [k]P Initialise: 1. Q = P Run: 2. for i = t − 1 downto 0 do 3. Q = [2]Q /* Point Double */ 4. if(ki = 1) then 5. Q =Q+P /* Point Addition */ 6. end if 7. end for Return: 8. Q = [k]P
algorithm can be utilised to remove the need for a divider, hence reducing the area footprint of the processor. 3.2
Point Scalar Multiplication Algorithms
Computing Q = [k]P is a special case of the general problem of exponentiation in Abelian groups and the shortest addition chain problem for integers. This problem can be stated as: starting from the integer 1, and at each step computing the sum of two previous results, what is the minimum number of steps required to reach k? The simplest method for computing [k]P is the binary double and add method [1] presented in Algorithm 1. Algorithm 1 iterates through the binary expansion of k. On each iteration a point doubling is performed. If ki = 1 then a point addition is performed. In general a binary expansion of k will have approximately m bits. On average half of these bits will be equal to one. Therefore the cost of implementing a point scalar multiplication using Algorithm 1 is given in equation (2). Nbinary = (m − 1)Ndouble + (m/2 − 1)Nadd
(2)
The basic binary double and add algorithm can be improved upon by using an algorithm. In this case a signed digit expansion of k = l addition-subtraction i s 2 , s ∈ {−1, 0, 1} is used. Non-adjacent form (NAF) is a signed digit i i i=0 representation in which there are no adjacent non-zero digits. NAF has the fewest non-zero coefficients of any signed binary expansion of k. The NAF of a non-negative integer given in binary representation can be computed using
314
M. Keller and W. Marnane Algorithm 2. NAF Addition-Subtraction Chain Point Multiplication Input: P ∈ E(GF (2m )), k = li=0 si 2i Output: [k]P Initialise: 1. Q = P Run: 2. for i = l − 1 downto 0 do 3. Q = [2]Q /* Point Double */ 4. if(si = 1) then 5. Q = Q+P /* Point Addition */ 6. if(si = −1) then 7. Q = Q−P /* Point Subtraction */ 8. end if 9. end for Return: 10. Q = [k]P
Algorithm IV.5 in [1]. It was shown in [13] that the expected weight of an NAF of length l is l/3. Algorithm 2 presents the point scalar multiplication algorithm based on an NAF expansion of k. When si = −1 an elliptic curve point subtraction is performed. Point subtraction is performed by adding the negative of the point i.e. P0 − P1 = P0 + (−P1 ). The negative of an affine elliptic curve point P = (x, y) is given by −P = (x, x + y). The negative of a projective point P = (X, Y, Z) is given by −P = (X, XZ + Y, Z). The cost of implementing a point scalar multiplication using Algorithm 2 is given in equation (3). NN AF = (l − 1)Ndouble + (l/3 − 1)Nadd
(3)
Montgomery [14] proposed a different method for computing [k]P which maintains the relationship P2 − P1 as invariant. Montgomery’s method is presented in Algorithm 3. Algorithm 3 performs an addition and a doubling on each iteration of the loop regardless of the value of ki . The advantage of Montgomery’s method is that it only operates on the x coordinate in affine coordinates and the X and Z coordinates in projective coordinates. After the loop the y coordinate of the resultant point can be computed. The cost of Montgomery’s method is given in equation (4). Algebraic equations for implementing Montgomery’s method in both affine and Lopez-Dahab projective coordinates are given in [15]. NMontgomery = Ndouble + (m − 1)Nloop + Ncompute
4 4.1
y
(4)
Hardware Implementation GF (2m ) Components
As described in Section 2, GF (2m ) addition is simply bitwise XOR of the elements and so can be implemented in hardware using m XOR gates with a latency of one XOR gate delay. GF (2m ) multiplication is implemented using the digitserial algorithm described in [16]. This architecture performs a multiplication
Low Power Elliptic Curve Cryptography
315
Algorithm 3. Montgomery Point Multiplication Input: P ∈ E(GF (2m )), k = ti=0 ki 2i Output: [k]P Initialise: 1. P1 = P ; P2 = [2]P Run: 2. for i = t − 1 downto 0 do 4. if(ki = 1) then 5. P1 = P1 + P2 ; P2 = [2]P2 6. else 7. P2 = P2 + P1 ; P1 = [2]P1 8. end if 9. end for 10. compute y1 Return: 11. Q = P1 = [k]P
in n = m d clock cycles, where d is the variable digit size of the architecture. A bit-parallel squaring architecture was used which has a purely combinational delay and computes the result in one clock cycle. Division is implemented using the division architecture described by Shantz in [17]. It computes the result in 2m clock cycles. 4.2
Reconfigurable Elliptic Curve Processor
Figure 1 shows the reconfigurable elliptic curve processor that was designed to implement the various algorithms described in Section 3 for any field size m. The processor is based on one each of the four GF (2m ) computational units described in the previous section. The processor also contains two storage elements to store the m-bit inputs, outputs and variables required during the computation of [k]P . The computational units and memory are connected by an m-bit bidirectional data bus. To minimise power consumption each computational unit is only enabled when it is required to perform a calculation. The processor is controlled by a Finite State Machine (FSM), counter and ROM. The ROM contains a sequence of 32-bit instructions which control the data transfer between the computational units and memory to perform the required operations to implement elliptic curve point addition and doubling, and conversion from projective to affine coordinates (if necessary). The FSM enables the counter to iterate through the ROM instructions. The counter contains several load pins. These allow the state machine to jump to various predefined locations within the ROM instruction set. This means that the ROM need only contain one set of instructions for doubling, adding and converting a point. The master counter tracks how many bits of the scalar k have been processed and indicates when the loop in the point scalar multiplication algorithm has been completed. This processor architecture can be used to implement the various coordinates and algorithms described in Sections 3.1 and 3.2. The only changes required are to the instructions contained in ROM and to the FSM controller. The number of m-bit memory locations required also varies depending on the algorithm used.
316
M. Keller and W. Marnane
data m
memory1
m
control 32 bus
memory0 addr
enable
address ROM
Counter
Control FSM
done
load
m
Master Counter
init rw ld
GF2m mult
GF2m divider
GF2m adder
4
3
GF2m squarer
start clk
4
m
m
m
2
m
Fig. 1. Reconfigurable GF (2m ) Elliptic Curve Processor
Table 2. Clock Cycles Per Stage Algorithm Binary Affine Binary Jacobian Binary Jac no div Montgomery Affine
NSetup NDouble NAdd 10 2m + n + 44 2m + n + 48 12 5n + 70 10n + 138 12 5n + 70 10n + 138 2m + 16 4m + 28
Montgomery Lopez-Dahab
20
Montgomery L-D no div NAF Affine NAF Jacobian NAF Jac no div
20 16 18 18
Nconvert 2m + 3n + 41 12n + 823 2m + 2n + 39 10n + 111, D ≤ 3 6n + 79 5n + 2m + 58, D > 3 6n + 79 19n + 905 2m + n + 44 2m + n + 48 5n + 70 10n + 138 2m + 3n + 40 5n + 70 10n + 138 12n + 819
Table 2 details the clock cycles required for each stage of the computation for the various algorithm and coordinate combinations implemented. As mentioned in Section 3.1, only one division is required when using projective coordinates. To this end, for each of the algorithms using projective coordinates two processors were implemented, one with and one without a GF (2m ) divider. In the case of no divider the division is implemented using repeated multiplications and squarings. The total number of clock cycles given in Table 2 includes clock cycles for calculation as well as control overheads such as data transfer between the components. At the beginning of each computation of [k]P a certain number of clock cycles are required to setup and initialise the processor, hence the setup stage (Nsetup ). The convert stage is required to convert the projective result into affine coordinates and, in the case of the processors based on the Montgomery method, to compute the y coordinate of the result.
Low Power Elliptic Curve Cryptography
5
317
Results
Each of the algorithms listed in Table 2 was implemented using the architecture described in the previous section. Two different digit sizes of the underlying GF (2m ) multiplier were used. A digit size of d = 1 means that a GF (2m ) division costs the same as two GF (2m ) multiplications. A digit size of d = 16, giving a D/M ratio of thirty two, was also implemented. The designs were prototyped on a Xilinx Spartan3L xc3s1000L f t256 − 4 FPGA using Xilinx ISE 9.1.01i. The memory required for each processor was implemented using the FPGA Block RAM resources. The power consumption of the elliptic curve processor for each different choice of algorithm was measured using XPower. The minimum post place and route clock frequency reported for all the designs over both digit sizes was 80M Hz. Therefore all the following results are for a clock frequency of 80M Hz. The results also assume an average value for the scalar k.
Fig. 2. Power Consumption of Various Algorithms
Fig. 3. Point Multiplication Time of Various Algorithms
Figure 2 illustrates the power consumption values of the various algorithms. It is noted at this point that these power values are meant to compare the effect of algorithm choice on power consumption. The architecture has not been optimised to minimise power consumption. The minimum power consumption for a digit size of one was found to be for the Montgomery algorithm with Lopez-Dahab coordinates and no divider. For a digit size of sixteen the minimum power is reported for the Binary Algorithm using Jacobian coordinates and no divider. It is noted that removing the divider from the circuit reduces the power consumption of all the projective algorithms implemented. This is due to the reduced circuit area. The penalty is a slightly increased computation time as seen in Figure 3. All of the algorithms implemented have different times per point multiplication. Therefore it is relevant to discuss the energy expended per point multiplication as illustrated in Figure 4. It can be seen that increasing the digit size from
318
M. Keller and W. Marnane
Fig. 4. Energy Usage of Various Algorithms per Point Multiplication
Fig. 5. Area of Various Algorithms
one to sixteen reduces the energy per point multiplication. This occurs because increasing the digit size reduces the computation time. The exception to this is the Montgomery affine processor which has virtually identical computation time and energy consumption for both digit sizes. This is because no multiplications are required in the dominant loop section of the algorithm (see Table 2). The minimum energy for a digit size of one is 0.288mJ reported for the Montgomery affine processor. For d = 16 the minimum energy is 0.055mJ reported for the Montgomery Lopez-Dahab processor with no divider. These results coincide with the fastest processors for both digit sizes. This indicates that while power consumption does not vary dramatically for the different algorithms the time per point multiplication does. Therefore the algorithms with minimum computation time tend to have the minimum energy per point multiplication when implemented on an FPGA.
6
Conclusions
A reconfigurable GF (2m ) elliptic curve processor architecture was presented in this paper. The architecture was used to implement several different algorithms and coordinates for elliptic curve point scalar multiplication. The effect of algorithm and coordinate choice on power and energy consumption of the processor implemented on an FPGA was then investigated. It was found that for small D/M ratio the Montgomery method with affine coordinates uses the least energy per point multiplication. For large D/M ratio the Montgomery method with Lopez-Dahab projective coordinates minimises the energy required to perform a point multiplication. Future work includes examining the energy consumption of additional methods for improving the performance of the point scalar multiplication. The architecture could also be optimised to minimise the absolute power and energy consumptions.
Low Power Elliptic Curve Cryptography
319
References 1. Blake, I.F., Seroussi, G., Smart, N.P.: Elliptic curves in cryptography. London Mathematical Society Lecture Note Series, 265, Cambridge University Press, (1999) 2. National Institute of Standards and Technology (NIST): Recommended elliptic curves for federal government use. NIST Special Publication (1999) 3. Koblitz, N.: Elliptic curve cryptosystems. Mathematics of Computation 48, 203– 209 (1987) 4. Miller, V.: Uses of elliptic curves in cryptography. In: Williams, H.C. (ed.) CRYPTO 1985. LNCS, vol. 218, pp. 417–426. Springer, Heidelberg (1986) 5. Ozturk, E., Sunar, B., Savas, E.: Low-power elliptic curve cryptography using scaled modular arithmetic. In: Joye, M., Quisquater, J.-J. (eds.) CHES 2004. LNCS, vol. 3156, pp. 107–118. Springer, Heidelberg (2004) 6. Batina, L., Guajardo, J., Kerins, T., Mentens, N., Tuyls, P., Verbauwhede, I.: An elliptic curve processor suitable for RFID-tags. Cryptology ePrint Archive, Report 2006/227 (2006) 7. Schroeppel, R., Beaver, C.L., Gonzales, R., Miller, R., Draelos, T.: A low-power design for an elliptic curve digital signature chip. In: Kaliski Jr., B.S., Ko¸c, C ¸ .K., Paar, C. (eds.) CHES 2002. LNCS, vol. 2523, pp. 366–380. Springer, Heidelberg (2003) 8. Bertoni, G., Breveglieri, L., Venturi, M.: Power aware design of an elliptic curve coprocessor for 8-bit platforms. In: Proceedings Fourth Annual IEEE International Conference on Pervasive Computing and Communications Workshops (PERCOMW’06), pp. 337–341. IEEE, Los Alamitos (2006) 9. National Institute of Standards and Technology (NIST): Recommendation for key management-part 1: general (Revised). NIST Special Publication 800–857 (2006) 10. IEEE P1363: Standard Specifications for Public Key Cryptography. IEEE Std 1363–2000 (2000) 11. Lopez, J., Dahab, R.: Improved algorithms for elliptic curve arithmetic in GF (2n ). In: Tavares, S., Meijer, H. (eds.) SAC 1998. LNCS, vol. 1556, pp. 201–212. Springer, Heidelberg (1999) 12. Al-Daoud, E., Mahmod, R., Rushdan, M., Kilicman, A.: A new addition formula for elliptic curves over GF (2n ). IEEE Transactions on Computers 51(8), 972–975 (2002) 13. Morain, F., Olivos, J.: Speeding up the computations on an elliptic curve using addition-subtraction chains. Theoretical Informatics and Applications 24, 531–543 (1990) 14. Montgomery, P.: Speeding the Pollard and elliptic curve methods of factorisation. Mathematics of Computation 48, 243–264 (1987) 15. Lopez, J., Dahab, R.: Fast multiplication on elliptic curves over GF (2m ) without precomputation. In: Ko¸c, C ¸ .K., Paar, C. (eds.) CHES 1999. LNCS, vol. 1717, pp. 316–327. Springer, Heidelberg (1999) 16. Song, L., Parhi, K.: Low energy digit-serial/parallel finite field multipliers. Kulwer Journal of VLSI Signal Processing Systems 19(2), 149–166 (1998) 17. Shantz, S.C.: From Euclid’s GCD to Montgomery multiplication to the great divide. Technical Report TR-2001-95, Sun Microsystems (2001)
Design and Test of Self-checking Asynchronous Control Circuit Jian Ruan, Zhiying Wang, Kui Dai, and Yong Li School of Computer, National University of Defense Technology Changsha, Hunan, P.R. China
Abstract. The application of asynchronous circuit has been greatly restricted by reason of lacking effective technologies to test. Making use of the self-checking property of asynchronous control circuit, we may preferably solve this problem. In the paper, we put forward an improved, failstop David Cell, describe a way of designing self-checking asynchronous control circuits by the direct mapping technique, and propose the testing method for single stuck-at faults. The result shows that self-checking counterpart can be tested at normal operation speed and the area overhead is acceptable.
1
Introduction
Asynchronous circuit promises a number of advantages over the synchronous, such as being absent from clock skew, can be designed for the average case performance, may potentially power down the unused circuity, and have a higher degree of modularity [4,8]. As asynchronous circuit becomes larger and starts to be used in the commercial products, it is necessary to develop effective testing techniques, at least as successful as current for synchronous circuits. The testing of asynchronous circuits is more complex than that of synchronous ones, and the major factors are[6]: – the absence of global synchronization clock drastically decreases the amount of test control over the circuit, as it can not easily be “single stepped” through a sequence of states - a common way to test synchronous circuits; – the presence of a large number of state holding elements in asynchronous circuit makes the generation of tests harder or even impossible, and design techniques to ease testing will have a higher area overhead; – asynchronous circuit may have hazards or races when faulty, which are difficult to detect. However, the self-synchronous nature of asynchronous circuit yields selfchecking property such that the circuit can detect particular types of faults during operation of oneself[1]. Martin[10] demonstrated that a single stuckt-at fault in a nonredundant delayinsensitive circuit causes a transition either not to occur or on output to become enabled in an illegal state. Varshavsky[16] showed that semimodular circuits N. Azemard and L. Svensson (Eds.): PATMOS 2007, LNCS 4644, pp. 320–329, 2007. c Springer-Verlag Berlin Heidelberg 2007
Design and Test of Self-checking Asynchronous Control Circuit
321
are totally self-checking,with respect to output stuckt-at faults (OSAFs), if the faults do not cause the circuit to enter an invalid state (exitory faults) or an altertnating cycle of valid states (substitutional faults). Beerel[1] related the semimodular circuit testability results to speed-independent circuit and extended the study of OSAF to certain input stuckt-at faults (ISAFs). Liebelt[9] showed that in a strongly connected atomic gate implementation of a live semimodular specification state graph, all exitory multiple output stuck-at faults of the gates feeding the state variables will cause the circuit to halt if the next state values for invalid states are appropriately assigned. Basing on the impressie results described above, we bring forward a way of designing self-checking asynchronous control circuit to faciitate the testing of single stuck-at fault. The remainder of the paper is organized as follows. Section 2 devotes to the formal model for asynchronous control circuit and the basics about the technique of direct mapping. Section 3 presents an improved fail-stop David Cell (FSDC)[3], analyzes the self-checking property of the linear FSDC chain for single stuck-at fault. In Section 4, we study the method of designing and testing self-checking asynchronous control circuit, which includes linear, fork, join, choice and merge structures. Finally, Section 5 offers some conclusions and directions for future work.
2
Background
In this section we first introduce the formal model to specify the behavior of asynchronous control circuit. Secondly, the basics about the technique of direct mapping are presented briefly. 2.1
Labeled Petri Net
In this paper we use the labeled Petri Net to capture concurrent behavior of asynchronous control circuit. Traditionally, a Petri Net is defined as a tuple Σ = P, T, F, M0 comprising finite disjoint sets of places P , transitions T , flow relation F ⊆ (P × T ) ∪ (T × P ) and the initial marking M0 . There is an arc between x and y if and if (x, y) ∈ F . The preset of node x is defined as • x = {y | (y, x) ∈ F } and the postset as x• = {y | (x, y) ∈ F }. A marking is a mapping M : P → {0, 1}. It is assumed that • t = φ = t• for every transition t ∈ T . The evolution of Petri Net Σ from the initial marking M0 to marking Mn by executing transitions results in a firing sequence σ = t1 t2 · · · tn , where ti is such that Mi−1 [ti > Mi , for i = 1, · · · , n. 1 An extension of Petri Net model is contextual net[11]. It uses additional elements such as non-consuming arcs, which only control the enabling of a transition and do not consume tokens. Petri Net Σ extended with a type of non-consuming arcs, namely read-arcs, is defined as P N = P, T, F, R, M0 . A set of read-arcs R is defined as R ⊆ (P × T ), there is a read-arc between place p and transition t if and if (p, t) ∈ R. 1
Mn is called a reachable marking.
322
J. Ruan et al.
Labeled Petri Net (LPN) is a P N whose transitions are associated with a labeling function λ, i.e. LP N = P, T, F, R, M0 , λ[17]. 2.2
Direct Mapping
To overcome the state explosion problem, direct mapping technique that linearly depend on the specification complexity has been proposed. Direct mapping technique has a long history originating from Huffmans work, where a method of one-relay-per-row realization of asynchronous sequential circuit was introduced. This approach has been further developed by many researchers, such as Unger[15], Hollaar[5], Kishinevsky[7] and the Asynchronous Group in University of Newcastle upon Tyne [2,13,14]. The main idea of direct mapping technique is that a graph specification of a circuit can be translated directly, without computationally hard transformations, into the circuit netlist in such a way that graph nodes correspond to the circuit elements and graph arcs correspond to the interconnects[12]. In this method, every place of a labeled Petri net is associated with a David cell. Figure 1(a) shows the schematic of a gate-level DC which uses a logic 1 as active high, and its STG for token to move from place p1 to place p2 is shown in Figure 1(b). 1 s
0
r1−
0 a1
a 0 r1 0
0
r
a+ s−
(a) Schematic
r+
a1+
p1 r1+
p2 a1−
a− s+ r−
(b) STG
Fig. 1. David Cell
The request and acknowledgement function of each DC are generated from the structure of the LPN, as shown in Figure 2[14]. The request function of the DC is shown in its top-left corner, while the acknowledge function in the bottom-right corner.
3
Fail-Stop David Cell
The DC in Figure 1(a) consists of NOR gates only. Whereas two redundant OR gates are added in its improved counterpart, shown in Figure 3. The overhead area is 33.3% with the number of transistors required to implement. By detailed fault analysis, we may find that single SAF will cause the improved DC to halt or have no influence on its behavior. In the following, we will analyze the self-testability of linear FSDC chain for single stuck-at faults. Table 1 shows the position of single stuck-at fault (column 2) and the behavior at the interface in a faulty linear FSDC chain (column 3). According to Table 1,
Design and Test of Self-checking Asynchronous Control Circuit in2
out2
request logic pr ed1
in1
323
out1
out1 pr ed2
C
C
DC
cur_req
in1 pred1 pred2
succ1 succ2
cur
cur_ack
succ1 cur
in3
out3
OR
succ2
acknowledgement logic
Fig. 2. Mapping of a Labelled PN place into a DC
s1 G1
a1 0
a o b
G0
r1 0
m1 o G2 a b
o a b
G4
0 a2
a b o
n1
G3
a b o
0 r2
Fig. 3. Improved, Fail-stop David Cell
we may say that the linear FSDC chain is total self-checking, except fault #16 which causes the circuit self-excited.
4
Design and Test of Self-checking Asynchronous Circuits
In this section, the technique of designing and testing self-checking asynchronous control circuits based on the technique of direct mapping is represented. Linear, fork/join and choice/merge structures will be studied respectively. 2 4.1
Linear Structure
Linear Structure is simplest and easy to test. All single stuck-at faults can be self-detected. Consider Figure 4, an example with 3 places and 3 transitions. The dashed read-arcs and places represent the interface between control and data paths. Signals z ack and z req compose the handshake interface to the environment. When set, signal z req means the computation is complete and the output data is ready to be consumed. While signal z ack is set when the output of the previous computation cycle is consumed and new input data is ready to be processed. Associating each place with a FSDC, we have structure as illustrated in Figure 5(a). 2
For convenience, asynchronous control circuits will be tested on the assumption that corresponding datapath circuit is faultless.
324
J. Ruan et al. z_ack x_req
x_ack
P0
c_req
P1
T0
T1
y_req
z_req
P2 T2
y_ack
c_ack
Fig. 4. Linear LPN structure z_ack
x_req C
x_ack
DC
c_req C
y_req
c_ack
DC
z_req C
DC
y_ack
P0
P1
P2
(a) z_ack
x_req
x_ack
DC y_req P0
c_req C
DC
c_ack
z_req DC
y_ack P1
P2
(b) Fig. 5. Schematic of linear LPN structure
In the netlist the dotted wires can be actually removed thus simplifying the request functions of P0 , P1 , P2 , the compact one is shown in Figure 5(b). Those wires are redundant because the trigger signals from the environment uniquely identify which FSDC among them should be activated even without a context signal from the preceding FSDCs. For the state-holding property of Muller C-element, its single stuck-at-0/ stuck-at-1 fault will disable the set/reset of the request signal in the subsequent FSDC, which causes the circuit to deadlock. Basing on the fault analysis of the linear FSDC chain obtained above, the conclusion can be reached that the improved implementation of a linear structure is total self-checking. Testing Approach 1. Initialize the circuit to be empty and observe the primary output z req to examine it for the self-excitation fault, that is fault #16 mentioned in Table 1. 2. Send handshake protocol into place P0 from the primary input signal z ack and propagates through the circuit. If the primary output signal z req remains 0 after a predefined period3 , we may confirm that the circuit is faulty. Otherwise, the circuit is faultless or just has redundant fault that does not influence its behavior. 3
In general, the period is greater than the worst case cycle time.
Design and Test of Self-checking Asynchronous Control Circuit
325
Table 1. Single Stuck-at Fault Analysis # 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
4.2
Fault Point Behavior of Fail-stop DC Chain G0 : a ≡ 1 Deadlock in Predecessor. G0 : b ≡ 1 a1 ≡ 1; r0 ↑→ s0 ↑→ n0 ↓→ Halt G0 : o ≡ 1 Deadlock. G0 : a ≡ 0 r1 ↑→ s1 ↑→ n1 ↓→ m1 ↑→ Halt G0 : b ≡ 0 Redundant Fault. G0 : o ≡ 0 same as #4. G1 : a ≡ 1 Deadlock in Predecessor. G1 : b ≡ 1 n1 ≡ 0, s1 , r2 , a1 , a2 ≡ 1; r0 ↑→ s0 ↑→ n1 ↓→ Halt G1 : o ≡ 1 Deadlock. G1 : a ≡ 0 r1 ↑→ Halt G1 : b ≡ 0 Redundant Fault. G1 : o ≡ 0 same as #10. G2 : a ≡ 1 Deadlock. G2 : b ≡ 1 m1 ≡ 0; r1 ↑→ s1 ↑→ n1 ↓→ Halt G2 : o ≡ 1 same as #7. Self-excitation. G2 : a ≡ 0 m1 ↑→ n1 ↓→ r2 ↑→ a2 ↑→ m1 ↓→ n1 ↑→ r2 ↓→ a2 ↓ Deadlock. G2 : b ≡ 0 r1 ↑→ s1 ↑→ n1 ↓→ m1 ↑→ a1 ↑→ r1 ↓→ r2 ↑→ a2 ↑→ Halt G2 : o ≡ 0 same as #13. G3 : a ≡ 1 Deadlock. G3 : b ≡ 1 r2 ≡ 0; r1 ↑→ s1 ↑→ n1 ↓→ m1 ↑→ a1 ↑→ r1 ↓→ Halt G3 : o ≡ 1 same as #7. G3 : a ≡ 0 G3 : b ≡ 0 Redundant Fault. G3 : o ≡ 0 same as #19. G4 : a ≡ 1 same as #7. G4 : b ≡ 1 Deadlock. G4 : o ≡ 1 r1 ↑→ s1 ↑→ Halt G4 : a ≡ 0 Redundant Fault. G4 : b ≡ 0 same as #27. G4 : o ≡ 0 same as #7.
Fork/Join Structure
The LPN with fork/join structures for asynchronous control circuit can be refined as a linear one. Let’s consider Figure 6, a LPN obtained by the transition expansion technique involving the fork/join structures. Firstly, this LPN can be optimized by leaving acknowledgement d ack, request e req and place P3 in the datapath, highlighted with a gray box. The optimized LPN is shown in Figure 7. Secondly, places P1 and P2 are merged into one place
326
J. Ruan et al. z_ack x_req
x_ack
c_req
c_ack
f_req
z_req
P1 P4
P0
P5
T1 T3
T0 T2
P2 y_req
y_ack
T4
P3
d_req
e_ack d_ack
f_ack
e_req
Fig. 6. LPN with fork and join structure z_ack x_req
x_ack
c_req
c_ack
f_req
z_req
P1 P4
P0
P5
T1 T3
T0
T4
P2 y_req
y_ack
d_req
e_ack
f_ack
Fig. 7. Optimized LPN of Figure 6 z_ack x_req
x_ack
c_req
P0
y_ack
f_req
z_req
P4
P12 T1
T3
de_req
de_ack
T0
y_req
c_ack
P5 T4
f_ack
Fig. 8. Resultant LPN of Figure 6
P12 , denotes the stage of data input. The resultant LPN is shown in Figure 8, which has been discussed in above subsection. 4.3
Merge/Choice Structure
Now we will see how the single stuck-at fault can be detected in choice/merge structures. For these structures we make use of the well known example, GCD (greatest common divisor) controller. Figure 9 shows the refined LPN for the GCD control circuit. The compact GCD control circuit schematic is illustrated in Figure 10. To ensure the testing of the stuck-at faults in the merge/choice structures, we need to control which branch to be test first. Our strategy is to modify the circuit by adding extra test circuitry. Specifically, we add multiplexors to the input request signal of the FSDCs following the choice structure, which allows the choice structure to be controlled externally during testing.
Design and Test of Self-checking Asynchronous Control Circuit gt_ack z_ack
x_req
x_ack
gt
327
p_sub_gt
cmp_req sub_gt_req
p_z
p_xy
p_mrg
p_cmp
eq
cmp sub_lt_req y_req
eq_ack
y_ack lt_ack
lt
z_req
p_sub_lt
Fig. 9. Refined LPN for GCD control circuit
gt_ack
sub_gt_req
DC
p_sub_gt
z_ack
DC
x_req x_ack
C1 C2
y_req y_ack p_xy
OR1 DC
eq_ack
DC p_cmp
C3
p_z
OR2
p_mrg
z_req
DC
cmp_req
lt_ack
sub_lt_req
DC
p_sub_lt
Fig. 10. Compact GCD control circuit schematic
C1 x_ack
cmp_req
OR1 a b 0 c
DC
MUX1
gt_ack a ext1 b 0
DC
DC
sub_gt_req
c
p_mrg
p_cmp
0
a b c
ctl1 p_sub_gt
OR2
Fig. 11. GCD control circuit schematic with extra test circuitry
Assume we want to test the upper branch of the circuit, as shown in Figure 11. Testing Approach 1. Initialize the circuit to be empty and observe the primary output z req to determine whether there is a self-excitation fault in any FSDC. 2. Set the multiplexor M U X1 on the upper branch so that the value of the request signal of FSDC p sub gt can be set from the testing environment (ctl1 = 1).
328
J. Ruan et al. Table 2. Test pattern for the single stuck-at faults # 1 2 3 4
Test Pattern Observation Point ctl1 ext1 Correct Faulty M U X1 : a ≡ 1 < 0, 1 > 1 0 1 M U X1 : c ≡ 0 0 < 1, 0 > 0 1 < 1, 0 > 1 1 OR2 : a ≡ 0 1 1 0 1 Others 1 < 1, 0 > 1 0 Fault Point
3. Send handshake protocol into the circuit from the external signal ext1, which propagates through the circuit, effectively exposing the single-stuck-at faults throughout the upper branch and the merge structure. 4. Read the output from signal cmp req, the observation point, and examine it for errors. The test pattern we used to test for the single stuck-at faults is shown in Table 24 . As for the stuck-at faults in other circuitries, we can use the same approach that described in subsection 4.1.
5
Conclusion and Future Work
This paper presented an improved, fail-stop David Cell, put forward a way of designing self-checking asynchronous control circuits in the direct mapping domain, and propose an effective strategy to test single stuck-at faults. Making use of the inherent self-checking property of asynchronous control circuits, the method is low-overhead and high-speed. Moreover, our approach applies not only to the linear structures, but also to those with fork/join and choice/merge structures, which is applicable to the direct-mapping asynchronous circuits with the arbitrary topology. At the same time, there are some subtle limitations of our approach. Firstly, there are still several redundant faults can not be detected. Secondly, the stuckat-0 fault at the input port of the multiplexor can not be detect expediently. Thus, one extension of the approach could be looking for an effective way to improve the controllability of the output of the datapath circuit, which will reduce the testing complexity of the asynchronous control circuit with choice/merge structures.
References 1. Beerel, P.A., Meng, T.H.-Y.: Semi-modularity and testability of speed-independent circuits. Integration, the VLSI journal 13(3), 301–322 (1992) 2. Bystrov, A., Yakovlev, A.: Asynchronous circuit synthesis by direct mapping: Interfacing to environment. In: Proc. International Symposium on Advanced Research in Asynchronous Circuits and Systems, April 2002, pp. 127–136 (2002) 4
The single stuck-at-0 fault at input b and c of the OR gate can be detected by the middle and lower branch of Figure 10 respectively.
Design and Test of Self-checking Asynchronous Control Circuit
329
3. David, R.: Modular design of asynchronous circuits defined by graphs. IEEE Transactions on Computers 26(8), 727–737 (1977) 4. Hauck, S.: Asynchronous design methodologies: An overview. Proceedings of the IEEE 83(1), 69–93 (1995) 5. Hollaar, L.A.: Direct implementation of asynchronous control units. IEEE Transactions on Computers C-31(12), 1133–1141 (1982) 6. Hulgaard, H., Burns, S.M., Borriello, G.: Testing asynchronous circuits: A survey. Integration, the VLSI journal 19(3), 111–131 (1995) 7. Kishinevsky, M., Kondratyev, A., Taubin, A., Varshavsky, V.: Concurrent Hardware: The Theory and Practice of Self-Timed Design. In: Series in Parallel Computing, John Wiley & Sons, Chichester (1994) 8. Lavagno, L., Sangiovanni-Vincentelli, A.: Algorithms for Synthesis and Testing of Asynchronous Circuits. Kluwer Academic Publishers, Dordrecht (1993) 9. Liebelt, M.J., Burgess, N.: Detecting exitory stuck-at faults in semimodular asynchronous circuits. IEEE Transactions on Computers 48(4), 442–448 (1999) 10. Martin, A.J., Hazewindus, P.J.: Testing delay-insensitive circuits. In: S´equin, C.H. (ed.) Advanced Research in VLSI, pp. 118–132. MIT Press, Cambridge (1991) 11. Montanari, U., Rossi, F.: Contextual nets. Acta Informatica 32(6), 545–596 12. Shang, D.: Asynchronous communication circuits: Design, test and synthesis. Technical Report NCL-EECE-MSD-TR-2003-100, University of Newcastle upon Tyne, UK (2003) 13. Shang, D., Xia, F., Yakovlev, A.: Asynchronous circuit synthesis via direct translation. In: Proc. International Symposium on Circuits and Systems, May 2000, vol. 3, pp. 369–372 (2000) 14. Sokolov, D., Yakovlev, A.: Clockless circuits and system synthesis. IEE Proceedings, Computers and Digital Techniques 152(3), 298–316 (2005) 15. Unger, S.H.: Asynchronous Sequential Switching Circuits. Wiley-Interscience, John Wiley & Sons, Inc., New York (1969) 16. Varshavsky, V.I. (ed.): Self-Timed Control of Concurrent Processes: The Design of Aperiodic Logical Circuits in Computers and Discrete Systems. Kluwer Academic Publishers, Dordrecht, The Netherlands (1990) 17. Yakovlev, A.V., Koelmans, A.M., Lavagno, L.: High-level modeling and design of asynchronous interface logic. IEEE Design & Test of Computers 12(1), 32–40 (1995)
An Automatic Design Flow for Implementation of Side Channel Attacks Resistant Crypto-Chips Behnam Ghavami and Hossein Pedram Computer Engineering Department, Amirkabir University of Technology (Tehran Polytechnic),424 Hafez Ave, Tehran 15785, Iran {ghavami,pedram}@ce.aut.ac.ir
Abstract. Recently, it has been proven that asynchronous circuits possess considerable inherent countermeasure against side channel attacks. In spite of these systems' advantages for immune cryptography, because of the lack of automatic design tools and standard methods, exploiting such schemes faces difficulties. In this paper, a fully automated secure design flow and a set of secure library cells resistant to power analysis and fault injection attacks are introduced for QDI asynchronous circuits. In this flow a standard cell library has been introduced which has resistance to differential power analysis on faulty hardware attack. The results show that using this scheme is approximately 5.62 times more balanced than the best cells designed using previous synchronous balancing methods. To verify the efficiency of our presented flow we applied it to implementation of the AES cryptography algorithm. Also, this implementation shows a 2.8 times throughput improvement over the synchronous implementation using the same technology.
1 Introduction Cryptography algorithms have an important part of today’s digital society to ensure the confidentiality of sensitive information. To obtain security, contemporary strong cryptography algorithms are designed to withstand rigorous cryptanalysis. The degree of complexity of the algorithm which implies the resistibility of the algorithm exposed to the attacks is discussed in the realm of cryptanalysis. The validity of mathematical security models for the algorithms is based on the fact that the attackers don’t have access to the intermediate computational data. So any kind of information about these intermediate data will simplify the crypto analysis dramatically. Since cryptography algorithms in mathematics perspective have high security, nowadays attackers turn to analyzing the physical aspects of system to achieve the intermediate computational data. Leaked information from physical side channels and information resulted from mathematical model of algorithm, are used in these kinds of attacks. There are some kinds of attacks which take advantage of the implemented physical properties and leaked information from side channels [1][2][3][4]. These attacks which exploit the physical specific weaknesses are known as side channel attacks. Several solutions have been proposed to countermeasure against side channel attacks both in software and hardware level [4][5][6][7]. Recently, in addition to N. Azemard and L. Svensson (Eds.): PATMOS 2007, LNCS 4644, pp. 330–339, 2007. © Springer-Verlag Berlin Heidelberg 2007
An Automatic Design Flow for Implementation of Side Channel
331
asynchronous circuit design advancements [18], it became clear that this design methodology is suitable for the secure cryptography systems. Since the clock signal has been eliminated in asynchronous circuits, these are resistant to the fault injection in clock. Clock elimination attenuates electromagnetic radiation leading to much increased complexity in these types of attacks. Power consumption in the dual rail QDI asynchronous circuits is independent of input data [11], so these circuits are countermeasure to differential power attacks. The purpose of this paper is to present a new method to eliminate the physical limitations of contemporary cryptography systems. This methodology prepares a complete design cycle of secure cryptography systems by introducing a practical commercial quality Electronic design Automation (EDA) synthesis tool, which eliminates all sources of leaked information from side channels. The design flow presented here is based on a QDI asynchronous circuits in which customized standard cells have been used. This method not only preserves performance but also is a suitable solution for promoting countermeasure against power, timing and fault injection attacks at the same time. Also, circuits implemented by this method have been resistant to the new and effective attack which is made up of fault injection and power analysis.
2 Motivation Countermeasures against side-channel analysis are necessary to pass certification of semiconductor products for security applications like smart cards. To prevent sidechannel attacks, several countermeasures solution has been proposed which aim to reduce or eliminate the amount of information which can be inferred about intermediate data in a hardware implementation of a cryptographic algorithm [4][5][6][7][10]. One of the most effective countermeasures against power analysis attacks is based on the use of specially designed balanced gates for which the power consumption is equal for all data and all transitions of the gate. Several such gates have been previously presented (SABL [5], DyCML [6], BSDT [14], WDDL [7]). The SABL [15] gate is based on the strongArm 110 flip-flop. It keeps the sense amplifier half of the flip-flop and replaces the input differential pair by a differential pull down network (Figure 1). WDDL [7] constructed from regular standard cells and is applicable to FPGA. F. Mace and etc. in [6], suggest using the Dynamic Current Mode Logic (DyCML) to counteract power analysis. They investigated use of dynamic and differential logic styles to counteract power analysis attack. In compared the SABL gates with DyCML [6], exhibited that both logic styles allow to significantly decreasing the circuit energy variations if compare them with a standard CMOS technology but significantly reduces the power delay product. However, most of this approaches, such as those from [5][6] have no countermeasures against glitch and fault-injection attacks and require additional protection. More importantly, since differential and dynamic (DD) approaches from [5][6], require dynamic logic cell design. The usage of DD gates is limited to custom or semi-custom design that greatly limits the perceived universality of DD based circuitry. In [7], two major reasons why EDA support of dynamic logic based design is very difficult for synchronous methodology was discussed. According to it each synchronous dynamic gate requires a clock input and uses both levels of clock signal – it means that from
332
B. Ghavami and H. Pedram
(a)
(b)
(c)
Fig. 1. (a). Balanced NAND gate proposed by Cryptographic Research [15]. (b). SABL XOR gate [5]. (c). DyCML XOR gate[6].
the point of view of EDA tools each gate behaves like a flip-flop. Second, due to early/late arrival, charge sharing, clock distribution problems with small clocking granularity and uncertainty about worst case delay makes static timing analysis (STA) of dynamic circuits very problematic. As these problems make power balanced dynamic circuitry practically unavailable for rapid ASIC development the researchers resort to less secure (e.g. less balanced) but easier to implement solutions based on standard static non-balanced gate libraries. A design methodology based on dynamic asynchronous micropipelines which eliminate the lake of secure hardware design flow (SHDF) was proposed [8]. Their methodology allows incorporation of existing synchronous dynamic gate designs and circuit structures. The combination of asynchronous operation and balanced dynamic gates allows automated design highly resistant to side-channel attacks. Also, a balanced library was specifically designed for the fine-grained asynchronous template (Balanced Symmetric with Discharge Tree (BSDT) gates [14]). All the currently known balanced gate designs require considerable hardware redundancy and overhead to ensure balanced computations [14][6][7]. Much of this redundant hardware is not directly associated with the logical or Boolean function of the gate; it is present to ensure power balance during computations. Weaknesses in the present balanced gate designs exist due to the redundancy of the gate; there exist many internal transistor level faults which will not affect the Boolean function of the gate but will affect the balance of the gate. As shown in [9], a small number of faults can potentially make power analysis attacks feasible even on protected devices. This vulnerability opens the possibility of new and effective methods of attacks, based on a combination of fault and power attack. Our approach incorporates dynamic gate balancing techniques and methods with asynchronous design principles to address the timing and clock related problems associated with current and future balanced dynamic gate designs and to enable their use in automatic standard-cell based design flow.
3 QDI Asynchronous Circuits An asynchronous circuit is composed of individual modules which communicate to each other by means of point-to-point communication channels. Therefore, a given
An Automatic Design Flow for Implementation of Side Channel
333
module becomes active when it senses the presence of an incoming data. It then performs the computation and sends the result via output channels. Communications through channels are controlled by handshake protocols [18]. An asynchronous circuit is called delay-insensitive (DI) if it preserves its functionality, independent of the delays of gates and wires [18]. Quasi delay-insensitive (QDI) circuits are like DI circuits with a week timing constraint: isochronic forks. The encodings of the channels can be in a variety of ways. Return to zero handshaking protocol with dual-rail data encoding that switch the output from data to spacer and back is the most common QDI implementation form. We use a dual rail encoding. The data channel contains a valid data (token) when exactly one of 2 wires is high. When the two wires are lowered the channel contains no valid data and is called to be neutral (Figure 2). One of the major protocols used in asynchronous circuits is four-phase protocol. In a four-phase protocol's sequence a receive action consists of four steps. (1) Wait for input to become valid. (2) Acknowledge the sender after the computation performed Lack. (3) Wait for inputs to become neutral. (4) And lower the acknowledgement signal. A send action consists of four phases: (1) send a valid output. (2) Wait for acknowledge. (3) Make the output neutral. (4) Wait for acknowledge to lower. As it has been mentioned, using four-phase handshaking protocol with dual-rail data encoding caused data independent time and power emissions which necessary for Side channel attacks resistant crypto-chips.
Neutral(“E”) Valid ‘0’ Valid ’1’ Not used
d.t
d.f
0 0 1
0 1 0
1
1
Fig. 2. Dual rail coding
Many of the properties which many designers try to artificially add to synchronous cryptography designs are natural in QDI asynchronous circuits. In these circuits, no clock caused clock glitch attacks are removed. In furthermore, Electromagnetic signature is strongly reduced by replacing a synchronous processor with an asynchronous one (no clock harmonics) [19]. Asynchronous circuits typically use a redundant encoding scheme (e.g. dual-rail). This mechanism provides a means to encode an alarm signal. Circuits comprising dual-rail (or multi-rail) codes can be balanced to reduce data dependent emissions. In the above illustration whether we have a logical-0 or a logical-1, the encoding of the bit ensures that the data is transmitted and computations are performed with constant Hamming weight. This is important since side-channel analysis is based on the leakage of the Hamming weight of the sensitive data [2]. Whilst dualrail coding might be used in a clocked environment one would have to ensure that combinational circuits were balanced and glitch free. Return to-zero (RTZ) signaling is also required to ensure data independent power emissions.
334
B. Ghavami and H. Pedram
4 Persia: A Synthesis Tool for QDI Circuits Persia is an asynchronous synthesis toolset developed for automatic synthesis of QDI asynchronous circuits with adequate support for GALS systems. The structure of Persia is based on the design flow shown in Figure 3 which can be considered as the following four individual portions: QDI synthesis, GALS synthesis, layout synthesis, and simulation at various levels. QDI and GALS synthesis flows are join together in the layout stage. The simulation flow is intended to verify the correctness of the synthesized circuit in all levels of abstraction. In this paper we only brief QDI synthesis flow for it security benefit. Persia synthesis approach is based on pre-design asynchronous four-phased dual rail templates. It uses PCFBs [18] for its predefined templates (Figure 4). Persia uses Verilog-CSP [17], an extension of the standard Verilog which supports asynchronous communications as the hardware description language for all levels of abstractions except the netlist which uses standard Verilog. The input of Persia is a Verilog-CSP description of a circuit. This description will be converted to a netlist of standard-cell elements through several steps of QDI synthesis flow. In the following subsections we briefly describe the functionality of these three stages.
Fig. 3. Persia synthesis flow[16]
Fig. 4. The 1-bit PCFB buffer
4.1 Arithmetic Function Extractor (AFE) Technology-Mapper, as a part of Template Synthesizer, is only able to synthesize one-bit assignments containing logical operators like AND, OR, XOR, etc. Arithmetic operations are not synthesizable by Template Synthesizer, so Persia extracts these operations from the CSP source code and then implements them with pre-synthesized standard templates. This is the role of the first stage of our asynchronous synthesis flow, called Arithmetic Function Extractor (AFE). AFE extracts each assignment that contains arithmetic operations like addition, subtraction, comparison, etc and generates a tree of standard circuits which implements the extracted assignment. The
An Automatic Design Flow for Implementation of Side Channel
335
communication between the main circuit and the arithmetic circuit is made by introducing new channels and added read/write [17] macros. As a result, the main circuit will contain only logical assignments and arithmetic computations will be performed in standard unconditional modules that are designed and included in the library. 4.2 Decomposition The high-level CSP description of even very simple practical circuits is not directly convertible to PCFB [18] templates. The intention of Decomposition stage is to decompose [20] the original description into an equivalent collection of smaller interacting processes that are compatible to these templates and are synthesizable in next stages of QDI synthesis flow. Decomposition also enhances the parallelism between the resultant processes by eliminating unnecessary dependencies and sequences in the original CSP description. 4.3 Template Synthesizer (TSYN) Template Synthesizer, as the final stage of QDI synthesis flow, receives a CSP source code containing a number of PCFB-compatible modules and optionally a top-level netlist and generates a netlist of standard-cell elements with dual-rail ports that can be used for creating final layout. TSYN can synthesize all logical operations including AND, OR, XOR, etc with conditional or unconditional READ and WRITES. In addition, TSYN adds acknowledge signals to I/O ports and converts the top-level netlist to dual-rail form and makes appropriate connections between ports and acknowledge signals. The output of TSYN can be simulated in standard Verilog simulators by using the behavioral description of standard-cell library elements.
5 Cell-Library Customization Now we are focused the PCFB power balancing requirements. Our analyzing and simulation results show that the function and operation of handshake (control) part of a PCFB template (sec. 4 and 5 in Figure 4) is completely data independent and only the input and output validity checker requires a trivial power balancing consideration which can be easily met with two additional transistors in NAND/NOR gate (Figure 5.(a)) . However, as mentioned in [9], due to the hardware redundancy in balanced gate designs, there are many faults making a balanced gate imbalanced without causing logical errors. Due to this redundancy, these faults might not create logical errors and hence would not be detected by traditional voltage level testing and reliability measurements. This vulnerability opens the possibility of new methods of attacks based on a combination of fault and power attacks[9]. To overcome this vulnerability, a better balancing solution can be obtained by using a duplicate NOR gates and a C_Element [12] in input and output validity checker of a PCFB template (Figure 5.(b)). The C_Element’s output changes when both of its inputs have the same value and their values are opposite of the C_Element current value. By nature, a C-element gate is intrinsically balanced. If a fault injected in the proposed balancing circuits that previously mentioned, the circuit creates deadlock (then alarm signal actives). By using this method, when the output must be charged to
336
B. Ghavami and H. Pedram
Fig. 5. (a) Balanced NOR gate [5]. (b) Our enhanced Balanced NOR gate for Input validity checker.
one and when the circuit is fault free, both of the pull-up network’s branches will charge the inputs of the C_Element. As a result the output of the C_Element will be charged to one. But if a fault is injected to one of the pull-up networks branches, the C_Element will not be charged to the new value. Consequently, we will be able to discover the attacker’s injected fault in the logic level and avoid any hybrid Fault Analysis-Deferential Power Attack[9] (DPA/ FI). This method and proposed computational section (for AND/OR/XOR PCFB templates) is the main contribution of our work against DPA/FI. The computational section is the last module for balancing considerations, and naturally, it is the main source of power imbalance. By using a SABL [5] gate a as the computational section of the PCFB template a QDI balanced gate can be resulted. Asynchronous handshake part removes the clocking and timing difficulties that normally associated with the dynamic gates. Furthermore, enhances their security applications due to the benefits of asynchronous behavior as mentioned in sec.3. By using a modified SABL [5] gate and employing Discharge Tree (DT) [14] as the computational section of the PCFB template, as shown on the Figure 6, a fully QDI balanced gate results. Using DT results the parasitic capacitors placed in intermediate nodes discharging in each evaluation phase. This will results a certain amount of capacitor resistance charging and discharging totally independent of input data. Figure 6 shows the implemented XOR and AND gates in this style. Our simulation results show that the resulted gates balance is approximately 6 times better than the synchronous SABL implementations. Furthermore, the timing and voltage tolerance of the QDI implementation allows for more aggressive dynamic designs which can achieve better balance than previous designs. However, due to redundancy in Discharge Tree, some faults might not create logical errors and hence would not be detected by traditional voltage level testing and reliability measures. Overcoming this vulnerability, we add a circuit to computational part of the template to detect those faults and generate Alarm-Signal (Figure 6). Alarm-Signal becomes active when a fault occurs in one of the transistors in Discharge Tree and it has employed to cause a pipeline stall which naturally prevents further data processing and creates deadlock within the pipeline. This signal behaves like alarm signal that has been introduced in [11]. The chip alarm signal is resulted from Alarm-Signals of each template which are wired-and together. Because of using Alarm-Signal generator, balancing of our cells is approximately decreased 6.34 %, while compared to BSDT cell library [14], our proposed cell library is resistance against DPA/FI attacks.
An Automatic Design Flow for Implementation of Side Channel
337
Fig. 6. (a) The Balanced AND computational section. (b) The Balanced XOR computational section.
The sequence in asynchronous handshake protocol adds natural fault resistance to the design. Almost all the single stuck-at faults, inside and outside of the complete balanced asynchronous gate, does not have an influence on functionality of the circuits but the system will be stocked in deadlock. That is the faults prevent or stop the necessary of four phase handshake protocol between each gate leads to stalling the communication between dependent downstream gates and preventing any further data processing. Synchronous based balanced dynamic logic gates have no comparable such property. This additional property should make it much harder to use invasive attacks on a circuit since almost all of the tampering would be detected by a pipeline stall. Additional error detection based on other high level fault-tolerant methods can be added easily due to the CSP-specification of the circuit.
6 AES Implementation To estimate efficiency of proposed methodology, we compare performance of automatically synthesized synchronous and asynchronous (with use of proposed balanced cell library) implementation of AES algorithm using TSMC 0.18 technology. The same specification of the AES with 128 key/inputs [22] has been used for both implementations. Synchronous implementation was synthesized with the Artisan Sage-XTM [21] standard cell library using the TSMC 0.18um technology. By using automatically pipelined synchronous implementation, (with Synopsys Design Compiler – maximum performance setting) – the performance was at the rate 43.69 MHz whereas the performance of our asynchronous implementation exceeds 122.37 MHz.
338
B. Ghavami and H. Pedram
Note that in synchronous case there is no side-channel attacks protection if there was, it would results a significant performance overhead. We performed a full analysis of the side-channel information leakage from a sample implementation. Initial simulations of power and timing analysis attacks on the Sbox of the AES indicate the benefits of the balanced dynamic gates and QDI asynchronous circuits, which are the main goals of our proposed implementations. Also the DFA was applied to the attack performed on our implementation and desirable simulations were resulted. We are currently evaluating whether there is a possibility of weakness resulting from the combination of the countermeasures, but up to this point, none had been found.
7 Conclusion A fully automated design flow and a set of secure library cells resistant to power analysis and fault injection attacks are introduced for implementation of secure QDI asynchronous circuits. Furthermore, a test methodology to resolve the faults making our templates imbalanced is presented. This methodology would cause the DPA/FI attacks to proposed library cells almost impossible. The results show that our proposed cell library is approximately 5.62 times more balanced than the best cells designed using previous synchronous balancing methods. As mention before, some of the transistor level single stuck-at faults inside and outside of the complete PCFB template can not creates deadlock within the pipeline. Attackers may be motivated by this deficiency in the future, thus we are working to propose PCFB template that 100% of the single stuck-at faults result in pipeline stall.
References 1. Kocher, P., Jaffe, J., Jun, B.: Differential Power Analysis. In: Wiener, M.J. (ed.) CRYPTO 1999. LNCS, vol. 1666, pp. 388–397. Springer, Heidelberg (1999) 2. adn, J.J.Q., Samyde, D.: Side-channel Cryptanalysis. In: Proc. SECI, September 2002, pp. 179–184 (2002) 3. Kocher: Timing Attacks on Implementations of Diffe-Hellman, RSA, DSS and Other Systems. In: Koblitz, N. (ed.) CRYPTO 1996. LNCS, vol. 1109, pp. 104–113. Springer, Heidelberg (1996) 4. Quisquater, J.J., Samyde, D.: ElectroMagnetic Analysis (EMA): Measures and Countermeasures for Smart Cards. In: Wiener, M.J. (ed.) CRYPTO 1999. LNCS, vol. 1666, Springer, Heidelberg (1999) 5. Tiri, K., Akmal, M., Verbauwhede, I.: A Dynamic and Differential CMOS Logic with Signal Independent Power Consumption to Withstand Differential Power Analysis on Smart Cards. In: 28th European Solid-State Circuits Conference (ESSCIRC 2002), September 2002, pp. 403–406 (2002) 6. Mace, F., Standaert, F.X., Quisquater, J.J., Legat, J.D.: A Design Methodology for Secured ICs Using Dynamic Current Mode Logic. In: Paliouras, V., Vounckx, J., Verkest, D. (eds.) PATMOS 2005. LNCS, vol. 3728, pp. 550–560. Springer, Heidelberg (2005)
An Automatic Design Flow for Implementation of Side Channel
339
7. Tiri, K., Verbauwhede, I.: A Logic Level Design Methodology for a Secure DPA Resistant ASIC or FPGA Implementation. In: Tiri, K. (ed.) Design, Automation and Test in Europe Conference (DATE 2004), February 2004, pp. 246–251 (2004) 8. Kulikowski, K., Smirnov, A., Taubin, A.: Automated Design of Cryptographic Devices Resistant to Multiple Side-Channel Attacks. In: Goubin, L., Matsui, M. (eds.) CHES 2006. LNCS, vol. 4249, Springer, Heidelberg (2006) 9. Kulikowski, K., Karpovsky, M., Taubin, A.: DPA on faulty cryptographic hardware and countermeasures. In: Fault Diagnosis and Tolerance in Cryptography. 3nd International Workshop (2006) 10. Kulikowski, K., Karpovsky, M., Taubin, A.: Robust Codes for Fault Attack Resistant Cryptographic Hardware. In: Fault Diagnosis and Tolerance in Cryptography, 2nd International Workshop, Edinburgh (2005) 11. Bouesse, F., Fesquet, L., Renaudin, M.: QDI circuit to Improve Smartcard Security. In: 2nd Asynchronous Circuit Design Workshop (ACID2002), Munich, Germany, Januray 2002, pp. 28–29 (2002) 12. Renaudin, M.: Asynchronous circuits and systems: a promising design alternative. Microelectronic for Telecommunications: managing high complexity and mobility (MIGAS 2000), Guest Editors : Senn, P., Renaudin, M., Boussey, J. Special issue of the Microelectronics-Engineering Journal 54(1-2), 133–149 (2000) 13. Biham, E., Shamir, A.: Differential fault analysis of secret key cryptosystems. In: Kaliski Jr., B.S. (ed.) CRYPTO 1997. LNCS, vol. 1294, pp. 513–525. Springer, Heidelberg (1997) 14. MacDonald, D.J.: A Balanced-Power Domino-Style Standard Cell Library for Fine-Grain Asynchronous Pipelined Design to Resist Differential Power Analysis Attacks. Master of Science Thesis, 2005, Boston University, Boston (2005), available at http://reliable.bu.edu/Projects/ MacDonald_thesis.pdf 15. Jaffe, J., Kocher, P., Jun, B.: Hardware-level mitigation and DPA countermeasures for Cryptographic devices, US Patent 6654884 16. http://www.asynch.ir/persia 17. Seifhashemi, A., Pedram, H.: Verilog HDL, Powered by PLI: a Suitable Framework for Describing and Modeling Asynchronous Circuits at All Levels of Abstraction. In: Proc. Of 40th DAC, June 2003, Anneheim, CA, USA (2003) 18. Sparso, J., Furber, S.: Principles of Asynchronous Circuit Design – A System Perspective. Kluwer Academic Publishers (2002) 19. McCardle, J., Chester, D.: Measuring an asynchronous processor’s power and noise. In: SNUG (2001) 20. Martin, A.J.: Synthesis of Asynchronous VLSI Circuits, Caltech, CS-TR-93-28 (1991) 21. TSMC 0.18μm process 1.8-volt Sage-X standard cell library databook (September 2003) 22. Fips pub 197: Advanced encryption standard, http://csrc.nist.gov
Analysis and Improvement of Dual Rail Logic as a Countermeasure Against DPA A. Razafindraibe, M. Robert, and P. Maurine University of Montpellier / LIRMM II, 161 rue Ada, 34392 Montpellier, France
Abstract. Dual rail logic is considered as a relevant hardware countermeasure against Differential Power Analysis (DPA) by making power consumption data independent. In this paper, we deduce from a thorough analysis of the robustness of dual rail logic against DPA the design range in which it can be considered as effectively robust. Surprisingly this secure design range is quite narrow. We therefore propose the use of an improved logic, called Secure Triple Track Logic, as an alternative to more conventional dual rail logics. To validate the claimed benefits of the logic introduced herein, we have implemented a sensitive block of the Data Encryption Standard algorithm (DES) and carried out by simulation DPA attacks.
1 Introduction It is now well recognized that the Achilles’ heel of secure applications lies in their physical implementation. Among all the potential techniques to retrieve the secret key, one can mention side channel attacks . If there are many side channel attacks, DPA attack [1], is considered as one of the most efficient since it requires only less skills and materials, than others attacks such electromagnetic attacks, to be successfully implemented. Because of it dangerousness, many countermeasures have been proposed in former works [2, 3]. Recently, synchronous [4] or asynchronous dual rail logic [5, 6] has been identified as a promising solution to increase the robustness of secure applications. However some experiments have shown that the use of basic dual rail structures is not sufficient to warrant a high level of robustness against DPA. To overcome this problem, specific dual rail cells [4,7,8] and ad hoc place and route methods [9] have been developed. Goals of these countermeasures are to make the power consumption of logic gates independent of the manipulated data and to balance the wire capacitance of each differential pair during place & route steps. Within this context, the first contribution of this paper is to analyze thoroughly the robustness of dual rail logic against DPA and to identify the secure design range. From the latter, we will identify the most sensitive parameters in secure dual rail design and will propose adequate countermeasures while staying as close to classical design flow as possible. By looking closely this secure design range, it appears that it is too narrow. To address this problem, we propose the STTL (Secure Triple Track Logic) secure logic to implement key modules of ciphering algorithms. This is the second contribution of the paper. N. Azemard and L. Svensson (Eds.): PATMOS 2007, LNCS 4644, pp. 340–351, 2007. © Springer-Verlag Berlin Heidelberg 2007
Analysis and Improvement of Dual Rail Logic as a Countermeasure Against DPA
341
The remainder of the paper is organized as follows: First, the basics of DPA are briefly summed up and the claimed benefits of DRL are reviewed. Then, the masked assumptions supporting these claims are identified and their validity range evaluated by simulation on a 130nm process. After a discussion about this secure design range, the STTL is introduced as an adequate hardware countermeasure against DPA. The design features of this logic are also detailed. Before concluding, validations of the robustness of the proposed logic are presented.
2 Differential Power Analysis DPA, first introduced in [1], succeeds in retrieving the secret key by exploiting the fact that the power consumption of cryptosystems is data dependent. Generally, DPA attack is executed in three phases: data collection, data sorting and data analysis. Data collection consists in running a large number of cryptographic operations and recording the sampled corresponding power traces. Data sorting consists in extracting, for all possible sub-secret keys, two sets of power traces from the whole power trace collection. These sets of power trace, S‘0’ and S‘1’, are built considering the expected value of the bit under attack according to both the guessed value of the sub-key and to the input data. Data analysis consists in computing, for each possible guess of the secret key, the average power traces of S‘0’ and S‘1’ and in performing the difference between the averages. Finally, the secret key is usually disclosed by identifying the guess leading to the difference with the higher amplitude. If this protocol is quite simple, one can wonder about the syndrome which is really captured by the DPA while applied on a dual rail circuit. In order to identify it, let us consider that a DPA is performed, with v vectors (∈V), on the output bit z of a logic block made of P gates. Among the v vectors applied to the cryptosystems, t∈T of them forced z to the logic value ‘1’, while the f = v - t forced z to ‘0’ (f∈F). With such definitions, the syndrome, SDPA, captured by the DPA is: S DPA ( Z ) =
f
1 t 1 ⋅ I u (t ) − ⋅ I w (t ) t u =1 f w=1
∑
∑
(1)
In the above expression, Iu(t) and Iw(t) are the current profiles of the whole block under attack while vectors u∈V and w∈W are applied on its inputs. These current profiles can be defined as the current consumes by all the gates of which is made up the block: P
Iv( t ) = ∑ ip( t ) p=1
P
Iw( t ) = ∑ ip( t ) p=1
(2)
Considering the definitions above and defining rpT, fpT (rpF et fpF) as the numbers of vectors of T (F) forcing the output of gate p to the a logic ‘1’ and ‘0’ respectively, it is then possible to deduce from (1) the following DPA signature expression : F P −1 ⎛ f f pT ⎞ p ⎟ ⋅ Δi p ( t ) + Δi Z ( t ) S DPA ( Z ) = ∑ ⎜⎜ − t ⎟⎠ p =1 ⎝ f
(3)
342
A. Razafindraibe, M. Robert, and P. Maurine
where Δi p(t) is the differential current profile of gate p, and Δiz(t) is the differential current profile of the gate driving the bit under attack. Here, we denote by differential switching current the waveform obtained by performing the difference between the currents provided by VDD to the considered gate to settle respectively a logic ‘1’ and a logic ‘0’ on its output. Note, that if fpT/t and fpF/f are close one from the other, the expression (3) resumes to the differential current profile of the gate driving z, within its operating context. This highlights the great sensitivity of DPA.
3 Dual Rail Logic: A Countermeasure Against DPA To secure cryptosystem against such an attack, the first action to be made is to break its assumptions in making power consumption independent of the manipulated data. Countermeasures have been proposed in [2] at all level of abstraction. Most of them aim at reducing the correlation between the data and leaking syndromes. Dual rail logic is one of these countermeasures. The main advantage of dual rail Logic lies in the associated encoding used to present logic values. Indeed, for such an encoding, a rising transition on one of the two wires indicates that a bit is set to a valid logic ‘1’ or ‘0’, while a falling edge indicates that the bit returns to the invalid state which has no logical meaning. Consequently, the transmission of a valid logic ‘1’ or ‘0’ always requires switching a rail to VDD. Therefore the differential current profiles of dual rail cells, and thus circuits, should be significantly lower than the ones of single ended gate. However, this claim holds if and only if the power consumption and the propagation delay of dual rail cells is data independent i.e. if the current waveform related to the settlement of logic ‘1’ and ‘0’ are rigorously the same. τ2 C2
AT2
DR gate Δ
τ1 C1
AT1
Fig. 1. A Dual rail cell within its context
Since conventional dual rail cells, such as DCVSL or asynchronous DIMS logic [12] do not have perfectly balanced power consumption a lot of effort have been devoted in [4,7,8] to define secure dual rail cells. In its seminal paper [7], K. Tiri has introduced the Sense Amplifier Based Logic as logic with constant power consumption. Dynamic Current Mode Logic has also been identified in [10] as an alternative to SABL while secure Dual Rail CMOS schematics are given in [4, 7].
Analysis and Improvement of Dual Rail Logic as a Countermeasure Against DPA
343
Even if all these formerly proposed solutions appear efficient, at cell level, to counteract the DPA, they are all based on three crude assumptions. Indeed, in all these works, it is assumed that after place & route steps:
• Assumption n°1: each wire of each differential output is loaded by an identical capacitance value (C2=C1). • Assumption n°2: all the inputs of the gate under consideration are controlled by identical drivers, i.e. that the transition times (labeled by τ in the remainder of the paper) of all the input signals have the same value (τT=τF), • Assumption n°3: the switching process of the gate under consideration starts always at the same time (ATT=ATF). Considering that both the power consumption and the timings of Dual Rail CMOS gates strongly depend on the transition time of the signals triggering the gate switching, and on the output capacitance switched, one can wonder about the validity domain of the three aforementioned assumptions.
4 Secure Design Range To evaluate this validity domain, the modeling of the switching current waveform of CMOS dual rail gate is of prime importance. Considering that any single rail gate can be reduced to an equivalent inverter [11,14] or buffer (fig.1), we did model, at first order, the maximum amplitude ΔiMAX of the differential switching current profile of a dual rail gate loaded by unmatched capacitances, controlled by imbalanced transition time and finally triggered by imbalanced arrival time signals [21]. Considering ITH as the smaller current imbalance that can be monitored with a given number N of current profiles measures according to the SNR definition: SNR =
ITH
σ
⋅ N
(4)
we did deduce from the modelling of ΔiMAX [21], three criteria allowing to quickly estimate the robustness against DPA of a Dual Rail cell within its context. These criteria are the following: C2 C1
Crit
τ1 τ2
VDD ⎧ −1 ⎪1 V ⎛V ⎪ = max ⎨ ⋅ DSAT + 1; ⎜ DSAT ⎜ β ⋅VDD R ( 1 − β ) ⎪ i ⎝ ⎪⎩
= 1− Crit
(VDD − VT ) ⋅ VDD
1 if Ri
(V − VT ) ⋅ 1 if Δ = DD τ Crit VDD Ri Ri =
I MAX ITH
⎫ −1 ⎛ 1 ⎞ ⎞⎟ ⎪⎪ ⎜1− ⎟ ⎬ ⎜ Ri ⎟⎠ ⎟⎠ ⎪ ⎝ ⎪⎭
I MAX > ITH
I MAX > ITH
(5)
(6)
(7) (8)
344
A. Razafindraibe, M. Robert, and P. Maurine
In the above expressions, VDD, VT and VDSAT are the supply, threshold and saturation voltages of the considered transistor, β is the ratio of current provided by a transistor while its drain source voltage is respectively equal to VDSAT and VDD. As shown, the first criterion allows evaluating the robustness of a dual rail cell in presence of unmatched loads. More precisely, for a given threshold of current ITH, expression (5) provides the imbalance that can be tolerated between the outputs. In the same way, the second and third criteria allow evaluating the robustness of a Dual Rail cell in presence of imbalanced input transition and arrival times respectively. 1,0
C1 C2
0,9 0,8
Crit
0,7 0,6 0,5
Model
0,4 0,3
SABL And2/Nand2 [8] SABL Xor2/Xnor2 [8]
0,2
I MAX I TH
0,1 0,0 0
2
4
6
8
10
12
14
Fig. 2. Simulated and calculated values of C1/C2⏐Crit vs. Ri=IMAX/ITH for two different SABL gates τ1 τ2
1,0
0,8
Crit
0,6
Model 0,4
SABL And2/Nand2 [8] SABL Xor2/Xnor2 [8]
I MAX I TH
0,2
0,0 0
2
4
6
8
10
12
14
Fig. 3. Simulated and calculated values of τ1/τ2⏐Crit vs Ri=IMAX/ITH for two different SABL gate
One property of these criteria is that they do depend only on process parameters. This implies that, for a given cell topology, it is possible to obtained, by electrical simulation [19], characteristic curves of its robustness against DPA in presence of load, transition and arrival times imbalances. This provides a really interesting way to compare the robustness of different cell topologies regardless of their sizing provided to apply a unique gate sizing policy for all drives. To demonstrate the validity of these first order criteria, we simulated and computed the critical load, transition and arrival time imbalance curves of SABL and2/nand2 and xor2/xnor2 gates. Fig 2, 3 and 4 report the results obtained. As shown, the accuracy of the proposed robustness criteria is satisfactory. However, a detailed interpretation of these characteristics provides more interesting results.
Analysis and Improvement of Dual Rail Logic as a Countermeasure Against DPA Δ
τ
345
Crit
1,0 0,9 0,8
SABL A nd2/Nand2 [8]
0,7
S ABL Xor2/Xnor2 [8]
0,6 0,5
M odel
0,4 0,3 0,2
I MAX I TH
0,1 0,0 0
1
2
3
4
5
6
7
8
9
11
Fig. 4. Simulated and calculated values of ⏐Δ/τ⏐Crit vs Ri=IMAX/ITH for two different SABL gates [8]
Let us consider that Ri=IMAX/ITH is equal to 2 (100µA / 50µA). For such a Ri value, we may conclude that the two considered SABL gates remains robust against DPA if: - the load imbalance C1/C2 is smaller than 0.7 (C1>C2), i.e. if C2 remains smaller than 1.4 times C1 - the transition time imbalance τ1/τ2 is smaller than 0.7, - and the arrival time imbalance⏐Δ/τ⏐is smaller than 0.2 (τ1=τ2=τ), i.e. if all the signals triggering the gate arrive within a time window of width equal to 0.2 time the smaller input transition time τ. This is quite small considering that typical transition time values range between 20ps and 300ps for the 130nm process under consideration. This demonstrates that dual rail logic may be considered as robust against DPA in presence of significant load and transition time imbalances but does not suffer any significant arrival time imbalances. This is all the more true since arrival time imbalances may grow with the logic depth of the data paths. From the preceding expressions and results, it appears that there is effectively a design range in which dual rail logic can be considered as robust against DPA. However this secure design space is quite narrow since the tolerable arrival time imbalances are quite small. Based on the previous expressions, we have to make Ri (i.e. IMAX) as small as possible to enlarge this secure design range. With this intention, naturally, one possible solution is to work with reduced VDD values. However this imposes to manage properly the power versus timing trade off. Considering once again the narrowness of the secure design range, it appears that another alternative lies in the progressive development of dedicated CAD tools and/or design solutions to balance not only the parasitic capacitances introduced during the place & route as proposed in [9], but also the transition and arrival times. Within this context, expressions (5-8) constitute clever design criteria to evaluate the dangerousness of elementary cells within a secure dual rail circuit. However, as CAD tools will not be available in a near future, we therefore concentrate our effort on design solutions and more precisely on the structures of dual rail cells used to implement secure design.
346
A. Razafindraibe, M. Robert, and P. Maurine
5 Secure Triple Track Logic If the results obtained above demonstrate that the main benefit of the Dual Rail countermeasure lies in its ability in reducing the differential current profiles and thus the correlations between data and the power consumption, they also point out its main weakness: dual rail logic does not sufficiently reduce the correlation between data and computation times to constitute an extremely robust countermeasure. To eliminate this remaining weakness, we have developed a CMOS logic with data independent timing and power consumption called Secure Triple Track Logic (STTL in the rest of the paper). In fact, it is a variant of the dual rail logic. Vdd
SV
av
S0
Vdd av bv
bv
S1
b0 a0
b0
a1
a=(0,1,0) b=(0,1,0) Enable=0
a=(0,1,1) b=(0,1,1) Enable=1
a=(1,0,0) b=(0,1,0) Enable=0
a=(1,0,1) b=(0,1,1) Enable=1
a=(0,1,0) b=(1,0,0) Enable=0
a=(0,1,1) b=(1,0,1) Enable=1
Invalid state av
Enable
bv
S=S0S1SV=(0,0,0) a=a0a1aV =(0,0,0) b=b0b1bV =(0,0,0) Enable=0
S=(0,1,1) or (1,0,1) a=(0,0,1) b=(0,0,1) Enable=0
Valid state S=(0,1,1)
Valid state S=(1,0,1)
S=(0,1,1) or (1,0,1) a=(0,0,1) b=(0,0,1) Enable=1
Fig. 5. STTL and2/nand2 gate
To introduce the main characteristics of this logic style, an STTL and2 /nand2 gate is represented in Fig.5 as well as a graph illustrating its operation. As shown, instead of using two output wires to convey one logical value, STTL uses three. Indeed an additional output wire SV is used to indicate whenever the output data S is valid or not. Similarly, two additional input wires av and bv, indicating the validity of the incoming signals a and b are used. STTL operates thus according a kind of triple rail encoding of data (fig.6). Note that this is not the first time that the use of an additional wire to encode the validity of a signal is proposed. Indeed, in [20] an additional wire is used to obtain “efficient hardware implementations” but not to obtain secure designs or a data independent logic. As illustrated by Fig.6, the encoding of data is not a true triple rail encoding since the additional code value is redundant and does not convey any information about the bit value itself. This additional code value (and thus the power consumption of greyed Valid data
Invalid data
S0S1SV=(0,1,1)
S0S1SV=(1,0,1)
S0S1SV=(Data,Validity)
S0S1SV=(0,0,0)
Fig. 6. Data encoding used by STTL
Analysis and Improvement of Dual Rail Logic as a Countermeasure Against DPA
347
gates on Fig. 5) is therefore uncorrelated with the value of input data. This property is extremely important. Indeed, one key design characteristic of STTL gate is that all validity signals such (av, bv and SV) are delivered by low switching current gate (greyed gates on fig.5), i.e. gates having a greater delays than high switching current gates (blackened gates on fig.5) in order to ensure that all input validity signals (av, bv) settle after the data signals (a0, a1, b0, b1). This can be obtained easily by sizing transistors of greyed cells smaller than those of blackened cells. With such a Return to Zero encoding of data and specific gate design rules, the and2/ nand2 represented in Fig.5 operates as follows. Starting from the invalid state, data (a0,a1,b0,b1) settle first. In a second step, validity signals (av,bv) rise forcing ‘Enable’ to ‘1’ which allows in a fourth step the computation of the outputs. The return to the invalid state is performed in a similar way. First data (a0,a1,b0,b1) returns to ‘0’. Then the validity signals (av,bv) are also forced to ‘0’ by the environment to ‘0’ allowing the gate to return to the invalid state. If the use of an additional wire implies, at cell level, a data independent power overhead, estimated roughly to be within 10% to 30% compared to Dual Rail cells introduced in [12] depending on the complexity of the gate, it allows designing STTL gates having four interesting properties from security and design points of view:
First: avoid any internal cell activity while the data signals (a0,a1,b0,b1) are settling since no currents may flow if validity signals (av, bv) are not true Second: a quasi data independent power consumption as most of the proposed secure dual rail gates [4,7,8,10,12,17], Third: a quasi data independent propagation delays, at block level, since the firing of gates will always be triggered by a data independent signals (Enable) computed from validity signals (av, bv and SV) which are also data independent. Fourth: STTL gates are quite compact compared to other dual rail cells. As an illustration, Table 1 gives the number of transistors required to realize different basic functions in STTL and in others design styles. Table 1. Transistor count comparison Gates Nand2/Nor2And2/Or2 Nand3/Nor3And3/Or3 Xor2/Xnor2 Xor3/Xnor3 AO21/ AOI21 AO22/ AOI22
STTL 27 29 29 35 39 42
[16] 64 128 68 136 128 192
[4] 112 224 80 160 224 336
[8] 14 28 18 36 28 42
The third property aforementioned counterbalances the identified weakness (relative to the arrival time imbalances) of basic or secure dual rail gates introduced in former works. Indeed, the gate firings are independent of the data processed if the incertitude on the arrival time of all input signals, introduced by the place and route steps, is smaller than the time window Q that separates the settlements of the data (E0, E1) and the validity signal EV (see Fig.7). An important point here is that this time window Q can be tuned by sizing adequately the low switching current gates. In other words, the robustness of a STTL circuit can easily be managed by enlarging or reducing the width of this time window.
348
A. Razafindraibe, M. Robert, and P. Maurine
E0
(A0,A1) V
A
BV
Enable
Logic
E0 or E1 rises EV rises last
E1 (B0,B1)
AV, BV
EV
Q Tunable time window
Fig. 7. Timing behaviour of an STTL gate
In order to evaluate the effectiveness of STTL, we have implemented a sensitive sub-module of DES algorithm [18] namely the sbox1 which is driven by XORs gates. With this intention, we make use of our STTL library but also formerly introduced dual rail logics in order to perform comparison. Among these others dual rail libraries, we may distinguish the ones including secure Dual Rail gates [4, 12] but also the SABL gate [8] and finally AO222 based logic [13;16]. With such a simulation setup, the expected properties of the STTL were analysed and verified. In a first validation step, we have realized by simulation DPA attacks on the four output bits of the Sbox1 in order to identify precisely the impact of the routing on the robustness of the STTL. More precisely, to obtain a thorough evaluation of the robustness against DPA of the considered logic styles, all the simulations were first based on an ideal netlists (without parasitic capacitances) and subsequently on a backannotated netlists. Note that to be fair, we have adopted the same sizing policy for all the Dual Rail cells but also the same parameters for the place & route steps done with Soc Encounter tool [15]. Fig.8 gives, for the 64 possible guesses of the secret key, the DPA signatures obtained during the attack of the third output (S3) bit of the Sbox1 implemented with STTL gates. From this figure two conclusions may be drawn. First, performing DPA attacks on S3, as for the three others, does not provide any information about the value of the secret key since its DPA curve is not distinguishable from the 63 others DPA signatures. Therefore STTL counteracts in this case the attack. Finally, the most important result that can be drawn considering Fig.8 is that STTL is, as expected, quasi insensitive to the load imbalances introduced by the place and route steps since the DPA signatures obtained with the ideal or back-annotated netlists are quasi identical. In a second validation step, we wanted to demonstrate that STTL effectively leads to quasi-data independent propagation delay values at block level. We therefore extracted from electrical simulations of back-annotated netlists, the time spent by the signals to propagate from the inputs to the outputs. This was done for all possible input vectors considering STTL as well as the other logic styles introduced in [4, 8, 12]. Note that all input signals were assumed to be stable at t=0. On Fig.9, we plotted the time spent by the signals to propagate from the inputs to the output S3. More precisely, this figure gives the propagation delay distributions while S3 settles logic ‘1’ and ‘0’ respectively for different Dual Rail Logic. Note, that we have also reported the average propagation delay values and spent by the circuit to settle a ‘1’ and a ‘0’ on output S3. The obtained representation is interesting to evaluate the robustness of a logic block against DPA. Indeed, the more symmetrical are the distributions, the more data independent is the considered logic and thus the more robust the physical implementation is.
Analysis and Improvement of Dual Rail Logic as a Countermeasure Against DPA g
p
p p
p
349
g
100
Pseudo static TriTraL
Bit 3
80 60
courant (µ A)
Current (µA)
40 20 0 -20 -40 -60 -80 -100
Ideal netlist 0
2
4
time (ns)
6
8
10 12 Temps (ns)
14
16
18
20
22
100
Bit 3
80
Pseudo static TriTraL
60
Back-annotated netlist
courant A) Current(µ(µA)
40 20 0 -20 -40 -60 -80
time (ns) -100
0
2
4
6
8
10
12
14
16
18
20
22
Fig. 8. DPA signatures obtained considering and without considering routing capacitances # of occurences 23
13
S3 settles a logic ‘1’
3
S3 settles a logic ‘0’ 7 [16]
[MAU03] [8]
17
STTL 27 1500
[GUI04] 2000
2500
3000
3500
4000
I/O propagation delay (ps)
Fig. 9. Some Timing data
As shown, depending on the logic, the gap ( - ) between the average times spent to settle logic ‘1’ and ‘0’ can be quite small (few ps) or significant (several tenths of ps). Obviously, STTL exhibits a quasi data independent timing behaviour, while the ones introduced in [8, 12] do not. However the price to be paid is longer I/O propagation delays due to the use of low switching current gates to control the validity signals. In a final step, we compared the robustness against DPA of all the considered logic styles. We thus performed by simulation DPA attacks on all the outputs of the structure represented on Sbox1. The netlists considered during these simulations were back-annotated ones Fig. 10 reports some relevant results we have obtained. These results may be summarized as follows. First, for all the attacked output bits, DPA was
350
A. Razafindraibe, M. Robert, and P. Maurine
Signatures DPA pour les 64 propositions de clé pour 64 messages en clair
Signatures DPA pour les 64 propositions de clé pour 64 messages en clair
100
100
81µA
Pseudo staticTriTraL
60
40
40
20
20
0 -20
[4]
80
60
courant (µA)
courant (µA)
80
Bit 4
0 -20
-40
-40
-60
-60
-80
38µA
Bit 1 -23µA -40µA
-80
-90µA -100
0
2
4
6
-100
8
10 12 14 16 18 20 22 Temps (ns) Signatures DPA pour les 64 propositions de clé pour 64 messages en clair
2 4 6 8 10 12 14 16 18 20 22 Temps (ns) Signatures DPA pour les 64 propositions de clé pour 64 messages en clair
170µA
[18-21]
91µA
80
0
200
100
[8]
150
60
110µA
100
courant (µ A)
courant (µ A)
40
Bit 3
20 0 -20
-45µA
-40
0
Bit 2
-50 -100
-40µA
-60
50
-150
-83µA
-80 -100
170µA -200
0
2
4
6
8
10
12
14
16
18
20
22
0
2
4
6
8
10 T
12 ( )
14
16
18
20
22
Fig. 10. Simulated DPA signature of the Sbox1outputs with back-annotated netlist (X-axis unit is ns)
unsuccessful while done on the STTL implementation. Second, these attacks may be considered as successful while done on SABL [8], and on circuits implemented with gates introduced in [4, 13, 16]. However as shown on Fig.9, for [4], the revealed syndrome is quite small.
6 Conclusion A thorough evaluation of the robustness of Dual Rail Logic has been carried out in this paper. This analysis has pointed out that Dual Rail Logic does not sufficiently reduce correlation between data and computation times to be a fully robust countermeasure against DPA. This observation has led to the proposal of an improved logic called STTL. The main characteristics of this logic that made of it a robust countermeasure against DPA are: quasi data independent power consumption and timing behaviour. The latter characteristic ensures that STTL is particularly robust to load, and arrival time imbalances introduced by the place and route steps while the resulting cells remain quite compact with respect to formerly introduced logic styles.
References [1] Kocher, P., et al.: Differential power analysis. In: Wiener, M.J. (ed.) CRYPTO 1999. LNCS, vol. 1666, pp. 388–397. Springer, Heidelberg (1999) [2] Suzuki, et al.: Random Switching Logic: A Countermeasure against DPA based on Transition Probability, Cryptology ePrint Archive, report 2004/346
Analysis and Improvement of Dual Rail Logic as a Countermeasure Against DPA
351
[3] Bystrov, A., Yakovlev, A., Sokolov, D., Murphy, J.: Design and Analysis of Dual-Rail Circuits for Security Applications. IEEE Trans. on Computers 54(4), 449–460 (2005) [4] Guilley, S., et al.: CMOS Structures Suitable for Secure Hardware. In: 2004 Design, Automation and Test in Europe Conf. and Exposition (DATE 2004),February 2004, France, 16-20 (2004) [5] Fournier, J.J.A., et al.: Security Evaluation of Asynchronous Circuits. In: D.Walter, C., Koç, Ç.K., Paar, C. (eds.) CHES 2003. LNCS, vol. 2779, pp. 137–151. Springer, Heidelberg (2003) [6] Bouesse, G.F., et al.: DPA on Quasi Delay Insensitive Asynchronous Circuits: Formalization and Improvement. In: 2005 Design, Automation and Test in Europe Conference and Exposition (DATE 2005), 7-11 March, 2005, Munich, Germany (2005) [7] Razafindraibe, A., et al.: Secure structures for secure asynchronous QDI circuits. In: DCIS’04: 19th International Conference on Design of Circuits and Integrated Systems (DCIS’04), November 24-26, 2004, Bordeaux, France (2004) [8] Tiri, K., et al.: Securing Encryption Algorithms against DPA at the Logic Level: Next Generation Smart Card Technology. In: D.Walter, C., Koç, Ç.K., Paar, C. (eds.) CHES 2003. LNCS, vol. 2779, pp. 125–136. Springer, Heidelberg (2003) [9] Tiri, K., et al.: A VLSI Design Flow for Secure Side-Channel Attack Resistant ICs. In: 2005 Design, Automation and Test in Europe Conference and Exposition (DATE 2005), 7-11 March, 2005, Munich, Germany (2005) [10] Mace, F., et al.: A dynamic current mode logic to counteract power analysis attacks. In: DCIS’04: 19th International Conference on Design of Circuits and Integrated Systems (DCIS’04), November 24-26, 2004, Bordeaux, France (2004) [11] Maurine, P., et al.: Transition time modeling in deep submicron CMOS. IEEE Trans. on Computer Aided Design 21, 1352–1363 (2002) [12] Razafindraibe, A., et al.: Asynchronous Dual rail Cells to Secure Cryptosystem against Side Channel Attacks (SAME’2005), October 5-6, 2005, Sophia Antipolis, France (2005) [13] Maurine, P., et al.: Static Implementation of QDI Asynchronous Primitives. In: Chico, J.J., Macii, E. (eds.) PATMOS 2003. LNCS, vol. 2799, pp. 181–191. Springer, Heidelberg (2003) [14] Chaterzigeorgiou, A., et al.: Collapsing the Transistor Chain to an Effective Single Equivalent Transistor. In: 1998 Design Automation and Test in Europe (DATE ’98), February 23-26, 1998, Le Palais des Congres de Paris, Paris, France (1998) [15] http://www.cadence.com/products/digital_ic/soc_encounter/index.aspx [16] Piguet, C., et al.: Electrical Design of Dynamic and Static Speed Independent CMOS Circuits from Signal Transistion Graphs. In: 8th International Workshop on Power and Timing Modeling, Optimization and Simulation (PATMOS ’98), Technical University of Denmark, October 7-9, 1998, pp. 357–366 (1998) [17] Kulikowski, K.J., et al.: Delay Insensitive Encoding and Power Analysis: A Balancing Act. In: 11th IEEE International Symposium on Asynchronous Circuits and Systems (ASYNC 2005), New York City, USA, March 13-16, 2005, pp. 116–125 (2005) [18] [18] National Bureau of Standards: Data Encryption Standard, Federal Information Processing Standards Publication, vol. 46 (January 1977) [19] Eldo User’s Manual, Mentor Graphic’s Corp (1998) [20] Meng, T.H.-Y., et al.: Automatic Synthesis of Asynchronous Circuits from High-Level Specifications. IEEE Trans. On Computer Aided Design 8(11) (November 1989) [21] Razafindraibe, A., Robert, M., Renaudin, M., Maurine, P.: Evaluation of the robustness of dual rail logic against DPA. In: IEEE International Conference on Integrated Circuit Design and Technology (24-26 May, 2006)
Performance Optimization of Embedded Applications in a Hybrid Reconfigurable Platform Michalis D. Galanis, Gregory Dimitroulakos, and Costas E. Goutis VLSI Design Laboratory, ECE Department, University of Patras {mgalanis,dhmhgre,goutis}@ece.upatras.gr
Abstract. This work presents an extensive study on the speedups achieved by mapping real-life applications in different instances of a hybrid reconfigurable system. The embedded heterogeneous system is composed by reconfigurable hardware units of different granularity. The fine-grain reconfigurable logic is realized by an FPGA, while the coarse-grain reconfigurable hardware by a 2Dimensional array of word-level Processing Elements. Performance gains are achieved by mapping time critical loops, which execute slowly on the FPGA, on the Coarse-Grain Reconfigurable Array. An automated design flow was developed for mapping applications on the reconfigurable units of the platform. The conducted experiments illustrate that the speedups relative to an all-FPGA execution range from 2.33 to 6.42 being close to theoretical speedup bounds.
1 Introduction Reconfigurable architectures have been a topic of intensive research activities in the past few years. Reconfigurable hardware can merge the performance of ASICs and the flexibility offered by the microprocessors [1]. Hybrid granularity reconfigurable systems, [2, 3] offer extra advantages in terms of performance, power dissipation and great flexibility as well, to efficiently implement computational intensive applications, like DSP and multimedia. These applications are characterized by mixed functionality, data and control. Hybrid architectures usually consist of fine-grain reconfigurable units, typically implemented in Field Programmable Gate Array (FPGA) technology, coarse-grain reconfigurable units realized in ASIC technology, instruction-set microprocessor(s), data and instruction memories. Certain parts of the application are better suited to be executed on the coarse-grain reconfigurable units and other parts on the fine-grain units, due to the special characteristics of the heterogeneous (hybrid) reconfigurable units included in the system architecture. The fine-grain reconfigurable hardware’s granularity is typically four or five bits. This type of hardware can efficiently execute small bit-width operations, like bit-level ones. For such types of operations, fine-grain logic is suitable, as the granularity of the Configurable Logic Blocks (CLBs) of modern FPGAs is typically four or five bits. Furthermore, tasks of Finite State Machine type of functionality are also good candidates to be implemented by the fine-grain reconfigurable hardware. Coarse-grain reconfigurable architectures have been mainly proposed for accelerating loop structures of multimedia and DSP applications in embedded systems. Their coarse N. Azemard and L. Svensson (Eds.): PATMOS 2007, LNCS 4644, pp. 352–362, 2007. © Springer-Verlag Berlin Heidelberg 2007
Performance Optimization of Embedded Applications
353
granularity greatly reduces the execution time, power consumption and reconfiguration time relative to an FPGA device at the expense of flexibility [1]. These architectures consist of a large number of Processing Elements (PEs) with word-level data bit-widths (like 16-bit ALUs) connected with a reconfigurable interconnect network. This work considers a subset of coarse-grain architectures where the PEs are organized in a 2-Dimensional (2D) array and they are connected with mesh-like reconfigurable networks, [4, 5]. In this paper, these architectures are called Coarse-Grain Reconfigurable Arrays (CGRAs). Main contribution of this paper is the extensive exploration of the performance improvements of embedded applications in a hybrid reconfigurable platform. Loops that execute slowly on the FPGA, characterized as kernels, are detected and accelerated on the coarse-grain reconfigurable hardware. The fine-grain reconfigurable unit executes the non-critical parts of the application segment mapped on the Reconfigurable Functional Units (RFUs) of the hybrid system architecture. Our study estimates the performance, using an automated design flow, of five real-world applications in respect to the size of the CGRA, the size of the FPGA and the reconfiguration time of the FPGA. The rest of the paper is organized as follows. The hybrid reconfigurable platform is presented in section 2. The design method is described in section 3. The experiments are presented in section 4 and section 5 concludes this paper.
2 Hybrid Reconfigurable Architecture A general diagram of the considered hybrid reconfigurable system architecture, that mainly targets DSP and multimedia applications, is shown in Fig. 1. The system includes: (a) coarse and fine-grain reconfigurable hardware units for executing computational intensive parts, (b) shared system data RAM, (c) instruction and configuration memories, and (d) an instruction-set microprocessor. Both the coarse and the fine-grain hardware units compose the RFU of the hybrid system. System Data RAM
Coarse-Grain Reconfigurable logic
Fine-Grain Reconfigurable logic
Embedded Microprocessor
RFU Configurations Configuration Memory
Instructions Instruction Memory
Fig. 1. Hybrid system architecture
In this work, the coarse-grain reconfigurable hardware is a CGRA architecture, while the fine-grain one is realized by an FPGA. The microprocessor executes noncritical parts of an application. The microprocessor also controls both types of reconfigurable hardware by properly enabling the execution of application parts.
354
M.D. Galanis, G. Dimitroulakos, and C.E. Goutis
Communication between the coarse and fine-grain reconfigurable hardware takes place via the system’s shared data memory. The microprocessor exchanges data with the RFU part through the data memory and via direct data bus. The configuration memory holds the configurations for the fine and coarse-grain reconfigurable hardware. Local data and configuration memories exist in each type of reconfigurable unit for quickly loading data and configurations, respectively. The targeted FPGA consists of a two-dimensional array of configurable logic blocks, executing bit-level operations, with a grid of interconnect lines running among them, as in commercial and academic FPGAs [8]. For the CGRA, a flexible template architecture is considered which allows exploration in respect to various parameters. An overview of the considered CGRA template is shown in Fig. 2a. The Processing Elements (PEs) are organized in a 2-Dimensional array and each PE is synchronously connected to its nearest neighbours (Fig. 2a). Direct connections among all the PEs across a column and a row are also supported (Fig. 2b). A scratch-pad memory serves as a local data RAM for quickly loading data in the PEs. The PEs residing in a row or column share a common bus connection to the scratch-pad memory. A control unit manages the execution of the CGRA every cycle by defining the operations performed by the PEs and the loading/storing of data from/to the memory. A PE contains one Functional Unit (FU), which it can be configured to perform a specific word-level operation each cycle. Our previous work [7] provides more details for the CGRA.
Scratch-Pad Memory
AGU
PE
PE
PE
PE
AGU
PE
PE
PE
PE
AGU
PE
PE
PE
PE
AGU
PE
PE
PE
PE
X
Configuration memory
To system Data RAM
To microprocessor CGRA Control unit
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
Memory interface
(a)
(b)
Fig. 2. (a) Outline of a CGRA architecture, (b) Point-to-point connectivity
3 Design Flow Fig. 3a shows a generic flow for mapping an application to a heterogeneous reconfigurable system. First, a hardware/software partitioning stage defines the software parts to be executed on the processor and on the RFU. The software parts decided to be executed on the RFU are loops since those typically contribute the most to the execution time of an application. Loops are also moved for acceleration on hardware for improving performance in existing design flows for reconfigurable systems [6]. Our methodology focuses on mapping the loop code mapped on the RFU to the fine and coarse-grain reconfigurable hardware for improving performance. The flow of the proposed RFU design methodology is illustrated in Fig. 3b.
Performance Optimization of Embedded Applications
355
In the RFU design flow, the loop code is initially mapped on the FPGA by using the automated mapping process described in section 3.1. The execution cycles on the FPGA are estimated using the high-level mapping phase of the mapper. Thus, the complete mapping flow for the FPGA is not enabled as it is not necessary. The loops are ordered in descending order of execution cycles. It is noted that loops containing operations not supported by the PEs of the CGRA (like divisions) and by bit-level operations are excluded from execution on the CGRA. An execution cycles threshold is provided by the designer. Loops above the threshold are decided to execute on the CGRA. The rest of the loops (non-critical) are executed on the FPGA. The performance is improved as the critical loops (kernels) are accelerated on the CGRA. This design choice leads to excellent speedups as shown in other design flows for reconfigurable systems [6]. The non-critical loops are mapped on the FPGA using the flow presented in section 3.1. The mapping of the kernels (critical loops) onto the CGRA architecture is performed by utilizing our developed software pipelining algorithm [7]. The mapper outputs the execution cycles and the CGRA configurations. Software
Loop code
Hw/Sw partitioning
FPGA execution estimation
Code on processor
Loop code RFU design flow
Compilation Processor
RFU
Loop ordering Kernels Mapping on CGRA
Mapping on FPGA
configuration (a)
Threshold Non-critical loops
configuration (b)
Fig. 3. General design flow for hybrid reconfigurable systems (a), Design flow for the RFU (b)
The separation of the application’s loop code to be executed on the RFU to critical and non-critical parts defines the data communication requirements between the CGRA and the FPGA. The proposed design flow considers the communication time for calculating the execution time on the RFU. The communication between the two types of reconfigurable hardware through the shared data memory of Fig. 1 is incorporated into the mapping procedures of the FPGA and the CGRA via the load and store operations that refer to the shared system data RAM. 3.1 FPGA Mapping Procedure The diagram of the procedure for mapping a loop described in C on the FPGA is presented in Fig. 4. The mapping flow consists of three phases which are described in the following. In the front-end phase, the CDFG of the loop is created using the SUIF2/MachineSUIF infrastructures. The developed high-level mapping phase utilizes a hierarchical CDFG for modeling data and control-flow dependencies. Then, optimizations are automatically applied to the CDFG, like dead code elimination, common sub-expression elimination and constant propagation. The considered high-level mapping procedure for the FPGA is based on a temporal partitioning method. Temporal partitioning resolves the hardware implementation of
356
M.D. Galanis, G. Dimitroulakos, and C.E. Goutis
an application that does not fit into the FPGA hardware by time-sharing the device in a way that each partition fits in the available hardware resources, i.e. the CLBs of an FPGA. Then, the partitioned application is executed by time-sharing the device such that the initial functionality is maintained. Time-sharing is achieved through the dynamic reconfiguration of the device which is a mechanism supported by modern FPGAs, either commercial, or academic ones [8]. A temporal partitioning procedure results in the concept of virtual hardware. The optimized CDFG is input to the highlevel mapping phase. The second input to the high-level mapper is the model of the FPGA. The fine-grain reconfigurable hardware is abstracted by the mapping algorithm since the algorithm considers the total number of logic blocks in the FPGA, the size in logic blocks of the operators (e.g. a multiplier) present in the CDFG and the execution (propagation) delays of the operators. The abstraction of the FPGA enables the high-level mapper to be retargetable in respect to the FPGA. The considered temporal partitioning algorithm classifies the nodes (operations) of the input DFG according to its ASAP levels. The ASAP levels expose the parallelism hidden in the DFG, i.e. all the DFG nodes with the same level can be considered for parallel execution. The approach followed is that the nodes are executed in increasing order relative to their ASAP levels. Such an approach also exploits the maximum operation parallelism from the input DFG which leads in faster as possible execution on the FPGA. For the temporal partitioning of the DFG, the ASAP level of the DFG node ui, the FPGA area in logic blocks - size(ui) - occupied by each node and the area AFPGA available for mapping the DFG operations on the FPGA are considered. The algorithm traverses each node of the DFG, level by level, and assigns them to a partition. The DFG nodes are assigned to partitions numbered 1 and beyond. All the nodes from level 1 to the maximum level of any node in the DFG are traversed. Nodes of the same ASAP level are placed in a single partition. If the available area in the FPGA hardware is exhausted, then the nodes are assigned to the next partition. If the nodes in the current ASAP level are all assigned to a partition, then the next level nodes are considered for that partition. As the size(ui) and AFPGA are parameters, the specific temporal partitioning algorithm is retargetable to the type of FPGA. Loop (.c) SUIF2/MachSUIF CDFG
Front-end
Optimizations
Temporal Partitioning
FPGA model
+ + * +
TP2 ... TPn
High-Level Synthesis ...
High-level mapping
Synthesized TPs
RTL to VHDL ... Low-level Logic & layout mapping synthesis ... Bitstreams FPGA configuration
Fig. 4. Diagram of the mapping flow for FPGAs
Performance Optimization of Embedded Applications
357
The generated temporal partitions are input to a high-level synthesis process for defining the execution time of each partition. For minimizing the execution time of each partition of the input CDFG, the As Soon As Possible (ASAP) scheduling algorithm was realized. This type of scheduling can be performed since the temporal partitioning algorithm does not consider resource sharing and all the operations of the partition fit on the FPGA. An input to the ASAP scheduler is the user-defined clock period. Typically, the clock period of synthesized designs is set to accommodate the delay of a functional unit, for example a 16-bit addition. The developed ASAP scheduler is extended to support scheduling of multi-cycle operations and operation chaining, for achieving an efficient schedule regardless of the definition of the clock period. The high-level mapping procedure outputs the scheduled temporal partitions and the overall execution time of the partitioned loop. We mention that the usage of different optimizations, in the front-end phase, on the loop’s CDFG can lead to different execution times. Thus, a feedback script is included in the mapping flow for optimizing the performance of loops executed on the FPGA. The synthesized temporal partitions, after an optimized performance is achieved by the high-level mapper, are input to the low-level mapping phase. The data-paths of the partitions are translated to Register-Transfer Level (RTL) VHDL. After correct RTL simulation, logic and layout synthesis is performed on each temporal partition and the bitstreams for programming the FPGA are produced. For logic and layout synthesis, commercial or academic tools can be utilized. For example, Synplify Pro can be used for logic synthesis, whereas the Xilinx ISE toolset or the academic VPR can be utilized for layout synthesis. Data memories are used for storing the input and output values among the temporal partitions. For example, local data memories embedded in the FPGA can be used. For each partition, full reconfiguration of the FPGA is performed.
4 Experimental Results 4.1 Set-Up The five real-world DSP applications, described in C language, used in the experiments are given in Table 1. A brief description of each application is given in the second column, while in the third one the input sequence used is presented. Table 1. Benchmark applications Application OFDM trans. Cavity det. Compressor QSDPCM JPEG enc.
Brief description IEEE 802.11a OFDM transmitter Medical imaging technique Wavelet-based image compressor Video compression technique Still-image JPEG encoder
Inputs 4 payload symbols 640x400 byte image 512x512 byte image 2 frames of 176x144 bytes each 256x256 byte image
In the performed experiments, a hypothetical hybrid platform considered to be realized at 130nm CMOS process, is assumed. The embedded microprocessor of the
358
M.D. Galanis, G. Dimitroulakos, and C.E. Goutis
system is an ARM926EJ-S clocked at 266MHz. For the first FPGA (FPGA1), the area is equal to 3000 logic blocks, i.e. AFPGA=3000. The second FPGA (FPGA2) is composed by 800 logic blocks (AFPGA=800). For estimating the delay and the area of basic operators, it is assumed that the architecture of the FPGA resembles the architecture of the fine-grain reconfigurable logic of the Xilinx Virtex-II Pro family. We outline that the Virtex-II Pro devices are fabricated in 130nm process technology. The full reconfiguration for each of the two FPGA devices is assumed to last 5 clock cycles as in the case of the FPGA part of the Garp architecture [8]. The clock frequency of the FPGA logic is set to 100 MHz for executing the application loops. The clock period is defined for having unit execution delays for 16-bit addition/subtraction operations. We found out, by performing logic and layout synthesis of temporal partitions of application loops - described in RTL VHDL - on a XC2VP4 with Synplify Pro 7.3 and Xilinx ISE 6.1i, respectively, that a clock frequency of 100 MHz can accommodate the propagation delay of a 16-bit adder/subtractor, the time required to transfer data values to/from the registers of the synthesized data-paths and the propagation delay of a register. We have decided to clock the FPGA at 100MHz since in the application’s loops, mapped on the RFU, most of operations were 16-bit additions/subtractions. Two different CGRA architectures were used each time for accelerating critical kernels. The first architecture is a 4x4 array of PEs (CGRA1), while the second one consists of 36 PEs connected in a 6x6 array (CGRA2). In both architectures, the PEs are directly connected to all other PEs in the same row and same column through vertical and horizontal interconnections. The data-width throughout the CGRA is 16bits. The FU in each PE can execute any supported operation in one clock cycle. Two buses per row are dedicated for transferring data to the PEs from the scratch-pad memory. Each bus transfers one 16-bit word per clock cycle. We have synthesized the RTL VHDL description of a single PE, with the abovementioned features, with the Synplify ASIC using 130nm CMOS process. A minimum clock period of 5.1ns was reported. We found out that the critical path of the CGRA starts at the registered output of the control unit of the CGRA that stores the central configuration pointer, passes via the local configuration RAM and ends at the data output of the FU which is the PE’s data register file. The FU dictates the critical path delay. Thus, since the FU remains the same for the 4x4 and the 6x6 CGRA, these architectures are expected to have similar critical path delays. A clock frequency of 150 MHz can be achieved for both CGRAs as it is the case in the experiments of this work. 4.2 Experimentation The number of application’s loops selected for execution on the RFU part of the system is shown in the second column of Table 2. The rest of the code of the five applications is executed on the microprocessor of the considered hybrid SoC. The parts of the five applications to be executed on the RFU were defined by a hardware/software partitioning stage applied prior to our partitioning method as shown in Fig. 3. We used a straightforward hardware/software partitioning approach where loops contributing more than 5% to the total execution time of the application on the ARM926EJ-S were selected for execution on the RFU. The sum of the temporal partitions (TPs) of each application, as reported by the FPGA high-level
Performance Optimization of Embedded Applications
359
Table 2. Number of loops mapped on the RFU, number of temporal partitions and number of kernels executed on the CGRAs Application OFDM trans. Cavity det. Compressor QSDPCM JPEG enc.
# of loops in RFU 9 9 10 11 8
Sum of TPs FPGA1 FPGA2 11 15 9 14 11 15 12 16 8 13
# of Kernels 3 4 4 4 3
mapping algorithm when the loops are mapped on the FPGA1 and the FPGA2 are given in the third column of Table 2. For the FPGA2 device there is a larger number of temporal partitions due to the smaller number of available logic blocks. The number of kernels for the considered applications’ parts mapped on the RFU, is given in the fourth column of Table 2. The threshold was set to the half of the execution cycles of the most timing-consuming loop. We have found that such a threshold selection contributed the most to the performance improvements when it was used in the considered applications. The kernels of the five applications are located at innermost loops and they consist of word-level operations that match the granularity of the PEs in the CGRA. We note that the detected loops represent critical loops for both FPGA1 and FPGA2. The execution times and overall speedups for the five applications are presented in Table 3. TimeFPGA_total represents the execution time of the loops, decided to be mapped on the RFU, on the FPGA1. Ideal sp. represents the speedup that would ideally be achieved, according to Amdahl’s Law, if application’s kernels were executed on the CGRA in zero time. TimeRFU corresponds to the execution time of the application when the kernels are executed either on the CGRA1 or on the CGRA2. All times are normalized to the software execution time of the loops on the ARM926EJS. Sp. is the estimated speedup, after utilizing the developed partitioning method, over the execution of the applications’ loops on the FPGA. The estimated speedup is calculated as: Sp= TimeFPGA_total / TimeRFU. From the results of Table 3, it is inferred that the execution time of the applications is significantly reduced when the CGRA accelerates kernels. When the RFU is Table 3. Execution times and speedups from using the partitioning methodology for an RFU composed by the FPGA1 Application
TimeFPGA_total
Ideal sp.
OFDM trans. Cavity det. Compressor QSDPCM JPEG enc. Average
0.0286 0.0385 0.0400 0.0208 0.0313 0.0318
4.12 4.51 3.51 3.46 2.57 3.63
FPGA1 & CGRA1 TimeRFU Sp. 0.0079 3.60 0.0095 4.07 0.0128 3.12 0.0072 2.90 0.0134 2.33 0.0102 3.20
FPGA1 & CGRA2 TimeRFU Sp. 0.0078 3.68 0.0091 4.23 0.0125 3.21 0.0066 3.17 0.0132 2.37 0.0098 3.33
360
M.D. Galanis, G. Dimitroulakos, and C.E. Goutis
composed by the FPGA1 and the CGRA1, the average estimated speedup equals 3.20. When the CGRA2 is used to accelerate time-critical loops, the average speedup is 3.33. The larger speedups when the CGRA2 is used are due to the better kernel speedups. However, even though we have found that the kernel speedup is significantly improved for the CGRA2 case, the overall speedup slightly increases due to the fact that the non-critical loops are executed on the FPGA1. From Table 3, it is inferred that the estimated speedups for each application are fairly close to the theoretical speedup bounds, imposed by the Amdahl’s Law, especially when the RFU employs the CGRA2. Thus, the proposed partitioning method quite effectively utilized the processing capabilities of the CGRAs for improving the overall performance of the applications’ loops, mapped on the RFU, near to the ideal speedups. 4.2.1 Effect of the Number of Logic Blocks on Speedup We have estimated the overall speedups when the FPGA2 device (AFPGA=800) is used in the RFU. Those speedups are compared with the ones when the FPGA1 is employed which are also given in Table 3. Fig. 5a shows this comparison when the CGRA1 accelerates the kernels, while Fig. 5b illustrates the improvements when the CGRA2 is used in the system. The average values of the speedups are also shown. From the presented results it is inferred that the speedup increases when a smaller FPGA is utilized in the system. More specifically, for the 4x4 CGRA based systems, the speedup increases on average by 41% when the FPGA2 maps non-critical loops instead of FPGA1. The average percentage of the increase equals 43% for the 6x6 CGRA based systems. The speedup increase is due to the fact that the kernels are more slowly executed on the FPGA2 than the FPGA1 due to the larger number of temporal partitions (presented in the third column of Table 3) which causes the execution time of a loop to increase. We mention that the achieved speedups, relative to the execution on the FPGA2, are quite close to the ideal speedups as in the case of the FPGA1. AFPGA=3000 5.78
Overall Speedup
6.00 5.00 4.00
AFPGA=800
4.78
4.57 3.60
4.07
3.12
3.00
4.51
3.99 3.42 2.90
3.20
2.33
2.00 1.00 0 OFDM trans. Cavity det. Compressor QSDPCM (a)
JPEG enc.
Average
6.10
6.00 Overall Speedup
5.26 5.00
4.75
4.72 4.23
4.00 3.00
3.68
4.17 3.21
3.17
3.53
3.33
2.37
2.00 1.00 0 OFDM trans. Cavity det. Compressor QSDPCM (b)
JPEG enc.
Average
Fig. 5. Speedup comparison for the FPGA1 and the FPGA2 when kernels are accelerated (a) on the 4x4 CGRA and (b) on the 6x6 CGRA
Performance Optimization of Embedded Applications
361
4.2.2 Effect of the FPGA Reconfiguration Time on Speedup We have performed an experiment in respect to the reconfiguration time of the FPGA. We have assumed two different times for the full reconfiguration of the FPGA: (a) 5 cycles which was the main case in our experiments, and (b) 1,000,000 FPGA cycles which is in the order of milliseconds, as in commercial FPGAs. The CGRA1 accelerates the critical kernels. Fig. 6a illustrates the speedups for the two reconfiguration time scenarios when the FPGA1 is used. Fig. 6b shows the speedup comparison when the FPGA2 is employed. reconf_time=5 cycles
reconf_time=106 cycles
7.00
Overall Speedup
6.00 5.00 4.00
3.60
3.97
4.41 4.07 3.46 3.12
3.00
3.38 2.90
3.20
3.55
2.33 2.53
2.00 1.00 0 OFDM trans. Cavity det. Compressor QSDPCM (a) 7.00
6.42 5.78
Overall Speedup
6.00 5.00 4.00
5.16 4.57
JPEG enc.
Average
5.74 4.59 3.99
5.15 4.51
4.78 3.84 3.42
3.00 2.00 1.00 0 OFDM trans. Cavity det. Compressor QSDPCM (b)
JPEG enc.
Average
Fig. 6. Speedup comparison for two different reconfiguration times for (a) AFPGA=3000 and (b) AFPGA=800. The kernels are accelerated on the 4x4 CGRA.
Comparing the speedups for these two FPGA reconfiguration times, it is inferred that even for small reconfiguration times (which can occur in academic FPGAs [8] speedups are achieved which are not significantly smaller than the ones reported with a more realistic FPGA reconfiguration time. More specifically, for the FPGA1-based systems the average speedup is equal to 3.55 for the reconfiguration time of 106 cycles, where for the case of 5 cycles required for reconfiguration the average speedup is 3.20. Thus, an average increase of 11% is reported for the larger reconfiguration time. For the FPGA2 systems the average increase is slightly larger and equals 14%. The speedups for the reconfiguration overhead of 106 cycles are not significantly higher than the 5-cycle case since most of the loops fit in one temporal partition, especially when mapped on the FPGA1. Thus, the execution time on the FPGA is less affected than a situation of a larger number of temporal segments and consequently the kernel speedup is smaller. The sum of temporal partitions is larger for the FPGA2 device, as shown in Table 3, and this is the reason for the slight performance increase, relative to the 5-cycle scenario, compared with the FPGA1based systems.
362
M.D. Galanis, G. Dimitroulakos, and C.E. Goutis
5 Conclusions An analytical study of the performance improvements relative to an all-FPGA execution by executing kernels on coarse-grain reconfigurable hardware was presented. The average value of the performance improvement is 3.33 relative to the execution on a FPGA of size of 3000 logic blocks. The speedup is larger, 4.75 on average, when the coarse-grain reconfigurable hardware is coupled with a smaller FPGA. Experiments with different CGRA sizes and FPGA reconfiguration times were also performed.
Acknowledgements This work was partially funded by the Alexander S. Onassis Public Benefit foundation.
References 1. Todman, T.J., et al.: Reconfigurable computing: architectures and design methods. IEE Comput. Digit. Tech. 152(2), 193–207 (2005) 2. Kastner, R., et al.: Instruction Generation for Hybrid Reconfigurable Systems. ACM TODAES 7(4), 605–627 (2002) 3. Rauwerda, G.K., et al.: Mapping Wireless Communication Algorithms onto a Reconfigurable Architecture. The Journal of Supercomputing 30(3), 263–282 (2004) 4. Miyamori, T., Olukutun, K.: REMARC: Reconfigurable Multimedia Array Coprocessor. IEICE Trans. On Information and Systems, 389–397 (1999) 5. Singh, H., et al.: MorphoSys: An Integrated Reconfigurable System for Data-Parallel and Communication-Intensive Applications. IEEE Trans. Computers, 465–481 (2000) 6. Stitt, G., et al.: Energy Savings and Speedups from Partitioning Critical Software Loops to Hardware in Embedded Systems. ACM TECS 3(1), 218–232 (2004) 7. Dimitroulakos, G., Galanis, M.D., Goutis, C.E.: Exploring the Design Space of an Optimized Compiler Approach for Mesh-Like Coarse-Grained Reconfigurable Architectures. In: Proc. of 20th IPDPS, April 25-29, 2006, Rodos Island, Greece (2006) 8. Callahan, T.J., Hauser, J.R., Wawrzynek, J.: The Garp Architecture and C Compiler. IEEE Computer 33(4), 62–69 (2000)
The Energy Scalability of Wavelet-Based, Scalable Video Decoding Hendrik Eeckhaut, Harald Devos, and Dirk Stroobandt Ghent University, ELIS, Parallel Information Systems Sint-Pietersnieuwstraat 41, 9000 Ghent, Belgium [email protected]
Abstract. Scalable video allows to decode a single video stream, or part of it, at varying quality of service (QoS). Since the amount of calculations scales with the QoS, energy dissipation is expected to scale similarly. To investigate the relation between QoS and energy dissipation we actually measured the energy dissipation of a scalable video decoder implementation on an FPGA. The measurements show how dissipation effectively scales with QoS and indicate how energy can be saved by rescaling the QoS and reconfiguring the FPGA accordingly.
1 Introduction Scalable video is a hot topic in the multimedia community. “Scalable” means that the quality of service (QoS), i.e. the image quality, frame rate, resolution and color depth of the decoded video, can be freely adapted without having to re-encode the video stream or having to decode the whole video stream if only a lower quality version is required. Scalable video has advantages for both the server (the provider of the content) and the clients. On the one hand, the server scales well since it has to produce only one encoded video stream that can be broadcast to all clients, irrespective of their QoS requirements. On the other hand, the client (or the network) can easily adapt the decoding parameters to its needs. This way it is possible to optimize the use of the network, the display, the required processing power, the required memory, . . . Scalable video decoders have a very high complexity. Therefore, a real-time implementation of a scalable video decoder requires the use of specialised hardware that can provide sufficient computational power. Field Programmable Gate Arrays (FPGA) [2] are a perfect fit to decode scalable video because they can provide the required computational power. Moreover, they offer flexibility because they can be reconfigured each time the QoS requirements change. In the RESUME project1 , we explored hardware accelerated scalable video by actually developing an FPGA implementation of a wavelet based scalable video decoder. The real power of implementing a scalable video decoder in reconfigurable hardware is in the fact that different QoS requirements can be handled by different hardware instantiations so that the hardware resources are truly scaled together with the scaling of the QoS. In this paper we study the relation between the energy dissipated by the decoder hardware and the delivered QoS. Our FPGA design was not optimized primarily for 1
http://www.elis.ugent.be/resume
N. Azemard and L. Svensson (Eds.): PATMOS 2007, LNCS 4644, pp. 363–372, 2007. c Springer-Verlag Berlin Heidelberg 2007
364
H. Eeckhaut, H. Devos, and D. Stroobandt
low power. Hence, when interpreting our results, we put more interest in the relative impact of scalability, rather than on the absolute power figures. Moreover, since the implementation effort is so high, it is difficult to find any comparable results. Unless indicated differently, all measurements in this paper were performed on the well known reference video sequence ‘Foreman’. Five GOPs (Group Of Pictures), i.e. 76 frames, were encoded once at CIF (352 × 288 pixels) resolution and decoded at varying QoS.
2 System Overview The algorithmic structure of the RESUME scalable video coder and decoder (codec) is shown in Figure 1 and is described in [4]. The encoder consists of a motion estimation step (ME) [6] which exploits the temporal redundancy in the video stream by looking for similarities between adjacent frames. To obtain temporal scalability, motion is estimated in a hierarchical way. This hierarchical temporal decomposition enables decoding of the video stream at different frame rates because the decoder can choose up to which (temporal) level the stream is decoded. Each extra level doubles the frame rate. A discrete wavelet transform (DWT) [5] separates the spatial low-pass and high-pass frequency components. Each LL-subband is a low resolution version of the original frame. The discrete inverse wavelet transform (IDWT) in the decoder can stop at an arbitrary level, resulting in resolution scalability. The wavelet entropy encoder (WEE) [4] is responsible for entropy encoding the wavelet transformed frames. The frames are encoded bit layer by bit layer, yielding progressive accuracy of the wavelet coefficients (Figure 2) influencing the PSNR (Peak Signal to Noise Ratio) of the decoded frames. In the video community this is called quality scalability. The WEE consists of two parts: the model selector (MS) for statistical context modelling and the arithmetic encoder (AE) for the actual compression. Finally, the packetizer (P) packs all encoded parts of the video together in one bit stream. This video stream is adapted to the required QoS and transmitted to a decoder, that performs the inverse operations of the encoder. Encoder In
ME
WEE DWT
MS
Decoder Out
MC
AE
P QoS Adaptation & Transmission
WED IDWT
MS
AD
DP
Fig. 1. High-level overview of the video encoder and decoder
In order to move from a pure software version of the video codec to a hardware accelerated HW/SW-codesign, we refined the basic structure at the bottom of Figure 1 to the architecture shown in Figure 3. For demonstration purposes, we used as hardware platform an Altera PCI high-speed development board [1] equipped with a Stratix S60 FPGA and 256 MiB of DDR SDRAM plugged into a standard PC with two monitors, one dedicated to displaying the decoded video, the other to interact with the system. The main control over the decoder resides on the host PC. All decoder blocks, but the
The Energy Scalability of Wavelet-Based, Scalable Video Decoding
365
Fig. 2. Quality scalability: decoding more bit layers gives a more accurate wavelet-transformed frame. The distortions of the images on the bottom row are slightly exaggerated for visual clarity.
Fig. 3. Overview of hardware architecture of the RESUME decoder. Continuous lines indicate direct (master) write transfers, dashed lines indicate DMA (slave) transfers.
depacketizer (DP), are implemented on the FPGA. Data is transferred from one block to another using the off-chip, but on-board, DDR SDRAM because the FPGA does not have enough internal memory to store the intermediate results. The main control over the decoder resides in the CPU. It drives the FPGA board over the PCI bus. The FPGA is fed by the DP on the CPU that copies the encoded video data into the DDR memory where it is consumed by the decoding pipeline. The pipeline consists of a wavelet entropy decoder (WED), an assembler (AS), an IDWT, a motion compensator (MC) and a color convertor (CC). After decoding the video, the resulting data is (DMA) transferred from the on-board DDR to a dedicated NVidia GeForce 5200 VGA card on the PCI bus which displays the decoded frames. As can be seen in Figure 3, the software architecture (Figure 1) was substantially modified. The entropy decoder was split into two separate components, the WED and the AS. The WED no longer produces ready-made wavelet frames but instead it constructs individual bit layers of the wavelet frames. The AS was introduced to reconstruct the wavelet frames from the individual bit layers. The AS substantially improves the use of the memory bandwidth. Functionally, the IDWT and the MC are identical to their software counterparts. The frames produced by the MC are in YUV format and the visualization of the frames occurs in RGB mode. Therefore, the frames need to be converted from the YUV to the RGB color space. This is the responsibility of the CC.
366
H. Eeckhaut, H. Devos, and D. Stroobandt Table 1. Synthesis results of the video decoder Component #LE #9×9 #18×18 #Regs Mem Clk IDWT 19733 0 9 1978 395752 54 PCI 4284 0 0 1816 23568 65(&66) WED 4133 1 0 1716 107392 59 AS 2894 0 2 1402 65024 65 MC 2115 0 0 1112 25344 65 CC 1315 0 0 500 36894 65 DDR 1356 0 0 978 4608 65 DMA 767 0 0 313 16384 65 Others 7161 0 0 3448 0 65 Total 43758 1 11 13263 674966
#LE: number of logic elements, #9×9: number of 9-bit multipliers, #18×18: number of 18bit multipliers, #Regs: number of 1-bit registers, Mem: bits of on-chip RAM, Clk: the clock frequency of the component in MHz. Others consists mostly of the Avalon Switch Fabric (Altera SOPC Builder) that interconnects the different blocks of the decoder and takes care of clock domain crossings. With the clock settings shown, the design decodes 26.5 lossless CIF-frames/s.
The resulting hardware implementation achieves real-time, lossless decoding of CIFsequences (352 × 288 pixels) at 25 frames per second. Synthesis results using Quartus II 6.1 are shown in Table 2. The IDWT clearly takes most of the resources. This is because this component was not designed manually but generated, after doing loop transformations to improve the memory access pattern, using our, still under active development, CLooGVHDL back-end [3]. Measurements (Figure 4) show that the execution times of the different components scale similarly with regard to temporal or spatial scalability. PSNR scalability only influences the blocks that work on a bit plane level, i.e. the WED, AS, and DP. The curves are not entirely smooth because the QoS adaptation algorithm is only a prototype with simple heuristics. Figure 4 suggests that our design is not real-time at the highest QoS settings. It takes 3.7 s to decode a 3.04 s sequence. This only seems so because the plots also include the latency, the time before the first frame is displayed (≈ 0.85 s).
3 Energy Measurement Setup and First Results Measuring the energy dissipation for decoding a video sequence on our hardware platform turned out to be non-trivial. The board was designed as a prototyping platform and has no special provisioning for current measurement of the FPGA power supplies. To measure the current through the FPGA, which is controlled and powered over the PCI-interface, we used a PCI extender card (Sycard Technology, PCIextend 177) which allows to monitor all power supplies. The PCI-standard provides multiple power lines: 5V, 3.3V, VIO, VAUX, +12V and −12V. Tracking all these lines simultaneously would be practically very hard. Fortunately we found out only two supplies are actually needed to run our design: the 3.3V line for the FPGA and the 5.0V line for the DDR. The +12V
The Energy Scalability of Wavelet-Based, Scalable Video Decoding
367
4 3.5
Execution Time (s)
3 Total VGA CC MC IDWT AS WED DP
2.5 2 1.5 1 0.5 0
1/16 2/16 4/16 8/16 16/16 Temporal Scalablity, Decoded frames per GOP
4 3.5
Execution Time (s)
3 Total VGA CC MC IDWT AS WED DP
2.5 2 1.5 1 0.5 0
quarter
half Spatial Scalablity, Resolution
full
4 3.5
Execution Time (s)
3 Total VGA CC MC IDWT AS WED DP
2.5 2 1.5 1 0.5 0 30
40
50 60 70 Quality Scalablity, PSNR (dB)
80
Fig. 4. Execution time for decoding the Foreman sequence for the three types of scalability
368
H. Eeckhaut, H. Devos, and D. Stroobandt
line could be disconnected since it is only used to power some indication LEDs. The other lines (VIO, VAUX and −12V) are not used by the board. Because the current through the 5.0V line is very small – only a few mA even during decoding – this power dissipation is ignored. The remaining 3.3V line is measured with a current probe (Tektronix TCP202, 3% accuracy) connected to an oscilloscope (Tektronix TDS7104). The disadvantage of our approach is that we measure the dissipation of the entire board and not only of the FPGA nor of the individual components inside the FPGA. The only option to achieve that would be a low level power simulation. This is however practically infeasible due to the huge simulation times required (1 GOP = 16 frames = 0.64 s in real time), and the strong interaction between the hardware components, the shared memory and the control software. Since we are mainly interested in the relative impact on the energy dissipation of different QoS video settings, having to measure the dissipation of the entire board is not really a concession. The extra power dissipation of the board stays the same, irrespective of the specific QoS settings.
2.7 3.2
2.6 3.0
Current (A)
Current (A)
2.5 2.8 2.6 2.4
2.4 2.3 2.2
2.2
2.1
2.0
2.0 Y
1.8 0
1
2
3
4
Time (s)
(a)
5
6
7
1.640
U 1.645
1.650
V 1.655
Time (s)
(b)
Fig. 5. (a) Current measured while decoding 5 GOPs at full quality. (b) Current drawn by the IDWT by transforming the three channels Y, U and V over 3 levels.
Figure 5(a) shows the current measured while decoding the Foreman sequence at full quality. When the decoder is not active, a steady state current of 2 A is measured. After a start signal the pipeline of the decoder has to be filled and only part of the decoder is active (1.2–2.2 s), what leads to a minor increase of the current. Once the pipeline is completely filled, all blocks can work in parallel, and the power dissipation reaches its maximum (2.2–5.3 s). Finally, one by one the blocks finish their jobs and the dissipation drops back to the initial 2 A. To illustrate the accuracy or our current measurements we plotted the oscilloscope trace of the inverse wavelet transform of one YUV-frame in Figure 5(b). It is easy to recognize the processing of the Y frame followed by the U and V frame, the latter two with a quarter of the frame size and execution time. Within a frame the different transformation levels are visible, also scaling with a factor four. If you would zoom in further, you could recognize transforms on individual lines of the wavelet frame.
The Energy Scalability of Wavelet-Based, Scalable Video Decoding
369
Note that these plots display the power consumption of the entire board. As a consequence they contain both the static and dynamic current drawn by the FPGA fabric as well as the current drawn by the I/O pins to access the external memory and PCI-bus. As can be seen in Figure 5(a) the board draws 2 A when it is not decoding video and just waiting for input. This steady state current (Iss ) is mostly used for distributing the clock signal inside the more than 75% filled FPGA. To interpret this current value we also measured the Iss of two other FPGA designs: a minimal design and a design which only contains a PCI-, DDR- and DMA-core (with the same settings as the video decoder design). The minimal design, which only drives the output pins to a constant value, uses just one logic element and invariably draws 1.2 A. The other design uses 8543 LEs (15%), 44560 memory bits, one PLL and one DLL and draws a steady state current of 1.3 A. The cores apparently draw about 0.1 A and inserting our decoder components in the design leads to an extra dissipation of 0.7 A. The remaining 1.2 A is a combination of the power dissipation of the DDR memory chips, the clock oscillators, the voltage regulator circuits, glue logic, etc. Because we are only interested in the energy that is needed for the actual video decoding, the steady state current will be ignored in the remainder of this paper.
4 Energy Scalability Measuring the instantaneous current enables to determine the amount of power and energy that is needed to decode a video sequence. Energy (E) and Power (P ) are calculated from the measured current (i(t)) by in (t) = i(t) − Iss P (t) = Vsource × in (t) = 3.3 V × in (t) E = P (t)dt ≈ P (ti )Δt ,
(1) (2) (3)
i
where Δt is the sampling period of the oscilloscope. To explore the impact of quality scalability on energy dissipation we automated the energy measurement process to measure the energy dissipation for different quality settings. The results for decoding 5 GOPs of the Foreman and Mobile sequence are plotted in Figure 6. The energy dissipation clearly scales with the PSNR. Lossless decoding (∞ dB) dissipates almost twice the amount of energy needed for decoding at 30 dB. There is also a clear difference in dissipation between the different video sequences. The Mobile sequence needs at least 0.5 J more energy for decoding at the same quality. The curves are not perfectly smooth. This is not due to variance on the energy measurements (smaller than 7 × 10−5 J for 10 executions of this experiment), but due to an imperfect bitrate reduction (=QoS adaptation) algorithm. To achieve a more detailed view on where the energy is actually dissipated in the video decoder we also measured the energy dissipation per component. Unfortunately it is not possible to measure this concurrently. Therefore we have to measure the components one by one. This was accomplished by instrumenting the original decoder control software to log all hardware component instructions and replaying them later on per component. The infrastructure to replay the commands is very similar to the original
370
H. Eeckhaut, H. Devos, and D. Stroobandt
6.5
Energy Dissipation (J)
6
5.5
5 Mobile Foreman 4.5
4
3.5
3 20
30
40 50 60 Quality Scalablity, PSNR (dB)
70
80
Fig. 6. Energy as a function of PSNR for two different video sequences
control software, therefore its use does not jeopardize a comparison with the total energy dissipation measurements of Figure 6. To obtain relevant measurements it is also necessary to use the same input data for each of the components as in the complete decoding pipeline. To that purpose we modified the control software to never deallocate DDR-memory so that all intermediate results remain unmodified and available for replay. The results of this approach for decoding the Foreman sequence at different quality, resolution and spatial scalability settings are plotted in Figure 7. The impact of temporal scalability is completely conform the expectations. The energy scales directly with the number of decoded frames. The same applies to spatial scalability. The energy scales directly with the number of decoded pixels, i.e. four times smaller per level. The quality scalability plot is different. As expected the WED significantly scales with increasing quality. Lossless decoding dissipates almost ten times more energy than decoding at low quality. The impact on the AS is imperceptible. Although more bitplanes have to be fetched to decode at higher quality, the energy dissipation does not increase accordingly. Unanticipated, the IDWT effectively scales with quality, despite that the control flow, the number of calculations, is constant. With increasing quality, a larger fraction of the wavelet coefficients becomes non-zero, resulting in more registers that toggle, resulting in increased energy dissipation. The IDWT is by far the largest energy consumer because it uses almost half of the system resources due to its automatic generation. In the MC and CC, energy dissipation is not influenced by the quality and remains invariant. The sum of the energy of these components (E(W ED) + E(AS) + E(IDW T ) + E(M C) + E(CC)) differs from the total energy mostly because two components were not separately measured: the INPUT-step and the VGA-step. The INPUT-step, the part of the DP-step that copies the encoded data from the host PC to the FPGA-board, is expected to mildly scale with quality, since more data has to be transfered through the PCI and DDR core. The energy dissipation of the VGA-step is expected to be invariant, similar to CC. Another reason for the
The Energy Scalability of Wavelet-Based, Scalable Video Decoding
371
1
10
0
Energy Dissipation (J)
10
Total Sum CC MC IDWT AS WED
−1
10
−2
10
−3
10
−4
10
1/16 2/16 4/16 8/16 16/16 Temporal Scalablity, Decoded frames per GOP
1
Energy Dissipation (J)
10
0
10
Total Sum CC MC IDWT AS WED
−1
10
−2
10
quarter
half Spatial Scalablity, Resolution
full
6
Energy Dissipation (J)
5
4
Total SUM CC MC IDWT AS WED
3
2
1
0 30
40
50 60 70 Quality Scalablity, PSNR (dB)
80
Fig. 7. Energy dissipation for decoding the Foreman sequence for the three types of scalability
372
H. Eeckhaut, H. Devos, and D. Stroobandt
discrepancy between ‘Total’ and ‘Sum’ is the absence of interaction between the components when replaying them per component. When using all components concurrently in parallel, they compete to access the DDR. When quality increases, the data flow increases, resulting in more conflicting DDR requests, which leads to a slightly longer execution time and thus a slightly larger energy dissipation.
5 Conclusions In this paper we investigated the impact of using a wavelet-based, scalable video codec on energy dissipation. We found out that the energy effectively nicely scales with the QoS; more energy is needed to decode at higher quality, frame rate and resolution. Even though our video decoder was designed for speed and not for low power, our measurements resulted in some interesting conclusions. They indicate how energy could be saved by rescaling QoS settings on real low power scalable video devices and also suggest that reconfiguring the FPGA, with one of several QoS-specific configurations, could lead to significant power savings.
Acknowledgment This research is supported by the I.W.T., grant 020174, the F.W.O., grant G.0021.03, the GOA project 12.51B.02 of Ghent University and the Altera university program.
References 1. Altera.: PCI High-Speed Development Kit, Stratix Pro Edition, 1.1.0 edn. (October 2005) 2. DeHon, A.: The density advantage of configurable computing. IEEE Computer 33(4), 41–49 (2000) 3. Devos, H., Beyls, K., Christiaens, M., Van Campenhout, J., D’Hollander, E.H., Stroobandt, D.: Finding and applying loop transformations for generating optimized FPGA implementations. In: Transactions on HiPEAC. LNCS 4050, 1(1), pp. 151–170, Springer, Heidelberg (2007) 4. Eeckhaut, H., Christiaens, M., Devos, H., Stroobandt, D.: Implementing a hardware-friendly wavelet entropy codec for scalable video. In: Proceedings of SPIE: Wavelet Applications in Industrial Processing III, Boston, USA, October 2005, vol. 6001, pp. 169–179 (2005) 5. Munteanu, A.: Wavelet Image Coding and Multiscale Edge Detection - Algorithms and Applications. PhD thesis, Vrije Universiteit Brussel (2003) 6. Munteanu, A., Andreopoulos, Y., van der Schaar, M., Schelkens, P., Cornelis, J.: Control of the distortion variation in video coding systems based on motion compensated temporal filtering. In: International Conference on Image Processing (ICIP), IEEE, Los Alamitos (September 2003)
Direct Memory Access Optimization in Wireless Terminals for Reduced Memory Latency and Energy Consumption Miguel Peon-Quiros1, Alexandros Bartzas2 , Stylianos Mamagkakis3, Francky Catthoor3,4 , Jose M. Mendias1 , and Dimitrios Soudris2 2
1 DACYA/UCM, Avda. Computense s/n, 28040 Madrid, Spain VLSI Design Center – Democritus Univ. Thrace, 67100 Xanthi, Greece 3 IMEC vzw, Kapeldreef 75, 3001 Heverlee, Belgium 4 Also Professor at the Katholieke Universiteit Leuven, Belgium
Abstract. Today, wireless networks are becoming increasingly ubiquitous. Usually several complex multi-threaded applications are mapped on a single embedded system and all of them are triggered by a single wireless stream (which corresponds to the dynamic run-time behavior of the user). It is almost impossible to analyze these systems fully at design-time. Therefore, run-time information has also to be used in order to produce an efficient design. This introduces new challenges, especially for embedded system designers using a Direct Memory Access (DMA) module, who have to know in advance the memory transfer behavior of the whole system, in order to design and program their DMA efficiently. In this paper, we propose a mixed Hardware/Software optimization at system level. More specifically, we propose to adapt DMA usage parameters automatically at run-time based on online information. With our proposed optimization approach we manage to reduce the mean latency of the memory transfers while optimizing energy consumption and system responsiveness. We evaluate our approach using a set of real-life applications and real wireless dynamic streams.
1 Introduction Wireless communication between mobile devices is becoming very popular through the use of embedded system implementations. Wireless network streams are received from and transmitted to many different types of systems which encapsulate a wide range of applications (e.g., Voice over IP - VoIP, video codecs etc.). The streams themselves vary in data rate and throughput at run-time according to the contents of the transmission and the behavior of the user who requested them. The most obvious examples are two people talking over VoIP while surfing on the Internet. The result is that the input changes frequently and unpredictably at run-time. Thus the control and data behavior of each one of the applications themselves can not be fully defined at design-time.
This paper is part of the 03ED593 research project, implemented within the framework of the “Reinforcement Programme of Human Research Manpower” (PENED) and co-financed by National and Community Funds (75% from E.U.-European Social Fund and 25% from the Greek Ministry of Development-General Secretariat of Research and Technology).
N. Azemard and L. Svensson (Eds.): PATMOS 2007, LNCS 4644, pp. 373–383, 2007. c Springer-Verlag Berlin Heidelberg 2007
374
M. Peon-Quiros et al.
Additionally, the sum of applications running on embedded systems contain multiple threads of execution while data are commonly shared or exchanged between them. Multi-threading eventually means that memory accesses from different threads interleave in a fine-grained way. This introduces new challenges in the mapping of the aforementioned applications, because the different threads that make up the applications cannot be analyzed independently any longer. Especially, the implications on programming the Direct Memory Access (DMA) module are relevant, since it traditionally depends on a predefined, predictable behavior of the data transfers [1]. The role of the DMA in an embedded system is to access the memory for reading and/or writing independently of the microprocessor [2]. Therefore, DMA usage has become popular in embedded system designs that map a single, predictable application (e.g., motion estimation algorithm [3]), because it enhances the processing speed of the system with minimum area and power consumption cost [4]. DMA is especially profitable in these systems because it can be applied not only to I/O data transfers but also to the explicit management of scratchpad memories. Nevertheless, traditional DMA programming can not handle the unpredictable behavior of wireless network streams for multi-threaded applications because the characteristics of the data transfers vary according to the current input. Thus, it is limited to I/O (e.g., in ring/circular buffers), to implementations where the behavior can be predicted, or by assuming a worst case behavior bounding. In this paper, we evaluate a wide range of wireless network streams and analyze the resulting memory transfers in a complex, multi-threaded environment. We can systematically deduct at design-time the most frequent scenarios and optimize the DMA programming only for them. After exhaustive experimental results, the minimum size of DMA transfers (SW parameter) and the DMA memory word length (HW parameter) are optimized only for the dominant scenarios. Finally, we propose the automatic selection of the optimized DMA parameters according to the scenario identified at run-time. We also introduce a number of tools to assist the designer in the proposed optimization process. With the proposed approach we manage to achieve up to a 79.6% reduction on the mean latency of the memory transfers, typically 16% compared to a best-effort design-time solution. All of this without a penalty on the memory energy consumption. The rest of the paper is organized as follows. In Sect. 2 related work is described. An overview of the proposed methodology is presented in Sect. 3 and in Sect. 4 the dominant scenarios are identified. In Sect. 5 the profiling framework, DMA parameters and the proposed optimizations are presented. In Sect. 6 we show how DMA parameters can be selected at run-time. In Sect. 7 experimental results are presented that validate our proposals. Finally, in Sect. 8 conclusions are drawn.
2 Related Work The classical design for a simple DMA device was first introduced for the SAGE computer [5] and has always been the reference design. In [3, 6] DMA combined with a software pre-fetch mechanism exploits the a priori access pattern of multimedia applications. A DMA architecture for high data rate implemented on a TI C6x is proposed in [7]. A method for application specific DMA controller synthesis is presented in [1].
Direct Memory Access Optimization in Wireless Terminals
375
In [4] DMA engines are used to provide efficient dynamic layout of data to memories. A run-time scratchpad management is proposed in [8] where DMA engines reduce the cost of copying data on scratchpad memories. The main differentiator in this paper is that a 2-phase approach is presented, identifying the dominant scenarios at design-time and choosing the optimal parameters at run-time. Thus, the DMA is adaptive to the input stream, choosing optimal sizes for the minimum block transfer and the DMA word length. Our work is complementary to the scratchpad memory optimizations presented in [9, 10] and the energy optimizations for DMA presented in [11]. Finally, the work of different groups on workload characterization [12], scenario identification [13] and scenario exploitation [14] is very relevant in the context of our proposed approach. The above works focus on defining the characteristics of the run time situations that trigger specific application behavior, which has significant impact on the resource usage of the applications under study. We apply related concepts to theirs in the context of DMA design in embedded multi-threaded wireless systems.
3 Overview of the Proposed Methodology Our proposed methodology consists of 4 steps (depicted in Fig. 1), conducted at designtime (steps 1 and 2) and at run-time (steps 3 and 4). – STEP 1: At design-time, we select a numDesign Time ber of representative traces of real wireless network streams and statistically analyze them with Run Time our tool, in order to evaluate which situations are most likely to occur at run-time. – STEP 2: Our profiling framework logs every memory access of all concurrent applications for all the input traces. Then, an intelligent analyzer is used to compact individual memory references into potential data transfers. This step is of crucial importance as it allows to analyze the Fig. 1. Overview of the proposed application of block-oriented optimizations, like methodology the use of DMAs, without having to modify the original source code, thus saving time at early design stages. Next, the memory transfer log is employed as input to our DMA simulator in order to evaluate the memory latency for different word lengths and for different thresholds of the minimum size of the DMA data transfers. These DMA parameters will be pruned and only the ones having the least latency (for each one of the proposed scenarios) will be used. The simulation phase may be repeated, starting from the log, as many times as needed to evaluate the different architecture proposals with the different dynamic inputs. Due to the fact that application execution and profiling is smoothly separated from the architectural exploration, most of the processing overhead is removed and every solution can be evaluated against full user sessions in a few seconds. – STEP 3: At run-time, it will be periodically identified which is the active scenario for the streaming data. Concurrent Multithreaded Apps Our profiling framework
Memory Accesses Log (R/W) Our simulator for DMA exploration
Potential DMA mem. transfers & word lenghts
Real Wireless Network Streams Statistical Analysis Tool Define dominant Scenarios
1
2 Optimal DMA parameters: mem.transfers &word lenghts
Input Our tools Output of our tools Proposed Solution
Streaming data
Identify scenario
3
Multi-threaded Apps running
4
Select correct DMA parameter
376
M. Peon-Quiros et al.
– STEP 4: According to the detected scenario and the information extracted at design-time, it will be decided if it is efficient to use the DMA for the memory transfers. If so, a light monitoring system will be used to determine the optimal value for the DMA word length from the ones precalculated at design-time.
4 Step 1: Dominant Scenario Definition In order to define the dominant scenarios, we use the 14 most active traces of a real wireless network as input to our system. The representativeness of these traces is assured by the fact that they were obtained from 18 different 802.11b network “sniffers” in 5 different buildings of the Dartmouth Campus [12]. The proposed approach is applicable to any collection of real network traces. We define as a user the joint stream of all the packets sent from all the applications originating from one IP address during a session. Each of the streams we have used have different characteristics. For example, the number of packets sent varies from 4,687 up to 59,800 and the total number of bytes sent varies from 14,822 up to 16,942,305 bytes. We statistically analyzed the streams with the use of our tool. It takes as input the whole network traces and produces multiple tables with statistical data such as the number of packets sent, bytes sent, average size of the packets, etc. The streams are dominated by two main types of packets (Fig. 2(a)): the TCP acknowledgements (ACK) and packets whose size is equal to the Maximum Transmission Unit (MTU). ACK packets are sent quite often (79.8%). Packets having a size equal to the MTU appear in 1.03% of the total, but due to its size they dominate the use of network and memory bandwidth. Whereas, the remaining 19.17% is distributed among other packet sizes. DMA word length = 16
USER 8 USER 9
Mean memory latency in cycles
8
60.000
50.000
USER 21 USER 51
7
USER 52
6
USER 55 USER 58
5
USER 79
4
USER 80 USER 81
3
USER 86 USER 87
2
USER 91
1
USER 99
0 B>=8
40.000
B>=16
B>=32
B>=64
B>=128 B>=256 B>=512 B>=768 B>=1024B>=1280 B>=1536
Minimum block size assigned to DMA Packets 30.000
DMA word length = 64
User 55
User 52
User 51
User 21
User 9
512-768
768-1024
1024-MTU User 8
64-128
128-256
256-512
16-32
0-16
Sizes (bytes)
32-64
0
User 58 User 79 User 80 User 81 User 86 User 87 User 91 User 99
10.000
Users
(a) Distribution of sizes (in bytes) among packets per wireless network stream
Mean memory latency in cycles
3,5
20.000
3 2,5 2 1,5 1 0,5 0 B>=8
B>=16
B>=32
B>=64
B>=128
B>=256
B>=512
B>=768
B>=1024 B>=1280 B>=1536
Minimum block size assigned to DMA
(b) Mean memory latency while changing the minimum size of the DMA transfers (left to right) and using different DMA word lengths
Fig. 2. Distribution of packet sizes and illustration of DMA parameter exploration
Direct Memory Access Optimization in Wireless Terminals
377
Based on the described nature of the wireless network streams and the statistical distribution of density of bytes per packet, it is natural to define two different scenarios for the system: – Scenario 1: less than 80 bytes per packet on average (users 8, 21, 79, 80, 86, 99). – Scenario 2: more than 80 bytes per packet on average (users 9, 51, 52, 55, 58, 81, 87, 91). We will further refine this scenario in Sect. 5.3.
5 Step 2: DMA Parameters Definition in Multi-threaded Systems 5.1 Profiling Framework We make the assumption that the system and the platform on which it runs provide full support for multi-threading (e.g., using [15]) and that the system contains applications running concurrently, triggered by a wireless network stream. Each kernel is executed on its own independent thread and communicates asynchronously with the other threads. All the queues have locking mechanisms to ensure proper synchronization between threads and its functionality is explained in Sect. 7. We use a profiling method similar to [16] to trace all the accesses to data in the memory. The memory accesses that occur when a wireless network stream triggers the system are: – When a thread receives a packet it is removed from the queue it resided, thus freeing memory space for new packets. When a thread finishes the processing of the packet, the packet is written to the corresponding queue. For example when the packet is formed, according to its destination port, it is forwarded to the entry queue of an encryption thread or to the queue of an internet checksum thread. If the “destination” is full, the thread waits until there is enough space (in the meanwhile no new packets are processed by that thread). The system works as a pipe-line; thus, no packet is discarded once it is accepted by the first thread. – Each thread works on specific packet fields, thus it performs different memory accesses. Part of this work can be performed while the DMA does other transfers. As it is a multi-threaded system, memory accesses do not happen in a sequential manner. Many packets are alive at the same time during execution, thus memory accesses happen at the same time from different application threads, meaning that no single access pattern and memory transfer behavior can be extracted and utilized at design-time. 5.2 DMA Word Length and DMA Minimum Block Transfer Size The decision of using the DMA for every data transfer in the system is traditionally taken at design-time for all the instances of a given transfer and subsequently hardcoded in the application source code. Our analysis of multi-threaded systems shows that the full design-time approach is not optimal for current and future embedded systems based on wireless network streams. Moreover, it is not affordable to introduce a full run-time approach that would try to identify, for any given data transfer made during execution, whether to program the DMA to do it and issue other operations, while
378
M. Peon-Quiros et al.
waiting for completion, or to call a SW function like memcpy() to do the copy. Therefore, we propose an integrated HW-SW approach for designing and implementing data transfers in embedded systems based on a design-time identification of most dominant scenarios and extraction of the most optimal (precalculated) DMA parameter solution for each scenario. Based on the previous considerations we propose: – In the HW part, if it is not currently possible to expropriate the ownership of the bus, a mechanism should be introduced, such that long DMA transfers can be momentarily interrupted to serve individual processor memory accesses and guarantee acceptable concurrency levels but with a parameterizable granularity. Thus, DMA operation should be configured also with the maximum number of bus cycles that a processor access can be hold before interrupting the current DMA transfer. We will call this parameter DMA word length. – In the SW part we propose a mechanism that decides whether or not to do the next transfer using the DMA and the right DMA word length according to the size of the block to be transferred and the type of input stream that is being processed at the time. These parameters will be decided according to the current scenario and not hard-coded into the application source code. 5.3 Analysis of the Proposed HW/SW Modifications for the DMA In our experiments (Fig. 2(b)), to check how the nature of the input affects system performance, we assume that the DMA engine can complete a memory transfer to the external DRAM in one cycle (one cycle to initiate the bus transfer and one cycle for each transferred word). We also assume that the processor can access the external DRAM in two cycles (each word is transferred independently and therefore the bus must be negotiated for each of them). Both cases are met under the constraint that burst transfer mode is used inside one DRAM row. DMA transfers block data from DRAM to the internal SRAM where they are accessed by the processor. However, the processor accesses scalar data directly from the DRAM when it makes no sense to transfer them to the internal memory (no locality). Clearly two groups of input streams are present: for one group memory latency improves (lower values) when the DMA is used for more blocks (so the memory latency decreases towards the left), whereas for the other it improves when the use of DMA is restricted to the bigger blocks (latency decreases thus towards the right). The rightmost points represent the situation where the DMA is not used at all. In this case memory latency is just the one for the processor acting alone without interferences, roughly above two cycles per memory access. As an example: – For user 86, the lowest memory latency is obtained when the DMA module is not used for any block transfers (this user has no packets of size larger than 40 bytes). – For user 87, the best configuration is to use the DMA module for all packet transfers (block size greater than 32 bytes) with a reasonably high DMA word length: 64 bytes already allows near optimal latency without compromising system responsiveness. However, if small transfers are left for the processor then spatial locality is broken in such a way that latency is worsened. Again, completely avoiding the use of DMA
Direct Memory Access Optimization in Wireless Terminals
379
brings back latency to typical processor-only values (roughly two cycles). This case shows the importance of correctly using system resources. The results show how the mean latency can be improved by using the DMA module (latency lower than 2 cycles), but also the danger of highly increasing it (much higher than 2 cycles) if it is not done in the right way. Figure 2(b) (top) shows that with a wrong value for the DMA word length (processor able to interrupt DMA every 16 cycles) can increase the mean latency up to seven cycles per memory access. This happens due to the interferences between the processor and the DMA module that force too many unnecessary row activations in the DRAM. The proposed solution improves over these numbers when latency is critical, but still guarantees that it will never go higher than the base case of doing all the memory accesses serially using only the main processor. Additionally, we also ran simulations varying the DMA word length so minimization of DMA row activations can be achieved. The results depicted in Fig. 3(a) show how this effect directly translates into an even lower latency, greatly reduced number of cycles spent by the DMA and the processor accessing the memory subsystem and an appreciable reduction of energy consumption by the memory modules. This figure shows the gains when a transition is made from the high responsiveness scenario to the high performance one, for the users that belong to scenario 2. The trade-off here is an increased latency for memory accesses from the processor (as it has to wait for long DMA transfers to complete), which in turn means worse reaction times to external events, such as user inputs or network signals, and reduced overall concurrency. To be able to balance these two effects we introduce a run-time mechanism that will choose the right value for the DMA word length parameter according to the current input, thus effectively subdividing the second scenario into two. – Scenario 2a: High responsiveness mode with a moderate value for the DMA word length (64 bus transfer cycles). – Scenario 2b: High performance mode when responsiveness can be sacrificed to achieve higher performance and lower energy consumption. Improvement of High performance over High responsiveness 45% 40%
External DRAM
35% 30%
Bus Arbiter
System Bus Latency
25%
Energy Memory
Cycles DMA Cycles Proc Row Activations
20%
DMA
Local SRAM
Processor
15% 10%
Block transfers
5% 0% User 9
User 51
User 52
User 55
User 58
User 81
User 87
User 91
Average
(a) Improvement achieved in the High Performance scenario over the High Responsiveness one
Timing Agent
DRAM Parameters Energy/access
3.5 nJ
Energy activate/precharge
10 nJ
CAS latency
2 cycles
Precharge latency
2 cycles
Active to read or w rite
2 cycles
Write recovery
2 cycles
Last data-in to new read/w rite 1 cycle Individual transfers
Max burst length
1024 w ords
Transfers Log
(b) Organization of simulator for memory architecture exploration
Fig. 3. Comparison between DMA scenarios and organization of the simulator with its technological parameters
380
M. Peon-Quiros et al.
Having identified these three scenarios we have further investigated to find the reason for this behavior. The answer resides in the different ways in which processor and DMA interact in each input stream: – In scenario 1 where many, yet small packets are sent, the memory transfers are so small that the improvement achieved by the DMA accessing data elements more efficiently than the processor is superseded by the loss in spatial locality. If no DMA is used, then the processor does all the accesses in sequential order; if it is employed, then accesses from both intermingle and the number of different DRAM row activations grows out of control. – As for the scenarios 2a and 2b, where mainly big packets are sent, memory transfers are big enough to take full advantage of DMA capabilities.
6 Steps 3 and 4: Run-Time Scenario Identification and DMA Parameter Selection At run-time, we propose to program the DMA selecting the correct parameters according to the ones that were defined at design-time. Therefore, we propose to have a software monitor, which calculates periodically the average occurrence rate of the ACK and MTU packets in the stream data. According to the percentage of these packets, it decides whether it is running on scenario 1 or one of the sub-scenarios of scenario 2. Once it detects a major change in the distribution of the packets, it means that the user is starting to behave differently. If scenario 1 is detected, then the DMA module is not utilized and all memory transfers are performed by the processor. If scenario 2 is detected then DMA module usage is enabled and run-time tests are performed in order to switch, if possible, from scenario 2b to 2a, because the latter one allows for a more reactive system to the input. After an application-specific time period is spent in scenario 2b, the system will switch to scenario 2a. While in scenario 2a, it will monitor the mean memory latency to check if it increases over a given threshold. If memory latency remains low, then the system can continue in this scenario as user’s Quality of Experience will be better. However, if it is detected that the latency increases beyond optimal levels, then the system will switch back to scenario 2b thus increasing DMA word length for every transfer and limiting the number of interruptions to DMA transfers. Keep in mind that too many interrupts from the processor to DMA transfers translates directly into more DRAM row misses (Fig. 3(a)) and so increased memory latency and, moreover, highly increased energy consumption [17].
7 Experimental Results In order to validate our approach, we have modeled the following concurrent threads in our framework, which are triggered by wireless streams in a Linux multi-threaded environment [15], relevant for embedded systems: – Simulated VoIP, FTP and Web browsing activity (as measured in [12]). This thread sends the data for TCP/IP packet formation.
Direct Memory Access Optimization in Wireless Terminals
381
– TCP/IP packet formation (it builds the complete TCP/IP packet filling in the header fields). This thread writes the new packet into the queue for encryption or the queue for TCP checksum, according to whether the connection is encrypted or not. – Encryption (packets that belong to an encrypted connection are processed with the DES algorithm). This thread accesses the data field of the packet in blocks of 8 bytes, and after finishing the encryption work it pushes the packet into the queue for TCP checksum. – TCP checksum (it is calculated applying the 16 bit one’s complement sum to the whole TCP packet and the so called “IP pseudo header”). Once the CRC field of the packet’s header is filled it is handled to the next queue. – Quality-of-Service manager and Deficit Round Robin (it builds a prioritized list of destinations). When a packet arrives to the subsystem, it is queued in one of the priority classes. Packets are extracted from them and forwarded to the network adapter according to a simplified Deficit Round Robin algorithm. With profiling data from this system and the simulator for DMA exploration we have built, we calculated the mean latency and energy consumption for memory accesses. This values were modeled according to Micron PC100 specification (with CL=2) [18]. A state machine is used to control whether accesses must open a new row in the memory module or not. The simulator models a system with one processor, one simple DMA controller, one local SRAM memory (scratchpad) and one bus to access the external DRAM memory (Fig. 3(b)). Access to the bus is decided by a bus arbiter (policies open for exploration). DMA and processor modules must issue bus requests to the arbiter in order to gain access to the bus. Processor accesses to the local SRAM are always resolved with one cycle latency. The simulation model, as we are doing early-stage evaluation, is not cycle accurate but transaction based. Nevertheless, the core module in the simulator is the timing manager that decides when and to which element (processor or DMA) the next memory transfer should be assigned. This manager uses explorationopen policies to decide the assignment. Table 1(a) shows the reductions achieved with our proposal when compared to a fully design-time system which uses the best-effort (most efficient combination of wordlength and minimum size of the transfer for which the DMA is used) or the worst solution (because of the designer not identifying the right parameter for all cases) and to a system not using DMA at all. With the proposed techniques it is possible to achieve a reduction in the mean latency of memory accesses of up to 79.6% (for user 51). Table 2(a) shows that on average the reduction of the mean latency for memory accesses is 16% when using the high performance mode and 7.8% when using the high responsiveness one. In data-dominated systems a reduced memory latency and the fact that DMA and processor can work at the same time, are directly translated into improved system performance. Tables 1(b) and 2(b) show that the proposed technique improves also the energy consumption in the memory subsystem over the best-effort (in average 7.3% when using scenarios 1 and 2b, and 5.4% when using 1 and 2a) and worst (up to 52% for user 86) design-time decisions. However, it is also possible to observe that the energy consumption for the memory subsystem when the DMA module is never used is indeed lower in all cases than when using it. The reason is that the number of row activations is the
382
M. Peon-Quiros et al. Table 1. Results on memory latency and energy consumption (our solutions in bold)
(a) Mean latency for memory accesses for all (b) Total energy consumption of the memory the simulated inputs (in cycles) subsystem (in mJ) User Best-Effort Scen.1 Scen.2b Scen.2a Worst (D-T)
No
User Best-Effort Scen.1 Scen.2b Scen.2a Worst
Perform. Respon. (D-T) DMA
(D-T)
No
Perform. Respon. (D-T) DMA
08
2.74
2.23
—
—
7.40
2.23
08
5.92
5.33
—
09
2.30
—
2.11
2.30
5.74
2.13
09
18.46
—
17.98
21
2.82
2.25
—
—
6.92
2.25
21
13.19 11.71
51
1.60
—
1.33
1.60
6.52
2.06
51
8.62
—
8.35
8.62 16.54
8.26
52
2.17
—
1.70
2.17
5.81
2.07
52
2.17
—
2.03
2.17
1.94
55
1.96
—
1.72
1.96
5.64
2.10
55
10.78
—
10.44
10.78 17.70 10.05
58
1.70
—
1.49
1.70
6.54
2.09
58
28.40
—
27.57
28.33 53.18 27.09
79
2.67
2.19
—
—
6.34
2.19
79
12.89 11.21
—
— 20.17 11.21
80
2.40
2.14
—
—
5.5
2.14
80
10.63
9.47
—
81
1.95
—
1.77
1.95
5.64
2.11
81
14.45
—
14.17
86
2.98
2.13
—
—
7.44
2.13
86
4.46
3.67
—
87
1.51
—
1.22
1.51
3.22
2.06
87
29.85
—
28.93
29.85 38.70 28.88
91
2.01
—
1.71
2.01
5.29
2.12
91
19.05
—
18.19
19.05 29.62 17.70
99
2.49
2.15
—
—
5.98
2.15
99
11.12
9.80
—
—
— 10.35
5.53
18.46 28.99 16.54 — 21.53 11.71 3.52
— 15.95 14.45 —
9.47
23.7 13.50 7.64
— 17.39
3.67
9.80
Table 2. Average improvements (a) Average latency improvement Scenarios
(b) Average energy improvement
Best-Effort D-T Worst D-T No DMA
(our approach)
Scenarios
Best-Effort D-T Worst D-T No DMA
(our approach)
Scen. 1 + 2b
16%
68.4%
12.7%
Scen. 1 + 2b
7.3%
42.7%
-1.7%
Scen. 1 + 2a
7.8%
65.5%
5.3%
Scen. 1 + 2a
5.4%
41.5%
-3.9%
lowest if only the processor is accessing the DRAM (completely serialized transfers, fewer interferences), and therefore no energy is wasted in unnecessary row activations and pre-charges. However, the execution time will increase (as latency does) because all the data transfers are performed by the processor, and the total energy consumption of the overall system will also increase as DMAs are more efficient performing memory accesses than general purpose processors.
8 Conclusions We have proposed the use of scenarios, studied at design-time and deployed at run-time, to optimize the DMA-assisted memory transfers. We have also introduced a number of tools (profiling of memory accesses, identification of relevant data transfers) to assist the designer in the proposed optimization process. With our approach we have managed to achieve up to a 79.6% reduction on the mean latency of the memory transfers (typically 16% compared to a best-effort design-time solution) and allowed for the Quality of Experience manager of the system to choose between a high performance and a high
Direct Memory Access Optimization in Wireless Terminals
383
responsiveness mode (by locking the system into scenario 2b) when usage of the DMA engine is indicated. All this without a penalty on the memory energy consumption, on the contrary achieving even a slight reduction, typically between 5.4% and 7.3%.
Acknowledgements This work is partially supported by the Spanish Government Research Grant TIN 20055619 and E.C. Marie Curie Fellowship contract HPMT-CT-2000-00031.
References 1. O’Nils, M., Jantsch, A.: Synthesis of DMA Controllers from Architecture Independent Descriptions of HW/SW Communication Protocols. In: Proc. of Conf. on VLSI Design (1999) 2. Blumrich, M., Dubnicki, C., Felten, E., Li, K.: Protected, user-level DMA for the SHRIMP network interface. In: Proc. of HPCA (1996) 3. Dasygenis, M., Brockmeyer, E., Durinck, B., Catthoor, F., Soudris, D., Thanailakis, A.: A combined DMA and application-specific prefetching approach for tackling the memory latency bottleneck. IEEE Trans. on (VLSI) Systems (2006) 4. Absar, M., Polletti, F., Marchal, P., Catthoor, F., Benini, L.: Fast and Power-Efficient Dynamic Data-Layout with DMA-Capable Memories. In: 1st Int’l Wksp on Power-Aware RealTime Computing (2004) 5. Hennessy, J., Patterson, D.: Computer Architecture: A Quantitative Approach. Morgan Kaufmann, San Francisco (1990) 6. Kim, D., Managuli, R., Kim, Y.: Data Cache and Direct Memory Access in Programming Mediaprocessors. IEEE Micro (2001) 7. Comisky, D., Fuoco, C.: A Scalable High-Performance DMA Architecture for DSP Applications. In: Proc. of ICCD, IEEE Computer Society, Los Alamitos (2000) 8. Poletti, F., Marchal, P., Atienza, D., Benini, L., Catthoor, F., Mendias, J.: An integrated HW/SW approach for run-time scratchpad management. In: DAC ’04 (2004) 9. Kandemir, M., Choudhary, A.: Compiler-directed scratch pad memory hierarchy design and management. In: Proc. of DAC, pp. 628–633. ACM Press, New York (2002) 10. Udayakumaran, S., Barua, R.: Compiler-decided dynamic memory allocation for scratch-pad based embedded systems. In: Proc. of CASES, pp. 276–286. ACM Press, New York (2003) 11. Pandey, V., Jiang, W., Zhou, Y., Bianchini, R.: DMA-aware memory energy management. In: The 12th Int’l Symp. HPCA, pp. 133–144 (2006) 12. Henderson, T., Kotz, D., Abyzov, I.: The Changing Usage of a Mature Campus-wide Wireless Network. In: Proc. of MobiCom (2004) 13. Eeckhout, L., Vandierendonck, H., De Bosschere, K.: Quantifying the Impact of Input Data Sets on Program Behavior and its Applications. J. of Instruction-Level Parallelism (2003) 14. Gheorghita, S., Basten, T., Corporaal, H.: Intra-task scenario-aware voltage scheduling. In: Proc. of CASES, ACM Press, New York (2005) 15. Nichols, B., Buttlar, D., Farrell, J.P.: Pthreads programming. O’Reilly (1996) 16. Poucet, C., Atienza, D., Catthoor, F.: Template-Based Semi-Automatic Profiling of Multimedia Applications. In: Proc. of ICME (2006) 17. Gomez, J., Marchal, P., Bruni, D., Benini, L., Prieto, M., Catthoor, F., Corporaal, H.: Scenario Based SDRAM Energy Aware Scheduling for Dynamic Multi-Media Applications on MultiProcessor Platforms. In: Proc. of WASP (2002) 18. Micron Technology, Inc: 128MSDRAM, http://www.micron.com/dram
Exploiting Input Variations for Energy Reduction Toshinori Sato1 and Yuji Kunitake2 1
System LSI Research Center Kyushu University 3-8-33-3F Momochihama, Sawara-ku, Fukuoka, 814-0001 Japan [email protected] 2 Graduate School of Computer Science and System Engineering Kyushu Institute of Technology 680-4 Kawazu, Iizuka, 820-8502 Japan [email protected]
Abstract. The deep submicron semiconductor technologies will make the worst-case design impossible, since they can not provide design margins that it requires. Research directions should go to typical-case design methodologies, where designers are focusing on typical cases rather than worrying about very rare worst cases. They enable to eliminate design margins as well as to tolerate parameter variations. We are investigating canary logic, which we proposed as a promising technique that enables the typical-case design. Currently, we utilize the canary logic for power reduction by exploiting input variations, and its potential of 30% power reduction in adders has been estimated at gate-level simulations. In this paper, we evaluate how canary logic is effective for power reduction of the entire microprocessor and find 9% energy reduction. Keywords: typical-case design, deep sub-micron, DVFS, dynamic retiming, reliable microarchitecture, robust microarchitecture.
1 Introduction Parameter variations are predicted to present critical challenges for manufacturability in the future LSIs [2, 8, 17]. In deep submicron (DSM) semiconductor technologies, the traditional worst-case design will not work since process variations increase design margins it requires. The trend toward lower supply voltage and higher clock frequency makes voltage variations and temperature variations more serious. In order to realize robust designs under these situations, designers have to be aware of design for manufacturability (DFM). One of the keys to solve the serious problem is exploiting typical cases. Since worst cases rarely occur, it is better for designers to focus on typical cases. We call it typical-case design methodologies. Recently, several typical-case designs are investigated, such as Razor [4, 5], approximation circuits [11], constructive timing violation (CTV) [13], algorithmic noise tolerance (ANT) [15], and TEAtime [16]. This paper focuses on the Razor logic. We proposed canary logic, an improvement of the Razor logic. In our preliminary study [14], the potential in power reduction of 30% was found in the case of carry select adders. However, it is N. Azemard and L. Svensson (Eds.): PATMOS 2007, LNCS 4644, pp. 384–393, 2007. © Springer-Verlag Berlin Heidelberg 2007
Exploiting Input Variations for Energy Reduction
385
unclear that the canary logic could reduce energy consumption of the entire microprocessors. The aim of this paper is to evaluate this open issue. This paper is organized as follows. Section 2 introduces a typical-case design methodology. Section 3 describes related works with an emphasis on the Razor logic. Section 4 proposes the canary logic. Section 5 explains our evaluation methodology and Section 6 presents experimental results. Finally, Section 7 concludes.
2 Typical-Case Design Methodologies The DSM technologies increase variations, and hence design margins that the traditional worst-case design methodology requires, are increased. The conservative approach will not work. Considering this situation, design methodology should be reconsidered for DFM. Typical-case design methodologies are one of the promising ones. It exploits an observation that worst cases are rare. Designers should focus on typical cases rather than worst cases. Since they do not have to consider worst cases, design constraints are relieved, resulting in easy designs. In the typical-case design methodologies, designers adopt two methods to a circuit design at a time. One is performance-oriented design, where only typical cases are under consideration. Since worst cases are not considered, design constraints are relaxed, resulting in easy designs. The other is function-guaranteed design. While worst cases are considered, designers don’t have to consider performance. They only have to guarantee functions, and thus design must be simple, resulting in easy verifications.
Inputs
Main part
Checker part
Outputs
Error
Fig. 1. Typical-Case Design
The concept of a typical-case design methodology is as follows. Every critical function in an LSI chip is designed by two methods. The design consists of two components as shown in Fig.1. One is called main part, and the other is called checker part. While two parts share the single function, their roles and implementations are mutually different. On designing the main part, performance is optimized to increase, but correct function is ignored to guarantee. The main part might cause errors. That is, it is implemented by the performance-oriented design. The checker part is provided as a safety net for the unreliable main part. It detects errors that occur in the main part, and thus it has to satisfy all design constrains in the chip. However, on the checker part design, while designers have to guarantee the function, they do not have to
386
T. Sato and Y. Kunitake
optimize neither of performance and power. That is, it is implemented by the function-guaranteed design. If an error is detected by the checker part, the circuit state has to be recovered to a safe point where the error is detected by any means.
3 Related Works Examples of the typical-case designs include Razor [4, 5], approximation circuits [11], CTV [13], ANT [15], and TEAtime [16]. In the approximation circuits [11], instead of implementing the complete circuit necessary to realize a desired functionality, a simplified circuit is implemented to approximate it. The approximation circuit works at higher frequency than the complete circuit does, and usually produces correct results. If it fails, the system utilizing the approximation circuit has to recover to a safe point. CTV [13] exploits input value variations. Considering that the critical path in the system is not always active, clock frequency and supply voltage, which violate critical path delay, are selected in use. In order to guarantee correct operations, the system utilizing CTV has a conservative circuit that realizes a desired functionality to find timing violation. In ANT [15], information theoretic technique is employed to determine the lower bounds on energy and performance. In order to approach these bounds, circuit- and algorithmic-level techniques are evolved. TEAtime [14] uses a tracking circuit to mimic the worst-case delay. As long as the tracking circuit works correctly, clock frequency can be increased and supply voltage can be decreased. Usually, a 1-bit-wise critical path is used for the tracking circuit. 3.1 Razor Logic Razor [4, 5] permits to violate timing constraints to improve energy efficiency. Razor works at higher clock frequency than that determined by the critical path delay, and removes voltage margin for power reduction. The voltage control adapts the supply voltage based on timing error rates. Figure 2 shows the Razor's dynamic voltage scaling (DVS) system. If the error rate is low, it indicates that the supply voltage could be decreased. On the other hand, if the rate is high, it indicates that the supply voltage should be increased. The control system works to maintain a predefined error rate, Eref. At regular intervals the error rate, Esample, is computed and the rate differential, Ediff = Eref – Esample, is calculated. If the differential is positive, it indicates that supply voltage could be decreased. The otherwise indicates that the supply voltage should be increased. In order to detect timing errors, Razor flip-flop (FF) shown in Fig.3 is proposed. Each timing-critical FF (main FF) has its shadow FF, where a delayed clock is delivered to meet timing constrains. In other words, the shadow FFs are expected to always hold correct values. If the values latched in the main and shadow FFs do not match, a timing error is detected. When the timing error is detected in microprocessor pipelines, the processor state is recovered to a safe point with the help of a mechanism based on counterflow pipelining. One of the difficulties on Razor is how it is guaranteed that the shadow FF could always latch correct values. The delayed clock has to be carefully designed considering so-called short path problem [5].
Exploiting Input Variations for Energy Reduction
387
Ediff = Eref - Esample Eref
Ediff
+
Voltage Vdd Controller
Pipeline Pipeline
error signals
∑
Esample
-
Fig. 2. Razor’s DVS System
clk
clk previous stage
next stage
previous stage
next stage
error comparator
delayed clk
delay
alert comparator
clk Fig. 3. Razor Flip-Flop
Fig. 4. Canary Flip-Flop
4 Canary Logic While Razor is a smart technique to eliminate design margins, its circuit implementation could be further improved. 4.1 Canary Flip-Flop
Each FF in the design is augmented with a delay buffer and a canary FF, as shown in Fig.4. The canary FF is used as a canary in a coal mine to help detect whether a timing error is about to occur. Timing errors are predicted by comparing the main FF value with that of the canary FF, which runs into the timing error a little bit before the main FF. Alert signal triggers voltage or frequency control. Utilizing the canary FFs has the following three advantages. - Elimination of the delayed clock: Using single phase clock significantly simplifies clock tree design. It also eliminates the short path problem [5] in the Razor FF, and hence its minimum-path length constraint should not be considered. - Protection offered against timing errors: As explained above, the canary FF protects the main FF against timing errors. This freedom from timing errors eliminates any complex recovery mechanism. The selector placed in front of the main FF is removed, leading that some timing pressure is relaxed. Instead, the signal generated by the comparator triggers voltage or frequency control. If the timing error is alerted, the supply voltage stops falling or the clock frequency is felt down. - Robustness for variations: The canary FF is variation resilient. The delay buffer always has a positive delay, even though parameter variations affect it. Hence, the canary FF always encounters a timing error before the main FF.
388
T. Sato and Y. Kunitake
4.2 Power Reduction with Canary FFs
Figure 5 explains how DVS techniques utilize the canary FFs. The horizontal and vertical lines present time and supply voltage, respectively. At regular intervals, the supply voltage is decreased if a timing error is not predicted during the last interval. This is possible since input values activating the critical path are limited to a few variations. For example, it has been reported that nearly 80% of paths have delays of half the critical time [18]. Timing errors rarely occur even if the timing constraints on the critical path are not satisfied. The input value variations can be exploited to decrease the supply voltage. Because the supply voltage is lower than that determined by the critical path delay, significant power reduction is achieved in the canary logic as in the Razor logic [4, 5]. When a timing error is predicted to occur, the supply voltage is increased. V Alert!
Alert!
Alert!
T
Fig. 5. Canary’s DVS
After the first timing error is predicted, timing errors will be repeatedly alerted if the supply voltage is decreased again in the end of the current interval, as shown in Fig.5. Since every supply voltage switching makes processor unavailable during the transition, this oscillation has a serious impact on performance and on power efficiency. In order to prevent the oscillation, we use a counter that counts the number of alerts and stop decreasing the supply voltage when the number exceeds a predefined threshold value. We consider the overhead due to the switches to be small in order to determine the clock cycles of the period where supply voltage reduction is stopped. After the period passes by, the DVS system begins to work again. 4.3 Canary FF Implementation Via Scan Reuse
Scan resources, which is implemented for production testing, can be reused to realize the canary FF. Figure 6 shows a scan FF design [12] that consists of a system FF (the lower part in the figure) and a scan portion (the upper part in the figure). The SI input is connected to the SO output of the next scan FF to be a shift register. In the test mode, clocks SCA and SCB are applied to shift a test pattern into latches LA and LB. Next, the UPDATE clock is applied to write the test pattern in LB into the system latch, PH1. Then, the CLK clock is applied to capture the system response to the test pattern. After that, the CAPTURE signal is applied to move the contents of PH1 to LA. And last, clocks SCA and SCB are applied to shift the system response out. In system operation mode, latches LA and LB are not utilized. The canary FF can be implemented with a little hardware cost by reusing the latches in the scan portion. Figure 7 shows how reusing scan FF design realizes the canary FF. The FF design's
Exploiting Input Variations for Energy Reduction SCB SI SCA
Scan/Canary portion
SI
1D
1D
SCA
1D
C1
Latch 2D LA CAPTURE
SCB
Scan portion
C1
C1
SO
Latch LB
389
CAPTURE
Latch 2D LA
1D
C2
C1
Latch LB
SO
C2 ALERT
1D UPDATE D
CLK
C1 1D C1
Q
Latch 2D PH1
Latch PH2
1D UPDATE
C1 1D
D
C2 C1
CLK
System FF
Latch 2D PH1
Latch PH2
Q
C2 System FF
Fig. 6. Scan Cell [12]
Fig. 7. Canary’s Scan Cell
test mode operation is identical to the design in Fig.6. In system operation mode, latches LA and LB hold the replicas of PH2 and PH1, respectively. If any timing error does not occur, the ALERT signal is low and thus the delayed signal of D is written into LA. The reuse is possible since the canary logic does not require delayed clock. This is not adapted to the Razor logic.
5 Evaluation Methodology First, we show how timing error rate is determined and then describe architecturallevel simulation environment. 5.1 Timing Error Rates
We estimate timing error rates of the entire microprocessor using a 32b carry select adder (CSLA), since the yield of pipeline is mainly determined by the timing error in execution stage [10]. SYNOPSIS DesignCompiler logic-synthesizes the CSLA with Hitachi 0.18um standard cell libraries. The combinations of the clock frequency and the supply voltage of Intel Pentium M [7], which is shown in Table 1, are used. We project the highest clock frequency, which is determined by CSLA's critical path delay reported by DesignCompiler, onto Pentium's highest clock frequency. In order to estimate how timing error occurs, we simulate the CSLA using Cadence VerilogXL simulator. Gate-level simulation results are shown in Figure 8. It is observed that supply voltage reduction down to 1.18V suffers little timing errors. We use the timing error rates in architectural-level simulations explained in the next section. 5.2 Architectural-Level Simulation Environment
SimpleScalar/PISA tool set [1, 3] is used for architectural-level simulation. Table 2 summarizes processor configurations. Six integer programs from SPEC2000 CINT Table 1. Frequency - Voltage Specifications F(GHz) Vdd(V)
2.1 1.340
1.8 1.276
1.6 1.228
1.4 1.180
1.2 1.132
1.0 1.084
0.8 1.036
0.6 0.988
390
T. Sato and Y. Kunitake
164.gzip
175.vpr
176.gcc
197.parser
255.vortex
256.bzip2
100%
Error rate
80% 60% 40% 20% 0% 0.9
1
1.1
1.2
1.3
1.4
Supply voltage
Fig. 8. Error Rate - Vdd Table 2. Processor Configurations Clock frequency Fetch width L1 instruction cache Branch predictor gshare predictor Bimodal predictor Branch target buffer Dispatch width Instruction window size Issue width Integer ALUs Integer multiplires Floating ALUs Floating multiplires L1 data cache ports L1 data cache Unified L2 cache Memory Commit width
2 GHz 8 instructions 16K, 2 way, 1 cycle gshare + bimodal 4K entries, 12 histories 4K entries 1K sets, 4 way 4 instructions 128 entries 4 instructions 4 units 2 units 1 unit 1 unit 2 ports 16K, 4 way, 2 cycles 8M, 8 way, 10 cycles Infinite, 100 cycles 8 instructions
benchmark are used. For each program, 1 billion instructions are skipped before actual simulation begins. After that each program is executed for 2 billion instructions. We evaluate three intervals between supply voltage scaling, which are 100K, 1M, and 10M clock cycles. It is assumed every supply voltage switching requires 10μs [6]. Two threshold values of 2 and 8 in the canary alert sequence are selected for stopping the supply voltage switching. The cycles where supply voltage reduction is prohibited is determined so as to the impact of switching on performance is less than 0.1%. For example, if the interval is 100K cycles and the threshold is 2, switching is prohibited until 400 intervals pass by. As explained above, we model the timing error rate of the entire processor based on that of the CSLA. Since it is difficult to consider input value variations in architectural-level simulations, we assume that timing errors occur at random at the
Exploiting Input Variations for Energy Reduction
391
error rate shown in Figure 8 in every supply voltage. For example, in the case of 164.gzip, 98.8% of dynamic ALU instructions meet timing error at 0.988V. Since it was found that performance results with and without considerations of actual input variations did not show any significant difference in our previous study on the CTV [9], we expect that the evaluation methodology in the present paper is enough accurate for the preliminary evaluation. We are currently building the simulation environment proposed in [9] in order to improve the accuracy.
6 Results Figure 9 presents which supply voltages are selected during each program’s execution. For each program, the left three bars are for the case where the threshold is two, and the right ones are for the case where that is eight. For each group of three bars, the left, the middle, and the right bars indicate results for the cases where the intervals are 100K, 1M, and 10M cycles, respectively. First, it is observed that only three voltages are selected while eight combinations of voltage and frequency are provided. In addition, almost all the time supply voltage of 1.276V is selected. This means that the DVS system loses the chance to exploit lower supply voltages. Second, any considerable differences can not be observed among three intervals between voltage scaling. Since we determined the cycles where supply voltage reduction is prohibited so as to the impact of switching on performance is identical among different intervals, the configurations with larger intervals perform voltage switching less frequently. Hence, the interval has little impact on supply voltage selection. Last, similarly with the second observation, any significant differences can not be found between two configurations with different thresholds. The impact on performance is very small. The increase in the execution cycles is at most 0.23%. We find that the oscillation of supply voltage switching is prevented. However, it is two times larger than that we expected. On the other hand, the configurations with the threshold value of 2 suffer slightly less performance penalty than those with the one of 8 do. Based on these observations above, we expect that every configuration achieves similar energy reduction. Figure 10 presents how energy consumption is reduced. The layout of the graph is the same with Fig.9 and the right six bars indicate the averages. As can be expected, 1.340
1.276
1.228
10%
100%
8%
80%
6%
60% 4% 2%
0%
0%
100K 1M 10M 100K 1M 10M 100K 1M 10M 100K 1M 10M 100K 1M 10M 100K 1M 10M 100K 1M 10M 100K 1M 10M 100K 1M 10M 100K 1M 10M 100K 1M 10M 100K 1M 10M
20%
2
8
164.gzip
2
8
175.vpr
2
8
176.gcc
2
8
2
8
197.parser 255.vortex
2
8
256.bzip2
Fig. 9. Breakdown of Supply Voltage
100K 1M 10M 100K 1M 10M 100K 1M 10M 100K 1M 10M 100K 1M 10M 100K 1M 10M 100K 1M 10M 100K 1M 10M 100K 1M 10M 100K 1M 10M 100K 1M 10M 100K 1M 10M 100K 1M 10M 100K 1M 10M
40%
2
8
164.gzip
2
8
175.vpr
2
8
176.gcc
2
8
2
8
2
8
197.parser 255.vortex 256.bzip2
Fig. 10. Energy Reduction
2
8
average
392
T. Sato and Y. Kunitake
there are not any significant differences among the configurations evaluated in this study. Every configuration achieves approximately 9% of energy reduction. These results imply that the energy reduction was attained just by selecting 1.276V for the supply voltage instead of 1.340V. In order to exploit input value variations more efficiently, the adaptable DVS system should be improved.
7 Conclusions As the complexity of the semiconductor manufacturing process increases, it is likely that process variations will be more difficult to control. Under the situations, the DSM semiconductor technologies will make the worst-case design impossible, since they can not provide design margins that it requires. In order to attack the problem, we proposed the canary logic as an alternative of the Razor logic, which is a smart technique to eliminate design margins. The canary logic eliminates the delayed clock required by the Razor logic, resulting in easy design. The canary FF relies on the delay buffer, which always has a positive delay, and hence they are variation resilient. In this paper, we utilize the canary logic for power reduction by exploiting input value variations. Since timing errors are expected to rarely occur even if the timing constraints on the critical path are not satisfied, input variations can be exploited to decrease the supply voltage. Because the supply voltage is lower than that determined by the critical path delay, significant power reduction is achieved. From the detailed simulation results, we found that the potential energy reduction of 9% without serious impact on performance. One of the future directions of this study is to build the entire processor at RT- or gate-level to evaluate the canary’s usefulness more accurately. Especially, we are interested in how eliminating design margins for parameter, voltage, and temperature variations improves energy efficiency. We are also investigating to improve robustness of microprocessors by utilizing the canary FF. Acknowledgments. Hitachi 0.18um standard cell libraries are provided by VDEC (VLSI Design and Education Center) in the University of Tokyo. This work is partially supported by the CREST (Core Research for Evolutional Science and Technology) program of Japan Science and Technology Agency.
References 1. Austin, T., Larson, E., Ernst, D.: SimpleScalar: an Infrastructure for Computer System Modeling. IEEE Computer 35(2) (2002) 2. Borkar, S., Karnik, T., Narendra, S., Tschanz, J., Keshavarzi, A.: Parameter Variations and Impact on Circuits and Microarchitecture. In: 40th Design Automation Conference (2003) 3. Burger, D., Austin, T.M.: The SimpleScalar Tool Set, Version 2.0. ACM SIGARCH. Computer Architecture News 25(3) (1997) 4. Das, S., Sanjay, P., Roberts, D., Lee, L.S., Blaauw, D., Austin, T., Mudge, T., Flautner, K.: A Self-Tuning DVS Processor Using Delay-Error Detection and Correction. In: Symposium on VLSI Circuits (2005)
Exploiting Input Variations for Energy Reduction
393
5. Ernst, D., Kim, N.S., Das, S., Pant, S., Rao, R., Pham, T., Ziesler, C., Blaauw, D., Austin, T., Flautner, K., Mudge, T.: A Low-Power Pipeline Based on Circuit-Level Timing Speculation. In: 36th International Symposium on Microarchitecture (2003) 6. Gochman, S., Ronen, R., Anati, I., Berkovits, A., Kurts, T., Naveh, A., Saeed, A., Sperber, Z., Valentine, R.C.: The Intel Pentium M Processor: Microarchitecture and Performance. Intel Technology Journal 7(2) (2003) 7. Intel Corporation: Intel Pentium M Processor on 90nm Process with 2-MB L2 Cache. Datasheet (2006) 8. Karnik, T., Borkar, S., De, V.: Sub-90nm Technologies: Challenges and Opportunities for CAD. In: International Conference on Computer Aided Design (2002) 9. Kunitake, Y., Chiyonobu, A., Tanaka, K., Sato, T.: Challenges in Evaluations for a Typical-Case Design Methodology. In: 8th International Symposium on Quality Electronic Design (2007) 10. Li, H., Chen, Y., Roy, K., Koh, C.-K.: SAVS: A Self-Adaptive Variable Supply-Voltage Technique for Process-Tolerant and Power-Efficient Multi-Issue Superscalar Processor Design. In: 11th Asia and South Pacific Design Automation Conference (2006) 11. Lu, S.-L.: Speeding up Processing with Approximation Circuits. IEEE Computer 37(3) (2004) 12. Mitra, S., Seifert, N., Zhang, M., Shi, Q., Kim, K.S.: Robust System Design with Built-In Soft-Error Resilience. IEEE Computer 38(2) (2005) 13. Sato, T., Arita, I.: Constructive Timing Violation for Improving Energy Efficiency. In: Benini, L., Kandemir, M., Ramanujam, J. (eds.) Compilers and Operating Systems for Low Power, Kluwer Academic Publishers, Dordrecht (2003) 14. Sato, T., Kunitake, Y.: A Simple Flip-Flop Circuit for Typical-Case Designs for DFM. In: 8th International Symposium on Quality Electronic Design (2007) 15. Shanbhag, N.R.: Reliable and Efficient System-on-chip Design. IEEE Computer 37(3) (2004) 16. Uht, A.K.: Going beyond Worst-case Specs with TEAtime. IEEE Computer 37(3) (2004) 17. Unsal, O.S., Tschanz, J.W., Bowman, K., De, V., Vera, X., Gonzalez, A., Ergin, O.: Impact of Parameter Variations on Circuits and Microarchitecture. IEEE Micro 26(6) (2006) 18. Usami, K., Igarashi, M., Minami, F., Ishikawa, T., Kanazawa, M., Ichida, M., Nogami, K.: Automated Low-Power Technique Exploiting Multiple Supply Voltages Applied to a Media Processor. IEEE Journal of Solid-State Circuits 33(3) (1998)
A Model of DPA Syndrome and Its Application to the Identification of Leaking Gates A. Razafindraibe and P. Maurine University of Montpellier / LIRMM II, 161 rue Ada, 34392 Montpellier, France
Abstract. Within the context of secure applications, side channel attacks are a major threat. The main characteristic of these attacks is that they exploit physical syndromes, such as power consumption rather than Boolean data. Among all the known side channel attacks the differential power analysis appears as one of the most efficient. This attack constitutes the main topic of this paper. More precisely, a design oriented modelling of the syndrome (signature) obtained while performing Differential Power Analysis of Kocher is introduced. As a validation of this model, it is shown how it allows identifying the leaking nets and gates during the logical synthesis step. The technology considered herein is a 130nm process.
1 Introduction If there are many side channel attacks, the differential power analysis [1] appears as a major threat since it requires less material than others attacks, such as fault injection, to be successfully implemented. Due to its dangerousness, many countermeasures have been proposed in former works. Among those countermeasures, one can find techniques aiming at reducing the correlation between the power consumption and the data processed, by appending randomness within the circuit. Time randomization of the computations [2], random permutation of datapaths [3], random data insertion [4] are some examples of countermeasures adopting this approach. There is a second approach. It aims also at reducing or masking all the potential sources of correlation rather than appending randomness in the circuit. Smoothing the variations of the current flowing through the supply rails using ad-hoc on chip circuits is one possible countermeasure [5], whereas using redundant logic, such as dual rail logic, is another technique adopting this second approach [6]. If many works have proposed countermeasures against differential power analysis, no effort has been devoted to the development of a physical oriented modelling of the DPA syndrome. More precisely, only little physical information related to what is the DPA syndrome is available in the literature to our knowledge. This lack of design oriented information is prejudicial, since designers may only rely on their own experience to evaluate, before fabrication, the robustness of their design against DPA. In this paper, a design oriented modelling of the DPA syndrome is introduced. This is the main contribution of the paper. To validate the aforementioned modelling, the N. Azemard and L. Svensson (Eds.): PATMOS 2007, LNCS 4644, pp. 394–403, 2007. © Springer-Verlag Berlin Heidelberg 2007
A Model of DPA Syndrome and Its Application to the Identification of Leaking Gates
395
latter is applied to identify, during the logical synthesis step, the critical gates in term of DPA, i.e. the gates that contribute the more to the DPA syndrome. This application will lead us to define the concept of critical gate. This is the second contribution of this work. The remainder of this paper is organized as follows. A brief review of the different differential power analysis available in the literature is given in section 2. The design oriented modelling of the DPA syndrome is then introduced in section 3. Finally the latter is applied in section 4 to the identification of critical gates, and a conclusion drawn in section 5.
2 Different Differential Power Analyses There is not only one differential power analysis but several. In this section we briefly sum up the basics of two of them. The first one is the differential power analysis introduced by P. Kocher in its seminal paper [1]. It will be denoted by DPA of Kocher in the remainder of the paper. The second one [2] is a generalisation to a larger target of the attack introduced in [1]. It will be denoted by multi-bits DPA subsequently. These two attacks constitute the historical approach of the power consumption analysis. A second approach has been suggested in various papers [2,10,11]. The latter proposed to use the correlation factor between the power samples and either the Hamming weight or the Hamming distance of the manipulated data to retrieve the secret key. Attacks falling within this second approach are not considered in the remainder of this paper. 2.1
Differential Power Analysis of Kocher
The differential power analysis introduced by P. Kocher in [1] is based on the fact that the power consumed by a ciphering circuit depends strongly on the manipulated data. This attack is usually performed in three steps: data collection, data sorting and data analysis. Data collection consists in sampling and recording the current flowing through the ground or supply pad of the circuit under attack. This is done for a large number of cryptographic operations leading to an important collection of current or power traces. Data sorting consist in extracting, for all possible guesses of key kg, two sets of power traces. This tow sets S‘0’kg and S‘1’ kg are defined considering the expected value of the bit under attack. Let us assume that the bit z is the target of differential power analysis of Kocher. In this case, S‘1’kg (S‘0’kg) contains all the power samples corresponding to input plain texts expected to force z to ‘1’ (‘0’) according to the guess of the key kg. Data analysis consist in computing in a first step, for all possible values of kg, the average power samples <S‘0’kg> and <S‘1’kg> of sets S‘0’kg and S‘1’ kg. In a second step, differences <S‘0’kg> - <S‘1’kg> are evaluated for all kg values resulting in a collection of kg differential power traces. Among these kg differential traces, one corresponds to the correct secret key kr. The latter is usually, but not necessarily disclosed, by identifying the guess kg leading to the curve with the highest amplitude.
396
A. Razafindraibe and P. Maurine
The above protocol allowing performing a DPA of Kocher may be formalized in order to obtain a mathematical expression of the DPA syndrome, SDPA. Let us consider that an attack is performed on the bit z of a ciphering block. Let V∈V be the number of plain texts (input vectors) applied on the inputs of the block. Let Tkg∈Tkg be the number of vectors of V expected to force z to the logic value ‘1’ according to the guess value kg of the key. Let Fkg∈Fkg (Fk=V-Tk) be the number of plain text (input vectors) expected to forced z to the logic value ‘0’. Finally, let Iv(t) be the courant waveform observed either on the ground or supply rail while the vector v∈V is applied on the block inputs. With such definitions, one can demonstrated that the syndrome DPA associated to the guess kg is S DPA ( z ,k g ) =
f 1 1 ⋅ ∑ I t (t ) − ⋅ ∑ I f (t ) Tk g t∈Tk Fk g f ∈Fk g
(1)
g
This expression is valid for all possible values of the key and therefore for the correct key kr. We therefore may conclude that the DPA of Kocher will disclose the secret key if SDPA(z,kr) has a greater amplitude than SDPA(z,kg) for all other possible values of the key. Although this formalism allows understanding quickly what a DPA of Kocher is, it does not provide any physical information about what to do or not to increase the robustness of a circuit during the design. 2.2 Multi-bit DPA As aforementioned, the Multi-bit DPA is a generalisation of the DPA of Kocher. Indeed, the main difference between the two attacks lies in their respective target. Thus a multi-bit DPA is roughly performed as a DPA of Kocher, i.e. following the same three steps: data collection, data sorting and data analysis. However, the data sorting step is slightly different. Indeed, sorting the power samples is done according to the expected values of m target bits rather than the value of a single bit. As an example, let us consider that a multi-bit DPA targets two bits namely x and z. In this case, the sorting consists in defining two sets of power traces S‘00’kg and S‘11’kg accordingly to the guessed value kg of the key. S‘00’kg (S‘11’kg) contains all the power traces corresponding to input vectors expected to force x and z to the logic value ‘0’ (‘1’). Note that this sorting leads to not exploit all the data collected during the data collection step unlike in the case of a DPA of Kocher. The protocol allowing performing a multi-bit DPA may also be formalized to obtain a mathematical expression of the multi-bit DPA syndrome. In the case of an attack targeting two bits namely x and z, the formalization leads to: S DPA ( x , z , k g ) =
1 1 ⋅ ∑ I t (t ) − ' ⋅ ∑ I f (t ) ' Tk g t∈T ' Fk g f ∈F ' kg
(2)
kg
where T’k < Tk and F’k< Fk are respectively the numbers of vectors of V forcing (x,z) to the value ‘11’ et ‘00’.
A Model of DPA Syndrome and Its Application to the Identification of Leaking Gates
397
3 Design Oriented Modelling of DPA Syndrome In the preceding section, the basic protocols to perform a differential power analysis of Kocher or a multi-bit one have been summarized. These protocols have been formalized to obtain a mathematical expression of the DPA syndrome. However the expression obtained does not give circuit designers insight into what should be done or not to obtain a robust circuit. This explains why a first order physical model of the DPA syndrome is introduced in this section. Whatever differential power analysis we did consider (DPA of Kocher or multi-bit DPA), we did obtain, in section 2, a generic expression of the DPA syndrome for all possible guessed value of the key. For the correct secret key, this expression may be re-written: S DPA ( C , k r ) =
1 1 ⋅ ∑ I t (t ) − ⋅ ∑ I f (t ) Tkr t∈T Fkr f ∈Fk
(3)
r
kr
where C denotes the target of the attack, i.e. a single bit or m different bits. Considering that the power trace It(t) (or If(t)) is the sum of the currents provided to, or drained from, the supply (or ground) rail when the vector t (f) of Tkr (Fkr) is applied on the inputs of the ciphering block, expression (3) may be rewritten: S DPA ( C ,kr ) =
1 Tk r
1
∑ ∑ itp ( t ) − F ∑ ∑ i pf ( t )
t∈Tk p∈P r
(4)
k r f ∈Fkr p∈P
where p denotes the gate p among the P gates constituting the block under attack and ipt(t) (ipf(t)) is the current drained from the supply (or ground) rail while the vector t (f) of Tkr (Fkr) is applied on its inputs. Applying the vector t (f) on the inputs of the block may produce three different events at the output sp of the gate p, that is to say: sp remains stable, sp switches from the logic value ‘0’ to the logic value ‘1’ and sp switches from the logic value ‘1’ to the logic value ‘0’. This leads to define six different numbers that characterize the behaviour of the gate p during the differential power analysis: fp0, fp1, fpS, tp0, tp1 and tpS. These numbers are defined as follows: - fp0, fp1 are the numbers of vectors of Fkr inducing a falling and rising transitions of sp respectively, - in the same way, tp0, tp1 are the numbers of vectors of Tkr, inducing a falling and rising transitions of the sp respectively - and finally fpS and tpS are respectively the numbers of vectors of Fkr and Tkr that let the output sp of gate p unchanged. Considering these definitions, we did obtain the following expression of DPA syndrome, with equivalent expressions for wrong guess of the key: ⎧⎛ t 1 ⎫ ⎛ t0 f1 ⎞ f0 ⎞ ⎪⎜ p − p ⎟ ⋅ i1p (t ) + ⎜ p − p ⎟ ⋅ i0p (t )⎪ ⎜ Tk ⎟ ⎪⎪⎜⎝ Tk r Fk r ⎟⎠ ⎪⎪ ⎝ r Fk r ⎠ S DPA ( C ,kr ) = ∑ ⎨ ⎬ S p∈P ⎪ ⎛ t p f pS ⎞⎟ S ⎪ ⎜ − ⋅ i p (t ) ⎪+ ⎜ ⎪ ⎟ T F kr ⎠ ⎪⎩ ⎝ k r ⎪⎭
(5)
398
A. Razafindraibe and P. Maurine
where ip0(t), ip1(t) and ipS(t), are the currents provided to or drained from the supply (or ground) rail while the output sp of gate p switches from ‘1’ to ‘0’, ‘0’ to ‘1’ or remains stable, respectively. Assuming that the switching current ipS(t) of the gate p is negligible if its output remains stable, (5) can be simplified: ⎧⎛ t 1 ⎫ f1 ⎞ ⎪⎜ p − p ⎟ ⋅ ΔiVpDD (t ) ⎪ ⎪⎪⎜⎝ Tk r Fk r ⎟⎠ ⎪⎪ S DPA ( C ,k r ) = ∑ ⎨ ⎬ 1 0 p ∈P ⎪ ⎛ t p − t p f p1 − f p0 ⎞⎟ 0 ⎪ ⎜ ( ) − + ⋅ i t ⎪ ⎜ ⎪ p T Fk r ⎟⎠ ⎩⎪ ⎝ k r ⎭⎪
(6)
⎧⎛ f 1 ⎫ t1 ⎞ ⎪⎜ p − p ⎟ ⋅ ΔiGnd ⎪ p (t ) ⎜ ⎟ F T kr ⎠ ⎪⎪⎝ k r ⎪⎪ = ∑⎨ ⎬ 0 1 0 1 p∈P ⎪ ⎛ t p − t p fp − fp ⎞ 1 ⎪ ⎜ ⎟ + ⋅ i p (t )⎪ ⎪− ⎜ Fk r ⎟⎠ ⎪⎩ ⎝ Tk r ⎪⎭
where ΔipVDD(t) and ΔipGnd(t) are called the differential switching currents and are: 0 1 ΔiVpDD (t ) = i1p ( t ) − i0p ( t ) ΔiGnd p (t ) = i p ( t ) − i p ( t )
(7-8).
At this point, it is important to decide if the power traces are obtained by probing the supply rail VDD or the ground rail Gnd. Let us consider that the measures are done on the VDD rail. In this case, ip0(t) can be considered as small compared to ΔipVDD(t) since only the short circuit current is drained from the supply rail. Simplifying expression (6), we finally obtain the DPA syndrome associated with a DPA performed on the supply rail VDD. DD S VDPA ( C , kr ) =
1 ⎞ ⎧⎪⎛ t 1 ⎫ ⎜ p − f p ⎟ ⋅ ΔiVDD (t )⎪⎬ = ⎨ ∑ ⎜T F ⎟ p kr ⎠ p∈P ⎪ ⎪⎭ ⎩⎝ k r
⋅ ΔiVp (t )} ∑ {ε VDD p DD
(9)
p∈P
In the same way, one can show that the DPA syndrome associated to a DPA performed on the ground rail can be expressed as: Gnd S DPA ( C , kr ) =
⎧⎪⎛ t 0 f p0 ⎞⎟ Gnd ⎫⎪ p ⎜ − ∑ ⎨⎜ T F ⎟ ⋅ Δi p (t )⎬ = kr ⎠ p∈P ⎪ ⎪⎭ ⎩⎝ k r
⋅ ΔiGnd ∑ {ε GND p p (t )}
(10)
p∈P
Considering (9) and (10), one may conclude that the DPA syndrome, associated to the DPA of Kocher or to the multi-bit PA, are linear combinations of the differential switching currents of all gates. One important point here is to note that the multiplicative coefficients εpVDD and εpGnd are independent of the physical implementation of the block since they are only function of the numbers of rising and falling transitions. Therefore they only depend on the logical structure of the block and can thus be evaluated during the logic synthesis step. Note also that the coefficients of the gates controlling the m bits targeted by the attack are necessarily equal to one if the guessed value of the key is the right one.
A Model of DPA Syndrome and Its Application to the Identification of Leaking Gates K secrect subkey
399
Plaintext
Xor 6
Substitution Box S0
S1
S2
S3
Fig. 1. Structure considered during the validation step
In a similar way, the differential switching currents do depend only on the physical synthesis (place and route …) of the circuits. Indeed, ΔipVDD(t) depends strongly on various physical design parameters such as: the load driven by p, the transition times of the signals driving p, the sizing of gate p. From the preceding remarks, we may conclude that the proposed model establishes a link between the DPA syndrome and both the logic and physical synthesis. In order to demonstrate the interest of such a link, we show, in the next section, how to apply this model to identify the leaking gates, i.e. the gates that contribute the more to the DPA syndrome.
4 Leaking Gates Identification In order to validate the proposed modelling of the DPA syndrome and to demonstrate its usefulness, we apply it, in this section, to identify just after the logic synthesis the critical gates and nets of a verilog netlist. At this point, a critical gate is a gate that contributes more than the others to the DPA syndrome. The application example considered in this section is the well known substitution box of the DES algorithm [13] represented in Fig.1.
# of gates
Gate driving S3 Leaking gates
Secret
Total # of gates :155
Key
εp Fig. 2. Distribution of εpVDD values with respect for all possible keys
400
A. Razafindraibe and P. Maurine
4.1 εpVDD Distribution Analysis The logical synthesis of the structure represented Fig.1 has been performed with RTL encounter from Cadence [8]. It has been done with a reduced 130nm standard cell library containing only simple gates such as inverters, (n)and2, (n)and3, (n)and4, (n)or2, (n)or3 and finally (n)or4. The verilog file [12] obtained after synthesis has been simulated with the hdl code simulator NCsim [8]. More precisely, a unique sequence of five thousands vectors (plain texts) has been applied to the structure represented Fig.1 for all possible values of the sub-key K. These simulations have provided the five thousands final logical values of all nets. These values have been stored in .csv files that are readable by Matlab [9]. Matlab scripts have been developed in order to be able to quickly compute the values of the coefficients εpVDD. Fig.2 gives the histogram of the εpVDD for all the gates and all the possible correct keys while a DPA of Kocher targeting the output bit S3 (see Fig.1) is performed. As shown, the coefficient of the gate driving S3 is equal to 1. Beside this expected result, one can note that most (>95%) have a coefficient value ranging from -0.2 to 0.2 while two gates have an absolute coefficient value ⎢εp ⎢greater than 0.2 and this for all possible value of the correct key. These gates (denoted by cg1 and cg2) have been identified. Their main characteristic is to be located (in term of logical depth) close to the gate driving the output net S3. More precisely, the logic depth separating the inputs of these gates and the net S3 was found smaller or equal to 2. These two gates are indicated as leaking gates on Fig.2 since they may contribute more than others gates to the DPA signature, according to the model introduced in section 3. 4.2 DPA Syndrome Analysis The analysis of the εpVDD histogram has indicated that gates cg1 and cg2 are critical or leaking gates, according to the DPA syndrome modelling. In order to validate the model, we have verified the validity of this result at the electrical level. We therefore generated from the verilog description three different spice netlists of the structure represented Fig.1. The first generated netlist was a direct transcription of the verilog file description into a spice netlist. The resulting netlist is denoted by ‘n_ref’ afterward. The second and third netlists are modifications of the reference netlist ‘n_ref’. More precisely, ‘n_ref’ has been first modified in order to multiply by three the current drained from the VDD rail by the critical gates cg1 and cg2. The resulting netlist is called ‘n_crit’ afterward. Finally ‘n_ref’ has been modified in order to multiply by three the current drained from the VDD rail by two gates having a εpVDD close to zero, i.e. uncritical gates. The resulting netlist is called ‘n_not_crit’. The multiplication of the current drained by these critical and uncritical gates was not done by sizing three times bigger the P transistors. We rather used courant controlled current source (CCCS in spice format) as shown Fig.3. This solution was chosen since it warrants to no change at all the behaviour of the rest of the circuit. Therefore any change of the DPA syndrome will only be due to the multiplication by three of the current drained from VDD by the modified gates.
A Model of DPA Syndrome and Its Application to the Identification of Leaking Gates
401
These modifications done, we simulated the three netlists. More precisely, a given sequence of two thousand vectors has been applied to these three structures for all possible values of the correct key. As a result, 64*2000 power traces have been collected. These traces have been used to perform, by simulation, a DPA of Kocher targeting the bit S3. The DPA syndromes obtained with the three different netlists were compared. More precisely, we compared the DPA syndromes obtained with critical and uncritical netlists (‘n_crit’ and ‘n_not_crit’) to the DPA syndrome obtained with the reference netlist ‘n_ref’. Fig.4 gives the differences obtained for 32 different values of the correct key. Vdd
IR
Vdd R=10-6 Ohm
IR
R=10 -6 Ohm
3.IR
Gnd
Gnd
Fig. 3. Modifications done on two critical an uncritical gates
As expected, the multiplication by three of the current consumed by the critical or leaking gates induces a significant modification of the DPA syndrome. As shown on Fig.4 (left), the difference may reach 80µA. This represents 100% of the maximum amplitude of the DPA syndrome obtained with the reference netlist. Conversely, the multiplication by three, of the current drained from VDD by the two less critical gates, induces only small modifications of the DPA syndrome. Indeed the difference remains smaller than 15µA. Note also that the observed differences (Fig.4, right) are either positive or negative. This means that the amplitude of the DPA syndrome could either be reduced or increased. Beside the validation of the DPA syndrome modelling introduced in section 3, these results leads to define the criticality of gates and nets with respect to the DPA, at the logical level: the greater ⎢εp ⎢value is, the most critical the gate p is. 4.3 Critical and Uncritical Gates and Nets Beside this definition, one can wonder how many gates are extremely critical in a design and how many gates are uncritical. To provide beginnings of answers to these questions we did compute, from data obtained with NCsim, the coefficients for differential power analyses of Kocher targeting all the outputs bits of the structure represented on Fig.1. As an illustration, Fig.5 gives the coefficient values of all nets (and thus all gates driving these nets) in the case of a DPA of Kocher targeting S3. For this attack, nets n_156, n_153, n_141 and n_160 have been find the most critical. Processing as described above for the three others output bits, we successively identified all the critical nets in case of DPA of Kocher targeting one of the four Sbox output bits. We did find only 18 (11%) extremely critical gates for a total number of
402
A. Razafindraibe and P. Maurine µA
µA
n _ crit n _ ref S DPA ( kr , S 3 ) − S DPA ( kr , S3 )
n _ not _ crit n _ ref S DPA ( kr , S3 ) − S DPA ( kr , S3 )
time (ns) 0
time (ns)
1
0
1
Fig. 4. Differences between the DPA syndromes obtained with the reference netlist and the critical (left) and uncritical (right)
εp
0, 60
n_156 n_153 n_141
0, 40
0, 20
0, 00
-0, 20
n_160
-0, 40
n_ 08
n _ 103
n _ 108
n _ 112
n _ 117
n _ 122
n _ 127
n _ 132
n _ 137
n _ 141
n _ 146
n _ 150
n _ 18
n _ 155
n _ 23
n _ 28
n _ 33
n _ 38
n _ 43
n _ 48
n _ 53
n _ 58
n _ 63
n _ 68
n _ 73
n _ 78
n _ 83
n _ 88
n _ 93
n _ 98
so [3]
nets -0, 60
Fig. 5. εp values wrt net names for a DPA of Kocher targeting S3
gates of 155. Conversely, we did find only 2 gates having an absolute coefficient value ⎢εpVDD ⎢smaller than 0.04. In others words, only 2 cells among 155 contribute twenty time less, to the DPA syndrome, than the cell driving the attacked bits. From these results, we may conclude that the number of extremely leaking or uncritical gates in a Sbox is small. Since the number of critical nets and gates is small, it appears possible to constraint the place and route steps and the timing optimization in order to reduce to increase the robustness against DPA of a circuit. As an example a critical gate should be placed as close as possible from its drivers and from its loading gates. This allows reducing the time spent by the critical gate to switch by controlling both the transition times of the signal applied on its inputs and its output load. Moreover this avoids the insertion of buffers on critical nets during the timing optimization. This is extremely important since introducing a buffer on a critical net is equivalent to introduce an additional critical gate, i.e. is equivalent to increase the DPA syndrome associated to the correct secret key.
A Model of DPA Syndrome and Its Application to the Identification of Leaking Gates
403
5 Conclusion A design oriented modelling of the DPA syndrome has been introduced and validated in this paper. The definition of this modelling has lead to the definition of critical gates (and nets) with respect to DPA. Based on this definition, this model allows identifying during the logic synthesis step the gates that will contribute the more to the DPA syndrome. This advantage has been demonstrated in this paper on a well kwon example: the Sbox of a DES. The results obtained suggest that the number of leaking gates is small, at least for the considered example.
References [1] Kocher, P., et al.: Differential power analysis. In: Wiener, M.J. (ed.) CRYPTO 1999. LNCS, vol. 1666, pp. 388–397. Springer, Heidelberg (1999) [2] Daemen, J., Rijmen, V.: Resistance against implementation attacks: a comparative study of the AES proposal. In: Proceedings of the Second Advanced Encryption Standard Candidate Conference (1999) [3] Goubin, L., Patarin, J.: DES and Differential Power Analysis, the duplication method. In: Koç, Ç.K., Paar, C. (eds.) CHES 1999. LNCS, vol. 1717, pp. 3–15. Springer, Heidelberg (1999) [4] Messerges, T.S., Dabbish, E.A., Sloan, R.H.: Investigations of power analysis attacks on smartcards. In: USENIX workshop on Smartcard Technology (1999) [5] Shamir, A.: Protecting Smart Card from passive power analysis with detached power supplies. In: Kaliski Jr., B.S., Koç, Ç.K., Paar, C. (eds.) CHES 2002. LNCS, vol. 2523, pp. 71–77. Springer, Heidelberg (2003) [6] Tiri, K., verbauwhede, I.: Securing encryption algorithms against DPA at the logic level: next generation smart card technology. In: ESSCIRC 2000 (2000) [7] Messerges, T.S., et al.: Examining Smart Card Security under the Threat of Power Analysis Attack. IEEE trans. On Computer 51, 541–552 (2002) [8] http://www.cadence.com/products/digital_ic/rtl_compiler [9] http://www.mathworks.com/ [10] Coron, J.S., Kocher, P., Naccache, D.: Statistics and Secret Leakage. In: Bellare, M. (ed.) CRYPTO 2000. LNCS, vol. 1880, pp. 157–173. Springer, Heidelberg (2000) [11] Mayer-Sommer, R.: Smartly Analysing the Simplicity and the Power of Simple Power Analysis on Smartcards. In: Paar, C., Koç, Ç.K. (eds.) CHES 2000. LNCS, vol. 1965, p. 231. Springer, Heidelberg (2000) [12] Thomas, D., Moorby, P.: Verilog Hardware description language. Kluwer Academic Publishers, Dordrecht, http://www.whap.com [13] DES: Data Encryption Standard, FIPS 46-2, www.itl.nist.gov/fipspub/fip46-2.htm
Static Power Consumption in CMOS Gates Using Independent Bodies D. Guerrero, A. Millan, J. Juan, M.J. Bellido, P. Ruiz-de-Clavijo, E. Ostua, and J. Viejo Departamento de Tecnología Electrónica de la Universidad de Sevilla, Escuela Técnica Superior de Ingeniería Informática, Avda. Reina Mercedes S/N, 41012 Sevilla, Spain* {guerre,amillan,jjchico,bellido,pruiz,ostua,julian}@dte.us.es http://www.dte.us.es
Abstract. It has been reported that the use of independent body terminals for series transistors in static bulk-CMOS gates improves their timing and dynamic power characteristics. In this paper, the static power consumption of gates using this approach is addressed. When compared to conventional common body static CMOS, important static power enhancements are obtained. Accurate electrical simulation results reveals improvements up to 35% and 62% in NAND and NOR gates respectively.
1 Introduction Over the past several years, static CMOS logic design style has played a dominant role in digital VLSI design because of its relative high performance, low static power dissipation, high input impedance, cost effectiveness and many other remarkable qualities [1]. However, static CMOS gates present strong performance degradation as the number of inputs increases, due to the so-called body effect and the internal parasitic capacitance associated to series-connected transistor [1,2]. Also, static power is becoming relevant in CMOS logic as the transistor size decreases, so that reducing leakage current is more and more important [3]. The body effect can be modeled as a dependence of the threshold voltage Vt on Vsb as follows [4]:
Vt = Vt 0 + φB + γ φB + Vsb
(1)
where Vt0 is the flat-band voltage, γ is the body effect coefficient and φB is determined from technological and physical conditions. In a standard CMOS NAND implementation like the one in Fig. 1a, if the input vector changes from (I3, I2, I1, I0)=(1, 1, 1, 0) to (I3, I2, I1, I0)=(1, 1, 1, 1) the series connected NMOS transistors corresponding to inputs I1, I2 and I3 will suffer from a *
This work has been partially supported by the Spanish Government’s MEC META project TEC-2004-00840-MIC and the Andalusian Regional Government’s CICE DHPMNDS projects EXC-TIC-1023 and EXC-TIC-635.
N. Azemard and L. Svensson (Eds.): PATMOS 2007, LNCS 4644, pp. 404–412, 2007. © Springer-Verlag Berlin Heidelberg 2007
Static Power Consumption in CMOS Gates Using Independent Bodies
405
reduced conductance due to an increased threshold voltage, lowering the performance of the gate. Additionally, parasitic capacitance associated to nodes A, B and C in Fig. 1a will also affect negatively the delay and power consumption of the gate.
out I3
out I3
C I2
I2 B
I1
I1 A
I0
I0
a)
b)
Fig. 1. Schematics of a two implementation of a NAND gate
In order to solve these problems, it has been proposed to make Vsb = 0 in every series transistor [5]. To do that, the source and body terminals of every NMOS in a NAND gate, and of every PMOS in a NOR gate are connected together using twinwell and triple-well technologies that provide independent bodies for both types of transistors (Fig. 1b). The possibility of using independent bodies for each series transistor has not typically been considered in bulk-CMOS design because of its obvious area penalty. However, this would allow connecting source and body, which has two remarkable consequences: •
•
The source-to-body junction parasitic capacitance can be neglected. Furthermore, drain-to-body junction capacitances in the series chain are not grounded anymore and are charged at a lower voltage, accumulating less charge. As a consequence, the total parasitic capacitance in the series structure is greatly reduced. Since Vsb = 0 for every transistor, the body effect is avoided and transistor conductance is improved.
Electric simulations of gates implemented using this technique (called INBO, for Independent Bodies) and using the conventional common-body design style (called COBO, for Common Bodies) have shown remarkable improvements in delay and dynamic power consumption [5] (delay and dynamic power consumption was reduced over 20%). Also, delay and power measurements were more homogeneous across input terminals. This makes the INBO technique adequate for the design of gates with a large number of inputs. With respect to static power consumption, it has two main causes in CMOS gates [1]: • Reverse-bias leakage currents • Subthreshold conduction
406
D. Guerrero et al.
The first cause is explained by the existence of parasitic diodes in CMOS gates such as the inverter shown in Fig. 2. Each p-n junction forms a parasitic diode, so there is one for each drain and source terminal and one for the n-well. in I0
out n+
n+
p−substrate (connected to gnd)
p+
p+
n−well (connected to vdd)
Fig. 2. Parasitic diodes in a CMOS inverter
Depending on the input value, the diodes corresponding to the source and drain terminals can be reverse-biased, driving a little reverse bias leakage current that contributes to the static power consumption. The second cause of static power consumption is the conductivity of a MOS transistor not being completely equal to zero when the gate voltage does not reach Vt. Hence, inactive transistors allow a little subthreshold current to flow from supply to ground. Since the subthershold conduction decreases with Vt and Vt increases with Vgs, in some digital techniques like Variable Threshold CMOS (VTCMOS) [6] Vgs is increased when the circuit is in idle state in order to reduce subthreshold currents. The objective of this paper is to study the static power consumption of gates using the independent body (INBO) approach and to determine that this design style is able to provide important static power savings when compared to the conventional common body (COBO) approach, due to the fact that the independent bodies of the transistors in the series tree provides additional isolation for drain and source diodes. In section 2 the test set-up is described. Simulation results are presented and analyzed in section 3. Finally, we will summarize some conclusions.
2 Test Set-Up Four input NAND and NOR gates have been designed using the COBO and INBO styles in order to measure their static power consumptions. The gates have been implemented using a 0.18 µm triple-well CMOS process with transistor sizes wn = 240 nm, wp = 240 nm for NAND (Fig. 3 and 4) and wn = 240 nm, wp = 1440 nm for NOR (Fig. 5 and 6) and minimum lengths. In a NAND gate, any input with logic value equal to 0 forces the output value to 1 so 0 is typically called the controlling value for NAND gates. Analogously, in NOR gates the controlling value is 1. Inputs are named I0, I1, I2 and I3 with input index increasing for series transistors nearer the output (Fig. 1). The inputs vectors are numbered so that the number associated to the input vector (I3, I2, I1, I0) is I32³+ I22²+ I12¹
Static Power Consumption in CMOS Gates Using Independent Bodies
407
+ I02°. The input vector 5, for example corresponds to (I3, I2, I1, I0)=(0, 1, 0, 1). Input vector are classified depending on the output value they produce. In a NAND gate, input vectors 0 to 14 contain the controlling value so they produce an output value of 1. In this case, the leakage current is driven by the series NMOS tree. Input vector 15 corresponds to an output value of 0, and makes leakage current to by driven by the parallel PMOS tree. In a NOR gate, input vectors 1 to 15 correspond to leakage current happening in the series PMOS tree, while input vector 0 corresponds to leakage current in the parallel NMOS tree.
Fig. 3. Layout of a COBO NAND gate
Fig. 4. Layout of an INBO NAND gate
The static power consumption is measured after parasitic extraction using the HSPICE [7] electrical simulator with the model card provided by the foundry and a nominal supply voltage of 1.8V.
408
D. Guerrero et al.
Fig. 5. Layout of a COBO NOR gate
Fig. 6. Layout of an INBO NOR gate
3 Simulation Results The static power consumption of NAND and NOR gates for all the possible input patterns and both INBO and COBO styles has simulated. Results are analyzed in two cases: patterns containing the controlling value, which correspond to static power dissipated in the series tree, and pattern of all inputs set to the non-controlling value, which corresponds to static power dissipated in the parallel tree. The first case is specially interesting since it is where INBO and COBO styles differ.
Static Power Consumption in CMOS Gates Using Independent Bodies
409
3.1 Input Patterns Containing the Controlling Value The static power consumption for all possible inputs containing the controlling value is shown in Fig. 7 and 8 for NAND and NOR gates respectively. The input vectors have been grouped depending of the number of inputs set to the controlling value. As can be easily seen, the INBO gates have better performance in most of the cases. From one group of patterns to the other, the major contribution is due to subthreshold current. From group to group, the number o cut-off transistors increase and the total impedance of the chain increases as well in both types of gates (COBO and INBO). Static power in INBO gates is almost constant inside a group while it varies significantly in COBO gates.
Fig. 7. Static power consumption of NAND gates for input vectors containing the controlling value
In the COBO style, the static consumption within a group will depend on the number of source/drain terminals that are reversed-biased. This is determined by the number of transistors in the series tree connected to the output by an active path. Thus, in the COBO NAND gate the input vector 12 (I3I2I1I0=1100) is more leaky than input vector 10 (I3I2I1I0=1010), since for the first vector there are three reverse-biased diodes while for the second vector there are only two (Fig. 9a). Input vectors 9 and 10 will in turn be leakier than 3, 5 and 6, since in the later ones only the drain diode connected to the output is reverse-biased. In the INBO NAND gate, on the other hand, the source to body parasitic diodes can be neglected since source and body are connected together, and the drain to body diodes cannot be reverse-biased since the n-wells are not connected to ground (except the one corresponding to input I0) as shown in Fig. 9b. The lack of reverse leakage currents in the INBO gates make them less leaky than their COBO counterparts in almost all the cases, and is the reason why its static power consumption is almost constant inside each group of patterns.
410
D. Guerrero et al.
Fig. 8. Static power consumption of NOR gates for input vectors containing the controlling value out I0
n+
I1
n+
I2
I3
n+
n+
n+
p−substrate (connected to gnd)
a) out I0
p+
I1
p+
p−well
I2
p+
p+
p+
p−well
I3
p+
p−well
p+
p+
p−well
p−well (connected to Vdd) p−substrate (connected to gnd)
b)
Fig. 9. NMOS tree in a NAND gate. a) COBO implementation b) INBO implementation
As mentioned in section 1, the transistors in the INBO gates present better conductance because they do not suffer from the body effect. This is specially evident when there is only one cut-off transistor in the tree. The transistors between the cut-off one and the output have better conductance in the INBO than in the COBO case. This is the reason why the INBO NAND gate is a bit leakier for input vectors 7, 11, 13 and 14 while the INBO NOR gate is slightly leakier for input vector 8. Table 1 shows the minimum and maximum enhancements of INBO with respect to COBO for each gate and the overall power consumption assuming equal probabilities for each input vector. Improvements are specially remarkable for the NOR gate (over 30% overall) with particular improvements around 50% in both cases.
Static Power Consumption in CMOS Gates Using Independent Bodies
411
Table 1. Minimum, maximum and overall static power consumption enhancements of INBO gates with respect to COBO gates, for input vectors containing the controlling value
NAND minimum NAND maximum NAND overall NOR minimum NOR maximum NOR overall
COBO (pW) 9 120.735 42.5 6.1392 32.742 14.325
INBO (pW) 6.577 121.183 40.4 3.58 21.18 9.83
Enhancement 27 % -0.4 % 5.1 % 42 % 35 % 31.4 %
3.2 Input Patterns Not Containing the Controlling Value The static power consumption when all the inputs are set to the non-controlling value is shown in Table 2. In this case, the voltage of the body of the transistors in the serial tree is the same in the COBO and INBO implementations, so their behavior is almost the same. The power consumption comes from the sub-threshold currents of the transistors in the parallel tree, which is identical in both implementations. From Table 2, it is also clear that N-MOS transistors are leakier than P-MOS, as is also deduced from Figs. 7 and 8. Table 2. Power consumption when all the inputs are set to the non-controlling value
COBO INBO
NAND (pW) NOR (pW) 76.1 483 76.0 483
Table 3. Overall power consumption considering all the input vectors
COBO INBO Enhancement
NAND (pW) NOR (pW) 44.6 43.6 42.6 39.4 5% 10 %
To summarize, Table 3 shows the overall power consumption considering the same probability for all the input vectors. The overall INBO enhancements for the NOR gate are reduced with respect to Table 1 due to the large contribution of pattern 0 (Table 2).
4 Conclusions As triple-well technologies become main-stream, the use of independent bodies (INBO) for each series transistor in static CMOS logic brings remarkable performance improvements in speed, dynamic and static power consumption when compared
412
D. Guerrero et al.
to the conventional common body approach (COBO) at the cost of some area penalty. INBO consumes less static power than COBO for almost all the input patterns with improvements up to 45% in NAND gates and up to 62% in NOR gates. This result stimulates further investigation on the application of the INBO style in a general way.
References 1. Weste, N., Eshraghian, K.: Principles of CMOS VLSI Design. Addison Wesley, Reading (1993) 2. Veendrick, H.: Deep-Submicron CMOS ICs. Kluwer Academic Publishers, Ten Hagen en Stam, Deventer, The Netherlands (2000) 3. Helms, D., Schmidt, E., Nebel, W.: Leakage in CMOS circuits-An introduction. In: Macii, E., Paliouras, V., Koufopavlou, O. (eds.) PATMOS 2004. LNCS, vol. 3254, pp. 17–35. Springer, Heidelberg (2004) 4. Tsividis, Y.: Operation and Modelling of the MOS Transistor. McGraw-Hill, New York (1987) 5. Guerrero, D., Millan, A., Chico, J.: Improving the performance of static CMOS gates by using independent bodies. Journal of Low Power Electronics 3(1) (2007) 6. Tadahiro, K.: Low power CMOS digital design for multimedia processors. In: Proceedings of 6th International Conference on VLSI and CAD (ICVC), Seoul, pp. 359–367 (1999) 7. HSPICE Simulation and Analysis User Guide, Synopsys Inc. (2003)
Moderate Inversion: Highlights for Low Voltage Design Fabrice Guigues1 , Edith Kussener1 , Benjamin Duval2 , and Herv´e Barthelemy3 1
ISEN-Toulon, L2MP UMR 6137 CNRS, 83000 Toulon, France STMicroelectronics, ZI Rousset-Peynier, 13106 Rousset, France Polytech’ Marseille, L2MP UMR 6137 CNRS, 13451 Marseille, France [email protected] 2
3
Abstract. This paper proposes to use a non conventional mode of operation to meet low voltage constraints: The moderate inversion. EKV 2.0 MOS model provides hand calculation applicable equations, while a BSIM3v3 simulation model is sufficiently accurate to design static circuits. The self cascode structure studied with highlights of the EKV 2.0 MOS model is revealed to be linear with temperature from weak to strong inversion: simulation and experimental data are provided. This self cascode linear with temperature voltage reference is the starting point of a new self biased current reference: An all in moderate inversion exemple is given, and simulation results are provided.
1 Introduction The need of reliability leads to a continuing decrease of supply voltage of digital CMOS circuits. Analog circuits implemented on mixed-signal chips in digital CMOS technology must conform to this trend, and this supply voltage constraint is becoming strong enough for being an important issue. Therefore, a low voltage technique based on a non conventional mode of operation, the moderate inversion, is explored in this paper. If the failure of strong inversion (SI) to meet low voltage operation is obvious, the weak inversion (WI) is not yet the best inversion mode for such a requirements: as introduced in [1], moderate inversion (MI) presents clearly the best compromise between analog parameters of interest such as speed, mismatch, gain, supply voltage or silicium area. Nevertheless moderate inversion is usually best avoided in circuit design: Because WI and SI asymptotes were still recently the only equations available for designers, and although because accuracy of commonly used third generation models of simulation (BSIM3v3 or MM9) is little-known in moderate inversion. But, the EKV 2.0 MOS model [2] provides continuous over inversion level equations, which in addition to being invertible can be easily used for hand calculation, and [3] demonstrate that a BSIM3v3 accuracy is good enough for static applications. A proportional to absolute temperature (PTAT) voltage reference is an important issue for analog integrated circuits: It is the starting point of a lot of temperature compensated current [4, 5] or voltage [6] references. Yet, the two most commonly used compact PTAT voltage references [7], are based on MOSFETs biased in weak inversion. In addition to consuming a large silicium area in the case of standard applications (≥ μA), the use of weak inversion causes leakage at high temperature [7, 5]. N. Azemard and L. Svensson (Eds.): PATMOS 2007, LNCS 4644, pp. 413–422, 2007. c Springer-Verlag Berlin Heidelberg 2007
414
F. Guigues et al.
This paper provides a study of the second Vittoz’s reference [7], the self cascode structure, with highlights of the EKV 2.0 modelisation. The structure is linear with temperature (∝ aT + b) if not PTAT (∝ aT ), depending on technology, and this without inversion level constraint. Finally, this linear with temperature voltage reference is the starting point of an all in moderate inversion current reference.
2 EKV 2.0 MOS Model EKV 2.0 MOS model provides continuous equation over inversion levels (weak - moderate - strong) in conduction as in saturation. Even if newer versions are more precise, this version is the last one which is analytically invertible. EKV MOS modelisation is based on the assumption that the drain current can be split into the forward current (IF ) and reverse current (IR ) , which describe respectively influence of source and drain voltages on the drain current: ID = IF − IR = F (VP − VS ) − F (VP − VD )
(1)
where VP , VS and VD are respectively the pinch-off, the source and the drain voltages, referenced to the bulk voltage. All currents and voltages can be normalized according to: ID IF IR id = if = ir = IS (2) V v = UT where UT is the thermal voltage, and IS the specific current defined by:
IS = 2nμn Cox UT2
W = 2nβUT2 L
(3)
where n is the slope factor, μn the mobility, Cox the gate oxide capacitance/area, and W/L the transistor aspect ratio. The EKV 2.0 forward and reverse inversion level introduced in (2), are given by: if,r = ln2 (1 + e
VP −VS,D 2UT
)
(4)
The inversion level coefficient IC (= if ) is particularly important: Under 0.1 the transistor is assumed to be in weak inversion (WI), and above 10 it reaches strong inversion (SI); Between these two limits the MOSFET is biased in moderate inversion. Finally, (4) can be approximate in weak and strong inversion to obtain the respective asymptotes associated to the forward and reverse currents:
IF,R =
βn 2 (VP
IF,R = IS e
− VS,D )2 (SI)
VP −VS,D UT
(WI)
(5)
Moderate Inversion: Highlights for Low Voltage Design
415
3 Self Cascode Structure 3.1 Traditional Approach The self cascode structure (Fig. 1) is usually used in WI, with N1 and N2 biased in saturation and conduction respectively. The use of standard WI asymptotic equations yields: ID2 SN 1 Vref = UT ln 1 + (6) ID1 SN 2 Where UT is the thermal voltage, and SN 1 and SN 2 are the W/L ratios of MOSFETs N1 and N2 respectively. Equation (6) shows that the obtained voltage reference Vref is insensitive to Vdd variations, and current level as long as N1 and N2 are in WI.
Fig. 1. Self cascode structure: a low-voltage CMOS bandgap reference [7]
3.2 Proposed Approach All the following equations, valid without inversion level constraint, are derived from EKV 2.0 expressions; (4) leads to the following Vref expression : √ √ Vref = 2UT ln e if2 − 1 − ln e if1 − 1 .
(7)
3.3 Temperature Considerations EKV 2.0 formulation allows to demonstrate that the self cascode structure is linear with temperature (LWT) in strong inversion (SI). As N1 is saturated, its reverse current can be considered negligible. Consequently, its forward normalized current if1 is also its inversion level IC1 , and can be approximated (1) and (2) by: ID1 if1 ∼ (8) = 2nβ1 UT2
416
F. Guigues et al.
The temperature dependences of if1 is led by UT and β. If the temperature dependence of UT is obvious, those of β, due to the mobility, is affected by several scattering mechanisms [8]. These combined effects can be approximated by: μ ∝ cte T −α , with α proportional to the doping concentration [8]: α ≈ 2 for Nb = 1016 cm−3 α ≈ 1 for Nb = 1018 cm−3 Finally, (7) can be approximated in SI by: Vref ∼ if2 − if1 = 2UT
(9)
(10)
(11)
Consequently, the temperature dependence of Vref is proportional (1), (2), (4) and (8) to:
α 2ID2 2ID1 2ID1 Vref ∝ T 2 + − (12) 2 1 1 nCox cte W nCox cte W nCox cte W L2 L1 L1 Thus when Nb = 1016 cm−3 , the self cascode structure provides a PTAT voltage reference not only in WI, but also in SI. Consequently, in such conditions the self cascode structure is PTAT without inversion level constraint. Unfortunately when Nb > 1016 cm−3 (α < 2), it is no longer the case. However, as shown Fig. 2, the error made by considering T 0.5 ∼ = aT + b over the temperature range of interest (-40 to 125 ◦ C) is negligible (< 1%). The self cascode structure can thus be considered LWT in SI, while being PTAT in WI. 3.4 Measurements and Discussion The temperature dependence of the voltage reference provided by four different inversion levels (from WI, IC ≈ 9.10−3, to SI, IC ≈ 30) based self cascode structures, are shown in Fig. 3. Simulations were made with a BSIM3v3 model, from a 0.35μm CMOS standard technology having a doping concentration of about 1017 cm−3 . The voltage reference was obtained for ID2 = 32 ID1 . As shown, experimental data closely match the simulated curves. The only parasitic effect involved, bulk leakage at high temperature (in weakly inverted transistors), appears above 50◦ C : starting at 25 ◦ C is thus sufficient for characterize the voltage reference temperature dependence. This bulk leakage problem appears on Fig. 3 for the weakly inverted transistors (Ic ≈ 9.10−3 ) case: it is responsible of the voltage reference falling at high temperature. Consequently, curves are linear with temperature from weak to strong inversion for this 1017 cm−3 doped technology.
Moderate Inversion: Highlights for Low Voltage Design
417
Fig. 2. Error made by considering the curve linear with temperature is negligible
Fig. 3. Measurements and simulations of the Vref temperature dependence
4 Self Cascode Current Reference (SCCR) 4.1 Current Reference Topology The self cascode LWT voltage reference is the starting point of a new current reference; The self cascode current reference (SCCR) proposed in this paper (Fig. 4) is based on the same concept as Oguey’s one [4]: A low power bandgap reference, the self cascode in the present case, is used to provide a LWT voltage reference to the drain of a conductor transistor, which acts as a resistor and fixes the current reference. But, the self cascode topology combined with the need of supply voltage independence, does not allow a simple NMOS current mirror configuration for biasing the conductor transistor. The solution consists on using the mirror proposed by Prodanov in [9], composed of transistors N3, N4 and N5. This mirror presents the great advantage of using a conductor transistor as an input stage, and in the present case of allowing the
418
F. Guigues et al.
Fig. 4. Self Cascode Current Reference
current mirroring with PMOS while preserving the supply voltage independence. It results in an optimum supply voltage design: Indeed only one VGS and one VDSsat are stacked per branch (ie V ddmin ≥ VGS + VDSsat ). Obviously, this design needs a startup as a lot of self biased current source, but it is out of the scope of this paper and thus not depicted here. 4.2 Design Equations Prodanov’s Mirror: As introduced above, this design uses a Prodanov’s mirror, whose input and output currents are: Iin = IDN4 − IDN3 = IDN1 − IDN2 and
(13) (14)
Iout = IDN5 = Iref
Equation (1) combined to the design topology (ifN4 = ifN5 , irN4 = ifN3 , and N3 and N5 are saturated) leads to:
IDN5 IDN3 IDN4 = ISN4 − (15) ISN5 ISN3 With (13) it comes: Iout
ISN5 = Iin + IDN3 ISN4
ISN5 ISN5 + ISN4 ISN3
(16)
SCCR: EKV 2.0 equations allow to find a Iref formulation valid without inversion level constraint. On one hand, (13) used in (15) yields: IDN1 − IDN2 + IDN3 IDN5 IDN3 = − ISN4 ISN5 ISN3
(17)
Moderate Inversion: Highlights for Low Voltage Design
419
On the other hand, equations (1), (2), and (4), applied to N2 gives the IDN2 expression:
VP VP −Vref N2 N2 IDN2 = ISN2 ln2 1 + e 2UT − ln2 1 + e 2UT
(18)
√ i VPN2 − Vref = VPN1 − Vref = 2UT ln e fN1 − 1
(19)
with:
Finally, (17), (18) and (19), combined to the PMOS mirroring yields to the following Iref formulation: √ Vref ISN2 ifN 2 2UT 1 − 1) Iref = IS ln 1 + e (e IS IS IS IS IS IS P1 + ISP3 + ISP1 ISN2 + ISP3 ISN4 − ISN4 IS P2
P2
P2
N1
P2
N3
N5
(20) As ifN1 = ISP1 Iref /(ISP2 ISN1 ), Iref is thus expressed as a function of Iref . Fig. 5 shows that under conditions on transistors sizes and current level, this equation has two solutions. The upper one is stable, but the design needs certainly a startup.
Fig. 5. Characteristic of Iref
4.3 Design Methodology As introduced in [10], the temperature dependence of the current reference comes from the temperature dependence of the conductor transistor associated to its bias transistors (N3, N4 and N5 in the present case). The temperature dependence of a MOSFET is yield by two different phenomenons: the threshold voltage variation and the temperature coefficient of mobility. In SI, the relative effect of the two phenomenons provides a compensation for a particular value of n(VP − VS,D ) [11]. Thus, an accurate polarisation of N3, N4, and N5 allows the temperature compensation of the current reference, but in such a case, the design is not optimized in minimum supply voltage. Indeed compensation occurs in ”high” moderate inversion (for 7 < IC < 10).
420
F. Guigues et al.
Fig. 6. Current reference against supply voltages variations
Fig. 7. Temperature dependence of the current reference
Fig. 8. Voltage reference against supply voltage variations
Moderate Inversion: Highlights for Low Voltage Design
421
4.4 Computer Simulation Results The proposed current reference has been simulated (BSIM3v3) in a 0.35μm CMOS standard technology, whose threshold voltage is about 500mV and 700mV, for NMOS and PMOS, respectively. The transistor sizes and gate voltages of the proposed exemple are given in table 1: all transistors are thus biased in moderate inversion. The area of the presented cell is estimated to less than 800μm2 . The simulation results show that the sensitivity of the reference current to the supply voltage is enough for industrial applications but must be optimized. Sensitivity to process variations is quite acceptable. On the other hand the temperature dependence, with under 8% of deviation across 160◦ C of temperature variation, is relatively low and quite acceptable for most applications. Table 1. Transistor sizes and gate voltages N1 N2 W [μm] 1.5 1.5 L [μm] 20 11 VG 430mV
N3 N4 N5 9 30 3 1 1 81 620mV
P1 P2,3 60 30 2 2 600mV
Finally the voltage reference independence in process and supply voltage variations is very good.
5 Conclusion Moderate inversion (MI) has been successfully used in two designs : a linear with temperature (LWT) voltage reference, base of a new all in MI current reference. EKV 2.0 MOS model provides analog designers invertible expressions to design circuits without inversion level constraint. Experimental data confirms that the accuracy of BSIM3v3 simulation model is good enough to design static circuits in MI. The Vittoz’s self cascode structure has been studied with EKV highlights. This structure provides of course a PTAT compact voltage reference when biased in WI, but this paper shows that it is although the case in SI if Nb ≈ 1016 cm−3 . Moreover, the structure is revealed to be LWT in MI and SI for commonly used MOS technologies where Nb = 1016 cm−3 . This inversion level free characteristic allows to optimize silicium area in regards to the current and the minimum supply voltage targets, and to minimize high temperature leakage. An inversion level free, self-biased current reference has been proposed. This design, optimum in terms of supply voltage (V ddmin ≥ VGS + VDS ), allows a great flexibility by not imposing the inversion level of the transistors. The proposed exemple (35nA at 1V supply voltage), is compensated in temperature and process variations, while consuming a relatively small silicium area (estimated to less than 800μm2).
422
F. Guigues et al.
References 1. Binkley, D., Hoper, C., Trucker, S., Moss, B., Rochelle, J., Foty, D.: A CAD methodology for optimizing transistor current and sizing in analog CMOS design. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 22(2), 225–237 (2003) 2. Enz, C.C., Krummenacher, F., Vittoz, E.A.: An analytical MOS transistor model valid in all regions of operation and dedicated to low-voltage and low-current. Analog Integrated Circuits and Signal Processing 8, 83–114 (1995) 3. Terry, S.C., Rochelle, J.M., Binkley, D.M., Blalock, B.J., Foty, D.P., Bucher, M.: Comparison of a BSIM3V3 and EKV MOSFET model for a 0.5 μm cmos process and implications for analog circuit design. IEEE transactions on nuclear science 50(4), 915–920 (2003) 4. Oguey, H., Aebischer, D.: CMOS current reference without resistance. IEEE J. Solid-State Circuits 32(7), 1132–1135 (1997) 5. Camacho-Galeano, E.M., Galup-Montoro, C., Schneider, M.C.: A 2-nW 1.1-V self-biased current reference in CMOS technology. IEEE Trans. Circuits Syst. II, Exp. Briefs 52(2), 333–336 (2005) 6. Najafizadeh, L., Filanovsky, I.M.: Towards a sub-1 V CMOS voltage reference. In: ISCAS, pp. 53–56 (2004) 7. Vittoz, A., Neyroud, O.: A low-voltage CMOS bandgap reference. IEEE J. Solid-State Circuits 14(3), 573–577 (1979) 8. Sze, S.M.: Semiconductor devices: Physics and Technology, 2nd edn. John Wiley & Sons, Chichester (1981) 9. Prodanov, V., Green, M.: CMOS current mirrors with reduced input and output voltage requirements. Electronics Letters 32, 104–105 (1996) 10. Guigues, F., Kussener, E., Malherbe, A., Duval, B.: Sub-1v oguey’s current reference without resistance. In: IEEE (ed.) Proceedings of the 2006 13th IEEE International Conference on Electronics, Circuits and Systems, Nice (2006) 11. Enz, C.C., Vittoz, E.A.: Charge-based MOS Transistor Modeling. John Wiley & Sons, Chichester (2006)
On Two-Pronged Power-Aware Voltage Scheduling for Multi-processor Real-Time Systems Naotake Kamiura, Teijiro Isokawa, and Nobuyuki Matsui Graduate School of Engineering, University of Hyogo, 2167 Shosha, Himeji, Japan {kamiura,isokawa,matsui}@eng.u-hyogo.ac.jp
Abstract. A power-aware voltage-scheduling heuristic is presented for a hard real-time multi-processor system. Given a task graph, the offline component first allocates a certain percentage of worst-case execution units of some tasks to them as potions to be executed in a higher voltage. Once some path is speeded up, the rest of the offline component chooses and speeds up one of the paths sharing tasks with that path. The online component reclaims the slack, which occurs when some task actually finishes, to slow down the execution speed of its successor. Experimental results are finally provided to demonstrate the effectiveness of the proposed heuristic. Keywords: hard real-time system, voltage scheduling, energy saving, dependent tasks.
1 Introduction Power management has become popular for a modern high-performance multiprocessor system [1]. If the hard real-time system runs on multiple voltages, energy savings can successfully be achieved by voltage scheduling [2]-[12]. Especially, a voltage-scheduling algorithm in [10] and [11] is available for both of independent tasks and a task graph with precedence constraints. It reclaims the slack, which is the time actually not consumed by a task, to reduce the execution speed of its successor, provided that all the units of the successor are executed with the single speed. A power-aware heuristic in [12] is divided into offline and online components. The offline component yields a static voltage configuration for a graph with tasks, under its worst-case execution profile. A configuration for a task consists of portion to be run at a higher voltage and that at a lower voltage. Each time a task finishes, the online component reclaims the slack that occurs for the just completed task to slow down the system. In [12], it is experimentally established that the heuristic outperforms the algorithm in [10] in energy savings. This is due to allowing voltage switching at most once a task. The offline component, however, sometimes causes heavy concentrations of portions executed in a higher voltage on certain tasks. This paper proposes a two-pronged power-aware voltage scheduling for a hard real-time system supporting three voltage levels. It also includes offline and online components, and allows voltage switching for each task. The offline component first allocates a certain percentage of worst-case execution units of some tasks to them just N. Azemard and L. Svensson (Eds.): PATMOS 2007, LNCS 4644, pp. 423–432, 2007. © Springer-Verlag Berlin Heidelberg 2007
424
N. Kamiura, T. Isokawa, and N. Matsui
one time as potions to be executed in a higher voltage. The percentage is systematically determined using one of the paths (i.e., sets of tasks) not meeting the deadline of a task graph. The offline component then chooses a path sharing tasks with the path having been most recently speeded up, and speeds up it so that the deadline can be met for it. The above strategies allow us to avoid heavy concentrations of portions executed in a higher voltage, and thereby improve the energy-saving performance of the online component. Simulation experiments are performed to establish that the proposed heuristic achieves more energy savings than the heuristic in [12], if the deadline is comparatively relaxed.
2 Preliminaries In CMOS devices, the circuit delay caused by the reduction of the power supply voltage is equal to CLvDD/K(vDD−vT)α [13], [14], where vDD is the power supply voltage, CL is the circuit output load capacitance, K is a constant depending on the process and gate size, vT is the threshold voltage, and α equals 2 in this paper as well as in [12]. A system model is as follows. Each processor is independent and is connected by a low cost fast interconnection network. The processor operates in three voltage levels: vHI, vLO, and vIDLE (vHI>vLO>vIDLE). vHI and vLO are voltages at which the processors can do useful computation, whereas vIDLE is the voltage necessary to sustain the system in idle state. The inter-task communication only happens after each task has finished its computation. The communication cost is then ignored. The energy cost when the processors are idle is also ignored. Besides, the voltage switching cost is considered to be negligible both with respect to the time needed and the energy expended. The factor by which the processor is slower at voltage v relative to when at the highest voltage vHI is
v ⎛ vHI − vT ⎞ slow(v) = ⎜ ⎟ vHI ⎝ v − vT ⎠
2
(1)
One unit of execution is defined as the computation performed by the processor at vHI in unit time. It will take slow(v) units of time at voltage v. The ratio of energy consumed per cycle by a processor at voltage v relative to that at voltage vHI is (v/vHI)2. The above formulas and assumptions are also employed in [12]. Given a task precedence graph (TPG), the offline component in [12] schedules vHI and vLO for tasks on the bases of their worst-case execution profiles. A path is a set of tasks from a source to a sink of the TPG. Critical paths mean paths missing the deadline of the TPG under the current voltage configuration. A weight of a task equals the frequency of the task being included in all of the critical paths. The rslack is the difference between the deadline and the worst-case execution time of the critical path with the current voltage configuration. Let top_level (or bottom_level) denote the maximum of the sum of the worst-case execution units from any connected source to the given task (or from the given task to any connected sink). The execution units of the given task are then excluded (or included) in the top_level (or bottom_level).
On Two-Pronged Power-Aware Voltage Scheduling
425
The static voltage scheduling (offline component) in [12] is summarized as follows. <Static Voltage Scheduling in [12]> Step 1 Provided that all of the tasks in the TPG are run at vLO, worst-case execution units are calculated for arbitrary paths, and a set of critical paths is generated. Step 2 Weights are calculated for the tasks, each of which includes the portion to be run at vLO, in the critical paths. One of such tasks is chosen in the next step. Step 3 The task with maximum weight is chosen as taskId. If more than one task has the same maximum weight, the task with minimum bottom_level is chosen. Step 4 The path with minimum rslack is chosen as pathId, among all the critical paths having taskId as a member task. It is speeded up in the next step. Step 5 Appropriate units of taskId are changed to run them at vHI instead of vLO. If rslack of pathId is not covered by this adjustment, entire taskId is run at vHI. Step 6 Path execution times are updated, and any path that now meets the deadline is removed from the critical path set. If no critical paths remain as a member in the set, stop this scheduling; otherwise, go to Step 2.
3 Power-Aware Heuristic for Task Graphs 3.1 Task Assignment and Online Voltage Scheduling
A list scheduling heuristic adapted from [15] is employed in [12]. It gives the sum of top_level and bottom_level to every task as the priority. Whenever a free slot is found in a processor, it assigns the ready task with the highest priority (i.e., the largest sum) to that processor. It is also adopted in this paper. The following online voltage scheduling is referred to as the dynamic resource reclamation (DRR) employed in [12]. Once the offline voltage scheduling described in Subsect.3.2 is complete, the start_time and commit_time are assigned to all tasks. The start_time of a task is the latest time, relative to the beginning of the execution of the TPG, at which the task must be invoked. The commit_time is the time by which the task must complete its execution. In addition, the current_time is the time when the task actually finishes on the processor. Since the offline voltage scheduling was based on the worst-case execution profiles, a task will finish before or at its commit_time during actual runtime. If its successor has no other pending dependencies, the DRR reclaims the slack equal to the difference between the start_time of the successor and the current_time of just completed task, to transfer execution units denoted by units_to_LO from running at vHI to running at vLO for the successor. units_to_LO is given as follows:
units _ to _ LO =
slack . slow(vLO ) −1
(2)
3.2 Static Voltage Scheduling
If all of the paths in Fig.1 are critical and there are no branching points in part of the path between tasks Ti1 and Tim, where i=1, 2, 3, the offline voltage scheduling in [12]
426
N. Kamiura, T. Isokawa, and N. Matsui
Tβ
Tγ Tσ
T11 T12
T1m
T21 T22
T2m
T31 T32
T3m
Fig. 1. Example TPG with task where heavy concentration of portions executed in vHI occurs
chooses Tσas taskId five successive times at most. Units to be run at vHI would then concentrate in it. Besides, since Tσ appears at the comparatively early slot in the Gantt chart, this is a hard case for the DRR. Strategies in Subsects. 3.2.1 and 3.2.2 are adopted to overcome such difficulty. 3.2.1 PEAK Allocation The first strategy is to allocate PEAK, which is defined as comparatively short units to be run at vHI, to every task in all critical paths. pathId and taskId are first determined by means of Steps 1-4 in Sect.2. The first chosen pathId and taskId are referred to as pathIdf and taskIdf, respectively, and D denotes the deadline of a TPG. Let lf denote the sum of worst-case execution units of member tasks in pathIdf. PEAK equals 100×PerPEAK percent of worst-case execution units of a task, where PerPEAK is
PerPEAK =
D − l f slow(vLO ) . l f (1− slow(vLO ))
(3)
Note that the length of PEAK allocated to a task is different from that to another task, if the worst-case execution units of the former are not equal to the worst-case execution units of the latter. In the DRR phase, enough slack is unavailable for each of the tasks appearing at the early slots in the Gantt chart, because few (or “no” at worst) tasks precede such tasks. Short PEAK’s are therefore favorable for such early scheduled tasks. pathIdf is one of the shortest paths, and hence Eq.(3) calculates the comparatively small value as PerPEAK. This is why pathIdf is used for the PEAK calculation. PEAK’s are allocated only once at the beginning of the offline component. Execution times required for member tasks in critical paths are then updated, and rslack is also recalculated for every critical path. 3.2.2 Critical Path Choice Dependent on Task Sharing The second strategy is to choose the critical path sharing member tasks with pathId. SId hereinafter means the set of pathId and critical paths that share member tasks with
On Two-Pronged Power-Aware Voltage Scheduling
427
pathId from a source of a TPG to taskId. In Fig.1, if pathId is the path Tβ→Tσ→T→…→Tm, we first have three paths, Tβ→Tσ→Ti→…→Tim, where i=1, 2, 3, as member paths of SId. taskId (Tσ in Fig.1) is included in each member of SId. pathId has the shortest rslack in SId, and is first chosen. Paths, which are chosen among members in SId after pathId is speeded up, are referred to as n_pathId’s. In n_pathId, the task with maximum weight among all of the member tasks, each of which has smaller bottom_level than taskId, is referred to as n_taskId. n_taskId in some n_pathId is nearer to the leaf of the TPG than taskId in it. If some member tasks in n_pathId are available as n_taskId, the task with minimum bottom_level is chosen. In Fig.1, Tim is n_taskId for n_pathId Tβ→Tσ→Ti→…→Tim, where i=2, 3. The proposed static (offline) voltage scheduling is summarized as follows. <Static Voltage Scheduling> Step 1 The PEAK allocation and rslack recalculation in Subsect.3.2.1 are made, after SId is generated for a given TPG. Any path with rslack equal to zero units is removed from SId. If SId becomes empty, go to Step 2; otherwise, go to Step 5. Step 2 As long as there exist critical paths in the TPG, the following steps are made repeatedly. Steps 2-4 in Sect.2 determine taskId and pathId under the current voltage configuration. SId is then generated. Step 3 Step 5 in Sect.2 adjusts taskId so that pathId can meet the deadline. Step 4 Execution times of all critical paths in the TPG are updated, and any path with zero units as rslack is removed from SId. If SId becomes empty, go to Step 2. Step 5 Weights are calculated for all the tasks, each of which includes the portion to be run at vLO. The next step will never choose a task without such a portion. Step 6 The member path with minimum rslack among all the paths in SId is chosen as n_pathId. If there are no tasks available as n_taskId’s in n_pathId, go to Step 2; otherwise, choose one of the available tasks as n_taskId. Step 7 Appropriate units of n_taskId are changed to run them at vHI instead of vLO so that n_pathId can meet the deadline. If its rslack is not covered by this adjustment, entire n_taskId is run at vHI. Step 8 Execution times of all critical paths in the TPG are updated, and any path with rslack equal to zero units is removed from SId. If SId becomes empty, go to Step 2; otherwise, go to Step 5. Step 1 generates the first SId with pathIdf. Once PEAK’s are allocated to tasks, pathIdf meets the deadline. This is why Steps 2-4 are skipped for the first SId. The offline component in [12] simply tries to speed up critical paths in ascending order of their rslack’s, by adjusting taskId’s. Recall the example of Fig.1. To speed up the paths Tβ→Tσ→T2→…→T2m and Tβ→Tσ→T3→…→T3m, the component in [12] would adjust taskId Tσ two successive times. The proposed component adjusts n_taskId T2m (or T3m) for the former (or latter) path. Steps 1-8 are thus useful in addressing the issue of concentrations of portions executed in vHI on certain tasks. Once Steps 1-8 are complete, the start_time and tentative commit_time are given to every task. The final commit_time is determined as follows. Let us assume that task Tp precedes q tasks, Tδ1, Tδ2, …, and Tδq, and that Tp and Tk are assigned to the same processor as shown in Fig. 2. In the chart, Tk is scheduled close on the heels of Tp. STi and COTi mean the start_time and commit_time of task Ti, respectively. COTp in
428
N. Kamiura, T. Isokawa, and N. Matsui
Tp Tδ1
Tδ2
Tδq
(a)
Tp
Tk
Tentative COTp (b)
STk TIME MIN1≤j≤qSTδj
Fig. 2. Process to fix commit_time of Tp
Fig. 2(b) is tentatively given by Steps 1-8. Let MIN1≤j≤qSTδj denote the earliest start_time out of q STδj’s. Final COTp is fixed at MIN1≤j≤qSTδj if and only if tentative COTp is earlier than MIN1≤j≤qSTδj and MIN1≤j≤qSTδj is earlier than STk. This process then allows us to expand the portion executed in vLO for Tp. Note that, if Tp is the task last scheduled for a processor in the Gantt chart, final COTp is fixed at D. Assigning the final commit_time to every task yields the static voltage configuration. Fig. 3 depicts a TPG used in [12] as an example. Two numbers on the side of each circle are the worst-case execution units (in bold) and the actual execution units at runtime for some execution instance, respectively. The assumptions employed in [12] are as follows: the three-processor system, vHI=3.3 V, vLO=2 V, slow(vLO)=2.25, and D=99 (i.e., deadline). In [12], tasks (1, 5), (2, 4, 6), and (3, 7) are assigned to Processors 1, 2, and 3, respectively, in ascending order of task numbers. The proposed heuristic is examined under the above assumptions and assignment. 28 1 22.88
4 2 2.5
28 3 26.1
30 4 26.8 20 5 11.8
16 6 14.86
Task number
18 7 11.2
Fig. 3. Example TPG with execution times in terms of vHI
The worst-case path execution times in terms of vLO are first calculated. For example, the time required for path 2→4→6 is 112.5 (=(4+30+16)×2.25). All the five paths (1→5, 2→4→5, 2→4→6, 2→4→7, 3→7) are critical. Task 4 with weight 3 and 2→4→6 are chosen as taskIdf and pathIdf, respectively. The first SId equals {2→4→5, 2→4→6, 2→4→7}. Eq.(3) calculates 0.216 as PerPEAK, and PEAK’s are allocated as
On Two-Pronged Power-Aware Voltage Scheduling
429
follows: 6.048 units to each of tasks 1 and 3, 0.864 units to task 2, 6.48 units to task 4, 4.32 units to task 5, 3.456 units to task 6, and 3.888 units to task 7. For every task, the portion to be run at vHI is placed behind that at vLO. This assignment makes it possible to further reduce energy expenditure, because the task would complete its execution while the processor runs at vLO. SId equals {2→4→5, 2→4→7} after the PEAK allocation. Then, the worst-case execution time for 2→4→7 (or 2→4→5) is recalculated as 102.96 (or 106.92). 2→4→7 is next speeded up as n_pathId by adding 3.168 units to PEAK (3.888 units) of n_taskId (task 7). Task 5 is similarly adjusted as n_taskId for speeding up 2→4 →5. Final commit_time’s are lastly determined for tasks 1 and 3 by transferring their PEAK’s (6.048 units for each of them) from running at vHI to running at vHI. Fig.4 (a) depicts the final static voltage configuration. The offline component in [12] gives 10.8, 7.2 and 3.6 units to tasks 4, 5, and 7 as portions to be run at vHI, respectively. Dead line Processor 3
Dead line Processor 3
3
7
63
2
Processor 2
7
3
7
91.944
58.725 Processor 2
2
4
7.056 Processor 1
4
60.84 1
5
63 0 7.92
6
6
95.544
2
67.32
(a)
5
99
4
5.625 Processor 1
6
62.676 1
88.344 TIME
97.555 5
51.48 0
89.32
4
90.67 64.12
TIME
99
(b)
Fig. 4. Gantt charts for graph shown in Fig.3. Fig.4 (a) depicts the static voltage configuration, whereas Fig.4 (b) depicts actual behavior of tasks.
For Fig.4 (a), the DRR is first invoked at time 5.625. Due to unit_to_LO=1.836, the configuration for task 4 is then updated as follows: 25.356 units (57.051 units in terms of vLO) for the portion to be run at vLO and 4.644 units for that at vHI. The sum of 5.625, 57.051 and 4.644 equals the commit_time of task 4 in Fig. 4 (a). The DRR thus never changes the commit_time of the successor, and never causes a deadline miss even if the successor actually uses up the time specified by its worst-case execution profile. Fig. 4 (b) depicts the result of the DRR. The units consumed in vHI are 1.444, whereas they are 4.9 of task 4 if offline and online components in [12] are applied. It is thus possible that the proposed heuristic would achieve further energy savings than the heuristic in [12].
4 Experimental Results In simulation experiments, the formula, Esave=(E−EPVS)/E, is used to assess energy savings, where EPVS (or E) denotes energy expenditure when the proposed voltage scheduling (or another scheduling) is applied. Voltage levels are as follows: vHI=1.75 V, vLO=1.0 V, and vT=0.2 V. They are also used in [12]. The processors have been modeled based on technology used for IntelxScale processors [12], [16]. Graphs
430
N. Kamiura, T. Isokawa, and N. Matsui
referred to as Robot control and random TPG’s are prepared. Robot control [17] with 88 tasks and 131 edges is related to Newton-Euler dynamic control calculation for a manipulator. A set of graphs randomly generated by four methods (sameprob, samepred, layrprob, layrpred) is also published in [17]. The 20 graphs, each of which has 50 tasks, are randomly selected from this set as random TPG’s. The worst-case execution units of each task have been given in [17]. The actual execution units of each task are randomly determined in the range [AR, 100] percent of its worst-case profile. Given a TPG, each offline component yields a static voltage configuration. A set of 20 combinations, each of which has actual execution units randomly determined for arbitrary tasks under some AR value, is then generated. The online component is conducted on the condition specified by some member combination. E save is calculated, using the results of the online component for the two static voltage configurations. Such calculation is made for all of the member combinations in the above set. The proposed heuristic is compared with others, based on averaged Esave’s. In simulations, a deadline of Robot control is simply varied in regular intervals, while DP in the following equation is varied for each random TPG:
D MAX − DMIN , (4) 10 where DMAX denotes the shortest execution time required for the given TPG, which is scheduled in such a way that all of the tasks in it are entirely run at vLO, and DMIN equals the longest path length. This is due to the fact that DMIN and DMAX of some random TPG probably differ from those of another TPG. Note that DMIN and DMAX are based on worst-case profiles of all the tasks. The proposed heuristic is first compared with the case where a graph is scheduled under vHI alone. Fig.5 (a) (or 5 (b)) depicts results for Robot control, on condition of deadline = DMIN + DP ×
70
70
65
65
60
60
55
55
AR=20 AR=40 AR=60 AR=80 AR=100
50 45
45
40
40
35
35
30 600
700
800
900
1000
Deadline
(a)
1100
8 processors 10 processors 12 processors 14 processors
50
1200
30 600
700
800
900
1000
1100
1200
Deadline
(b)
Fig. 5. Energy savings after runtime adjustments for Robot control. To estimate the savings, the system where proposed heuristic is applied is compared with the system where there is no voltage scheduling: that is, all tasks have to run in a higher voltage, vHI. The plots in Fig.5 (a) are obtained by varying variance in execution time, whereas those in Fig.5 (b) are obtained by varying processor number.
On Two-Pronged Power-Aware Voltage Scheduling
431
12-processor systems (or AR=60). The plots in Fig. 5 (a) establish that Esave’s increase as AR’s decrease. This implies that sufficient slack occurs in the case where AR is small, and it is effectively reclaimed. The plots in Fig. 5 (b) demonstrate that better Esave’s are achieved with an increasing number of processors. It seems that the parallelism inherent in the system can be well exploited.
AR=20 AR=40 AR=60 AR=80 AR=100
6 5 4
10 8 6 4 2
3
0 -2
2
0
2
4
6
1
-6 -8
0 600
700
800
900
1000
1100
1200
8
10
12
AR=20 AR=40 AR=60 AR=80 AR=100
-4
-10 -12
-1
Deadline
(a)
DP +1
(b)
Fig. 6. Comparison of proposed heuristic with heuristic in [12]. The plots in Fig.6 (a) are obtained for Robot control, where as those in Fig.6 (b) are obtained for random TPG’s.
The heuristic in [12] is most closely relevant to the proposed heuristic, and hence comparison of them is discussed. Figs. 6 (a) and 6 (b) depict results for Robot control and random TPG’s, respectively. The 12-processor (or 14-processor) systems are prepared for Robot control (or random TPG’s). The offline component in [12] adjusts a task with maximum weight per speedup. It affects a number of paths while paying the energy price only once, and often yields the static voltage configuration with shorter units to be consumed in vHI than the proposed heuristic. The proposed heuristic tries to assign units to be run at vHI to later slots than the heuristic in [12], except PEAK’s. Fig. 6 establishes the above. When AR≤80 for Robot control, the proposed heuristic consumes less energy than the heuristic in [12]. A similar tendency applies to the cases of random TPG’s. The DRR thus achieves better performance for static voltage configurations yielded by the proposed scheme than by the scheme in [12], if enough slack occurs in runtime.
5 Conclusions This paper has proposed a power-aware heuristic scheduling voltage levels for hard real-time systems. PEAK’s defined as short units to be run at vHI are first allocated to some tasks. The critical path to be speeded up is then chosen in accordance with the concept of task sharing. The DRR is applied to the static voltage configuration for a TPG, to further save energy. Numerical results have established that the proposed
432
N. Kamiura, T. Isokawa, and N. Matsui
heuristic makes it possible to reduce the energy expenditures, compared with the heuristic in [12], if enough slack can be expected to occur during actual runtime. In future studies, the proposed heuristic will be modified so that it can be applicable for processors running under four or more voltage levels.
References 1. Hariyama, M., Aoyama, T., Kameyama, M.: Genetic Approach to Minimizing Energy Consumption of VLSI Processors Using Multiple Supply Voltages. IEEE Trans. on Comput. 54(16), 642–650 (2005) 2. Burd, T.D., Pering, T.A., Stratakos, A.J., Brodersen, R.W.: A Dynamic Voltage Scaled Microprocessor System. IEEE Journal of Solid-State Circuits 35(11), 1571–1580 (2000) 3. Hong, I., Potkonjak, M., Srivastava, M.: On-line Scheduling of Hard Real-Time Tasks on Variable Voltage Processor. In: Proc. of International Conference on Computer Aided Design, pp. 653–656 (1998) 4. Pering, T., Burd, T., Brodersen, R.: Voltage Scheduling in the lparm Microprocessor System. In: Proc. of Int. Symp. on Low-Power Electronics and Design, pp. 96–101 (2000) 5. Krishna, C.M., Lee, Y.-H.: Voltage-Clock-Scaling Adaptive Scheduling Techniques for Low Power in Hard Real-Time System. IEEE Trans. on Comput. 52(12), 1586–1593 (2003) 6. Barnett, J.A.: Dynamic Task-Level Voltage Scheduling Optimizations. IEEE Trans. on Comput. 54(5), 508–520 (2005) 7. Gruian, F., Kuchcinski, K.: Lenes: Task-Scheduling for Low-Energy Systems Using Variable Voltage Processors. In: Proc. of Asia South Pacific-Design Automation Conference, pp. 449–455 (2001) 8. Zhang, Y., Hu, X., Chen, D.Z.: Task Scheduling and Voltage Selection for Energy Minimization. In: Proc. of the 39th Design Automation Conference, pp. 183–188 (2002) 9. Mishra, R., Rastogi, N., Zhu, D., Mossé, D., Melhem, R.: Energy Aware Scheduling for Distributed Real-Time Systems. In: Proc. of Int. Parallel and Distributed Processing Symposium, pp. 22–26 (2003) 10. Zhu, D., Melhem, R., Childers, B.R.: Scheduling with Dynamic Voltage/Speed Adjustment Using Slack Reclamation in Multi-processor Real-Time Systems. In: Proc. of IEEE RealTime Systems Symposium, pp. 84–92. IEEE Computer Society Press, Los Alamitos (2001) 11. Zhu, D., AbouGhazaleh, N., Mossé, D., Melhem, R.: Power Aware Scheduling for AND/OR Graphs in Multi-processor Real-Time Systems. In: Proc. of Int. Conference on Parallel Processing, pp. 593–601 (2002) 12. Roychowdhury, D., Koren, I., Krishna, C.M., Lee, Y.-H.: A Voltage Scheduling Heuristic for Real-Time Task Graphs. In: Proc. of the Performance and Dependability Symposium, pp. 741–750 (2003) 13. Chandrakasan, A.P., Sheng, S., Brodersen, R.W.: Low Power CMOS Digital Design. IEEE Journal of Solid-State Circuits 27(4), 473–484 (1992) 14. Ishihara, T., Yasuura, H.: Voltage Scheduling Problem for Dynamically Variable Voltage Processors. In: Proc. of Int. Symp. on Low-Power Electronics and Design, pp. 197–201 (1998) 15. Yang, T., Gerasoulis, A.: List Scheduling with and without Communication Delays. Parallel Computing 19(12), 1321–1344 (1993) 16. http://www.intel.com/design/intelxscale/ 17. http://www.kasahara.elec.waseda.ac.jp/schedule/
Semi Custom Design: A Case Study on SIMD Shufflers Praveen Raghavan1,2, Nandhavel Sethubalasubramanian1,3, Satyakiran Munaga1,2 , Estela Rey Ramos1 , Murali Jayapala1, Oliver Weiss5 , Francky Catthoor1,2 , and Diederik Verkest1,2,4 1
IMEC vzw, Heverlee, Belgium {ragha,satyaki,reyramos,jayapala,catthoor,verkest}@imec.be, [email protected], [email protected] 2 ESAT, KULeuven, Leuven, Belgium 3 International University Bremen, Germany 4 Dept. of Electrical Engineering, VUB, Brussels, Belgium 5 Chair of Electrical Engineering and Computer Systems, RWTH Aachen University, Aachen, Germany
Abstract. Power has become the most important candidate for optimization in today’s design. This is necessary for further functionality and processing capability to be added to the design. Standard cell design is the defacto standard for most IC designs. The other end of the spectrum is full custom design whose efficiency is very high, but with a large design time. In this paper we investigate the use of prototype module generators to improve the energy efficiency of the design over the standard cell design while trading off some design time. We investigate this on an interconnect intensive design namely the SIMD Shuffler which is one of the important parts of a low power embedded processor’s datapath. We show that using module generators, we can reduce the energy consumption of the shuffler by about 30%. We also show the possible research opportunities for filling in the further EDA tools for low power.
1
Introduction
Given the continuous increase in demand for computational performance, power has become one of the most important optimization objectives. To keep up with the growing performance demand, embedded processors start using various data level parallelism (SIMD) [1,2,3] techniques. Furthermore as technology scales, it is known that energy consumption in interconnect is bound to dominate [4]. This is because of the fact that the etch stop layers between the metal layers do not scale and therefore the net-k1 between two metal wires does not scale. This implies that parts of the processor that are interconnect dominated would start consuming large amount of energy. One of the most commonly used methods to reduce interconnect is full custom design. While full custom design is very efficient, the design time required for full 1
“net-k” is the total dielectric equivalent between two metal wires routed on the chip.
N. Azemard and L. Svensson (Eds.): PATMOS 2007, LNCS 4644, pp. 433–442, 2007. c Springer-Verlag Berlin Heidelberg 2007
434
P. Raghavan et al.
custom design is very high. At the other end of the spectrum resides standard cell design which has a fast design time but a poor quality [5,6]. There are also other hybrid techniques like [7] which perform module generator based design. These module generator based design methodologies are especially interesting in datapaths as datapaths typically exhibit a high degree of locality in their design (for example carry chain in adders etc.). In this paper, we present a case study of using a datapath generator based design of an interconnect-dominated part of a processor datapath viz. the SIMD shuffler/permutation unit. The shuffler is one of the datapath elements which shows regularity in the design. We present the different advantages and disadvantages of using a datapath generator based methodology compared to standard cell based design. We analyze the gains obtained using the datapath generator based flow compared to a standard cell based design. We provide qualitative reasons on the gains obtained and where future research should focus towards. The remainder of the paper is organized as follows: Section 2 motivates a need for a solution which is different from commonly used standard cell or full custom design. Section 3 presents the description of the shuffle unit and its microarchitecture that was used for the case study. The module generator based methodology used is described in Section 4. We present our analysis of the the design using standard cell and the module generator based methodology in Section 5. Finally, we conclude in Section 6.
2
Motivation
In this section we first try to motivate the need for a non-standard cell based flow which can give higher energy efficiency. We also motivate next the choice of the datapath used for the case study.
Standard Cell
Energy
Module Generation 10X
Full Custom
5X
Design Time (man years)
Fig. 1. Design Time vs. Energy Efficiency Tradeoff
Semi Custom Design: A Case Study on SIMD Shufflers
2.1
435
Design Methodology Choices
While designing a low power processor, it is possible to take a pure standard cell based design approach or at the other extreme a full custom design approach. When a large design time (this is dependent on the expected volume of the product) is available as in the case of processors like IBM’s Cell [8], or TI’s TMS320C64+ [9], a full custom based design is possible for most parts of the processor. But in case of designs with a stronger design time constraint like NXP’s [10,1], SODA [3] etc, a standard cell based methodology similar to [11] is followed. These two extremes are qualitatively shown in Figure 1. The gains of a full custom design compared to a standard cell based design has been shown in [12], which include various reasons namely: microarchitecture, floor-planning and placement, cell and wire sizing, logic style etc. It was also shown in [12] that a full custom based design can increase the energy efficiency upto 10x based on the design. Qualitatively we can see that a full custom based design is often not affordable. Hence, it is quite clear that a need exits for an intermediate solution where with a bit more design effort than the standard cell based design, the energy efficiency can be improved. Various attempts have been made to make an environment for module generators [7,13], which provide an improvement in energy efficiency while slightly increasing the design effort. 2.2
Datapath Choice for Case Study
A common feature that is used in embedded system processors to reduce the power consumption as well as improve the performance is to use SIMD. A SIMD processor usually needs a shuffle unit which aligns the data required for the operations to follow. It performs the pack/unpack operations required for this alignment of data. This shuffler is one of the most interconnect dominated parts of the datapath of the processor as bits from one location have to be moved to another location etc. As motivated in section 1, interconnect energy consumption bound to become one of the dominant parts of the design. Since the shuffler dominates the interconnect in the datapath, we choose this. Furthermore the activation of the shuffler is quite high in most kernels which run in these embedded processors [14]. In addition to the above considerations, the precise micro-architecture of the shuffler that would be used also makes a difference. This choice is motivated in the next section (Section 3).
3
SIMD Shufflers
A functional unit which can perform the shuffle operations, known as shuffler or permutation unit, is usually implemented as a full crossbar, which provides full flexibility and requires a large amount of interconnect. The area of interconnect is one of the most important aspects that has a negative influence in terms of
436
P. Raghavan et al.
power consumption and performance [4]. Therefore, it is desired to implement a design that reduces the amount of interconnect and at the same time allows the required performance for the target application domain. These requirements can be achieved by customizing the crossbar. Figure 2 shows the external structure (inputs and output) of a shuffle unit. This shuffle unit can be implemented in many ways: (micro-architecturally) as a crossbar, partial crossbar or as Multistage Interconnection Network (MIN) (various types of Banyan networks like: Baseline, Omega, Cube, Butterfly). Figure 2 shows the architectural view of a shuffler. Multistage Interconnection Networks are popular and widely used in largescale multicomputers and network routers [15,16,17] . Many of them are a class of networks that consists of log2 N stages of 2x2 switching elements connecting N input ports to N output ports. These networks have the property of full access capability that any output can be reachable from any input in a single pass through the network. This makes these networks interesting for further analysis. OpCode
Input_1(#Sub-Words) →
Shuffle Unit
→Output(#SW)
Input_2(#Sub-Words) →
Fig. 2. Architectural view of the shuffler
In our previous work [14] we have introduced different types of commonly used shuffle patterns in embedded systems. A similar set of patterns which correspond to a particular application where clubbed into a family (F1, F2 etc.). Flexibility of a shuffler can be defined as the possibility of executing a particular family in a given shuffle network. Table 1 shows the flexibility of different networks over different shuffle families as described in [14]. For details on the set of permutations that are present in a given family the reader is refered to [14]. Table 1. Different Shuffle Networks and their Flexibility Shuffle Shuffle Family Network F1 F2 F3 F4 F5 F6 Interleave Filters FFT GSM DCT Broadcast Full Xbar Cube Baseline Butterfly Omega
Y Y N Y Y
Y Y Y Y Y
Y Y N N Y
Y Y N N N
Y Y Y N Y
Y Y Y Y Y
Semi Custom Design: A Case Study on SIMD Shufflers C0
C0 C1
C1
Out1
Out2
0
0
In1
In2
In1
Out1
0
1
In1
In1
In2
Out2
1
0
In2
In1
1
1
In2
In2
(a) Switch
437
(b) Truth Table
Fig. 3. Switch used in a MIN network
C0
C1
C2 C3
Fig. 4. 16 × 16 Cube network
Figure 3 shows the basic unit (a switch) used for constructing a MIN network. The truth table of the switch is also shown in Figure 3(b). Each switch consists of two inputs and two outputs each of which are as wide as the smallest subword size (4-bit or 8-bit etc.). The different MIN networks as mentioned in Table 1 are composed of different configurations (number of stages and interconnection between the different stages) of these switches. The precise description of the different networks is beyond the scope of this paper. Given the high flexibility at a not too high complexity price, the Cube network is most interesting of the different shuffle micro-architectures. The number of shuffle stages required for the Cube network is low and also the interconnect between the each of the stages is not complex. The illustration of a cube network is shown in Figure 4. Furthermore the number of bits required to steer the Cube network is low. The number of stages in Cube is lower than other networks and also the interconnection is quite localized in most stages, and therefore it is expected that the energy consumption of the Cube network is lower compared to other networks.
438
4
P. Raghavan et al.
Datapath Generator (DPG)
Data-path generator (DPG) is a software tool from RWTH Aachen that enables a design methodology for a semi-automated, physical implementation of parameterizable IP design blocks in ASIC design [18]. The critical IP blocks of the platform can be optimized for power, performance and area using the DPG. The application of the DPG assisted design methodology to typical datapath dominated IP blocks showed an improvement over Standard Cell based implementation of up to a factor 10 with respect to area, power and/or throughput [19,20] A technology independent structure of the final macro to be built is described in a VHDL syntax, by specifying the nodes and branches of the underlying signal flow graph (SFG). This SFG typically defines the full architecture of the design macro in terms of the IP blocks and the physical routing needed between those blocks. DPG makes use of specific DPG pragmas and parameters to control the implementation of the blocks and the routing between them. For the given IP block (say shuffler, adder, multiplier or even bigger blocks), individual leaf-cells are identified first and then manually designed while keeping the context of the leaf-cell in mind. Leaf cells may be larger or smaller than a standard cell. The size of the leaf cell would be very much dependent on the design. For example in case of the shuffler that is used as the case study, the switch that was introduced in Figure 3 is chosen as the leaf cell. Furthermore, a bit-sliced version of the switch may also be taken as a leaf-cell. This is a trade-off one can make to reduce design-time while sacrificing energy/performance. The leaf-cells could also be parameterized using Cadence PCell parameters [21]. The functionality of these leaf-cells seem like a black-box to DPG and only the input/output pins are relevant for routing. Once the different leaf cells of the design are identified, the specification in the SFG allows routing to be performed either by an external router (Cadence IC Craftsman [21]) or by the internal DPG router. The syntax of the SFG follows VHDL and most constructs of VHDL are available for use. Note that there is no synthesis which is going to happen, only the connectivity of the blocks is specified and therefore no constructs which synthesize to cells would be allowed. For further details on the specifications of the SFG, the reader is refered to [18]. The SFG can also specify the relative placement between the leaf cells, rotation of the cell, distances between cells etc. as pragmas. This allows the placement to be good and the routing can work efficiently on it. In the case of the shuffler, the leaf cells would be the switch as shown in Figure 3. Given the shuffle unit that needs to be made, say Cube; the SFG required for the corresponding network is written. This SFG would give the connectivity between the different switches as shown in Figure 4 and also the relative placement between the switches. Figure 5 shows a layout of the Cube shuffler generated using the DPG tool. On close observation one can see the different stages of the Cube network where each stage in 4 (vertical row of switches) forms a horizontal row in Figure 5. In all there are 4 such rows in Figure 5.
Semi Custom Design: A Case Study on SIMD Shufflers
439
Fig. 5. Layout of the Cube Shuffle Unit obtained using DPG
5
Experimental Setup and Results
In this section we discuss the specifications of the Cube shuffle unit that was used, the tool flow that was used for computing energy and ensuring timing and also analyze the results. The Cube shuffle unit that was used consists of two 64-bit inputs and one 64-bit output. It consists of four stages as illustrated in Figure 4. Each subword input is 8-bit wide and therefore each switch takes in 2 8-bit inputs and produces 2 8-bit outputs. The target frequency of the design is 200MHz and the technology used is UMC 130nm. The design was targeted to meet the 200MHz frequency under the worst case design corner and the power estimates are done for the typical case corner. Figure 6(a) shows the flow that was used for the standard cell based design. As it can be seen, the flow that was used ensures that a full design till layout is made and both the effects of parasitics (place and route and R+C parasitic extraction) and the effects of activity are taken into account (using gate-level simulation). This flow is part of the industry standard Cadence-TSMC v6 Reference Flow [11]. Figure 6(b) shows the flow that was used for the semi-custom based design. After designing the individual leaf cells and composing them into the SFG manually, the place and route was done using Cadence Virtuoso Custom Router (VCR). The internal router is not suited to perform this routing as it is oriented towards an abutment style of connects. The connection between the stages of the SIMD shuffler is not abutment oriented and therefore, the internal router of DPG failed to find a solution. After layout, the spice netlist
TB for each Shuffle Family
UMC130 Tech Lib VHDL
Design Compiler (Synthesis)
SoC Encounter (Place+Route)
ModelSim (GT Simulation)
GT Activity Parasitics
Prime Power (Power Est.)
(a) Standard Cell Design Flow
Design Spec.
TB for each Shuffle Family Leaf Cell Creation
SFG Creation
Place+Route (DPG/ICC)
PathMill (verify timing)
HSpice (Power Est.)
(b) DPG Based Semi Custom Flow
Fig. 6. Tool Flows Used for the Standard Cell and Semi-Custom Design
440
P. Raghavan et al.
was extracted with parasitics and verified that it meets timing using PathMill. Finally for each of the different families of shuffle, the netlist was simulated in HSpice to obtain the power estimates. It can be seen that in both the flows that are used, both activity and the parasitics are taken into account. Although the simulations for power are at different abstraction levels, the accuracy is almost equal. In case of the standard cell flow, the parasitics of the wires are taken into account and also the exact cell power is pre-characterized in the library. In case of the semi-custom flow, since the characterized library is not available, spice-level simulation in HSpice does the job. 64bit Cube Shuffle Network DPG Cell Cost 1.8
DPG Interconnect Cost Std. Cell Cell Cost
1.6
Std. Cell Interconnect Cost
Normalised Power
1.4 1.2 1 0.8 0.6 0.4 0.2 0 f1 Interleave
f2 Filter
f3 FFT
f4 GSM
f5 DCT
f6 Broadcast
Average
Shuffle Families
Fig. 7. Power Comparison between Standard Cell based design and Semi-Custom based design. All power numbers are normalized to the average power consumption in the cells using the Standard Cell based design.
Figure 7 shows the normalized power consumption of the cells in the design and also the interconnect power consumption. All the numbers are normalized to the average cell power consumption obtained using the Standard Cell based design. It can already be seen from Figure 7 that the design is heavily interconnect dominated. Though the DPG was applied to a component which exhibits poor locality, a power savings of about 30% was achieved. Based on the two design flows and the results we can make the following observations: 1. More optimal cell design: A more optimal cell design is possible as the context of where a cell (leaf cell) is going to be used is known. Hence sizing and even the logic used for the design can be chosen wisely. In our case, we ensured that the transistor sizing was done minimally and also pass transistor logic is used. This explains the gains in the cell power consumption in Figure 7. 2. Placement of cell: Some gains are also expected in the interconnect as a more “design-aware” placement was done by specifying it in the SFG. This is harder to quantify, but intuitively it is clear that knowledge of the design can enable placement of cells in a more efficient way.
Semi Custom Design: A Case Study on SIMD Shufflers
441
3. Routing between cell: It was observed that the router is one of the weakest parts of the design flow at least when regularity is present and an observable human solution is possible. In the semi-custom design flow, the deterministic internal router of DPG which is suited for an abutment type of connection, failed to route while a clear solution was available even by observation. This forced us to use an external stochastic router (Cadence’s Virtuoso Custom Router). It was seen that while a solution is found by this router, the solution is very sub-optimal. Also the stochastic nature of the router makes the problem more difficult. When a leaf cell was modified and the routing is done again, the solution obtained is quite different. During iterations of the design, it is very hard to ensure that the effect of your change may change the routing and hence the solution itself. Therefore, there is a strong need for a deterministic and efficient router to make an efficient module generated oriented design. 4. Effort: In terms of trade-off of using a semi-custom design compared to standard cell based design, the extra design time effort required is quite low if there is good knowledge of the design2 . The total gains of upto 30% in energy are obtained with about 20% extra effort when the knowledge of the design and tools are present in case of the shuffler. The gains can be expected to go up for larger designs and also the effort required would increase exponentially as more and more minor details of the design would need to be made available to the designer. The gains would increase much more for a larger design as the standard cell place and route based technique would become more suboptimal.
6
Conclusions
In this paper, we compared the datapath generator based flow compared to a standard cell flow for an interconnect dominated datapath component namely the shuffle/permutation unit of a processor. We showed that a possible gain of 30% can be obtained for this unit. We presented the different networks that can be used for the shuffle unit and motivated the choice of the network that was chosen. We analyzed the reasons for this gain and give the possible advantages and weakness of both the module generator and the standard cell based techniques.
References 1. Van Berkel, K., Heinle, F., Meuwissen, P., Moerman, K., Weiss, M.: Vector processing as an enabler for software-defined radio in handsets from 3G+WLAN onwards. In: Proc. of Software Defined Radio Technical Conference, November 2004, pp. 125–130 (2004) 2. Rounioja, K., Puusaari, K.: Implementation of an hsdpa receiver with a customized vector processor. In: Proc of SOC (November 2006) 2
This may not always be the case for all designs.
442
P. Raghavan et al.
3. Lin, Y., Lee, H., Woh, M., Harel, Y., Mahlke, S., Mudge, T., Chakrabarti, C., Flautner, K.: SODA: A low-power architecture for software radio. In: Proc of ISCA (2006) 4. DeMan, H.: Ambient intelligence: Giga-scale dreams and nano-scale realities. In: Proc of ISSCC, Keynote Speech (February 2005) 5. Chinnery, D., Keutzer, K.: Closing the Gap Between ASIC and Custom: Tools and Techniques for High-Performance ASIC Design. Springer, Heidelberg (2006) 6. Chinnery, D., Keutzer, K.: Closing the Power Gap Between ASIC and Custom: Tools and Techniques for Low Power Design. Springer, Heidelberg (2006) 7. Wiess, O., Gansen, M., Noll, T.G.: A flexible datapath generator for physical oriented design. In: Proc of ESSCIRC, September 2001, pp. 408–411 (2001) 8. IBM: The Cell Microprocessor (2005), http://www.research.ibm.com/cell/ 9. Texas Instruments, Inc.: TMS320C64x/C64x+ DSP CPU and Instruction Set Reference Guide (May 2006) http://focus.ti.com/docs/apps/catalog/resources/ appnote abstract.jhtml?abstractName=spru732b 10. Philips PDSL. CF6 CoolFlux DSP (2004), http://www.coolfluxdsp.com 11. Cadence and TSMC, Cadence-TSMC Reference Flow ver. 6.0 (2005) http://www.cadence.com/datasheets/6159 TSMC RefFlow FS FNL.pdf 12. Chinnery, D., Keutzer, K.: Closing the power gap between asic and custom: an asic perspective. In: Proc of DAC, pp. 275–280 (2005) 13. Six, P., Claesen, L., Rabaey, J., De Man, H.: An intelligent module generator environment. In: Proc of DAC, pp. 730–735 (1986) 14. Raghavan, P., Munaga, S., Rey Ramos, E., Lambrechts, A., Jayapala, M., Catthoor, F., Verkest, D.: On the benefits of customized cross-bar for data-shuffling operation in simd asip. In: ARCS 2007, LNCS, vol. 4415, Springer, Heidelberg (2007) 15. Padmanabhan, K.: Design and analysis of even-sized binary shuffle-exchange networks for multiprocessors. IEEE Transactions on Parallel and Distributed Systems 385–397 (1991) 16. McGregor, J.P., Lee, R.B.: Architecture techniques for acclerating subword permutations with repetitions. Trans. on VLSI, 325–335 (2003) 17. Diana Smith, S., Siegel, H.J.: An emulator network for SIMD machine interconnect networks. Computers, 232–241 (1979) 18. RWTH Aachen – University of Technology, DPG User Manual Version 2.8 (October 2005) http://www.eecs.rwth-aachen.de/dpg/info.html 19. Gemmeke, T., Gansen, M., Noll, T.G.: Implementation of scalable power and area efficient high-throughput viterbi decoders. IEEE Journal of Solid-State Circuits 37(7) (July 2002) 20. Gemmeke, T., Gansen, M., Noll, T.G.: Design optimization of low power high performance dsp building blocks. IEEE Journal of Solid-State Circuits 39(7), 1131– 1139 (2004) 21. Cadence Inc.: Cadence Virtuoso Custom Design Platform (2006) http://www.cadence.com/products/custom ic/index.aspx
Optimization for Real-Time Systems with Non-convex Power Versus Speed Models Ani Nahapetian, Foad Dabiri, Miodrag Potkonjak, and Majid Sarrafzadeh Computer Science Department, University of Calfornia Los Angeles (UCLA) {ani,dabiri,miodrag,majid}@cs.ucla.edu
Abstract. Until now, the great majority of research in low-power systems has assumed a convex power model. However, recently, due to the confluence of emerging technological and architectural trends, standard convex models have been invalidated for the proper specification of power models with different execution speeds. For example, the use of a shutdown energy minimization strategy to eliminate leakage power in multiprocessor systems results in a nonconvex trade-off between power and speed. Non-convexity renders the majority of previous power management schemes, algorithms, and even basic theorems invalid. For instance, the main premise that one has to run continuously using a single speed in order to minimize energy consumption for constant computation requirements is not valid anymore. We study techniques for energy minimization where the power versus speed curve has a non-convex shape. We first identify and quantify sources of nonconvexity. Minimizing energy when the power-speed model is non-convex is an NP-complete problem, even in the canonical and simple case where a task is to execute a specified amount of computation without dependencies, in a given amount of time. We address this problem using a non-linear function minimization based approach and demonstrate that on average the new solution saves at least 40% more energy on industrial processors than techniques that follow the convexity paradigm. Then we address common real-time task scenarios where the power-speed model is non-convex. Specifically, we introduce a heuristic for scheduling tasks onto a multiprocessor system with a non-trivial start-up cost and compare its performance to our mixed integer linear programming (MIP) formulation. We experimentally compare our neighbors heuristic with the wellknown average rate algorithm, and find that it results in a 106% improvement while being only 14% worse than the optimal MIP solution.
1 Introduction Traditionally, low power research has focused on a power model where the relationship between power consumption and processor speed is convex. Convexity has a number of profound ramifications when energy is minimized using variable voltage strategies. For example, running the processor at the lowest speed possible continuously, while still meeting the task deadline, has been the most advantageous strategy. N. Azemard and L. Svensson (Eds.): PATMOS 2007, LNCS 4644, pp. 443–452, 2007. © Springer-Verlag Berlin Heidelberg 2007
444
A. Nahapetian et al.
Also, it is well known, that convex objective functions are much more amenable to both heuristic and provably optimal minimization [4]. All dynamic voltage scaling research has essentially been governed by this fact. There exist rapidly emerging application and technology scenarios, however, where the relationship between power and processor speed is not convex, including the following situations: (i) multiprocessor systems with powering-up power cost; (ii) scaled down CMOS devices where the increased impact of leakage power dominates the power consumption. (Leakage power does not have a convex relationship with processor speed [7]); (iii) systems with simultaneously adaptive Vdd and Vt. (Energy gains can be made by simultaneously varying Vdd, through dynamic voltage scaling, and Vt, through adaptive body biasing, where Vdd is the supply voltage and Vt, is the threshold voltage [17]. Lowering the threshold voltage increases the processor speed, but at the expense of increasing the leakage power creating a non-convex relationship between power and speed.) (iv) Finally, there is an important emerging class of systems that are subject to non-convex power minimization: subthreshold ultra low power circuits. Recent work at MIT, University of Michigan, and Purdue University have characterized the power versus speed of execution curve for subthreshold ultra low power as non-convex using circuits, CAD, and architectural techniques [5][6][14][18][20][23][24][28]. There is a widely held opinion that subthreshold logic will dominate several rapidly growing segments of ICs and computer and communication architectures. The optimization process under these new conditions will have numerous and profound consequences and will significantly differ from the current variable voltage approaches. As we mentioned, given a convex power model a single speed properly chosen minimizes the energy consumption. In the non-convex case, on the other hand, the selection of two or more different speeds minimizes energy consumption. Even piecewise convex curves cannot be handled using a single speed, instead, they are best handled, as other non-convex curves, using two speeds. Therefore, with pending system and technology solutions, non-convex power models will dominate the spectrum of power-constrained systems. Table 1. Example data points
Convex Power Model Speed (million Power (W) cycles/sec) 0 0 1 1 2 4 3 8
Non-convex Power Model Speed (million Power (W) cycles/sec) 0 0 1 0.5 2 2 3 2.5
Let us examine the difference between a convex and a non-convex power-speed curve with the following example. Assume we are given a task that requires 20 million cycles of computation in 10 seconds. With a convex power to processor speed curve, as given in Table 1, the optimal speed to run the tasks is 2 million cycles/second for 10 seconds at 4 W. On the other hand, with a non-convex curve, also
Optimization for Real-Time Systems with Non-Convex Power Versus Speed Models
445
given in Table 1, the optimal schedule would be to run the task at 3 million cycles/second for 5 seconds and at 1 million cycles/second for 5 seconds. The total energy cost would be 15 J (= 2.5 J + 12.5 J), which is less than the 20 J consumed if we had run at 2 million cycles/sec for the entire interval. This example highlights the new paradigm of non-convex power minimization. (1) We can no longer run the tasks at their slowest possible speed. (2) We cannot assume that a single speed will be used by the optimal solution. Our goal is to develop techniques that under the general non-convex power model address scheduling onto a multiprocessor system with a startup cost for each processor. First, we solve the problem of scheduling single tasks onto a multiprocessor system. Then we solve the more complex problem of scheduling multiple tasks, with arrival times, deadlines, and cycles of computation, onto a multiprocessor system with startups costs, using the solution of the first problem as the enabling procedure. The major contributions of this paper are the following. We introduce and discuss the new paradigm where there is a non-convex power relation between power and processor speed. We solve the fundamental problem of scheduling a single task on to a multiprocessor system, by formulating the problem as non-linear function minimization. We introduce the neighbors heuristic for scheduling tasks onto a multiprocessor system with a significant startup cost. We also formulate a mixed integer programming (MIP) formulation to solve the problem. Finally, we demonstrate the significant improvement possible by experimentally comparing the neighbors heuristic, the MIP approach, and the average rate algorithm.
2 Power Model and Related Work We consider the case where a startup cost is incurred for transitioning a processor from the sleep state to the on state. We use the startup cost calculated by Jejurikar et al [13] that estimates the cost of changing the state of the processor to be 483μJ, based on several assumptions including ones about the cache state. This cost dominates, if it is incurred. Aside from the startup cost, we incur a cost for keeping a processor on. We use the following formula. E on = Pon / f where Pon is taken to be 0.1Watts, similar to [13], and f is the frequency, which we take to be 300MHz, the minimum frequency of the Transmeta Crusoe processor. Our power model is based on real processors, specifically, the Transmeta Crusoe processor Model TM5500 [22] and AMD-K6-IIIE+500 ANZ processor [1]. Although we use actual processor values for our experimentation and for our problem abstraction, we do make a few simplifying assumptions. We assume that the cost of transitioning between different voltage values was zero. We also assume that the transition time is negligible. These assumptions are common in the recent literature [11][13][15][26]. Energy minimization techniques can be classified in two broad groups. The first, called dynamic power management (DPM), aims to shutdown processors when they are idle. The second, dynamic voltage scaling (DVS), dynamically varies the voltage
446
A. Nahapetian et al.
supplied to the processor, to provide just in time execution of tasks. Benini et al provide a survey in [2][3]. Irani et al combine the two methods, DPM and DVS, for systems with DVS and multiple power modes [11]. A large portion of the research in power scheduling algorithms has focused on uniprocessors. The scheduling and assignment of tasks onto multiprocessors generally has been solved utilizing heuristics that have a two-phase approach. They assign jobs to the resources, then they allocate the voltages for the processors, assuming the job assignment determined in the first phase [29][30]. Yu and Prasanna [27] examine the two problems of assignment of jobs to resources and the determining of the voltage levels in a joint manner. They formulate the problem as an integer linear program (ILP), and they utilize a linear relaxation heuristic (LP-relaxation) to solve the problem. The majority of the related work in DVS follows a convex power model [11][15][21][26][27]. In general, they assume a quadratic relationship between power and processor speed. Jejurikar et al [13] consider the effect of leakage power, but they do not address the problem of scheduling in the non-convex region of the power curve.
3 Fundamental Problem Let us consider a constrained yet fundamental case of the scheduling problem to gain intuition and to create a powerful procedure for the final steps of the energy optimization in more complex scenarios. The problem is the following. Given a single required average speed at which to run the processor for a period of time, determine the speeds at which to run each segment of the time interval in such a way that the total energy is minimized given a non-convex power-speed relationship. The problem is NPcomplete. We prove the claim using reduction from the well-known knapsack problem [8]. The knapsack problem is reduced in polynomial time to the fundamental problem by mapping the weight of the objects to the energy of a chosen speed for a given period of time and by mapping the value of the objects to the speed chosen. The resulting problem is a discretized instance of the fundamental problem, given below. Instance: Finite set P, for each p ∈ P an energy e(p) ∈ Z+ and a speed s(p)∈ Z+, and positive integers C and E. Question: Is there a subset P’ ⊆ P such that that ∑ e( p ) ⋅ s ( p ) ≥ C .
∑ e( p ) ⋅ s( p ) ≤ E and
such
p∈P
p∈ P
To solve the fundamental problem for multiple processors, we start with multiple energy versus speed curves that correspond to using a different number of processors, and we juxtapose the curves to obtain a new energy versus speed curve. Then given an average speed and a time interval, nonlinear function minimization is used to determine the most energy efficient selection of the number of processors and their execution speeds. To create the new function, we examine k possible energy versus
Optimization for Real-Time Systems with Non-Convex Power Versus Speed Models
447
3.5 3
Energy/Ceff
2.5
One Processor Two Processors Three Processors
2 1.5 1 0.5 0 200
700
1200
1700
Total Speed (MHz)
Fig. 1. Energy vs. Speed Curves for 3 Types of Multiprocessor Systems
processor speed mappings, where k is the number of processors. The k curves are then combined, and the minimal assignment and schedule is found. Figure 1 graphs three different curves, where each curve represents the energy consumption (divided by the effective capacitance) for running at the given speed. Each of the curves assumes a certain number of processors available for use. The first curve graphs the energy versus speed curves for the simple case, where there is a single processor, using the data values from the slides associated with [13]. The second curve graphs the case where there are two processors, each running at the same speed. The speed is two times that of the uniprocessor case, because there are two processors. The energy is equal two times the energy required if a single processor were running at half the speed, basically the speed of each of the two processors. This analysis is carried out for k processors or in the case of Figure 1 for three processors. To obtain the solution for the fundamental problem, a nonlinear equation minimization function can be used to determine the number of processors and their speeds that consume the minimum amount of energy. The NLP has been omitted due to space limitations. As illustration, we ran a nonlinear minimizer, specifically the Powell, or Direction Set, Method in Multidimensions [19] on a four processor system. The results are shown in Figure 2. Figure 2 shows that NLP solution is able to consistently improve by on average 45.8% on the solution obtained with traditional convex scheduling techniques. Note that we assumed the processors exhibit no cost in starting up and/or changing their voltage values. The results are obtained using a 1000 random restarts for the Powell method. The results may somewhat improve with a larger number of restarts. Once we have the solution to the fundamental problem, we can use it to simplify our overall problem. As long as we can determine the total number of cycles that are needed to be executed in an interval, then the solution given in this section can be applied to the determine the most energy efficient execution of the cycles.
448
A. Nahapetian et al.
6
Traditional Techniques's Energy NLP's Energy
5
Energy/Ceff
4
3
2
1
3.78
3.56
3.34
3.13
2.91
2.69
2.47
2.25
2.03
1.81
1.59
1.38
1.16
0.94
0.5
0.72
0
Cycles (million)
Fig. 2. Energy Consumption using NLP Solution
4 Scheduling for Non-convex Power Model In this section, we highlight a provably optimal ILP-based approach for power minimization in systems subject to non-convex power-speed relationship, as well as a new heuristic approach for the same problem. Before addressing energy minimization with non-convex power-speed models, we formally define the addressed problem of minimizing energy for a set of tasks with known computational requirements and arrival and deadline times. Let us consider the problem of scheduling tasks onto multiprocessors with a startup cost. The problem requires determining the number of processors to use and speed at which to execute the cycles at any given time, such that the energy consumption is minimized. More formally, we are given as input a set of n tasks. Each task is characterized by the following: ai – arrival time before which task i cannot be executed, di – deadline by which time task i must be completed, ci – cycles of computation associated with task i. We assume a hard real-time scenario where tasks are periodic, and where the deadline for each period is equal to its worst-case execution time (WCET), as is common among the literature [11][13][15][16]. We are also given a set of homogenous processors, whose applied voltages can be varied dynamically. We assume homogenous processors for the sake of clarity, but the algorithms and software tools can easily be extended to handle heterogeneous processors as well. We consider processors that can be transitioned into sleep mode to save energy, if they are idle, and we assume that the processors have a start-up cost associated with them. We formulate the problem as a mixed integer programming (MIP) problem and solve it optimally using an MIP solver such as CPLEX [10]. Although MIP in the worst case can take an exponential amount of time, this MIP formulation is often fast for instances of practical interest. For D total time units and N tasks, there are D·N continuous variables and 2N integer variables. There are six types of constraints, with
Optimization for Real-Time Systems with Non-Convex Power Versus Speed Models
449
a total of 3N + 3D constraints. There is on average 3D/N + (N+5)/D variables per constraint. The MIP formulation has been omitted due to space limitations For the problem formulated in section 4, we have also developed the neighbors heuristic. Unlike the MIP, the heuristic does not always produce the optimal solution; however, it is useful for large problem instances where the MIP may be prohibitive. Also, it allows the addition of nonlinear constraints to the problem, such as allowing the processors to go to sleep at any time instance. Intuitively, the heuristic evens out the load on neighboring intervals, by reducing the instances of change in the number of processors used. It also attempts to run the active processors with the voltage values that were determined to be advantageous by the solution of the fundamental problem. The psuedocode for the heuristic is given below. The heuristic initially assigns tasks to execute based on their average rate, given by the equation Average ratei = ci . The solution is iteratively improved by evening d i − ai out the load on neighboring intervals, by reallocating the cycles of overlapping tasks. We iteratively choose the two intervals that have the largest density difference that share a common task. If there are multiple tasks to choose from, we move the task that is less likely to be redistributed in a following iteration. Finally, we optimize each interval using the solution for the fundamental problem. The intuition behind the neighbors heuristic is based on the fact that the less change there is in the required average speed, the better the speed assignments can be. Although, the neighbors heuristic will not be useful for all speed energy tradeoff, it is very useful for the common case in multiprocessor systems where, roughly, the larger the computation requirement the less efficient the speed energy tradeoff is, especially if we are dealing with a piecewise convex model. Neighbors Heuristic(tasks) for all tasks Assign the tasks to their execution intervals at their average rate for a certain number of iterations for each interval i and its neighboring interval j Choose the task belonging to both intervals that has the earliest deadline Move the cycles of execution of the chosen task from the more dense interval to the less dense interval until the intervals are even or until all of the common task has been moved for each interval Optimize the voltage settings using the fundamental problem’s solution
5 Experimental Results We evaluated the effectiveness of the neighbors heuristic experimentally against randomly generated task sets, which are standard among the related work [13][15][21][27]. The tasks’ values were based on the work of Kwon et al [15], by choosing arrival times and deadlines to be in the range of 1 to 20 seconds and cycles of execution from the range of 1 to 400 million cycles. For comparison purposes, we
450
A. Nahapetian et al.
also implemented the following heuristics. (1) The average rate algorithm [26] assigns tasks to their execution intervals to be executed at their average rate. The number of processors is based on the number of cycles to be executed at that time instance. (2) The contention-based algorithm is an enhancement to the average rate algorithm. It moves tasks from high contention intervals to low contention intervals, where an interval with low contention is characterized as having a small number of live tasks during the interval. The contention-based heuristic chooses the interval with the least contention to which to move task cycles. Among the tasks that are live in that interval, the task with the largest density is moved to the interval. (3) The flattening heuristic aims to flatten the scheduled execution of cycles by eliminating peaks and valleys in the schedule to decrease the number of startups. As in the contention-based and neighbors heuristic, the tasks are initially scheduled according to the average rate algorithm. Then, the solution is iteratively improved, by choosing two intervals that have the largest density difference that share a common task. The common task is redistributed so that the intervals are even in terms of density. Of course, we are limited by the number of cycles belonging to the chosen task, and thus cannot move more cycles than the task requires. As the process of flattening is done iteratively, the schedule will eventually flatten out, if intervals have multiple tasks in common. If there are multiple tasks that lie in both of the intervals, then the task with the greatest density is chosen for redistribution. Neighbors (40 iterations) Flattening Energy Contention-Based Energy Average Rate Energy
Energy Normalized to Optimal
4
3.5
3
2.5
2
1.5
1 10
20
30
40 50 60 70 Number of Tasks
80
90
100
Fig. 3. Energy vs. Number of Tasks for Various Approaches
Figure 3 present the results obtained from our experimentation. The bar chart displays the energy consumption normalized to the optimal value for 10 different task sets of varying sizes for each of the four heuristics: the average rate heuristic, the contention-based heuristic, the flattening heuristic, and the neighbors heuristic. The neighbors heuristic dramatically out-performs all of the other heuristics. The heuristic is on average 106% better than the well-known average rate heuristic and 84% better than the flattening heuristic, yet is it only 14% worse than the optimal solution. For the task sets with 70, 80, 90, and 100 tasks the results are equal to the optimal values.
Optimization for Real-Time Systems with Non-Convex Power Versus Speed Models
451
6 Conclusion In this paper, we addressed energy minimization for non-convex power-speed models, which are poised to dominate future applications and technologies. We first addressed the fundamental problem of scheduling onto a multiprocessor system a computation requirement for a given interval. We, then, addressed the more complicated problem of scheduling tasks for systems with a large startup cost. First, we formulated the problem as a mixed integer-programming problem and showed how it could be solved very efficiently. Secondly, we introduced the neighbors heuristic, which evens out the density of neighboring intervals initially produced by the average rate schedule. We carried out extensive experimentation to quantify the quality of our approaches.
References [1] AMD PowerNow Technology Platform Design Guide for Embedded Processors. AMD Document number 24267a (December 2000) [2] Benini, L., DeMicheli, G.: System-Level Power: Optimization and Tools. In: International Symposium Low Power Embedded Design (ISLPED) (1999) [3] Benini, L., Bogliolo, A., DeMicheli, G.: A Survey of Design Techniques for SystemLevel Dynamic Power Management. IEEE Transactions on VLSI Systems 8(3) (2000) [4] Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge University Press, Cambridge (2004) [5] Calhoun, B.H., Chandrakasan, A.P.: Characterizing and Modeling Minimum Energy Operation for Subthreshold Circuits. In: International Symposium Low Power Embedded Design (ISLPED) (2004) [6] Calhoun, B.H., Wang, A., Chandrakasan, A.P.: Device sizing for minimum energy operation in subthreshold circuits. In: Custom Integrated Circuits Conference (CICC) (2004) [7] Chandrakasan, A.P., Sheng, S., Brodersen, R.W.: Low- Power CMOS digital design. IEEE Journal of Solid-State Circuits 27(4), 473–484 (1992) [8] Garey, M.R., Johnson, D.S.: Computers and Intractability. In: A Guide to the Theory of NP-Completeness, W.H. Freeman and Company, New York (1979) [9] Hong, I., Kirovski, D., Qu, G., Potkonjak, M., Srivastava, M.B.: Power optimization of variable-voltage core-based systems. In: Design Automation Conference (DAC) (1998) [10] ILOG CPLEX http://www.ilog.com/products/cplex/ [11] Irani, S., Gupta, R.: Algorithms for Power Savings. In: Symposium on Discrete Algorithms (2003) [12] Ishihara, T., Yasuura, H.: Voltage Scheduling Problem for Dynamically Variable Voltage Processors. In: International Symposium Low Power Embedded Design (ISLPED) (1998) [13] Jejurikar, R., Pereira, C., Gupta, R.: Leakage Aware Dynamic Voltage Scaling for RealTime Embedded Systems. In: Design Automation Conference (DAC) (2004) [14] Kao, J., Narendra, S., Chandrakasan, A.: Subthreshold leakage modeling and reduction techniques. In: International Conference on Computer Aided Design (ICCAD) (2002) [15] Kwon, W.C., Kim, T.: Optimal Voltage Allocation Techniques for Dynamically Variable Voltage Processors. In: Design Automation Conference (DAC) (2003) [16] Liu, C.L., Layland, J.W.: Scheduling Algorithms for Multiprogramming in Hard RealTime Environment. Journal of ACM 20(1), 46–61 (1973)
452
A. Nahapetian et al.
[17] Mastin, S.M., Flautner, K., Mudge, T., Blaauw, D.: Combined Dynamic Voltage Scaling and Adaptive Body Biasing for Lower Power Microprocessors under Dynamic Workloads. In: International Conference on Computer Aided Design (ICCAD) (2002) [18] Nazhandali, L., Zhai, B., Olson, J., Reeves, A., Minuth, M.: Energy Optimization of Subthreshold-Voltage Sensor Network Processors. In: International Symposium on Computer Architecture (ISCA) (2005) [19] Press, W.H., Teukolsky, S.A., Vetterling, W.T., Flannery, B.P.: Numerical Recipies in C: The Art of Scientific Computing. Cambridge University Press, New York, NY (1994) [20] Soeleman, H., Roy, K., Paul, B.C.: Robust subthreshold logic for ultra-low power operation. IEEE Transactions on VLSI Systems 9(1) (2001) [21] Shin, Y., Choi, K.: Power Consious Fixed Priority Scheduling for Hard Real-Time Systems. In: Design Automation Conference (DAC) (1999) [22] Transmeta Crusoe Data Sheet. https://www.transmeta.com [23] Wang, A., Chandrakasan, A.P.: A 180mV FFT processor using subthreshold circuit techniques. In: Solid-State Circuits Conference, 2004. Digest of Technical (2004) [24] Wang, A., Chandrakasan, A.P., Kosonocky, S.V.: Optimal supply and threshold scaling for subthreshold CMOS circuits. In: International Symosium on VLSI (ISVLSI) (2002) [25] Wolf, W., Potkonjak, M.: A Methodology and Algorithms for the Design of Hard RealTime Multi-Tasking ASICs. ACM Transaction on Design Automation of Electronic Systems (TOADES) 4(4), 430–459 (1999) [26] Yao, F., Demers, A., Shenker, S.: Scheduling for Reduced CPU Energy. IEEE Annual Foundations of Computer Science (FOCS) (1995) [27] Yu, Y., Prasanna, V.: Resource Allocation for Independent Real-Time Tasks in Heterogeneous Systems for Energy Minimization. In: International Conference on Computer Aided Design (ICCAD) (2001) [28] Zhai, B., Blaauw, D., Sylvester, D., Flautner, K.: Theoretical and practical limits of dynamic voltage scaling. In: Design Automation Conference (DAC), 2004. Zhang,Y. Hu, X., Chen, D. Z.: Task Scheduling and Voltage Selection for Energy Minimization. Design Automation Conference (2002) [29] Zhu, D., Melhem, R., Childeres, B.: Scheduling with Dynamic Voltage/Speed Adjustment using Slack Reclamation in Multi-processor Real-Time Systems. In: IEEE RealTime Systems Symposium (2001) [30] Hong, I., Kirovski, D., Qu, G., Potkonjak, M., Srivastava, M.B.: Power optimization of variable-voltage core-based systems. In: Design Automation Conference (DAC) (1998)
Triple-Threshold Static Power Minimization in High-Level Synthesis of VLSI CMOS Harry I.A. Chen, Edward K.W. Loo, James B. Kuo*, and Marek J. Syrzycki School of Engineering Science, Simon Fraser University Burnaby, BC, Canada V5A 1S5 [email protected] * Department of Electrical Engineering, National Taiwan University Taipei, Taiwan
Abstract. In this paper we present a new static power minimization technique exploiting the use of triple-threshold CMOS standard cell libraries in 90nm technology. Using static timing analysis, we determine the timing requirements of cells and place cells with low and standard threshold voltages in the critical paths. Cells with a high threshold voltage are placed in non-critical paths to minimize the static power with no overall timing degradation. From the timing and power analysis, we determine the optimal placement of high, standard and low threshold voltage cells. Using three different threshold voltages, an optimized triple-threshold 16-bit multiplier design featured 90% less static power compared to the pure low-threshold design and 54% less static power compared to the dual-threshold design.
1 Introduction As CMOS technology advances, MOS transistor sizes continue to decrease while power dissipations continue to increase. As transistor sizes decrease, designers can lower supply voltages to reduce the power dissipation. But in order to achieve high speed, the threshold voltage must be reduced accordingly, which results in increased static power. For nanometre CMOS technology, manufacturers usually supply standard cell libraries built of MOS transistors with high, standard, and low threshold voltages (HVT, SVT, and LVT). The HVT gates (or cells) have slow switching speeds but dissipate low static power; the LVT gates have fast switching speeds but dissipate much larger static power. Using multi-threshold CMOS (MTCMOS) technology, developing techniques to obtain the optimal placement of transistors with different threshold voltages is an active research topic.
2 Related Work The total dissipated power for CMOS circuits consists of dynamic and static power. The static power becomes more dominant as CMOS technology progresses towards deep sub-micron nodes [1]. Typical attempts to minimize static power rely on the use of HVT transistors in non timing-critical paths and LVT transistors in paths where a N. Azemard and L. Svensson (Eds.): PATMOS 2007, LNCS 4644, pp. 453–462, 2007. © Springer-Verlag Berlin Heidelberg 2007
454
H.I.A. Chen et al.
high speed is necessary. Dual threshold-voltage techniques have been proposed [2][6] to determine the optimal placement of the HVT and LVT cells. The previous reported approaches [2] and [3] minimize static power during highlevel synthesis, while [4]-[6] minimize static power at the transistor level. Transistorlevel designs typically use SPICE modelling to determine accurate timing waveforms and current flows for each transistor. Since large designs usually consist of repetitive gates such as inverters and NAND gates, the circuit simulation time can be reduced significantly if simulations are performed at the gate level. For high-level synthesis, device manufacturers supply standard sets of cell libraries with corresponding timing and power dissipation information, and commercial tools from vendors such as Cadence™ and Synopsys™ are available to synthesize the circuits. Circuits can be designed by specifying the exact placement of gates using a gate-level RTL language such as Verilog, or by describing circuit functionalities with a high-level language such as VHDL. High-level synthesis requires less simulation time compared to transistor-level modelling, and developing appropriate methodologies for optimizing the placement of MTCMOS gates during high-level synthesis is necessary. As it has been proposed in [2], a circuit is first synthesized using only HVT cells. Each HVT cell in the critical path is temporarily replaced by its LVT counterpart cell to record the amount of reduction in the delay time. The cell that produces the largest time delay reduction is changed permanently from HVT to LVT. This method is effective for small designs, but may be inefficient for large designs since timing analysis has to be repeated for a large number of cells in the critical paths. In [3], only the highest cost cells in the critical paths are changed from HVT to LVT, where the cost is defined as the number of critical paths passing through the cell. Although this approach may not result in the lowest static power dissipation, we have found designs produced using this technique to be close to optimal. The approach described in [3] is faster to perform than [2] and is thus more suitable for large designs. Triple-threshold techniques have also been used to minimize the static power [7][8]. Previous attempts use HVT transistors as sleep transistors for logic blocks. Within each logic block, the circuits are optimized using the dual-threshold technique, placing SVT transistors in non-critical paths and LVT transistors in critical paths. Since SVT and LVT cells dissipate much larger static power, this method is only effective when the circuit is in sleep mode, but the static power dissipation would otherwise be larger compared to the same circuit optimized with the dual-threshold method using HVT and LVT gates. Moreover, the previous work requires customized design for the sleep transistors, which cannot be easily automated in a high-level synthesis flow. To determine more fine-grained placement in high-level synthesis, we propose a new triple-threshold technique that places HVT cells within logic blocks in addition to SVT and LVT cells. We have presented the results of optimizing 20 ITC’99 benchmark circuits in [9]. In this paper, we describe the methodology in greater detail and present the results of optimizing a 16-bit Wallace-tree multiplier as well as 8 benchmark circuits from the 1995 High-Level Synthesis Design Repository (verilog code adapted from http://www.ece.vt.edu/mhsiao/hlsyn.html). HVT cells are placed in non-critical paths while SVT and LVT cells are placed in critical paths to achieve the fastest clock speed and the lowest static power dissipation.
Triple-Threshold Static Power Minimization in High-Level Synthesis
455
3 Comparison of Standard Cell Libraries For our experiments, Synopsys Design Compiler™ is used to synthesize a 16-bit Wallace-tree multiplier, which contains 1123 cells in total. Static timing analysis is performed with Synopsys PrimeTime™ to extract the longest delay paths and obtain the clock speed information. Design Compiler™ is used to report the circuit’s static power dissipation. A 16-bit Wallace tree multiplier circuit is synthesized three times: with the HVT library only, with the SVT library only, and with the LVT library only. Based on the simulation result, we determine the relative performance and static power dissipation of the three libraries, as shown in Table 1. Table 1. Performance comparison of a 16-bit Wallace-tree multiplier synthesized using HVT, SVT and LVT libraries
Cell Library NMOS VT [V] PMOS VT [V] Longest Path Delay [ns] Max. Clock Speed [MHz] Static Power [μW]
HVT 0.32 -0.36 3.4 294.1 0.75506
SVT 0.24 -0.29 2.6 384.6 14.4600
LVT 0.18 -0.24 2.1 476.2 270.7120
The 90nm cell libraries are designed such that an HVT gate would occupy the same area as an SVT or LVT gate; thus all three designs have the same die area. In Table 1, the longest delay path corresponds directly to the clock period, and indicates the fastest clock speed that can be obtained with each cell library The multiplier synthesized with SVT cells consumes approximately 20 times higher static power compared to the HVT multiplier, and the LVT multiplier consumes 20 times higher static power compared to the SVT multiplier. It becomes obvious that replacing all cells from the slower library by cells from the faster library leads to a significant static power increase that cannot be afforded in large SOC designs. 3.1 Static Power Calculations The static power dissipation values for all simulated circuits in this paper are obtained from Design Compiler™. For each cell, the 90nm standard cell library contains a static power value for each possible input state. Design Compiler™ calculates a cell’s static power by multiplying the static power value for each state by the percentage of the total simulation time at that state. The total static power of a circuit is then calculated by summing the static power of each cell in the circuit. 3.2 HSPICE Simulation To verify the static power dissipation reported by Design Compiler™, we select a 2input XOR gate in the multiplier design with different preceding gates as shown in Fig. 1. The XOR gate A has the same probability to output either a 1 or 0. The
456
H.I.A. Chen et al. Table 2. Input state probability for XOR gate C
Input A 0 0 1 1
Input B 0 1 0 1
Input State Probability 9/32 7/32 9/32 7/32
Fig. 1. An example 2-input XOR gate with different input probabilities from preceding gates
double-AND-OR gate B outputs 0 with a probability of 9/16 and 1 with a probability of 7/16. The probability of each input state for XOR gate C is calculated in Table 2. We use HSPICE simulation to obtain the static power of a 2-input XOR gate for each input combination using HVT, SVT and LVT transistors. We calculate the weighted static power of the XOR gate and compare with the reported value from Design Compiler™, as shown in table below. The standard cell libraries provided to us do not contain internal circuit information, and therefore our HSPICE simulation is a best attempt to replicate the cell structure. Nevertheless, the simulation results indicate a 20 times difference in the static power between the different threshold voltage cell libraries. This verifies that the static power values reported by Design Compiler™ are valid for our experiments. Table 3. Comparison of HSPICE simulated static power with the reported value from Design Compiler™
Input 00 01 10 11 Weighted Total DC Reported
HVT 0.6423 0.8229 1.087 0.3668 0.7466 0.8845
Static Power (nW) SVT 9.938 23.27 28.89 11.42 18.51 17.97
LVT 333.2 324.9 479.4 273.3 359.4 335.6
4 Triple-Threshold Static Power Minimization Methodology The large difference in static power of the HVT, SVT and LVT cell libraries suggests that a low static power design should use as many HVT cells and as few SVT and
Triple-Threshold Static Power Minimization in High-Level Synthesis
457
LVT cells as possible, while satisfying the timing performance specifications. The objective is then to place HVT cells in non timing-critical paths and SVT cells in timing-critical paths. LVT cells are used only if replacing HVT by SVT cells does not meet the target timing performance of the synthesized circuit. The proposed methodology involves the following steps (Fig. 2): 1. Initial synthesis: This step synthesizes the multiplier in RTL code using the HVT standard cell library. 2. Timing improvement using SVT cells: Perform timing analysis to select the timing-critical paths for the synthesized HVT netlist. A timing-critical path is a path where the total timing delay from the input to the output exceeds the specified timing requirement. Timing-critical paths are detected automatically in Synopsys PrimeTime™ after specifying the clock periods. Select the cell with the highest cost, where cost is defined as the number of critical paths passing through the cell; replace this cell with an SVT cell. Repeat the timing analysis again, and swap the highest-cost cell in the newly-determined critical paths with its SVT counterpart. This step is repeated until either the timing requirement has been met or all the highest cost cells in the critical paths have been replaced with SVT cells. 3. Timing improvement using LVT cells: If the design still violates timing requirements and all cells in the critical paths have been replaced by SVT cells, perform timing analysis to select the critical paths and replace each highest-cost cell with its LVT counterpart. Repeat the timing analysis and cell replacement until there is no timing violation in the design.
Fig. 2. Triple-threshold power minimization flow
5 Simulation We use Synopsys PrimeTime™ to perform static timing analysis and selection of critical paths for the 16-bit multiplier. We set the clock period to 2.1ns (corresponding
458
H.I.A. Chen et al.
to 476MHz frequency), which is the fastest clock speed attainable with the LVT library. We first examine the path slacks of the HVT multiplier, which can only run at a clock period of 3.4ns. Fig. 3 shows that 4790 paths have a timing slack between 1.1ns and -1.3ns. The negative slack indicates that the path delay has exceeded the specified clock period. A path with -1.3ns of slack under the 2.1ns timing constraint has a path delay of 3.4ns.
Fig. 3. Number of timing violated paths in the HVT multiplier
Using our new methodology, we generate an optimized triple-threshold 16-bit multiplier design consisting of 833 HVT cells, 196 SVT cells, and 94 LVT cells. We compare the static power with the LVT multiplier and the dual-threshold (HVT+LVT) multiplier adapted from [3]. The LVT multiplier consists of 1123 LVT cells and the dual-threshold multiplier consists of 897 HVT cells and 226 LVT cells. The results are shown in table below. Table 4. Static power comparison of the LVT, dual-threshold, and triple-threshold multipliers
Multiplier Design LVT Dual-Threshold Triple-Threshold
# HVT 0 897 833
# SVT 0 0 196
# LVT 1123 226 94
% LVT Static Power (µW) 100% 270.71 20% 59.89 8.4% 27.40
The triple-threshold multiplier results in 90% less static power compared to the LVT multiplier, and 54% less static power compared to the dual-threshold multiplier. The triple-threshold multiplier uses 91.6% fewer LVT cells compared to the LVT multiplier and 58.4% fewer LVT cells compared to the dual-threshold multiplier. Since LVT cells dominantly contribute to the static power, the drastic reduction in static power corresponds well to the reduction in the use of LVT cells.
Triple-Threshold Static Power Minimization in High-Level Synthesis
459
We select one path from input IN17 to output P31 and analyze the saving in static power. Fig. 4 shows the path in (a) the dual-threshold optimized circuit and (b) the triple-threshold optimized circuit. The dual-threshold path in (a) contains 9 HVT cells (shaded in black stripe) and 23 LVT cells (shown in red outline), while the triplethreshold path in (b) contains 7 HVT cells, 8 SVT cells (shaded in blue dots), and 17 LVT cells. Comparing the two paths, there are nine cells that have different threshold voltages. The delay and static power for each of the nine cells for both paths are presented in Table 5.
(a)
(b) Fig. 4. Timing path from IN17 to P31 for (a) dual-threshold multiplier and (b) triple-threshold multiplier
460
H.I.A. Chen et al. Table 5. Timing delay and static power of cells in a dual-Vt and tri-Vt path
Dual-Vt Path Cell FD2QLVT EOHVT AO2NHVT EOLVT AO2NLVT EOLVT AO2NLVT EOHVT EOLVT Total
Cell Delay (ns) 0.10 0.10 0.09 0.05 0.06 0.05 0.06 0.10 0.02 0.63
Static Power Tri-Vt Path Cell Static Power (nW) Cell Delay (ns) (nW) 478.8098 FD2QSVT 0.13 24.34833 0.855517 EOSVT 0.08 17.25491 0.545157 AO2NSVT 0.07 10.10962 325.1559 EOSVT 0.06 17.5569 188.8806 AO2NSVT 0.07 10.09687 327.8204 EOHVT 0.08 0.898775 189.5429 AO2NSVT 0.07 10.13421 0.925426 EOSVT 0.08 17.96629 335.4581 EOSVT 0.03 18.09893 1847.994 Total 0.68 126.4648
For the 9 cells listed above, the triple-threshold path has 6 LVT cells removed and 8 SVT cells added compared to the dual-threshold path. As a result, the triplethreshold path delay has increased by 0.05ns. The total path delays for the dualthreshold and triple-threshold paths are 2.00ns and 2.05ns, respectively, which still meet the 2.1ns timing constraint. Note that the circuit requires 0.05ns of setup time; therefore the triple-threshold path has zero slack time after the optimization. For comparison, the same path in the HVT multiplier has a total path delay of 2.79ns. The -0.69ns slack time results in the use of a large number of LVT cells in the optimized dual-threshold and triple-threshold paths. The total static power for the 9 cells has been reduced by 93.2% in the triple-threshold design, demonstrating the effectiveness of the triple-threshold technique in reducing static power. Fig. 5 shows the static power dissipation of the multiplier designs optimized with different clock period constraints. As the clock period increases (at lower clock frequencies), the paths have more slack time, resulting in fewer LVT cells being used and therefore more savings in static power. For the 16-bit Wallace tree multiplier, our triple-threshold methodology has achieved a minimum saving of 90% in static power at the fastest clock frequency. For most applications, the target clock frequency is not required to be at the maximum speed possible, and the static power can be reduced almost completely by minimizing the use of LVT cells in the design. We optimize eight circuits from the 1995 High-Level Synthesis Design Repository using the dual-threshold and our triple-threshold techniques using a Sun Ultra 45 workstation. The results are shown in Table 6 and Table 7. The optimization run time for the triple-threshold technique is up to twice as long as the dual-threshold technique for smaller designs, but is comparable to dual-threshold for larger designs. For the eight circuits, the triple-threshold technique has achieved an average saving of 89.3% in static power compared to the LVT designs, and 41% saving in static power compared to the dual-threshold technique. In all cases, designs optimized with the triple-threshold technique result in the lowest static power dissipation.
Triple-Threshold Static Power Minimization in High-Level Synthesis
461
Fig. 5. Static power dissipation at different clock constraints Table 6. Static Power Savings in Dual-Vt and Tri-Vt Optimized Circuits
Circuit am2910 barcode dhrc diffeq gcd kalman lru prawn Average
Clock Period (ns) 1.38 0.81 2.82 5.00 2.80 2.19 0.99 1.12
Static Power (µW) LVT 160.0487 57.2719 334.9089 1478.4 94.3967 485.6676 101.4512 133.3684
Dual-Vt 14.1046 8.6383 42.9717 337.0328 27.0290 42.3875 17.5919 38.0656
Tri-Vt 7.1088 4.6698 25.0963 181.4167 21.0709 25.9261 10.8594 19.9524
% Saving vs Tri-Vt % LVT Saving vs. Dual Tri Dual-Vt 91.2 95.6 49.6 84.9 91.8 45.9 87.2 92.5 41.6 77.2 87.7 46.2 71.4 77.7 22.0 91.3 94.7 38.8 82.7 89.3 38.3 71.5 85.0 47.6 82.2 89.3 41.3
Table 7. Number of Gates and Optimization Run time
Circuit am2910 barcode dhrc diffeq gcd kalman lru prawn
Total # of Gates 601 204 1378 5465 451 2275 476 733
# of Gates (HVT / SVT / LVT) Dual-Vt
Tri-Vt
541 / 0 / 60 168 / 0 / 36 1214 / 0 / 164 4558 / 0 / 907 347 / 0 / 104 2097 / 0 / 178 387 / 0 / 89 538 / 0 / 195
513 / 65 / 23 147 / 40 / 17 1136 / 171 / 71 4218 / 836 / 411 304 / 68 / 79 2035 / 139 / 101 375 / 44 / 57 429 / 203 / 101
Run Time (h:mm:ss) Dual-Vt 0:00:34 0:00:13 0:02:23 0:52:40 0:02:17 0:03:50 0:00:44 0:03:09
Tri-Vt 0:00:59 0:00:27 0:02:37 1:19:17 0:04:25 0:07:47 0:01:15 0:06:25
462
H.I.A. Chen et al.
6 Conclusion We have presented a novel triple-threshold static power reduction technique that is suitable for RTL synthesis of large digital designs. The presented methodology has been tested using a 16-bit Wallace tree multiplier design showing the best static power optimization when compared with other methodologies known in literature. This technique allows designs to operate at the fastest clock speed as the pure LVT designs, while saving approximately 90% of the static power dissipation compared to the LVT designs. The saving in static power increases as the clock frequency requirement decreases. This trade-off between speed and power consumption allows designers to choose the best circuit configurations for different requirements. Using eight benchmark circuits, the triple-threshold technique is also shown to achieve an average saving of 89.3% in static power compared to LVT designs and 41.3% saving compared to the dual-threshold optimized designs. The optimization run time for the triple-threshold technique is shown to be comparable to the dual-threshold technique for large designs. In all circuits tested, the triple-threshold technique has produced designs with the lowest static power dissipation.
References 1. Kim, N.S., Austin, T., Baauw, D., Mudge, T., Flautner, K., Hu, J.S., Irwin, M.J., Kandemir, M., Narayanan, V.: Leakage current: Moore’s Law meets static power. IEEE Computer 36, 68–75 (2003) 2. Shin, K., Kim, T.: Leakage power minimization in arithmetic circuits. In: Electronics Letters, vol. 40, pp. 415–417. The Institution of Engineering and Technology (2004) 3. Chung, B., Kuo, J.B.: Gate-level dual-threshold static power optimization methodology (GDSPOM) for designing high-speed low-power SOC applications using 90nm MTCMOS technology. In: ISCAS, pp. 3650–3653 (2006) 4. Wei, L., Chen, Z., Roy, K., Johnson, M.C., Ye, Y., De, V.K.: Design and optimization of dual-threshold circuits for low-voltage low-power applications. IEEE Trans. on VLSI Systems 7, 16–24 (1999) 5. Anis, M., Areibi, S., Elmasry, M.: Design and optimization of multithreshold CMOS (MTCMOS) circuits. IEEE Trans. on Computer-Aided Design of Integrated Circuits and Systems 22, 1324–1342 (2003) 6. Srivastav, M., Rao, S.S.S.P., Bhatnagar, H.: Power reduction technique using multi-Vt libraries. In: System-on-Chip for Real-Time Applications, pp. 363–367. Springer, Heidelberg (2005) 7. Fujii, K., Douseki, T., Harada, M.: A sub-1 V triple-threshold CMOS/SIMOX circuit for active power reduction. In: ISSCC Digest of Technical Papers, pp. 190–191. S3 Digital Publishing Inc., Maine (1998) 8. Fujii, K., Douseki, T.: A 0.5-V, 3-mW, 54x54-b multiplier with a triple-Vth CMOS/SIMOX circuit scheme. In: IEEE International SOI Conference, pp. 73–74. IEEE, Los Alamitos (1999) 9. Chen, H.I.A., Loo, E.K.W, Kuo, J.B., Syrzycki, M.J.: Triple-Threshold Static Power Minimization Technique in High-Level Synthesis for Designing High-Speed Low-Power SOC Applications Using 90nm MTCMOS Technology. In: Canadian Conference on Electrical and Computer Engineering, IEEE, Los Alamitos (2007)
A Fast and Accurate Power Estimation Methodology for QDI Asynchronous Circuits Behnam Ghavami, Mahtab Niknahad, Mehrdad Najibi, and Hossein Pedram Computer Engineering Department, Amirkabir University of Technology (Tehran Polytechnic), 424 Hafez Ave, Tehran 15785, Iran {ghavami,niknahad,najibi,pedram}@ce.aut.ac.ir
Abstract. In this paper, we present a new efficient methodology for power estimation of the well known family of asynchronous circuits, QDI circuits, at pre-synthesized level. Power estimation at high-level is performed by simulating the intermediate format of the design. This format consists of concurrent processes represented with CSP-Verilog. The number of Reads and Writes accesses on the ports of these concurrent processes are counted by analyzing the conditional and computational portion during the simulation which is the based of our estimation methodology. To verify the accuracy of our presented method we applied it to a Reed-Solomon decoder as the benchmark. The results show up to 15 % imprecision in comparison with the power measured by SPICE, also simulation speed is faster by factor of 7 compared to gate-level transition counting based methodology.
1 Introduction Nowadays, complexity in digital systems makes the time to market more critical and the designers have to rely on early decisions which require high precision estimation of the system at higher levels of abstraction. Therefore it is necessary to increase the levels of abstraction and trend to high-level evaluation. Increasing the chip density and reducing the feature size in VLSI circuits have caused power consumption to become one of the main challenges to the digital design. Thus, for better understanding of the designer, power estimation methods are developed to show clearly where and how the power is consumed in a circuit. Power estimation can be done at various levels of abstraction. There is a trade-off between the accuracy of the power estimation and the level of details at which the circuit is analyzed or simulated. In general, the more the details in the level of modeling and specification, the less the speed of the simulation should be. We concentrate on high-level power estimation of asynchronous QDI circuits, because low power consumption is one of the main advantages of asynchronous circuits (among the other benefits of the clock-less design [1]). As Fig.1 shows our scheme, the estimation of the power is done at high-level by converting the behavioral description of a circuit to an intermediate format supporting the synthesized features of the circuit. The method used here to prepare the intermediate format, is very similar to the data driven decomposition method [6] transforming N. Azemard and L. Svensson (Eds.): PATMOS 2007, LNCS 4644, pp. 463–473, 2007. © Springer-Verlag Berlin Heidelberg 2007
464
B. Ghavami et al.
the high-level description to the concurrent processes which are in relation to each other using the input and output ports [2]. Then simulation is done using some kind of conditional and computational analyzing on the intermediate format and the number of read and write occurrences on ports are counted. Results indicate a correlation between the number of reads and writes through simulation and the actual power usage of the circuit.
Fig. 1. High level power estimation using an intermediate format
The next section overviews the related work on power estimation of asynchronous systems. Section0 describes the QDI asynchronous circuits, the Caltech synthesis method in brief, and the Persia Synthesis Tool [6] for generating the intermediate format. Section0 elaborates the proposed methodology for power estimation. Section 5 describes the results of applying this method to a Reed-Solomon Error Detector as a benchmark and finally section 6 concludes the.
2 Related Work In recent years, many methods have been developed to estimate the power consumption of asynchronous circuits in the various levels of abstraction. Using the transistor-level simulation for precise calculation of power consumption is a traditional method for estimating/analyzing the power of the circuits. Transistor level simulation is usually done using analog simulation tools such as SPICE. The process of analyzing the power in the transistor level is extensive and the simulation time is very long. Another major drawback to this technique is that the designer needs to manually create and apply stimulus files for the circuit inputs. This is a usable method for synchronous circuits, because the inputs change in relation with a periodic clock pulse. Unfortunately in asynchronous designs which handshaking is base of relations (instead of clock pulse) the generation of stimulus test-cases is hard. To solve this problem, an alternative method have been developed for simulating transistor-level circuits in VHDL test-case to provide the transistor-level circuit inputs based on the circuit outputs. This method uses the Mentor Graphics Advance MS Tool [7]. However, the simulation time for this method is also extremely long.
A Fast and Accurate Power Estimation Methodology for QDI Asynchronous Circuits
465
Previously, Petri-Net based Abstraction using reach-ability state traversal and discrete time Markov chains are employed to estimate the power consumption of selftimed circuits [11]. The Markov chain modeling is converted to a linear equation in [10] and determines the average energy per external signal transition. This process is done at both high and low levels of abstraction. Another method have used Petrify tool [13] to synthesis their signal transition graph description to Petri-Net models analyzed using the matrix representation instead of Markov chains [14]. Power estimation at post-synthesized level is described in [12] which have used the fan-in transition counting at the standard cells. Most of the methods described above have the common problem of using the postsynthesized result of the design. So they need a lot of simulation and are time consuming. Focusing of the presented work is at high-level of abstraction. We preserve the detail characteristics of the design at our intermediate format.
3 QDI Asynchronous Synthesis Method Asynchronous circuits represent a class of circuits that are not controlled by a global clock but rely on exchanging local request and acknowledge signaling for the purpose of synchronization. In fact, an asynchronous circuit is composed of individual modules which communicate with each other by means of point-to-point communication channels. An asynchronous circuit is called delay-insensitive if it preserves its functionality independent of the delays of gates and wires. Quasi delay-insensitive (QDI) circuits are like delay-insensitive ones with a weak timing constraint: isochronic forks. In an isochronic fork the difference between the delays through the branches must be less than minimum gate delay [3]. Among the various methods of asynchronous design, the QDI method 3 is capable of having better performance and low power consumption. This method is based on QDI timing model. It breaks the high-level description of the design to modules which are small enough and ready to be synthesized to fine-grain predefined templates. This process is called Data Driven Decomposition [4]. At present, most QDI circuits are designed using PCHB (Pre-Charge logic Half-Buffer) and PCFB (PreCharge logic Full-Buffer) templates [5]. Our synthesis Tool uses PCFBs for its predefined templates. The PCFB circuit includes five main sections. The first section is the output computation network, the second is the input validity check, the third is the output validity check and the forth is the input acknowledge logic and the fifth is the internal enable signal generator. The template method for validity check is dual rail encoding [1]. In dual rail the data channel contains a valid data (token) when exactly one of 2 wires is high. When the two wires are lowered the channel contains no valid data and is called to be neutral. The template performs 4-phase handshaking on both input and output channels. 4-phase handshaking means that the sender puts (or writes) the information on its output ports and then activates the acknowledge signal to inform the receiver. When the receiver gets the acknowledge signal, it picks the information (or reads) from its input ports and informs the sender. It can be inferred that the concept of read and write is hidden through the handshaking portion of the templates. Handshaking is the source of energy consumption of a system [12]. Considering this fact, the number of reads and
466
B. Ghavami et al.
writes at a more high-level description of the design, such as intermediate section described, has a direct correlation with the power consumption of the proposed system. Not only should be considered the number of reads and writes in the system, but also the computational section of the design and the conditions which under, a read or write may be done should be taken in to account. Because these two parameters are another source of power consumption and ignoring them reduces the precision of the estimation. This problem is attended in this paper by accurate analysis of the computational and conditional portions of the intermediate format. 3.2 Persia Synthesis Tool Persia is an asynchronous synthesis toolset developed for automatic synthesis of QDI asynchronous circuits with adequate support for GALS1 systems [4][8]. The structure of Persia is based on the design flow shown in Fig.2 which can be considered as the following four individual portions: QDI synthesis, GALS synthesis, layout synthesis, and simulation at various levels. As shown in Fig.2 the QDI synthesize flow gets the high-level description of the circuit as its input. Then using the decomposition method, the specification of the circuit is broken to a number of parts which are mappable to PCFB templates. These templates read the input values and write the output values in each working cycle. The decomposed circuit is used as our suitable intermediate format and used as the input of the simulation tool which count the number of reads and writes in the system. High-level description of the circuit used in the Persia is Verilog-CSP. Verilog-CSP refers to the CSP specification used in high-level description of asynchronous circuits [6] which is convert to verilog using the library of Macros and PLIs. The handshaking on the channels are simply implemented as blocking READ and WRITE Macros
Fig. 2. The synthesis flow of Persia [8] and Simulation Tool in relation with each other 1
Globally-Asynchronous Locally-Synchronous.
A Fast and Accurate Power Estimation Methodology for QDI Asynchronous Circuits
467
in Verilog, where the executing a Read on a port of a channel waits until a Write executes on the port of the other side of that channel [9].
4 High Level Power Estimation In CMOS circuits, the dominant factors in power dissipation are the dynamic and short-circuit power. The dynamic power consumes in the charging and discharging of the capacitive loads. This power dissipation can be calculated with the following equation:
pdyn =
1 2 CVdd f s 2
(1)
Where C is load capacitance, Vdd is the supply voltage and f s is the switching frequency of the circuit. High level power estimators generally calculate f s and take the values of Vdd and C as inputs. In Asynchronous designs by elimination of the clock pulse the power consumption of the system has reduced to the local relations of the processes. So the switching frequency can be estimated assuming the local handshakings. Base on the dual rail nature of the PCFB templates the short- circuit power consumption is removed and the data dependency is reduced in the switching frequencies. 4.1 High-Level Simulation for Power Estimation Previously, for power estimation of synchronous designs in high-level of abstraction, the intermediate formats which include the synthesizing characteristic of the design have been in use [15]. For the asynchronous designs used here, however applying the intermediate formats to the high-level estimation is useful. Intermediate format leads to more accurate and faster estimation of the power at high-levels. To convert the High-level description to our suitable intermediate format we must separate the different independent components of the behavioral description. In the first step, as shown in Fig.3 arithmetic (computational) expressions are extracted from the main code. The connectivity between the main module and these separated component circuits is utilized by communication channels that uses reads and writes. At this step as pointed earlier, the behavioral models of the computing circuits such as READ and WRITE on ports are developed in the Verilog-CSP. Taking advantage of these models the circuit has been isolated from its computational section to reach a faster simulation. In the second step, the High-level design description is broken into smaller modules. This breaking is done based on the decomposition method presented by 4[4]. At first all of the variables in the design are checked to be as single assignment forms and then in the projection phase all of the single assignment variables make their own processes. This is the proposed intermediate format where the other modules except of the computational section are adaptable to PCFB templates and so are used in estimating the synthesis characteristics of the design. Each of these small modules is described in Verilog-CSP language in Persia. At this step, the circuit can also be
468
B. Ghavami et al.
Fig. 3. Separating the arithmetic section from the main high-level design helping a simpler convert to intermediate format
Fig. 4. Process of converting the high-level design to the intermediate format
simulated. Fig.4 shows the high-level program including its conditions and computations converted to the intermediate form. As referred, Reads and Writes are the main presentations of power consumption in the high-level description of our circuits because of their nature of handshaking operations. So counting them in the simulation time leads us to a simpler power estimation method. Simulation in this level is data dependent because of the conditions existing in the intermediate description. So our simulation tool uses a condition analyzing method to count the real number of Reads and Writes occurs. Condition analyzing helps make a probability tree for the conditions in the intermediate format (Fig.5). It assigns a probability label (based on data dependency analyzing) to each branch of the conditions and then (based on the probability number during the simulation) makes a weight for Reads and Writes which exist in the branches. This process is done for 12 different test-cases. The result of counting the Reads and Writes are shown in Table 1.
A Fast and Accurate Power Estimation Methodology for QDI Asynchronous Circuits
P1 P1−1
469
P2 P1− 2
P1−1−1
Fig. 5. Probability trees result from conditional analyzing on intermediate format Table1. Test-cases and their number of reads/writes before analyzing their computational portion Test-Case
Num of R/W
Buffer-4 Buffer-8 Hamming-Enc Buffer-16 Adder8bit Hamming-Dec
20 32 37 64 37 208
Test-Case
DiffEq4 DEMUX2 GF-Adder DEMUX4 Booth Mult GFInv
Num of R/W
270 431 302 793 290 1247
The resulting num of Reads and Writes are shown in Fig.6 compared with the power computed in transistor level using SPICE. As shown in the relationship between the reads and writes that occurs during the simulation and the true power consumption of these test-cases is very similar to a linear equation.
Power Per Num of R\Ws
0.06
Pow er(microW )
0.05
Booth Multiplier
0.04 0.03 GF Adder
0.02 0.01
Num of R/Ws in Test-Cases
0 20
32.01
37
64.01 122
208
270
280
302
431
793 1247
Fig. 6. The semi-linear relation between the num of reads and writes and power consumption before interfering the computational analyzing (GF-Inverter and Booth Multiplier derange the linear correlation)
With more attention to the Fig.7 clearly can be observed that the test-cases which have a computational section before their writes, such as GFAdder and BoothMultiplier, derange the linear equation. So the computational section of the design is highlighted in estimating the power. To interfere the computational section’s power
470
B. Ghavami et al.
Fig. 7. The generic NANDs network resulting from a computational component
consumption in counting the reads and writes we should use a unit model for all of the computations, through this way we convert the computations to generic Nands and as shown in Fig.7 using Num of the connections between the Nands consisting the computational section, we prepare a weight for the writes done on ports at the end of the computations. Numof Re ads / Writes = Numof Re ads / Writes normal + NumofWrite scomputational
(2)
In the above equation, the normal num of reads and writes has ignored the computational portion of the description and the computational section assuming this portion and its affect (num of the connections in generic Nands described earlier) on the corresponding writes. Considering these estimations for the conditional and computational sections, as shown in the Fig.8, the num of reads and writes in the intermediate format trend to their accurate occurring. As the Fig.8 shows, after the computational section is interfered, the plot trends to a linear equation. after Cmputational Analysis Before Computational Analysis Linear (after Cmputational Analysis)
0.06
y = 0.0043x - 0.0036
0.05
P o w er(m icroW )
Booth Multiplier
0.04 0.03 GF Adder
0.02 0.01 Num of R/Ws in Test-Cases
0 20 32.01 37 64.01 122
208 270
280
302
431
793 1247
Fig. 8. The linear relation between the num of reads and writes and power consumption after the computational analyzing using generic NANDs
A Fast and Accurate Power Estimation Methodology for QDI Asynchronous Circuits
471
The overall flow of the high-level power estimation is shown in Fig.9. Persia (asynchronous design tool) prepares the intermediate format, and the simulation tool counts the number of reads and writes. As described earlier, data dependencies in the absence of conditions are reduced in dual rail circuits, so separating the computational section and propagating the conditional probabilities make our simulation time very short.
Fig. 9. The detailed flow of the high-level power estimation methodology
5 Experimental Results This high-level power estimation is employed on the well-known Reed-Solomon Error Detector benchmark. First, we obtained the high level description in the intermediate format from the asynchronous synthesis tool. Then, we simulated the description using the proposed simulation tool, that contain both the conditional and computational analyzes, to find the number of reads and writes. The simulation time was less than 2 seconds, while the time needed by the transition count method at gate level [12] was more than 14 seconds in a similar computing environment. SPICE simulation shows the accurate power estimated of the benchmark. Table 2 shows the results for different portion of the Reed-Solomon circuits. The number of reads and writes was converted to power consumption using the proposed linear relationship. The results are with 15 % accuracy except for RiBm that contains a large computational part that account for more inaccuracy. As the result is so linear, we have been able to obtain very good power estimations as demonstrated by benchmark. It is noteworthy that the Reed-Solomon applying to the method shows a much better match in comparison with the solution obtained from running only the microbenchmarks as Shown earlier.
472
B. Ghavami et al.
Table 2. Number of Reads/Writes and the power estimated by our methodology and the real power usage measured by SPICE Simulation
Read /Write Syndrome ChienForney RiBM ReedSolomon
1051 849 1353 3395
High Level Power Estimated(uW) 4.4189 3.6525 5.8197 14.4981
Power By Spice (uW) 3.9314 3.5064 4.9557 12.2704
6 Conclusions We presented a method for power estimation of QDI asynchronous circuits at presynthesis level. We designed a simulation tool that accepts the circuit description in intermediate format, and counts the number of conditional and computational reads and writes. The number was transformed by the linear relationship to estimate the power consumption. We tested our method on the Reed-Solomon Error detector benchmark and the results showed accurate power estimation, and lower estimation time compared to estimations methods at lower levels. Further, a linear correlation between the high level and synthesized level estimations was found. consumption a future work can be leading to estimate the static power usage of the system.
References 1. Sparso, J., Furber, S.: Principles of Asynchronous Circuit Design – A System Perspective. Kluwer Academic Publishers, Dordrecht (2002) 2. Wong, C G., Martin, A.J.: High-Level Synthesis of Asynchronous Systems by Data Driven Decomposition. In: Proc. Of 40th DAC, June 2003, Anneheim, CA, USA (2003) 3. Martin, A.J.: Synthesis of Asynchronous VLSI Circuits Caltech, CS-TR-93-28 (1991) 4. Wong, C.G., Martin, A.J.: Data-Driven Process Decomposition for the Synthesis of Asynchronous circuits. In: Proc. ICECS (2001) 5. Lines, A.M.: Pipelined Asynchronous circuits" MSc Thesis, California Institute of Technology, June 1995 (revised 1998) 6. Martin, A.J.: Programming in VLSI, from Communicating Processes to Delay Insensitive Circuits. In: Hoare, C.A.R. (ed.) Developments in concurrency and communication, UT Year of programming Series, Addision Wesley, London (1990) 7. Singh, A., Smith, S.C.: Using a VHDL Testbench for Transistor-Level Simulation and Energy Calculation. In: The 2005 International Conference on Computer Design (2005) 8. Persia Site: http://www.async.ir/persia/persia.php 9. Seifhashemi, A., Pedram, H.: Verilog HDL, Powered by PLI: a Suitable Framework for Describing and Modeling Asynchronous Circuits at All Levels of Abstraction. In: Proc. Of 40th DAC, June 2003, Anneheim, CA, SA (2003) 10. Penzes, P.I., Martin, A.J.: An Energy Estimation Method for Asynchronous Circuits with Application to an Asynchronous Microprocessor. In: DATE Conference, Le Palais des Congres, Paris, France (2002)
A Fast and Accurate Power Estimation Methodology for QDI Asynchronous Circuits
473
11. Kudva, P., Akella, V.: A Technique for Estimating Power in Asynchronous Circuits. In: Proc. 1st International Symposium on Advanced Research in Asynchronous Circuits and Systems (ASYNC), Salt lake city, Utah, pp. 166–175. IEEE Computer Society Press, Los Alamitos (1994) 12. Salehi, M., Saleh, K., Kalantari, H., Naderi, M., Pedram, H.: High level Energy Estimation of Template-Based QDI Asynchronous circuits Based on Transition Counting. In: ICM 2004, Tunis, Tunisia (2004) 13. Cortadella, J., Kondratyev, A., Kishinevsky, M., Lavagno, L., Yakovlev, A.V.: Petrify: A Tool for Manipulating Concurrent Specifications and Synthesis of Asynchronous Systems (DCIS’96), November 1996, Barcelona, pp. 205–210 (1996) 14. Beerel, P.A., Hsieh, C.-T., Wadekar, S.: Estimation of Energy Consumption in SpeedIndependent Control Circuits. IEEE Transactions on CAD, 672–680 (June 1996) 15. Lee, I., Kim, H., Yang, P., Yoo, S., Chung, E.Y., Choi, K.-M., Kong, J.-T., Eo, S.-K.: Power Vip: SOC Power Estimation Framework at Transaction Level. In: Design Automation, Asia and South Pacific Conference (January 2006)
Subthreshold Leakage Modeling and Estimation of General CMOS Complex Gates Paulo F. Butzen1, André I. Reis2, Chris H. Kim3, and Renato P. Ribas1 1
Instituto de Informática – UFRGS, Porto Alegre, Brazil {pbutzen,rpribas}@inf.ufrgs.br 2 Nangate Inc. – Menlo Park, CA, USA [email protected] 3 EECS – University of Minnesota, MN, USA [email protected]
Abstract. A new subthreshold leakage model is proposed in order to improve the static power estimation in general CMOS complex gates. Series-parallel transistor arrangements with more than two logic depth, as well as non-seriesparallel off-switch networks are covered by such analytical modeling. The occurrence of on-switches in off-networks, also ignored by previous works, is considered in the proposed analysis. The model has been validated through electrical simulations, taking into account transistor sizing, operating temperature, supply voltage and threshold voltage variations.
1 Introduction Power consumption represents an emerging issue nowadays due to recent mobile products and the challenges from new advanced technologies. The leakage currents responsible for static power dissipation during idle mode are increasing significantly in sub-100nm CMOS processes, where the device threshold voltage and the gate oxide thickness tend to reduce. Great effort has been concentrated to understand the leakage current mechanisms, to model their behavior and to develop design techniques for power saving in standby operation [1]-[9]. The main contributors to the total leakage dissipation in CMOS circuitry are the subthreshold and the gate oxide currents. The subthreshold current, negligible in old processes when compared to dynamic charging and short-circuit currents, starts to be taken into account in bellow 180nm technologies. Gate leakage, on the other hand, tends to become the main factor in the static consumption from sub-100nm CMOS processes [3]. Design techniques for leakage reduction have been proposed considering for instance the input state dependency, multi-Vth devices and bulk biasing when subthreshold current is addressed, while future high-К gate dielectric and metal gate electrodes are expected to solve static power consumption due to gate oxide leakage current [1]. The most considered design strategies are based on the already well known stack effect, which suggests the use of more than one off-device in series to reduce the subthreshold current. Note that such effect presents distinct influence on different leakage mechanisms, e.g. gate oxide currents [3]. N. Azemard and L. Svensson (Eds.): PATMOS 2007, LNCS 4644, pp. 474–484, 2007. © Springer-Verlag Berlin Heidelberg 2007
Subthreshold Leakage Modeling and Estimation of General CMOS Complex Gates
475
The modeling of the stack effect has been treated in the literature, where only basic NOR and NAND gates, i.e. purely series and parallel arrangements of transistors have been actually targeted. It should be quite enough for standard cell libraries applied to the technology mapping performed by commercial EDA tools. However, in most recent EDA technologies, library-free mapping technique has been explored [10]. In this sense, CMOS complex gates, composed by a great variety of mixed series-parallel pull-up/pull-down logic networks, i.e. arrangements with more than two levels of logic depth, are frequently applied. On the other hand, transistor networks derived from BDD graphs provide interesting improvements in device count and cell performance [11-13]. However, unlike standard CMOS gates, non-series-parallel transistor arrangements are usually obtained. Such a kind of topology has not been treated in literature for leakage current prediction. Finally, the occurrence of on-transistors in off-networks for subthreshold current analysis is another factor largely ignored by previous works, which considers it as ideal short-circuits or on-switches, as reviewed in Section 2. In this paper, the subthreshold leakage modeling has been improved to be applied in general transistor networks, as described in Section 3. Section 4 presents experimental results demonstrating the model accuracy and validation. The conclusions are given in Section 5.
2 Related Works Subthreshold leakage models have been recently reported in the literature. In [4], Gu and Elmasry presented the subthreshold current model for purely series and parallel off-transistors arrangements (Inverter, NAND and NOR gates), assuming equivalent standby current in both NMOS and PMOS devices, and considering up to three series devices. The two XOR topologies, presented by Gu [4], result, in fact, in single offtransistor configurations during the steady state analysis, as verified in pass-transistor logic (PTL) design style. Transistor stacks composed by single transistors in chain are also addressed in [5]-[8], while on-devices are considered as ideal short-circuits. AOI and OAI gates are mentioned as examples for analysis of series-parallel offnetworks by Lee et al. in [9]. Multiple parallel transistor stacks are computed, taking into account each branch separately, and the values are then summed to obtain the total leakage in the logic gate. In case of parallel transistors within a stack, such transistors are initially collapsed and replaced by a single equivalent device with transistor size equal to the sum of their original sizes. Such strategy is also applied by other authors, such as Rosseló et al. in [8], and it seems to be quite accurate. However, series-parallel networks with more than two levels of logic depth, as illustrated in Fig.1, cannot be treated by such a kind of network reduction technique. Yang et al. [3], in turn, presents a detailed analysis to estimate the total leakage, including and interacting subthreshold and gate leakage currents, for NAND and NOR gates. Different AOI and OAI gates are also discussed emphasizing the strong influence of the electrical topology in the leakage value. Although not explicitly mentioned, probably the same strategy of transistor width equivalence is applied to treat parallel devices, as suggested in [8] and [9].
476
P.F. Butzen et al.
The major contributions of this paper are the subthreshold current modeling extension for general off-networks (series-parallel and non-series-parallel arrangements), and the analysis and modeling of subthreshold leakage considering the presence of on-devices in off-networks. In terms of gate oxide leakage, the proposed method can be combined with the estimation method proposed Yang et al. [3], improving the subthreshold part in the total leakage calculation.
3 Subthreshold Leakage Model The total leakage dissipation results from the sum of the current in each branch of offtransistors between the supply voltage (Vdd) and ground. To present the proposed method, the off-network illustrated in Fig. 1 can be considered as the entire NMOS pull-down arrangement. The same analysis is applicable to PMOS pull-up trees.
Fig. 1. NMOS series-parallel network
According to BSIM transistor model [14], the subthreshold current for a MOSFET can be modeled as Vgs −(Vt 0 −ηVds −γVbs )
I S = I 0We
nVT
V − ds ⎤ ⎡ V ⎢1 − e T ⎥ ⎣⎢ ⎦⎥
(1)
2 1.8 where I 0 = μ 0 C oxVT e and V = kT . Vgs, Vds and Vbs are the gate, drain and bulk T L q voltage of the transistor, respectively. Vt0 is the zero bias threshold voltage. W and L are the effective transistor width and length, respectively. γ is the body effect coefficient and η is the DIBL coefficient. Cox is the gate oxide capacitance, μ0 is the mobility and n is the subthreshold swing coefficient. Detailed resolution of off transistor arrangement depicted in Fig. 1 is presented in [15] and can be generalized as follow. The subthreshold current through the top devices (ISi), i.e. transistors connected to Vdd, can be expressed by equation (2). This
Subthreshold Leakage Modeling and Estimation of General CMOS Complex Gates
477
equation considers the variable Vj as the voltage over every transistor placed under the transistor in evaluation in the stack. −
I Si = I 0Wi e
∑V j −[Vt 0 −η (Vdd −∑V j )+γ ∑V j ] nVT
(2)
The subthreshold current through other transistors in the network is expressed by equation (3). The differences between both equations are observed in the η term (DIBL effect), where Vi represents the voltage over the specific transistor in analysis, and also in the last term, which can be eliminated when Vi >> VT. V ∑V j −[Vt 0 −ηVi +γ ∑V j ] ⎡ − i ⎤ nVT V I Si = I 0Wi e ⎢1 − e T ⎥ ⎢⎣ ⎥⎦ −
(3)
The transistor voltages can be evaluated in three different situations. The analysis assumes that Vdd >> Vi, which drop out all the Vi terms, and it also considers the fact that Vi >> VT, so that the [1 − e(− Vi / VT )] term can be ignored. The first situation is represented by the voltage V1 in Fig. 1. In this case, it is possible to associate every transistor connected in that node by series-parallel association. The terms Wabove and Wbelow, in equation (4), represent the effective width of the transistor over and under the node Vi, respectively. For this condition, Vi is given by
⎛W
⎞
ηVdd + nVT ln⎜⎜ above ⎟⎟ ⎝ Wbelow ⎠ Vi = 1 + 2η + γ
(4)
The second situation is presented by the voltage V2 in Fig. 1. In such condition, it is not possible to make series-parallel associations between the transistors connected at i-index node. The term Vabove means the voltage on the transistors above the node Vi. For this state, the voltage Vi is given by ηVabove ⎛ ⎜ W e nVT ∑ above nVT ln⎜ Wbelow ⎜ ⎜ ⎝ Vi = 1+η + γ
⎞ ⎟ ⎟ ⎟ ⎟ ⎠
(5)
Finally, the third situation is illustrated by the voltage V3 in the same arrangement. This case only happens at the bottom transistors and the analysis cannot assume Vi >> VT, so that the term (Vi/VT) should not be ignored. To simplify the mathematic calculation, the term e(− Vi / VT ) in (1) can be expressed by (1 − Vi / VT ) . Then, Vi is obtained by equation (6), where C = 1 + η + γ. C ⎛ Vi ⎜ n ⎜⎝ VT
⎞ ⎛V ⎟⎟ + ln⎜⎜ i ⎠ ⎝ VT
⎛W ⎞ ⎞ ηVabove ⎛V ⎞ ⎟⎟ = + ln⎜⎜ above ⎟⎟ + ln⎜⎜ above ⎟⎟ nVT ⎠ ⎝ Wi ⎠ ⎝ VT ⎠
(6)
478
P.F. Butzen et al.
3.1 Subthreshold Leakage in Non-series-parallel Networks Standard CMOS gates derived from logic equations are usually composed by seriesparallel (SP) device arrangements. When a Wheatstone bridge configuration is observed in the transistor schematic, as observed in some BDD-based networks [11-13], a non-series-parallel topology is identified, as depicted in Fig. 2b. The proposed model, discussed above, can be used to calculate the voltage across each single transistor, estimating accurately the leakage current. When the model is applied in non-series-parallel configuration sometimes is somewhat difficult to calculate the voltage across certain transistors, as occur in Fig. 2b for the transistor controlled by ‘c’. In this case, the transistor receiving ‘c’ may be ignored at a first moment, until the voltage at one of its terminals is evaluated. For evaluating the other terminal, such device is then included. Similar procedure is suitable for any kind of non-SP networks.
Fig. 2. SP (a) and non-SP (b) transistor arrangements of the same logic function
3.2 On-Transistors in Off-Networks Previous analysis considers only off-networks composed exclusively by transistors that are turned off. Usually, in the most cases, the transistors that are turned on could be treated as ideal short-circuits, because the drop voltage across such devices is some orders of magnitude smaller than the drop voltage across the off-transistors. However, in the case of NMOS transistors switched on and connected to power supply Vdd, the drop voltage Vdrop across them should be taken into account. In the leakage current analysis, this voltage drop is somewhat important when the transistor stacking presents only one off-device at the bottom of the stack, as depicted in Fig. 3. In the proposed model, the term Vdd − ∑ V j in equation (2) is replaced by Vdd − Vdrop −
∑Vj
.
In arrangements with more than one off-transistor in series, the on-devices could be accurately considered as zero drop voltage (i.e. ideal short-circuit). Similar analysis is valid for PMOS transistors in off-networks when they are connected to the ground node.
Subthreshold Leakage Modeling and Estimation of General CMOS Complex Gates
479
Fig. 3. Influence of on-transistor in off-stack current
4 Results In order to validate this model, the results obtained analytically were compared to Hspice simulation results, considering commercial 130nm CMOS process parameters and operating temperature at 100°C. It is known that down to 100nm processes, the subthreshold currents represent the main leakage mechanism, while for sub-100nm technologies, the model presented herein should be combined with gate oxide leakage estimation, already proposed by other authors [3][9]. The parameters used in the analytical evaluation were: Vdd = 1.2V, Vdrop = 0.14, I0 = 20.56mA, η = 0.078. γ = 0.17, n = 1.45, and W = 0.4μA. In a first moment, transistors with equal sizing were applied to simplify the analysis. The leakage current was calculated and correlated with Hspice results for several pull-down NMOS off-networks, depicted in Fig 4. The results presented in Table 1 show a good agreement between the model and the simulation data, showing an absolute average error of less than 10%. It is interesting to note that the static current in networks (f), (g), and (h) from Fig. 4, not treated by previous models, are accurately predicted. The main difference is observed for structures (c) and (d), when the model assumes Vi < VT and the term e(− Vi / VT ) is changed by (1 − Vi / VT ) .
Fig. 4. Pull-down NMOS networks (‘•’ and ‘ο’ represent on- and off-device, respectively)
480
P.F. Butzen et al.
Table 2 and 3 correspond to the results related to both NMOS trees in Fig. 4(f) and Fig. 4(g), respectively. In these tables, the input dependence leakage current is evaluated for all input combinations. In some cases, different input vectors result in equivalent off-device arrangements. For that, the Hspice values are presented for minimum and maximum values obtained applying the set of equivalent input states. Moreover, the previous model presented in [5] was also calculated for such logic states. Note that, the first input vector in both cases, which represents the entire network composed by off-switches, is not treated by the previous model proposed in [5]. Basically, different Table 1. Subthreshold leakage current (nA) related to the off-networks depicted in Fig. 4 Network (a) (b) (c) (d) (e) (f) (g) (h)
HSPICE 1.26 6.58 0.69 0.72 1.29 1.29 2.52 2.56
Model 1.26 6.60 0.75 0.77 1.28 1.28 2.53 2.54
Diff.(%) 0.30 8.70 6.94 0.78 0.78 0.40 0.79
Table 2. Input dependence leakage estimation (nA) in the logic network from Fig. 5(f) Input-state (abcde) 00000 00001 00010 00011 00100 a 00101 b 00110 c
HSPICE results* 1.29 9.71 1.43 25.00 1.36/1.37 14.91/15.14 6.30/6.73
Proposed model 1.28 9.65 1.34 25.02 1.30 14.94 6.60
Previous model [5] 9.65 1.34 25.02 1.31 16.69 8.34
* The HSPICE value is given for min./max. currents related to such equivalent vectors. Equivalent vectors – a.[01000, 01100]; b.[01001, 01101]; c.[01010, 01110, 10000, 10010, 10100, 10110, 11000, 11010, 11100, 11110].
a,b,c
Table 3. Input dependence leakage estimation (nA) in the logic network from Fig. 4(g) Input-state (abcde) 00000 00001a 00011b 00100 01000 01010c 01100d 10000
HSPICE results* 2.52 10.54/10.54 16.68/16.68 2.52 7.90 21.05/21.05 12.55/13.15 7.90
Proposed model 2.53 10.76 16.68 2.52 7.90 21.54 13.20 7.90
Previous model [5] 10.76 16.68 2.52 7.91 24.02 16.68 9.65
* The HSPICE value is given for min./max. currents related to such equivalent vectors. Equivalent vectors – a.[00010]; b.[00101, 00110, 00111]; c.[10001]; d.[10100, 11000, 11100].
a,b,c,d
Subthreshold Leakage Modeling and Estimation of General CMOS Complex Gates
481
values from both methods are obtained when on-transistors are considered in the offnetworks, providing more accurate correlation with the electrical simulation results. Fig. 5 shows the subthreshold average leakage current related to the NMOS networks illustrated in Fig. 4(f) and Fig. 4(g). As discussed before, the previous model from [5] cannot estimate the subthreshold current for the first input vector in both cases, and it is not considered in the average static current calculation. Unlike the previous model, the proposed one presents results close to Hspice simulations. The main reason for that is the influence of on-transistors in off-networks.
Fig. 5. Average subthreshold leakage current for pull-down networks in Fig. 4(f) and Fig. 4(g)
In terms of combinational circuit static dissipation analysis, the technology mapping task divides the entire circuit in multiple logic gates or cells. Thus, they can be treated separately for the leakage estimation, since the input state of each cell is known according to the primary input vector of the circuit. A complex CMOS logic gate, whose transistor sizes were determined by considering the Logical Effort method [16], is depicted in Fig. 6. Table IV presents the comparison between electrical simulation data and the model calculation, proving the method in sized gates. Finally, the proposed model has been verified to the variation of power supply voltage and operating temperature, as depicted in Fig. 7. The influence of temperature variation in the predicted current shows good agreement with Hspice results. On the
Fig. 6. CMOS gate, with different transistor sizing, according to the Logical Effort [16]
482
P.F. Butzen et al.
Table 4. Subthreshold leakage (nA) related to the CMOS complex gate depicted in Fig 6 Input state (abcd) 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111
(a)
HSPICE results 4.01 20.67 19.93 44.52 4.44 42.34 19.93 43.38 4.40 36.81 19.93 43.38 19.50 96.67 20.43 20.48
Proposed model 4.13 20.68 19.99 43.27 4.29 42.37 19.99 43.27 4.26 36.50 19.99 43.27 19.99 96.92 19.99 20.21
Diff(%) 3.0 0.0 0.3 2.8 3.4 0.1 0.3 0.3 3.2 0.8 0.3 0.3 2.5 0.3 2.2 1.3
(b)
Fig. 7. Subthreshold current in (a) supply voltage and (b) operating temperature variations
Fig. 8. Subthreshold current according to the operating temperature variation
Subthreshold Leakage Modeling and Estimation of General CMOS Complex Gates
483
other hand, the difference between the subthreshold currents obtained from electrical simulation and analytical modeling to voltage variation can be justified by eventual inaccuracy in the parameter extraction. Fig. 8 shows the leakage current analysis in respect to the threshold voltage variation, validating the proposed method for this factor, critical in the most advanced CMOS processes.
5 Conclusions A new subthreshold leakage current model has been presented for application in general CMOS networks, including series-parallel and non-series-parallel transistor arrangements. The occurrence of on-devices in off-networks is also taken into account. These features make the model useful for different logic styles, including BDDderived networks, improving previous works not suitable to such a kind of structure. The proposed model has been validated considering a 130nm CMOS technology, in which the subthreshold current is the most relevant leakage mechanism. In the case of sub-100nm processes where gate leakage becomes more significant, the present work should be combined with already published works which address the interaction between the subthreshold and the gate oxide leakage currents, such as proposed by Yang et al. [3], in order to improve the accuracy of the total leakage estimation.
References 1. Roy, K., Mukhopadhyay, S., Meimand, H.M.: Leakage Current Mechanisms and Leakage Reduction Techniques in Deep-Submicrometer CMOS Circuits. Proceedings of the IEEE 91(2), 302–327 (2003) 2. Anderson, J.H., Najm, F.N.: Active Leakage Power Optimization for FPGAs. IEEE Trans. on CAD 25(3), 423–437 (2006) 3. Yang, S., Wolf, W., Vijaykrishnan, N., Xie, Y., Wang, W.: Accurate Stacking Effect Macro-modeling of Leakage Power in Sub-100nm Circuits. In: Proc. Int. Conference on VLSI Design, January 2005, pp. 165–170 (2005) 4. Gu, R.X., Elmasry, M.I.: Power Distribution Analysis and Optimization of Deep Submicron CMOS Digital Circuit. IEEE J. Solid-State Circuits 31 (May 1996) 5. Cheng, Z., Johnson, M., Wei, L., Roy, K.: Estimation of Standby Leakage Power in CMOS Circuits Considering Accurate Modeling of Transistor Stacks. In: Proc. Int. Symposium Low Power Electronics and Design, August 1998, pp. 239–244 (1998) 6. Roy, K., Prasad, S.: Low-Power CMOS VLSI Circuit Design. John Wiley & Sons, Chichester (2000) 7. Narendra, S.G., Chandrakasan, A.: Leakage in Nanometer CMOS Technologies. Springer, Heidelberg (2006) 8. Rosseló, J.L., Bota, S., Segura, J.: Compact Static Power Model of Complex CMOS Gates. In: Vounckx, J., Azemard, N., Maurine, P. (eds.) PATMOS 2006. LNCS, vol. 4148, Springer, Heidelberg (2006) 9. Lee, D., Kwong, W., Blaauw, D., Sylvester, D.: Analysis and Minimization Techniques for Total Leakage Considering Gate Oxide Leakage. In: Proc. DAC, June 2003, pp. 175– 180 (2003) 10. Gavrilov, S., et al.: Library-less Synthesis for Static CMOS Combinational Logic Circuits. In: Proc. ICCAD, November 1997, pp. 658–662 (1997)
484
P.F. Butzen et al.
11. Yang, C., Ciesielski, M.: BDS: a BDD-based logic optimization system. IEEE Trans. CAD 21(7), 866–876 (2002) 12. Lindgren, P., Kerttu, M., Thornton, M., Drechsler, R.: Low power optimization technique for BDD mapped circuits. In: ASP-DAC 2001, 615–621 (2001) 13. Shelar, R.S., Sapatnekar, S.: BDD decomposition for delay oriented pass transistor logic synthesis. IEEE Trans. VLSI 13(8), 957–970 (2005) 14. Sheu, B.J., Scharfetter, D.L., Ko, P.K., Jeng, M.C.: BSIM: Berkeley Short-Channel IGFET Model for MOS Transistors. IEEE J. Solid-State Circuits SC-22 (August 1987) 15. Butzen, P.F., Reis, A.I., Kim, C.H., Ribas, R.P.: Modeling Subthreshold Leakage Current in General Transistor Networks. In: Proc. ISVLSI (May 2007) 16. Sutherland, I., Sproull, B., Harris, D.: Logical Effort: Designing Fast CMOS Circuits. Morgan Kaufmann, San Francisco, CA (1999)
A Platform for Mixed HW/SW Algorithm Specifications for the Exploration of SW and HW Partitioning Christophe Lucarz and Marco Mattavelli Ecole Polytechnique F´ed´erale de Lausanne, Switzerland {christophe.lucarz,marco.mattavelli}@epfl.ch
Abstract. The increasing complexity in particular of video and multimedia processing has lead to the need of developing the algorithms specification using software implementations that become in practice generic reference implementations. Mapping directly such software models in platforms made of processors and dedicated HW elements becomes harder and harder for the complexity of the models and for the large choice of possible partitioning options. This paper describes a new platform aiming at supporting the mapping of software specifications into mixed SW and HW implementations. The platform is supported by profiling capabilities specifically conceived to study data transfers between SW and HW modules. Such optimization capabilities can be used to achieve different objectives such as optimization of memory architectures or low power designs by the minimization of data transfers.
1
Introduction
Due to the ever increasing complexity of integrated processing systems, software verification models are necessary to test performance and specify accurately a system behaviour. A reference software, in both cases of the processing specified by a standard body or for any other ”custom” algorithm, is the starting point of every real implementation. This is typical case for MPEG video coding where the ”reference software” is now the real specification. There is an intrinsic difference between real implementations that can be made of HW and SW components working in parallel using specific mapping of data on different logical or physical memories and the ”reference software” that is usually based on a sequential model with a monolithic memory architecture, such difference is generically called ”architectural gap”. In the process of transforming the reference SW into a real implementation the possibility of exploring different architectural solutions for specific modules and study the resulting data exchanges for defining optimal memory architectures is a very attractive approach. This system exploration option is particularly important for complex systems where conformance and validation of the new module designs need to be performed at each stage of the design otherwise incurring into the risk of N. Azemard and L. Svensson (Eds.): PATMOS 2007, LNCS 4644, pp. 485–494, 2007. c Springer-Verlag Berlin Heidelberg 2007
486
C. Lucarz and M. Mattavelli
developing solutions not respecting their original specification or not providing adequate performances. This paper presents an integrated platform and the associated profiling capabilities that supports mixed SW and HW specifications and enables the hardware designer to seamlessly transform a Reference Software into software plus hardware modules mapped on an FPGA. The paper is organized as follows: section 2 presents a brief state of the art on integrated HW/SW platforms. Section 3 provides a general view of the platform introducing the innovative elements. Section 4 describes the details of the platform that enables HW/SW support. Section 5 presents the capabilities of the profiling tool and explain how it can be used to study and optimize data transfers satisfying different criteria.
2
State of the Art of Platforms Supporting Mixed HW/SW Implementations
Many platforms have been designed to support mixed SW/HW implementations, but all of them suffer from the fact that there is no easy procedure capable to seamlessly plug hardware modules described in HDL to a pure software algorithm. Either the memory management is a burdensome task or the call of the hardware module is done by an embedded processor on the platform. Environments which support HW/SW implementations are generally based on a platform containing an embedded processor and some dedicated hardware logic like FPGA as described in the work of Andreas Koch [2]. The control program lies in the embedded processor. However, data on the host are available easily thanks to virtual serial ports. But the plugging of hardware modules inside the reference software running on the host remains the most difficult task. The work of Martyn Edwards and Benjamin Fozard [3] is interesting in the way a FPGA-based algorithm can be activated from the host processor. This platform is based on the Celoxica RC1000-PP board and communicates with the host by using the PCI bus. The control program is on the host processor, sends control information to the FPGA and transfers data in small shared memory which is part of the hardware platform. In this case, the designer/programmer must do explicitly the data transfer between the host and the local memory. Many other works about coprocessors have been reported in literature. Some examples are given in [4] [5]. However the problem of seamless plug-in of HDL modules is still there, the data transfers in the charge of the designer with can be a very burdensome task when dealing with complex data-dominated video or multimedia algorithms. In some works on coprocessors, data transfers can be generated automatically by the host like for instance in [6]. But data are copied in the local memory at a fixed place. Thus, the HDL module must be aware of the physical addresses
A Platform for Mixed HW/SW Algorithm Specifications
487
of the data in the local memory. The management of the addresses can be a burdensome task when dealing with complex algorithms. The Virtual Socket concept and associated platform has been presented in [10] [9] [7] and has been developed to support the mixed specification of MPEG-4 Part2 and Part 10 (AVC/H.264) in terms of reference SW including the plug-in of HDL modules. The platform is constituted by a standard PC where the SW is executed and by a PCMCIA card that contains a FPGA and a local memory. The data transfers between the host memory and the local memory on the FPGA must be explicitly specified by the designer/programmer. Specifying explicitly the data transfers would not constitute a serious burden when dealing with simple deterministic algorithms for which the data required by the HDL module are known exactly. Unfortunately for very complex design cases where design trade-offs are much more convenient (and often are the only viable solutions) than worst case designs data transfers cannot be explicitly specified in advance by the designer. Our work is based on the Virtual Socket platform to which we add the virtual memory capability to allow automatic data transfers from the host to the local memory. The goal of our platform implementation is to provide a ”direct map” of any SW portion to a corresponding HDL specification without the need of specifying any data transfer explicitly. In other words, to extend the concept of Virtual Socket for plugging HDL modules to SW partition with the concept of virtual memory. HDL modules and software algorithm share an unified virtual memory space. Having a shared memory - enforced by a cache-coherence protocol - between the CPU running the SW sections and the platform supporting HW avoids the need of specifying explicitly all the data transfers. The clear advantage is that the data transfer needs of the developed HDL module can be directly profiled so as to explore different memory architecture solutions. Another advantage of such direct map is that conformance with the original SW specification is guaranteed at any stage and the generation of test vectors is naturally provided by the way the HDL module is plugged to the SW section.
3
Description of the Virtual Socket Platform
The Virtual Socket platform is composed of a PC and a PCMCIA card that includes a FPGA and a local memory. The Virtual Socket handles the communications between the host (the PC environment) and the HDL modules (in the FPGA inside the PCMCIA). Given that the HDL modules are implemented on the FPGA, in principle they would only have access to the local memory (see figure 1). This was the case of the first implementation of the Virtual Socket platform, with the consequence that all the data transfers from the host to the local memory had to be specifically specified in advance by the designer/programmer himself. Such operation beside being error prone or be implemented transferring more data than necessary it is not straightforward and may become difficult to be handled when the volume of data is comparable with the size of the (small) local
488
C. Lucarz and M. Mattavelli
MPEG C# functions Host
Local Memory
HOST - PC
physical addresses
Virtual Socket PCMCIA FPGA Card
Virtual Socket Platform
Local Memory
Window Memory Unit
Platform
Virtual Memory Controller virtual addresses
HDL module 0
HDL module 1
HDL module 31
HDL module
HDL description of the functions
Fig. 1. The Virtual Socket platform overview
memory. Therefore, an extension has been conceived and implemented so as to handle these data transfers automatically. The Virtual Memory Extension (VME) is implemented by two components: the hardware extension to the Virtual Socket platform (Window Manager Unit) and a Virtual Manager Window (VMW) library on the host PC. The cache-coherence protocol is implemented in the Window Manager unit (WMU) using a TLB (Translation Lookaside Buffer) and is handled by the software support (VMW). The HDL module is designed simply generating virtual addresses relative to the user virtual memory space (on the host) to request data and execute the processing tasks. The processing of the data on the platform using the virtual memory feature proceed as follows. The algorithm starts the execution on the PC and associated host memory. The Virtual Socket environment allows the HDL module to have a seamless direct access to the host memory thanks to the Virtual Memory Extension and allows the HDL module to be started easily from the software algorithm thanks to the VMW Library. Figure 2 shows what are the relations between the host memory, the reference software algorithm, the hardware function call and the HDL module. Given an algorithm described in a mixed HW/SW form (1): some parts are executed in software with the host processor (5), some other parts are executed by hardware HDL modules (4) on the Virtual Socket platform hardware. To deal with mixed HW/SW algorithms, it is very convenient if the HDL and C functions have access to the same user memory space (6) which is part of the host hardware and where are stored the data to process. The main memory space is trivially available for the parts of the algorithm executed in software, which is much less evident for the parts executed in hardware. The section of C code the programmer intends to execute in hardware is replaced by the hardware call function (2). This latter is based on the Virtual Manager Window Library. The programmer sets the parameters to give to the HDL module. The Start_module() function drives the Virtual Socket platform
A Platform for Mixed HW/SW Algorithm Specifications
489
(6) User Software Virtual Memory Space
HOST HARDWARE Host Processor
(5)
PLATFORM HARDWARE
(1) (4)
Open platform
SIO
StatXYSigmaY ( const CStatXY *Stat) { SInner Avr ; if (Stat->N<=1) return ( SIO )NAN; Avr =Stat-> Sy /Stat->N; return ( SIO ) sqrt ( (Stat-> Syy 2*Avr *Stat-> Sy + Stat->N* Avr *Avr ) / (Stat->N-1) ); SIO StatXYAreg ( const CStatXY *Stat) { SInner Delta;
(2) Configure platform
mixed HW /SW Algorithms (from reference C Functions)
SOFTWARE
Set Parameters Start HDL module Close the platform
VMW library
drives
Virtual (3) Socket + VME
Virtual Socket Platform with Virtual Memory Extension
VHDL described HDL modules
Implemented HDL Functions
Fig. 2. Interactions between the C function, the HDL module and the shared memory space
and the VME (3) to activate the HDL module (4). The VMW library manages all the data transfers between the main memory (6) and the local memory of the platform (3) because as the HDL module is in a FPGA, it has access only this local memory. Thanks to the VME, the HDL module has access to the host memory without intervention of the programmer. Data are sent to the HDL module and results are updated in the main memory automatically thanks to the software library support. When the HDL module finishes its work, the hardware call function is terminated by closing the platform and the reference software algorithm can be continued on the host PC.
4
Details on HW Implementation and SW Support
The following section describes in more details how the Virtual Socket platform supporting the Virtual Memory Extension is implemented. The first part explains how virtual memory accesses are possible from the HDL modules. Then, the Virtual Memory Window library, i.e. the software support is described in details to show how virtual memory accesses are handled. The final part explains how HDL modules can be integrated in the platform using a well-defined protocol.
490
4.1
C. Lucarz and M. Mattavelli
The Heart of Simplicity: HDL Modules Virtual Memory Accesses
The HDL modules are implemented on the FPGA, so that they have access only to the local memory of the Virtual Socket platform. With the implementation of the Virtual Memory Extension, the HDL modules have a direct access to the software virtual memory space located on the host PC. The right part of figure 1 shows in details how the connections between a HDL module, the Virtual Socket platform and the Virtual Memory Extension are implemented (in the hardware of the PCMCIA card). The virtual addresses generated by the HDL modules are handled by the Virtual Memory Controller (VMC) and the Window Memory Unit (WMU). The WMU is a component taken from the work of Vuleti´c and al.[8]. The WMU translates virtual addresses into physical addresses. The VMC is in charge of intercepting precise signals at right time from the interface between the HDL module and the platform in order to send information to the WMU which executes the translation. Among the signals intercepted by the VMC, can be mentioned the address signal, the count signal (number of data requested by the HDL module) and the strobe signal. The virtual addresses refer to the unified virtual memory space and the physical addresses refer to the local memory on the card. A physical address is composed of an offset and a page number. The local memory (on the current PCMCIA card platform) is composed of 32 pages of 2 kB. The offset corresponds to the location of the data in the page. The software support library (on the host PC) fills the pages of the local memory with the requested data coming from the virtual memory. When the WMU receives an unknown virtual address, it raises an interrupt through the interrupt controller of the card. The interrupt is taken in charge by the software support (on the host PC) and the requested data are written from the host memory to the local memory. From the designer/programmer point of view using the Virtual Memory Extension, the whole process of data transfers is completely transparent. The only issue the designer/programmer has to care of is to generate the virtual addresses accordingly to the data contained in the host memory space. The whole task of transferring data to local memory is done by the platform and its software support. 4.2
The Software Support: The Virtual Memory Window Library
The Virtual Memory Window (VMW) library is built on the FPGA card driver (Wildcard II API), the Virtual Socket API developed by Yifeng Qiu and Wael Badawy bases on the works [9] [10] and the WIN32 API. The Virtual Socket platform can be used with or without the Virtual Memory Extension. The designer/programmer is free to choose if the data transfers between the main memory on the host and the local memory on the card are done automatically (virtual mode) or manually (explicit mode).
A Platform for Mixed HW/SW Algorithm Specifications
491
The following piece of C code shows how a HDL module can be easily called from the Reference Software by using the Virtual Memory Extension: int main(int argc,char *argv[]) { /* [. . .] Reference Software Algorithm stops here */ /* Beginning of the HDL module calling procedure */ /******* CONFIGURING THE PLATFORM *******/ Platform_Init(); // Virtual Socket VMW_Init() ; // Virtual Memory Extension /******* PARAMETERS SETTINGS *******/ Module_Param.nb_param = 4 ; Module_Param.Param[0] = A ; Module_Param.Param[1] = B ; Module_Param.Param[2] = C ; Module_Param.Param[3] = D ;
// // // // //
number of parameter parameter parameter parameter
parameters 1 2 3 4
/******* HDL MODULE START *******/ Start_module(1, &Module_Param) ; /******* CLOSING THE PLATFORM *******/ VMW_Stop(); // Virtual Memory Extension Platform_Stop(); // Virtual Socket /* End of the HDL module calling procedure */ /* [. . .] the Reference Software Algorithm continues*/ }
First the designer/programmer must configure the platform by using the Platform_Init( ) and VMW_Init( ) functions from the Virtual Socket API and VMW API. HDL modules are activated thanks to the function Start_module( ) from the VMW API. The designer/programmer must set a given number of parameters needed for the configuration of the HDL module. This can be done thanks to the data structure Module_Param. Sixteen parameters are available for each HDL module. 4.3
The Integration of the HDL Modules in the Platform
The HDL module is linked to the Virtual Socket platform thanks to a well-defined interface and a precise communication protocol. Figure 3 illustrates the essential elements of the communication protocol. A HDL module can issue two types of requests: read or write data (in main or local memory, it depends on the operating mode: virtual or explicit mode).
492
C. Lucarz and M. Mattavelli
Steps 1 and 2 Virtual Socket Platform
request for read (write)
Step 3 Virtual Socket Platform Parameters of the read (write) request
HDL module
HDL module
acknowledgment
Local Memory
Local Memory
Steps 4 and 5 Virtual Socket Platform
"input valid" ("output valid") read data (write data)
Local Memory
Steps 6 and 7 Virtual Socket Platform
HDL module
release request
HDL module
acknowledgment
Local Memory
Fig. 3. The communication protocol between a HDL module and the Virtual Socket Platform
Read and write protocols are very similar. The following is the description of the communication protocol for the read (write) request: 1. The user HDL module asks to read (write) data, it issues a read request for reading (writing) the memory. 2. The platform accepts the reading (writing) request and if the data are available in the local memory, it generates an acknowledgement signal to the user HDL module. Otherwise the Virtual Memory Extension copies the requested data of the host memory into the local memory and then generates the acknowledgement. 3. Once the user HDL module receives the acknowledgement signal, it asks for reading (writing) some data directly from (to) the memory space. This request is performed by asserting a ”memory read” (”output valid”) signal together with setting up some other parameters signals (identification number of the HDL module used, the virtual address and how much data must be read (written)). 4. The platform accepts those signals and reads (writes) data from (to) the memory space. When the platform finishes each reading (writing), it asserts ”input valid” (”output valid”) and the data are ready to input of the user HDL module (platform). 5. The user HDL module receives (sends) the data from (to) the interface. 6. The user HDL module asserts a request to ask for releasing the reading (writing) operations when finished. 7. The platform generates an acknowledgement signal to release the reading (writing) operations.
A Platform for Mixed HW/SW Algorithm Specifications
493
In the Virtual mode, the read and write addresses contain the addresses of the data in the unified virtual memory space. It was like the HDL modules see the host memory.
5
Profiling Tools: Testing and Optimizing Data Transfers
Optimization of data transfers is a very important issue particularly for data dominated systems such as multimedia and video processing. Minimizing data transfer is also important for achieving low power implementations that are fundamental for mobile communication terminals. Data transfers contributes to the power dissipations and need to be optimized to achieve low power designs. The profiling tools supported by the platform allow the programmer to receive a feedback on the data requested by the HDL module.
(1) Conformance test
HDL module
HOST MEMORY Virtual Memory Extension (virtual mode) No profiling
(2) Global optimization
HDL module
HOST MEMORY Virtual Memory Extension (virtual mode) With profiling
(3) Final optimization and validation
HDL module
HOST MEMORY
CACHE MEMORY Original Virtual Socket (explicit mode)
profiling
Fig. 4. The optimization methodology
Figure 4 shows the methodology to achieve an optimized hardware function (HDL module) relative to data transfers. The first step concerns the validation of the design. Using the Virtual Memory Extension, the equivalency of the C and HDL functions are verified. Virtual memory feature allows the designer/programmer to focus exclusively on the HDL module conformance checking. He can forget about the memory management during this phase. The second phase consists in understanding and having a global overview of the data transfers made between the platform and the HDL module. The way the data are accessed, the re-organization of data can be source of optimization. When the data required by the HDL module are profiled, the designer/programmer enters the last phase in which data transfers are optimized between HDL module and cache memory.
494
6
C. Lucarz and M. Mattavelli
Conclusion
This paper describes the implementation of a platform capable of supporting the execution of algorithms described in mixed SW and HW form. The platform provide a seamless environment for migrating section of the SW into HDL modules that include a ”virtual memory space” common to SW sections and to the HW modules. On one side conformance of the HDL modules with the reference SW is guaranteed at any stage of the design, on the other side the programmer/designer can focus on different aspects of the design. First design efforts can be focused on the module functionality without worrying about data transfers, then using the profiled data transfer on design of appropriate memory architectures or any other design optimization that matches the specific criteria of the design.
References 1. Annapolis Micro Systems, WILDCARD-II Reference Manual, 12968-000 Revision 2.6 (January 2004) 2. Koch, A.: A comprehensive prototyping-platform for hardware-software codesign. Rapid System Prototyping, 2000. In: RSP 2000 Proceedings. 11th International Workshop, 21-23 June, 2000, pp. 78–82 (2000) 3. Edwards, M., Fozard, B.: Rapid prototyping of mixed hardware and software systems. In: Digital System Design, 2002, Proceedings. Euromicro Symposium, 4-6 September, 2002, pp. 118–125 (2002) 4. Pradeep, R., Vinay, S., Burman, S., Kamakoti, V.: FPGA based agile algorithm-ondemand coprocessor, Design, Automation and Test in Europe 2005 (March 2005) 5. Plessl, C., Platzner, M.: TKDM - a reconfigurable co-processor in a PC’s memory slot. In: Field-Programmable Technology (FPT) Proceedings. 2003 IEEE International Conference, 15-17 December, 2003, pp. 252–259. IEEE Computer Society Press, Los Alamitos (2003) 6. Sukhsawas, S., Benkrid, K., Crookes, D.: A reconfigurable high level FPGA-based coprocessor. In: Computer Architectures for Machine Perception, 2003 IEEE International Workshop, 12-16 May 2003, p. 4 (2003) 7. Schumacher, P., Mattavelli, M., Chirila-Rus, A., Turney, R.: A Virtual Socket Framework for Rapid Emulation of Video and Multimedia Designs. In: Multimedia and Expo, 2005 (ICME 2005) IEEE International Conference, 6-8 July, 2005, pp. 872–875 (2005) 8. Vuletic, M., Pozzi, L., Ienne, P.: Virtual memory window for application-specific reconfigurable coprocessors. In: Proceedings of the 41st Design Automation Conference, June 2004, San Diego, Calif (2004) 9. Amer, I., Rahman, C.A., Mohamed, T., Sayed, M., Badawy, W.: A hardwareaccelerated framework with IP-blocks for application in MPEG-4. In: System-onChip for Real-Time Applications, Proceedings. Fifth International Workshop, 20-24 July, 2005, pp. 211–214 (2005) 10. Mohamed, T.S., Badawy, W.: Integrated hardware-software platform for image processing applications, In: System-on-Chip for Real-Time Applications, Proceedings. 4th IEEE International Workshop, IEEE Computer Society Press, Los Alamitos (2004)
Fast Calculation of Permissible Slowdown Factors for Hard Real-Time Systems Henrik Lipskoch1 , Karsten Albers2 , and Frank Slomka2 1
Carl von Ossietzky Universit¨ at Oldenburg [email protected] 2 Universit¨ at Ulm {karsten.albers,frank.slomka}@uni-ulm.de
Abstract. This work deals with the problem to optimise the energy consumption of an embedded system. On system level, tasks are assumed to have a certain CPU-usage they need for completion. Respecting their deadlines, slowing down the task system reduces the energy consumption. For periodically occurring tasks several works exists. But even if jitter comes into account the approaches do not suffice. The event stream model can handle this at an abstract level and the goal of this work is to present and solve the optimisation problem formulated with the event stream model. To reduce the complexity we introduce an approximation to the problem, that allows us a precision/performance trade-off.
1
Introduction
Reducing the energy consumption of an embedded system can be done by shutting down (zero voltage), freezing (zero frequency, e.g. clock gating) or stepping the circuits with a slower clock and lower voltage (Dynamic Voltage Scaling or Adaptive Body Biasing). On system level, tasks as programmes or parts of those are assigned a processing unit. Here we are interested in tasks having a deadline not to miss and with some sort of repeated occurrence, that is those tasks are executed repeatedly as long as the embedded system is up and running. The mentioned possibilities to reduce the overall energy consumption result in a delay or slowdown from the view of a program running on the system. It has a lower-bound for those tasks having deadlines to meet with the side effect, that the available processing time for other tasks running on the the same processor will be reduced. Thus any technique regarding energy reduction which influences the processing speed has to take these limits into account. The problem we focus here is to minimise the systems total power consumption for a task set, where each task may have its own trigger and its own relative deadline, using static slowdown to guarantee hard real-time feasibility. In the next section we will describe work on similar problems. In the model section we will specify our assumptions, describe the event stream model along
This work is supported by the German Research Foundation (DFG), grant GRK 1076/1 and SL 47/3-1.
N. Azemard and L. Svensson (Eds.): PATMOS 2007, LNCS 4644, pp. 495–504, 2007. c Springer-Verlag Berlin Heidelberg 2007
496
H. Lipskoch, K. Albers, and F. Slomka
with the demand bound function. We will improve the number of test points and then show how this is incorporated into a linear programme solving the problem of calculating slowdown factors, while guaranteeing hard real-time feasibility. Before we conclude the paper, we will do an example and demonstrate the advantages, we gained with the developed theory.
2
Related Work
There are several works regarding the energy optimisation of hard real-time systems. Here we are interested in optimising without granting the circuits any loss of precision, without missing any deadlines, and we want a true optimal solution, if possible, thus we focus on linear program solutions. Ishihara and Yasuura [1] incorporate in their integer linear program different discrete voltage/frequency pairs, an overall deadline for their task set to meet, but no intertask dependency and only dynamic power consumption. They consider mode switching overhead negligible and for every task a number of needed CPUcycles, in worst-case analysis this corresponds to their worst-case execution time. Andrei et al. [2] formulate four different problems, each regarding intertask dependency, a deadline per task, and per task a number of clock cycles to complete it, which can, again, be considered as worst-case time. The four problems vary in with or without regarding mode switching overhead and with or without integer linear programming for the number of clock cycles. The task set is considered non-preemptive. The authors prove the non-polynomial complexity of the integer linear problem with overheads. A somewhat different approach have Hua and Qu [3]. They are looking for the number and the values of voltages yielding the best solution with dynamic voltage scaling. However, their problem formulation only explores the slack between execution time and relative deadline of a task with the assumption that the task system is schedulable even if every task uses the complete deadline as execution time. In hard real-time analysis this assumption does not hold. For example if each task has its own period a common way is to set its deadline equal to its period. Rong and Pedram [4] have their problem formulated with intertask dependencies, leakage and dynamic power consumption, different CPU-execution modes, external device usage each with different execution modes. They state that mode switching overhead on the CPU is negligible especially when “normal” devices (e.g. hard disks etc.) are involved in the calculation. They also state the nonpolynomial complexity of the mixed integer linear program. And their task graph is assumed to be triggered periodically with a period they use as overall deadline for the task set. The tasks are considered non-preemptive. The tasks themselves do not have an own deadline. In [5] Jejurika and Gupta present a work on calculating slowdown factors for energy optimisation of earliest-deadline-first scheduled embedded systems with periodic task sets. The work does not consider static power consumption and therefore does not take switching off the CPU into account.
Fast Calculation of Permissible Slowdown Factors
497
They first present a method to calculate a slowdown factor for a whole task set using state-of-the-art real-time feasibility tests. Then they develop a method with ellipsoids to have a convex optimisation problem. This test incorporates the feasibility test of Baruah [6], which turned out to be very computational intensive, because it requires to test all intervals up to the hyper period of the task system in the case of a periodic task system (again, see [6]) in other cases. To face the problem of computational intensity Albers and Slomka developed another test [7] with a fast approximation [8], having the possibility to do a performance/precision tradeoff, based on the event stream methodology by Gresser [9]. Here, in this work, we want to show how this approximation applies for the problem of calculating slowdown factors.
3
Model
We do not want to limit us to periodic triggered task systems. Therefore, we assume that a task can be triggered not only periodic, but periodic with a jitter or sporadic with minimal distance between two consecutive triggers or other forms. We assume the triggers of a task be given in the form of an event stream, see below. For the scheduling algorithm, we assume preemptive earliest deadline first scheduling. When we speak of a task system we speak of tasks sharing one single processor concurring on the available processing time. Additional we assume it to be synchronous, i.e. all tasks can be triggered at the same point in time. An invocation of a task is called a job and the point in time when a task is triggered an event. Each task of our task systems is assumed to have a relative deadline d, measured from the time when the task is triggered, a worst-case execution time c, and an event stream denoting the worst-case trigger of that task. The latter one is described in the following subsection. 3.1
Event Streams
Motivated by the idea to determine bottlenecks in the available processing time, Gresser ([9]) introduced the event stream model, which is used in [7] to provide a fast real-time analysis for embedded systems. We understand a bottleneck as a shortest time span in which the most amount of processing time is needed, i.e. a time span with the highest density of needed processing time. The event stream model covers this, time spans with maximal density are ordered by their length (maximal among time spans having the same length: go through all intervals on the time axis having the same length and take the length of that one, that has the least available processing time within). Achieved is this by calculating the minimal time span for each number of task triggers. Definition 1. Let τ be a task. An event stream E(τ ) is a sequence of real numbers a1 , a2 , . . . , where for each i ∈ N ai denotes the length of the shortest interval
498
H. Lipskoch, K. Albers, and F. Slomka
in time in which i number of events of type τ can happen. (See [7] for a more detailed definition) Event streams are sub-additive sequences, i.e. ai + aj >= ai+j , for all i, j ∈ N. Albers and Slomka explain how to gather that information for periodic, periodic with jitter, and sporadic triggered tasks [7]. Example 1. Consider the following three tasks. 1. Let τ1 be triggered with a period of 100 ms. Then the shortest time span to trigger two tasks is 100 ms. For three it is 200 ms and so on, thus the resulting event stream is E(τ1 ) : a1 = 0 s, an = (n − 1) · 100 ms. 2. Then, let τ2 be triggered sporadically with a minimal distance between two events of 150 ms. Then the shortest time span to trigger two tasks is 150 ms. For one task more it is 300 ms. The resulting event stream is E(τ2 ) : a1 = 0 s, an = (n − 1) · 150 ms. 3. And let τ3 be triggered periodically every 60 ms but can be triggered 5 ms before and after its period. Thus the shortest time span to trigger two tasks is 50 ms, which corresponds to one trigger 5 ms after one period and the next trigger 5 ms before the next period. The then earliest task after both can not be triggered shorter than 60 ms later, which is 5 ms before the over-next period, and this corresponds to a time length of 110 ms to trigger 3 tasks. Following this argumentation the resulting event stream is E(τ3 ) : a1 = 0 s, a2 = 50 ms, an = 50 ms + (n − 2) · 60 ms. 3.2
Demand Bound
To guarantee the deadline of a task one has to consider the current workload of the resource the task runs on. The demand bound function (see [10] and [7]) provides a way to describe this, and the np-hard feasibility test using this function can be approximated in polynomial time [7]. For the workload we now calculate the maximal amount of needed processing time within an interval of length Δt. If we allow the simultaneous trigger of different tasks, which was our assumption, this leads to synchronising the event streams, and that is to assume all the intervals, out of which we obtained the time lengths for our event streams, have a common start. Thus, the sum of the worst-case execution times of all events in all event streams happening during a time of length Δt, having their deadline within that time, gives us an upper bound of the execution demand for any interval of length Δt. Note, that we only have to process jobs with deadline within this time span. Formulated with the notion of definition 1 the demand bound function turns out as follows. Definition 2. The demand bound function denotes for every time span an upper bound of workload on a resource to be finished within that time span. (See for ex. [10]) Lemma 1. Let τ1 , . . . , τn be tasks running on the same resource, each with worst-case execution time ci and relative deadline di , i = 1, . . . , n. And let
Fast Calculation of Permissible Slowdown Factors
499
E1 , . . . , En be their event streams. Define a0 := −∞. Under the assumption all tasks can be triggered at the same time, the demand bound function can be written as D(Δt) =
n
max{j ∈ N0 : aj ∈ Ei ∪ {a0 }, aj ≤ Δt − di }ci .
i=1
(see [7]) Example 2. Consider the task set of example 1. Let the deadlines be 30 ms for the first, 20 ms for the second, and 10 ms for the third task; let the worst-case execution times for each task be 25 ms, 15 ms, and 5 ms, respectively. Out of these properties, we obtain the demand bound function, which is D(Δt) =
Δt + 70 ms Δt + 130 ms · 25 ms + · 15 ms 100 ms 150 ms Δt − 10 ms Δt + max 0, + · 5 ms. |Δt − 10 ms| 60 ms
The next step proceeds with a match of the needed processing time below the available processing time. This is the feasibility test of the real-time system. Since one can process exactly t seconds processing time within an interval of t seconds, if every consumer of processing time is modelled within the task set, and thus the feasibility test results in proving D(Δt) ≤ Δt
∀Δt > 0.
(1)
Lemma 2. Let τ1 , . . . , τn be tasks and E(τ1 ), . . . , E(τ2 ) their corresponding event streams. A sufficient set of test points for the demand bound function is E := N i=1 {ai + di : ai ∈ E(τi )}. Proof. The demand bound function remains constant between two points e1 , e2 ∈ ˜ E. The values of E can be bounded above and the remaining set will still be a sufficient test set. If the event streams contain only periodic behaviour, it is feasible to use their hyper period, since this is defined as the least common multiple of all involved periods it grows as the prime numbers contained in the periods grow (cf. p1 = 31 ∗ 2 = 62, p2 = 87 : H = p1 ∗ p2 = 5394, whereas p1 = 60 and p2 = 90 will yield H = 180). Another test bound exists [6] covering also nonU periodic behaviour. It depends on the utilisation U : Δtmax = 1−U · max{Ti − di } and it cannot be used here, because in slowing down the system, we will increase its utilisation (more processing time due less speed) and thus formulas, similar to the one mentioned, will result in infinite test bounds (which is the reason that such formulas are only valid for utilisations strict less than 1). Instead of using such test bounds, we improve the model in another way.
500
H. Lipskoch, K. Albers, and F. Slomka
Definition 3. A bounded event stream with slope s from k on is an event stream E with the property ∀ i ≥ k, ai ∈ E :
1 ≤ s. ai+1 − ai
(2)
The index k is called the approximation’s start-index. Because of the sub-additivity the pair (s, k) = (a2 , 2) always forms a bounded event stream, this is used in [8], but in changing the index, we may change the precision, as the following example shows: Example 3. Let there be a jittering task with period 100 ms and a jitter of 5 ms. Then approximating with a2 = 90 ms will have a significant error. Starting the approximation at index 2 with a3 − a2 = 100 ms will end up in no error at all! We summarise this information more formal in the following lemma. Lemma 3. Let task τ have a bounded event stream E with slope s from k, then an upper bound on its demand is
c · max{j : aj ∈ E, aj + d ≤ Δt} Δt < ak + d Dτ (Δt) = (3) c · k − 1 + Δt−as k −d Δt ≥ ak + d. Note that the growth of the function has its maximum between 0 and ak + d, because for values greater than ak +d the function grows with Δt/s, which must be less or equal the maximal growth according to the sub-additivity of the underlying event stream. The definition reduces our set of test-points depending on the wanted precision. Theorem 1. Let τ1 , . . . , τn be tasks, c1 , . . . , cn their worst-case execution times, and let E1 , . . . , En be their bounded event streams with slopes s1 , . . . , sn and approximation start-indices k1 , . . . , kn . n Define E˜ := i=1 {ai,j + di : ai,j ∈ Ei , ai,j ≤ ai,ki −1 }. A sufficient feasibility test is then ∀Δt ∈ E˜ :
D(Δt) ≤ Δt n ci ≤ 1. s i=1 i
and
(4) (5)
Proof. In Lemma 2 we stated that the demand bound function is constant be˜ For values greater than A := max{a ∈ E} ˜ we approxtween the test-points of E. imate the demand bound function by a sum over straight lines for each task, cf. lemma 3: D(Δt) ≤ D :=
n i=1
with:
gi (Δt)∀Δt > A,
Δt − ai,k − di g(Δt) = c · ki − 1 + si
Fast Calculation of Permissible Slowdown Factors
501
The growth of D has its maximum between 0 and A, because this is true for the elements of the sum (compare note in lemma 3). If the function D is below the straight line h(x) := x for values between 0 and A, then it will cut h for values greater than A if and only if its derivate there is greater than 1. That results in proving: ∂ (D (Δt)) ∂Δt n ∂ = gi (Δt) ∂Δt i=1 n ∂ Δt − ai,k − di = c · ki − 1 + ∂Δt si i=1
1≥
=
(Δt > A)
n ci . s i=1 i
If a task system’s demand bound function allows some “slack”, that is it does not use the full available processing time, we are interested what happens to the calculation if we introduce another task into the system. The argumentation is clear: it has to fit into the rest available processing time. To be more formal we state the following lemma, which basically expresses, that we do not have to recalculate the demand bound as a whole but only for the new test-points. Lemma 4. Let Γ be a real-time feasible task system and let D be it’s demand bound function in the notion of the theorem. Let τ be a task with event stream E, deadline d and worst-case execution time c. Then the task system Γ ∪ {τ } is real-time feasible if D (Δt) + max{j ∈ N : aj ∈ E, aj + d ≤ Δt} · c ≤ Δt ∀Δt ∈ {ai + d : ai ∈ E}. (6) (cf. [7]) Proof. Since the function D will by prerequisite not exceed the line h(x) = x, this violation can only occur at points when the task τ needs to be finished. Clearly, if the introduced task has a bounded event stream with some slope s from some index k on, the set of test-points reduces to those induced by indices less than k. We summarise the gained complexity reduction along with the accuracy of the test. Lemma 5. The complexity of the test for a periodic only task system is linear in the number of tasks, if for all tasks the deadlines are equal to the periods, no accuracy will be lost. The complexity of the test for a periodic task system with m tasks having a jitter is linear in the number of tasks plus m.
502
H. Lipskoch, K. Albers, and F. Slomka
Two reasons for loss in accuracy exist. On the one hand, there is an error by the assumption of synchronicity. And on the other hand there is an approximation error due to linearisation. 3.3
Linear Programme
Our optimisation problem can now be formulated as a linear programme. Since our goal is to slow down the task system as much as it is allowed, the corresponding formulation (in the notion of the theorem) for the objective is then Maximize:
n αi · ci i=1
si
,
(7)
where αi is the slow down for task i. Note that a slowdown factor of < 1 means to speed up the task as this shortens its execution time and therefore we have to ensure the opposite: αi ≥ 1.
(8)
As stated in the theorem, the long term utilisation must not exceed 1, this gives us the constraint: n αi · ci ≤ 1. (9) si i=1 Clearly, the optimum will never exceed 1. Let ai,j denote the j-th element in the event stream belonging to task i. The constraint limiting the demand is then: ˜ ∀ Δt ∈ E
n
max{j ∈ N : ai,j + di ≤ Δt} · αi · ci ≤ Δt.
(10)
i=1
The max-term in the equation is calculated beforehand, because it does not change during optimisation, if it’s value reaches a start-index k of some task then it is replaced by the equation 3 given in lemma 3.
4
Experiments
As first experiment we chose a rather simple example given in [11], describing seven periodic tasks, having their deadlines equal to their periods, on a Palmpilot (see table 1). It has a utilisation of 86.1%. Calculation with the unimproved test results in slowing down task 7 by 3.075 with the help of 45 constraints. Exactly the same slowdown was calculated by our fast approach with only 7 constraints. For demonstration purpose, we chose a periodic task system with some tasks having a deviation in their period, whose maximal value is found as jitter in table 4. The example was taken from [12]. The task set has a utilisation of about
Fast Calculation of Permissible Slowdown Factors
503
Table 1. Task set of the Palm-pilot Task Exec. Time [ms] Period [ms] 1 5 100 2 7 40 3 10 100 4 6 30 5 6 50 6 3 20 7 10 150 Table 2. Task set of processor one Task Exec. Time [μs] Jitter [μs] Deadline [μs] Period [μs] 1 150 0 800 800 2 2277 0 5000 200000 3 420 8890 15000 400000 4 552 10685 20000 20000 5 496 9885 20000 20000 6 1423 0 12000 25000 7 3096 0 50000 50000 8 7880 0 59000 59000 9 1996 15786 10000 50000 10 3220 34358 10000 100000 11 3220 55558 10000 100000 12 520 0 10000 200000 13 1120 107210 20000 200000 14 954 141521 20000 1000000 15 1124 0 20000 200000 16 3345 0 20000 200000 17 1990 0 100000 1000000
65.2%. We first applied the slowdown calculation without test-point reduction and tested up to the hyper-period of the task periods, which is 59,000,000, and the test resulted in 78562 constraints concerning the demand bound function. It yielded a slowdown for task 9 of about 9.7, a utilisation of exactly 1, which both are the same result as with the improved linear program we suggested with our developed theory having only 20 constraints regarding the demand bound function. All linear programs were programmed in GNU MathProg modelling language and solved with glpsol, version 4.15 [13].
5
Conclusion and Future Work
We have shown a very fast and yet accurate method for calculating static slowdown factors while providing hard real-time feasibility. In contrast to other methods it does not rely on periodic task behaviour, it’s complexity does not increase
504
H. Lipskoch, K. Albers, and F. Slomka
when other forms of trigger, like sporadic with minimal distance between two consecutive triggers or periodic with a certain jitter, are part of the optimisation problem. In future work we want to embed criteria to allow modelling different system states such as sleep states and with our methodology we want to research in what cases a common slow down factor will be sufficient.
References 1. Ishihara, T., Yasuura, H.: Voltage scheduling problem for dynamically variable voltage processors. In: Proceedings of the International Symposium on Low Power Electronics and Design, pp. 197–202 (1998) 2. Andrei, A., Schmitz, M., Eles, P., Peng, Z., Al-Hashimi, B.M.: Overhead conscious voltage selection for dynamic and leakage energy reduction of time-constrained systems. In: Proceedings of the Design Automation and Test in Europe Conference (2004) 3. Hua, S., Qu, G.: Voltage setup problem for embedded systems with multiple voltages. IEEE Transactions on Very Large Scale Integration (VLSI) Systems (2005) 4. Rong, P., Pedram, M.: Power-aware scheduling and dynamic voltage setting for tasks running on a hard real-time system. In: Proceedings of the Asia and South Pacific Design Automation Conference (2006) 5. Jejurika, R., Gupta, R.: Optmized slowdown in real-time task systems. In: Proceedings of the 16th Euromicro Conference on Real-Time Systems, pp. 155–164 (2004) 6. Baruah, S.K., Rosier, L.E., Howell, R.R.: Algorithms and complexity concerning the preemptive scheduling of periodic, real-time tasks on one processor. Real-Time Systems 2(4), 301–324 (1990) 7. Albers, K., Slomka, F.: An event stream driven approximation for the analysis of real-time systems. In: Proceedings of the Euromicro Conference on Real-Time Systems, pp. 187–195 (2004) 8. Albers, K., Slomka, F.: Efficient feasibility analysis for real-time systems with edf scheduling. In: Proceedings of the Design Automation and Test in Europe Conference (2005) 9. Gresser, K.: An event model for deadline verification of hard realtime systems. In: Proceedings of the Fifth Euromicro Workshop on Real Time Systems, pp. 118–123 (1993) 10. Baruah, S., Chen, D., Gorinsky, S., Mok, A.: Generalized multiframe tasks. RealTime Systems 17(1), 5–22 (1999) 11. Lee, T.M., Henkel, J., Wolf, W.: Dynamic runtime re-scheduling allowing multiple implementations of a task for platform-based designs. In: Proceedings of the Design, Automation and Test in Europe Conference (2002) 12. Tindell, K., Clark, J.: Holistic schedulability analysis for distributed hard real-time systems. Microprocessing and Microprogramming - Euromicro Journal (Special Issue on Parallel Embedded Real-Time Systems) 40(2-3), 117–134 (1994) 13. Makhorin, A.: GLPK linear programming/MIP solver (2005), http://www.gnu.org/software/glpk/glpk.html
Design Methodology and Software Tool for Estimation of Multi-level Instruction Cache Memory Miss Rate N. Kroupis and D. Soudris VLSI Design Centre, Dept. of Electrical and Computer Eng. Democritus University of Thrace, 67100 Xanthi, Greece {nkroup,dsoudris}@ee.duth.gr
Abstract. A typical design exploration process using simulation tools for various cache parameters is a rather time-consuming process, even for low complexity applications. The main goal of an estimation methodology, introduced in this paper, is to provide fast and accurate estimates of the instruction cache miss rate of data-intensive applications implemented on a programmable embedded platform with multi-level instruction cache memory hierarchy, during the early design phases. Information is extracted from both the high-level code description (C code) of the application and its corresponding assembly code, without carrying out any kind of simulation. The proposed methodology requires only a single execution of the application in a general-purpose processor and uses only the assembly code of the targeted embedded processor. In order to automate the estimation procedure, a novel software tool named m-FICA implements the proposed methodology. The miss rate of two-level instruction cache can be estimated with high accuracy (>95%), comparing with simulationbased results while the required time cost is much smaller (orders of magnitude) than the simulation-based approaches.
1 Introduction Cache memories have become a major factor to bridge the bottleneck between the relatively slow access time to main memory and the faster clock rate of today’s processors. Nowadays the programmable systems usually contain two levels caches, in order to reduce the main memory transfer delay. The simulation of cache memories is common practice to determine the best configuration of caches during the design of computer architectures. It has also been used to evaluate compiler optimizations with respect to cache performance. Unfortunately, the cache analysis of a program can significantly increase the program’s execution time often by two orders of a magnitude. Thus, cache simulation has been limited to the analysis of programs with a small or moderate execution time and still requires considerable experimentation time before yielding results, In reality, programs often execute for a long time, but cache simulation simply becomes unfeasible with conventional methods. The large overhead of cache simulation is imposed by the necessity of tracking the execution order of instructions. In [3], an automated method for adjusting two-level cache memory hierarchy in order to reduce energy consumption in embedded applications is presented. The N. Azemard and L. Svensson (Eds.): PATMOS 2007, LNCS 4644, pp. 505–515, 2007. © Springer-Verlag Berlin Heidelberg 2007
506
N. Kroupis and D. Soudris
proposed heuristic, Two-level Cache Exploration Heuristic considering Cycles, consists of making a small search in the space of configurations of the two-level cache hierarchy, analyzing the impact of each parameter in terms of energy and number of cycles spent for a given application. Zhang and Vahid [1] introduce a cache architecture that can find the best set of cache configurations for a given application. Such architecture would be very useful in prototyping platforms, eliminating the need for time-consuming simulations to find the best cache configurations. Gordon et. al [4] present an automated method for tuning two-level caches to embedded applications for reduced energy consumption. The method is applicable to both a simulation-based exploration environment and a hardware based system prototyping environment. The heuristic interlaces the exploration of the two cache levels and searches the various cache parameters in a specific order based on their impact on energy. Platune is introduced by Givargis and Vahid in [2] which is used to automatically explore the large configuration space of such an SOC platform. The power estimation techniques for processors, caches, memories, buses, and peripherals combined with the design space exploration algorithm deployed by Platune form a methodology for design of tuning frameworks for parameterized SOC platforms in general. Additionally, a number of techniques were presented which had as main goal the reduction of the simulation time cost [5], [6], [7] and [8]. A technique called inline tracing can be used to generate the trace of addresses with much less overhead than trapping or simulation. Measurement instructions are inserted in the program to record the addresses that are referenced during the execution. Borg, Kessler, and Wall [5] modified some programs at link time to write addresses to a trace buffer, and these addresses were analyzed by a separate higher priority process. The time required to generate the trace of addresses was reduced by reserving five of the general purpose registers, to avoid memory references in the trace generation code. Mueller and Whalley [6] provided a method for instruction cache analysis, which outperforms the conventional trace-driven methods. This method, named static cache simulation, analyzes a program for a given cache configuration and determines, prior to execution time, if an instruction reference will always result in a cache hit or miss. In order to use the proposed technique, the designer should make changes in the compiler of the processor, which is restricted most of the times, when we used commercial tools and compilers. A simulation-based methodology, focused on an approximate model of the cache and the multi-tasking reactive software, that allows one to trade off-smoothly between accuracy and simulation speed, has been proposed by Lajolo et. al. [7]. The methodology reduces the simulation time, taking into account the intra-task conflicts and considering only a finite number of previous task executions. This method achieved a speed-up of about 12 times faster than the simulation process, with an error of 2% in cache miss rate estimation. Nohl et. al [8] presented a simulation-based technique, which meets the requirements for both, the high simulation speed and maximum flexibility. This simulation technique, called just-in-time cache compiled simulation technique, can be utilized for architecture design, as well as for end-user software development. However, the simulation performance increases by about 4 times, compared with the
Design Methodology and Software Tool
507
trace-driven techniques, which is not enough for exploring various cache sizes and parameters. In this paper, a novel systematic methodology aiming at the estimation of the optimal cache memory size with the smallest cache miss rate of a multi-level instruction cache hierarchy is introduced (Figure 1). The accuracy can be achieved within an affordable estimation time cost. The high-level estimation is very useful for a fast exploration among many instruction cache configurations. The basic concept of the new methodology is the straightforward relationship for specific characteristics (e.g. the number of loop iterations) between the high-level application description code and its corresponding assembly code. Using the proposed methodology, a new tool has been developed, achieving orders of magnitude speedup in the miss rate estimation time cost, compared to existing methods, with an estimation accuracy higher than 95%. We estimate the miss rate of a two level instruction cache consisting of a first level, L1, and second level, L2, cache memory. The proposed approach is fully software-supported by a CAD tool, named m-FICA, which automates the whole estimation procedure. In addition, the methodology could be applied to every processor without any compiler or application modification.
Fig. 1. The instruction memory hierarchy of the system, with L1 and L2 instruction cache off-chip
2 Proposed Estimation Methodology In order to model the number of cache misses of a nested loop, analytical formulas have been proposed in [10]. Given a nested loop with N iterations and a total size of instructions in assembly code L_s, a cache memory with size C_s (in instructions), and a block size B_s (cache line length), the number of misses N_misses, can be calculated using the following formulas [10]: Loop Type 1: if L _ s ≤ C _ s then: Num _ misses =
L_s B_s
(1)
Loop Type 2: if C _ s < L _ s < 2 × C _ s then: Num _ misses =
L_s L _ s mod C _ s + ( N − 1) × 2 × B_s B_s
(2)
Loop Type 3: if 2 × C _ s ≤ L _ s then: Num _ misses = N ×
L_s B_s
(3)
508
N. Kroupis and D. Soudris
The miss rate is given by the formula: Num _ misses Miss _ rate = (4) Num _ references where Num_references is the number of references from the processor to memory with L_s Num _ references = ×N (5) B_s Depending on the loop size mapped to the cache size, the assumed loops are categorized in three types (Fig. 1). This example assumes that the instruction cache block size is equal to the instruction size. With such a systematic way, the number of misses can be calculated for every application’s loop. The proposed miss rate estimation methodology is based on the correlation between the high-level description code (e.g. C) of the application and its associated assembly code. Using the compiler of the chosen processor, we can create the assembly code of the application. The crucial point of the methodology is that the number of conditional branches in both the C code and its assembly code is equal. Thus, executing the C code we can find the number of passes from every branch. The values correspond to the assembly code, and thus we can find how many times each assembly branch instruction is executed. Creating the Control Flow Graph (CFG) of the assembly code, the number of execution of all application’s assembly instructions can be calculated. The miss rate estimation is accomplished by the assembly code processing procedure and the data extracted from the application execution. Thus, the estimation time depends on the code (assembly and C) processing time and the application execution time in a general-purpose processor. The total estimation time cost is much smaller than that obtained by the trace-driven techniques. The proposed methodology has as input the high-level description, i.e., in C, of the application code, which includes:
(i) The conditional statements (if/else, case), (ii) The application function calls and (iii) The nested loops, (for/while) with the following assumptions: (a) perfectly / non-perfectly nested loops, (b) the loop indices may exhibit constant or variable lower and upper loop boundaries, constant or variable step and interdependences between the loop indices (e.g. affine formula ik=ik-1+c), and (c) The loop can contain, among others, conditional statements and function calls. The proposed methodology consists of three stages, shown in Figure 2. The first stage of the proposed methodology contains three steps: (i) Determination of branches in the C code of the application, (ii) Insertion of counters into the branches of the C code, and (iii) Execution of C code. The first stage aims at the calculation of the number of executions (passes) of all branches of the application C code. Such, the number of executions of every leaf of the CFG is evaluated by the application execution. Determining the branches of the high-level application code, we can find the number of executions within these branches executing the code. This stage is a platform-independent process and, thus, its results can be used in any programmable platform.
Design Methodology and Software Tool
509
The second stage estimates the number of executions of each microinstruction and, eventually, the total number of the executed instructions. It consists of: (i) The determination of assembly code branches, (ii) The creation of Control Flow Graph, (iii) The assignment of counter values to Control Flow Graph nodes, and (iv) The computation of execution cost of the rest CFG nodes The basic idea, not only of the second step but also of the proposed methodology, is the use of behavioral-level information (i.e. C code) into the processor level. The number of branches of the assembly code is equal to the number of branches in C code, and remains unchanged even if compiler optimization techniques are applied. In case of the unrolled loops, the methodology could detect in assembly code the repeated assembly instructions and eventually detect the location of the unrolled branch. Moreover, the methodology can handle optimized and compressed assembly code without any change and limitation. Concluding, the proposed methodology is compiler and processor
Fig. 2. The proposed methodology for estimating the miss rate of a two-level instruction memory cache hierarchy
510
N. Kroupis and D. Soudris
independent and can be applied into every programmable processor. The derived assembly code can be grouped into blocks of code each of which is executed in a sequential fashion, which are called basic blocks. A basic block code can be defined as the group of instructions between two successive labels, or between a conditional instruction and a label and vice versa. The third stage of the methodology is platform-dependent and contains two steps: (i) the creation of all the unique execution paths of each loop and (ii) the computation of number of instructions and iterations associated with a unique path. Exploring all the paths of the CFG of an application, we determine the loops and the size (in numbers of instructions), as well as the number of executions of each loop. Furthermore, from the rest of the conditional branches (if / else), we create all the unique execution paths inside every loop, together with the number of executions of each unique path. The methodology is able to handle applications, which includes perfectly nested loops or non-perfectly nested loops and any type of loops structure. Considering the target embedded processor, which is MIPS IV, we count the instructions of the assembly code. Using eq. (1)-(5), and the unique execution paths of each loop, the number of instruction cache misses and the cache miss rate can be estimated. These equations can be applied to every cache level with similar way estimating the cache misses in every level of the instruction cache.
3 Comparison Results In order to evaluate the proposed estimation technique we compare the results, which are taken using the developed tool, with simulation-based measurements. We considered as implementation platform the 64-bit processor core MIPS IV, while the measurements was taken by Simplescalar tool [11], the accurate instruction set simulator of MIPS processor. Simplescalar includes instruction set simulator, fastinstruction simulator and cache simulator, and can simulate architectures with instruction, data and mixed instruction-data caches with one or two memory hierarchy layers. In order to evaluate the proposed methodology, a set of benchmarks from various signal processing applications, such as MPEG-4, JPEG, Filtering and H.263 are used. Basically the code of every video compression algorithm used, such we use five Motion Estimation algorithms: (i) Full Search (FS) [12], (ii) Hierarchical Search (HS) [13], (iii) Three Step Logarithmic Step (3SLOG) [12], (iv) Parallel Hierarchical One Dimensional Search (PHODS) [12] and (v) Spiral Search (SS) [14]. It has been noted that their complexity ranged from 60 to 80% of the total complexity of video encoding (MPEG-4) [12]. Also, we have used the 1-D Wavelet transformation [15], the Cavity Detector [16] and the Self Organized Feature Map Color Quantization (CQ) [17]. More specifically, the C code size of all algorithms ranges from 2 Kbytes to 22 Kbytes, while the corresponding MIPS assembly code size ranges from 9 Kbytes to 46 Kbytes. We assumed L1 instruction cache memory size ranging from 64 bytes to 1024 bytes with block sizes 8 and direct-mapped cache architecture and L2 instruction cache with sizes varying between 128 bytes and 4 Kbytes. We performed both simulation and estimation computations in terms of the miss rate of instruction cache on L1 and L2. Moreover, we computed the actual time cost for running the
Design Methodology and Software Tool
511
simulation and the estimation-based approaches as well as the average accuracy level of the proposed methodology. Every cache level has its own local miss rate, which is the misses in this cache divided by the total number of memory accesses to this cache. Average miss rate is the misses in this cache divided by the total number of memory accesses generated by the processor. For example in the case where there are two level of cache memories the Average miss rate is given by the product of the two local miss rates of the two levels. (Miss RateL1 × Miss RateL2). Average miss rate is what matters to overall performance, while local miss rate is factor in evaluating the effectiveness of every cache level. The accuracy of the proposed estimation technique is provided by the average estimation error. Table 1-7 presents the average percentage error of the proposed methodology compared to the simulation results taken using the Simplescalar tool, considering the abovementioned eight DSP applications. The last row of each table provides the average estimation error of miss rate of a two-level instruction cache memory hierarchy of each application. We have choose to present the results of only two-level cache hierarch because luck of space. Also, in order to reduce the results we present only the miss rate of L2 cache which its size is four times greater that L1, otherwise a lot tables and results must be presented. Depending on the application, the corresponding average values of estimation error ranges from 1% to 12%, while the total average estimation error of the proposed approach is less than 4:% (i.e. 3.77%). The latter value implies that the proposed methodology exhibits high accuracy. Table 1. Comparison between the estimation and the simulation results of the miss rate in L1 and L2 caches when the size of L2 is 4 times greater than L1 cache for Full Search application
L1 Miss rate
L2 Miss rate Average Miss rate
L1 cache size (bytes) Simplescalar m-FICA L2 cache size (bytes) Simplescalar m-FICA Simplescalar m-FICA
64 100.0 100.0 256 99.8 99.9 99.8 99.9
128 100.0 100.0 512 99.2 99.6 99.2 99.6
256 99.8 99.9 1024 77.0 72.0 76.8 71.9
512 99.2 99.6 2048 0.1 0.2 0.1 0.2
1024 Error 76.8 1.10 % 71.9 4096 Error 0.0 1.16 % 0.2 0.0 1.13 % 0.1
Table 2. Comparison between the estimation and the simulation results of the miss rate in L1 and L2 caches when the size of L2 is 4 times greater than L1 cache for Hierarchical Search application
L1 Miss rate
L2 Miss rate Average Miss rate
L1 cache size (bytes) Simplescalar m-FICA L2 cache size (bytes) Simplescalar m-FICA Simplescalar m-FICA
64 99.9 100.0 256 92.7 87.5 92.5 87.5
128 97.3 96.0 512 68.2 63.3 66.4 60.8
256 92.6 87.5 1024 3.0 3.9 2.8 3.4
512 66.4 60.8 2048 2.3 5.3 1.6 3.2
1024 Error 2.8 2.5 % 3.4 4096 Error 53.3 10.2 % 15.9 1.5 2.8 % 0.5
512
N. Kroupis and D. Soudris
Table 3. Comparison between the estimation and the simulation results of the miss rate in L1 and L2 caches when the size of L2 is 4 times greater than L1 cache for 3SLOG application L1 cache size (bytes) Simplescalar L1 Miss rate m-FICA L2 cache size (bytes) Simplescalar L2 Miss rate m-FICA Simplescalar Average Miss rate m-FICA
64 100.0 100.0 256 93.1 96.9 93.1 96.9
128 99.7 99.4 512 15.9 7.4 15.9 7.4
256 93.1 96.9 1024 2.0 1.0 1.9 0.9
512 15.9 7.4 2048 11.0 7.2 1.7 0.5
1024 1.9 0.9 4096 0.4 2.2 0.0 0.0
Error 2.7 % Error 3.8 % 2.9 %
Table 4. Comparison between the estimation and the simulation results of the miss rate in L1 and L2 caches when the size of L2 is 4 times greater than L1 cache for PHODS application L1 cache size (bytes) Simplescalar L1 Miss rate m-FICA L2 cache size (bytes) Simplescalar L2 Miss rate m-FICA Simplescalar Average Miss rate m-FICA
64 100.0 100.0 256 99.6 98.8 99.6 98.8
128 100.0 100.0 512 96.8 96.1 96.7 96.1
256 99.6 98.8 1024 31.8 23.0 31.7 22.7
512 96.7 96.1 2048 0.8 1.0 0.8 1.0
1024 31.7 22.7 4096 0.7 4.2 0.2 1.0
Error 2.1 % Error 2.8 % 2.3 %
Table 5. Comparison between the estimation and the simulation results of the miss rate in L1 and L2 caches when the size of L2 is 4 times greater than L1 cache for SS application L1 cache size (bytes) Simplescalar m-FICA L2 cache size (bytes) Simplescalar L2 Miss rate m-FICA Simplescalar Average Miss rate m-FICA L1 Miss rate
64 99.9 100.0 256 99.0 98.4 98.8 98.4
128 99.9 99.2 512 80.0 75.6 79.9 75.0
256 98.8 98.4 1024 0.5 0.0 0.5 0.0
512 79.9 75.0 2048 0.1 0.0 0.1 0.0
1024 0.5 0.0 4096 0.1 9.4 0.0 0.0
Error 1.3 % Error 2.9 % 1.2 %
Table 6. Comparison between the estimation and the simulation results of the miss rate in L1 and L2 caches when the size of L2 is 4 times greater than L1 cache for 1-D Wavelet application L1 cache size (bytes) Simplescalar L1 Miss rate m-FICA L2 cache size (bytes) Simplescalar L2 Miss rate m-FICA Simplescalar Average Miss rate m-FICA
64 98.7 99.3 256 50.9 43.6 50.3 43.3
128 89.9 92.7 512 1.5 0.4 1.3 0.4
256 50.3 43.3 1024 2.1 0.2 1.1 0.1
512 1.3 0.4 2048 1.5 4.4 0.0 0.0
1024 1.1 1.1 4096 1.4 14.3 0.0 0.2
Error 2.3 % Error 5.2 % 1.8 %
Design Methodology and Software Tool
513
Table 7. Comparison between the estimation and the simulation results of the miss rate in L1 and L2 caches when the size of L2 is 4 times greater than L1 cache for Cavity Detector application L1 cache size (bytes) Simplescalar L1 Miss rate m-FICA L2 cache size (bytes) Simplescalar L2 Miss rate m-FICA Simplescalar Average Miss rate m-FICA
64 100.0 100.0 256 94.3 94.6 94.3 94.6
128 100.0 100.0 512 61.4 45.7 61.4 45.7
256 94.3 94.6 1024 17.9 0.8 16.9 0.8
512 61.4 45.7 2048 0.5 0.0 0.3 0.0
1024 16.9 0.8 4096 0.5 0.0 0.1 0.0
Error 6.4 % Error 6.8 % 6.5 %
Table 8. Comparison between the estimation and the simulation results of the miss rate in L1 and L2 caches when the size of L2 is 4 times greater than L1 cache for CQ application L1 cache size (bytes) Simplescalar L1 Miss rate m-FICA L2 cache size (bytes) Simplescalar L2 Miss rate m-FICA Simplescalar Average Miss rate m-FICA
64 100.0 100.0 256 89.2 84.2 89.1 84.2
128 99.4 98.7 512 46.8 3.5 46.5 3.5
256 89.1 84.2 1024 10.8 0.0 9.6 0.0
512 46.5 3.5 2048 0.8 0.0 0.3 0.0
1024 9.6 0.0 4096 0.4 100.0 0.0 0.0
Error 11.2% Error 17.9% 11.6 %
Apart from the accuracy of an estimation methodology (and tool), a second parameter very crucial for its efficiency is the required time cost to obtain the accurate estimates. Table 9 provides the required (average) time cost, in seconds, for performing the simulation and estimation procedure for all benchmarks. It is assumed an architecture with two levels of instruction cache and cache sizes for L1 from 64 bytes to 1024 bytes and L2 from 128 up to 4096 bytes there are 20 different combinations assuming that L2>L1. Using variable cache block sizes for L1 and L2 caches from 8 bytes to 32 bytes. there are totally 6 combinations assuming that L1_block_size ≤ L2_block_size. In order to complete explorer the two-level instruction cache architecture 20×6=120 simulation procedures are needed for every application. The estimation and simulation computations were performed by a personal computer with a processor Pentium IV, 2GHz and 1Gbyte RAM. It can be inferred that the proposed methodology offers a huge time speedup (orders of magnitude) compared with the simulation-based approach. Consequently, the new methodology/tool is suitable for performing estimations with a very high accuracy at the early design phases of an application. The exploration time cost of the simulation-based approach is proportional to the size of the trace file of the application considered (order of GBs). In contrast, the corresponding time cost of the proposed methodology is (almost) proportional to the code size of the assembly code (order of KBs). From Table 9, it can be seen that the larger the number of loop iterations in C code (and of course in assembly code) is, the larger is the speedup factor of the new methodology. Regarding the proposed
514
N. Kroupis and D. Soudris
approach, we achieved time cost reduction between 40 to 70,000 times (i.e. up to four (4) orders of magnitude), depending on the application characteristics. Thus, accurate estimation within an affordable time cost allows a designer to perform design exploration of larger search space (i.e. exploration of additional design parameters). In addition, the increasing complexity of modern applications, for instance image/video frame with higher resolution, will render the usage of simulation tool impractical. Thus, for designing such complex systems the high-level estimation tool will be the only viable solution. Table 9. Speed up comatrison results using our proposed methodology compared to the simulation time. In a host machine Intel Pentium IV CPU 2GHz.
Simulation Time (sec) Estimation Time (sec) Speed up
FS 73200 4.8 15,250
HS 3SLOG PHODS SS Wavelet 1920 2760 3480 77520 4320 27.3 7.2 9.45 7.2 105 70 383 368 10,767 41
Cavity CQ 1081080 795240 15.3 27.45 70,659 28,970
4 Conclusions A novel methodology for estimating the cache misses of multilevel instruction caches realized in an embedded programmable platform, was presented. The methodology was based on the straightforward relationship between the application high-level description code and its corresponding assembly code. Having as inputs both types of code, we extract specific features. Using the proposed methodology, we can perform estimation of application critical parameters during the early design phases, avoiding the time-consuming simulation-based approaches. The m-FICA tool is based on the proposed methodology and it is an accurate instruction cache miss rate estimator. The proposed methodology achieved estimations with smaller time cost than the simulation process, (i.e. orders of magnitude).
Acknowledgments This paper is part of the 03ED593 research project, implemented within the framework of the “Reinforcement Programme of Human Research Manpower” (PENED) and co-financed by National and Community Funds (75% from E.U.European Social Fund and 25% from the Greek Ministry of Development-General Secretariat of Research and Technology).
References [1] Zhang, D., Vahid, F.: Cache configuration exploration on prototyping platforms. In: 14th IEEE International Workshop on Rapid System Prototyping, June 2003, vol. 00, p. 164. IEEE, Los Alamitos (2003) [2] Givargis, T., Vahid, F.: Platune: A Tuning framework for system-on-a-chip platforms. IEEE Trans. Computer-Aided Design 21, 1–11 (2002)
Design Methodology and Software Tool
515
[3] Silva-Filho, A.G., Cordeiro, F.R., Sant’Anna, R.E., Lima, M.E.: Heuristic for Two-Level Cache Hierarchy Exploration Considering Energy Consumption and Performance. In: Vounckx, J., Azemard, N., Maurine, P. (eds.) PATMOS 2006. LNCS, vol. 4148, pp. 75– 83. Springer, Heidelberg (2006) [4] Gordon-Ross, A., Vahid, F., Dutt, N.: Automatic Tuning of Two-Level Caches to Embedded Aplications. In: DATE, pp. 208–213 (February 2004) [5] Borg, A., Kessler, R., Wall, D.: Generation and analysis of very long address traces. In: International Symposium on Computer Architecture, May 1990, pp. 270–279 (1990) [6] Mueller, F., Whalley, D.: Fast Instruction Cache Analysis via Static Cache Simulation. In: Proc. of 28th Annual Simulation Symposium, 1995, pp. 105–114 (1995) [7] Lajolo, M., Lavagno, L., Sangiovanni-Vincentelli, A.: Fast instruction cache simulation strategies in a hardware/software co-design environment. In: Proc. of the Asian and South Pacific Design Automation Conference, ASP-DAC 1999 (January 1999) [8] Nohl, A., Braun, G., Schliebusch, O., Leupers, R., Meyr, H.: A Universal Technique for Fast and Flexible Instruction-Set Architecture Simulation. In: Proc. of the 39th conference on Design automation, DAC 2002, New Orleans, Louisiana, USA, pp. 22–27 (2002) [9] Edler, J., Hill, M.D.: A cache simulator for memory reference traces, http:// www.neci.nj.nec.com/homepages/edler/d4 [10] Liveris, N., Zervas, N., Soudris, D., Goutis, C.: A Code Transformation-Based Methodology for Improving I-Cache Performance of DSP Applications. In: Proc. of DATE, 2002, Paris, pp. 977–984 (2002) [11] Austin, T., Larson, E., Ernst, D.: SimpleScalar: An Infrastructure for Computer System Modeling. Computer 35(2), 59–67 (2002) [12] Kuhn, P.: Algorithms, Complexity Analysis and VLSI Architectures for MPEG-4 Motion Estimation. Kluwer Academic Publisher, Boston (1999) [13] Nam, K., Kim, J.-S., Park, R.-H., Shim, Y.S.: A fast hierarchical motion vector estimation algorithm using mean pyramid. IEEE Transactions on Circuits and Systems for Video Technology 5(4), 344–351 (1995) [14] Cheung, C.-K., Po, L.-M.: Normalized Partial Distortion Search Algorithm for Block Motion Estimation. Proc. IEEE Transaction on Circuits and Systems for Video Technology 10(3), 417–422 (2000) [15] Lafruit, G., Nachtergaele, L., Vahnhoof, B., Catthoor, F.: The Local Wavelet Transform: A Memory-Efficient, High-Speed Architecture Optimized to a Region-Oriented ZeroTree Coder. Integrated Computer-Aided Engineering 7(2), 89–103 (2000) [16] Danckaert, K., Catthoor, F., De Man, H.: Platform independent data transfer and storage exploration illustrated on a parallel cavity detection algorithm. In: ACM Conference on Parallel and Distributed Processing Techniques and Applications III, pp. 1669–1675 (1999) [17] Dekker, A.H.: Kohonen neural networks for optimal colour quantization. Network: Computation in Neural Systems 5, 351–367 (1994)
A Statistical Model of Logic Gates for Monte Carlo Simulation Including On-Chip Variations Francesco Centurelli, Luca Giancane, Mauro Olivieri, Giuseppe Scotti, and Alessandro Trifiletti Dipartimento di Ingegneria Elettronica, Università di Roma "La Sapienza", Via Eudossiana 18, 00184 Roma, Italy {centurelli,giancane,olivieri,scotti, trifiletti}@mail.die.uniroma1.it
Abstract. Process variations are becoming a paramount design problem in nano-scale VLSI. We present a framework for the statistical model of logic gates that describes both inter-die and intra-die variations of performance parameters such as propagation delay and leakage currents. This allows fast but accurate behavioral-level Monte-Carlo simulations, that could be useful for full-custom digital design optimization and yield prediction, and enables the development of a yield-aware digital design flow. The model can incorporate correlation between mismatch parameters and dependence on distance and position, and can be extracted by fitting of Monte-Carlo transistor level simulations. An example implementation using Verilog-A hardware description language in Cadence environment is presented.
1 Introduction Fluctuations in manufacturing process parameters can cause random deviations in the device parameters, which in turn can significantly impact yield and performance of both analog and digital circuits [1]. With each technology node, process variability has become more prominent, and has become an increasing concern in integrated circuits as circuit complexity continues to increase and feature sizes continue to shrink [2]. As integrated circuits have to be insensitive to such fluctuations to avoid parametric yield loss, appropriate analyses are needed in the design phase to assess that circuit performance under process variations remains inside the acceptability region in the performance space, and new yield-oriented design methodologies are required [3]. Process variations can be classified into two categories: − inter-die variations (process variations), that affect all the devices on chip in the same way, and are due to variations in the manufacturing process; − intra-die variations (mismatch variations), that correspond to variations of the parameters of the devices in the same chip, due to spatial non-uniformities in the manufacturing process. Traditionally, inter-die variations were largely dominant, so that intra-die variations could be safely neglected in the design phase. However, in modern sub-micron N. Azemard and L. Svensson (Eds.): PATMOS 2007, LNCS 4644, pp. 516–525, 2007. © Springer-Verlag Berlin Heidelberg 2007
A Statistical Model of Logic Gates
517
technologies, intra-die variations are rapidly and steadily growing and can significantly affect the variability of performance parameters on a chip [4]. Intra-die variations are spatially correlated, and are affected by circuit layout and surrounding environment. The problem of variability has been usually handled by analyzing the circuit at multiple process corners; in particular, digital designers adopt worst-case corner analysis (WCCA), assuming that a circuit that performs adequately at the extremes should perform properly also at nominal conditions [5]. However the worst-case methods do not provide adequate information about the yield and robustness of the design [6], and do not take into account intra-die variations. Moreover, WCCA is a deterministic analysis and this makes it an inadequate approach: a good accuracy could be obtained by considering a large number of process corners, thus losing the advantage of the corner analysis in terms of computational efficiency. These issues with WCCA have led to the development of statistical techniques, that rely on a more accurate representation of uncertainty and its impact on circuit functionality and performance [7]. Recently, for what concerns timing analysis in digital design, Statistical Static Timing Analysis (SSTA) has been proposed as an alternative to traditional Static Timing Analysis [8]-[10]. SSTA allows to compute the probability distribution of circuit delay, given the probability distributions of the delays of the logic gates, and taking into account their possible correlations: thus also intra-die variations are considered. With increasing clock frequency and transistor scaling, both dynamic and static power dissipation have become a major source of concern in recent deep sub-micron technologies. In the nanometer scale, leakage currents make up a significant portion of the total power consumption in high-performance digital circuits [11], and show large fluctuations both between different dies and between devices on the same chip. Digital IC design can no more take into account only timing performance, but a leakage-aware design methodology is needed to consider also power dissipation issues. Some authors ([12]-[13]) have underlined the need for a statistical approach in the analysis of leakage current as a key area in future high-performance IC design. However the standard digital design flow used nowadays does not take into account neither the leakage power consumption nor its statistical variability, and post-silicon tuning techniques have been developed to tighten the distribution of maximum operating frequency and maximum power consumption [14]. This scenario has to change to cope with the issues of future steps of deep submicron CMOS technology, to allow a design that provides high performance in terms both of speed and power consumption, with an acceptable yield. Yield prediction has to be integrated into CAD tools, and floorplanning and routing have to optimize simultaneously yield, performance, power, signal integrity and area [15]. As a first step towards this direction, yield prediction based on Monte-Carlo (MC) simulations is sometimes used on the complete chip or on a specific part of it, like for example the critical path. However this approach requires a very large number of transistor-level simulations for good accuracy, thus resulting highly demanding in terms of simulation time. In this paper we propose a statistical model of logic gates implemented in the Verilog-A language. The model is extracted by fitting transistor-level MC simulations,
518
F. Centurelli et al.
and describes the gate performances of interest (delay, leakage current, etc) as functions of a certain number of technology-related statistical variables. Both process and mismatch variations can be considered, thus obtaining a model well suited for the needs of present-day and future technologies. An example implementation in the Cadence analog CAD environment is considered, since it makes available a MC simulation engine that is not part of the standard digital design flow. This model can be used to perform gate-level yield prediction based on MC simulation, thus obtaining a net decrease in simulation time with an accuracy comparable to transistor-level simulations, and it is a first step towards the development of a yield-aware design flow. Moreover the analog extensions of the Verilog language can be used to accurately describe the transient behavior of the output signals, therefore allowing also fast but accurate simulations for mixed-signal environment. The paper is structured as follows: Section 2 describes the structure of the model of the logic gate and its implementation in the Cadence environment, and Section 3 presents the extraction procedure. A case study is shown in Section 4 to assess the validity of the model and an example application, and some conclusions are drawn in section 5 on possible developments of this work.
2 Statistical Gate Model A logic gate in a digital environment is described by a structural VHDL model, that contains information both on the function performed (RTL-level description) and on physical characteristics, such as propagation times and leakage currents, that can be function of the configuration of the inputs. These characteristics, that we will call in the following the gate figures of merit (FOMs), are described by a set of parameters: usually a certain number of libraries of gates are defined, that use different sets of parameters, related to different process corners and environmental conditions (temperature and supply voltage). To perform gate-level statistical analyses, such as SSTA, this corner-based approach, where a set of values is defined for each parameter, and the 'corners' take into account the correlation between parameters, has to be substituted with a probabilistic approach, where the parameters are defined as stochastic variables and a single statistical library is used. The correlation between the stochastic variables should be maintained, so that the resulting model could be effectively used for simultaneous optimization of all the FOMs and yield prediction. To guarantee that the correlation is correctly taken into account, we propose a model structure where all the FOMs are described as functions of a set of stochastic variables, that represent transistor-level parameters such as the threshold voltage, the body effect parameter and the unit-area gate-source capacitance. To minimize the number of stochastic variables while maintaining good accuracy in the description of the FOMs statistics, a Principal Component Analysis (PCA) could be needed to determine the optimum set of variables to be used. The FOMs are also defined as functions of environmental variables (temperature, supply voltage) and circuit parameters such as transistor sizes and the fan-out, thus allowing the definition of a scalable model that could be used for optimization.
A Statistical Model of Logic Gates
519
The digital design flows in use nowadays do not include a Monte-Carlo simulation engine, therefore we consider the implementation of the model structure we are proposing in Cadence analog CAD environment. In this context, the analog extensions of hardware description languages, such as Verilog-A [16] or VHDL-AMS [17], can be used to describe also the transient behavior of the input and output signals, leading to more realistic time-domain simulations. This allows fast and accurate mixed-signal statistical simulations, where the digital blocks are described using this approach, whereas the analog part can be simulated at transistor level. An accurate characterization of transient behavior can also be useful in a digital environment, since it allows a more accurate estimation of propagation times [18] and power consumption, and allows to cope with issues such as reduced noise margins and metastability that are becoming more and more important in deep-submicron technologies [15]. In particular, the following analog functions can be useful: - transition: to describe the transient behavior of the output variables (so that they can be interpreted as analog variables, i.e. voltages and currents); - cross: to interpret the input variables as analog variables, and thus to consider their transient behavior and introduce the noise margins. To implement the model into the Cadence environment, the logic gates are described by a Verilog-A code where the stochastic variables are defined as parameters, and a single library file (*.scs) contains the statistical description of these parameters. Their average values will be used for deterministic (typical) simulations, whereas the MC engine will calculate the values of the parameters to be used for the single MC iterations, starting from the description in the library file. This approach allows to describe both process and mismatch variations of the parameters, by a suitable definition of the library file: the generic transistor-level parameter pi (e.g. the threshold voltage) can be defined as pi = pio (1 + εi)
(1)
where pio is a stochastic variable that defines process variation and is the same for all the gates on the same chip, and εi is another stochastic variable that describes mismatch between different gates on the same chip. A mismatch variable is needed for each gate in the circuit to be simulated, therefore the library file has to contain a set of pi parameters for each gate, all sharing the same stochastic variables pio, but with different mismatch coefficients εi. Mismatch correlation and dependence on position and distance can also be included in the model, through a suitable definition of mismatch parameters.
3 Model Extraction A key issue to achieve good accuracy for the statistical model of the logic gate is the extraction procedure to get the statistical description of the transistor-level parameters pi defined in the previous Section. As a first step, appropriate equations have to be selected for the FOMs that have to be included in the model. These equations define the FOMs as functions of environmental variables, circuit variables and technological
520
F. Centurelli et al.
parameters; some of the latter have to be chosen as the random variables pi. The statistical description of the random variables could be obtained directly from the statistical technology library, thus allowing the model to be used to perform technology comparison or performance predictions for future technological nodes [19]. However for a more accurate fitting an extraction procedure to obtain the statistical description of the parameters pi starting from transistor-level simulations is needed, since the model tries to describe with a limited set of stochastic variables a much more complex statistical variability. The extraction procedure starts from Monte-Carlo transistor-level simulations of the logic gate, performed considering process variations and mismatch variations separately, as is usually possible with the statistical libraries of recent CMOS technologies. From these simulations, a statistical characterization of the FOMs of interest, in terms of average value and variance under process or mismatch variations, is obtained. The average values of the stochastic variables pio can be obtained by fitting the average values of the FOMs with the appropriate equations (the mismatch parameters εi have zero mean value). The standard deviations of the stochastic variables can then be obtained by fitting the distribution of the FOMs obtained by transistorlevel MC simulations with those obtained by MC simulations at the Verilog level; this step can be performed through an optimization procedure that computes the optimal values of the standard deviations of pio and εi to fit the standard deviations of the FOMs, by using the appropriate equations relating the standard deviations and minimizing a suitable error function. The model obtained can be scaled as a function of circuit parameters such as MOS channel width and fan-out of the logical gate, thus providing an useful tool for fast circuit simulation and optimization.
4 Case Study To verify the feasibility in a real analog CAD environment of the model framework we have presented, we have developed a model for a 2-input NAND gate in 90 nm CMOS technology. The model has been implemented in the Cadence Design Framework II environment, using the Verilog-A hardware description language and exploiting its analog extensions as described in Section 2. We have considered as FOMs the propagation delay with a single active input and the leakage current, described by the following equations: -
for the propagation delay [20]:
tp =
( FO + α ) C VDD k + C 2Wnp C ox vsat ( VDD − Vth ) DF
(2)
where VDD is the supply voltage, Vth is the MOS threshold voltage, Cox is the oxide capacitance, vsat is the carrier saturation velocity, Wnp is the channel width, averaged between the NMOS and PMOS transistors, C is the gate input capacitance, FO is the fan-out, α is the ratio between the output and the input capacitance of the logic gate, DF is the driving factor (the number of unity-width gates driving the gate under test) and k is a fitting parameter. The equation for the propagation delay
A Statistical Model of Logic Gates
521
has been adapted empirically from [20], and the driving factor allows to model the steepness of the input ramp (a higher driving factor corresponds to a steeper ramp). - for the leakage current, the model in [21] has been modified as follows:
⎧ I L00 = Ion exp ( − Vth nVT ) exp ( −ηVDD nVT ) for A=B=0 ⎪⎪ for A ≠ B I L = ⎨ I L01 = Ionp exp ( −Vth nVT ) ⎪ for A=B=1 ⎪⎩ I L11 = 2IL01 ( Iop Ionp )
(3)
where VT = kT q , n is the ideality coefficient in weak inversion, η is the DIBL (drain-induced barrier lowering) coefficient and the terms Ion, Iop and Ionp are given by Ioj = μ j Cox
Wj L
VT2 exp (1.8 )
(4)
where μ is the mobility, L the channel length, and W is the channel width of the NMOS devices, and the subscript j can be n, p or np (average between n and p values). Eqs. (3) have been obtained empirically to fit transistor-level simulations of the leakage current. We consider as stochastic variables pi the threshold voltage Vth, the gate input capacitance C and the DIBL coefficient η, using an uniform distribution for process variations and a gaussian distribution for the mismatch coefficients εi, similarly to what is done in the statistical model of the transistor. The distribution of the leakage current can be well matched by a lognormal probability distribution [22], due to the exponential relationship between leakage and the underlying device model parameters, therefore the logarithm of IL has been considered for histogram matching. Using minimum size NMOS transistors (L = 90 nm, Wn = 120 nm), Wp/Wn = 2.5 and a supply voltage of 1.2 V, we obtain from process-only and mismatch-only MC transistor-level simulations the statistical characterization summarized in Tab. 1, using 100 Monte-Carlo iterations, that have been proven to be enough to obtain stable results. It has to be noted that the values reported in Tab. 1 for the leakage current with different inputs (IL01) are averaged over the results for the 01 and 10 input configurations. Table 1. Results from transistor-level Monte-Carlo simulations
FOM tp log(IL00) log(IL01)
Average 15.568 ps -9.31 (487pA) -8.76 (1.739nA)
Std (process) 768 fs 0.160 0.255
Std (mismatch) 635 fs 0.359 0.409
Technological parameters such as the saturation velocity, the mobility and the oxide capacitance have been extracted from the technology library, and the mean values (Avg) of the stochastic parameters have been obtained by matching equations (2) – (3) with the results of transistor-level simulations in typical conditions. As a next step, the
522
F. Centurelli et al.
standard deviation (Std) of the parameters, for process and mismatch variations, has been extracted by matching the results of Verilog Monte-Carlo simulations to the values in Tab. 1. Tab. 2 summarizes the results, reporting the relative errors with respect with transistor-level simulations. Very good results are obtained for the delay tp, whereas the accuracy of the model (3) affects the results for the leakage current IL00. The model has been checked by comparing the statistical description of the FOMs in case of combined process and mismatch variations: Figs. 1 and 2 show some histograms for transistor-level and Verilog MC simulations, and Tab. 3 summarizes the results, showing a good agreement between the model and transistor-level simulations. Table 2. Results from Verilog Monte-Carlo simulations
FOM tp log(IL00) log(IL01)
Average ε% 15.593 ps 0.16% -9.34 0.28% -8.56 0.24%
Std (process) 768 fs 0.183 0.184
ε% Std (mismatch) ε% 0.08% 635 fs 0.03% 14% 0.375 4.3% 0.16% 0.409 0.11%
Fig. 1. Histograms of the propagation delay under combined process and mismatch variations: a) transistor-level simulations; b) Verilog simulations
Fig. 2. Histograms of the log of the leakage current (for A≠B) under combined process and mismatch variations: a) transistor-level simulations; b) Verilog simulations
A Statistical Model of Logic Gates
523
Table 3. Standard deviation of the FOMs under combined process and mismatch variations
FOM tp log(IL00) log(IL01)
CMOS 958 fs 0.380 0.450
ε% 7.23 % 14.2% 6.7%
Verilog 1.027 ps 0.434 0.480
As an example application, the cascade of 15 NAND gates has been considered to simulate the critical path of a microprocessor, and the distribution of the overall propagation delay under process and mismatch variations has been studied using the proposed model, and compared with transistor-level simulation. Tab. 4 summarizes the results, and Fig. 3 shows the histograms under combined process and mismatch variations.
Fig. 3. Histograms of the delay of the cascade of 15 gates under combined process and mismatch variations: a) transistor-level simulations; b) Verilog simulations Table 4. Statistics of the 15-NAND delay under combined process and mismatch variations
FOM Avg Std
CMOS 261.164 ps 12.908 ps
Verilog 255.658 ps 11.995 ps
ε% 2.11% 7.08%
We have also checked the scalability of the model, that would allow to obtain a statistical library of gates with different transistor sizes by extracting the model of a single logic gate. The statistical description of the propagation delay tp extracted for a minimum-size NAND gate has been used to predict the delays of gates with larger transistors (with constant PMOS-to-NMOS width ratio). We have considered the scaled NAND gate driven and loaded by DF and FO minimum-size gates respectively, and eq. (2) has been modified to take into account the scaling factor S = Wnp/Wmin. An empirical relationship has been obtained to correctly describe the dependence of tp on S and C: tp =
( FO + Sα ) C VDD k ⎡ + γ( 2SWmin Cox vsat ( VDD − Vth ) DF ⎣
)
S −1
C + C⎤ ⎦
(5)
524
F. Centurelli et al.
Tab. 5 compares the statistical characterization of the logic gate obtained from transistor-level and Verilog MC simulations, showing that a good agreement is maintained as the scaling factor increases. Table 5. Staptistical characterization of propagation delay vs. channel width (process variations)
S 1 2 3 4
Avg (CMOS) Avg (Verilog) 15.568 ps 15.593 ps 15.818 ps 15.949 ps 16.677 ps 16.863 ps 17.813 ps 17.818 ps
H% 0.16% 0.83% 1.11% 0.03%
Std (CMOS) 768 fs 792 fs 834 fs 870 fs
Std (Verilog) 768 fs 805 fs 827 fs 844 fs
H% 0.08% 1.59% 0.81% 3%
5 Conclusions We have presented a framework for a statistical model of logic gates that allows Monte-Carlo behavioral-level simulations in a standard analog CAD environment (Cadence). The model fits the statistics obtained from transistor-level simulations both for process and mismatch variations, and takes into account the correlation between the performance parameters of the gate (delays, leakage currents, etc). The framework also allows to include correlations between different gates and positiondependent mismatches, to allow a more realistic description of a digital IC. The proposed model allows very fast statistical simulations for circuit optimization and yield prediction of complex digital designs. An accurate analog description of transient behavior can also be obtained by exploiting analog extensions of the Verilog hardware description language, and can be useful for mixed-signal simulations or fullcustom digital design optimization. The proposed model framework could also be used to develop some form of yield-aware design optimization that finds the best trade-off between performance parameters such as maximum delay, dynamic power consumption, leakage power consumption, etc. A simple case study has been presented to assess the feasibility of the proposed framework inside the Cadence analog CAD environment, and to test the extraction procedure for the stochastic variables used in the model. A good agreement is obtained both for the average value and the standard deviation of the performance parameters of interest, both for process and mismatch variations taken separately, and when the two effects are considered together.
References 1. Chandrakasan, A., Bowhill, W.J., Fox, F.: Design of high-performance microprocessor circuits. Wiley, New York (2001) 2. Gneiting, T., Jalowiecki, I.P.: Influence of process parameter variations on the signal distribution behavior of wafer scale integration devices. IEEE Trans. Components, Packaging and Manufacturing Technology Part B 18(3), 424–430 (1995)
A Statistical Model of Logic Gates
525
3. Chang, H., Qian, H., Sapatnekar, S.S.: The certainty of uncertainty: randomness in nanometer design. In: Macii, E., Paliouras, V., Koufopavlou, O. (eds.) PATMOS 2004. LNCS, vol. 3254, pp. 36–47. Springer, Heidelberg (2004) 4. Nassif, S.: Design for variability in DSM technologies. In: IEEE Int. Symp. Quality Electronic Design, pp. 451–454. IEEE Computer Society Press, Los Alamitos (2000) 5. Nardi, A., Neviani, A., Zanoni, E., Quarantelli, M., Guardiani, C.: Impact of unrealistic worst case modeling on the performance of VLSI circuits in deep submicron CMOS technologies. IEEE Trans. Semiconductor Manufacturing 12(4), 396–402 (1999) 6. Singhal, K., Visvanathan, V.: Statistical device models from worst case files and electrical test data. IEEE Trans. Semiconductor Manufacturing 12(4), 470–484 (1999) 7. Mutlu, A.A., Kwong, C., Mukherjee, A., Rahman, M.: Statistical circuit performance variability minimization under manufacturing variations. In: ISCAS 06 IEEE Int. Symp. on Circuits and Systems, pp. 3025–3028. IEEE Computer Society Press, Los Alamitos (2006) 8. Jyu, H.-F., Malik, S., Devadas, S., Keutzer, K.W.: Statistical timing analysis of combinatorial logic circuits. IEEE Trans. VLSI Systems 1(2), 126–137 (1993) 9. Chang, H., Sapatnekar, S.S.: Statistical timing analysis under spatial correlations. IEEE Trans. on CAD 24(9), 1467–1482 (2005) 10. Jess, J.A.G., Kalafala, K., Naidu, S.R., Otten, R.H.J.M., Visweswariah, C.: Statistical timing for parametric yield prediction of digital integrated circuits. IEEE Trans. on CAD 25(11), 2376–2392 (2006) 11. Agarwal, A., Mukhopadhyay, S., Raychowdhury, A., Roy, K., Kim, C.H.: Leakage power analysis and reduction for nanoscale circuits. IEEE Micro 26(2), 68–80 (2006) 12. Rao, R., Srivastava, A., Blaauw, D., Sylvester, D.: Statistical estimation of leakage current considering inter- and intra-die process variation. In: ISLPED 03 Int. Symp. Low-Power Electronics and Design, pp. 84–89 (2003) 13. Chang, H., Sapatnekar, S.S.: Full-chip analysis of leakage power under process variations, including spatial correlations. In: DAC 05 Proc. Design Automation Conf., pp. 523–528 (2005) 14. Chen, T., Naffziger, S.: Comparison of Adaptive Body Bias (ABB) and Adaptive Supply Voltage (ASV) for improving delay and leakage under the presence of process variation. IEEE Trans. VLSI Systems 11(5), 888–899 (2005) 15. ITRS: The International Technology Roadmap for Semiconductors, 2005 edn. (2005) 16. Verilog-A language reference manual, Version 1.0. Open Verilog International (1996) 17. Ashenden, P.J., Peterson, G.D., Teegarden, D.A.: The system designer’s guide to VHDLAMS. Morgan Kaufmann, San Francisco (2002) 18. Auvergne, A., Daga, J.M., Rezzoug, M.: Signal transition time effect on CMOS delay evaluation. IEEE Trans. Circuits and Systems I 47(9), 1362–1369 (2000) 19. Bowman, K.A., Duvall, S.G., Meindl, J.D.: Impact of die-to-die and within-die parameter fluctuations on the maximum clock frequency distribution for gigascale integration. IEEE J. Solid-State Circuits 37(2), 183–190 (2002) 20. Bowman, K.A., Austin, B.L., Eble, J.C., Tang, X., Meindl, J.D.: A physical alpha-power law MOSFET model. J. Solid-State Circuits 34(10), 1410–1414 (1999) 21. Gu, R.X., Elmasry, M.I.: Power dissipation analysis and optimization of deep submicron CMOS digital circuits. IEEE J. Solid-State Circuits 31(5), 707–713 (1996) 22. Rao, R., Devgan, A., Blaauw, D., Sylvester, D.: Parametric yield estimation considering leakage variability. In: DAC 04 Proc. Design Automation Conf., pp. 442–447 (2004)
Switching Activity Reduction of MAC-Based FIR Filters with Correlated Input Data Oscar Gustafson1 , Saeeid Tahmasbi Oskuii2 , Kenny Johansson1, and Per Gunnar Kjeldsberg2 1
Department of Electrical Engineering, Link¨ oping University, SE-581 83 Link¨ oping, Sweden {oscarg,kennyj}@isy.liu.se 2 Department of Electronics and Telecommunications, Norwegian University of Science and Technology (NTNU), NO-7491 Trondheim, Norway {saeeid.oskuii,per.gunnar.kjeldsberg}@iet.ntnu.no
Abstract. In this work we consider coefficient reordering for low power realization of FIR filters on fixed-point multiply-accumulate (MAC) based architectures, such as DSP processors. Compared to previous work we consider the input data correlation in the ordering optimization. For this we model the input data using the dual bit type approach. Results show that compared with just optimizing the number of switches between coefficients, the proposed method works better when the input data is correlated, which can be assumed for most applications. Keywords: FIR filter, MAC, dual bit type, switching activity, coefficient reordering.
1
Introduction
Energy consumption is becoming the major cost measure when implementing integrated circuits. This trend is motivated both by the benefit of increased battery life for portable products as well as reducing cooling problems. Many of these systems include a digital signal processing (DSP) subsystem which performs a convolution or a sum-of-product computation. These computations are often performed using a, possibly embedded, programmable DSP processor. The probably most common form of convolution algorithms is the finite-length impulse response (FIR) filter. The output of an N :th-order FIR filter is computed as N y(n) = h(i)x(n − i) (1) i=0
where the filter coefficients, h(n), determine the frequency response of the filter. The transfer function of the FIR filter is H(z) =
N
h(i)z −i
i=0 N. Azemard and L. Svensson (Eds.): PATMOS 2007, LNCS 4644, pp. 526–535, 2007. c Springer-Verlag Berlin Heidelberg 2007
(2)
Switching Activity Reduction of MAC-Based FIR Filters T
x(n)
T
T
T
527
T
h0
h1
h2
h3
h4
h5
h0
h1
h2
h3
h4
h5
y(n) x(n)
y(n)
T
T
T
T
T
Fig. 1. (above) Direct form and (below) transposed direct form fifth-order FIR filter
Data memory Coefficient memory Register
Fig. 2. Multiply-accumulate (MAC) architecture suitable for realizing direct form FIR filters
The two most common filter structures for realizing the transfer function in (2) are the direct form and the transposed direct form structures depicted in Fig. 1. As can be seen from Fig. 1 the basic arithmetic operation is a multiplication followed by an addition. This is usually called a multiply-accumulate (MAC) operation and is commonly supported in programmable DSP processors [1]. If a direct form FIR filter is realized the input data is stored in one memory, while the coefficients are stored in another memory. Then each output is computed by performing N + 1 MAC operations. An abstracted suitable architecture is shown in Fig. 2. It is also possible to use a similar architecture when implementing FIR filters in application specific integrated circuits (ASICs) and field-programmable gate arrays (FPGAs) [2, 3]. Many FPGAs have dedicated general fixed-point multipliers and some even have a complete fixed-point MAC as a dedicated building block. For integrated circuits implemented in CMOS technology the sources of power dissipation can be classified as dynamic, short circuit, and leakage power. Even though the impact of leakage power increases with decreasing feature size, the
528
O. Gustafson et al.
main source of power dissipation for many integrated circuits is still the dynamic power. The dynamic power consumption for a CMOS circuit is expressed as 2 Pdynamic = αCf VDD (3) where α is the switching activity, C is the capacitance, f is the frequency, and VDD is the power supply voltage. In this work we focus on reducing the switching activity in fixed-point MACbased realizations of direct form FIR filters. The same ideas can be applied to other convolutions and sum-of-product computation, but for clarity we will only discuss FIR filters here. We focus on reducing the number of switches on the inputs of the multiplier. It should be noted that this will also decrease the number of switches on the buses connecting the memories to the multiplier. Previous work considering switching activity reduction of FIR filters on MACbased architectures can be divided in two class. For the first class it is assumed that the MAC operations are performed in increasing (or decreasing) order and the approaches optimize the coefficient values such that the number of switches between adjacent coefficients is small [4,5]. In [4] a heuristic optimization method is proposed, while in [5] an optimal method based on mixed integer linear programming is presented. The second class aims at reordering the computations such that the number of switches between succeeding coefficients are small [4,6]. In [4] a framework that both optimizes the coefficients and the order was proposed. However, the optimization and reordering are not performed simultaneously. Furthermore, the work in [4] neglects the fact that the input data is correlated. Input data correlation is treated in [6] by determining a lookup table based on simulation. This lookup table will grow rapidly with increased filter length, and, hence, only short filters and convolutions are considered in [6]. There are also works where several output samples are computed interleaved with, typically, more than one accumulator, leading to reduced switching activity [7, 8]. In this work we characterize the input data using the dual bit type method [9] and derive equations for computing the correlation between samples more than one sample period apart. This is used to formulate a Hamiltonian path (or traveling salesman, TSP) problem that is solved to find the best ordering of the computations. While the focus of this work is on FIR filters, similar techniques can be applied for other applications based on subsequent MAC-operations. In the next section we review the issues related to correlated input data and derive the correlation equations for the input data. Then, in Section 3 the proposed optimization approach is presented. This approach is extended to include possible negation of coefficients in Section 3.2. In Section 4 results are presented that highlight the importance of the contribution. Finally, in Section 5 some concluding remarks are given.
2
Correlated Input Data
Signals in real world applications can in many cases be approximated as a Gaussian stochastic variable. This leads to that their binary representations have
Switching Activity Reduction of MAC-Based FIR Filters α
BP1 BP0 linear BP
MSB
529
LSB
0.5 αMSB
Bit W−1W−2W−3 S
3
2 U
1
0
Fig. 3. Illustration of the dual bit type properties
different switching probabilities for different bit positions. However, certain properties for different regions can be observed [9, 10, 11]. In this work we focus on two’s complement representation, but a similar derivation can be performed using, e.g., sign-magnitude representation. The dual bit type (DBT) method [9] is based on the fact that the binary representation of most real world signals can be divided into a few regions, where the bits of each region have a well defined switching activity. In [9] the three regions LSB, linear, and MSB was defined as illustrated in Fig. 3. Because of the linear approximation of the middle region, it was stated that two bit types is sufficient. In the LSB region the switching probability is 1/2, which corresponds to random switching. Hence, the bits are divided into a uniform white-noise (UWN) region, U , and a sign region, S, as shown in Fig. 3. The word-level statistics, i.e., mean, μ, variance, σ 2 , and correlation, ρ, of the signal are used to determine the breakpoints and the switching activities. The correlation for a signal is computed as ρ=
μΔ − μ2 σ2
(4)
where μΔ is the average value of the signal multiplied by the signal delayed one sample. Typically, we have μ = 0, which gives that the probability of a bit in the two’s complement representation being one, p, is 1/2. In [9] the break points of the regions are defined as |ρ| BP 0 = log2 σ + log2 ( 1 − ρ2 + ) 8 BP 1 = log2 (|μ| + 3σ) BP 0 + BP 1 BP = 2
(5) (6) (7)
With a data wordlength of W bits the number of bits in the region S is WS = W − BP − 1
(8)
If pQ is the probability that a single-bit signal, Q, is one and the temporal correlation of Q is ρQ , then the switching activity, aQ , of Q is defined as [10] αQ = 2pQ (1 − pQ )(1 − ρQ )
(9)
530
O. Gustafson et al. 0.5
Switching probability
0.4
0.3
α
MSB
= 0.1
αMSB = 0.01 α
0.2
MSB
= 0.001
0.1
0 0
20
40 60 80 Distance between samples
100
Fig. 4. Resulting switching activity for the MSB region, S, after reordering
The probability for a one is assumed to be 1/2 for all bits in the twos-complement representation of a signal, as the mean value is 0. Furthermore, it was stated in [10] that the temporal correlation for bits in the MSB region is close to the word-level correlation, ρ. Hence, the switching activity in the MSB region can be computed from as (1 − ρ) αMSB = (10) 2 Now, when the filter coefficients are reordered, the switching probability for adjacent data words is changed. Let αm,D denote the switching probability between two bits at position, m, with time index i and i + D. We have ⎧ 0 D=0 ⎪ ⎪ ⎨ 1/2 m ∈ S αm,D = D=1 (11) αMSB m ∈ U ⎪ ⎪ ⎩ (1 − αm,1 )αm,D−1 + αm,1 (1 − αm,D−1 ) D ≥ 2 In Fig. 4 the effect of reordering on the switching probability is shown for some initial switching probabilities, αMSB . From this it can be seen that the switching probability increases monotonically toward 1/2 with increasing distance between samples. Hence, while reordering may decrease the switching probability of the coefficients, it will increase the switching probability of the input data sent to the multiplier.
3
Proposed Approach
Let the coefficient h(i) be represented using a B-bit two’s complement representation as B−2 h(i) = −bi,B−1 + bi,k 2−(B−1−k) (12) k=0
Switching Activity Reduction of MAC-Based FIR Filters
531
where bi,j ∈ {0, 1}. Hence, the number of switches when changing the coefficient from h(i) to h(j) (or vice versa) is ch,i→j = ch,j→i =
B−1
bi,k ⊕ bj,k
(13)
k=0
This is the Hamming distance measure used for coefficients in [4, 5, 6]. For the input data it is not possible to explicitly compute the number of switches. Instead the switching probability from (11) is used to obtain cx,i→j = cx,j→i =
W −1
αk,|i−j|
(14)
k=0
The total transition cost for selecting h(j) as the input coefficient after h(i) is now ctot,i→j = ctot,j→i = ch,i→j + cx,i→j (15) By forming a fully connected weighted graph with N + 1 nodes, where each node correspond to a filter coefficient, h(i), and each edge weight is obtained from (15) the ordering problem becomes a problem of finding a path visiting all nodes in the graph once with minimum weight. This problem is known as a symmetric traveling salesman problem (TSP). The TSP-problem is NP-hard, but it is in general possible to solve rather large instances in reasonable time. We have used GLPK [12] to solve problems with about 100 coefficients to optimality in tens of seconds. 3.1
Multiplier and Bus Power Consumption
It should be noted that switches at different inputs to the multiplier affects the power consumption differently [13]. Hence, if it is possible to characterize the used multiplier it is also possible to weight the ch,i→j and cx,i→j terms. However, the results in [13] also indicate that the variation is not large. Our own simulations also show that when all inputs are randomly distributed with a switching and one probability of 0.5 except for one which is known to switch every cycle, the variation in power consumption is insignificant. Hence, this aspect is not included in the results. For the cases that we will implement a custom multiplier and not use an existing one in a DSP, an FPGA, or a macro library, it is worth noticing that it is possible to optimize the power consumption of the multiplier based on the expected switching probability [14]. For buses the traditional power model has been to count the number of switches. However, for deep sub-micron technology the interwire capacitances is dominating over the wire-to-ground capacitances [15]. Hence, one should possibly include these in the cost function as well. In general it is hard to determine the exact bus structure, especially for DSPs and FPGAs, and, hence, in the results section we only consider the number of switches.
532
O. Gustafson et al.
3.2
Selective Negation
In [4] it was proposed that if the MAC operation is able to conditionally subtract the output of the multiplier, it is possible to negate some of the coefficients to reduce the switching even further. However, they did not provide any solution as how to decide which coefficients to be negated. In this work we evaluate these ideas by considering the case that we change all coefficients to positive values and selectively subtract the results that corresponds to negative coefficients. This could in general be solved using a modified TSP formulation known as the equality generalized TSP (E-GTSP) [16].
4
Results
To illustrate the results of the proposed design technique we will consider three FIR filters of varying lengths. These will be optimized according to the methodology in Section 3 using different data distributions. All FIR filters are designed using the Remez exchange algorithm for the given specifications. For simplicity we assign the same weights for the passband and the stopband ripples. The filter coefficients are scaled by a power of two such that the magnitude of the largest coefficients is represented using all available bits, i.e., 0.5 ≤ max(|h(i)|) < 1. Finally, the coefficients are rounded to the used wordlength. It should be noted that the used wordlength are enough even for harder specifications [17]. Hence, it would be possible to design filters with shorter wordlengths for most designs. However, this aspect is not considered here, but the rounded coefficients are used to demonstrate the properties of the proposed reordering methodology. 4.1
Design 1
For the first design the passband and stopband edges are at 0.2π rad and 0.3π rad, respectively. For this design we aim at a general purpose DSP with a 24×24bit multiplier. With a filter order of 65 we obtain the results shown in Table 1, where the switching activity denotes the total number of switches at the bus and multiplier inputs for a complete FIR filter computation. From the results it can be seen that savings in switching activity between 1.5% and 7.2% are obtained taking the correlation of the input data into account. 4.2
Design 2
For the second design, we will consider implementation in an FPGA which includes a general 18 × 18-bits multiplier. Again we use an FIR filter designed using the Remez algorithm with identical maximum passband and stopband ripples. For the passband and stopband edges we select 0.6π rad and 0.8π rad, respectively. The filter coefficients are scaled as in the previous design. To obtain reasonable stopband attenuation we select a filter order of 40.
Switching Activity Reduction of MAC-Based FIR Filters
533
Table 1. Total switching activity of the data and coefficient values for Design 1
Data characteristics Random WS = 4, αM SB = 0.1 WS = 4, αM SB = 0.01 WS = 8, αM SB = 0.1 WS = 8, αM SB = 0.01 WS = 12, αM SB = 0.1 WS = 12, αM SB = 0.01
Optimized for Optimized using coefficient Reduction data Hamming characteristics distance [4] 1020.0 1020.0 1012.0 996.6 1.5% 943.5 929.2 1.5% 1004.0 964.6 3.9% 867.0 837.7 3.4% 996.1 924.0 7.2% 790.5 744.3 5.8%
Natural order1 1472.0 1366.4 1342.6 1260.8 1213.3 1155.2 1083.9
Switching activity
500 480
Random data W =4
460
WS = 8
S
W = 12 S
440 420 400 380 360 340 0
5
Actual WS
10
15
Fig. 5. Resulting estimated switching activity for different coefficient orders with respect to the actual WS . αM SB = 0.1
As the statistical properties of the input signal are estimated it is of interest to know how a change in the actual input characteristics affects the switching activity. For this we have considered the case where αMSB = 0.1 and varied WS for four different designs, each optimized for coefficient Hamming distance [4] (corresponding to WS = 0) and WS = 4, 8, and 12. The results are shown in Fig. 5, where it can be seen that the designs that consider the input data correlation in most cases result in less switching activity compared to not considering (as in [4]) independent of the actual value of WS . It is only for small WS :s that the orderings designed for large WS are worse compared with the ordering designed for Hamming distance (corresponding to WS = 0). Hence, the proposed design methodology will reduce the switching activity as long as the estimated parameters are reasonably close to the actual parameters of the input data. 1
The filter coefficients processed as h0 , h1 , h2 , . . .
534
O. Gustafson et al.
Table 2. Total switching activity of the data and coefficient values for Design 3
Data characteristics Random WS = 6, αM SB = 0.1 WS = 6, αM SB = 0.01 WS = 10, αM SB = 0.1 WS = 10, αM SB = 0.01
4.3
Original sign Natural Optimized 1890.0 1152.0 1635.6 1111.1 1578.4 963.8 1470.0 1062.0 1374.9 829.2
Positive sign Reduction Natural Optimized Natural Optimized 1604.0 1128.0 15.1% 2.1% 1349.6 1068.1 17.5% 3.9% 1292.4 929.9 18.1% 3.5% 1184.0 1000.5 19.5% 5.8% 1088.9 779.2 20.8% 6.0%
Design 3
In the third design, again we consider implementation in an FPGA. Now we consider the effect of making all signs positive. This can easily be realized by replacing the adder in Fig. 2 with an adder/subtracter. To control this an additional bit is required in the coefficient memory. It should be noted that using this approach is similar to using the sign-magnitude number representation. This sign bit should possibly be included in the switching activity analysis. From a bus point of view it should be included while from a multiplier point of view it should not, as it does not affect the power consumption of the multiplier. We choose to not include it in the cost function in this design study. For this design we consider a 105:th-order FIR filter with passband and stopband edges at 0.5π rad and 0.53π rad, respectively. The results are shown in Table. 2 for the cases where the original sign is used and when all coefficients are transformed into having positive sign. It is clear that while significant savings can be obtained by using absolute valued coefficients when realizing the multiplications in their natural order, the advantage decreases when coefficient reordering is considered. The reason for the larger savings using natural ordering is that the value of the coefficients vary for the impulse response leading to many switches for the sign bits. For the optimized version this is already considered through the optimization. One would expect that by using the E-GTSP formulation the switching activity can be reduced even further.
5
Conclusion
In this work we have proposed an approach to low-power realization of FIR filters on MAC-based architectures when the input data correlation is considered. The reordering of computations to reduce the switching activity now also depends on the input data correlation, which is represented using the dual bit type method. Furthermore, we proposed how to form a problem when we consider the possibility to negate coefficients to reduce the switching activity further. The proposed approach provide a more accurate modeling compared to [4] which did not consider input data correlation. The results show that as long as we have correlated input data and the dual bit type parameter estimation is reasonably correct we obtain lower switching activity using the proposed methodology compared to [4]. Compared to [6] the proposed method can handle arbitrary large FIR filters, while
Switching Activity Reduction of MAC-Based FIR Filters
535
the modeling in [6] depended on simulations making it complex to obtain results for long filters due to the large look-up tables required. For this work the corresponding results are easily computed from the presented equations. However, this requires that the input data is characterized for use of the dual bit type method.
References 1. Lapsley, P., Bier, J., Shoham, A., Lee, E.A.: DSP Processor Fundamentals: Architectures and Features. Wiley-IEEE Press (1997) 2. Wanhammar, L.: DSP Integrated Circuits. Academic Press, London (1999) 3. Meyer-Baese, U.: Digital Signal Processing with Field Programmable Gate Arrays. Springer, Heidelberg (2001) 4. Mehendale, M., Sherlekar, S.D., Venkatesh, G.: Low-power realization of FIR filters on programmable DSPs. IEEE Trans. VLSI Systems 6(4), 546–553 (1998) 5. Gustafsson, O., Wanhammar, L.: Design of linear-phase FIR filters with minimum Hamming distance. In: Proc. IEEE Nordic Signal Processing Symp., October 4–7, 2002, Hurtigruten, Norway (2002) 6. Masselos, K., Merakos, P., Theoharis, S., Stouraitis, T., Goutis, C.E.: Power efficient data path synthesis of sum-of-products computations. IEEE Trans. VLSI Systems 11(3), 446–450 (2003) 7. Arslan, T., Erdogan, A.T.: Data block processing for low power implementation of direct form FIR filters on single multiplier CMOS DSPs. In: Proc. IEEE Int. Symp. Circuits Syst., 1998, Monterey, CA, vol. 5, pp. 441–444 (1998) 8. Parhi, K.K.: Approaches to low-power implementations of DSP systems. IEEE Trans. Circuits Syst.–I 48(10), 1214–1224 (2001) 9. Landman, P.E., Rabaey, J.M.: Architectural power analysis: The dual bit type method. IEEE Trans. VLSI Systems 3(2), 173–187 (1995) 10. Ramprasad, S., Shanbhag, N.R., Hajj, I.N.: Analytical estimation of transition activity from word-level signal statistics. In: Proc. Design Automat. Conf., June 1997, 582–587 (1997) 11. Lundberg, M., Muhammad, K., Roy, K., Wilson, S.K.: A novel approach to highlevel swithing activity modeling with applications to low-power DSP system synthesis. IEEE Trans. Signal Processing 49(12), 3157–3167 (2001) 12. Gnu Linear Programming Kit 4.16, http://www.gnu.org/software/glpk/ 13. Hong, S., Chin, S.-S., Kim, S., Hwang, W.: Multiplier architecture power consumption characterization for low-power DSP applications. In: Proc. IEEE Int. Conf. Elec. Circuits Syst., September 15–18, 2002, Dubrovnik, Croatia (2002) 14. Oskuii, S.T., Kjeldsberg, P.G., Gustafsson, O.: Transition-activity aware design of reduction-stages for parallel multipliers. In: Proc. Great Lakes Symp. on VLSI, March 11–13, 2007, Stresa-Lago Maggiore, Italy (2007) 15. Caputa, P., Fredriksson, H., Hansson, M., Andersson, S., Alvandpour, A., Svensson, C.: An extended transition energy cost model for buses in deep submicron technologies. In: Proc. Int. Workshop on Power and Timing Modeling, Optimization and Simulation, Santorini, Greece, September 15–17, 2004, pp. 849–858 (2004) 16. Fischetti, M., Salazar Gonzˆ alez, J.J., Toth, P.: The generalized traveling salesman problem. In: The Traveling Salesman Problem and Its Variations, pp. 609–662. Kluwer Academic Publishers, Dordrecht (2002) 17. Kodek, D.M.: Performance limit of finite wordlength FIR digital filters. IEEE Trans. Signal Processing 53(7), 2462–2469 (2005)
Performance of CMOS and Floating-Gate Full-Adders Circuits at Subthreshold Power Supply Jon Alfredsson1 and Snorre Aunet2 1
Department of Information Technology and Media, Mid Sweden University SE-851 70 Sundsvall, Sweden [email protected] 2 Department of Informatics, University of Oslo Postbox 1080 Blindern, 0316 Oslo, Norway [email protected]
Abstract. To reduce power consumption in electronic designs, new techniques for circuit design must always be considered. Floating-gate MOS (FGMOS) is one of those techniques and has previously shown potentially better performance than standard static CMOS circuits for ultra-low power designs. One reason for this is because FGMOS only requires a few transistors per gate and still retain a large fan-in. Another reason is that CMOS circuits becomes very slow in subthreshold region and are not suitable in many applications while FGMOS can have a shift in threshold voltage to increase speed performance. This paper investigates how the performance of an FGMOS fulladder circuit will compare with two common CMOS full-adder designs. Simulations in a 120 nm process shows that FGMOS can have up to 9 times better EDP performance at 250 mV. The simulations also show that the FGMOS full-adder is 32 times faster and have two orders of magnitude higher power consumption than that for CMOS.
1 Introduction It has become more and more important to reduce the power consumption in circuits while still trying to achieve as high switching speed as possible. The increasing demands for longer lasting lifetimes in portable and battery driven applications are some of the strongest driving forces to push the limits in terms of ultra-low power consumption. According to the ITRS Roadmap for Semiconductors [18], the two most important of the five “grand challenges” for future nanoscale CMOS are to reduce power consumption and design for manufacturability. In this work we have chosen to focus on reducing power consumption. The challenge to design for manufacturability is desirable for future work within this topic. One of the ways to reduce power is to explore new types of circuits in order to find better circuit techniques for energy savings. Floating-gate MOS (FGMOS) is a circuit technique that has been proposed in several previous works as a potentially good technique to reduce power consumption and still maintain a relatively high speed [1],[2],[3]. FGMOS is normally fabricated using a standard CMOS process where an extra floating-gate capacitance is connected to the transistor’s gate node. This capacitance, called floating-gate capacitance, will make it possible to shift the N. Azemard and L. Svensson (Eds.): PATMOS 2007, LNCS 4644, pp. 536–545, 2007. © Springer-Verlag Berlin Heidelberg 2007
Performance of CMOS and Floating-Gate Full-Adders Circuits
537
threshold voltage level of the MOS-transistors. The required effective threshold voltage for the gate will thereby change and the shift is controlled by the floatinggate’s node charge voltage [1],[3]. A shift in threshold voltage will also change the static current (and power consumption), normally to a higher value, at the same time as the propagation delay of the circuit will be different (normally smaller). Maximum propagation delay, tp, and power consumption of a circuit, P, are two figures of merits that are important in FGMOS designs. These figures must be considered while simulating with different fan-ins. In our simulations we have been using power consumption (P), power-delay product (PDP) and energy-delay product (EDP) as figure of merits to determine differences in performance [4]. The approach to reduce power consumption and increase performance in this work is to lower the circuits power supply voltage into subthreshold region. Previous works in this area have shown that FGMOS circuits working in subthreshold should not have a fan-in higher than 3 in order to be able to have advantages compared to CMOS [14]. This advice has been taken into account in this work where we use an FGMOS full-adder with a maximum fan-in of 3 and compare it to two common basic CMOS full-adders with respect to power and speed performances. The aim of this work has been to determine if the FGMOS full-adder will show better performance than normal CMOS full-adders when power supply is reduced below subthreshold voltage. This is important knowledge since subthreshold designs have been frequently proposed to be good for ultra-low power consumption [19]. In this article we show that when the power supply is reduced into subthreshold region (250 mV), the FGMOS circuits will have up to 9 times better EDP and 32 times higher speed than the CMOS circuits. However, FGMOS will also have penalty with over two orders of magnitude higher power consumption and also worse PDP.
2 FGMOS Basics The FGMOS technique is based on normal MOSFET transistors and CMOS process technology. They are manufactured with an extra gate capacitance in series with the transistor’s gate. Because of that, FGMOS can shift the effective threshold potential required by the transistor. The shifts in the threshold are made by charging the node between the extra gate capacitance and the normal transistor gate. If there is no charge leakage, the node is said to be floating and it is called a true floating-gate circuit. The added extra capacitance is called a floating-gate capacitance (CFG). Figure 1 shows a floating-gate transistor and a majority gate with fan-in 3 designed in FGMOS. Depending on the size of the floating-gate charge voltage (VFG), the effective threshold voltage will vary. VFG is determined during the design process and the floating-gate circuits are subsequently programmed with the selected VFG once and then they are fixed during operation [6]. Implementation of the floating-gate potential, VFG, can be made via a variety of different methods. For true floating-gates, hot-electron injection, electron tunnelling or UV-exposure is normally used [3],[10],[11]. If the CMOS process also has a gateoxide thickness of 70Å or less [7], some kind of refresh or auto-biasing technique is also required as gate charge leakage will be significant [8].
538
J. Alfredsson and S. Aunet
Fig. 1. FGMOS transistor (left) and FGMOS Majority-3 gate with fan-in 3 (right)
3 Full-Adder Designs The Full-adder is one of the most used basic circuits since addition of binary numbers are one of the most used operations in digital electronics. Full-adders exist everywhere in electronic systems and a large amount of research has been done in this area in order to achieve best possible performance [12], [13], [15]. There exist many different solutions for full-adder designs, this work have focused on two the most commonly used basic CMOS full-adders. A standard static CMOS full-adder design (Figure 4) and a mirrored–gate based full-adder (Figure 3) have
Fig. 2. FGMOS full-adder with fan-in 3
Performance of CMOS and Floating-Gate Full-Adders Circuits
539
been used in our simulations to determine speed and power performance compared to a floating-gate full-adder. The floating-gate full-adder is represented by a recently improved adder structure with a maximum fan-in of 3 [14]. This full-adder is shown in Figure 2 and have shown potential to be better than CMOS at subthreshold.
Fig. 3. Mirrored CMOS full-adder
Fig. 4. Standard static CMOS full-adder
4 Full-Adder Simulations The simulations have been performed in Cadence with the Spectre simulator in a 120 nm CMOS process technology and the used transistors are of low-leakage type. The
540
J. Alfredsson and S. Aunet
transistors using minimum gate lengths, 120 nm (effective), and a width of 150 nm for NMOS and a width of 380 nm for the PMOS. The threshold voltage, Vth, for these low-leakage transistors are 383 mV for NMOS and -368 mV for PMOS according to the simulations. Previous research with full-adders at subthreshold power supply, suggests that the EDP performance for FGMOS can be better than EDP for CMOS if the fan-in of the floating-gate circuit is below four [14]. For this reason, a floating-gate full-adder structure with a fan-in of three has been used. In this work the simulations have been performed for three types of full-adders, one FGMOS and two CMOS. The power supply for the simulations are chosen between 150 mV - 250 mV since previous simulations have shown that this is the range in subthreshold with best performance. The propagation delay, tp, for a fulladder varies with every different state change on the input and because of that, the results from our simulations are based on the slowest input to output change [15]. EDP is calculated from the average power consumption (P) and the minimum signal propagation delay, tp, according to Eq. 1. It is the consumed power required to drive the output to 90% of its final value multiplied by the propagation delay squared.
EDP = PDP ⋅ t p = I avg ⋅ Vdd ⋅ t p ⋅ t p = P ⋅ t 2p
(1)
Iavg is the average switching current and tp is the inverter’s minimum propagation delay [4].
5 Results The simulation results from this work should determine if FGMOS can be used to design better full-adder circuits than static and mirrored CMOS. Figure 5 and Figure 6 show plots of EDP for the circuits at 150 mV and 250 mV power supply. As seen, the
Fig. 5. EDP for Floating-gate and CMOS full-adders at 250 mV
Performance of CMOS and Floating-Gate Full-Adders Circuits
541
Fig. 6. EDP for Floating-gate and CMOS full-adders at 150 mV
Fig. 7. PDP for different full-adders at 250 mV power supply
EDP can be up to 9 times better for FGMOS at 250 mV depending on how you chose the floating-gate voltage VGFp. In all the figures we have plotted CMOS’ EDP as straight horizontal lines to be easily comparable with FGMOS. The plots show the limit of how large the floating-gate voltage can be while the circuit’s gain is higher than one. If the floating-gate voltage, VFGp, is set more negative than in these plots, there will be an attenuation of the signal for each gate. Figure 7 shows the PDP (at 250 mV) which is almost constant for all applied different floating-gate voltages and is approximately 4 times worse than PDP for each
542
J. Alfredsson and S. Aunet
of the CMOS full-adders. Similar results can be obtained from simulations at 150 mV. Plots from the simulations of propagation delay can be seen in Figure 8 and Figure 9 and the FGMOS full-adder has up to 33 times shorter delay compared to the CMOS versions at 250 mV.
Fig. 8. Propagation delay for the different full-adders at 250 mV. The horizontal lines are for the CMOS circuits.
Figure 10 shows the power consumption at 250 mV and it is more than two orders of magnitude higher for FGMOS (114 times).
Fig. 9. Propagation delay for the different full-adders at 150 mV. The horizontal lines are for the CMOS circuits.
Performance of CMOS and Floating-Gate Full-Adders Circuits
543
Fig. 10. Power consumption for the three types of full-adders
6 Discussion FGMOS circuits have in previous studies shown that it can achieve better EDP performance in subthreshold region than normal static CMOS and the fan-in should not be more than three [2],[14]. While there is an advantage in EDP performance for FGMOS in subthreshold, there is also a penalty with a worse PDP and power consumption that needs to be taken into account. The simulation results in this work shows that the EDP can be up to 9 times better for FGMOS full-adder compared to the static CMOS design. It also shows an advantage in switching speed that is 33 times higher for FGMOS than for CMOS fulladders at 250 mV. Even at 150 mV, the switching speed will be more than 3 times better for FGMOS. The mirrored CMOS and static CMOS full-adder circuits in this work have been chosen to be compared with FGMOS since they have shown to have some of the best results of commonly used full-adders in terms of P, PDP and EDP[12],[13]. To notice is also that the mirrored gate full-adder has better performance than the static CMOS full-adder in all the three figures of merits. Even though the results from simulations performed in this work shows a clear advantage for FGMOS when certain design constraints are fulfilled, it must be taken into account that it might not be possible to design the FGMOS with a true floatinggate. It could be required to use some kind of refresh circuit, either as a large resistance or switch that retain or recharge the voltage on the floating-gate node [16],[17]. This will of course have an impact on performance. Especially for state-ofthe-art and future process technologies where the gate-oxide thickness decreases for every generation this will be an issue to carefully look into during the design process. There is still a lot of research to be done within the field of subthreshold FGMOS to find out more advantages or limitations. Some work close related to the topic of this article could be to do a more detailed analysis of netlists from layout perform real
544
J. Alfredsson and S. Aunet
measurements. It would also be interesting to find out how statistical process variations and mismatches between components will affect the performance.
7 Conclusions Using FGMOS circuits in subthreshold power supply can give several times improvement in terms of EDP and over one order of magnitude better gate propagation delay than comparable CMOS circuits. These advantages in performance will hopefully lead to more ultra-low power circuits with higher requirements on switching frequency. While the FGMOS circuits can be much faster and have better EDP than CMOS, they will also have significantly higher power consumption than and that will on the other hand decrease the number of possible applications for FGMOS. The performance constraints for FGMOS designs in subthreshold, especially the power consumption, will be one of the major limiting factors that decide if floating-gate circuits can be used in a specific design.
References [1] Shibata, T., Ohmni, T.: A Functional MOS Transistor Featuring Gate-Level Weighted Sum and Threshold Operations. IEEE Transactions on Electron Devices 39 (1992) [2] Alfredsson, J., Aunet, S., Oelmann, B.: Basic speed and power properties of digital floating-gate circuits operating in subthreshold. In: IFIP VLSI-SOC 2005, Proc. of IFIP International Conference on Very Large Scale Integration, October 2005, Australia (2005) [3] Hasler, P., Lande, T.S.: Overview of floating-gate devices, circuits and systems. IEEE Transactions on Circuits and Systems - II: Analog and Digital Signal Processing 48(1) (January 2001) [4] Stan, M.R.: Low-power CMOS with subvolt supply voltages. IEEE Transactions on VLSI Systems 9(2) (April 2001) [5] Rodríguez-Villegas, E., Huertas, G., Avedillo, M.J., Quintana, J.M., Rueda, A.: A Practical Floating-Gate Muller-C Element Using vMOS Thershold Gates. IEEE Transactions on Cirucits and Systems-II: Analog and Digital Signal Processing 48(1) (January 2001) [6] Aunet, S., Berg, Y., Ytterdal, T., Næss, Ø., Sæther, T.: A method for simulation of floating-gate UV-programmable circuits with application to three new 2-MOSFET digital circuits. In: The 8th IEEE International conference on Electronics, Circuits and Systems, 2001, vol. 2, pp. 1035–1038 (2001) [7] Rahimi, K., Diorio, C., Hernandez, C., Brockhausen, M.D.: A simulation model for floating-gate MOS synapse transistors. In: ISCAS 2002, Proc. of the 2002 IEEE International Sympposium on Circuits and Systems, May 2002, vol. 2, pp. 532–535 (2002) [8] Ramírez-Angulo, J., López-Martín, A.J., González Carvajal, R., Muñoz Chavero, F.: Very low-voltage analog signal processiing based on quasi-floating gate transistors. IEEE Journal of Solid-State Circuits 39(3), 434–442 (2004) [9] Schrom, G., Selberherr, S.: Ultra-Low-Power CMOS Technologies (Invited paper). In: Proc. of International Semiconductor Conference, vol. 1, pp. 237–246 (1996)
Performance of CMOS and Floating-Gate Full-Adders Circuits
545
[10] Aunet, S.: Real-time reconfigurable devices implemented in UV-light programmable floating-gate CMOS. Ph.D. Dissertation 2002:52, Norwegian University of Science and Technology, Trondheim, Norway (2002) [11] Rabaey, J.M.: Digital Integrated Cirucuits - A design perspective, pp. 188–193. Prentice Hall, Englewood Cliffs (2003) [12] Alioto, M., Palumbo, G.: Impact of Supply Voltage Variations on Full Adder Delay: Analysis and Comparison. IEEE Transactions on very large scale integration (VLSI) systems 14(12) (December 2006) [13] Granhaug, K., Aunet, S.: Six Subthreshold Full Adder Cells characterized in 90 nm CMOS technology. Design and Diagnostics of Electronic Circuits and Systems, 25–30 (April 2006) [14] Alfredsson, J., Aunet, S., Oelmann, B.: Small Fan-in Floating-gate Circuits with Application to an Improved Adder Structure. In: Proc. of 20th international Conference on VLSI design, January 2007, Bangalore, India (2007) [15] Shams, A.M., Bayoumi, M.A.: A Framework for Fair Performance Evaluation of 1-bit Full Adder Cells. In: 42nd Midwest Symposium on Circuits and Systems, vol. 1, pp. 6–9 (1999) [16] Seo, I., Fox, R.M.: Comparison of Quasi-/Pseudo-Floating Gate Techniques. In: Proceedings of the International Symposium on Circuits and Systems, ISCAS 2004, May 2004, vol. 1, pp. 365–368 (2004) [17] Alfredsson, J., Oelmann, B.: Influence of Refresh Circuits Connected to Low Power Digital Quasi-Floating gate Designs. In: Proceedings of the 13th IEEE International Conference on Electronics, Circuits and Systems (ICECS 2006), December 2006, Nice, France, (2006) [18] International Technology Roadmap for Semiconductors, Webpage documents, http:// public.itrs.net [19] Lande, T.S., Wisland, D.T., Sæther, T., Berg, Y.: Flogic – Floating Gate Logic for LowPower Operation. In: Proceedings of International Conferens on Electronics Circuits and Systems (ICECS’96), April 1996, vol. 2, pp. 1041–1044 (1996)
Low-Power Digital Filtering Based on the Logarithmic Number System Ch. Basetas, I. Kouretas, and V. Paliouras Electrical and Computer Engineering Department, University of Patras, Greece
Abstract. This paper investigates the use of the Logarithmic Number System (LNS) as a low-power design technique for signal processing applications. In particular we focus on power reductions in implementations of FIR and IIR filters. It is shown that LNS requires a reduced word length compared to linear representations for cases of practical interest. Synthesis of circuits that perform basic arithmetic operations using a 0.18μm 1.8V CMOS standard-cell library, reveal that power dissipation savings more than 60% in some cases are possible.
1
Introduction
Data representation plays an important role in low-power signal processing system design since it affects both the switching activity and processing circuit complexity [1][2]. Over the last decades, the Logarithmic Number System (LNS) has been investigated as an efficient way to represent data in VLSI processors. Traditionally the motivation for considering LNS as a possible efficient solution for data representation in VLSI, is the inherent simplification of the basic arithmetic operations of multiplication, division, roots, and powers which are reduced to addition, subtraction, and right and left shifts, respectively, due to the properties of the logarithm. Beyond the simplification of basic arithmetic, LNS provides an interesting performance in terms of roundoff error, resembling the behavior of floating-point arithmetic. In fact, LNS-based systems have been proposed with characteristics similar to 32-bit single-precision floating-point representation [3]. Recently the LNS has been proposed as a means that can reduce power dissipation in signal processing-related applications, ranging from hearing-aid devices [4], subband coding [5], to video processing [6], and error control [7]. The properties of logarithmic arithmetic have been studied [8,9] and it has been demonstrated that under particular conditions, the choice of the parameters of the representation can reduce the switching activity, while guaranteeing the quality of the output evaluated in terms of measures such as the signal-to-noise ratio (SNR). The impact of the selection of the base b of the logarithm has been investigated as a means to explore trade-offs between precision and dynamic range given a particular word length. However, these works treat the subject at a representational level only, without power estimation data based on simulations. N. Azemard and L. Svensson (Eds.): PATMOS 2007, LNCS 4644, pp. 546–555, 2007. c Springer-Verlag Berlin Heidelberg 2007
Low-Power Digital Filtering Based on the Logarithmic Number System
547
In this paper we demonstrate that there are practical cases, where a reduced LNS representation can replace a linear representation of longer word length, without imposing any degradation on the signal quality. Furthermore we quantify the power dissipated by synthesized LNS arithmetic circuits and equivalent fixed-point circuits, to demonstrate that substantial power savings are possible. Finally, the implementation of the LNS processing units in a 0.18μm library is discussed. It is here shown that for a practical range of word lengths, gatelevel implementations of the required look-up tables provide significant benefits over the linear approach, while the use of low-power techniques such as enabled latched inputs to independent blocks, significantly reduces the power dissipation. Actually it is shown that the basic structure of the LNS adder allows for the easy application of the aforementioned technique. The remainder of the paper is organized as follows: Section 2 describes the basics of LNS operation. Section 3 discusses cases where a smaller LNS wordlength suffices, when compared to a linear fixed-point representation. Section 4 discusses the power aspects of this approach, while conclusions are discussed in section 5.
2
LNS Basics
In LNS, a number X is represented as the triplet X = (zx , sx , x),
(1)
where zx is asserted in the case that X is zero, sx is the sign of X and x = logb (|X|), with b the base of the logarithm and the representation. The choice of b plays a crucial role in the representational capabilities of the triplet in (1), as well as the computational complexity of the processing and forward and inverse conversion units. Due to basic properties of the logarithm, the multiplication of X and Y is reduced to the computation of the triplet Z, Z = (zz , sz , z),
(2)
where zz = zx or zy , sz = sx xor sy , and z = x + y. Similarly for the case of division. The derivation of the logarithm s of the sum S of two triplets is more involved, as it relies on the computation of s = max{x, y} + logb 1 + b−|x−y| . (3) Similarly the derivation of the difference of two numbers, requires the computation of d = max{x, y} + logb 1 − b−|x−y| . (4)
548
C. Basetas, I. Kouretas, and V. Paliouras
Assume that a two’s-complement word is used to represent the logarithm x, composed of a k-bit integral part and an l-bit fractional part. The range DLNS spanned by x is −l k−1 −l k−1 −l 2−l DLNS = b2 , b2 −2 {0} −b2 −2 ,−b , (5) to be compared with the range of (−2i−1 , 2i−1 −2−f ) of a linear two’s-complement representation of i integral bits and f fractional bits. In general, LNS offers a superior range, over the linear two’s-complement representation. This is achieved using comparable word lengths, by departing from the strategy of equispaced representable values and by resorting to a scheme that resembles floating-point arithmetic.
3
LNS Representation in the Implementation of Digital Filters
By capitalizing on the representational properties of LNS, this section investigates cases of filters and shows that a word-length reduction due to LNS is feasible for practical cases. 3.1
Feedback Filters
Fig. 1 depicts a second-order structure, the impulse response of which is shown in 15 Fig. 2(a), for the case of a1 = 489 256 and a2 = − 16 . The impulse response for various word length choices assuming a two’s-complement fixed-point implementation is shown in Fig. 2(b). Similarly, the impulse response of the particular structure implemented in LNS, is shown in Fig. 3(a), for different word lengths and various values of the base b. The choice of the LNS base b is important as it affects the required word length and therefore the complexity of the underlying VLSI implementation of the LNS adder and multiplier. Furthermore, the selection of the base b greatly affects the representational precision of the filter coefficients, y
x a1 D
a2
D
Fig. 1. A second-order feedback structure
Low-Power Digital Filtering Based on the Logarithmic Number System
549
5
4 5 4+5 bits 4+6 bits 4+8 bits
3 4
2 3
1
2
1
0
0
−1 −1
−2 −2
−3
0
50
100
150
200
250
300
−3
0
50
100
(a)
150
200
250
300
(b)
Fig. 2. Impulse response of a second-order structure: (a) ideal and (b) for various fractional word lengths, assuming fixed-point representation 5
6
b=1.2955 k=3 l=3 b=1.6782 k=3 l=4 b=1.2301 k=5 l=5
4
b=1.2301 b=1.5700 b=1.5800
5
4 3
3 2 2 1 1 0 0
−1 −1
−2
−3
−2
0
50
100
150
(a)
200
250
300
−3
0
50
100
150
200
250
300
(b)
Fig. 3. Impulse response of a second-order structure (a) for various word lengths (k, l), of k integral and l fractional bits, assuming logarithmic representations with different values of the base b and (b) response for various values of b. Notice the difference of the impulse responses for b = 1.57 and b = 1.58.
in such an extent that even a difference of 0.01 in the value of the base b can severely alter the impulse response of the second order structure. The behavior of the impulse response as a function of the base is depicted in Fig. 3(b). The experimental results tabulated in Tables 1(a) and 1(b) show that an LNSbased system using b = 1.2301 and a 9-bit word which includes a 5-bit fractional part, is equivalent to a 12-bit fixed-point two’s-complement system. In every case an additional sign bit is implied. 3.2
FIR Filters
The word length required by LNS and fixed-point implementations of practical filters have been determined. The performance of a raised-cosine filter is studied, assuming a zero-mean gaussian input with variance 13 , i.e., taking values in (-1,1), and various values of the input signal correlation factor ρ, have been tested, namely, −0.99, −0.5, 0, 0.5 and 0.99.
550
C. Basetas, I. Kouretas, and V. Paliouras
Table 1. SNR for various wordlengths of a linear (a) and a logarithmic (b) implementation of a second-order structure (a) word length (bits) 4+5 4+6 4+8 4+9
(b) SNR (dB) 9.65 12.15 2.45 39.23
word length (bits) 3+3 3+4 4+5 5+5
b 1.2955 1.6782 1.2301 1.2301
SNR (dB) 14.29 18.28 25.54 31.64
The experimental results shown in Tables 2(a) and 2(b) reveal that an LNSbased implementation demostrates equivalent behavior to the fixed-point system, using 9 bits instead of 10 bits for ρ = −0.5 and ρ = 0. Furthermore a 9-bit LNSbased system is found to be equivalent to a linear system of more than 10 bits, for the case of ρ = −0.99, while for ρ = 0.5 and ρ = 0.99 both systems exhibit identical performance for the same number of bits (9 bits). Further experiments were conducted employing filters designed using the Parks-McClellan algorithm. The filters were excited using an identical set of input signals. The relation of achieved SNR to the employed word length for the LNS and linear fixed-point implementation are tabulated in Tables 3(a) and 3(b). Table 2. Word lengths for SNR for various values of the input signal correlation ρ., assuming (a) fixed-point two’s-complement representation, and (b) LNS word organization (a) word length SNR (bits) 0 + 10 0 + 11 0 + 12 0 + 10 0 + 11 0 + 12 0 + 10 0 + 11 0 + 12 0+9 0 + 10 0 + 11 0+9 0 + 10 0 + 11
(dB) 15.62 20.71 26.21 29.45 35.14 40.60 32.27 38.76 44.11 29.18 34.08 40.25 30.08 36.40 41.45
(b) ρ −0.99 −0.99 −0.99 −0.5 −0.5 −0.5 0 0 0 0.5 0.5 0.5 0.99 0.99 0.99
word length (bits) 4+5 4+6 4+5 4+6 4+5 4+6 4+5 4+6 4+5 4+6
SNR b 1.5 1.7 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5
(dB) 18.04 21.30 29.49 33.54 29.79 35.13 30.13 35.84 28.89 34.39
ρ −0.99 −0.99 −0.5 −0.5 0 0 0.5 0.5 0.99 0.99
Low-Power Digital Filtering Based on the Logarithmic Number System
551
Table 3. (a) SNR and integral and fractional word lengths in LNS for various values of ρ. (b) Word lengths and SNR in a fixed-point filter, for various values of ρ. An additional sign bit is implied. (a) word length (bits) 4+5 4+6 4+5 4+6 4+5 4+6 4+5 4+6 4+5 4+6
(b) SNR
b 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5 1.5
(dB) 29.47 34.37 34.34 39.89 34.58 41.10 28.29 34.81 22.00 28.83
word length SNR ρ −0.99 −0.99 −0.5 −0.5 0 0 0.5 0.5 0.99 0.99
(bits) 1+10 1+11 1+12 0+10 0+11 0+12 0+9 0+10 0+11 0+9 0+10 0+11 0+9 0+10 0+11
(dB) 22.56 28.06 34.65 32.73 38.91 44.57 27.70 33.48 39.47 25.92 31.69 37.82 10.02 15.54 21.95
ρ −0.99 −0.99 −0.99 −0.5 −0.5 −0.5 0 0 0 0.5 0.5 0.5 0.99 0.99 0.99
Results depict that the LNS-based system demonstrates equivalent behavior to the linear implementation, requiring only 9 bits instead of 10 bits for the cases of ρ = −0.5, ρ = 0 and ρ = 0.5, while for the case of ρ = 0.99, 9 bits are required instead of 11 bits for the linear case. The experimental results reveal that LNS achieves acceptable behavior using a lower word length than a fixed-point structure. The area, delay and power dissipation required for LNS basic operations in comparison to the corresponding fixed-point operations are detailed in the following section.
4
Power Dissipation of LNS Addition and Multiplication
The basic organization of an LNS adder/subtractor is shown in Fig. 4. By performing power measurements assuming gaussian input distributions of mean value zero and for various values of variance, the results of Table 4 occur. The results refer to a 0.18μm CMOS library, operating with a supply voltage of 1.8V. The particular LNS adder requires an area of 8785μm2 and has a delay of 4.76ns, while the corresponding linear two’s-complement multiplier is organized as a carry-save adder array and it requires 11274μm2 and has a delay of 4.22ns. Fixed-point addition and LNS multiplication are both implemented by binary adders. The area and delay of addition in two’s-complement fixed-point (833.45μm2, 2.06ns) is compared to the cost of multiplication in LNS (639μm2 , 1.58ns ). The corresponding power dissipation figures are offered in Table 5.
552
C. Basetas, I. Kouretas, and V. Paliouras
add LUT
x − subtract LUT
y
s
−
Fig. 4. The organization of an LNS adder/subtractor Table 4. Power dissipation of an LNS adder/subtractor compared to power dissipation of an equivalent linear fixed-point multiplier
σ2 0.1 0.5 1.0 1.5 1.7
LNS (mW) 1.214 1.307 1.338 1.359 1.356
FXP savings (mW) % 2.286 46.9 2.233 41.5 2.275 41.2 2.318 41.6 2.277 40.4
Table 5. Power dissipation of an LNS multiplier compared to power dissipation of an equivalent linear fixed-point adder
σ2 0.1 0.5 1.0 1.5 1.7
LNS (μW) 84.78 83.75 85.57 85.17 85.05
FXP savings (μW) % 116.27 27.1 115.07 27.2 116.8 26.7 116.12 26.5 115.57 26.4
Table 5 shows that power savings are achieved by adopting LNS, due to reducing the employed word length. The organization of the LNS adder comprises two look-up tables, one of which is used in the case of operands having the same sign (add LUT in Fig. 4), while the other one is used in the case of operands of different signs (subtract LUT). It is here reported that significant power savings are achieved by latching the
Low-Power Digital Filtering Based on the Logarithmic Number System
553
sx sy
dadd
dsub
d
Fig. 5. Latched inputs to the LUTs Table 6. Comparison of power dissipated by the enhanced LNS adder to an equivalent fixed-point multiplier
σ2 0.1 0.5 1.0 1.5 1.7
LNS (mW) 0.619 0.833 0.889 0.896 0.903
FXP savings (mW) % 2.286 72.9 2.233 62.6 2.275 60.3 2.318 61.3 2.277 61.0
inputs to the look-up table which is not used in a particular addition. Therefore an exclusive-or operation on the signs of the operands simply provides an enable signal to an input latch and thus can inhibit any unnecessary switching in the LUT not required in a particular operation, as shown in Fig. 5. In employing this scheme using level-sensitive latches care should be taken to avoid timing violations, related to latch set-up times. The benefits of this scheme are tabulated in Table 6, while it requires 10290μm2 and has a delay of 3.08ns. Table 7 summarizes area, time, and power dissipation of synthesized LNS adders, while Table 8 presents synthesized multiplier performance data for comparison purposes. In this paper the issue of conversion between LNS and a linear representation is not dealt with. It is assumed that a sufficiently large amount of computation can be Table 7. Area, time, and power dissipation of synthesized LNS adders word length 4/7 5/9 6/10 7/11
LNS adder Area Delay Power μm2 ns mW 5370.70 3.66 0.723 8785.73 4.76 1.338 15058.89 5.33 2.049 24722.73 6.40 2.887
with latches Area Delay Power μm2 ns mW 6118.73 2.46 0.678 10289.98 3.08 0.889 17091.92 3.91 1.525 28931.14 4.50 1.966
554
C. Basetas, I. Kouretas, and V. Paliouras Table 8. Synthesized linear multiplier data word length 11 12 13
Area μm2 9485.4 11274.3 13173.0
Delay ns 3.94 4.22 4.37
Power mW 1.877 2.275 2.759
performed within LNS, and that any conversions are limited to input/output, therefore any conversion overhead, if required, is compensated by the overall benefits. All power figures assume a 50MHz word data rate and are obtained by the following procedure. Scripts developed in Matlab, generate the VHDL description of the LNS adders, as well as the input signals for the simulations. VHDL models are synthesized and simulations provide switching activity information which subsequently back-annotates the designs, to improve the accuracy of power estimation.
5
Discussion and Conclusions
It has been shown that the use of LNS in certain DSP kernels can significantly reduce the required word length. By exploiting the reduced word length, combined with the simplified circuits imposed by the use of LNS, as well as the isolation of the look-up table which is not used in a particular addition, significant power dissipation reduction is achieved for practical cases. It should be noted that further power reduction due to LNS, can be sought by applying look-up table size reduction techniques, dependent on the particular accuracy [10][7]. Therefore the use of LNS is worth investigating for possible adoption in lowpower signal processing systems.
References 1. Stouraitis, T., Paliouras, V.: Considering the alternatives in low-power design. IEEE Circuits and Devices 17, 23–29 (2001) 2. Landman, P.E., Rabaey, J.M.: Architectural power analysis: The dual bit type method. IEEE Transactions on VLSI Systems 3, 173–187 (1995) 3. Arnold, M.G., Bailey, T.A., Cowles, J.R., Winkel, M.D.: Applying features of the IEEE 754 to sign/logarithm arithmetic. IEEE Transactions on Computers 41, 1040–1050 (1992) 4. Morley, J. R.E., Engel, G.L., Sullivan, T.J., Natarajan, S.M.: VLSI based design of a battery-operated digital hearing aid. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, 1988, pp. 2512–2515. IEEE Computer Society Press, Los Alamitos (1988) 5. Sacha, J.R., Irwin, M.J.: Number representation for reducing switched capacitance in subband coding. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1998, pp. 3125–3128 (1998)
Low-Power Digital Filtering Based on the Logarithmic Number System
555
6. Arnold, M.G.: Reduced power consumption for mpeg decoding with lns. In: Proceedings of the IEEE International Conference on Application-Specific Systems, Architectures and Processors (ASAP 02), 2002, IEEE Computer Society Press, Los Alamitos (2002) 7. Kang, B., Vijaykrishnan, N., Irwin, M.J., Theocharides, T.: Power-efficient implementation of turbo decoder in sdr system. In: Proceedings of the IEEE International SOC Conference, 2004, pp. 119–122 (2004) 8. Paliouras, V., Stouraitis, T.: Low-power properties of the Logarithmic Number System. In: Proceedings of 15th Symposium on Computer Arithmetic (ARITH15), Vail, CO, June 2001, pp. 229–236 (2001) 9. Paliouras, V., Stouraitis, T.: Logarithmic number system for low-power arithmetic. In: Soudris, D.J., Pirsch, P., Barke, E. (eds.) PATMOS 2000. LNCS, vol. 1918, pp. 285–294. Springer, Heidelberg (2000) 10. Taylor, F., Gill, R., Joseph, J., Radke, J.: A 20 bit Logarithmic Number System processor. IEEE Transactions on Computers 37, 190–199 (1988)
A Power Supply Selector for Energy- and Area-Efficient Local Dynamic Voltage Scaling Sylvain Miermont1 , Pascal Vivet1 , and Marc Renaudin2 1
CEA-LETI/LIAN, MINATEC, 38054 Grenoble, France {sylvain.miermont,pascal.vivet}@cea.fr 2 TIMA Lab./CIS group, 38031 Grenoble, France [email protected]
Abstract. In systems-on-chip, dynamic voltage scaling allows energy savings. If only one global voltage is scaled down, the voltage cannot be lower than the voltage required by the most constrained functional unit to meet its timing constraints. Fine-grained dynamic voltage scaling allows better energy savings since each functional unit has its own independent clock and voltage, making the chip globally asynchronous and locally synchronous. In this paper we propose a local dynamic voltage scaling architecture, adapted to globally asynchronous and locally synchronous systems, based on a technique called Vdd-hopping. Compared to traditional power converters, the proposed power supply selector is small and power-efficient, with no needs for large passives or costly technological options. This design has been validated in a STMicroelectronics CMOS 65nm low-power technology.
1
Introduction
With the demand for more autonomy of mobile equipments and general purpose microprocessors hitting the power wall, there is an unquestionable need for techniques increasing power-efficiency of computing. As CMOS technology scale down, the part of leakage energy in the total energetic budget tends to increase, however reducing dynamic power still remains a major issue. This is caused by the use of new telecommunication standards, highly-compressed multimedia formats, high-quality graphics and, more generally, the greater algorithmic complexity of today’s applications. For digital circuits, Ptotal ∼ k1 · Vdd 2 + k2 · Vdd and Fmax ∼ k · Vdd so the energy-per-operation Eop = Ptotal /Fmax scales with Vdd . Hence, DVS (Dynamic Voltage Scaling) techniques allow an energetic gain when computing speed can be scaled-down without violating real-time constraints. On the other hand, with the increased complexity of actual and future integrated circuits, global clock distribution, design reuse, intra-chip communications and whole-chip verification are becoming severe engineering challenges. Thus new N. Azemard and L. Svensson (Eds.): PATMOS 2007, LNCS 4644, pp. 556–565, 2007. c Springer-Verlag Berlin Heidelberg 2007
A Power Supply Selector for Local Dynamic Voltage Scaling
557
engineering methodologies are emerging to replace the ‘all synchronous’ way of designing circuits, one of them being the GALS (Globally Asynchronous Locally Synchronous) technique. In a GALS chip, there is no global clock, but rather a set of synchronous ‘islands’ communicating through an asynchronous interconnexion network. An example of GALS telecommunication SoC (System-on-Chip) is given in [1], for GALS microprocessors, see [2]. As each synchronous island in a GALS chip has its own independent clock, each clock frequency can be individually modified and locally managed, so it is possible to change the voltage of each island: that’s the principle of LDVS (Local Dynamic Voltage Scaling). The purpose of this work is to design an energy- and area-efficient way to do LDVS on a GALS chip. See [3] for more details about the interests of GALS and LDVS. In the first section we’ll describe our architecture proposal for LDVS, then the power supply selector architecture and its behaviour and, finally, the results we’ve got so far for this PSS and our conclusions.
2
LDVS Architecture Proposal
In a GALS-SoC, functional units, classically called ‘IP blocks’, form synchronous islands. In our proposal (see fig. 1), the chip is composed of many units (the functional part of a unit is called ‘the core’) and a GALS interconnect for interunits communications (e.g. an asynchronous network-on-chip, see [1]).
Fig. 1. LDVS chip architecture proposal: a GPM (Global Power Manager) per chip, and for each synchronous unit, a LCG (Local Clock Generator), a LPM (Local Power Manager) and a PSS (Power Supply Selector)
One unit of the chip acts as a global power manager. The global power manager enforces the chip energy policy by sending commands through the GALS interconnect to local power managers and receiving back some status information. Local power managers can be specialized for their units in order to reduce energy by self-adapting to local conditions, such as temperature or process, but
558
S. Miermont, P. Vivet, and M. Renaudin
also by using power-optimization algorithms (e.g. Earliest Deadline First if the unit is a processor with a scheduler, as seen in [4]). Local Clock Generation. Each unit has its own LCG (Local Clock Generator), designed to run the unit at the maximum frequency allowed by local conditions. This LCG can be a programmable ring oscillator as suggested in [3]. Local Power Supply. Each unit has its own power converter, driven by the local power manager. In conventional DVS techniques, the converter is either inductive, capacitive or linear. Inductive and capacitive converters have a good maximum efficiency but needs passives (inductances and capacitances), and a typical SoC can have tens of units, so using external components is not an option. Passives can be integrated but that would take a lot of area or would require special technological options in the manufacturing process. Linear converters do not use any passives but their efficiency decreases as the output voltage is lowered, so the energy-per-operation does not scale down. All these converters also need a regulation circuit that consumes power and lowers the total efficiency of the device. These three types of converters have been integrated for various purposes [5,6,7] but do not fit the needs of actual and future SoC platforms. On the other hand, a ‘discrete’ DVS technique, using only two set-points instead of a continuously adjustable voltage, is presented in [8], and is better suited for integration. This paper [8] showed that significant power is saved using two set points and that the additional gain of using an infinite number of set-points is low. Architectures using this principle, called ‘Vdd-Hopping’, have been presented in [9,10,11]. In our LDVS architecture (see fig. 1), ‘Vdd-Hopping’ is implemented with two external supply voltages, called Vhigh (nominal voltage, unit running at nominal speed) and Vlow (reduced voltage, unit running at reduced speed) and a PSS (Power Supply Selector) for each unit. Typically, the two voltages will by supplied by two fixed-voltage inductive DC/DC converters, that can be partially integrated (passives on the board) or not. The efficiency of these converters must be taken into account when calculating the total system efficiency (see section 4.1). Power switches architectures have been proposed (see [10,12,13]) for hopping from one source to an other (this is called a transition) but their simplicity comes with serious disadvantages. First problem, there is no control on the transition duration (i.e. the dV /dt slope on Vcore ). Power lines from external sources to PSS forms a RLC (resistive-inductive-capacitive) network, so if current changes too rapidly, the supply voltage of a unit could fall under a safe voltage limit, leading to memory losses. Moreover, the unit power grid delays the voltage transition, so a fast transition from Vhigh to Vlow creates large ΔV over the unit power grid. If the clock generator is a ring oscillator, large ΔV can cause a mismatch between generator delay element and critical path delay, leading to errors. So, without a careful
A Power Supply Selector for Local Dynamic Voltage Scaling
559
analysis of power supply networks of the chip, there is no guarantee that the core will make no error during transitions. Besides, if both power switches are off at the same time, the unit voltage can drop, while if both power switches are on, a large current will flow from the higher voltage source to the lower voltage source. This is a second problem to solve because voltage regulators are usually not designed to receive current from the load.
3
The Power Supply Selector
The principle we use is: when no transition is needed, the system must be in a ‘sleep’ mode, one of the source supplies the power with minimum losses. When a transition is needed, the power element between Vcore and Vhigh is used as a linear regulator so Vcore ≈ Vref , and the power element between Vlow and Vcore is switched on and off only when Vref ≈ Vlow . With this principle, using the appropriate sequence, we can switch smoothly between the two power supplies. Notice that all elements of the PSS are powered by the Vhigh source. For designing the power supply selector, the system constraints were: – device area must be moderate regarding to unit core area (< 20%), – power efficiency must be as good as possible (> 80%), and total energy-peroperation must scale down with computing speed, – the voltage must always be sufficient for the unit to run without error even during transitions from one source to an other, – current must never flow from one source to an other. 3.1
PSS Architecture
Power Switches and Soft-Switch. The power switches are the main elements of the power supply selector (fig. 2). They are made of a group of PMOS transistors. Between Vlow and Vcore is the Tlow transistor, sized to minimize resistive
Fig. 2. Power Supply Selector architecture
560
S. Miermont, P. Vivet, and M. Renaudin
losses when in full conduction with a unit at maximum power. The inverter driving its gate is sized so that transistor switches on and off relatively slowly (a dozen clock periods). Between Vhigh and Vcore are the Ntrans Thigh transistors (Ntrans = 24), connected in parallel with common drain, source and bulk, but separated gates. The sum of Thigh transistors width is chosen to minimize losses, the width of each transistor is chosen so Vcore changes from ±Vhigh /Ntrans when a transistor is switched on or off using a thermometer code. The Ntrans inverters driving their gates are sized to minimize switch-on and -off time ( Tclk , ≈100 ps in our case). The soft-switch is a synchronous digital element acting as a 1-bit input digital integrator with a Ntrans -bit thermometer coded output. When this element is enabled, for each clock cycle, if its input is a binary ‘1’, it switches on the output bit previously off with the lowest index number (1 to Ntrans in this case). If its input is a binary ‘0’, it switches off the output bit previously on with the highest index number. Combining the Ntrans Thigh transistors and the soft-switch element gives us a ‘virtual’ Thigh transistor whose effective width can be increased or decreased (in predetermined non-uniform steps) according to a binary signal. DAC and Comparator. DAC is a 5-bit R-2R Digital to Analog Converter used to generate voltage ramps between Vhigh and Vlow − Δs , with Δs ≈ Δr , resistive losses through Tlow when Vlow is the selected power supply (see section 3.2 for the use of Δs ). Ramp duration is set in the controller, and it generates codes for the DAC accordingly. DAC output impedance must be low enough to drive the comparator’s input (capacitive load) without limiting the minimum duration of the ramp. The comparator must be fast (≈ 150 ps in our case) and must be able to compare voltages close to supply voltage. None of this element’s characteristics depend on the unit, so no redesign is needed until technology is changed. When the comparator’s output is connected together with soft-switch input, these two elements and the Thigh power switches become a closed-loop system, acting as a linear voltage regulator, maintaining Vcore = Vref . This regulation is integral with a 1-bit error signal so Vcore oscillates around Vref with an amplitude depending on loop delay, clock period Tclk , and the number of Thigh transistors Ntrans . Other Elements. To prevent any dependence with the core LCG, our PSS has its own clock generator, a ring oscillator with a programmable delay element (so the clock frequency is digitally adjustable). The clock frequency doesn’t need to be accurate but must be fast enough so the closed-control loop can regulate variation of the unit current (in our PSS, Fclk ≈ 1GHz). In our implementation of the PSS, ramp duration Tramp , low reference point offset Δs and clock frequency Fclk are tunable, so there are elements, not represented on the diagram, to allow the GPM to configure these parameters. Last element, the controller plays the hopping sequence and disable elements of the PSS to save power when there is no need to make a transition.
A Power Supply Selector for Local Dynamic Voltage Scaling
3.2
561
Hopping Sequence
Falling Transition. A falling transition occurs when switching from Vhigh to Vlow . When the LPM sends a signal to the PSS asking to hop, the PSS clock generator wakes up and the controller starts its hopping sequence.
Fig. 3. Falling transition chronogram
First, the controller sends an enable signal to all elements, and waits for a few clock periods for them to stabilize (‘init’ on fig. 3). The comparator output is connected with the soft-switch input, enabling the closed-loop regulation. In this state, Vref = Vhigh and Vcore = Vhigh − Δr , Δr representing resistive losses through the Thigh power transistor, so that Vref > Vcore , so all Thigh transistors stay on. Then, the controller sends codes to the DAC so Vref falls with a controlled slope, Vcore follows Vref , the voltage across Thigh transistors rises (‘transi’ on fig. 3). This phase ends when Vref reaches its lowest point, a voltage equals to Vlow − Δs . As Vcore ≈ Vref < Vlow , the Tlow transistor can be switched on without any reverse current flowing through it. In this phase (‘switch’ on fig. 3), the Tlow transistors slowly switch on (thanks to an undersized driver), while more Thigh transistors switch off to compensate. At this point, Vref ≈ Vlow − Δs ≈ Vlow − Δr (with Δr resistive losses through the Tlow transistor) so Thigh might not be totally off. In the last phase (‘end’ on fig. 3), the controller opens the control loop, forces all Thigh transistors to switch off, disables all elements, waits a few clock periods and stops the clock generator. Core power is supplied only by the Vlow source, and the PSS do not consume any dynamic power. Rising Transition. When switching from Vlow to Vhigh , the controller starts by enabling all elements and closing the control loop (‘init’ on fig. 4). Then, Tlow is switched off, some Thigh transistors switches on, so Vcore ≈ Vlow (‘switch’ on fig. 4). Then Vref slowly rises up, Vcore follows, and this phase (‘transi’ on fig. 4) ends when Vref reaches its highest point and when the soft-switch is in the ‘all-on’ state. Then, all elements are disabled and the clock is stopped. Power to the unit is supplied only by the Vhigh source, and the PSS do not consume any dynamic power.
562
S. Miermont, P. Vivet, and M. Renaudin
Fig. 4. Rising transition chronogram
4
Results
The PSS is a mixed-mode device so its design is made using VHDL for digital elements, SPICE for analog elements, and VHDL-AMS to bind digital and analog elements together. For co-simulation, we use ADVance MS, Modelsim and Eldo from Mentor Graphics. For implementation, we use a ST Microelectronics CMOS 65 nanometer low-power triple-Vt technology, using synthesis and standard-cells for the digital parts, manual layout for analog parts. The corner case is Vhigh = 1.2 V, Vlow = 0.8 V, T = 25 ◦ C and process is considered nominal. Vlow value was chosen so the unit logic cells and memories are functional with a reduced power consumption. 4.1
Power Efficiency
Outside transitions phases, losses in the PSS are the sum of resistive losses in the power transistors and leakage current in the device. Thanks to the use of a triple-Vt technology, leakage power for the PSS is 10−5 to 10−4 Watt, a negligible value (less than 1%) compared to unit active power. The power transistors Thigh and Tlow are sized so that the resistive losses ρres are less than 3 % of Punit , useful unit power. So, when the PSS is not doing a transition, power efficiency is around 97 %. During transitions, there are two sources of losses: the energy dissipated by the Thigh transistors when used as a voltage regulator and the dynamic consumption of the PSS elements. During ramp time Tramp , a linear ramp is done from Vhigh to Vlow and the V −Vcore Thigh power losses ratio is ρramp = high . In our implementation, for a Vhigh V
−V
low whole transition, it is the average of ρramp+ = 3% and ρramp− = high = Vhigh 33.3%, so we obtain ρramp = 18%. During transition time Ttransi = Tramp + Tinit,switch,end , the PSS is active, and its dynamic power is Pdyn . Punit The total PSS efficiency ηP SS = Punit +PP SS is given by:
PP SS =
Thop − Ttransi Tramp Ttransi · ρres · Punit + · ρramp · Punit + · Pdyn Thop Thop Thop
A Power Supply Selector for Local Dynamic Voltage Scaling
563
As example, for an average time between transition of Thop = 1 μs (1 Mhz hopping frequency), with Tramp = 100 ns, Ttransi = 140 ns, ρramp = 18% and Pdyn = 4 mW (post-synthesis estimation for Fclk = 1 GHz), we obtain the following PSS efficiency: PP SS = 4.18% · Punit + 0.56 mW. With these parameters: – for a 15 mW unit, PP SS = 1.19 mW, so PSS efficiency ηP SS = 92, 7% – for a 150 mW unit, PP SS = 6.83 mW, so PSS efficiency ηP SS = 95, 6% To calculate the total system efficiency ηsystem , we must also take into account the efficiency of the external or internal voltage regulators used to supply the global voltages Vhigh and Vlow . If fixed-voltage inductive DC/DC regulators are used, with an efficiency in the 90% range, the total system efficiency is ηsystem ≈ 80%. The efficiency of our system is high compared to systems using linear regulators, and equivalent to the efficiency of systems based on integrated inductive or capacitive converters. 4.2
PSS Area
Area of the PSS is divided into a fixed part, for its digital logic and analog elements, and a variable part, for the power switches and their drivers. The fixed part is composed of the DAC, the PSS clock generator, the comparator and hundreds of logic standard cells (≈ 2500 μm2 after synthesis) witch gives an estimated size of ≈ 4400 μm2 . Table 1. Area results of PSS for 3 different units peak power total area PSS area relative PSS area unit A 11.4 mW 322 · 103 μm2 ≈ 6.6 · 103 μm2 2.06 % unit B 23.1 mW 899 · 103 μm2 ≈ 7.4 · 103 μm2 0.83 % unit C 61.1 mW 1242 · 103 μm2 ≈ 10.8 · 103 μm2 0.87 %
For the variable part, the power switches are sized according to the unit power consumption, by choosing an acceptable frequency penalty and measuring the unit current on a averaging time window. The detailed sizing method is out of the scope of this paper, see [14] for more about sizing power switches. Table 1 sums up area results obtained for some telecom signal processing units (the area is given after place-and-route). Even for small units, the area cost of the proposed PSS is very low compared to inductive or capacitive integrated voltage converters and is equivalent to linear converters. 4.3
Transition Simulation
Results of a PSS simulation, with a simplified load model, is shown in fig. 5: Vcore and Vref ramp down, oscillation of Vcore around Vref is clearly visible, as current switching slowly from one source to another. In this simulation, Ttransi ≈ 300 ns and Fclk ≈ 500 MHz.
564
S. Miermont, P. Vivet, and M. Renaudin
Fig. 5. Simulation chronogram of a falling transition
5
Conclusion
This paper presents a local dynamic voltage scaling architecture proposal and the power supply selector on witch it is based. The presented power supply selector is as efficient as the best integrated inductive or capacitive regulators, but is much more compact and solves the problems of simple voltage selectors. This work will be implemented in a future technological demonstrator, allowing us to characterize the PSS and measure the power gained by using our LDVS architecture. If the presented results and numbers are confirmed in the final design, we will be able to demonstrate significant power gains in globally asynchronous and locally synchronous system-on-chip.
References 1. Lattard, D., Beigne, E., Bernard, C., Bour, C., Clermidy, F., Durand, Y., Durupt, J., Varreau, D., Vivet, P., Penard, P., Bouttier, A., Berens, F.: A Telecom Baseband Circuit-Based on an Asynchronous Network-on-Chip. In: Proceedings of Intl. Solid State Circuits Conf. (ISSCC) (February 2007) 2. Iyer, A., Marculescu, D.: Power and Performance Evaluation of Globally Asynchronous Locally Synchronous Processors. In: Proceedings of Intl. Symp. on Computer Architecture (ISCA) (May 2002) 3. Njølstad, T., Tjore, O., Svarstad, K., Lundheim, L., Vedal, T.Ø., Typp¨ o, J., Ramstad, T., Wanhammar, L., Aar, E.J., Danielsen, H.: A Socket Interface for GALS using Locally Dynamic Voltage Scaling for Rate-Adaptive Energy Saving. In: Proceedings of ASIC/SOC Conf. (September 2001) 4. Zhu, Y., Mueller, F.: Feedback EDF Scheduling Exploiting Dynamic Voltage Scaling. In: Proceedings of Real-Time and Embedded Technology and Applications Symp (RTAS) (May 2004)
A Power Supply Selector for Local Dynamic Voltage Scaling
565
5. Ichiba, F., Suzuki, K., Mita, S., Kuroda, T., Furuyama, T.: Variable supply-voltage scheme with 95%-efficiency DC-DC converter for MPEG-4 codec. In: Proceedings of Intl. Symp. on Low Power Electronics and Design (ISLPED) (August 1999) 6. Li, Y.W., Patounakis, G., Jose, A., Shepard, K.L., Nowick, S.M.: Asynchronous datapath with software-controlled on-chip adaptive voltage scaling for multirate signal processing applications. In: Proceedings of Intl. Symp. on Asynchronous Circuits and Systems (ASYNC) (May 2003) 7. Hammes, M., Kranz, C., Kissing, J., Seippel, D., Bonnaud, P.-H., Pelos, E.: A GSM Baseband Radio in 0.13μm CMOS with Fully Integrated Power-Management. In: Proceedings of Intl. Solid State Circuits Conf (ISSCC) (February 2007) 8. Lee, S., Sakurai, T.: Run-Time Voltage Hopping for Low-Power Real-Time Systems. In: Proceedings of Design Automation Conf (DAC) (June 2000) 9. Kawaguchi, H., Zhang, G., Lee, S., Sakurai, T.: An LSI for VDD-Hopping and MPEG4 System Based on the Chip. In: Proceedings of Intl. Symp. on Circuits and Systems (ISCAS) (May 2001) 10. Xu, Y., Miyazaki, T., Kawaguchi, H., Sakurai, T.: Fast Block-Wise Vdd-Hopping Scheme. In: Proceedings of IEICE Society Conf. (September 2003) 11. Calhoun, B.H., Chandrakasan, A.P.: Ultra-Dynamic Voltage Scaling using Subthreshold Operation and Local Voltage Dithering in 90nm CMOS. In: Proceedings of Intl. Solid-State Circuits Conf (ISSCC) (February 2005) 12. Kuemerle, M.W.: System and Method for Power Optimization in Parallel Units. Us Patent 6289465 (September 2001) 13. Cohn, J.M., et al.: Power Reduction by Stage in Integrated Circuit. US Patent 6825711 (November 2004) 14. Anis, M.H., Areibi, S., Elmsary, M.I.: Design and Optimization of Multi-Threshold CMOS (MTCMOS) Circuits. IEEE Trans. on CAD 22(10) (October 2003)
Dependability Evaluation of Time-Redundancy Techniques in Integer Multipliers Henrik Eriksson SP Technical Research Institute of Sweden Box 857, SE-501 15 Borås, Sweden [email protected]
Abstract. An evaluation of the fault tolerance which can be achieved by the use of time-redundancy techniques in integer multipliers has been conducted. The evaluated techniques are: swapped inputs, inverted reduction tree, a novel use of the half precision mode in a twin-precision multiplier, and a combination of the first two techniques. The faults which have been injected are single stuck-atzero or stuck-at-one faults. Error detection coverage has been the evaluation criteria. Depending on the technique, the attained error detection coverage spans from 25% to 90%.
1 Introduction Dependable computer systems, where a malfunction can lead to loss of human lives, economical loss, or an environmental disaster, are no longer limited to space, aerospace, and nuclear applications, but are rapidly moving toward consumer electronics and automotive applications. In a few years time, mechanical and hydraulic automotive subsystems for braking and steering will be replaced by electronic systems, also known as by-wire systems. In these systems, faults will occur, and thus fault tolerance is paramount. Modern microprocessors have several on-chip error-detection mechanisms in memories and registers, but in general arithmetic units such as multipliers do not. In this paper four different time-redundancy techniques as a means to achieve fault tolerance in an integer multiplier are evaluated. 1.1 Fault Tolerance There are different means to attain a dependable system [1] e.g.: • Fault prevention • Fault removal • Fault tolerance Fault prevention aims to reduce the faults which are introduced during development. The use of a mature development process, a strongly typed programming language, and design rules for hardware development are examples of fault prevention. Fault removal is present both during development and when the system is in use. During development, faults are removed by verification and validation activities N. Azemard and L. Svensson (Eds.): PATMOS 2007, LNCS 4644, pp. 566–575, 2007. © Springer-Verlag Berlin Heidelberg 2007
Dependability Evaluation of Time-Redundancy Techniques in Integer Multipliers
567
which consist of static analysis methods and dynamic testing methods. Preventive and/or corrective maintenance remove faults during operation. Despite the fact that fault prevention and removal are used, there might still be design faults left in the system or faults will be caused by external sources such as radiation from space (neutrons) or EMI (electromagnetic interference). As a consequence there is a need for a certain tolerance level of faults in the system. Fault tolerance is achieved by error detection and error recovery. To be able to detect and recover from errors, redundancy in time, space (hardware or software), or information is needed. A parity bit is an example of information redundancy and mirrored hard disk drives in a RAID (Redundant Array of Inexpensive Disks) system an example of hardware spatial redundancy. If a piece of software is executed twice on the same hardware component, it is denoted time redundancy. The effectiveness of a fault-tolerance technique is called coverage and defines the probability that the fault tolerance technique is effective given that a fault has occurred. In order to determine the coverage, an assumption on the types, location, arrival rate, and persistence of the faults, i.e. a fault model, has to be defined. The fault model used in this study is described in Section 3.1. 1.2 Integer Multipliers An 8×8-bit parallel multiplier of carry-save type [2] is shown in Fig. 1.
HA
HA
HA
FA
FA
FA
HA
FA
FA
FA
FA
FA
HA
FA
FA
FA
FA
FA
FA
FA
HA
FA
FA
FA
FA
FA
FA
FA
FA
FA
HA
FA
FA
FA
FA
FA
FA
FA
FA
FA
FA
FA
FA
FA
FA
FA
FA
FA
FA
FA
FA
FA
FA
FA
Partial product generation xi y j
& pij
Partial product reduction tree
p01 p 10
p 00
O0
HA
p77
FA
Final adder
O15
Fig. 1. An 8×8-bit integer multiplier of carry-save type. Three parts can be distinguished: partial product bit generation (AND gates), partial product reduction tree (half-adder cells, HAs, and full-adder cells, FAs), and the final adder.
The first step in the multiplier is to generate all partial product bits. In a simple unsigned integer multiplier this is performed by conventional AND gates. The second step is to feed all partial-product bits to the reduction tree where the bits are reduced to a number which is suitable for the final adder. Different reduction trees exist, ranging from slow and simple, e.g. a carry-save tree, to fast and complex, e.g. Wallace [3] or Dadda [4] trees. To get optimal performance, the final adder which computes the
568
H. Eriksson
resulting product shall be tailored to the delay profile of the reduction tree and could consist of both ripple-carry and carry-lookahead parts [5]. 1.3 Fault-Tolerant Adders and Multipliers Many techniques have been proposed to add fault tolerance to adders and multipliers. The techniques described in this section are a small but representative set of the possible techniques which can be used. A fault-tolerant adder was proposed [6] which use TMR (Triple Modular Redundancy) in the carry logic of the full-adder cells as well as TMR on the ALU (Arithmetic Logic Unit) level. The extra two ALUs operate on either shifted or rotated inputs. Hence a fault in a bit slice will manifest itself as an error in another bit position in the results from the redundant ALUs. Using TMR is an expensive approach. Time-redundancy is used in the RETWV (Recomputing with triplication with voting) technique [7]. The idea here is to divide the input(s) and the adder or multiplier into three parts and for each computation one third of the end result is computed and decided by voting. In the adder case, both input operands are divided into three parts but for the multiplier only one of the operands is divided. An extra redundant bit slice is added in a Wallace multiplier to give the possibility for self repair [8]. A test sequence is used to detect errors and if an error is detected the multiplier is reconfigured by the use of a configuration memory. Residue number systems (RNSs) exhibit error detection and correction capabilities by means of information redundancy. RNS has been used to design a 36-bit single fault tolerant multiplier [9].
2 Time-Redundancy Techniques in Multipliers The use of time redundancy to achieve fault tolerance is not new and many techniques have been proposed [10]. Common for most time-redundancy techniques is that during the first computation the operands are used as presented and during the second computation the operands are encoded (in general no extra bits are added, i.e. no information redundancy is added) before the computation and decoded after. Examples of encoding functions are inversion or arithmetic shift. Not all components are self dual, i.e. they can use inversion without modification, but e.g. the adder is. In order to use arithmetic shift extra bit slices are needed to accommodate the larger word length needed. In this paper, the multiplier is the target and the operation we wish to perform is P = x · y.
(1)
Here x and y are the input operands and P is the product. In the following, the techniques used during the second computation of the multiplier are explained. 2.1 Technique 1 – Swapped Inputs Technique 1 is as simple as a technique can be. During the second computation, the input operands are swapped, and thus the product is calculated as P = y · x.
(2)
Dependability Evaluation of Time-Redundancy Techniques in Integer Multipliers
569
Although the difference is small, all partial product bits where i ≠ j (see Fig. 1) will enter the reduction tree at different locations. As a consequence, a fault might affect the resulting product differently making error detection possible. 2.2 Technique 2 – Inverted Reduction Tree The multiplier is not self dual but inversion can still be explored. Technique 2 harnesses the inverting property of the full-adder cell [11], i.e. S’(a, b, c) = S(a’,b’,c’) and CO’(a, b, c) = CO(a’,b’,c’).
(3)
Here S and CO are the sum and carry outputs, respectively. Since the reduction tree more or less consists of a network of full-adder cells, it is possible to invert the partial-product bits entering the reduction tree and then invert the output product to get the correct product. The cost for an N-bit multiplier is (N×N)-1 XOR gates at the input and 2N-1 XOR gates at the output. However, since the inverted versions of the partialproduct bits are available “inside” the AND gates these could be used together with multiplexers to get the inverted mode of operation; at least in a full-custom design. The half adder cells of the reduction tree do not fulfill the inverting property and as a consequence they have to be changed to full adder cells. The extra input of the new full-adder cells is selected according to inverted mode (1) or not (0). In Technique 2 the multiplication becomes P = x · y = x0y0+21(x1y0+x0y1)+…+2N-1xN-1yN-1 = ((x0y0)’+21((x1y0)’+(x0y1)’)+…+2N-1(xN-1yN-1)’)’.
(4)
2.3 Technique 3 – Inverted Reduction Tree and Swapped Inputs Technique 3 is a combination of Technique 1 and 2. Hence the multiplication becomes P = y · x = y0x0+21(y1x0+y0x1)+…+2N-1yN-1xN-1 = ((y0x0)’+21((y1x0)’+(y0x1)’)+…+2N-1(yN-1xN-1)’)’.
(5)
2.4 Technique 4 – Twin Precision With appropriate modifications, it is possible to perform two N/2-bit multiplications in parallel in an N-bit multiplier, see Fig 2. This is referred to as a twin-precision multiplier [12]. One N/2×N/2-bit multiplication is performed using N/2×N/2 partial product bits in the least significant part of the multiplier and another N/2×N/2-bit multiplication using partial product bits in the most significant part. The resulting products are the least significant and most significant halves of the output, respectively. Other implementations of a twin or dual precision multiplier exist [13], where three of the four base half precision multipliers are used to obtain a fault tolerant mode of operation using majority voting. The idea behind Technique 4 is to divide the input operands into halves and then these halves are multiplied with each other according to P = x · y = (xh+xl) · (yh+yl) = xhyh+xhyl+xlyh+xlyl.
(6)
570
H. Eriksson
Besides the extra (third) computation (compared with the other techniques), there is also a need for extra shifts and additions to obtain the final product. The extra computation is not that costly since two N/2×N/2-bit multiplications in parallel do not take as long time as a single N×N-bit multiplication [12].
x i yj
x i y j tp
&
&
pij
pij
HA
HA
HA
HA
FA
FA
FA
HA
FA
FA
FA
FA
FA
HA
FA
FA
FA
FA
FA
FA
FA
HA
FA
FA
FA
FA
FA
FA
FA
FA
FA
FA
FA
FA
FA
FA
FA
FA
FA
FA
FA
FA
p 01 p 10
p00
HA
p 77
FA
FA
FA
FA
FA
FA
FA
FA
FA
FA
FA
O0
FA
FA
O 15
Fig. 2. An 8×8-bit twin-precision integer multiplier of carry-save type. The partial-product bits represented by white squares are unused (zeroed) when the multiplier is performing two 4-bit multiplications in parallel.
3 Evaluation of the Time-Redundancy Techniques The four different techniques are evaluated by injecting faults in a gate-level model of a multiplier. After a fault has been injected, the base computation (Eq. 1) is performed as well as computations using all the time-redundancy techniques T1-T4. The output of the base computation is compared with a golden (fault-free) multiplication to see if the fault is effective. If the fault is effective, i.e. the result from the base computation differs from the golden computation, the error detection capability of a technique is checked by comparing the output of the base computation with that of the presently evaluated technique. If the outputs are different, an error has been detected. The error detection coverage is obtained by dividing the number of detected errors by the number of effective faults (detected errors + wrong outputs). 3.1 Fault Model The faults which are injected are permanent (last at least two computations) single stuck-at-one or stuck-at-zero faults. The faults are injected at the outputs of the gates in the multiplier. 3.2 Assessment 1 - 8×8-Bit Multiplier with CSA Reduction Tree In the first assessment faults are injected in an 8×8 multiplier of carry-save type. Faults are injected in all possible gate outputs in the multiplier and for each fault all
Dependability Evaluation of Time-Redundancy Techniques in Integer Multipliers
571
possible input combinations are evaluated, i.e. an exhaustive evaluation. The results from the different fault-injection campaigns are collected in Table 1. The campaigns are: • • • • •
pi – the outputs of the input AND gates p – the outputs of the inverting XOR gates before the reduction tree s – the sum outputs of the FA gates in the reduction tree co – the carry outputs of the FA gates in the reduction tree o – the outputs of the inverting XOR gates after the reduction tree
Some interesting observations can be made in Table 1: • • • •
The number of effective faults is roughly 50%. Thus, on average, every second fault will have an effect on the multiplier result, i.e. cause an error. Using time-redundancy, it is difficult to detect errors at the output of the multiplier. However, if technique T4 is used it is possible to detect at least 70% of the errors. The techniques exploring the inverting property of the reduction tree, T2 and T3, tolerate all possible stuck-at faults in the reduction tree, i.e. the coverage is 100%. Errors caused by faults in the input AND gates cannot be detected by Technique T2, which make sense since the input bits are fed to the same AND gates as for the base computation.
The relations between the coverage of the different techniques are as expected. In terms of extra hardware, less costly techniques have less coverage than the expensive ones. Table 1. 8×8-bit CSA reduction tree. Exhaustive. (sta0 – stuck-at-zero, sta1 – stuck-at-one)
Campaign
Injected faults
Effective faults
pi-sta0 pi-sta1 p-sta0 p-sta1 s-sta0 s-sta1 co-sta0 co-sta1 o-sta0 o-sta1 Total
4194304 4194304 4194304 4194304 3670016 3670016 3670016 3670016 1048576 1048576 33554432
1048576 3145728 1032192 3096576 1636774 2033242 613916 3056100 418276 564764 16646144
Coverage by technique T1 T2 T3 T4 66% 0% 66% 89% 22% 0% 22% 93% 67% 100% 100% 90% 22% 100% 100% 93% 33% 100% 100% 80% 27% 100% 100% 8% 57% 100% 100% 88% 12% 100% 100% 97% 0% 0% 0% 70% 0% 0% 0% 78% 27% 69% 77% 90%
3.3 Assessment 2 - 8×8-Bit Multiplier with HPM Reduction Tree To check if the interconnection scheme of the reduction tree has an effect on the error detection capabilities of the techniques, a multiplier having another reduction tree is
572
H. Eriksson
evaluated in this assessment. The selected tree is the HPM reduction tree [14] which has a logarithmic logic depth (the CSA tree has linear logic depth) but still a regular connectivity. The results from this assessment are collected in Table 2. As can be seen in the table, the coverage dependence on the reduction tree is insignificant. Table 2. 8×8-bit HPM reduction tree. Exhaustive. (sta0 – stuck-at-zero, sta1 – stuck-at-one)
Campaign
Injected faults
Effective faults
pi-sta0 pi-sta1 p-sta0 p-sta1 s-sta0 s-sta1 co-sta0 co-sta1 o-sta0 o-sta1 Total
4194304 4194304 4194304 4194304 3670016 3670016 3670016 3670016 1048576 1048576 33554432
1048576 3145728 1032192 3096576 1590058 2079958 613916 3056100 418276 564764 16646144
Coverage by technique T1 T2 T3 T4 66% 0% 66% 89% 22% 0% 22% 93% 67% 100% 100% 90% 22% 100% 100% 93% 29% 100% 100% 81% 22% 100% 100% 84% 42% 100% 100% 89% 9% 100% 100% 97% 0% 0% 0% 70% 0% 0% 0% 78% 25% 69% 77% 90%
3.4 Assessment 3 - 16×16-Bit Multiplier with CSA Reduction Tree Besides the interconnection scheme of the reduction tree, it is interesting to study the impact of the multiplier size on the coverage of the different techniques. In this assessment a 16×16-bit multiplier with a CSA reduction tree is used. For this size it is no longer possible to evaluate all possible input combinations since then the total number of injected faults would be more than 8.5·1012. Therefore, for each Table 3. 16×16-bit CSA reduction tree. Random. (sta0 – stuck-at-zero, sta1 – stuck-at-one)
Campaign
Injected faults
Effective faults
pi-sta0 pi-sta1 p-sta0 p-sta1 s-sta0 s-sta1 co-sta0 co-sta1 o-sta0 o-sta1 Total
25600000 25600000 25600000 25600000 24000000 24000000 24000000 24000000 3200000 3200000 204800000
6375286 19124714 6375286 19124714 11348556 12651444 4940279 19059721 1464353 1735647 102200000
Note, not all input combinations are used.
Coverage by technique T1 T2 T3 T4 71% 0% 71% 92% 24% 0% 24% 94% 71% 100% 100% 92% 24% 100% 100% 94% 43% 100% 100% 81% 39% 100% 100% 82% 68% 100% 100% 91% 18% 100% 100% 95% 0% 0% 0% 66% 0% 0% 0% 72% 34% 72% 81% 90%
Dependability Evaluation of Time-Redundancy Techniques in Integer Multipliers
573
stuck-at-fault injected, 100 000 random input combinations are used. For a specific fault, the same random input combinations are used for the evaluation of all techniques. The results from this assessment are collected in Table 3. Although there are some minor differences compared with the results from the assessments on the 8×8 multipliers, they probably originate from the fact that only a fraction of all possible input combinations is evaluated for each fault. 3.5 Delay, Power, and Area Penalties Adding redundancy to tolerate faults always comes at a cost in delay (time), power, area, or a combination of these. Time-redundancy has a small area cost but a large delay cost. For spatial redundancy, on the other hand, the situation is the opposite; a small delay cost but a large area cost. The power dissipation is at least doubled in both these cases. In Table 4, the delay, power, and area penalties for the different techniques are estimated. Area and power penalties are estimated based on added gate equivalents and the delay penalty for T4 based on the delay values presented by Sjalander et al. [8]. All values are normalized to the ones of the base computation, i.e. a conventional multiplication. The goal of this paper has been to compare the different timeredundancy techniques, therefore the cost of the detection mechanism (comparison), which is the same for all techniques, has been omitted from the values in Table 4. Table 4. Delay, power, and area penalties
Technique T1 T2 T3 T4*
Delay 2 2 2 2.7
Power 2 2.6 2.6 2.2
Area 1 1.3 1.3 1
* Penalties from extra shifts and addition are not included.
The least costly technique, T1, is also the one having the worst coverage. Delay, power, and area budgets have to be considered when a selection is made between the other three techniques. There is however no reason for not selecting T3, when budgets permit the use of T2 or T3.
4 Conclusion Four different time-redundancy techniques have been evaluated. The techniques range from less costly techniques where the inputs to the multiplier are swapped to advanced techniques where the half-precision mode in a twin-precision multiplier is harnessed. The latter technique is novel with respect to time-redundancy in multipliers.
574
H. Eriksson
The coverage for single stuck-at-faults at the output of the gates was assessed for all techniques. Different reduction trees and multiplier sizes were used during the assessment and some interesting observations were made. • • • •
As expected, the most costly technique (in terms of delay and power), twin precision, was also the technique having the best coverage. The techniques using inverted inputs to the reduction tree detects all errors caused by a fault in the reduction tree. The only technique which can detect some of the errors caused by a fault at the multiplier output is the twin precision technique. If a technique where the partial-product bits to the reduction tree are inverted, the input operands shall be swapped since it yields a better coverage at no extra cost.
Acknowledgements The author wish to thank Professor Per Larsson-Edefors and Dr. Jonny Vinter who have provided fruitful input to this manuscript.
References 1. Avizienis, A., Laprie, J.-C., Randall, B., Landwehr, C.: Basic Concepts and Taxonomy of Dependable and Secure Computing. IEEE Transactions on Dependable and Secure Computting 1(1), 11–33 (2004) 2. Parhami, B.: Computer Arithmetic – Algorithms and Hardware Design, 1st edn. Oxford University Press, Oxford (2000) 3. Wallace, C.S.: A Suggestion for a Fast Multiplier. IEEE Transactions on Electronic Computers 13, 14–16 (1964) 4. Dadda, L.: Some Schemes for Parallel Adders. Acta Frequenza 42(5), 349–356 (1965) 5. Oklobdzija, V.G., Villeger, D., Liu, S.S.: A Method for Speed Optimized Partial Production and Generation of Fast Parallel Multipliers Using an Algorithmic Approach. IEEE Transactions on Computers 45(8), 294–306 (1995) 6. Alderighi, M., D’Angelo, S., Metra, C., Sechi, G.R.: Novel Fault-Tolerant Adder Design for FPGA-Based Systems. In: Proceedings of the 7th International On-Line Testing Workshop, pp. 54–58 (2001) 7. Hsu, Y.-M., Swartzlander Jr., E.E.: Reliability Estimation for Time Redundant Error Correcting Adders and Multipliers. In: Proceedings of IEEE International Workshop on Defect and Fault Tolerance in VLSI Systems (DFT), pp. 159–167. IEEE Computer Society Press, Los Alamitos (1994) 8. Namba, K., Ito, H.: Design of Defect Tolerant Wallace Multiplier. In: Proceedings of the 11th IEEE Pacific Rim International Symposium on Dependable Computing (PRDC), pp. 159–167. IEEE Computer Society Press, Los Alamitos (2005) 9. Radhakrishnan, D., Preethy, A.P.: A Novel 36 Bit Single Fault-Tolerant Multiplier Using 5 Bit Moduli. In: Proceedings of IEEE Region 10 International Conference (TENCON’98), pp. 128–130. IEEE Computer Society Press, Los Alamitos (1998) 10. Pradhan, D.K.: Fault-Tolerant Computer System Design, 1st edn. Prentice-Hall, Englewood Cliffs (1996)
Dependability Evaluation of Time-Redundancy Techniques in Integer Multipliers
575
11. Rabaey, J.M., Chandrakasan, A., Nikolic, B.: Digital Integrated Circuits, 2nd edn. Prentice-Hall, Englewood Cliffs (2003) 12. Sjalander, M., Eriksson, H., Larsson-Edefors, P.: An Efficient Twin-Precision Multiplier. In: Proceedings of the IEEE International Conference on Computer Design (ICCD’04), IEEE Computer Society Press, Los Alamitos (2004) 13. Mokrian, P., Ahmadi, M., Jullien, G., Miller, W.C.: A Reconfigurable Digital Multiplier Architecture. In: Proceedings of the Canadian Conference on Electrical and Computer Engineering (CCECE) (2003) 14. Eriksson, H., Larsson-Edefors, P., Sheeran, M., Sjalander, M., Johansson, D., Scholin, M.: Multiplier Reduction Tree with Logarithmic Logic Depth and Regular Connectivity. In: Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS), IEEE Computer Society Press, Los Alamitos (2006)
Design and Industrialization Challenges of Memory Dominated SOCs J.M. Daga ATMEL, France
The quest for the universal memory has attracted many talented researchers and number of investors for years now. The objective is to develop a low cost, high-speed, low power, and reliable non-volatile memory. In practice, the universal memory system is more like an optimized combination of execution and storage memories, each of them having its own characteristics. Typically, execution memories manage temporary data and must be fast, with no endurance limitations. Different types of RAM memories are used to build an optimized hierarchy, including different levels of cache. In addition to RAM memories, non-volatile memories such as ROM or NOR flash used for code storage can be considered as execution memories when in place execution of the code is possible. There are several advantages in having execution memories embedded with the CPU, such as: speed and power optimization, improved code security. There is a trend confirmed by the SIA that forecasts that memories will represent more than 80% of the total area of SOCs by the end of the decade. SOCs in the future will be more and more memory dominated. As a result, memory management decisions will have a major impact on system cost, performances and reliability. Memory I/P availability (including embedded SRAM, DRAM, FLASH ..) will become the main differentiator, especially for fabless companies. This will be developed during the presentation. A detailed comparison of different types of embedded memories (SRAM, DRAM, ROM, EEPROM and FLASH) and their related challenges will be reviewed. Practical examples of SOC implementation, including for example flash based MCUs versus ROM based ones will be presented.
N. Azemard and L. Svensson (Eds.): PATMOS 2007, LNCS 4644, p. 576, 2007. © Springer-Verlag Berlin Heidelberg 2007
Statistical Static Timing Analysis: A New Approach to Deal with Increased Process Variability in Advanced Nanometer Technologies D. Pandini STMicroelectronics, Italy
As process parameter dimensions continue to scale down, the gap between the designed layout and what is really manufactured on silicon is increasing. Due to the difficulty in process control in nanometer technologies, manufacturing-induced variations are growing both in number and as a percent of feature size and electrical parameters. Therefore, characterization and modeling of the underlying sources of variability, along with their correlations, is becoming more and more difficult and costly. Furthermore, the process parameter variations make the prediction of digital circuit performance an extremely challenging task. Traditionally, the methodology adopted to determine the performance spread of a design in presence of variability is to run multiple Static Timing Analyses at different process corners, where standard cells and interconnects have the worst/best combinations of delay. Unfortunately, as the number of variability sources increases, the corner-based method is becoming too computationally expensive. Moreover, with the larger parameter spread this approach results in overly conservative and suboptimal designs, leaving most of the advantages offered by the new technologies on the table. Statistical Static Timing Analysis (SSTA) is a promising innovative approach to deal with process variations in nanometer technologies, especially the intra-die variations that cannot be handled properly by existing corner-based techniques. In this keynote, the SSTA methodology is presented showing the potential advantages over the traditional STA approach. Moreover, the most important challenges for SSTA, like the required additional process data, characterization efforts, and its integration into the design flow are outlined and discussed. Experimental results obtained from pilot projects in nanometer technologies will be presented, demonstrating the potential benefits of SSTA, along with optimization techniques based on SSTA and the parametric yield improvement that can be achieved.
N. Azemard and L. Svensson (Eds.): PATMOS 2007, LNCS 4644, p. 577, 2007. © Springer-Verlag Berlin Heidelberg 2007
Analog Power Modelling C. Svensson Linköping University, Sweden
Digital power modelling is well developed today, through many years of active research. However analog power modelling lags behind. The aim of this paper is to discuss possible fundamentals of analog power modelling. Modelling is based on noise, precision, linearity, and process constraints. Simple elements as samplers, amplifiers and comparators are discussed. Analog-to-digital converters are used to compare predicted minimum power constraints with real circuits.
N. Azemard and L. Svensson (Eds.): PATMOS 2007, LNCS 4644, p. 578, 2007. © Springer-Verlag Berlin Heidelberg 2007
Technological Trends, Design Constraints and Design Implementation Challenges in Mobile Phone Platforms F. Dahlgren Ericsson Mobile Platforms, Sweden
Mobile phones has already become far more than the traditional voice centric device. A large number of capabilities are being integrated into the higher-end phones competing with dedicated devices, including camera, camcorder, music player, positioning, mobile TV, and high-speed internet access. The huge volumes push the employment of the very latest silicon and packaging technologies, with respect taken to cost and high-volume production. While at one hand, the technology allows for integration of more features and higher performance, issues such as low hardware cost requirements, power dissipation, thermal issues, and handling the software complexity are increasingly challenging. This presentation aims at surveying current market and technological trends, including networking technology, multimedia, application software, services, and technology enablers. Furthermore, it will go through a set of design constraints and design tradeoffs, and finally cover some of the implementation challenges going forward.
N. Azemard and L. Svensson (Eds.): PATMOS 2007, LNCS 4644, p. 579, 2007. © Springer-Verlag Berlin Heidelberg 2007
System Design from Instrument Level Down to ASIC Transistors with Speed and Low Power as Driving Parameters A. Emrich Omnisys Instruments AB, Sweden
For wide bandwidth spectrometers there are several competing technologies to consider, digital, optical and various analog schemes. For applications demanding wide bandwidth and low power consumption in combination, autocorrelation based digital designs take advantage of Moores law and will take a dominating position in the coming years. Omnisys implementations have shown an order of magnitude better performance in respect to bandwidth versus power consumption as compared to what other teams has presented over the last decade. The reason for this is concurrent engineering and optimisation has been performed at all levels in parallel, from instrument level down to transistor level. We now have a single chip spectrometer core providing 8 GHz of bandwidth with 1024 resolution channels and a power consumption of less than 3 W. The design approach will be presented with examples of how decisions on different levels interact
N. Azemard and L. Svensson (Eds.): PATMOS 2007, LNCS 4644, p. 580, 2007. © Springer-Verlag Berlin Heidelberg 2007
Author Index
Abdel-Hafeez, Saleh 75 Albers, Karsten 495 Alfredsson, Jon 536 Aunet, Snorre 536 Azemard, N. 138
Feautrier, Paul 10 Fournel, Nicolas 10 Fraboulet, Antoine 10
Bacinschi, P.B. 242 Barthelemy, Herv´e 413 Bartzas, Alexandros 373 Basetas, Ch. 546 Bellido, M.J. 404 Bhardwaj, Sarvesh 125 Blaauw, David 211 Bravaix, A. 191 Butzen, Paulo F. 474 Cappuccino, Gregorio 107 Catthoor, Francky 373, 433 Centurelli, Francesco 516 Chabini, Noureddine 64 Chang, Yao-Wen 148 Chen, Harry I.A. 453 Chen, Pinghua 86 Chidolue, Gabriel 288 Chou, Szu-Jui 148 Cocorullo, Giuseppe 107 Crone, Allan 288
Galanis, Michalis D. 352 Ghanta, Praveen 125 Ghavami, Behnam 330, 463 Giacomotto, Christophe 181 Giancane, Luca 516 Glesner, M. 242 Goel, Amit 125 Goutis, Costas E. 352 Grumer, Matthias 268 Gu´erin, C. 191 Guerrero, D. 404 Guigues, Fabrice 413 Gustafson, Oscar 526 Gyim´ othy, Tibor 300 Hagiwara, Shiho 222 Harb, Shadi M. 75 He, Ku 160 Helms, Domenik 171, 278 Herczeg, Zolt´ an 300 Hoyer, Marko 171 Hsu, Chin-Hsiung 148 Huard, V. 191 Isokawa, Teijiro
Dabiri, Foad 255, 443 Daga, J.M. 576 Dahlgren, F. 579 Dai, Kui 320 Delorme, Julien 31 Denais, M. 191 Devos, Harald 363 Dimitroulakos, Gregory Duval, Benjamin 413
423
Jayapala, Murali 433 Jiang, Jie-Hong R. 148 JianJun, Guo 43 Johansson, Kenny 526 Juan, J. 404 352
Eeckhaut, Hendrik 363 Eisenstadt, William R. 75 Emrich, A. 580 Engels, S. 138 Eriksson, Henrik 566
Kamiura, Naotake 423 Keller, Maurice 310 Kim, Chris H. 474 Kim, Seongwoon 53 ´ Kiss, Akos 300 Kjeldsberg, Per Gunnar 526 Kleine, Ulrich 97 Koelmans, Albert 53 Kouretas, I. 546
582
Author Index
Kroupis, N. 505 Kui, Dai 43 Kunitake, Yuji 384 Kuo, James B. 453 Kussener, Edith 413
Pandey, S. 242 Pandini, Davide 201, 577 Papadopoulos, Lazaros 1 Papakonstantinou, George 20 Parthasarathy, CR. 191 Pavlatos, Christos 20 Pedram, Hossein 330, 463 Peon-Quiros, Miguel 373 Popa, Cosmin 117 Potkonjak, Miodrag 255, 443 Pugliese, Andrea 107
Li, Shen 43 Li, Yong 320 Li, Zhenkun 86 Lipka, Bj¨ orn 97 Lipskoch, Henrik 495 Liu, Yijun 86 Loo, Edward K.W. 453 Lucarz, Christophe 485 Luo, Hong 160 Luo, Rong 160 M¨ uhlberger, Andreas 268 Macii, A. 232 Macii, E. 232 Mamagkakis, Stylianos 373 Manis, George 20 Marnane, William 310 Masu, Kazuya 222 Matsui, Nobuyuki 423 Mattavelli, Marco 485 Maurine, P. 138, 340, 394 Mendias, Jose M. 373 Miermont, Sylvain 556 Migairou, V. 138 Millan, A. 404 Mingche, Lai 43 Munaga, Satyakiran 433 Murgan, T. 242 Nahapetian, Ani 255, 443 Najibi, Mehrdad 463 Nanua, Mini 211 Nebel, Wolfgang 171, 278 Neffe, Ulrich 268 Niknahad, Mahtab 463 Oh, Myeonghoon 53 Oklobdzija, Vojin 181 Olivieri, Mauro 516 Ortiz, A. Garc´ıa 242 Oskuii, Saeeid Tahmasbi Ostua, E. 404 Paliouras, V. 546 Panagopoulos, Ioannis
Raghavan, Praveen 433 Ramos, Estela Rey 433 Razafindraibe, A. 340, 394 Reis, Andr´e I. 474 Renaudin, Marc 556 Repetto, Guido A. 201 Ribas, Renato P. 474 Robert, M. 340 Rosinger, Sven 278 Ruan, Jian 320 Ruiz-de-Clavijo, P. 404 Sarrafzadeh, Majid 255, 443 Sato, Takashi 222 Sato, Toshinori 384 Schmidt, Daniel 300 Scotti, Giuseppe 516 Sethubalasubramanian, Nandhavel Shang, Delong 53 Shin, Chihoon 53 Singh, Mandeep 181 Sinisi, Vincenzo 201 Sithambaram, P. 232 Slomka, Frank 495 Soudris, Dimitrios 1, 373, 505 Steger, Christian 268 Stroobandt, Dirk 363 Svensson, C. 578 Syrzycki, Marek J. 453 Trifiletti, Alessandro
526
20
Uezono, Takumi Verkest, Diederik Viejo, J. 404
516
222 433
433
Author Index Vivet, Pascal 556 Vrudhula, Sarma 125
Wilson, R. 138 Wu, Z. 138
Wang, Ping 53 Wang, Wenyan 86 Wang, Yu 160 Wang, Zhiying 320 Wehn, Norbert 300 Weiss, Oliver 433 Weiss, Reinhold 268 Wendt, Manuel 268
Xia, Fei 53 Xie, Yuan 160 Yakovlev, Alex 53 Yang, Huazhong 160 Zeydel, Bart 181 Zhiying, Wang 43
583