Timing Optimization Through Clock Skew Scheduling
Ivan S. Kourtev • Baris Taskin • Eby G. Friedman
Timing Optimizati...
55 downloads
1094 Views
3MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
Timing Optimization Through Clock Skew Scheduling
Ivan S. Kourtev • Baris Taskin • Eby G. Friedman
Timing Optimization Through Clock Skew Scheduling
ABC
Ivan S. Kourtev University of Pittsburgh Pittsburgh, PA USA
Baris Taskin Drexel University Philadelphia, PA USA
Eby G. Friedman University of Rochester Rochester, NY USA
ISBN: 978-0-387-71055-6 e-ISBN: 978-0-387-71056-3 DOI: 10.1007/978-0-387-71056-3 Library of Congress Control Number: 2008937987 c Springer Science+Business Media, LLC 2009 ° All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Springer Science+Business Media, LLC, 233 Spring Street, New York, NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden. The use in this publication of trade names, trademarks, service marks and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. While the advice and information in this book are believed to be true and accurate at the date of going to press, neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or omissions that may be made. The publisher makes no warranty, express or implied, with respect to the material contained herein. Printed on acid-free paper springer.com
Preface
History of the Book The last three decades have witnessed an explosive development in integrated circuit fabrication technologies. The complexities of current CMOS circuits are reaching beyond the 65 nanometer feature size and multi-hundred million transistors per integrated circuit. To fully exploit this technological potential, circuit designers use sophisticated Computer-Aided Design (CAD) tools. While supporting the talents of innumerable microelectronics engineers, these CAD tools have become the enabling factor responsible for the successful design and implementation of thousands of high performance, large scale integrated circuits. This book (a research monograph) originated from a body of doctoral dissertation research completed by the first author at the University of Rochester from 1994 to 1999 while under the supervision of Prof. Eby G. Friedman. This research focuses on issues in the design of the clock distribution network in large scale, high performance digital synchronous circuits and particularly, on algorithms for non-zero clock skew scheduling. During the development of this research, it became clear that incorporating timing issues into the successful integrated circuit design process is of fundamental importance, particularly in that advanced theoretical developments in this area have been slow to reach the designers’ desktops. The second edition of the book is enhanced by the body of doctoral dissertation research completed by the second author at the University of Pittsburgh from 2000 to 2005 under the supervision of Prof. Ivan S. Kourtev. This dissertation focuses on advanced timing, synchronization and design methodologies based on non-zero clock skew scheduling. Included in this book are methods on the applicability of clock skew scheduling on circuits with level-sensitive latches, a timing-driven circuit design methodology to attain the maximum performance out of clock skew scheduling and a solution to non-zero clock skew scheduling problem in a parallel computing environment, specifically derived for integration into the physical design process of an emerging non-zero clock skew clocking technology.
V
VI
Preface
It is the authors’ belief that the successful application of non-zero clock skew scheduling techniques to the integrated circuit design process can only follow a detailed understanding of the operation of integrated circuits at many different levels—from device physics through system architecture to packaging. While a detailed coverage of all of these topics in a single text is impractical, an honest effort has been made to provide an in-depth treatment of all of those areas closely related to the clock skew scheduling techniques presented in this book. Tutorial chapters on the structure and design of modern integrated circuits, as well as on the fundamental principles of signal delay are included in this text since these topics are crucial to understanding clock skew scheduling in general. The information presented in these tutorial chapters can also quickly familiarize the reader with the problems, definitions, and terminology used throughout the book. Automated methodologies for synchronous circuit performance optimization through clock skew scheduling is the primary topic presented in this book. The objectives of these methodologies are to improve the performance (specifically, the operating frequency or speed) while increasing the reliability of fully synchronous digital integrated circuits. Traditionally, design wisdom has dictated the use of global zero clock skew. In the research presented here, however, non-zero clock skew scheduling is exploited. A set of algorithms to accomplish this objective are considered in more detail. Specifically, this book deals in depth with the following issues: •
•
A methodology for simultaneous non-zero clock skew scheduling and design of the topology of the clock distribution network. This methodology is based on the pioneering works of Friedman [1] and Fishburn [2], and builds on Linear Programming (LP) solution techniques. The non-zero clock skew scheduling of circuits with level-sensitive latches and for multi-phase clock signals is formulated as a LP problem. The simultaneous clock scheduling and clock tree topology synthesis problem is formulated as a mixed-integer linear programming problem that can be solved efficiently. The proposed algorithms have been evaluated on a variety of benchmark and industrial circuits and synchronous performance improvements of well above 60% have been demonstrated. For those cases where reliable circuit operation and production yield are the highest level priorities, an alternative problem formulation is developed. This formulation is based on a quadratic (hence the QP—quadratic programming) measure, or cost function, of the tolerance of a clock schedule to parameter variations. A mathematical framework is presented for solving the constrained and bounded QP problem. A constrained version of the problem is iteratively solved using the Lagrange multipliers method. As these research issues are topics of great practical importance for input/output (I/O) interfacing and Intellectual Property (IP) blocks, explicit clock delay and skew requirements are fully integrated into the mathematical model described here.
Preface
•
•
VII
The theoretical derivation of the limits on the improvements on the clock period available through clock skew scheduling. The theoretical derivation is performed by identifying the limits for three local data path topologies. A methodology to mitigate the limitation of clock skew scheduling for a reconvergent path system is presented. The methodology involves delay insertion on some data paths of the reconvergent system and is formulated as an LP problem for an automated application. A practical (and necessary) implementation of clock skew scheduling for an emerging clock generation and distribution technology in resonant rotary clocking technology. Preliminary efforts in modeling and implementation are demonstrated. Details are included on the integration of clock skew scheduling into a complete physical design flow for the automated design of rotary clock synchronized synchronous circuits.
As with any project of this magnitude, mistakes are likely. To the best knowledge of the authors, proper credit has been given to everyone whose work has been mentioned here, but the authors take full responsibility for any errors or omissions.
Acknowledgments The authors would like to thank all of those who have helped writing and correcting early manuscript versions of this monograph—fellow colleagues and students, as well as the anonymous reviewers who provided important comments on improving the overall quality of this book. The authors would also like to thank Dr. Bob Grafton from the National Science Foundation for supporting the early research projects that have culminated in the writing and production of this book. We would also like to warmly acknowledge the assistance and support of Alex Greene and Katelyn Stanne from Springer— Alex and Katie’s patience and encouragement have been crucial to the success of this project. The research work described in this research monograph was made possible in part by support from the National Science Foundation under Grant No. MIP-9423886 and Grant No. MIP-9610108, by a grant from the New York State Science and Technology Foundation to the Center for Advanced Technology-Electronic Imaging Systems, and by grants from the Xerox Corporation, IBM Corporation, Intel Corporation and Multigig Inc.
Pittsburgh, PA, Philadelphia, PA, Rochester, NY, July, 2008
Ivan S. Kourtev Baris Taskin Eby G. Friedman
Contents
1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
2
VLSI Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1 Signal Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 Synchronous VLSI Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3 The VLSI Design Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3
Signal Delay in VLSI Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Delay Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Devices and Interconnections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Analytical Delay Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.2 Controlling the Delay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.3 Waveform Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.4 Short-Channel Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.5 The Importance of Interconnections . . . . . . . . . . . . . . . . . 3.2.6 Delay Mitigation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19 19 23 25 31 31 33 35 37
4
Timing Properties of Synchronous Systems . . . . . . . . . . . . . . . . 4.1 Storage Elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Latches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Parameters of Latches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Width of the Clock Pulse . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.2 Latch Clock-to-Output Delay . . . . . . . . . . . . . . . . . . . . . . . 4.3.3 Latch Data-to-Output Delay . . . . . . . . . . . . . . . . . . . . . . . . 4.3.4 Latch Setup Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.5 Latch Hold Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Flip-Flops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Parameters of Flip-Flops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.1 Width of the Clock Pulse . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.2 Flip-Flop Clock-to-Output Delay . . . . . . . . . . . . . . . . . . . .
41 41 43 44 45 45 45 45 46 47 48 48 49
IX
X
Contents
4.5.3 Flip-Flop Setup Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.4 Flip-Flop Hold Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Clock Signal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6.1 Clock Skew . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6.2 Multi-Phase Clock Synchronization . . . . . . . . . . . . . . . . . . Single-Phase Path with Flip-Flops . . . . . . . . . . . . . . . . . . . . . . . . . 4.7.1 Preventing the Late Arrival of the Data Signal . . . . . . . . 4.7.2 Preventing the Early Arrival of the Data Signal . . . . . . . Single-Phase Path with Latches . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.8.1 Preventing the Late Arrival of the Data Signal . . . . . . . . 4.8.2 Preventing the Early Arrival of the Data Signal . . . . . . . Multi-Phase Path with Latches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.9.1 Preventing the Late Arrival of the Data Signal . . . . . . . . 4.9.2 Preventing the Early Arrival of the Data Signal . . . . . . . A Final Note . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
49 49 50 52 53 55 55 58 61 61 63 65 66 68 69
Clock Skew Scheduling and Clock Tree Synthesis . . . . . . . . . . 5.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Definitions and Graphical Model . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 Permissible Range of Clock Skew . . . . . . . . . . . . . . . . . . . . 5.2.2 Graphical Model of a Synchronous System . . . . . . . . . . . . 5.3 Clock Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Timing Constraints and Design Automation . . . . . . . . . . . . . . . . 5.5 Structure of the Clock Distribution Network . . . . . . . . . . . . . . . . 5.6 Solution of the Clock Tree Synthesis Problem . . . . . . . . . . . . . . . 5.7 Software Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.7.1 Simultaneous Clock Scheduling and Clock Tree Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.7.2 Clock Skew Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
71 72 73 74 76 80 85 86 87 89
4.6
4.7
4.8
4.9
4.10 5
6
89 90
Clock Skew Scheduling of Level-Sensitive Circuits . . . . . . . . . 97 6.1 Clock Scheduling for Level-Sensitive Circuits . . . . . . . . . . . . . . . . 97 6.1.1 Latching Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 6.1.2 Synchronization Constraints . . . . . . . . . . . . . . . . . . . . . . . . 98 6.1.3 Propagation Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 6.1.4 Validity Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 6.1.5 Initialization Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 6.2 Iterative Approach to Clock Skew Scheduling . . . . . . . . . . . . . . . 103 6.3 Linearization of the Timing Analysis . . . . . . . . . . . . . . . . . . . . . . . 104 6.3.1 Modified Big M (MBM) Method . . . . . . . . . . . . . . . . . . . . 105 6.3.2 Linear Programming (LP) Model . . . . . . . . . . . . . . . . . . . . 106 6.4 An Example and Experimental Results . . . . . . . . . . . . . . . . . . . . . 108 6.4.1 Level-Sensitive Synchronous Circuit State of Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 6.5 Optimality of the LP Formulation . . . . . . . . . . . . . . . . . . . . . . . . . 113
Contents
XI
6.6 Multi-Phase Level-Sensitive Circuits . . . . . . . . . . . . . . . . . . . . . . . 117 6.6.1 Multi-Phase Synchronization Overview . . . . . . . . . . . . . . . 117 6.6.2 Multi-Phase Level-Sensitive Circuit Timing . . . . . . . . . . . 118 6.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 7
Clock Skew Scheduling for Improved Reliability . . . . . . . . . . . 121 7.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 7.1.1 Clock Scheduling for Maximum Performance . . . . . . . . . . 123 7.1.2 Maximizing Safety . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 7.1.3 Further Improvement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 7.1.4 Clock Scheduling as a Quadratic Programming Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 7.2 Derivation of the QP Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 7.2.1 The Circuit Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 7.2.2 Linear Dependence of Clock Skews . . . . . . . . . . . . . . . . . . 130 7.2.3 Optimization Problem and Solution . . . . . . . . . . . . . . . . . . 137
8
Delay Insertion and Clock Skew Scheduling . . . . . . . . . . . . . . . . 145 8.1 Limitations on Minimum Clock Period . . . . . . . . . . . . . . . . . . . . . 146 8.1.1 Uncertainty of Data Propagation Times . . . . . . . . . . . . . . 147 8.1.2 Data Path Cycles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 8.1.3 Reconvergent Paths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 8.2 Delay Insertion Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 8.2.1 Motivational Example with a Reconvergent Path . . . . . . 153 8.2.2 Reconvergence in an Edge-Triggered Circuit . . . . . . . . . . 153 8.2.3 Reconvergence in a Level-Sensitive Circuit . . . . . . . . . . . . 159 8.2.4 General Reconvergent Data Path Systems . . . . . . . . . . . . 160 8.3 Linear Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 8.4 Practical Concerns in Modeling and Application . . . . . . . . . . . . . 163 8.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
9
Practical Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 9.1 Computational Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 9.1.1 Algorithm LMCS-1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 9.1.2 Algorithm LMCS-2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 9.1.3 Algorithm CSD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 9.1.4 Summary of the Proposed Algorithms . . . . . . . . . . . . . . . . 175 9.2 Unconstrained Basis Skews . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 9.3 I/O Registers and Target Delays . . . . . . . . . . . . . . . . . . . . . . . . . . 178 9.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
10 Clock Skew Scheduling in Rotary Clocking Technology . . . . 183 10.1 Resonant Clocking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 10.1.1 Rotary Traveling Wave Oscillators . . . . . . . . . . . . . . . . . . . 185 10.1.2 Timing Requirements of Rotary Circuits . . . . . . . . . . . . . 189
XII
Contents
10.2 Physical Design Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 10.2.1 Timing-Driven Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . 193 10.2.2 Partitioning with chaco . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 10.2.3 Register Insertion for Partitioning . . . . . . . . . . . . . . . . . . . 196 10.2.4 Clock Skew Scheduling of Partitions . . . . . . . . . . . . . . . . . 197 10.2.5 Timing-Driven Register Placement . . . . . . . . . . . . . . . . . . 200 10.3 Parallelization of Clock Skew Scheduling . . . . . . . . . . . . . . . . . . . 202 10.3.1 Speedup of Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 10.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 11 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 11.1 Clock Skew Scheduling of Level-Sensitive Circuits . . . . . . . . . . . 205 11.1.1 Experimental Results on ISCAS’89 Benchmark Circuits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206 11.1.2 Verification and Interpretation of Results . . . . . . . . . . . . . 208 11.1.3 Parameter Data Distributions . . . . . . . . . . . . . . . . . . . . . . . 209 11.1.4 Skew Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 11.2 Multi-Phase Level-Sensitive Circuits . . . . . . . . . . . . . . . . . . . . . . . 213 11.2.1 Multi-Phase Clocking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216 11.2.2 Multi-Phase Clocking Effects on Time Borrowing . . . . . . 219 11.2.3 Multi-Phase Clocking and Clock Skew Scheduling . . . . . 220 11.2.4 Simultaneous Time Borrowing and Clock Skew Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221 11.3 Quadratic Programming (QP) for Maximizing Safety . . . . . . . . 223 11.3.1 Description of Computer Implementation . . . . . . . . . . . . . 223 11.3.2 Graphical Illustrations of Results . . . . . . . . . . . . . . . . . . . . 225 11.4 Delay Insertion in Clock Skew Scheduling . . . . . . . . . . . . . . . . . . . 225 11.5 Physical Design of Rotary Clock Synchronized Circuits . . . . . . . 233 11.5.1 Clock Skew Scheduling of Partitions Results . . . . . . . . . . 234 11.5.2 Overall CAD Tool Results . . . . . . . . . . . . . . . . . . . . . . . . . . 237 12 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
List of Figures
1.1 1.2 1.3
Moore’s law—an exponential increase in circuit density. . . . . . . . . Moore’s law—an exponential increase in circuit performance. . . . . Example of applying localized negative clock skew. . . . . . . . . . . . . .
2 3 4
2.1 2.2 2.3 2.4 2.5 2.6 2.7
Logic schematic view of a full adder circuit. . . . . . . . . . . . . . . . . . . . Circuit view of a two-input NAND gate. . . . . . . . . . . . . . . . . . . . . . . Signal delay with linear ramp input and a linear ramp output. . . . Signal delay with linear ramp input and an exponential output. . . A finite-state machine (FSM) model of a synchronous system. . . . A local data path. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A typical integrated circuit design flow. . . . . . . . . . . . . . . . . . . . . . . .
9 10 10 11 13 14 16
3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 3.10
A simple electronic circuit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Signal waveforms for the circuit shown in Figure 3.1(b). . . . . . . . . . Signal waveforms for the inverter shown in Figure 3.1(b). . . . . . . . An N-channel enhancement mode MOS transistor. . . . . . . . . . . . . . A basic CMOS inverter logic gate. . . . . . . . . . . . . . . . . . . . . . . . . . . . Operating mode of a CMOS inverter. . . . . . . . . . . . . . . . . . . . . . . . . . High-to-low output transition for a step input signal. . . . . . . . . . . . Operating point trajectory of a CMOS inverter for different. . . . . . Low-to-high output transition for a step input signal. . . . . . . . . . . . Graphical illustration of the RC signal delay expressions. . . . . . . .
20 21 22 24 26 27 28 28 30 37
4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8
A general view of a register. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Schematic representation of a level-sensitive register or latch. . . . . Idealized operation of a level-sensitive register or latch. . . . . . . . . . Parameters of a level-sensitive register. . . . . . . . . . . . . . . . . . . . . . . . An edge-triggered register or flip-flop. . . . . . . . . . . . . . . . . . . . . . . . . Idealized operation of an edge-triggered register or flip-flop. . . . . . Parameters of an edge-triggered register. . . . . . . . . . . . . . . . . . . . . . . A typical clock signal. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
42 43 44 46 47 48 50 51
XIII
XIV
List of Figures
4.9 4.10 4.11 4.12 4.13 4.14 4.15 4.16
Lead/lag relationships causing clock skew. . . . . . . . . . . . . . . . . . . . . A sample multi-phase synchronization clock. . . . . . . . . . . . . . . . . . . . Multi-phase clock skew. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A single-phase local data path. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Timing diagram—violation of the setup constraint. . . . . . . . . . . . . . Timing diagram—violation of the hold constraint. . . . . . . . . . . . . . . A single-phase local data path with latches. . . . . . . . . . . . . . . . . . . . A multi-phase local data path with latches. . . . . . . . . . . . . . . . . . . .
5.1
A simple synchronous digital circuit with four registers and four logic gates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The permissible range of the clock skew of a local data path. A / [lk , uk ]. . . . . . . . . . . . . . . . . . . . . . . . . . timing violation exists if sk ∈ A directed multi-graph representation of the synchronous system shown in Figure 5.1. The graph vertices correspond to the registers, R1 , R2 , R3 and R4 , respectively. . . . . . . . . . . . . . . . . . . A graph representation of the synchronous system shown in Figure 5.1 according to Definition 5.3. The graph vertices v1 , v2 , v3 , and v4 correspond to the registers, R1 , R2 , R3 and R4 , respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Transformation rules for the circuit graph. . . . . . . . . . . . . . . . . . . . . Application of non-zero clock skew to improve circuit performance (a lower clock period) or circuit reliability (increased safety margins within the permissible range). . . . . . . . . . Tree structure of a clock distribution network. . . . . . . . . . . . . . . . . . Buffered clock tree for the benchmark circuit s1423. The circuit s1423 has a total of N = 74 registers and the clock tree consists of 45 buffers with a branching factor of is f = 3. . . . . . . . . . . . . . . Buffered clock tree for the benchmark circuit s400. The circuit s400 has a total of N = 21 registers and the clock tree consists of 14 buffers with a branching factor of f = 3. . . . . . . . . . . . . . . . . . Sample input for the clock scheduling program described in Section 5.7.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sample output for the clock scheduling program described in Section 5.7.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The application of clock skew scheduling to a commercial integrated circuit with 6,890 registers [note that the time scale is in femtoseconds, 1 fs = 10−15 sec = 106 ns]. . . . . . . . . . . . . . . . . . .
5.2 5.3
5.4
5.5 5.6
5.7 5.8
5.9
5.10 5.11 5.12
6.1 6.2 6.3 6.4
52 53 54 55 56 59 61 65 73 75
77
78 79
83 86
91
92 93 94
96
Possible cases for the arrival and departure times of data at the initial latch. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 Propagation of the data signal in a simple circuit. . . . . . . . . . . . . . 101 The iterative algorithm for static timing analysis of level-sensitive circuits. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 A simple synchronous circuit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
List of Figures
6.5 6.6 6.7 6.8 6.9
XV
A single-phase synchronization clock with a 50% duty cycle. . . . . . 109 Zero and non-zero clock skew timing schedules for the level-sensitive circuit in Figure 6.4. . . . . . . . . . . . . . . . . . . . . . . . . . . 109 The optimized timing schedule for s27 operable with TCP = 4.1. . 112 Run times under 1250 seconds for the LP and MIP formulations. 115 Propagation of the data signal in a simple multi-phase circuit. . . 119
7.1 7.2
Circuit graph of the simple example circuit C1 from Section 7.1.1. 129 Two spanning trees and the corresponding minimal sets of linearly independent clock skews and linearly independent cycles for the circuit example C1 . Edges from the spanning tree are indicated with thicker lines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
8.1
Limitation on the minimum clock period TCP caused by the delay uncertainty of a local data path. . . . . . . . . . . . . . . . . . . . . . . . . 147 Limitation on the minimum clock period TCP caused by data path cycles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 Limitation on the minimum clock period TCP caused by reconvergent paths. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 A simple reconvergent data path system. . . . . . . . . . . . . . . . . . . . . . 153 Timing of the edge-sensitive reconvergent system in Figure 8.4 after CSS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 The simple reconvergent system in Figure 8.4 after delay insertion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 Two reconvergent data path systems satisfying (P1) and (P2), respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 Timing of the simple level-sensitive reconvergent system in Figure 8.4 after CSS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 A generalized reconvergent data path system. . . . . . . . . . . . . . . . . . . 161 Timing of the edge-triggered reconvergent system with m=3 and n=2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 Timing of the level-sensitive reconvergent system with m=3 and n=2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
8.2 8.3 8.4 8.5 8.6 8.7 8.8 8.9 8.10 8.11 9.1 9.2
9.3
9.4
Computation of the clock schedule basis sb by computing only the last nb rows of the matrix −Z + I. . . . . . . . . . . . . . . . . . . . . . . . . 173 The numerical constants (as functions of k = p/r) of the term r3 in the runtime complexity expressions for the algorithms LMCS-1, LMCS-2 and CSD, respectively. . . . . . . . . . . . . . . . . . . . . . 176 The numerical constants (as functions of k = p/r) of the term r2 in the memory complexity expressions for the algorithms LMCS-1, LMCS-2 and CSD, respectively. . . . . . . . . . . . . . . . . . . . . . 176 Modified example circuit C1 to include an additional edge e6 . C1 is originally introduced in Section 7.1.1 and illustrated in Figure 7.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
XVI
List of Figures
9.5
I/O registers in a VLSI integrated circuit. Note that the I/O registers form part of the local data paths between the inside of the circuit and the outside of the circuit. . . . . . . . . . . . . . . . . . . . . . . 179
10.1 10.2 10.3
Basic rotary clock architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186 The RTWO theory. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 The cross-section of the transmission line with shunt connected inverters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 Line voltage and line current for the 3.4GHz clock example. . . . . . 189 The clock phase relationships on an ROA ring. . . . . . . . . . . . . . . . . 190 The physical design flow of VLSI circuits with RTWO clock synchronization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 Partitioning a circuit for timing analysis. . . . . . . . . . . . . . . . . . . . . . 198 An ROA ring in a chip layout illustrated in 0.13 um technology. . 201 Xgrid computing cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
10.4 10.5 10.6 10.7 10.8 10.9 11.1 11.2 11.3 11.4 11.5 11.6 11.7 11.8 11.9 11.10 11.11 11.12 11.13 11.14 11.15 11.16 11.17 11.18 11.19 11.20
Data propagation times for s938 with 32 registers and 496 data paths. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 Maximum effective path delays in data paths of s938 for zero clock skew. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210 Maximum effective path delays for s938 for non-zero clock skew. 211 Distribution of the clock skew values of the non-zero clock skew case for s938. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 Distribution of the clock delay values of the non-zero clock skew case for s938. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 Generation of an n-phase data path with latches. . . . . . . . . . . . . . . 214 Non-overlapping multi-phase synchronization clock. . . . . . . . . . . . . 215 Effects of multi-phase clocking on time borrowing. . . . . . . . . . . . . . 219 Effects of multi-phase clocking on clock skew scheduling. . . . . . . . . 221 Effects of multi-phase clocking on time borrowing and clock skew scheduling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222 Circuit s3271 with r = 116 registers and p = 789 local data paths. The target clock period is TCP = 40.4 nanoseconds. . . . . . . 227 Circuit s1512 with r = 57 registers and p = 405 local data paths. The target clock period is TCP = 39.6 nanoseconds. . . . . . . 228 Percentage improvements through delay insertion in Table 11.6. . . 232 Percentage improvements on edge-triggered circuits in Table 11.6. 232 Percentage improvements on level-sensitive circuits in Table 11.6. 233 CAD tool flow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235 The run times of hpictiming with Xgrid on large circuits. . . . . . . 239 Run time breakdown of hpictiming program steps for s38584. . . 240 Run time breakdown of hpictiming program steps for s38417. . . 240 Run time breakdown of hpictiming program steps for industrial1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
1 Introduction
The concept of data or information processing arises in a variety of fields. Understanding the principles behind this concept is fundamental to computer design, communications, manufacturing process control, biomedical engineering, and an increasingly large number of other areas in technology and science. It is impossible to imagine modern life without computers for generating, analyzing and retrieving large amounts of information, as well as communicating information regardless of location. Technologies for designing and building microelectronics-based computational equipment have been steadily advancing ever since the first commercial discrete integrated circuits (ICs) were introduced in the late 1950’s [3].1 As predicted by Moore’s Law in the 1960’s [4], integrated circuit density has been doubling approximately every 18 months. This scaling of circuit size has been accompanied by a similar exponential increase in circuit speed (or more precisely, clock frequency). These trends of steadily increasing circuit size and clock frequency are illustrated in Figures 1.1 and 1.2, respectively. As a result of this amazing revolution in semiconductor technology, it is not unusual for modern integrated circuits to contain over ten million switching elements (i.e., transistors) packed into a chip area as large as 500 mm2 (e.g., [5, 6, 7]). This truly exceptional technological capability is due to advances in both design methodologies and physical manufacturing technologies. Research and experience demonstrate that this trend of exponentially increasing integrated circuit computational power will continue into the foreseeable future. Integrated circuit performance is typically characterized [8] by the speed of operation, the available circuit functionality, and the power consumption, and there are multiple factors which directly affect these performance characteristics. While each of these factors is significant, on the technological side,
1
Monolithic integrated circuits were first introduced in the early 1960’s.
I.S. Kourtev et al., Timing Optimization Through Clock Skew Scheduling, DOI: 10.1007/978-0-387-71056-3 1, c Springer Science+Business Media LLC 2009
1
2
1 Introduction
Transistor Count 1010
Dual-core Itanium
109
Itanium II 108 Pentium IV 107
Pentium II Pentium
106
i860 i43201
5
10
i486 V80
V60/V70 i80286 μPD7809
i8087 μPD7720 i8086 104 i4004 1975
1980
1985
1990
1995
2000
2005
year
Fig. 1.1. Moore’s law—an exponential increase in circuit density, or number of transistors, per integrated circuit.
increased circuit performance has been largely achieved by the following approaches: • • •
reduction in feature size (technology scaling), that is, the capability of manufacturing physically smaller and faster circuit structures, increase in chip area, permitting a larger number of circuits and therefore greater on-chip functionality, advances in packaging technology, permitting the increasing volume of data traffic between an integrated circuit and its environment as well as the efficient removal of heat created during circuit operation.
The most complex integrated circuits are referred to as VLSI circuits, where the term VLSI stands for Very Large Scale Integration. This term describes the complexity of modern integrated circuits consisting of hundreds of thousands to many millions of active transistor elements. Presently, the
1 Introduction
3
Clock Frequency (MHz) 104
ItaniumII PentiumIV
3
10
Itanium DECAlpha 102 Pentium V70 101
1
i80286
i8086
i4004
1975
1980
1985
1990
1995
2000
2005
year
Fig. 1.2. Moore’s law—an exponential increase in circuit performance, or clock frequency.
leading integrated circuit manufacturers have a technological capability for the mass production of VLSI circuits with feature sizes as small as 65nm [5, 6]. These sub-100 nanometer technologies are identified with the term deep submicrometer (DSM) since the minimum feature size is well below the one micrometer mark. As these dramatic advances in fabricating technologies take place, integrated circuit performance is often limited by effects closely related to the very reasons behind these advances such as small geometry interconnect structures. Circuit performance has become strongly dependent and limited by electrical issues that are particularly significant in deep submicrometer integrated circuits. Signal delay and related waveform effects are among those phenomena that have a great impact on high performance integrated circuit design methodologies and the resulting system implementation. In the case
4
1 Introduction
of fully synchronous VLSI systems, these effects have the potential to create catastrophic failures due to the limited time available for signal propagation among the gates. The material presented in this monograph is associated with these aforementioned delay effects from the perspective of a synchronous digital VLSI system. The research results described here can be used to improve the performance and reliability of a synchronous VLSI circuit through the design of the clock distribution network common to any synchronous digital system. Specifically, new algorithms for scheduling the arrival time of the clock signals at the individual registers (or synchronous macro blocks) of a circuit and synthesizing the overall clock tree are discussed. Operational characteristics, performance improvements and limitations to suggested improvements are presented in a cohesive manner. To provide an intuitive perspective into the topics discussed here, consider the simple synchronous circuit shown in Figure 1.3 [9]. Two consecutively connected local data paths, consisting of the registers, R1 and R2 , and R2 and R3 , respectively, are depicted in this figure. Consider that, by design, clock delays to R1 and R3 must be identical. That is, the clock signal C1 to the register R1 is synchronized2 with the clock signal C3 to R3 . The signal delays through the registers are considered identical in this example, numerically assigned to 2 ns. Under this identical register delay assumption, the path from R2 to R3 is the worst case path (since it has a larger logic signal delay). By delaying the clock signal C3 to the register R3 with respect to the clock signal to the register R2 , a leading (or negative) clock skew is added to this local data path from R2 to R3 . As the clock delays to R1 and R3 are designed to be identical, a certain amount of lagging (or positive) clock skew is applied to the local data path from R1 to R2 . Thus, the clock signal C2 should be designed to lead the clock signal C3 by 1.5 ns, thereby forcing both paths R1 to R2 and R2 to R3 to have the same total effective local data path delay (consisting of propagaR1 Data Signal
Data Clock
R2 Logic Delay = 4 ns
Data Clock
C1 TC1 = 3 ns
R3 Logic Delay = 7 ns
Data Clock
C2 TC2 = 1.5 ns
C3 TC1 = 3 ns
Clock Signal
Fig. 1.3. Example of applying localized negative clock skew to a synchronous circuit.
2
The signals C1 and C3 arrive at the same time with no delay or advance with respect to each other.
1 Introduction
5
tion delay TP D and local data path skew TSkew ) TP D + TSkew = 7.5 ns. The delay of the critical path (R2 to R3 ) of the synchronous circuit is temporally refined to the precision of the clock distribution network, and the entire system (for this simple example) could operate at a maximum clock frequency of 133.3 MHz. Note that, if no localized clock skew were applied, the maximum possible frequency would be 111.1 MHz. The performance characteristics of the system, both with and without the application of localized clock skew, are summarized in Table 1.1. Table 1.1. Performance characteristics of the circuit shown in Figure 1.3 without and with localized clock skew. Local Data Path TP D(min) with TCi TCf TSkew TP D(min) with zero skew non-zero skew R1 ;R2 R2 ;R3 fmax
4 + 2 + 0 = 6 3 1.5 7 + 2 + 0 = 9 1.5 3 111.1 MHz
1.5 4 + 2 + 1.5 = 7.5 -1.5 7 + 2 − 1.5 = 7.5 133.3 MHz
Note that |TSkew | < TP D (since | − 1.5 ns | < 9 ns) for the local data path from R2 to R3 . Therefore, it is ensured that the correct data signal is successfully latched into R3 and no local data path/clock skew constraint relationship is violated. This design technique of applying localized clock skew is particularly effective in sequentially-adjacent, temporally irregular local data paths; however, it is applicable to any type of synchronous sequential system. For certain architectures, a significant improvement in performance and reliability is both possible and likely. One of the objectives of this research monograph is to provide detailed insight into the systematic application of the technique exemplified in Figure 1.3 and Table 1.1 and described above to synchronous sequential digital circuits of arbitrary structure and size. To this end, the basic properties of CMOS-based digital integrated circuits as well as the fundamental principles of synchronous VLSI system operation are reviewed in Chapter 2. In Chapters 3 and 4, the timing issues related to the implementation of synchronous VLSI circuits are discussed. A summary of the definitions and notations used in this monograph is presented. Signal delay in CMOS digital integrated circuits is presented in Chapter 3 where the sources of both device and interconnect delays are discussed. In Chapter 4, the fundamental timing relationships of synchronous digital systems are summarized as these relationships are key to understanding the algorithms presented in Chapters 5 and 7. More specifically, Chapter 4 describes in considerable detail the properties of both the various types of timed storage elements and of the data paths built with these elements.
6
1 Introduction
In Chapter 5, clock skew scheduling is formally introduced. Specifically, the relationships between clock skew and the clock distribution network are analyzed in detail and a methodology for circuit performance optimization is presented. The presentation in Chapter 5 focuses on the appropriate use of timing constraints and an optimization objective to formulate a mathematical clock skew scheduling problem for a given circuit. Circuits with both edgetriggered (flip-flops) and level-sensitive (latches) registers as storage elements are analyzed. It is shown how Linear Programming (LP) formulations can be used in clock skew scheduling with the objective of minimizing the clock period of a circuit. In practice, there may be a variety of situations where a different design objective is more appropriate. For example, it may be appropriate to try and maximize the timing reliability of the circuit under various process and operating variations or to decrease the total circuit area by downsizing circuits without compromising the timing reliability of the circuit. Such design objectives can be addressed successfully via clock skew scheduling and two important applications are detailed in Chapters 7 and 8. In Chapter 6, the application clock skew scheduling problem to circuits with level-sensitive registers is formulated as an LP problem and the performance results are presented. In Chapter 7, a different class of clock skew scheduling algorithms are described in detail. Based on a Quadratic Programming (QP) formulation, these algorithms can be used when it is important to maximize the timing reliability of a circuit in the presence of process and operating parameter variations. In Chapter 8, clock skew scheduling is discussed in a different perspective. It is shown that by taking advantage of clock skew scheduling, the logic delay of parts of the circuit can be increased without compromising the circuit reliability and correct operation. Longer permissible logic delays are directly translated into reduced circuit sizes, thereby leading to savings in both circuit area and power. In Chapter 9, an efficient solution to the QP problem formulation in Chapter 7 is developed and analyzed. Also demonstrated in Chapter 9 is a process for integrating certain issues of practical importance into the mathematical model presented in Chapter 7. In Chapter 10, the application of clock skew scheduling to an emerging type of clock distribution network based on resonant oscillation and adiabatic switching is described in detail. It is shown how clock skew scheduling algorithms can be modified in order to address the particular physical design challenges of the resonant rotary clocking technology. In Chapter 11, the application of the QP based algorithms to benchmark and industrial circuits is presented. Finally, some conclusions are offered in Chapter 12.
2 VLSI Systems
High performance VLSI digital systems are composed of millions of electronic devices that exhibit switching properties. The analysis and design of these systems can be approached at different levels of abstraction, with the advantages and limitations corresponding to each such level [10, 11]. Abstract representations are used to hide the details and highlight the essential features of a system in a specific context. For example, the system architects of a VLSI integrated circuit may choose Boolean or switching algebra as the formal mathematical framework to describe a complex computational procedure [12, 13]. Circuit designers, on the other hand, may be interested in active and passive circuit elements such as transistors and interconnect, as well as in the underlying physical laws that govern the operation of these elements [10, 14, 15]. Aspects of the VLSI design issues covered in this monograph overlap several levels of abstraction and require familiarity with the terminology and phenomena at each of these levels. The information described in this chapter provides a fundamental background for motivating the use of clock distribution networks in VLSI-based synchronous digital systems. The essential characteristics of a digital VLSI system are reviewed in this chapter. First, the basic signal properties related to digital circuits are presented in Section 2.1. Following this description, the principles of operation of a synchronous digital system are discussed in Section 2.2. The VLSI circuit design process is summarized in Section 2.3 followed by some concluding remarks in Section 2.4.
2.1 Signal Representation Data processing in the most widely available types of digital integrated circuits (e.g. CMOS, Bipolar, BiCMOS and GaAs) is based on the transport of electrical energy from one physical location to another physical location. Typically, the information being processed is encoded as a physical variable that can be stored and transmitted to other locations while functionally manipulated I.S. Kourtev et al., Timing Optimization Through Clock Skew Scheduling, DOI: 10.1007/978-0-387-71056-3 2, c Springer Science+Business Media LLC 2009
7
8
2 VLSI Systems
along the way. Such a physical variable—also called a signal—is, for example, the electrical voltage provided by a power supply (with respect to a ground potential) and developed in circuit elements in the presence of an electromagnetic field. The voltage signal or bit of information (in a digital circuit) is temporarily stored in a circuit structure capable of accumulating electric charge. This accumulating or storage property is called a capacitance—denoted by the symbol C—and, depending on the materials and the physical properties, is created by a variety of different forms of conductor-insulator-conductor structures commonly found in integrated circuits Furthermore, modern digital circuits utilize Boolean (binary) logic, in which information is encoded by two values of a signal. These two signal values are typically called false and true (or low and high or logic zero and logic one) and correspond to the minimum and maximum1 allowable values of the signal voltage for a specific integrated circuit implementation.2 Since the voltage V is proportional to the stored electric charge q (q = CV, where C is the storage capacitance), the logic low value corresponds to a fully discharged capacitance (q = CV = 0) while the logic high value corresponds to a capacitance storing the maximum possible charge (fully charged to a voltage V ). The largest and most complicated digital integrated circuits today contain many millions of circuit elements each processing hundreds and thousands of binary signals [8, 10, 16, 17]. Every circuit element has a number of input terminals through which data is received from other elements. In addition, a circuit element has a number of output terminals through which the results of the processing are made available to other elements. For a circuit to implement a particular function, the inputs and outputs of all of the elements must be properly connected among each other. These connections are accomplished with wires, which are collectively referred to as an interconnect network, while the set of circuit elements processing the binary signals is often simply called the logic gates. During normal circuit operation, signals are received at the inputs of the logic gates, the gates process the signals to generate new data, and then transmit the resulting data signals to the corresponding logic elements through a network of interconnections. This process involves the transport of a voltage signal from one physical location to another physical location. In each case, this process takes a small yet finite amount of time to be completed and is often called the propagation delay of the signal. Usually, a small number of logic gates are combined to yield modules (or standard cells) that perform frequently encountered operations—these modules can then be reused at many different places in a circuit. An example of such a module is the full adder circuit shown in Figure 2.1. This specific circuit 1 2
Or the maximum and minimum (that is, vice versa) voltage levels. In practice, a range of values close to the minimum and maximum signal voltages, respectively, are interpreted as logic zero and one, respectively. By doing so, the noise immunity of the circuit is significantly improved.
2.1 Signal Representation
9
adds two one-bit numbers x0 and y0 and a carry-in bit c0 to produce a two-bit result z1 z0 , where z1 = x0 y0 + x0 c0 + y0 c0 and z0 = x0 ⊕ y0 ⊕ c0 . A typical CMOS transistor configuration for one of the two-input NAND gates is shown in Figure 2.2 [corresponding to the gates na 1 through na 3 in Figure 2.1]. z0
z1 xo 1
xo 2
x0 y0
na 4
na 1
na 2
na 3
c0
Fig. 2.1. Logic schematic view of a full adder circuit.
The rate of data processing in a digital integrated circuit is directly related to two factors—how fast can the circuit switch between the two logic values, and how precisely can a circuit element interpret a specific signal value as the intended binary logic state. Switching the state of a circuit between two logic values requires either charging a fully discharged capacitance or discharging a fully charged capacitance, depending upon the type of state transition—lowto-high or high-to-low. This charging/discharging process is controlled by the active switching elements in the logic gates and is strongly affected by the physical properties of both the gates and the interconnections. Specifically, the signal waveform shapes change, either enhancing or degrading the signals, affecting both the ability and the time required for the logic gates to properly interpret these signals. The concept of signal propagation delay between two different points A and B of a circuit is illustrated in Figures 2.3 and 2.4, respectively. The signals at points A and B—denoted by sA and sB , respectively—are plotted versus time for two different cases in Figures 2.3 and 2.4, respectively. Without considering the specific electronic devices and circuits required to create these waveforms shapes, it is assumed that signal sA makes a high-to-low transition and triggers a computation that causes signal sB to make an opposite low-to-high transition. Several important observations can be made from the waveforms depicted in Figures 2.3 and 2.4:
10
2 VLSI Systems VDD
x0 x1 x0
x1
Fig. 2.2. Circuit view of a two-input NAND gate.
sA , sB sA
sB 90%
tr B tP DAB = tP LH AB 50%
t fA 10%
time Fig. 2.3. Signal propagation delay from point A to point B with a linear ramp input and a linear ramp output.
• although sA is the same in each case, sB may have different shapes, • a temporal relationship (or causality relationship) between sA and sB exists in the sense that sA ‘causes’ sB , thereby preceding the switching event by an amount of time required for the physical switching process to propagate through the circuit structure,
2.2 Synchronous VLSI Systems
•
11
regardless of shape, sB has the same logical meaning, that is, that the state of the circuit at point B changes from low to high; this low-to-high transition and the reverse high-to-low state transition of signal sA require a positive amount of time to complete.
The temporal relationship between sA and sB as shown in Figures 2.3 and 2.4 must be evaluated quantitatively. This information permits the speed of the signals at different points in the same circuit or in different circuits built in different semiconductor technologies to be temporally characterized. By quantifying the physical speed of the logical operations, circuit designers are provided with the necessary timing information to design correctly functioning integrated circuits.
2.2 Synchronous VLSI Systems Typically, a digital VLSI system performs a complex computational algorithm, such as a Fast Fourier Transform or a RISC3 architecture microprocessor. Although modern VLSI systems contain large number of components, these systems normally employ only a limited number of different kinds of logic sA , sB sA
sB 90%
tr B tP DAB = tP LH AB 50%
t fA 10%
time Fig. 2.4. Signal propagation delay from point A to point B with a linear ramp input and an exponential output. 3
RISC = Reduced Instruction Set Computer.
12
2 VLSI Systems
elements or logic gates. Each logic element accepts certain input signals and computes an output signal used by other logic elements. At the logic level of abstraction, a VLSI system is a network of tens of thousands or more logic gates whose terminals are interconnected by wires in order to implemented the target algorithm. As mentioned earlier in Section 2.1, the switching variables acting as inputs and outputs of a logic gate in a VLSI system are represented by tangible physical quantities,4 while a number of these devices are interconnected to yield the desired function of each logic gate. The specific physical characteristics are collectively summarized with the term technology, that encompasses such detail as the type and behavior of the devices that can be built, the number and sequence of the manufacturing steps and the impedance of the different interconnect materials. Today, several technologies are used in the implementation of high performance VLSI systems—these are best exemplified by CMOS, Bipolar, BiCMOS, and Gallium Arsenide [10, 16]. CMOS technology, in particular, exhibits many desirable performance characteristics, such as low power consumption, high density, ease of design and moderate to high speed. Due to these excellent performance characteristics, CMOS technology has become the dominant VLSI technology used today. The design of a digital VLSI system requires a great deal of effort when considering a broad range of architectural and logic issues, such as choosing the appropriate gates and interconnections among these gates to achieve the required circuit function. No design is complete, however, without considering the dynamic (or transient) characteristics of the signal propagation or, alternatively, the changing behavior of the signals with time. Every computation performed by a switching circuit involves multiple signal transitions between the logic states, each transition requiring a finite amount of time to complete. The voltage at every circuit node must reach a specific value for the computation to be completed. Therefore, state-of-the-art integrated circuit design is largely centered around the difficult task of predicting and properly interpreting signal waveform shapes at various points within a circuit. In a typical VLSI system, millions of signal transitions occur, such as those shown in Figures 2.3 and 2.4, which determine the individual gate delays and the overall speed of the system. Some of these signal transitions can be executed concurrently while others must be executed in a strict sequential order [17]. The sequential occurrence of the latter operations—or signal transition events—must be carefully coordinated in time so that logically correct system operation is guaranteed and the results are reliable (in the sense that these results can be repeated). This coordination is known as synchronization and is critical to ensuring that any pair of logical operations in a circuit with a precedence relationship proceed in the proper order. In modern digital integrated circuits, synchronization is achieved at all stages of the system design process and system operation by a variety of techniques, known as a timing 4
Such quantities as the electrical voltages and currents in electronic devices.
2.2 Synchronous VLSI Systems
13
discipline or timing scheme [10, 18, 19, 20]. With few exceptions, these circuits are based on a fully synchronous timing scheme, specifically developed to cope with the finite speed required by the physical signals to propagate throughout a system. A fully synchronous system is most frequently modeled as a finite-state machine as shown in Figure 2.5. As illustrated in Figure 2.5, there are three COMPUTATION Input Data
Output
Combinational Logic
Data
Clocked Storage (Registers) Clock Signal Clock Distribution Network SYNCHRONIZATION Fig. 2.5. A finite-state machine (FSM) model of a synchronous system.
recognizable components in this system. The first component—the logic gates, collectively referred to as the combinational logic—provides the range of operations that a system executes. The second component—the clocked storage elements or simply the registers—are elements that store the results of the logical operations. Together, the combinational logic and registers constitute the computational portion of a synchronous system and are interconnected in a way that implements the required system function. The third component of the synchronous system—known as the clock distribution network—is a highly specialized circuit structure which does not perform a computational process but rather provides an important control capability. The clock generation and distribution network controls the overall synchronization of the circuit by generating a time reference and properly distributes this time reference to every register. The normal operation of a system, such as the example shown in Figure 2.5, consists of the iterative execution of computations in the combinational logic followed by the storage of the processed results in the registers. The actual process of storage is temporally controlled by the clock signal and occurs once the signal transients in the logic gate outputs are completed and the outputs have settled to a valid state. At the beginning of each computational cycle, the inputs of the system together with the data stored in the registers initiate
14
2 VLSI Systems
a new switching process. As time proceeds, the signals propagate through the logic, generating results at the logic output. By the end of the clock period, these results are stored in the registers and are operated upon during the following clock cycle. Signal activity at the beginning of the clock period
Ri Data
Rf Combinational Logic
Data
Clock
Clock Signal activity at the end of the clock period Fig. 2.6. A local data path.
Therefore, the operation of a digital system can be thought of as the sequential execution of a large set of simple computations that occur concurrently in the combinational logic portion of the system. The concept of a local data path is a useful abstraction for each of these simple operations and is shown in Figure 2.6. The magnitude of the delay of the combinational logic is bound by the requirement of storing data in the registers within a clock period. The initial register Ri is the storage element at the beginning of the local data path and provides some or all of the input signals for the combinational logic at the beginning of the computational cycle (defined by the beginning of the clock period). The combinational path ends with the data successfully latching within the final register Rf where the results are stored at the end of the computational cycle. Each register acts as a source or sink for the data depending upon which phase the system is currently operating in.
2.3 The VLSI Design Process As previously mentioned, VLSI systems are composed of millions of active electronic devices (transistors) with switching properties. Groups of these devices are interconnected together to yield functional parts from which the VLSI system is built. Typical functional parts include, for example, logic gates such as the two-input NAND gate shown in Figure 2.2. In this monograph,
2.3 The VLSI Design Process
15
the design process refers to the activity in which a concept and a set of specifications are converted into an actual integrated circuit. A view of the VLSI design process—also known as a design flow—is illustrated in Figure 2.7 magnifying the clock distribution network design process. This flow is typical in the design of high-volume, Application-Specific Integrated Circuits (ASICs). The sequence of steps in this design flow is from top to bottom and follows the direction of the arrows as shown in Figure 2.7. As previously mentioned, the design process often starts with loosely defined behavioral and architectural specifications, as well as with design constraints such as physical dimensions, cost, power supply voltage, operational temperature and so on. Architectural specifications are refined and coded into a Hardware Description Language (HDL) which forms the basis for the actual synthesis process. The HDL descriptions are also useful in performing simulations to verify the desired circuit function. The synthesis process is performed by software-based synthesis tools which compile the HDL descriptions into an equivalent logic schematic of a circuit— each logic gate in this schematic has been predesigned and is available to the synthesis tool as a library element. After the circuit synthesis process is completed, the resulting logic and register circuit structures are symbolically placed to form the integrated circuit. Wire routing among the circuit structures is performed next to connect the inputs and outputs of the logic gates as well as to deliver the clock signal to each of the clocked registers within the circuit. A variety of verification and simulation procedures are also performed to ensure the correct functionality and timing of the integrated circuit. Among these procedures is a timing verification step which includes the analysis of the data and clock signal delays to ensure correct temporal operation. The body of research presented in this monograph deals with certain aspects of the timing of VLSI-based digital circuits, particularly those topics related to the clock distribution network. The timing optimization algorithms presented in Chapters 5, 6 and 7 are integrated into the design flow at the step called Clock Planning, shown shaded in Figure 2.7. As indicated in Figure 2.7, Clock Planning includes clock scheduling and the design of both the topology and the circuit structure of the clock tree5 . The timing information describing the signal delays obtained from the Placement of Logic and Registers step is used in the clock planning process. Specifically, both the maximum and minimum data path delays are used in the clock skew scheduling process. The entire chip verification process is not considered complete until the timing verification is satisfied after the detailed chip routing has been completed and all physical impedance characteristics have been back annotated and analyzed with accurate timing analysis tools [8, 9, 21]. Several iterations of the Clock Planning may be required in order to satisfy the entire chip verification process.
5
A clock tree is another term for describing the clock distribution network.
16
2 VLSI Systems
In this monograph, an algorithm to perform the simultaneous non-zero clock skew scheduling and the topological design of the clock tree is presented in Chapter 5 and enhanced algorithms for clock skew scheduling are presented in Chapter 6 and 7. Behavioral and Architectural Specifications, Logic Synthesis, Timing Specifications
Placement of Logic and Registers Delay Information
Clock Planning (Clock Scheduling, Clock Tree Topology)
Clock Tuning (Pre-Route)
Clock Verification
Detailed Chip Routing, Parasitic Extraction, Circuit Simulation and Verification Fig. 2.7. A typical integrated circuit design flow magnifying the clock distribution network design process.
2.4 Summary The behavior of a fully synchronous system is well defined and controllable as long as the time window provided by the clock period is sufficiently long
2.4 Summary
17
to allow every signal in the circuit to propagate through the required logic gates and interconnect wires and successfully latch into the final register of each local data path. In designing the system and choosing the proper clock period, however, two contradictory requirements must be satisfied. First, the smaller the clock period, the more computational cycles can be performed by the circuit in a given amount of time. Alternatively, the time window defined by the clock period must be sufficiently long so that the slowest signals reach the destination registers before the current clock cycle is concluded and the following clock cycle is initiated. This strategy for organizing the computational process has certain clear advantages that have made a fully synchronous timing scheme the primary choice for digital VLSI systems: • •
The properties and variations are well understood. The nondeterministic behavior of the propagation delay of the combinational logic (due to environmental and process fluctuations and the unknown input signal pattern) is eliminated such that the system as a whole has a completely deterministic behavior corresponding to the implemented algorithm. As long as the data signal is successfully captured inside the register before the arrival of the next clock signal, the timing characteristics of the system are completely known. • The circuit design process does not need to be concerned with glitches in the combinational logic outputs. Therefore, the only relevant dynamic timing characteristic of the logic is the propagation delay. • The state of the system is completely defined within the storage elements— this characteristic greatly simplifies certain aspects of the design, debug and test phases when developing a large synchronous digital system. However, the synchronous paradigm also has certain limitations that makes the design of a synchronous VLSI system increasingly challenging: •
•
This synchronous approach has a serious drawback in that this approach requires the overall circuit to operate as slow as the slowest register-toregister path. Thus, the global speed of a fully synchronous system depends upon those data paths with the largest delays—these paths are also known as the worst case or critical paths. In a typical VLSI system, the propagation delays in the combinational paths are distributed unevenly so there may be many paths with delays much smaller than the clock period. Although these paths could operate at a lower clock period—or higher clock frequency—it is these critical paths that bound the minimum clock period, thereby imposing a limit on the overall system speed (or clock frequency). This imbalance in propagation delays is sometimes so dramatic that the system speed is dictated by only a handful of very slow paths. The clock signal has to be distributed to tens of thousands of storage registers scattered throughout the system. Therefore, a significant portion of the system area and dissipated power is devoted to the clock distribution
18
•
2 VLSI Systems
network—a circuit structure that does not perform any computational function. The reliable operation of a synchronous digital system depends upon certain assumptions concerning the propagation delays which, if not satisfied, can lead to catastrophic timing violations which would render the system unusable.
3 Signal Delay in VLSI Systems
In order to understand the timing characteristics of a synchronous digital system—specifically, the delays within the data paths and clock distribution network—a more complete understanding of the properties of signal delay in VLSI systems is necessary. The topic of signal delay in VLSI-based systems is examined in detail in this chapter. Delay metrics are first analyzed and certain definitions are introduced in Section 3.1. A more thorough analytical treatment of the subject of computing delay in CMOS integrated circuits is presented in Section 3.2.
3.1 Delay Metrics The delay of a signal propagating from one point within a circuit to another point is caused by both the active electronic devices (transistors) in the logic elements and the various passive interconnect structures connecting the logic gates. While the physical principles behind the operation of transistors and interconnect are well understood at the current-voltage (I-V ) level, it is often computationally difficult to directly apply this detailed information to the densely packed multi-million transistor DSM integrated circuits of today. A general form of a circuit with N input and M output terminals (labeled x1 , . . . , xN and y1 , . . . , yM , respectively) is shown in Figure 3.1(a). The box labeled ‘CIRCUIT’ may represent a simple wire, a transistor, a logic gate consisting of several transistors, or an arbitrarily complex combination of these elements. The logic schematic outlined in Figure 3.1(b), for example, may correspond to a portion of the circuit between points X and Y shown in Figure 3.1(a). With the choice of logic circuit illustrated in Figure 3.1(b), a logically possible signal activity at the circuit points X, Y, and Z is shown in Figure 3.2. The dynamic characteristics and temporal relationships of the signal transitions are described and formalized in Definitions 3.1, 3.2, and 3.3.
I.S. Kourtev et al., Timing Optimization Through Clock Skew Scheduling, DOI: 10.1007/978-0-387-71056-3 3, c Springer Science+Business Media LLC 2009
19
20
3 Signal Delay in VLSI Systems
y1
x1
Y
CIRCUIT
X
yM
xN
(a) Abstract representation of a circuit Y
X Z
(b) Logic schematic of part of the circuit in Figure 3.1(a) Fig. 3.1. A simple electronic circuit.
Definition 3.1. If X and Y are two points in a circuit and sX and sY are the signals at X and Y, respectively, the signal propagation delay tP DXY from X to Y is defined 1 as the time interval from the 50% point of the signal transition of sX to the 50% point of the signal transition of sY . This formal definition of the propagation delay is related to the concept that ideally, the switching point of a logic gate is at the 50% level of the output waveform. Thus, 50% of the maximum output signal level is assumed to be the boundary point where the state of the gate switches from one binary logic state to the other binary logic state. Practically, a more physically correct definition of propagation delay is the time from the switching point of the driving circuit to the switching point of the driven circuit. Currently, however, this switching point-based reference for signal delay is not widely used in practical computer-aided design applications because of the computational complexity of the algorithms and the increased amount of data required to estimate the delay of a path based on information describing the signal waveform shape. Therefore, choosing the switching point at 50% has become a generally acceptable practice for referencing the propagation delay of a switching element. Also note that the propagation delay tP D as defined in Definition 3.1 is mathematically additive, thereby permitting the delay between any two points X and Y to be determined by summing the delays through consecu1
Although the delay can be defined from any point X to any other point Y, the points X and Y typically correspond to an input and an output of a logic gate, respectively. In such a case, the signal delay from X to Y is the propagation delay of the gate.
3.1 Delay Metrics
21
sX , sY , sZ 90%
tP DXY = tP DXZ + tP DZY 50%
tP DZY = tP HLZY tP DXZ = tP LHXZ sZ
10%
sX
sY
time
Fig. 3.2. Signal waveforms for the circuit shown in Figure 3.1(b).
tive structures between X and Y . From Figures 3.1(b) and 3.2, for example, tP DXY = tP DXZ + tP DZY . However, this additivity property must be applied with caution since neither of the switching points of consecutively connected gates may occur at the 50% level. In addition, passive interconnect structures along signal paths do not exhibit switching properties although physical signals propagate through these structures with finite speed (more precisely, through signal dispersion). Therefore, if the properties of a signal propagating through a series connection of logic gates and interconnections are being evaluated, an analysis of the entire signal path composed of gates and wires— rather than adding 50%-to-50% delays—is necessary to avoid accumulating significant error in the path delay. In high performance CMOS VLSI circuits, logic gates often switch before the input signal completes a transition.2 This difference in switching speed may be sufficiently large such that an output signal of a gate will reach the 50% point before the input signal reaches the 50% point. If this is the case, tP D as defined by Definition 3.1 may have a negative value. Consider, for example, the inverter connected between nodes X (inverter input) and Z (inverter output) shown in Figure 3.1(b). The specific input and output waveforms for this 2
Also, a gate may have asymmetric signal paths, whereby a gate would switch faster in one direction than in the other direction.
22
3 Signal Delay in VLSI Systems
sX , sZ sX
sZ 90%
TP HLXZ < 0 (tfZ < trX )
50%
TP LHXZ > 0 (trZ > tfZ ) 10%
time tfX trZ
tfZ trX
Fig. 3.3. Signal waveforms for the inverter in the circuit shown in Figure 3.1(b).
inverter are shown in detail in Figure 3.3. When the input signal sX makes a high-to-low transition, the output signal sZ makes a low-to-high transition (and vice versa). In this specific example, the low-to-high transition of the signal sZ crosses the 50% signal level after the high-to-low transition of the signal sX . Therefore, the signal delay tP LH (the signal name index is omitted for clarity) is positive as shown by the direction of the arrow in Figure 3.3— coinciding with the positive direction of the x-axis. However, when the input signal sX makes a low-to-high transition, the output signal sZ makes a faster high-to-low transition and crosses the 50% signal level before the input signal sX crosses the 50% signal level. The signal delay tP HL in this case is negative as shown by the direction of the arrow in Figure 3.3—coinciding with the negative direction of the x-axis. This phenomenon can occur in circuits with slow input signal transitions and fast output signal transitions, demonstrating a weakness in the 50% delay definition commonly used today throughout industry. The possible asymmetry of the switching characteristics of a logic gate— as illustrated by the waveforms shown in Figure 3.3—requires the ability to discriminate between the values of the propagation delay in the two different switching situations (a low-to-high or a high-to-low transition). One single value of the propagation delay tP D —as defined in Definition 3.1—does not provide sufficient information about possible asymmetry in the switch-
3.2 Devices and Interconnections
23
ing characteristics of a logic gate. Therefore, the concept of delay is extended to include this missing information. Specifically, the direction of the output waveform (since the output of a gate is typically the evaluation node) is included in the definition of delay, thereby permitting the evaluation of the gate switching speed to account for the effects of the output signal transition: Definition 3.2. The signal propagation delays tP LHXY and tP HLXY , respectively, denote the signal delay from input X to output Y (as defined in Definition 3.1) where the output signal (at point Y ) transitions from low to high and from high to low, respectively (the low-to-high and high-to-low transitions). It is important to consider both tP LH and tP HL during circuit analysis and design. However, if only a single value of tP D is specified, tP D usually refers to the arithmetic average, (tP LH + tP HL )/2. While Definition 3.2 specifies the time between switching events, it does not convey any information about the transition time of the events themselves. This transition time is finite and is characterized by the two parameters described in the following definition: Definition 3.3. For a signal making a transition between two different logic states, the transition time is defined as the time interval between the 10% point and the 90% point of the signal. For a low-to-high transition, the rise transition time tr = t90% −t10% . For a high-to-low transition, the fall transition time tf = t10% −t90% . The parameters defined in Definition 3.3 are illustrated in Figures 2.3 and 2.4 where the fall time tfA and the rise time trB for the signals sA and sB , respectively, are indicated. As tr and tf are related to the slope of the signal transitions, the transition times also affect the values of tP LH and tP HL , respectively. In Figure 3.2, for example, note that if the signal sY had been slower—a longer fall time tfY — sY would have crossed the 50% level at a later time, effectively increasing the propagation delay tP LHXY . However—as illustrated in Figures 2.3 and 2.4—it is possible for the 50%-to-50% delay to remain nearly the same, although the signal slope may change significantly [note the rise time trB in Figures 2.3 and 2.4].
3.2 Devices and Interconnections The technology of choice for most modern high performance digital integrated circuits is based on the MOSFET3 transistor structure. The primary reasons for the wide application of MOSFETs are, among other things, high packing density and, in its complementary form, low power dissipation. In this section, 3
MOSFET ≡ Metal-Oxide-Semiconductor Field Effect Transistor
24
3 Signal Delay in VLSI Systems
the properties of both active devices and interconnections are discussed from the perspective of circuit performance. An N-channel enhancement mode MOSFET transistor (NMOS) is depicted in Figure 3.4. Note that in most digital applications, the substrate Vd − Vgd
Vg
+ Idd
drain gate base Vb (substrate) source
+ +
Vds
Iss
Vgs −
− Vs
Fig. 3.4. An N-channel enhancement mode MOS transistor.
is usually connected to the source, i.e., Vs = Vb and Vsb = 0. Therefore, the four-terminal transistor depicted in Figure 3.4 can be considered as a threeterminal device with the voltages Vs , Vg , and Vd controlling the operation of the transistor. Assuming no substrate current, Idd = Iss —both currents Idd and Iss are usually referred to as Ids only. In the following discussion, the additional indices n and p are used to indicate which type of transistor is being considered, N-channel or P-channel, respectively. To first order, the drain current Idsn through a long-channel NMOS transistor4 can be modeled by the classical Shichman-Hodges set of equations [22]: ⎧ 1 2 ⎪ ⎪ V (V , Vgsn ≥ Vtn and Vgdn ≥ Vtn − V )V − β ⎪ n gsn tn dsn ⎪ ⎪ 2 dsn ⎪ ⎪ ⎪ ⎪ (triode or linear region) ⎪ ⎪ ⎨ 1 2 Vgsn ≥ Vtn and Vgdn ≤ Vtn Idsn = βn (Vgsn − Vtn ) , 2 ⎪ ⎪ ⎪ (pentode or saturation region) ⎪ ⎪ ⎪ ⎪ ⎪ 0, Vgsn ≤ Vtn ⎪ ⎪ ⎪ ⎩ (cutoff region). (3.1) 4
Derivation of the PMOS I-V equations is straightforward by accounting for the changes in voltage and current directions.
3.2 Devices and Interconnections
25
In (3.1), the parameter βn is a device parameter commonly called the gain factor or the current gain of the transistor—the dimension of βn is [A/V 2 ]. The current gain βn is Wn βn = K n , (3.2) Ln where Kn is the process transconductance parameter and Wn and Ln are the width and length of the transistor channel, respectively. The process transconductance Kn is ox , (3.3) Kn = μn Cox = μn tox where μn is the carrier mobility, Cox is the gate capacitance per unit area, ox is the relative dielectric constant of the gate oxide material (3.9 for SiO2 ) and tox is the gate oxide thickness. By substituting the index p for the index n in (3.1), (3.2), and (3.3), analogous expressions for βp and Kp of a Pchannel enhancement mode MOSFET transistor can be developed [16, 8, 10, 15]. Also note that the threshold voltage Vtn of an enhancement-mode Nchannel transistor is positive (Vtn > 0), while the threshold voltage Vtp of an enhancement-mode P-channel transistor is negative (Vtp < 0). Equation (3.1) and the counterpart for a P-channel MOS device are fundamental to both static and dynamic circuit analysis. Static or DC analysis refers to evaluating the circuit bias conditions in which the control voltages, Vg , Vd , and Vs , remain constant. Dynamic analysis is attractive from a signal delay perspective since it deals with voltage and current waveforms changing with time. An important goal of dynamic analysis is to determine the timing relationships among the transistor terminals. Specifically, the voltages at these terminals are the signal representations of the data being processed. By performing a dynamic analysis, the signal delay from an input waveform to the corresponding output waveform can be evaluated at high levels of accuracy. Complementary MOS logic or CMOS logic is the most popular circuit style for most modern high performance digital integrated circuits. An analytical analysis of a simple CMOS logic gate is presented in Section 3.2.1 for one of the simplest CMOS gates—the CMOS inverter shown in Figure 3.5. Performing such a simple analysis illustrates the process for estimating circuit performance, as well as provides insight into what factors and how these factors affect the timing characteristics of a logic gate. 3.2.1 Analytical Delay Analysis Consider the CMOS inverter circuit consisting of a PMOS device Q1 and an NMOS device Q2 as shown in Figure 3.5. For this analysis, assume that the capacitive load of the inverter—consisting of the device capacitances, interconnect capacitances and the load capacitance of the following stage—can be lumped into a single capacitor CL . The output voltage Vo = VCL is the voltage across the capacitive load and the terminal voltages of the transistors are listed in Table 3.1. The regions of operation for the devices, Q1 and Q2 ,
26
3 Signal Delay in VLSI Systems Vdd Q1 Idsp
Vi (t)
Idsn
LOAD
Vo (t)
Q2
Fig. 3.5. A basic CMOS inverter logic gate.
Table 3.1. Terminal voltages for the P-channel and N-channel transistor in a CMOS inverter circuit. Q1 (PMOS) Vgs Vgsp = Vi − VDD Vgd Vgdp = Vi − Vo Vds Vdsp = Vo − VDD
Q2 (NMOS) Vgsn = Vi Vgdn = Vi − Vo Vdsn = Vo
are illustrated in Figure 3.6 depending upon the values of Vi and Vo . Referring to Figure 3.6 may be helpful in understanding the switching process of a CMOS inverter. Methods for determining the values of the fall time tf and the propagation delay tP HL are described in this section. Similarly, closed form expressions are derived for the rise time tr and the propagation delay tP LH . Derivation of the Fall Time The transition process used to derive tf and tP HL is illustrated in Figure 3.7. Assume that the input signal Vi has been held at logic low (Vi = 0) for a sufficiently long time such that the capacitor CL is fully charged to the value of Vdd —the operating point of the inverter is point A depicted in Figures 3.6 and 3.8. At time t0 = 0, the input signal abruptly switches to a logic high. The capacitor CL cannot discharge instantaneously, thereby forcing the operating point of the circuit to point B, (Vi , Vo ) = (Vdd , Vdd ). At B, the device Q1 is cut off while Q2 is conducting, thereby permitting CL to begin discharging through Q2 . As this discharge process develops, the operating point moves down the line BD, approaching point D when CL is fully discharged, i.e., Vo (D) = 0. Observe that during the interval 0 ≤ t < t2 , the operating point
3.2 Devices and Interconnections
27
Vo Vdd
B
A Q1 linear
Q1 linear
Q1 cutoff
Q2 cutoff
Q2 sat
Q2 sat
I
II
III
IV
C (V − V ) tn dd
Q1 sat Q2 sat −Vtp
E 0
F
VII
VI
V
Q1 sat
Q1 sat
Q1 cutoff
Q2 cutoff
Q2 linear
Q2 linear
Vtn
(Vdd + Vtp )
D Vi Vdd
Fig. 3.6. Operating mode of a CMOS inverter depending upon the input and output voltages. (Note that the abbreviation ‘sat’ stands for the saturation region.)
is between B and C and the device Q2 operates in the saturation region. At time t2 , the capacitor is discharged to Vdd −Vtn and Q2 begins to operate in the linear region. For t ≥ t2 , the device Q2 is in the linear region. If 0.1Vdd < Vtn < 0.5Vdd (as is typical), then t1 < t2 < t3 as shown in Figure 3.7. Therefore, the fall time is tf = t4 − t1 and the propagation time tP HL = t3 − 0 = t3 . To determine the values of tf and tP HL , the output waveform Vo (t) must be evaluated for each of the intervals [t0 , t2 ) and [t2 , ∞). For t0 ≤ t < t2 , the current discharging the capacitor Idsn , shown in Figure 3.5, is
1 dVo . (3.4) Idsn = βn (Vdd − Vtn )2 = −CL 2 dt Substituting βn Vdd (1 − η) Vtn and γn = , (3.5) η= Vdd CL and solving (3.4) for Vo with the initial condition Vo (0) = Vdd , yields Vo (t) for t0 ≤ t < t 2 , γn βn Vo (t) = Vdd − (1 − η)t . (3.6) (Vdd − Vtn )2 t = Vdd 1 − 2CL 2
28
3 Signal Delay in VLSI Systems
Fig. 3.7. High-to-low output transition for a step input signal. B
A
B
A
C I
II
III
IV
VI
V
I
C
F
II
III
IV
VI
V
C
F VII
E
VII D
E
Ideal Step Input
D
Non-Ideal (Non-Step) Input
Fig. 3.8. Operating point trajectory of a CMOS inverter for different input waveforms (only the rising input signal is shown).
From (3.6) it can be further shown that Vo (t2 ) = Vdd − Vtn
for t2 =
2CL βn (Vdd − Vtn )
2 Vtn
=
2η . γn (1 − η)
(3.7)
3.2 Devices and Interconnections
29
The interval t ≥ t2 is considered next. The device Q2 operates in the linear region, where Idsn is 1 dVo . (3.8) Idsn = βn (Vdd − Vtn )Vo − Vo2 = −CL 2 dt A closed form expression for the output voltage Vo (t) for time t ≥ t2 is obtained by solving (3.8), a Bernoulli equation, with the initial condition Vo (t2 ) = Vdd − Vtn : for
t ≥ t2 ,
Vo (t) = Vdd
2(1 − η) . 1 + eγn (t−t2 )
(3.9)
The values of t1 from (3.6) and t3 and t4 from (3.9) are ([10, 15, 23]) 1 0.2 , γn 1 − η 2η 1 + ln(3 − 4η) , t3 = γn 1 − η 2η 1 + ln(19 − 20η) . and t4 = γn 1 − η t1 =
The fall time tf is ([10, 15, 23])
η − 0.1 1 CL 2 + ln(19 − 20η) , tf = t4 − t1 = βn Vdd (1 − η) 1−η and the propagation delay tP HL is ([10, 15, 23])
2η 1 CL + ln(3 − 4η) . tP HL = t3 − 0 = t3 = βn Vdd (1 − η) 1 − η
(3.10)
(3.11)
(3.12)
Derivation of the Rise Time The rise time tr and propagation delay tP LH are determined from the switching process illustrated in Figure 3.9 (similarly to tf and tP HL derived earlier in this section). Assume that the input signal Vi has been held at logic high (Vi = Vdd ) for a sufficiently long time such that the capacitor CL is fully discharged to Vo = 0. The operating point of the inverter is point D shown in Figures 3.6 and 3.8. At time t0 = 0, the input signal abruptly switches to a logic low. Since the voltage on CL cannot change instantaneously, the operating point is forced at point E. At E, the device Q2 is cut off while Q1 is conducting, thereby permitting CL to begin charging through Q1 . As this charging process develops, the operating point moves up the line EA towards point A at which point CL is fully charged, i.e., Vo (A) = Vdd . Note that during the interval 0 ≤ t < t2 , the operating point is between E and F and the device
30
3 Signal Delay in VLSI Systems
Vi (t) =
Vdd , 0,
t<0 t≥0
t tr tP LH
Vo (t)
0.9Vdd
0.5Vdd −Vtp 0.1Vdd 0 t1 t2
t3
t4
t
Fig. 3.9. Low-to-high output transition for a step input signal.
Q1 operates in the saturation region. At time t2 , the capacitor is charged to −Vtp (recall that Vtp < 0) and Q1 begins to operate in the linear region. For t ≥ t2 , the device Q1 is in the linear region. If 0.1Vdd < |Vtp | < 0.5Vdd (as is typical), then t1 < t2 < t3 as shown in Figure 3.9. Therefore, the rise time is tr = t4 − t1 and the propagation delay is tP LH = t3 − 0 = t3 . To determine the values of tr and tP LH , the output waveform Vo (t) must be evaluated for each of the intervals [t0 , t2 ) and [t2 , ∞). An analysis similar to that described previously in this section for the highto-low output transition can be performed to derive closed form expressions for t1 , t3 , and t4 as shown in Figure 3.9. Substituting βp Vdd (1 − π) , CL
(3.13)
1 0.2 , γp 1 − π 2π 1 t3 = + ln(3 − 4π) , γp 1 − π 2π 1 and t4 = + ln(19 − 20π) . γp 1 − π
(3.14)
π=−
Vtp Vdd
and γp =
t1 , t3 , and t4 are t1 =
3.2 Devices and Interconnections
Therefore, the rise time tr is ([10, 15, 23])
π − 0.1 1 CL 2 + ln(19 − 20π) , tr = t4 − t1 = βp Vdd (1 − π) 1−π and the propagation delay tP LH is ([10, 15, 23])
2π 1 CL + ln(3 − 4π) . tP LH = t3 − 0 = t3 = βp Vdd (1 − π) 1 − π
31
(3.15)
(3.16)
Several observations can be made by analyzing the expressions derived in this section for tr , tf , tP HL , and tP LH . These observations are provided in the following subsections. First, the factors which affect the inverter delays are analyzed in Section 3.2.2. Following this analysis, the related waveform effects are considered in Section 3.2.3 and the effects of short-channel devices in submicrometer technologies are described in Section 3.2.4. 3.2.2 Controlling the Delay Note that in (3.11) and (3.15), the fall and rise times, respectively, are the product of the term CL /β, and another process dependent term (a function composed solely of Vdd and Vt ). These relationships imply that for a given manufacturing process, improvements in the individual gate delays are possible by reducing the load impedance CL or by increasing the current gain of the transistors. Increasing the current gain (higher β) is possible either by utilizing a more advanced technology or by controlling certain physical qualities of the transistor (the specific physical layout). In the latter case, increasing β of the devices (recall that β ∝ W/L) is typically accomplished by controlling the value of W —a process known as transistor or gate sizing5 [24, 25, 26]. Transistor sizing, however, has limits—area requirements may limit the maximum channel width W, and increasing W will also increase the input load capacitance of the previous gates. 3.2.3 Waveform Effects The ideal step input waveform used in the derivation of the delay expressions presented in Section 3.2.1 is a physical abstraction. Such an ideal waveform does not practically exist, although it can be used to simplify the analysis presented in Section 3.2.1. Note that despite ideally fast input waveforms, the output signal of a CMOS logic gate has a finite slope, thereby contributing to the gate delay. In a practical VLSI integrated circuit, both the input and output signals have a non-zero rise and fall time caused by the impedances along any signal path. Fast input waveforms can be effectively considered as 5
Typically, device channel length is chosen to be the minimum geometry permitted by the technology and therefore cannot be decreased to further increase β.
32
3 Signal Delay in VLSI Systems
step inputs. The delay expressions derived in (3.11) and (3.15) model the delays for such cases with reasonable accuracy. Slow input waveforms, however, contribute significantly to the overall delay of the charge/discharge path in a gate [8, 10, 15, 23], making the delay expressions presented in Section 3.2.1 less accurate. Furthermore, it is considerably more difficult to derive closed form delay expressions for non-step input waveforms. Consider, for example, the derivation of the fall time of the inverter shown in Figure 3.5 assuming a non-ideal input, such as the linear ramp signal sA depicted in Figure 2.3. Referring to Figure 3.8, the trajectory of the operating point relating Vi and Vo for a non-ideal (non-step) input is as shown in the diagram on the right. This trajectory is a curve passing through regions I, II, III, and IV,6 and down the line C → C → D, rather than the two straight-line segments A → B and B → C → D (as shown in the diagram on the left). Therefore, calculating an exact expression for tf in this case requires separately evaluating the delay for all five portions of the output Vo —one for each region. An analysis of the CMOS inverter shown in Figure 3.5 with a non-step input signal, as well as the respective delay expressions, can be found in [23]. Consider, for example, a linear ramp input described by ⎧ 0 t<0 ⎪ ⎪ ⎨ t Vdd 0 ≤ t < tri , (3.17) Vi (t) = tri ⎪ ⎪ ⎩ Vdd t ≥ tr i where tri is the rise time of the input voltage signal Vi (t). For the case depicted in the upper diagram shown in Figure 3.8, the total propagation delay tP HLramp at the 50 % level [23] is given by tP HLramp =
1 (1 + 2η)tri + tP HLstep , 6
(3.18)
where tP HLstep is the propagation delay time for a step input given by (3.12). Note that the ramp input described by (3.17) is also an idealization intended to simplify analysis. In a practical integrated circuit, the input waveform to the inverter is not a linear ramp but rather the output waveform of another gate within the circuit. For such an input signal—also known as a characteristic input [23]—it is preferable to regard the propagation delay through the inverter gate shown in Figure 3.5 as a function of the CL /β ratio of the preceding gate or, equivalently, as a function of the step response delay of the preceding stage [23]. This type of direct analytical solution—by breaking the output waveform into regions depending upon the trajectory of the operating point—is further complicated for those gates with more than one input arriving at an arbitrary time and with arbitrary waveforms. Due to 6
I, II, III, IV, and V for slower input signals.
3.2 Devices and Interconnections
33
the growing complexity of such an analytical solution, it is imperative that alternative methods for delay calculation be developed for practical use. Non-ideal input waveforms also have implication on the power dissipation of individual logic gates, and therefore on the entire circuit. Observe that in regions II, IV, and VI, shown in Figure 3.6, both devices simultaneously conduct, thereby creating a temporary direct path for the current to flow from Vdd to ground. The short-circuit current in this direct current path is only mildly related to the output voltage of the gate and adds to the total power dissipation. This added power component is known as short-circuit power [26, 27, 28]. The short-circuit power can be a substantial fraction of the total transient power dissipation of a circuit and has become a severe obstacle to satisfying a maximum power budget. Faster waveforms throughout the circuit generally mean less time is spent switching within regions II, IV, and VI, and therefore decreased short-circuit current and power. 3.2.4 Short-Channel Effects The active device model, (3.1), used in the analyses described in Section 3.2.1, is accurate for long-channel devices. As technology is scaled down into the deep submicrometer range, a variety of physical phenomena develop that require improved device models in order to preserve accuracy. In this section, certain important effects, known as short-channel effects, are described in terms of their effect on propagation delay. Channel-Length Modulation A MOSFET device modeled by (3.1) has an infinite output resistance in saturation and acts as a voltage-controlled current source. Recall the linear portion of the falling/rising output waveforms from the analyses described in Section 3.2.1. The device acts as a current source since the drain current Idsn is completely independent of the voltage Vdsn in the saturation region [see (3.1)]. This independence, however, is an idealization that does not consider the effect of the voltage Vdsn on the shape of the channel. In practice, as Vdsn increases beyond Vgsn − Vtn (such that Vgdn < Vtn or Vgdp > Vtp for a PMOS device), the channel pinch-off point moves towards the source. Therefore, due to an effect known as channel-length modulation, the effective channel length is reduced [10, 15, 22, 29]. To analytically account for channel-length modulation, an expression for the current of a MOS transistor operating in the saturation region is modified as follows: 1 2 (3.19) Idsn = βn (Vgsn − Vtn ) (1 + λn Vdsn ) . 2 The additional factor (1 + λn Vdsn ) in (3.19) describes the finite device output resistance ∂Vdsn /∂Idsn = 2(Vgsn − Vtn )−2 /(λn βn ) when the transistor operates in the saturation region. The output waveform deteriorates due to the degradation of the transfer characteristic of the inverter.
34
3 Signal Delay in VLSI Systems
Velocity Saturation In a long-channel transistor, the drift velocity of the carriers in the channel is proportional to both the carrier mobility and the lateral electric field in the channel (parallel to the source-drain path). In short-channel devices, however, the velocity of the carriers eventually saturates to some value vsat for a specific value of the voltage Vds within the operating range of the circuit. This velocity saturation phenomenon is due to the power supply voltage not being scaled down as quickly as the device dimensions, creating high electric field within the device. The saturation in the carrier velocity for high electric field strengths— caused by the high voltage Vds applied over a short channel—causes a reduction in both the device transconductance [see (3.3)] and the current gain of a saturated device. This reduction in the current gain β has a direct effect on the ability of the devices to drive a specific load, resulting in increased delay times. Recall that the propagation delays described in (3.11), (3.12), (3.15) and (3.16) are inversely proportional to β. A more realistic device model for DSM devices—known as the α-power law model—has been developed by [30] to include the carrier velocity saturation and VD0 are given by effect in submicrometer device I-V model.7 If ID0 n n ID0 n
= ID0
Vgsn − Vtn Vdd − Vtn
α ,
VD0 n
= VD0
Vgsn − Vtn Vdd − Vtn
α/2
then the drain current Idsn of the MOS transistor is ⎧ ⎪ Vgsn ≥ Vtn and Vdsn ≥ VD0 ID0n ⎪ n ⎪ ⎪ ⎪ (pentode or saturation region) ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ ID0n V dsn Vgsn ≥ Vtn and Vdsn < VD0n Idsn = VD0 n ⎪ ⎪ ⎪ (triode or linear region) ⎪ ⎪ ⎪ ⎪ ⎪ 0 Vgsn ≤ Vtn ⎪ ⎪ ⎩ (cutoff region).
,
(3.20)
(3.21)
In (3.20) and (3.21), α is the velocity saturation index, VD0 is the drain saturation voltage for Vgsn = Vdd , and ID0 is the drain saturation current for Vgsn = Vdsn = Vdd . A typical value of the velocity saturation index for a short-channel device is 1 ≤ α ≤ 2, where (3.21) is the same as (3.1) for α = 2. Analytical solutions for the output voltage of a CMOS inverter with a purely capacitive load CL for a step, linear ramp and exponential input waveforms can be found in [31]. Closed form expressions for the delay of a CMOS
7
Short-channel MOS devices in general.
3.2 Devices and Interconnections
35
inverter as shown in Figure 3.5 under the α-power law model are given in [30] and are repeated below:
1−η 1 CL Vdd Vtn − tT + and tP HL = tP LH = . (3.22) η= Vdn 2 1+α 2ID0 The propagation delay described by (3.22) can be applied to non-ideal input waveforms and consists of two terms. The first term reflects the effect on the gate delay of the input waveform shape and is proportional to the input waveform transition time tT . The second term reflects the dependence of the delay on the gate load, similarly to the CL /β term included in (3.12) and (3.16). 3.2.5 The Importance of Interconnections The analysis of the CMOS gate delay as described in Section 3.2.1 is based on the assumption that the load of the inverter shown in Figure 3.5 is a purely capacitive load (C). This assumption is generally true for logic gates placed physically close to each other. In a multi-million transistor VLSI system, however, certain connected logic gates may be relatively far from each other. In this situation, the impedance of the interconnect wires cannot be considered as being purely capacitive but rather as being resistive-capacitive (RC). An important type of global circuit interconnect structure where the gates can be very far apart is the clock distribution network [9, 32]. On-chip interconnect has become a major concern due to the high resistance of the interconnect which can limit overall circuit performance. These interconnect impedances have become significant as the minimum line dimensions have been scaled down into the deep submicrometer regime while the overall chip dimensions have increased. Perhaps the most important consequence of these trends of scaling transistor and interconnect dimensions and increasing chip sizes is that the primary source of signal propagation delay has shifted from the active transistors to the passive interconnect lines. Therefore, the nature of the load impedance has shifted from a lumped capacitance to a distributed resistance-capacitance, thereby requiring new qualitative and quantitative interpretations of the signal switching processes. To illustrate the effects of scaling, consider ideal scaling [8] where devices are scaled down by a factor of S (S > 1) and chip sizes are scaled up by a factor of Sc (Sc > 1). The delay of the logic gates decreases by 1/S while 2 [8, 33]. Therefore, the the delay due to the interconnect increases by S 2 SC ratio of interconnect delay to gate delay after ideal scaling increases by a 2 . For example, if S = 4 (corresponding to scaling down from factor of S 3 SC a 2 μm CMOS technology to a 0.5 μm CMOS technology) and Sc = 1.225 (corresponding to the chip area increasing by 50%), the ratio of interconnect delay to gate delay increases by a factor of 43 × 1.225 = 78.4 times.
36
3 Signal Delay in VLSI Systems
Delay estimation in RC interconnect Interconnect delay can be analyzed by considering the CMOS inverter shown in Figure 3.5 with a capacitive load CL representing the accumulated capacitance of the fanout of the inverter. The interconnect connecting the drains of the devices, Q1 and Q2 , to the upper terminal of the load is replaced by a distributed RC line with a resistance and capacitance, Rint and Cint , respectively [33]. Closed form expressions for the signal delay of a CMOS inverter with an RC load have been developed by Wilnai [34]. The delay values for both distributed and lumped nature of the RC load are summarized in Table 3.2. These delay values are obtained assuming a step input driving the CMOS inverter. Table 3.2. Closed form expressions for the signal delay of the CMOS inverter shown in Figure 3.5 driving an RC load. An ideal step input signal (Vi (t) transitioning from high to low) is assumed. Signal Delay Output Voltage (Distributed RC) (Lumped RC) Range 0 to 90% 1.0RC 2.3RC 0.9RC 2.2RC ←− rise time tr 10% to 90% 0 to 63% 0.5RC 1.0RC 0.4RC 0.7RC ←− delay tP LH 0 to 50% 0 to 10% 0.1RC 0.1RC
The delay values listed in Table 3.2 are graphically illustrated in Figure 3.10 [34]. Two waveforms describing the output of a CMOS inverter (shown in Figure 3.5) for an input signal making a high-to-low transition are shown in Figure 3.10. These two waveforms are based on the assumption that the RC load of the CMOS inverter is distributed and lumped, respectively. Furthermore, assuming an on-resistance Rtr of the driving transistor [33], the interconnect delay Tintc can be characterized by the following expression [34], Tintc = Rint Cint + 2.3 (Rtr Cint + Rtr CL + Rint CL ) ≈ (2.3Rtr + Rint ) Cint .
(3.23) (3.24)
The on-resistance of the driving transistor Rtr in (3.23) and (3.24) can be approximated [33] by 1 , (3.25) Rtr ≈ βVDD where the term β in (3.25) is the current gain of the driving transistor operating in the saturation region [see (3.2)].
3.2 Devices and Interconnections
37
Vout (t)/Vdd 0.9
90% distributed lumped
0.63
63%
0.5
50%
0.1
10%
time 0.5RC
1.0RC
1.5RC
2.0RC
Fig. 3.10. Graphical illustration of the RC signal delay expressions listed in Table 3.2 (from [34]). The output waveforms for a CMOS inverter are for both a distributed and lumped RC load.
Approximating a distributed RC line by a combination of lumped resistances (R) and capacitances (C) is a common strategy when using circuit simulation programs (such as SPICE). A lumped Π and T ladder circuit model better approximates a distributed RC model than a lumped L ladder circuit [35] by up to 30%. As described in [35], a strategy to model a distributed RC line depends upon two circuit parameters: CL of the load capacitance CL of the fanout to the 1. the ratio CT = C capacitance C of the interconnect line, Rtr of the output resistance of the driving MOSFET 2. the ratio RT = R device Rtr to the resistance R of the interconnect line. The appropriate ladder circuit (from [35]) to properly model a distributed RC interconnect line within 3% error as a function of RT and CT is listed in Table 3.3. By using the proper ladder circuit recommended in [35], the computational time of the simulation can be greatly reduced while preserving the accuracy of the overall circuit simulation [21]. 3.2.6 Delay Mitigation As discussed in this chapter, signal delay in VLSI circuits is caused by the inherent switching properties and impedances of the transistors and interconnections along each signal path. Accurate methods for estimating the signal
38
3 Signal Delay in VLSI Systems
Table 3.3. Circuit network to model distributed RC line with maximum error of 3% (from [35]). The notations Π, T and L correspond to a Π, T and L impedance model, respectively. The notations R and C correspond to a single lumped resistance and capacitance, respectively. The notation N means that the interconnect impedance can be ignored. CT 0 0.01 0.1 0.2 0.5 1 2 5 10 20 50 100
0 Π3 Π3 T2 T2 T1 T1 T1 Π1 Π1 R R R
0.01 Π3 Π3 T2 T2 T1 T1 T1 Π1 Π1 R R R
0.1 Π2 Π2 Π2 Π2 T1 T1 T1 Π1 Π1 R R R
0.2 Π2 Π2 Π2 Π2 T1 T1 T1 Π1 Π1 R R R
0.5 Π1 Π1 Π1 Π1 Π1 Π1 Π1 Π1 Π1 R R R
RT 1 2 Π1 Π1 Π1 Π1 Π1 Π1 Π1 Π1 Π1 Π1 Π1 Π1 Π1 Π1 Π1 Π1 Π1 L1 R L1 R R R R
5 Π1 Π1 Π1 Π1 Π1 Π1 Π1 L1 L1 L1 R R
10 Π1 Π1 Π1 Π1 Π1 Π1 L1 L1 L1 L1 R R
20 C C C C C C L1 L1 L1 L1 R R
50 100 C C C C C C C C C C C C C C C C C C C C C N N N
delay are required in order to guarantee that the circuit will operate correctly. Furthermore, certain signal delays within a circuit may need to be decreased so as to meet specific performance goals. A variety of different techniques have been developed to improve the signal delay characteristics depending upon the type of load and other circuit parameters. Among the most important techniques are: •
Gate sizing to increase the output current drive capability of the transistors along a logic chain [24, 25, 26]. Gate sizing must be applied with caution, however, because of the resulting increase in area and power dissipation, and, if incorrectly applied, increase in delay. • Tapered buffer circuit structures are often used to drive large capacitive loads (such as at the output pad of a chip) [17, 36, 37, 38, 39, 40, 41]. A series of CMOS inverters such as the circuit shown in Figure 3.5 can be cascaded where the output drive of each buffer is increased by a constant (or variable) tapering factor. • The use of repeater circuit structures to drive resistive-capacitive (RC) loads. Unlike tapered buffers, repeaters are typically CMOS inverters of uniform size (drive capability) that are inserted at uniform intervals along an interconnect line [8, 42, 43, 44, 45, 46, 47]. • A different timing discipline such as asynchronous timing [17, 48, 49]. Unlike fully synchronous circuits, the order of execution of logic operations in an asynchronous circuit is not controlled by a global clock signal. Therefore, the temporal operation of asynchronous circuits is essentially
3.2 Devices and Interconnections
39
independent of the signal delays. The logical order of the operations in an asynchronous circuit is enforced by requiring the generation of special handshaking signals which communicate the status of the computation. Among other useful techniques to improve the signal delay characteristics are the use of dynamic CMOS logic circuits such as Domino logic [50, 51, 52, 53] and differential circuit logic styles, such as cascade voltage switch logic (CVSL) [54, 55, 56, 57].
4 Timing Properties of Synchronous Systems
The general structure and principles for operating a fully synchronous digital VLSI system are described in Chapter 2. The combinational logic and the storage elements make up the computational circuitry used to implement a specific synchronous system. The clock distribution network provides the time reference for the storage elements—or registers—thereby enforcing the required logical order of operations. This time reference consists of one or more clock signals that are delivered to each and every register within the integrated circuit. These clock signals control the order of computational events by controlling the exact times the register data input signals are sampled. As shown in Chapter 3, the data signals are inevitably delayed as these signals propagate through the logic gates and along interconnections within the local data paths. These propagation delays can be evaluated within a certain accuracy and used to derive timing relationships among the signals within a circuit. In this chapter, the properties of commonly used types of registers and their local timing relationships for different types of local data paths are described. After discussing registers in general in Section 4.1, the properties of level-sensitive registers (latches) and the significant timing parameters characterizing these registers are reviewed in Sections 4.2 and 4.3, respectively. Edge-sensitive registers (flip-flops) and the timing parameters are analyzed in Sections 4.4 and 4.5, respectively. Properties and definitions related to the clock distribution network are reviewed in Section 4.6. The mathematical foundation for analyzing timing violations in flip-flops and latches for single-phase operation, and latches for multi-phase operation are discussed in Sections 4.7, 4.8 and 4.9, respectively, followed by some final comments in Section 4.10.
4.1 Storage Elements The storage elements (registers) used in VLSI systems vary in their function and temporal relationships. Independent of these differences, however, I.S. Kourtev et al., Timing Optimization Through Clock Skew Scheduling, DOI: 10.1007/978-0-387-71056-3 4, c Springer Science+Business Media LLC 2009
41
42
4 Timing Properties of Synchronous Systems
REGISTER
Data (Outputs)
Data (Inputs)
all storage elements share a common feature—the existence of two groups of signals with largely different purposes. A generalized view of a register is depicted in Figure 4.1. The I/O signals of a register can be divided into two
Control (clock, set/reset, etc.) Fig. 4.1. A general view of a register.
groups as shown in Figure 4.1. One group of signals—called the data signals— consists of input and output signals of the storage element. These input and output signals are typically connected to the terminals of ordinary logic gates and may be connected to the data signal terminals of other storage elements. Another group of signals—identified by the name control signals—are those signals that control the storage of the data signals in the registers but do not participate in the logical computation process. Certain control signals enable the storage of a data signal in a register independently of the values of any data signals. These control signals are typically used to initialize the data in a register to a specific well known value. Other control signals—such as a clock signal—control the process of storing a data signal within a register. In a synchronous circuit, each register has at least one clock (or control) signal input. The two major groups of storage elements (registers) are considered in the following sections based on the type of relationship that exists among the data and clock signals of these elements. In latches, it is the specific value or level of a control signal1 that determines the data storage process. Therefore, latches are also called level-sensitive registers. In contrast to latches, a data signal is stored in flip-flops enabled by an edge of a control signal. For that reason, flip-flops are also called edge-triggered registers. The timing properties of latches and flip-flops are described in detail in the following two sections.
1
This signal is most frequently the clock signal.
4.2 Latches
43
4.2 Latches A latch is a register whose behavior depends upon the value or level of the clock signal [10, 12, 14, 15, 29, 58, 59, 60]. Therefore, a latch is often referred to as a transparent latch, a level-sensitive register or a polarity hold latch. A simple type of latch with a clock signal C and an input signal D is depicted in Figure 4.2—the output of the latch is typically labeled Q. This type of latch is also known as a D latch and its operation is illustrated in Figure 4.3.
Data Input
D
Clock Input
C
Q
Data Output
Fig. 4.2. Schematic representation of a level-sensitive register or latch.
The type of register illustrated in Figures 4.2 and 4.3 is a positive-polarity2 latch since it is transparent during that portion of the clock period during which C is high. The operation of this positive latch is summarized in Table 4.1. Table 4.1. Operation of the positive-polarity D latch. State Clock Output high passes input transparent low maintains output opaque
As described in Table 4.1 and illustrated in Figure 4.3, the output signal of the latch follows the data input signal while the clock signal remains high, i.e., C = 1 ⇒ Q = D. Thus, the latch is said to be in a transparent state during the interval t0 < t < t1 as shown in Figure 4.3. When the clock signal C changes from 1 to 0, the current value of D is stored in the register and the output Q remains fixed to that value regardless of whether the data signal D changes. The latch does not pass the input data signal to the output but rather holds onto the final value of the data signal when the clock signal made the high-to-low transition. By analogy with the term transparent introduced above, this state of the latch is called opaque and corresponds to the interval t1 < t < t2 shown in Figure 4.3 where the input data signal is isolated from the output port. As shown in Figure 4.3, the clock period is TCP = t2 − t0 . 2
Or simply a positive latch.
44
4 Timing Properties of Synchronous Systems Transparent
Opaque
State
State Clock
Leading Edge
Trailing Edge
C
D
Data In Stored Value Q
Data Out
t1 t0
Clock Period TCP
t2
Fig. 4.3. Idealized operation of a level-sensitive register or latch.
The edge of the clock signal that causes the latch to switch to its transparent state is identified as the leading edge of the clock pulse. In the case of the positive latch shown in Figure 4.2, the leading edge of the clock signal occurs at time t0 . The opposite edge direction of the clock signal is identified as the trailing edge—the falling edge at time t1 shown in Figure 4.3. Note that for a negative latch, the leading edge is a high-to-low transition and the trailing edge is a low-to-high transition.
4.3 Parameters of Latches Registers such as the D latch illustrated in Figures 4.2 and 4.3 and the flipflops described in Sections 4.4 and 4.5 are built of discrete components, such as the NMOS transistor shown in Figure 3.4. The exact relationships among signals on the terminals of a register can be presented and evaluated in analytical form [61, 62, 63]. In this research monograph, however, registers are considered at a higher level of abstraction in order to hide the unnecessary details of the specific electrical implementation. The latch delay parameters described in the following sections are therefore considered from the perspective of the earlier discussion of delay in Chapter 3. These parameters are briefly introduced next.
4.3 Parameters of Latches
45
Note: The remaining portion of this chapter and the rest of this monograph use an extensive notation for various parameters describing the signals and storage elements. 4.3.1 Width of the Clock Pulse L is the permissible width of this portion of the The width of the clock pulse CW L clock signal during the time when the latch is transparent. In other words, CW is the length of the time interval between the leading and the trailing edge of the clock signal such that the latch will operate properly. The superscript L is used optionally to represent the type of registers—latch in this case—that are synchronized by this clock signal. The subscript W is used to represent the width, which is included to distinguish between a clock signal C and the clock L any further will not affect the values width CW . Increasing the value of CW L L L of DDQ , δS and δH (defined in Sections 4.3.3, 4.3.4, and 4.3.5, respectively). L The width of the clock pulse, CW = t6 − t1 , is illustrated in Figure 4.4. The clock period is TCP = t8 − t1 .
4.3.2 Latch Clock-to-Output Delay L (typically called the clock-to-Q delay) is the The clock-to-output delay DCQ propagation delay of the latch from the clock signal terminal to the output L = t2 − t1 is depicted in Figure 4.4 and is defined terminal. The value of DCQ assuming that the data input signal has settled to a stable value sufficiently early, i.e., setting the data input signal earlier with respect to the leading L . clock edge will not affect the value of DCQ
4.3.3 Latch Data-to-Output Delay L (typically called the data-to-Q delay) is the The data-to-output delay DDQ propagation delay of the latch from the data signal terminal to the output L is defined assuming that the clock signal has set terminal. The value of DDQ the latch to its transparent state sufficiently early, i.e., making the leading L . The edge of the clock signal occur earlier will not change the value of DDQ L data-to-output delay DDQ = t4 − t3 is illustrated in Figure 4.4.
4.3.4 Latch Setup Time The latch setup time δSL = t6 − t5 , shown in Figure 4.4, is the minimum time between a change in the data signal and the trailing edge of the clock signal such that the new value of D would successfully propagate to the output Q of the latch and be stored within the latch during the opaque state.
t1 Data Out
L Clock-to-Output DCQ
Data In
Clock
t2
t3
t4
L Width of Clock Pulse CW
Setup Time δSL
t5
t6
t7
L Data-to-Output DDQ
L Hold Time δH
Clock Period TCP
t8
Q
D
4 Timing Properties of Synchronous Systems C
46
Fig. 4.4. Parameters of a level-sensitive register.
4.3.5 Latch Hold Time L is the minimum time after the trailing clock edge that The latch hold time δH the data signal must remain constant such that this value of D is successfully
4.4 Flip-Flops
47
L stored in the latch during the opaque state. This definition of δH assumes that the last change of the value of D has occurred no later than δSL before the L = t7 − t6 is shown in Figure 4.4. trailing edge of the clock signal. The term δH Note: The latch parameters introduced in Sections 4.3.1 through 4.3.5 are used to refer to any latch in general or to a specific instance of a latch when this instance can be unambiguously identified. To refer to a specific instance i of a latch explicitly, the parameters are additionally shown with a superscript. Li refers to the clock-to-output delay of latch i. Also, adding For example, DCQ m and M to the subscript of any parameter is used to refer to the minimum and maximum values of that parameter, respectively.
4.4 Flip-Flops An edge-triggered register or flip-flop is a type of register which, unlike the latches described in Sections 4.2 and 4.3, is never transparent with respect to the input data signal [10, 12, 14, 15, 29, 58, 59, 60]. The output of a flip-flop normally does not follow the input data signal at any time during the register operation but rather holds onto a previously stored data value until a new data signal is stored in the flip-flop. A simple type of flip-flop with a clock signal C and an input signal D is shown in Figure 4.5—similarly to latches,
Data Input Clock Input
D
Q
Data Output
C
Fig. 4.5. An edge-triggered register or flip-flop.
the output of a flip-flop is usually labeled Q. This specific type of register, shown in Figure 4.5, is called a D flip-flop and its operation is illustrated in Figure 4.6. In typical flip-flops, data is stored either on the rising edge (the low-tohigh transition) or on the falling edge (the high-to-low transition) of the clock signal. The flip-flops are known as positive-edge-triggered and negative-edgetriggered flip-flops, respectively. The term latching, storing or positive edge is used to identify the edge of the clock signal on which storage in the flipflop occurs. For the sake of clarity, the latching edge of the clock signal for flip–flops will also be called the leading edge (compare to the discussion of latches in Sections 4.2 and 4.3). Also, note that certain flip-flops—known as double-edge-triggered (DET) flip-flops [64, 65, 66, 67, 68]—can store data at either edge of the clock signal. The complexity of these flip-flops, however, is significantly higher and these registers are therefore rarely used.
48
4 Timing Properties of Synchronous Systems
As shown in the timing diagram in Figure 4.6, the output of the flip-flop remains unchanged most of the time regardless of the transitions in the data signal. Only values of the data signal in the vicinity of the storing edge of the clock signal can affect the output of the flip-flop. Therefore, changes in the output will only be observed when the currently stored data has a logic value x and the storing edge of the clock signal occurs while the input data signal has a logic value of x ¯. Clock Period TCP Clock Latching Edge C
D
Data In Stored Value
Stored Value
Q
Data Out t0
t1
t2
Fig. 4.6. Idealized operation of an edge-triggered register or flip-flop.
4.5 Parameters of Flip-Flops The significant timing parameters of edge-triggered registers are similar to those of latches (recall 4.3) and are presented next. These parameters are illustrated in Figure 4.7. 4.5.1 Width of the Clock Pulse F is the permissible width of the time interval The width of the clock pulse CW between the latching edge and non-latching edge of the clock signal. The superscript F is used optionally to represent the type of registers—flip-flops in this case—that are synchronized by this clock signal. The subscript W is
4.5 Parameters of Flip-Flops
49
used to represent the width, which is included to distinguish between a clock F = t6 − t 3 signal C and the clock width CW . The width of the clock pulse CW is shown in Figure 4.7 and is defined as the interval between the latching and non-latching edges of the clock pulse such that the flip-flop will operate F will not affect the values of the setup time δSF correctly. Further increasing CW F and hold time δH (defined in Sections 4.5.3 and 4.5.4, respectively). The clock period TCP = t6 − t1 is also shown in Figure 4.7. 4.5.2 Flip-Flop Clock-to-Output Delay F of the flip-flop is As shown in Figure 4.7, the clock-to-output delay DCQ F DCQ = t5 − t3 . This propagation delay parameter—typically called the clockto-Q delay—is the propagation delay from the clock signal terminal to the F is defined assuming that the data input output terminal. The value of DCQ signal has settled to a stable value sufficiently early, i.e., setting the data input signal any earlier with respect to the latching clock edge will not affect the F . value of DCQ
4.5.3 Flip-Flop Setup Time The flip-flop setup time δSF is shown in Figure 4.7—δSF = t3 − t2 . The parameter δSF is defined as the minimum time between a change in the data signal and the latching edge of the clock signal such that the new value of D propagates to the output Q of the flip-flop and is successfully latched within the flip-flop. 4.5.4 Flip-Flop Hold Time F is the minimum time after the arrival of the latching The flip-flop hold time δH clock edge during which the data signal must remain constant in order to F = t4 −t3 successfully store the D signal within the flip-flop. The hold time δH is illustrated in Figure 4.7. This definition of the hold time assumes that the last change of D has occurred no later than δSF before the arrival of the latching edge of the clock signal. Note: Similar to latches, the parameters of these edge-triggered registers refer to any flip-flop in general or to a specific instance of a flip-flop when this instance is uniquely identified. To explicitly refer to a specific instance i of a flip-flop, the flip-flop parameters are additionally shown with a superscript. For example, δSF i refers to the setup time parameter flip-flop i. Also, adding F are used to refer to the minimum and m and M to the subscript of DCQ F maximum values of DCQ .
F Clock-to-Output DCQ
Data Out
Data In
Clock
t1
Clock Period TCP
Setup Time δSF
t2
t3
t4
t5
F Hold Time δH
F Width of Clock Pulse CW
t6
Q
D
4 Timing Properties of Synchronous Systems C
50
Fig. 4.7. Parameters of an edge-triggered register.
4.6 The Clock Signal The clock signal is typically delivered to each storage element within a circuit. This signal is crucial to the correct operation of a fully synchronous digital
4.6 The Clock Signal
51
system. As described in 2.2, the storage elements serve to establish the relative sequence of events within a system so that those operations that cannot be executed concurrently operate on the proper data signals. A typical clock signal c(t) in a synchronous digital system is shown in Figure 4.8. The clock period TCP of c(t) is also indicated in Figure 4.8. In Width of Clock Pulse CW ΔL
ΔT
ΔL ΔT Clock Period TCP
Fig. 4.8. A typical clock signal.
order to provide the highest possible clock frequency, the objective is for TCP to be the smallest number such that ∀t :
c(t) = c(t + nTCP ),
(4.1)
where n is an integer. The width of the clock pulse CW is shown in Figure 4.8 where the meaning of CW is explained in Sections 4.3.1 (for a latch) and 4.5.1 (for a flip-flop), respectively. Typically, the period of the clock signal TCP is a constant, that is, ∂TCP /∂t = 0. If the clock signal c(t) has a delay τ from some reference point, the leading edges of c(t) occur at times τ + mTCP
for
m ∈ {. . . , −2, −1, 0, 1, 2, . . . },
(4.2)
and the trailing edges of c(t) occur at times τ + CW + mTCP
for m ∈ {. . . , −2, −1, 0, 1, 2, . . . }.
(4.3)
In practice, however, it is possible for the edges of a clock signal to fluctuate in time, that is, for a clock signal not to occur precisely at the times described by (4.2) and (4.3) for the leading and trailing edges, respectively. This phenomenon is known as clock jitter and may be due to various causes such as variations in the manufacturing process, ambient temperature, power supply noise and oscillator variations.
52
4 Timing Properties of Synchronous Systems
To account for this clock jitter, the following parameters are introduced: •
the maximum deviation ΔL of the leading edge of the clock signal, i.e., the leading edge is guaranteed to occur anywhere in an interval (τ + kTCP − ΔL , τ + kTCP + ΔL ), • the maximum deviation ΔT of the trailing edge of the clock signal, i.e., the trailing edge is guaranteed to occur anywhere in the interval (τ +CW + kTCP − ΔT , τ + CW + kTCP + ΔT ).
4.6.1 Clock Skew Consider a local data path such as the path shown in Figure 2.6 on page 14. Without loss of generality, assume that the registers shown in Figure 2.6 are flip-flops. The clock signal with period TCP is delivered to each of the registers Ri and Rf . Let the clock signal driving the register Ri be denoted as Ci and the clock signal driving the register Rf be denoted by Cf . Also, let ticd and tfcd be the delays of Ci and Cf to the registers Ri and Rf , respectively.3 As described by (4.2), the latching or leading edges of Ci occur at times . . . , τ + ticd − TCP , τ + ticd , τ + ticd + TCP , . . . . Similarly, the latching or leading edges of Cf occur at times . . . , τ + tfcd − TCP , τ + tfcd , τ + tfcd + TCP , . . . as described by (4.3). The clock skew TSkew (i, f ) = ticd −tfcd between Ci and Cf is introduced next as the difference of the arrival times of Ci and Cf [9] (a more formal definition is provided in Chapter 5). This concept is illustrated by Figure 4.9. Note that depending on the values of ticd and tfcd , the clock skew can be zero, negative or Zero skew Clock f
Clock i
Delay i = Delay f
Negative skew Clock f
Clock i
Delay i < Delay f
Positive skew Clock f
Clock i
Delay i > Delay f
Fig. 4.9. Lead/lag relationships causing clock skew to be zero, negative or positive.
3
Note that ticd and tfcd are measured with respect to the same reference point.
4.6 The Clock Signal
53
positive, depending upon whether ticd is equal to, less than or greater than tfcd , respectively. Furthermore, note that the clock skew as defined above is only defined for sequentially-adjacent registers, that is, a local data path [such as the path shown in Figure 2.6]. 4.6.2 Multi-Phase Clock Synchronization Multi-phase (clock) synchronization is observed when different phases of the clock signal are distributed to the synchronous components of a circuit. Figure 4.10 presents a representation of a multi-phase clock signal. In Fig-
CW φ(n)
n Csource
CW φ(n−1)
(n−1)
Csource
CW
2 Csource
φ2
CW
1 Csource
φ1
Clock Period TCP
Fig. 4.10. A sample multi-phase synchronization clock.
ure 4.10, the multi-phase synchronization scheme is generated with overlapping clock signal phases. In practical implementation, non-overlapping clock phases are used more frequently due to their simplicity of synchronization implementation and analysis. The duty cycles of the clock phases are considered identical with on-times of (CW ). It is common for duty cycles to be similar as multiple phases of the clock are typically generated from a single oscillation source with phase shifters. In Figure 4.10, the set of clock signals C global = {C 1 , . . . , C n } constitutes the n-phase clocking scheme, where the superscripts denote the particular clock phase. The subscripts denote the location of the clock signals on
54
4 Timing Properties of Synchronous Systems pf
tf φp f
p
f Csource
pf
Cf
pp
i f |φ pi p f + TSkew (i, f )|
p
i Csource
φ pi
p
Ci i pi
ti
Clock Period TCP
Fig. 4.11. Multi-phase clock skew.
1 the circuit. For instance, Csource denotes the clock signal at the clock source “source” of the clock phase C 1 . When this clock signal is delivered to an arbitrary register Rk , it is denoted by Ck1 . The start time φpi of clock signal phase C pi is defined with respect to a common reference clock cycle. The phase shift operator φpi pf [69] is used to transform variables between different clock phases. The phase shift operator φpi pf is defined as the algebraic difference φpi pf = φpi − φpf + kTCP , where k is the number of clock cycles occurring between phases. Note that for a single-phase clocking scheme, the phase shift operator evaluates to φif = TCP . A multi-phase synchronization approach can be advantageous in terms of increasing the reachability of circuit registers, creating less skew within physically neighboring local clock domains and potentially saving power. Despite these advantages, the design and analysis of such synchronization schemes are more complex. pi pf p (i, f ) = tpi i − tf f , where The multi-phase clock skew is defined as TSkew p p tpi i and tf f are the delays of the clock signals Cipi and Cf f from the clock sources to the registers Ri and Rf , respectively. The multi-phase clock skew is illustrated in Figure 4.11. The common clock period for all clock phases is denoted by TCP for consistency with the original formulation of the singlephase synchronized circuits.
4.7 Single-Phase Path with Flip-Flops
55
4.7 Single-Phase Path with Flip-Flops A local data path composed of two flip-flops and combinational logic between the flip-flops is shown in Figure 4.12. The initial flip-flop Ri is the origin of Flip-Flop Ri Di Data In
D C
Clock Ci
Q
Flip-Flop Rf Df (Data)
Qi Data
Combinational Logic Lif
D
Q
Qf Data Out
C Clock Cf
Fig. 4.12. A single-phase local data path.
the data signal and the final flip-flop Rf is the destination of the data signal. The combinational logic block Lif between Ri and Rf accepts the input data signals supplied by Ri and other registers and logic gates and transmits the operated upon data signals to Rf . The period of the clock signal is denoted by TCP and the delays of the clock signals Ci and Cf to the flip-flops Ri and Rf are denoted by ticd and tfcd , respectively. The input and output data signals to Ri and Rf are denoted by Di , Qi , Df , and Qf , respectively. An analysis of the timing properties of the local data path shown in Figure 4.12 is offered in the following sections. First, the timing relationships to prevent the late arrival of data signals to Rf are examined in Section 4.7.1. The timing relationships to prevent the early arrival of signals to the register Rf are described in Section 4.7.2. The analyses presented in Sections 4.7.1 and 4.7.2 borrow some of the notation from [19] and [20]. Similar analyses of synchronous circuits from the timing perspective can be found in [69, 70, 71, 72, 73]. 4.7.1 Preventing the Late Arrival of the Data Signal The operation of the local data path Ri ;Rf shown in Figure 4.12 requires that any data signal that is being stored in Rf arrives at the data input Df of Rf no later than δSF f 4 before the latching edge of the clock signal Cf . It is possible for the opposite event to occur, that is, for the data signal Df not to arrive at the register Rf sufficiently early in order to be stored successfully within Rf . If this situation occurs, the local data path shown in Figure 4.12 fails to perform as expected and a timing failure or violation is created. This form of timing violation is typically called a setup (or long path) violation. A setup violation is depicted in Figure 4.13 and is used in the following discussion. 4
As a reminder for the definitions in Section 4.5, in δSF f representation, subscript S denotes the setup time, the superscript F denotes a flip-flop parameter and the superscript f denotes the parameter defined at the final register Rf .
56
4 Timing Properties of Synchronous Systems
ΔL Ci k-th clock period Di
Fi DCQ
Qi
DPi,fM Df
δSF f TCP Cf k-th clock period ΔL Fig. 4.13. Timing diagram of a local data path with flip-flops illustrating a violation of the setup (or long path) constraint.
The coincidental cycles (k-th) of the clock signals Ci and Cf are shaded for identification in Figure 4.13. Also shaded in Figure 4.13 are those portions of the data signals Di , Qi , and Df that are relevant to the operation of the local data path shown in Figure 4.12. Specifically, the shaded portion of Di corresponds to the data to be stored in Ri at the beginning of the kth clock cycle. This data signal propagates to the output of the register Ri and is illustrated by the shaded portion of Qi shown in Figure 4.13. The combinational logic operates on Qi during the k-th clock cycle. The result of this operation is illustrated by the shaded portion of the signal Df which must be stored in Rf during the next (k + 1)-st clock cycle. Observe that as illustrated in Figure 4.13, the leading edge of Ci that initiates the k-th clock cycle occurs at time ticd + kTCP with respect to a global time reference of zero. Similarly, the leading edge of Cf that initiates
4.7 Single-Phase Path with Flip-Flops
57
the (k + 1)-th clock cycle occurs at time tfcd + (k + 1)TCP . Therefore, the latest arrival time Af of the data signal Df at the flip-flop Rf must satisfy
Ff Af ≤ tfcd + (k + 1)TCP − ΔF (4.4) L − δS .
f The term tcd + (k + 1)TCP − ΔF L on the right hand side of (4.4) corresponds to the critical situation of the leading edge of Cf arriving earlier by the maxiFf mum possible deviation ΔF L . The −δS term on the right hand side of (4.4) accounts for the setup time of Rf (recall the definition of δSF from Section 4.5.3). Note that the value of Af in (4.4) consists of two components: 1. The latest arrival time Di that a valid data signal Qi appears at the output Fi of Ri , i.e., the sum Di = ticd + kTCP + ΔF L + DCQM of the latest possible arrival time of the leading edge of Ci and the maximum clock-to-Q delay of Ri , 2. The maximum propagation delay DPi,fM of the data signals through the combinational logic block Lif and interconnect along the path Ri ;Rf . Therefore, Af can be described as i,f Fi Af = Di + DPi,fM = ticd + kTCP + ΔF L + DCQM + DP M .
(4.5)
By substituting (4.5) into (4.4), the timing condition guaranteeing correct signal arrival at the data input D of Rf is i
Ff f i,f Fi F tcd + kTCP + ΔF L + DCQM +DP M ≤ tcd + (k + 1)TCP − ΔL −δS . (4.6) The above inequality can be transformed by subtracting the kTCP terms from both sides of (4.6). Furthermore, certain terms in (4.6) can be grouped together. Also, by noting that ticd − tfcd = TSkew (i, f ) is the clock skew between the registers Ri and Rf , i,f Fi Ff TSkew (i, f ) + 2ΔF . (4.7) L ≤ TCP − DCQM + DP M + δS Note that a violation of (4.7) is illustrated in Figure 4.13. The timing relationship (4.7) represents three important results describing the late arrival of the signal Df at the data input of the final register Rf in a local data path Ri ;Rf : i,f Ff Fi 1. Given any values of TSkew (i, f ), ΔF L , DP M , δS and DCQM , the late arrival of the data signal at Rf can be prevented by controlling the value of the clock period TCP . A sufficiently large value of TCP can always be chosen to relax (4.7) by increasing the upper bound described by the right hand side of (4.7). 2. For correct operation, the clock period TCP does not necessarily have Fi to be larger than the term DCQM + DPi,fM + δSF f . If the clock skew TSkew (i, f ) is properly controlled, choosing a particular negative value for the clock skew will relax the permitting (4.7) to left side of (4.7), thereby i,f Fi Ff be satisfied despite TCP − DCQM + DP M + δS < 0.
58
4 Timing Properties of Synchronous Systems
i,f Fi Ff are harmful 3. Both the term 2ΔF and the term D + D + δ L CQM S PM in the sense that these terms impose a lower bound on the clock period TCP (as expected). Although negative skew can be used to relax the inequality (4.7), these two terms work against relaxing the values of TCP and TSkew (i, f ). Note that equivalently, the inequality (4.7) can be interpreted as imposing an upper bound on the clock skew TSkew (i, f ). Finally, the relationship (4.7) may be rewritten in a form that clarifies the upper bound imposed on the clock skew TSkew (i, f ): Fi TSkew (i, f ) ≤ TCP − DCQM + DPi,fM + δSF f − 2ΔF (4.8) L. 4.7.2 Preventing the Early Arrival of the Data Signal Late arrival of the signal Df at the data input of Rf (see Figure 4.12) is analyzed in Section 4.7.1. In this section, an analysis of the timing relationships of the local data path Ri ;Rf to prevent early data arrival of Df is presented. To this end, recall from the discussion in Section 4.5.4 that any data signal Df Ff being stored in Rf must lag the arrival of the leading edge of Cf by at least δH . new It is possible for the opposite event to occur, i.e., for a new data signal Df to overwrite the value of Df and be stored within the register Rf . If this situation occurs, the local data path shown in Figure 4.12 will not perform as desired because of the timing violation known as a hold time (or short path) violation. In this section, these hold time violations caused by race conditions are analyzed. It is shown that a hold violation is more dangerous than a setup violation since a hold violation cannot be removed by simply adjusting the clock period TCP [unlike the case of a data signal arriving late where TCP can be increased to satisfy (4.7)]. A hold violation is depicted in Figure 4.14 and is used in the following discussion. The situation depicted in Figure 4.14 is different from the situation depicted in Figure 4.13 in the following sense. In Figure 4.13, a data signal stored in Ri during the k-th clock cycle arrives too late to be stored in Rf during the (k + 1)-st clock cycle. In Figure 4.14, however, the data stored in Ri during the k-th clock cycle arrives at Rf too early and overwrites the data that had to be stored in Rf during the same k-th clock cycle. To clarify this concept, certain portions of the data signals are shaded for easy identification in Figure 4.14. The data Di being stored in Ri at the beginning of the k-th clock cycle is shaded. This data signal propagates to the output of the register Ri and is illustrated by the shaded portion of Qi shown in Figure 4.14. The output of the logic (left unshaded in Figure 4.14) is being stored within the register Rf at the beginning of the (k +1)-st clock cycle. Finally, the shaded portion of Df corresponds to the data signal that is to be stored in Rf at the beginning of the k-th clock cycle. Note that, as illustrated in Figure 4.14, the leading (or latching) edge of Ci that initiates the k-th clock cycle occurs at time ticd + kTCP . Similarly, the
4.7 Single-Phase Path with Flip-Flops
59
ΔL Ci k-th clock period Di
Fi DCQ
Qi
DPi,fm Df
Ff δH
Cf k-th clock period ΔL
Fig. 4.14. Timing diagram of a local data path with flip-flops with a violation of the hold constraint.
leading (or latching) edge of Cf that initiates the k-th clock cycle occurs at time tfcd + kTCP . Therefore, the earliest arrival time af of the data signal Df at the register Rf must satisfy the following condition: Ff (4.9) af ≥ tfcd + kTCP + ΔF L + δH . The term tfcd + kTCP + ΔF L on the right hand side of (4.9) corresponds to the critical situation of the leading edge of the k-th clock cycle of Cf arriving late by the maximum possible deviation ΔF L . Note that the value of af in (4.9) has two components: 1. The earliest arrival time di that a valid data signal Qi appears at the Fi output of Ri , i.e., the sum di = ticd + kTCP − ΔF L + DCQm of the earliest arrival time of the leading edge of Ci and the minimum clock-to-Q delay of Ri , 2. The minimum propagation delay DPi,fm of the signals through the combinational logic block Lif and interconnect wires along the path Ri ;Rf .
60
4 Timing Properties of Synchronous Systems
Therefore, af can be described as i,f Fi af = di + DPi,fm = ticd + kTCP − ΔF L + DCQm + DP m .
(4.10)
By substituting (4.10) into (4.9), the timing condition that guarantees that Df does not arrive too early at Rf is
f i,f Fi F Ff ticd + kTCP − ΔF L + DCQm + DP m ≥ tcd + kTCP + ΔL + δH .
(4.11)
The inequality (4.11) can be further simplified by regrouping terms and noting that ticd − tfcd = TSkew (i, f ) is the clock skew between the registers Ri and Rf : i,f Fi Ff TSkew (i, f ) − 2ΔF (4.12) L ≥ − DCQm + DP m + δH . Recall that a violation of (4.12) is illustrated in Figure 4.14. The timing relationship described by (4.12) provides certain important facts describing the early arrival of the signal Df at the data input of the final register Rf of a local data path: 1. Unlike (4.7), the inequality (4.12) does not depend on the clock period TCP . Therefore, a violation of (4.12) cannot be corrected by simply increasing the clock period TCP . A synchronous digital system with hold violations is non-functional, while a system with setup violations will still operate correctly at a reduced speed.5 2. Both for (4.12) and for zero-skew systems, the hold violation can be avoided through delay padding [74] into the logic. Inserting delays into the logic increases the DPi,fm value on the right hand side of the inequality, making it easy to satisfy the constraint for given values of TSkew (i, f ). A more sophisticated used of delay insertion in eliminating timing violations for non-zero clock skew circuits is presented in Chapter 8. 3. The relationship (4.12) can be satisfied with a sufficiently large value of Ff the clock skew TSkew (i, f ). However, both the term 2ΔF L and the term δH are harmful in the sense that these terms impose a lower bound on the clock skew TSkew (i, f ) between the register Ri and Rf . Although positive skew may be used to relax (4.12), these two terms work against relaxing Fi + DPi,fm . the values of TSkew (i, f ) and DCQm Finally, the relationship (4.12) can be rewritten to stress the lower bound imposed on the clock skew TSkew (i, f ): Fi Ff + δH + 2ΔF (4.13) TSkew (i, f ) ≥ − DPi,fm + DCQ L.
5
Increasing the clock period TCP in order to satisfy (4.7) is equivalent to reducing the frequency of the clock signal.
4.8 Single-Phase Path with Latches
61
4.8 Single-Phase Path with Latches A local data path consisting of two level-sensitive registers (or latches) and combinational logic between these registers (or latches) is shown in Figure 4.15. Note the initial latch Ri which is the origin of the data signal and the final latch Rf which is the destination of the data signal. The combinational Latch Rf
Latch Ri Di Data In
D C
Clock Ci
Q
Df (Data)
Qi Data
Combinational Logic Lif
D
Q
Qf Data Out
C Clock Cf
Fig. 4.15. A single-phase local data path with latches.
logic block Lif between Ri and Rf accepts the input data signals sourced by Ri and other registers and logic gates and transmits the data signals that have been operated on to Rf . The period of the clock signal is denoted by TCP and the delays of the clock signals Ci and Cf to the latches Ri and Rf are denoted by ticd and tfcd , respectively. The input and output data signals to Ri and Rf are denoted by Di , Qi , Df , and Qf , respectively. An analysis of the timing properties of the local data path shown in Figure 4.15 is offered in the following sections. The timing relationships to prevent the late arrival of the data signal at the latch Rf are examined in Section 4.8.1. The timing relationships to prevent the early arrival of the data signal at the latch Rf are examined in Section 4.8.2. The analyses presented in this section are built on the timing relationships among the signals of a latch that are similar to those used in Section 4.7. Specifically, it is guaranteed that every data signal arrives at the data input of a latch no later than δSL time before the trailing clock edge. Also, this data L time after the trailing edge, i.e., no signal must remain stable at least δH L time after the latch has become new data signal should arrive at a latch δH opaque. Observe the differences between a latch and a flip-flop [70, 75]. In flipflops, the setup and hold requirements described in the previous paragraph are relative to the leading—not to the trailing—edge of the clock signal. Similarly, in flip-flops, the late and early arrival of the data signal to a latch gives rise to timing violations known as a setup and hold violation, respectively. 4.8.1 Preventing the Late Arrival of the Data Signal A system of signals similar to the example illustrated in Figure 4.13 is assumed in the following discussion. A data signal Di is stored in the latch Ri
62
4 Timing Properties of Synchronous Systems
during the k-th clock cycle. The data Qi stored in Ri propagates through the combinational logic Lif and the interconnect along the path Ri ;Rf . In the (k + 1)-st clock cycle, the result Df of the computation in Lif is stored within the latch Rf . The signal Df must arrive at least δSL time before the trailing edge of Cf in the (k + 1)-st clock cycle. Similar to the discussion presented in Section 4.7.1, the latest arrival time Af of Df at the D input of Rf must satisfy
L Lf − ΔL (4.14) Af ≤ tfcd + (k + 1)TCP + CW T − δS . Note the difference between (4.14) and (4.4). In (4.4), the first term on the right hand side is [tfcd + (k + 1)TCP − ΔF L ], while in (4.14), the first term on the L L . The addition of CW corresponds to right hand side has an additional term CW the concept that unlike flip-flops, a data signal is stored in the latches, shown L term). Similar to
in Figure 4.15, at the trailing edge of the clock signal (the CW L − ΔL the case of flip-flops in Section 4.7.1, the term tfcd + (k + 1)TCP + CW T in the right hand side of (4.14) corresponds to the critical situation of the trailing edge of the clock signal Cf arriving earlier by the maximum possible deviation ΔL T. Observe that the value of Af in (4.14) consists of two components: 1. The latest arrival time Di when a valid data signal Qi appears at the output of the latch Ri , 2. The maximum signal propagation delay through the combinational logic block Lif and the interconnect along the path Ri ;Rf . Therefore, Af can be described as Af = DPi,fM + Di .
(4.15)
However, unlike the situation of flip-flops as discussed in Section 4.7.1, the term Di on the right hand side of (4.15) is not the sum of the delays through the register Ri . The reason is that the value of Di depends upon whether the signal Di arrived before or during the transparent state of Ri in the k-th clock cycle. Therefore, the value of Di in (4.15) is the greater of the following two quantities: i
Li Li , tcd + kTCP + ΔL . (4.16) Di = max Ai + DDQM L + DCQM There are two terms in the right hand side of (4.16): Li corresponds to the situation in which Di arrives 1. The term Ai + DDQM at Ri after the leading edge of the k-thclock period, Li corresponds to the situation in 2. The term ticd + kTCP + ΔL L + DCQM which Di arrives at Ri before the arrival of the leading edge of the k-th clock pulse.
4.8 Single-Phase Path with Latches
By substituting (4.16) into (4.15), the latest time of arrival Af is i
Li Li , tcd + kTCP + ΔL , Af = DPi,fM + max Ai + DDQM L + DCQM
63
(4.17)
which is in turn substituted into (4.14) to obtain i
Li Li DPi,fM + max Ai + DDQM , tcd + kTCP + ΔL L + DCQ
L Lf ≤ tfcd + (k + 1)TCP + CW − ΔL T − δS . (4.18) Equation (4.18) is an expression of the inequality that must be satisfied in order to prevent the late arrival of a data signal at the data input D of the latch Rf . By satisfying (4.18), any setup violation in a local data path with latches as shown in Figure 4.15 is avoided. For a circuit to operate correctly, (4.18) must be enforced for every local data path Ri ;Rf consisting of the latches, Ri and Rf . The max operator in (4.18) creates a mathematically difficult situation since it is unknown which of the quantities under the max operation is greater. To overcome this obstacle, this max operation may be split into two conditions: f
Li L Lf ≤ tcd + (k + 1)TCP + CW − ΔL (4.19) DPi,fM + Ai + DDQM T − δS , i,f i L Li DP M + tcd + kTCP + ΔL +DCQM
L Lf ≤ tfcd +(k + 1)TCP + CW − ΔL (4.20) T − δS . Taking into account that the clock skew TSkew (i, f ) = ticd − tfcd , (4.19) and (4.20) can be rewritten, respectively, as f
Li L Lf ≤ tcd + (k + 1)TCP + CW − ΔL (4.21) DPi,fM + Ai + DDQM T − δS , L i,f L Li Lf . (4.22) TSkew (i, f ) + ΔL + ΔL T ≤ TCP + CW − DCQM + DP M + δS Similar to Sections 4.7.1 and 4.7.2, (4.22) can be rewritten to emphasize the upper bound on the clock skew TSkew (i, f ) imposed by (4.22): f
Li L Lf ≤ tcd + (k + 1)TCP + CW DPi,fM + Ai + DDQM − ΔL T − δS , (4.23) i,f L L Li Lf . (4.24) TSkew (i, f ) ≤ TCP + CW − ΔL L − ΔT − DCQM + DP M + δS 4.8.2 Preventing the Early Arrival of the Data Signal A system of signals similar to the example illustrated in Figure 4.14 is assumed in the discussion presented in this section. Recall the difference between the late arrival of a data signal at Rf and the early arrival of a data signal at Rf (see Section 4.7.2). In the former case, the data signal stored in the latch Ri during the k-th clock cycle arrives too late to be stored in the latch Rf during
64
4 Timing Properties of Synchronous Systems
the (k + 1)-st clock cycle. In the latter case, the data signal stored in the latch Ri during the k-th clock cycle propagates to the latch Rf too early and overwrites the data signal that is already stored in the latch Rf during the same k-th clock cycle. In order for the proper data signal to be successfully latched within Rf during the k-th clock cycle, there should not be any changes in the signal Df until at least the hold time after the arrival of the storing (trailing) edge of the clock signal Cf . Therefore, the earliest arrival time af of the data signal Df at the register Rf must satisfy the following condition, L Lf af ≥ tfcd + kTCP + CW + ΔL (4.25) T + δH . L + ΔL The term tfcd + kTCP + CW T on the right hand side of (4.25) corresponds to the critical situation of the trailing edge of the k-th clock cycle of the clock signal Cf arriving late by the maximum possible deviation ΔL T . Note that the value of af in (4.25) consists of two components: 1. The earliest arrival time di that a valid data signal Qi appears at the Li output of the latch Ri , i.e., the sum di = ticd + kTCP − ΔL L + DCQm of the earliest arrival time of the leading edge of the clock signal Ci and the Li of Rf , minimum clock-to-Q delay DCQm 2. The minimum propagation delay DPi,fm of the signal through the combinational logic Lif and the interconnect along the path Ri ;Rf . Therefore, af can be described as i,f Li af = di + DPi,fm = ticd + kTCP − ΔL L + DCQm + DP m .
(4.26)
By substituting (4.26) into (4.25), the timing condition guaranteeing that Df does not arrive too early at the latch Rf is i f i,f Li L L Lf tcd + kTCP − ΔL L + DCQm + DP m ≥ tcd + kTCP + CW + ΔT + δH . (4.27) The inequality (4.27) can be further simplified by reorganizing the terms and noting that ticd − tfcd = TSkew (i, f ) is the clock skew between the registers Ri and Rf : i,f L Li Lf ≥ − D TSkew (i, f ) − ΔL + Δ + D (4.28) L T CQm P m + δH . The timing relationship described by (4.28) represents three important results describing the early arrival of the signal Df at the data input of the final latch Rf of a local data path: 1. The relationship (4.28) does not depend on the value of the clock period TCP . Therefore, if a hold time violation in a synchronous system has occurred,6 this timing violation cannot be fixed through clock period manipulation. 6
As described by the inequality (4.28) not being satisfied.
4.9 Multi-Phase Path with Latches
65
2. Similar to flip-flop-based path, the hold violation can be avoided through delay padding [74] into the logic. Inserting delays into the logic increases the DPi,fm value on the right hand side of the inequality, making it easy to satisfy the constraint for given values of TSkew (i, f ). 3. The relationship (4.28) can be satisfied with a sufficiently of the large value L + Δ clock skew TSkew (i, f ). Furthermore, both the term ΔL L T and the Lf term δH are harmful in the sense that these terms impose a lower bound on the clock skew TSkew (i, f ) between the latches Ri and Rf . Although positive skew (TSkew (i, f ) > 0) can be used to relax (4.28), these two terms make it difficult to satisfy the inequality (4.28) for specific values i,f Li of TSkew (i, f ) and DCQm + DP m . Finally, the relationship (4.28) can be rewritten to emphasize the lower bound on the clock skew TSkew (i, f ): Li i,f L Lf − D + Δ + D (4.29) TSkew (i, f ) ≥ ΔL L T CQm P m + δH .
4.9 Multi-Phase Path with Latches Multi-phase clock synchronization is often used for level-sensitive synchronous circuits. A multi-phase local data path consisting of two latches and combinational logic between these latches is shown in Figure 4.16. Similar to Latch Rf
Latch Ri Di Data In
D C
Clock Cipi
Q
Df (Data)
Qi Data
Combinational Logic Lif
D
Q
Qf Data Out
C p
Clock Cf f
Fig. 4.16. A multi-phase local data path with latches.
single-phase counterpart in Figure 4.15, the initial latch Ri is the origin of the data signal and the final latch Rf is the destination of the data signal. The combinational logic block Lif between Ri and Rf accepts the input data signals sourced by Ri and other registers and logic gates and transmits the data signals that have been operated on to Rf . The period of the multi-phase clock signals is denoted by TCP and the latches Ri and Rf of a local data path p shown in Figure 4.16 are synchronized by the clock signals Cipi and Cf f , respectively. As defined in Section 4.6.2, the superscripts pi and pf describe the clock phases that synchronize Ri and Rf , respectively. The subscripts i and f
66
4 Timing Properties of Synchronous Systems
denote the clock signals of phase C pi at Ri and phase C pf at Rf , respectively. p The delays of the clock signals Cipi and Cf j to the latches Ri and Rf are depf pi noted by ti and tf , respectively. The input and output data signals to Ri and Rf are denoted by Di , Qi , Df , and Qf , respectively. An analysis of the timing properties of the local data path shown in Figure 4.16 is offered in the following sections. The timing relationships to prevent the late arrival of the data signal at the latch Rf are examined in Section 4.9.1. The timing relationships to prevent the early arrival of the data signal at the latch Rf are examined in Section 4.9.2. The analyses presented in this section are built on the timing relationships among the signals of a latch similar to those used in Sections 4.7 and 4.8. Specifically, it is guaranteed that every data signal arrives at the data input of a latch no later than δSL time before the trailing clock edge. Also, this data L time after the trailing edge, i.e., no signal must remain stable at least δH L time after the latch has become new data signal should arrive at a latch δH opaque. 4.9.1 Preventing the Late Arrival of the Data Signal Analogous to the single-phase discussion, a system of signals similar to the example illustrated in Figure 4.10 is assumed in the following discussion. A data signal Di is stored in the latch Ri during the k-th clock cycle. The data Qi stored in Ri propagates through the combinational logic Lif and the interconnect along the path Ri ;Rf . During the (k + 1)-st clock cycle, the result Df of the computation in Lif is stored within the latch Rf . The signal Df must arrive at least δSL time before the trailing edge of Cf in the (k + 1)-st clock cycle. Similar to the discussions presented in Sections 4.7.1 and 4.8.1, the latest arrival time Af of Df at the D input of Rf must satisfy p L Lf − ΔL (4.30) Af ≤ φpf + tff + (k + 1)TCP + CW T − δS . Note the difference between (4.30) and (4.14). In (4.30), the term on the right hand side has an additional term φpf to account for the clock phase information. Observe that the value of Af in (4.30) consists of two components: 1. The latest arrival time Di when a valid data signal Qi appears at the output of the latch Ri , 2. The maximum signal propagation delay through the combinational logic block Lif and the interconnect along the path Ri ;Rf . Therefore, Af can be described as Af = DPi,fM + Di .
(4.31)
4.9 Multi-Phase Path with Latches
67
Similar to Section 4.8.1, the value of Di in (4.31) is the greater of the following two quantities: pi
Li Li , φ + tpi i + kTCP + ΔL . (4.32) Di = max Ai + DDQM L + DCQM There are two terms in the right hand side of (4.32): Li corresponds to the situation in which Di arrives 1. The term Ai + DDQM at Ri afterthe leading edge of the k-th clock cycle, Li 2. The term φpi + tpi i + kTCP + ΔL L + DCQM corresponds to the situation in which Di arrives at Ri before the arrival of the leading edge of the k-th clock pulse. By substituting (4.32) into (4.31), the latest time of arrival Af is pi
Li Li , φ + tpi i + kTCP + ΔL , Af = DPi,fM + max Ai + DDQM L + DCQM (4.33) which is in turn substituted into (4.30) to obtain pi
Li Li , φ + tpi i + kTCP + ΔL DPi,fM + max Ai + DDQM L + DCQ p L Lf − ΔL ≤ φpf + tff + (k + 1)TCP + CW T − δS . (4.34) Equation (4.34) is an expression of the inequality that must be satisfied in order to prevent the late arrival of a data signal at the data input D of the latch Rf . By satisfying (4.34), any setup violation in a local data path with latches as shown in Figure 4.16 is avoided. For a circuit to operate correctly, (4.34) must be enforced for every local data path Ri ;Rf consisting of the latches, Ri and Rf . Similar to single-phase operation, the max operator in (4.34) may be split into two conditions: p
p Li L Lf ≤ φ f + tff + (k + 1)TCP + CW − ΔL DPi,fM + Ai + DDQM T − δS ,
DPi,fM
(4.35) pi pi L Li + φ + ti + kTCP + ΔL +DCQM
p L Lf − ΔL ≤ φpf + tff +(k + 1)TCP + CW T − δS . (4.36) pi pf (i, f ) Tskew
= tpi i − Taking into account that the multi-phase clock skew is pf tf , (4.35) and (4.36) can be rewritten, respectively, as Li DPi,fM + Ai + DDQM (4.37) p L Lf − ΔL ≤ φpf + tff + (k + 1)TCP + CW T − δS , pi pf L φpi pf + TSkew (i, f ) + ΔL L + ΔT (4.38) Li L − DCQM + DPi,fM + δSLf . ≤ TCP + CW
68
4 Timing Properties of Synchronous Systems
Similar to Sections 4.8.1 and 4.8.2, (4.38) can be rewritten to emphasize the pi pf (i, f ): upper bound on the clock skew TSkew Li DPi,fM + Ai + DDQM (4.39) p L Lf − ΔL ≤ φpf + tff + (k + 1)TCP + CW T − δS , p p
i f TSkew (i, f )
(4.40) Li i,f L L Lf . − ΔL ≤ −φpi pf + TCP + CW L − ΔT − DCQM + DP M + δS 4.9.2 Preventing the Early Arrival of the Data Signal In order for the proper data signal to be successfully latched within Rf during the k-th clock cycle, there should not be any changes in the signal Df until at least the hold time after the arrival of the storing (trailing) edge of the clock p signal Cf f . Therefore, the earliest arrival time af of the data signal Df at the register Rf must satisfy the following condition, p L Lf af ≥ φpf + tff + kTCP + CW + ΔL (4.41) T + δH . p L The term φpf + tff + kTCP + CW on the right hand side of (4.41) + ΔL T corresponds to the critical situation of the trailing edge of the k-th clock cycle p of the clock signal Cf f arriving late by the maximum possible deviation ΔL T. Note that the value of af in (4.41) consists of two components: 1. The earliest arrival time di that a valid data signal Qi appears at the Li output of the latch Ri , i.e., the sum di = φpi + tpi i + kTCP − ΔL L + DCQm pi of the earliest arrival time of the leading edge of the clock signal Ci and Li of Rf , the minimum clock-to-Q delay DCQm i,f 2. The minimum propagation delay DP m of the signal through the combinational logic Lif and the interconnect along the path Ri ;Rf . Therefore, af can be described as i,f Li af = di + DPi,fm = φpi + tpi i + kTCP − ΔL L + DCQm + DP m .
(4.42)
By substituting (4.42) into (4.41), the timing condition guaranteeing that Df does not arrive too early at the latch Rf is pi i,f Li φ + tpi i + kTCP − ΔL L + DCQm + DP m (4.43) p L Lf ≥ φpf + tff + kTCP + CW + ΔL T + δH . The inequality (4.43) can be further simplified by reorganizing the terms p pi pf (i, f ) is the multi-phase clock skew between and noting that tpi i − tff = TSkew the registers Ri and Rf :
4.10 A Final Note
69
pi pf i,f L Li Lf φpi pf + TSkew (i, f ) − ΔL L + ΔT ≥ − DCQm + DP m + δH .
(4.44)
The timing relationship described by (4.44) represents three important results describing the early arrival of the signal Df at the data input of the final latch Rf of a local data path: 1. The relationship (4.44) does not depend on the value of the clock period TCP . Therefore, if a hold time violation in a synchronous system has occurred,7 this timing violation cannot be fixed by manipulating the clock period. 2. Similar to flip-flop-based path, the hold violation can be avoided through delay padding [74] into the logic. Inserting delays into the logic increases the DPi,fm value on the left hand side of the inequality, making it easy to satisfy the constraint for given values of TSkew (i, f ). 3. The relationship (4.44) can be satisfied with a sufficiently large value pi pf L (i, f ). Furthermore, both the term ΔL of the clock skew TSkew L + ΔT Lf and the term δH are harmful in the sense that these terms impose a pi pf (i, f ) between the latches Ri and Rf . lower bound on the clock skew TSkew Although positive skew can be used to relax (4.44), these two terms make pi pf (i, f ) it difficult to satisfy the inequality (4.44) for specific values of TSkew Li and DCQm + DPi,fm .
Finally, the relationship (4.44) can be rewritten to emphasize the lower bound pi pf on the clock skew TSkew (i, f ): Li pi pf i,f L Lf − D TSkew (i, f ) ≥ −φpi pf + ΔL + Δ + D (4.45) L T CQm P m + δH .
4.10 A Final Note The properties of registers and local data paths are described in this chapter. Specifically, the timing relationships to prevent setup and hold timing violations in a local data path consisting of two positive edge-triggered flip-flops are analyzed in Sections 4.7.1 and 4.7.2, respectively. The timing relationships to prevent setup and hold timing violations in a local data path consisting of two positive-polarity latches have also been analyzed in Sections 4.8.1 and 4.8.2, respectively. Timing relationships to prevent setup and hold timing violations in a local data path consisting of two positive-polarity latches, synchronized by a multi-phase clocking scheme, have been analyzed in Sections 4.9.1 and 4.9.2, respectively. In a fully synchronous digital VLSI system, however, it is possible to encounter certain local data paths different from those circuits analyzed in this chapter. For example, a local data path may begin with a positive-polarity, 7
As described by the inequality (4.44) not being satisfied.
70
4 Timing Properties of Synchronous Systems
edge-sensitive register Ri and end with a negative-polarity, edge-sensitive register Rf . It is also possible that different types of registers are used, e.g., a register with more than one data input. In each individual case, the analyses provided in this chapter illustrate a general methodology for determining the proper timing relationships specific to that system. Furthermore, note that for a given system, the timing relationships that must be satisfied for a system to operate correctly—such as (4.8), (4.13), (4.23), (4.24), (4.29), (4.39), (4.40) and (4.45)—are collectively referred to as the overall timing constraints of the synchronous digital system [9].
5 Clock Skew Scheduling and Clock Tree Synthesis
The basic principles of operation of a synchronous digital VLSI system are described in Chapter 2. As demonstrated in Chapter 3, the propagation of signals through logic gates and interconnections requires a certain amount of time to complete. Therefore, a timing discipline is necessary to ensure that logical computations—whether executing concurrently or in sequence—operate on the proper data signals. As described in Chapter 4, this timing discipline is implemented by inserting storage elements, or registers, throughout the circuit. Also analyzed in Chapter 4 are the timing relationships among signals in local data paths based on the type of clock signal and storage element. Recall from Chapter 4 the relationships that must be satisfied in order for a local data path to operate properly [inequalities (4.8), (4.13), (4.23), (4.24), (4.29), (4.39), (4.40) and (4.45)]. These relationships are written in the form of bounds on the clock skew TSkew in order to emphasize that bounds are imposed on TSkew by various parameters of the data paths and the clock signal. If any of the inequalities (4.8), (4.13), (4.23), (4.24), (4.29), (4.39), (4.40) and (4.45) is not satisfied, a timing violation occurs. A methodology and software system for determining (or scheduling) the values of the clock skew TSkew based on the timing constraints of a fully synchronous digital VLSI system and for synthesizing the clock distribution network so as to implement these target clock skew values is described in this chapter. The relation of synchronization to the design of the clock distribution network is presented in Section 5.1. Some useful definitions and notations are introduced in Section 5.2. The clock skew scheduling problem for more popular register type of edge-triggered flip-flops is described in Section 5.3. Various formulations of timing problem with the presented timing constraints are briefed in Section 5.4. The structure of the clock distribution network is examined from the perspective of clock skew scheduling in Section 5.5. The proposed algorithms are described in Section 5.6. Finally, the software programs developed to implement the algorithm and the demonstration of these programs on benchmark and industrial circuits are described in Section 5.7. I.S. Kourtev et al., Timing Optimization Through Clock Skew Scheduling, DOI: 10.1007/978-0-387-71056-3 5, c Springer Science+Business Media LLC 2009
71
72
5 Clock Skew Scheduling and Clock Tree Synthesis
5.1 Background As described in Chapter 2, most high performance digital integrated circuits implement data processing algorithms based on the iterative execution of basic operations. Typically, these algorithms are highly parallelized and pipelined by inserting clocked registers at specific locations throughout the circuit. The synchronization strategy for these clocked registers in the vast majority of VLSI/ULSI-based digital systems is a fully synchronous approach. It is not uncommon for the computational process in these systems to be spread over hundreds of thousands of functional logic elements and tens of thousands of registers. For such synchronous digital systems to function properly, the many thousands of switching events require a strict temporal ordering. This strict ordering is enforced by a global synchronization signal known as the clock signal. For a fully synchronous system to operate correctly, the clock signal must be delivered to every register at a precise relative time. The delivery function is accomplished by a circuit and interconnect structure commonly known as a clock distribution network [9, 32]. As described in Chapter 3, multiple factors affect the propagation delay of the data signals through the combinational logic gates and interconnect. Since the clock distribution network is composed of logic gates and interconnection wires, the signals in the clock distribution network are delayed. Moreover, the dependence of the correct operation of a system on the signal delay in the clock distribution network is far greater than on the delay of the logic gates. Recall that by delivering the clock signal to registers at precise times, the clock distribution network essentially quantizes the operational time of a synchronous system into clock periods, thereby permitting the simultaneous execution of operations. The nature of the on-chip clock signal has become a primary factor limiting circuit performance, causing the clock distribution network to become a performance bottleneck in high speed VLSI systems. As described in Chapter 3, the primary source of the load for the clock signals has shifted from the logic gates to the interconnect, thereby changing the physical nature of the load from a lumped capacitance (C) to a distributed resistive-capacitive (RC) load [8, 76]. These interconnect impedances degrade the on-chip signal waveform shapes and increase the path delay. Furthermore, statistical variations of the parameters characterizing the circuit elements along the clock and data signal paths, caused by the imperfect control of the manufacturing process and the environment, introduce ambiguity into the signal timing that cannot be neglected. All of these changes have a profound impact on both the choice of synchronous design methodology and on the overall circuit performance. Among the most important consequences are increased power dissipated by the clock distribution network as well as increasingly challenging timing constraints that must be satisfied in order to avoid timing violations [9, 77, 78, 79, 80]. Therefore, the majority of the approaches used to design a clock distribution
5.2 Definitions and Graphical Model
73
network focus on simplifying the performance goals by targeting minimal or zero global clock skew [81, 82, 83], which can be achieved by different routing strategies [84, 85, 86, 87], buffered clock tree synthesis, symmetric n-ary trees [77] (most notably H-trees) or a distributed series of buffers connected as a mesh [9, 32, 80].
5.2 Definitions and Graphical Model A synchronous digital system is a network of combinational logic and storage registers whose input and output terminals are interconnected by wires. An example of a synchronous system is shown in Figure 5.1. The sets of registers and logic gates of this specific system are outlined in Figure 5.1. The system consists of four registers, R1 through R4 , and four logic gates, G1 through G4 . For clarity, the clock distribution network and clock signals to the registers are not shown in Figure 5.1 and the details of the registers and logic gates are also omitted. The set of registers R = R1 , R2 , R3 , R4 } data R1 input
data R2 input
G1
G2
R3
G3
R4
data output
G4 The set of logic gates G = {G1 , G2 , G3 , G4 }
Fig. 5.1. A simple synchronous digital circuit with four registers and four logic gates.
A sequence of connected logic gates (no registers) is called a signal path. For example, in Figure 5.1, one signal path begins at the register R1 and propagates through the logic gates G1 and G2 before reaching the register R3 . Other signal paths can also be identified within the system shown in Figure 5.1. Every signal path in a synchronous system is delimited by a pair of registers—one register each for the start and the end of the path. Such a pair of registers is called a sequentially-adjacent pair and is defined next: Definition 5.1. Sequentially-adjacent pair of registers. For an arbitrary ordered pair of registers Ri , Rf in a synchronous circuit, one of the following two situations can be observed. Either there exists at least one signal path that connects some output of Ri to some input of Rf or inputs of Rf cannot be
74
5 Clock Skew Scheduling and Clock Tree Synthesis
reached from outputs of Ri through a signal path.1 In the former case—denoted by R1 ;R2 —the pair of registers Ri , Rf is called a sequentially-adjacent pair of registers and switching events at the output of Ri can possibly affect the input of Rf during the same clock period. A sequentially-adjacent pair of registers is also referred to as a local data path [9]. Generalized examples of local data paths with flip-flops and latches are shown in Figures 4.12 and 4.15, respectively. The clock signal Ci driving the initial register Ri of the local data path and the clock signal Cf driving the final register Rf are shown in Figures 4.12 and 4.15, respectively. Returning to Figure 5.1, for example, R1 , R3 is a sequentially-adjacent pair of registers connected by a signal path consisting of the combinational logic gates, G1 and G3 . In Figure 5.1, however, R3 , R1 is not a sequentially-adjacent pair of registers. 5.2.1 Permissible Range of Clock Skew The timing constraints of a local data path have been derived in Sections 4.7.1 through 4.8.2 for paths consisting of flip-flops and latches. The concept of clock skew used in these timing constraints is formally defined next: Definition 5.2. Clock skew. In a given digital synchronous circuit, the clock skew TSkew (i, j) between the registers Ri and Rj is defined as the algebraic difference, (5.1) TSkew (i, j) = ticd − tjcd , where Ci and Cj are the clock signals driving the registers Ri and Rj , respectively, and ticd and tjcd are the delays of the clock signals Ci and Cj , respectively. In Definition 5.2, the clock delays, ticd and tjcd , are with respect to an arbitrary—but necessarily the same—reference point. A commonly used reference point is the source of the clock distribution network on the integrated circuit. Note that the clock skew TSkew (i, j) as defined in Definition 5.2 obeys the antisymmetric property, TSkew (i, j) = −TSkew (j, i).
(5.2)
Recall that the clock skew TSkew (i, j) as defined in Definition 5.2 is a component in the timing constraints of a local data path [see inequalities (4.8), (4.13), (4.23), (4.24), (4.29), (4.39), (4.40) and (4.45)]. Therefore, the clock skew TSkew (i, j) is defined and is of primary practical use for sequentiallyadjacent pairs of registers Ri ;Rj , that is, for local data paths.2 1 2
Propagating through a sequence of logic elements only. Note that technically, TSkew (i, j) can be calculated for any ordered pair of registers Ri , Rj . However, the skew between a non-sequential pair of registers has no practical value.
5.2 Definitions and Graphical Model
75
For notational convenience, clock skews within a circuit are frequently denoted throughout this monograph with the small letter s with a single subscript. In such cases, the clock skew sk corresponds to a uniquely identified local data path k within the circuit, where the local data paths have been numbered 1 through a certain number p. In other words, the skew s1 corresponds to the local data path one, the skew s2 corresponds to the local data path two and so on. Previous research [83, 88] has indicated that tight control over the clock skews rather than the clock delays is necessary for the circuit to operate reliably. Timing relationships similar to (4.8), (4.13), (4.23), (4.24), (4.29), (4.39), (4.40) and (4.45) are used in [88] to determine a permissible range of allowable clock skew for each signal path. The concept of a permissible range for the clock skew sk of a data path Ri ;Rf is illustrated in Figure 5.2.
Race Conditions Negative Skew
Clock Period Limitations
PERMISSIBLE RANGE lk
sk
uk
Positive Skew
Fig. 5.2. The permissible range of the clock skew of a local data path. A timing / [lk , uk ]. violation exists if sk ∈
Each signal data path has a unique permissible range associated with it.3 The permissible range is a continuous interval of valid skews for a specific path. As suggested by the inequalities, (4.8), (4.13), (4.23), (4.24), (4.29), (4.39), (4.40) and (4.45) and illustrated in Figure 5.2, every permissible range is delimited by a lower and upper bound of the clock skew. These bounds—denoted by lk and uk , respectively—are determined based on the timing parameters of the individual local data paths and the constraints to prevent timing violations discussed in Chapter 4. Note that the bounds lk and uk also depend on the operational clock period for the specific circuit. When sk ∈ [lk , uk ]— as shown in Figure 5.2—the timing constraints of this specific k-th local data path are satisfied. The clock skew sk is not permitted to be in either the interval (−∞, lk ) because a race condition will be created or the interval (uk , +∞) because the minimum clock period will be limited. Furthermore, note that the reliability of a circuit is related to the probability of a timing violation occurring for any local data path Ri ;Rf . This
3
Later in Section 5.2.2 it is shown that it is more appropriate to refer to the permissible range of a sequentially-adjacent pair of registers. There may be more than one local data path between the same pair of registers but circuit performance is ultimately determined by the permissible ranges of the clock skew between pairs of registers.
76
5 Clock Skew Scheduling and Clock Tree Synthesis
observation suggests that the reliability of any local data path Ri ;Rf of a circuit (and therefore of the entire circuit) is increased in two ways: 1. by choosing the clock skew sk for the k-th local data path as far as possible from the borders of the interval [lk , uk ], that is, by (ideally) positioning the clock skew sk in the middle of the permissible range as sk = 12 (lk +uk ), 2. by increasing the width (uk − lk ) of the permissible range of the local data path Ri ;Rf . Even if the clock signals can be delivered to the registers within a given circuit with arbitrary delays, it is generally not possible to have all clock skews in the middle of the permissible range as suggested above. The reason behind this characteristic is that inherent structural limitations of the circuit create linear dependencies among the clock skews within the circuit. These linear dependencies and the effect of these dependencies on a number of circuit optimization techniques are examined in detail in Chapter 7. 5.2.2 Graphical Model of a Synchronous System Many different fully synchronous digital systems exist. It is virtually impossible to describe the variety of all past, current or future such systems depending on the circuit manufacturing technology, design style, performance requirements and multiple other factors. A system model of these fully synchronous digital systems is required so that the system properties can be fully understood and analyzed from the perspective of clock skew scheduling and clock tree synthesis while permitting unnecessary details to be abstracted.4 In this section, a graphical model used to represent fully synchronous digital systems is introduced. The purpose of this model is twofold. First, the model provides a common abstract framework for the automated analysis of circuits by computers. Second, it permits a significant reduction of the size of the data that needs to be stored in the computer memory when performing analysis and optimization procedures on a circuit. This graph-based model can be arrived at in a natural way by observing what constitutes relevant system information (in terms of the clock skew scheduling problem). For example, it is sufficient to know that a pair of registers Ri , Rj are sequentially-adjacent whereas the specific functional information characterizing the individual logic gates along the signal paths between Ri and Rj is not necessary. Consider, for instance, the system shown in Figure 5.1. This system is completely described (for the purpose of clock skew scheduling) by the timing information describing the four registers, four logic gates, ten wires (nets) and the connectivity of these wires to the registers and logic gates. Consider next the abstract representation of this system shown in Figure 5.3. Note that the 4
As a matter of fact, the graph model described here is quite universal and can be successfully applied for a variety of other different circuit analysis and optimization purposes.
5.2 Definitions and Graphical Model
G3 , G2
R1
G1 ,
G2
G3 R3
G2 G 1, R2
77
R4 G4 , G1 , G2
G4 , G1 , G2
Fig. 5.3. A directed multi-graph representation of the synchronous system shown in Figure 5.1. The graph vertices correspond to the registers, R1 , R2 , R3 and R4 , respectively.
registers, R1 through R4 , are represented by the vertices of the graph shown in Figure 5.3. However, the logic gates and wires have been replaced in Figure 5.3 by arrows or arcs, representing the signal paths among the registers. The four logic gates and ten nets in the original system have been reduced to only six local data paths represented by the arcs in Figure 5.3. For clarity, each arc or edge is labeled with the logic gates5 along the signal path represented by this specific arc. The type of data structure shown in Figure 5.3 is known as a multigraph [89] since there may be more than one edge between a pair of vertices in the graph. In order to simplify data storage and the relevant analysis and optimization procedures, this multi-graph is reduced to a simple graph [89] model by imposing the following restrictions:6 • • •
either one or zero edges can exist between any two different vertices of the graph, there cannot be self-loops, that is, edges that start and end at the same vertex of the graph, additional labels (or markings) of the edges are introduced in order to represent the timing constraints of the circuit.
With the above restrictions, a formal definition of the circuit graph model is as follows: Definition 5.3. Circuit graph. A fully synchronous digital circuit C is represented as the connected undirected simple graph GC . The graph GC is the (C) (C) (C) ordered six-tuple GC = V (C) , E (C) , A(C) , hl , hu , hd , where 5 6
In the order in which the traveling signals pass through the gates. Restrictions on the model itself and not on the ability of the model to represent features of the circuits.
78
5 Clock Skew Scheduling and Clock Tree Synthesis
• V (C) = {v1 , . . . vr } is the set of vertices of the graph GC , • E (C) = {e1 , . . . ep } is the set of edges of the graph GC , (C) • A(C) = [aij ]r×r is the symmetric adjacency matrix of GC . Each vertex from V (C) represents a register of the circuit C. There is exactly one edge in E (C) for every sequentially-adjacent pair of registers in C. The (C) (C) mappings hl : E (C) → R and hu : E (C) → R to the set of real numbers R assign the lower and upper permissible range bounds, lk , uk ∈ R, respectively, for the sequentially-adjacent pair of registers indicated by the edge ek ∈ E. (C) The edge labeling hd defines a direction of signal propagation for each edge vx , ez , vy . Note that in a fully synchronous digital circuit there are no purely combinational signal cycles, that is, it is impossible to reach the input of any logic gate Gk by starting at the output of Gk and going through a sequence of combinational logic gates only [9, 90]. Naturally, all registers from the circuit C are preserved when constructing the circuit graph GC as described in Definition 5.3—these registers are enumerated 1 through r and a vertex vi is created in the graph for each register Ri . Alternatively, an edge between two vertices is added in the graph if there are one or more local data paths between these two vertices. The self-loops are discarded because the clock skew of these local data paths is always zero and cannot be manipulated in any way. The graph GC for any circuit C can be determined by either direct inspection of C or by first building the circuit multi-graph and then modifying the multi-graph to satisfy Definition 5.3. Consider, for example, the circuit multigraph shown in Figure 5.3—the corresponding circuit graph is illustrated in Figure 5.4. Observe the labels of the graph edges in Figure 5.4. Each edge
v1
[l1 , u e1 1 ] → v3
u 2]
[l3 , u3 ] e3 →
v4
[l 2, → e2 v2 Fig. 5.4. A graph representation of the synchronous system shown in Figure 5.1 according to Definition 5.3. The graph vertices v1 , v2 , v3 , and v4 correspond to the registers, R1 , R2 , R3 and R4 , respectively.
5.2 Definitions and Graphical Model
79
is labeled with the corresponding permissible range of the clock skew for the given pair of registers. An arrow is drawn next to each edge to indicate the order of the registers in this specific sequentially-adjacent pair—recall that the clock skew as defined in Definition 5.2 is an algebraic difference. As shown in the rest of this section, either direction of an edge can be selected as long as the proper choices of lower and upper clock skew bounds are made. In most practical cases, a unique signal path (a local data path) exists between a given sequentially-adjacent pair of registers Ri , Rj . In these cases, the labeling of the corresponding edge is straightforward. The permissible range bounds lk and uk are computed using (4.8), (4.13), (4.23), (4.24), (4.29), (4.39), (4.40) and (4.45) and the direction of the arrow is chosen so as to coincide with the direction of the signal propagation from Ri to Rj . With these choices, the clock skew is computed as s = ticd − tjcd . In Figure 5.4, for example, the direction labels of both e1 and e2 can be chosen from v1 to v3 and from v2 to v3 , respectively. Multiple signal paths between a pair of registers, Rx and Ry , require a more complicated treatment. As specified before, there can be only one edge between the vertices, vx and vy , in the circuit graph. Therefore, a methodology is presented for choosing the correct permissible range bounds and direction labeling for this single edge. This methodology is illustrated in Figure 5.5 and is a two-step process. First, multiple signal paths in the same direction from
vx
[lz , uz ] → .. . → [lz(n) , uz(n) ]
vy
⇒
vx
[lz(i) , uz(i) ]
i
→
vy
(a) Elimination of multiple edges
[lz , uz ] → vx
← [lz , uz ]
vy
⇒
vx
[lz , uz ] ∩ [−uz , −lz ] →
vy
(b) Elimination of a two-edge cycle Fig. 5.5. Transformation rules for the circuit graph.
the register Rx to the register Ry are replaced by a single edge in the circuit graph according to the transformation illustrated in Figure 5.5(a). Next, twoedge cycles between Rx and Ry are replaced by a single edge in the circuit graph according to the transformation illustrated in Figure 5.5(b).
80
5 Clock Skew Scheduling and Clock Tree Synthesis
In the former case [Figure 5.5(a)], the edge direction labeling is preserved while the permissible range for the new single edge is chosen such that the permissible ranges of the multiple paths from Rx to Ry are simultaneously satisfied. As shown in Figure 5.5(a), the new permissible range [lz , uz ] is the intersection of the multiple permissible ranges [lz , uz ] through [lz(n) , uz(n) ] between Rx and Ry . In other words, the new lower bound is lz = max{lz(i) } and the new upper bound is uz = min{uz(i) }.
i
i
In the latter case [Figure 5.5(b)], an arbitrary choice for the edge direction can be made—the convention adopted here is to choose the direction towards the vertex with the higher index. For the vertex vy , the new permissible range has a lower bound lz = min(lz , −uz ) and an upper bound uz = max(uz , −lz ). It is straightforward to verify that any clock skew s ∈ [lz , uz ] satisfies both permissible ranges [lz , uz ] and [lz , uz ] as shown in Figure 5.5(b). The process for computing the permissible ranges of a circuit graph [using (4.8), (4.13), (4.23), (4.24) and (4.29)] and the transformations illustrated in Figure 5.5 have linear complexity in the number of signal paths since each signal path is examined only once. Note that the terms, circuit and graph, are used throughout the rest of this research monograph interchangeably to denote the same fully synchronous digital circuit. Also, note that for brevity, the superscript (C) when referring to the circuit graph GC of a circuit C is omitted for the rest of the monograph unless a circuit is explicitly indicated. The terms, register and vertex, are used interchangeably as are edge, local data path, arc and a sequentially-adjacent pair of registers. On a final note, it is assumed that the graph of any circuit considered in this work is connected. If this is not the case, each of the disjoint connected portions of the graph (circuit) can be individually analyzed.
5.3 Clock Scheduling The process of non-zero clock skew scheduling is discussed in this section. The following substitutions are introduced for notational convenience: Definition 5.4. Let C be a fully synchronous digital circuit and let Ri and Rf be a sequentially-adjacent pair of registers, i.e., Ri ;Rf . The long path delay ˆ i,f of a local data path Ri ;Rf is defined as D PM Fi + DPi,fM + δSF f + 2ΔF (DCQM i,f L ), if Ri , Rf are flip-flops ˆ = D PM Li L + DPi,fM + δSLf + ΔL (DCQM L + ΔT ), if Ri , Rf are latches. (5.3) ˆ i,f of a local data path Ri ;Rf is defined as Similarly, the short delay D Pm i,f Fi Ff F ˆ i,f = (DP m + DCQ − δH − 2ΔL ), if Ri , Rf are flip-flops (5.4) D i,f Pm Li Lf L (DCQm + DP m − δH − ΔL L − ΔT ), if Ri , Rf are latches.
5.3 Clock Scheduling
81
Table 5.1. LP-model for clock skew scheduling of edge-sensitive circuits. LP Model min TCP ˆ i,f s.t. TSkew (i, f ) ≤ TCP − D PM i,f ˆ TSkew (i, f ) ≥ −DP m
Based on Definition 5.4, the timing constraints of a local data path Ri ;Rf with flip-flops [(4.8) and (4.13)] are used to construct the linear programming (LP) model for clock skew scheduling [2] shown in Table 5.1. The constraints in Table 5.1 are the operating conditions for an edge-sensitive circuit: ˆ i,f = TCP − Di,f − δSF i TSkew (i, f ) ≤ TCP − D PM PM i,f i,f Ff ˆ −DP m ≤ TSkew (i, f ) = −DP m + δH .
(5.5) (5.6)
For a local data path Ri ;Rf consisting of the flip-flops, Ri and Rf , the setup and hold time violations are avoided if (5.5) and (5.6), respectively, are satisfied. The clock skew TSkew (i, f ) of a local data path Ri ;Rf can be either positive or negative, as illustrated in Figures 4.13 and 4.14, respectively. Note that negative clock skew may be used to effectively speed-up a local data path Ri ;Rf by allowing an additional TSkew (i, f ) amount of time for the signal to propagate from the register Ri to the register Rf . However, excessive negative skew may create a hold time violation, thereby creating a lower bound on TSkew (i, f ) as described by (5.6) and illustrated by l in Figure 5.2. A hold time violation, as described in Chapter 4, is a clock hazard or a race condition, also known as double clocking [2, 9]. Similarly, positive clock skew effectively decreases the clock period TCP by TSkew (i, f ), thereby limiting the maximum clock frequency and imposing an upper bound on the clock skew as illustrated by u in Figure 5.2.7 In this case, a clocking hazard known as zero clocking may be created [2, 9]. Examination of the constraints, (5.5) and (5.6), reveals a procedure for preventing clock hazards. Assuming (5.5) is not satisfied, a suitably large value of TCP can be chosen to satisfy constraint (5.5) and prevent zero clocking. Also note that unlike (5.5), (5.6) is independent of the clock period TCP (or the clock frequency). Therefore, TCP cannot be changed to correct a double clocking hazard, but rather a redesign of the entire clock distribution network [83] or a delay padding procedure onto the logic network [74] may be required. Both double and zero clocking hazards can be eliminated if two simple choices characterizing a fully synchronous digital circuit are made. Specifically,
7
Positive clock skew may also be thought of as increasing the path delay. In either case, positive clock skew (TSkew > 0) increases the difficulty of satisfying (5.5).
82
5 Clock Skew Scheduling and Clock Tree Synthesis
if equal values are chosen for all clock delays, then the clock skew TSkew (i, f ) = 0 for each local data path Ri ;Rf , ∀ Ri , Rf : ticd = tfcd
⇒
TSkew (i, f ) = 0.
(5.7)
Therefore, (5.5) and (5.6) become ˆ i,f TSkew (i, f ) = ticd − tfcd = 0 ≤ TCP − D PM i,f ˆ −D ≤ 0 = TSkew (i, f ) = ticd − tfcd . Pm
(5.8) (5.9)
Note that (5.8) can be satisfied for each local data path Ri ;Rf in a circuit ˆ i,f in a circuit— if a sufficiently large value—larger than the greatest value D PM is chosen for TCP . Furthermore, (5.9) can be satisfied across an entire circuit ˆ i,f ≥ 0 for each local data path Ri ;Rf in the if it can be ensured that D Pm circuit. The timing constraints, (5.8) and (5.9), can be satisfied since choosing ˆ i,f is positive a sufficiently large clock period TCP is always possible and D Pm for a properly designed local data path Ri ;Rf . The application of this zero clock skew methodology [(5.7), (5.8), and (5.9)] has been central to the design of fully synchronous digital circuits for decades [9, 32, 91]. By requiring the clock signal to arrive at each register Rj with approximately the same delay tjcd , these design methods have become known as zero clock skew methods.8 As shown by previous research [9, 81, 82, 83, 88, 92, 93], both double and zero clocking hazards may be removed from a synchronous digital circuit even when the clock skew is non-zero, that is, TSkew (i, f ) = 0 for some (or all) local data paths Ri ;Rf . As long as (5.5) and (5.6) are satisfied, a synchronous digital system can operate reliably with non-zero clock skews, permitting the system to operate at higher clock frequencies while removing all race conditions. The vector column of clock delays TCD = [t1cd , t2cd , . . . ]T is called a clock schedule [2, 9]. If TCD is chosen such that (5.5) and (5.6) are satisfied for every local data path Ri ;Rf , TCD is called a consistent clock schedule. A clock schedule that satisfies (5.7) is called a trivial clock schedule. Note that a trivial clock schedule TCD implies global zero clock skew since for any i and f , ticd = tfcd , thus, TSkew (i, f ) = 0. An intuitive example of non-zero clock skew being used to improve the performance and reliability of a fully synchronous digital circuit is shown in Figure 5.6. Two pairs of sequentially-adjacent flip-flops, R1 ;R2 and R2 ;R3 , are shown in Figure 5.6, where both zero skew and non-zero skew situations are illustrated in Figures 5.6(a) and 5.6(b), respectively. Note that the local data paths made up of the registers, R1 and R2 and of R2 and R3 , respectively, are connected in series (R2 being common to both R1 ;R2 and R2 ;R3 ). In each of the Figures 5.6(a) and 5.6(b), the permissible ranges of the clock skew for 8
Equivalently, it is required that the clock signal arrive at each register at approximately the same time.
5.3 Clock Scheduling
83
Clock Period = 8.5 ns R2
R1 Data
Logic
R3 Logic
Data
1 ns—2.5 ns
5 ns—8 ns
Data
Clock
Clock
Clock
t
t
t
-1 ns Permissible Range 6 ns
-5 ns Permissible Range 0.5 ns
Skew = 0
Skew = 0 (a) The circuit operating with zero clock skew. Clock Period = 8.5 ns R2
R1 Data Clock
Logic 1 ns—2.5 ns
t
-1 ns Permissible Range 6 ns
Data Clock τ
R3 Logic 5 ns—8 ns (τ < t)
Data Clock t
-5 ns Permissible Range 0.5 ns
Skew = 0 = t − τ Skew = 0 = τ − t (b) The circuit operating with non-zero clock skew. Fig. 5.6. Application of non-zero clock skew to improve circuit performance (a lower clock period) or circuit reliability (increased safety margins within the permissible range).
both local data paths, R1 ;R2 and R2 ;R3 , are lightly shaded under each circuit diagram. As shown in Figure 5.6, the target clock period for this circuit is TCP = 8.5 ns. The zero clock skew points (Skew = 0) are indicated in Figure 5.6(a)— zero skew is achieved by delivering the clock signal to each of the registers, R1 , R2 and R3 , with the same delay t (symbolically illustrated by the buffers connected to the clock terminals of the registers). Observe that while the zero clock skew points fall within the respective permissible ranges, these zero
84
5 Clock Skew Scheduling and Clock Tree Synthesis
clock skew points are dangerously close to the lower and upper bounds of the permissible range for R1 ;R2 and R2 ;R3 , respectively. A situation could be foreseen where, for example, the local data path R2 ;R3 has a larger than expected long delay (larger than 8 ns), thereby causing the upper bound of the permissible range for R2 ;R3 to decrease below the zero clock skew point. In this scenario, a setup violation will occur on the local data path R2 ;R3 . Consider next the same circuit with non-zero clock skew applied to the data paths, R1 ;R2 and R2 ;R3 , as shown in Figure 5.6(b). Non-zero skew is achieved by delivering the clock signal to the register R2 with a delay τ < t, where t is the delay of the clock signal to both R1 and R3 . By applying this delay τ < t, positive (t − τ > 0) and negative (τ − t < 0) clock skews are applied to R1 ;R2 and R2 ;R3 , respectively. The corresponding clock skew points are illustrated in the respective permissible ranges in Figure 5.6(b). Comparing Figure 5.6(a) to Figure 5.6(b), observe that a timing violation is less likely to occur in the latter case. In order for the previously described setup timing violation to occur in Figure 5.6(b), the deviations in the delay parameters of R2 ;R3 would have to be much greater in the non-zero clock skew case than in the zero clock skew case. If the precise target value of the non-zero clock skew τ − t < 0 is not met during the circuit design process, the safety margin from the skew point to the upper bound of the permissible range would be much greater. Therefore, there are two identifiable benefits of applying non-zero clock skew. First, the safety margins of the clock skew (that is, the distances between the clock skew point and the bounds of the permissible range) within the permissible ranges of a data path can be improved. The likelihood of correct circuit operation in the presence of process parameter variations and operational conditions is improved with these increased margins. In other words, the circuit reliability is improved. Second, without changing the logic and circuit structure, the performance of the circuit can be increased by permitting a higher maximum clock frequency (or lower minimum clock period). The formulation of circuit timing constraints for different timing problems and formulation of clock skew scheduling for different objectives are presented in Section 5.4. Friedman in 1989 first presented in [1] the concept of negative non-zero clock skew as a technique to increase the clock frequency and circuit performance across sequentially-adjacent pairs of registers. Soon afterwards in 1990, Fishburn suggested an algorithm in [2] for computing a consistent clock schedule that is nontrivial. It is shown in [1, 2] that by exploiting negative and positive clock skew within a local data path Ri ;Rf , a circuit can operate with a clock period TCP less than the clock period achievable by a trivial (or zero skew) clock skew schedule while satisfying the conditions specified by (5.5) and (5.6). In fact, [2] determined an optimal clock schedule by applying lin-
5.4 Timing Constraints and Design Automation
85
ear programming techniques to solve for TCD so as to satisfy (5.5) and (5.6) while minimizing the objective function Fobjective = min TCP 9 . The process of determining a consistent clock schedule TCD can be considered as the mathematical problem of minimizing the clock period TCP under the constraints, (5.5) and (5.6). However, there are important practical issues to consider before a clock schedule can be properly implemented. A clock distribution network must be synthesized such that the clock signal is delivered to each register with the proper delay so as to satisfy the clock skew schedule TCD . Furthermore, this clock distribution network must be constructed so as to minimize the deleterious effects of interconnect impedances and process parameter variations on the implemented clock schedule. Synthesizing the clock distribution network typically consists of determining a topology for the network, together with the circuit design and physical layout of the buffers and interconnect that make up a clock distribution network [9, 32].
5.4 Timing Constraints and Design Automation Digital VLSI synchronous circuits are subject to different types of timing analyses with regards to computing or analyzing their clock schedules. Traditional among these analysis are three different problems: clock period minimization [2, 69, 72, 73, 94, 95, 96, 97, 98], clock period verification [69, 99, 100] and circuit retiming [101, 102, 103, 104]. Clock period minimization is the analysis of a synchronous circuit in order to solve for the minimum clock period—the maximum operating frequency—of a synchronous circuit. Clock period verification is the analysis to ensure that a synchronous circuit is fullyoperational for a given clock period. Clock period verification can also be used to formulate the clock skew scheduling problem with the objective of improved tolerance to process parameter variations for operation at a predetermined clock period. Circuit retiming is the analysis of a synchronous circuit aiming to achieve higher operating frequencies by modifying the circuit network. Even though there are different types of timing analysis problems, the operation of the synchronous circuit under scrutiny is identical in all cases (possibly except for retiming problems). Thus, in the formulation of the timing analysis problem, a framework of constraints identifying synchronous circuit operation is essential. The categorized set of constraints are verified at each local data path of a circuit subject to a specific objective function, constructing the static timing analysis. The timing relationship constraints discussed in Section 5.3 and Section 6.1 for flip-flop-based and latch-based circuits, respectively, are the building blocks for the automation for these timing analysis processes.
9
This LP problem model is presented in Table 5.1.
86
5 Clock Skew Scheduling and Clock Tree Synthesis
5.5 Structure of the Clock Distribution Network A clock distribution network is typically organized as a rooted tree structure [9, 81, 105], as illustrated in Figure 5.7, and is often called a clock tree [9]. A circuit schematic of a clock distribution network is shown in Figure 5.7(a). An abstract graphical representation of the tree structure in Figure 5.7(a) is shown in Figure 5.7(b). The unique source of the clock signal is at the root of the tree. This signal is distributed from the source to every register in the circuit through a sequence of buffers and interconnect. Typically, a buffer in the network drives a combination of other buffers and registers in a VLSI circuit. A network of wires connects the output of the driving buffer to the inputs of these driven buffers and registers. An internal node of the tree corresponds to a buffer and a leaf node of the tree corresponds to a register. There are N leaves10 in the clock tree labeled F1 through FN , where leaf Fj corresponds to register Rj . A clock tree topology that implements a given clock schedule TCD
HH SOURCE
HH q q HH OC C
BUFFERS
q
HH q :
........ R ..
HH
HH to buffer to buffer ....... R ..
HH to buffer
(a) Circuit structure of the clock distribution network
wBUFFER REGISTER
w CLOCK SOURCE S / ? w S S w w Sw S C w S / ? CW e C e w w Cw C C
(b) Clock tree structure that corresponds to the circuit shown in (a) Fig. 5.7. Tree structure of a clock distribution network.
must enforce a clock skew TSkew (i, f ) for each local data path Ri ;Rf of the circuit in order to ensure that both (5.5) and (5.6) are satisfied. 10
The number of registers N in the circuit.
5.6 Solution of the Clock Tree Synthesis Problem
87
5.6 Solution of the Clock Tree Synthesis Problem In this section, a solution to the topological synthesis problem [93, 106, 107] is presented. The solution is based on the following assumption: the signal propagation delay through a node and all of its descendant nodes is a constant, denoted by Δb . Therefore, the propagation delay δj of the clock signal from the clock source to the register Rj at depth bj is tjcd = δj = bj × Δb . Note that Δb includes the delay through both a buffer and the interconnect branches connected to the buffer output. There can be considerable difficulty in practically achieving a constant Δb throughout all levels of the clock tree. Therefore, new research should focus on removing this constraint by providing variable branch delays. After substituting δj = bj × Δb into (5.5) and (5.6), the necessary conditions to avoid either clock hazard can be rewritten as follows: ˆ i,f − TCP −TSkew (i, f ) = (bf − bi )Δb >D PM ˆ i,f . TSkew (i, f ) = (bi − bf )Δb > − D Pm
(5.10) (5.11)
Therefore, the problem of designing the topology of a clock distribution network can be formulated as the optimization problem of minimizing the clock period TCP subject to the constraints (5.10) and (5.11). The quantities bi and bf are integers, since these terms denote the number of branches (buffers) from the root of the clock tree to a particular leaf (register). In the general case, this optimization problem can be described as a mixed-integer linear programming problem (since TCP can be any real positive number) and is difficult to solve. However, previous research has demonstrated [108] that if a fixed value for the clock period TCP is chosen, the problem changes as follows. Given a value for TCP , find a set of integers {b1 , b2 , . . . , bi , . . .} such that ˆ i,f − TCP (bj − bi )Δb > D PM ˆ i,f (bi − bj )Δb > −D Pm
(5.12)
for every sequentially-adjacent pair of registers Ri ;Rf or determine that no such set of integers exist. Once (5.12) has been solved for a particular circuit, a clock tree topology such as the network shown in Figure 5.7 can be implemented. Each register Ri of a circuit receives a clock signal from a leaf Fi of the clock tree at a branching depth b = bi , where bi is the integer obtained from solving (5.12). In addition, Leiserson and Saxe describe in [90] an algorithm for efficiently solving similar optimization problems such as represented by (5.12). The run time of this algorithm is O(V E), where V and E denote the number of registers and the number of sequentially-adjacent pairs of registers, respectively. This algorithm is applied in this synthesis methodology for constructing the topology of the clock tree.
88
5 Clock Skew Scheduling and Clock Tree Synthesis
The sequence of operations is as follows. A feasible range for the clock period [Tmin , Tmax ] to be searched is determined initially—the bounds Tmin and Tmax are determined as described in [88]. A binary search for the optimal clock period Topt is then performed over the feasible range of the clock period. This sequence of operations is presented in Algorithm 1. The feasible range for the clock period [Tmin , Tmax ] to be searched is determined in lines 1 and 2. A binary search of the feasible clock period range is performed next in lines 3 through 9. For each value of the clock period, (5.12) is solved in line 5 to determine the feasibility of this current target value of the clock period TCP . The binary search ends when the condition stated in line 4 is no longer satisfied. Algorithm 1 Compute clock schedule. 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11:
min ← Tmin max ← Tmax test ← (min + max)/2 while max − min > δ do if (∃ feasible solution for TCP = test) then max ← test else min ← test end if test ← (min + max)/2 end while
After computing a clock schedule, a mapping M : tcd → B is produced such that each clock delay tcd (i) is mapped to a non-negative integer number b(i) ∈ B = {1, 2, . . . , bmax }. The integer b(i) is the required depth of the leaf in the clock tree driving the register Ri . Typically, bmax < NR , since there may be more than one register with the same value of the required depth b. In addition, note that the set B can be redefined as {1 + k, 2 + k, . . . , bmax + k} without affecting the validity of the solution (k is any integer). For example, if the solution for a circuit with 10 registers is b(1), . . . , b(10) = {3, 5, 8, 10, −2, 0, 0, 5, 5, 4}, this solution can be changed to {5, 7, 10, 12, 0, 2, 2, 7, 7, 6} by adding two branches (or buffers) to each of the numbers b(1) through b(10). The clock distribution network is implemented recursively in the following manner. An integer value called the branching factor f is initially chosen. The branching factor determines the number of outgoing branches from each node of the clock tree. By maintaining f constant throughout the clock tree, the requirement for a constant Δb can be satisfied. A specific number of registers nj is driven at a specific depth b(j) of the clock tree. Therefore, at least nj /f buffers at depth b(j − 1) of the clock tree are required to drive these nj registers at depth b(j). The number of buffers and branches in the
5.7 Software Implementation
89
clock tree is determined by beginning at the bottom of the tree (those leaves with the greatest depth) and recursively computing the number of buffers at each preceding level.
5.7 Software Implementation The techniques for clock skew scheduling and clock distribution network synthesis discussed in this chapter have been implemented as two separate computer programs. The first program implements the problem of simultaneous clock skew scheduling and clock tree synthesis as described by (5.12). This program is described and results are presented in Section 5.7.1. A second more exhaustive software implementation for clock skew scheduling only is described in Section 5.7.2. 5.7.1 Simultaneous Clock Scheduling and Clock Tree Synthesis The algorithm has been implemented in a 3, 300 line program written in the C++ high-level programming language. This program has been executed on the ISCAS’89 suite of benchmark circuits. A simple delay model based on the load of a gate is used to extrapolate the gate delays since these benchmark circuits do not contain delay information. A summary of the results for the benchmark circuits is shown in Table 5.2. These results demonstrate that by applying the proposed algorithm to schedule the clock delays to each register, up to a 64% decrease11 in the minimum clock period can be achieved for these benchmark circuits while removing all race conditions. Note that due to the relatively large number of buffers required in the clock tree, this approach is only practical for circuits with a large number of registers. Two example implementations of a clock tree topology with non-zero skew are shown in Figures 5.8 and 5.9 for the benchmark circuits s1423 and s400, respectively: 1. The clock tree topology shown in Figure 5.8 corresponds to the circuit s1423 which contains N = 74 registers. The improvement of the minimum achievable clock period TCP is 14% by applying the methodology described in Section 5.6. 2. The clock tree topology shown in Figure 5.9 corresponds to the circuit s400 which contains N = 21 registers. The improvement of the minimum achievable clock period for this circuit when non-zero clock skew is applied is 37%.
11
Compared to the minimum possible clock period if zero skew is used throughout a circuit.
90
5 Clock Skew Scheduling and Clock Tree Synthesis
Table 5.2. ISCAS’89 suite of circuits. The name, number of registers, bounds of the searchable clock period, optimal clock period (Topt ) and performance improvement (in per cent) are shown for each circuit. Also shown in the last two columns labeled B2 and B3 , respectively, are the number of buffers in the clock tree for f = 2 and f = 3, respectively. Circuit s1196 s13207 s1423 s1488 s15850 s208.1 s27 s298 s344 s349 s35932 s382 s38417 s38584 s386 s400 s420.1 s444 s510 s526 s526n s5378 s641 s713 s820 s832 s838.1 s9234.1 s9234 s953
Regs 18 669 74 6 597 8 3 14 15 15 1728 21 1636 1452 6 21 16 21 6 21 21 179 19 19 5 5 32 211 228 29
Tmin 7.80 60.40 75.80 31.00 83.60 5.20 5.40 9.40 18.40 18.40 34.20 8.00 42.20 67.60 17.00 8.40 5.20 8.40 14.80 9.40 9.40 20.40 71.00 79.20 19.20 19.80 5.20 54.20 54.20 16.40
Tmax 20.80 85.60 92.20 32.20 116.00 12.40 6.60 13.00 27.00 27.00 34.20 14.20 69.00 94.20 17.80 14.20 16.40 16.80 16.80 13.00 13.00 28.40 88.00 89.20 19.20 19.80 24.40 75.80 75.80 23.20
Topt % Imp. B2 B3 13.00 17% 21 14 60.45 29% 681 348 79.00 14% 80 45 31.00 4% 5 4 83.98 28% 614 320 5.48 56% 10 9 5.40 18% 3 3 10.48 19% 13 8 18.65 31% 16 11 18.65 31% 15 10 34.20 0% 3457 2595 8.88 37% 25 14 42.82 38% 1647 832 67.65 28% 1465 743 17.80 0% 12 10 8.88 37% 25 14 7.45 55% 21 15 10.17 39% 23 15 15.20 10% 7 5 10.48 19% 21 10 10.48 19% 21 10 22.29 22% 182 93 71.03 19% 30 22 72.23 19% 31 23 19.20 0% 11 9 19.80 0% 11 9 8.76 64% 40 24 54.24 28% 220 113 54.24 28% 237 123 18.96 18% 31 18
5.7.2 Clock Skew Scheduling In this program implementation, only clock skew scheduling is implemented as described in Sections 5.3 and 5.6. This implementation is targeted at commercial integrated circuits for which accurate timing information can be obtained. The program is written in the C++ high-level programming language and consists of approximately 17, 300 lines of code. This program has been demonstrated on a commercial integrated circuit with 6, 890 registers (a video-game
5.7 Software Implementation
91
Dummy Load Internal Node (Buffer) Leaf (Register)
Fig. 5.8. Buffered clock tree for the benchmark circuit s1423. The circuit s1423 has a total of N = 74 registers and the clock tree consists of 45 buffers with a branching factor of is f = 3.
controller) and some characterizing data is shown in Figure 5.12. The minimum achievable clock period without clock skew scheduling is TCP = 14.8 ns (= 67.5 MHz). After non-zero clock skew is applied to this circuit, the minimum achievable clock period with clock skew scheduling is TCP = 11.4 ns (= 87.7 MHz) corresponding to a performance improvement of 23%. Input File Format The input to this program is a standard text file containing the timing information necessary to apply the clock scheduling algorithm to a fully synchronous digital integrated circuit. This timing information characterizes the minimum and maximum signal delay of each local data path and can be obtained from the application of simulation tools known as static timing analyzers. More accurate simulation methods—such as dynamic circuit simulation (e.g., SPICE)—can be used to obtain highly accurate timing information for relatively small circuits. A sample input file for the clock skew scheduling program is shown in Figure 5.10. As shown in Figure 5.10, the input consists of groups of information (lines 1-11 and 13-18 in Figure 5.10) enclosed
92
5 Clock Skew Scheduling and Clock Tree Synthesis
Dummy Leaves (Load) Internal Node (Buffer) Leaf (Register)
Fig. 5.9. Buffered clock tree for the benchmark circuit s400. The circuit s400 has a total of N = 21 registers and the clock tree consists of 14 buffers with a branching factor of f = 3.
in curly braces (the ‘{’ and ‘}’ symbols). Each line in a group describes an instance of a register. The first line in a group describes a register Ri at the beginning of a local data path Ri ;Rf . Each of the remaining lines of a group describes a register Rf at the end of a local data path Ri ;Rf . In the example shown in Figure 5.10, the registers Top/Block1/RegA[8]:sc and TopA/Block1/RegA[7]:sc each describe the first register of a local data path (lines 1 and 13, respectively). Each register listed in the input file of the program consists of a sequence of strings separated with slashes (the ’/’ character). These strings represent the hierarchical name of the register in the design hierarchy. The register on line 1, for example, is named RegA and is part of a design block named
5.7 Software Implementation 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18:
93
{Top/Block1/RegA[8]:d1 2.781105e-04 5.243128e-01 _ _ 3.000000e-02 3.00000 0e-02 _ _ {Top/Block2/RegB[7]:d1 4.596487e-01 5.079964e-01 4.596487e-01 5.079964e-01} {Top/Block2/RegB[6]:d1 4.116543e-01 4.677776e-01} {Top/Block2/RegB[8]:d1 4.224569e-01 4.813909e-01} {Top/Block2/RegB[7]:d1 4.596487e-01 5.079964e-01 4.596487e-01 5.079964e-01} } {TopA/Block1/RegA[7]:D 5.195378e-01 5.195681e-01 _ _ 3.000000e-02 3.000000e-02 _ _ {Top/Block1/RegC[6]:da 4.116543e-01 4.677776e-01} }
Fig. 5.10. Sample input for the clock scheduling program described in Section 5.7.2.
Block1, whereas the design block Block1 is part of the module called Top. Finally, a register bit index may be appended at the end of a register name for multi-bit registers12 and the data pin name is appended after the bit index and separated with a colon ‘:’. The description of the initial register of a local data path is followed by eight (8) numbers which specify the timing information characterizing this register. These numbers specify the minimum and maximum values of the setup and hold times for the register for the rising and falling edges of the clock signal. If a number is not available, an underscore ‘ ’ is substituted for this missing data. The program determines the type of register by examining both the missing and specified numbers describing the setup and hold times. Returning to line 1 in Figure 5.10, the minimum and maximum setup times for the rising edge of the clock signal are included while the minimum and maximum setup times for the falling edge of the clock signal are absent (note the underscores in line 2). Therefore, this register instance is either a positiveedge triggered flip-flop or a negative latch. A positive flip-flop has the setup and hold times defined for the rising edge of the clock signal. Similarly, a negative latch has the setup and hold times defined for the rising edge of the clock signal. Since the register instance described by line 1 in Figure 5.10 has setup and hold times defined for the rising edge of the clock signal, the register instance is either a positive flip-flop or a negative latch. 12
If the register is not a multi-bit register, this index is omitted.
94
5 Clock Skew Scheduling and Clock Tree Synthesis
As mentioned previously, each register instance in an input file describes an initial register at the beginning of a local data path and is followed by one or more register instances describing a final register at the end of a local data path. For the example shown in Figure 5.10, there are four (4) local data paths (lines 5 through 10) with an initial register described on line 1. Each final register of a local data path (lines 5 through 10) consists of a register name and is followed by the timing information describing the local data path terminated by this specific register instance. This timing information may contain two or four delay numbers depending upon whether the starting L register of the local data path is a flip-flop or a latch. The minimum (DCQm F L F or DCQm ) and maximum (DCQM or DCQM ) clock-to-output delays are the first two numbers listed on line 5 and are present regardless of the type of register (recall the description of latches and flip-flops in Sections 4.2 and 4.4, respectively). An additional pair of delay numbers specifies the minimum and L L and DDQM ) if the initial storage element of the local maximum delays (DDQm data path is a latch (line 6 in Figure 5.10). Output File Format The output of the clock skew scheduling program is a standard text file. A sample output is shown in Figure 5.11. Each line in the output consists of the full hierarchical name of a register Rj and the value of the delay tjcd of the clock signal to the register Rj . Recall that it is not the clock delays to the individual
1: 2: 3: 4: 5: 6: 7: 8: 9:
Top/Block1/Reg1[7] 3.479695 Top/Block1/Reg143 2.814349 Top/Block1/Reg26[0] 2.159099 Top/Block1/Reg33A 3.479695 Top/Block1/Reg33B 3.479695 Top/Block1/reg_2a 3.479695 Top/Block1/reg_2 3.052987 Top/Block1/Reg271 2.541613 Top/Block1/Reg12 1.871610
Fig. 5.11. Sample output for the clock scheduling program described in Section 5.7.2.
registers that are important but rather important but rather the difference between the clock delays—the clock skew TSkew —to each sequentially-adjacent pair of registers that matters.
5.7 Software Implementation
95
Experimental Results Two histograms are shown in Figure 5.12 which illustrate the effects of nonzero clock skew on the circuit path delays. The distribution of the path deˆ i,f is shown in Figure 5.12(a). With clock scheduling (non-zero clock lay D PM skew) applied, the effective path delay of each path Ri ;Rf is increased or decreased13 by the amount of clock skew scheduled for that path. This effective path delay distribution is shown in Figure 5.12(b). Note that the net effect of clock skew scheduling is a ‘shift’ of the path delay distribution away from the maximum path delay [from right to left in Figure 5.12(b)]. There are two beneficial effects of that shift of delay in that either the circuit can be run at a lower clock period (or higher clock frequency) or the circuit can operate at the target clock period with a reduced probability of setup and hold time violations (improving the overall system reliability).
13
As described previously in this chapter, clock skew can be thought of as adding (or subtracting) to (or from) the path delay.
96
5 Clock Skew Scheduling and Clock Tree Synthesis Maximum Path Delay (fs) Number of Paths (#)
615
0 0 fs
7416120 fs
14832240 fs
(a) Path delay distribution with zero skew (before clock skew scheduling is applied) Maximum Path Delay (fs) Number of Paths (#)
851
0 0 fs
7416120 fs
14832240 fs
(b) Path delay distribution after non-zero clock skew is applied Fig. 5.12. The application of clock skew scheduling to a commercial integrated circuit with 6,890 registers [note that the time scale is in femtoseconds, 1 fs = 10−15 sec = 106 ns].
6 Clock Skew Scheduling of Level-Sensitive Circuits
Level-sensitive circuits are gaining popularity in the state-of-the-art highperformance synchronous circuit design due to their smaller size, lower power consumption and faster operation speeds [109, 110, 111]. The timing analysis of level-sensitive circuits, however, is more difficult due to the non-linearity of the timing constraints caused by the transparent latch operation discussed in Section 4.2. Traditionally, the non-linearity of constraints has been resolved with one of two approaches. On one hand, analyses which aim to accurately model the effects of time borrowing have been considered too optimistic and this property is fully disregarded from the analysis [69, 72]. More recently, the non-linear constraints of operation are relaxed using iterative solution techniques [73, 94, 99, 100]. The iterative solution techniques are practical for timing analysis where clock skew values (zero or non-zero) are known. However, these techniques are not applicable in clock skew scheduling computation. In this chapter, a linear programming (LP) formulation applicable to the timing analysis of large-scale level-sensitive synchronous circuits is presented. The presented LP formulation accurately models the effects of time borrowing. This LP formulation is computationally efficient due to the linearization of non-linear constraints, and the formulation and solution processes are fullyautomated.
6.1 Clock Scheduling for Level-Sensitive Circuits The process of clock skew scheduling for level-sensitive circuits is governed by the timing relationships defined for local data paths of latches. As discussed in Section 5.3, the long and short data path delays defined for local datapaths composed of latches are different than those defined for the more traditional local data paths that are composed of flip-flops. Consequently, the timing relationships of local data paths composed of (level-sensitive) latches need to be integrated into the design framework to define the clock skew scheduling process of level-sensitive circuits. I.S. Kourtev et al., Timing Optimization Through Clock Skew Scheduling, DOI: 10.1007/978-0-387-71056-3 6, c Springer Science+Business Media LLC 2009
97
98
6 Clock Skew Scheduling of Level-Sensitive Circuits
The timing relationships for local data paths with latches are categorized into two sets: operational constraints and constructional constraints. The operational constraints are the constraints that model the operation of a levelsensitive synchronous circuit. The constructional constraints are defined to ensure the correctness and completeness of the formulation of the proposed timing analysis problem. The definitions for the operational constraints—called latching, synchronization and propagation constraints, respectively—are derived from the zero clock skew definitions in [69]. The latching, synchronization and propagation constraints for a single-phase synchronization system are described in Section 6.1.1, Section 6.1.2 and Section 6.1.3, respectively. The constructional constraints, called validity and initialization constraints, are defined to ensure the correctness and completeness of the formulation of the presented timing analysis problem. The validity constraints are presented in Section 6.1.4. The initialization constraints are presented in Section 6.1.5. 6.1.1 Latching Constraints Latching constraints bound the arrival time of the data signal Df (recall the local data path in Figure 4.15 on page 61) in order to ensure that Df is latched during the intended clock cycle. The interval for the data arrival time is characterized by the hold time and the setup time requirements of Rf as follows: Lf ≤ af δH Af ≤ TCP − δSLf .
(6.1) (6.2)
Eq. (6.1) constrains the earliest arrival of Df at Rf . The earliest data arrival time must be no earlier than hold time after the trailing edge of the previous clock cycle. Suppose the (k + 1)-th clock cycle at latch Rf is illustrated in Figure 4.4 on page 46, where t1 = tfcd + kTCP [zero in the frame of reference of (k + 1)-th cycle]. The hold time is defined by the difference t7 − t6 . If data arrives at Rf earlier than the hold time, a double-clocking hazard occurs. Similarly, (6.2) represents the setup constraint on Rf . As shown in Figure 4.4, the data must arrive at the final latch at least setup time prior to the trailing edge of the clock cycle. Assuming the (k + 1)-th clock cycle is illustrated in Figure 4.4, the trailing edge of the clock cycle occurs at L . Thus, data cannot be latched into Rf during the (k +1)t6 = tfcd +kTCP +CW L − δSLf . Late arrival th cycle if the data arrives later than t5 = tfcd + kTCP + CW of the data signal results in a zero clocking hazard. 6.1.2 Synchronization Constraints Synchronization constraints define the departure time of the data signal Qi from the initial latch of a local data path. The departure time from a latch
6.1 Clock Scheduling for Level-Sensitive Circuits
k-th clock cycle i + (k − 1)T ← tcd CP
i + kT tcd CP →
k-th clock cycle i + (k − 1)T ← tcd CP
i + kT tcd CP →
Case I
Case II
k-th clock cycle
k-th clock cycle
i + (k − 1)T ← tcd CP
i + kT tcd CP →
i + (k − 1)T ← tcd CP
i + kT tcd CP →
Case III
Case IV
k-th clock cycle
k-th clock cycle
i + (k − 1)T ← tcd CP
i + kT tcd CP →
i + (k − 1)T ← tcd CP
i + kT tcd CP →
Case V
Case VI
k-th clock cycle
k-th clock cycle
i + (k − 1)T ← tcd CP
Case VII
i + kT tcd CP →
99
i + (k − 1)T ← tcd CP
i + kT tcd CP →
Case VIII
Fig. 6.1. Possible cases for the arrival and departure times of data at the initial latch.
depends on the state of the latch—transparent or opaque. Implementationspecific register internal delays, DDQ and DCQ , affect the departure times in transparent and opaque states of operation, respectively. The earliest departure time di of Qi from Ri is defined in (6.3). The latest departure time Di is defined by (6.4): i L i , (6.3) , TCP − CW + DCQm di = max ai + DDQm i L i . (6.4) Di = max Ai + DDQM , TCP − CW + DCQM An exhaustive inspection of all possible cases of earliest and latest departure times during the k-th clock cycle is shown in Figure 6.1. The time intervals for the arrival and departure times are illustrated by the upper and lower parallel dotted lines, respectively. The left and right ends of these dotted lines in the figure correspond to earliest and latest times, respectively. The lengths of the white and black rectangular boxes correspond to the clockto-output and data-to-output latch delays, respectively. Note that cases V through VIII may exhibit timing hazards. Consider (6.3), which describes the earliest departure time of the data Li , designal Qi from latch Ri . The first term of the max function, ai + DDQm scribes the time instant when the input data arrival occurs at its earliest time
100
6 Clock Skew Scheduling of Level-Sensitive Circuits
during the active phase of the clock signal Ci . The data signal immediately propagates through the latch (as illustrated in cases I and VIII of Figure 6.1). In these cases, the earliest departure time di from Ri depends on the earliest Li it takes for the data to arrival time ai of the data signal and the time DDQ appear at the output terminal of Ri . L Li , refers to the + DCQm The second term of the max function, TCP − CW case when the earliest data arrival time occurs during the opaque phase of Ri . In the opaque phase of operation, the departure time of the data signal from Li later than the leading edge the initial latch occurs clock-to-output delay DCQ of the clock signal. Such data propagation is illustrated in cases II-VII of Figure 6.1. The max function is used to combine these cases and to define the earliest departure time di from the initial latch Ri . Similar reasoning applies to the derivation of the latest departure time Di defined by (6.4). 6.1.3 Propagation Constraints Propagation constraints define the arrival time of the data signal Df at the final latch Rf of a local data path. These constraints are as follows: ˆ i,f + TSkew (i, f ) − TCP (6.5) af = min di + D Pm i ˆ i,f + TSkew (i, f ) − TCP . Af = max Di + D (6.6) PM i
For each incoming path to latch Rf , the lower bound for af is individually calˆ i,f + TSkew (i, f ) − TCP . The minimum culated using the expression di + D Pm of the arrival times among the incoming data paths is assigned as the earliest arrival time at Rf . The latest arrival time Af for the data signal is defined similarly. In case of multiple data paths fanning into Rf , the maximum of the arrival times among the incoming data paths is the latest arrival time of the data signal at Rf . These two facts are implied in the formulation by the inclusion of the min and max functions in (6.5) and (6.6), respectively. The propagation constraints are illustrated on a sample synchronous circuit in Figure 6.2. Note that in Figure 6.2, two local data paths starting at the latches Ri1 and Ri2 and ending at Rf are considered. The time intervals for the arrival and departure times of the data signal are illustrated by the upper and lower parallel dotted lines, respectively. The lengths of the white and black rectangular boxes correspond to the clock-to-output and data-to-output latch delays, respectively. The earliest arrival time is illustrated on the data path Ri1 ;Rf . The data signal departs from Ri1 at time di1 and propagates on the ˆ i1 ,f of DP m . The earliest data arrival time data path Ri1 ;Rf for a time period ˆ i1 ,f + TSkew (i1 , f ) − TCP observed on this data path is earlier than di1 + D Pm ˆ i2 ,f + TSkew (i2 , f ) − TCP observed on the only other the arrival time di2 + D Pm incoming path to Rf , Ri2 ;Rf . Hence, the earliest data arrival time af at Rf is
6.1 Clock Scheduling for Level-Sensitive Circuits i1 tcd + (k − 1)TCP
ai 1
Ci1
i1 tcd + kTCP
k-th clock cycle
di 1
101
t i1 + (k + 1)TCP k + 1-th clock cycle cd
Ai1 Di1 1 ,f DiPm
TSkew (i1 , i2 ) < 0
1 ,f DiPM
f tcd
+ (k − 1)TCP
k-th clock cycle
f tcd
t f + (k + 1)TCP k + 1-th clock cycle cd
+ kTCP
af
Cf
Af df
Df
2 ,f DiPm 2 ,f DiPM
i2 tcd
Ci2
+ (k − 1)TCP
k-th clock cycle
ai2
TSkew (i1 , f ) > 0 TSkew (i2 , f ) > 0
di 2
i2 tcd
+ kTCP
t i2 + (k + 1)TCP k + 1-th clock cycle cd
Ai2 Di2
Fig. 6.2. Propagation of the data signal in a simple circuit.
defined by the propagation on the Ri1 ;Rf data path. Similarly, on the data ˆ i2 ,f elapses conferring path Ri2 ;Rf , a maximum data propagation time of D PM i ,f ˆ 2 + TSkew (i2 , f ) − TCP . the latest data arrival time at Rf , Af = Di + D 2
PM
The departure of Qi and the arrival of Df must occur during two consecutive clock cycles for proper circuit operation. In order to switch between the frame of references of these two cycles, the phase shift operator φif is used. The phase shift operator evaluates to φif = TCP for single-phase synchronization as discussed in Section Section 4.9. Thus, the clock period TCP is subtracted from the calculated arrival time in order to shift the point of reference of the data arrival time at Rf to the beginning of the previous clock cycle. 6.1.4 Validity Constraints The definitions of the parameters af , Af , df and Df require the value of af (df ) to be smaller than or equal to the value of Af (Df ): Af ≥ af Df ≥ df .
(6.7) (6.8)
While the operational constraints introduced in the preceding sections model the timing properties of the circuit, the required sequentiality in time of the
102
6 Clock Skew Scheduling of Level-Sensitive Circuits
referred variables is not explicitly enforced. Consistency in the definitions of af , Af , df and Df , must be maintained through post-solution checks or by including additional constraints. A solution leading to a result where af > Af , for instance, is incorrect and must be discarded. Introducing the validity constraints [(6.7) and (6.8)] in the problem formulation is preferred over performing post-solution checks for two primary reasons. The first reason is to gain the ability to easily detect the feasibility of the problem. The second reason is to preserve the automation of the solution procedure. 6.1.5 Initialization Constraints The LP model clock skew scheduling problem is formulated in order to minimize the clock period of a synchronous circuit. Besides the minimum clock period, it may also prove essential to accurately calculate the nominal data arrival and departure times for each register. The initialization constraints are introduced in order to fulfill this purpose, by leading to a consistent timing schedule for the data signal propagation in a level-sensitive synchronous circuit. After clock skew scheduling, the feasible (or optimal) solution set for one or more variables can be a range of values rather than a specific value. For instance, suppose that the earliest arrival time of a data signal at an arbitrary latch Rk can get any value in the interval 1.8 ≤ ak ≤ 2.3 without changing the minimum clock period of the circuit. For consistency, it is preferable to assign the smallest value to the earliest arrival time (ak = 1.8). In general, it is better to assign the smallest possible values to the earliest arrival and departure time variables and the largest possible values to the latest arrival and departure time variables (where applicable). Such assignment provides a more comprehensive representation of data propagation (and sensitivity information [112]) in the system. Identification of the sensitivity information is useful to check for the consistency of the timing schedule generated by the LP problem (if necessary) as will be briefly discussed in Section 11.1.2. Note that, the earliest and latest data arrival times at all registers, except for the input registers, are set to their lowest and highest possible values, respectively. These assignments are enforced by the propagation constraints [(6.5) and (6.6)]. The values assigned to the earliest and latest data arrival times (a, A) at the input registers do not affect the minimum clock period unless the assigned values cause the departure times to change. It may even be considered redundant to define earliest and latest arrival time variables (a, A) at the input registers as the non-local data paths do not affect the circuit timing directly. For consistency and completeness of the generated timing schedule, the data arrival times at the input registers are defined and the following constraints are included in the LP formulation for each input register Rl : Ll Ll ∀Rl : |F an − in(Rl )| = 0. or DDQ (6.9) Al = dl − DCQ
6.2 Iterative Approach to Clock Skew Scheduling
103
6.2 Iterative Approach to Clock Skew Scheduling The operational constraints provide a system of equations defining the timing operation of a level-sensitive synchronous circuit. Different versions of the constraints presented in Sections 6.1.1, 6.1.2 and 6.1.3 have been used by designers in order to develop timing analysis models for zero clock skew, levelsensitive circuits. The set of constraints initially defined for the clock period minimization problem of a conventional zero clock skew problem in [69] is known as the SMO formulation [75]. A popular timing analysis approach for level-sensitive circuits is presented in [72, 73, 75, 94] based on the SMO formulation. This timing analysis approach involves several algorithms targeting clock period verification and minimization problems, all based on the analytical framework described in Sections 6.1.1, 6.1.2 and 6.1.3. The proposed algorithms are iterative algorithms. In particular, very small values are assigned to the timing variables of a circuit and the circuit is investigated for timing violations by iteratively incrementing the values of the timing variables. It is important to note that the clock delay values ticd and consequently the clock skew values TSkew (i, f ) are predetermined numerical values in these algorithms. Thus, these iteration-based algorithms do not support clock skew scheduling. The iterative algorithm proposed in [73] for the clock period minimization problem of level-sensitive circuits is presented in Figure 6.3. In the algorithm, r is the number of registers in the synchronous circuit. The a, d, A and D vectors are the earliest arrival/departure and latest arrival/departure times, respectively, where the superscript prev identifies the value of a variable in the previous clock cycle. The variables SetupV io and HoldV io hold the timing violation information for each register. In this algorithm, the arrival times are initialized to ai = Ai = −∞, where the algorithm simulates the start-up timing of the circuit. At each iteration step, the execution of the circuit at a clock cycle is simulated. Finally, once the arrival and departure times of the latches are determined, the algorithm checks for potential setup and hold time violations. The algorithm presented in Figure 6.3 has been shown to converge to solutions relatively quickly [73]. The algorithm complexity is reported as O(|r||p|), where |r| is the number of latches in a circuit and |p| is the number of edges of a circuit graph (recall from 5.2.2 that the number of edges of a circuit graph is the number of local data paths). However, it has been proved in [72] that in case of data-path loops (sequential feedback) in the synchronous circuit, the arrival and departure times might increase without bound. This leads to a setup violation and the described algorithm fails to provide reasonable run times. In [94], a correction is offered to the algorithm. This correction is based on the assumption that, a data path loop in the circuit can be detected in |r| iterations. Thus, the algorithm is modified to artificially limit the number of iteration steps by |r|. In the modified algorithm, the complexity of the resulting algorithm is cubic in the number of registers r, as each iteration in-
104
6 Clock Skew Scheduling of Level-Sensitive Circuits
//Initialize the latch arrival times for i = 1 to |r| { Aprev = aprev = −∞; i i // iterate the evaluation of the departure and arrival time // equations until coverage // or a maximum of |r| iterations iter = 0; repeat iter = iter + 1; // update the latch departure times based on the latch // arrival times // computed in the previous iteration for i = 1 to |r| { , φi + Di ); Di = max ( Aprev i rev di = max ( aP , φi + di ); i }; // update the latch arrival times based on the just-computed // latch departure times for i = 1 to |r| { Ai = maxj ( Dj + DP M ); ai = minj ( dj + DP m ); }; ) && ( ai = aprev ) ) || ( iter + 1 > |r| ) until ( ( ( Ai = Aprev i i ) ) ; }; // check and record setup and hold violations for i = 1 to |r| { SetupV io[i] = Ai > TCP - δSLi + di ; Li HoldV io[i] = ai < δH + Di ; }; Fig. 6.3. The iterative algorithm for static timing analysis of level-sensitive circuits.
volves examining up to |p| edges, and p is at most |r|2 . The iterative algorithm presented in Figure 6.3 is later modified to account for more advanced timing features or data models, such as for crosstalk [100] and statistical timing analysis [113]. Although the iterative algorithm provides an initial and useful formulation for the timing analysis of level-sensitive circuits, it does not constitute a framework lenient to general timing analysis problems or clock skew scheduling.
6.3 Linearization of the Timing Analysis The non-linear max and min functions in the constraints shown in (6.3), (6.4), (6.5) and (6.6) present a major challenge in solving the clock skew scheduling
6.3 Linearization of the Timing Analysis
105
problem. A method is introduced in this chapter in order to replace the nonlinear constraints with linear constraints. Although theoretically inequivalent, it is demonstrated that the same results are obtained with the original nonlinear programming (NLP) model and the novel linear programming (LP) model problems in experimentation with ISCAS’89 benchmark circuits. The proposed linearization method is described in Section 6.3.1. The LP model for the clock period minimization problem of non-zero clock skew, levelsensitive circuits is offered in Section 6.3.2. 6.3.1 Modified Big M (MBM) Method The linearization of the constraints which exhibit non-linear behavior is a commonly applied procedure in operations research [112]. When possible, nonlinear constraints are manipulated to derive linear constraints, which are inherently easier to solve. In this work, a collection of linearization procedures is applied to the non-linear constraints of the timing analysis problem. The collection of these procedures is called the Modified big M (MBM) method. It is considered reasonable to denominate the collection of linearization procedures the MBM method, as the research is developed by an inspiration from the “big M method” [112]. The big M method is a special case of the simplex algorithm [112] which has applications in a completely distinct set of problems with respect to the MBM method. The only similarity between the big M method and the MBM method is the use of the constant M in both methods. The constant M symbolically represents a sufficiently large positive number used to assign an overwhelmingly large penalty to a variable in the objective function in order to increase the priority of the variable in the optimization process. The collection of linearization procedures composing the MBM method is presented in Table 6.1. For a minimization type LP problem—subject to constraints that have min and max functions—the transformations listed in Table 6.1 are applied to replace non-linear constraints with linear constraints. Note that only relevant constraints and relevant terms of the objective function are included in Table 6.1. Define a finite set N , consisting of the variables N = {a, b, c, . . . , n}. Consider all variables in the finite set N to be elements of the real numbers set Table 6.1. Modified Big M transformations. min Z → min (Z + M a) a = max(b, c) → a ≥ b a≥c min Z → min (Z − M a) a = min(b, c) → a ≤ b a≤c
106
6 Clock Skew Scheduling of Level-Sensitive Circuits
N = {a, b, c, . . . , n} ⊂ . The objective function Z is a linear function of the variables {a, b, c, . . . , n} and is defined Z : |N | → . There are no limitations on variables being inter-dependent, provided the linearity of the constraints is preserved. Two different linearization scenarios are presented in Table 6.1. In the first scenario [linearization of a = max(b, c) expression], the variable a is constrained to be the greater of the variables b and c. The constraint is replaced with two new constraints, explicitly requiring the variable a to be greater than or equal to the variables b and c. The initial constraint and the relaxed constraints are equivalent if either of the following conditions holds: 1. Equality condition is observed for at least one of the inequalities, while the other inequality operation returns true, 2. Equality condition is observed for both inequalities. The cost function denoted by the product M a is added to the objective function. The product M a is overwhelmingly large with respect to other cost functions in the objective function as a result of the highly-weighed cost figure (recall the very large coefficient M ). Thus, M a is given the highest priority in the minimization process. As a result, the greater of the variables b and c is assigned to variable a. The relaxation method in the second scenario [linearization of a = min(b, c) expression] is also presented in Table 6.1. In this case, the cost function M a is subtracted from the objective function in order to exploit the maximum value to be assigned to the variable a. Similar to its implementation in the big M method, the constant M is defined sufficiently large, but as small as possible. The selection of a value for the constant M depends on the solution space of a specific problem (problem constraints) and the objective function Z. Typically, the number M must be chosen significantly larger than the values of any parameter in the problem. However selection of an extremely large M may cause the LP solver to fail drastically [114]. Thus, a sufficiently large number is desired to provide the described minimization characteristic without degrading the performance of solution mechanism. 6.3.2 Linear Programming (LP) Model An LP model of the clock period minimization problem is generated through the application of the MBM method. There are five sets of constraints in the LP model. These sets are the latching [(6.1) and (6.2)], synchronization [(6.3) and (6.4)], propagation [(6.5) and (6.6)], validity [(6.7) and (6.8)] and initialization [(6.9)] constraints. The finalized LP model for the clock period minimization problem is shown in Table 6.2. The latching, validity and initialization constraints exhibit linear behavior. Therefore, these constraints remain unchanged in both the LP and NLP
6.3 Linearization of the Timing Analysis
107
Table 6.2. LP model clock skew scheduling problem of level-sensitive circuits. LP Model min TCP + M [ (dj + Dj ) + ∀Rj
(Ak − ak )]
∀Rk :|F an−in(Rk )|≥1
subject to Lf (i) af ≥ δH [Latching-Hold time] (ii) Af ≤ TCP − δSLf [Latching-Setup time] Li (iii) di ≥ ai + DDQm L Li di ≥ TCP − CW + DCQm [Synchronization-Earliest time] Li (iv ) Di ≥ Ai + DDQM L Li Di ≥ TCP − CW + DCQM [Synchronization-Latest time] (v ) af ≤ di1 + DPi1 ,fm + TSkew (i1 , f ) − TCP .. . af ≤ din + DPin ,fm + TSkew (in , f ) − TCP [Propagation-Earliest time] (vi) Af ≥ Di1 + DPi1 ,fM + TSkew (i1 , f ) − TCP .. . Af ≥ Din + DPin ,fM + TSkew (in , f ) − TCP [Propagation-Latest time] (vii) Af ≥ af [Validity-Arrival time] (viii) Df ≥ df [Validity-Departure time] Ll Ll (ix ) Al = dl − (DCQm orDDQm ), ∀Rl : |F an − in(Rl )| = 0 [Initialization]
models as shown in constraints (i-ii , vii-ix ) of the formulation. The synchronization constraints, however, are formed by the max function and exhibit non-linear behavior. The MBM method is used on the synchronization constraints in order to generate linear constraints for the LP model problem (constraints iii and iv ). For instance, (iii ) depicts the replacement of the nonlinear constraint presented in (6.3) with two linear constraints, where .[i] is Li greater than or equal to both operands of the max function, ai + DDQm L Li . Note that the cost function M di is added to the + DCQm and TCP − CW objective function. Propagation constraint on the latest data arrival time (6.6), exhibits similar non-linearity with the synchronization constraints such that the max function is used. The linearized propagation constraints in the LP model are shown in (vi ). In the LP model, the variable Af isgreater than or i,f equal to the expressions Di + DP M + TSkew (i, f ) − TCP , evaluated for
108
6 Clock Skew Scheduling of Level-Sensitive Circuits
each fan-in path of register Rf . In the formulation, fan-in paths of Rf are indexed by the parameter n. Unlike other non-linear constraints in the formulation, the propagation constraint on the earliest arrival time af is modeled by the min function. In this type of linearization, af is set to be less than orequal to each operand of the min function. As shown in (v ), the expressions di + DPi,fm + TSkew (i, f ) − TCP evaluated for each fan-in path of register Rf are included in the finalized LP model.
6.4 An Example and Experimental Results The circuit network shown in Figure 6.4 is analyzed in order to illustrate the application of the proposed linearization procedure. Without affecting the generality of the solution, zero setup and hold times and zero internal delays Li = DCQ = DDQ = 0). A single phase synchronization are considered (δSLi = δH scheme with 50% duty cycle is selected as shown in Figure 6.5. Given single-phase synchronization under zero and non-zero clock skew operation, the clock period minimization problems of three different synchronous circuits with same circuit topology are formulated. These circuits are: 1. Zero clock skew, edge-sensitive circuit, 2. Zero clock skew, level-sensitive circuit, 3. Non-zero clock skew, level-sensitive circuit. The simpler (in terms of timing analysis) circuit is the zero clock skew, edge-sensitive circuit. This circuit is used as the basis of comparison for other circuits. The minimum clock period of a zero clock skew, edge-sensitive circuit is defined by the maximum data propagation time in the circuit [96]. Thus, the synchronous circuit network presented in Figure 6.4 has a minimum clock period of TCP = DP3,2M = 7 (time units) when used with edge-triggered [3, 4] → R1
[2.9, 3] →
[5, 7] ←
R2
[3 ,4 ← ] R4 Fig. 6.4. A simple synchronous circuit.
R3
5] 5, . [2 ←
6.4 An Example and Experimental Results
109
L = T /2 CW CP
Csource
φ = TCP /2
TCP Fig. 6.5. A single-phase synchronization clock with a 50% duty cycle.
C1
C1
C2
C2
C3
C3
C4
C4 TCP = 4.66 Zero clock skew
TCP = 4.05 Non-zero clock skew
Zero Skew Non-Zero Skew Critical Path A3 = 1.66 = D1 + 4 − 4.66 A3 = 2.025 = D1 + 4 + (0.05 − 0) − 4.05 R1 → R3 A2 = 4.66 = D3 + 7 − 4.66 A2 = 4.05 = D3 + 7 + (0 − 0.925) − 4.05 R3 → R2
Fig. 6.6. Zero and non-zero clock skew timing schedules for the level-sensitive circuit in Figure 6.4.
flip-flops. The second synchronous circuit of interest is the zero clock skew, level-sensitive circuit. In order to design a level-sensitive synchronous circuit, each flip-flop in the given circuit topology is replaced with a level-sensitive latch. Zero clock skew, level-sensitive circuits exhibit improved circuit performance due to time borrowing. Clock skew scheduling is applied to the zero clock skew, level-sensitive circuit to generate the non-zero clock skew, levelsensitive circuit. This circuit exhibits performance improvement due to the simultaneous consideration of time borrowing and clock skew scheduling. The clocking schedules and the data propagation on the critical paths of the circuit in Figure 6.4 are shown in Figure 6.6. In Figure 6.6, the clocking schedule for the zero clock skew circuit is shown on the left, with a minimum clock period of TCP = 4.66. Non-zero clock skew scheduling results with a minimum clock period of TCP = 4.05 is shown on the right. For nonzero clock skew scheduling, the optimal clock signal delays at the register are t1cd = 0.05, t2cd = 0.925, t3cd = 0 and t4cd = 0.475. The arrows represent data signal propagation on the respective critical paths. Note that unlike the case
110
6 Clock Skew Scheduling of Level-Sensitive Circuits
presented in Figure 6.6, the critical paths for zero and non-zero clock skew scheduling need not be identical. In the analysis, the minimum clock period for the zero clock skew, levelsensitive circuit is calculated as 4.66 (time units), which is a 33% improvement over the zero clock skew, edge-sensitive synchronous circuit. Note that the percentage improvement is calculated by the expression 100(Told − Tnew )/Told . As stated earlier, clock skew scheduling is applied to the level-sensitive circuit in order to generate the non-zero clock skew, level-sensitive circuit. The calculated minimum clock period of 4.05 for the non-zero clock skew, levelsensitive circuit is a 13% improvement over the zero clock skew, level-sensitive circuit and a 42% improvement over the zero clock skew, edge-sensitive circuit. Note that 13% improvement is only due to clock skew scheduling, while 42% improvement is due to time borrowing and clock skew scheduling. Further analysis of the time borrowing and clock skew scheduling effects on circuit timing are presented in Section 11.1. 6.4.1 Level-Sensitive Synchronous Circuit State of Operation Presence of data path loops (cycles) and transient state errors are two major issues that need to be identified in the timing analysis of level-sensitive circuits. As discussed in Section 6.2, the iterative algorithm offered in [73] suffers from excessive run times and produces false negative outputs in presence of data path loops [99]. In [99], modifications are offered for the iterative algorithm in order to detect and handle the effects of data path loops in the circuit. Also in [99], it has been shown that synchronous circuits are prone to transient state errors. The transient state errors occur due to the non-unique solution sets of the problem parameters, discussed (within a different context) in Section 6.1.5. In circuits under transient state errors, setup violations occur in certain registers after the system is initiated from a reset state. The arrival and departure times may not be stable at start-up, in which case these times change during initial clock cycles, constituting the transient state. As circuit operation progresses in time, the arrival and departure times converge to their steady-state values. There are two major conventions in evaluating the transient errors and determining the steady-state behavior. The first convention overlooks the transient errors and presumes that the departure times converge to the opening edge of the driving clock, which is the expected schedule for the steady-state of operation. The second convention is more strict in that transient state errors are not permitted. The first convention is more common and leads to a generally acceptable solution unless the transient state operation of the levelsensitive circuit is decisive to overall circuit operation. Given that the second convention is adopted, the reset state is preferably extended until the steady state of operation is reached [99]. The LP model in Table 6.2 assumes the transient-state operation of a level-sensitive circuit to be negligible. The aim of the generated model is to
6.4 An Example and Experimental Results
111
solve for the steady-state timing scheduling problem. The simplex algorithmbased LP solver directs the gradual advancement of parameter values as they are enforced by the LP model. Previously offered algorithms are vulnerable to potential fallacies caused by data path loops due to their iterative nature. In the LP model, complications posed by the presence of data path loops are resolved within the mechanics of the LP solver without significantly affecting the run time or quality of the solution. If the problem remains feasible, the timing parameters for the steady state operation of the circuit are calculated. In order to illustrate the described phenomenon, the steady-state optimal timing schedule for the ISCAS’89 benchmark circuit s27 is presented in FigLi Li = DCQ = ure 6.7. Simplifications of DPi,fm = DPi,fM , ∀Ri ;Rf and δSLi = δH Li DDQ = 0 are considered. The circuit s27 has one input register and a data path loop consisting of two other registers. The data signal departs from input register R3 and perpetually propagates on the loop between R1 and R2 . The minimum clock period is calculated to be 4.1, where the pre-computed data propagation times are indicated on the circuit graph. In Figure 6.7, the data propagations occurring on all data paths of the s27 benchmark circuit are analyzed. As defined in Section 4.6.2, the subscripts to the clock signal indicate the register being synchronized by the clock signal. The clock signals most likely are not aligned in time due to the non-identical clock delays to their respective destination registers. The clock signal C3 at the input register R3 has no delay in time with respect to the clock signal at the clock source t3cd = 0 . Hence, the origin of the clock signal at the source is aligned with the origin of C3 . The clock signals C1 and C2 however, are shifted in time by t1cd = 3.8 and t2cd = 1.3 relative to the origin of the clock signal at the source. The horizontal axis of Figure 6.7 represents the time, where the beginning (k − 1)TCP of the k-th clock cycle of C3 , is defined as the local time reference, with an assigned value of zero. In Figure 6.7, the numbers associated with the leading (enabling) and trailing (latching) edges of the clock signals label the times with respect to the local time reference. The arrows illustrate the propagation between the registers and are drawn to scale. Illustration of the data propagation on three consecutive clock cycles are sufficient to analyze the behavior of the data path loop of the benchmark circuit s27. Arbitrary cycles labeled the k-th, (k + 1)-th and (k + 2)-th clock cycles are selected. The solid arrows represent the data propagation during the selected clock cycles. For instance, the propagation between R3 and R1 is represented by the arrows initiating from the C3 row at times 2.05 and 6.15, and concluding at the C2 row at times 8.65 and 12.75, respectively. Data propagation on the data path loop between the registers R1 and R2 is visible by the cross-structured arrows initiating and concluding in the corresponding clock signal rows. Note that the calculated nominal arrival and departure times are illustrated on the circuit graph, inside the boxes associated with each node. In steady-state of operation, the departure times of the registers that constitute a data path loop converge to the beginning of their respective clock
k + 1-th clock cycle
k + 2-th clock cycle
8.65
C1
12.75
7.9
5.85
12
9.95
14.05
TSkew (3, 1) = −3.8 k-th clock cycle
k + 1-th clock cycle
k + 2-th clock cycle
C2 1.3
3.35
5.4
7.45
9.5
11.55
13.6
TSkew (3, 2) = −1.3 TSkew (1, 2) = 2.5 k-th clock cycle
k + 1-th clock cycle
k + 2-th clock cycle
C3 0
4.1
2.05
8.2
6.15
12.3
16.4
(k − 1)TCP + 12.3
(k − 1)TCP + 16.4
10.25
timeglobal (k − 1)TCP
(k − 1)TCP + 4.1 a1 = 0.75 A1 = 2.05
(k − 1)TCP + 8.2
d1 = 2.05 D1 = 2.05
1 = 3.8 tcd
a2 = 2.05 A2 = 2.05
[1.6] → R1 [6. 6 ← ]
[6.6] ←
R3
R2
d2 = 2.05 D2 = 2.05
2 = 1.3 tcd
4] [5. →
a3 = 0 A3 = 0
d3 = 2.05 D3 = 2.05
3 =0 tcd
Fig. 6.7. The optimized timing schedule for s27 operable with TCP = 4.1.
6 Clock Skew Scheduling of Level-Sensitive Circuits
3.8
112
k-th clock cycle
6.5 Optimality of the LP Formulation
113
cycles. The circuit s27 in Figure 6.7 is analyzed in order to provide a better insight on how the latest departure times converge to a certain value in the steady-state. Define a variable , where is a very small period of time. Suppose that a deviation of occurs in the departure time of the data signal from R3 . The signal departure from R3 occurs at time 2.05+, delaying the arrival times at R1 and R2 by . The departure from R2 is gradually delayed by every turn, which in turn delays the arrival time at R1 . The arrival and departure times cumulatively increase in each turn of the data signal around the loop. Eventually, the signal arrivals at the latches occur during the nontransparent state of the latches. At this point, the signal departure times return to their starting values, which are the trailing edges of their respective clock cycles. It is evident that the arrival times will finally be restored to their initial values when the source of the deviation vanishes. Thus, the assignment of the time-varying departure times to the leading edges of the synchronizing clock signals is referred to as the steady-state of operation for the synchronous circuit.
6.5 Optimality of the LP Formulation The operational constraints (latching [(6.1) and (6.2)], synchronization [(6.3) and (6.4)] and propagation [(6.5) and (6.6)] constraints) accurately model the timing of level-sensitive synchronous circuits. However, the synchronization and propagation constraints are non-linear, leading to a non-linear programming (NLP) problem formulation. Typical NLP problems, especially for large-scale systems, are very hard to solve efficiently. Consequently, alternative modeling and solution procedures to solve for the timing constraints of level-sensitive circuits are of interest for researchers. As discussed in Section 6.3, a linearization procedure that generates an LP formulation is presented. Neither the iterative solution methods proposed in [94, 72] nor the LP model problem presented in this monograph are equivalent to the original non-linear problem. These alternative solution methods are proposed in order to generate results that are as close as possible to the optimal solution in relatively shorter run times. In this section, a Mixed-Integer (Linear) Programming (MIP) [112, 114] formulation that is equivalent to the NLP formulation of the clock skew scheduling problem for level-sensitive circuits is described. A MIP problem is a linear programming problem in which some or all of the problem variables are constrained to be integers [112, 114]. If the integer variables are further constrained to take only 0 or 1 values, these variables are called binary variables. In general, a MIP problem can be solved optimally (granted enough time) or within a close proximity of the optimal solution [114]. A typical MIP problem, although generally harder to solve than an LP problem of similar size, is
114
6 Clock Skew Scheduling of Level-Sensitive Circuits Table 6.3. MIP modeling of a constraint with a max or a min function. yi = max(xi , xj , . . . , xk ) yi = min(xi , xj , . . . , xk ) yi ≥ xi yi ≤ xi yi ≥ xj yi ≤ xj .. .. . . yi ≥ xk yi ≤ xk yi + (Bxi − 1)M ≤ xi yi + (1 − Bxi )M ≥ xi yi + (Bxj − 1)M ≤ xj yi + (1 − Bxj )M ≥ xj .. .. . . yi + (Bxk − 1)M ≤ xk yi + (1 − Bxk )M ≥ xk Bxi + Bxj + · · · + Bxk ≥ 1 Bxi + Bxj + · · · + Bxk ≥ 1 Bxi , Bxj , . . . , Bxk binary Bxi , Bxj , . . . , Bxk binary
generally easier to solve than an NLP problem of similar size [112]. In experimentation, the MIP problems generated for the clock skew scheduling problem of level-sensitive ISCAS’89 benchmark circuits are solved optimally. In order to generate the MIP formulation for the clock skew scheduling problem of level-sensitive circuits, the non-linear synchronization and propagation constraints in Table 6.2 (page 107) are remodeled using binary variables. Remember from Section 6.3.1 that the non-linearity of the synchronization and propagation constraints are due to the max and min functions. The transformations in Table 6.3 can be used to model a constraint with a max function or a min function using a binary variable. In Table 6.3, yi , xi , xj and xk are continuous variables. A binary variable Bxa is defined for each operand xa (xa ∈ {xi , xj , . . . , xk }) of the max or min function. For operand xi of the max function shown on the left hand side of Table 6.3, for instance, the binary variable Bxi is defined. The parameter M is a sufficiently large constant, similar to its definition in Section 6.3.1. For a non-linear constraint with the max function in the form given as [yi = max(xi , xj , . . . , xk )], yi is constrained to be greater than or equal to each one of the operands. For the max function to hold, equality condition must be true for at least one of these inequalities (multiple equalities occur when two or more identical operands are the maximal value). Binary variables are used in order to enforce the equality of at least one of these inequalities. The assignment of 0 or 1 to the binary variables Bxa either constrain yi to be less than or equal to xa or constrain yi to be strictly greater than xa . In particular for operand xi , when Bxi = 1, the relevant constraints become: yi ≥ xi yi ≤ xi
(6.10) (6.11)
which simplifies to the equality yi = xi through xi being the largest of the operands xi , xj , . . . , xk . On the other hand, if Bxi = 0, the relevant constraints
6.5 Optimality of the LP Formulation
115
1400 1200
Seconds
1000 800
MIP
600
LP
400 200 s27 s208.1 s298 s344 s349 s382 s386 s400 s420.1 s444 s510 s526 s526n s641 s713 s820 s832 s838.1 s938 s953 s967 s991 s1196 s1238 s1423 s1488 s1494 s1512 s3271 s3330 s3384 s4863 s5378 s6669 s9234 s9234. s13207
0
Fig. 6.8. Run times under 1250 seconds for the LP and MIP formulations.
become: yi ≥ xi yi − M ≤ xi
(6.12) (6.13)
which simplifies to yi > xi . The transformation for a non-linear constraint with the min function in the form [yi = min(xi , xj , . . . , xk )] is similar, as shown on the right hand side of Table 6.3. Using the transformation procedures defined in Table 6.3 on the non-linear synchronization and propagation constraints, the MIP problem is constructed for the clock skew scheduling problem of level-sensitive circuits. The MIP formulation is shown in Table 6.4. The MIP formulations of the clock skew scheduling problem are performed for the ISCAS’89 benchmark circuits. These MIP problems are solved in order to observe the potential deviations from optimality because of modeling the NLP problem as an LP problem as described in Section 6.3.1. It is observed that all of the ISCAS’89 suite of benchmark circuits are solved optimally with the LP model problem. For small-sized circuits, the MIP formulation can be preferred due to its guarantee for optimality. However, as the number of registers and paths grow, the solutions of the MIP problems can suffer from very long run times (can be practically insolvable). In order to compare the run times of the MIP problems with the run times of the LP problems, experiments are performed on the ISCAS’89 benchmark circuits. In Figure 6.8, the ISCAS’89 benchmark circuits whose run times are below 1250 seconds using CPLEX (v7.5) [115] simplex solver on a 440MHz Sun Ultra10 Workstation are shown. For smaller circuits, both LP and MIP run times are below a few seconds, thus cannot be visualized with the scale used in Figure 6.8. For s1423 and larger benchmark circuits, whose number of paths
116
6 Clock Skew Scheduling of Level-Sensitive Circuits
Table 6.4. MIP model clock skew scheduling problem of level-sensitive circuits. MIP Model min TCP subject to Lf (i) af ≥ δH [Latching-Hold time] (ii) Af ≤ TCP − δSLf [Latching-Setup time] L (iii) di ≥ ai + DDQm i L L di ≥ TCP − CW + DCQm i L di + (Bai − 1)M ≤ ai + DDQm i L L di + (BT ai − 1)M ≤ TCP − CW + DDQm i [Synchronization-Earliest time] L (iv ) Di ≥ Ai + DDQM i L Li Di ≥ TCP − CW + DCQM Li Di + (BAi − 1)M ≤ Ai + DDQM L Li Di + (BT Ai − 1)M ≤ TCP − CW + DCQM [Synchronization-Latest time] (v ) af ≤ di1 + DPi1 ,fm + TSkew (i1 , f ) − TCP .. . af ≤ din + DPin ,fm + TSkew (in , f ) − TCP af + (1 − Bdi1 f )M ≥ di1 + DPi1 ,fm + TSkew (i1 , f ) − TCP .. . af + (1 − Bdin f )M ≥ din + DPin ,fm + TSkew (in , f ) − TCP [Propagation-Earliest time] (vi) Af ≥ Di1 + DPi1 ,fM + TSkew (i1 , f ) − TCP .. . Af ≥ Din + DPin ,fM + TSkew (in , f ) − TCP Af + (BDi1 f − 1)M ≤ Di1 + DPi1 ,fM + TSkew (i1 , f ) − TCP .. . Af + (BDin f − 1)M ≤ Din + DPin ,fM + TSkew (in , f ) − TCP [Propagation-Latest time] (vii) Af ≥ af [Validity-Arrival time] (viii) Df ≥ df [Validity-Departure time] Ll L (ix ) Al = dl − (DCQm or DDQm l), ∀Rl : |F an − in(Rl )| = 0 [Initialization]
exceed a thousand, a significant gap between the run times of the LP and MIP problems is observed. For larger circuits, the MIP run times can get extremely worse compared to the LP run times. For instance, the MIP problem run
6.6 Multi-Phase Level-Sensitive Circuits
117
time for s38417 is 286496 seconds, while the LP problem run time is only 603 seconds. The run time experiment results shown in Figure 6.8 demonstrate the advantages of using the LP formulation versus the MIP formulation. It is demonstrated that the LP formulation suggests a scalable alternative to the accurate MIP model. It is expected that the run times for industry-size integrated circuit will benefit even more from the simplifications of the LP formulation. The results of the LP formulation for the ISCAS’89 benchmark circuits are empirically shown to be equal to the optimal results1 . These empirical results do not guarantee the optimality of results for all circuits using the LP formulation. However, these results suggest the general accuracy of the LP formulation for the clock skew scheduling problem of level-sensitive circuits in leading to optimal or close to optimal results.
6.6 Multi-Phase Level-Sensitive Circuits Single-phase synchronization has traditionally been used in the design and analysis of systems, mostly due to its simplicity. Recently, however, multiphase clock synchronization has become a necessity for larger and relatively complex integrated circuits. Also, when a single-phase, edge-triggered system is converted to a level-sensitive system, multi-phase synchronization is applied to conserve functionality without logic redesign. Besides these conventional clock distribution networks modified for multi-phase synchronization, emerging clocking technologies such as resonant rotary clocking technology (discussed in Chapter 10) also encompass multi-phase synchronization schemes [116]. Such necessity to design and analyze for multi-phase synchronization schemes requires dedicated design and analysis frameworks. An extension to clock skew scheduling algorithms for edge-triggered circuits (Chapter 5) in order to account for multi-phase synchronization is relatively straightforward. In this section, an enhancement to the Linear Programming (LP) framework presented in Section 6.1 for non-zero clock skew, level-sensitive circuits is described. The enhanced framework is used to profile the performance improvement of level-sensitive circuits subject to clock skew scheduling under multi-phase synchronization. Time borrowing and clock skew scheduling are analyzed in single, two, three and four phase synchronization schemes. The effects of multi-phase synchronization schemes—independent of the clocking technology—on non-zero clock skew, level-sensitive circuit performance are analyzed. 6.6.1 Multi-Phase Synchronization Overview The advantages of level-sensitive design with multi-phase synchronization have previously been investigated in different contexts. One line of research has 1
The results of these experiments will be presented in detail in Chapter 11.
118
6 Clock Skew Scheduling of Level-Sensitive Circuits
concentrated on circuit retiming, most notably in [117] and [118]. In [117], the advantages of two-phase, level sensitive circuits (as opposed to edge-sensitive circuits) are explored. It is concluded in [117] that the level of improvement in circuit performance is insignificant for such a circuit transformation, when circuit retiming is performed. In [118], the results of [117] are examined from a wider perspective, considering the depth of pipelining within a circuit— average improvements up to 30% are shown to be possible by two-phase, level-sensitive clocking with circuit retiming. Presented multi-phase, level-sensitive clock skew scheduling methodology differs from [117] and [118] by expanding the multi-phase synchronization concept to three, four and potentially higher number of phases (the studies presented in [117] and [118] are performed only for two-phase, level-sensitive circuits). Furthermore, unlike extensive emphasis on circuit retiming in [117] and [118], the application of clock skew scheduling is presented in this section. In [119], the authors advocate the use of a multi-phase clocking scheme for both edge-triggered and level-sensitive synchronous circuits for increased circuit performance. In [120], the number of clock phases constituting the multiphase synchronization scheme and the skew values are restricted to reflect the practical limitations of conventional clock distribution networks. These studies in [119] or [120] do not explore the effects of multi-phase synchronization on the level of improvement in circuit performance for non-zero clock skew, level-sensitive circuits. 6.6.2 Multi-Phase Level-Sensitive Circuit Timing In Figure 6.9, two local data paths starting at the latches Ri1 and Ri2 , respectively, and ending at Rf are considered. This figure is the multi-phase synchronization counterpart of Figure 6.2 shown on page 101. The clock signals driving the initial latches Ri1 and Ri2 are shown at the top and bottom, respectively. The middle clock signal corresponds to the final latch Rf . The time intervals for the arrival and departure times of latch data are illustrated by the upper and lower parallel dotted lines, respectively. Data delays are represented by the lengths of white or black rectangular boxes. Similar to the analysis in Section 6.1, the operational and constructional timing constraints of multi-phase, level-sensitive circuits are formulated based on these data propagation rules. The timing constraints governing the operation of a multi-phase, levelsensitive synchronous system are summarized in Table 6.5. The multi-phase clock skew definition from Section 4.6.2 is incorporated into the constraints. These constraints are valid for all varieties of overlapping and non-overlapping clocking schemes, and for any feasible selection of duty cycles per clock phase. Note the max and min functions in the synchronization and propagation constraints in Section 4.9. The non-linearities of these constraints are similar to those reported in Section 6.3 for single-phase circuits. Consequently, the
6.6 Multi-Phase Level-Sensitive Circuits pi
ti1 1 + (k − 1)TCP + φ pi1
pi
ai 1
Ci1 1
pi
di1
1 ,f DiPm
pi pi
p
pi 1 k + 1-th clock cycle ti1 + (k + 1)TCP + φ 1
Ai1 Di1
1 2 Tskew (i1 , i2 ) + φ pi1 pi2 < 0
tf f + (k − 1)TCP + φ p f
pi
ti1 1 + kTCP + φ pi1
k-th clock cycle
119
1 ,f DiPM
k-th clock cycle
p
p
tf f + kTCP + φ p f
f k + 1-th clock cycle tf + (k + 1)TCP + φ p f
pf
af
Cf
Df
df
2 ,f DiPm
Af
2 ,f DiPM
p i2
ti2 + (k − 1)TCP + φ pi2
pi
Ci2 2
pi p f
1 Tskew (i1 , f ) + φ pi1 p f > 0
p i2 p f Tskew (i2 ,
f)+φ
pi2 p f
pi2
ai2 di2
pi
ti2 + kTCP + φ pi2 k + 1-th clock cycle ti2 2 + (k + 1)TCP + φ pi2
k-th clock cycle Ai2 Di2
>0
Fig. 6.9. Propagation of the data signal in a simple multi-phase circuit. Table 6.5. LP model clock skew scheduling problem of multi-phase level-sensitive circuits. LP Model min TCP + M [ (dj + Dj ) + ∀Rj
(Ak − ak )]
∀Rk :|F an−in(Rk )|≥1
subject to Lf (i) af ≥ δH [Latching-Hold time] (ii) Af ≤ TCP − δSLf [Latching-Setup time] Li (iii) di ≥ ai + DDQm L Li di ≥ TCP − CW + DCQm [Synchronization-Earliest time] Li (iv ) Di ≥ Ai + DDQM L Li Di ≥ TCP − CW + DCQM [Synchronization-Latest time] pi1 pf (v ) af ≤ di1 + DPi1 ,fm + TSkew (i1 , f ) + φpi1 pf .. . pin pf af ≤ din + DPin ,fm + TSkew (in , f ) + φpin pf [Propagation-Earliest time] pi1 pf (vi) Af ≥ Di1 + DPi1 ,fM + +TSkew (i1 , f ) + φpi1 pf .. . pin pf Af ≥ Din + DPin ,fM + TSkew (in , f ) + φpin pf [Propagation-Latest time] (vii) Af ≥ af [Validity-Arrival time] (viii) Df ≥ df [Validity-Departure time] Ll Ll (ix ) Al = dl − (DCQm orDDQm ), ∀Rl : |F an − in(Rl )| = 0 [Initialization]
120
6 Clock Skew Scheduling of Level-Sensitive Circuits
multi-phase problem is solved by linearizing the non-linear constraints with the Modified big M method (Section 6.3.1).
6.7 Summary The timing analysis and optimization of synchronous circuits are subject to non-zero clock skew (intentional or not) and other effects of process parameter variations. In this chapter, design and timing analysis procedures are presented for clock skew scheduling of level-sensitive circuits. The formulation is performed to improve the performance of level-sensitive synchronous circuits in permitting shorter clock periods. The described procedure integrates non-zero clock skew scheduling in an automated fashion into the design and analysis of level-sensitive circuits. The procedure is based on a stand-alone LP model formulation (to be solved by any standard LP solver) which constitutes a generic automated framework for the design and analysis of level-sensitive synchronous circuits. The optimality of the results generated by the LP model is empirically confirmed against the optimal results of a precise MIP model. Using the clock skew definition that is enhanced for the increasingly popular multi-phase clock systems, the LP model clock skew scheduling formulation for level-sensitive circuits is presented.
7 Clock Skew Scheduling for Improved Reliability
The operation of a fully synchronous digital system has been discussed in detail in Chapters 1 through 5. Briefly, in order for such systems to function properly, a strict temporal ordering of the many thousands of switching events within the circuit is required. This strict ordering is enforced by a global synchronizing clock signal delivered to every register in a circuit by a clock distribution network. Algorithms for determining a non-zero clock skew schedule that satisfy the tighter timing constraints of high speed, VLSI complexity systems have been presented in detail in Chapter 5. In this chapter, the problem of determining an optimal clock skew schedule for a fully synchronous VLSI system is considered in this chapter from the perspective of improving system reliability. An original formulation of the clock skew scheduling problem by Kourtev and Friedman is introduced as a constrained quadratic programming (QP) problem [121, 122]. In this formulation, the primary objective is to improve circuit reliability by maximizing the tolerance to process parameter variations. As the initial step of the computation process, first an objective is computed for the clock skew value of each local data path. Then, a consistent clock schedule is found by applying the proposed optimization algorithm. Unlike the approach discussed in Chapter 5, the algorithm presented in this chapter minimizes the least square error between the computed and objective clock skew schedules.1 It should also be mentioned that a secondary objective of the clock skew scheduling algorithm presented in this chapter is to increase the system-wide clock frequency. This chapter begins with the alternative formulation of the clock skew scheduling problem as a quadratic programming problem—discussed in detail in Section 7.1. The mathematical procedures used to determine the clock skew schedule are developed and analyzed in Section 7.2.
1
Recall that in Chapter 5, the starting point of the clock scheduling algorithms is the set of timing constraints and the objective is to determine a feasible clock schedule and a clock distribution network given these constraints.
I.S. Kourtev et al., Timing Optimization Through Clock Skew Scheduling, DOI: 10.1007/978-0-387-71056-3 7, c Springer Science+Business Media LLC 2009
121
122
7 Clock Skew Scheduling for Improved Reliability
7.1 Problem Formulation ˆ i,j and long delay D ˆ i,j of a local data path Ri ;Rj Recall the short delay D Pm PM introduced in Definition 5.4. Using the substitutions, (5.3) and (5.4), the timing constraints of a local data path Ri ;Rf are rewritten in (5.5) and (5.6). A pair of constraints such as (5.5) and (5.6) must be satisfied for each local data path within a circuit in order for this circuit to operate correctly. Furthermore, the local data path timing constraints lead to the concept of a permissible range introduced in Section 5.2.1 and illustrated in Figure 5.2. Formally, the lower and upper bounds of the permissible range of a local data path Ri ;Rj are ˆ i,j li,j = −D Pm
(7.1)
ˆ i,j . ui,j = TCP − D PM
(7.2)
Also defined here for notational convenience are the width wi,j and middle mi,j of the permissible range. Specifically, ˆ i,j − D ˆ i,j w = u − l = TCP − D (7.3) i,j
i,j
i,j
PM
Pm
1 1 ˆ i,j − D ˆ i,j mi,j = li,j + ui,j = TCP − D PM Pm . 2 2
(7.4)
Recall from Section 5.3 that it is frequently possible to make two simple choices (5.7) characterizing the clock skews and clock delays within a circuit, such that both zero and double clocking violations are avoided. Specifically, if equal values are chosen for all clock delays and a sufficiently large value— ˆ i,f —is chosen for TCP , neither of these two larger than the longest delay D PM clocking hazard will occur. Formally, ∀ Ri , Rf : ticd = tfcd = Const ˆ i,f , Ri ;Rf ⇒ TCP > D PM
(7.5) (7.6)
and, with (7.5) and (7.6), the timing constraints, (5.5) and (5.6), for a hazardfree local data path Ri ;Rf become ˆ i,f < TCP D PM ˆ DPi,fm > 0.
(7.7) (7.8)
Next, recall that each clock skew TSkew (i, f ) is the difference of the delays of the clock signals, ticd and tfcd . These delays are the tangible physical quantities which are implemented by the clock distribution network. The set of all clock delays within a circuit can be denoted as the vector column, ⎡1 ⎤ tcd ⎢t2cd ⎥ tcd = ⎣ ⎦ , .. .
7.1 Problem Formulation
123
and is called a clock skew schedule or simply a clock schedule [2, 9, 106]. If tcd is chosen such that (5.5) and (5.6) are satisfied for every local data path Ri ;Rj , tcd is called a feasible clock schedule. A clock schedule that satisfies (5.7) [respectively, (7.5) and (7.6)] is called a trivial clock schedule. Again, a trivial tcd implies global zero clock skew since for any i and f ,
t ticd = tfcd , thus, TSkew (i, f ) = 0. Also, observe that if t1cd t2cd . . . is a feasible
t clock schedule (trivial or not), c + t1cd c + t2cd . . . is also a feasible clock schedule where c ∈ R1 is any real constant. An alternative way to refer to a clock skew schedule is to specify the vector of all clock skews within a circuit corresponding to a set of clock delays tcd as specified above. Denoted by s, the vector column of clock skews is
t s = s1 s2 . . . where the skews s1 , s2 , . . . of all local data paths within the circuit are enumerated. Typically, the dimension of s is different from the dimension of tcd for the same circuit. If a circuit consists of r registers and
t
t p local data paths, for example, then s = s1 . . . sp and tcd = t1cd . . . trcd for this circuit. Therefore, the clock skew schedule refers to either tcd or s, where the precise reference is usually apparent from the context. Note that tcd must be known to determine each clock skew within s. The inverse situation, however, is not true, that is, the set of all clock skews within a circuit need not be known in order to determine the corresponding clock schedule tcd . As is shown in Sections 7.1 and 7.2, a small subset of clock skews (compared to the total number of local data paths, that is, clock skews) uniquely determines all the skews within a circuit as well as the different feasible clock schedules tcd . Finally, note that a given feasible clock schedule s
t allows for many possible implementations tcd = c + t1cd c + t2cd . . . where any specific constant c implies a different tcd but the same s. Thus, the term clock schedule is used to refer to tcd where the choice of the real constant c ∈ R1 is arbitrary. The classical linear programming approach for minimizing only the clock period TCP of a circuit is first described in Section 7.1.1. The new problem formulation approach for maximizing the safety of the non-zero clock skew circuit towards variations in clock delays is described in Section 7.1.2. A new quantitative measure to compare different clock schedules for the formulation of maximum safety against variations is introduced in Section 7.1.3. This section is concluded by sketching the clock skew scheduling problem as an efficiently solvable quadratic programming problem in Section 7.1.4. 7.1.1 Clock Scheduling for Maximum Performance The linear programming (LP) problem of computing a feasible clock skew schedule while minimizing the clock period TCP of a circuit is discussed in Chapter 5. With TCP as the value of the objective function being minimized, this problem is formally defined as problem LCSS:
124
7 Clock Skew Scheduling for Improved Reliability
Problem LCSS
(LP Clock Skew Scheduling)
min subject to:
TCP ˆ i,j ticd − tjcd ≤ TCP − D PM ˆ i,j . ticd − tj ≥ −D cd
(7.9)
Pm
To develop additional insight into problem LCSS, consider a circuit C1 consisting of the four registers, R1 , R2 , R3 , and R4 , and the five local data paths, R1 ;R2 , R1 ;R3 , R3 ;R2 , R3 ;R4 , and R4 ;R2 . Let the long and short ˆ 1,2 = 3, D ˆ 1,3 = 2, D ˆ 1,3 = 4, D ˆ 3,2 = 5, ˆ 1,2 = 1, D delays for this circuit be2 D Pm PM Pm PM Pm ˆ 3,4 = 2.5, D ˆ 3,4 = 5, D ˆ 4,2 = 2, and D ˆ 4,2 = 4. Solving problem ˆ 3,2 = 7, D D PM Pm PM Pm PM LCSS yields a feasible clock schedule t1cd for the minimum achievable clock period TCP = 5, ⎡1 ⎤ ⎡ ⎤ tcd 1 2 ⎥ ⎢ ⎢ ⎥ t cd ⎥ = ⎢ 2 ⎥ . min TCP = 5 → t1cd = ⎢ 3 ⎣tcd ⎦ ⎣ 0 ⎦ t4cd 2.5 These results are summarized in Table 7.1 along with the actual permissible range for each local data path for the minimum value of the clock period TCP = 5 (recall that the permissible range depends upon the value of the clock period TCP ). Table 7.1. Clock schedule t1cd —clock skews and permissible ranges for the example circuit C1 (for the minimum clock period TCP = 5). Local Data Path Permissible Range R1 ;R3 R3 ;R4 R1 ;R2 R3 ;R2 R4 ;R2
[−2, 1] [−2.5, 0] [−1, 2] [−5, −2] [−2, 1]
Clock Skew − t3cd = 1 − 0 = 1 3 tcd − t4cd = 0 − 2.5 = −2.5 t1cd − t2cd = 1 − 2 = −1 t3cd − t2cd = 0 − 2 = −2 4 tcd − t2cd = 2.5 − 2 = 0.5 t1cd
Note that most of the clock skews (specifically, the first four) listed in Table 7.1 are at one end of the corresponding permissible range. This situation 2
The times used in this section are all assumed to be in the same time unit. The actual time unit—e.g., picoseconds, nanoseconds, microseconds, milliseconds, seconds—is irrelevant and is therefore omitted.
7.1 Problem Formulation
125
is due to the inherent feature of linear programming which seeks the objective function extrema at the vertices of the solution space. In practice, however, this situation can be dangerous since correct circuit operation is strongly dependent on the accurate implementation of a large number of clock delays— effectively, the clock skews—across the circuit. It is quite possible that the actual values of some of these clock delays may fluctuate from the target values—due to manufacturing tolerances as well as variations in temperature and supply voltage—thereby causing a catastrophic timing failure of the circuit. Observe that while zero clocking failures can be corrected by operating the circuit at a slower speed (higher clock period TCP ), double clocking violations are race conditions that are render the circuit nonfunctional unless delay padding is performed. 7.1.2 Maximizing Safety Frequently in practice, a target clock period TCP is established for a specific circuit implementation. Making the target clock period smaller may not be a primary design objective. If this is the case, alternative optimization strategies may be sought such that the resulting circuit is more tolerant to inaccuracies in the timing parameters. Two different classes of timing parameters are considered—the local data path delays and the clock delays (respectively, the clock skews). Note first that the clock skew scheduling process depends ˆ i,j ) for ˆ i,j and D on accurate knowledge of the short and long path delays (D Pm PM every local data path Ri ;Rj . Second, provided the path delay information is predictable, correct circuit operation is contingent upon the accurate implementation of the computed clock schedule tcd . Both of these factors must be considered if reliable circuit operation under various operating conditions is to be attained. One way to achieve the specified goal of higher circuit reliability is to artificially shrink the permissible range of each local data path by an equal amount from either side of the interval and determine a feasible clock skew schedule based on these new timing constraints. This idea has been addressed by in [2] as the problem of maximizing the minimum slack [over all inequalities (5.5) and (5.6)] or the amount by which an inequality exceeds the limit. Formally, the problem can be expressed as the LP problem LCSS-SAFE: Problem LCSS-SAFE
(LP Clock Skew Scheduling for Safety)
max
M
ˆ i,j subject to: ticd − tjcd + M ≤ TCP − D PM ˆ i,j ticd − tj − M ≥ −D cd
M ≥0
Pm
(7.10)
126
7 Clock Skew Scheduling for Improved Reliability
To gain additional insight into problem LCSS-SAFE, consider again the circuit example used in Section 7.1.1. Two solutions of problem LCSS-SAFE are listed in Table 7.2 for two different values of the clock period, TCP = 6.5 and TCP = 6, respectively. The results are summarized in Table 7.2—denoted by t2cd and t3cd , respectively—in columns two through five and six through nine for TCP = 6.5 (clock schedule t2cd ) and TCP = 6 (clock schedule t3cd ), respectively. For the specific value of TCP , the permissible range is listed in
Table 7.2. Solution of problem LCSS-SAFE for the example circuit C1 for clock periods TCP = 6.5 and TCP = 6, respectively. →
t2cd
t2cd 1
2
→
TCP = 6.5, M = 1 t3cd
t = 32 23 0 12 3
4
R1 ;R3 [−2, 2.5] 1.5 0.25 R3 ;R4 [−2.5, 1.5] −0.5 −0.5 R1 ;R2 [−1, 3.5] 0 1.25 R3 ;R2 [−5, −0.5] −1.5 −2.75 R4 ;R2 [−2, 2.5] −1 0.25
5
TCP = 6, M = 2/3 4 5 1 t 0 3 3 3
t3cd = 6
7
8
1.25 [−2, 2] 4/3 0 0 [−2.5, 1] −1/3 −3/4 1.25 [−1, 3] −1/3 1 1.25 [−5, −1] −5/3 −3 1.25 [−2, 2] 0 −4/3
9 4/3 5/12 4/3 4/3 4/3
1: local data path, 2,6: permissible range, 3,7: clock skew solution for this local data path, 4,8: ideal clock skew value for this path (middle of permissible range), 5,9 distance (absolute value) of the clock skew solution from the actual clock skew
columns two and six, respectively, and the clock skew solution is listed in columns three and seven, respectively. Note that there are two additional columns of data for either value of TCP in Table 7.2. First, an ‘ideal’ objective value of the clock skew is specified for each local data path in columns four and eight, respectively. This objective value of the clock skew is chosen in this example to be the value corresponding to the middle mi,j [note (7.4)] of the permissible range of a local data path Ri ;Rj in a circuit with a clock period TCP . The middle point of the permissible range is equally distant from either end of the permissible range, thereby providing the maximum tolerance to process parameter variations. Second, the absolute value of the distance TSkew (i, j) − mi,j between the ideal and actual values of the clock skew for a local data path is listed in columns five and nine, respectively. This distance is a measure of the difference between the ideal clock skew and the scheduled clock skew. Note that in the general case, it is virtually impossible to compute a clock schedule tcd such that the clock skew TSkew (i, j) for each local data path Ri ;Rj is exactly equal to the
7.1 Problem Formulation
127
middle mi,j of the permissible range of this path. The reasons for this characteristic are due to structural limitations of the circuits as will be highlighted in Section 7.2. 7.1.3 Further Improvement Problem LCSS-SAFE [see (7.10)] provides a solution to the clock skew scheduling problem for the case where circuit reliability is of primary importance and clock period minimization is not the focus of the optimization process. As shown in Section 7.1.2, a certain degree of safety may be achieved by computing a feasible clock schedule subject to artificially smaller permissible ranges [as defined in (7.10)]. However, Problem LCSS-SAFE is a brute force approach since it requires that the same absolute margins of safety are observed for each permissible range regardless of the width of this range. Therefore, this approach does not consider the individual characteristics of a permissible range and does not differentiate among local data paths with wider and narrower permissible ranges. It is possible to provide an alternative approach to clock skew scheduling that considers all permissible ranges and also provides a natural quantitative measure of the quality of a particular clock schedule. Consider, for instance, a circuit with a target clock period TCP . Furthermore, denote an objective clock skew value for a local data path Ri ;Rj by gi,j , where it is required that li,j ≤ gi,j ≤ ui,j [recall the lower (7.1) and upper (7.2) bounds of the permissible range]. For most practical circuits, it is unlikely that a feasible clock schedule can be computed that is exactly equal to the objective clock schedule for each local data path. Multiple linear dependencies among clock skews within each circuit exist—those linear dependencies define a solution space such that the
t clock schedule s = gi1 ,j1 gi2 ,j2 . . . most likely is not within this solution space (unless the circuit is constructed of only non-recursive feed-forward paths). If tcd is a feasible clock schedule, however, it is possible to evaluate how close a realizable clock schedule is to the objective clock schedule by computing the sum,
2 TSkew (i, j) − gi,j , (7.11) ε= Ri ; Rj over all local data paths in the circuit. Note that ε, as defined in (7.11), is the total least squares error of the actual clock skew as compared to the objective clock skew. This error permits any two different clock skew schedules to be compared. Moreover, the clock skew scheduling problem can be considered as a problem of minimizing ε of a clock schedule tcd given the clock period TCP and an ‘ideal’ clock
t schedule gi1 ,j1 gi2 ,j2 . . . subject to any specific circuit design criteria. The flexibility permitted by such a formulation is far greater since the ideal sched t ule gi1 ,j1 gi2 ,j2 . . . can be any clock schedule that satisfies a specific target circuit.
128
7 Clock Skew Scheduling for Improved Reliability
Consider, for instance, the solution of LCSS-SAFE listed in Table 7.2 for TCP = 6.5 and TCP = 6. Computing the total error [as defined by (7.11)] for both solutions gives ε6.5 = 6.25 and ε6 = 1049 144 = 7.2847. Next, consider an alternative clock schedule t2cd for TCP = 6.5 as follows: ⎡1 ⎤ ⎡ ⎤ tcd 43/32 ⎢t2 ⎥ ⎢38/32⎥ cd ⎥ ⎢ ⎥ TCP = 6.5 → t2cd = ⎢ (7.12) ⎣t3cd ⎦ = ⎣ 0 ⎦ . 4 tcd 31/32
It can be verified that with t2cd as specified, ε6.5 improves to 675 128 = 5.2734 from 6.25 for t2cd [columns two (2) through five (5) in Table 7.2]. Similarly, an alternative clock schedule t3cd for the clock period TCP = 6 is ⎡1 ⎤ ⎡ ⎤ tcd 35/32 ⎢t2 ⎥ ⎢54/32⎥ cd ⎥ ⎢ ⎥ TCP = 6.5 → t3cd = ⎢ (7.13) ⎣t3cd ⎦ = ⎣ 0 ⎦ . 4 tcd 39/32
Again, using t3cd leads to an improvement of ε6 to 6.1484 as compared to 7.2847 for the solution of LCSS-SAFE t3cd (see Table 7.2, columns six through nine). 7.1.4 Clock Scheduling as a Quadratic Programming Problem As discussed in Sections 7.1.1, 7.1.2, and 7.1.3, a common design objective is ensuring reliable system operation under a target clock period. As hinted in Section 7.1.3, it is possible to redefine the problem of clock skew scheduling for this case. The input data for this redefined problem consists of: • The clock period of the circuit TCP , • The circuit connectivity and delay information, i.e., all local data paths ˆ i,j and D ˆ i,j , respectively, Ri ;Rj and the short and long delays D PM ⎡ ⎤P m gi1 ,j1 ⎢ ⎥ • An objective clock schedule g = ⎣gi2 ,j2 ⎦ . .. . Given this information, the optimization goal is to compute a feasible clock schedule s∗ (respectively t∗cd ) so as to minimize the least square error between the computed clock schedule s∗ and the objective clock schedule g. Recall that the least square error ετ [described by (7.11)] is defined as the sum of the squares of the distances (algebraic differences) between the actual and objective clock skews over all local data paths in the circuit. This problem is described within a formal framework in the following section. Also in the following section, the mathematical algorithm to solve this revised problem is explained in greater detail.
7.2 Derivation of the QP Algorithm
129
7.2 Derivation of the QP Algorithm The formulation of clock skew scheduling as a quadratic programming problem is described in detail in this section. First, the graph model introduced in Chapter 5 is further analyzed in Section 7.2.1. The linear dependencies among the clock skews and the fundamental set of cycles are introduced and analyzed in Section 7.2.2. Finally, the quadratic programming problem is formulated and solved in Section 7.2.3. 7.2.1 The Circuit Graph As discussed in Section 5.2.2, a circuit C is represented as the simple (C) (C) (C) undirected graph GC = V (C) , E (C) , A(C) , hl , hu , hd , where VC = {v1 , . . . , vr } is the set of vertices of the graph, EC = {e1 , . . . , ep } is the set of edges of the graph, and the symmetric r × r matrix AC —called the adjacency matrix—contains the graph connectivity [89]. Vertices from GC correspond to the registers of the circuit C and the edges reflect the fact that pairs of registers are sequentially-adjacent. Note the cardinalities |VC | = r and |EC | = p—the circuit C has r registers and p local data paths. The adjacency matrix AC = [aij ]r×r is a square matrix of order r × r where both the rows and columns of A correspond to the vertices of GC . As previously mentioned, for notational convenience sj denotes the clock skew corresponding to the edge ej ∈ EC . Specifically, if the vertices vi1 and vi2 correspond to the sequentially-adjacent pair of registers Ri1 ;Ri2 connected by the j-th edge ej , def
sj = TSkew (i1 , i2 ). To illustrate these concepts, the graph GC1 of the small circuit example C1 introduced in Section 7.1.1 is illustrated in Figure 7.1 (note the enumeration and labeling of the edges as specified in Definition 5.3). For this example, [l1 , u1 ] e1 → v1
[l3 , u3 ] e3 →
v2
e5
[l5
[l4 , u4 ] e4 ←
,u 5 ← ]
v3
] u2 , [l 2 ← e2
v4 Fig. 7.1. Circuit graph of the simple example circuit C1 from Section 7.1.1.
130
7 Clock Skew Scheduling for Improved Reliability
r = 4, p = 5, and the adjacency matrix, that will be used in the solution procedure, is v1v2v3v4 ⎡ ⎤ v1 0 1 1 0 v ⎢1 0 1 1⎥ ⎥. AC 1 = 2 ⎢ v3 ⎣ 1 1 0 1 ⎦ v4 0 1 1 0 Observe that in general, the elements of AC are defined as 1 if there is an edge ek connecting the vertices vi and vj auij = 0 otherwise.
(7.14)
In addition, note that the adjacency matrix as defined in (7.14) is always symmetric. The edges of GC have no direction so each edge between vertices vi and vj is shown in both of the rows corresponding to i and j. Also, all diagonal elements of the adjacency matrix are zeroes since self-loop edges are excluded by the required circuit graph properties described in 5.2.2. As a final reminder and without any loss of generality, it is assumed that a circuit has a connected graph [89]. In other words, a circuit does not have isolated groups of registers. If a specific circuit has a disconnected graph, then each connected subgraph (subcircuit) can be considered separately. 7.2.2 Linear Dependence of Clock Skews Consider the circuit graph of C1 illustrated in Figure 7.1. The clock skews for the local data paths R3 ;R2 , R3 ;R4 , and R4 ;R2 are s4 = TSkew (3, 2) = t3cd − t2cd , s2 = TSkew (3, 4) = t3cd − t4cd , and s5 = TSkew (4, 2) = t4cd − t2cd , respectively. Note that s4 = s2 + s5 , i.e., the clock skews s2 , s4 , and s5 are linearly dependent. In addition, note that other sets of linearly dependent clock skews can be identified within C1 , such as, for example, s1 , s3 , and s4 . Generally, large circuits contain many feedback and feed-forward signal paths. Thus, many possible linear dependencies among clock skews—such as those described in the previous paragraph—are typically present in such circuits. A natural question arises as to whether there exists a minimal set3 of linearly independent clock skews which uniquely determines all clock skews within a circuit. (The existence of any such set could lead to substantial improvements in the run time of the clock scheduling algorithms as well as permit significant savings in storage requirements when implementing these algorithms on a digital computer.) It is generally possible to identify multiple minimal sets within any circuit. Consider C1 , for example—it can be verified that {s3 , s4 , s5 }, {s1 , s3 , s5 }, and {s1 , s4 , s5 } are each sets with the property that (a) the clock skews within the set are linearly independent, and (b) every 3
Such that the removal of any element from the set destroys the property.
7.2 Derivation of the QP Algorithm
131
clock skew within C1 can be expressed as a linear combination of the clock skews that exist in the set. Let C be a circuit with graph GC and let vi0 , ej0 , vi1 , . . . , ejz−1 , viz ≡ vi0 be an arbitrary sequence of vertices and edges. Formally, the condition for linear dependence of the clock skews, sj0 , sj1 , . . . , sjz−1 , is ⎫ z−1 ⎪ ⎪ z−1 aik jk = 0 ⎪ ⎬ ±TSkew (ik , jk ) = 0, (7.15) ⇒ k=0 ⎪ ⎪ k=0 ⎪ ⎭ (iz = i0 ) = i1 = . . . = iz−1 where the proof of (7.15) is trivial by substitution. The product on the left side of (7.15) requires that there exists an edge between every pair of vertices vik and vik+1 (k = 0, . . . , z − 1). The sum in (7.15) can be interpreted4 as traversing the vertices of the cycle C = vi0 , ej0 , vi1 , . . . , ejz−1 , viz ≡ vi0 in the order of appearance in C and adding the skews along C with a positive or negative sign depending on whether the direction labeled on the edge coincides with the direction of traversal. Typically, multiple cycles can be identified in a circuit graph and an equation—such as (7.15)—can be written for each of these cycles. Referring to Figure 7.1, three such cycles, C1 = v1 , e1 , v3 , e2 , v4 , e5 , v2 , e3 , v1 C2 = v2 , e4 , v3 , e2 , v4 , e5 , v2 C3 = v1 , e1 , v3 , e4 , v2 , e3 , v1 , can be identified and the corresponding linear dependencies written: cycle C1 cycle C2 cycle C3
→ s1 + s2 − s3 → → s1
s2
+ s5 = 0
(7.16)
− s4 + s5 = 0 − s3 + s4 = 0.
(7.17) (7.18)
Note that the order of the summations in (7.16), (7.17), and (7.18) has been intentionally modified from the order of cycle traversal so as to highlight an important characteristic. Specifically, observe that (7.16) is the sum of (7.17) and (7.18), that is, there exists a linear dependence not only among the skews within the circuit C, but also among the cycles (or, sets of linearly dependent skews). Note that any minimal set of linearly independent clock skews must not contain a cycle [as defined by (7.15)] for if the set contains a cycle, the skews 4
Note the similarity with Kirchoff’s Voltage Law (KVL or loop equations) for an electrical network [123].
132
7 Clock Skew Scheduling for Improved Reliability
within the set would not be linearly independent. Furthermore, any such set must span all vertices (registers) of the circuit or it is not possible to express the clock skews of any paths in and out of the vertices not spanned by the set. Given a circuit C with r registers and p local data paths, these conclusions are formally summarized in the following two results from graph theory [89, 124]: 1. Minimal Set of Linearly Independent Clock Skews. A minimal set of clock skews can be identified such that (a) the skews within the set are linearly independent, and (b) every skew in C is a linear combination of the skews from the set. Such a minimal set is any spanning tree of GC and consists of exactly r − 1 elements (recall that a spanning tree is a subset of edges such that all vertices are spanned by the edges in the set). These r − 1 skews (respectively, edges) in the spanning tree are referred to as the skew basis, while the remaining p − (r − 1) = p − r + 1 skews (edges) of the circuit are referred to as chords. Note that there is a unique path between any two vertices such that all edges of the path belong to the spanning tree. 2. Minimal Set of Independent Cycles. A minimal set of cycles [where a cycle is as defined by (7.15)] can be identified such that (a) the cycles are linearly independent, and (b) every cycle in C is a linear combination of the cycles from the set. Each choice of a spanning tree of GC determines a unique minimal set of cycles, where each cycle consists of exactly one chord vi1 , ej , vi2 plus the unique path that exists within the spanning tree between the vertices vi1 and vi2 . Since there are p − (r − 1) = p − r + 1 chords, a minimal set of independent cycles consists of p−r+1 cycles. The minimal set of independent cycles of a graph is also called a fundamental set of cycles [89, 123, 124]. To illustrate the aforementioned properties, observe the two different spanning trees of the example circuit C1 outlined with the thicker edges in Figure 7.2 (the permissible ranges and direction labelings have been omitted from Figure 7.2 for simplicity). The first tree is shown in Figure 7.2(a) and consists of the edges {e3 , e4 , e5 } and the independent cycles C2 [see (7.17)] and C3 [see (7.18)]. As previously explained, both C2 and C3 contain precisely one of the skews not included in the spanning tree—s2 for C2 and s1 for C3 . Similarly, the second spanning tree {e1 , e3 , e5 } is illustrated in Figure 7.2(b). The independent cycles for the second tree are C1 [see (7.16)] and C3 [see (7.18)]— generated by s2 and s4 , respectively. Let a circuit C with r registers and p local data paths be described by a graph G and let a skew basis (spanning tree) for this circuit (graph) be identified. For the remainder of this discussion, it is assumed that the skews have been enumerated such that those skews from the skew basis have the highest indices.5 Introducing the notation sb for the basis and sc for the chords, the clock schedule s can be expressed as 5
Such enumeration is always possible since the choice of indices for any enumeration (including this example) is arbitrary.
7.2 Derivation of the QP Algorithm
133
e1 (s1 ) e4 (s4 )
2)
v3
(s
e5 (s5 )
v2
e2
e3 (s3 )
v1
v4 (a) Spanning tree {e3 , e4 , e5 }
e1 (s1 ) e4 (s4 )
2)
v3
(s
e5 (s5 )
v2
e2
e3 (s3 )
v1
v4 (b) Spanning tree {e1 , e3 , e5 } Fig. 7.2. Two spanning trees and the corresponding minimal sets of linearly independent clock skews and linearly independent cycles for the circuit example C1 . Edges from the spanning tree are indicated with thicker lines.
c p−r+1 r−1 " #$ %" #$ % s s = b = [ s1 . . . sp−r+1 sp−r+2 . . . sp ]t , s $ %" #$ %" # Basis
Chords
where
⎡ ⎢ sc = ⎣
s1 .. . sp−r+1
⎤ ⎥ ⎦
(7.19)
⎡
⎤ sp−r+2 ⎢ ⎥ and sb = ⎣ ... ⎦ .
(7.20)
sp
Note that the case illustrated in Figure 7.2(a) is precisely the type of enumeration just described by (7.19) and (7.20)—e1 , e2 (s1 , s2 ) are the chords and e3 , e4 , e5 (s3 , s4 , s5 ) are the basis.
134
7 Clock Skew Scheduling for Improved Reliability
With the notation and enumeration as specified above, let nb = r − 1 be the number of skews (edges) in the basis and nc = p − r + 1 = p − nb be the number of chords (equal to the number of cycles). The set of linearly independent cycles is C1 , . . . , Cnc and the clock skew dependencies for these cycles are i1z−1 1 cycle C1 = vi10 , ej01 , vi11 , . . . , ejz−1 , vi10
→
0=
±sjk1
k=i10
.. .
(7.21) c in z−1
nc , v nc cycle Cnc = vin0 c , ej0nc , vin1 c , . . . , ejz−1 i0
→
0=
±sjknc .
c k=in 0
Note that the sums in (7.21) can be written in matrix form, Bs = 0,
(7.22)
where B = [bij ]nc ×p is a matrix of order nc × p. The matrix B is called the circuit connectivity matrix and each row of B corresponds to a cycle of the circuit graph and contains elements from the incidence matrix A combined with zeroes depending on whether a skew (an edge) belongs to the cycle or not. Note that since each cycle contains exactly one chord, the cycles can always be permuted such that the cycles appear in the order of the chords, i.e., C1 corresponds to e1 , C2 corresponds to e2 and so on. If this correspondence is applied, the matrix B can be represented as
(7.23) B = Inc Cnc ×nb , where the submatrix Inc is an identity6 matrix of dimension nc × nc , thereby permitting (7.22) to be rewritten as
sc = sc + Csb = 0. (7.24) Bs = I C sb Consider, for instance, the choice of spanning tree illustrated in Figure 7.2(a). There are two independent cycles denoted by C1 [corresponding to C2 in (7.17)] and C2 [corresponding to C3 in (7.18)]. The matrix relationship (7.22) for this case is − s3 + s4
s1 + s2 6
= 0 ← cycle C1 = v1 , e1 , v3 , e4 , v2 , e3 , v1
− s4 + s5 = 0 ← cycle C2 = v3 , e2 , v4 , e5 , v2 , e4 , v3
Recall that an identity matrix In is a square n × n matrix such that the only nonzero elements are on the main diagonal and are all equal to one.
7.2 Derivation of the QP Algorithm
and the matrices B and C, respectively, are
1 0 −1 1 0 , B = I2 C2×3 = 0 1 0 −1 1 −1 1 0 C= . 0 −1 1
135
(7.25)
From an algebraic standpoint [125], (7.22) requires that any clock schedule s must necessarily be in the kernel ker(B) of the linear transformation B : Rp → Rnc , i.e., s ∈ ker(B). The inverse situation, however, is not true, that is, an arbitrary element of the kernel is not necessarily a feasible clock schedule. Furthermore, note that B is already in reduced row echelon form [125] so the rank of B is rank(B) = nc . Thus, the dimension of ker(B) is [125] dim [ker(B)] = columns of B − rank(B) = p − rank(B) = p − nc = nb .
(7.26)
Therefore, (7.22) is referred to here as the circuit kernel equation. This last result expressed by (7.26) demonstrates that there are nb = r − 1 linearly independent skews in a circuit. Furthermore, considering that the matrix C is ⎤ ⎡ | | ⎥ ⎢ ⎥ ⎢ ⎢ C = ⎢c1 . . . cnb ⎥ ⎥, ⎦ ⎣ | | one possible basis for ker(B) can be written from inspection: ⎤ ⎡ ⎤ ⎤ ⎡ ⎡ −c1 −c2 −cnb ⎢ 1 ⎥ ⎢ 0 ⎥ ⎢ 0 ⎥ ⎥ ⎢ ⎥ ⎥ ⎢ ⎢ ⎥ ⎢ 0 ⎥ ⎢ 1 ⎥ ⎢ basis for ker(B) = ⎢ ⎥ ⎢ ⎥ . . . ⎢ 0 ⎥. ⎢ .. ⎥ ⎢ .. ⎥ ⎢ .. ⎥ ⎣ . ⎦ ⎣ . ⎦ ⎣ . ⎦ $
0
0
%" nb vectors
1
(7.27)
#
Any feasible clock schedule s ∈ ker(B) can be expressed as a linear combination of the vectors from the basis of the kernel, ⎡ ⎡ ⎡ ⎤ ⎤ ⎤ −c1 −c2 −cnb ⎢ 1 ⎥ ⎢ 0 ⎥ ⎢ 0 ⎥ c ⎢ ⎢ ⎢ ⎥ ⎥ ⎥ s −Csb b⎢ 0 ⎥ b⎢ 1 ⎥ b ⎢ 0 ⎥ s = b = s1 ⎢ , (7.28) ⎥ + s2 ⎢ ⎥ + . . . + snb ⎢ ⎥= s sb ⎢ .. ⎥ ⎢ .. ⎥ ⎢ .. ⎥ ⎣ . ⎦ ⎣ . ⎦ ⎣ . ⎦ 0 0 1
136
7 Clock Skew Scheduling for Improved Reliability
where the scalars, sb1 , sb2 , . . . , sbr , in (7.28) are the elements of the vector sb [as defined by (7.19)]: ⎡ b⎤ ⎡ ⎤ s1 snc +1 ⎢ sb2 ⎥ ⎢snc +2 ⎥ ⎢ ⎥ ⎢ ⎥ sb = ⎢ . ⎥ = ⎢ . ⎥ . (7.29) ⎣ .. ⎦ ⎣ .. ⎦ sbnb
sp
Observe that either knowing or deliberately choosing sb not only provides sufficient information to determine the corresponding sc (respectively, the entire s), but also permits computation of the clock delays tcd to implement the desired clock schedule s. Specifically, the dependencies among the clock skews in the branches (the local data paths) and the clock delays to the vertices (the registers) can be described in matrix form as follows: sb = Tnb ×r tcd = Tnb ×r tcd .
(7.30)
Note that each skew is the difference of two clock delays so that each row of the matrix T in (7.30) contains exactly two nonzero elements. These two nonzero elements are 1 and −1, respectively, depending upon which two clock delays determine the clock skew corresponding to this equation (or row in the matrix). Also note that (7.30) is a consistent linear system (the rows correspond to linearly independent skews within the circuit) with fewer equations than the r unknown clock delays tcd . Therefore, (7.30) has an infinite number of solutions all corresponding to the same clock schedule s. Finding a solution tcd of (7.30) is now a straightforward matter. For example, setting trcd = 0 and rewriting (7.30) to account for this substitution, trcd = 0
⇒
sb = T∗nb ×nb tcd = T∗nb ×nb
(7.31)
yields a consistent linear system with the same number of variables as equations where the matrix T∗nb ×nb is the matrix Tnb ×r with the rightmost column deleted. The most efficient way to solve the system characterized by (7.31) with the highest accuracy is by back substitution (only addition/subtraction operations are necessary). In the software implementation of this algorithm discussed in this work, tcd is computed in an efficient way by traversing the edges of the spanning tree. This section concludes by illustrating the concepts discussed in this section on a small circuit example C1 [the circuit graph GC1 is shown in Figure 7.1 and the respective spanning tree is shown in Figure 7.2(a)]. For this circuit, r = 4, the number of local data paths is p = 5 and nb = 4 − 1 = 3. The clock schedule is ⎡ ⎤ c s3 s s (7.32) s = b , where sc = 1 , sb = ⎣s4 ⎦ . s2 s s5 The independent cycles are C2 [from (7.17)] and C3 [from (7.18)] and the matrices B and C are as defined in (7.25). A basis for the kernel of B has a dimension nb = 3 and consists of the vectors,
7.2 Derivation of the QP Algorithm
⎡ ⎤ 1 ⎢0⎥ ⎢ ⎥ ⎢1⎥ , ⎢ ⎥ ⎣0⎦ 0
137
⎡
⎤ ⎡ ⎤ −1 0 ⎢ 1⎥ ⎢−1⎥ ⎢ ⎥ ⎢ ⎥ ⎢ 0⎥ , and ⎢ 0⎥ . ⎢ ⎥ ⎢ ⎥ ⎣ 1⎦ ⎣ 0⎦ 0 1
(7.33)
Any clock schedule is in ker(B) and can be expressed as a linear combination of the vectors from the kernel basis, ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ 1 −1 0 ⎢0⎥ ⎢ 1⎥ ⎢−1⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ b⎢ b⎢ ⎥ ⎥ ⎥ (7.34) s = sb3 ⎢ ⎢1⎥ + s4 ⎢ 0⎥ + s5 ⎢ 0⎥ . ⎣0⎦ ⎣ 1⎦ ⎣ 0⎦ 0 0 1 Consider, for instance, the clock skew schedule for TCP = 6.5 shown in Table 7.2. Substituting s3 = 0, s4 = −1.5 and s5 = −1 into (7.34) yields the clock schedule, ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ 1 −1 0 1.5 ⎢0⎥ ⎢ 1⎥ ⎢−1⎥ ⎢−0.5⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎥ − 1 ⎢ 0⎥ = ⎢ 0 s = 0 ⎢1⎥ − 1.5 ⎢ 0⎥ (7.35) ⎢ ⎥ ⎢ ⎥ ⎢ ⎥. ⎣0⎦ ⎣ 1⎦ ⎣ 0⎦ ⎣−1.5⎦ 0 0 1 −1 Finally, the clock delays tcd are derived from the underdetermined linear system [as described by (7.30)], ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ t1 0 1 −1 0 0 ⎢ cd t2cd ⎥ ⎥ sb = ⎣−1.5⎦ = ⎣0 −1 1 0⎦ ⎢ (7.36) ⎣t3cd ⎦ , −1 0 −1 0 1 t4cd where setting t4cd = 0 yields ⎤ ⎡ ⎤⎡ 1 ⎤ 0 1 −1 0 tcd sb = ⎣−1.5⎦ = ⎣0 −1 1⎦ ⎣t2cd ⎦ t3cd −1 0 −1 0 ⎡
t1cd = 1 ⇒
t2cd = 1
(7.37)
= −0.5.
t Interestingly, the clock schedule 1 1 − 21 0 differs from the solution shown 1 in Table 7.2 by only a constant of c = − 2 . Namely,
t
t 1 1 − 12 0 = c + 32 c + 32 c + 0 c + 12 . (7.38) t3cd
7.2.3 Optimization Problem and Solution Recall the intuitive definition of clock skew scheduling as a Quadratic Programming (QP) problem first introduced in Section 7.1.4. In this section, the
138
7 Clock Skew Scheduling for Improved Reliability
QP formulation is formalized and the solution of the problem is explained in detail. Problem QP-1
(QP Clock Skew Scheduling)
Let C be a circuit with r registers, p local data paths and a target clock period TCP , and let the local data paths be enumerated as ⎧ ⎪ ⎨ path1 → Ri1 ;Rj1 .. (7.39) p local data paths . ⎪ ⎩ pathp → Rip ;Rjp . For each local data path pathk (Rik ;Rjk ) within C, let the lower bound lik ,jk , upper bound uik ,jk , width wik ,jk , and middle mik ,jk of the permissible range of this path be defined as in (7.1), (7.2), (7.3), and (7.4), respectively. For simplicity, these parameters of the permissible range are denoted with a single subscript corresponding to the number of the respective local data path, that is, for the pathk ≡ Rik ;Rjk , lik ,jk = lk , uik ,jk = uk , wik ,jk = wk , and mik ,jk = mk . Furthermore, let the circuit graph of C be GC , let the skew basis sb and chords sc be identified in GC [according to (7.19)], and let the corresponding
independent set of cycles be described by the matrix B = I C [as defined
t
t in (7.23)]. Let an objective clock schedule be g = g1 . . . gp = m1 . . . mp ,
t
t and let l = l1 . . . lp and u = u1 . . . up be the vectors of the lower and upper bounds, respectively, of the permissible ranges. Find a feasible clock schedule s that minimizes the least square error ε between s and g. Formally, min
ε=
p
2
(sk − gk )
k=1
subject to: Bs = 0 l≤s
(7.40)
s ≤ u, where the inequalities in (7.40) are treated componentwise, i.e., l1 ≤ s1 ≤ u1 , l2 ≤ s2 ≤ u2 , and so on. Problem QP-1 is a constrained QP problem with bounded variables— methods such as active constraints exist for solving such problems [126, 127, 128, 129, 130]. These methods are both analytically and numerically challenging. A two-phase solution process is suggested here that includes the solution of a constrained version of Problem QP-1 as the first phase. If the result is infeasible, a rapidly converging iterative refinement of the objective g is performed until the feasibility of s is satisfied. This two-phase process is defined formally as
7.2 Derivation of the QP Algorithm
Phase 1
→
min
ε=
p
2
(sk − gk )
139
(7.41)
k=1
Phase 2
→
subject to Bs = 0 Iterative refinement . of s,
where Phase 1 is an equality-constrained quadratic optimization problem expressed as the following problem QP-2: Problem QP-2
(QP Clock Skew Scheduling) min
2
ε = (s − g) =
p
2
(sk − gkτ )
k=1
subject to: Bs = 0.
(7.42)
Problem QP-2 is representative of a broader class of optimization problems where the function that is minimized is a distance in the Euclidean space Rn . One typical problem that arises in a variety of situations, for instance, is the linear least squares problem. The objective of the linear least squares problem is to find x∗ ∈ Rn such that the Euclidean distance between Dx∗ ∈ Rm and b ∈ Rm is as small as possible. The matrix D is an m × n matrix and the system Dx = b is typically inconsistent. The function being minimized in the linear least squares problem is ⎡ ⎤ | | ⎢ ⎥ m ⎢ ⎥ t 2 ⎥ . . . d d di x − bi , where Dt = ⎢ m⎥ . ⎢ 1 ⎣ ⎦ i=1 | | It is well known [125, 130] that if the kernel of D is ker(D) = {0}, then x∗ is the solution of the consistent system Dt Dx = Dt b. The quadratic programming problem QP-2 is solved by applying the classical method of Lagrange multipliers for constrained optimization [131, 129, 130]. To start, note that minimizing the objective function ε in (7.42) is equivalent to minimizing the function, ε∗ = st s − 2gt s.
140
7 Clock Skew Scheduling for Improved Reliability
For a quick proof of this equivalence, consider expanding the value of ε, ε = (s − g)
2
= (s)2 − 2gt s + (g)2
(7.43)
= s s − 2g s + g g, t
t
t
where the inner product gt g in (7.43) is a numeric constant. Therefore, if a value s = s∗ exists which minimizes ε∗ in (7.43), s∗ also minimizes ε. Note that since ε∗ = ε − gt g, the two minimums are related by min(ε∗ ) = min(ε) − gt g.
(7.44)
Thus, problem QP-2 is transformed into the following problem QP-3: Problem QP-3
(QP Clock Skew Scheduling) min subject to:
ε∗ = st s − 2gt s Bs = 0.
(7.45)
To apply the method of Lagrange multipliers to problem QP-3, the vector
t λ = λ1 . . . λnc is introduced, where each multiplier λi in λ corresponds to the i-th equality constraint from Bs = 0. The Lagrangian function L(s, λ) is introduced next, L(s, λ) = ε∗ + λt Bs = st s − 2gt s + λt Bs,
(7.46)
where the term λt Bs in (7.46) is the sum over all equality constraints of the product of the i-th constraint times the multiplier λi . Any extremum of ε∗ must be a stationary point of the Lagrangian L(s, λ) [125], that is, the first derivatives of L(s, λ) with respect to si where i ∈ {1, . . . , p} and λj where j ∈ {1, . . . , nc } must be zero. Formally, if the differential operator is denoted as ∇, then any stationary point (s∗ , λ∗ ) of L(s, λ) is a solution of the system of equations, ∇ L(s, λ) = 0 s (7.47) ∇L(s, λ) = 0 ⇒ ∇λ L(s, λ) = 0. In the general case of a QP problem with any type of constraints, systems such as (7.47) can be non-linear and difficult to solve. In the case of linear
7.2 Derivation of the QP Algorithm
141
constraints, however, a solution can be derived in a straightforward manner. To this end, consider the derivatives, ∇s L(s, λ) and ∇λ L(s, λ), of the Lagrangian, ∇s L(s, λ) = ∇s st s − 2gt s + λt Bs = 2s − 2g + (λt B)t
(7.48)
= 2s − 2g + B λ, t
and
∇λ L(s, λ) = ∇λ st s − 2gs + λt Bs = Bs.
(7.49)
Note that (7.48) and (7.49) contain p and nc equations, respectively (recall that s and λ have p and nc variables, respectively). Therefore, the solution of (7.47) requires finding exactly p + nc = 2p − nb = 2p − r + 1 variables. Substituting (7.48) and (7.49) back into (7.47) yields the linear system, 2s + Bt λ = 2g (7.50) Bs = 0, which can be conveniently written in matrix form, 2Ip Bt s g =2 . B 0 λ 0
(7.51)
Solving (7.51) by Gauss-Jordan elimination is straightforward by premultiplying with 12 B the first row of the system described by (7.51) and subtracting the result from the second row, thereby yielding s g 2Ip Bt s g 2Ip Bt = 2 . (7.52) =2 ⇒ Bg B 0 λ 0 0 BBt λ A natural way to solve the linear system described by (7.52) is by back substitution,7 such that λ is initially computed, followed by the computation of s. The Lagrange multipliers λ are determined from the equation (BBt )λ = 2Bg in the second row of (7.52), where the right-hand side 2Bg is a non-zero vector, that is, Bg = {0}. The opposite situation, Bg = {0}, is highly unlikely to occur since Bg = {0} means that g ∈ ker(B), which in turn means [recall (7.26) through (7.29)] that the objective clock schedule g is feasible and no optimization needs to be performed.8 Therefore, the equation (BBt )λ = 2Bg in (7.52) can have either no solutions or exactly one solution depending upon whether the matrix BBt is singular or not. In other words, the non-singularity of BBt is a necessary and t t t ˆ s λ sufficient condition for the existence of a unique solution ˆ of (7.51). If t the product BB is denoted by M, note that the symmetric nc × nc matrix, 7 8
Since the coefficient matrix is an upper triangular matrix. The chances of g being feasible for a large real circuit are infinitesimally small.
142
7 Clock Skew Scheduling for Improved Reliability
M = BB = I C t
I = I + CCt , Ct
(7.53)
is strictly positive-definite and thus nonsingular. Therefore, the system (7.51) is absolutely guaranteed to have a unique solution, ˆ = 2M−1 Bg λ 1 ˆ s = − Bt λ + g = − Bt M−1 B g + g, 2
(7.54) (7.55)
where the matrix M is as introduced in (7.53). To gain further insight into the solution described by (7.51) through (7.55), consider substituting (7.23) for B into (7.51), and representing the vector column g of the objective clock skew schedule as c g (7.56) g= b , g where gc and gb correspond to sc and sb , that is, g1 is the objective value of the clock skew s1 , g2 is the objective value of the clock skew s2 and so on. With these substitutions, the system represented by (7.51) can be written as ⎤ ⎡ c⎤ ⎡ c⎤ ⎡ c⎤ ⎡ s s g 2Inc 0 Inc ⎣ 0 2Inb Ct ⎦ ⎣sb ⎦ = K ⎣sb ⎦ = 2 ⎣gb ⎦ , (7.57) Inc C 0 λ λ 0 where the coefficient matrix K on the left is symmetric. In (7.57), the Gaussian elimination step described by (7.52) is equivalent to multiplying by 12 the first row of K, premultiplying by 12 C the second row of K and subtracting both of these rows from the third row: ⎤ ⎡ c⎤ ⎡ c⎤ ⎡ s g 2I 0 I ⎣ 0 2I Ct ⎦ ⎣sb ⎦ = 2 ⎣gb ⎦ I C 0 λ 0 ⎤ ⎡ c⎤ ⎤ ⎡ ⎡ (7.58) gc s 2I 0 I ⎦. gb Ct ⎦ ⎣sb ⎦ = 2 ⎣ ⇒ ⎣ 0 2I 0 0 I + CCt λ gc + Cgb Observe that the linear system of (7.58) is simply a more detailed technique for rendering the linear system described by (7.52) where the first row of (7.52) has been expanded into the first two rows of (7.58):
I t = I + CCt (7.59) BB = I C Ct c
g = gc + Cgb . Bg = I C (7.60) gb
7.2 Derivation of the QP Algorithm
143
With the matrix M as defined in (7.53), the solution of (7.58) is ˆ = 2M−1 Bg, λ 1 ˆ ˆ sb = − Ct λ + gb , 2 1ˆ + gc . ˆ sc = − λ 2
(7.61) (7.62) (7.63)
As a final note, observe that the solution described by (7.54) and (7.55) is not only a stationary point of the Lagrangian function L(s, λ) (i.e., a potential local minimizer) but also a global minimizer of ε∗ in (7.45) [130]. As a matter of fact, problem QP-3 belongs to a broader class of optimization problems where the function being minimized is of the form f (x) = xt Zx + yt x (note that in the case of problem QP-3 the matrix Z is the positive-definite identity matrix Ip ). A proof can be found in [130] that if Z is positive-definite, a solution process similar to the process represented by (7.46) through (7.55) can be applied to obtain a unique global minimizer of f (x) = xt Zx + yt x. Reference [130] provides a most thorough treatment of this subject as well as proofs of the existence and uniqueness of the solution.
8 Delay Insertion and Clock Skew Scheduling
As briefly mentioned in Chapter 4, delay insertion into the logic network1 can be used as a post-processing step in mainstream digital integrated circuit design flow in order to solve the short-path (hold time) timing violations of synchronous circuits. The drawbacks of delay insertion, such as increased circuit area and power dissipation, are usually disregarded in favor of achieving a feasible timing schedule. In this chapter, a delay insertion algorithm into the logic network that improves the efficiency and results of clock skew scheduling is presented. By systematic delay insertion, a higher operating speed or improved reliability is achieved through clock skew scheduling. It is known that the minimum clock period of a synchronous circuit achievable through clock skew scheduling is limited by the uncertainties of the data propagation times on local data paths [2] and the total data propagation times on data path loops [117]2 . It has been shown recently by Taskin [132] that the reconvergent local data paths also introduce an additional theoretical limit on the minimum clock period of a synchronous circuit achievable through clock skew scheduling. This limitation caused by reconvergent paths is theoretically derived and a delay insertion method is defined in order to mitigate this limitation. Overall, these limitations can be used to quickly and efficiently calculate the improvements achievable through clock skew scheduling, without having to apply clock skew scheduling. Based on the improvements achievable for a particular circuit, the design team can decide whether or not to allocate resources in the design budget to perform clock skew scheduling and non-zero clock skew clock tree synthesis. 1
2
Note that clock skew scheduling also entails delay insertion, however, into the clock distribution network. Dependence of the minimum clock period TCP on the uncertainty of data propˆ i,f for Ri ;Rf ) is visible in Problem LCSS ˆ i,f and D agation times (between D Pm PM definition in Section 7.1.1. The linear dependency of clock skew values on data path cycles are explained in Section 7.2.2.
I.S. Kourtev et al., Timing Optimization Through Clock Skew Scheduling, DOI: 10.1007/978-0-387-71056-3 8, c Springer Science+Business Media LLC 2009
145
146
8 Delay Insertion and Clock Skew Scheduling
In this section, the limitation on the minimum clock period caused by all three factors are derived as applied to edge-triggered circuits. The limitations for level-sensitive circuit implementations can be derived similarly. It is shown that through systematic delay insertion, the limitation on the minimum clock period achievable through clock skew scheduling can be mitigated. In other words, the improvements achieved through clock skew scheduling can further be increased by additional delay insertion onto the logic network, simultaneous with the application of clock skew scheduling. For a fully-automated application, the proposed delay insertion method is implemented as a Linear Programming (LP) problem in tradition of clock skew scheduling applications presented in Chapters 5 and 7. The application of the delay insertion method is demonstrated both for edge-triggered and level-sensitive circuits.
8.1 Limitations on Minimum Clock Period Both zero clock skew and non-zero clock skew circuits are subject to limitations in the minimum clock period at which these circuits are fully operational. Remember from Section 5.3 that the limit for a zero clock skew circuit is the slowest local data path of the circuit (the path with the largest delay ˆ i,f ). Consequently, a timing analysis of zero clock skew circuits is centered D PM around identifying the N slowest local data paths of a circuit and ensuring that there are no timing hazards on any of the local data paths for a given clock period. Typically, this type of timing analysis is performed with the goal of satisfying all setup time constraints on the N selected paths. As mentioned in Sections 4.7 and 4.8, this objective can be achieved by lowering the clock frequency until all setup time constraints of the form of (4.8) [where TSkew (i, f )=0] are satisfied. Any remaining hold time violations can then be removed by inserting delay elements—a procedure called delay padding [74]. The limitations on non-zero clock skew circuits are more complicated. These limitations are caused by various circuit topologies and, unlike zero clock skew circuits, both setup and hold time violations are hard to remove. The limitations on the minimum clock period of non-zero clock skew circuits are caused by the following three factors: 1. Uncertainty of the data propagation time along the local data paths [2], 2. The total data propagation time of data path cycles [117], 3. The difference between the total data propagation time on reconvergent paths [132]. The first of these three limitations occurs on every single local data path of a synchronous circuit while the second and third limitations only occur on those circuits where the topology of the circuit graph includes cycles and reconvergent paths, respectively. A circuit with all three limitations will ultimately be affected from the most dominant limitation. In this section, these limitations are described for edge-triggered circuits—equivalent limitations on
8.1 Limitations on Minimum Clock Period
147
level-sensitive circuits can be similarly derived. In the rest of this chapter, it is assumed that reconvergent paths are the dominant limiting factor on the minimum clock period of a synchronous circuit achievable through clock skew scheduling over other limiting factors of delay uncertainty and data path cycles. This assumption does not invalidate the generality of the discussion, it is adopted in order to simplify the presentation of the delay insertion process (which is effective only for circuits where reconvergent paths are the most limiting factor). 8.1.1 Uncertainty of Data Propagation Times The uncertainty of the data propagation times is modeled by the corner-based (best-case/worst-case or min/max) timing delay models in timing analysis. The algebraic difference between the maximum data propagation time DPi,fM and the minimum data propagation time DPi,fm on a local data path Ri ;Rf constitutes the delay uncertainty. For a critical local data path, the trailing edge of the previous clock cycle is the hold time before the earliest arrival of the data signal Df at register Rf . The trailing edge of the current clock cycle is the setup time after the latest arrival of the data signal Df at register Rf . This situation is depicted on an example edge-triggered local data path in Figure 8.1. Note that in Figure 8.1, the tolerance of the clock signals are
DPif m , DPif M
→ Ri
Rf
(a) A sample local data path.
Ci
Fi DCQm Fi DCQM
Cf
DPif m
Ff δH
DPif M af
Af
δsF f
TCP (b) Delay uncertainty in timing diagram. Fig. 8.1. Limitation on the minimum clock period TCP caused by the delay uncertainty of a local data path.
148
8 Delay Insertion and Clock Skew Scheduling
ignored for the sake of simplicity. For such a critical timing path, the setup and hold time constraints (that are modeled with inequalities) satisfy the equality conditions3 . Due to this limitation, the clock period cannot be minimized any further than: i,f i Fi Ff ΔF min TCP = max + D + D + δ L CQM S PM ∀Ri ; Rf i,f i Fi Ff (8.1) + D + D + δ − ΔF L CQm H Pm ˆ i,f − D ˆ i,f . = max D PM Pm ∀Ri ; Rf The shaded region in Figure 8.1 illustrates the timing criticality, causing the limitation on TCP . 8.1.2 Data Path Cycles Limitations due to data path cycles occur due to the linear dependency of clock skews of the local data paths on a cycle, as explained in Section 7.2.2. In a zero clock skew circuit, the circuit topology is irrelevant in the timing analysis because each local data path is analyzed independent of any neighboring paths. The timing of neighboring local data paths in a non-zero clock skew circuit, however, is interdependent. For a cycle of local data paths, this interdependency regains the form described in Section 7.2.2. In this linear dependency form, the minimum clock period is limited by the (timing) criticality of the local data paths along the cycle (in addition to the limitations caused by the delay uncertainties of each local data path along the cycle, which are the limitations explained in Section 8.1.1). The data path cycle limitation is illustrated for a sample local data path cycle in Figure 8.2. The cyclic traveling path for the data signal over a data path cycle, such as the example circuit shown in Figure 8.2, leads to stringent operating conditions under non-zero clock skew. The local data paths along the cycle operate without any slack time, because any existing slack on these local data paths is distributed over the paths through the mechanics of the clock skew scheduling process. In such circuits where a data path cycle is (timing) critical, the minimum clock period depends on two factors. The first factor is the number of registers n along the cycle. For n registers on the cycle, n clock cycles must have passed after each completion of the cycle through a register. The second factor is the total delay of the data signal over the local data paths along the cycle. This total delay time includes the setup time δSF f and maximum Fi of each register along the cycle, the maximum clock-to-output time DCQM data propagation time DPi,fM of each local data path along the cycle, and the tolerances of the clock signal (which are ignored below for simplicity). The 3
These constraints have no available slack for improvement.
8.1 Limitations on Minimum Clock Period
149
Data path cycle: n local data paths
←
←
Rk
R(n−1)
→
→
R2
→
→
R1
(a) A sample local data path cycle.
nTCP C1
F1 DCQM
...
δsF 1
DP12M (n−1)1
C2
δsF 2
F2 DCQM
DP M
...
DP23M
.. .
(n−2)(n−1)
DP M
F (n−1)
C(n−1)
δ.s . .
F (n−1)
DCQM
(b) Data path cycle timing. Fig. 8.2. Limitation on the minimum clock period TCP caused by data path cycles.
limitation on the minimum clock period by the data path cycles is given by the following formula: i,f i Fi Ff ΔF L + DCQM + DP M + δS ∀Ri ; Rf on cycle min TCP = n (8.2) i,f ˆ DP M ∀Ri ; Rf on cycle = n
150
8 Delay Insertion and Clock Skew Scheduling
The shaded region in Figure 8.2 illustrates the timing criticality, causing the limitation on TCP . 8.1.3 Reconvergent Paths A reconvergent path is composed of a series of two or more local data paths with a common source register (divergent register) and a common sink register (convergent register). A reconvergent system is composed of at least two parallel reconvergent paths. The interdependency of the timing of local data paths in a non-zero clock skew system occurs explicitly in a reconvergent system because of the reconvergent fanout. In a reconvergent system, a data signal that is initially stored in the divergent register starts propagating simultaneously through all of the reconvergent paths. The signals that are processed on the reconvergent paths arrive at the convergent register at (possibly) different times. In the case of nonidentical numbers of registers in two reconvergent paths, the data signals arrive at the convergent register during different clock cycles. The timing of all reconvergent paths is satisfied by collectively analyzing the arrival time of the data signals at the convergent register over a duration of (possibly) multiple clock cycles. In Figure 8.3, the limitation such a reconvergent system imposed on the minimum clock period of a non-zero clock skew circuit is illustrated. In Figure 8.3, two reconvergent paths with m and n registers, respectively, are considered. The total propagation time of the data signal on the two reconvergent paths are shown. Let the propagation time on the reconvergent paths with m and n registers be the longest and shortest total propagation times, respectively. After propagating along these two paths, m and n clock cycles must have elapsed, respectively, by the time the data signals arrive at the convergent register. When critical, the reconvergent path with n registers is matched with the trailing edge of the (n−1)-th clock cycle, while the reconvergent path with m registers is matched with the trailing edge of the (m)-th clock cycle. Thus, the algebraic difference between the two total data propagation times along the reconvergent paths limits the minimum clock period. Mathematically, the limitation of the reconvergent paths on the minimum clock period of non-zero clock skew circuits is given by min TCP =
path1 F convergent path2 P DM − P Dm + δSF convergent + δH , |m − n + 1|
(8.3)
pathp pathp where P DM and P Dm represent the maximum and minimum total data propagation times between the divergent and convergent registers over paths path 1 and path 2, respectively. Unlike the limitations caused by the delay uncertainty of the local data paths and the total data propagation times along the data path cycles, the limitations caused by reconvergent paths can be mitigated. The mitigation procedure offered in [132] involves systematic delay insertion on one
8.1 Limitations on Minimum Clock Period
151
path1
path1 P Dm , P DM
path1: m registers Ri1
→
→
Rim
Rd
Rc →
Rj1
→
Rjn
path2: n registers
path2 path2 P Dm , P DM
(a) A sample reconvergent path system.
Cd
Fd DCQm Fd DCQM
Cc
...
... path2 P Dm
...
Fc δH
path2 P DM
ac
path1 P Dm
...
path1 P DM
Ac
δsF c
|m − n + 1|TCP (b) Reconvergent path system timing diagram. Fig. 8.3. Limitation on the minimum clock period TCP caused by reconvergent paths.
or more of the reconvergent paths in order to decrease the algebraic differpath1 path2 ence P DM of (8.3), which consequently improves the mini− P Dm path2 mum clock period TCP . Note that it is possible to increase path delay P Dm path1 without increasing P DM because both paths are determined by two different series of local data paths.4
4
The minimum and maximum total data propagation times along a reconvergent system may be observed on the same reconvergent path. In such a case, delay insertion is not beneficial.
152
8 Delay Insertion and Clock Skew Scheduling
8.2 Delay Insertion Method When clock skew scheduling is applied to a synchronous circuit, a set of optimal values that satisfy the objective function (e.g. consider clock period minimization) are assigned to the clock delays at each register. Certain data paths become critical timing paths because of the distribution of these optimal clock delays. In this section, the consequences of criticality to the short and long path constraints of a reconvergent path are analyzed. It is demonstrated that when the short and long path constraints of a reconvergent path are critical, the minimum clock period can be improved via delay insertion. Note that criticality of the constraints of a reconvergent path adheres to the preliminary assumption that the limitation caused by this reconvergent path system is dominant over other limitations. For circuits where limitations caused by other factors are dominant, improvement through delay insertion is not possible. In experimentation, such circuits are reported to be one of the two cases where the delay insertion method is inapplicable (e.g., delay insertion method is not beneficial). Let the source and sink registers in a reconvergent path system be called the divergent register Rd and the convergent register Rc , respectively. Let pd{i1 ...in }c define a reconvergent path starting from register Rd , continuing through the intermediate registers Ri1 , . . ., Rin and ending at register Rc . The number of intermediate registers rd{i1 ...in }c = n is a non-negative integer number (n ∈ Z + ∪ {0}) and the path is acyclic [∀in , im : Rd = Rin , Rin = Rim , Rd = Rc and Rin = Rc ]. In the sample circuit modeled in Figure 7.1 (page 129), for instance, there are three reconvergent paths between v1 and v2 , p12 , p132 and p1342 , where the numbers of intermediate registers for the three reconvergent paths of this circuit are r12 = 0, r132 = 1 and r1342 = 2, respectively. The path delay P Dd{i1 ...in }c of a reconvergent path pd{i1 ...in }c is defined as the total data propagation time between the divergent and convergent registers Rd and Rc , respectively, over the intermediate registers {Ri1 , . . . , Rin }. The minimum and maximum path delays of this reconvergent data path are given by d{i ...i }c d{i ...i }c P Dm 1 n and P DM 1 n , respectively. The system delay SDdc of a reconvergent data path system between divergent and convergent registers Rd and Rc is defined by the conjuncture of all the (reconvergent) path delays between dc of this reconvergent registers Rd and Rc . The maximum system delay SDM data path system is defined by the largest of the maximum path delays bedc is defined by tween Rd and Rc . Similarly, the minimum system delay SDm the smallest of the minimum path delays between Rd and Rc . If there are k number of reconvergent paths between Rd and Rc , labeled pA , pB , . . . , pK , then: dc pA pB pK = min (P Dm , P Dm , . . . , P Dm ), SDm pA pB pK dc SDM = max (P DM , P DM , . . . , P DM ) .
(8.4) (8.5)
8.2 Delay Insertion Method
153
12a 12a 12a a [D12 Pm , DPM ] = [PDm , PDM ] = [1.0, 1.2]
→ p12a R1
R2 → p12b 12
12
12
12
[DPmb , DPMb ] = [PDm b , PDM b ] = [0.6, 0.7] Fig. 8.4. A simple reconvergent data path system.
8.2.1 Motivational Example with a Reconvergent Path A simple reconvergent data path system formed by two reconvergent local data paths sharing the divergent and convergent registers R1 and R2 , respectively, is shown in Figure 8.4. Note that as a special case, subscripts a and b are used to identify the two reconvergent local data paths p12a and p12b . Registers R1 and R2 are the divergent and convergent registers, respectively. The two reconvergent paths p12a and p12b form the reconvergent data path system. For this simple reconvergent data path system, the path delay of each reconvergent path is the data propagation delay of the 12a 12a a = DP12ma = 1.0 , P DM = DP12M = 1.2 and respective local data paths, P Dm 12b 12b b P Dm = DP12mb = 0.6 , P DM = DP12M = 0.7 . The minimum and maximum system delays are driven by the reconvergent data paths p12b and p12a , respectively: 12 12a 12b 12b = P Dm = min P Dm , P Dm = 0.6, (8.6) SDm 12b 12a 12a 12 SDM = max P DM , P DM = P DM = 1.2. (8.7) Two circuits with the topology presented in Figure 8.4 are analyzed in Sections 8.2.2 and 8.2.3–the edge-triggered circuit SF F and the level-sensitive circuit SL , respectively. 8.2.2 Reconvergence in an Edge-Triggered Circuit For edge triggered circuits, the data signals depart the registers clock-tooutput delay (DCQ ) after the latching edge of the clock signal. Consequently in SF F , the signal Q1 (recall Figure 4.12 on page 55) departs R1 clock-tooutput delay DCQ time after the positive clock edge and propagates along the reconvergent paths. In order to satisfy the short path constraints, the arF2 later than the positive rival of data signals X2a and X2b at R2 must occur δH edge of the previous clock cycle at R2 . Similarly, in order to satisfy the long
154
C1
8 Delay Insertion and Clock Skew Scheduling 1 DCQ
b = SD12 PD12 m m
b PD12 M a PD12 m
C2 δF2 H
12 a PD12 M = SDM A2
a2
δF2 S
Tmin
Fig. 8.5. Timing of the edge-sensitive reconvergent system in Figure 8.4 after CSS.
path constraints, the arrivals must occur δSF 2 earlier that the positive edge of the current clock cycle at R2 : F2 ≤ a2 ≤ A2 ≤ TCP − δSF 2 . δH
(8.8)
Next, suppose clock skew scheduling for clock period minimization is applied to an arbitrary edge-triggered circuit which involves a reconvergent data path system. After clock skew scheduling, if at least one of the reconvergent paths becomes a critical timing path, the earliest and latest arrival times of the data signal at the critical convergent node are at marginal values. Accordingly for SF F , the arrival times a2 and A2 satisfy F2 = a2 ≤ A2 = Tmin − δSF 2 , δH
(8.9)
where Tmin is the minimum clock period achievable by clock skew scheduling. The constraints in (8.9) are illustrated in Figure 8.5. C1 and C2 are the clock signals synchronizing registers R1 and R2 , respectively. Also illustrated on F2 defining the Figure 8.5 is the separation between A2 + δSF 2 and a2 − δH minimum clock period: F2 ). Tmin = A2 + δSF 2 − (a2 − δH
(8.10)
Note that the data arrival times at R2 are given by the constraints similar to the discussion in Section 4.7: b a − Tmin , d1 + DP12m − Tmin a2 = min d1 + DP12m (8.11) a b − Tmin , , DP12m = d1 + min DP12m b a A2 = max D1 + DP12M − Tmin , D1 + DP12M − Tmin (8.12) a b − Tmin . , DP12M = D1 + max DP12M
8.2 Delay Insertion Method
Replacing (8.11) and (8.12) in (8.10) yields a b − Tmin + δSF 2 , DP12M Tmin = D1 + max DP12M
12a F2 b − Tmin + δH . − d1 + min DP m , DP12m Eq. (8.13) is simplified to 12b 12a 12a 12b F2 Tmin = max P DM − min P Dm + δSF 2 + δH , P DM , P Dm .
155
(8.13)
(8.14)
Following from (8.4) and (8.5), (8.14) is identical to 12 12 F2 Tmin = SDM − SDm + δSF 2 + δH .
(8.15)
Substituting the numerical values and assuming zero internal register delays F = 0, the minimum clock period Tmin of SF F after DCQ = DDQ = δSF = δH clock skew scheduling is computed Tmin = 0.6 time units. Consider (8.15), showing the dependence of Tmin on the algebraic difference between the maximum system delay and the minimum system delay Ff ). between Rd and Rc (summed with the internal register delays δSF f and δH The delay insertion method is proposed to modify these maximum and minimum system delays between Rd and Rc . The modification, when applicable, decreases the algebraic difference in (8.15). In SF F , for instance, the mini12b of path p12b . mum system delay between Rd and Rc is determined by P Dm 12b By inserting a delay element of 0.1 time units on p , the minimum and maxb b = 0.7 and DP12M = 0.8, imum path delays of this path are changed to DP12m respectively. More importantly, the minimum system delay between Rd and 12b of path p12b , which is now 0.7 instead of the Rc is still determined by P Dm original 0.6 time units. Both before and after delay insertion, the maximum 12a of path p12a , which system delay between Rd and Rc is determined by P DM is a constant 1.2 time units. Therefore, the algebraic difference between the maximum and minimum system delays between Rd and Rc is improved from (1.2−0.6 = 0.6) to (1.2−0.7 = 0.5) time units. This delay insertion procedure for the circuit shown in Figure 8.4 is illustrated in Figure 8.6. The black circle in Figure 8.6 represents a delay element of [0.1,0.2] that is inserted on the reconvergent path p12b . Note that for SF F , inserting a delay element with a value in range [0.4,0.5] on p12b gives the minimum possible algebraic difference in (8.15), leading to ∗ . For SF F , the minimum clock period obtainable through delay insertion Tmin ∗ ∗ Tmin evaluates to Tmin = 1.2 − 1.0 = 0.2. It is shown that this minimum clock period obtainable through delay insertion depends on the maximum of the algebraic differences between the maximum and minimum path delays of each reconvergent path (after delay insertion). Proposition: Let there be k number of reconvergent paths between Rd and Rc , labeled pA , pB , . . . , pK . The minimum possible algebraic difference between the maximum and minimum path delays of each reconvergent path between ∗ obtainable Rd and Rc after delay insertion is the minimum clock period Tmin through delay insertion.
156
8 Delay Insertion and Clock Skew Scheduling
Let the minimum and maximum system delays define the real numbers interval Λ, such that: dc dc , SDM ] (8.16) Λ = [SDm By definition, the minimum possible algebraic difference between the maximum and minimum path delays of each reconvergent path after delay insertion (defining the minimum possible clock period) is the minimum length of interval Λ (after delay insertion). In order to compute the minimum length |Λ| of interval Λ achievable through delay insertion, the difference [max(Λ) − min(Λ)] is computed. Recalling (8.4) and (8.5), the following is derived: dc pA pB pK = min (P Dm , P Dm , . . . , P Dm ), min(Λ) = SDm pA pB pK dc max(Λ) = SDM = max (P DM , P DM , . . . , P DM ) .
(8.17) (8.18)
Let the real number delay intervals formed by the minimum and maximum delay values of the paths pA , pB , . . . , pK be represented by A, B, . . . , K, respectively. In other words, a delay interval L, associated with the path pL pL , P DM ]. One of the following pL ∈ {pA , pB , . . . , pK } is formed by L = [P Dm possibilities defining the expression [|Λ| = max(Λ) − min(Λ)] must hold: P1. A delay interval M ∈ {A, . . . , K} determines both the minimum min(Λ) and maximum max(Λ) values of the interval Λ. Then, Λ = M and |Λ| = |M | = max(Λ) − min(Λ) = max(M ) − min(M ), P2. Otherwise, two non-identical delay intervals determine the minimum and maximum values of the interval Λ. Then, ∀L ∈ {A, . . . , K}: |Λ| = max(Λ) − min(Λ) > max(L) − min(L). For systems satisfying (P1), the minimum length for Λ is already given by |Λ| = |M |. The minimum interval length, thus the minimum clock period, cannot be changed by delay insertion. For systems satisfying (P2), delay insertion method is used to modify one or more of the delay intervals in Λ 12a a [PD12 m , PDM ] = [1.0, 1.2]
→ p12a R1
R2 → p12b 12
12
[PDm b , PDM b ] = [0.6, 0.7] + [0.1, 0.2] = [0.7, 0.9] Fig. 8.6. The simple reconvergent system in Figure 8.4 after delay insertion.
8.2 Delay Insertion Method Delay Intervals
Delay Intervals
Λ
Λ
A
A B
B
C
C D
D
0
157
K Delay 3 12 (i) |Λ| = |D| = max(D) − min(D) = 9
K
0
2
Delay 14 (ii) |Λ| = max(B) − min(C) = 12
Delay Intervals
Delay Intervals Λ
Λ
A B
A B UC UD
C D K Delay 0 7 16 (iii) |Λ| = |B| = max(B) − min(B) = 9 < 12
C D
UA UB
K Delay 0 7 18 (iv) |Λ| = |B| = max(B) − min(B) = 11 < 12
Fig. 8.7. Two reconvergent data path systems satisfying (P1) and (P2), respectively.
in order to promote one of the delay intervals to become the interval M . In other words, systems satisfying (P2) are converted to systems satisfying (P1) through delay insertion. Delay insertion is performed into the logic network, thus, the systems delays and the interval Λ are modified with delay insertion. Note that both the minimum and maximum system delays can be modified with delay insertion. Therefore, it is not possible to predetermine which reconvergent path will be the determining path for the interval Λ after delay insertion. In case (i) of Figure 8.7, a sample system satisfying (P1) is illustrated, where the delay interval D (associated with path pD ) determines the minimum length for Λ. No modification is necessary for such systems, as the minimum possible length for Λ is already observed. In cases (ii) and (iii) of Figure 8.7, the application of the delay insertion method to a sample system satisfying (P2) is illustrated. Note that in case (ii), the minimum value in the Λ interval is determined identically by depC pD = P Dm ], while the maximum value lay intervals C and D [min(Λ) = P Dm pB ]. Delay insertion on a reis determined by delay interval B [max(Λ) = P DM convergent path is similar to adding an offset to the interval, while preserving
158
8 Delay Insertion and Clock Skew Scheduling
the interval length. If the optimal values of delay elements are inserted on each path, the minimum possible |Λ| is achieved by asserting that the biggest delay interval M ∈ {A, . . . , K} becomes the interval Λ. In the modification of the sample system shown in cases (ii) and (iii) of Figure 8.7, the delay interval B is promoted to become this biggest delay interval M such that both min(Λ) and max(Λ) are determined by delay interval B (i.e. delay interval B becomes Λ). The intervals before and after delay insertion on the sample system are demonstrated in cases (ii) and (iii) of Figure 8.7, respectively. There are two important points to note here. First, the solution set of the inserted delay values is not unique (remember similar discussions in Sections 6.1.5 and 6.4.1). For instance, the delay inserted on the path defining delay interval C in case (iii) of Figure 8.7 can be any value between 6 and 12 time units (|C| = 3) to satisfy the computed minimum interval. Similarly, the delay values inserted on all paths can simultaneously be increased by any identical amount (e.g. x time units) to generate an alternative solution. This non-unique solution set property provides a certain range of safety against any inherent uncertainty or unavailability of exact values of the delay elements. The second important point to note is that after delay insertion, the interval lengths are preserved only if the inserted delay elements have no delay uncertainty. In demonstrating case (ii) of Figure 8.7, delay values with no uncertainties are considered in order to simplify the presentation of the delay insertion method. In reality, delay elements have delay uncertainties just like any other circuit component. These delay uncertainties of the delay elements are accrued over the associated delay intervals. Let the delay uncertainty of the delay element inserted on path L be represented by U L . The application of delay insertion to the sample system presented in case (ii) of Figure 8.7, where the delay uncertainties of the delay elements are accounted for, is presented in case (iv) of Figure 8.7. Note that due to the differences in the accrued delay uncertainties for each delay interval, the interval determining the minimum possible length for interval Λ can be different compared to the ideal case presented in case (iii). Incidentally, for cases (iii) and (iv) of Figure 8.7, the delay intervals determining the minimum possible length for Λ are B and A, respectively. Also, in a worst case scenario, the accrued delay intervals can end up being larger compared to the minimum length for Λ presented in case (ii). In the problem formulation presented later in Section 8.3, delay elements are realistically modeled with uncertainties. Reflecting the proposition on a general reconvergent circuit, there are two possibilities in computing the minimum algebraic difference of (8.15): P1*. The minimum and maximum system delays of the reconvergent data path system between Rd and Rc are determined by the same reconvergent path, P2*. The minimum and maximum system delays of the reconvergent data path system between Rd and Rc are determined by two non-identical reconvergent paths.
8.2 Delay Insertion Method
159
For systems satisfying P1*, the minimum algebraic difference is already achieved. For systems satisfying P2*, delay insertion is used. By inserting delays in one or more of the reconvergent paths, the path with the largest difference between its maximum and minimum path delays after delay insertion ∗ obtainable becomes the determinant path for the minimum clock period Tmin through delay insertion. Therefore, the minimum clock period of SF F with clock skew scheduling and delay insertion is 12α ∗ 12α F2 = max P DM − P Dm + U 12α + δSF 2 + δH . (8.19) Tmin ∀α∈{a,b}
Assuming zero delay uncertainty and substituting the numerical values, the ∗ of SF F after clock skew scheduling with delay minimum clock period Tmin ∗ insertion method is Tmin = 1.2−1.0 = 0.2. The improvement achieved through delay insertion over circuits with clock skew scheduling is computed with the ∗ )/Tmin ]100. Substituting the values, the improvement formula [(Tmin − Tmin is computed as [(0.6 − 0.2)/0.6]100 = 66.7%. The computation of the amount of delays to be inserted on each path is integrated into the clock skew scheduling algorithm. For simplicity, continuous delay models are considered in here. The revised clock skew scheduling algorithm and initial insight for a general analysis using discrete delay models are presented in Sections 8.3 and 8.4. 8.2.3 Reconvergence in a Level-Sensitive Circuit For level-sensitive circuits, results similar to an edge-triggered circuit are obtained despite the significant changes in circuit operation. The timing constraints are similar to the constraints for the edge-triggered circuit: F2 ≤ a2 ≤ A2 ≤ Tmin − δSF 2 . δH
(8.20)
When clock skew scheduling is applied to SL , the earliest and latest arrival times at R2 satisfy L2 = a2 ≤ A2 = Tmin − δSL2 , (8.21) δH as illustrated in 8.8. Using the same derivation as (8.10) and (8.13) Figure L1 L1 = DDQ , d1 = D1 for practical reasons: and assuming DCQ 12b 12a 12a 12b L2 Tmin = max P DM − min P Dm + δSL2 + δH , P DM , P Dm .
(8.22)
Substituting the numerical values into the equation and assuming zero internal register delays, the minimum clock period Tmin of SL after clock skew scheduling is Tmin = 0.6. The delay insertion method can also be used on level-sensitive circuits in order to improve the minimum clock period. The minimum clock period of
160
C1
8 Delay Insertion and Clock Skew Scheduling 1 DCQ
b = SD12 PD12 m m
b PD12 M a PD12 m
C2
δL2 H
a2
12 a PD12 M = SDM A2
δL2 S
Tmin
Fig. 8.8. Timing of the simple level-sensitive reconvergent system in Figure 8.4 after CSS.
SL with clock skew scheduling and delay insertion is given by the following formula: 12α ∗ 12α L2 = max P DM − P Dm + U 12α + δSL2 + δH . (8.23) Tmin ∀α∈{a,b}
∗ The minimum clock period Tmin of SL after clock skew scheduling and delay ∗ insertion is computed as Tmin = 1.2 − 1.0 = 0.2, leading to an improvement of 66.7% over circuit with clock skew scheduling. The revised clock skew scheduling algorithm for level-sensitive circuits is presented in Section 8.3. Note that the earliest and latest data departure times d1 and D1 , respectively, from a register R1 can be non-identical in a level-sensitive circuit. Figure 8.8 illustrates one such case, where d1 and D1 occur at the leading and trailing edges of the clock signal, respectively. In such cases, the formulae in (8.22) and (8.23) do not hold true, however the minimum clock period remains directly proportional to the algebraic difference between the maximum and minimum path delays between R1 and R2 . The delay insertion algorithm is fully applicable to all level-sensitive circuits, as the referred algebraic difference can ultimately be modified with delay insertion leading to improvements in the minimum clock period.
8.2.4 General Reconvergent Data Path Systems The generalized case for a reconvergent data path system is presented in Figure 8.9. The edge-triggered and level-sensitive circuits are analyzed on the same circuit graph. Let there be k number of reconvergent paths between Rd and Rc , labeled pA , pB , . . . , pK . The generalized system contains rd{i1 ...im }c = m and rd{j1 ...jn }c = n intermediate registers on two of its reconvergent paths, pI and pJ , respectively (pI , pJ ∈ {pA , pB , . . . , pK }). Assume that the minimum
8.2 Delay Insertion Method
d{i1 ...im }c
PDm
d{i1 ...im }c
161
, PDM
pd{i1 ...im }c
→
Ri1
→
Rim
Rd
Rc
→
Rj1
Rjn p
d{j1 ...jn }c
PDm
→
d{j1 ...jn }c d{j1 ...jn }c
, PDM
Fig. 8.9. A generalized reconvergent data path system.
and maximum system delays between Rd and Rc are determined by paths pd{j1 ...jn }c = pJ and pd{i1 ...im }c = pI , respectively. Note that, if m = n, the number of clock cycles for data propagation along the paths are different. After clock skew scheduling is applied, the earliest and latest data arrival times at the convergent node with respect to the global zero time reference are Fc acglobal = tccd + nTmin + δH
(8.24)
+ (m + 1)Tmin −
(8.25)
Acglobal =
tccd
δSF c .
Following from (8.10), the minimum clock period after clock skew scheduling is bounded by Fc ), |m − n + 1|Tmin = Acglobal + δSF c − (acglobal − δH
(8.26)
which leads to d{i ...im }c
d{j ...j }c
Fc dc dc Fc − P Dm 1 n + δSF c + δH SDM − SDm + δSF c + δH = . |m − n + 1| |m − n + 1| (8.27) The identical lower bounds of the minimum clock period stated in (8.27) for both the edge-triggered and level-sensitive circuits are demonstrated in Figure 8.10 and Figure 8.11, respectively. Similar to the simple reconvergence case analyzed in Section 8.2.1, if the minimum and maximum path delays are determined by the same reconvergent
Tmin =
P DM 1
162 Cd
8 Delay Insertion and Clock Skew Scheduling d DCQ
d{j1 ...jn }c
PDm
= SDdc m
d{j1 ...jn }c
PDM
d{i1 ...im }c
d{i1 ...im }c
PDm
Cc
δFc H
ac
= SDdc M
PDM
Ac
δFc S
|m − n + 1|Tmin
Fig. 8.10. Timing of the edge-triggered reconvergent system with m=3 and n=2.
Cd
d DCQ
d{j1 ...jn }c
PDm
= SDdc m
d{j1 ...jn }c
PDM
d{i1 ...im }c
PDm
d{i1 ...im }c
= SDdc M
PDM
Cc
δLc H
Ac
ac
δLc S
|m − n + 1|Tmin
Fig. 8.11. Timing of the level-sensitive reconvergent system with m=3 and n=2.
path, the delay insertion method is not beneficial. If these delays are determined by different reconvergent paths, the delay insertion method is used to improve the minimum clock period. The minimum clock period achieved through clock skew scheduling and delay insertion is
pR pS Fc P DM − P Dm + U pR − U pS δ F c + δH ∗ + S . max Tmin = |m − n + 1| |m − n + 1| ∀pR ,pS ∈{pA ,pB ,...,pK } (8.28) The minimum (and maximum) path delay of the reconvergent paths can be modified by inserting delays on the local data paths of the reconvergent path. The amount of delay to be inserted is determined at run time by the clock skew scheduling algorithm.
8.3 Linear Problem Formulation A valid approach to computing the theoretical limitation caused by reconvergent paths in a synchronous circuit is to identify the reconvergent systems on a circuit graph and evaluate (8.28). Such an approach might not be ideal for
8.4 Practical Concerns in Modeling and Application
163
Table 8.1. CSS method for edge-sensitive circuits with the delay insertion method. LP Model min TCP Fi s.t. TSkew (i, f ) ≤ TCP − DPi,fM − DCQM i,f Fi Ff TSkew (i, f ) ≥ −DP m − DCQm + δH LP Model modified min TCP Fi if s.t. TSkew (i, f ) ≤ TCP − DPi,fM − DCQM − IM i,f Fi Ff if TSkew (i, f ) ≥ −DP m − DCQm + δH − Im if if IM ≥ Im
trivial circuit topologies. As a more practical approach, two generalized LP problems are defined in order to model the delay insertion method for levelsensitive and edge-sensitive synchronous circuits. These LP problems not only model and solve the clock period minimization problems also compute the optimal delay values to be inserted on each local data path in order to achieve the minimum possible clock period. Two clock skew scheduling algorithms presented in Table 5.1 and Table 6.2 for level-sensitive and edge-triggered circuits, respectively, are modified in order to integrate the delay insertion method. Both LP models for clock skew scheduling are highly amenable to accommodating additional design constraints. The modified clock skew scheduling algorithms using the delay insertion method, assuming continuous delay models with uncertainty, are presented in Tables 8.1 and 8.2. The amount of delay to be inserted is formulated as the if if and IM , respectively. minimum-amount and maximum-amount variables Im if Obviously, the uncertainty U of this delay element, defined in Section 8.2.2, is if if −Im . The delay variables are included in the propagation constraints U if = IM on each local data path, however, pruning of the paths such that only the propagation constraints of the reconvergent paths are modified is also possible. For the former case, the clock skew scheduling algorithm simply returns zero for the delay values on the non-reconvergent paths.
8.4 Practical Concerns in Modeling and Application In the problem formulation, continuous delay models have been used. Practically, however, delay elements are available only in discrete values. There are two possible approaches to solving the discrete valued delay insertion problem. The naive approach is to solve the clock skew scheduling problem assuming continuous delays and approximating the optimal values with the given set of discrete components. Although likely to produce reasonable results for simple
164
8 Delay Insertion and Clock Skew Scheduling
Table 8.2. CSS method for level-sensitive circuits with the delay insertion method. LP Model min TCP + M [ (dj + Dj ) + ∀j
(Aj − aj )]
∀j:|F I(j)|≥1
Lf s.t. af ≥ δH Af ≤ TCP − δSLf Li di ≥ ai + DDQM L Li di ≥ TCP − CW + DCQm Li Di ≥ Ai + DDQM L Li Di ≥ TCP − CW + DCQM in ,f af ≤ din + DP m + TSkew (in , f ) − TCP , ∀n Af ≥ Din + DPin ,fM + TSkew (in , f ) − TCP , ∀n Af ≥ af Df ≥ df LP Model modified min TCP + M [ (dj + Dj ) + (Aj − aj ] ∀j
∀j:|F I(j)|≥1
Lf s.t. af ≥ δH Af ≤ TCP − δSLf Li di ≥ ai + DDQm L Li di ≥ TCP − CW + DCQm Li Di ≥ Ai + DDQM L Li Di ≥ TCP − CW + DCQM in ,f in f af ≤ din + DP m + Im + TSkew (in , f ) − TCP , ∀n in f Af ≥ Din + DPin ,fM + IM + TSkew (in , f ) − TCP , ∀n Af ≥ af Df ≥ df if if IM ≥ Im
cases, such linear approximations to integer problems do not always guarantee optimality [112]. As a more robust and ubiquitously valid approach, the problem can be formulated as a mixed integer programming (MIP) problem. Evidently, the expected run times for MIP problems are typically longer than LP problems of similar size (see Section 6.5). Modeling and solving the problem with continuous delay models serve best to demonstrate the two main purposes of this work; Identifying the limitation caused by the reconvergent paths and demonstrating how to mitigate these limitations through the delay insertion method. By adapting continuous delay models, the theoretical limitations of reconvergent paths and the level of improvement through mitigation of this limitation are analyzed independent of any cell library. For practical implementation, MIP-based solution approaches discussed above, or similar methods, must be used. Another practical concern for the delay insertion method is the area-aware delay insertion method proposed in [74]. In order to reduce the total area increase due to inserted delays, a delay buffer tree structure is proposed. In the
8.5 Summary
165
buffer tree structure, a shared delay element is placed between the fanouts—or fanins—of a register, if multiple fanouts of the same register must be padded. Note that the delay buffer-tree construction is a post-timing analysis process and is not integrated into the clock skew scheduling algorithms. Throughout this research monograph, the local data paths are modeled abstractly at a higher hierarchy level than gate-level hierarchy. Such simplification is followed in this chapter in order to improve the demonstration of the theoretical limitation of reconvergent paths and the mitigation of this limitation by the delay insertion method. In practical implementation, the location of the delay elements to be inserted into the logic must be identified at a lower level of abstraction—most suitably at the gate-level of hierarchy. The modeling of local data paths at a higher abstraction level as suggested in this work might lead to an ambiguous assignment of delays to reconvergent paths. In an extreme case, it is plausible that three or more reconvergent paths might share all of the logic paths that constitute a reconvergent system. For the simplest case of four reconvergent paths, any two reconvergent paths might differ by one logic path only, and all logic paths might be covered by the four reconvergent paths. For such a reconvergent system, including delay elements anywhere on a reconvergent path (on any logic path) would affect the path delay of more than one reconvergent path. Thus, the optimal delay insertion values computed by the presented LP problem must be post-processed for practical implementation. The described concerns in the practical implementation of the delay insertion method are not considered in the experimentation stage of this work. Simplicity is preserved in the models used in formulation in order to improve the presentation of the limitation caused by the reconvergent paths and the mitigation of this limitation by the delay insertion method. Designers, however, must be wary of these practical requirements. Some researchers have already started analyzing these practical concerns [133]. In [133], the LP programming model shown in Table 8.2 is redefined at the gate-level netlist to pinpoint the placement of inserted delays on the gate-level netlist.
8.5 Summary In this chapter, the limitations of delay uncertainty of the min-max timing models, data path cycles and reconvergent paths on the improvements achievable through clock skew scheduling are shown. The mitigation of these limitations with a delay insertion method is possible. The delay insertion method is formulated as an LP problem, proposing a highly-automated, versatile and efficient implementation. Practical concerns in modeling and implementation, such as the continuous versus discrete models of delay elements, delay element sharing between neighboring branches and underlying area costs for the delay insertion process are referenced.
9 Practical Considerations
The formulation of clock skew scheduling as a QP problem is introduced in Chapter 7. Recall that in this formulation, a feasible and consistent clock schedule is found that is close1 to a previously chosen ‘ideal’ objective clock schedule. In this chapter, a computer methodology is presented for the solution of this QP clock scheduling problem. Different computer implementations are analyzed and compared in detail in Section 9.1. It is shown that the QP problem can be efficiently solved and three computer algorithmic procedures for this solution are discussed. These three algorithms are demonstrated to have O(r3 ) run time complexity and O(r2 ) storage complexity, where r is the number of registers in the circuit. The numerical constants of the leading terms in these complexity expressions are derived as a function of the ratio of the number of local data paths to the number of registers in the circuit, thereby permitting a suitable algorithm to be chosen for a specific circuit. Furthermore, the methodology presented in Chapter 7 is extended in order to account for two important details of practical interest. The circuit graph model is first discussed in 9.2. It is shown that certain clock skews from the basis are unconstrained2 and this information is integrated into the mathematical framework described in Chapter 7. In Section 9.3, it is demonstrated how to efficiently handle the timing constraints of the I/O registers of a circuit, including the necessary modifications to the mathematical optimization procedure.
9.1 Computational Analysis The solution to problem QP-3 is described in Section 7.2.3 in purely mathematical terms and without consideration of any computational aspects. Naturally, the solution described by (7.54) and (7.55) is determined from a program 1 2
Close in a Euclidean sense. These skews are independent from other skews within the circuit. Nevertheless, these skews must satisfy the permissible range requirement.
I.S. Kourtev et al., Timing Optimization Through Clock Skew Scheduling, DOI: 10.1007/978-0-387-71056-3 9, c Springer Science+Business Media LLC 2009
167
168
9 Practical Considerations
running on a computer. In this section, the time and memory requirements of three different computer implementations are analyzed in greater detail. The run time complexity N of these algorithms is considered to be dependent upon the number of multiplicative (multiple and divide) floating point operations. Similarly, the memory complexity M is considered to be the largest number of floating point storage units that must be stored in memory at any time during the execution of the specific algorithm.3 It is shown here that the run time complexity of all three algorithms described in this section is O(r3 ) where r is the number of registers in the circuit. Furthermore, it is shown that the numerical constant of the leading r3 term in these complexity expressions is a function of the ratio k=
p r
(9.1)
of the number of local data paths p to the number of registers r in a circuit. Similarly, the memory complexity of all three algorithms is O(r2 ) where the numerical constant of the r2 term is a function of k introduced in (9.1). This relationship is exploited to determine the most efficient algorithm for a specific circuit. Note that formally the Lagrange multipliers λ are not required for the solution of problem QP-3 since the objective of the procedures described here is to determine a feasible clock schedule s. Since the existence and uniqueness of a clock schedule ˆ s satisfying problem QP-3 have been established in Section 7.2.3, this clock schedule can be directly computed by evaluating the rightmost expression in (7.55), (9.2) ˆ s = − Bt M−1 B g + g. As an alternative, a sequential approach can be adopted such that the Laˆ are computed first, followed by computing ˆ grange multipliers λ s (consisting ˆ sc ) using λ. of ˆ sb and ˆ In the former case (a straightforward computation of ˆ s ), the complexity of evaluating the expression described by (9.2) determines the complexity of ˆ first), both ˆ sc can the overall solution. In the latter case (computing λ sb and ˆ be computed quickly since these computations involve only the addition and subtraction operations (recall that all non-zero elements of the matrix C are ˆ and ˆ either 1 or −1). Therefore, in the case of computing λ s in this order, the complexity of the overall solution of problem QP-3 is dominated by the ˆ computation of the Lagrange multipliers λ. Three computational algorithms for solving problem QP-3 are described in the following three sections. The first two algorithms—called LMCS-1 and 3
Memory transfers between main and secondary storage are, of course, always an option. For the quickest execution, however, all data should reside in the main storage.
9.1 Computational Analysis
169
ˆ and ˆ LMCS-2, respectively—compute λ s in this order according to the de1 tˆ pendence relationship ˆ s = 2 B λ described by (7.55). The third algorithm— called CSD—computes the clock schedule ˆ s directly as described by (9.2). The algorithms LMCS-1 and LMCS-2 are described in Sections 9.1.1 and 9.1.2, respectively. Algorithm CSD is described in Section 9.1.3 and is shown to be superior to both of the other algorithms. A comparative summary of the results is offered in Section 9.1.4. 9.1.1 Algorithm LMCS-1 As mentioned previously, this algorithm for solving problem QP-3 consists ˆ from Mλ = 2Bg [see (7.54)], then computing ˆ of eliminating λ s according ˆ corresponding to (7.55). To determine the value of the Lagrange multipliers λ to the minimization of ε∗τ in problem QP-3, consider the linear system, Mλ = (BBt )λ = (I + CCt )λ = 2Bg = 2(gc + Cgb ),
(9.3)
which corresponds to the last row of (7.52) and (7.58), respectively. As mentioned previously in Section 7.2.3, the symmetric matrix M is always positiveˆ of the definite4 and nonsingular, thereby permitting exactly one solution λ linear system described by (9.3). The system described by (9.3) is a large square linear system of the type Ax = b, where b ∈ Rn is a column vector and the coefficient matrix A ∈ Rn×n is dense. Typically, the most effective approach to computing the solution x ˆ ∈ Rn of such systems consists of performing a triangular decompo5 sition of the coefficient matrix A followed by the successive solution of two relatively ‘easy’ to solve square linear systems of order n × n. The triangular decomposition of A is of the form A = LU, where L and U are a lower triangular and an upper triangular matrix, respectively [134, 135]. The solution of Ax = LUx = b is obtained next by first computing the intermediate solution y ˆ of the system Ly = b. Finally, x ˆ is the solution of the system Ux = y ˆ. Because of the triangularity of the matrices L and U, the vectors y ˆ and x ˆ can be computed with relatively little effort. The components of the intermediate solution y ˆ are obtained by solving the system Ly = b—referred to as forward elimination [134, 135]—since the first equation of Ly = b involves only ˆ are y1 , the second only y1 and y2 , and so on. Similarly, the components of x obtained from the system Ux = y ˆ in the reverse order xn , xn−1 , . . . , x1 . The process of solving Ux = y ˆ for x ˆ is also called back substitution [134, 135]. Furthermore, the symmetry and positive-definiteness of M can be exploited to obtain a special form of the LU triangular decomposition of M such that the lower and upper triangular matrices in the decomposition are 4
5
The positive-definiteness of M follows from M = BBt where B = I C has t t linearly independent rows. Therefore, the kernel of B is ker(B ) = {0} and the value of the quadratic form xt Mx = xt BBt x is positive for any value of x = 0. The non-singularity of A, L and U is assumed in this discussion.
170
9 Practical Considerations
the transpose of each other. This alternative decomposition is known as the Cholesky decomposition of M and permits M to be uniquely represented [134] as the product, (9.4) BBt = M = L1 Lt1 , where L1 is a lower triangular matrix. The Cholesky decomposition is computationally more efficient than a general LU decomposition in that the Cholesky decomposition requires about half of the computation time of a general LU decomposition. Finally, the Cholesky decomposition has useful properties related to issues of numerical stability and accuracy. (An in-depth treatment of this subject can be found in [134, 135].) As mentioned previously, the complexity of algorithm LMCS-1 is domiˆ This comnated by the complexity of computing the Lagrange multipliers λ. ˆ putation of λ consists of a total of N1 (r, k) =
1 (k − 1)3 r3 + (k − 1)2 r2 6
(9.5)
multiplications distributed among tasks as follows: ← number multiplications
task
a. computing the Cholesky decomposition L1 of ← 16 n3c = 16 (k − 1)3 r3 M b. forward elimination of ξ from L1 ξ = 2Bg ← 12 n2c = 12 (k − 1)2 r2 ˆ from Lt1 λ = ξ c. back substitution of λ ← 12 n2c = 12 (k − 1)2 r2 . The maximum memory usage of the algorithm LMCS-1 is M1 (r, k) =
1 (k − 1)2 r2 2
(9.6)
floating point elements. This memory is used during different tasks in LMCS-1 as follows: a. matrix M ← 12 (p − r)2 = 12 (k − 1)2 r2 b. Cholesky decomposition L1 of M ← L1 overwrites M as is computed. 9.1.2 Algorithm LMCS-2 The algorithm LMCS-2 described in this section is similar to algorithm LMCS-1 described in Section 9.1.1 in that both algorithms follow the same general course of computation. Specifically, algorithm LMCS-2 also first elimiˆ from Mλ = 2Bg [see (7.54)], and next computes ˆ nates λ s according to (7.55). ˆ (7.54) is solved by findTo determine the value of the Lagrange multipliers λ, ing the matrix inverse M−1 and then multiplying the right-hand side (2Bg) by M−1 : ˆ = M−1 (2Bg). λ (9.7)
9.1 Computational Analysis
171
Note that the matrix inverse M−1 = (I + CCt )−1 in (9.7) can be expressed using the Sherman-Morrison-Woodburry formula [134], −1 −1 t −1 D + EFt = D−1 − D−1 E I + Ft D−1 E FD ,
(9.8)
where D ∈ Rn×n , E ∈ Rn×k , F ∈ Rn×k , and both D and (I + Ft D−1 E) are nonsingular. When applied to the matrix M−1 = (I + CCt )−1 , the ShermanMorrison-Woodburry formula described by (9.8) yields6 ⎫ D=I ⎪ ⎬ E=F=C ⎪ ⎭ N = I + Ct C
⇒
M−1 = (I + CCt )−1 −1 t = I − C I + Ct C C −1
= I − CN
(9.9)
t
C.
Note that in (9.9), not only can the matrix inverse N−1 = (I + Ct C)−1 be computed more quickly than M−1 (the dimension of N is nb × nb vs. nc × nc = (k − 1)r × (k − 1)r for M) but the computation of this inverse N−1 matrix does not have to be explicitly performed in order to evaluate the product CN−1 Ct in (9.9). Let the Cholesky decomposition of N = I + Ct C be (9.10) N = L2 Lt2 and substitute (9.10) into the product C(I + Ct C)−1 Ct = CN−1 Ct in (9.9), then M−1 = I − CN−1 Ct −1 t = I − C L2 Lt2 C −1 t t −1 L2 C = I − C(L2 )
(9.11)
= I − Xt X, t where X is used to denote the product (L−1 2 C ). The matrix X can be computed by forward elimination according to the matrix equation L2 X = Ct , while the product CN−1 Ct is equal to the product Xt X. Also, observe that the matrix M−1 can be computed one row at a time, thereby drastically reducing the storage requirements of the algorithm. The j-th row of M−1 is ˆ j as the inner prodcomputed and used to calculate the Lagrange multiplier λ −1 uct of this j-th row of M and the vector 2Bg. The memory used to store the elements of the j-th row of M−1 is then overwritten with the elements of the (j + 1)-th row of M−1 and so on. The rows of the matrix M−1 can be stored in disk in order to permit the rows to be retrieved for future execution. Just as in algorithm LMCS-1, the complexity of algorithm LMCS-2 is ˆ This dominated by the complexity of computing the Lagrange multipliers λ. ˆ computation of λ consists of a total of
6
Note that I + Ct C is positive-definite, thus nonsingular.
172
9 Practical Considerations
1 1 1 2 + (k − 1) + (k − 1) r3 + (k − 1)r2 N2 (r, k) = 6 2 2
(9.12)
multiplications distributed among the following tasks: ← number multiplications
task a. computing the Cholesky decomposition L2 of N b. forward elimination of X from L2 X = Ct c. evaluate M−1 = I − Xt X ˆ = M−1 (2Bg) d. evaluate λ
← 16 r3 ← 12 r2 (p − r) = 12 (k − 1)r3 ← 12 r(p − r)2 = 12 (k − 1)2 r3 ← (p − r)2 = (k − 1)2 r2 .
The maximum memory usage of algorithm LMCS-2 is 1 M2 (r, k) = (k − )r2 + (k − 1)r 2
(9.13)
floating point elements. This memory usage is distributed among different tasks in LMCS-2 as follows: a. matrix N ← requires 12 r2 storage units b. Cholesky decomposition L2 of ← L2 overwrites N as is computed N c. matrix X from L2 X = Ct ← requires r(p−r) = (k −1)r2 storage units d. matrix M−1 = I − Xt X ← requires (p − r) = (k − 1)r storage units for one row of M only. 9.1.3 Algorithm CSD Unlike algorithms LMCS-1 and LMCS-2, the clock schedule ˆ s is computed directly in algorithm CSD, i.e., without first computing the Lagrange mulˆ With this strategy, the clock schedule ˆ tipliers λ. s is determined according to (9.2), s = −Zg + g = (−Z + I) g, (9.14) Z = Bt M−1 B ⇒ ˆ where the matrix Z is introduced in (9.14) in order to simplify the notation. To evaluate Z, the expression described by (9.9) is substituted for M−1 into (9.14) and the product Z = Bt M−1 B is evaluated using the same technique as in (9.10) and (9.11): Z = Bt M−1 B = Bt I − CN−1 Ct B = Bt B − Bt CN−1 Ct B t = Bt B − Bt C(Lt2 )−1 L−1 2 C B
= Bt B − Yt Y.
(9.15)
9.1 Computational Analysis
The notation
t Y = L−1 2 C B
173
(9.16)
is introduced in (9.15) for simplicity, where similarly to the previously described algorithm LMCS-2, the matrix Y can be eliminated according to the equation L2 Y = Ct B. The clock schedule ˆ s can be computed if the operations described by (9.14), (9.15), and (9.16) are carried on literally. These expressions, however, can be manipulated to significantly reduce both the run time and memory requirements for algorithm CSD. Initially, note that computing each clock skew si requires evaluating the inner product of two dense p-element-long vectors— the i-th row of the matrix (−Z+I) and g. The evaluation of this inner product requires p multiplications, where p is the number of local data paths in the circuit. Recall, however, that the values of the clock skews from the basis sb provide sufficient information to reconstruct all clock skews s in a quick fashion. Specifically, once the skews from the basis sb are known, the skews sc in the chords of the circuit may be derived through the operation described by (7.24),
sc IC = sc + Csb = 0 ⇒ sc = −Csb . (9.17) sb Since only the basis sb is evaluated, only the last nb rows of the matrix (−Z+I) are computed, thereby yielding significant savings of computation time. (Note that computing one row of Z requires the evaluation of p row elements, each row requiring r multiplications in the product Yt Y.) These concepts are illustrated graphically in Figure 9.1. p
1
sc p
s
=
p
g
(−Z + I)
=
sb nb
last nb rows
1 b
Fig. 9.1. Computation of the clock schedule basis s by computing only the last nb rows of the matrix −Z + I.
The complexity of the evaluation of (−Z + I) = (−Bt B + Yt Y + I) can be reduced further by examining the computation of Y. Typically, the
174
9 Practical Considerations
direct evaluation of Y—by forward elimination from L2 Y = Ct B—requires 1 1 2 3 2 pr = 2 kr multiplications. This number can be reduced by noting that
Ct B = Ct I C = Ct CCt
= Ct N − I = Ct L2 Lt2 − I (9.18)
(9.19) and Y = Y1 Y2 − Y3 , where the matrices Y1 , Y2 , and Y3 can be eliminated from the following dependencies, respectively: 1 (k − 1)r3 2 multiplications
L2 Y1 = Ct ← compute Y1 L 2 Y2 = N
←
L 2 Y3 = I
← compute Y3
Y2 =
→ requires Lt2
(9.20)
→ already computed (9.21) 1 3 1 2 → requires r + (3r + 2r) 6 6 multiplications. (9.22)
Finally, the following transformations (9.23) through (9.25) are used to evaluate the matrix (−Z + I):
I C I C I I C = = Bt B = Ct Ct C Ct N − I Ct (9.23) O C , =I+ Ct N − 2I
Y1t t Y 1 Y2 − Y 3 V=Y Y= Y2t − Y3t V11 V12 = −t −1 t −t −1 −1 −t t t t (L2 L−1 2 C − L2 L2 C ) (L2 L2 + L2 L2 − L2 L2 − L2 L2 ) V12 V11 = −1 −t −1 , t Ct − (L−t 2 L2 )C N − 2I + L2 L2 (9.24) and −Z + I = −Bt B + Yt Y + I O C = −I − Ct N − 2I V12 V11 + −1 −t −1 + I t Ct − (L−t 2 L2 )C N − 2I + L2 L2 ... ... = −1 t −t −1 . −(L−t L )C L 2 2 2 L2
(9.25)
9.1 Computational Analysis
175
Note that only the last r rows of (−Z + I) are shown in (9.25) since only these r rows are required to compute sb . Also, note that the matrix −1 t Y1 = L−1 2 C does not require evaluation. Only Y3 = L2 must be determined −t −1 t (from L2 Y3 = I) since L2 = (L2 ) . The computation of the clock schedule ˆ s in algorithm CSD consists of a total of 1 1 1 1 (9.26) N3 (r, k) = r3 + (3k + 4)r2 + r − 2 3 2 6 multiplications distributed among the following tasks: ← number multiplications
task
a. computing the Cholesky decomposition L2 of ← 16 r3 N b. forward elimination of Y3 = L−1 from ← 16 r3 + 12 r2 + 13 r 2 L 2 Y3 = I −1 c. evaluate the product L−t ← 61 r3 + 16 (5r2 + r − 1) 2 L2 d. evaluate sb ← rp = kr2 . The maximum memory usage of algorithm CSD is M3 (r, k) = r2
(9.27)
floating point elements. This memory usage is distributed among different tasks in CSD as follows: a. b. c. d.
matrix N ← requires 12 r2 storage units Cholesky decomposition L2 of N ← L2 overwrites N as is computed matrix L−1 ← L−1 2 = Y3 2 overwrites L2 as is computed −t −1 product L2 L2 ← requires 12 r2 storage units
9.1.4 Summary of the Proposed Algorithms This section concludes with a brief synopsis of the run time and memory requirements of the three algorithms for solving problem QP-3 described in Sections 9.1.1, 9.1.2, and 9.1.3, respectively. To summarize the results, each of the three algorithms, LMCS-1, LMCS-2, and CSD, requires O(r3 ) floating point multiplicative operations and O(r2 ) floating-point storage units. The numerical constant of the leading terms in the polynomial expressions for both the run time and memory complexity is a function of the ratio k = p/r which is the ratio of the number of local data paths to the number of registers in a circuit. To gain further insight into the proposed algorithms, the numerical constants of the leading terms in the polynomial runtime complexity expressions are plotted versus k in Figure 9.2. Similarly, the numerical constants of the leading terms in the polynomial memory complexity expressions are plotted
176
9 Practical Considerations
Runtime Complexity
40 LMCS-1 30 LMCS-2 20
CSD
10
2
4
6
8
10
k
Fig. 9.2. The numerical constants (as functions of k = p/r) of the term r3 in the runtime complexity expressions for the algorithms LMCS-1, LMCS-2 and CSD, respectively.
Memory Complexity
40 LMCS-1 30 LMCS-2 20
CSD
10
2
4
6
8
10
k
Fig. 9.3. The numerical constants (as functions of k = p/r) of the term r2 in the memory complexity expressions for the algorithms LMCS-1, LMCS-2 and CSD, respectively.
versus k in Figure 9.3. Note that algorithm CSD outperforms both of the other two LMCS algorithms where the superiority of algorithm CSD is particularly evident with respect to the speed of execution. Thus, algorithm CSD is the algorithm of choice for solving problem QP-3 as introduced in Section 7.2.3.
9.2 Unconstrained Basis Skews Consider again the example circuit C1 introduced in 7.1.1 (the graph of C1 is shown in Figure 7.1). A modified version of C1 with one additional edge—the edge e6 —is shown in Figure 9.4. Also shown with thicker edges in Figure 9.4
9.2 Unconstrained Basis Skews
177
is a spanning tree for the modified circuit C1 . Note that the basis edge e6 does not belong to any of the fundamental cycles of the circuit depicted in Figure 9.4. In fact, the edge e6 does not belong to any cycle of the circuit in Figure 9.4 at all. Such basis edges which do not belong to any cycles are called isolated, while the rest of the basis edges are called main. Note that any isolated edge must necessarily by definition be a basis edge.7 [l1 , u1 ] e1 → v1
[l3 , u3 ] e3 →
v2
e5
[l5
[l4 , u4 ] e4 ←
,u 5 ← ]
v3
] 2
[l6 , u6 ] e6 →
v5
,u [l 2 ← e2 v4
Fig. 9.4. Modified example circuit C1 to include an additional edge e6 . C1 is originally introduced in Section 7.1.1 and illustrated in Figure 7.1.
Theoretically, a circuit with r registers (the vertices in the circuit graph) may have any number ni of isolated basis edges where ni ranges from zero to r − 1 = nb . A circuit with ni = nb = r − 1 isolated basis edges does not have any cycles whatsoever—all edges of such circuits are basis edges and there are no chord edges to complete a cycle. A simple example of such a circuit is a shift register. Note that since isolated edges do not belong to a cycle, the clock skews on these edges are linearly independent of any other clock skews in the circuit. Intuitively, the clock skew of an isolated edge can be assigned to be any value without contradicting the linear dependencies among the skews in a circuit. Observe, for example, (7.22) written for the modified circuit C1 shown in Figure 9.4: ⎡ ⎤ s1 ⎢s2 ⎥ ⎢ ⎥ ⎥ 1 0 −1 1 0 0 ⎢ ⎢s3 ⎥ = 0. Bs = (9.28) ⎥ s 0 1 0 −1 1 0 ⎢ ⎢ 4⎥ ⎣s5 ⎦ s6 All of the elements in the sixth column of B are zeroes. Therefore, if s1 through s5 are such that (9.28) is satisfied, the choice of s6 does not invalidate (9.28). This fact can be exploited in the mathematical solution of problem QP-1 to decrease the number of variables, thereby decreasing the runtime and memory 7
A chord edge is already a part of a cycle and cannot be isolated.
178
9 Practical Considerations
requirements. The only requirement is that the basis skews (the edges) must be enumerated such that the isolated skews are last. In other words, the clock skew vector (7.19) becomes Basis with nb elements
" #$ % ⎡ c⎤ nb −ni ni nc s " #$ %" #$ %" #$ % s = ⎣sb ⎦ = [ s1 . . . snc snc +1 . . . sp−ni sp−ni +1 . . . sp ]t , $ %" # $ %" # $ %" # si Chords Main Basis Isolated Basis
(9.29)
where sb stands for the main basis and the isolated basis is denoted by si . With this specific choice of clock skew enumeration, the B matrix in (7.22) becomes
(9.30) B = B1 0 , where 0 in (9.30) is a zero matrix of dimension nc × ni . With this notation, it is straightforward to show that the matrix M in (7.53) becomes (9.31) M = BBt = B1 Bt1 and the solution to problem QP-1 (7.54) and (7.55) is ⎡ c⎤ c g
g −1 −1 −1 b⎦ ˆ ⎣ B1 0 g = 2M B1 b , λ = 2M Bg = 2M g gi ⎡ c ⎤ g −1 t t −1 (I − B M B ) 1 1 gb ⎦ . ˆ s=g− B M B g=⎣ i g
(9.32)
(9.33)
As can be observed in (9.32) and (9.33), 1. the choice of the objective isolated basis skews gi has no effect on either the Lagrange multipliers (9.32) or the chords and main skew basis (9.33) solution, and, 2. the final solution for the clock skews si in the isolated basis edges corresponds precisely to the objective skew values gi for these edges. Therefore, the isolated basis edges can be completely excluded from consideration when solving problem QP-1. Equations (9.32) and (9.33) demonstrate that the final clock skew values of these edges can be chosen arbitrarily provided these values satisfy the permissible range requirements.
9.3 I/O Registers and Target Delays The clock skew scheduling methodology discussed in Chapter 7 is based on the assumption that complete connectivity and timing information is available for all local data paths within a circuit. This condition may, however, not be
9.3 I/O Registers and Target Delays
179
realistic. Consider, for example, the input and output registers (also called the I/O registers) in a VLSI system. Some I/O registers are illustrated in Figure 9.5 where the registers R1 and R5 are an input and an output register, respectively, of the circuit C. The register R3 shown in Figure 9.5 is an internal register since all of the other registers to which R3 is connected (via local data paths) are inside the circuit C. The timing of the I/O registers is less flexible than the timing of the internal registers. Consider, for example, the local data path R6 ;R1 shown in Figure 9.5. The register R6 is outside the circuit C which contains the registers R1 through R5 . It is possible to apply a clock schedule to S that specifies a clock delay t1cd to the register R1 . However, the timing information for the local data path R6 ;R1 is not considered when scheduling the clock signal delays to the registers within C (including t1cd ). Therefore, a timing violation may occur on the local data path R6 ;R1 illustrated in Figure 9.5. One strategy to overcome this difficulty is to include in the clock scheduling process the timing information of those local data paths which cross the boundaries of the circuit C. This approach does not change the nature of the clock scheduling algorithm but rather only the number of timing constraints. However, such an optimization scenario is difficult to conceive due to the many instances where C may be used. Therefore, a preferable approach is to set the clock signal delay to the I/O registers (such as t1cd to R1 ) to a specific value with respect to the clock source (shown as the clock pin in Figure 9.5). If this value is specified, all of the necessary timing information is available to System (Board) Level Clock Source Clock Pin
Clock Signal
Circuit C R2 R6
R1
R3
R4 R5
Fig. 9.5. I/O registers in a VLSI integrated circuit. Note that the I/O registers form part of the local data paths between the inside of the circuit and the outside of the circuit.
180
9 Practical Considerations
avoid any timing violations of the local data paths such as the path R6 ;R1 shown in Figure 9.5. Equivalently, a group of registers (the I/O registers, for example) may be defined which require that the clock signal be delivered to all of the registers within such a group with the same delay. Application-specific integrated circuits (ASICs) and Intellectual Property (IP) blocks are good examples of circuits where the aforementioned strategy may be useful. Given the difficulty in knowing a priori all timing contexts of an integrated circuit, a preferred solution may be to require that all I/O registers are clocked at the same time (zero skew). More specifically, all possible explicit clock delay requirements for registers within the circuit fall into one of the following categories: 1. zero skew island, that is, a group of registers with equal delay, 2. target delays, that is, tkcd1 = δk1 , . . . , tkcdα = δkα , where kα ≤ r and δk1 . . . δkα are explicitly specified clock signal delay constants, 3. target skews, that is, sj1 = σj1 , . . . , sjβ = σjβ , where jβ < nb and σj1 . . . σjβ are explicitly specified clock skew constants. Zero skew islands can be satisfied by collapsing the corresponding graph vertices into a single vertex while eliminating all edges among vertices within the island. Note that in this case, it must be verified that zero skew is within the permissible range of each in-island path.8 Alternatively, the target delays are converted to target skews (category 3 above) for sequentially-adjacent pairs or by adding a ‘fake’ edge. Thus, an algorithm to handle only target skews is necessary. Note first that target values for only nf ≤ nb skews can be independently specified. As nf approaches nb , the freedom to vary all skews decreases and it may become impossible to determine any feasible s. Given nf ≤ nb , (a) the basis can always be chosen to contain all target skews by using a spanning tree algorithm with edge swapping, and (b) the edge enumeration can be accomplished such that the target skews appear last in the basis. The problem is now similar to (7.42) except for the change of the circuit kernel equation, ⎡ c⎤ ˆ s ˆ s + C2 σ = 0, (9.34) sb ⎦ ⇒ Bˆ C = [C1 C2 ] ⇒ Bs = [I C1 C2 ] ⎣ˆ σ c ˆ s ˆ = [I C1 ], ˆ where B sc = sc , and ˆ s = ˆsb , ˆ sb is sc with the last nf elements removed. The matrix C2 in (9.34) consists of the last nf columns of C, while the target skew vector σ is an nf -element vector of target skews whose elements are ordered in the order of the target edges. The linear system (7.51) 8
Normally, this would be the case. However, [recall (4.8), (4.13), (4.23), (4.24), and (4.29)], in an aggressive circuit design with a short clock period it may so happen that zero skew is designed to be out of the permissible range, most likely creating a setup time violation. In these circuits, negative skew is used to increase the overall system-wide clock frequency, thereby removing the setup violation.
9.4 Summary
becomes
2ˆ ˆ t ˆ = 2ˆ g s+B m ˆ s + C2 σ = 0 Bˆ
⇒
ˆt 2I B ˆ 0 B
ˆ s 2ˆ g = , m ˆ −C2 σ
181
(9.35)
with solution ˆ −1 (Bˆ ˆ g + C2 σ) m ˆ ∗ = 2M ˆ −1 B)ˆ ˆ g−B ˆ tM ˆ −1 C2 σ. ˆ tM ˆ s∗ = (I − B
(9.36)
9.4 Summary This chapter describes the practical implementation concerns in the implementation of QP-based formulation from Chapter 7 and a general clock skew scheduling implementations on a system-on-chip (SoC). First, the details of a computer implementation for the QP-based clock skew scheduling program described in Chapter 7 are presented. Three alternative implementations are discussed, analyzing the memory requirements and computational complexities based on the number variables and operations in each implementation. Mathematical discussions are presented to the efficacy and accuracy of each algorithm through theoretical discussions. A comparative analysis is also presented, which demonstrates the superiority of the CSD algorithm (out of the three proposed computer implementation alternatives) over the LMCS-1 and LMCS-2 algorithms. Later in the chapter, the timing isolation of the intellectual property blocks in a system-on-chip implementation is presented, which enables the implementation of clock skew scheduling on individual IP blocks.
10 Clock Skew Scheduling in Rotary Clocking Technology
Development of a low-jitter, low-skew (or controllable skew for clock skew scheduling) clocking technology that has low power dissipation is one of the major research topics in the development of next-generation synchronous integrated systems. Among the proposed clocking technologies are wireless [136, 137, 138] and transmission line-based [116, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148] approaches. These technologies must be supported by specific design flows and CAD suites in order to be viable in semiconductor implementation. In this chapter, the adaptability of the majority of the nonzero clock circuit skew design and analysis methods presented earlier in this monograph to the physical design flow of circuits synchronized by a transmission line-based clocking technology is described. The particular transmission line-based technology of interest, the rotary clocking technology, is described in detail. Three main types of resonant clocking technologies are described in Section 10.1. In particular, the operation of resonant rotary clocking technology is summarized in Section 10.1.1 and the timing of circuits synchronized with the resonant rotary clocking technology is discussed in Section 10.1.2. The physical design flow proposed for integrated circuits synchronized with the rotary clocking technology, that does require non-zero clock skew operation and scheduling, is presented in Section 10.2. A heuristic methodology for the parallelization of clock skew scheduling is presented in Section 10.3. The chapter is summarized in Section 10.4.
10.1 Resonant Clocking In the last decade, clock frequencies of digital integrated circuits have surpassed the GHz milestone [149]. Historically, systems that operate at clock frequencies in low MHz ranges have utilized off-chip quartz crystal oscillators [150, 151]. The oscillatory signal generated off-chip with the quartz crystal is input to the on-chip PLL, where it is multiplied to the desired frequency I.S. Kourtev et al., Timing Optimization Through Clock Skew Scheduling, DOI: 10.1007/978-0-387-71056-3 10, c Springer Science+Business Media LLC 2009
183
184
10 Clock Skew Scheduling in Rotary Clocking Technology
on chip. The generated signal is distributed to the synchronous components throughout the chip, typically using a tree topology, called a clock tree network [152]. Especially in nano-scale CMOS, where signal integrity has become a dominating problem, the distribution of the clock signal from a single clock source over a clock tree network has become quite error-prone. The discrepancies in the arrival time of the clock signal at the destination registers increase with scaling technology. The prevailing methodology to generate such highfrequency clock signals is to use on-chip frequency multiplication by using phase-locked-loop (PLL) components [153]. The on-chip PLL components occupy chip area and lead to problems with signal reflections, capacitive loading and power dissipation that effectively limit the maximum operating frequency. The resonant clocking technologies [116, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 154, 155, 156] present an alternative to generating the synchronizing clock signal. Based on energy recovering adiabatic switching principles [142], resonant clocking technologies permit significant power savings. As such, the resonant clocking technologies eliminate the necessity to use a complicated on-chip PLL component. Currently, there are three major types of resonant clocking technologies. These resonant clocking technologies are categorized with respect to their oscillator types and the phase and voltage characteristics of the generated clock signals: 1. Coupled LC oscillator based resonant clocking technology [140, 141, 147], 2. Standing wave oscillator based resonant clocking technology [139, 142, 145, 146], 3. Traveling wave oscillator based resonant clocking technology [116, 148, 157, 158]. Coupled LC oscillator based resonant clocking technology provides a constant magnitude clock signal with constant phase. A clock signal with constant magnitude and constant phase is similar to the conventional clock signals that are delivered using conventional clock tree networks. The main advantage of coupled LC oscillator based resonant clocking technology over other resonant clocking technologies is that coupled LC oscillator based clocking provides the desired clock signal without any change to the conventional design flows. Higher circuit performances are achievable solely by replacing the clock distribution network with the coupled LC oscillator based resonant clocking technology distribution network. H-tree network based implementations are introduced in [141] and extensively analyzed, including tests on silicon [147, 154, 155, 156, 159, 160, 161, 162, 163]. Standing wave oscillator based resonant clocking technology provides a varying amplitude clock signal with a constant phase [146, 164]. Similar to coupled LC oscillator based resonant clocking technology, clock phase is constant. Thus, this technology does not require drastic modifications to the conventional design flows. The varying clock signal magnitude, however, makes standing wave oscillators unattractive in integrated circuit design.
10.1 Resonant Clocking
185
Table 10.1. Categorization of the resonant clocking technologies. Oscillator Type Coupled LC Standing Wave Traveling Wave
Phase Constant Constant Variable
Voltage Constant Variable Constant
Traveling wave oscillator based resonant clocking technology is the resonant clocking technology of interest in this discussion. Traveling wave oscillator based resonant clocking technology, also called rotary clocking technology, provides a clock signal which has a constant magnitude and varying phase. Varying phase (delay) of clock signal provides permits easy implementation of non-zero clock skew systems. The design and analysis methods proposed for non-zero clock skew systems in earlier chapters can be used to design circuits synchronized with the rotary clocking technology. Table 10.1 [165] summarizes the categorization of the presented resonant clocking technologies, based on the magnitude and phase properties of the generated clock signals. 10.1.1 Rotary Traveling Wave Oscillators Rotary traveling-wave oscillators (RTWO’s) comprise a novel clock network implementation technology providing controllable-skew, low-jitter, gigahertz range clocking with fast transition times and low power consumption [116]. RTWO’s are generated on cross-connected transmission lines, constructing a differential LC transmission line oscillator. These oscillators generate multi phase (360 degrees) square waves with low jitter, which switch adiabatically to limit power dissipation. Multiple RTWO’s can be connected together forming the rotary oscillator arrays (ROA) which distribute the synchronized square wave over the whole chip. The basic ROA structure [116] is shown in Figure 10.1. This 7x7 ROA grid topology yields 25 interconnected RTWO rings. Around each RTWO ring, a clock signal is produced that travels around the ring in a frequency dependent on the physical parameters of the ring. Pulses on each ring are phase-locked via the shared transmission line wires between the rings. When the transmission line is excited from one or more points, the traveling wave is established on the cross-connected line. Figure 10.2(a) shows the open loop that conceptually occurs when the circuit is being excited for the first time. Figure 10.2(b) shows the closed loop in steady state of operation where overlap of the traveling waves causes signal negation. The traveling wave is inverted on the crossover points, generating different phases of the square wave. Any number of crossovers are allowed on the transmission line. The relative phase and skew of any point on the ring is well known due to the homogeneity of the traveling pattern around the ring. Note that anti-parallel
186
10 Clock Skew Scheduling in Rotary Clocking Technology
=
=
(a)
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
45o 225o
(c) 0o
(b)
180o
270o 135o
90o
315o
Fig. 10.1. Basic rotary clock architecture.
(shunt connected) inverter pairs are used between the cross-connected lines to save power, initiate and maintain the traveling wave. After excitation, the anti-parallel inverters feed the traveling wave in the stronger direction, up to a stable oscillation frequency. The transmission line with anti-parallel connected inverters is shown in Figure 10.3 [116]. In Figure 10.3, the traveling wave is traveling from left to right. Each pair of anti-parallel inverters on the path of the traveling signal turns on after some time, stimulating the same process at the neighboring pair of anti-parallel inverters in the direction of the wave. The transmission line impedance is on the order of 10Ω and the differential on-resistance of the anti-parallel connected inverters are in the 100Ω-1kΩ range for a 0.25μm technology [116]. Once a wave is established, it takes little power to sustain it. The dissipated power on the ring is given by the I 2 R dissipation instead of the conventional CV 2 f expression. Such consideration of power is possible because
10.1 Resonant Clocking
WAVE
_
+ _
_
_
_
+
0
0 0
+
0
0
+
0 0
+
0
+
187
0
_
0
_
0
_ 0
0
0
0
0
+
(a)
_
+
+
(b) Fig. 10.2. The RTWO theory.
t0
+2.5V Already latched 0V
t1
t2
.....0V Yet to switch .....+2.5V
(reinforces latch)
Fig. 10.3. The cross-section of the transmission line with shunt connected inverters.
the energy that goes into charging and discharging MOS gate capacitance (of the inverters) becomes transmission line energy, which in turn is circulated in the closed electromagnetic path. Such conservation of energy is enabled by adiabatic switching [166, 167], in terminating the current path to the transmission line, instead of ground. The coherent switching occurs only in the direction of the traveling path. An equal amount of energy is launched in the reverse direction, however the latches in this direction are already switched, thus this energy simply serves to reinforce the previous switching events on these registers. The frequency of the clock signal generated by the rotary clocking technology depends on total capacitance and inductance in the system, which are defined by the physical implementation of the rotary wires and the
188
10 Clock Skew Scheduling in Rotary Clocking Technology
anti-parallel inverters [116]. On a typical RTWO loop, the oscillation frequency of the signal is given by the equation: 1 fosc ≈ √ 2 Ltotal Ctotal
(10.1)
1 ≈ & . (10.2) πs + 1 ( Cinv + Creg + Cwire ) 2 Pπμ0 log w+t Ltotal is the total loop inductance and depends on the ring perimeter P , interconnect separation s, wire width w, thickness of the strip t and permeability in vacuum μ0 . Ctotal is the total capacitance that is driven by the RTWO ring. The total capacitance is defined by gate-oxide capacitances of inverter pairs Cinv and registers Creg , and the tapping wire (from register to ring) capacitance Cwire . These introduced factors affecting the total inductance Ltotal and the total capacitance Ctotal are the design parameters for an RTWO ring that provide a design flexibility to generate the desired frequency. Inductance variation on a typical silicon implementation is expected to be small because of the high quality of lithographic reproduction. Overall, the projected postproduction variation in the targeted operating frequency is 5%, accounting for of variation and the dependence of the operating frequency on √ √ the sources C and L [116]. The operation of the ROA structure in providing a gigahertz frequency, low jitter, low power clock signal with fast transition times is confirmed by simulating the ring shown in Figure 10.1 at 965MHz and 3.4GHz. The ring designed at 2.5V 0.25 μm CMOS technology has 25 interconnected RTWO rings on a 7x7 array grid. The simulation result presented in [116] for the 3.4GHz ring is shown in Figure 10.4. Promising results of a clock jitter of 5.5ps and 34-dB power supply rejection ratio (PSRR) are measured for 3.4GHz [116] and with 117-dB noise on a 18 GHz implementation [168]. Two other important metrics for an oscillator are the sensitivity to changes in temperature and supply voltage. It has been shown that the frequency deviation with temperature change between −50o C and 150o C is only 1% while the change with VDD deviation between 1.5 and 3.5V is around 2% [116]. The immunity of the RTWO signals to process variations while allowing full skew control over 360 degree phases on the ring proves very valuable for deep sub micrometer applications. A detailed analysis of the rotary clocking technology and the RTWO loops can be found in [116, 148, 157, 158]. Research on rotary clocking can be categorized into characterization and physical design. In characterization research presented in [168, 169, 170, 171], test chips and spice models are used to analyze the power and frequency characteristics of homogeneous rotary rings. Power savings around 60-80% are reported for a single, square rotary ring [169, 170, 171]. In physical design research in [172, 173] and recently in [174], skew computation and logic placement for a given rotary clock ring are
10.1 Resonant Clocking
189
Fig. 10.4. Line voltage and line current for the 3.4GHz clock example.
discussed. Both methods adopt iterative principles for integrated skew computation and logic placement. The point of interest for the clock skew scheduling discussion presented in this monograph is primarily the timing requirements of the rotary clocking technology, which are presented in Section 10.1.2. 10.1.2 Timing Requirements of Rotary Circuits As described in Section 10.1.1, rotary clocking technology provides a constant magnitude clock signal with varying phase (clock skew and clock phase). The constant magnitude of the clock signal is similar to the customary, however, varying clock phase is not common in mainstream circuit design flow. It is more common to use a zero clock skew, single-phase clock signal in synchronous circuit design due to its simplicity in design and analysis. The majority of the design automation tools for clock tree synthesis produce better results in generating a clock distribution network that provides a zero clock skew, single phase clock distribution as opposed to a non-zero clock skew tree. Non-zero clock skew and multi-phase synchronization, although shown in Section 6.6 to be consistently superior over traditional zero clock skew, single-phase design, is not very popular due to the lack of automation. For traditional PLL-based clock sources and clock tree networks, excessive amount of buffering can be necessary in order to deliver the clock signals to the synchronous component with the desired delays. Remember from Section 8.4 that buffer elements are available only in discrete values. In traditional clock
190
10 Clock Skew Scheduling in Rotary Clocking Technology Null: 0/360o
C.W.
Null: 0/360o
C.W.W.
135o
225o 315o
270o
90o
180o
225o
45o
45o
90o
=V =I
180o
135o
315o
270o
Fig. 10.5. The clock phase relationships on an ROA ring.
tree networks, clock delays are generated with buffering, thus, clock delays are available only in discrete values for such systems. For rotary clocking technology, however, buffer elements are not necessary, as clock delays are provided with the propagation of the clock signal on RTWO rings. The clock phase driving a synchronous component is determined by the location of the connection point of the clock signal wire on the RTWO ring as shown in Figure 10.1(b) (page 186). Figure 10.5 also presents the different phases of the clock signal available for a sample rotary implementation with one crossover point. Note that with this implementation, two corresponding points on the differential line provides clock signals with are shifted by 180 degrees. Unlike traditional PLL-based clock sources, the generation of a multi-phase clock signal is highly practical with rotary clocking. The number of phases in the clock signal generated by the rotary clocking technology is determined by the number and placement of crossovers onto the RTWO rings. Common multi-phase synchronization scheme of two-phases, as well as any other arbitrary number of phases, can be implemented with rotary clocking technology, without loss of quality. Two (or more) crossovers can be used to generate any desired number of overlapping or non-overlapping clock phases for multi-phase synchronization. The length and respective placement of the duty cycles of the multiple clock phases are determined by the location of the crossovers on the ring. Rotary clocking technology readily supplies a fine grain of clock delays, and potentially, phases. From a CAD perspective, continuous delay models can be used to model clock delays available in the network. From a circuit design perspective, the assignment of different clock delays to the synchronous components of a rotary-clock synchronized circuit are essential for the proper operation of the circuit. Towards this end, the most common problem is the unbalanced capacitive loading of the rotary network. The lack of a relatively uniform load distribution (within one ring or between multiple
10.2 Physical Design Flow
191
rings) may affect the rotation of the oscillatory signal on the ring(s), thereby causing degradations in the quality of synchronization. In the optimal scheduling scenario, the clock delays at the synchronous components are distributed relatively evenly in time, leading to a relatively balanced distribution of the latching points on the rotary ring. The required balanced loading of the ROA rings can be provided by clock skew scheduling (see the distribution of clock delays for a sample circuit in Figure 11.5 on page 213). The advanced timing methodology of using non-zero clock skew circuits with multi-phase synchronization can easily be realized in circuits synchronized with rotary clocking technology. Advantageously, implementation of circuits synchronized with the rotary clocking technology mutually requires the automated design and analysis methodologies for multi-phase, non-zero clock skew synchronization schemes. Such integration of the design and analysis methodologies into the physical design flow leads to circuits which benefit both from the presented advanced timing methodologies and the rotary clocking technology.
10.2 Physical Design Flow The physical design flow for integrated circuits synchronized with the rotary clocking technology necessitates a clock network stage for the implementation of the ROA grid topology and multiple-phase, non-zero-clock-skew clocks. In this chapter, the physical design flow is examined from the perspective of the requirements for a non-zero clock skew circuit implementation. The design flow is illustrated with the flow chart shown in Figure 10.6. The flow includes processing the design entry to investigate the complexity and requirements of the circuit, partitioning the netlist, performing clock skew scheduling and performing register and logic placement. The three major steps of the presented physical design methodology are the partitioning, clock skew scheduling and placement steps. The partitioning step is proposed in order to generate logic partitions that are implementable within the ROA ring regions of a rotary clocking network. The clock skew scheduling step is required to improve the scalability of conventional clock skew scheduling techniques through partitioning. The placement step is proposed in order to provide a practical implementation alternative for the mapping of the circuit logic and registers to the ROA rings. These steps are required to increase the feasibility of the rotary clocking technology as the infrastructure of choice for the advanced timing methodologies discussed in previous chapters (such as clock skew scheduling and multi-phase synchronization). The design entry is provided in industry standard file formats, such as DEF, LEF and SDF file formats. An initial timing information of the circuit is necessary for the application of clock skew scheduling. This information must be obtained prior to tape-out, preferably from a preliminary placement and routing of the circuit.
192
10 Clock Skew Scheduling in Rotary Clocking Technology DESIGN ENTRY
PARTITIONING ROA SIZE
PARTITIONING
REGISTER INSERTION
NO ROA FEASIBLE?
YES
CLOCK SKEW SCHEDULING
CSS on PARTITION I
CSS on PARTITION N
CSS on TOP BLOCK
NO
YES CSS FEASIBLE?
PLACEMENT
REGISTER MAPPING
LOGIC PLACEMENT
Fig. 10.6. The physical design flow of VLSI circuits with RTWO clock synchronization.
The implementation of the ROA rings and netlist partitioning are dependent on each other as illustrated in the Partitioning step in the flow chart. The size and number of rings in the ROA structures depend on several factors such as the complexity of the design, the availability of clock network design resources, the computational resources for timing analysis and the silicon area. Despite these dependencies, the number and physical dimensions of ROA rings in a circuit are quite flexible. The number of ROA rings is usually held sufficiently high in order to limit the total wirelength. The shapes of ROA rings are not necessarily regular (e.g., rectangles) as implied by the mesh structure presented in Section 10.1.1. Such flexibility in the physical
10.2 Physical Design Flow
193
implementation of the ROA rings enables reconciliation of the non-routable blocks of the chip area. Partitioning is performed on a gate-level or a register-transfer level netlist. For the former case, it is often necessary to insert extra registers in the logic network as part of the timing-driven partitioning process. This process is represented by the “Register Insertion” block in the flow chart. These inserted registers are level-sensitive latches operating in the transparent phases of operation (Section 4.2) in order to preserve the functionality of the original circuit. The feasibility of the partitioning result is checked at the next validation step. If the current result is not feasible, the partitioning step of the design flow is repeated until feasibility is satisfied. In the clock skew scheduling step, the rotary clock network is constructed. Data paths that are local to each partition are identified and the corresponding timing constraints are included in the clock skew scheduling problem for that partition. Similarly, the timing constraints of local data paths which span different partitions are included in the clock skew scheduling problem of the top block. A heuristic method is proposed to solve the partition and top block LP problems. The clock skew scheduling problems of each partition are independent of each other, so these analyses can be parallelized. After the clock skew scheduling block is completed, the optimal clock signal delays required at each synchronous component are known. Depending on the number of clock phases and the number of registers for a given clock phase, the mapping of synchronous components to the registers within an ROA ring is performed. This is an automated design step called “Register Mapping” in the flow chart. Following register mapping, the rest of the logic within a partition is placed in the area available within the ROA rings for this partition. The placement is performed using conventional logic placement techniques. The partitioning and clock skew scheduling steps of the physical design flow are presented in detail in Sections 10.2.1 through 10.2.4. The placement step is discussed in Section 10.2.5. 10.2.1 Timing-Driven Partitioning The objective of the conventional timing-driven partitioning process is to generate circuit placements that are more likely to meet a particular timing budget. Path-based and net-based partitioners [175] are the two most widely used kinds of partitioners in current state-of-the-art physical design. Both pathbased and net-based partitioners are used to limit the lengths of selected critical paths in a circuit. Such limitation in the number of analyzed paths significantly reduces the processing time for partitioning (and static timing analysis) while generally preserving the accuracy of the analysis. In clock skew scheduling, the local data paths in an entire circuit (or circuit partition) are equally important and analyzed together. Thus, traditional path-based and net-based timing-driven partitioning methods do not provide ideal cuts for the application of clock skew scheduling. Thus, an alternative
194
10 Clock Skew Scheduling in Rotary Clocking Technology
partitioning approach is proposed in this work using selection criteria that lead to partitions which are amenable to clock skew scheduling. Towards this end, a hypergraph partitioning tool is used with fine-tuned partitioning criteria to generate partitions that are easily implementable with the rotary clocking technology. Principally, timing-driven partitioning is performed within the proposed design methodology subject to the following considerations: 1. To construct the logic network partitions that will be synchronized by individual ROA rings of the rotary clocking technology, 2. To enable the completion of path enumeration on large scale circuits, 3. To enable the completion of clock skew scheduling algorithms on large scale circuits. The first of the three factors listed above is directly related to the implementation of the rotary clocking technology. If clock tree synthesis is performed completely independent from logic synthesis, the assignment of synchronous components to individual ROA rings can be inefficient for physical implementation. As discussed in Section 10.1.2, a relatively balanced distribution of clock phases is necessary for the quality of synchronization with a rotary clock signal. An unbalanced loading of synchronous components on the ROA rings may also cause hot spots in the circuit or significantly increase the clock load on one side of the chip compared to another (thereby causing performance degradation). To prevent such undesired operation, logic and clock tree synthesis need to be performed interdependently. The partitioning procedure presented here achieves this goal by generating balanced logic partitions to be synchronized by each ROA ring. Advantageously, the clock phases at the synchronous components within each partition are well distributed after the application of clock skew scheduling (see Figure 11.5 on page 213) to the logic partitions. Thus, the synchronization by non-zero clock skew requirement is satisfied as well as the capacitive load balancing requirement for robust rotary oscillation. The second and third factors that drive the timing-driven partitioning process are related to the design and analysis methodologies of large-scale circuits. Although discussed here within the context of rotary clock synchronization, the partitioning procedures presented in this chapter can also be applied to circuits synchronized with traditional clocking technologies. From a CAD perspective, the generality of the partitioning procedure to improving the scalability of clock skew scheduling (independent of the particular clocking technology) is discussed next. As reported earlier in Chapter 5, scalability of clock skew scheduling is an important drawback for its widespread acceptance in mainstream design. Most industrial-strength timing tools or circuit designers that implement variations of clock skew scheduling perform these tasks only on certain portions of the circuit, without analyzing the circuit in its entirety. Analysis of the entire circuit in order to implement a full-scale application of clock skew scheduling can be computationally intensive for very large-scale circuits. The main ob-
10.2 Physical Design Flow
195
stacle for the application of clock skew scheduling to the entire circuit is the run times of LP model problems. The LP problem for the application of clock skew scheduling is formulated as described in Chapter 5. The LP problems generated for an integrated circuit with millions of paths and hundreds of thousands or more synchronous components can be very large. The run times of such large LP problems are usually reasonable within the typically long IC design cycles (up to a few days with industrial strength LP solvers and common computing resources). However, very large models might not be solvable at all within the memory limits of common computing resources. In several industry applications, for instance, LP model problems for the clock skew scheduling of large-scale circuits are observed to exceed the practical limits of desktop computing resources (e.g. 4 gigabytes of memory for 32-bit systems) [176]. Partitioning, as discussed here, remedies this shortcoming. Through partitioning the circuit into small partitions, small linear programming models can be developed and solved for each partition. In practice, the LP formulations can be applied in parallel, achieving further improved run times. 10.2.2 Partitioning with chaco In the development of the partitioning step of the physical design flow, the partitioning tool Chaco [177] from Sandia National Laboratories is used. Chaco is a hypergraph partitioning tool that is primarily developed for the parallelization of tasks on special architectures. Nevertheless, chaco has been proved to be applicable to a wide range of areas. Chaco offers various methods (spectral bisectioning [178], the inertial method [179], the Kernighan-Lin [180], Fiduccia-Mattheyses [181] algorithms and multilevel partitioners [182]) for partitioning, each fine tuned for a specific purpose. Among the multiple criteria for partitioning a synchronous circuit for clock skew scheduling are the weight, number and location of the cuts amongst partitions, the weight of each partition, the relative mapping of sequentiallyadjacent registers to partitions and the number of internal vertices per partition. Chaco tracks the quality of these partitioning performance metrics with user-defined priorities. In order to generate partitions amenable to clock skew scheduling, the number of cuts between partitions must be minimal and the number of internal vertices (vertices that do not have edges between partitions) must be maximal. Depending on particular design budgets and the priority of the performance metrics, the weights of particular nets or vertices can be fine tuned. In the computer-aided design (CAD) tool implementation, the application of partitioning to two types of netlists are supported. These netlists, categorized by the hierarchical level of input data, are: 1. Gate level netlists, 2. Register-transfer level netlists.
196
10 Clock Skew Scheduling in Rotary Clocking Technology
If the input to the CAD tool is a register-transfer level netlist, identifying local data paths (register-to-register timing paths) is inherently simple. The local data paths in the register-transfer level netlist already form a circuit graph such as the one shown in Figure 7.1 (page 129), where each vertex is a register or a synchronous component and each edge is a local data path. If the input to the CAD tool is a gate-level netlist, some paths can be too long (high logic depth), which practically limits the quality of partitions. For such long paths, the partitioner is tuned such that registered-input, registered-output partitions are generated. To encourage the generation of such partitions, the following rules are applied in weight assignment to edges: 1. If the edge is between two registers, assign low edge weight. 2. If the edge is a fanout from the data output terminal of a synchronous component to a combinational component, assign high edge weight. 3. If the edge is a fanout from a combinational component to the data input terminal of a synchronous component, assign low edge weight. 4. If the edge is between two combinational components, assign high edge weight. Through such weighted assignments, the chaco partitioning tool minimizes the weight cuts, leading the cuts to pass through the data inputs terminals of synchronous components. In case of single input synchronous components, like flip-flops, a data input net is singlefold, while a data output net can have multiple fanouts. Hence, the cuts are directed to occur at the data input terminal of a synchronous component as opposed to a data output terminal. A synchronous component on the boundary of two partitions is shared between two partitions, structuring the registered-input and registered-output partitions. The enforcement of the edge weights only on data I/O terminals (as opposed to all terminals) are to avoid forcing artificial constraints on irrelevant I/O terminals, such as synchronization and scan-path I/O terminals. The chaco partitioning tool is operated with different priorities assigned to the partitioning objectives. Experimentally, a balanced priority assignment between minimizing the total cut weight and maximizing the number of internal vertices is found to be sufficiently effective. 10.2.3 Register Insertion for Partitioning As discussed in Section 10.2.2, partitioning can be performed on netlists at two different hierarchy levels. The application of partitioning on a registertransfer level netlist is simpler compared to its application on a gate-level netlist. On a partitioned register-transfer level netlist, a cut is assumed to pass through an arbitrary location on the cut local data path. The final register of the cut local data path is called a boundary register . Timing constraints for the local data paths, where the boundary register is either the initial or the final register of the path and the other register is within the same partition with the boundary register are grouped into a partition LP problem. Timing
10.2 Physical Design Flow
197
constraints of the local data paths between the boundary register and registers in other partitions are grouped into the top block LP problem. These LP problems constitute an integral part of the physical design flow depicted in Figure 10.6 on page 192. When a gate-level netlist is used, the heuristic described in Section 10.2.2 is used to bolster cuts on the input of synchronous components. Unlike its treatment for a register-transfer netlist, a final register of the local data path must be in the same partition with the cut local data path. This objective suggests registered-input, registered-output partitions, simplifying the timing analysis. The slight variation in the weight (or load) balance of the partitions is insignificant and eventually balances out as the transfer of registers between partitions occurs in all directions. For instances where the partitioner validates a cut on a net that is between two combinational components, register insertion is used to satisfy the registered-input, registered-output scheme. The number of inserted registers depends on the quality of the partitioner and the complexity of the design. In the performed experiments, the number of inserted registers has been observed to be directly proportional to the number of partitions. For higher number of total partitions, the number of inserted registers can get even higher than the number of original registers. This requires the partitioning step to be applied with caution in designs where die area is a scarce resource. The registers inserted into the logic network in the register insertion step of the physical design flow can affect the functionality of the circuit. In order to preserve the functionality of the circuit, level-sensitive latches are used. The inserted registers are selected as level-sensitive latches operating in their transparent phases of operation. The propagation of the data signals on the inter-partition paths is not disrupted, as these signals are immediately propagated through the level-sensitive latches during the transparent phases. Constraints similar to the linearized timing constraints presented in Chapter 5 are used in this step in order to drive the inserted registers with proper clock delays and phases. The general partitioning process is illustrated in Figure 10.7. In this figure, the dots represent registers and the lines represent data paths. The paths from partition (4,1) are demonstrated. Note that only some of the registers and paths are shown. The data paths which are on a cut are identified and the timing constraints of these paths are included within the top block LP. 10.2.4 Clock Skew Scheduling of Partitions In this section, the application of clock skew scheduling on the partitions generated by the timing-driven partitioner is described. A heuristic method is presented in order to perform the referred application. It is shown that this heuristic method, despite significantly simplifying the clock skew scheduling process, does not guarantee an optimal solution. The heuristic method is
198
10 Clock Skew Scheduling in Rotary Clocking Technology
Fig. 10.7. Partitioning a circuit for timing analysis.
described explicitly for circuits synchronized by the rotary clock technology in this chapter, however, it can be generally applied to any synchronous circuit. The heuristic method to solve the clock skew scheduling of partitions is as follows. Assume that there are n partitions. The partition LP problems (LP1 , LP2 , . . . LPn ) are generated for these n circuit partitions. Each partition LPi is solved (sequentially or in parallel) in order to compute the minimum clock period permitted by that partition. Note that the minimum clock periods of each partition can be different. For proper operation of the circuit, all partitions must operate at the same clock period. A simple resolution to this issue is possible through the fact that each partition can freely operate at any clock period higher than the minimum clock period computed for that particular partition LP problem. Consequently, the maximum of the minimum clock periods reported from each partition LP is selected as the principal clock period at which all the partitions are operable. This maximum value corresponds to the frequency at which at least one of the partitions is operating at its maximum frequency, while the rest of the partitions are operating at frequencies lower than their capacities. After solving the partition LP problems, the maximum of the minimum clock periods computed for the partitions is used to further constrain the top block LP. Consequently, a constraint in the form (10.3) T ≥ max(T1 , T2 , . . . Tn ) is added to the top block LP, where T1 , T2 , . . . Tn denote the minimum clock periods computed for partitions LP1 , LP2 , . . . LPn , respectively. If the top block LP problem is less constraining on the minimum clock period compared
10.2 Physical Design Flow
199
to the partition LP problems (smaller minimum clock period), then the maximum of the minimum clock periods of the partition LP problems is assigned as the clock period of the top block. Otherwise, the top block LP problem determines the actual minimum operating clock period of the circuit (partitions and top block). The top block LP problem is solved after the partition LP problems are solved, because the top block has the most number of boundary vertices implied in its constraints. Actually, all boundary vertices are implied in the constraints that make up the top block LP problem. Each partition LP problem only has a fraction of the boundary vertices implied in their constraints. The solution of the clock delays to all boundary vertices, as computed by each partition LP and the top block LP problems, must match in order to verify the validity of the computed minimum clock period. In order to match these clock delays of boundary vertices, the solutions computed for the top block LP problem are enforced on the partition LP problems with equalities such as: (10.4) ticd = xi , where the clock delay computed for register Ri in the top block LP problem is xi time units. If the partition LP problems return feasibility, the computation is complete. There are two points to note here. First, note that the minimum clock periods computed for partition LP problems are lower limits on the minimum clock period of the complete circuit as each partition LP problem is a subproblem of the original LP problem. The constraints that make up the subproblems are subsets of the LP problem of the complete (original) circuit. As the solution of one of the subproblems (the top block LP problem in this case) is enforced on the remaining LP problems, the convex solution space of the original problem is not violated. Intuitively, therefore, if the presented heuristic method produces a feasible result, this result is optimal. The second point to note is the fact that the presented heuristic method does not guarantee a feasible solution. The percentage (65%) of ISCAS’89 benchmark circuits for which the presented heuristic method is feasible are shown in Section 11.5. The following alternative approaches are proposed to solve for cases where the presented heuristic method is not feasible: • Reiteration: The infeasibility diagnostics of an LP solver can be used to resolve the infeasibility problem by changing one or more clock delays that appear in a contradictory constraint. Even if any infeasibility information is not available, iterations can be performed on the infeasible subproblems to search for a feasible answer. The clock delays whose values are changed from the optimal solution of the top block LP are tracked such that the feasibility of the remaining LP problems are not violated. Iterations are performed either until a feasible solution is found or a time limit is reached. • Constraining boundary vertices: As an alternative procedure, the clock delays of all boundary registers can be fixed to a particular value.
200
10 Clock Skew Scheduling in Rotary Clocking Technology
Synchronous circuits are typically built to operate at zero clock skew, thereby, constraining the clock delays of the boundary vertices to a particular value will guarantee proper circuit operation. The minimum operational clock period for the restricted circuit will be larger than or equal to the minimum clock period of the original circuit due to the additional constraints on the convex solution space. As discussed in Section 9.3, a similar clock delay restriction procedure is applied to the timing of Intellectual Property (IP) blocks. In the experiments performed on ISCAS’89 benchmark circuits for IP blocks, restricting the clock delays of boundary vertices leads to the 27% improvement of conventional clock skew scheduling reduced to 24%. • Delay padding: Implementation of clock skew scheduling requires modification of the clock distribution network. If the designers can modify the logic network as well as the clock distribution network, the infeasibility of one or more partition LP problems can be mitigated by delay padding. In this alternative procedure, the data propagation delays of all paths of the infeasible LP problems are formulated as variables. To this end, the minimum DPi,fm and the maximum DPi,fM data propagation delays of a loif and cal data path can be formulated with additional slack variables Sm if SM , respectively, specific to each local data path. The summation of all the slack variables is added to the minimization type objective function (min TCP ). The coefficient of the minimum clock period TCP is increased as appropriate in the LP formulation in order to increase the priority of minimizing TCP in optimization. In the LP problem solution, the non-zero slack values reported on each local data path are the amounts of delay that must be inserted to the logic paths. The clock delays of the boundary registers must be fixed prior to the solution, such that, the solutions of the remaining LP problems are not violated. The practical concerns of delay padding discussed in Chapter 8 are also valid for this alternative procedure. 10.2.5 Timing-Driven Register Placement A register placement methodology is presented for the physical design of circuits synchronized with the rotary clocking technology. In this methodology, designated areas for register placement are reserved underneath the ROA rings. Highly populated register banks are stacked inside these designated regions, available for use with the full spectrum of clock phases. Upon synthesis of the circuit and the computation of optimal clock phases, each register in the synthesized netlist is physically mapped to a register underneath the ROA ring. To complete the placement step, the synthesized blocks of combinational circuitry are distributed in the free space inside the region, outside the designated areas. In Figure 10.8, an ROA ring of a typical circuit designed with a 0.13 μm technology on a 2mm x 2mm circuit die is illustrated. Note that the figure
10.2 Physical Design Flow
201
500 um
500 um
4 um
2 um
Fig. 10.8. An ROA ring in a chip layout illustrated in 0.13 um technology.
is not drawn to scale. The die area is evenly divided into 16 regions in a four by four setting, each of which is synchronized with an ROA ring. The dimensions of each ROA ring is 500μm by 500μm. Assuming a single row of registers is placed underneath each ring, the maximum number of registers that are realizable on this die can be easily be obtained using the dimensions of a typical register. In the 0.13um technology, a size of a register is considered to be 4μm by 4μm, with a minimal spacing of 2μm between two instances. Therefore, there is enough space to place approximately 80 registers on each ROA ring edge [(500+2)/(4+2) ≈ 80]. For 4 sides of an ROA ring and 16 rings, a total of 5120 registers are available for mapping against the synthesized logic. This number is adequate for most state-of-the-art digital circuit designs of similar die size. The dimensions of the designated area for register placement and the number of register bank rows are the determining factors for the number of registers in a design, which can be altered for particular design budget requirements. Availability of registers in the register bank enables a good distribution and mapping of clock phases. The register placement methodology is discussed to demonstrate a viable mechanism to deliver the required clock delays to registers. The described heuristic implementation of the register placement methodology is only presented as a proof-of-concept, and does not negatively or positively impact the coveted synchronization principles with non-zero clock skew. Alternative methods of placement and routing for rotary timing synchronization have been offered in [173, 174] that can also be followed in the physical design flow.
202
10 Clock Skew Scheduling in Rotary Clocking Technology
10.3 Parallelization of Clock Skew Scheduling The popularity of the personal computers in the consumer market over the last few decades has significantly lowered the costs of computing systems. Consequently, the costs associated with setting up a distributed computing system have become relatively affordable. Processes, previously incomputable or considered costly, can be executed on a cluster of standard computing systems. Xgrid [183] is a distributed computing software provided by Apple Computers Inc. permitting the operation of a cluster of popular desktop machines as a supercomputer. The Xgrid system aggregates an ad hoc network of Macintosh desktop computers into a multi-agent computing cluster, where each agent is called a computation grid. Xgrid is typically beneficial for highly parallelized problems that can be broken up into smaller pieces and each piece executed separately and relatively independent from each other. One of the computers in the cluster is set up as the client for Xgrid and other computers are used as distributed agents. The Xgrid software is installed on all computers, enabling the agents to perform grid calculation. The computations can be submitted when the agents are idle or it can be used as the master task. The Xgrid software is run with a controller, which regulates the assignment of computing processes to grids and manages the outputs as they are returned to the server. Xgrid software serves as a simple distributed computing infrastructure and does not support message passing between independent agents as is the case for typical Message Passing Interface (MPI) [184] systems. The parallelization of the application of clock skew scheduling is implemented for the Xgrid distributed computing system. The LP problems for each partition are submitted as individual tasks to the Xgrid computing cluster and solved simultaneously on specific agents. The generated system not only exhibits the pre-described advantages of implementing a parallel execution scheme for clock skew scheduling, it also exemplifies the implementation of a complex VLSI design application on the Xgrid software architecture. The computing cluster is constructed with eight PowerMac computers with dual G5 1.8GHz microprocessors and 3GB RAM operating Mac OS X 10.3.8. The cluster has one dedicated client, one dedicated controller and six distributed computing agents. The agents are configured to process Xgrid tasks as the master task. Only one of the processors on each computer is used in experimentation. This grid computing cluster setup is illustrated in Figure 10.9. In order to effectively harness the distributed computing potential, the benchmark circuits are partitioned into four partitions. Note that four partitions emulate a 2x2 grid clock distribution for the rotary clocking technology. The analysis of a 3x3 or a larger grid size is possible, however, perfect parallelization for such grid size can not be achieved with six distributed agents.
10.4 Summary
203
Client
6 ? Controller
6 ?
?
?
?
?
Agent 1
Agent 2
Agent 3
Agent 4
Agent 5
?
Agent 6
Fig. 10.9. Xgrid computing cluster.
10.3.1 Speedup of Computation The primary advantage gained from the parallelization of the application of clock skew scheduling is the speedup in computation time. The speedup is gained not only from parallelization but also from partitioning the LP problems (smaller size LPs are generated). The following simple and intuitive formula is used to compute the speedups achieved through partitioning and parallelization: Run time of physical design (PD) flow without partitioning . Run time of PD flow with partitioning and parallelization (10.5) In a distributed computing environment, communication overhead is often of concern. In the Xgrid environment, because of the simple (practically nonexistent) interface between independent computing agents, the communication overhead is reduced to a minimum. The bulk of the required communication occurs in distributing the tasks to agents. Speedup =
10.4 Summary Non-zero clock skew scheduling is proposed as a clock distribution network design and improvement methodology in conventional VLSI design flow. The
204
10 Clock Skew Scheduling in Rotary Clocking Technology
conventional design flow, optimized to generate a zero clock skew synchronous system, is used to implement a system, which is then improved for shorter clock period or maximized safety as explained in earlier chapters. This chapter demonstrates how a next-generation clocking technology, resonant rotary clocking technology not only inherently provides but also necessitates the use of non-zero clock skew design principles. The resonant rotary clocking technology is reviewed and timing requirements that provide and necessitate the use of non-zero clock skew principles are described. A cluster-based parallel clock scheduling methodology is described within the context of rotary clocking. Possible extensions to a general approach for clock skew scheduling parallelization is hinted. A physical design flow that incorporates the coveted parallel clock skew scheduling step with placement and routing steps is presented, which constitutes a proof-of-concept for advocated design methodologies. The physical design flow for rotary-synchronized circuits is an ongoing research field and more recent approaches are available in literature [173, 174].
11 Experimental Results
The results of the various clock skew scheduling methodologies described in this research monograph are presented in this chapter. Results of each application are presented in dedicated sections for a thorough analysis and simplicity in presentation. For comparison of results, identical experimental setups are used where possible and publicly available ISCAS’89 benchmark circuits are used as test subjects. Presented results entail explanations of experimental setups for replicability, detailed reports (presented in tabular form) of experiment runs including circuit statistics, reported improvements and runtimes, and interpretation of results for observed trends or deviations from norms. Specifically, experimental results for the following methodologies are presented: In Section 11.1, the results for the clock skew scheduling of levelsensitive circuit results are shown. These results also include edge-triggered circuit implementation results (presented earlier in Chapter 5) for side-byside comparison. In Section 11.2, the level-sensitive circuit results that are expanded for multi-phase clock synchronization are shown. The effects of multi-phase clocking are interpreted for best synchronization practices, which is particularly useful for rotary clock synchronized circuits. In Section 11.3, the performance of quadratic programming (QP) formulation proposed for maximizing safety against variations is shown. In Section 11.4, the improvement of clock skew scheduling results for edge-triggered and level-sensitive circuits by applying the delay insertion method is shown. In Section 11.5, preliminary results for the proof-of-concept physical design methodology for rotary-clock-synchronized circuits are shown.
11.1 Clock Skew Scheduling of Level-Sensitive Circuits The generic LP model shown in Table 6.2 (page 107) is used in the problem formulation of clock skew scheduling application on level-sensitive circuits. The commercial optimization package CPLEX (v7.5) [115] is used to solve for these clock period minimization problems of the generated level-sensitive (ISCAS’89 I.S. Kourtev et al., Timing Optimization Through Clock Skew Scheduling, DOI: 10.1007/978-0-387-71056-3 11, c Springer Science+Business Media LLC 2009
205
206
11 Experimental Results
benchmark) circuits. In experiments, the primal and dual simplex optimizers of CPLEX are used. The worst case analysis shows that the simplex method and its variants may require exponential number of steps to reach an optimal solution. However, a vast amount of practice has confirmed that in most cases, the number of iterations to reach an optimal solution is polynomial [114]. In the presented experiments, the LP clock skew scheduling formulations of all benchmark circuits are solved in reasonable runtimes with CPLEX. Consider the LP formulation in Table 6.2. The number of problem constraints m is proportional to the number of registers r and the number of local data paths p in the circuit. Let s denote the number of input registers for which the initialization constraints are defined. In Table 6.2, there are eight (8) constraints for each register, two (2) constraints for each local data path and one (1) constraint for each input register. Thus, the number of constraints in the problem formulation is m = 8r + 2p + s. The minimum clock period TCP is a problem variable. Also, there are five (5) problem variables defined for each register leading to a total number of n = 5r + 1 variables in the problem formulation. 11.1.1 Experimental Results on ISCAS’89 Benchmark Circuits The original ISCAS’89 benchmark circuits are edge-sensitive synchronous circuits without any timing information. The timing information for the benchmark circuits is generated explicitly with an algorithm, where the type, size and fan-out of a gate are included in the computed combinational gate delay. Level-sensitive implementations of the ISCAS’89 benchmark circuits are generated by replacing each flip-flop in the original benchmark circuit with a level-sensitive latch. In experimentation, a single phase clock signal with a duty cycle of 50% (Figure 6.5 on page 109) is selected. Without affecting the generality of the solution, the setup and hold times and the internal delays Li Li Li = DCQ = DDQ = 0). The consideration of are assumed to be zero (δSLi = δH these numeric constants in an actual problem is straightforward. In experimentation, edge-sensitive and level-sensitive synchronous circuit implementations are analyzed for zero and non-zero clock skew scheduling applications. The effects of time borrowing and clock skew scheduling in circuit implementation are investigated. The results of the analyses—computed on a 440MHz Sun Ultra-10 Workstation—are presented in Table 11.1. For each circuit, the following data are listed—the circuit name, the clock perifor a zero skew circuit with flip-flops, TLnoskew for a zero skew ods TFnoskew F for a non-zero skew circuit with flip-flops, TLCSS circuit with latches, TFCSS F for a non-zero skew circuit with latches, and TLr for a non-zero skew circuit where the clock delays to I/O registers are restricted to be equal. The subscripts F F, L represent circuit topologies for flip-flop based and latch-based circuits, respectively. The superscripts noskew, CSS indicate zero or non-zero of TLCSS , and clock skew scheduling. Also listed are the calculation time tCSS L TB CSS T BCS , where the superscripts the clock period improvements IL , IF F and IL
11.1 Clock Skew Scheduling of Level-Sensitive Circuits
207
Table 11.1. Clock skew scheduling results for level-sensitive ISCAS’89 circuits. Info Zero CS I (%) Non-Zero CS I (%) t (sec) R I (%) noskew TB CSS CSS T BCS CSS r r Circuit TFnoskew TL IL TFCSS TL IF IL IL tCSS TL IL F F F L s27 6.6 5.4 18 4.1 4.1 38 38 24 0.02 4.1 38 s208.1 12.4 8.6 31 4.9 5.2 60 58 40 0.01 7.6 39 s298 13.0 10.6 18 9.4 9.4 28 28 11 0.02 10.6 18 s344 27.0 18.4 32 18.4 18.4 32 32 0 0.03 18.4 32 s349 27.0 18.4 32 18.4 18.4 32 32 0 0.03 18.4 32 s382 14.2 10.3 27 8.5 8.5 40 40 17 0.04 8.7 39 s386 17.8 17.3 3 17.3 17.3 3 3 0 0.03 17.3 3 s400 14.2 10.4 27 8.6 8.6 39 39 17 0.05 8.8 38 s420.1 16.4 12.6 23 6.8 7.2 59 56 43 0.04 10.3 37 s444 16.8 12.4 26 9.9 9.9 41 41 20 0.07 9.9 41 s510 16.8 14.8 12 14.8 14.3 12 15 3 0.02 14.8 12 s526 13.0 10.6 18 9.4 9.4 28 28 11 0.05 10.6 18 s526n 13.0 10.6 18 9.4 9.4 28 28 11 0.05 10.6 18 s641 83.6 66.2 21 61.9 61.9 26 26 6 0.05 63.1 25 s713 89.2 71.2 20 63.8 63.8 28 28 10 0.05 65.0 27 s820 18.6 18.3 2 18.3 18.3 2 2 0 0.01 18.3 2 s832 19.0 18.8 1 18.8 18.8 1 1 0 0.01 18.8 1 s838.1 24.4 20.6 16 8.3 9.1 66 63 56 0.28 15.6 36 s938 24.4 20.6 16 8.3 9.1 66 63 56 0.31 15.6 36 s953 23.2 21.2 9 18.3 18.3 21 21 14 0.10 21.2 9 s967 20.6 17.9 13 16.2 16.6 21 19 7 0.08 17.9 13 s991 96.4 91.6 5 79.4 79.4 18 18 13 0.02 79.4 18 s1196 20.8 16.0 23 10.8 7.8 48 63 51 0.03 16.0 23 s1238 20.8 16.0 23 10.8 7.8 48 63 51 0.01 16.0 23 s1423 92.2 86.4 6 77.4 75.8 16 18 12 1.10 75.8 18 s1488 32.2 29.0 10 29.0 29.0 10 10 0 0.02 29.0 10 s1494 32.8 29.6 10 29.6 29.6 10 10 0 0.01 29.6 10 s1512 39.6 34.8 12 34.8 34.8 12 12 0 0.28 34.8 12 s3271 40.3 29.8 26 28.6 28.6 29 29 4 0.69 29.0 28 s3330 34.8 23.4 33 17.8 17.8 49 49 24 0.49 23.2 33 s3384 85.2 77.4 9 67.4 67.4 21 21 13 1.88 76.2 11 s4863 81.2 75.4 7 69.0 69.0 15 15 8 0.64 69.0 15 s5378 28.4 23.2 18 22.0 22.0 23 23 5 1.66 22.0 23 s6669 128.6 124.6 3 109.8 109.8 15 15 12 3.62 109.8 15 s9234 75.8 64.8 15 54.2 54.2 28 28 16 4.59 59.2 22 s9234.1 75.8 64.8 15 54.2 54.2 28 28 16 3.88 59.2 22 s13207 85.6 67.4 21 57.1 57.1 33 33 15 14.86 57.1 33 s15850 116.0 92.8 20 83.6 83.6 28 28 10 76.96 83.6 28 s15850.1 81.2 71.4 12 57.4 57.4 29 29 20 58.89 57.4 29 s35932 34.2 34.1 0 20.4 20.4 40 40 40 80.03 20.4 40 s38417 69.0 54.8 21 42.2 42.2 39 39 23 603.49 43.0 39 s38584 94.2 76.4 19 65.2 65.2 31 31 16 321.74 64.8 31 Average 15 30 27 14 24
T B, CSS, T BCS stand for time borrowing, clock skew scheduling and both, respectively. The minimum clock periods calculated for the edge-sensitive synchronous and TFCSS circuits under zero and non-zero clock skew scheduling (TFnoskew F F , respectively) suggest an average improvement of 30% in the minimum clock period for the ISCAS’89 benchmark circuits. The minimum clock periods calculated for the level-sensitive synchronous circuits (TLnoskew and TLCSS ) suggest an average improvement of 27% in the minimum clock period. Below,
208
11 Experimental Results
the clock period improvements for the level-sensitive latches are examined in detail. The experimental results shown in Table 11.1 demonstrate that utilizing latches as storage elements instead of flip-flops may result in up to 30% improvements of the minimum clock period under zero clock skew (for singlephase, 50% duty cycle clock synchronization). On the ISCAS’89 benchmark circuits, an average of 15% improvement is observed when the flip-flops are replaced by latches (under zero clock skew). This level of improvement is solely due to time borrowing. Utilizing non-zero clock skew, an even higher improvement is possible. Improvements up to 63%—over flip-flop based synchronous circuit with zero clock skew—are observed. The average improvement in the minimum clock period for ISCAS’89 benchmark circuits is 27%. This level of improvement is due to simultaneous application of clock skew scheduling and consideration of time borrowing. Out of this 27% improvement for non-zero clock skew, level-sensitive circuits, the improvement due to time borrowing is 15% and the improvement due to clock skew scheduling is 14%. It is interesting to note that the improvements achieved through time borrowing and clock skew scheduling are not additive. Time borrowing and clock skew scheduling target the same resource in performance improvement, the slack propagation time on local data paths. There is a limited amount of slack propagation time on the critical paths and a circuit where time borrowing is abundantly realized, cannot benefit as much from clock skew scheduling. It has been shown however, that even though time borrowing and clock skew scheduling are battling effects (battling for the same resource), dramatically shorter clock periods are achievable through the collaboration of both effects. It is also important to note that, although non-zero clock skew, edge-triggered circuits benefit more from improvements (30%) on average, non-zero clock skew, level-sensitive circuits lead to superior improvements for some of the circuits. Furthermore, the smaller size of level-sensitive latches compared to edge-triggered flip-flops is often highly desirable. Thus, the use of level-sensitive latches as register elements in synchronous circuits where clock skew scheduling is applied is advantageous for area savings1 , and sometimes, superior to edge-triggered circuits in both area and operating speed. 11.1.2 Verification and Interpretation of Results Some edge-sensitive synchronous circuits are inoperable with level-sensitive latches due to design. For such circuits, the clock skew scheduling problem is infeasible. The presented timing analysis procedure detects the infeasibility of a such problem and provides diagnostics messages. The slack and excess values associated with each constraint can be examined in the sensitivity analysis output provided by an LP solver. Careful interpretation of the sensitivity 1
with a minor sacrifice in operating speed.
Number of paths
11.1 Clock Skew Scheduling of Level-Sensitive Circuits 75 70 65 60 55 50 45 40 35 30 25 20 15 10 5 0
1
2
3
4
5
6
7
8
209
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Propagation delay DP in time units
Fig. 11.1. Data propagation times for s938 with 32 registers and 496 data paths.
output leads to the identification of the necessary modifications on the circuit topology to achieve the desired operating frequency. The interpretation of the timing schedule for a synchronous circuit presents a model to investigate the effects of zero and non-zero clock skew scheduling on synchronous circuit operation. In the rest of this section, the timing schedules generated for the synchronization of the ISCAS’89 benchmark circuit s938 with zero and non-zero clock skew scheduling are analyzed. The analyses include the data distributions for various parameters, which are presented in Section 11.1.3. The verification of clock skew values is discussed in Section 11.1.4. Also in Section 11.1.4, lower and upper bounds on clock skew are derived. 11.1.3 Parameter Data Distributions In Section 4.7.1, data propagation time DPif is defined as the period of time the data is processed in the combinational logic block of a local data path Ri ;Rf . Without loss of generality, an empirical calculation method is used to calculate the data propagation times of each local data path of a circuit. The distribution of the calculated data propagation times for the ISCAS’89 benchmark circuit s938 is illustrated in Figure 11.1. In this figure, the height of each bar corresponds to the number of paths within a given delay range. For example, there are nine (9) paths with delays between 4 and 5 time units. ˆ i,f [96] is defined as the time period between the Effective path delay D P departure of the data signal from the initial register Ri and the arrival of the same data signal at the final register Rf . The effective path delay of a local data path differs from data propagation delay because of the additional propagation time provided by clock skew and the time borrowing property of level-sensitive synchronous circuits. Note that in level-sensitive synchronous circuits, the effective path delay is defined within a permissible range instead of a fixed value, as the arrival and departure times are indeterminate.
210
11 Experimental Results
60 55
Number of paths
50 45 40 35 30 25 20 15 10 5 0
1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Maximum effective path delay in time units
Fig. 11.2. Maximum effective path delays in data paths of s938 for zero clock skew.
The nominal effective path delay is determined when the arrival and departure times are realized in run-time as certain values in the permissible ranges [af , Af ] and [di , Di ], respectively. Specifically, the shortest effective path delay occurs when the data signal departs at its latest time Di from the initial register Ri and arrives at its earliest arrival time af at the final register Rf . The longest effective path delay is realized by the earliest departure di of the data signal from Ri and latest arrival Af at Rf . Hence, the interval for the effective path delay of level-sensitive synchronous circuits can be defined as: ˆ i,f ≤ Af − di − TSkew (i, f ) + TCP . af − Di − TSkew (i, f ) + TCP ≤ D P
(11.1)
In this work, the longest effective path delay is investigated in order to illustrate the effects of clock skew and time borrowing on data propagation. The aim is to observe the increase in the effective path delay of a circuit, which in turn leads to a higher operating frequency. This increase in operating frequency is obtained by the replacement of flip-flops with latches and introducing non-zero clock skew. Observe that the distribution of the propagation delays for the s938 benchmark circuit presented in Figure 11.1 is exactly the same as the distribution of the effective path delay of the same benchmark circuit s938, when operational with flip-flops (under zero clock skew). In circuits with flip-flops, the effective path delays are determinate if DP − TSkew (i, f ) as the data departures occur at the active transition of the clock signal. The distribution of the maximum effective path delays of the levelsensitive s938 circuit with zero clock skew scheduling is shown in Figure 11.2. Note that the maximum effective path delay is calculated by the expression [Af − di − TSkew (i, f ) + TCP ]. The target clock period is TCP = 20.6. The height of each bar corresponds to the number of paths with an effective path delay within a given range. It is observed by comparing Figures 11.1 and 11.2 that the maximum effective path delays are increased in the level-sensitive cir= 24.4 v.s. cuit, as well as providing a smaller minimum clock period TFnoskew F
11.1 Clock Skew Scheduling of Level-Sensitive Circuits
211
55 50
Number of paths
45 40 35 30 25 20 15 10 5 0
1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Maximum effective path delay in time units
Fig. 11.3. Maximum effective path delays for s938 for non-zero clock skew.
TLnoskew = 20.6 . The increase in the effective path delays is due to time borrowing. Accumulation of effective path delay values slightly below or above the minimum operating clock period TCP = 20.6 is visible. Note that the effective path delay having larger values than the minimum clock period is a sufficient but not a necessary condition for time borrowing. Thus, local data paths where the effective path delay is calculated to be smaller than TCP = 20.6 may still benefit from time borrowing. Furthermore, it can be observed that certain data paths in the circuit benefit more from time borrowing, realizing an ef
L − TSkew (i, f ) . fective path delay close to the theoretical limit of TCP + CW 11.1.4 Skew Analysis As discussed throughout this monograph, non-zero clock skew scheduling in synchronous circuits permits smaller clock periods. Note that in presence of non-zero clock skew, the effective path delay for the data signal over a data path most likely gets smaller compared to its value observed in zero clock skew scheduling. This fact is directed by (11.1) (TCP gets smaller). However, as the minimum clock period TCP gets smaller, the percentage of the data paths, on which the effective path delay exceeds the minimum clock period, significantly increases (see Figure 11.3). The target clock period is TCP = 9.09. The height of each bar corresponds to the number of paths with an effective path delay within a given range. The effect of clock skew on improving the minimum clock period is visible by comparing the histograms presented in Figures 11.2 and 11.3. In order to generate an expression for the upper bound, express the following condition: Di + DPi,fM − TCP + TSkew (i, f ) ≤ TCP − δSLf .
(11.2)
In (11.2), the earliest possible time is assigned to Di in order to realize the upper bound on clock skew. The earliest possible time that a data signal
212
11 Experimental Results
departs from a latch is DCQ later than the leading edge of the clock signal, L TCP − CW + DCQ . Reordering the expression gives the upper bound on clock skew: L L Tskew (i, f ) ≤ TCP + CW − DPifM − DCQ − δSLf .
(11.3)
The lower bound on the clock skew is derived similarly, which leads to: Lf af + DPi,fm ≥ TCP − TSkew (i, f ) + δH .
(11.4)
In order to derive the lower bound, the data arrival time at Rf must be considered to occur at its latest possible time. The latest data arrival time is the setup time δSLf earlier than the trailing edge of the clock signal, TCP − δSLf . Thus, the lower bound on the clock skew is: Lf . TSkew (i, f ) ≥ TCP − TCP − DPi,fm + δSLf + δH
(11.5)
Combining (11.3) and (11.5), the theoretical limits on clock skew is expressed as follows: Lf L L −DPi,fm + δSLf + δH ≤ TSkew (i, f ) ≤ TCP + CW − DPi,fM − DCQ − δSLf . (11.6) L L Lf Recall that in experimentation, the parameters DDQ , DCQ , δSLf , δH are considered zero and 50% duty cycle is selected for the single-phase synchronization clock signal. In order to evaluate the upper and lower bounds on clock skew in this simplified case, the parameters are substituted in (11.6):
−DPi,fm ≤ TSkew (i, f ) ≤ 1.5TCP − DPi,fM .
(11.7)
Specifically on the ISCAS’89 benchmark circuit s938, the clock skew bounds are verified using the experimental values shown in Figure 11.1. For the benchmark circuit s938 with a minimum clock period of 9.09, the minimum and maximum propagation delays are calculated to be 5 and 24.4, respectively. Thus, the values for the clock skew variable on the data paths of s938 is bounded by −24.4 ≤ TSkew (i, f ) ≤ 8.64. The distribution of the clock skew values of s938, when operable with a minimum clock period of 9.09, is presented in Figure 11.4. The target clock period is TCP = 9.09. The height of each bar corresponds to the number of paths formed by sequentially adjacent pair of registers which have a clock skew within the given range. The calculated clock skew values are within the derived limits, most of which are negative. Negative clock skew between registers help improve the minimum clock period of the synchronous circuit due to the additional time it provides for data signal propagation. The data paths, on which positive skew is recorded, most likely occur due to two reasons. The first reason is the presence of data path cycles and reconvergent systems within the circuit, which have constraining timing properties as explained in Chapter 8. The second reason are the—faster—paths which provide extra time for neighboring critical paths.
Number of paths
11.2 Multi-Phase Level-Sensitive Circuits 70 65 60 55 50 45 40 35 30 25 20 15 10 5 0
−20 −19 −18 −17 −16 −15 −14 −13 −12 −11 −10 −9 −8 −7 −6 −5 −4 −3 −2 −1
0
213
1
2
Clock skew Tskew (i, f ) in time units
Fig. 11.4. Distribution of the clock skew values of the non-zero clock skew case for s938. 4
Number of latches
3.5 3 2.5 2 1.5 1 0.5 0
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Clock delay ti in time units
Fig. 11.5. Distribution of the clock delay values of the non-zero clock skew case for s938.
The distribution of the clock delays to each register presented in Figure 11.5. The target clock period is TCP = 9.09. The height of each bar corresponds to the number of latches being driven by a clock signal with a time delay within the given range. The distribution is significantly widespread, ranging from 0 to 19 (time units), where the minimum clock period is TCP = 9.09. If the clock tree network of the synchronous circuit is implemented to accommodate for these nominal clock delays, operation at the target minimum clock period is achieved.
11.2 Multi-Phase Level-Sensitive Circuits Multi-phase synchronization of ISCAS’89 benchmark circuits are performed using the transformation shown in Figure 11.6. This transformation is similar to the procedure used in the literature, particularly in [72, 94, 99, 185]. In par-
214
11 Experimental Results
FF D Q
DP
C
Latch D Q
C φ1
Latch 1 n DP
D Q
C φ2
Latch 1 n DP
···
D Q
C φn−1
Latch 1 n DP
D Q
C φn
1 n DP
Fig. 11.6. Generation of an n-phase data path with latches.
ticular, to synchronize the circuit with an n-phase scheme, the combinational logic block is divided to n equal-length (delay) blocks along the logic depth and n latches are inserted between these blocks. Each latch is synchronized with one phase of the n-phase synchronization scheme, where the phases are selected in ascending order based on the location of the duty-cycle. The timing information of the benchmark circuits is generated with a similar algorithm to the one used in Section 11.1, where the type, size and fanout of a gate are considered in the computed delays. The experiments are performed using dual, three and four-phase clocking schemes representing various degrees of multi-phase synchronization. For simplicity, non-overlapping multi-phase clock signals with identical duty cycles, shown in Figure 11.7, are used in experimentation. Due to the transformation shown in Figure 11.6, a new level of latches is required for each additional clock phase. The latches are modeled with inherent delays in order to capture these effects in formulation. Latches are modeled with a delay pair of [0.9, 1.1] time units, corresponding to the minimum and maximum delays for a latch. For reference, a unity delay (a delay of 1 time unit) is close to the delay value of an FO4 inverter in the proposed delay generation algorithm. For multi-phase synchronization, each additional clock phase requires a new level of latches to be inserted into data paths, effectively increasing path delays. Thus, as the number of clock phases increases, the performance of a zero clock skew system degrades. For non-zero clock skew systems, however, this is not necessarily the case as shown by these experiments. The solutions of clock period minimization problems computed with CPLEX (v7.5) barrier optimizer [115] on a 440MHz Sun Ultra-10 workstation are presented in Tables 11.2, 11.3 and 11.4, and Figures 11.8, 11.9 and 11.10. The number of registers r and paths p (before modification) of the ISCAS’89 benchmark
11.2 Multi-Phase Level-Sensitive Circuits
215
L = T /n CW CP
φ(n) = TCP (n − 1)/n
n Csource
L = T /n CW CP
(n−1)
φ(n−1) = TCP (n − 2)/n
Csource
L = T /n CW CP
φ2 = TCP /n
2 Csource L = T /n CW CP
1 Csource
φ1 = 0
Clock Period TCP Fig. 11.7. Non-overlapping multi-phase synchronization clock.
circuits are shown in Tables 11.2, 11.3 and 11.4. Minimum clock periods, improvements and calculation time are denoted by T , I and t, respectively. Subscripts F F , nφ represent circuit topologies for flip-flop based and n-phase level-sensitive circuits, respectively. Superscripts and titles T B, CSS, T BCS stand for time borrowing, clock skew scheduling and both, respectively. Minimum clock periods (T ) are measured in time units. In the rest of this section, the experimental results and factors contributing to the improvements in these results are discussed in greater detail. In particular, the properties of multi-phase synchronization which affect level-sensitive circuit performance are discussed in Section 11.2.1. The effects of multi-phase synchronization on time borrowing are addressed in Section 11.2.2. The effects of multi-phase synchronization on clock skew scheduling are addressed in Section 11.2.3. Finally, the effects of multi-phase synchronization on the simultaneous application of time borrowing and clock skew scheduling are addressed in Section 11.2.4.
216
11 Experimental Results
Table 11.2. Minimum clock periods of multi-phase ISCAS’89 benchmark circuits. Circuit Circuit s27 s208.1 s298 s344 s349 s382 s386 s400 s420.1 s444 s499 s510 s526 s526n s635 s641 s713 s820 s832 s838 s938 s953 s967 s991 s1196 s1238 s1269 s1423 s1488 s1494 s1512 s3271 s3330 s3384 s4863 s5378 s6669 s9234.1 s9234 s13207.1 s13207 s15850.1 s15850 s35932
TF F 7.7 13.5 14.1 28.1 28.1 15.3 18.9 15.3 17.5 17.9 17.5 17.9 14.1 14.1 165.9 89.1 90.3 19.7 20.1 25.5 25.5 24.3 21.7 97.5 21.9 21.9 52.3 93.3 33.3 33.9 40.7 41.5 35.9 86.3 82.3 29.5 129.7 76.9 76.9 77.1 86.7 82.3 117.1 35.3
Zero Clock Skew TB TB TB T1φ T2φ T3φ 6.5 8.2 9.5 9.0 11.8 13.9 10.8 11.0 13.1 19.5 22.4 24.5 19.5 22.4 24.5 11.0 14.4 15.0 18.4 19.5 20.6 11.1 14.4 15.0 12.8 16.1 17.2 12.6 16.1 17.6 16.3 18.6 19.3 15.9 18.2 19.2 11.0 12.1 13.5 11.0 12.1 13.5 157.4 113.3 127.9 66.9 74.9 76.6 71.9 75.8 77.4 19.4 20.6 21.6 19.9 21.1 22.1 20.8 21.5 23.4 20.8 21.5 23.4 21.4 21.7 23.6 18.6 19.6 21.4 91.8 65.7 74.2 16.2 15.3 18.0 16.2 15.3 18.0 48.0 35.6 40.6 86.6 69.4 73.5 30.1 32.6 33.8 30.7 33.2 34.3 35.9 38.4 40.2 30.0 28.4 32.6 23.9 26.3 28.9 77.6 58.3 65.9 75.6 55.6 62.9 24.1 25.3 26.0 124.8 87.2 98.2 65.0 65.6 69.8 65.0 65.6 69.8 64.8 63.2 68.9 67.6 63.3 69.6 71.6 67.8 71.9 93.0 78.8 88.8 35.0 36.4 37.5
TB T4φ 10.3 15.4 14.7 26.1 26.1 16.7 21.7 16.7 18.5 19.3 20.5 20.1 15.1 15.1 135.4 78.0 78.7 22.7 23.2 25.0 25.0 25.3 22.8 80.6 20.2 20.2 44.5 77.3 35.0 35.6 41.6 35.8 31.8 71.7 68.5 27.2 106.4 72.4 72.4 72.3 74.0 74.8 96.3 37.6
TFCSS F 5.2 5.8 9.6 19.5 19.5 9.2 18.4 9.3 7.7 10.6 16.3 15.9 9.6 9.6 4.8 62.6 64.5 19.4 19.9 9.3 9.3 19.4 17.3 79.6 10.1 10.1 43.0 76.7 30.1 30.7 35.9 28.8 18.7 67.6 69.2 23.1 110.0 55.3 55.3 54.8 57.8 58.5 83.8 21.1
Non-Zero Clock Skew T BCS T BCS T BCS T1φ T2φ T3φ 5.2 6.3 7.4 6.3 9.8 11.1 9.6 10.3 12.8 19.5 22.4 24.5 19.5 22.4 24.5 9.2 12.2 14.8 18.4 19.5 20.6 9.3 12.3 14.8 8.2 12.9 14.3 10.6 14.5 17.2 16.3 18.0 19.3 15.9 17.4 18.8 9.6 11.4 13.5 9.6 11.4 13.5 4.9 78.8 97.1 62.6 66.6 73.3 64.5 67.5 74.2 19.4 20.5 21.6 19.9 21.0 22.1 10.2 17.0 20.1 10.2 17.0 20.1 19.4 21.7 23.6 17.7 19.6 21.4 79.6 52.0 62.4 8.0 8.2 7.1 8.0 8.2 7.1 43.0 30.5 39.1 76.0 69.2 73.1 30.1 32.3 33.7 30.7 32.9 34.3 35.9 38.4 40.2 28.8 15.1 10.0 18.7 23.6 25.9 67.6 41.6 48.5 69.2 44.2 45.4 23.1 24.4 25.6 110.0 62.5 44.5 55.3 65.6 69.8 55.3 65.6 69.8 54.8 63.2 68.9 57.8 63.2 68.9 58.5 67.8 71.9 83.8 67.8 71.9 21.1 26.2 30.0
T BCS T4φ 8.5 11.4 14.6 26.1 26.1 16.5 21.7 16.5 15.6 19.0 20.5 20.0 15.1 15.1 102.6 77.2 78.1 22.7 23.2 22.1 22.1 25.3 22.8 68.1 8.0 8.0 44.0 75.6 35.0 35.6 41.6 7.7 27.6 52.8 46.5 26.7 49.4 72.4 72.4 72.3 72.3 74.8 74.8 32.5
11.2.1 Multi-Phase Clocking Multi-phase clocking is superior to single phase clocking by better accommodating the transparency periods of latches. Depending on the particular synchronization scheme, however, the duration of the transparency periods can be short for each phase, thereby reducing the advantages of multi-phase clocking. In single-phase clocking, the transparency periods of latches have identical positions within their respective clock cycles. In multi-phase clocking, the transparency periods of different clock phases are distributed over the clock cycle. In a multi-phase circuit synchronized with the clock signal shown
11.2 Multi-Phase Level-Sensitive Circuits
217
Table 11.3. Clock period improvements of multi-phase ISCAS’89 circuits. Circuit Circuit s27 s208.1 s298 s344 s349 s382 s386 s400 s420.1 s444 s499 s510 s526 s526n s635 s641 s713 s820 s832 s838 s938 s953 s967 s991 s1196 s1238 s1269 s1423 s1488 s1494 s1512 s3271 s3330 s3384 s4863 s5378 s6669 s9234.1 s9234 s13207.1 s13207 s15850.1 s15850 s35932 Average
Improvement TB (%) Improvement CSS (%) Improvement TBCS (%) I1φ I2φ I3φ I4φ I1φ I2φ I3φ I4φ IF F I1φ I2φ I3φ I4φ 16 -6 -24 -34 20 23 22 18 32 32 18 3 -10 33 13 -3 -14 30 17 20 26 57 53 27 18 16 23 22 7 -4 11 6 3 1 32 32 27 9 -3 31 20 13 7 0 0 0 0 31 31 20 13 7 31 20 13 7 0 0 0 0 31 31 20 13 7 28 6 2 -9 16 15 2 1 40 40 20 4 -8 3 -3 -9 -15 0 0 0 0 3 3 -3 -9 -15 28 6 2 -9 16 15 2 1 39 39 20 3 -8 27 8 2 -6 36 20 17 15 56 53 26 18 11 30 10 2 -8 16 10 3 1 41 41 19 4 -6 7 -6 -10 -17 0 3 0 0 7 7 -3 -10 -17 11 -2 -7 -12 0 5 2 1 11 11 3 -5 -12 22 14 4 -7 12 6 0 0 32 32 20 4 -7 22 14 4 -7 12 6 0 0 32 32 20 4 -7 5 32 23 18 97 30 24 24 97 97 53 41 38 25 16 14 13 6 11 4 1 30 30 25 18 13 20 16 14 13 10 11 4 1 29 29 25 18 14 2 -5 -10 -15 0 0 0 0 2 2 -4 -10 -15 1 -5 -10 -15 0 0 0 0 1 1 -4 -10 -15 18 16 8 2 51 21 14 12 64 60 33 21 14 18 16 8 2 51 21 14 12 64 60 33 21 14 12 11 3 -4 9 0 0 0 20 20 11 3 -4 15 10 2 -5 5 0 0 0 20 18 10 2 -5 6 33 24 17 13 21 16 16 18 18 47 36 30 26 30 18 8 51 47 61 60 54 63 63 68 64 26 30 18 8 51 47 61 60 54 63 63 68 64 8 32 22 15 10 14 4 1 18 18 42 25 16 7 26 21 17 12 0 0 2 18 19 26 22 19 10 2 -1 -5 0 1 0 0 10 10 3 -1 -5 9 2 -1 -5 0 1 0 0 9 9 3 -1 -5 12 6 1 -2 0 0 0 0 12 12 6 1 -2 28 32 22 14 4 47 69 79 31 31 64 76 82 33 27 19 11 22 10 10 13 48 48 34 28 23 10 32 24 17 13 29 26 26 22 22 52 44 39 8 32 24 17 8 21 28 32 16 16 46 45 43 18 14 12 8 4 3 2 2 22 22 17 13 9 4 33 24 18 12 28 55 54 15 15 52 66 62 15 15 9 6 15 0 0 0 28 28 15 9 6 15 15 9 6 15 0 0 0 28 28 15 9 6 16 18 11 6 15 0 0 0 29 29 18 11 6 22 27 20 15 15 0 1 2 33 33 27 21 17 13 18 13 9 18 0 0 0 29 29 18 13 9 21 33 24 18 10 14 19 22 28 28 42 39 36 1 -3 -6 -6 40 28 20 14 40 40 26 15 8 16.7 15.3 8.0 1.6 16.5 12.1 11.4 11.3 30.3 30.3 24.8 17.7 12.0
in Figure 11.7, for instance, the transparency periods are located at different times within the clock cycle (e.g., clock phases C 1 and C n are the first and last sections, respectively). Such variety in the locations of transparency periods provides flexibility on the permissible data propagation times of a local data path. The assorted assignment of clock phases to registers, achieved through clock skew scheduling or any other methods, leads to improvements in the circuit performance.
218
11 Experimental Results Table 11.4. Circuit info and run times for multi-phase ISCAS’89 circuits. Circuit Info Circuit r p s27 3 4 s208.1 8 28 s298 14 54 s344 15 68 s349 15 68 s382 21 113 s386 6 15 s400 21 113 s420.1 16 120 s444 16 113 s499 22 462 s510 6 15 s526 21 117 s526n 21 117 s635 32 496 s641 19 81 s713 19 81 s820 5 10 s832 5 10 s838 32 496 s938 32 496 s953 29 135 s967 29 135 s991 19 51 s1196 18 20 s1238 18 20 s1269 37 1260 s1423 74 1471 s1488 6 15 s1494 6 15 s1512 57 415 s3271 116 789 s3330 132 514 s3384 183 1759 s4863 104 620 s5378 179 1147 s6669 239 2138 s9234.1 228 247 s9234 211 2342 s13207.1 669 3068 s13207 669 3068 s15850.1 534 10830 s15850 597 14257 s35932 1728 4187
B tT 1φ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 1 0 1 1 2 2 3 3 10 15 6
tBCS 1φ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 1 1 1 2 3 3 4 5 25 26 8
B tT 2φ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 2 1 1 2 2 3 6 6 13 19 16
Time BCS tT 2φ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 1 1 1 2 1 2 3 3 4 7 8 30 32 17
(sec) B BCS tT tT 3φ 3φ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 1 1 0 1 0 0 0 0 0 0 0 0 0 2 2 0 0 0 0 1 1 1 1 1 1 3 3 1 1 2 2 3 4 4 5 4 5 9 11 9 13 19 37 24 42 21 28
B tT 4φ 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 1 2 0 0 1 2 2 4 1 3 4 5 5 13 13 25 30 27
BCS tT 4φ 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 3 0 0 1 2 2 4 2 3 6 6 7 15 19 47 46 39
As illustrated in the transformation procedure shown in Fig. 11.6, an extra level of latches is inserted onto a logic data path for each clock phase. The delays of these inserted latches can become significant for higher number of clock phases which degrades the minimum clock period in the absence of a clock skew scheduling application. For non-zero clock skew circuits, however, the negative effects of latch insertion can be compensated, potentially leading to equivalent or improved circuit performances. Note that, the complexity of the design process increases due to clock skew scheduling and the complexity of the timing analysis increases due to the multiplicity of clocking phases.
11.2 Multi-Phase Level-Sensitive Circuits
219
Performance Improvement via Time Borrowing per Clock Phase 100 80 60 40 20
-20
s27 s208.1 s298 s344 s349 s382 s386 s400 s420.1 s444 s499 s510 s526 s526 n s635 s641 s713 s820 s832 s838 s938 s953 s967 s991 s1196 s1238 s1269 s1423 s1488 s1494 s1512 s3271 s3330 s3384 s4863 s5378 s6669 s9234.1 s9234 . s13207 s13207 . s15850 s15850 s35932 Average
0
-40 ISCAS'89 Modified Circuits 1-Phase
2-Phase
3-Phase
4-Phase
Fig. 11.8. Effects of multi-phase clocking on time borrowing.
As an added note, consider that the transformation procedure in Fig. 11.6 leads to a certain bias in circuit operation, such that, each sequentially adjacent latch pair is synchronized by two consecutive clock phases (i.e. φk and φk+1 , k ∈ {1, 2, . . . n − 1}). Furthermore, the propagation delays of the combinational blocks are distributed evenly between clock phases according to the transformation procedure. In typical circuit implementations, these regularities do not always occur, which leaves more room for improvement. 11.2.2 Multi-Phase Clocking Effects on Time Borrowing In order to observe the effects of multi-phase clocking on time borrowing (without clock skew scheduling), the transformation procedure of Figure 11.6 is applied under a conventional zero clock skew synchronization regime (i.e. φ1 = φ2 = · · · = φn in Fig. 11.6). As shown in Table 11.3, an average improvement of 16.7% is achieved for single-phase circuits. Also, average improvements of 15.3%, 8.0% and 1.6% are achieved for dual, three and four phase clocking schemes, respectively. For a visual representation of these results, the percentage improvements presented in Table 11.3 for each benchmark circuit are illustrated in Figure 11.8. In Figure 11.8, four data points shown per benchmark circuit from left-to-right are the percentage improvements observed for the single-phase, dual-phase, three-phase and four-phase synchronization schemes, respectively.
220
11 Experimental Results
It is observed that the improvement achieved through time borrowing decreases on average as the number of clock phases increases. This average degradation is expected, because by definition (CW = T /n, n ≥ 2), the transparency period of latches shortens for a higher number of clock phases. The degradation is worsened by the increasing delays of the latches inserted in accordance with the transformation procedure presented in Fig. 11.6. Nevertheless, 34% of the benchmark circuits in Table 11.2 (15 out of 44 total) benefit more from time borrowing under multi-phase clocking. These circuit demonstrate that the average degradation in improvement through time borrowing with multi-phase clocking is not observed for all circuits. For these circuits, multi-phase distribution of latch transparency periods provides additional slack where necessary, leading to these improvements in specific cases. 11.2.3 Multi-Phase Clocking and Clock Skew Scheduling In order to observe the effects of multi-phase clocking on clock skew scheduling (without time borrowing), comparisons have been performed between levelsensitive, non-zero clock skew circuits and level-sensitive, zero clock skew implementations. The improvement attributed to clock skew scheduling for a dual-phase, level-sensitive circuit implementation is computed using the formula (Told − Tnew ) /Told × 100%, where Tnew is the clock period of the dual-phase, clock-skew-scheduled, level-sensitive implementation whereas Told is the clock period of the dual-phase, zero-clock skew, level-sensitive implementation of the same circuit. The results for edge-sensitive circuits display improvements of 30.3% on average due to clock skew scheduling alone, conforming with earlier results in literature [2, 96]. For single, dual, three and four-phase level-sensitive implementations, clock skew scheduling results in 16.5%, 12.1%, 11.4% and 11.3% improvements on average, respectively. The percentage improvements for each benchmark circuit are illustrated in Fig. 11.9. In Figure 11.9, five data points shown per benchmark circuit from left-to-right are the percentage improvements observed for the edge-sensitive, single-phase, dual-phase, three-phase and four-phase synchronization schemes, respectively. The average degradation in performance (compared to non-zero clock skew, edge-sensitive circuits) is expected as the even distribution of the transparency periods potentially negates the effectiveness of clock skew scheduling. Nevertheless, 34% of the benchmark circuits in Table 11.2 (15 out of 44, where 5 circuits are not improved by clock skew scheduling for any synchronization scheme) benefit more from clock skew scheduling under multi-phase clocking. These circuit demonstrate that the average degradation in improvements of clock skew scheduling with multi-phase clocking is not observed for all circuits. For this important special case observed in some circuits, the change in the delay paths (by the multi-phase transformation procedure) is such that the resulting circuits are more suitable to the optimization provided by clock skew scheduling.
11.2 Multi-Phase Level-Sensitive Circuits
221
Performance Improvement via CSS per Clock Phase
s27 s208.1 s298 s344 s349 s382 s386 s400 s420.1 s444 s499 s510 s526 s526n s635 s641 s713 s820 s832 s838 s938 s953 s967 s991 s1196 s1238 s1269 s1423 s1488 s1494 s1512 s3271 s3330 s3384 s4863 s5378 s6669 s9234.1 s9234 s13207. s13207 s15850. s15850 s35932 Average
100 90 80 70 60 50 40 30 20 10 0
ISCAS'89 Modified Circuits
FF
1-Phase
2-Phase
3-Phase
4-Phase
Fig. 11.9. Effects of multi-phase clocking on clock skew scheduling.
As an added note, consider that clock skew scheduling is more effective on circuits with certain characteristics. In particular, if the data propagation delays on the local data paths of a circuit are irregular, higher improvements in the circuit performance are achievable through clock skew scheduling. The transformation method shown in Fig. 11.6 proposes an even distribution of data propagation delays for adjacent local data paths, increasing the regularity of the circuit for increasing number of clock phases. Therefore, the high regularity of the multi-phase circuits—due to the bias in the transformation procedure in Fig. 11.6—also contributes to the degradation. 11.2.4 Simultaneous Time Borrowing and Clock Skew Scheduling In non-zero clock skew circuits, final improvements of 30.3%, 24.8%, 17.7% and 12.0% on average are observed for single, dual, three and four phase clocking schemes, respectively. These final improvements are due to the simultaneous application of clock skew scheduling and time borrowing, i.e. clock skew scheduling application on level-sensitive version of the circuit. For a dual-phase clocking regime, the improvement is computed using the formula (Told − Tnew ) /Told × 100%, where Tnew is the clock period of the dual-phase, clock-skew-scheduled, level-sensitive implementation whereas Told is the clock period of the single-phase, zero-clock skew, edge-sensitive implementation (not shown in Table 11.2) of the same circuit. The percentage improvements for
222
11 Experimental Results
Performance Improvement via TB and CSS per Clock Phase 100 80 60 40 20
-20
s27 s208.1 s298 s344 s349 s382 s386 s400 s420.1 s444 s499 s510 s526 s526n s635 s641 s713 s820 s832 s838 s938 s953 s967 s991 s1196 s1238 s1269 s1423 s1488 s1494 s1512 s3271 s3330 s3384 s4863 s5378 s6669 s9234.1 s9234 s13207. s13207 s15850. s15850 s35932 Average
0
-40 ISCAS'89 Modified Circuits 1-Phase
2-Phase
3-Phase
4-Phase
Fig. 11.10. Effects of multi-phase clocking on time borrowing and clock skew scheduling.
each benchmark circuit are illustrated in Fig. 11.10. In Figure 11.10, four data points shown per benchmark circuit from left-to-right are the percentage improvements observed for the single-phase, dual-phase, three-phase and four-phase synchronization schemes, respectively. In general, the observed improvements for multi-phase synchronized circuits are superior compared to zero-skew, edge-sensitive circuits. Exemplifying the positive trend is the analysis of the benchmark circuit s1196, for instance, where an improvement of 68% is observed for three-phase clocking through time borrowing and clock skew scheduling. For the same circuit, the improvements are at 63%, 63% and 64% for single, dual and four-phase clocking, respectively. As discussed in Sections 11.2.2 and 11.2.3, the improvements achieved through time borrowing and clock skew scheduling decrease on average as the number of clock phases increases. The improvements through simultaneous application of time borrowing and clock skew scheduling decrease on average as well, as the number of clock phases increases. Some negative improvements are also recorded, which are circuits with significant delay increase due to latch insertion. Nevertheless, 23% of the level-sensitive benchmark circuits in Table 11.2 (10 out of 44) benefit more from clock skew scheduling under multi-phase clocking. These circuit demonstrate that the average degradation in improvements of simultaneous time borrowing and clock skew scheduling with multi-phase clocking is not observed for all the circuits.
11.3 Quadratic Programming (QP) for Maximizing Safety
223
It is observed from experiments that no particular multi-phase approach is superior to others in all cases. For some circuits, conventional one-phase edge-triggered or dual-phase level-sensitive applications can be best, however, unconventional (but highly feasible with rotary clocking) implementations of three and four phase synchronization of level-sensitive circuits can be best for others. Investigation of all schemes using the presented analysis framework is necessary in order to identify the optimal synchronization scheme for any rotary-clock synchronized circuit.
11.3 Quadratic Programming (QP) for Maximizing Safety A quadratic programming formulation of the clock skew scheduling problem is developed in Chapter 7. This QP problem can be efficiently solved by applying the mathematical procedures developed in Chapter 9. The algorithm described in Section 7.2.3 has been implemented as a C++ program and applied to ISCAS’89 and ISCAS’93 benchmark circuits, as well as to industrial circuits (IC1, IC2, and IC3). Results from the application of this computer program are described in this chapter. Certain characteristics of the implementation are initially described in Section 11.3.1. Graphical illustrations of representative results are shown in Section 11.3.2. 11.3.1 Description of Computer Implementation The results described in this section are obtained from the execution of a computer implementation of Algorithm CSD introduced in Section 9.1.3. This computer implementation shares code with the computer implementation described in Section 5.7. In particular, the input data file format and the input/output routines are exactly the same. Without unnecessary details, this computer implementation consists of the sequential execution of the following major steps: Step 1. Input data file format and input/output routines are shared with the LP computer implementation described in 5.7. The circuit timing and connectivity data is read in and compressed and stored in a binary database. The database can be used for fast data access in subsequent algorithmic applications of the same circuit. Furthermore, the data size of the database permits significant space and time savings if the circuit data is exchanged. Step 2. The circuit data is examined and the circuit graph is built according to the graph model described in 5.2.2. An adjacency lists data structure [105] stored in memory is used for fast access of the circuit graph data.
224
11 Experimental Results
Step 3. The circuit graph is transformed according to the transformation rules described in 5.7 and illustrated in Figure 5.5. Within this step, the permissible range bounds are calculated and directions for the graph edges are determined. Step 4. The circuit graph is traversed in order to determine the edges in the skew basis sb and in the skew chords sc . This graph traversal is accomplished by using a depth-first search [89, 105, 124] algorithm—the classical traversal algorithm of choice for building a spanning tree. Three additional important tasks are accomplished during the traversal step: 1. For circuits with more than one connected disjoint subcircuit, these connected disjoint parts are identified and marked. This step does not incur any computational overhead—it is an inherent feature of the depth-first search graph traversal algorithm to separate a graph into disjoint pieces (if any). 2. The skew basis and chords of each disjoint connected circuit subgraph are identified and enumerated. 3. The circuit connectivity matrix B (actually, only the non-identitymatrix C portion of B) is derived for each disjoint connected circuit subgraph. Recall that C contains only elements from the set {−1, 0, 1}, thus permitting an efficient bit compression scheme to be used to store C in a small amount of memory. Step 5. Using C, the matrix N is computed as described by (9.9). Step 6. The Cholesky factorization L2 of N is calculated as described by (9.10). Simple, yet efficient algorithms for computing the Cholesky factorization have long been known and can be found in multiple sources [127, 131, 134, 135]. Recall that the matrix N is guaranteed to be positive-definite by construction. Therefore, the real (no complex numbers) Cholesky decomposition is guaranteed to exist. Step 7. The objective clock skews are chosen at the center of the permissible range for all local data paths. The actual clock skews (a consistent clock schedule) are calculated as described by (9.25) and as illustrated in Figure 9.1. At this point, each clock skew is verified against the respective permissible range. If all skews are within the respective permissible range bounds, the algorithm concludes. Otherwise, the objective clock skews are modified and the calculation is repeated again. Only the calculation described in this step must be repeated since all matrices have now been computed. Different objective clock schedule modification strategies can be used. The most effective strategy to modify the objective clock schedule—resulting in the fastest convergence towards a feasible schedule—is as follows. All objective clock skews are slightly increased or decreased depending upon whether the respective calculated clock skews is larger or smaller than the objective one. Using this strategy, a feasible solution is typically reached within a few iterations.
11.4 Delay Insertion in Clock Skew Scheduling
225
Step 8. The actual clock delays to the individual registers are calculated by traversing the spanning tree (basis) of the circuit graph. The clock delay of the first register is arbitrarily chosen (zero in this implementation). As the spanning tree is traversed, additional vertices adjacent to the current vertex are visited. The clock delay of the visited vertex is determined trivially since both the clock delay of the current vertex and the clock skew of the edge between the current and visited vertex are known. The results of the application of the algorithm to these circuits are summarized in Table 11.5. For each circuit, the following data is listed—the circuit name in column 1, the number of disjoint subgraphs in column 2, and the number of vertices, edges, chords (cycles), main and isolated basis, and target clock period in nanoseconds in columns 3 through 8, respectively. The number of iterations to reach ' a solution is listed in column 9. The average value of ε in (7.42), that is, ε/p, is listed in column 10. The run time in minutes for the mathematical portion of the program is shown in column 11 for a 170 MHz Sun Ultra 1 workstation. 11.3.2 Graphical Illustrations of Results The application of the computer implementation described in Section 11.3.1 to many of the circuits listed in Table 11.5 is graphically illustrated in this section. Immediately following are illustrations of two circuits shown in Figures 11.11 and 11.12, respectively. Three histograms for a circuit are shown in each graphical illustrations. These histograms are as follows: (a) The distribution of the zero clock skews in the permissible range for the clock period listed in Table 11.5 is illustrated in subfigure (a). (b) The distribution of the non-zero clock skews in the permissible range after one iteration of problem QP-2—as described in Step 7 in Section 11.3.1—is shown in subfigure (b). Note that there are frequent lower bound and upper bound violations of the permissible range. These violations are represented by the dark leftmost and rightmost regions, respectively, where the number of violations is also indicated. (c) The final distribution of the non-zero clock skews within the permissible range—no timing violations—is illustrated in subfigure (c). There is a noticeable improvement since most clock skews are concentrated around the center of the permissible range. The majority of the clock skews are within 10% of the safest clock skew value at the center of the respective permissible range of each local data path.
11.4 Delay Insertion in Clock Skew Scheduling For experimentation, the clock skew scheduling algorithms with the delay insertion method proposed for edge-triggered and level-sensitive circuits (Ta-
# subcircuits
TCP (nanoseconds)
# iterations
Run time (min)
Circuit
# subcircuits
TCP (nanoseconds)
# iterations
5
6
7
8
9
10 11
1
2
3
4
5
6
7
8
9
20
9
8
3 20.8
5
3.19
1
s526n
1
21
117
97
20
0
13
2
1.26
s1238
7
18
20
9
8
3 20.8
5
3.19
1
s5378
1
179
1147
969
28.4 20
8.79
3
s13207
49
669
3068
2448
581
39 85.6 20 18.92
5
s641
1
19
81
63
18
5 11.67
1
s1423
2
74
1471
1399
72
s1488
1
6
15
10
5
6
15
ni
(
ε p
r
p
nc
nm ni
158 20 0
83.6
(
ε p
Run time (min)
Circuit
4
18
nm
10 11 2
0 92.2 20
60.9
3
s713
1
19
81
63
18
0
89.2
6 12.74
1
0 32.2
1
0.87
1
s820
1
5
10
6
4
0
18.6
1
0.71
1
0 32.8
1
0.88
1
s832
1
5
10
6
4
0
19
2
0.66
1
s838.1
1
32
496
465
31
0
24.4
3
3.68
3
9 31.44 19
s9234
3
228
2476
2251
222
3
75.8 20 16.67
4
s1494
1
10
5
s15850
15
597 14257 13675
546
36
s15850.1 22
534 10830 10318
478
34 81.2
116 10
70.6 21
s208.1
1
8
28
21
7
0 12.4
1
1.22
1
s9234.1
2
211
2342
2133
205
4
75.8 20
18.6
4
s27
1
3
3
1
2
0
6.6
1
0.71
1
s953
4
29
135
110
25
0
23.2
1.93
2
s298
1
14
54
41
12
1
13
1
1.16
1
s1269
1
37
251
215
36
0
51.2 20 12.73
2
s344
1
15
68
54
14
0
27
4
4.91
1
s1512
1
57
405
349
56
0
39.6
4
4.43
3
s349
1
15
68
54
14
0
27
4
4.91
107
8
40.4
3
3.64
5
s35932
1 1728
4187
61 70
34.8
4
3.4
5
s382 s38417
1
21
113
2460 1727 93
20
1
s3271
1
116
789
674
0 34.2 20
60.4 27
s3330
1
132
514
383
0 14.2
1.59
2
s3384
25
183
1759
1601
151
7
85.2
5
15.5
7
69 20 32.35 31
s4863
1
104
620
517
103
0
81.2
8 39.85
3
11 1636 28082 26457 1443 182
s38584
2 1452 15545 14095 1400
s386
1
6
15
s400
1
21
s420.1
1
16
s444
1
21
s510
1
6
s526
1
21
3
6
50 94.2 11
29.1 29
s6669
20
239
2138
1919
218
0.82
1
s938
1
32
496
465
31
8
1.6
1
s967
4
29
135
110
25
0
20.6
2
1.76
2
0 16.4 20
1.95
1
s991
1
19
51
33
18
0
96.4
3
8.58
1
1.05
1
IC1
1
500 124750 124251
499
0
8.2
2
1.51 30
0.85
1
IC2
1
58
0
10.3
3
1.82
4
1
IC3
3108 1155 59
5.6
2
1.43
2
10
5
0 17.8
1
113
93
20
0 14.2
120
105
15
113
93
20
0 16.8
2
15
10
5
0 16.8
1
117
97
20
0
2
1.26
13
59
493
34 1248
4322
435
1 128.6
3 20.67
6
0
2
2
24.4
3.41
11 Experimental Results
3
7
nc
226
2
s1196
p
Table 11.5. Experimental results of the application of the QP based clock scheduling algorithm to both benchmark and industrial circuits.
1
r
11.4 Delay Insertion in Clock Skew Scheduling 53 →
227 ← 53
0→
←0
(a) Zero skew in permissible range. 112 →
← 112
82 →
(b) Non-zero clock skew in permissible range after iteration #1. 68 →
← 68
0→
←0
(c) Non-zero clock skew in permissible range after all iterations.
Fig. 11.11. Circuit s3271 with r = 116 registers and p = 789 local data paths. The target clock period is TCP = 40.4 nanoseconds.
228
11 Experimental Results
45 →
← 45
0→
←0 (a) Zero skew in permissible range.
76 →
← 76
15 →
(b) Non-zero clock skew in permissible range after iteration #1. 37 →
← 37
0→
←0 (c) Non-zero clock skew in permissible range after all iterations.
Fig. 11.12. Circuit s1512 with r = 57 registers and p = 405 local data paths. The target clock period is TCP = 39.6 nanoseconds.
11.4 Delay Insertion in Clock Skew Scheduling
229
bles 8.1 and 8.2) are applied to the ISCAS’89 benchmark circuits. Continuous delay models have been used in the experimentation. The experimental setup in Section 11.1 (circuit delay information, clock signal duty cycle, internal register delays, computing platform, LP solver) is replicated for the proposed timing analyses. Experimental results are presented in Table 11.6. In Table 11.6, the data shown are the number of registers r and paths p, the clock period TF F for zero skew circuit with flip-flops, TFCSS F for non-zero skew circuit for non-zero skew circuit using delay insertion with flip-flops, and, TFDICSS F DICSS , of TFCSS with flop-flops. Also listed are the calculation times tCSS F F , tF F F , , respectively, and the percentage clock period improvements IFCSS TFDICSS F F , DI CSS DICSS CSS and I for improvements from T to T , T to T , T IFDICSS FF FF F FF FF FF FF to TFDICSS , respectively. F The clock skew scheduling algorithms used in experimentation are targeting the clock period minimization problem. Therefore the improvements achieved in the minimum clock period through the application of clock skew scheduling and delay insertion methods are reported in Table 11.6. These improvements are computed with the formula (Told − Tnew ) /Told × 100. The zero clock skew, edge-sensitive synchronous circuit is selected as the common comparison mark due to its simplicity and popularity in digital circuit design. Both for edge-triggered and level-sensitive circuits, the improvements and ILCSS , respectively) through conventional clock skew scheduling (IFCSS F and ILDICSS , and through clock skew scheduling with delay insertion (IFDICSS F respectively) are computed. Also shown in Table 11.6 are the comparisons of the non-zero clock skew circuits scheduled with conventional clock skew scheduling methods with non-zero clock skew circuits with delay insertion. These comparisons (IFDIF and ILDI , respectively, for edge-triggered and levelsensitive circuits) demonstrate the effectiveness of the delay insertion method in further improving the performance of a conventional clock skew scheduled circuit. For the ISCAS’89 benchmark circuits, the delay insertion method leads to 10% and 9% improvements on average over the conventional clock skew scheduling algorithms for edge-triggered and level-sensitive circuits, respectively. For better visualization, the performance improvements in minimum clock period of edge-triggered and level-sensitive circuits achieved respectively over corresponding non-zero clock skew edge-triggered and level-sensitive circuits are presented in Figure 11.13. Shown in Figure 11.13 are the percentage improvements IFDIF and ILDI that are also presented in Table 11.6. Two data points shown per benchmark circuit from left-to-right are the improvements observed for edge-triggered and level-sensitive circuits, respectively. Note that these improvements are due to delay insertion simultaneous with clock skew scheduling. The delay insertion method cannot be applied (not beneficial) to some circuits due to the two reasons discussed in Sections 8.2 and 8.2.2. The first reason, discussed in Section 8.2, is the fact that the minimum clock period of the circuit can be determined by a limitation other than reconvergent paths,
230 11 Experimental Results
Table 11.6. Delay insertion results for edge-sensitive ISCAS’89 benchmark circuits.
Circuit Info Edge-Triggered Circuits Level-Sensitive Circuits ISCAS’89 Clock Periods (tu) Run Times (s) Improvements (%) Clock Periods (tu) Run Times (s) Improvements (%) CSS DICSS CSS DICSS CSS DICSS DI CSS DICSS CSS CSS DICSS DI Circuit r p TF F TF F TF F tF F tF F IF F IF F IF F TL TL TL tL tDICSS IL IL IL IL L s27 3 4 6.6 4.1 4.1 0 0 38 38 0 5.4 4.1 4.1 0 0 18 38 38 0 s208.1 8 28 12.4 4.9 1.6 0 0 60 87 67 8.6 5.2 1.6 0 0 31 58 87 69 s298 14 54 13.0 9.4 9.4 0 0 28 28 0 10.6 9.4 9.4 0 0 18 28 28 0 s344 15 68 27.0 18.4 18.4 0 0 32 32 0 18.4 18.4 18.4 0 0 32 32 32 0 s349 15 68 27.0 18.4 18.4 0 0 32 32 0 18.4 18.4 18.4 0 0 32 32 32 0 s382 21 113 14.2 8.5 6.0 0 0 40 58 29 10.3 8.5 6.0 0 0 27 40 58 29 s386 6 15 17.8 17.3 17.3 0 0 3 3 0 17.3 17.3 17.3 0 0 3 3 3 0 s400 21 113 14.2 8.6 6.0 0 0 39 58 30 10.4 8.6 6.0 0 0 27 39 58 30 s420.1 16 120 16.4 6.8 1.6 0 0 59 90 76 12.6 7.2 1.6 0 0 23 56 90 78 s444 16 113 16.8 9.9 7.9 0 0 41 53 20 12.4 9.9 8.0 0 0 26 41 53 20 s510 6 15 16.8 14.8 14.8 0 0 12 12 0 14.8 14.3 14.3 0 0 12 15 15 0 s526n 21 117 13.0 9.4 9.4 0 0 28 28 0 10.6 9.4 9.4 0 0 18 28 28 0 s641 19 81 83.6 61.9 57.8 0 0 26 31 7 66.2 61.9 57.8 0 0 21 26 31 7 s713 19 81 89.2 63.8 59.4 0 0 28 33 7 71.2 63.8 59.4 0 0 20 28 33 7 s820 5 10 18.6 18.3 18.3 0 0 2 2 0 18.3 18.3 18.3 0 0 2 2 2 0 s832 5 10 19.0 18.8 18.8 0 0 1 1 0 21.2 18.3 18.3 0 0 9 21 21 0 s953 29 135 23.2 18.3 18.3 0 0 21 21 0 16.0 7.8 7.8 0 0 23 63 63 0 s1196 18 20 20.8 10.8 7.8 0 0 48 63 28 16.0 7.8 7.8 0 0 23 63 63 0 s1423 74 1471 92.2 77.4 75.8 0 0 16 18 2 86.4 75.8 75.8 1 2 6 18 18 0 s1488 6 15 32.2 29.0 29.0 0 0 10 10 0 29.0 29.0 29.0 0 0 10 10 10 0 s1494 6 15 32.8 29.6 29.6 0 0 10 10 0 23.2 22.0 22.0 1 2 18 23 23 0 s5378 179 1147 28.4 22.0 22.0 0 0 23 23 0 64.8 54.2 54.2 2 4 15 28 28 0 s9234 228 247 75.8 54.2 54.2 1 1 28 28 0 67.4 57.1 53.8 4 7 21 33 37 6 s13207 669 3068 85.6 57.1 53.8 1 2 33 37 6 92.8 83.6 83.6 23 44 20 28 28 0 s15850 597 14257 116.0 83.6 83.6 5 19 28 28 0 71.4 57.4 57.4 23 34 12 29 29 0 s15850.1 534 10830 81.2 57.4 57.4 5 10 29 29 0 34.1 20.4 15.7 7 16 0 40 54 23 s35932 1728 4187 34.2 20.4 15.7 1 6 40 54 23 54.8 42.2 42.2 41 101 21 39 39 0 s38417 1636 28082 69.0 42.2 42.2 15 37 39 39 0 76.4 65.2 62.8 31 51 19 31 33 4 s38584 1452 15545 94.2 65.2 62.8 5 15 31 33 4 76.4 65.2 62.8 31 51 19 31 33 4 Average – 28 34 10 – 17 29 34 9
11.4 Delay Insertion in Clock Skew Scheduling
231
which cannot be mitigated by the delay insertion method. The second reason, discussed in Section 8.2.2, is the fact that due to the uncertainty of the delay elements inserted into the logic, the delay insertion might be ineffective in improving the minimum clock period. In the LP formulations presented in Tables 8.1 and 8.2, the uncertainties of the delay elements are modeled without lower (and upper) bounds (delay elements can have zero uncertainty with Im = IM ). Thus, the second reason for inapplicability is not observed in the experimentation. Among the selected ISCAS’89 circuits, the delay insertion method for edge-triggered circuits is applicable to 41% (12 circuits) of the total 29 circuits. By excluding the circuits for which zero improvements are observed (for which the method is not applicable due to the first reason stated above), the average improvement of the delay insertion method for edge-triggered circuits is observed to be 26% over the conventional clock skew scheduling algorithm of [2] (Table 5.1). The delay insertion method on levelsensitive circuits was applicable to 34% (10 circuits) of the total 29 circuits. By excluding the circuits for which zero improvements are observed, the average improvement of the delay insertion method for level-sensitive circuits is observed to be 27% on average over the conventional clock skew scheduling algorithm presented in Chapter 5. The experimental results in Figure 11.13 show that reconvergent paths— with a significant probability (41% and 34% as observed on the ISCAS’89 circuits)—are the dominant limiting factor on the minimum clock period after clock skew scheduling for a synchronous circuit. The delay insertion method can effectively be used to mitigate these limitations, as shown by 26% and 27% improvements in the minimum clock period. The proposed clock skew scheduling method with delay insertion takes about twice as much time as the conventional application of clock skew scheduling, however, the method is highly practical with total run times below a few minutes with highly common computing resources. The improvements in minimum clock period achieved through conventional and ILCSS ), and through clock skew schedulclock skew scheduling (IFCSS F DICSS and ILDICSS ) for edge-triggered and leveling with delay insertion (IF F sensitive circuits are visually presented for each benchmark circuit in Figures 11.14 and 11.15, respectively. DICSS Shown in Figure 11.14 are the percentage improvements (IFCSS F and IF F in Table 11.6, respectively) in minimum clock period via clock skew scheduling and delay insertion for edge-triggered ISCAS’89 benchmark circuits. Two data points shown per benchmark circuit from left-to-right are the improvements observed for clock skew scheduling alone and delay insertion with clock skew scheduling, respectively. Shown in Figure 11.15 are the percentage improvements (ILCSS and ILDICSS in Table 11.6, respectively) in minimum clock period via clock skew scheduling and delay insertion for level-sensitive ISCAS’89 benchmark circuits. Two data points shown per benchmark circuit from left-to-right are the improvements observed for clock skew scheduling alone and delay insertion with clock skew scheduling, respectively.
232
11 Experimental Results Improvements via Delay Insertion
Improvement (%)
100 80 60 40 20 s27 s208.1 s298 s344 s349 s382 s386 s400 s420.1 s444 s510 s526n s641 s713 s820 s832 s953 s1196 s1423 s1488 s1494 s5378 s9234 s13207 s15850 s15850.1 s35932 s38417 s38584 Average
0
ISCAS'89 Benchmark Circuits Edge-Triggered Circuits
Level-Sensitive Circuits
Fig. 11.13. Percentage improvements through delay insertion in Table 11.6.
Edge-Triggered Circuits
Improvement (%)
100 80 60 40 20 s27 s208.1 s298 s344 s349 s382 s386 s400 s420.1 s444 s510 s526n s641 s713 s820 s832 s953 s1196 s1423 s1488 s1494 s5378 s9234 s13207 s15850 s15850.1 s35932 s38417 s38584 Average
0
ISCAS'89 Benchmark Circuits CSS
DICSS
Fig. 11.14. Percentage improvements on edge-triggered circuits in Table 11.6.
The average total improvement of non-zero clock skew, edge-triggered circuits with delay insertion with respect to the zero clock skew, edge-triggered circuits is 34%. The average total improvement of non-zero clock skew, levelsensitive circuits with delay insertion with respect to the zero clock skew, edge-triggered circuits is also 34%. Note that the total improvements are due to the simultaneous effects of the applications of delay insertion, clock skew
11.5 Physical Design of Rotary Clock Synchronized Circuits
233
Level-Sensitive Circuits
Improvement (%)
100 80 60 40 20 s27 s208.1 s298 s344 s349 s382 s386 s400 s420.1 s444 s510 s526n s641 s713 s820 s832 s953 s1196 s1423 s1488 s1494 s5378 s9234 s13207 s15850 s15850.1 s35932 s38417 s38584 A verage
0
ISCAS'89 Benchmark Circuits CSS
DICSS
Fig. 11.15. Percentage improvements on level-sensitive circuits in Table 11.6.
scheduling and consideration of time borrowing (for level-sensitive circuits only) in the timing analysis. The improvement with delay insertion is equal to or greater than the improvement with clock skew scheduling only, as delay insertion is only applied when it can be used to mitigate the limitation of the reconvergent paths.
11.5 Physical Design of Rotary Clock Synchronized Circuits The development of a computer-aided design tool called hpictiming following the guidelines of the presented design methodology in Chapter 10 is performed in an open source environment [186]. In this section, the timing portion of the hpictiming tool is discussed. The details of the partitioning step implementation with chaco [177] and clock skew scheduling implementation step with Xgrid parallel computing system are presented in Section 10.2. The logic flow of the hpictiming program is presented in Figure 11.16. This flow is similar to the physical design flow shown in Figure 10.6 (note that the specific design decisions made in various stages of the implementation of the physical design flow are indicated on the figure). In Figure 11.16, the grid size for partitioning is set to 2x2 for simplicity. The focus in experimentation is on the effectiveness of the application of clock skew scheduling with partitioning (sequentially and particularly in parallel ). Towards this goal, the run times for the parallelized application of clock skew scheduling on the parallel computing clusters and final circuit performances are reported.
234
11 Experimental Results
Hpictiming is mainly written in C++ using the standard template library (STL). The total code is approximately 250,000 lines. Some of the parsers are written in lex/yacc and the partitioning tool (chaco software used to implement timing-aware partitioning) is written in ansi C. The program is developed on GNU/Linux, Solaris Unix (Sun OS 9) and Mac OS X 10.3.8 operating systems using gcc 3.0 compiler. 11.5.1 Clock Skew Scheduling of Partitions Results Clock skew scheduling is applied in parallel to the partitions of ISCAS’89 benchmark circuits and an industrial circuit called industrial1. The partitioning results from chaco are utilized within hpictiming in generating the top block and the partition LP problems. The LP problem [2] shown in Table 5.1 (page 81) is used for clock skew scheduling of edge-sensitive synchronous circuits. For industrial1, where register insertion is performed, linear constraints described in Chapter 5 for level-sensitive local data paths are added to the LP problem constraints. The feasibility of the parallel application of clock skew scheduling is analyzed. The speedups achievable through parallel clock skew scheduling are computed. The experimental setup of Section 11.1 is replicated for experimentation. The experiments are performed on an Xgrid cluster built with eight (8) PowerMac computers with dual G5 1.8GHz microprocessors (only one processor is used on each client computer) and 3GB RAM running Mac OS X 10.3.8 (Section 10.3). The simplex optimizer of the GNU LP solver GLPK (version 4.8) [187] is used to solve the LP problems. The results are presented in Table 11.7. In Table 11.7, the number of registers r and the number of paths p are shown for each analyzed circuit. Run times of various clock skew scheduling methods are shown. Run times of the conventional method of Table 5.1 are denoted by tconven . The run times of the sequential solution of partitions method are denoted by tsequen and the run times of the parallel solution of partitions method are denoted by tparal . The feasibility of each circuit when solved with the presented heuristic method is shown on the column labeled “Feasibility”. The minimum clock periods computed via each of the three methods (when feasible) are identical and equal to the values reported in Tables 11.1 and 11.6 under columns TFCSS F . These minimum clock periods, presented in Tables 11.1 and 11.6) provide an average of 30% improvement over conventional zero clock skew, edge-triggered circuits (28% is reported in Table 11.1 due to selected accuracy of the tabular environment). In presented methodology, the target is to improve the run times of clock skew scheduling without degrading these clock period improvements. Accordingly, the run times in Table 11.7 are reported in order to demonstrate the speedups achievable through partitioning and parallel application of clock skew scheduling. The selected suite of ISCAS’89 benchmark circuits and industrial1 are partitioned into a 2x2 partition using chaco. The partition and top block LP
11.5 Physical Design of Rotary Clock Synchronized Circuits DEF LEF SDF
235
BENCH
PARTITIONING 2x2 GRID
CHACO
REGISTER INSERTION
CLOCK SKEW SCHEDULING
XGRID
LP1 GLPK T1
LP2 GLPK
LP3 GLPK
T2
T3
LP4 GLPK T4
Choose max (T1, T2, T3, T4) TOP BLOCK LP T >= max (T1, T2, T3, T4 ) GLPK min T optimal ti
XGRID
LP1 T = min T ti = optimal ti GLPK
1) Re-Iteration 2) Constraining Boundary Vertices 3) Delay Padding
LP2 T = min T ti = optimal ti GLPK
LP3 T = min T ti = optimal ti GLPK
NO
LP4 T = min T ti = optimal ti GLPK
YES CSS FEASIBLE? PLACEMENT
REGISTER MAPPING
LOGIC PLACEMENT
Fig. 11.16. CAD tool flow.
problems are generated. First, the generated LP problems are solved on a single workstation in a sequential order. The observed run times tsequen record speedups over conventional clock skew scheduling application due to partitioning. Second, the generated LP problems are solved on the Xgrid computing cluster in parallel as described in Section 10.3. The observed run times tparal
236
11 Experimental Results
Table 11.7. Clock skew scheduling results on 2x2 partitioned ISCAS’89 circuits. Circuit Info Run Time CSS (sec) RTI (%) Feasibility Circuit r p tconven tsequen tparal RT Isequen RT Iparal Feasibility s27 3 4 0 0 0 0 0 yes s208.1 8 28 0 0 0 0 0 yes s298 14 54 0 0 0 0 0 yes s344 15 68 0 0 0 0 0 yes s349 15 68 0 0 0 0 0 yes s382 21 113 0 0 0 0 0 yes s386 6 15 0 0 0 0 0 yes s400 21 113 0 0 0 0 0 yes s420.1 16 120 0 0 0 0 0 no s444 16 113 0 0 0 0 0 yes s510 6 15 0 0 0 0 0 yes s526 21 117 0 0 0 0 0 yes s526n 21 117 0 0 0 0 0 yes s641 19 81 0 0 0 0 0 no s713 19 81 0 0 0 0 0 no s820 5 10 1 1 1 0 0 yes s832 5 10 0 0 0 0 0 yes s838.1 32 496 2 0 0 0 100 no s938 32 496 1 1 1 0 0 no s953 29 135 0 0 0 0 0 yes s967 29 135 0 0 0 0 0 yes s991 19 51 0 0 0 0 0 yes s1196 18 20 0 0 0 0 0 no s1238 18 20 0 0 0 0 0 no s1423 74 1471 21 6 3 71 86 yes s1488 6 15 0 0 0 0 0 yes s1494 6 15 0 0 0 0 0 yes s1512 57 415 1 0 0 100 100 yes s3271 116 789 4 2 1 50 75 no s3330 132 514 2 2 1 0 50 no s3384 183 1759 22 4 3 82 86 yes s4863 104 620 2 0 0 100 100 yes s5378 179 1147 9 5 2 44 78 no s6669 239 2138 33 10 7 30 79 no s9234 228 247 52 15 8 71 85 no s9234.1 211 2342 47 12 5 74 89 yes s13207 669 3068 86 17 10 80 88 yes s15850 597 14257 3545 735 447 79 87 no s15850.1 534 10830 1358 156 110 89 92 yes s35932 1728 4187 101 38 13 62 87 no s38417 1636 28082 7707 3780 1845 51 76 yes s38584 1452 15545 1394 749 339 46 76 yes industrial1 14031 3692878 n/a 34680 25680 n/a n/a no Average 25 28
record speedups over the conventional clock skew scheduling application due to partitioning and parallelization of the application. Note that the application of clock skew scheduling to industrial1 using the conventional clock skew scheduling method is not possible, thus run times are not reported. It is observed from Table 11.7 that tparal is consistently and significantly (especially for large scale circuits) superior to tsequen and tconven . Similarly, tsequen is consistently superior to tconven . The run time improvement from tconven to tsequen and from tconven to tparal are listed under RT Isequen and
11.5 Physical Design of Rotary Clock Synchronized Circuits
237
RT Iparal , respectively. The improvements are computed with the formula [(told − tnew )/told × 100]. On the ISCAS’89 benchmark circuits, the average run time improvement via partitioning (RT Isequen ) is 25%. The average run time improvement via partitioning and parallel application of clock skew scheduling RT Iparal is 28%. The circuits, for which the method is infeasible, are not considered in the computations of the average improvement. Overall, the application of clock skew scheduling to partitions is feasible for 28 (65%) of the total 43 circuits, whereas this method is not applicable to the remaining 15 circuits (35%). For these 15 circuits, the alternative methods described in Section 10.2.4 can be used. 11.5.2 Overall CAD Tool Results In this section, the run times of hpictiming on the benchmark circuits are analyzed to profile the speedups gained in overall program execution due to partitioning and parallelization. In particular, the speedups available through solving the partition problems sequentially and in parallel are computed using the speedup formula presented in (10.5). Table 11.8 presents the speedup results of hpictiming tool on ISCAS’89 benchmark and industrial1 circuits. In Table 11.8, the number of registers and paths of each circuit are shown with r and p, respectively. Run times of the hpictiming tool operated with various clock skew scheduling methods on the ISCAS’89 benchmark circuits are shown. Run times of hpictiming with the conventional clock skew scheduling method of Table 5.1 are denoted hpictiming by tconven . The run times with the sequential solution of partitions method hpictiming and the run times with the parallel solution of parare denoted by tsequen . In Table 11.8, the speedups due to titions method are denoted by thpictiming paral partitioning and sequential application of clock skew scheduling to 2x2 partitions of the circuits are denoted by speedupsequen . The speedup speedupsequen is computed with the following formula: speedupsequen =
thpictiming conven . thpictiming sequen
(11.8)
The speedups due to partitioning and application of clock skew scheduling in parallel are denoted by speedupparal . The speedup speedupparallel is computed with the following formula: speedupparal =
thpictiming conven . thpictiming paral
(11.9)
Remember from Section 11.5.1 that the application of clock skew scheduling with partitioning is not feasible for some of the ISCAS’89 benchmark circuits and the industrial circuit industrial1. The circuits for which the method is not applicable are not considered in the computation of average
238
11 Experimental Results Table 11.8. Speedup of hpictiming on 2x2 partitioned ISCAS’89 circuits. Circuit Info Circuit r
Run Time hpictiming (sec) Speedup (X) hpictiming hpictiming p thpictiming tparal speedupsequen speedupparal conven tsequen
s27 3 4 s208.1 8 28 s298 14 54 s344 15 68 s349 15 68 s382 21 113 s386 6 15 s400 21 113 s420.1 16 120 s444 16 113 s510 6 15 s526 21 117 s526n 21 117 s641 19 81 s713 19 81 s820 5 10 s832 5 10 s838.1 32 496 s938 32 496 s953 29 135 s967 29 135 s991 19 51 s1196 18 20 s1238 18 20 s1423 74 1471 s1488 6 15 s1494 6 15 s1512 57 415 s3271 116 789 s3330 132 514 s3384 183 1759 s4863 104 620 s5378 179 1147 s6669 239 2138 s9234 228 247 s9234.1 211 2342 s13207 669 3068 s15850 597 14257 s15850.1 534 10830 s35932 1728 4187 s38417 1636 28082 s38584 1452 15545 industrial1 14031 3692878 Average
0 0 0 0 1 0 0 0 0 1 1 0 1 1 0 1 0 3 1 1 1 1 0 1 22 1 1 2 6 2 25 6 15 40 60 53 105 3757 1385 313 7881 1615 n/a
0 0 0 0 1 0 0 0 0 1 1 0 1 1 0 1 0 1 1 1 1 1 0 1 7 1 1 1 4 2 7 4 11 17 23 18 36 947 185 250 3958 1022 36062
0 0 0 0 1 0 0 0 0 1 1 0 1 1 0 1 0 1 1 1 1 1 0 1 4 1 1 1 3 1 6 4 8 14 16 11 29 659 138 225 2021 611 27046
n/a n/a n/a n/a 1.0x n/a n/a n/a n/a 1.0x 1.0x n/a 1.0x 1.0x n/a 1.0x n/a 3.0x 1.0x 1.0x 1.0x 1.0x n/a 1.0x 3.1x 1.0x 1.0x 2.0x 2.5x 1.0x 3.6x 2.5x 1.4x 2.4x 2.6x 2.9x 2.9x 4.0x 7.5x 1.3x 2.0x 1.6x n/a 2.1x
n/a n/a n/a n/a 1.0x n/a n/a n/a n/a 1.0x 1.0x n/a 1.0x 1.0x n/a 1.0x n/a 3.0x 1.0x 1.0x 1.0x 1.0x n/a 1.0x 5.5x 1.0x 1.0x 2.0x 2.0x 2.0x 4.2x 2.5x 1.9x 2.9x 3.8x 4.8x 3.6x 5.7x 10.0x 1.4x 3.9x 2.6x n/a 2.6x
speedups. Still, the speedup numbers are presented individually for all the ISCAS’89 benchmark circuits and the industrial circuit industrial1 in Table 11.8. It is observed from Table 11.8 that on average 2.1x speedup is observed in hpictiming run time due to partitioning. If the partitioned LP problems are solved in parallel, the average speedup is 2.6x. It is intuitive that as the size of a circuit increases, the clock skew scheduling step of hpictiming, which is the fraction of the task that is enhanced with partitioning and parallelization,
11.5 Physical Design of Rotary Clock Synchronized Circuits
239
30000
Seconds
25000 20000
Scheduling
15000
Partitioning
10000
Read-In
5000
Industrial1
s38584
s38417
s15850.1
0
Fig. 11.17. The run times of hpictiming with Xgrid on large circuits.
increases as well. So, for larger size circuits, higher values of speedup are expected through partitioning and parallelization. Indeed, such a trend is observed in Table 11.8. Speedup (10.5) is further investigated on several of the benchmark circuits. The execution of hpictiming is divided into three main steps, Read-in, Partitioning and Scheduling. The Read-in step consists of reading the input data and identifying the local data paths. Partitioning step consists of the timing-driven partitioning procedure implemented with chaco, discussed in Section 10.2.2. Scheduling step consists of the application of clock skew scheduling to generated partitions. Figure 11.17 illustrates the relative run time lengths of each step for several ISCAS’89 benchmark circuits and the industrial circuit industrial1 for the parallel application of clock skew scheduling. The ISCAS’89 benchmark circuits, whose total run times are below a certain limit, are not included in the analysis. The selectivity about the ISCAS’89 benchmark circuits is to eliminate the inaccuracies due to the rounding off errors in run times, most prominent for circuits with a run time below a few seconds. Although the solution for industrial1 is infeasible, the reported run times are believed to be a good approximation of what they would have been, if all the subpartitions had been feasible. The total run time of the hpictiming program (with parallel application of clock skew scheduling) is reported in Table 11.8 under hpictiming . the column tparal The breakdown of run times to the three steps of hpictiming is shown for the three largest circuits, s38584, s38417 and industrial1. The run times are shown in Figure 11.18, 11.19 and 11.20 for s38584, s38417 and Industrial1, respectively. The run times for three application methods—conventional, sequential and parallel application of clock skew scheduling—are shown for each circuit. The run times for each step of hpictiming is shown with color codes, listed as
240
11 Experimental Results
2000
Seconds
1500 Scheduling
1000
Partitioning Read-In
500 0 Conventional
Sequential
Parallel
Fig. 11.18. Run time breakdown of hpictiming program steps for s38584.
10000
Seconds
8000 Scheduling
6000
Partitioning 4000
Read-In
2000 0 Conventional
Sequential
Parallel
Seconds
Fig. 11.19. Run time breakdown of hpictiming program steps for s38417.
40000 35000 30000 25000 20000 15000 10000 5000 0
Scheduling Partitioning Read-In
Sequential
Parallel
Fig. 11.20. Run time breakdown of hpictiming program steps for industrial1.
11.5 Physical Design of Rotary Clock Synchronized Circuits
241
read-in, partitioning and scheduling steps from bottom to top for each data bar. Partitioning step is not required in the conventional application method, thus is not shown on the run time bar in the figures for the conventional application cases. Even for methods where partitioning is necessary, the partitioning stage of the run time bar is not visible, because the run times for the partitioning process with chaco are very small compared to the rest of the execution time. Note that the run time of the read-in and partitioning (where applied) steps are identical in all three application methods. Through partitioning and application of clock skew scheduling in parallel, the run time of the clock skew scheduling step of the hpictiming program is improved. This improvement speeds up the hpictiming program, the results of which are presented in Table 11.8.
12 Conclusions
The physical limitations of materials and lithography in nano-scale CMOS design have significantly affected the IC design flow. Some previously ignored material behavior lead to design altering phenomena at such small dimensions. Non-zero clock skew is a prime example of such a change; previously negligible, clock skew and jitter have combined to occupy up to 10% of useful computation time within clock cycles. Clock design methodologies are being improved to limit skew and jitter as technology scales. Within such a dynamic and resourceful environment, where design techniques are evolving to adapt to nanoscale silicon implementations, ComputerAided Design (CAD) tools are crucial to the successful development of novel design techniques and the sustainability of techniques currently in use. In this monograph, problem formulation and design automation of non-zero clock skew scheduling are described from a CAD perspective. The focus is on the algorithmic advances and automation of the application of non-zero clock skew scheduling at the large scale. Based on the efficacy and scalability of these automation principles, the non-zero clock skew system design concept can be adapted to mainstream physical design flow, permitting significant performance improvements. Two major items of importance within the adaptation are the scalability of clock skew scheduling methodologies and the practical application of these methodologies amid increasing circuit complexities. The parallel clock skew scheduling research described in this monograph is an important first step to solving the scalability problem. Solving the issues in practical application for efficient nano-scale implementation requires a comprehensive treatment from the logic synthesis stage (e.g. resource allocation for non-zero clock skew systems) to the testing stage (e.g. scan clock insertion for non-zero clock skew systems) of the IC design flow. In this monograph, the current state of knowledge in the formulation, automation and application of non-zero clock skew scheduling is presented. The topics are discussed in not an attempt to address all of the challenges listed above, but to present the existing knowledge and automation principles in application. Equipped with the knowledge presented in this monograph, I.S. Kourtev et al., Timing Optimization Through Clock Skew Scheduling, DOI: 10.1007/978-0-387-71056-3 12, c Springer Science+Business Media LLC 2009
243
244
12 Conclusions
the application of clock skew scheduling to application-specific integrated circuits (ASICs) is possible. Furthermore, formulation characteristics (inclusive of known bottlenecks) are presented to establish a roadmap for non-zero clock skew scheduling researchers. In overview, the following topics are presented in this research monograph. First in Chapter 1 through 4, the preliminary information of VLSI circuit design and VLSI circuit timing are briefed. In Chapter 5, the original linear programming (LP) formulation for clock skew scheduling of edge-triggered circuits is revisited. The implications of non-zero clock skew timing constraints on clock tree synthesis step are iterated. In Chapter 6, a linear programming formulation for the static timing analysis of level-sensitive circuits is described. This LP formulation is the first stand-alone formulation offered for the timing analysis of non-zero clock skew, level-sensitive circuits. The majority of the current static timing analyzers utilize iteration-based approaches to analyze the timing behavior of systems with latches. These iteration-based approaches are shown to converge to solutions relatively quickly for most circuits, however, they require algorithmic extensions for complex circuit topologies. The LP formulation presented in this monograph is topologically independent and shown to operate with reasonable run times. Performance improvements of 27% shorter clock periods on average are obtained for non-zero clock skew, level-sensitive circuits over traditionally used zero clock skew, edgesensitive circuits. Although level-sensitive circuits do not provide additional improvements in clock period over edge-sensitive circuits (approximately 2830% shorter clock periods over zero clock skew, edge-sensitive circuits) for non-zero clock skew scheduling, they provide improvements in area savings. Also in Chapter 6, an LP automation framework is presented in order to analyze advanced multi-phase synchronization methodologies with non-zero clock skew. These multi-phase synchronization analyses are performed in order to provide design and analysis methods to address synchronous circuit design with emerging clocking technologies, some of which entail multi-phase synchronization schemes. For instance, the resonant rotary clocking technology provides an improved clock distribution network which satisfies the complex synchronization requirements of high-performance synchronous circuits by using multi-phase, non-zero clock skew clocking. The presented timing analysis method efficiently captures the behavior of multi-phase, non-zero clock skew circuits in a fully-automated fashion. The experiments performed on ISCAS’89 benchmark circuits demonstrate that multi-phase synchronization can actually be advantageous in terms of circuit speed, despite the increased path delays due to latch insertion per each clock phase. Such a fact is contrary to common wisdom, which has over the years been suggested for zero clock skew systems. Approximately 17.7% and 12.0% shorter clock periods are obtained on average over zero clock skew, edge-sensitive circuits for three-phase and four-phase synchronization schemes, respectively. In Chapter 7, an effective quadratic programming formulation to improve the tolerance of circuits to process parameter variations is presented. The
12 Conclusions
245
variations (manufacturing and environmental) are becoming increasingly dominant with semiconductor technology scaling. As researchers are working on design-for-manufacturing (DFM) techniques, regular design fabrics (semiconductor and nanoarchitecture levels) and timing tools to accurately model the statistical behavior, multi-phase operation and clock skew scheduling can also be used effectively to circumvent the hazards caused by such variations. In experiments, safety factors are maximized for ISCAS’89 circuits with runtimes less than 30 minutes. The scalability of QP formulation is a concern for increasing circuit sizes, similar to, but more so then, LP formulations. In Chapter 8, the optimal clock schedules and data propagation times of a circuit are analyzed after clock skew scheduling. With these analyses, the theoretical limits of improvement in the minimum clock period achievable through clock skew scheduling are identified. Traditionally, it has been considered that the data path cycles and delay uncertainties are the only limiting factors on the minimum clock period achievable through clock skew scheduling. As shown recently, the reconvergent data paths also introduce theoretical limits on the minimum achievable clock period through clock skew scheduling. This limitation is mitigated by the delay insertion method, leading to improvements of 10% and 9% shorter clock periods on average over conventional clock skew scheduling techniques for edge-sensitive and level-sensitive circuits, respectively. In mainstream digital circuit design flow, delay insertion is commonly used as a post-processing step in order to solve the short-path (hold time) violations. The drawbacks of delay insertion, such as increased circuit area and power consumption, are mainly disregarded in favor of the feasibility of the timing schedules. Similarly for the presented design principles, the drawbacks of delay insertion are considered tolerable in favor of the improvement in the circuit performance. In Chapter 9, the practical considerations in the implementation of the design automation algorithms as well as the clock skew scheduling methodologies are discussed. Three different implementations of the QP based algorithm are demonstrated for varying objectives, which might be selected based on practical limitations or necessities. Also shown to address a potential practical limitation in timing is the implementation of non-zero clock skew scheduling on intellectual-property (IP) blocks within an ASIC. The proposed design strategy suggests a zero clock skew implementation at the I/O registers for easy synchronization of the IP within the ASIC. The additional constraint on the clock skew at the I/O registers limits the level of improvement through clock skew scheduling, however, simplifies the timing relationship between communicating IPs. Alternative implementations and strategies in synchronizing multiple non-zero clock skew IPs are indeed feasible. In Chapter 10, the integration of the presented timing and synchronization methodologies into the physical design flow of circuits synchronized with rotary clocking technology is described. Rotary clocking technology is a type of resonant clocking technology, which provides controllable skew, low-jitter, giga-hertz range clocking with fast transition times and low power consump-
246
12 Conclusions
tion. Rotary clocking technology also permits non-zero clock skew operation and multi-phase synchronization of systems. In the presented discussion, the development of the physical design flow for rotary clock synchronized circuits is described. The physical design flow consists of a novel partitioning step in order to generate partitions of the circuit netlist on which clock skew scheduling can be applied individually. The potential to parallelize the application of clock skew scheduling is explored. Partitioning and the parallelization of the application of clock skew scheduling are shown to provide significant speedups in run times of the timing analysis. Over the ISCAS’89 benchmark circuits, a average speedup of 2.6x is observed over four (4) processors. When applicable, clock skew scheduling of partitions significantly improves the scalability of clock skew scheduling. In summary, this monograph presents valuable timing and synchronization methodologies for non-zero clock skew scheduling and their automation methods. The timing and synchronization methodologies are proposed particularly for the non-zero clock skew operation of high-performance digital VLSI integrated circuits. Various algorithms and blueprints for methodology development are presented, which include algorithms for circuits with edgesensitive registers v.s. level-sensitive registers, circuits synchronized by a single clock phase scheme v.s. multi-phase clocking schemes and algorithms modifying clock distribution network only v.s. modifying the clock distribution network simultaneous with the logic network. Theoretical limitations of improvement achievable through non-zero clock skew scheduling are presented for the proposed algorithms and methodologies.
References
1. E. G. Friedman, Performance Limitations in Synchronous Digital Systems. Ph.D. thesis, University of California, Irvine, California, 1989. Abstract published in Dissertations Abstracts International, Volume 50, Number 7, p. 3067B, January 1990. 2. J. P. Fishburn, “Clock Skew Optimization,” IEEE Transactions on Computers, Vol. C–39, pp. 945–951, July 1990. 3. J. S. Kilby, “Invention of the Integrated Circuit,” IEEE Transactions on Electron Devices, Vol. ED-23, pp. 648–654, July 1976. 4. G. E. Moore, “Cramming more Components onto Integrated Circuits,” Proceedings of the IEEE, Vol. 86, pp. 82–85, January 1998. 5. D. C. Pham, D. Boerstler, M. Bolliger, R. Chaudhry, D. Cox, P. Harvey, P. Hofstee, C. Johns, J. Kahle, A. Kameyama, J. Keaty, Y. Masubuchi, M. Pham, J. Pille, S. Posluzsny, M. Riley, D. Stasiak, M. Suzuoki, O. Takahashi, J. Warnock, S. Weitzel, D. Wendel, and K. Yazawa, “Overview of the architecture, circuit design, and physical implementation of a first-generation cell processor,” IEEE Journal of Solid-State Circuits, Vol. 41, pp. 179–196, January 2006. 6. S. Rusu, S. Tam, H. Muljono, D. Ayers, J. Chang, B. Cherkauer, J. Stinson, J. Benoit, R. Varada, J. Leung, R. Limaye, and S. Vora, “A 65 nm Dual-Core Multithreaded Xeon Processor with 16-MB L3 Cache,” IEEE Journal of Solid State Circuits, Vol. 42, pp. 17–25, January 2007. 7. A. S. Leon, K. W. Tam, J. L. Shin, D. Weisner, and F. Schumacher, “A Power efficient High-Throughput 32-Thread SPARC processor,” IEEE Journal of Solid State Circuits, Vol. 42, pp. 7–16, January 2007. 8. H. B. Bakoglu, Circuits, Interconnections, and Packaging for VLSI. AddisonWesley Publishing Company, 1990. 9. E. G. Friedman, Clock Distribution Networks in VLSI Circuits and Systems. IEEE Press, 1995. 10. N. W. Weste and D. Harris, Principles of CMOS VLSI Design: A Systems Perspective. Addison-Wesley Publishing Company, Reading, MA, 3rd ed., 2004. 11. D. D. Gajski, Silicon Compilation. Addison-Wesley Publishing Company, Reading, MA, 1988. 12. Z. Kohavi, Switching and Finite Automata Theory. McGraw-Hill Book Company, New York, NY, 2nd ed., 1978. 247
248
References
13. F. J. Hill and G. R. Peterson, Computer Aided Logical Design (with emphasis on VLSI). John Wiley & Sons, Inc., 4th ed., 1993. 14. J. P. Uyemura, Introduction to VLSI Circuits and Systems. Wiley Publishing, 2001. 15. S.-M. Kang and Y. Leblebici, CMOS Digital Integrated Circuits: Analysis and Design. The McGraw-Hill Companies, Inc., 3rd ed., 2002. 16. J. M. Rabaey, A. Chandrakasan, and B. Nikolic, Digital Integrated Circuits: A Design Perspective. Prentice-Hall, Inc., Upper Saddle River, NJ, 2nd ed., 2002. 17. C. Mead and L. Conway, Introduction to VLSI Systems. Addison-Wesley Publishing Company, Reading, MA, 1980. 18. F. Anceau, “A Synchronous Approach for Clocking VLSI Systems,” IEEE Journal of Solid-State Circuits, Vol. SC-17, pp. 51–56, February 1982. 19. M. Afghani and C. Svensson, “A Unified Clocking Scheme for VLSI Systems,” IEEE Journal of Solid State Circuits, Vol. SC-25, pp. 225–233, February 1990. 20. S. H. Unger and C.-J. Tan, “Clocking Schemes for High-Speed Digital Systems,” IEEE Transactions on Computers, Vol. C-35, pp. 880–895, October 1986. 21. G. Y. Yacoub, H. Pham, M. Ma, and E. G. Friedman, “A System for Critical Path Analysis Based on Back Annotation and Distributed Interconnect Impedance Models,” Microelectronics Journal, Vol. 19, pp. 21–30, May/June 1988. 22. H. Shichman and D. A. Hodges, “Modeling and Simulation of Insulated-Gate Field-Effect Transistor Switching Circuits,” IEEE Journal of Solid-State Circuits, Vol. SC-3, pp. 285–289, September 1968. 23. N. Hedenstierna and K. O. Jeppson, “CMOS Circuit Speed and Buffer Optimization,” IEEE Transactions on Computer-Aided Design, Vol. CAD-6, pp. 270–281, March 1987. 24. M. R. C. M. Berkelaar and J. A. G. Jess, “Gate Sizing in MOS Digital Circuits with Linear Programming,” Proceedings of the European Design Automation Conference, pp. 217–221, March 1990. 25. O. Coudert, “Gate Sizing for Constrained Delay/Power/Area Optimization,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol. VLSI-5, pp. 465–472, December 1997. 26. U. Ko and P. T. Balsara, “Short-Circuit Power Driven Gate Sizing Technique for Reducing Power Dissipation,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol. VLSI-3, pp. 450–455, September 1995. 27. S. R. Vemuru and N. Scheinberg, “Short-Circuit Power Dissipation Estimation for CMOS Logic Gates,” IEEE Transactions on Circuits and Systems I: Fundamental Theory and Applications, Vol. 41, pp. 762–765, November 1994. 28. H. J. Veendrick, “Short-Circuit Dissipation of Static CMOS Circuitry and its Impact on the Design of Buffer Circuits,” IEEE Journal of Solid-State Circuits, Vol. SC-19, pp. 468–473, August 1984. 29. A. S. Sedra and K. C. Smith, Microelectronic Circuits. Oxford University Press, 4th ed., 1997. 30. T. Sakurai and A. R. Newton, “Alpha-power Law MOSFET Model and its Applications to CMOS Inverter Delay and Other Formulas,” IEEE Journal of Solid-State Circuits, Vol. SC-25, pp. 584–594, April 1990.
References
249
31. A. I. Kayssi, K. A. Sakallah, and T. M. Burks, “Analytical Transient Response of CMOS Inverters,” IEEE Transactions on Circuits and Systems— I : Fundamental Theory and Applications, Vol. CAS I–39, pp. 42–45, January 1992. 32. E. G. Friedman, ed., High Performance Clock Distribution Networks. Kluwer Academic Publishers, Norwell, Massachusetts, 1997. 33. H. B. Bakoglu and J. D. Meindl, “Optimal Interconnection Circuits for VLSI,” IEEE Transactions on Electron Devices, Vol. ED-32, pp. 903–909, May 1985. 34. A. Wilnai, “Open-Ended RC Line Model Predicts MOSFET IC Response,” Electronic Design News, pp. 53–54, December 1971. 35. T. Sakurai, “Approximation of Wiring Delay in MOSFET LSI,” IEEE Journal of Solid-State Circuits, Vol. SC-18, pp. 418–426, August 1983. 36. S. R. Vemuru and A. R. Thorbjornsen, “Variable-Taper CMOS Buffer,” IEEE Journal of Solid-State Circuits, Vol. SC-26, pp. 1265–1269, September 1991. 37. C. Prunty and L. Gal, “Optimum Tapered Buffer,” IEEE Journal of Solid-State Circuits, Vol. SC-27, pp. 118–119, January 1992. 38. N. Hedenstierna and K. O. Jeppson, “Comments on the Optimum CMOS Tapered Buffer Problem,” IEEE Journal of Solid-State Circuits, Vol. SC-29, pp. 155–158, February 1994. 39. B. S. Cherkauer and E. G. Friedman, “Channel Width Tapering of Serially Connected MOSFET’s with Emphasis on Power Dissipation,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol. VLSI-2, pp. 100–114, March 1994. 40. B. S. Cherkauer and E. G. Friedman, “Design of Tapered Buffers with Local Interconnect Capacitance,” IEEE Journal of Solid-State Circuits, Vol. SC–30, pp. 151–155, February 1995. 41. B. S. Cherkauer and E. G. Friedman, “A Unified Design Methodology for CMOS Tapered Buffers,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol. VLSI-3, pp. 99–111, March 1995. 42. V. Adler and E. G. Friedman, “Repeater Insertion to Reduce Delay and Power in RC Tree Structures,” Proceedings of the Asilomar Conference on Signals, Systems, and Computers, pp. 749–752, November 1997. 43. V. Adler and E. G. Friedman, “Delay and Power Expressions for a CMOS Inverter Driving a Resistive-Capacitive Load,” Proceedings of the IEEE International Symposium on Circuits and Systems, pp. 4.101–4.104, May 1996. 44. V. Adler and E. G. Friedman, “Repeater Design to Reduce Delay and Power in Resistive Interconnect,” Proceedings of the IEEE International Symposium on Circuits and Systems, pp. 2148–2151, June 1997. 45. V. Adler and E. G. Friedman, “Timing and Power Models for CMOS Repeaters Driving Resistive Interconnect,” Proceedings of the IEEE ASIC Conference, pp. 201–204, September 1996. 46. C. J. Alpert and A. Devgan, “Wire Segmenting for Improved Buffer Insertion,” Proceedings of the IEEE/ACM Design Automation Conference, pp. 588–593, June 1997. 47. V. E. Adler and E. G. Friedman, “Repeater Design to Reduce Delay and Power in Resistive Interconnect,” IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing, Vol. CAS II-45, pp. 607–616, May 1998. 48. I. E. Sutherland, “Micropipelines,” Communications of the ACM, Vol. 32, pp. 720–738, June 1989.
250
References
49. J. M. Rabaey, Digital Integrated Circuits : A Design Perspective. Prentice Hall, Inc., 1996. 50. R. H. Krambeck, C. M. Lee, and H.-F. S. Law, “High Speed Compact Circuits with CMOS,” IEEE Journal of Solid-State Circuits, Vol. SC-17, pp. 614–619, June 1982. 51. V. Friedman and S. Liu, “Dynamic Logic CMOS Circuits,” IEEE Journal of Solid-State Circuits, Vol. SC-19, pp. 263–266, April 1984. 52. N. F. Gonclaves and H. J. DeMan, “NORA: A Racefree Dynamic CMOS Technique for Pipelined Logic Structures,” IEEE Journal of Solid-State Circuits, Vol. SC-18, pp. 261–266, June 1983. 53. C. M. Lee and E. W. Szeto, “Zipper CMOS,” IEEE Circuits and Systems Magazine, pp. 10–16, May 1986. 54. L. G. Heller, W. R. Griffin, J. W. Davis, and N. G. Thoma, “Cascade Voltage Switch Logic: A Differential CMOS Logic Family,” Proceedings of the IEEE International Solid State Circuits Conference, pp. 16–17, February 1984. 55. T. A. Grotjohn and B. Hoefflinger, “Sample-Set Differential Logic (SSDL) for Complex High-Speed VLSI,” IEEE Journal of Solid-State Circuits, Vol. SC-21, pp. 367–369, April 1986. 56. L. C. M. Pfennings, W. G. J. Mol, J. J. J. Bastiens, and J. M. F. V. Dijk, “Differential Split-Level CMOS Logic for Subnanosecond Speed,” IEEE Journal of Solid-State Circuits, Vol. SC-20, pp. 1050–1055, October 1985. 57. K. M. Chu and D. I. Pulfrey, “Design Procedures for Differential Cascode Voltage Switch Circuits,” IEEE Journal of Solid-State Circuits, Vol. SC-21, pp. 1082–1087, December 1986. 58. L. A. Glasser and D. W. Dobberpuhl, The Design and Analysis of VLSI Circuits. Addison-Wesley Publishing Company, 1985. 59. M. M. Mano and C. R. Kime, Logic and Computer Design Fundamentals. Prentice Hall, Inc., 1997. 60. W. Wolf, Modern VLSI Design : A Systems Approach. Prentice-Hall, Inc., 1994. 61. T. Kacprzak and A. Albicki, “Analysis of Metastable Operation in RS CMOS Flip-Flops,” IEEE Journal of Solid-State Circuits, Vol. SC-22, pp. 57–64, February 1987. 62. T. A. Jackson and A. Albicki, “Analysis of Metastable Operation in D Latches,” IEEE Transactions on Circuits and Systems—I : Fundamental Theory and Applications, Vol. CAS I–36, pp. 1392–1404, Nov 1989. 63. E. G. Friedman, “Latching Characteristics of a CMOS Bistable Register,” IEEE Transactions on Circuits and Systems—I : Fundamental Theory and Applications, Vol. CAS I–40, pp. 902–908, December 1993. 64. S. H. Unger, “Double-Edge-Triggered Flip-Flops,” IEEE Transactions on Computers, Vol. C-30, pp. 447–451, June 1981. 65. S.-L. Lu, “A Novel CMOS Implementation of Double-Edge-Triggered D-FlipFlops,” IEEE Journal of Solid State Circuits, Vol. SC-25, pp. 1008–1010, August 1990. 66. M. Afghani and J. Yuan, “Double-Edge-Triggered D-Flip-Flops for High-Speed CMOS Circuits,” IEEE Journal of Solid State Circuits, Vol. SC-26, pp. 1168– 1170, August 1991. 67. R. Hossain, L. Wronski, and A. Albicki, “Low Power Design Using Double Edge Triggered Flip-Flops,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol. VLSI-2, pp. 261–265, June 1994.
References
251
68. G. M. Blair, “Low-Power Double-Edge Triggered Flip-Flop,” Electronics Letters, Vol. 33, pp. 845–847, May 1997. 69. M. R. Dagenais and N. C. Rumin, “On the Calculation of Optimal Clocking Parameters in Synchronous Circuits with Level-Sensitive Latches,” IEEE Transactions on Computer-Aided Design, Vol. CAD-8, pp. 268–278, March 1989. 70. I. Lin, J. A. Ludwig, and K. Eng, “Analyzing Cycle Stealing on Synchronous Circuits with Level-Sensitive Latches,” Proceedings of the ACM/IEEE Design Automation Conference, pp. 393–398, June 1992. 71. J. Lee, D. T. Tang, and C. K. Wong, “A Timing Analysis Algorithm for Circuits with Level-Sensitive Latches,” IEEE Transactions on Computer-Aided Design, Vol. CAD-15, pp. 535–543, May 1996. 72. T. G. Szymanski, “Computing Optimal Clock Schedules,” Proceedings of the ACM/IEEE Design Automation Conference, pp. 399–404, June 1992. 73. K. A. Sakallah, T. N. Mudge, and O. A. Olukotun, “checkTc and minTc : Timing Verification and Optimal Clocking of Synchronous Digital Circuits,” Proceedings of the IEEE/ACM International Conference on Computer–Aided Design, pp. 552–555, November 1990. 74. N. Shenoy, R. K. Brayton, and A. L. Sangiovanni-Vincentelli, “Minimum Padding to Satisfy Short Path Constaints,” Proceedings of the IEEE/ACM International Conference on Computer–Aided Design, pp. 156 –161, November 1993. 75. K. A. Sakallah, T. N. Mudge, and O. A. Olukotun, “Analysis and Design of Latch-Controlled Synchronous Digital Circuits,” IEEE Transactions on Computer-Aided Design, Vol. CAD-11, pp. 322–333, March 1992. 76. S. Bothra, B. Rogers, M. Kellam, and C. M. Osburn, “Analysis of the Effects of Scaling on Interconnect Delay in ULSI Circuits,” IEEE Transactions on Electron Devices, Vol. ED-40, pp. 591–597, March 1993. 77. N. Gaddis and J. Lotz, “A 64-b Quad-Issue CMOS RISC Microprocessor,” IEEE Journal of Solid-State Circuits, Vol. SC-31, pp. 1697–1702, November 1996. 78. P. E. Gronowski et al., “A 433-MHz 64-bit Quad-Issue RISC Microprocessor,” IEEE Journal of Solid-State Circuits, Vol. SC-31, pp. 1687–1696, November 1996. 79. N. Vasseghi, K. Yeager, E. Sarto, and M. Seddighnezhad, “200-Mhz Superscalar RISC Microprocessor,” IEEE Journal of Solid-State Circuits, Vol. SC31, pp. 1675–1686, November 1996. 80. W. J. Bowhill et al., “Circuit Implementation of a 300-MHz 64-bit Secondgeneration CMOS Alpha CPU,” Digital Technical Journal, Vol. 7, No. 1, pp. 100–118, 1995. 81. J. L. Neves and E. G. Friedman, “Topological Design of Clock Distribution Networks Based on Non-Zero Clock Skew Specification,” Proceedings of the IEEE Midwest Symposium on Circuits and Systems, pp. 468–471, August 1993. 82. J. G. Xi and W. W.-M. Dai, “Useful-Skew Clock Routing With Gate Sizing for Low Power Design,” Proceedings of the ACM/IEEE Design Automation Conference, pp. 383–388, June 1996. 83. J. L. Neves and E. G. Friedman, “Design Methodology for Synthesizing Clock Distribution Networks Exploiting Non–Zero Localized Clock Skew,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol. VLSI-4, pp. 286–291, June 1996.
252
References
84. M. A. B. Jackson, A. Srinivasan, and E. S. Kuh, “Clock Routing for HighPerformance ICs,” Proceedings of the ACM/IEEE Design Automation Conference, pp. 573–579, June 1990. 85. R.-S. Tsay, “An Exact Zero-Skew Clock Routing Algorithm,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Vol. CAD12, pp. 242–249, February 1993. 86. N.-C. Chou and C.-K. Cheng, “On General Zero-Skew Clock Net Construction,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol. VLSI-3, pp. 141–146, March 1995. 87. N. Ito, H. Sugiyama, and T. Konno, “ChipPRISM: Clock Routing and Timing Analysis for High-Performance CMOS VLSI Chips,” Fujitsu Scientific and Technical Journal, Vol. 31, pp. 180–187, December 1995. 88. J. L. Neves and E. G. Friedman, “Optimal Clock Skew Scheduling Tolerant to Process Variations,” Proceedings of the ACM/IEEE Design Automation Conference, pp. 623–628, June 1996. 89. D. B. West, Introduction to Graph Theory. Prentice-Hall, 1996. 90. C. E. Leiserson and J. B. Saxe, “A Mixed-Integer Linear Programming Problem Which is Efficiently Solvable,” Journal of Algorithms, Vol. 9, pp. 114–128, March 1988. 91. T.-C. Lee and J. Kong, “The New Line in IC Design,” IEEE Spectrum, pp. 52– 58, March 1997. 92. E. G. Friedman, “The Application of Localized Clock Distribution Design to Improving the Performance of Retimed Sequential Circuits,” Proceedings of the IEEE Asia–Pacific Conference on Circuits and Systems, pp. 12–17, December 1992. 93. I. S. Kourtev and E. G. Friedman, “Simultaneous Clock Scheduling and Buffered Clock Tree Synthesis,” Proceedings of the IEEE International Symposium on Circuits and Systems, pp. 1812–1815, June 1997. 94. T. M. Burks, K. A. Sakallah, and T. N. Mudge, “Critical Paths in Circuits with Level-Sensitive Latches,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol. 3, pp. 273–291, June 1995. 95. I. S. Kourtev and E. G. Friedman, “A Quadratic Programming Approach to Clock Skew Scheduling for Reduced Sensitivity to Process Parameter Variations,” Proceedings of the IEEE ASIC/SOC Conference, 1999. 96. I. S. Kourtev and E. G. Friedman, Timing Optimization Through Clock Skew Scheduling. Kluwer Academic Publishers, 2000. 97. B. Taskin and I. S. Kourtev, “Linear Timing Analysis of SOC Synchronous Circuits with Level-Sensitive Latches,” Proceedings of the IEEE ASIC/SOC Conference, pp. 358–362, September 2002. 98. B. Taskin and I. S. Kourtev, “Performance Optimization of Single-Phase Level-Sensitive Circuits Using Time Borrowing and Clock Skew Scheduling,” ACM/IEEE International Workshop on Timing Issues in the Specification and Synthesis of Digital Systems, pp. 111–118, 2002. 99. T. G. Syzmanski and N. Shenoy, “Verifying Clock Schedules,” Proceedings of the IEEE/ACM International Conference on Computer–Aided Design, pp. 124– 131, November 1992. 100. H. Zhou, “Clock Schedule Verification Crosstalk,” ACM/IEEE International Workshop on Timing Issues in the Specification and Synthesis of Digital Systems, pp. 78–83, 2002.
References
253
101. C. Leiserson and J. Saxe, “Retiming Synchronous Circuitry,” Algorithmica, Vol. 6, No. 1, 1991. 102. B. Lockyear and C. Ebeling, “Optimal Retiming of Level-Clocked Circuits Using Symmetric Clock Schedules,” IEEE Transactions on Computer-Aided Design, Vol. CAD-13, pp. 1097–1109, Sep 1994. 103. N. Maheshwari and S. Sapatnekar, “A Practical Algorithm for Retiming LevelClocked Circuits,” Proceedings of International Conference on VLSI in Computers and Processors, pp. 440–445, October 1996. 104. N. Shenoy and R. Rudell, “Efficient Implementation of Retiming,” Proceedings of IEEE/ACM International Conference on Computer-Aided Design, pp. 226– 233, 1994. 105. T. H. Cormen, C. E. Leiserson, and R. L. Rivest, Introduction to Algorithms. MIT Press, 1989. 106. I. S. Kourtev and E. G. Friedman, “Topological Synthesis of Clock Trees for VLSI-Based DSP Systems,” Proceedings of the IEEE Workshop on Signal Processing Systems, pp. 151–162, November 1997. 107. I. S. Kourtev and E. G. Friedman, “Topological Synthesis of Clock Trees with Non-Zero Clock Skew,” Proceedings of the ACM/IEEE International Workshop on Timing Issues in the Specification and Design of Digital Systems, pp. 158– 163, December 1997. 108. R. B. Deokar and S. S. Sapatnekar, “A Graph–Theoretic Approach to Clock Skew Optimization,” Proceedings of the IEEE International Symposium on Circuits and Systems, pp. 407–410, May 1995. 109. S. Naffziger, G. Colon-Bonet, T. Fischer, R. Riedlinger, T. Sullivan, and T. Grutkowski, “The Implementation of the Itanium 2 Microprocessor,” IEEE Journal of Solid-State Circuits, Vol. 37, pp. 1448–1460, November 2002. 110. J. Warnock, “Circuit Design Issues for the POWER4 Chip,” Proceedings of the 2003 International Symposium on VLSI Technology, Systems, and Applications, pp. 125–128, October 2003. 111. C. Webb, C. Anderson, L. Sigal, K. Shepard, J. Liptay, J.D.Warnock, B. Curran, B. Krumm, M. Mayo, P. Camporese, E. Schwarz, M. Farrell, P. Restle, R. A. III, T. Slegel, W. Houtt, Y. Chan, B. Wile, T. Nguyen, P. Emma, D. Beece, C. Ching-Te, and C. Price, “A 400-MHz S/390 Microprocessor,” IEEE Journal of Solid-State Circuits, Vol. 32, pp. 1665–1675, November 1997. 112. W. L. Winston, Operations Research Application and Algorithms. PWS-Kent Publishing Company, second ed., 1991. 113. R. Chen and H. Zhou, “Clock Schedule Verification Under Process Variations,” Proceesings of the IEEE Conference on Computer-Aided Design, pp. 619–625, November 2004. 114. S.-C. Fang and S. Puthenpura, Linear Optimization and Extensions: Theory and Algorithms. AT&T, Prentice Hall, 1993. 115. ILOG, France, ILOG CPLEX 7.1 User’s Manual, 2001. 116. J. Wood, T. Edwards, and S. Lipa, “Rotary Traveling-Wave Oscillator Arrays: A New Clock Technology,” IEEE Journal of Solid-State Circuits, Vol. 36, pp. 1654–1665, November 2001. 117. M. C. Papaefthymiou and K. Randall, “Edge-Triggering vs. Two-Phase LevelClocking,” Proceedings of the 1993 in Research in Integrated Systems, March 1993.
254
References
118. C. Ebeling and B. Lockyear, “On the Performance of Level-Clocked Circuits,” Proceedings of the Sixteenth Conference on Advanced Research in VLSI, pp. 342–356, March 1995. 119. Y. C. Hsu, S. Sun, D. Du, and X. Chu, “Enhancing Circuit Performance Under a Multiple-Phase Clocking Scheme,” Proceedings of the 1998 IEEE International Symposium on Circuits and Systems, pp. 219–222, June 1998. 120. K. Ravindran, A. Kuehlmann, and E. Sentovich, “Multi-Domain Clock Skew Scheduling,” Proceedings of the International Conference on Computer Aided Design, pp. 801–808, November 2003. 121. I. S. Kourtev and E. G. Friedman, “A Quadratic Programming Approach to Clock Skew Scheduling for Reduced Sensitivity to Process Parameter Variations,” Proceedings of the IEEE International ASIC/SOC Conference, pp. 210– 215, November 1999. 122. I. S. Kourtev and E. G. Friedman, “Clock Skew Scheduling for Improved Reliability via Quadratic Programming,” Proceedings of the IEEE/ACM International Conference on Computer–Aided Design, pp. 239–243, November 1999. 123. S.-P. Chan, S.-Y. Chan, and S.-G. Chan, Analysis of Linear Networks and Systems : A Matrix-Oriented Approach with Computer Applications. AddisonWesley Publishing Company, 1972. 124. E. M. Reingold, J. Nievergelt, and N. Deo, Combinatorial Algorithms: Theory and Practice. Prentice-Hall, 1977. 125. O. Bretscher, Linear Algebra with Applications. Prentice-Hall, 1996. 126. P. G. Ciarlet and J. L. Lions, eds., Handbook of Numerical Analysis, Vol. I. North-Holland, 1990. 127. R. W. Farebrother, Linear Least Square Computations. Marcel Dekker, 1988. 128. M. R. Osborne, Finite Algorithms in Optimization and Data Analysis. John Wiley & Sons, 1985. 129. R. Fletcher, Practical Methods of Optimization. John Wiley & Sons, 1987. 130. ˚ A. Bj¨ orck, Numerical Methods for Least Squares Problems. North-Holland, 1996. 131. C. L. Lawson and R. J. Hanson, Solving Least Squares Problems. Prentice-Hall, 1974. 132. B. Taskin and I. S. Kourtev, “Delay Insertion in Clock Skew Scheduling,” ACM International Symposium on Physical Design, (San Francisco, CA), April 2005. 133. S.-H. Huang, C.-H. Cheng, C.-M. Chang, and Y.-T. Nieh, “Clock Period Minimization with Minimum Delay Insertion,” Proceedings of the IEEE/ACM Design Automation Conference, pp. 970–975, June 2007. 134. G. H. Golub and C. F. V. Loan, Matrix Computations. Johns Hopkins University Press, 1996. 135. G. Forsythe and C. B. Moler, Computer Solution of Linear Algebraic Systems. Prentice-Hall, 1967. 136. B. Floyd, X. Guo, J. Caserta, T. Dickson, C.-M. Hung, K. Kim, and K. O, “Wireless Interconnects for Clock Distribution,” Proceedings of the 8th ACM/IEEE Intl. Workshop on Timing Issues in the Specification and Synthesis of Digital Systems,, December 2002. 137. B. Floyd, C. Hung, and K.K.O, “Intra-chip Wireless Interconnect for Clock Distribution Implemented with Integrated Antennas, Receivers, and Transmitters,” IEEE Journal of Solid-State Circuits, Vol. 37, pp. 522–543, May 2002.
References
255
138. R. Li, X. Guo, and K. O, “A Technique for Incorporation of a Heatsink for a System Utilizing Integrated Circuits with Wireless Connections to an Offchip Antenna,” Proceedings of the IEEE International Interconnect Technology Conference, pp. 160–162, June 2004. 139. W. Andress and D. Ham, “Standing Wave Oscillators Utilizing Wave-adaptive Tapered Transmission Lines,” Digest of Technical Papers, 2004 Symposium on VLSI Circuits, pp. 50–53, June 2004. 140. S. C. Chan, P. J. Restle, N. K. James, and R. L. Franch, “A 4.6 GHz Resonant Global Clock Distribution Network,” IEEE ISSCC Digest of Technical Papers, pp. 341–343, February 2004. 141. S. C. Chan, K. L. Shepard, and P. J. Restle, “Design of Resonant Global Clock Distributions,” Proceedings of the International Conference on Computer Design, pp. 238–243, 2003. 142. V. L. Chi, “Salphasic Distribution of Clock Signals for Synchronous Systems,” IEEE Transactions on Computers, Vol. 43, pp. 597–602, May 1994. 143. A. Drake, K. Nowka, T. Nguyen, J. Burns, and R. Brown, “Resonant Clocking Using Distributed Parasitic Capacitance,” IEEE Journal of Solid-State Circuits, Vol. 39, pp. 1520–1528, September 2004. 144. L. Hall, M.Clemens, W. Liu, and G. Bilbro, “Clock Distribution Using Cooperative Ring Oscillators,” Proceedings of the Conference on Advanced Research in VLSI, pp. 15–16, September 1997. 145. F. O’Mahony, C. Yue, M. Horowitz, and S. Wong, “A 10-GHz Global Clock Distribution Using Coupled Standing-wave Oscillators,” IEEE Journal of SolidState Circuits, Vol. 38, pp. 1813–1820, November 2003. 146. F. O’Mahony, C. P. Yue, M. Horowitz, and S. Wong, “Design of a 10GHz Clock Distribution Network Using Coupled Standing Wave Oscillators,” Proceesings of IEEE/ACM Design Automation Conference, (Anaheim, CA), pp. 682–687, June 2003. 147. P. J. Restle, T. G. McNamara, P. J. Camporese, K. F. Eng, K. A. Jenkins, D. H. Allen, M. J. Rohn, M. P. Quaranta, D. W. Boerstler, C. J. Alpert, C. A. Carter, R. N. Bailey, J. G. Petrovik, B. L. Krauter, , and B. D. McCredie, “A Clock Distribution Network for Microprocessors,” IEEE Journal of Solid-State Circuits, Vol. 36, pp. 792–799, May 2001. 148. J. Wood, S. Lipa, P. Franzon, and M. Steer, “Multi-Gigahertz Low-Power LowSkew Rotary Clock Scheme,” Proceedings of the IEEE International Solid-State Circuits Conference, pp. 400–401, February 2001. 149. M. Saint-Laurent, M. Swaminathoan, and J. Meindl, “On the Microarchitectural Impact of Clock Distribution Using Multiple PLLs,” Proceedings of IEEE International Conference on Computer Design, pp. 214–220, September 2001. 150. S.-M. Kang and Y. Leblebici, CMOS Digital Integrated Circuits: Analysis and Design. The McGraw-Hill Companies, Inc., 1996. 151. J. M. Rabaey, Digital Integrated Circuits: A Design Perspective. Prentice-Hall, Inc., Upper Saddle River, NJ, 1995. 152. E. G. Friedman, Clock Distribution Networks in VLSI Circuits and Systems. IEEE Press, 1995. 153. H. G. Chyun and J. Hung, “Phase-Locked Loop Techniques. A survey,” IEEE Transactions on Industrial Electronics, Vol. 43, pp. 609–615, December 1996.
256
References
154. A. J. Drake, K. J. Nowka, T. Y. Nguyen, J. L. Burns, and R. B. Brown, “Resonant Clocking Using Distributed Parasitic Capacitance,” IEEE Journal of Solid-State Circuits, Vol. 39, pp. 1520–1528, September 2004. 155. J.-Y. Chueh, M. C. Papaefthymiou, and C. H. Ziesler, “Two-phase Resonant Clock Distribution,” Proceedings of the IEEE Computer Society Annual Symposium on VLSI, pp. 65–70, May 2005. 156. S. C. Chan, K. L. Shepard, and P. J. Restle, “Distributed Differential Oscillators for Global Clock Networks,” IEEE Journal of Solid-State Circuits, Vol. 41, pp. 2083–2094, September 2006. 157. J. Wood, “Electronic circuitry.” United States Patent Application Number 20030128075, July 2003. 158. J. Wood, “Electronic circuitry.” United States Patent Number 6,816,020, November 2004. 159. J.-Y. Chueh, C. H. Ziesler, and M. C. Papaefthymiou, “Experimental Evaluation of Resonant Clock Distribution,” Proceedings of the IEEE Computer Society Annual Symposium on VLSI Emergim Trends in VLSI System Design, pp. 135–140, February 2004. 160. J.-Y. Chueh, C. H. Ziesler, and M. C. Papaefthymiou, “Empirical Evaluation of Timing and Power in Resonant Clock Distribution,” Proceedings of the International Symposium on Circuits and Systems, pp. 249–252, May 2004. 161. J. Rosenfeld and E. G. Friedman, “Sensitivity Evaluation of Global Resonant H-tree Clock Distribution Networks,” Proceedings of the ACM Great Lakes Symposium on VLSI, pp. 192–197, April-May 2006. 162. J. Rosenfeld and E. G. Friedman, “Design Methodologies for Global Resonant H-tree Clock Distribution Networks,” Proceedings of the IEEE International Symposium on Circuits and Systems, pp. 2073–2076, May 2006. 163. J. Rosenfeld and E. G. Friedman, “Design Methodology for Global Resonant H-tree Clock Distribution Networks,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol. 15, pp. 135–148, February 2007. 164. F. O’Mahony, 10 GHz Global Clock Distribution Using Coupled Standing-Wave Oscillators. PhD thesis, Stanford University, Aug. 2003. 165. P. Restle, “Resonant Clock Networks.” http://www.research.ibm.com/, 2005. IBM Research, Computer Science, Innovative Matters, VLSI Design. 166. J. Denker, “A review of Adiabatic Computing,” Proceedings of the 1994 Symposium on Low Power Electronics, pp. 94–97, October 1994. 167. K. S. Kim and M. Papaefthymiou, “Single-Phase Source-Coupled Adiabatic Logic,” Proceedings of the International Symposium on Low Power Electronics and Design, pp. 97–99, 1999. 168. G. D. Mercey, “A 18GHz Rotary Traveling Wave VCO in CMOS with I/Q Outputs,” Proceedings of the European Solid-State Circuits Conference, pp. 489– 492, Sept. 2003. 169. Z. Yu and X. Liu, “Power Analysis of Rotary Clock,” Proceedings of the IEEE Computer Society Annual Symposium on VLSI, pp. 150–155, May 2005. 170. Z. Yu and X. Liu, “Power Minimization of Rotary Clock Design,” Proceedings of the IEEE International SOC Conference, pp. 19–24, September 2005. 171. Z. Yu and X. Liu, “Low-Power Rotary Clock Array Design,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol. 15, pp. 5–12, January 2007.
References
257
172. G. Venkataraman, J. Hu, F. Liu, and C.-N. Sze, “Integrated Placement and Skew Optimization for Rotary Clocking,” Proceedings of the IEEE Design, Automation and Test in Europe, pp. 1–6, March 2006. 173. G. Venkataramam, J. Hu, and F. Liu, “Integrated Placement and Skew Optimization for Rotary Clocking,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, pp. 149–158, February 2007. 174. Z. Yu and X. Liu, “Design of Rotary Clock Based Circuits,” Proceedings of the ACM/IEEE Design Automation Conference, pp. 43–48, June 2007. 175. C. Ababei, S. Navaratnasothie, K. Bazargan, and G. Karypis, “Multi-Objective Circuit Partitioning for Cutsize and Path-based Delay Minimization,” Proceedings of the IEEE/ACM International Conference on Computer Aided Design, pp. 181–185, November 2002. 176. I. Lustig, “Private Communication,” 2004. ILOG Inc. 177. B. Hendrickson and R. Leland, “The Chaco User’s Guide: Version 2.0,” Tech. Rep., Sandia National Laboratories, Albuquerque, NM, Jul 1995. 178. A. Pothen, H. Simon, and K. Liou, “Partitioning Sparse Matrices Eigenvectors of Graphs,” SIAM Journal of Matrix Analysis, Vol. 11, pp. 430–452, 1990. 179. R. Williams, “Performance of Dynamic Load Balancing Algorithms for Unstructured Mesh Calculations,” Concurrency, Vol. 3, pp. 457–481, 1991. 180. B. Kernighan and S. Lin, “An Efficient Heuristic Procedure for Partitioning Graphs,” Bell System Technical Journal, Vol. 29, pp. 291–307, 1970. 181. C. M. Fiduccia and R. Mattheyses, “A Linear Heuristic for Improving Network Partitions,” Proceedings of the IEEE/ACM Design Automation Conference, pp. 175–181, 1982. 182. B. Hendrickson and R. W. Leland, “A Multi-Level Algorithm For Partitioning Graphs,” Supercomputing, 1995. 183. Apple Inc., Advanced Computing Group, Xgrid Guide, 2004. 184. MPI Standard Forum, http://www-unix.mcs.anl.gov/mpi/standard.html, Message Passing Interface Standard v 2.0, 1997. 185. N. Shenoy, R. K. Brayton, and A. L. Sangiovanni-Vincentelli, “Graph Algorithms for Clock Schedule Optimization,” Proceedings of the IEEE/ACM International Conference on Computer–Aided Design, pp. 132–136, November 1992. 186. B. Taskin, “High Performance Integrated Circuit (hpic) Timing Software Package v1.9.” http://sourceforge.net/projects/hpictiming/, 2004. 187. Free Software Foundation (FSF), http://www.gnu.org/software/glpk/glpk. html, GLPK (GNU Linear Programming Kit), 2005. version 4.8.
Index
A Application-specific integrated circuits (ASICs), 15, 180, 244, 245 B Bernoulli equations, 29 C CAD. See Computer-aided design Cascade voltage switch logic (CVSL), 39 Clock distribution network branching factor, 88 circuit and interconnect structure, 72 design process for, 16 resistive-capacitive (RC), 35 scheduling algorithms for, 4 signals, 4 tree structure of, 86 Clock signal clock pulse, 51 clock skew lead/lag relationship, 52 sequentially-adjacent registers, 53 coincidental cycles of, 56 data in, 49 latching and non-latching edges of, 48–49 leading and trailing edge of, 45 multi-phase clock synchronization reference clock cycle, 54 sample of, 53 storage elements in, 50–51
Clock skew scheduling applications of, 96 basis skews clock skew vector and enumeration, 178 thicker edges and basis edges, 176–177 circuit design process and safety margin in, 84 clocking technology, 183 definitions and graphical model clock delays in, 74 graph-based models in, 76 inherent structural limitations of, 76 permissible range of, 74–76 synchronous digital system, 73, 75–80 timing parameters of, 75 delay insertion method, 153–162 edge-triggered circuits, 232 ISCAS’89 benchmark circuits, 229–230 level-sensitive circuits, 231, 233 QP-based clock scheduling algorithm, 226–229 double and zero clocking hazards in, 82 double clocking, 81 input file format samples of, 93, 94 static timing analyzers, 91 259
260
Index
Clock skew scheduling (Continued) I/O registers and target delays delay requirements for, 180 local data paths, 178 timing information and violations, 179–180 VLSI integrated circuit, 179 level-sensitive circuits circuit networks, 108–113 initialization constraints, 102 interpretation results, 209 on ISCAS’89 Benchmark circuits, 206–208 iterative approach, 103–104 latching constraints, 98 linearization, 105–108 LP formulation, 113–117 multi-phase, 117–120 parameter data distributions, 209–211 propagation constraints, 100–101 skew analysis, 211–213 synchronization constraints, 98–100 timing relationships, 97–98 validity constraints, 101–102 verification, 208 linear problem formulation delay models, 163 reconvergent paths, theoretical limitation of, 162 linear programming (LP) formulation for, 81, 244 localized negative synchronous circuit, 4 LP models, 163, 195, 197–198 minimum clock period circuits and limitations on, 146, 149 data path cycles, 148–150 data propagation times for, 147–148 dominant limiting factor in, 147 reconvergent paths, 150–151 register delays in, 155 min-max timing models, 165 modeling and applications delay buffer tree structure, 164–165 delay insertion problems, 163 post-timing analysis process, 165 output file format path delay distribution in, 95 sample of, 94
parallelization computation time speedup, 203 Xgrid computing cluster, 202–203 Xgrid software architecture, 202 partitioning process alternative approaches, 199–200 clock periods, 198–199 delay padding, 200 heuristic method, 197–199 performance characteristics in, 5 problem formulation LCSS-SAFE, 127–128 linear programming approach, 123 local data path timing constraints, 122 maximum performance, 123–125 quadratic programming problem, 128 safety, 125–127 quadratic programming algorithm derivation circuit graph, 129–130 linear dependence, 130–137 optimization problem and solution, 137–143 quadratic programming formulation computer implementation, 223–225 graphical illustrations, 225 registers, 83 ROA rings and application of, 191 rotary clock synchronized circuits industrial1, 234 minimum clock periods, 234–235 scalability of, 194, 243 software implementation benchmark circuit s400, 92 benchmark circuit s1423, 91 in clock tree synthesis, 89 data paths in, 94 ISCAS’89 suite of circuits, 90 timing constraints and design automation, 85 topological design of, 16 tree topology implementations, 89 Clock tree synthesis, 87 Complementary metal-oxidesemiconductor (CMOS) input waveforms in, 28 inverter logic gate in, 26
Index logic circuits and styles for, 39 operating mode of, 27 P-channel and N-channel transistor, terminal voltages for, 26 PMOS and NMOS device in, 25 transistor configuration of, 9 Computational algorithms CSD clock schedule, 172–173 computation time, 173 memory usage, 175 LMCS-1 memory usage of, 170 triangular, cholesky decomposition, 169–170 LMCS-2 lagrange multipliers, 170–171 memory usage of, 172 Sherman-Morrison-Woodburry formula, 171 run time and memory complexity expressions, 176 requirements, 175 Computer-aided design (CAD), 195 tools, 196, 243 CVSL. See Cascade voltage switch logic D Data path cycles clock skews circuit of, 148 minimum clock period, limitation on, 149 reconvergent system and paths, timing diagrams of, 150–151 Data propagation times setup and hold time constraints for, 148 timing delay models in, 147 Deep submicrometer (DSM), 3 Delay analytical analysis fall time derivation input waveforms for, 28 transition process, 26 rise time derivation, 30 short-channel effects channel-length modulation in, 33 velocity saturation, 34–35 waveform effects delay expressions in, 31–32
261
propagation delay time in, 32 short-circuit power for, 33 Delay controlling, 31 Delay insertion method divergent and convergent registers in, 152–153 drawbacks of, 245 edge-triggered circuit reconvergent system CSS method for, 163 data signals for, 153 minimum clock period, 159 path delays, algebraic difference in, 156 timing of, 154, 162 values, interval and elements in, 157–158 level-sensitive circuit reconvergent system clock skew scheduling algorithm for, 160 CSS method for, 164 timing of, 160, 162 zero internal register delays, 159 minimum clock period obtainable in, 155 reconvergent data path systems delays in, 153 edge-triggered and level-sensitive circuits in, 160 Design-for-manufacturing (DFM) techniques, 245 Digital integrated circuits, 8 Double-edge-triggered (DET) flip-flops, 47 DSM. See Deep submicrometer F Finite-state machine (FSM) model, 13 Flip-flops positive and negative-edge-triggered, 47 single-phase path data signal, early arrival of, 58–60 data signal, late arrival of, 55–58 delay padding, 60 logic gates and, 55 timing parameters clock pulse width of, 48–49
262
Index
clock-to-output delay, 49 hold time and setup time, 49 violation setup, timing diagram of, 56 Full adder circuit, 8–9 H Hardware description language (HDL), 15 hpictiming tool, 233–234 I industrial1, 234 Integrated circuits (ICs) characteristics and factors of, 1 circuit structures and chip area in, 2 data traffic in, 2 performance of, 3 Intellectual property (IP) blocks, 180 benchmark circuits for, 200 ISCAS’89 benchmark circuits average speedup of, 245 CAD tool parallel speedup, 237–238 sequential speedup, 237–238 tool flow, 235 clock skew scheduling industrial1, 234 minimum clock periods, 234–235 hpictiming tool, 233 multi-phase synchronization, 244 run time breakdown process, 239–241 L Latches clock signal levels in, 43 idealized operation of, 44 multi-phase path combinational logic blocks in, 65 data signal, early arrival of, 68–69 data signal, late arrival of, 66–68 timing properties of, 66 parameters clock pulse width of, 45 clock-to-output delay, 45 data-to-output delay and setup time, 45–46 hold time, 46–47 minimum and maximum values of, 47 signal relationships on, 44
schematic representation of, 43 single-phase path data signal, early arrival of, 63–65 data signal, late arrival of, 61–63 max operation, 63 registers and logic gates in, 61 Level-sensitive circuits circuit networks edge-sensitive circuit, 108–109 level-sensitive circuit, 109 non-zero clock, 110 synchronous circuit state, 110–113 topology, 108 clock skew scheduling interpretation results, 209 on ISCAS’89 Benchmark circuits, 206–208 parameter data distributions, 209–211 skew analysis, 211–213 verification, 208 initialization constraints, 102 iterative approach algorithm, 103–104 framework lenient, 104 SMO formulation, 103 latching constraints, 98 limitations, 146–147 linearization linear programming (LP) model, 106–108 modified big M (MBM) method, 105–106 timing analysis, 104–105 LP formulation benchmark circuits, 115–116 MIP problems and model, 113–114, 116–117 NLP problems, 113 non-linear constraint, 114–115 optimized timing schedule, 112 multi-phase clocking clock skew scheduling, 220–221 simultaneous time borrowing and clock skew scheduling, 221–223 time borrowing, 219–220 multi-phase synchronization non-overlapping process, 215–216 transformation process, 213–214
Index multi-phase system edge-triggered system, 117 synchronization overview, 117–118 timing circuits, 118–120 propagation constraints, 100–101 synchronization constraints, 98–100 timing relationships, 97–98 validity constraints, 101–102 Linear programming (LP) problems, 146 models, 164–165 naive approach, 163 Logic gates and registers sequentially-adjacent pair of, 74 in synchronous digital circuit, 73 switching characteristics of, 22–23 values and properties of, 9 M Message passing interface (MPI), 202 Metal-oxide-semiconductor field effect transistor (MOSFETs), 23 Mixed-integer linear programming (MIP), 87, 113, 164 Modified big M method (MBM), 105–106 Moore’s law, 3 Multi-phase synchronization approach, 54 N NAND gate, 9–10 N-channel enhancement mode MOSFET transistor (NMOS), 24 NMOS transistor, 44 Non-linear programming (NLP), 113 Non-zero clock skew scheduling applications of, 243 automation and application of, 243 circuit operating and applications of, 83 clock signals and benefits in, 84 researchers, 244 synchronization methodologies for, 245
263
P Phase-locked-loop (PLL), 183 clock sources, 190 components, 184 reflections, capacitive loading, 184 Power supply rejection ratio (PSRR), 188 Q QP algorithm derivation circuit graph, 129–130 linear dependence circuit connectivity matrix, 134 circuit graph cycles, 131 clock scheduling algorithms, 130 graph theory, 132 independent cycle matrix, 136–137 kernel equation, 135–136 local data paths, 132–133 matrix relationship, 134–135 spanning trees, 133 optimization problem and solution active constraints, 138–139 clock skew definition, 137–138 Gauss-Jordan elimination, 141 global minimizer, 143 Lagrange multipliers, 139–140 linear system technique, 142–143 local data paths, 138 non-linear equation, 140 objective clock skew schedule, 142 QP-based clock skew scheduling, computational analysis CSD, 172–175 LMCS-1, 169–170 LMCS-2, 170–172 run time and memory requirements, 168 Quadratic programming (QP) formulation, 244–245 computer implementation, 223–225 graphical illustrations, 225 R Resistive-capacitive (RC) loads, 72 circuit network for, 38 signal delay expressions in, 37 Resonant clocking technology clock tree network, 184 digital integrated circuits, 183
264
Index
oscillators coupled LC and standing wave, 185 traveling wave, 186 partitioning process balanced priority assignment, 196 with chaco, 195–196 clock skew scheduling, 197–200 path-based and net-based, 193 registered-input and registeredoutput, 197 register insertion, 196 register placement, 200–201 timing constraints and data path, 196–197 timing-driven, 193–195 tools and factors for, 194 VLSI circuits, 192 rotary circuits, timing requirements clock skew and signals, 189 oscillatory signals, 191 synchronization schemes, 190–191 rotary traveling-wave oscillators (RTWO’s), 185–189 ROA. See Rotary oscillator arrays Rotary clock synchronized circuits CAD tool run time breakdown process, 239–241 speedup process, 237–238 clock skew scheduling industrial1, 234 minimum clock periods, 234–235 Rotary oscillator arrays (ROA), 185 clock architecture, 186 grid topology, 185, 191 ring, clock phase relationships of, 190, 191 structures, 192 Rotary traveling-wave oscillators (RTWO’s) anti-parallel inverters, 186 integrated skew computation and logic placement, 189 loop inductance, 188 novel clock network, 185 rings, 185, 187, 190 shunt connected inverters, transmission line, 187 theory, 187
S Shichman-Hodges equations, 24 Spanning tree algorithm, edge swapping and enumeration, 180 Standard template library (STL), 234 Synchronous digital system logic gates and storage registers, 73 signal cycles and graph representation of, 78 Synchronous systems clock signals, 50–54 finite-state machine (FSM) model of, 13 flip-flops, 47–50 single-phase path with, 55–60 latches, 43–47 multi-phase path with, 65–69 single-phase path with, 61–65 storage elements, 42 timing properties of, 41 System-on-chip (SoC), 181 V Very large scale integration (VLSI) systems buffers and registers, 86 circuit design and timing, 244 circuits production in, 3 delay metrics circuit analysis and design for, 23 computer-aided design applications, 20 logic gates and elements in, 19, 21 signal propagation and making in, 23 signal transitions in, 22 signal waveforms circuit in, 21–22 delay mitigation, 37 design process electronic devices, switching properties in, 14 synthesis tools in, 15 devices and interconnections analytical delay analysis, 26–31 delay controlling, 31 delay mitigation, 37–39 gain factor in, 25 importance of, 35–37 RC estimation in, 36
Index short-channel effects, 33–35 signal delay in, 19 static and dynamic circuit analysis for, 25 waveform effects, 31–33 digital systems in, 7 integrated circuits, 245 networks and logic gates in, 12 signal representation data processing in, 7 electronic devices and circuits, 8 storage elements, 42 synchronous circuits, 85 synchronous systems
265
computational algorithm in, 11 data path in, 14 finite-state machine (FSM) model, 13 signal propagation delay, 11 systems and circuits of, 4 technologies and systems in, 12 transistor elements, 2 and ULSI-based digital systems, 72 X Xgrid computing cluster, 202–203 software architecture, 202