Fifth ACM/ SIGDA Physical Design Workshop Proceedings
April 15-17, 1996 Reston, Virginia, USA
Foreword Welcome to the Fifth ACM SIGDA Physical Design Workshop (PDW-96)! Our workshop is being held in the midst of a watershed era of growth and innovation in the EDA industry, most notably within the "back-end" domains of physical design and physical verification. Timing-, signal integrity- and power-consciousness have rapidly become mainstream design requirements, with reliability and yield just around the corner. Many classic techniques have broken down, from the design of cell libraries to place-and-route to physical verification. That this sea change is partly due to the continued scaling of process technology (weaker drivers driving more resistive interconnects, lower supply voltages reducing noise margins, slew rates affecting signal integrity and timing, neighboring interconnects coupling more noticeably, ...) is well recognized. But the changing nature of the semiconductor business itself - the changing nature of design teams has had an equally and design projects (more design starts, shorter (synthesis-driven) design cycles, ... ) profound impact on our context. We are here because all those problems in physical design which were declared "solved" four years ago (!) are now - without question very much unsolved. Interconnect design, block placement, estimation and cell layout are just a few of the wide-open topics addressed in this workshop. We are also here because few if any of us can claim to know "the real killer issues" in physical design. (What kind of problem will cause the XYZ chip to miss its market window in 1999? Timing? Thermal? Noise? Algorithms or flows?) The two evening panels bring together a number of luminaries in the field to discuss future needs and directions for deep-submicron design, as well as "disconnects" in back-end (data management and physical verification) flows. The Tuesday afternoon panel highlights the emerging need for yield optimization in physical design. Finally, we are here to make new friends and expand our horizons. There will be plenty of opportunities to chat, eat, and imbibe (which we hope will make up for the lack of opportunity to sleep!). The Monday afternoon sessions will highlight what is arguably an extremely rich opportunity in physical design, CAD for micro electromechanical systems (MEMS), and the workshop will close with an open problems session. We are grateful to the U.S. National Science Foundation for its generous support of this workshop (NSF grant MIP-9531666). Additional sponsorship has been provided by ACM SIGDA and Avant! Corporation. On behalf of the technical program and organizing committees, we hope that PDW-96 will be a rewarding and enjoyable experience for you.
Andrew B. Kahng Technical Program Chair
Gabriel Robins General Chair
ii
Workshop Organization
Steering Committee:
General Chair:
M. Lorenzetti (Mentor Graphics)
G. Robins (U. of Virginia)
B. T. Preas (Xerox PARC)
Technical Program Committee:
Keynote Address:
C. K. Cheng (UC San Diego)
C. L. Liu (U. Illinois Urbana-Champaign)
J. P. Cohoon (U. of Virginia)
Benchmarks Co-Chairs:
J. Cong (UC Los Angeles) A. Domic (Cadence)
F. Brglez (NOCSU)
J. Frankle (Xilinx)
W. Swartz (TimberWolf Systems)
E. G. Friedman (Rochester)
Local Arrangements Chair:
D. D. Hill (Synopsys)
M. J. Alexander (U. of Virginia)
L. G. Jones (Motorola)
Treasurer:
A. B. Kahng (UC Los Angeles, Chair)
S. B. Souvannavong
Y.-L. Lin (Tsing Hua)
Publicity Chair:
K. S. J. Pister (UC Los Angeles)
J. L. Ganley (Cadence)
M. Marek-Sadowska (UC Santa Barbara)
Sponsors:
C. Sechen (Washington)
ACM / SIGDA
R.-S. Tsay (Avant! Corporation)
U.S. National Science Foundation
G. Zimmermann (Kaiserslautern)
Avant! Corporation
http: //www. cs. Virginia. edu/- pdw96/
111
Contact Information Gabriel Robins Department of Computer Science Thornton Hall University of Virginia Charlottesville, VA 22903-2442 (804) 982-2207 (804) 982-2214 (fax) robins c-cs.virginia.edu
Jason Cong UCLA Computer Science Department 4711 Boelter Hall Los Angeles CA 90095-1596 (310) 206-2775 (310) 825-2273 (fax) cong~cs.ucla.edu Antun Domic Cadence Design Systems, Inc. 2655 Seely Road, Building 6, MS 6B1 San Jose, CA 95134 (408) 428-5837 (408) 428-5828 (fax) domic cadence.com
C. L. Liu University of Illinois, Urbana-Champaign 1304 West Springfield Avenue Urbana, IL 61801 (217) 333-6769 (217) 333-3501 (fax) liucl cs.uiuc.edu
Jon Frankle Xilinx, Inc. 2100 Logic Drive San Jose, CA 95124 (408) 879-5348 (408) 559-7114 (fax) jon.frankle Axilinx.com
Michael Lorenzetti Mentor Graphics Corporation 8005 SW Boeckman Road Wilsonville, OR 97070-7777 (503) 685-1258 (503) 685-4790 (fax)
[email protected] Bryan T. Preas Xerox PARC 3333 Coyote Hill Road Palo Alto, CA 94304 (415) 812-4845 (415) 812-4471 (fax) preas(4parc.xerox.com
Eby G. Friedman Department of Electrical Engineering Computer Studies Building 420 University of Rochester Rochester, New York 14627 (716) 275-1022 (716) 275-2073 (fax) friedman Oee.rochester.edu
Chung-Kuan Cheng CSE Department University of California, San Diego La Jolla, CA 92093-0114 (619) 534-6184 (619) 534-7029 (fax) kuan~cs.ucsd.edu
Dwight D. Hill Synopsys, Inc. 700 East Middlefield Road Mountain View, CA 94043 (415) 694-4421 (415) 965-8637 (fax) hill synopsys.com
James P. Cohoon Department of Computer Science Thornton Hall University of Virginia Charlottesville, VA 22903-2442 (804) 982-2210 (804) 982-2214 (fax) cohoon~virginia.edu
Larry G. Jones Motorola, Inc. MD OE321 6501 William Cannon Drive West Austin. TX 78735-8598 (512) 891-8867 (512) 891-3161 (fax) jones~ssdt-oakhill.sps.mot.com iv
Andrew B. Kahng UCLA Computer Science Department 3713 Boelter Hall Los Angeles CA 90095-1596 (310) 206-7073 (310) 825-7578 (fax) abk~cs.ucla.edu
Gerhard Zimmermann Department of Computer Science University of Kaiserslautern Erwin-Schroedinger-Strasse P.O. Box 3049 D-67653 Kaiserslautern, Germany +49 631 205-2628 +49 631 205-3558 (fax) zimmermaAinformatik.uni-kl.de
Youn-Long Lin Department of Computer Science Tsing Hua University Hsin-Chu, Taiwan 30043 Republic of China 886-35-731070 886-35-723694 (fax) ylin cs.nthu.edu.tw
Franc Brglez Dept. of Electrical and Computer Engineering North Carolina State University, Box 7911 Raleigh, NC 27695-7911, USA (919) 248-1925 (919) 248-9245 brglez~cbl.ncsu.edu
Kristofer S. J. Pister UCLA Department of Electrical Engineering Los Angeles, CA 90095-1594 (310) 206-4420 (310) 206-8495 (fax) pister ee.ucla.edu
William P. Swartz TimberWolf Systems, Inc. 10880 Cassandra Way Dallas, TX 75228-2493 (214) 613-6772 (214) 682-1478 (fax) bills twolf.com
Malgorzata Marek-Sadowska Room 4157, Engineering I Dept. of Electrical and Computer Engineering University of California Santa Barbara, CA 93106-9560 (805) 893-2721 (805) 893-3262 (fax) mms Kece.ucsb.edu
Michael J. Alexander Department of Computer Science Thornton Hall University of Virginia Charlottesville, VA 22903-2442 (804) 982-2291 (804) 804-982-2214 (fax) alexander tvirginia.edu
Carl Sechen Department of Electrical Engineering Box 352500 University of Washington Seattle. WA 98195-2500 (206) 685-8756 (206) 543-3842 (fax) sechen(ee.washington.edu
Sally B. Souvannavong Department of Computer Science Thornton Hall University of Virginia Charlottesville, VA 22903-2442 (804) 982-2291 (804) 804-982-2214 (fax) sallys~cs.virginia.edu
Ren-Song Tsay Avant! Corporation 1208 East Arques Avenue Sunnyvale, CA 94086 (408) 738-8814 (408) 738-8508 (fax) tsayLavanticorp .com
Joseph L. Ganley Cadence Design Systems, Inc. 555 River Oaks Parkway, MS 2A2 San Jose, CA 95134-1937 (408) 944-7232 (408) 894-2700 (fax) ganleyvcadence.com
v
Program
Sunday, April 14 6:00pm-8:30pm: Registration (the registration desk will also be open 8:00am-5:00pm on Monday and 8:00am-12:00pm on Tuesday)
7:OOpm-8:30pm: Reception (refreshments provided)
Monday, April 15 8:30am-8:40am: Welcome 8:40am-10:OOam: Session 1: Timing-Driven Interconnect Resynthesis Session Chair: E. S. Kuh (UC Berkeley) * Interconnect Layout Optimization by Simultaneous Steiner Tree Construction and Buffer Insertion T. Okamoto and J. Cong, UIC Los Angeles * Simultaneous Routing and Buffer Insertion for High Performance Interconnect J. Lillis, C.-K. Cheng and T.-T. Y. Lin, UC San Diego * Timing Optimization by Redundancy Addition and Removal L. A. Entrena, E. Olias and J. Uceda, U. Carlos III of Madrid and U. Politecnica of Madrid
I 7 13
* Open Commentary - Moderators: D. D. Hill (Synopsys) and P. Suaris (Interconnectix) 10:00am-10:20am: Break (refreshments provided) 10:20am-12:00pm: Session 2: Interconnect Optimization Session Chair: C. L. Liu (U. Illinois Urbana-Champaign) * Optimal Wire-Sizing Formula Under the Elmore Delay Model C.-P. Chen, Y.-P. Chen and D. F. Wong, U. Texas Austin
21
* Reducing Coupled Noise During Routing A. Vittal and M. Marek-Sadowska, UC Santa Barbara
27
* Simultaneous Transistor and Interconnect Sizing Using General Dominance Property J. Cong and L. He, UC Los Angeles
34
* HierarchicalClock-Network Optimization D. Lehther, S. Pullela, D. Blaauw and S. Ganguly, Somerset Design Center, Motorola * Open Commentary - Moderators: D. D. Hill (Synopsys) and M. Lorenzetti (Mentor)
40
12:00pm-2:00pm: Lunch
vi
Workshop Keynote Address: Prof. C. L. Liu, U. Illinois Urbana-Champaign Algorithmic Aspects of Physical Design of VLSI Circuits
2:00pm-2:45pm: Session 3: Tutorial: Making MEMS Speaker: K. J. Gabriel (ARPA)
45
2:45pm-3:00pm: Break (refreshments provided)
3:00pm-4:15pm: Session 4: Physical Design for MEMS Session Chair: K. J. Gabriel (ARPA) * Physical Design for Surface-Micromachined MEMS
53
G. K. Fedder and T. Mukherjee, Carnegie-Mellon U. * Consolidated Micromechanical Element Library
61
R. Mahadevan and A. Cowen, MCNC * Synthesis and Simulation for MEMS Design E. C. Berg, N. R. Lo, J. N. Simon, H. J. Lee and K. S. J. Pister, UC Los Angeles
67
4:15pm-4:30pm: Break (refreshments provided) 4:30-6:00pm: Session 5: Panel: Physical Design Needs for MEMS Moderator: K. S. J. Pister (UC Los Angeles) Panelists include: * S. F. Bart (Analog Devices)
71
* G. K. Fedder (Carnegie-Mellon U.)
76
* K. J. Gabriel (ARPA) * I. Getreu (Analogy) * R. Grafton (NSF) * R. Harr (ARPA)
81
* R. Mahadevan (MCNC)
83
* J. E. Tanner (Tanner Research)
86
6:00pm-8:00pm: Dinner 8:OOpm-9:30pm: Session 6: Panel: Deep-Submicron Physical Design:
Future Needs and Directions Moderator: N. Mokhoff (Managing Editor, EE Times) Panelists include: * T. C. Lee (President/CEO, Neo Paradigm Labs) * L. Scheffer (Architect, Cadence)
89
* W. Vercruysse (UltraSPARC III CAD Manager, Sun) * M. Wiesel (Design Manager, Intel) * T. Yin (VP R&D, Avant! Corporation)
Vii
Tuesday, April 16
8:30am-9:50am: Session 7: Partitioning Session Chair: D. F. Wong (U. Texas Austin) * VLSI Circuit Partitioningby Cluster-Removal Using Iterative Improvement Techniques S. Dutt and W. Deng, U. Minnesota and LSI Logic * A Hybrid Multilevel/Genetic Approach for Circuit Partitioning C. J. Alpert, L. Hagen and A. B. Kahng, UC Los Angeles and Cadence * Min-Cut Replication for Delay Reduction J. Hwang and A. El Gamal, Xilinx and Stanford U.
92 100 106
* Open Commentary - Moderators: J. Frankle (Xilinx) and G. Zimmermann (U. Kaiserslautern) 9:50am-10:1Oam: Break (refreshments provided) 10:lOam-11:50am: Session 8: Topics in Hierarchical Design Session Chair: M. Sarrafzadeh (Northwestern U.) * Two-Dimensional Datapath Regularity Extraction R. X. T. Nijssen and J. A. G. Jess. TU Eindhoven * HierarchicalNetlength Estimation for Timing Prediction W. Hebgen and G. Zimmermann, U. Kaiserslautern * Exploring the Design Space for Building-Block Placements Considering Area, Aspect Ratio. Path Delay and Routing Congestion H. Esbensen and E. S. Kuh, UC Berkeley * Genetic Simulated Annealing and Application to Non-Slicing Floorplan Design S. Koakutsu, M. Kang and W. W.-M. Dai, Chiba U. and UC Santa Cruz * Open Commentary - Moderators: L. Scheffer (Cadence) and T. Yin (Avant! Corporation)
111 118
126 134
11:50pm-1:30pm: Lunch
1:30pm-3:00pm: Session 9: Poster Session * Physical Layout for Three-DimensionalFPGAs M. J. Alexander, J. P. Cohoon, J. L. Colfiesh, J. Karro, E. L. Peters and G. Robins, U. Virginia
142
* Efficient Area Minimization for Dynamic CMOS Circuits B. Basaran and R. A. Rutenbar, Carnegie-Mellon U.
150
* A Fast Technique for Timing-Driven Placement Re-engineering M. Hossain, B. Thumma and S. Ashtaputre, Compass Design Automation * Over-the-Cell Routing with Vertical Floating Pins I. Peters, P. Molitor and M. XVeber, U. Halle and Deuretzbacher Research GmbH
154
* Congestion-Balanced Placement for FPGAs Y. Sun, R. Gupta and C. L. Liu, Altera and U. Illinois Urbana-Champaign
163
* Fanout Problems in FPGA K.-H. Tsai, M. Marek-Sadowska and S. Kaptanoglu, UC Santa Barbara and Actel
169
viii
158
* Performance-Driven Layout Synthesis: Optimal Pairing & Chaining A. J. Velasco, X. Marin, J. Riera, R. Peset and J. Carrabina, U. Autonoma de Barcelona and Philips Research Labs Eindhoven
176
* Clock-Delayed Domino for Adder and CombinatorialLogic Design G. Yee and C. Sechen, U. Washington
183
3:00pm-4:00pm: Session 10: Manufacturing/Yield Issues I Session Chair: Eby G. Friedman (U. Rochester) * Layout Design for Yield and Reliability K. P. Wang, M. Marek-Sadowska and W. Maly, UC Santa Barbara and Carnegie-Mellon U.
190
* Yield Optimization in Physical Design (invited survey paper) V. K. R. Chiluvuri, Motorola
198
4:00pm-4:15pm: Break (refreshments provided) 4:15pm-5:45pm: Session 11: Panel: Manufacturing/Yield Issues II Moderator: L. G. Jones (Motorola) Panelists include: * V. K. R. Chiluvuri (Motorola) * I. Koren (U. Massachusetts Amherst)
207
* J. Burns (IBM Watson Research Center) * W. Maly (Carnegie-Mellon U.)
5:45pm-7:30pm: Dinner 7:30pm-8:00pm: Session 12a: Design Views in Routing Session Chair: B. T. Preas (Xerox PARC) * A Gridless Multi-Layer Channel Router Based on Combined Constraint Graph and Tile Expansion Approach H.-P. Tseng and C. Sechen, U. Washington
210 218
* A Multi-Layer Chip-Level Global Router L.-C. E. Liu and C. Sechen, U. Washington
8:00pm-9:30pm: Session 12b: Design Views, Data Modeling and Flows:
Critical Disconnects Moderator: A. B. Kahng (UC Los Angeles) Panelists include: * W. W.-M. Dai (UC Santa Cruz and Ultima Interconnect Technology, Inc.)
226
* L. G. Jones (Motorola) * D. Lapotin (IBM Austin Research Center) * E. Nequist (VP R&D, Cooper & Chyan) * R. Rohrer (Fellow, Avant! Corporation) * C. Palesko (VP. Savantage)
228
ix
Wednesday, April 17 8:30am-9:50am: Session 13: Performance-Driven Design Session Chair: M. Marek-Sadowska (UC Santa Barbara) * A Graph-Based Delay Budgeting Algorithm for Large Scale Timing-Driven Placement Problems G. E. Tellez, D. A. Knol and M. Sarrafzadeh, Northwestern U.
234
* Reduced Sensitivity of Clock Skew Scheduling to Technology Variations J. L. Neves and E. G. Friedman, U. Rochester
241
* Multi-Layer Pin Assignment for Macro Cell Circuits L.-C. E. Liu and C. Sechen, U. Washington
249
* Open Commentary- Moderator: J. Cong (UC Los Angeles) 9:50pm-10:10pm: Break (refreshments provided) 10:10am- 11:30am: Session 14: Topics in Layout Session Chair: D. D. Hill (Synopsys) * Constraint Relaxation in Graph-Based Compaction S.-K. Dong, P. Pan, C. Y. Lo and C. L. Liu, Silicon Graphics. Clarkson ]U., Lucent Technologies and U. Illinois 'Urbana-Champaign
256
* An 0(n) Algorithm for Transistor Stacking with Performance Constraints B. Basaran and R. A. Rutenbar, Carnegie-Mellon U. * Efficient Standard Cell Generation When Diffusion Strapping is Required B. Guan and C. Sechen, U. Washington
262
* Open Commentary - Moderators: D. D. Hill (Synopsys) and E. G. Friedman (U. Rochester) 11:30am-12:00pm: Session 15: Open Problems Moderators: A. B. Kahng (UC Los Angeles) and B. T. Preas (Xerox PARC) 12:00pm-2:00pm: Lunch (and benchmark competition results) 2:00pm: Workshop adjourns
x
268
Interconnect Layout Optimization by Simultaneous Steiner Tree Construction and Buffer Insertion * Takumi Okamoto 1,2 okamoto~cs .ucla. edu
Jason Cong I conglcs .ucla. edu
Dept. of Computer Science, University of California, Los Angeles, CA 90095 2 C&C Research Laboratories, NEC Corp., Miyamae, Kawasaki 216, Japan
Abstract This paper presents an algorithm for interconnect layout optimization with buffer insertion. Given a source and n sinks of a signal net, with given positions and a required arrival time associated with each sink, the algorithm finds a buffered Steiner tree so that the required arrival time (or timing slack) at the source is maximized. In the algorithm, Steiner routing tree construction and buffer insertion are achieved simultaneously by combining A-tree construction and dynamic programming based buffer insertion algorithms, while these two steps were carried out independently in the past. Extensive experimental results indicate that our approach outperforms conventional two-step approaches. Our buffered Steiner trees increase the timing slack at the source by up to 75% compared with those by the conventional approaches. 1.
given tree topology. [7, 8] has integrated wire sizing and power minimization with the algorithm in [6] under a more accurate delay model taking signal slew into account. On interconnect topology optimization problem, the analysis in [9] and [10] showed that as we reduce the device dimension, resistance ratio, which is defined as the ratio of the driver resistance versus the unit wire resistance, decreases. As a result the distributed nature of the interconnect structure must be considered, and conventional algorithm for total wire capacitance minimization does not necessarily lead to the minimum interconnect delay. For interconnect optimization in deep submicron VLSI design, recently a number of interconnect topology optimization algorithms have been proposed, including bounded-radius bounded-cost trees[ri], AHHK trees[12], maximum performance trees[13], A-trees[10], low-delay trees[14, 15], and IDW/CFD trees[16]. Although steady progress has been made in buffer insertion and Steiner tree construction for delay minimization, and encouraging experimental results were reported, we believe that these two steps need to be carried out simultaneously in order to construct even higher performance buffered Steiner trees directly. The independent two-step approach often leads to sub-optimal designs due to the following reasons: in the case of buffer insertion followed by Steiner tree construction, there is a problem that wiring delay and routability can not be estimated accurately in buffer insertion as mentioned above: in the case of Steiner tree construction followed by buffer insertion, there is a problem that Steiner tree optimized for delay does not necessarily result in a minimum-delay buffered Steiner tree. Figure 1 shows an example, where sink si is the most critical among all the sinks. In certain cases, which depends on the technology and criticality of sinks, the buffered Steiner tree in Figure l(a) is desired, while Figure l(b) shows a minimumdelay Steiner tree followed by buffer insertion.
Introduction
For timing optimization of VLSI circuits, buffer insertion (or fanout optimization) and interconnect topology optimization take important roles and a number of algorithms were proposed for these problems over the past a few years. On fanout optimization problem, most of previous work focused on construction of buffered trees in logic synthesis with consideration of user-defined timing and area constraints[1, 2, 3]. The timing measures used during this stage mainly consist of gate delays and a rough approximation for interconnect delay, which is assumed to be piecewise linear with the number of fanouts. When the wiring effect is dominant, traditional synthesis tools that use such a fanout-based model may be optimizing a timing value which is significantly different from the actual post layout value. Another problem with traditional synthesis is in area estimation. Typically, the tools try to optimize only the total gate area, and the interconnect area and the routabilitv of the chip are not taken into account. As a result, although the total gate area of the synthesized netlist is quite small, it may not fit into the target die area after layout. In recent years, [4, 6, 5, 7, 8] attack the fanout optimization problem after layout information is available. In [4], a fanout optimization algorithm based on alphabetic trees is presented that generates fanout trees free of internal edge crossings thus improving routing area. In [5], buffer insertion based on a minimum spanning tree is proposed. In [6]. a polynomial time algorithm using dynamic programming is proposed for delay-optimal buffer insertion problem on a
t S I (Critical)
S l(Critical) h
Source (a) Minimum-Delay Buffered Tree
*This work is partially supported by National Science Foundation Young Investigator Award MIP9357582 and a matching grant from Intel Corporation.
S2
Source 4
(b) Minimum-Delay Tree Followed by Buffer Insertion
Figure 1. Example of Buffered Steiner Tree
1
S4
In this paper, we presents an algorithm for interconnect layout optimization with buffer insertion. Given a source and n sinks of a signal net, with given positions and a required arrival time associated with each sink, the algorithm finds a buffered Steiner tree so that the required arrival time (or timing slack) at the source is maximized. In the algorithm. Steiner tree construction and buffer insertion are achieved simultaneously by combining A-tree algorithm[10] and dynamic programming based buffer insertion algorithm[6]. Extensive experimental results indicate that our approach outperforms conventional two-step approaches. Our buffered Steiner trees increase the timing slack at the source by up to 75% compared with those by the conventional approaches. 2.
Hereafter, it is assumed that only one type of buffer is considered for the buffer insertion, and signal polarity is neglected. Our algorithm, however, is easily extended to general case, where more than one types of buffer can be used and signal polarity must be considered, by using the methods similar to those in [7, 8]. 3. Related Work We briefly review the A-tree algorithm in [10] and the buffer insertion algorithm in [6], which are basis of our proposed algorithm. 3.1. A-tree Algorithm In [10], it was shown that a routing tree which minimizes the Elmore delay upper bound in [20] can be achieved by minimizing a weighted combination of the objectives of the minimum Steiner tree, the shortest path tree, and the "quadratic minimum Steiner tree" (a tree that minimizes the summation of source-node path lengths, taken over all possible node locations). Therefore, a minimum-cost rectilinear arborescence(A-tree) formulated in [21] is of interest since it heuristically addresses all of these terms in the decomposed upper bound at once.
Delay Models and Problem Formulation
2.1. Delay Models As in most previous works on interconnect layout optimization, we adopt the Elmore delay model[17] for interconnects and standard RC models for buffers. For wire e, let l, c, and r, be its length, capacitance and resistance, respectively. Further, let e, denote the wire entering node v from its parent. We use the following basic models for interconnects delay Dwire and buffer delay DbSff: ce = cole
Definition 1: A rectilinear Steiner tree T is called an A-tree if every path connecting the source so and any node p on the tree is a shortest path.
re = rOle
Dwire(ev) = r.,j(
In [10], an efficient algorithm based on bottom-up tree construction from the sinks was proposed for minimum-cost A-tree, which extends the algorithm in [21]. The algorithm starts with a set of subtrees, each consisting of a sink, and iteratively performs two subtrees "merging" or a subtree "growing" until all subtrees are merged into one tree. Two type of move is used for the bottom-up construction: safe move which cannot worsen the sub-optimality of an existing set of subtrees and heuristic move that may not lead to an optimal solution. According to their experimental results, A-trees constructed by the algorithm are at most 4% within the optimal, and achieve interconnect delay reduction by as much as 66% when compared to the best-known Steiner routing topology. In our approach here, we use only heuristic move in the A-tree algorithm for simplicity(essentially the algorithm in [21]). Despite using only heuristic moves, [21] has similar performance as the A-tree algorithm in [10]. The algorithm in [21] works as follows: A set called ROOT consisting of the roots of current subtrees which will eventually be merged to form the solution is maintained; Initially, ROOT contains the roots of n trivial trees, each consisting of a single sink. The algorithm then iteratively merges a pair of roots such that the "merged" root is as far from the source as possible, and terminates when
- + c(Tv))
2
Dbuff(b, ci) = db + rbCi,
where co and ro are capacitance and resistance for unit length wire, respectively, c(T,) is the lumped capacitance of subtree T, rooted at v, db and rb are buffer b's intrinsic delay and output resistance, respectively, and cl is the load on buffer b. When wire e is very long, we can divide e into a sequence of wires connected by degree-2 nodes to capture the distributed nature of the interconnect delay. Note that we assume wires are of a uniform width. Wiresize optimization can be carried out in a separate step after the buffered tree construction using the algorithm in [18, 19] or during the buffered tree construction as mentioned in Section 6. 2.2. Problem Formulation We use required arrival time as our optimization objective. The required arrival time at the root of tree T., denoted q(Tv), is defined as follows: q(Ti,) =
E
(qn
uE'zks(TO,
-
delay(v, u)),
where qu is the required arrival time of sink u, sinks(Tv) is a set of sinks of tree T, and delay(u, v) is delay from v to u defined by our delay models. This measure is useful since it is a typical objective when optimizing the performance of combinational networks. If we assume the signal arrives at the root of T at t = 0, the timing requirements are met when q(T) is non-negative. The buffered Steiner tree problem for delay minimization is stated as follows:
ROOT1 = 1.
More formal description of the algorithm is shown in Figure 2, where all sinks are assumed lie in the first quadrant with so at the origin for simplicity. But it is easy to extend the algorithm to the general case. 3.2. Buffer Insertion For given required arrival times at the sinks of a given Steiner tree, the buffer insertion algorithm in [6] chooses the buffering position on the tree such that the required arrival time at the source is as late as possible, where the delay is calculated based on the definition in Section 2. The algorithm assumes that the the topology of the routing tree (or Steiner tree) is given, as well as the possible (legal) positions of the buffers.
Given: A source so and sinks S1, S2 .s.. ,S, of a signal S. with given positions and a required arrival time associated with si (1 < i < n). Find: Steiner tree T, that spans S and has buffers inserted. Objective: Maximize q(Ts 0 ).
2
Procedure HeuristicAtree() ROOT - {sj I 0 < i < n}; while IROOTI > 1 do Find v, w E ROOT such that the sum min(vx, wt) + min(vy, wy) is maximum; ROOT - ROOT + {r} - {v} - {w}, where r is a node with coordinates (min(v,, w~),rmin(v,, wy)); Merge T, and Tut to T, adding edges from r to v and w, respectively; end while; end procedure
Procedure bottom.up(T) foreach v G T in topological order from sinks to source do if v is a sink then Z-
(qv - Dire(ev), c, + cej;
else ZI
a set of options for v's left child;
Zr a set of options for v's right child; Zv - ; for (z E Z2 , j E Z,) do /* redundant Merge of Z1 and Z,2/ if qj < qj then Zv Zv U (q,, ci + min{cj I gi < qI); end if; end for;
Figure 2. A-tree Algorithm Using Heuristic Move
Zv -
Zv U (maxZEz.(qz - Dbsff(bcZ)), Cb);
for z G Z, do
qz - qz -Dwjre(ev);
In the algorithm, which is based on dynamic programming technique, a set of (qj, ci) pairs is maintained for possible buffers assignment at each legal positions of the buffers, where qj and ci are the required arrival time and the capacitance of dc-connected subtreel rooted at i corresponding to q, respectively(Figure 3). Note that ci is not the total capacitance of entire subtree rooted at i. Each pair is called an option. The algorithm consists of two phases as follows. During the first phase the function "bottom-up(" in Figure 4 computes the irredundant set of all possible options 2, Zv for each node v(or legal positions of the buffers) in the tree in bottom up manner(Figure 5(a)) 3 . For the options at the root of the entire tree, the actual delay is calculated, using the output resistance Rgyte of the gate which produces the signal: q= go-Rguteco, qrco and then the option which gives the maximum q,,,... is chosen. The second phase traces back the computations of the first phase that led to this option, and determines the computed buffer positions on the way (Figure 5(b)).
CZ E CZ + cc; end for; end if; end for; end Procedure; Figure 4. Algorithm Finding a Set of Options
Tv
TO
eler ZIl -A
M~ix(qORgaticO)
eer At
At
I
l
bottom up TI
qsOurc
I
>Rtb Zr
top down r
(a) Option Calculation in Bottom-up Phase
TIT (b) Buffer Insertion in Top-Down Phase
Figure 5. Illustration of buffer insertion
ce
4.
Sink
Simultaneous Steiner Tree Construction and Buffer Insertion
4.1. Basic Idea of the Proposed Approach We develop an algorithm for simultaneous Steiner tree construction and buffer insertion, called buffered A-tree (BAtree) algorithm, by combining the A-tree and buffer insertion algorithms in Section 3. In such combination, the concept of critical path isolation(Figure 7(a)) and balanced load decomposition(Figure 7(b)) are also applied, which are techniques used for fanout optimization(or buffer insertion) in logic synthesist, 2, 3]. In logic synthesis, when one or several sinks are timing-critical, the critical path isolation tech-
Figure 3. An Option at the Root of Subtree T, In the algorithm in [6], candidate points for the buffer insertion are right after the Steiner points in the tree, which makes it possible to unload the critical path as much as possible(Figure 6(a)). In our implementation, we also make each Steiner point itself be a candidate(Figure 6(b)), in addition to the points right after the Steiner points, in order to reduce the number of buffers inserted. Moreover, an edge whose length is longer than certain threshold given by user is divided in order to make it possible to insert buffer in the middle of the wire(Figure 6(b)).
I' hi
7 i, ii'h'
l "dc-connected" means "directly connected by wires'. 2 Irredundant set has no two options (q, c) and (q', c') such that q > q' and c < c' [6]. 3 For simplicity, a binary tree is assumed here, but the algorithm is easily applied to general trees by addition of dummy nodes and 0 length wires[8]. Node which has only one child, where Z, or Zr are NULL in Figure 4, can be also treated by a simple extension.
AT
(a) Candidates for Buffer Insenion Points in 161
AT,rt
(b) Candidates for Buffer Insertion Points in Our Implementation
Figure 6. Candidate Points for Buffer Insertion
3
nique generates a fanout tree so that the root gate drives the critical sinks and a smaller additional load due to buffered non-critical paths. On the other hand, if required times at sinks are within a small range, balanced load decomposition is applied in order to decrease the load at output of root gate. These transformations are applied recursively in a bottom-up process from the sinks in the same manner as the A-tree and buffer insertion algorithms. Therefore, it is natural for us to apply these techniques in combination with the A-tree and buffer insertion algorithms.
Tv V.
ZV ev . y
ew
Source
-- -
Zw
Tw
(xu, yw)
Trv y) --
01
Figure 8. Evaluation of a Merged Subtree Definition 2: The maximum possible required time at the root r of subtree Tr generated by merging of T, and T., denoted R.., is defined as follows: (a) Cntical Signal Isolation
R-
(b) Balanced Load Decomposition
Figure 7. Fanout Optimization in Logic Synthesis
The maximum R'. among all possible Definition 3: merging pairs v and w in the set of roots ROOT of the current subtrees, denoted Rmz(ROOT), is defined as follows:
Rmai(ROOT) =
R_
Definition 4: The distance between the source and the merging point for v and w, denoted D_, is defined as follows: Dw = min(v,, w,) + min(v,, w,). This definition is for the case that v and w are in the first quadrant with so at the origin. Other cases can be defined in a similar way. Definition 5: The maximum D_ among all possible merging pairs v and w in the set of roots ROOT of the current subtrees, denoted Dmax(ROOT), is defined as follows:
2. Merge T, and T. to Tr, and compute a set of options at r by bottom.up(Tr). 4.2. Selection of Roots to be Merged in BA-tree In our algorithm, the computation of options and tree construction are performed simultaneously. Suppose that subtrees T, and T. are merged into Tr as shown in Figure 8. Let Zv and Z. be the sets of options at v and w, respectively computed in the previous steps. Based on Z,, Zw, dist(r, v), dist(r,w), and buffer b's characteristics, a set of options Zr at r are temporarily computed for evaluation of the best merge. Since the parent nodes of the current subtree's roots, v and w, are not determined yet at this stage, 1,, and 1,. were assumed to be 0 in the computation of Z, and Z., respectively. In the temporary computation of Zr, we update Z, and Z. using lev = dist(r, v) and le = dist(r, w) when computing the arrival time at r with the assumption that 1,r = dist(so, r). Note that dist(so,r)
Dmax(ROOT) =
Dvw.
max v,wEROOT
Now, we can define the merging cost for v and w. Definition 6: The merging cost for v and w, denoted mcost(v, w, ROOT), is defined as follows: mcost(v, w, ROOT) = Rvw Rmax(ROOT)
Dvw Dmoz (ROOT)'
where a is a fixed constant with 0.0 < a < 1.0. Note that instead of using a * R_ + f * D_, for the cost, we use the scaled objective above. The use of the scaled objective avoids the problem of choosing a different pairs of a and Qfor each instance. For the subtrees merging in BA-tree algorithm, we select T, and T. whose mcost(v, w, ROOT) is maximum among
,r.
We introduce the following definitions before describing how to select two subtrees to be merged in BA-tree construction. 4
max v,wEROOT
1. Select v and w with taking critical path isolation and balanced load decomposition into account.
1
ZCZr
where r is the merging point of T, and T., and Zr is a set of options at r.
In our approach, the concepts of critical path isolation and balanced load decomposition are used when choosing two subtrees (T, and T.) to be merged in the A-tree algorithm. Every pair of subtree roots v and w are evaluated by computing the required time at the root of subtree Tr, which results from merging of Tv and T.. Then, the best pair for merging is chosen so that critical path isolation and balanced load decomposition are achieved(See Section 4.2). The required times at the root of Tr is calculated based on 4 the options at v and w, dist(r, v) and dist(r, W) for the interconnect delay, and the effect of buffer insertion at r. For the evaluation, we keep a set of options at each of subtree's roots by using bottom-up() during the construction of A-tree. Basically the following two steps are iterated in BA-tree algorithm.
is an upper bound of
= max qz,
dist(v, w) denotes Manhattan distance between v and w
4
all possible pairs of subtrees. Clearly, if we set a = 0.0, our root selection criteria for merging is the same as that in the A-tree algorithm presented in Section 3.1. By using mcost in the A-tree construction, required time maximization with buffer insertion(critical path isolation and balanced load decomposition) and wire length minimization can be achieved simultaneously. The second term in mcost contributes to the wire length minimization as the original A-tree algorithm. The first term contributes to the critical path isolation and balanced load decomposition as the fanout optimization in logic synthesis. When one or several sinks are timing-critical, those sinks are isolated since the merging for those sinks, whose R are smaller than the others, will be applied in the later stage. Figure 9(a) shows an example for this case, where sink SI is the most critical among all the sinks. Sink Si will be isolated since the merging for si, whose R is smaller than the others, will be applied after S 4 , s 3 , and 52 are merged. On the other hand, if required times at sinks are within a small range, the merging will be performed so that the load is balanced, since R of the merging for those sinks are also within a small range. Figure 9(b) shows an example for this case, where required times at sinks SI, S2, S3, ands 4 are within a small range. The load will be balanced, since R of the merging for the sinks are within a small range.
Procedure BA-tree-bottomup() ROOT E {si I 0 < i < nj; foreach v E ROOT do bottom-up(v); /* Zv is computed for each sink '/ end for; while IROOTI > 1 do Find v, w E ROOT with max, c-EROOT mcost(v, w, ROOT); /* Z, is temporarily computed for its evaluation*/ ROOT - ROOT + {r} - {v} - {w}, where r is a node with coordinates (rnin(vr, w,), min(vy, wy)); Merge T, and T, to T, adding edges from r to v and w, respectively; bottom-up(Tr); /* Z,,Z,,Zw are re-computed here with pruning */ end while; end procedure
Figure 10. Algorithm for Simultaneous A-Tree Construction and Option Computation region, and we evaluated the average results. The loading capacitances and required times at the sinks are also
randomly chosen from the intervals [0.05pF, 0.15pF] and [5.0ns, 10.0nus], respectively. The parameters used in the experiments are summarized in Table 1.
Psi
Table 1. Parameters for Experiments Output Resistance of Gate Output Resistance of Buffer Intrinsic delay of Buffer Wire Resistance Wire Capacitance Loading Capacitance of Sink Loading Capacitance of Buffer Required Time at Sink
Source (a) Tree with Cntical Sink Isolation
(b) Tree with Balance Load Decomposition
Figure 9. Example of BA-tree 4.3.
Cb
q,
1000Q 800Q 0.lns 0.12Q/ttm 0.15fF/jlm 0.05pF - 0.15pF 0.05pF 5.Ons - 10.0ns
We compared results obtained by the following two methods:
Overall Algorithm
The algorithm consists of two phases in the same way with the buffer insertion[6]: bottom up tree construction with option computation and top down buffer insertion. Formal description for the first phase, bottom up tree construction with option computation, is shown in Figure 10. Option computation at each subtree's root by bottom up() and mcost(v, w, ROOT) evaluation at the merging are integrated into A-tree algorithm. The second phase, top down buffer insertion, is the same with the one in the buffer insertion[6]. The option which gives the maximum required time at root is chosen, then traces back the computations of the first phase that led to this option. During the backtrace, the buffer positions are determined.
Ml: A-tree[21] followed by buffer insertion[6]. M2: BA-tree construction(s in mcost: 0.2, 0.4) Table 2 shows average required times at the sources of the buffered Steiner trees generated by the two methods. The difference of the required time is increased as the number of sinks is increased. Although the difference is not so large for nets with 10 sinks, the required time at source of BAtree is larger than that by Ml by 49% with a = 0.2 and 75% with a = 0.4 for nets with 100 sinks.
Table 2. Required Time at Source(ns) #sinks
5.
Rgate rb db ro c0 cg
Experimental Results
We implemented BA-tree on a Sun SPARC 5 workstation under the C/UNIX environment, and tested it on signal nets with 10, 25, 50, and 100 sinks5 . For each net size, 100 nets were randomly generated on a 10mm x 10mm routing
M2(a : 0.2)
M2(a : 0.4)
10
3.051.00
Ml
3.07 (1.01)
3.10 (1.027
25 50 100
2.22(l.00) 1.65(1.00 0.88(1.00-
2.29 (1.03) 1.80 1.09) 1.31 1.49)
2.37 (1.07) 1.94 1.18 1.54 (1.75)
Table 3 shows average runtimes, which is increased due to the merging pair evaluation by 1.5 times (10 sinks) to 40 times (100 sinks) in BA-tree.
'The number of sinks in signal nets before buffer insertion is usually up to 100 and most of them are less than 25.
5
-
#sinks 10 L 25 50 100
"A [3] K. J. Singh and A. Sangiovanni-Vincentelli, Heuristic Algorithm for the Fanout Problem," Proc. ACM/IEEE Design Automation Conf., 1990, pp. 3 5 7 360. [4] H. Vaishnav and M. Pedram, "Routability-Driven Fanout Optimization," Proc. ACM/IEEE Design Automation Conf., 1993, pp.2 3 0 - 23 5 . [5] L. N. Kannan, P. R. Suaris, and H. G. Fang, "A Methodology and Algorithms for Post-Placement Delay Optimization," Proc. ACM/IEEE Design Automation Conf., 1994, pp.3 27 -33 2 . [6] L.P.P.P. van Ginneken, "Buffer Placement in Distributed RC-tree Networks for Minimal Elmore Delay," Proc. IEEE Int. Symp. Circuits Syst., 1990, pp. 86 5 -868. [7] J. Lillis, C. K. Cheng, and T. T. Lin, "Optimal and Efficient Buffer Insertion and Wire Sizing," Proc. IEEE Custom Integrated Circuits Conf., 1995, pp. 25 9 -26 2 . [8] J. Lillis, C. K. Cheng, and T. T. Lin, "Optimal Wire Sizing and Buffer Insertion for Low Power and a Generalized Delay Model," Proc. IEEE Int. Conf. ComputerAided Design, 1995, pp. 1 38 -1 43 . [9] D. Zhou, F. P. Preparata, and S. M. Kang, "Interconnection Delay in Very High-Speed VLSI," IEEE Trans. Circuits Syst., 38(7), pp. 77 9 -7 90 , July 1991. [10] J. Cong, K. S. Leung, and D. Zhou, "PerformanceDriven Interconnect Design Based on Distributed RC Delay Mode," Proc. ACM/IEEE Design Automation Conf., 1993, pp.6 0 6 -611. [11] J. Cong, A. B. Kahng, G. Robins, M. Sarrafzadeh, and C. K. Wong, "Provably Good Performance-Driven Global Routing," IEEE Trans. Computer-Aided Design, 11(6), pp. 73 9 -752, June 1992. [12] C. J. Alpert, T. C. Hu, H. Huang, and A. B. Kahng, "A Direct Combination of the Prim and Dijkstra Constructions for Improved Performance-Driven Routing," Proc. IEEE Int. Symp. Circuits Syst., 1993, pp. 186 9 -1 8 72 . [13] J. P. Cohoon and L. J. Randall, "Critical Net Routing," Proc. IEEE Int. Conf. Computer Design, 1991, pp.l 7 4-177. [14] K. D. Boese, A. B. Kahng, B. A. McCoy, and G. Robins, "Rectilinear Steiner Trees with Minimum Elmore Delay" Proc. ACM/IEEE Design Automation Conf., 1994, pp.3 81 -3 86 . [15] K. D. Boese, A. B. Kahng, B. A. McCoy, and G. Robins, "Near-Optimal Critical Sink Routing Tree Constructions," IEEE Trans. Computer-Aided Design, 14(12), pp.1417-1436, Dec. 1995. [16] X. Hong, T. Xue, E. S. Kuh, C. K. Cheng, and J. Huang, "Performance-Driven Steiner Tree Algorithms for Global Routing," Proc. ACM/IEEE Design Automation Conf., 1993, pp. 177-181. [17] W. C. Elmore, "The Transient Response of Damped Linear Network with Particular Regard to Wideband Amplifier," J. Applied Physics, 19, pp. 55-63, 1948. [18] J.Cong and C.-K.Koh, "Simultaneous Driver and Wire Sizing for Performance and Power Optimization," IEEE Trans. VLSI, 2(4), pp.408-423, Dec. 1994. [19] J.Cong and K.S.Leung, "Optimal Wiresizing Under the Distributed Elmore Delay Model," IEEE Trans. Computer-Aided Design, 14(3), pp.3 2 1 -336, Mar. 1995. [20] J. Rubinstein, P. Penfield, and M. A. Horowitz, "Signal Delay in RC Tree Networks," IEEE Trans. ComputerAided Design, 2(3), pp.202-211, 1983. [21] S.K. Rao, P. Sadayappan, F.K. Hwang, and P.W. Shor, "The Rectilinear Steiner Arborescence Problem," Algorithmica 7, pp.27 7 -288, 1992.
Table 3. Run Time(s) M1 M2(a: 0.2) 1 M2(o : 0.4) 0.03 1.5) 0.03 (1.5) 0.02(1.00) 0.03(1.00) I 0.16(5.3) - 0.15(5.0) 0.06(1.00) 1 1.02(17.0) 1 1.01(16.8) 0.18(1.00) 1 7.18(42.2) - 6.95(38.6)
Table 4 shows wire length, which is increased by 0% (10 sinks) to 7% (100 sinks) with a = 0.2 and 5% (10 sinks) to 28% (100 sinks) with a = 0.4.
#sinks 10 25 50 100
Table 4. Wire Length(mm) 0.4) M2 ( Ml M2 a: 0.2) 2.5821.00 .69(1.05) .57-(1.00) 4.32(1.02) 4.70(1.11) 4.25(l.00) 6.08(1.00) 6.29(1.03) 7.17(1.18 8.64(1.00) 9.23(1.07) 11.0o(1.28)
Table 5 shows the number of buffers inserted, which is also increased by 0% (10 sinks) to 8% (100 sinks) with a = 0.2 and 0% (10 sinks) to 15% (100 sinks) with a = 0.4. Note that minimization for the number of buffers as in [6] is not considered here. Therefore, redundant buffers might be included in the results.
#sinks 10
Table 5. #Buffers Inserted M1 M2(a: 0.2) M2(a: 0.4) 8(1.00) 8 1.00) 8(1.00)
25
18(1.00)
18(1.00)
18(1.00)
50 100
31(1.00) 53(1.00)
32(1.03) 57(1.08)
33(1.06) 61(1.15)
Through Table 2 to 5, tradeoff between the required time, wire length and number of buffers can be seen with the different parameter a. 6. Conclusions In this paper. we have presented an algorithm, BA-tree, which derives buffered Steiner tree so that the required arrival time at the source is maximized. The algorithm achieves Steiner tree construction and buffer insertion simultaneously, while these two steps were carried out independently in the past. We have shown its efficiency and effectiveness experimentally. Future work will include the total capacitance minimization and their trade-off with the required time at the source. We also plan to incorporate optimal wiresizing for further delay optimization. Our preliminary study shows that all these optimization techniques can be combined in the bottom-up dynamic programming paradigm employed in this paper. Interested readers may contact the authors for the follow-up work. References [1] C. L. Berman, J. L. Carter, and K. F. Day, "The fanout problem: From theory to practice," Advanced Research in VLSI: Proc. 1989 Decennial Caltech Conf., pp.69-99, 1989. [2] H. J. Touati, C. W. Moon, R. K. Brayton, and A. Wang, "Performance Oriented Technology Mapping," Proc. sixth MIT VLSI Conf., pp.79-97, 1990.
6
Simultaneous Routing and Buffer Insertion for High Performance Interconnect John Lillis, Chung-Kuan Cheng Dept. of Computer Sci. & Engr.
Ting-Ting Y. Lin Dept. of Elect. & Computer Engr.
University of California. San Diego La Jolla. CA 92093-0114
Abstract We present an algorithm for simultaneously finding a Rectilinear Steiner Tree T and buffer insertion points into T. The objective of the algorithm is to minimize a cost function (e.g., total area or power) subject to given timing constraints on the sinks of the net. An interesting side-effect of our approach is that we are able to derive an entire cost/delay tradeoff curve for added flexibility. The solutions produced by the algorithm are optimal subject to the constraint that the routing topology be induced by a permutation on the sinks of the net. We show that high quality sink permutations can be derived from a given routing structure such as the Minimum Spanning Tree. This derivation provides an error bound on the minimum area solution induced by the permutation. The effectiveness of our algorithm is demonstrated experimentally.
the generalization of area minimization subject to timing constraints. Also of interest is the approach in [12] where the authors derive fanout trees from a sink permutation. However, [12] did not consider area overhead or optimize the topology embedding. Recently, [7] gave an efficient implementation of this generalization (and minimization of power) and also incorporated a generalized buffer delay model taking signal slew into account. Like the P-Tree algorithm, this work also computes a cost/delay tradeoff curve. Since both the P-Tree algorithm and typical buffer insertion algorithms adopt -a bottom-up dynamic programming approach, it is natural to attempt simultaneous optimization by both techniques. This is the topic of this paper. The remainder of the paper is organized as follows. Section 2 introduces necessary concepts and definitions; Section 3 gives details of our algorithm: Section 4 discusses experimental results and we conclude in Section 5.
1
Introduction In recent years, interconnect delay has become an increasingly critical factor in VLSI systems, in some cases accounting for over 50% of overall delay. This trend is a result of the increased resistance of interconnect when feature sizes enter the sub-micron range and will become more dramatic in the future. Two promising techniques for improving interconnect performance are the topic of this paper: performance driven routing and buffer insertion. Previous work in performance driven routing includes [1], [3], [4] and [2]. Recently, in [9], the P-Tree performance driven routing algorithm was proposed. The key idea behind the P-Tree algorithm is restricting the solution space by a permutation constraint on the sinks of the given net; in other words, given such a sink permutation, only routing topologies which can be induced by that permutation are considered. As a result, the problem is sufficiently constrained to allow a pseudo-polynomial dynamic programming algorithm solving the problem of minimizing area subject to timing constraints on the sinks of the net under the Elmore delay model [10]. The algorithm also computed an entire cost/delay treadeoff curve. This approach has yielded impressive results in both area overhead and delay versus previously proposed methods. In the area of buffer insertion, van Ginneken [13] presented an efficient algorithm for inserting buffers into a given static routing topology so as to maximize the required arrival time at the root of the tree and sketched
2
Preliminaries
2.1 Delay Models Throughout this paper we use the Elmore delay model to model interconnect delay and a simple RC delay model for buffers and drivers. In the Elmore model, the delay of a wire segment e = (u, v) transmitting a signal from node u to node v is defined as follows. Let r, and ce be the resistance and capacitance of e respectively. Further, let c(Tv) be the capacitive load at node v. The delay of the segment is expressed as ret 2 + c(Tv)). Similarly, the delay of a buffer b at node v is determined by c(Tv). the capacitive load at v, and b's intrinsic (load independent) delay db and output resistance rb. The delay through the buffer with load cl on its output is db + rbCl
We note that a more accurate buffer delay model taking signal slew into account can be incorporated by the techniques presented in [7] and [8]. For simplicity, we do not present those techniques here.
7
2.2 Definitions and Problem Formulations Central to the algorithm we present are the concepts of the Grid Graph of a terminal set and the notion of a topology being induced by a pin permutation.
Algorithm: MST-tohierarchv Let Tm be a rectilinear Minimum Spanning Tree Vv E Tm, let h(v) be a single node tree labeled v Repeat n - 1 times 1. select a leaf node v and its parent u in Tm 2. delete edge (u, v) from Tm 3. replace h(u) with a new tree t where t.left= h(u) and t.right= h(v) return last tree formed
Definition 2.1 Grid Graph [5]: given a set of terminals N, N 's Grid Graph GG(N) = (V, E) is defined by the following process: Construct vertical and horizontal lines through each terminal. Let V be identified with the set of intersection points of these lines.
Figure 1: Constructing Hierarchy Consistent with MST
There is an edge in E between two vertices iff the corresponding intersection points are connected by a single horizontal or vertical segment.
3.1 Finding High Quality Permutations We adopt the technique proposed in [9] to construct a sink permutation which we summarize here (the reader is referred to [9] for further details). The method is broken into three phases: (1) construction of a hierarchical decomposition of the terminals consistent with the Minimum Spanning Tree, (2) reorienting the hierarchical structure such that the driver is attached to the root while maintaining consistency and (3) application of a dynamic programming algorithm to optimize the tour length of-the induced permutation. The first phase is illustrated in pseudo-code in Figure 1. It was shown in [9] that the hierarchy produced by this algorithm (and and any reorientation of it) is consistent with the MST. By consistency, we mean that any permutation derived by a depth first traversal of the hierarchy can induce the MST itself. This is a useful property since it ensures that the minimum area routing topology induced by the permutation is no worse than 50% larger than the optimal rectilinear steiner tree spanning the points [6]. Among the possible permutations produced by a depth first traversal, we select the one with minimal tour length. Finding this permutation can be done in 0(n4 ) time by dynamic programming [9]. The intuition is that tour length minimization should lead to good clustering characteristics in the permutation. We illustrate this process in Figure 2. In Fig. 2(a) we see an MST for the point set with node e designated at the driver: in 2(b), we see a hierarchical decomposition consistent with the MST; in 2(c), we see the reoriented hierarchy and the possible permutations derived from the hierarchy along with their tour lengths. The minimum tour-length permutation in Figure 2 is "d c b a". To give some insight into the next phase of the algorithm and Formulation 2.1, we show in Figure 3(a) an abstract topology induced by the permutation and in Fig. 3(b) an embedding of that topology into the routing graph (which gives the minimum area Rectilinear Steiner Tree for this case).
Definition 2.2 Permutation Induced Abstract Topology: Consider a terminal set N and a permutation wron the sinks of N. A binary tree T is an abstract topology induced by 7r if its leaves are identified with the terminals in N and it obeys the ordering imposed by wr when T is interpreted as a binary search tree (the driver being implicitly attached to the root). Definition 2.3 Abstract Topology Embedding: Given a abstract topology T (as a binary tree), leaves identified with the terminal set N and a target graph G = (V, E) (e.g., G = GG(N)), an embedding of T into G is a mapping of the internal nodes of T to V. We use the term "abstract topology" to emphasize that such a topology is not a true routing topology in that its internal nodes are not mapped. Only when such a topology is embedded in the plane or a routing graph such as GG(N), does a physical routing topology result. Given this framework, we solve the following problem: Problem 2.1 Given terminal set N, required arrival times for each sink, driver parameters, buffer parameters and sink permutation -r, find the minimum cost routing topology induced by 7r and embedded in GG(N) possibly with buffers inserted at prescribed points where all timing requirements are satisfied. Cost may be a combination of the contribution of inserted buffers and consumed routing resources. For instance, one may wish to minimize the dynamic power dissipation of the net, in which case cost is the total capacitance associated with the solution.
3
Algorithm
In describing our algorithm for simultaneous routing and buffer insertion, we first describe the somewhat independent phase of finding a good sink permutation. Next we describe the algorithm solving Formulation 2.1 by (1) describing the nature and properties of the solutions sets we compute by dynamic programming, (2) giving some important primitives for manipulating those solution sets and (3) sketching the entire algorithm in terms of the primitives.
3.2 Nature of Solution Sets In the following we refer to the i'th sink in the given permutation 7w as sink i. Further, we will assume that our cost metric is total capacitance which corresponds to dynamic power dissipation [11]. The algorithm can be adapted for area minimization by ap-
8
c(14,6) C
(p, c, q) E S(v, i, j)
-
(p, C,q) E Sb(v, i, j)
-
3 routing tree rooted at vertex v, spanning sinks i..j, with cost p. load c and required time q. 3 routing tree rooted at vertex v, spanning sinks i..j, with cost p, load c and required time q and v is a branching point.
, Id(1.4)
.,(O.3)
, M5.i1)
c di2.O)
W~
(b)
1
.,
PERM
TOUR LEN
c4db dc-b cdb. dcb.
24 26 2(0 19
In [13] and [7], the following pruning property was used to keep solution sets as small as possible. Property 3.1 Solution (p, c, q) in solution set S is sub-optimal if 3 solution (p', c', q') E S where p' < p, c' < c and q' > q (in the case where all parameters are equal, either may be discarded). An interesting special case of this property is when all p's are equal (or we are optimizing performance independent of cost). In such a case, we have the property that a solution set with all sub-optimal solutions discarded can be arranged in strictly increasing order of both c and q. We refer to such sets as cq-sets. In [7] it was proposed that solution sets be organized first by p; i.e., one would partition the sets into cq-sets with identical p values. For such sets we maintain the invariant that they are arranged in increasing order of c and q. For a solution set S. it will be useful to refer to the cq-set associated with p; we do so by S[p]. As will be discussed in the next section, organizing the solutions in such a manner allows efficient detection of Property 3.1. 3.3 Primitives We now describe three useful primitives for manipulating cq-sets and solution sets: join-cq-sets(, augment.soln-set() and join-soln-sets(. The routine join-cq-sets(Cj, C,) takes left and right cq-sets of a vertex and produces the optimal cq-set at that vertex. This is done in a linear time process by similar to merging sorted lists by starting with the minimum load pairs and advancing the critical (this technique was used extensively in [13] and [7]). Pseudocode appears in Figure 4.
(C)
Figure 2: Finding a sink permutation
0 .
o.
....-.
.
0 0
d
C
b
(a)
(b)
Figure 3: Abstract Topology and an Embedding propriately weighting the cost of routing and logic resources. For simplicity, we present the algorithm two assumptions: (1) the buffer library does not contain inverters and (2) buffer insertion points are only at branching points. The algorithm can be easily adapted to eliminate these assumptions. Consider a routing tree rooted at a vertex v of GG(N and spanning sinks i..j (possibly containing buffers). Such a tree is characterized by three parameters:
While (i < ICl] and j < IC11) Let (cl,q1) = Ci[i]
p: the total capacitance of the tree.
Let (c2,q2) = C [j] C - C U {(cl + c2, min(ql, q2))}
c: the capacitive load of the tree at v.
If (qi <
q: the required arrival time at v.
q2)
i+1 If (q2 < q)
/* Ci Critical */
i-
Intuitively, p is the cost of the tree, c is the load presented upward by the tree and q gives the timing characteristics of the tree. In a situation where there have been no buffers inserted into the tree, p = c. Thus, it is by these three parameters which we characterize a routing sub-solution. Our approach is to inductively compute, in bottom up fashion, sets of (p, c, q) triples. We compute the sets S(v, i, j) and Sb(V, i, j) which have the following intuitive meaning:
/* C, Critical */
rtr iC+ return C.
Figure 4: Primitive join cq sets(S1, S2 ) The routine augment-soln-set(S, 1) is used to extend single-stem solutions by an incoming wire of length 1.
9
We examine each (p, c, q) E S and add to c the capacitance of a wire of length I and subtracts the delay of such a wire from q to give (p', c', q') as
p' = p + cl,
c' = c + cl,
Let B' = B u {s} where 9pis a "non-buffer" Sort the set F = {pI + pr + Cb pP E Si,pr E S,. b For each pi + p, + cb E F in order: [pi], SIp[P]) S - join cq sets(Sp if b $ 5 Find (c, q) E S s.t.
q' = q - di
where di indicates the Elmore delay of a wire of length I with load c. A fundamental operation in [7] was to take solution sets Si and Sr from a node v's left and right children and produce a new set S of solutions at v where a solution in S may also be buffered at v. Additionally, we must apply Property 3.1 to keep S as small as possible. This operation will also be fundamental in the computation of Sb. Letting B represent a buffer library, we use primitive join-soln-seis(SI, Sr, B) to represent this operation. Its implementation follows that of [7] and we sketch the key points here. An unbuffered solution (p, c, q) E S is created from a pair of solutions (pi, cl, q ) E Si and (Pr, cr, qr ) E Si where
Figure 5: Pseudo-code for joZn-soln-sets(S, S r)
Given: GG(N) = (V, E) Vv l. Compute S(v, i, i) 2. For I= 1..n -2 3. For i = 1..n-1-I
4.
e
V, I < i < n
j =i+I
Compute Sb(V, i, j) VV E V Compute S(v, i, j) Vv E V using results of step 5 7. S(vd, 1, n - 1) now gives the sub-solutions to be paired with the driver.
5. 6. and
Figure 6: High-Level Algorithm
Similarly, letting db, Cb and rb be the intrinsic delay, input capacitance and output resistance of buffer b e B, a solution (p, c, q) with buffer b at the root is created from a pair of solutions (pi, cl, qj) E Si and (Pr, cr, q,) E Si where
time at sink i. Since S is a generalization of Sb, we first compute Sb(V, i, j) for all v and from this we compute S(v, i j). When the algorithm terminate we have stored in S(vd, 0, n - 1) the optimal solution set at the output of the driver (where the driver is located at vertex Vd). For each (p, c, q) E S(Vd, 0, n - 1) we have an overall solution with cost p and required arrival time q-(rdc) where rd is the output resistance of the driver. The result is an entire cost/delay tradeoff curve. We propose that such a curve is useful since such net optimizations must be considered in the context of the entire circuit. For instance, a higher level tool can use such a tool to determine the marginal benefit of optimizing one net versus another or how aggressively to optimize one net before moving to another. We first sketch how to compute S(v, i. j)Vv once is computed Vv. We proceed in four Sb(V, i,j) phases, computing the following intermediate solution sets as we go (recall this is for a fixed i, j): LS(v): the solution set where trees are rooted at v with branching point v' constrained to be at v or to v's left v's row. the solution set where trees are rooted at RS(v): v with branching point v' constrainted to be in the same row as v. US(v): the solution set where trees are rooted at v with branching point v' constrained to be in v's row or a row above v (up from v). Once US(v) has been computed, we compute S(v, i, j)
P = pi + Pr + cb, and
q = min(qi, q,)
-
db
-
rb(CI + Cr).
Thus. letting S be the solution set we wish to compute, a cq-set S[p] is derived from either (1) cq-sets Sil[pi] and S[Pr] where PI + Pr = p or (2) cq-sets Sl[pi] and S[Pr] and buffer b where PI + Pr + Cb = p. The strategy taken in [7] and taken here is to visit these configurations in order of p. By building the cq-sets in increasing order of p, we have already satisfied one of the pruning conditions in Property 3.1. When building a cq-set C with cost p, we know that all lower cost solutions have already been visited and that only lower cost solutions have been visited. Thus, to determine if (c, q) C C is sub-optimal, we need to determine if there exists a previous (c', q') where c' < c and q' > q. Such a problem is a special case of a two-dimensional orthogonal range query [14] and can be solved in logarithmic time (see [7], [8] for details). This approach is summarized in Figure 5.
3.4
= {(cb, q')}
Prune S' versus solutions in S S - S uS'
q = min(qi, q,).
c = cb
B'}
1= q - rbc - db is maximized S
P = Pl + PrX
c = cl + cr
E
Overall Algorithm
The overall structure of the algorithm appears in Figure 6. The keys are in steps 5 and 6 where we compute Sb and S. The base case of computing S(v, i, i) is simply a matter of computing the distance between vertex v and sink i and the associated Elmore delay and subtracting this from qj, the given required arrival
in a final phase. Each of these sets is computed by making a linear pass through the grid graph and using the
10
previously computed intermediate solution set to bootstrap. To compute LS(v)Vv, we make a left-to-right traversal of each row inductively computing the solutions from that of our neighbor. Similarly, we compute RS(v) by right to left passes and using LS(v), US(v) is computed by traversing each column top-to-bottom and finally we compute S(v, i, j) with a bottom-to-top sweep. We illustrate the process with pseudo-code for computing US(v) in Figure 7. In the figure we use v{rc} to indicate the vertex in the r'th row and c'th column so that we may easily identify a vertex's neighbors. We also use d(u, v) to denote the physical distance between vertices u and v. US(v {j,}) - RS(v { 1 ,}) Vc E {I..n} for c = 1 to n for r = 2 to n A - augment.oln.set(LS(v{r....c}), d(v{_... US(V{rc}) - RS(v{r c}) U A Prune sub-optimal solutions from USvf,,,c
c,,
Wire Resistance Wire Capacitance Sink Capacitance Driver Resistance Buffer Resistance Buffer Intrinsic Delay Buffer Input Capacitance
v{1 , ,,))
2 Use the P-Tree algorithm to derive the routing topology T induced by the sink permutation which minimizes the cost metric. Apply a static buffer insertion algorithm to this topology. 3 The proposed simultaneous routing buffer insertion algorithm.
S
In each case, the result is a tradeoff curve from which we select the solution minimizing the metric. For each net size we ran the algorithms on 25 randomly generated point sets. Results were normalized to the result of Alg 3. As can be seen from the table, the proposed approach does in fact improve over the two phase approaches. With respect to running time, the two phase approaches have an advantage in that solutions in the basic P-Tree algorithm without buffering are characterized by load and required-time only - since there is no buffering, load also captures the notion of cost. Thus, the added complexity of the simultaneous approach must be weighed against the solution improvement yielded. Nevertheless, when high quality solutions are required, the simultaneous method appears quite practical for most typical net sizes. For instance, our relatively unoptimized implementation is able to route 12 pin nets in about 30 cpu seconds on a SUN Sparc 20. A topic of our current research is development of more powerful bounding techniques and, more generally, methods for improving performance by search space limitation. A very simple technique of this sort
Figure 8: Computation of Sb(v, i, j)
Complexity
Evaluating the algorithm in terms of the number of primitives it executes, we have an 0(n 5 ) algorithm. This can be seen in the computation of Sb: there are 0(n 4 ) sets Sb(v, i, j) (note that there are n2 vertices in GG(N)) and for each such set we execute 0(n) primitives. Of course, the the primitives are not constant time operations. However, it can be argued that the size of these sets is polynomially bounded in the parameters of the problem instance (e.g., grid size) and thus the algorithm is pseudo-polynomial overall. Additionally, this leads to a natural way of trading precision for running time by "coarsening" the problem instance (e.g., by using a coarser grid and taking advantage of the resultant discretization).
4
O.Ins 0.03pF
1 Use the P-Tree algorithm to derive the minimum area routing topology T induced by the sink permutation. Apply a static buffer insertion algorithm T.
S 0 for k = i..j - 1 S - S U join(S(v, i, k), S(v, k + 1, j)) Prune suboptimal solutions from S
3.5
270Q 500Q
approach. Since the quality of a routing solution depends on both its timing and its cost, we adopt the Cost*Delayproduct as our metric of comparison (where Delay is the maximum source to sink delay - i.e., for simplicity, we assume identical required arrival times). In our experiments we use the 0.5pm technology parameters used in [9] and [4]. We also introduced a single non-inverting buffer. These parameters are given in Table 1. We compared two different two-phases approaches to the proposed simultaneous approach under the Cost*Delay metric and the results are reported in Table 2. The three approaches are as follows.
To compute Sb(v, ij), we consider the partition points k E {i..j- 1}. Because of the bottom up nature of the algorithm, the optimal solution sets S(v, i, k) and S(v, k + 1, j) have been previously computed. Pseudocode for computation of S(v, i, j) appears in Figure 8.
-
1.0fF
Table 1: Technology Parameters
Figure 7: Computation of US(v) Vv e V
Sb(v, i, j)
.112Q/pm .039fF/pim
Experiments
The main focus of our experiments was to determine the benefit afforded by taking the proposed approach of simultaneous routing and buffering versus a two phase
11
n=6 n=9 n=12
Alg I 1.03 1.07 1.11
Alg 2 1.04 1.06 1.12
[8] J. Lillis, C. K. Cheng, T. T. Lin, "Optimal Wire Sizing for Low Power and a Generalized Delay Model," Technical Report #CS96-468, CSE Dept., UCSD.
Alg 3 1.0 1.0 1.0
[9] J. Lillis, C. K. Cheng, T. T. Lin, C.-Y. Ho, "New Techniques for Performance Driven Routing with Explicit Area/Delay Tradeoff and Simultaneous Wire Sizing," Technical Report #CS96-469, CSE Dept., UCSD.
Table 2: Relative Cost*Delay Results with which we have had some promising initial results is the following. Initially compute the min area solution in 0(n5 ) time and let qrnj be the resulting required time (this is done by the P- TreeA algorithm and is strongly polynomial). When running the timing optimization algorithm, consider a solution (p, c, q) E S(v, i, j). Let d be a lower bound on the delav from the driver to v (e.g., the delay of the driver with minimum possible load plus a lower bound on interconnect delay to vertex v). If q - d < qmin then solution (p, c, q) can be discarded since the min area solution is better. Surprisingly, this has yielded an order of magnitude speed up in some cases. We are exploring generalizations of this concept.
5
[10] W.C. Elmore, "The Transient Response of Damped Linear Network with particular Regard to Wideband Amplifiers," J. Applied Physics 19 (1948), pp 55-63. [11] N.H.E. Weste, K. Eshraghian, Principles of CMOS VLSI Design, Addison-Wesley, 1993, pp. 231-237. [12] H. Vaishnav, M. Pedram, "Routability-Driven Fanout Optimization." Proc. ACII/IEEE Design Automation Conf., 1993, pp. 230-235. [13] L.P.P.P van Ginneken, "Buffer Placement in Distributed RC-tree Networks for Minimal Elmore Delay," Proc. International Symposium on Circuits and Systems. 1990, pp 865-868. [14] F. F. Yao, "Computational Geometry," Ch. 7 in "Handbook of Theoretical Computer Science. Vol A.," Elsevier Science Publishers, 1990.
Conclusions
We have presented an algorithm for optimization of timing and cost of a net by simultaneous routing and buffer insertion. Preliminary results are promising and show improvement over 2-phase approaches.
References [1] K. D. Boese, A. B. Kahng, G. Robins, "HighPerformance Routing Trees With Identified Critical Sinks," Proc. ACM/IEEE Design Automation Conf., 1993, pp. 182-187.
[2] M. Borah, R.M. Owens, M.J. Irwin, "Fast algorithm for performance-oriented Steiner routing," Proc. Fifth Great Lakes Symposium on VLSI, 1995 pp. 198-203.
[3] J.J. Cong, K.S. Leung, D. Zhou, "Performance-driven interconnect design based on distributed RC delay model," Proc. ACM/IEEE Design Automation Conf., 1993 pp. 606-611.
[4] T. D. Hodes, B. A. McCoy, G. Robins, "DynamicallyWiresized Elmore-Based Routing Constructions," Proc. IEEE Intl. Symp. Circuits and Systems, 1994. [5] F. K. Hwang, D. S. Richards, P. Winter, "The Steiner Tree Problem," Elsevier Science Publishers, (1992), pp. 213-214. [6] F.K. Hwang, "On Steiner Minimal Trees with Rectilinear Distance," SIAM J. Applied Math. 30 (1976), pp. 104- 114. [7] J. Lillis, C. K. Cheng, T. T. Lin, "Optimal Wire Sizing for Low Power and a Generalized Delay Model," Proc. IEEE Intl. Conf. Computer-Aided Design, 1995, pp. 138-143.
12
TIMING OPTIMIZATION BY REDUNDANCY ADDITION AND REMOVAL Luis A. Entrena*, Emilio Olias*, Javier Uceda* *
* Universidad Carlos III of Madrid, Spain {entrena, olias} @ing.uc3m.es ** Universidad Politecnica of Madrid, Spain
[email protected]
ABSTRACT Redundancy Addition and Removal is a logic optimization method that has recently been proposed. This method uses Automatic Test Pattern Generation (ATPG) techniques to identify optimization transforms. It has been applied successfully to combinational and sequential logic optimization and to layout driven logic synthesis for Field Programmable Gate Arrays (FPGAs). In this paper we present an improved Redundancy Addition and Removal technique that allows to identify new types of optimization transforms and it is more efficient because it reduces the number of ATPG runs required. Also, we apply the Redundancy Addition and Removal method to timing optimization. The experimental results show that this improved Redundancy Addition and Removal technique produces significant timing optimization with very little area cost.
---
added redundancy created redundancy
Figure 1. Example of redundancy addition and removal The basic idea underlying the previous approaches in [ 1-7] can be summarized as follows. A wire is selected and tested for stuck-at fault. If no test is possible, then the wire is redundant and can be removed. Otherwise, the mandatory assignments (those assignments that are required for a test to exist) obtained during test generation suggest the additions that will force the tested wire to become redundant. However, it is not known whether these additions can be performed without changing the circuit functionality. This must be further verified by performing additional tests. In this paper we propose an efficient technique that allows to identify which connections/gates can be certainly added to the input of an existing gate in the circuit with a single test run. This technique is also extended to multiple wire addition, allowing to identify a bigger set of alternatives than with previous approaches. The Redundancy Addition and Removal technique has been applied in the past to area optimization of combinational and sequential circuits. In this paper, we apply this technique to timing optimization of combinational logic networks. Since this technique can take into account accurate area and delay estimations, it can be applied in a similar manner to technologydependent timing optimization. Logic restructuring techniques for timing optimization have been proposed based on other optimization methods [10], [11]. These techniques are commonly used in combination with other timing synthesis techniques [9]. We show that the improved Redundancy Addition and Removal technique proposed may produce significant timing optimization with very little area cost.
1. INTRODUCTION Redundancy Addition and Removal has been shown to be a powerful logic optimization method by several authors [1-7]. With this method, a logic network is optimized by iteratively adding and removing redundancies that are efficiently identified using Automatic Test Pattern Generation (ATPG) techniques. Due to the nature of redundancy, the addition of a redundant wire/gate may make some wires/gates become redundant elsewhere in the circuit. After the removal of the created redundancies, the resulting circuit may have less area or a smaller delay. The Redundancy Addition and Removal approach is illustrated with the example in Fig. 2 (taken from [1]). This is an irredundant circuit. In this circuit a connection can be added from the output of g5 as a new input to g9 without changing the logic functionality of the network. In other words, the added connection is redundant. By adding this connection, two connections, gl-g4 and g6-g7 become redundant and can be removed. The resulting network contains less gates and a shorter critical path. The added wire, g5-g9, is called an alternative wire of gl-g4 and g6g7 [3] The Redundancy Addition and Removal technique can be successfully applied to some physical design related problems, such as routing [3]. If routing cannot be completed, the unroutable wires may be substituted by alternative wires in order to complete the routing. Also, this technique is particularly well suited for resynthesis, because it allows to identify optimal functional alternatives to a wire/gate based on accurate area and delay estimations.
13
alternative wire candidate. In this section, we will describe how the connections that can be certainly added to a destination node can be identified with a single redundancy test. The optimization transforms are obtained by comparing the results of this redundancy test with the set of mandatory assigments
(SMA) of the removal candidate. We call this technique "two-way" transform identification.
In the sequel, we note a connection as a triple (S, D, P), where S is the source node, D is the destination node and P is the polarity (1 for inverted and 0 for non-inverted). Also, an input to a gate G has a controlling value Cont(G) if this
value determines the output of the gate G regardless of the other inputs. The controlling value of an AND(OR) gate is 0(1). The inverse of the controlling value is called the sensitizing value Sens(G). The sensitizing value of an AND(OR) gate is 1(0). Note that all candidate connection faults that have the same destination node have exactly the same static observability conditions. They only differ in the controllability condition. Therefore, connection faults that have the same destination node have the same mandatory observation assignments and only differ in the mandatory control assignment. Let D be a node and c be the controlling value of D. Suppose that we perform the implication of the mandatory observation assignments that are common to all connection faults that have the same destination node D. A mandatory assignment v obtained this way in a node N, such that v = c, indicates that the connection C = (N, D, 0) is redundant and therefore it can be added without changing the circuit's functionality. The demonstration of this statement is simple: the fault associated to the added connection is C stuck-at c and the necessary control assignment for this fault is N = c, which is incompatible with the previous assignment N = v = c. Analogously, if a mandatory assignment v is obtained in a node N, such that v = c, then the connection C = (N, D, 1) is redundant and therefore it can be added without changing the circuit's functionality.
Figure 2. An irredundant circuit This paper is organized as follows. Section 2 describes the improved transform identification technique, that we call "two-way" transform identification. Section 3 describes
how this technique is extended with the addition of redundant gates. Section 4 describes the timing optimization algorithm based on this approach. Section 5 presents the experimental results. Finally, section 6 presents the conclusions of this work. 2. TWO-WAY TRANSFORM IDENTIFICATION The techniques developed so far to identify logic optimization transforms by redundancy addition and removal [1-7] can be called "one-way", because the redundancy test of a fault allows to identify candidate connections for addition, but it is not known whether they can be certainly added without changing the circuit functionality. Consider for example the circuit shown in Fig. 2 and the target fault g6 stuck-at 1. When this fault is tested, a mandatory assignment g5 = 0 is obtained; in other words, all input vectors that are able to test this fault put a logic value 0 at the output of g5. The implications performed to obtain this result are shown below: (1) Mandatory control assignment: g6 = 0 (2) Mandatory observation assignments:
Example. Consider again the example in Fig. 2. The mandatory observation assignments for all connection faults whose destination node is g9 are: g8 = 1, f = 1. By implication, using recursive learning [8], we get the mandatory assignments shown below. The recursivity level is indicated by the indentation. g8 = I => Recurviviav level = I
g3 = 1, g4 = Of= I
(3) Implications: g6 = 0 => d = 0, g2 = 0 g3 = I => a = 1, b = I
Justification of g4 = i g4= I => c =1, g =1 gi =I => b =1, d = I gi = I => g5 = 1 c = 1 => g2 = 0 d = 1 => g6 = 1 Justificationof g7 =1 g7 = I => g6 = 1, g3 = 1 g3 = I => a = I, b = 1 g6 = I => Recursiviify level = 2 Justification of g2 = I g2 = I => e = 1, c = 0
d = 0=> gi = 0 gl = 0, g2 = 0 => g5 = 0 This result suggests the addition of a new connection from g5 as a new input of g9 (dotted line in Fig. 2), because when this wire is added, it blocks the propagation of the fault g6 stuck-at 1, which thus becomes redundant. The same result is obtained by the addition of wires from g6, g4, d, g2, gi, and, through an inverter, g3, a and b. However, we do not know which of these new connections can be added without changing the circuit functionality, i.e., if they are redundant connections. In order to check this, an additional redundancy test is required for each
14
g2 =I =>g5 =1 c = O=>g4 = 0 Justification of d = I d = 1, b = I => g] = I gI = I => g5 =I
g5 =1 <= b = I <= g6=1 <= g5=1 <=
Since the controlling value of the destination node g9 is Cont(g9) = 0, we obtain that the connections (b, g9, 0), (g6, g9, 0) and (g5, g9, 0), can be added without changing the circuit's functionality. It can be verified that their associated stuck-at faults are not testable because the mandatory control assignment is inconsistent with the mandatory assignments obtained previously for these connections. Fig. 3 shows several possible transforms described in [4]. The subset of mandatory assignments that are common to all candidate connection faults that have the same destination node D under Type 0 transform coincide with the SMA for the fault D stuck-at Cont(D). Similarly, the set of mandatory assignments that are common to all candidate connection faults that have the same destination node D under Type 2 transform coincide with the SMA for the fault D stuck-at Sens(D). We will call the tests for the faults D stuck-at Cont(D) and D stuck-at Sens(D) the
Figure 4. Addition of a gate redundancy test of the destination node (fault g9 stuck-at 0); (ii) g5 = 0 obtainedfrom the redundancy test of the target wire (fault g6 stuck-at 1). Then, the addition of connection (g5, g9, 0) allows to eliminate connection (g6, g7, 0), i.e., (g5, g9, 0) is a valid alternative connection of (g6, g7, 0). The two-way identification technique is complete in the sense that we can identify all possible connection/nodes that can be added to the destination node as long as all mandatory assignments can be identified. However, this is only possible if recursive learning is used. Generally, it is not possible to know beforehand what is the maximum recursion depth to obtain equivalent results to one-way identification. For instance, the transformation of the previous example can be identified using a one-way algorithm with a maximum recursion depth equal to 0, while a two-way algorithm requires at least a recursion depth equal to 2.
redundancy tests of the destination node. We consider here
Type 2 transform rather than Type 1 transform because Type 2 transform covers Type 1 transform, i.e., if Type 2 transform is possible then Type 1 transform is also possible for the same source and destination nodes.
3. ADDING GATES
Type 0
In this section, the two-way identification technique is generalized to the addition of gates. Note that a node may have a mandatory assignment even though its predecessors and successors do not. For instance, in the previous example, g6 has a mandatory assignment even though neither its predecessors (g7) nor its successors (d, g2) do. This is because the higher level mandatory assignments do only propagate to the next inferior level when they do all coincide. Therefore, the output node of a network may hold a level-O mandatory assignment even though neither its predecessors nor its successors do. The key to identify
Type I
452' Type 2 Figure 3. Several types of logic transforms
more general transformationsis the information contained in mandatory assignments of a level higher than 0. We call Extended Set of Mandatory Assignments (ESMA) a set of
mandatory assignments that is not restricted to recursion depth 0. Example. Consider the example in Fig. 4. This circuit is the same of Fig. 2, except for the absence of node g5. The redundancy test of the destination node stuck-at 0 is the following: g9 = I => g8 = If = I g8 = I => Recursivitv level = I
The optimization transformations are easily identified by comparing the mandatory assignments that are obtained in each node of the network by the redundancy test of the destination node and the redundancy test for each fault dominated by the destination node. If a node N has mandatory assignments of different value for each of these tests, then it is possible to add one connection from N to the destination node to eliminate at least one connection. Thus, for the previous example, we have the following mandatory assignments: (i) g5 = I obtained from the
Justification of g4 = I
15
g4 = I => c = 1, gI = I gi = 1 => b = 1, d = l c= 1=> g2 = 0 d= 1=>g6 =I Justificationof g7 =1 g7 = 1 => g6 = 1, g3 = 1 g3 =I => a = 1, b = I g6 = I => Recursivity level = 2 Justificationof g2 = I g2 = 1 => e = 1, c =0 c = 0=> g4 = 0 Jusnficationof d = I d = 1, b= = > g =1
d a
b
=I <= g6 = I <=
Figure 5. Example circuit
As it can be observed, there is not a level-0 mandatory assignment neither in gI nor in g2. However, all branches in the recursivity tree have a mandatory assignment gi = 1 or g2 = 1. Hence, consider a subnetwork such that its output is 1 when g 1 = 1 or g2 = 1. Such a subnetwork is an OR gate whose inputs are gi and g2. This subnetwork is redundant and can be added to the circuit without changing its functionality, since it shows a level-0 mandatory assignment in the redundancy test of the destination node.
Proof. The theorem will be demonstrated for the first set of premises. The demonstration is analogous for the second set. The demonstration is constructive, i.e., it determines exactly the types of the gate and the connections of the transformation T. Let G be the gate in transformation T. For the sake of simplicity, we will take G E {AND, OR}. The connections between A, B and G are of the following types
The following theorem characterizes the addition of a twoinput gate.
(A, G, 0) if val(A, 0) = Sens(G)
Theorem. Let A, B, D nodes in a network. Let T be a valid transformation that involves the elimination of a fault f by the addition of an elementary two-input gate with one output. The inputs are connected to nodes A and B. The output is connected to the destination node D. Let ESMAf and ESMAD* be the extended sets of mandatory assignments corresponding to fault f and to the redundancy test of the destination node, respectively. Let val(A, 0) and val(B, 0) the level-0 mandatory assignments at A and B in one these extended sets, and val'(A, pl), ..., val'(A, Pn), val'(B, Pt), ..., val'(B, Pn) the mandatory assignments at A and B for a set of supplementary branches Pl,i Pn of the other set, i.e., either val(A, 0), val(B, 0)
I
E
(A, G, l) if val(A, 0) = Cont(G) (B, G, 0) if val(B, 0) = Sens(G) (B, G, 1) if val(B, 0) = Cont(G) With this connections, we have val(G, 0) = Sens(G) Since val(A, 0) • val'(A, pi) or val(B, 0) # val'(B, pi) V pi, i= 1, ..., n, at least one of the G inputs has a controlling value V pi, i = 1, ..., n. Therefore
val'(A, pi) E ESMAD (Pi) val'(B, pi) E ESMAD (Pi) SMAD* = ESMAD*(PJ)n *--
i = I, i = .,
Since the value of G is the same at all supplementary branches of the recursivity tree, we have
n n
val'(G, 0) = val'(G, pi) r)
ESMAD*(Pn)
i = I1., n
val'(B, pi) E ESMA/p 1 )
i = 1, ..., n
SMAf = ESMA/pi ) r
...
...
rn val'(G, p,)
= Cont(G) • Sens(G)
or val'(A, pi) E ESMAp 1i)
V pi, i = 1_., n
val'(G, pi) = Cont(G)
SMAf
This demonstrates that the value of G is different for each of the sets of mandatory assignments SMAf and SMAD The polarity of the connection from G to the destination node D depends on the controlling value of D. The polarity is not inverted in valfG) = Cont(D), and inverted otherwise.
r) ESMA#P,)
val(A, 0), val(B, 0) e SMAD*
Example. Let's take the previous example shown in Fig. 4. We have the mandatory assignments gI = 0 and g2 = 0 for the fault g6 stuck-at 1. Selecting G as and OR gate, we have
A sufficient condition to guarantee that this transformation does not change the circuit's functionality is Vpi, i = l, ..., n
val(gl, 0) = Sens(G) = 0
val(A, 0) * val'(A, pi) or val(B, 0) # val'(B, Pi)
16
val(B, 0) = Sens(G) = 0
and the connection from gl to G and g2 to G are noninverted. If we perform such connections, we have
critical path
<
val(G) = val(gl, 0) + val(B, 0) = 0.
The redundancy test of the destination node was shown in a previous example. It was demonstrated that there is a mandatory assignment gl = 1 or g2 = 1 in all branches of the recursivity tree for this test. Therefore, val'(G) = 1. Since val(G) • val'(G), the addition of G allows to eliminate fault f.
Figure 6. Timing optimization approach
To complete the transformation it is necessary to determine the polarity of connection G to g9. This connection is non-inverted, since valf(G) = Cont(g9). In some particular cases it is possible to identify multiplewire addition transformations without considering recursive learning [4]. However, this cannot be generalized. For instance, consider the interesting example shown in Fig. 5.The redundancy test of the target fault g5 stuck-at 1 does not suggest any interesting candidate connection. Hovewer, by performing the redundancy test of the destination node g6 stuck-at 0, we find that the connections (g2, g6, 0) and (g7, g6, 0) can be added without changing the network functionality. If we compare the SMA of the destination node test with the SMA of the target fault, the two-way transform condition is not met because there is not a level-0 mandatory assignment neither in g2 nor in g7. However, all branches in the recursivity tree have a mandatory assignment g2 = 0 or g7 = 0 for the fault shown. Hence, consider a subnetwork such that its output is 0 when g2 = 0 or g7 = 0. Such a subnetwork is an AND gate whose inputs are g2 and g7. In other words, the addition of (g2, g6, 0) or (g7, g6, 0) does not cause the target fault to be redundant separately, but the addition of both of them does. Note that this transform cannot be identified with previous approaches.
Timing-opt ()
{
START:
Compute the arrival times and identify the critical path Foreach CPS in CP{ Foreach N in CPSf Identify-transform(N); lf(transform is found){ Optimize(,; goto START;
}
Figure 7. Timing optimization algorithm. replace a critical wire by a better alternative wire/gate, or on a general basis. On a local basis, the alternatives of the target wire are identified with the techniques described in the previous section. If a better alternative is found, the target wire is substituted by the alternative wire/gate. Note that this method can take into account accurate delay estimations at the time of searching for the alternatives. This is an advantage over other timing optimization techniques based on logic restructuring [10]. On a general basis, the timing optimization algorithm focuses on critical path segments. A critical path segment (CPS) is the portion of the critical path between two multiple fanout nodes. For each critical path segment we try to add a connection/gate as a new input of a node in the CPS to eliminate at least a connection in the critical path. If a transformation is found, the algorithm starts again by identifying the new critical path. The algorithm is shown in Fig. 7.
4. TIMING OPTIMIZATION Redundancy addition and removal techniques can be applied to timing optimization [1]. If the addition of redundant wires/gates to a circuit causes some connection(s) in the critical path to be redundant, then the transformation will result in a faster circuit. For timing optimization, the search for alternative wires/ gates is performed according to the following criteria (see Fig. 6): * The destination node of the alternative connection/gate belongs to the critical path. Although this is not strictly required to generate redundancies in the critical path, it is very unlikely that redundancies may be created in the critical path otherwise. * The arrival time of the output of the alternative connection/gate (tN) must not be greater than the arrival times of the inputs of the destination node (ti). In this way, the timing of the network is not degraded by this addition. Timing optimization can be applied on a local basis, to
The selection of a transform among the possible transform candidates is based on a cost function. The cost of a connection/gate is defined as the circuit delay reduction that can be obtained by removing/adding this connection/ gate. The cost of removing a connection is computed as the delay difference between the fault-free and faulty circuit. From this definition, the cost of removing a non-critical connection is 0. Also the cost of adding a connection following the above mentioned criteria is also 0. The cost of each segment and each connection is computed
17
be recomputed and the search for new transforms must start all over again. The CPU time consumption will be greatly reduced in a future implementation by reusing previous computations and eliminating repeated runs over already optimized critical path segments.
at the time the critical path is identified. The timing optimization algorithm iterates over the critical path segments in decreasing order of its cost function. .
5. EXPERIMENTAL RESULTS In this section we will present preliminary experimental results of timing optimization with the improved Redundancy Addition and Removal technique. The experiments were made at the technology independent level for simplicity. However, it should be noted that the redundancy addition and removal technique can be similarly applied at the technology dependent level, where more precise estimations of area and delay are available. In the experiments carried out we used a unit delay model with a fanout factor of 0.2. We have considered three types of transforms: (1) adding one connection to eliminate at least 1 connection/gate in the critical path (area cost 0); (2) adding 2 connections to eliminate at least 1 connection/ gate in the critical path; and (3) adding a 2-input gate to eliminate at least 1 connection/gate in the critical path. These transforms where applied successively in order to evaluate the improvement obtained in each case. Static timing analysis was used to determine the critical path for practical reasons. With this simple delay model, Type 1 and Type 2 transforms are not considered because they increase the length of the critical path. Also note that Type 0 transforms do not augment the circuit area, because a wire is added to remove at least another wire. Therefore, with these criteria, Type 0 transforms allow to reduce the length of the critical path without augmenting the circuit area. If the subnetwork added contains more than one connection, then the area will be augmented. The experimental results are shown in Tables 2 and 3. Table 1 shows the inicial parameters of the benchmark circuits. The initial circuits were obtained after strong area optimization with RAMBO [2] and MISII. Table 2 shows the timing optimization results obtained by adding only one connection, first without recursive learning and then with a maximum recursion depth of 3. Table 3 shows the results obtained with multiple connection addition. For each example the number of connections (#C), the number of nodes (#N), the estimated delay (D) and the CPU time consumption in a Sun Sparcstation 2 are presented. The timing improvement is significant in most of the examples. In the case of frg2, a speed-up factor 3 was obtained. Other outstanding results were obtained for vda (2.5), k2 (2.4) and x4 (2.3). The average speed-up factor was 1.25 using only one connection addition and 1.48 in total. Note that the area (estimated by the number of nodes and connections) is not degraded by applying one connection addition transforms. With more complex transforms, the area may be degraded as we add more connections/gates than we remove. However, except in one case, the area increase is less than 1.5%. This result shows that the proposed technique is well suited for timing optimization when no much area is allowed to be traded off.
6. CONCLUSIONS AND FUTURE WORK We have proposed an improved transform identification technique for Redundancy Addition and Removal. This technique allows to efficiently identify the set of redundant connections that can be added to a logic network without changing its functionality by reducing dramatically the number of test generation runs required. This technique has also been extended for the addition of gates, obtaining new transformations that could not be identified with previous approaches. Previous work in Redundancy Addition and Removal only considered area optimization. In this paper, we have extended this technique to timing optimization and obtained promising results. These results show that this technique allows to reduce circuit delay significantly with very little area increase and therefore is well suited for timing optimization when there is not much room to trade area for speed. In the future, we plan to introduce more accurate delay models and to extend the improved Redundancy Addition and Removal technique with the addition of several gates at a time, in order to obtain additional timing optimization, although at bigger area cost.
REFERENCES [1]
[2]
[3]
[4] [5] [6]
[7]
K.-T. Chen, L. A. Entrena. "Multi-Level Logic Optimization by Redundancy Addition and Removal". Proc. European Design Automation Conference (EDAC-93), p. 373-377. February, 1993. L. A. Entrena, K.-T. Cheng. "Sequential Logic Optimization by Redundancy Addition and Removal". Proc. ICCAD-93, p. 310-315. November, 1993. S. C. Chang, K.-T. Cheng, N.-S. Woo, M. MarekSadowska. "Layout Driven Logic Synthesis for FPGAs". Proc. DAC-94, p. 308-313. June, 1994. S. C. Chang, M. Marek-Sadowska. "Perturb and Simplify: Multi-level Boolean Network Optimizer". Proc. ICCAD-94, p. 2-5. November, 1994. W. Kunz, P. R. Menon. "Multi-level Logic Optimization by Implication Analysis". Proc. ICCAD-94, pp. 6-13. November, 1994. L. A. Entrena, K.-T. Cheng. "Combinational and Sequential Logic Optimization by Redundancy Addition and Removal". IEEE Transactions on CAD, vol.14, n. 7, p. 909-916. July, 1995. U. Glaser, K.-T. Cheng. "Logic Optimization by an Improved Sequential Redundancy Addition and Removal Technique". Proc. ASP-DAC. September, 1995
[8] W. Kunz y D. K. Pradhan. "Recursive Learning: an attractive alternative to the decision tree for test
There is heavy CPU time consumption in some cases. This is because when a transform is found, the critical path must
18
generation in digital circuits". Proc. International Test Conference, p. 816-825. October 1992. [9] J. P. Fishburn. "LATTIS: An Iterative Speedup Heuristic for Mapped Logic". Proceedings 29th Design Automation Conference, p. 488-491. June 1992. [10] K. J. Singh, A. R. Wang, R. K. Brayton, A. Sangiovanni-Vincentelli. "Timing Optimization of Combinational Logic". Proceedings International Conference on Computer-Aided Design, p.282-285, November 1988. [11] K. C. Chen y S. Muroga. "Timing Optimization for Multi-Level Combinational Networks". Proceedings 27th Design Automation Conference, p. 339-344. June 1990.
Initial
Name
#N 316 599 84 376 1217 1871 1433 246 436 141 439 460 381 936 385 85 100 186 455 221
C1908 C2670 C432 C499 C5315 C6288 C7552 C880 apex6
apex7 dalu k2 pair rot term ttt2 vda x3 x4
#C 712 1316 221 798 2671 3787 2992 603 1096 354 1073 1145 1162 2335 953 206 254 645 1122 571
D 44.8 46.2 33.8 31.0 62.6 131.0 217.0 49.4 29.8 24.4 45.4 62.0 65.0 70.4 33.8 16.0 25.6 52.2 20.0 28.0
Table 1: Initial benchmark circuits
Adding one connection (K=3)
Adding one connection (K=O)
l
Nombre_____ #N
#C
D
T
#N
#C
D
T
C1908
316
711
40.2
3.2
316
711
40.2
109.5
C2670
599
1316
43.8
4.5
599
1316
43.8
648.0
C432
83
220
27.8
3.9
83
220
27.8
12.8
C499 C5315 C6288
376 1217 1871
798 2671 3787
31.0 52.8 131.0
0.7 68.4 5.9
376 1217 1871
798 2671 3787
31.0 52.8 126.8
36.2 478.8 1003.5
C7552 C880 apex6 apex7 dalu frg2
1424 246 436 140 439 458
2975 603 1096 353 1073 1143
162.2 49.4 26.2 20.2 33.0 40.4
123.4 0.9 1.6 1.1 32.3 14.4
1424 246 436 140 439 458
2975 603 1096 353 1073 1142
159.4 49.4 26.2 19.2 33.0 27.8
2703.1 35.1 4.5 3.7 272.0 38.3
k2
381
1161
43.6
22.7
381
1161
26.6
4880.8
pair
936
2335
53.8
21.4
936
2335
53.8
60.7
rot term 1
385 85
953 206
32.6 16.0
1.5 0.2
385 85
953 206
32.6 16.0
6.1 3.1
ttt2 vda x3
99 186 455
253 645 1122
20.6 25.2 19.2
0.8 7.5 1.0
99 186 455
253 645 1122
20.6 25.2 19.2
8.9 505.6 5.5
x4
220
570
14.2
1.0
220
569
12.2
4.7
Table 1: Experimental results with one connection addition
19
Adding two connections
Name
Adding a gate
#N 316
#C 718
D 37.0
T 487.6
#N 316
#C 718
D 37.0
T 95.2
599
1317
42.6
190.6
596
1320
33.0
633.8
83 376 1217 246
x3
140 439 457 936 385 85 99 186 455
221 802 2674 610 1100 354 1079 1154 2339 958 206 253 682 1129
27.6 29.6 50.8 42.0 22.4 19.2 25.0 21.8 48.6 30.4 16.0 20.6 22.2 15.8
28.9 107.9 245.4 186.8 11.9 5.8 218.2 40.0 149.2 35.4 3.4 4.4 3618.5 37.5
83 376 1208 247 437 141 443 458 929 385 83 101 187 456
221 802 2673 613 1103 358 1087 1156 2335 959 208 257 684 1134
26.8 29.6 44.6 39.8 19.0 18.2 24.8 21.0 39.4 29.0 13.0 19.4 21.2 14.2
20.3 38.0 1817.0 39.0 17.9 10.0 379.0 18.1 425.2 7.4 20.9 13.9 409.4 25.0
x4
220
570
12.2
0.3
220
570
12.2
0.7
C1908
C2670 C432 C499 C5315 C880 a ex6436 apex7 dalu pair rot term I ttt2
vda
I
Table 1: Experimental results with multiple connection addition
20
Optimal Wire-Sizing Formula Under the Elmore Delay Model * Chung-Ping Chen, Yao-Ping Chen, and D. F. Wong Department of Computer Sciences, University of Texas, Austin, Texas 78712
Abstract
driver
In this paper, we consider non-uniform wire-sizing. Given a wire segment of length L, let f(x) be the width of the wire at position x, 0 < x < L. We show that the optimal wire-sizing function that minimizes the Elmore delay through the wire is f (x) = ae -bX where a > 0 and b > 0 are constants that can be computed in 0(1) time. In the case where lower bound (L > 0) and upper bound (U > 0) on the wire widths are given, we show that the optimal wire-sizing function f (x) is a truncated version of ae- b that can also be determined in 0(1) time. Our wire-sizing formula can be iteratively applied to optimally size the wire segments in a routing tree.
1
Introduction
As VLSI technology continues to scale down, interconnect delay has become the dominant factor in deep submicron designs. As a result, wire-sizing plays an important role in achieving desirable circuit performance. Recently, many wire-sizing algorithms have been reported in the literature [1, 2, 4, 5, 7]. All these algorithms size each wire segment uniformly, i.e., identical width at every position on the wire. In order to achieve non-uniform wire-sizing, existing algorithms have to chop wire segments into large number of small segments. Consequently, the number of variables in the optimization problem is increased substantially and thus results in long runtime and large storage. In this paper, we consider non-uniform wire-sizing. Given a wire segment W of length L, a source with driver resistance Rd, and a sink with load capacitance CL. For each x e [0, L], let f (x) be the wire width of W at position x. Figure 1 shows an example. Let ro and co be the respective wire resistance and wire capacitance per unit square. Let D be the Elmore delay from the source to the sink of W. We show that the optimal wire-sizing function f that minimizes D satisfies a differential equation which can be analytically solved. We have f(x) = ae-b., where a > 0 and b > 0 are constants that can be computed in O(1) time. These constants depend on Rd, CL, L, ro, and co. In the case where lower bound (L > 0) and upper bound (U > 0) on the wire widths are given, i.e. L < f (x) < U, 0 < x < L, we show that the optimal wire-sizing function f (x) is a truncated version of ae-b. which can also be determined in O(1) time. Our wiresizing formula can be iteratively applied to optimally size the wire segments in a routing tree. The remainder of this paper is organized as follows. In
wire
load
-
Figure 1: Non-uniform wire-sizing. Section 2, we show how to compute the Elmore delay for non-uniformly sized wire segments. In Section 3.1, we derive the optimal wire-sizing function when the wire widths are not constrained to be within any bounds. In Section 3.2, we consider the case where lower and upper bounds for the wire widths are given. We discuss in Section 4 the importance of our wire-sizing formula in sizing the wire segments in a routing tree. Finally, we present some experimental results and concluding remarks in Section 5.
2
Elmore Delay Model
We use the Elmore delay model [3]. Suppose W is partitioned into n equal-length wire segments, each of length Ax = A. Let xi be iAx, 1 < i < n. The capacitance and resistance of wire segment i can be approximated by coAxf(xi) and roAx/f(xi), respectively. Thus the Elmore delay through W can be approximated by
Dn=
Rd (CL
+±Zco n f jxi) A x) + i=,
f(ZcfxiA Cof0 (X,)AX + C) ,o E L)( CL).
The first term is the delay of the driver, which is given by the driver resistance Rd multiplied by the total capacitance of W and CL. The second term is the sum of the delay in each wire segment i, which is given by its own resistance ro Ax/f(xi) multiplied by its downstream capacitance E>i cof(xj)Ax+ CL. (See Figure 2.) As n - co, D,, - D where D
=
Rd(CL +
Jo
j
cof (x)dx) +
(j
cof(t) dt+ CL) dx
*This work was partially supported by the Texas Advanced
is the Elmore delay through the driver and W.
Research Program under Grant No. 003658459.
21
CL
I4
fix)
wire
Proof: Let xr E [0, ]. Assume f is continuous at ax. We consider f which is a local modification of f in a small region [x - Jx + 26]. The function i is defined as follows:
a
driver
L
- -
0
LXx
r0
fixj)
T0 ,
x
fix 3 )
fix 2 )
fix 4 )
CL
ATt
cOf(x 1 ) _X
Cofx
C of(x2 )
3)
D
=
Cofx.
rj) o
cof (s) ds + cody + C3 + CL) dt
x
+-(CONY + C3 + CL) +
Figure 2: Elmore Delay Model.
IC co f (s) ds + CL) dt.
rL
f
o(I
Optimal Wire-Sizing Function
In this section, we derive closed-form formula for the optimal wire-sizing function. It is reasonable to assume that wire-sizing functions are bounded and piecewise smooth with at most finite number of discontinuity points. We consider two cases: unconstrained and constrained wire-sizing. In unconstrained wire-sizing, there is no bound on the value of f (x); i.e. we determine f : [0, L] -s (0. cc) that minimizes D. In constrained wire-sizing, we are given L > 0 and U < cc, and require that L < f () < U, 0 < x < L; i.e., we determine f : [0, L] -+ [L, U] that minimizes D.
3.1
< x + 2
Rd(CL+Cl1+Coy+C3+CL)+
j
x
x
< t otherwise x -I
The wire W could be divide into three regions Q 1 , Q2 , and Q3 as shown in Figure 3. We denote the signal delay through Qi by Di. Hence the total signal delay D = ,3=, Di. We represent the wire resistance (capacitance) of Q, by Ri (Ci). We have R2 = r and C2 = cody. The signal delay through the wire can be calculated as follows:
roAx
-T
3
f)f(t)
f
rx, lx
X
R d
f( {
Thus
dD
=
coJ(Rd + ro
f(t)dt)
y2 By setting dD = 0, we get
2
We now consider unconstrained wire-sizing. We show that the optimal wire-sizing function satisfies a second order ordinary differential equation which can be analytically solved.
Therefore, f using y Let J5- 0, we get 2
02
r
.- dt)-
rO 5(CL + Co :+
Unconstrained Wire-Sizing
ire
J-21
10
ro(CL + co
fag
f(t)dt)
co(Rd + rO
jo+
7
dt)
ym-n gives minimum delay.
f f (t)dt) co(Rd + rO 1X f(tdt)
ro(CL + CO -
Since f is an optimal wire-sizing function, we have Ymi. f(x) and hence
03
RJ
f 2 (X)
Figure 3: Local modification of an optimal wire-sizing function.
Theorem 1 Let f be an optimal wire-sizing function. We have f 2 (X)
=
ro(CL + co Jf: f(t)dt) co(Rd + ro fj, (tdt)
ro(CL + co fJ f(t)dt) co(Rd + ro
CL
fo
I(t)dt)
For the case where f is not continuous at x, we have f is either left-continuous or right-continuous at x. All we need to do is to start with using the interval [x - 3, x] or [x, x + 6]
U
respectively.
Note that CL +CO Jf f (t)dt is equal to the downstream capacitance at point . (denoted by F(x)) and Rd + ro f(()dt is equal to the upstream resistance at point x (denoted by ,(x)). Hence we can rewrite Equation (1) as follows:
(1)
f r) =
r)o(X) CoA(x),
22
(2)
Since F is strictly decreasing and ct is strictly increasing, therefore f is strictly decreasing. By rearranging the terms in Equation (1) and differentiating it with respect to x twice, we get the following theorem.
Separating the variables and integrating both sides, we get Clr
where
C2
is a constant. It follows that
Theorem 2 Let f(x) be an optimal wire-sizing function. We have fU(X)f(_)
=
(3)
fi(X)2
Proof: We first multiply Equation (1) by the denominator of its right hand side and then differentiating both side with respect to x. We get 2f(x)f'(r)(Rd +
fr 1dt)
T
y
=
eCIC2 = ae
where a > 0 and b > 0. (Note that b > 0 follows from the fact that f is decreasing. ) In order to determine a and b, we substitute f (r) = ae- bx into Equation (4) and check the two boundary points x - 0 and x = L. We obtain the following two equations:
-2rof(x).
=
fJ0 f (t)
Iny,
+ C2 =
cobRda + roco(e
cobRda + roco(e
Since f(x) 0 0, we can divide both side by f(x) and get
)a - rebCL
-
1)a
-
-
robCLe
=
0,
=
0.
We can simplify these two equations and get f
'(x)(Rd + ro
I
J.0 f (t)
dt)
=
-ro.
ab
rO
=
Rd'
Since f is strictly decreasing, f'(x) < 0. Dividing the above equation by f'(x) and then differentiate both sides with respect to x, we obtain
2
b b [Rd 2/1eCL
-
r 0 co
~
2
-
=
00.
f"(_)f(r) = f(X) 2 . Note that the function g(z)
We can analytically solve the differential equation in (3) and obtain a closed-form solution. We have the following theorem.
Theorem 3 Let f(x) 6
b,
= aebr,
where a
-
bR
and
R~dCL(4 roco
-e
9
0.
= z
Ve0C
e-2
is a
strictly increasing function in z, g(0) < 0, and limzc> g(z) > 0. Thus g(z) has a unique root b > 0. We can use NewtonRaphson method [9] to determine b and, in practice, five to seven iterations are sufficient. Since a = j and b > 0, we have a > 0. Figure 4 shows the exponentially decreasing nature of the optimal wire-sizing function.
(4) wire width
We have that f is an optimal wire-sizing function.
f(X)
Proof: Let y = f (x) and P = y'. We have y" = differential equation (3) can be rewritten as p( dP
P)
p dP. The dy
-bx
=0.
.f(x)=ae
Since P = f'(x) < 0. we have 1....Z.. @. ., pC^AzzUrz t- .. wzre
YddP dy
- P = 0. Figure 4: Optimal unconstrained wire-sizing.
Separating P and y, we get dP -
-
dy
3.2
Constrained Wire-Sizing
.
We now consider constrained wire-sizing. It is clear that if the wire-sizing function f obtained for the unconstrained case lies within bounds L and U, then f is also optimal for constrained wire sizing. On the other hand, if for some x, f(x) is not in [LU], a simple approach is to round f(r) to either L or U; i.e. the new function is obtained by a direct truncation of f by y = L and y = U. (See Figure 5.) Unfortunately, the resulting function is not optimal. The
Integrating both sides, we get P = Cly,
where c1 is a constant. Since P = y', we have dy dx =cy.
23
wire width
wire width
flx)
ftx) U'fll)=U \-ha \flal-tie
U
U
bx
L
B
A
C
x)=ae (a)
L
,p-ABC
wire width
x wire position
wire width fla)
fla)
V U-hb
U
l -ha ffll)at
Figure 5: Direct truncation is not optimal.
L A I- 1I
reason is as follows: Suppose the curves f(x) = ae- bx and y = U intersect at x = v, from Equation (2) v must satisfy
-
(h)
I12
--
fia)
flat
F (V)
flat)=,e
(5)
fl(-L
LA A
for v to be on the optimal curve. However, from Figure 5, it is clear that v does not satisfy Equation (5), because both of its upstream resistance and downstream capacitance should be recalculated according to the new function, in which the two values associated with v are reduced because of the truncation. Thus this simple approach is not optimal. Recall that the optimal unconstrained wire-sizing function is a decreasing function. We can show that the optimal constrained wire-sizing function must also be decreasing. Theorem 4 Let f
U
ab,
U -
-
17L
wire width
wire widei
fla)=U
f(v)
C
1,-12 (a) Tire-BC
L
hpi-AB
wire width
lX=
B
B
B
-
I]
(d)
rpe-A
--
13
L (e}
(J)
,pe-B
L
hvpe-C
Figure 6: Six types of optimal wire-sizing functions.
We now define six wire-sizing functions fBC, fABC
ae-b
L
be an optimal constrained wire-sizing
function. We have, f is decreasing on [0,C]. According to Theorem 4, the optimal wire-sizing function
f, similar to the one shown in Figure 5, consists of (at most) three parts. The first part is f(r) = U, the middle part is a decreasing function, and the last part is f(r) = L. The three parts of f(x) partition W into three wire segments, A, B, and C, where A has width U, C has width L, and B is defined by the middle part of f(r). It is easy to see that the middle part of f(x) must be of the form f(x) = ae-ba for some a > 0 and b > 0. To see this, we can consider the wire segment A to be a part of the driver and its resistance to be a part of Rd. Similarly, the wire segment C can be considered as a part of the load and its capacitance as a part of CL. According to Equation (4), we can re-calculate a and b using the new values of Rd and CL, as long as we know the length of the wire segments A and B. As mentioned before, not all three parts of f(r) needed to be present. In fact, an optimal constrained wire-sizing function f(r) can be of any one of the six types of functions (type-A, type-B, type-C, type-AB, type-BC, and type-ABC) as shown in Figure 6. Note that the six function types clearly are named after the wire-segment types which are presented in W. For example, in a type-AB function, W consists only of wire segments A and B. As shown in Figure 6, 11, 12, and 13 are the length of wire segments A, B, and C, respectively.
f},
fc,
fAB,
0 < r < 11,
U f(r)w={
fA,
as follows: All six functions are of the form
,
<7
<1
11 + 12 < X
+1 2 ,
<_l1
+ 12 + 13 =
L,
where the parameters a, b, 11, 12, and 13 for the six functions are given in Table 1. Typically, the names of the functions correspond to their types, i.e., fA is of type-A, fB is of typeB, and so on, but it is not always true. For example, it is possible that after we compute the parameters for fAB we get 11 > L and hence it is of type-A; it is also possible that fAB degenerates into a type-B function. In this case, we say that fAn is degenerated. We also note that sometimes the functions may be illegal in the sense that they violate the wire-width constraints. Nevertheless, we can show that these six functions are candidates for an optimal constrained wiresizing function f(r). In fact, if we eliminate the functions that are either illegal or degenerated, an optimal wire-sizing function can be chosen as the best one (in terms of delay) among the remaining ones. We have the following theorem.
24
Theorem 5 Let G C F = {fA,
fB, fC, fAB, fBC, fABC}
be
the set of functions that are either illegal or degenerated. Let f E F-G be a function which has minimum delay. We have, f is an optimal constrained wire-sizing function. The above method always requires the computation of all six functions in F. With the help of additional analysis, we can speed up the procedure. Table 2 shows a set of six feasibility conditions {fOA, tB, PC, CAB, SCBC, SCABC} on L. Let F = {A, B, C, AB, CB, ABC}.
0
fB
fc
0
0
12-l 0
Ur,
fABC
InU( L
2
2+iY RoCL
2
_+ L
(a) = 0
L
L
0
U
IP 3 (1 3 ) -0
L2-13
+L-(1+1r-)
b 0
U
0
F 2 (11 ) =0
fBC
___
0
0
fAB
a
13
1112 fA
RLO(
'O
0
-
+.DcgL
Cr cL3
U2d
+Z.+
L-
2
-
Rd___
_
12
1_l
U
121 Red U+In-+ 11) __ _ _ _
.313
v-OCT.313
c~,RdU+,ro,
Table 1: Definitions of the wire-sizing functions
CcL
_
fA, fB, fc,
fAB, fBC,
wire
r -
(PARB
0
S BC
(PABC
~
and
fABc.
ienghr
BC
AB
A
B
R d / C L
C
ratio
Figure 7: Relationships among the six types of functions with respect to C and Rd
n
0
in In
d'
U
> max IRdln ~jRd L2< (1 + In L) C'L -I L> mar{L - C
ma'C{(1 +
(I + InLL
=
L2
According to Theorem 6, we only need to check the six feasibility conditions. Only the functions in H needed to be computed. In general, IHI < 6 and we have never encountered any case where IHI : 1.
L< minA
3
~TOC+oI-)-1
ABC
Theorem 6 Let H = {fzIz E r and L satisfies fo}. Let f e H be a function which has minimum delay. We have, f is an optimal constrained wire-sizing function.
(sP
R 5
I____
C, + c 0 Li,
Lemma 1 The six feasibility conditions 0 {.(PA, 4PB, (PC, S AB, VPBC, (PABC} cover all possible L > 0. Moreover, if L satisfies cp, where z E F, then f is legal and is of type-z.
VA
+
) cnL
L )n
-O
Cn
nn
C-
,
I
Rd&}
InL COd rpL
I
is of moderate size, optimal wire-sizing function will change speakfrom type-AB to type-BC as we increase Rd CL Roughly ghysek ing, the larger the ratio R., the smaller the wire sizes. When the wire length L is very large, optimal wire-sizing function is most likely to be of type-ABC.
C
}
and and
CL
e
4
p
Application to Routing Trees
}d.
Table 2: Feasibility Conditions.
We also have the following interesting observations. In Figure 7, we show the relationships among the six types of optimal wire-sizing functions with respect to the three parameters: wire length L, driver resistance Rd, and load capacitance CL. The horizontal axis represents the ratio of the driver resistance to the load capacitance. The vertical axis represents the wire length L. Suppose we keep Rfixed CL and varies L. When L is small, optimal wire-sizing functions tend to be of type-A, type-B, or type-C. As we increase L, wire-sizing function types will change to type-AB or type-BC when is of moderate size and will be of type-ABC when L
L
is large. Suppose we keep C fixed and varies Rd. When CL
L
is small, as we increase Rd optimal wire-sizing function will change from type-A to type-B and then to type-C. When L
Our wire-sizing formula can be applied to size a general routing tree. Recently, [2] presents a wire-sizing algorithm GWSA-C for sizing the wire segments in routing trees to minimize weighted delay. Each segment in the tree is sized uniformly, i.e. uniform wire width per segment. Basically, GWSA-C is an iterative algorithm with guaranteed convergence to a global optimal solution. In each iteration of GWSA-C, the wire segments are examined one at a time; each time a wire segment is uniformly re-sized optimally while keeping the widths of the other segments fixed. We can incorporate our wire-sizing formula into GWSA-C to size each wire segment non-uniformly. When we apply our wiresizing formula to size a wire segment in a tree, Rd should be set to be the total (weighted) upstream resistance including the driving resistance, and the CL should be set to be the total (weighted) downstream capacitance, including the load capacitances of the sinks in the subtree. (See Figure 8.) It can be shown that this modified algorithm is extremely fast and always converges to a global optimal solution.
25
CL. while processing wire segment .3
Precision Requirement (pm) 0.1 0.01 0.001 0.0001 0.00001 0.000001 I-
the downstream capacitance of W3
Rd. the upstream resistance of w 3 2 . 'I ';
4
7
WI
Wn
I
Table 4: The number of Newton-Raphson iterations.
35
Experimental Results and Concluding Remarks
a
E V I
We implemented and tested our algorithm in C on a Sun Sparc 5 workstation with 16 MB memory. The parameters used are shown in Table 3. The results are given in Table 4. The first column labeled "Precision Requirement" specifies the required accuracy of the wire width values. The second column shows the number of Newton-Raphson iterations. Our results show that even under very strict precision requirement the number of iterations is at most 7. Thus, in practice, the optimal wire-sizing functions can be computed in 0(1) time and hence our method is extremely fast. We also performed experiments to compare the nonuniform wire-sizing solutions with the uniform ones in which wires are chopped into different number of segments. The results are drawn in Figure 9. Wire widths are plotted as the functions of positions on the wire segments. It shows that the more segments a wire is chopped into, the closer the solution is to our formula. When the wire is chopped into 1000 segments, it can be shown that the corresponding curve and the non-uniform wire-size curve are almost identical. Finally, we note that the authors in [8] independently solved the unconstrained wiring-sizing problem using a different approach.
Unit Resistance: Unit Capacitance: Minimum Wire Width: Maximum Wire Width: Driver Resistance: Load Capacitance:
|
W
Figure 8: Sizing a segment of a routing tree.
5
# of iterations 5 5 5 6 6 D
25
15
Figure 9: Approximating non-uniform wire-sizing by uniform wire-sizing.
0.008 Q£/pm 6.0 * o0-,' F/pm I 1.0 pm 3.5 pm 25 Q 1.0 * 10-2 F
Table 3: RC Parameters
References [1] J. Cong and K. Leung, "Optimal wiresizing under Elmore
delay model," IEEE Trans. on Computer-Aided Design of Integrated Circuits and Systems 14(3), pp. 321-336, 1995.
[2] Chung-Ping Chen, D. F. Wong. "A fast algorithm for optimal wire-sizing under Elmore delay model" Proc. IEEE ISCAS, 1996. [3] W. C. Elmore, "The transient response of damped linear networks with particular regard to wide band amplifiers", J. Applied Physics, 19(1), 1948. [4] N. Menezes, S. Pullela, F. Dartu, and L. T. Pillage, "RC interconnect syntheses-a moment fitting approach", Proc. ACM/IEEE Intl, Conf. Computer-Aided Design, November 1994. [5] N. Menezes, R. Baldick, and L. T. Pillage, "A sequential quadratic programming approach to concurrent gate and wire sizing", Proc. ACM/IEEE Intl, Conf. ComputerAided Design, 1995. [6] R. S. Tasy, "Exact zero skew," IEEE Trans. on Computer-Aided Design of Integrated Circuits and Systems, Feb. 1993. [7] Q. Zhu, W. M. Dai, and J. G. Xi, "Optimal sizing of high-speed clock networks based on distributed RC and lossy transmission line models," Proc. IEEE Intl. Conf. on Computer-Aided Design, pp. 628-633, 1993. [8] J.P. Fishburn and C.A. Schevon, "Shaping a distributedRC line to minimize Elmore delay", IEEE Transactions on Circuits and Systems-I: Fundamental Theory and Applications, Vol. 42, No. 12, pp. 1020-1022, December 1995. [9] A. Ralston and P. Rabinowitz, A First Course in Numerical Analysis, McGraw-Hill, 1978.
26
Reducing Coupled Noise During Routing Ashok Vittal and Malgorzata Marek-Sadowska Department of Electrical and Computer Engineering, University of California Santa Barbara, CA 93106 Abstract - The performance of high-speed electronic systems is limited by interconnect-related failure modes such as coupled noise. We propose new techniques for alleviating the problems caused by coupling between signal lines. We show that models used by previous work on coupled noise-constrained layout synthesis do not allow the use of several important degrees of freedom. These degrees of freedom include the ability to utilize dynamic noise margins rather than static noise margins, the dependence of tolerable coupled noise on drive strength, the possibility of using overlaps to reduce susceptibility to noise and the use of jogs. We derive an expression for the coupled noise integral and a bound for the peak coupled noise voltage which is more accurate than the charge sharing model used in previous work. These results lead to a new problem formulation for interconnect synthesis constrained by coupled noise. We use the new bounds to guide a greedy channel router, which manipulates exact adjacency information at every stage, allowing it to make better informed decisions. Experimental results indicate that our algorithm compares favorably to previous work. The coupled noise is significantly reduced on benchmark instances.
smaller set of situations where simulation needs to be performed. This verification issue is addressed in Section II. As coupled noise pulse widths are typically much smaller than the clock period, the dynamic noise margin [2], [3], [6] rather than the static noise margin, applies. The noise susceptibility of digital logic gates is usually specified by a noise amplitude - pulse width plot, defining the allowed region of operation. Larger noise pulse amplitudes can be tolerated as the pulse width gets smaller. Therefore, given a layout, we need to be able to calculate the pulse width and amplitude on a victim net. We derive a new expression for the noise pulse integral and a simple bound on the noise pulse amplitude in Section II. Our expression does not involve the solution of the differential equations characterizing the system as in [7] and is simple enough to be used during design and verification. Our noise amplitude bound could be considered to be analogous to the signal delay bounds in [8] which have been the foundation for many interconnect design algorithms. We have developed a router guided by our new measure and use our analysis results to show that the charge sharing model used in previous work on coupled noise-driven layout ([9], [10], [11]) is extremely pessimistic. Coupled noise control during layout is critical as the noise pulse amplitude and width induced on a quiet (nonswitching) line by an adjacent switching line is primarily a function of the layout. Other researchers have also identified coupled noise as a major problem during layout synthesis. As parasitic control in analog circuits is critical to correct operation, there has been work on the layout of analog systems considering coupled noise. In [12], a maze router with limited analysis capability is proposed to avoid bad routes. Maze routing does not scale well to digital complexities, so that these methods cannot be used. Besides, the use of a simulator in an optimization loop is computationally too expensive. The post-route spacing algorithm proposed in [13] considers crosstalk while compacting a given layout. We address the issue of routing considering coupled noise in this paper. Our channel routes could use this post-processing for further improvement. Other approaches include track permutation of a given channel route [9] to maximize the minimum coupled noise slack among all nets using integer linear programming, the switchbox version [10] of the ILP formulation and the modification of a graph-based approach (Glitter [14]), to include coupled noise costs[1l]. These two approaches are reviewed next. The elegant integer linear programming formulation [9] finds an optimal track permutation, given an initial channel
I. Introduction* Coupled noise between signal lines is a potential cause of failure in high speed electronic systems [1], [2], [3], [4]. Aggressive scaling in lateral dimensions with relatively unchanged vertical dimensions in sub-micron CMOS VLSI causes the coupling capacitance between adjacent lines to become a significant fraction of the capacitance to the substrate [5]. This leads to increased coupled noise.Further, increasing system speeds cause signal spectra to exhibit significant energy at higher frequencies leading to increased coupled noise. When high-speed ECL circuits and lowpower CMOS circuits are integrated on a chip, the small noise margins of ECL, the large logic swings of CMOS circuits and the high current drive of ECL drivers create coupled noise problems. Besides, increasing system complexities lead to increased coupled energy, again increasing the need for techniques aimed at reducing the deleterious effects of coupled noise. It is important to be able to quickly check if a given routing solution will not lead to logic failures caused by coupled noise. A layout can be certified correct either using a simple capacitive charge sharing model to compute noise or using exhaustive simulation. The simulation method is timeconsuming and the charge sharing model is very pessimistic, so there is a need for simple methods which provide better accuracy. Such a method could at least identify a much *. This work was supported in part by the Defence Advanced Research Projects Agency under contract DABT63-93-C-0039, the National Science Foundation under Grant MIP 9419119 and in part by LSI Logic & Silicon Valley Research through the California MICRO program.
27
be assumed. The total coupled noise voltage can then be computed as the sum of the individual coupled noise contributions of each of the aggressors, rather than having to look at all possible switching combinations (there are exponentially many such scenarios which would otherwise need to be analyzed). As lines immediately adjacent to the victim net act as shields, we need to consider only the immediately adjacent lines. In Figure 1, therefore, our analysis omits the coupled noise contribution from net Z. We now analyze the coupled noise contribution from one aggressor net, with all other aggressor nets grounded. The total peak noise voltage and pulse width will later be obtained using superposition. The equivalent circuit is as shown in Figure 2. The attenuation along the interconnect is neglected (line lengths of a few mm or less). The resistance RI is the driver resistance of the aggressor (typically a few tens of kiloohms for CMOS and a few hundreds of ohms in ECL) and Cl is the total line capacitance of the aggressor net. X is the coupling capacitance and is proportional to the overlap length. R2 is the output resistance (typically several tens of kiloohms in CMOS and a few hundreds of ohms in ECL) of the victim net and C2 is the capacitance to ground of the victim net. RI and R2 are obtained from the transistor characteristics and the capacitances per unit length are obtained either from tabulated experimental results or from a commercial field solver. While RI is the familiar driver resistance (of the aggressor net), R2 is the output resistance (of the victim net). The ratio RI/R 2 can vary over a wide range (especially when transistor sizing optimizations are resorted to) as these are resistances corresponding to two different logic gates. We are interested in computing the peak voltage and the noise pulse width at node 0 for a unit step input.
route. The track permutation maximizes the minimum crosstalk slack among all nets. However, this is sub-optimal as finer grain permutations - track segment permutation rather than track permutation may lead to better results. Section III will show an example for which no track permutations are possible, while track segment permutations lead to optimal results. In [10], track segment permutations in a switchbox are considered. In Section III, we show that even this method is sub-optimal because the initial route is not driven by coupled noise considerations. Besides, drive strengths are not considered and Section II will show that the coupled noise model is simplistic. The channel routing heuristic presented in [11] modifies a well-known channel router [14], [15] to reflect coupled noise issues. The important notion of digital sensitivity is also introduced - logic and temporal information may make coupling between certain pairs of nets tolerable, i.e., noise coupled to a net may not lead to system malfunction if, for instance, the signal isn't being latched at that time. A capacitive charge sharing model is used to find the amplitude of the noise pulse: we show that this model is very pessimistic. The expression which we derive cannot be broken down into pairwise potential coupled noise contributions as required by their formulation. This is intuitive because the coupled noise contribution of a net adjacent to a victim net depends on other adjacent nets as well. This makes pairwise decomposition of noise required by the graph-based approach impossible. Coupled noise is considered only between adjacent horizontal segments. As the worst case has to be considered for coupled noise, the router may end up with poor routes as shown in Section HI. Besides, shields are assumed to be available wherever necessary and the issue of hooking up the shields to some reference supply voltage is not considered. We derive new results for the analysis of coupled noise pulse height and width in Section II, enabling the verification of dynamic noise margins. Section III uses these expressions to show that routing driven by the new model could be significantly better than routing driven by the charge sharing model. A new algorithm is also proposed and experimental results are presented. Section IV concludes with a discussion on the likely impact of this paper.
to
C2
Figure 2. Equivalent circuit for calculating coupled noise The following two theorems lead to expressions for the coupled noise pulse amplitude and width. Both the expressions are derived from the differential equations characterizing the system and no explicit time-domain solution is necessary. Theorem 1 The integral of the coupled noise over time is given by
II. Coupled noise calculation 2.1 Coupled noise bounds Consider a quiet victim net coupled to several other "aggressor" switching nets, as shown in Figure 1. Net V is
= R,
fvodt
V
A,
A
m
|
Proof: The nodal equation for the output node 0 is dv
Figure 1. A victim net V coupled to aggressors
(1)
X
(X + C2)
Al-A4
-
v + --
dv (X
-
)
=
0
Integrating this over positive time, using the initial conditions (all voltages at 0) and the final values (VO at 0 and Vm at I)and simplifying, we get the result
coupled to nets Al through A4 . Let the rise times of the signals be much larger than the propagation delays, so that a lumped RC model is adequate[3]. As the noise pulse is essentially a small signal, linearity and time-invariance can
Clearly, the proof holds even if the input has non-zero
rise time.
28
The possibility of using overlaps with nets switching slowly to decrease the overall crosstalk on a victim net. Note that overlap is never used to advantage by any of the previous routing methods. Our expression shows that increasing overlap length increases the crosstalk contribution of that particular aggressor. However, it also increases the line capacitance of the victim net and could make the sum of crosstalk contributions of other nets smaller! Thus, not all overlap is bad. Overlap with slow switching nets reduces the noise susceptibility to coupling from other nets. The limiting case of a slowly switching net is, of course, a ground net which serves to increase the line-toground capacitance of the nets adjacent to it. Nets which switch slowly are therefore a resource which needs to be shared among the nets which are coupled noise-critical. Figure 3 shows SPICE results which verify this. Overlap with S reduces the peak noise coupled to V from 1.22V to 0.93V - a significant reduction.
Theorem 2 The peak noise pulse height is bounded above by P
C2 I
(I
+-+-
X
(2)
C,
R +-)
R2
X
Proof: When the coupled noise pulse is at its maximum, the derivative of VO with respect to time is zero. Using this in the nodal equation at 0, we get VP
dv.
-
= x
R2,
-
dt v0 =VP
From the nodal equation at M, we get I-v
dv
dt vo= VP
RI
'
Vp V
(X + C,)
Using the two equations above, we get R2 V
=
R. R
X
(I
(X + C1 )
VO
0
P
The equation at 0 can also be written as v, +
ox
v.
V, + A = X
Vm.,
I... .L.
L.LA.
A.,
In order to utilize this degree of freedom, jogs or doglegs may be essential. For instance, if a net with a stringent noise margin is being routed through a channel which has only two slowly switching nets, one on the top left and the other on the bottom right of the channel, the net would have to be routed with a jog if its coupled noise constraint is to be satisfied. Note that the graph-based solution does not introduce jogs and would therefore be sub-optimal. The use of a purely capacitive charge sharing model may lead to incorrect conclusions about the coupled noise. For instance, the charge sharing model may predict that adjacency to net a is better than adjacency to net b, but the actual situation might be the opposite due to a larger drive resistance or line capacitance of net b. The noise calculations based on a capacitive charge sharing model may, therefore, show poor fidelity to the actual coupled noise. Figure 4 shows SPICE simulations to support this.
x x+
.
in (a), while in (b) it overlaps with only F. Noise in (a) is smaller.
v=
The term A is the integral of the noise pulse till the time when the maximum is reached and is strictly positive. Neglecting A and substituting for the voltage at M when the noise pulse reaches its maximum and rearranging, we get the required upper bound.E Note that in Theorem 2, when RI/R 2 = 0, the expression for the peak coupled noise voltage reduces to vP =
2
(b) time IOns Figure 3. Increased overlap leading to smaller total coupled noise The victim net, V,overlaps with the slow S net and the fast F net
Using this equation with t equal to the time when the coupled noise pulse reaches its maximum, we get (X + C 2 )
. . . . . . . . . . 2
0X v dt = X
-
.. . . . . ...
v
(a)
t
(X + C 2 )
3
v
=" V
C2
which is similar in form to the capacitive charge sharing expression. However, in this expression, C2 is the sum of the line-to-ground capacitance and all the line-to-line couplings from lines adjacent to line 2 (other than line 1). The pairwise capacitance ratio used in [11] is, therefore, pessimistic even in the limiting case of a coupled node with infinite output resistance. As the expression depends on other line-to-line couplings, the decoupling of total noise into independent pairwise contributions is not possible. The noise contribution depends on the other lines adjacent to the aggressor net and this cannot be handled by any graph-based approach - the formulation would need to be based on hypergraphs With line-to-line couplings becoming a significant fraction of lineto-ground capacitances in sub-micron ICs, the graph-based approach is less applicable. 2.2 Design implications Theorems 1 and 2 lead to several interesting design implications. These include:
v
+ X
Dox
1t 3.6 (a)
Db
II
I
time Figure 4. Pure charge sharing model exhibits poor fidelity when driver sizes vary. (a) shows twice the overlap as (b), but with a weaker driver. The coupled noise is larger for (b).
. The drive strength of ECL gates is equal to their output resistance. In this case, the expression shows that it is use-
29
ful to route nets of equal drive strengths adjacent to each other. In this way the ratio RI/R 2 is not too large when either net is considered as the victim net. Thus the optimal track assignment of a set of equal length nets entering and leaving a channel (with no other connection to be made) is obtained by sorting in order of drive strength. * Coupled noise is less of a problem for strongly driven nets. These nets have small R2, so that the peak coupled noise voltage is smaller and the noise integral is proportionately smaller. * Longer tolerable overlap lengths are obtained than using the charge sharing model. This could lead to more flexible routing. * Coupled noise could be a problem for all net lengths as it depends on the ratio of the line-to-line capacitance to the line-to-ground capacitance. The fidelity of the new expressions are evaluated next using comparisons with exact analytical solutions and SPICE simulations for 0.5 micron CMOS. Figures 5a, 5b and 5c compare the peak noise voltage computed with the capacitive charge sharing model and the new expression for the practical range of parameter ratios. On average, the upper bound is a factor of 2.5 smaller than the pessimistic prediction of the charge sharing model.
25
charge sharing model
\ \
Vu)
output voltage and the noise pulse are shown. The interconnect is modelled by 100 distributed LC sections for the SPICE simulation. The peak noise pulse height is 0.7V and the equivalent pulse width is 4 ns. The peak pulse height is calculated by our formula to be IV. The capacitive charge sharing model predicts 1.65V peak pulse height and pulse width equal to the input pulse width (iOns). Note that the SPICE results are consistent with a finite output resistance the noise voltage discharges before the end of the clock period.
05 I
1
7
Cn
A
2
,d
Normalize coupled noise
15 '.1 05 1
v,
2
charge sharing model
'.2
I
I I
T N. - A.
III
25
entire length. The total line-to-substrate capacitances are 70fF and the line-to-line coupling capacitance is 50fF. The MOSFET models are from MOSIS. The SPICE simulation results for the circuit is shown in Figure 7a and the simulation results for the RC network are shown in Figure 7b. The
Figure 7. Model fidelity. a) RC equivalent circuit. b) Simulated output and noise voltage.
15 '.1
5a. Coupled noise variation with RI/R 2
Figure 6. Coupled noise in 0.5 micron CMOS. a) Circuit configuration. b) Simulated (SPICE) output and noise voltage.
n-
4 4,'f'0*(tn!6
1.3 Normalize( coupled '.2 noise
Consider the case of two minimum size driver-receiver pairs in 0.5 micron CMOS coupled capacitively as shown in Figure 6. The lines are 1mm long and run parallel along the
III. Routing to minimize coupled noise new bound l
CI/C2
4
h
K
Sb. Coupled noise variation with C1/C2
.
..
H
..
.9
10(
30
Equations (1) and (2) in Section II could be used by coupled noise-driven routers. As we are interested in large complexities, coupled noise should be considered when an entire set of nets is being routed. We, therefore, demonstrate the use of the new expressions in a channel router. We must emphasize, however, that other routers could also utilize the formulas. We begin by showing that the two coupled noiseconstrained channel routers proposed so far are incapable of using these new expressions. 3.1 A pathological instancefor the ILPformulation The integer linear formulation for finding track permutations is not amenable to change using the new expressions as the noise coupled to a net is no longer a sum of weighted overlap lengths and an intractable non-linear programming formulation becomes necessary. Even for the ILP case, the
channel routing instance in Figure 8 shows an example where track segment permutation rather than track permutation is necessary. The C nets are critical but have been routed
1i DJ A~
*---
S I. 1 a) Crosstalk bounds violated on C nets
E;
this problem is NP-complete when total coupled noise is to be minimized and provide a travelling salesperson solution. This solution is of independent theoretical interest and can be used even in the absence of coupled noise (with crossing count based costs, for instance). 3.3 Wire ordering to minimize total coupled noise Consider the problem of ordering the nets entering a channel [17], [18]. The ordering determines the aggressor nets for each victim net and we wish to minimize the "total" crosstalk. Formally, the problem is stated as the MCWO (minimum crosstalk wire ordering) problem below. Instance: An NxN crosstalk matrix [C] where C(i,j) specifies the crosstalk on net i due to the adjacent j net and a number TC. Question: Is there an ordering of the nets such that the total crosstalk is less than TC. Theorem 3 [MCWO] is NP-complete. Proof: Reduction from the travelling salesperson problem with distance matrix C(ij) and maximum Hamiltonian path cost of TC. a The proof of Theorem 3 enables the use of the heuristics developed for the travelling salesperson problem to obtain a wire ordering. Note that this NP-completeness result does not use vertical constraints as in [11] and is therefore a stronger result. Vertical constraints make channel routing NPcomplete and our result shows that crosstalk independently makes the problem intractable. 3.4 Greedy channel routing The pivotal idea of greedy channel routing is to route the channel using a column scan and completely routing a column before moving on to the next, introducing jogs when necessary. In order to consider coupled noise, the track assignment during the scan should attempt to satisfy crosstalk constraints on all nets. This is facilitated by the fact that precise adjacency information is available for the columns already routed. Equations (1) and (2) of Section II can now be used to dynamically identify the nets which are crosstalk-critical and appropriate jogs used to diminish the noise coupled onto these nets. The cost function used is based on our bound. Theorem 4 gives a sufficient condition for any routing solution to be free of coupled noise problems. We first define the notation used in the statement of the Theorem. A(i) is the set of aggressor nets for net i and comprises all nets routed adjacent to it. X.- is the coupling capacitance between nets i and j, Cg is the line-to-ground capacitance per unit length, li is the length of net i, Rjo is the output resistance of the driver for net j, Ri is the driver resistance for net i and E i is the maximum amplitude function dependent on the noise integral divided by the output resistance. E i is determined by simulation for each receiver and is typically of the form shown in Figure 10 below. Note that this function is a property of a net. For a multi-pin net, the lumped capacitance model leads to identical coupled noise waveforms on all receivers. Therefore, we find the allowed coupled noise on the net for a given total coupling to be equal to the corresponding smallest allowed noise amplitude. For nets with bidirectional drivers,
metall line metal2 line vvia
Lu Lurs,
b) Better routing satisfies crosstalk bounds Figure 8. Track segment permutation vs. track permutation. In (a), the tracks on the top half cannot be swapped with the tracks on the bottom half due to vertical constraints imposed by the U and D nets. Tack segment permutations are used to get the solution in (b) which satisfies crosstalk bounds. adjacent to each other by a coupled noise-oblivious router. The S nets, had they been interposed between the C nets, could have served to reduce the coupled noise. Track permutations to alleviate coupled noise on the C nets are not possible as there are vertical constraints imposed on every pair of upper and lower tracks by the nets on the right. If track segment permutations had been allowed, the routing shown in Figure 8b becomes possible. Note that coupled noise is particularly a problem when the layout is dense, as in the Deutsch difficult example. On such benchmarks, the ILP formulation performs poorly as its solution space is limited by the stringent vertical constraints imposed, much like the instance shown in Figure 8a. 3.2 Pathologicalinstancefor the graph-basedapproach Graph-based approaches cannot handle interactions over sets of nets when the set size is larger than 2, so that the work presented in [14], [15], [11] is less powerful. A pathological instance for the algorithm is shown in Figure 9. YZ
YZ
11
1
Z
Z
*
(a) (b) Figure 9. The importance of using jogs or doglegs when routing is constrained by coupled noise. a) No jogs or doglegs are used leading to large coupled noise. b) A dogleg in net Z reduces coupled noise. The primary reason for the sub-optimality shown in Figure 9 is that the algorithm constrains the entire net to run on a single track, which makes the combination of coupling constraints and routing constraints difficult to handle. We therefore propose the use of a new channel router which maintains exact adjacency information at each stage and can introduce jogs when necessary, either motivated by coupled noise issues or for routing purposes. It is based on the "greedy" channel router described in [ 16]. We briefly review the work presented in [16] and outline a new technique to find the optimal "greedy move". We begin with the problem of wire ordering at the ends of the channel - we show that
31
the maximum amplitude can be similarly defined as the worst case scenario - the smallest allowed amplitude for that particular coupled capacitance for every possible driverreceiver combination. In short, the function Ei serves to formalize the dynamic noise margin concept and is defined over coupling capacitance space as this is easily derivable from the layout.
This is the standard greedy routing algorithm with two novel features. . The initial wire ordering is not simply clustering in the middle and uses the heuristic motivated by Theorem 3. . The jog insertion phases uses the dynamically updated coupled noise information to find an optimal jog, if necessary. to reduce coupled noise. Upper and lower bounds on the coupled noise expression are determined along with upper and lower bounds for the right hand side of the inequality in Theorem 4. This allows quick and accurate identification of coupled noisecritical nets. In order to keep the number of vias required from becoming excessive, only two jogs per column for coupled noise reduction are allowed. Tradeoff exploration of jogs for coupled noise reduction is possible by modifying the jog insertion heuristic. Almost every stage of the greedy channel router can be modified to reflect coupled noise concerns. Our implementation changes only two of the phases.
Maximum noise
amplitude Noise pulse width
X
Figure 9. Transforming the A-W plot to a Ei vs. X plot Theorem 4 A routing solution will not fail due to coupled noise if, for all nets i, Rj s
je~)
(
where
R
Cj .Xi
C' -I <Ej( + -+ij )
E jle A (i)
X'
Ci = CX 1i +
A
3.5. Channel routing results
In this sub-section we describe the experimental results obtained using our channel router. We compare the greedy channel router oblivious to coupled noise with the algorithm outlined in the previous sub-section, both implemented in C. The implementation in [11] was not available[19] and therefore could not be compared. The average normalized (to the supply voltage) coupled noise results are presented in Table 1. On average, the algorithm leads to about 22% lower cou-
X jj)
xj
j e A (i)
Proof: The sum of the left hand side of the inequality is just the sum of the upper bounds on noise contributions from all aggressor nets, from Theorem 1. This amplitude must be smaller than the maximum allowed amplitude for this amplitude-pulse width product. The expression for the amplitudepulse width product is substituted from Theorem 2 to obtain the result.C Theorem 4 directly suggests a linear time verification procedure which scans the layout and updates the corresponding maximum peak noise (the right hand side of the inequality in Theorem 4) and noise sums (the left hand side of the inequality) for each net. This verification procedure is optimal in time complexity as any verification procedure has to scan the entire layout before pronouncing it correct. The layout determines the sets A(i), the Cis and the Xijs. Note that during any stage of a greedy routing process, we can determine upper and lower bound on all relevant capacitances to find all nets which are potentially coupled-noise critical. An optimal jog is inserted to improve the coupled noise situation. The algorithm template is defined below. Algorithm Xtinguish
TABLE 1. Channel routing results Normalized average crosstalk CMOS ECL Bench- # #Colmark nets umns Xtinguish Greedy Xtinguish Greedy
Input: Channel routing instance, technology parameters Cg, Ri, Xo, X functions. Output: Channel routing solution satisfying crosstalk bounds. begin{ initial wire ordering using a travelling salesperson heuristic route channel from left to right( make top and bottom connections collapse split nets add jogs to reduce coupled noise I/ this is new add jogs to reduce split net range add jogs to raise rising nets and lower falling nets widen channel if necessary proceed to next column 1 update coupled noise information H/this is also new I
3a
30
45
.136
.166
.072
.097
3b
47
62
.148
.155
.085
.087
3c 3cr
54 54
103 103
.154 .146
.239 .227
.088 .091
.138 .131
ddr
72
169
.217
.242
.131
.137
exI
235
417
.083
.091
.051
.052
ex2
282
421
.063
.066
.033
.037
ex3
291
421
.043
.067
.028
.039
ex4
270
421
.060
.072
.036
.042
nI
77
139
.185
.262
.099
.150
n2
77
117
.115
.155
.071
.090
n3
78
123
.169
.157
.096
.091
n4
74
150
.197
.257
.119
.147
rl
77
139
.187
.252
.109
.143
r2
77
117
.124
.152
.070
.090
r3
78
123
.153
.158
.091
.092
r4
74
150
.218
.244
.121
.140
r5
112
150
.131
.141
.076
.082
157
.229
.232
.131
.132
Deutsch 52
32
References
pled noise for CMOS and ECL. Drive strengths are assumed to vary from lx to lOx. The number of tracks is typically the same for both solutions, i.e., there is no area penalty. The time required for each of these is at most a few seconds on a DEC 3100 workstation. The parameters used are representative of 0.5 micron CMOS and ECL.
[1] 1. Catt, "Crosstalk (noise) in digital systems", IEEE Transactions on Electronic Computers, Vol. 16, No. 6, pp.743-763, 1967. [2] P. Larsson and C. Svensson, "Noise in digital dynamic CMOS circuits", IEEE Journal of Solid-State Circuits, June 1994, pp. 655-662. [3] S.I. Long and S.E. Butner, Gallium Arsenide Digital IC Design, McGraw-Hill, 1990, Ch. 5. [4] R.R. Tummala and E.J. Rymaszewski, Microelectronics Packaging Handbook, Van Nostrand Reinhold, 1989. [5] L. Gal, "On-chip crosstalk - the new signal integrity challenge", Proceedings of the Custom Integrated Circuits Conference, 1995, pp. 12.1.1 - 12.1.4. [6] G.A. Katopis, "Delta-I noise specification for a high-performance computing machine", Proceedings of the IEEE, September 1985, pp. 1405-1415. [71 T. Sakurai, "Closed form expressions for interconnection delay, coupling and crosstalk in VLSIs", IEEE Transactions on Electron Devices, Vol. 40, No. 1, 1993, pp. 118-123. [8] J. Rubinstein, P. Penfield and M.A. Horowitz, "Signal delay in RC tree networks", IEEE Transactions on Computer-Aided Design of ICs and Systems, Vol. 2. No.3, 1983, pp. 202-210. [9] T. Gao and C.L. Liu, "Minimum crosstalk channel routing", Digest of Technical Papers of the International Conference on Computer-Aided Design, 1993, pp. 692-696. [10] T. Gao and C.L. Liu, "Minimum crosstalk switchbox routing", Digest of Technical Papers of the International Conference on Computer-Aided Design. 1994, pp. 610-615. [11] D. Kirkpatrick and A. Sangiovanni-Vincentelli, "Techniques for crosstalk avoidance in the design of high-performance digital systems", Digest of Technical Papers of the International Conference on Computer-Aided Design. 1994, pp. 616-619. [12] S. Mitra, S.K. Nag, R.A. Rutenbar and L.R. Carley, "Systemlevel routing of mixed-signal ASICs in WREN", Digest of Technical Papers of the International Conference on Computer-Aided Design, 1992, pp. 394-399. [13] A. Onozawa, K. Chaudhary and E.S. Kuh, "Performance driven spacing algorithms using attractive and repulsive constraints for submicron LSIs", IEEE Transactions on CAD of ICs and Systems, Vol. 14, No. 6, 1995, pp. 707-719. [141 H.H. Chen and E.S. Kuh, "Glitter: a gridless variable width channel router", IEEE Transactions on Computer-Aided Design of ICs and Systems, Vol. 5, No. 4, 1986, pp. 459-465. [15] U. Choudhury and A. Sangiovanni-Vincentelli, "Constraintbased channel routing for analog and mixed analog/digital circuits", IEEE Transactions on Computer-Aided Design of ICs and Systems, Vol. 12, No. 4, 1990, pp. 497-5 10. [16] R. L. Rivest and C.M. Fiduccia, "A greedy channel router", Proceedings of the DAC, 1982, pp. 418-424. [17] P. Groenveld, "Wire ordering for detailed routing", IEEE Design and Test of Computers, December 1989, pp. 6-17. [18] M. Marek-Sadowska and M. Sarrafzadeh, "The crossing distribution problem", IEEE Transactions on Computer-Aided Design of ICs and Systems, Vol. 14, No. 4, 1995, pp. 423-433. [19] D. Kirkpatrick, private communication, May 1995. [20] D. Theune, R. Thiele, W. John and T. Lengauer, "Robust methods for EMC-driven routing", IEEE Transactions on Computer-Aided Design of ICs and Systems, Vol. 13, No. 11, 1994, pp. 1366-1378. [21] G.L. Matthaei, J.C.-H. Shu and S.I. Long, "Simplified calculation of wave coupling between lines in high-speed integrated circuits", IEEE Transactions on Circuits and Systems, Vol. 37, No. 10, 1990, pp. 1201-1208.
IV. Discussion and conclusions The analysis results of Section II enable quick computation of bounds for coupled noise which are intuitive and enable better design. We believe that these bounds will spawn a new class of coupled noise-constrained layout systems. Section III has shown that routing algorithms could utilize these bounds to improve layouts. While the focus has been on channel routing, this is by no means the only routing formulation which could be modified to include coupled noise. Area routers, switchbox routers, etc. could also use our work. We believe that our channel routing results of Section IV will lead to a paradigm shift away from graph-based approaches to algorithms which maintain exact adjacency information, especially with coupled noise becoming a major issue. We have considered only channels where a lumped capacitance model is appropriate and the question of long channels remains open. We are currently addressing the issue of bounds on crosstalk for layout where transmission line analysis is necessary. This is particularly essential for PCBs and MCMs. It is usually handled by grouping nets into net classes and conducting simulations to determine allowable overlap lengths [20]. It appears to be difficult to obtain simple, elegant bounds for these cases which capture relevant parameters and are accurate. The RC model is considerably in error for these cases[21] and could be optimistic. The wide variety of termination possibilities of these lines also make good abstractions difficult. For sub-micron VLSI, however, our methods appear to be adequate. Timing analysis can be used by routers [5], [11] to identify pairs of nets between which interaction is allowed. Our algorithm can be extended easily to handle this information too. Post-layout optimization for crosstalk such as the work in [9] and [13] could be used to further improve our results. Note that jog insertion is not attempted by these techniques. Our algorithm addresses this issue and therefore complements these methods well. Besides, our models (after suitable linearization) should improve the results [9] and [13] obtain. In conclusion, we have presented new results for coupled noise computation in VLSI circuits. We have used these to show several novel design implications and have presented innovative routing techniques for coupled noise reduction.
Acknowledgments We would like to thank D. Kirkpatrick of UC, Berkeley for providing the benchmarks and for many discussions. We would also like to thank C.-C. Lin of UCSB for providing us his greedy channel router implementation.
33
SIMULTANEOUS TRANSISTOR AND INTERCONNECT SIZING USING GENERAL DOMINANCE PROPERTY Jason Cong and Lei He Department of Computer Science University of California, Los Angeles, CA 90095 *
given cell library for performance optimization. Recent gate sizing works can be found in [1, 2, 3]. However, all these works ignored the sizing of interconnects. The interconnect sizing problem, often called the wiresizing problem, was first introduced in [6, 7] where the authors developed the first polynomial-time optimal wiresizing algorithm to minimize the weighted Elmore delay between the unique source and a set of critical sinks. Later on, the wiresizing problem to minimize the maximum delay along a RC tree was formulated as a posynomial programming problem and solved in [12] by the sensitivity-based algorithm similar to that in [9]. In addition, the optimal wiresizing problem for interconnects with multiple sources was formulated and solved optimally in [4]. All these works, however, did not consider the need to size the driving transistors again after interconnects have been changed. Several recent studies consider both gate and interconnect sizing. The work in [5] formulated the simultaneous driver and wire sizing problem to size a single routing tree driven by a chain of cascaded drivers and developed an efficient optimal algorithm for both delay and power optimization. The authors of [11] studied the simultaneous wire sizing and buffer insertion problem for a single RC tree and solved it by a dynamic programming approach. Finally, the concurrent gate and wire sizing problem was studied in [15] using a sequential quadratic programming technique with the assumption that all transistor sizes in a gate are scaled isotropically. However, if we assume that all transistor in a gate are sized isotropically, it often leads to suboptimal designs, especially in the full-custom layout. In this paper, we shall study the simultaneous transistor and interconnect sizing (STIS) problem, where the optimal size for each transistor (instead of a gate or cell) and the optimal wire width for each wire segment are computed independently in order to achieve more optimized designs. We show that the STIS problem under a number of transistor and interconnect models has the objective functions of the following form:
ABSTRACT We study the simultaneous transistor and interconnect sizing (STIS) problem in this paper. We define a class of optimization problems named CH-posynomial programs and show a general dominance property for all CH-posynomial programs (Theorem 1). Based on this property, a set of lower and upper bounds for the optimal solution of a CH-posynomial program can be computed through local refinement operations very efficiently (in polynomial time). We show that the STIS problem under a number of transistor and interconnect models is CH-posynomial programs, for example, when we use the RC tree model for interconnects and the step or the slope model for transistors. Preliminary experimental results show that in nearly all cases, the optimal solution is achieved because the recursive application of local refinement operations using the dominance property leads to the identical lower and upper bounds. Moreover, the STIS algorithm produces the more optimized solution when compared with optimal transistor sizing only. 1. INTRODUCTION It is well recognized that interconnect delay has become the dominating factor in determining the circuit performance in deep submicron designs. We believe that the most effective way for performance optimization in deep submicron design is to consider both logic and interconnect designs throughout the entire design process. As part of our effort to develop a unified methodology and platform for simultaneous logic and interconnect design and optimization, we study the simultaneous transistor and interconnect sizing problem (STIS) for delay minimization in this paper. Most previous works consider the device sizing and the interconnect sizing problems separately. The device sizing problem includes both gate sizing and transistorsizing problems. In [9], the transistor sizing problem was formulated as a posynomial programming problem under the Elmore delay formulation for the RC tree model and solved by a sensitivity-based heuristics. Later, the authors of [13] applied the delay model developed in [10] for their transistor sizing formulation and solved it by a convex programming technique. The gate sizing problem usually assumes that all transistor sizes within each gate scale isotropically [14] (i.e., all transistor sizes increase or decrease by an uniform factor), or each gate has a discrete set of implementations (cells) with different driving capabilities pre-designed as a
f(X)
=
±. 5bhi .
E
p4i
pqji'
where
zi, z,
E X =
X,
YZP +
,z'i pti
,n)
p,q E (1,2,-,m),
apq,ij, bp,i and cp,i > °
(1)
and its coefficients have the following symmetry:
* This work is partially supported by ARPA/CSTO under contract J-FBI-93-112, the NSF Young Investigator Award MIP9357582 and a grant from Intel Corporation under the NYI matching award program.
1. apq,i > 0 if and only if a,,ji > 0; 2. b,i > 0 only if cp,, > 0 or a,,i, > O for any q and j;
34
3. cpi > 0 only if bpi > 0 or apq,i, > 0 for any q and j; Following the notations in [8], we call the class of functions as CH-posynomials.
Gate2
Definition 1 Eqn. (1) is a simple CH-posynomial if all coefficients are constants independent of X. Definition 2 Eqn. (1) is a general CH-posynomial if coefjicients are functions of xi and x, satisfying the following conditions: ap,,ij monotonically increases with respect to zx while monotonically decreases with respect to x,, and br,, monotonically increases with respect to xi while cp,, monotonically decreases with respect to xi.
M9
Gatel Gate3
M10
(a)
(b)
Figure 1. (a) the gate-level circuit diagram where (b) the gate2 and gate3 are fanout of gatel. transistor-level circuit diagram where gatel and the routing tree linked to its output comprise a DCConnected-Component DCC,. The output of gatel, denoted as N., is the root of the routing tree. The inputs of gate2 and gate3 are sinks of the routing tree.
Furthermore, we call an optimization problem to minimize a simple CH-posynomial as a simple CH-posynomialprogram, and to minimize a general CH-posynomial a general CHposynomial program. We study the following dominance property for CH-posynomial programs. Definition 3 (Dominance Relation) For two vectors X and X', we say that X dominates X' (denoted as X > X') if x, > z$ for all i.
2. FORMULATIONS We consider the CMOS static logic in this paper. Since the transistor channel length is usually fixed to the minimum feature size, we only allow the channel width to change during transistor sizing and shall use width to refer both the channel width of a transistor and the wire width of an interconnect. We will first formulate the delay as a function of the transistor/wire widths, then show the STIS problem for delay minimization is a CH-posynomial program.
Definition 4 (Local Refinement Operation) The local refinement operation of a solution vector X, with respect to any particularvariable zi and function f (X), is to minimize f (X) subject to only evaluating xi while keeping the value of any xj(j - i). We say that the result solution vector is the local refinement of X (with respect to zi). For simplicity of presentation. we shall use solution instead of solution vector.
2.1. Delay Formulation Because an MOS transistor is a voltage-controlled device, an MOS circuit can be partitioned into a number of DCConnected Components (DCCs) [18]. A DCC is a set of transistors and wires that are connected by DC-current paths containing only transistor channels or wires. The DC current can not cross the boundary of a DCC. In most cases, a DCC consists of a gate G and a routing tree connecting the output (denoted as N.) of G to the inputs of all gates driven by G (see Figure 1). Our delay computation is similar to that in the switch level timing analysis tool Crystal [16]. The delay will be computed based on a stage, which is a DC-current path from a signal source (either the Vdd or the ground) to the gate of a transistor, including both transistors and wires (see Figure 2). The delay of a stage is the summary of delays through all transistors and wires in the stage.
Definition 5 (Dominance Property) Let X* be the optimal solution to minimize f (X). If X dominates X*, a local refinement of X still dominates X*; If X is dominated by X*, a local refinement of X is still dominated by X* We proved the following important Theorem 1 that will lead to our STIS algorithm later on. Theorem 1 The dominance property holds for both simple and general CH-posynomial programs. The authors of [7] first proposed the dominance property for their single-source optimal wiresizing formulation. Our results greatly generalize the concept of the dominance property and reveal that the dominance property holds for a large class of optimization problems instead of the single optimization problem in [7], which is an instance of the simple CH-posynomial program. The dominance property leads to an efficient algorithm to compute a set of lower and upper bounds of the optimal solution to a CH-posynomial program by local refinement operations very efficiently (in polynomial time). The algorithm has guaranteed convergence and optimality and can be applied to many optimization problems in VLSI CAD and other domains. We will show that the STIS problem is CH-posynomial programs when we use the RC tree model for interconnects and the step or the slope model for transistors. Preliminary experimental results show that in nearly all cases, the optimal solution is achieved because the recursive application of local refinement operations using the dominance property leads to the identical lower and upper bounds. So the algorithm is optimal in the practical sense. Besides, the STIS algorithm produces the more optimized solution when compared with optimal transistor sizing only.
A. Delay through transistors A transistor can be modeled by the source-drain effective resistance r and the gate, source and drain capacitances cg, c, and Cd. Let 2 be the transistor width, r, c5 , c, and Cd can be written as the following: r
=
ro/x
C9
=
Cg5
X
CS
=
c'o
X
Cd
=
CdO
X
Where c,0, co and Cdo are the gate, source and drain capacitances for a unit-width transistor and can be viewed as constants without loss much accuracy. In addition, ro is called unit effective resistance that is defined in the following: Let
35
width x and length I are No M2
M6 M,9 (-)
N.
No
M3
H
M4
H (')
5
TO I/x
=
co
I
I + ci
I
Tk. Let Tk be the resistance of the uni-segment Ek with downstream node at Nk, the delay t(P'(No, N,)) along the
unique path PI(N0 , Nt) between node No and sink Nt is t(P'(N, N,)) =
:
Tk
(3)
.Ck
Nk EP'(NoTVt )
C. Delay through a stage (3), the delay (2) and Eqn. With respect to Eqn. t(P(NsNt), X) of stage P(Ns, Nt) can be written as according to [19]:
a rising input drive an inverter with total capacitance loading CL. If the 50% delay is 7, we say that r = r/CL is the effective resistance of the n-type transistor and the unit effective resistance ro is given by r x with x being the size of the n-type transistor. The unit effective resistance of the p-type transistor can be defined similarly by the delay under the falling input waveform. If X = {x1 ..-. ,xn} is the sizing solution for all transistors and wires, in general, ro shall be a function of X. For simplicity of presentation, we use rO instead of ro(X). However, it is worthwhile to mention that the unit effective resistance can consider the nonlinear characteristic of the transistor and the waveform slope effect. High accuracy can be achieved when using a table based method and more discussions will be given in Section 2.3. Under this transistor model, the gate part of a DCC becomes an RC network. Let PM(N,, N0) denote the path corresponding to all transistors in a stage. When computing the delay through these transistors, we only consider the effective resistances in path PM(N',, N) and the capacitances connected to nodes in path PM(N,, No). Let cl denote the total capacitance at node Nk due to source/drain capacitances of all transistors linked to Nk and C, the total capacitance due to the routing tree and its sinks. If R(PM (Ns, Nk)) is the total resistance of the partial path from the source N, to node Nk in path PM(N , Nk), the Elmore delay through path PM(Ns, Nk), denoted as t(PM(N,, No)) is =
=
C
We treat node N. as the root of the routing tree. Consider node Nk in the routing tree and let Tk denote the subtree rooted at node Nk (including node Nk). We denote the total capacitance within Tk as Ck, including the total wire capacitances and the total sink capacitances within
Figure 2. For DCCI in Figure 1, (a) a stage from the Vdd to the gate of transistor Mg; (b) a stage from the Vdd to the gate of transistor Mg; (c) a stage from the ground to the gate of transistor M 6 . Clearly, a transistor and a wire may belong to multiple stages.
t(PM(N,, No))
r
(4)
t(P(N,, NO)I X) t(PM(N,, No)) + t(P'(N,,.N,))
-
)+(1, X,
t,
Zfl t 0ia
1,
stS
hi
i
.
5i
I +
ixi
1i*l,*lJ ± h(
Ii
where xi is the width for a transistor Mi or a wire unisegment Ei, and 1i is the length for a wire uni-segment Ei or 1I= 1 for a transistor Mi. The coefficients f~st, f st gt /h't and hl t can be determined for stage P(N7 , Nt) with respect to the given rO, CgO, cA 1cdO co and cj for either transistors
or wires. 2.2.
STIS Formulation to Minimize Delay for Multiple Critical Paths In order to simultaneously minimize delay along multiple critical paths in a circuit, we propose to minimize the weighted delay t(X) of all stages in these critical paths:
t(X)
At t(P(N, Nt), X)(5)
= P(Ns,Nt)Ecr~itic. poth,
R(PY (N,, N,)) *c
where the penalty weight A"t indicates the criticality of stage P(Ns, Nt). A simplified weight assignment scheme is used in this paper. The weight of a stage is 1 if the stage is in a critical path, otherwise, it is 0. Let
N;,EPM(N-,N,)
R(PM(N,, No)) - Co
(2)
B. Delay through wires A routing tree is modeled as a distribute RC tree, similarly to [7, 4]. Each sink has an extra capacitance due to the gate capacitance of a transistor in a gate driven by the routing tree. Each wire segment is divided into a sequence of sub-segments. Each sub-segment is treated as a 7r-type RC circuit. Since we assume that the wire width is uniform within a sub-segment, a sub-segment is called a uni-segment. Note that the segment division controls how aggressively we perform wiresizing optimization. Clearly, if the unit-width unit-length wire has wire resistance ro, wire area capacitance co and wire fringing capacitance cl, the resistance r and the capacitance c for a uni-segment with
Fo(i, j)
=
E P(N
Fi (i, j)
=
G(i)
=
A' t . fV(i,j)
N,)ECriti-CI poths
I:
As' -fl t(i, j)
E
Ast , gst (i)
P(N, Ni)Ecr-iti-l poths
Hi (i)
=
5 P(N-Nt)EcriticaI poth,
36
A
t
t
.
hi (i).
Theorem 3 The STIS problem under a DP-slope model is a general CH-posynomial program with the dominance property.
and eliminate those terms independent of X, Eqn. (5) becomes
t(X) =
(6)
Foi, j) -
, I., +
+ aG(i)- I' +EHxl(i)
F (i, j) -
1
An example of DP-slope model is the model developed in [10]. For an inverter, let 7o be the delay under the step input and T, be the delay under the input with the transition time of s, [10] derived the following relation
I,
I'
7
= -
a
s
+
7Th
(7)
where ce is a constant determined by the technology. According to Eqn. (7), the unit effective resistance of a transistor is an increasing function of its input transition time. Because increasing the size of a transistor always increases the gate capacitance of the transistor, the input waveform will become slower due to a larger capacitance loading. Thus, the unit effective resistance of a transistor is an increasing function of its size. Since Eqn. (7) is an accurate solution for the inverter delay, we believe that at least most models to consider the waveform slope effect are DP-slope models. It is worthwhile to mention that if Eqn. (7) is used to compute a gate delay by assuming that the input switch time is twice the Elmore delay in the previous stage, like the transistor sizing formulation in [13], the path delay is a simple CHposynomial and the STIS problem to minimize the weighted delay of all stages in multiple critical paths is also a simple CH-posynomial program with the dominance property. However, Theorem 3 is much more general in the sense that it is applicable to a DP-slope model of any form and even without a closed form, as long as the effective unit resistance is an increasing function of the transistor size. A slope model without using a closed form will be discussed in Section 3.2. We would like to emphasize that the STIS algorithm will be developed for any DP-slope model which satisfies Theorem 3.
With respect to Eqn. (6), we define the following STIS problem to minimize delay through multiple critical paths: Formulation 1 Given a circuit and the lower and upper bounds for the width of every transistor/wire, the STIS problem is to determine the width for every transistor/wire (or equivalently, a sizing solution X) such that the weighted delay through multiple critical paths given by Eqn. (6) is minimized. In practice, it is often the case that we want to size the transistors and wires without increase in the layout area (using the free space in the current layout) or with bounded increase in the layout area. Therefore, there is an upper bound associated with each transistor and wire during the optimization. On the other hand, there is a lower bound associated with each transistor and wire due to the technology feature sizes and reliability concerns (like electromigration). Thus, the lower and upper bounds are used to handle these constraints. It will be seen later on that the lower and upper bounds are also the starting point for our STIS algorithm. 2.3. Dominance Property for STIS Problems Coefficient functions Fo, F1 , G and HI in Eqn. (6) are determined by parameters rT, c 9 Og,co,cdo, co and cl. Since we assume that the unit resistance ro for wires and all capacitance parameters are constants independent of X, the property of these coefficients will be determined by the unit effective resistance ro for the transistor. The unit effective resistance ro for the transistor defined in Section 2.1.A is a function of the input waveform slope. The step model assumes that the input waveform is always a step so that ro is a constant independent of the sizing solution X. As a result, all coefficients of Eqn. (6) are positive constants independent of X. Thus, we have the following theorem.
3. ALGORITHMS Generic Algorithms to Exploit Dominance Property For a CH-posynomial f(X), let X be the lower bound of its definition domain and X the upper bound. Since its optimal solution X* must be bounded by X and X, a simple algorithm scheme, the Local Refinement Algorithm (LRA) scheme, can be used to compute a set of tighter lower and upper bounds for X* by applying the local refinement operations beginning with either X or X. We introduce the concept of LR-tight bound in the following. 3.1.
Theorem 2 The STIS problem under the step model is a simple CH-posynomial program with the dominance property.
Definition 7 A lower or upper bound is LR-tight if it can not be tightened any more by a local refinement.
The step model has been used in [9] for transistor sizing and in wiresizing works [7, 12, 4] to model the driver. It was also used in [5] for simultaneous driver and wire sizing. However, the step input is just an ideal assumption and it is well known that the delay under real waveforms will be larger than that under the step input. In order to consider the waveform slope effect on the unit effective resistance for the transistor, we define the following DP-slope model for the transistor.
The LRA scheme is a greedy algorithm based on iterative local refinement operations. If beginning with a solution X = X, the LRA scheme traverses every xi in certain order to perform a local refinement operation on it. Because X is dominated by X' its local refinement is still dominated by X*. This process is repeated and X becomes increasingly closer to and remains dominated by X*. This process is stopped until no improvement is achieved on any ri in the last round of traversal and we obtain a LR-tight lower bound. A LR-tight upper bound can be obtained in the similar way by performing local refinement operations beginning with X = X. In essence, the LRA scheme generalize the greedy wiresizing algorithm GWSA first developed in [7]. If r is
Definition 6 The DP-slope model is a transistor model where the unit effective resistance for the transistor is an increasingfunction of its size. We proved that the STIS problem under a DP-slope model has the dominance property.
37
the maximum number of the possible evaluations for any xi(i = .1,.-. , n}), the LRA scheme will converge in the
in the technology. The allowed transistor widths are {0.5gm, 0.6am,... . 200.0gm } with step of 0.1gm, which are determined by the design rules. Note that the local refinement operation does not depend on how feasible widths are defined, thus our algorithms are applicable to any width scheme for transistors and wires.
polynomial time Q(r * n3).
3.2. Implementation of the DP-slope model We assume a DP-slope model of the most general form with the only requirement to give the unit effective resistance ro for each discrete transistor size and apply a table based method in order to achieve the satisfactory trade-off between accuracy and complexity. In general, the effective unit resistance ro of a transistor is a function of its size, the input waveform slope and the capacitance loading. However, [17] proposed that all the three parameters could be combined into one factor called slope ratio to solely determine the unit effective resistance ro for the transistor. Thus, a one-dimensional table can be used instead of a three-dimensional table. We built a one-dimensional table for every type of transistors based on SPICE simulation results. Our implementation is similar to that in [16].
4.1. Transistor Sizing The dominance property was only used for the optimal wiresizing problem in the past. In this experiment, the STIS algorithm based on the dominance property is applied to the transistor sizing problem. The full adders from 2-bit to 16-bit implemented by complex gates are sized by assuming that all transistors are critical. Although we use an extreme upper bound (200.0 gim) for each transistor, the performance of the STIS algorithm is still quite good. In Table 1, nFET is the total transistor number in a circuit, average nLoc is the average number of local refinement operations for a transistor, time is the total CPU time to size a circuit and average width is the average channel width for all transistors after transistor sizing. A transistor on average reaches its optimal width after 6-8 local refinement operations on it, and the time to size 514 transistors is just 19.78 seconds in a Sparc-10 workstation. The critical delay reduction by the optimal transistor sizing is up to 23.7%.
3.3. Overview of Near-Optimal STIS algorithm The overall STIS algorithm includes three steps: to initialize the coefficient functions, to tighten lower and upper bounds of the optimal solution and to search the optimal solution between the LR-tight lower and upper bounds. Besides, the coefficient functions will be updated during the procedure to tighten lower and upper bounds because the unit effective ro for the transistor under the DP-slope model is a function of the current solution X. In order to efficiently initialize and update the coefficient functions, a circuit containing both transistors and interconnects is pre-partitioned into DCCs. Although the sizes of coefficient function Fo and F. are n x n, their operation complexities are reduced greatly by DCC partitioning because these coefficients are only needed to be computed for transistors and wires within a DCC. It is worthwhile to mention that the LR-tight lower and upper bounds have the zero sensitivity. In other words, the sensitivity based method can not obtain a solution more optimized than the LR-tight lower or upper bound. If the LR-tight lower and upper bounds are identical for every transistor/wire, the optimal solution is achieved immediately, which happens almost all cases in practice. Because both the coefficient operations and the lower and upper bound operations are completed in polynomial-time, the optimal STIS solution is achieved by the polynomial-time in practice. When the LR-tight lower and upper bounds do not meet, it is observed that the gap between the lower and upper bounds are very small, often just of one width in our experiments, and the percent of divergent transistors and wires is also very small, thus enumeration can be carried out in reasonable time. 4.
4.2.
Comparison between STIS and Other Sizing Schemes A 4-bit adder is used to driven a lcm wire in IC technology. Three sizing schemes, i.e., minimum transistor/interconnect width, transistor sizing only, and STIS are used (see Table 2). When compared with the minimum transistor and wire sizing solution, transistor sizing only reduces the maximum delav bv 13.58% and STIS reduces the maximum delav by 27.56%. Clearly, more optimized solution is achieved by the simultaneous transistor and wire sizing. 5.
CONCLUSIONS
We formulated the STIS problem using a distributed RC tree model with consideration of the waveform slope effect for transistors, and developed efficient STIS algorithm based on recursive local refinement computations. The preliminary experiments have shown that the STIS algorithm produces the solution more optimized than that obtained by optimal transistor sizing only. We plan to integrate our STIS algorithm with a timing analysis tool like Crystal [16] and test more circuits in the future. ACKNOWLEDGMENTS The authors would like to thank Professor C.-J. Richard Shi at University of Iowa, Kei-Yong Khoo and Cheng-Kok Koh at UCLA for their helpful discussions and assistance.
EXPERIMENTAL RESULTS REFERENCES
We have implemented the STIS algorithm in ANSI C for the Sun SPARC station environment. Preliminary experiments will be presented in this section. The delays to be reported are computed using HSPICE. The use of HSPICE simulation results, instead of calculated Elmore delay values, not only shows the quality of our STIS solutions, but also verifies the validity of our transistor/interconnect modeling and the correctness of our STIS problem formulation. The MCNC 0.5gm CMOS process technology is used. The wire width choices are { W 1, 2W 1. 3W 1, 4W 1 , 5W 1 }, where WI is the minimum wire width (0.95pm)
[1] M. Berkelaar, P. Buurman and J. Jess, "Computing the Entire Active Area/Power Consumption versus Delay Trade-off Curve for Gate Sizing with a Piecewise Linear Simulator", Proc. IEEE Int'l. Conf. on ComputerAided Design, 1994, pp. 474-480. [2] W. Chuang, S. S. Sapatnekar and I. N. Hajj, "Timing and Area Optimization for Standard-Cell VLSI Circuit Design", IEEE Tran. on Computer-Aided Design, March 1995, pp. 308-320.
38
Adder
nFET
Average nLoc
Time (s)
Average Width
2bit 4bit 8bit 16bit
66 130 258 514
6.82 7.55 7.90 8.07
0.22 1.23 4.40 19.78
2.030 1.923 1.868 1.840
Critical nin-width 0.6021 1.3511 2.8681 5.8878
Delay (ns) opt-width 0.4597 (-23.7%) 1.0592 (-21.6%) 2.4440 (-14.8%) 4.6227 (-21.5%)
Table 1. Critical delay comparison between the minimum-width solution and the optimal-width solution. min FET + rain Wire opt FET + min Wire STIS Table 2. solution.
Total FET Area 828 1119 1268
Total Wire Area 1000 1000 4493
Critical Delav 2.3159ns(0.007c) 2.0012ns(-13.58%) 1.6776ns(-27.56%o)
Critical delay comparison among minimal sizing scheme, transistor sizing only, and the STIS
[14] N. Menezes, S. Pullela, and L. T. Pilegi, "Simultaneous Gate and Interconnect Sizing for Circuit-Level delay Optimization", Proc. ACM/IEEE DAC, 1995, pp. 690695. [15] N. Menezes, R. Baldick, and L. T. Pileggi, "A Sequential Quadratic Programming Approach to Concurrent Gate and Wire Sizing", Proc. IEEE ICCAD, 1995, pp. 144-151. [16] J. K. Ousterhout, "A Switch-Level Timing Verifier for Digital MOS VLSI, " IEEE Trans. on CAD, 4(3) (1983) pp. 336-349. [17] D. J. Pilling and J. G. Skalnik, "A Circuit Model for Predicting Transient Delays in LSI Logic Systems", Proc. 6th Asilomar Conf. on Circuits and Systems, 1972, pp. 424-428. [18] V. B. Rao and I. Hajj, "Switch-Level Timing Simulation of MOS VLSI Circuits", Proc. IEEE ISCAS, 1985, pp. 229-232. [19] J. Rubinstein, P. Penfield, and M. A. Horowitz, "Signal Delay in RC Tree Networks", IEEE Trans. on CAD, 2(3) (1983) pp. 202-211.
[3] G. Chen, H. Onodera and K. Tamaru, "An Iterative Gate Sizing Approach with Accurate Delay Evaluation", Proc. IEEE Int'l. Conf. on Computer-Aided Design, 1995, pp. 422-427. [4] J. Cong and L. He, "Optimal Wiresizing for Interconnects with Multiple Sources", Proc. IEEE Int'l. Conf. on Computer Design, Nov. 1995 (full version as UCLA Computer Science Dept. Tech. Report CSD-00031, 1995). [5] J. Cong, and C.-K. Koh, "Simulataneous Driver and Wire Sizing for Performance and Power Optimization", IEEE Trans. on VLSI, 2(4), December 1994, pp. 408423. [6] J. Cong, K. S. Leung, and D. Zhou, "PerformanceDriven Interconnect Design Based on Distributed RC Delay Model", Proc. ACM/IEEE Design Automation Conf., 1993, pp. 606-611. [7] J. Cong and K. S. Leung, "Optimal Wiresizing Under the Distributed Elmore Delay Model", IEEE Trans. on CAD, 14(3), March 1995, pp. 321-336 (extended abstract in Proc. IEEE Int'l. Conf. on Computer-Aided Design, 1993, pp. 634-639). [8] J. G. Ecker, "Geometric Programming: Methods, Computations and Appicaltions", SIAM Review, Vol. 22, No. 3, July 1980, pp. 338-362. [9] J. P. Fishburn and A. E. Dunlop, "TILOS: A Psoynomial Programming Approach to Transistor Sizing", Proc. IEEE Int'l. Conf. on Computer-Aided Design, 1985, pp. 326-328. [10] N. Hedenstierna, and K. 0. Jeppson, "CMOS Circuit Speed and Buffer Optimization", IEEE Tran. on CAD, 1987, pp. 270-281. [11] J. Lillis, C. K. Cheng and T. T. Y. Lin, "Optimal Wire Sizing and Buffer Insertion for Low Power and a Generalized Delay Model", Proc. IEEE Int'l. Conf. on Computer-Aided Design, Nov. 1995, pp. 138-143. [12] S. S. Sapatnekar, "RC Interconnect Optimization Under the Elmore Delay Model", Proc. ACM/IEEE Design Automation Conf., 1994, pp. 387-391. [13] S. S. Sapatnekar, V. B. Rao, P. M. Vaidya, and S. M. Kang, "An Exact Solution to the Transistor Sizing Problem for CMOS Circuits Using Convex Optimization", IEEE Tran. on CAD, November 1993, pp. 1621-1634.
39
HIERARCHICAL CLOCK-NETWORK OPTIMIZATION Daksh Lehtherl, Satyamurthy Pullela 2 , David Blaauw3, Shantanu Ganguly 4 Somerset Design Center, Motorola Inc., 9737 Great Hills Trail, Austin TX 78759 1 daksh @ibmoto.com, 4 [email protected] Motorola Inc., 6501 William Cannon Drive West, Austin, TX 787352 2 [email protected], 3 [email protected]
ond-order delay model is used to evaluate delays and slopes at the clock-pins. Wire sizing is based on a sensitivity based technique. General network topologies such as clock-meshes and non-binary trees are handled by this tool.
ABSTRACT Clock-distribution network design for high-performance microprocessors has become increasingly challenging in recent years due to lower-skew tolerances, larger networks, and low-power requirements. The design of a typical clock-network entails a substantial amount of designer time and effort to meet these interrelated objectives. The work presented in this paperaims at automating the clock-network design process while being able to achieve near optimal design solutions. Skew optimization is performed by partitioningclock-networks and then hierarchically sizing wires, to reduce overall design time.
The following sections describe the use of this wirewidth optimization tool in a-hierarchical fashion to achieve quicker turnaround time for large networks. The efficacy of the approach is seen in terms of low run times on realistic examples. 2. BACKGROUND The problem of designing clock-distribution networks entails the design of the routing from a clock-source to all clocked-elements, in a manner that minimizes skew between various clocked-elements. This problem has been addressed in the past with a wide range of techniques [1, 2, 3]. The focus of recent approaches has been to reduce skew by balancing the Elmore-delay from a source to various clock-targets. Notable among these is the Tsay's algorithm [4]. This approach aims at achieving "zero-skew", by recursively constructing balanced binary clock-trees. The delay is balanced by selecting a tapping-point on the route between two Elmore-delay balanced sub-trees. The design of clock-trees by the selection of tapping-points in this manner may not always be possible in practice due to constraints on available routing area. An alternative to this approach, was presented in [5] and was based on ensuring Elmore-delay balance by varying wire widths. This approach has the advantage that the widths of wires of a routed clock-network can be optimized for skew, and hence requires minimal changes to existing floor-plan, and placement. In order to consider higher order delay models, the current tool uses the techniques described in [11] to generate the target waveform at the clock-nodes.
1. INTRODUCTION The design of clock-networks for microprocessors has become a formidable task due to contemporary clockspeeds, large die sizes and complexity of clocked logic. Optimality of clock-network design is crucial toward meeting performance criteria of high-speed designs. In addition, considerations for low-power and stability of power supplies dictate that the total network capacitance be bounded. In addition to meeting performance criterion, clock-routing and optimization tools have to account for complex placement configurations. In view of these facts it is becoming increasingly important to develop tools that would optimize general clock trees, as well as meshes. This paper reports on the design methodology being developed at Somerset design center for the automated design of clock-networks of PowerPCTM family of microprocessors. A wire-width based skew-optimization tool CLOSYT (CLOck-network SYnThesis) forms the core of the clock-network synthesis. This tool aims at providing designers with the ability to synthesize near optimal clocknetworks in a reasonable amount of time. Desired slope and delay targets can be specified as parameters to the optimizer. The tool sizes the wires in the clock-net to achieve these delay-slope requirements, while meeting specific constraints on the wire-width. An AWE based sec-
3. DESIGN METHODOLOGY Clock signal is distributed to latches through several layers of buffers or regenerators. Typical design methodol-
40
ogies use tree or mesh type structures to distribute the signal from the central signal source to these regenerators. This network forms the primary clock-network and is main cause of signal skew. The regenerators then distribute the signal to the cells or latches that are geometrically closest to this regenerator through secondary networks. Although skew is effected both by the cell assignment to a regenerator and the relative location of cells, as these auxiliary nets are relatively small, their impact on overall skew is negligible.
not cause the initial network to be excessively unbalanced. This is important because the final wire-width optimization solution has to meet width constraints and a limit on the network capacitance and hence excessive metal cannot be added during the optimization phase. In order to achieve a reasonably balanced load assignment, clock-regenerators are grouped into what will be referred to here as "clusters". The assignment of a regenerator to a cluster depends on the physical location of regenerator and the estimated delay from the cluster source to the regenerator. The locations of the clock-regenerators are known from the placement of the cells. The delay from a particular branch of the central network to a clock-regenerator is dependent on the input-capacitance of a regenerator, the Manhattan distance between the closest central-network branch and regenerator location. The distance to the regenerators is approximated by the Manhattan distance. The delay to every target clock-regenerator is then estimated using RICE[6]. A clock-regenerator is then assigned to a branch of the central network that has the shortest delay. Detailed routing of these clusters, is performed later, based on this assignment. These regenerators are now deemed to belong to their respective auxiliary networks and will henceforth be referred to as "cluster-networks". This entire procedure is automated.
For state-of-the-art microprocessors, given that the size of the primary clock-network is excessively large, for design purposes, it is efficient to further consider them to comprise of two levels (with no intermediate buffers), a central network, and a number of auxiliary networks that are fed by the central network. The auxiliary clock-networks consist of the routing from specific branches of the central network to the input pins of clock-regenerators. The design methodology presented here focuses on the physical design and skew-optimization of the primary clock-network which includes both the central and auxiliary networks. The following steps outline the process that aims at achieving an initial clock-network for subsequent wire-width optimization. 4. INITIAL TOPOLOGY GENERATION The approach described here is primarily used as a post-routing optimization. Therefore it is essential to generate the initial topology of both the central and auxiliary networks. Once the topologies are known both the central and auxiliary networks are considered for skew minimization. The following steps outline the topology generation of an initial clock-network.
4.2.2. Subclustering The regenerators belonging to various clusters are divided into smaller groups termed as "subclusters". This subclustering is performed such that the estimated skew per subcluster is limited to a specific value. Each subcluster is then routed as an individual net, rooted at its assigned branch. The process of subclustering aims at minimizing the skew due to daisy-chaining of clock-pins, that occurs quite frequently with conventional routers that tend to reduce the overall path length. Hence subclustering results in multiple branches from the central network feeding groups of clock-regenerators with in a cluster.
4.1. Central Network The initial design of central network is currently performed manually. The intent here is to give designers the freedom to layout wires optimally with respect to the existing floor-plan and cell-placement. As we do not expect the central structure to be very complicated, this manual step requires very little effort. In addition it helps to fully utilize designers knowledge of the current design and potential changes that the design may undergo.As the tool is capable of handling meshes and non-binary structures, designers have the freedom to chose from a wide variety of design practices.
4.2.3. Routing of Cluster Networks The routing from every branch of the central network to its corresponding targets is performed by a maze routing tool. These wires are routed with the maximum permissible widths, to ensure that router allocates the sufficient blockage-area for the wires. The wire-width optimization tool is permitted to size the wires with this value of width as a maximum width constraint. This ensures that the width-optimization will not violate any design rules. Figure 1. depicts a typical primary clock-network after the clusters have been routed.
4.2. Auxiliary Networks The initial design of auxiliary networks consists of several steps each of which are described below: 4.2.1. Clustering and Load Balancing
It is desirable to ensure that the auxiliary networks that connects to the leaf branches of the central network do
41
-
I
(A +,XI) AP = SI (y -y,)
I
240.00
(1)
220-0
where A = STS and S is the
200-00
n X v Jacobian matrix(i.e.,
Si. = ayilpj ) and y, = f (13), the current solution
5SO OO 160,00
andy is the target solution. X in (1) is the Lagrangianmultiplier determined dynamically to achieve a rapid convergence to the final solution. This method combines the convergence properties of steepest descent methods during the initial stages and the convergence properties of methods based on Taylor series truncation. The method of setting X dynamically is shown in the flow chart Figure 1. k(i) is the value of X at ith iteration.
I4Q,00 120-00 100-00
90.00 60.00
*0,00 20.00
I
SO 0M
100.00
I
I
190.00 -w.00
;:
Figure 1. A Typical Primary Clock-Network 5. 5.1.
CLOSYT OPTIMIZATION
Optimization Parameters
The initial clock-network resulting from the above mentioned steps is described to CLOSYT in terms of a set of wires with their lengths, initial widths and connectivity information. CLOSYT optimizes the widths of the wires for skew subject to the following specifications * * * * * *
The slew-rate/transition-time at the output of the clock-driver. The delay and slope requirements at the clock-pins. The maximum and minimum width constraints on wire segments. Upper and lower bound on the phase-delay of the clock-network. The maximum capacitance allowable for the clocknetwork. Maximum acceptable skew.
5.2. Skew Optimization Approach
The Levenberg-Marquardt algorithm [8,10,12] is utilized to minimize the mean-square error between the desired and actual delays and slopes. Given a set of functions yi = fi (x, 3) , the Levenberg-Marquardt algorithm addresses the problem of finding the vector 1, such that the mean square error between a specified set of vectors yi and the functions yi evaluated at
P is minimized.
The solution is obtained by iteratively finding a vector of increments AP3, incrementing the parameters by AP, and recomputing the Jacobian repeatedly until a satisfactory convergence is obtained. More specifically at each iteration we solve the equation:
For clock-network synthesis, the set of wire-widths form the vector of parameters (AP in (1)) to be determined. The functions fi correspond the evaluation of slope and delay as a function of widths of wires in the network.
42
As one can observe the Levenberg-Marquardt method is an approach for unconstrained optimization as the parameters are allowed to take any value. In order to account for maximum and minimum width constraints, we cast clock-network synthesis as a constrained optimization problem - determine the vector of widths W such that the sum-of-squares error between the 2N target delays and slopes the current slopes and delays at the clock-pins is minimized. The wire widths are increased by AW obtained by solving (1) after each iteration. If, after incrementing, the width of any wire is outside the permissible range, the width is set to the corresponding limit: if
wj(k-1) + AWi < wi
W.i(k) I
if
wi(k-1) +AWI.>Wi-
W-(k) = W. I
networks by partitioning the problem into two or more hierarchical levels. This optimization is performed bottom-up and improves the total run-time of the tool substantially, without significant loss in accuracy. For example if a clock-network divided into a central network with p wires, and m wires at the auxiliary network level partitioned into k-clusters, the time complexity is proportional to ( 3 + m3 /k2) in comparison to (j+m) for the entire network. The following steps outline the hierarchical optimization: 1). As described in Section 4, the networks corresponding to each cluster, form the first level of the hierarchy. The clock-regenerators at the leaves of cluster-networks are modeled as capacitances. All cluster-networks are optimized individually for skew, and a specific value of delayslope target. The widths of wires at this level are constrained to be between the initial routed width and a minimum width.
= Wimin W.ii Im
where w (k) is the width of wire i at kh iteration. The sensitivity of delay and slope at all targets with respect to every wire in the network is computed by the adjoint-sensitivity technique [7]. The wires with the highest sensitivity are selected for width changes.
2). The second stage of optimization involves sizing the wires of the central network to minimize the global skew. Each cluster is replaced by it's equivalent driving point model, and the average delay from the root of the cluster to all its targets is estimated. The wires of the central-network are then sized taking into account the average delays of the clusters. As we assume that the size of the clusters is small, we expect that the signal slope integrity is preserved for propagation through the clusters. Therefore the user specified slope target can be fed directly as slope target to the central network. The delay targets for the central network are generated as follows. Considering a central network feeding n cluster-networks, each with average delay ti, if the delay up to the root of the cluster networks is
5.3. Sensitivity Computation From (1) it is clear that we need the sensitivities of delays and slopes with respect to widths of wires in the network. We obtain these sensitivities in three steps: 1). Form the equivalent circuit of the network using a imodel for each segment of the route. We then compute the moment sensitivities of delay and slope at a target node (a clock-pin or central network branch tip) with respect to the resistances and capacitances in the circuit. 2). These sensitivities are then translated into delay/slope sensitivities [11].
-
\
3). Sensitivities with respect to wire widths are then derived from the resistive and capacitive sensitivities as described in [9]. The resulting sensitivity matrix is used to iteratively solve (1).
-
t151(
\ 'I
-
/ t2, C2
t3, C 3 -
6. HIERARCHICAL OPTIMIZATION Due to the fact that the sensitivity computation has an f
t4 C4 I
-
t, C.
j
As solving the matrix in (1) is an 0 (n 3 ) (where n is the number of wires) operation, this solution is computationally intensive and therefore time consuming. We use heuristics proposed in [10] for identifying the most sensitive wires and then discard the insensitive wires. This results in a dramatic reduction in the size of the matrix and therefore a quick convergence is achieved as evidenced by our results.
0(n 3 ) compjlnn
-
I
-
-
I -
Figure 2. Optimization with Cluster Estimates
is desirable to optimize large clock-
43
required to be Td, then the central network is optimized by settingy in (1) as
hierarchical optimization was found to reduce run-times significantly for large networks. 9.
Yi = Td - ti,
FUTURE WORK
To ensure low-skew values in reality it is desirable to take into account the process variations as a part of the clock tree design. From the reliability perspective it is necessary to be able to take into account certain electro-migration constraints while sizing the wires for skew. Future enhancements to the tool will aim at including these criterion as a part of the optimization.
wherey = (Y], Y2. -n)¾
7. EXPERIMENTAL RESULTS The design methodology and the wire-width optimization was tested with a clock-network design problem of a PowerPCTM microprocessor. The optimization was performed both as a complete network and hierarchically. The estimated capacitance, average delay, skew for individual clusters for this network are summarized in Table 1. The capacitance and delay estimates were used to optimize the wire widths of central network. A global skew of less than 5OpS were achieved with given wire-width constraints. The total run time for the entire design process described above, when performed in a hierarchical fashion was a little more than 3 hours on a IBM RISC System 6000/Model 560. The run-time for the width-optimization was less than 5 minutes on an average for the cluster networks and approximately 15 minutes for optimization of the central network with the estimated capacitance and delay. In comparison the time taken to optimize the wires for the clocknetwork as a whole was approximately 7 hours.
REFERENCES [1] H. B. Bakoglu, J.T. Walker, and J.D. Meindl "A Symmetric Clock-Distribution Tree and Optimized High Speed Interconnection for Reduced Clock Skew in ULSI and WSI Circuits," Proc. IEEE International Conference on Computer Design, pp. 118122, 1986. [2] J. Cong, A. B. Kanhng, and G. Robins, "Matching-Based Methods for High-Performance Clock Routing", IEEE Trans. on Computer-Aided Design of Circuits and Systems," vol. 12, no. 8,
pp. 1157-1169, Aug. 1993. [3] M. A. B. Jackson, A. Srinivasan and E. S. Kuh, "Clock Routing for High Performance ICs," Proc. 27th ACM/IEEE Design Automation Conference, pp. 573-579, 1990. [4] R. S. Tsay, "Exact Zero-Skew," Proc. IEEE International Conference on Computer-Aided Design," pp. 336-339, 1991.
Number Cluster
Nae
Of
Net
regenerat
(pF)
Clock-
Cap
Delay
(nS)
[5] S. Pullela, N. Menezes, L. T. Pillage, "Reliable Non-Zero Skew Clock Trees Using Wire Width Optimization," Proc. 30th ACM/IEEE Design Automation Conference, pp. 165-170, 1993.
Skew
(Ps)
[6] C.L. Ratzlaff, N. Gopal, and L. T. Pillage, "RICE: Rapid Interconnect Circuit Evaluator," Proc. 28th ACM/IEEE Design Automation Conference, 1991. [7] S.W. Director, and R.A. Rohrer, "The Generalized Adjoint Sensitivities," IEEE Transactions on Circuit Theory, vol CT-16, no 3, Aug 1969.
ors fxu-sw
35
6.062
0.1085
19.509
fpu-se
18
3.588
0.1068
17.588
fpu-sw
6
1.273
0.1014
2.3210
biu-lmw
20
3.381
0.1028
7.101
biu umw
16
1.501
0.1027
5.796
biu-sw
16
2.090
0.1019
4.430
[8] Q. Zhu, W.-M. Dai, and Joe G. Xi, "Optimal Sizing of HighSpeed Clock Networks Based on Distributed RC and Lossy Transmission Line Models," Proc. International Conference on Computer Aided Design 1993, pp 628-633 [9] N. Menezes, S. Pullela, and L. T. Pillage, "RC Interconnect
fpu-nw
1
0.230
0.1003
0.000
Synthesis, - A moments Approach," International Conference on ComputerAided-Design, Nov. 1994.
fpu-ne
9
1.117
0.1014
4.407
fxu-nw
12
1.835
0.1069
9.794
[101 S. Pullela, N. Menezes, and L. T. Pileggi, "Skew minimization via wire-width Optimization," Submitted to IEEE Transactions on CAD.
[11] N.Menezes, R. Baldick, and L. T. Pileggi, "A Sequential Quadratic Programming Approach to Concurrent Gate and Wire
Table 1: Summary of Clusters
8. CONCLUSION
Sizing," International Conference on Computer Aided-Design
A clock-network design flow has been developed to facilitate the design of large clock-networks. An overview of this wire-width optimization based tool was presented. The efficacy of the approach has been seen by testing the same with a typical clock-network design scenario. The
44
1995. [12] D. W. Marquardt, "An algorithm for least-squares estimation of nonlinear parameters," J. Soc. Indust. App. Math., vol. 11, no. 2, pp. 431 - 441, June 1963
Making MEMS Kaigham J. Gabriel Deputy Director Electronics Technology Office Defense Advanced Research Projects Agency 3701 N. Fairfax Drive Arlington, VA 22203-1714 [email protected] http://www.darpa.mil
Introduction As information systems increasingly leave fixed locations and appear in vehicles and in our pockets and palms, they are getting closer to the physical world, creating new opportunities for perceiving and controlling our environment. To exploit these opportunities, information systems will need to sense and act as well as compute. Filling this need is the driving force for the development of microelectromechanical systems (MEMS). Using the fabrication processes and materials of microelectronics as a basis, MEMS processes construct both mechanical and electrical components. Mechanical components in MEMS, like transistors in microelectronics, have dimensions that are measured in microns and numbers measured from a few to millions. MEMS is not about any one single application or device, nor is it defined by a single fabrication process or limited to a few materials. More than anything else, MEMS is a fabrication approach that conveys the advantages of miniaturization, multiple components and microelectronics to the design and construction of integrated electromechanical systems. MEMS devices are and will be used widely, with applications ranging from automobiles and fighter aircraft to printers and munitions. While MEMS devices will be a relatively small fraction of the cost, size and weight of these systems, MEMS will be critical to their operation, reliability and affordability. MEMS devices, and the smart products they enable, will increasingly be the performance differentiator for both defense and commercial systems.
45
MEMS Market and Industry Structure
MEMS Market and Industry Structure MEMS Market Forecasts for MEMS products throughout the world show rapid growth for the foreseeable future. Early market studies projected an eight-fold growth in the nearly $1 billion 1994 MEMS market by the turn of the century. More recent estimates are forecasting growth of nearly twelve to fourteen times today's market, reaching $12-14 billion by the year 2000 (Figure 1). While sensors (primarily pressure and acceleration) are the principal MEMS products today, no one product or application area is set to dominate the MEMS industry for the foreseeable future, with the MEMS market growing both in the currently dominant sensor sector and in the actuator-enabled sectors. Furthermore, because MEMS products will be embedded in larger, nonMEMS systems (e.g., automobiles, printers, displays, instruments, and controllers), they will enable new and improved systems with a projected market worth approaching $100 billion in the year 2000.
Projected Growth of Worldwide MEMS Market IA
15
Market Segments
12
n 2000
10
m 8. i!2 0
4. 2 0.
I1993
1994
1995
1996
1997
1998
1999
11-
11
2000
Year
FIGURE 1.
Projected worldwide MEMS market. Note inset pie chart that shows the non-sensor market segments influid regulation and control, optical systems and mass data storage are projected to be about half of the total market by the year 2000.
Present MEMS markets and demand are overwhelmingly in the commercial sector, with the automobile industry being the major driver for most micro-
Making MEMS
46
MEMS Market and industry Structure
machined sensors (pressure, acceleration and oxygen). In 1994 model year cars that were manufactured in the US, there are an average of 14 sensors, one-fourth of which are micromachined sensors, increasing in number at a rate of 20% per year. As one example, a manifold pressure sensor is currently installed in vehicles by all three major US automakers. This amounts to more than 20 million micromachined manifold pressure sensors being manufactured per year.
Production/Revenue Share E us
1.5
0
E
Asia
*
Europe
-
El
Acceleration
.
Pressure
F___1
1
I
Cu
I___
a)
06
*e
0.5-
HI-
C1993
FIGURE 2.
-Hi 1994
1995 (projected)
Worldwide annual pressure and acceleration sensor markets with associated (on top) regional production and revenue percentages for the combined sensor markets. More recently, the market for accelerometers used in airbag deployment systems has also grown. Nearly 5 million micromachined accelerometers for airbag systems were manufactured and installed in 1994 vehicles. Increasingly biomedical sensors, particularly disposable blood pressure and blood chemistry sensors are fast approaching the automobile industry in both sensor unit numbers and market size. Over 17 million micromachined pressure
Making MEMS
47
MEMS Market and Industry Structure
sensors, with a market value of nearly $200 million, were manufactured, used and disposed of in 1994. While the MEMS sensors market will continue to grow, particularly sensors with integrated signal processing, self-calibration and self-test (pressure sensors, accelerometers, gyroscopes, and chemical sensors), a substantial portion of the growth in the next few years (and of the MEMS market by the year 2000) will be in non-sensing, actuator-enabled applications. These applications include microoptomechanical systems, principally in displays, scanners and fiber-optic switches; integrated fluidic systems, primarily in fuel-injections systems, ink-jet printheads, and flow regulators; and mass data storage devices for both magnetic and non-magnetic recording techniques. Two non-sensor markets alone, printing and telecommunications, are projected to match the present sensor market size by the year 2000. MEMS Industry Structure Those companies which have so far been directly involved in producing MEMS devices and systems are manufacturers of sensors, industrial and residential control systems, electronic components, computer peripherals, automotive and aerospace electronics, analytical instruments, biomedical products, and office equipment. Examples of companies manufacturing MEMS products worldwide include Honeywell, Motorola, Hewlett-Packard, Analog Devices, Siemens, Hitachi, Vaisala, Texas Instruments, Lucas NovaSensor, EG&G-IC Sensors, Nippon Denso, Xerox, Delco, and Rockwell. Of the roughly 80 US firms currently identified as being involved in MEMS (Figure 2), more than 60 are small businesses with less than ten million dollars in annual sales. The remaining 20 firms are large corporations distributed across different industry sectors with varying degrees of research activities and products in MEMS (the front cover of the 1993 annual shareholders' report for Hewlett-Packard featured a MEMS flow-valve developed for use in their analytical instruments division). Of the nearly $300 million worldwide market in pressure sensors, US manufacturers account for nearly 45% of production and revenue. In the growing accelerometer market, the US position is very similar. Of the nearly 5 million accelerometers made in 1994, US manufacturers accounted for nearly 50% of the market. Because of the combination of an advanced technology base and a strong manufacturing capability in these two key sensor areas, US manufacturers are poised to expand their MEMS market share and are already beginning to penetrate both the European and Japanese automotive sensors market. Accounting for slightly more than half of the worldwide MEMS manufactured products and revenue, the US MEMS industry is a major player in all key segments of the world MEMS market. One notable distinction in industry structure is that few small businesses in Europe or Japan are involved in MEMS. In the US, nearly 60 of the 80 identified firms with MEMS activities are small businesses, each typically generating on average less than five million dollars in annual revenues. Most of
Making MEMS
48
MEMS Market and Industry Structure
these businesses do not have or need their own dedicated fabrication resources. New approaches to the development of manufacturing resources can both exploit this distinctive structure for DoD-specific needs and accelerate the innovation and commercialization of MEMS products. Given the varied applications of MEMS devices and the most likely evolution of their associated fabrication processes, the development of support and access technologies will be even more important and challenging in MEMS manufacturing than in microelectronics manufacturing. Unlike microelectronics, where a few types of fabrication processes satisfy most microelectronics manufacturing requirements, MEMS, given their intimate and varied interaction with the physical world, will have a greater variety of device designs and a greater variety of associated manufacturing resources. For example, the thin-film structures created using surface micromachining techniques, while well-suited for the relatively small forces encountered in inertial measurement devices, are not adequate for MEMS fluid valves and regulators. Similarly, the thicker structures created using a combination of wafer etching and bonding while well-suited to the higher forces and motions in fluid valves and regulators consume too much power to be used for the fabrication of microoptomechanical aligners and displays. There is not likely to be a MEMS equivalent of a CMOS process like that in microelectronics that will satisfy the majority of MEMS device fabrication needs. These different MEMS fabrication processes will often be developed by larger firms with a particular and large commercial market as the target. Typically the firm developing the manufacturing resources needs to be focused on the production of products for those one or two driving applications. But in most cases, once the manufacturing resource is developed, numerous (hundreds) of products for smaller (<$10 million per year) markets could be addressed with the same manufacturing resources. No single one of these smaller market would have justified the development of the fabrication process. For the firms that have developed the manufacturing resource, addressing small and fragmented markets is not presently economically justifiable given the market diversity and the current state of electronic design aids. Most of these specialized markets will only be attractive and economically justifiable to smaller businesses who, however, do not have (nor would they want to duplicate) the manufacturing resources. By gaining access to manufacturing resources through a domestic MEMS infrastructure, businesses would be in a better position to field competitive MEMS products and also be able to use existing resources at higher capacities, speeding return on investments for those companies with the resources. Furthermore, since most MEMS defense applications and products are some of these smaller and fragmented markets, access to MEMS manufacturing resources would also support rapid and affordable fielding of MEMS defense products.
Making MEMS
49
The DoD Investment Strategy for MEMS
One step towards acquiring such a MEMS manufacturing infrastructure is to make product-neutral investments in the development of support and access technologies that include; electronic design aids for the free-form MEMS device designs and coupling of simulation tools for the variety of physical effects and properties encountered in MEMS applications, * better understandingand control of processes to assure repeatableand predictable mechanical and other non-electricalpropertiesof materials, * manufacturing equipment optimized for MEMS requirements (thicker film deposition, deeper etching, handling and packaging techniques which selectively contact and sealportions of the MEMS device), and * measurement tools and techniquess that characterizeelectrical and other performanceparameters(.e.g., motion, fluid flow) for operationaltesting and device qualification. Since most DoD applications will be early drivers of advanced MEMS devices or require the adaptation and qualification of commercial devices, DoD investments in the manufacturing resources serves not only its own needs for rapid, flexible and affordable access to MEMS technology but does so in a way that complements and enhances the US industrial capability in MEMS.
The DoD Investment Strategy for MEMS A strong US MEMS technology and manufacturing base is essential to assure early, affordable, and responsive access to MEMS technology for DoD needs. Relatively small investments in MEMS will leverage the vast and historic national investments made in capital equipment, materials, processes and expertise for the microelectronics industry to create a superior, national MEMS capability. While ongoing industry investments in MEMS will continue to grow, the bulk of these investments are by individual companies focused on gaining incremental improvements in performance and manufacturing costs for their one or two major products. Because DoD will be the early customer for advanced and integrated MEMS devices and systems (ranging from inertial navigation on a chip to advanced maneuverability aircraft), DoD investments will focus on the development of advanced MEMS materials, devices, systems and manufacturing resources and will target the development of supporting capabilities that enable rapid and flexible access to those resources. The DoD MEMS research and development strategy is to:
invest in advanced MEMS device and systems developments leading towards MEMS with higher levels of functional capability, higher levels of integrated electronics, and larger numbers of mechanical
Making MEMS
50
The DoD Investment Strategy for MEMS
components. Activities in this area will accelerate the development of actuator-enabled applications and the shift from discrete MEMS component manufacturing to the manufacturing of integrated MEMS devices. Focused thrusts include the development of new materials, devices, systems, fabrication processes, and interfacing/packaging techniques. Example target devices and applications include navigation-grade inertial guidance systems on a chip, complete hand-held analytical instruments, and distributed aerodynamic control of aircraft;
invest in the development of a MEMS infrastructure by developing support and access technologies including electronic design aids and data bases, shared fabrication services, and test/evaluation capabilities. Infrastructure activities will increase and broaden the pool of MEMS designers, enable rapid, timely and affordable access to MEMS technologies for evolving DoD needs, and create a national mechanism for cost-effective MEMS prototyping and low-volume production. An on-going project supported by the Defense Advanced Research Projects Agency (DARPA) offering regular, shared access to a single, common MEMS fabrication process has already been used by over three hundred users at service/federal laboratories, domestic companies, and universities. More than haif the users (and all the small businesses) are getting their first and only access to MEMS technology through the shared fabrication service.
invest in activities to accelerate insertion of presently available or near-term commercial MEMS products into military systems and operations. Examples include munitions safing & arming and condition-based maintenance. Investments in this area are focussed on improved, affordable manufacturing resources and methods of assessing and qualifying device performance and reliability for DoD applications. Activities in this area encourage and are aligned with industry-formed teams that speed the introduction and use of MEMS fabrication processes and products;
coordinate and complement federal programs within DoD and at other agencies by establishing a DoD and interagency MEMS specialists group, chaired by a representative of DARPA. Examples of ongoing activities in this area include coordinated projects in fluid dynamics and integrated MEMS fluidic devices (AFOSR and DARPA), materials standards and databases (NIST and DARPA), and a project to broaden education and training programs in MEMS, increase the number of qualified MEMS instructors, and couple them to shared fabrication services (NSF and DARPA). DoD funding of MEMS research and development projects over the last three years, primarily at DARPA, have created a strong US MEMS science and technology base which has demonstrated multiple and varied DoD applications, made accessible commercially-based MEMS manufacturing resources and devices for DoD systems, and sparked a rapid growth in domestic MEMS commercialization activities. Continued DoD MEMS R&D funding, growing to and sustained at a projected $75 million per year in fiscal 1998, builds on existing accomplishments and capabilities to produce future MEMS devices and processes with Making MEMS
51
References
the higher functionalities and flexibilities required to meet present and future DoD needs.
OSD (DARPA) Service Labs & XORs Total
TABLE 1.
1995
1996
1997
1998
1999
22
31
38
50
50
3
15
21
25
25
25
46
59
75
75
Past and Projected DoD funding profile in MEMS R&D ($M).
Starting in fiscal year 1996, service investments of approximately $3 million for each service growing to roughly $7 million for each service by 1997 will harvest device and systems developments to qualify and adapt MEMS technology for service needs, positioning MEMS devices for procurement and insertion into weapons systems starting in 1998-1999. DoD will also be the early beneficiary and user of an accessible MEMS infrastructure that opens maturing manufacturing resources for high-volume products and makes them available for the production of related, but lowvolume DoD products. Continuing cross-company, product-neutral investments in manufacturing resources will increase the level of integrated mechanical components and electronics, expand the range and types of MEMS-specific electronic design aids, manufacturing equipment, generic packaging & interfacing techniques and characterization tools. Finally, focused DoD investments for the assessment, qualification and adaptation of commercially available MEMS technology are accelerating the incorporation of existing and near-term MEMS device capabilities into existing and planned weapons systems. With a strong MEMS technology base, a growing MEMS manufacturing capability, and coordinated Federal and industry investments, the US can cost-effectively leverage its semiconductor industry leadership into industry leadershipin MEMS.
References
Microelectromechanical Systems: A DoD Dual Use Technology Industrial Assessment Final Report. December 1995. (50 references) Gabriel, Kaigham J., "Engineering Microscopic Machines" Scientific American, September 1995.
Making MEMS
52
PHYSICAL DESIGN FOR SURFACE-MICROMACHINED MEMS Gary K Feddert*and Tamal Mukherjeet tDepartment of Electrical and Computer Engineering and*The Robotics Institute Carnegie Mellon University Pittsburgh, PA, 15213-3890 E-mail: [email protected], [email protected]
ABSTRACT We are developing physical design tools for surfacemicromachined MEMS, such as polysilicon microstructures built using MCNC's Multi-User MEMS Process service. Our initial efforts include automation of layout synthesis and behavioral simulation from a MEMS schematic representation. As an example, layout synthesis of a folded-flexure electrostatic comb-drive microresonator is demonstrated. Lumped-parameter electromechanical models with two mechanical degrees-of-freedom link the physical and behavioral parameters of the microresonator. Simulated annealing is used to generate global optimized layouts of five different resonators from 3 kHz to 300 kHz starting with mixed-domain behavioral specifications and constraints. Development of the synthesis tool enforces codification of all relevant MEMS design variables and constraints. The synthesis approach allows a rapid exploration of MEMS
domains. Several groups are addressing the deficiency in MEMS design tools, including MEMCAD [1] (Microcosm/M.I.T.), IntelliCAD [2] (IntelliSense Corp.), and CAEMEMS [3] (Univ. of Michigan). These tools focus on analysis using 3D modeling from layout and process integrated with self-consistent electromechanical numerical simulation. In a complementary approach to existing analysis tools, we are developing circuit-level design tools and layout synthesis tools tailored to general surfacemicromachined systems. The planar '2 1/2-D' topology of surface-micromachined MEMS lends itself to abstraction at a higher level than numerical simulation. As in circuit design, a schematic representation of MEMS provides a critical link between layout and behavioral simulation that enables high-level design automation. Many existing surface-micromechanical designs can be partitioned into discrete components, such as beam springs, plate masses, and electrostatic actuators, that are modeled as lumped-parameter elements. Conversely, new MEMS devices can be created by connecting together these lumped elements.
design issues.
1. INTRODUCTION Microelectromechanical systems (MEMS) are commonly defined as sensor and actuator systems that are made using integrated-circuit fabrication processes. The main advantage of MEMS-based systems when compared with conventional electromechanical systems is the miniaturization and integration of multiple sensors, actuators, and electronics at a low cost. These attributes are obtained by leveraging standard integrated-circuit batch-fabrication processes. In the past, MEMS technology has primarily been driven by the development of new processes to meet specific needs. However, over the past decade there has been a shift in emphasis from process design to device design. There is a growing demand for CAD tools to support rapid design of complex MEMS involving physical interactions between mechanical, electrostatic, magnetic, thermal, fluidic, and optical
In this paper, we will describe initial work involving the rapid layout synthesis of a surface-micromachined resonator from high-level design specifications and constraints. Our approach to layout synthesis involves modeling the design problem as a formal numerical synthesis problem, and then solving it with powerful optimization techniques. This synthesis philosophy has been successful in a variety of fields such as analog circuit synthesis [4] and chemical plant synthesis [5]. The process of modeling the design problem involves determining the design variables, the numerical design constraints and the quantitative design objective. The microresonator architecture used in this study has been thoroughly analyzed in the literature. It represents a good starting point for synthesis work since proper operation can
53
1) isolation and interconnect definition
2) contact cut for mechanical anchor It00
IT
I
iD
I ::
3) structural definition
I :7 I ,,,
E:
I77
I
M
I __f
'
I
4) structural release from substrate
Fig. 2. Scanning electron micrograph of a released folded-flexure comb-drive microre-onator fabricated in the MUMPs process.
(a)
con nitride is first deposited on the silicon substrate to provide electrical isolation between microstructures. An electrical interconnect layer of polycrystalline silicon (polysilicon) is then deposited and patterned. Next, a 2 gm-thick layer of phosphosilicate glass (PSG) is deposited. The PSG acts as a sacrificial spacer layer for the microstructures. After contact cuts are made in the PSG, a 2 jm-thick layer of polysilicon is deposited and patterned to form the microstructures. Further process steps in MUMPs are not necessary for microresonator fabrication and will not be shown. A final wet etch in hydrofluoric acid (HF) dissolves the PSG and releases the microstructure so it is free to move. The PSG contact cuts act as mechanical anchor points that fix the microstructure to the substrate surface.
Fig. 1. Abbreviated process flow for MCNC's MultiUser MEMS Process service. (a) Cross-sectional view. (b) Top view (layout). be easily verified using existing analysis tools and experimental measurements. We begin with an overview of surface-micromachining and the microresonator architecture, followed by a description of design variables and lumpedparameter models. Next, the design constraints are specified along with the synthesis algorithm and a discussion of the layout synthesis results.
2. MICRORESONATOR DESCRIPTION 2.1. Surface Micromachining
2.2. Microresonator Design
One set of events that has fueled the trend toward device design in fixed processes is the establishment of MEMS process services by MCNC [6], Analog Devices [7], CMP [8], and MOSIS [9] (Currently, post-foundry processing is required for MOSIS designs). These services produce micromechanical structures made out of thin films on the surface of the substrate. We refer to these thin-film microstructures as 'surface-micromachined MEMS' although, in some cases, bulk silicon micromachining is required to release the microstructures from the substrate. MCNC's Multi-user MEMS Process service (MUMPs) is the technology chosen for the synthesis work presented in this paper. A simplified and truncated process flow is shown in Fig. 1. Low-stress sili-
A microresonator fabricated in MUMPs is shown in Fig. 2. The specific microresonator design was first described and analyzed by Tang [10] and is commonly used as for MEMS process characterization. In one application, the resonator has been integrated with CMOS circuitry to form a micromechanical oscillator [11]. The resonator components are made entirely from the homogeneous, conducting, 2 jmthick polysilicon film. The spacer gap of 2 jim above the substrate is set by the PSG thickness. The movable microstructure (in the center of the micrograph) is fixed to the substrate at only two points. Simplified layout and schematic views of the device are shown in Fig. 3. The resonator is a
54
comt drive
N
i
tI
I
I a N I a I .
II
III.ItI1i11 III i1ii 11
-
ii1ll~lllll
,x I
Y
mr
I
shuttle mass
I.
I ill
parameter elements: the shuttle mass, two foldedflexure springs, and two comb-finger actuators which are displayed as time-varying capacitors. A voltage source that drives one actuator is included in the schematic. Lateral translational modes of the mass-springdamper system are modeled by second-order equations of motion:
\ folded flexure
I1I,,I1I.1,;It
l 1lluh1ll III
S (a)
Fe
anchor points
= mx + Bx +kx
F e,y = mY+By+k y Y Yx
x
(1) (2)
where Fex and Fey are lateral components of the external electrostatic force generated by the comb drives. The effective masses (mx and my), damping coefficients (Bx and By), and spring constants (kx and ky) for these modes are calculated from the geometry and material parameters of the lumped elements. The vertical (z) mode, rotational modes, and other higher order modes are left out of the present study because the equations unnecessarily complicate the analysis without providing for significantly greater insight into the design and synthesis issues.
Y
k
1E7),anchor point (b) Fig. 3. Circuit-level views of the lateral foldedflexure comb-drive microresonator. (a) Physical view (layout). (b) Mixed-domain structural view (schematic), including a voltage source, V, for comb-
2.3. Design Variables All design variables of the microresonator are structural parameters of the folded flexure and comb drive elements, with the exception of the comb-drive voltage. The 14 variables used in the synthesis algorithm, along with upper and lower bounds, are listed in Table 1. Most of the layout parameters are detailed in Fig. 4. The choice of upper and lower bounds is discussed in the section on design constraints. Several geometric variables, such as the width of the anchor supports, wa and was, are not included in Table 1. These variables are necessary to completely define the layout, but do not affect the resonator behavior. We refer to these variables as 'style' parameters, because they primarily affect the stylistic look of the finished device. Redundant state variables can be defined that depend on the design variables. For example, the shuttle axle length, Lsa, is a state variable which is dependent on the truss beam length and the gap between the beam anchor and the shuttle yoke. In our formulation of the problem, this gap is a style parameter.
mechanical mass-spring-damper system consisting of a central shuttle mass that is suspended by two folded-beam flexures. The folded flexure is a popular design choice because it is insensitive to buckling arising from residual stress in the polysilicon film. Instead of buckling, the beams expand outward to relieve the stress in the film. The resonator is driven in the preferred (x) direction by electrostatic actuators that are symmetrically placed on the sides of the shuttle. Each actuator, commonly called a 'comb drive,' are made from a set of interdigitated comb fingers. When a voltage is applied across the comb fingers, an electrostatic force is generated which, to first order, does not depend on x. The suspension is designed to be compliant in the x direction of motion and to be stiff in the orthogonal direction (y) to keep the comb fingers aligned. In the schematic view of Fig. 3.(b), the resonator is represented as an interconnected set of lumped-
55
Table 1. Design variables for the microresonator. Upper and lower bounds are in units of glm except N
1
and V vaTr description Lb length of flexure beam Wb width of flexure beam Lt length of truss beam wt width of truss beam Wsa width of shuttle axle width of shuttle yoke Ly length of shuttle yoke wcywidth of comb yoke Lc length of comb fingers WC width of comb fingers g gap between comb fingers x0 comb finger overlap N number of comb fingers V voltage amplitude
(a)
min. max. 2 400 2 200 2 200 4 400 11 400 11 400 84 400 11 400 8 200 2 400 2 100 4 100 15 100 5 V 50 V
12
mX = ms +4mt + 3 5 mb
(3)
8 my = mS +-m +mb 35 t b
(4)
where m. is the shuttle mass, mt is the total mass of all truss sections, and mb is the total mass of all long beams. For operation at atmospheric pressure, damping is dominated by viscous air forces generated by the moving shuttle. Viscous air damping is proportional to velocity with a damping factor given by [12] B1 = + 05At + 05Ab) (-do)+-1 (As g] (5) 5
Bx =
where p. is the viscosity of air, d is the spacer gap, 8 is the penetration depth of airflow above the structure, AC is the gap between comb fingers, and As, At, Ab, and AC are bloated layout areas for the shuttle, truss beams, flexure beams, and comb-finger sidewalls, respectively. Damping factors of the other lateral modes do not enter into the design constraints and are not calculated. Linear equations for the folded-flexure spring constants in the lateral directions are [13]
(b)
-
2
Etwb
L2 + 14xLLLb + 36a2L2 (6)
3
ky =
I-,
(d) Fig. 4. Parameterized elements of the microresonator. (a) shuttle mass, (b) folded flexure, (c) comb drive with N movable 'rotor' fingers, (d) close-up view of comb fingers.
Lb
4L2 +41aLLLb +36a 2L2
2Etw 3
8L2 +88aLLLb +a Lb
3
(7) 4L1 + lOaLtLb +5a ab2L2
where E is the Young's modulus of polysilicon, t is
the polysilicon thickness and a = (wt/wb) 3. General analytic equations for the lateral combdrive force, Fx, as a function of wC, g, t, and d are derived in [14], but are too lengthy to repeat here. For the special case of equal comb-finger width, gap, thickness, and spacing above the substrate (wC = g = t = d), each comb drive generates a force that is proportional to the square of the voltage, V, applied across the comb fingers.
2.4. Lumped-Element Modeling A very brief description of the models used in the synthesis algorithm are given in this section. The models for the spring constants, damping, and combdrive force are derived in detail in the references listed with each equation. The effect of spring mass on resonance is incorporated in an effective mass for the entire structure in each lateral direction.
F e, x =-1.12eN -V 2 0 g
56
(8)
where s, is the permittivity of air. If the comb fingers are not perfectly centered, a y-directed electrostatic force is also present. Assuming a small perturbation y and that only one comb actuator is activated, the destabilizing force is found to be Fe y
1.12s 0NV 2(x +x)ty/g3
sticking and breakage during the wet release etch. We have constrained beam lengths to less than 400 gm to avoid these problems. If all the above constraints are met, then a design with minimum area can be found which is considered optimal.
(9)
3.2. Synthesis Algorithm
3. LAYOUT SYNTHESIS
In our approach, the synthesis problem is mapped onto a constrained optimization formulation that is solved in an unconstrained fashion. As in [15], the circuit design problem is mapped to the non-linear constrained optimization problem (NLP):
3.1. Design Constraints Several design specifications must be defined to constrain the layout synthesis of the microresonator. An essential specification is resonant frequency of the lowest mode (x), which is determined by the spring constant and effective mass of the resonator.
k minu
21A
= i = 1
s.t. A valid layout must have a resonant frequency within 10% of the desired value. Resonant frequency of the orthogonal mode, fy, is set at least 10 times greater than f to decouple the modes. Specifications for minimum displacement and maximum applied voltage act as additional constraints. The displacement amplitude at resonance is related to the spring constant, damping, and combdrive force. Assuming the system is underdamped, the displacement amplitude is Xmax = QFx/kx
where Q =
h(u,x) = 0
L (NLP)
(13)
g(y, x) • 0 UE Up
where u is the vector of independent variables, such as geometries of micromachined devices that we wish to change to determine the MEMS structure performance; x is the vector of state variables; f (y, x) is a set of objective functions that codify performance specifications the designer wishes to optimize, e.g., area; and h(u, x) = 0 and g(,x) < 0 are each a set
(11)
of constraint functions that codify specifications that must satisfy a specific goal. For example, resonant frequency is constrained to greater than 20 kHz by the function 20000 -fx(y, x.) < 0. Scalar weights, wi,
2
mTk/Bx is the quality factor. We
have constrained 2 glm < xm,, < 5 gm at a drive voltage of V < 50 V to enable easy visual confirmation of resonance, and Q 2 5 to ensure underdamped resonant operation. For stability, the restoring force of the spring in the y direction must be greater than the destabilizing electrostatic force from the comb drive. Fe,y < ky
Wi ' fi(u, x)
(10)
balance competing objectives. The decision variables can be described as a set u E Up, where U is the set of allowable values for u (described by the bounds in Table 1).
(12)
To allow the use of simulated annealing, we convert this constrained optimization problem to an unconstrained optimization problem with the use of additional scalar weights. As a result, the goal becomes minimization of a scalar cost function,
The layout is constrained by the MUMPs design rules. Several design rules set minimum beam widths and minimum spaces between structures. Maximum structure size is limited by undesirable curling due to stress gradients in the structural film and by possible
C (y) , defined by
57
k C(u) =
I
Wjfj(LX)
i= 1
+ IWigiX) i= 1
m
(14)
+ z wj"h(y, X)
The key to this formulation is that the minimum of C (u) corresponds to the circuit design that best matches the given specifications. Thus, the synthesis task is divided into two sub-tasks: evaluating C (y) and searching for its minimum. Evaluating the cost function involves firing the lumped-element macro-models to determine the extent to which the design constraints are met, for the current values of the design variables, u. This cost function has multiple minima due to the complex non-linear characteristics of the individual equations in the lumped-element macro-model. We use simulated annealing [16] as the optimization engine to drive the search for the minimum; it provides robustness and the potential for global optimization in the face of many local minima. Because annealing incorporates controlled hill-climbing it can escape local minima and is essentially starting-point independent.
I................
......... .......I M
................................... .............. .............-...... I..
(b)l
(C)
=H
I2 IN!
(d)
V79S ; q ; 1171,111,717,99M11`11171719M (e) Fig. 5. Layout of five resonators synthesized from specifications. (a)fr = 3 kHz, (b) 10 kHz, (c) 30 kHz, (d) 100 kHz, (e) 300 kHz.
4. RESULTS AND DISCUSSION Five lateral comb-drive resonator structures synthesized from specifications are shown in Fig. 5. For visualization of the synthesis results, we used the Consolidated Micromechanical Element Library (CaMEL) parameterized module generation software [17] from MCNC which provides CIF output when given resonator layout parameters. The CaMEL generators automatically place holes in the large plates that are over 30 gm in size. Several iterations of the design variables and constraints were necessary to produce synthesized designs that followed manual design convention and common sense. The layout visualization was instrumental in debugging the equations. Feedback from the synthesis iterations directed our efforts to codify the design variables and constraints. In many cases, a quick inspection of a synthesized layout is all that was need to determine errant or missing equations. As expected, the devices shown in Fig. 5 become smaller with increasing values of resonant frequency.
Smaller devices have less mass, and smaller flexures are stiffer. Both effects increase the resonant frequency. Values of selected design variables and behavioral parameters for the five resonator examples are given in Table 2. For high frequency resonators, the mass becomes limited by the lower bounds on the shuttle mass and comb drive dimensions. The algorithm always chooses a device at the high end of the frequency specification, since that will yield a minimum size device. Quality factor is smaller for the larger resonators which have more air drag. The maximum beam length is 400 gm, however the 3kHz design is not able to take advantage of that due to the stylistic design constraint that the structure should be less than 700 glm in the y-direction. In our formulation, the number of comb fingers, N, is tied directly to the shuttle size. Therefore, N is large for
58
enced by the resonators. However, the first-order spring model in our analysis does not account for spring softening for non-zero values of x deflection. More realistic models may raise the significance of this 2D constraint.
Table 2. Selected behavioral and physical parameters for the five synthesized resonators. Spec. [kHz] 3±0.3 10±1 30±3 100±10 300±30 mx [ng]
435.
39.9
25.7
23.4
26.7
Lb [pm]
304
302
168
77
36
k., [N/m]
0.187 0.191
1.10
11.2
122.
[kHz]
3.30
11.0
33.0
110
340
Q N V [V] x,,x[gim]
9.35 50 14.0 4.97
20.0 21 15.0 5.00
56.2 15 25.5 5.00
181 15 45.1 4.99
503 19 50.0 1.93
Vcit [V]
9190
14200 16700 16500
5. CONCLUSIONS Synthesis algorithms have been successfully applied to automatic layout of surface-micromachined resonators. A prerequisite for synthesis is a set of lumpedparameter models that adequately link device behavior with physical design variables. One great benefit of creating synthesis tools is that it forces the CAD developer to codify every design variable and design constraint.
15300
Once a structured design methodology is established for surface-micromachined MEMS, the synthesis techniques may be extended to general parameterized designs. Then we can integrate structural synthesis with electronic design using the similar formulation for circuits in [4]. Our long-term goal is to enable rapid, intuitive exploration and analysis of the design space for MEMS.
the low-frequency (large mass) resonators. Very low frequency resonators are limited by the upper bounds imposed on geometry. The upper bound on displacement amplitude of 5 gm is achieved for all of the devices in Fig. 5 except the 300 kHz resonator. The optimization does not directly evaluate all the equations in each iteration. Instead it builds a local loworder macro-model of the equations which is regularly updated. This can sometimes lead to the optimizer believing it has met a constraint when postsynthesis simulation (used to generate the performance numbers in Table 2) indicates that the optimizer actually overshot the constraint. This occurred in the resonant frequency and maximum displacement specifications of the 300 kHz design. Note that although N is more than doubled from the 10 kHz to 3 kHz device, the Q is reduced by about 1/2. Therefore, the applied voltage stays nearly constant for these devices. The lower limit for N (=15) is reached for intermediate frequency resonators. For high-frequency resonators, the voltage limit is reached and N must be increased to provide adequate electrostatic force to achieve the 2 g~m minimum displacement. Very high frequency resonators cannot be synthesized because the necessary increase in comb fingers generates a more massive device, which drives the resonant frequency down. A critical voltage, Vcnt, is defined as the applied voltage necessary to crash the movable comb fingers into the stator comb fingers. The values for Vcnt in Table 2 are much larger than any voltages experi-
ACKNOWLEDGEMENT The authors thank Karen Markus and Ramaswarny Mahadevan of MCNC for use of the CaMEL tool.
REFERENCES [1]
[2] [3]
[4]
59
S. D. Senturia, R. Harris, B. Johnson, S. Kim, K. Nabors, M. Shulman, and J. White, "A Computer-Aided Design System for Microelectromechanical Systems," J. of Microelectromechanical Systems, v.1, no.1, pp. 3-13, Mar. 1992. IntelliCAD, IntelliSense Corporation, 16 Upton Dr., Wilmington, MA 01887. S. B. Crary, 0. Juma, and Y. Zhang, "Software Tools for Designers of Sensor and Actuator CAE Systems," Technical Digest of the IEEE Int. Conference on Solid-State Sensors and Actuators (Transducers '91), San Francisco, CA, pp. 498-501, June 1991. T. Mukherjee, L.R. Carley, and R.A. Rutenbar, "Synthesis of Manufacturable Analog Circuits," Proceedings of ACM/IEEE ICCAD, pp. 586-593, Nov '94.
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
I. E. Grossmann and D. A. Straub, "Recent Developments in the Evaluation and Optimization of Flexible Chemical Processes," Proceedings of COPE-91, pp 49-59, 1991. D. A. Koester, R. Mahadevan, K: W. Markus, Multi-User MEMS Processes (MUMPs) Introduction and Design Rules, available from MCNC MEMS Technology Applications Center, 3021 Cornwallis Road, Research Triangle Park, NC 27709, rev. 3, Oct. 1994, 39 pages. J. T. Kung and S. Lewis, BiMOS Technology Stage One Design Rules, Analog Devices, 831 Woburn St., Wilmington, MA 01887, 14 pages (available from http://www-mtl.mit.edu:8001/ htdocs/starting.html). Micromachines Program, available from Multi-Project Circuits (CMP) Service, 46, avenue Felix Viallet, 38031 Grenoble Cedex, Oct. 1994, 29 pages. G. K. Fedder, S. Santhanam, M. L. Reed, S. Eagle, D. Guillou, M. S.-C. Lu, and L. R. Carley, "Laminated High-Aspect-Ratio Microstructures in a Conventional CMOS Process," Proceedings of the IEEE Micro Electro Mechanical Systems Workshop, San Diego, CA, pp. 13-18, Feb. 1996. W. C. Tang, T.-C. H. Nguyen, M. W. Judy, and R. T. Howe, "Electrostatic Comb Drive of Lateral Polysilicon Resonators," Sensors and Actuators A, vol.21, no.1-3, pp. 328-31, Feb. 1990. C. T.-C. Nguyen and R. T. Howe, "CMOS Micromechanical Resonator Oscillator," Technical Digest of the IEEE Int. Electron Devices Meeting, Washington, D.C., pp. 199-202, 1993. X. Zhang and W. C. Tang, "Viscous Air Damping in Laterally Driven Microresonators,"Proceedings of the IEEE Micro Electro Mechanical Systems Workshop, Oiso, Japan, pp.199-204, Jan. 1994. G. K. Fedder, Simulation of Microelectromechanical Systems, Ph.D. Thesis, Dept. of Electrical Engineering and Computer Science, University of California at Berkeley, Sept. 1994. W. A. Johnson and L. K. Warne, "Electrophysics of Micromechanical Comb Actuators," J. of Microelectromechanical Systems, v.4, no.1, pp.49-59, Mar. 1995.
[15] E. S. Ochotta, R. A. Rutenbar, and L. R. Carley, "ASTRX/OBLX: Tools for Rapid Synthesis of High Performance Analog Circuits," Proc. 31st ACM/IEEE DAC, pp. 24-30, June 1994. [16] S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi, "Optimization by Simulated Annealing," Science, vol. 220, no. 4598, May 1983. [17] CaMEL Web Page, http:/lwww.mcnc.org/ camel.org, MCNC MEMS Technology Applications Center, 3021 Cornwallis Road, Research Triangle Park, NC 27709.
60
CONSOLIDATED MICROMECHANICAL ELEMENT LIBRARY Ramaswamy Mahadevan
Allen Cowen
MCNC, MEMS Technology Applications Center PO BOX 12889 Research Triangle Park, NC 27709 USA [email protected] http://mems.mcnc.org tors for user customized elements as well as other technologies. The PME library is currently available on Sun-Sparc, HP 9000 workstation platforms as well as on Macintosh and IBM personal computers. The MEMS libraries and the MUMPs process are described in greater detail in the rest of the paper. Section 2 provides a brief overview of the MUMPs process while Sections 3, 4 describe the NPME and PME libraries.
ABSTRACT The Consolidated Micromechanical Element Library, CaMEL. consists of two independent parts; a nonparameterized cell database and a parameterized microelectromechanical element library, PME. The MEMS cell libraries are intended to be a design aid for both novice and experienced MEMS designers. Both libraries are intended to assist the user in the design and layout of MEMS devices by providing an initial layout for components of a MEMS system. The parameterized library also allows the users to write their own customized generator using the primitives available in the library. The user can modify these elements and customize them as desired and assemble designs using a suitable mask layout editor.
2. MUMPS
1. INTRODUCTION The nonparameterized cell library is a database of MEMS designs in various process technologies contributed by different sources. Further design contributions from the MEMS community for addition to the library are actively solicited! The parameterized library, PME, includes a set of generators that can be used to create customized layout for many common MEMS components that use a surface micromachined process technology such as the MUMPs process. Technology specific design rules are used to ensure that the layout generated will conform to process requirements. The generators support CIF (Caltech Intermediate Form) [I] and GDS IITM™1[2] mask formats, as well as PostScript®2 [3] for display and printing purposes. The library framework can also be used to write genera-
The Multi-User MEMS Processes or MUMPs is an ARPA-supported program that provides the domestic (USA and Canada) industry, government and academic communities with cost-effective, proofof-concept surface micromachining fabrication. MUMPs is designed for general purpose micromachining by outside users who would like to fabricate MEMS devices. The front-end support activities and access to the technologies are provided by the MCNC MEMS Technology Applications Center. The following is a general description of the MUMPs process. For more details and design rules please refer to [4,5]. The three-layer polysilicon micromachining process is derived from work performed at the Berkeley Sensors and Actuators Center at the University of California. Several modifications and enhancements have been added to increase the flexibility and versatility of the process for the multi-user environment. Figure 1 is a cross section of the three-layer polysilicon surface micromachining MUMPs pro2. PostScript is a registered trademark of Adobe Systems Incorporated.
1. GDS II is a trademark of Calma.
61
4. PARAMETERIZED MICROMECHANICAL ELEMENT LIBRARY
cess. This process has the general features of a standard surface micromachining process: (1) polysilicon is used as the structural material; (2) deposited oxide (PSG) is used as the sacrificial layer, and silicon nitride is used as electrical isolation between the polysilicon and the substrate. The polysilicon and metal M
M
[3
Nstide
Poty0
1s Oxid.
ED Poly1
2,d Oid,
Poly2
The Parameterized Micromechanical Element (PME) library is a set of generators that allow the user to create customized micromechanical elements in a quick and easy manner. It also provides a framework for writing MEMS cell generators. It enables generators to be relatively process independent and allows limited cell hierarchy. Designs can be generated in CIF, GDS II, or PostScript output formats. Technology dependent design rules are read in from an environment specified technology file. The framework provides various geometric primitives and a library of available generators. The library framework and its usage are described in this section. More complete information on the generators available can be found in [6].
11111 Metal
Fig. 1. Cross Sectional view of the three-layer polysilicon MUMPs process (not to scale). layers are also used for electrical interconnects. The process is different from most customized surface micromachining processes in that it is designed to be as general as possible, and to be capable of supporting many different designs on a single silicon wafer. Since the process was not optimized with the purpose of fabricating any one specific device, the thicknesses of the structural and sacrificial layers were chosen to suit most users, and the layout design rules were chosen conservatively to guarantee the highest yield possible.
4.1 PME Framework 4.1.1 Geometry Primitives The PME library provides various geometry primitives for use in the creation of generators. Rectangle, circle, sector, polygon, and wire primitives are available. Primitives on structural layers can be generated with etch holes if they are parts of structures that will be released from the substrate. These primitives generate the appropriate layout depending on the output format selected. Currently, these primitives are accessible to the user in a limited capacity for writing user customized generators. The main purpose for incorporating such primitives is to ease the generation of all angle geometry including arcs and sectors and enable generation of etchholes. Some of these features are often not available or tedious to use in many existing VLSI layout editors.
3. NONPARAMETERIZED MICROMECHANICAL ELEMENT LIBRARY The intent of the nonparameterized element library is to provide the novice MEMS designer easy access to MEMS designs without having to extract information from various sources. Cells may be retrieved through the use of a simple perl script or electronically via the World Wide Web. In either case the cells are requested by sending mail to MCNC and an automated system electronically retrieves and mails back the requested cell in the desired format. Cells in stream format are uuencoded before mailing to the user. Cell categories available include test structures, motors, linear actuators, and out-of-plane structures. The majority of cells available were designed for the MUMPs process, but some designs from other processes such as LIGA are also available.
4.1.2 Process Dependence The generators available in the PME library assume a two layer surface micromachined process. The process is assumed to have two structural layers, two sacrificial layers, and two electrical connect layers. Process specific mask levels and layer names are specified through a process file and are compiled into the library. Technology specific design rules are specified through a separate technology file that is accessed at execution time. The technology design rule file to use can be specified by the user through
62
Reflection and rotation are performed about the instantiated cell's local coordinate system. Reflection is performed first, rotation is performed second, and the cell is finally translated to place the local origin of the cell instance at the specified location within the MAINCELL. Rotation is specified by passing the desired angular rotation value in degrees. All geometrical length parameters for the generators and locations of instances are specified in microns. Permitted values for reflection are: '*': no reflection 'x': changes sign of x values; i.e., reflects about yaxis 'y': changes sign of y values; i.e., reflects about xaxis
the environment variable PMETECH. The generators have been written to conform to the technology specific design rules specified via the PMETECH file. However, a design rule checker is not a part of the library and all layout generated should be checked using an external design rule checker. For PC and Macintosh versions of the library the file dsgnruls is used as the technology design rule file. It must be located in the directory where the library is located. 4.1.3 Cell Hierarchy
The library allows a limited cell hierarchy in both the generators and the output layout. The layout generated for any cell is not stored in program memory but is written out in the sequence generated. Hence any leaf cells used inside the top level cell, must be defined before being called. The same applies to instances within generators too, allowing a generator to be called inside another generator.
4.2.2 Generating Layout
After the PME input file has been created, layout can be generated in the desired format using the command pmegen -[cif I gds I ps]. To generate layout in
4.2 Library Usage
CIF format on Unix and PC Dos platforms pmegen -cif pme-inputjfile output-file.cif.
4.2.1 PME Input File
The PME input is specified in an ascii text file and defines cells and places instances of defined cells at desired locations within the top level cell MAINCELL. The syntax used is similar to C language syntax. The input file should declare PME objects, call desired generators with appropriate parameter lists and assign them to PME objects, and place instances of these objects within MAINCELL. PME objects are declared first using the statement: PME celll,
cell2,
Similarly, to generate GDS or PostScript versions of the layout use, pmegen -gds pme-input-file output file.gds pmegen -ps pme-input-file output-file.ps.
On Macintosh systems the program prompts for the necessary arguments. The top level cell in the layout generated will be named MAINCELL. The CIF or GDS layout can then be read into the mask layout editor of choice and modified or assembled with other designs.
etc.;
Cells are defined using generators in the library: pme-object = generator(parameterlist, cell-name); celll = gen-namel(plval, p2val,.... cellnamel); cell2 = gen name2(plval, p2val,.... cellname2);
4.3 Parameterized Micromechanical Elements A variety of active, passive, and test MEMS structures have been parameterized. In most cases the geometric parameters have been determined by their importance in the overall design of the element. In some cases, physical quantities such as residual strain are also used. The elements in the library are designed to work closely with each other. One or more active elements would typically be used along with one or more passive elements to form a micromechanical system. For instance, a rotary side drive
Finally, instances are placed at desired locations in the MAINCELL: instance (pme-objectname,reflection, rotation, origin-x, origin-y);
63
actuator could be combined with a journal bearing to create a rotary variable capacitance motor. Most generators have two versions: one using structural layer 1, and the other using structural layer 2 and are identified by the use of 1 or 2 at the end of the name of the generator. Details about available generators are listed in [6].
*Wire - wirel, wire2 4.4 Layout Examples The following examples illustrate how elements of the library are used in conjunction with each other to generate typical MEMS devices. 4.4.1 Rotary Side Drive Motor Example
4.3.1 Active MicromechanicalElements
A rotary side drive motor on the first structural layer is created using the active element rsdml and the passive element bearing].The rotary side drive is generated with inner ring radius of 21rm, inner rotor tooth radius of 27gtm, outer rotor tooth radius of 51gLm, stator inner radius of 53gLm, stator outer radius of 80gm, rotor pole angle of 18°, rotor gap angle of 27', stator pole angle of 18', and stator gap angle of 120. The rotor is aligned with the first stator pole and the cell is named rsdma. The bearing is created with a cap radius of 14gtm, journal rotor inner radius of 10gLm, and journal rotor outer radius of 27im.
The following active elements are currently available in the CaMEL PME library: *Harmonic Side Drives
*Linear Comb Drives adcombl,
-lcombl,
adcomb2,
*Linear Side Drives
hsdm2
-hsdml,
lcomb2, adcomb3 lsdm2
-lsdml,
* Rotary Comb Drives
rcdm2,
-rcdml,
rcombdl,
rcombd2,
rcornbu2,
rcombula,
rcombul,
rcombu2a
-Rotary Side Drives - rsdml,
rsdm2
/* Rotary Side Drive Example */ /* Declare PME Objects PME pl, p2; /* Create rotary side drive active element */ pl=rsdml(21.0,27.0,51.0,53.0,80.0,18.0,2 7.0,18.0,12.0,0.0,rsdma); /* Create central journal bearing */ p2 = bearingl(14.0,10.0,27.0,jnla); /* Create motor using instances of cells generated inside MAINCELL */ instance(pl,'*',0.0,0.0,0.0); instance(p2,'*',0.0,0.0,0.0);
4.3.2 Passive MicromechanicalElements The following passive elements are currently available in the CaMEL PME library: -Journal Bearings - bearingl,
bearing2
*Linear Crab Leg Suspensions - l c 1 s 1, lcls2, 1cisib, lcsl2b, lcls3b *Linear Folded Beam Suspensions - 1 fbsl, lfbs2
*Spiral Springs - spirall,
spiral2 4.4.2 Linear Comb Resonator Example
The following example shows a linear comb resonator on the first structural layer created using the lcomb] linear comb drive and lfbsl linear folded beam suspension. The fingers in the comb are 4gm wide with an air gap of 3 Rm and the beams in the suspension are 150 gm long and 4gm wide.
4.3.3 Test and ElectricalElements The following test/measurement elements are currently available and are intended to measure mechanical or electrical characteristics of the process: *Area Perimeter Test Structure - aptest
*Crossover Test Structures
-
cotest,
coat-
/* Linear Comb Resonator with Folded Beam Suspension */ PME pl,p2; /* Create comb drive and folded beam suspension */ pl = lcombl(98,12,14,60,4,3, 30,lcomb); p2 = lfbsl(150,4,50,12,30,25,12,98 lfbsl);
est
*Euler Columns - dsbeaml, *Guckel Rings - gringl, gringsl,
grings2
ePad -pad,
padframe
dsbeam2
gring2,
64
Fig. 3. MAINCELL layout generated by the linear comb resonator example PME input file. Fig. 2. Layout of rotary side drive motor generated by the PME input file shown above. Both cells rsdma and jnla are located at the origin of the MAINCELL in this case. The radii of the rotor ring and journal have been chosen to overlap and provide mechanical interconnection of the bearing and rotor ring.
structure incorporated within the generator is used to calculate the ring radii required for the given critical strain values and generate the layout accordingly in a cell named rings].
/* Instantiate cells to create linear resonator */ instance(pl,'*',0.0,0.0,75.0); instance(pl,'y',0.0,0.0,-75.0); instance(p2,'*',0.0,0.0,0.0);
/* Guckel Rings Example */ PME p1; generate rings for critical strains from 0.05% to 0.25% in steps of 0.025% with a ring width of 20 and beam width of 10, for a film thickness of 2 and Poisson's ratio of 0.23.
4.4.3 Example of Guckel Ring Test Structures A set of Guckel ring test structures[7] to measure residual tensile stress in the first structural layer is generated using the gringsl test/measurement element. Guckel rings are generated to measure strains from 0.05% to 0.25% in steps of 0.025% using a ring width of 2ORm and a beam width of 10~L.m. The film thickness is specified as 2km and Poisson's ratio for the film is 0.23. A mechanical model of the ring
*/ pl = gringsl(0.0005,0.0025,0.00025, 20,10,40,2,0.23,ringsl); /* place instance in MAINCELL */ instance(p1,'*',0.0,0.0,0.0);
Fig. 4. Guckel ring test structures on the first structural layer generated by the PME input file shown above.
65
[2] Calma Company, Stream Format, GDS II Release 5.2, Calma Company, 1985.
4.5 PME Generator Development CaMEL V2.0 allows users to design their own generator; but, in a limited capacity as only one generator may be added to the code at a time. Many library functions and geometry primitives are available to the user for writing their own generators. Functions are available to initialize and call generators, set the current drawing level, look up design rules for levels, create or parse argument lists for a generator, verify values passed to a generator and create layout using geometry primitives such as rectangles, circles, sectors, polygons, and wires. Geometry primitives on structural layers can be drawn with or without release etch holes. Detailed information on how to write generators is described in [6].
[3] Adobe Systems Incorporated, PostScriptLanguage Reference Manual, Addison-Wesley Publishing, 1990.
[4] Koester, D. A., Mahadevan, R., and Markus, K. W., "Multi-User MEMS Processes - Introduction and Design Rules," October 1994. [5] Koester, D. A., Mahadevan, R., Shishkoff, A. and Markus, K. W., SmartMUMPs Design Handbook, MCNC, 1996. [6] Mahadevan. R, and Cowen. A, CaMEL User's Guide MCNC, 1994, 1996. [7] Guckel, H., Burns, D. and B. Choi., "Diagnostic Microstructures for the Measurement of Intrinsic Strain in Thin Films," Journal of Micromechanics and Microengineering,Vol. 2, pp. 86-95, 1992.
REFERENCES [1] Mead, C. and Conway, L., Introduction to VLSI Systems, Addison-Wesley Publishing, 1980.
66
SYNTHESIS AND SIMULATION FOR MEMS DESIGN Erik C. Berg', Nanping R. Lo', Jonathan N. Simon2 , Hee Jung Lee1 , and Kristofer S. J. Pister1 2
'Department of Electrical Engineering Department of Mechanical and Aerospace Engineering University of California at Los Angeles 405 Hilgard Avenue, Los Angeles, CA 90095-1594 [email protected] modelling and simulation of the system. Both of these require time-consuming manual changes for iterative design. Taking inspiration from VLSI CAD techniques, the authors advocate MEMS systems-level synthesis, extraction, and simulation (Fig. 1). We have developed software tools which enable rapid generation and simulation of a broad class of surface micromachined and LIGA systems. These tools are intended to complement MEMS tools for process and device design [1,2].
ABSTRACT To facilitate growth and innovation in micro-electricalmechanical systems (MEMS) research and development, techniques for high-level design of these systems are needed. Using these methods and commercially available standard manufacturing processes, non-experts will be able to design, simulate, and fabricate complicated devices and systems. This paper presents methodology and tools developed to enable designers, starting from a high-level abstraction of coupled their system, to extract parameterized electromechanical models, perform full time-domain or frequency-domain simulation, optimize the design, and finally synthesize layout to produce lithography masks. A MEMS surface-micromachined, electronically-controlled micromechanical resonator fabricated using standard CMOS is demonstrated to illustrate the tools and approach.
2. HIGH-LEVEL SYSTEM SYNTHESIS To demonstrate this approach, a micromechanical resonator with active control using velocity feedback was designed. This device was based on work done by Clark Nguyen and his colleagues [3-5], and it includes both mechanical and MOS electronic devices on a surface-micromachiningcompatible substrate. Figure 2 shows a "schematic" layout of the system that uses symbols to represent the electronic and mechanical "leaf cells" of the system [6]. Each leaf cell contains specific parameters that describe it, and these can be edited in the schematic editor. For example, a "spring" leaf cell will have parameters that specify either the spring constant or the geometry of the spring.The leaf cells also have terminals that are used to connect the elements together, forming the electrical and mechanical networks of the design. In Figure 3, the symbols have been replaced by machinesynthesized layout, but each leaf cell retains its parametric data and terminal information. This representation can easily be converted to a mask definition for fabrication.
1. INTRODUCTION To facilitate growth and innovation in micro-electricalmechanical systems (MEMS) research and development, techniques for high-level design of these systems are needed. Using these methods and commercially available standard manufacturing processes, non-experts will be able to design, simulate, and fabricate complicated devices and systems. Modern VLSI CAD tools allow synthesis of lower-level circuits from high-level abstractions, extraction of relevant features, and rapid simulation of circuit performance. In stark contrast, the design of MEMS systems is an art, not a science. A typical MEMS system design uses two separate processes: hand layout of circuit geometry, and manual
Text Descriptio
.-
-_
ymbols + Interconnect I A
Block Generatic
Parameterized SPICE Netlist
Synthe
Graphical InpL
Layout + Interconnect |
LALtaLruinI
cALI atIIVI
Place and Route
Hand I 1
I Mask Pattern
I
Figure 1: VLSI Based MEMS Design Process
67
CMOs Transimpedance Amplifier
Sense Feedback A A- A
uMuO
Transimpedance Amplifier
Drive Comb
Folded Spring
Sense Combs
Feedback Comb
I
Capacitor
Resistor CDorveb
I
Sprngd
\
Mass
Figure 2: Schematic Representation of Q-Controlled Resonator
Figure 3: Physical Representation of Figure 2 with Symbols Replaced by Machine Synthesized Layout
3. EXTRACTION AND SIMULATION
Figure 5 shows an output frequency-domain plot near resonance for several values of the feedback resistor, and the variation of the Q factor that occurs with changing this resistance value is shown in Figure 6. Figure 7 demonstrates the invariance-of the quality factor for a wide range of system pressure; the rolloff at high pressure is due to the viscous damping force becoming larger than the electronically-controlled damping. Physical variables such as displacements and forces can be easily monitored, and Figure 8 shows the magnitude of the displacement of the moving parts of the resonator for the same set of resistance values as Figure 5. Note that the displacement, as expected for a second-order spring-massdamper mechanical system, is a lowpass function that has a two-pole rolloff. Tools were also developed that synthesize layout directly from user performance specifications. Figure 9 shows a micromechanical resonator that was automatically generated and fabricated at MCNC. Layout synthesizers such as this enable rapid, error-resistant MEMS design.
Extraction is performed on either of the representations, producing a SPICE netlist (Fig. 4). Each leaf cell produces an element in the netlist with a set of attached parameters. Each type of leaf cell has an associated macromodel used by SPICE that models its performance. The mechanical models use electrical analogs to simulate the differential equations of the network, and both the mechanical and electrical networks created by each mechanical leaf cell are represented in the simulation model. This allows, for example, explicit coupling between the varying overlap of a comb drive and its capacitance, allowing capacitive velocity sensing to be simulated. The micromechanical resonator is designed to have an electrical voltage input and output and a bandpass frequency characteristic. The relative bandwidth of the device's output is described by the quality factor, or "Q." and applications such a filters usually require precise control of this value. In this design, the Q is controlled by the value of the feedback resistance in the amplifier.
spice deck conversion of qcontrol:schematic generated by talos4 on Wed Apr 3 12:41:20 1996 Include Subcircuit, CMOS, and Process Information: .INCLUDE 'MEMSmodels.spice' $ MEMS Models .INCLUDE 'ELECmodels.spice' $ Electrical Models .INCLUDE 'TRANSmodels.spice' $ BSIM Level 13 Models .INCLUDE 'PROCESS-PARAM.spice' $ Sample Process Parameters
* *
XcombOOO XcombOOl XcombOO2 XspringO03 XmassOO4 XpmosOOS XpmosO06 XnmosO07 XnmosO08 XpmosOO9 COOOOO10 XnmosOll XnmosO12 R0000013
inm leftm ine lefte comb-geo NG=113 GAP=2e-6 W=2e-6 XO=lOe-6 DIR=-l sinm leftm sine lefte comb-geo NG=226 GAP=2e-6 W=2e-6 XO=lOe-6 DIR=-l outm rightm oute righte comb-geo NG=339 GAP=2e-6 W=2e-6 XO=lOe-6 DIR=l biasm topm biase tope spring-geo L=134e-6 W=2e-6 NSP=4 RF=4 leftm topm rightm lefte tope righte mass3-geo W=50e-6 L=50e-6 nnl2e nnl2e Vdde Vdde pmos W=lOOe-6 L=2e-6 nnl2e stagele Vdde Vdde pmos W=l00e-6 L=2e-6 Outbiase stagele nnlOe nnl0e nmos W=SOOe-6 L=2e-6 Vbiase oute GNDe GNDe nmos W=400e-6 L=2e-6 stagele oute Vdde Vdde pmos W=800e-6 L=2e-6 stagele oute lOp Vbiase nnlOe GNDe GNDe nmos W=lOOe-6 L=2e-6 sine nnl2e nnlOe nnlOe nmos W=500e-6 L=2e-6 oute sine R=rfeedb
Figure 4: Extracted Parameterized SPICE Netlist of Q-Controlled Micromechanical Resonator
68
l
MD022
96/04/032
,0'V0
00
X~
1 .
0.0914
1 MED
Figure 5: SPICE Simulation of Q-Control Transimpedance Amplifier Output
Figure 6: SPICE Simulation Showing Variation of Q with Feedback Resistance
4. SIMULATION MODELS
Table 1: Electromechanical Analogs
System simulation is performed using the widely-available SPICE program, which is designed to solve systems of ordinary differential equations (ODEs) such as those that result from the coupled electrical and mechanical element models of these systems. The electronic elements use the standard models provided in SPICE such as resistors, capacitors, and transistors, but models for the mechanical devices had to be developed to simulate their performance. The ODEs of the mechanical system are simulated using electrical analogs. First-order electromechanical models relating inertial, damping, spring, and electrostatic forces to the time derivatives of the displacement are used, but more sophisticated models or empirical data could be used for more precision. The modeling technique uses a parallel-element approach in which the parallel forces exerted on a node are modeled as currents entering the node and the displacement of a node is represented by its voltage variation. Just as the current entering a node must sum to zero, the sum of the forces (including inertial terms) on a node must also have zero sum. Similarly, adding either displacements or voltage drops around a closed loop gives a zero net result. The first-order analogs used for this example are shown in Table 1. Note that many mechanical devices can be directly simulated using electrical elements, while some must use controlled sources as their analogs.
Mechanical Model
Fspring
Electrical Analog
=kx
Ires
= kRI
I
=CV C
Fdamp =Bx pcap Fmass F
=mx
2 source
R =
=
'orce
eiecrro.tatic =V 2~etosai jx cap
Element Values -
C = B aV C =m
2
axV cap
ax 2ax
5. SUMMARY The goal of this work is to develop tools which allow MEMS designers to easily produce fully-simulated complex designs with reasonable confidence in the system's predicted performance. For designers, this high-level approach has the advantage that parameterized models enable quick and efficient exploration of the design space, analysis of sensitivity to process variations, and optimization of performance. 5. ACKNOWLEDGEMENT This work was supported in part by a grant from NSF and ARPA under award 1R19321718 and ARPA award DABT63-95-C-0050. Software available by http and ftp from synergy.icsl.ucla.edu.
I
-P
1ot
0.0101010:Dnrle
II '10
If
I
Iag
I
o ('~
0.
..
.
Figure 8: SPICE Small-Signal Simulation of Resonator Displacement
Figure 7: SPICE Simulation Showing Variation of Q with Pressure
69
50..035I
Figure 9: SEM of Machine-Synthesized Resonator Fabricated at MCNC REFERENCES [1] Crary, S. and Zhang, Y, "CAEMEMS: An Integrated Computer-Aided Engineering Workbench for MicroElectrical-Mechanical Systems." Proc. IEEE Workshop on MicroElectroMechanicalSystems, Napa, California, pp. 113-114, February 11-14, 1990. [2] Senturia, S., et al., "A Computer-Aided Design System for MicroElectroMechanical Systems (MEMCAD)." IEEE J. MicroElectroMechanicalSystems, vol. 1, pp. 3-13, March 1992. [3] Nguyen, C. T.-C. and Howe, R.T., "Quality Factor Control for Micromechanical Resonators." Technical Digest, IEEE International Electron Devices Meeting, San Francisco, California, pp. 505-508, Dec. 13-16, 1992. [4] Nguyen, C. T.-C., Micromechanical Signal Processors, Ph.D. Dissertation, University of California, Berkeley, 1994. [5] Nguyen, C. T.-C. and Howe, R.T., "Design and Performance of CMOS Micromechanical Resonator Oscillators." IEEE International Frequency Control Symposium, 1994. [6] "Octtools/vem Integrated IC Design System." Industrial Liaison Program, 479 Cory Hall, University of California, Berkeley, CA 94720.
70
CAD FOR INTEGRATED SURFACE-MICROMACHINED SENSORS: PRESENT AND FUTURE Stephen F. Bart Analog Devices, Inc., Wilmington, MA, USA Stephen.Bart tuanalog.com
ABSTRACT The present state of the art in Micro-electro-mechanical systems (MEMS) design methods can be characterized by the use of a disjoint set of tools and techniques which have been gathered from the many disciplines that MEMS encompass. This requires the MEMS designer to master a broad spectrum of unrelated tools and often requires translation of relevant data from one simulation to another by hand. Further, the designer must integrate the results to form an overall prediction of the system's behavior. This type of piecemeal design environment is inefficient and prone to human error. As MEMS progress towards ever more complicated systems, the present piecemeal approach will become less and less satisfactory. In particular, system level simulations will become more and more difficult and their trustworthyness will decrease just as they become more and more necessary. This paper will discuss the simulation tools used to design a typical surface-micromachined lateral accelerometer such as the Analog Devices, Inc.. ADXL50. This will highlight the types of analyses which are required and typical tools that are presently used. Then it will examine the shortcomings of these tools, as well as analyses that are not practically possible at this time, as a way to explore the MEMS design tools that will be required in the future. 1.
,nchor
Figure 1. Schematic diagram of the accelerometer sensor displaced by an applied acceleration. tems, the present piecemeal approach will become less and less satisfactory. In particular. system level simulations will become more and more difficult and their trustworthyness will decrease just as they become more and more necessary. This evolution will place increasing demands on the required design tools which argues for a careful examination of the design framework into which these tools will be placed. This paper will examine the simulation tools used to design a typical surface-micromachined lateral accelerometer such as the Analog Devices, Inc. ADXL50. This will highlight the types of analyses which are required and typical tools that are presently used. Then we will examine the shortcomings of these tools, as well as analyses that are not practically possible, as a way to explore the MEMS design tool requirements of the future.
INTRODUCTION
Integrated Micro-electro-mechanical systems (MEMS) have become commercially viable products which are increasingly being used as high reliability, low cost replacements for standard electromechanical devices. For example, integrated surface micromachined accelerometers such as the Analog Devices, Inc. ADXL50, are seeing increased use in high-volume, low-cost applications such as crash sensors for airbag deployment modules in automobiles [5]. Such applications make it very important that the sensor be produced at the highest quality for the minimum cost. This requires efficiency at all stages of the device's production, particularly in its design. The present state of the art in MEMS design methods can be characterized by the use of a disjoint set of tools, often borrowed from other disciplines. This requires the MEMS designer to master a broad spectrum of unrelated tools, and often requires translation of relevant data from one simulation type to another by hand. Further, the designer must integrate the results to form an overall prediction of the system's behavior. This type of piecemeal design environment is inefficient and prone to human error. As MEMS progress towards ever more complicated sys-
1.1. A MEMS Accelerometer Figure 1 is a schematic diagram of the electromechanical sensor portion of a typical surface micromachined accelerometer. The fixed fingers are attached to the substrate and do not move. The moving fingers are attached to the moving mass which is suspended by four tethers. The moving mass and the tethers make up a mechanical spring-mass system in which the damping is dominated by the viscous drag of the surrounding air. Each moving finger with its two adjacent fixed fingers forms a pair of lateral series capacitors called a "unit cell" (see Fig. 1). The ADXL50 accelerometer structure consists of 42 such unit cells. An acceleration causes the moving mass and its fingers to move, creating imbalances in the series capacitor pairs. During normal operation, complementary clocks (antiphase square waves) are applied to the
71
0o
Tether I
.I .I .I 0
0
.I
I !-0.1
I I I I I I I
2-
-0.2
-02f
I I I Center Line
AA
20
A
40
A'A
.. A
. AA
60 80 100 Posison along finger (urn)
.AA
120
140
Figure 3. Simulated warpage of a fixed finger cantilever.
I
Denotes Anchor
component to the electric field and the capacitance will be significantly different from the simple parallel-plate value. At present we overcome this problem by using constant correction factors to modify the parallel-plate solutions for capacitance and its spatial derivative.
Figure 2. Schematic diagram of the ADXL50 moving mass. fixed fingers in each unit cell. When the moving fingers are centered, the series capacitors are equal and the capacitive coupling of the complementary clocks cancel each other, yielding a zero voltage on the moving mass electrical node. In the case of a non-centered beam, the clock signals are capacitively coupled to the moving mass in proportion to its distance from its neutral position. Thus, the amplitude of the ac signal at the moving mass node is monotonically related to the magnitude of the deflection. This signal is then amplified and demodulated to yield the output signal. To insure linearity and long-term stability. micromachined accelerometers are often closed-loop, force-rebalance devices. In this case, the output signal is also used as a feedback signal which alters the moving beam voltage and thus applies an electrostatic force which balances the inertial force due to the applied acceleration [5, 8, 3]. However, surface-micromachined inertial sensors tend to be relatively low-mass, high stiffness devices [1]. Thus, their deflections are quite small. This means that they tend to be quite linear devices and often do not require force-rebalancing for linearization. 2.
.
C = K, eoA -g-r dC
-= dx
Kf
A
(g - X)2
(1)
(2)
These correction factors are generated with either a 2-D finite element solver or with a 3-D boundary element solver (FastCap[7]). These solutions are not self-consistent with the mechanics. For the very small displacements typical of the accelerometer devices (tens to hundreds of angstroms), this can be tolerated. For structures that undergo relatively large deflections, such as resonant structures, this is more of a problem. In particular, electrostatic instability (pull-in) can be a problem for such devices. A self-consistent solution is required if pull-in is to be accurately modeled. 2.1. Mechanical Simulation Static finite-element model solutions are used to verify several aspects of the design. The first is the structural warpage. The ADI poly-silicon is deposited with tensile residual stress which causes the structures to shrink slightly. In addition, the poly has a stress gradient through its thickness which causes the structures to warp. This can cause large structures to come into contact with the substrate. Since this would be catastrophic, a simulation is required to check for contact. We model the fabrication induced residual stress gradient as a thermal gradient of 26.5 deg. with a thermal expansion coefficient of 2.0 x 10-6 (deg.-l). An average temperature is also applied to model the average tensile stress. These thermal parameters were obtained by matching the maximum simulated tip deflection of a fixed finger and the maximum simulated height of the center of the moving mass to average measured values for many hundreds of ADXL50 structures. Fig. 3 plots the simulated warpage of a typical cantilever finger of length 135 pm. Fig. 4 plots the simulated warpage along the center line of the ADXL50 accelerometer's moving mass. Static solutions are also used to apply static acceleration loads to the structure. This allows the sensitive and
BASIC DESIGN SIMULATION METHODS
The basic design iterations for the ADXL50 and similar sensors were performed using simple closed form models of the appropriate physics. Since these sensors are inertial sensors, the fundamental parameters are the mass and the spring constant. The mass can be accurately calculated from the area of the device. A simple model for the spring constant of the tethers can be derived from simple beam theory [4]. Accelerometers have been designed with both straight, non-stress relieved tethers as shown in Fig. 2 [1] and folded, stress relieved tethers as shown in Fig. 1 [3]. In the case of non-stress relieved tethers, the stress must be accounted for as well. Here the difficulty is knowledge of the stress value and not the accuracy of the model. Simple models for the electrostatic domain are more problematic. The basic model is that of an infinite parallel-plate capacitor. The difficulty here is that the aspect ratio of typical surface micromachined capacitor gaps are not much better than 2-to-i. Hence there is a large fringing field
72
02
the use of SPICE constructs to perform electromechanical or mathematical functions. As a consequence, these models are very hard to modify and debug by the designer.
0.45
aE: 0.4
4.
Several tools exist which can perform self-consistent electromechanical simulation. At present, the problem with these systems is the size of the structure that they can practically solve due to both time and memory limitations. To simulate an entire interdigitated accelerometer structure has not been possible. One option is to use the symmetry of the structure to reduce the size of the problem. However, because these devices owe much of their sensitivity to their symmetric structure, it is important to have a simulation tool that allows you to see the effects of manufacturing variations that can break the symmetry; for example, the effect of nonsymmetric tethers. Another option is to simulate one set of capacitive finger cells and use the solution for all of the equivalent cells. However, in the case of the ADI structures, this redundancy is broken because the structures are warped. Due to this fabrication induced warpage, each unit cell has a different force contribution to the displacement of the total moving structure. A modeling method, called the lumped-model selfconsistent analysis method, was developed to overcome this problem [9]. The method is based on the calculation of an intermediate look-up table which relates the electrostatic forces on a finger cell to the position of the moving finger in both the sensitive axis and the warpage axis. Since the fingers are stiff relative to the tether, we can apply the cell electrostatic force vector at the intersection center of the moving finger and the moving mass. This use of a precomputed look-up table greatly reduces the computation time and memory requirements in comparison to a standard self-consistent electromechanical analysis scheme. The lumped-model self-consistent analysis method is computed as follows. First, the structure's warped static shape is modeled as described above. The second step requires calculating the cell electrostatic energy by using FastCap to compute the capacitance matrix in the relation
C 0.3 02o 0.2 0.15
0.1
-
-- 400 .
-35,l -350
-30.4J -300
0 -. 0-. 5 -250 -200 -150 -100 Position alng moving mass (um)
.5 .50
COUPLED ELECTROMECHANICS
I II
Figure 4. Simulated warpage of the moving mass along the cross section. cross-axis spring constants to be confirmed. Also, it allows us to determine the maximum acceleration that will cause the structure to come into contact with the substrate or surrounding structures. This is important for determining maximum acceleration specifications. 2.1.1. Modal Analysis For the accelerometer structures, modal analysis allows you to examine the relative compliance of various structural motions to be sure that any such motions will not interfere with the primary sensing. Also, for systems with switched voltages, the excitation of unanticipated resonant motions can be examined. Modal analysis becomes critical for resonant structures. In such structures you generally wish to excite one resonant mode and avoid all others. Thus a knowledge of the spacing and Q of the structure's nodes is required. 2.1.2. Harmonic Analysis Harmonic analysis is primarily useful for resonant structures. It allows you to apply realistic forces and examine the amplitude and Q of the motions that are excited. Clearly, to examine the effects on all pertinent modes, a reasonably accurate solution for the applied forces must be obtained. Since a simple parallel-plate capacitance model for the electrostatic force, even if corrected for fringing fields as described above, only yields an estimate of the on-axis force, a simulation for the cross-axis forces is required.
E = IZCi,,jV,j2.
(3)
'ij
Fig. 5 shows the FastCap mesh for a finger cell. Since these accelerometer structures are designed to be much stiffer in the non-sensitive in-plane axis (y axis), we assume that the moving mass can move in the acceleration sensitive axis (x axis) and the out-of-plane, or levitation axis (z axis). The electrostatic energy is thus differentiated with respect to the x and z displacements to build a table of cell electrostatic force versus finger position values. Fig. 7 shows the electrostatic energy distribution which was differentiated to build the force versus position table. Next, for each finger's position, the x and z axis electrostatic force components from the lumped-model table are attached to the moving mass structural model as indicated in Fig. 6. Finally, we self-consistently analyze the structural model with the applied finger forces using ABAQUS [6]. The final two steps are repeated until the displacement stops changing. In order to test the accuracy of the simulations obtained with this method, the change in resonant frequency with applied DC bias voltage was simulated and compared to measured values for several open-loop ADXL50 accelerometers. Figure 8 shows the resonant frequency data versus
3. SPICE MODELS The sensors used as examples here are integrated sensors. As a consequence, the designer of the interfacing electronics needs to have a model of the sensor which can be run within the context of a circuit simulation tool. At ADI we use SPICE based circuit simulation tools. We have developed a model which uses voltages to represent force and position variables and uses SPICE expressions to calculate the behavior of the sensor electromechanics. The mechanics are modeled as a set of independent, single axis, second order mechanical systems. The capacitances and electrostatic forces are modeled as parallel-plate capacitors with fringing field corrections. Parasitic capacitance can dominate the response of these tiny sensors and therefore, models for parasitic capacitance must be included. Although these model are functional, their inner workings are obscured by
73
sac
_6A13. 4t-.15
! -6.28- . us46.25~
l
46.3~ 0.4 -
____1 -
0
-k
-
.
068
0.2 n
-
-0.4
f a tv
vary
L.~ii-
~Vlw~ UI
0
I_
-
S i
Drr~di
(um)
,
Figure 7. Electrostatic energy distribution.
29 28
Figure 5. 3-D FastCap mesh of a unit cell. 26
star --
masmurrne
circle- measurromrenI 24 i23
2
21
}
1
-
1.5
-
-
2
2.5
-
A
3
^
3.5 -
4
-
4.5
DC bias(vots)
Figure 8. Resonant frequency versus DC bias. 0
the DC bias voltage difference on the fixed fingers. The DC bias voltage on the horizontal axis of Fig. 8 was applied to the right-hand fixed finger of each finger cell (see Fig. 1). The left-hand fixed finger was grounded. The DC potential of the moving mass and its fingers (as well as the groundplane under the structure) was centered between these values (ie, one-half the horizontal-axis potential). The change in resonant frequency with applied DC bias voltage in Fig. 8 is caused by the reduction in the effective spring constant due to the electrostatic forces. The code used to implement this method is hand-crafted and would require considerable effort to use on a regular basis. However, there is no fundamental reason why it could not be automated. A more general problem with this method is that it relies on a repeated finger cell geometry like that found in the ADI accelerometer products. It is easy to imagine structures where this approach would not be particularly helpful.
0 0 0
Computed finger forces
5.
l
SIMULATION REQUIREMENTS FOR MEMS SYSTEMS
There are several basic simulation functions that are needed to allow MEMS structure designers to obtain accurate and functionally useful models. The first is a self-consistent electromechanical modeler. The system must be able to handle structures with a level of geometric complexity at least as
Figure 6. Computed x and z axis forces are applied to the moving mass.
74
a sensitivity analysis is important. Such simulations will require the ability to build a parametric simulation "tree" and to call upon an entire network's resources to efficiently compute.
great as present commercial accelerometer structures. And it must be able to solve them accurately in a reasonable amount of time. Among other things, efficient mesh generation will be important here. Since there are different requirements for optimization in the mechanical and electrical domains, an adaptive meshing capability in both domains will probably be required. The availability of dielectric materials is also desirable. Since these MEMS structures will increasingly be connected to more and more sophisticated circuits, the circuit designer requires a model for the MEMS device in the circuit simulation environment. Thus, the automatic generation of a SPICE type model is required. Since computing power will, in general, not allow a full electromechanical simulation to be performed at each SPICE time step, a flexible structure for encapsulating a behavioral model of the MEMS structure will be required. A set of electromechanical simulations would generate the parameters required for this SPICE framework. An example would be a model which linearizes the desired behaviors or degrees of freedom around a typical operating point. The self-consistent electromechanical modeler would provide the parameters for this linearized model. The circuit simulation could monitor how close the system is to the modeled operating point, calling for a new computation about a new operating point when necessary. Finally, it should be noted that some of the most important circuit parameters are the various parasitic capacitances. The ability to extract not only air gap parasitics, but also parasitics which have semiconductor behavior, such as those created by a diffused interconnect, is important.
6. CONCLUSION The state of computer aided design tools for MEMS is, at the present time, inadequate. Although a fundamental strength of MEMS systems is the transduction of multiple energy domains, most modeling is still performed on tools which only handle a single domain. The tools which do couple the two most important domains, electric and mechanical, are too computationally inefficient to handle problems of a practical size. Further, they cannot generate models which can couple to the circuit simulation environment. This slows the ability of the designer to evaluate novel designs to a crawl and greatly increases the chance for errors to invade the simulation path from the sensor's layout to the modeling or circuit simulation results. Finally, the lack of a unified simulation environment makes performing a parametric sensitivity analysis to examine the effects of dimensional variations very difficult. REFERENCES [1] S. Bart, J. Chang, T. Core, L. Foster, A. Olney, S. Sherman, and W. Tsang, "Design Rules for a Reliable Surface Micromachined IC Sensor," Proc. of the 33rd Annual International Reliability Physics Symposium, Las Vegas, NV, April 4-6, 1995, pp. 311-317. [2] S. F. Bart, M. Mehregany, L. S. Tavrow, J. H. Lang, and S. D. Senturia, "Electric Micromotor Dynamics," IEEE Transactionson Electron Devices, vol. 10, March 1992. pp. 566-575. [3] K.H.-L. Chau, S.R. Lewis, Y. Zhao, R.T. Howe, S.F. Bart, and R.G. Marcheselli, "An Integrated Forcebalanced Capacitive Accelerometer for Low-G Applications." Proc. of the 8th International Conference on Solid-State Sensors and Actuators, Stockholm, Sweden, June 25-29, 1995, pp. 593-596. [4] G. K. Fedder, Simulation of Microelectromechanical Systems, PhD Thesis, University of California at Berkeley, CA, 1994. [5] F. Goodenough, "Combining Micromachining with Analog IC Technology," Electronic Design, August 8, 1991. [6] Hibbit, Karlsson, and Sorensen, Inc., Providence, R.I. [7] K. Nabors and J. White, "FastCap: A multipoleaccelerated 3-D capacitance extraction program,' IEEE Transactions on Computer-Aided Design, vol. 10, Nov. 1991, pp. 1447-1459. [8] S. Sherman, W. Tsang, T. Core, R. Payne, D. Quinn, K. Chau, J. Farash, S. Baum, "A Low Cost Monolithic Accelerometer; Product/Technology Update," Proc. IEEE 1992 Int. Elec. Dev. meeting, San Francisco, CA, Dec. 13-16, 1992, pp. 19.1.1-19.1.4. [9] H. Yie, S. F. Bart. J. White, and S. D. Senturia. "A Computationally Practical Approach to Simulating Complex Surface-micromachined Structures with Fabrication Non-idealities," Proceedings of the IEEE Micro Electro Mechanical Systems Workshop, Amsterdam. the Netherlands, Jan 29-Feb 2, 1995. pp. 128132.
5.1. Damping and Other Energy Domains One domain that has been somewhat neglected is simulation of the structure's damping. Accurate damping calculations will become increasingly important for sensitive resonant structures such as gyros. At this point it is not clear whether a fluid mechanical analysis is required or just an appropriate application of known closed form solutions for low Reynold's number flows [2]. In either case, appropriate parameters should be extractable from the structure geometry and provided to the SPICE level model. The modeling of the thermal domain can also be important in some devices. Since thermal modeling has well known duals in mechanical or electrostatic analysis, it is easily added to the simulation environment. 5.2. System Confirmation Circuit and layout environments provide extensive tools for design rule checking and circuit-to-layout verification. As MEMS systems become more complicated, verification through the simulation chain becomes more and more important. The modeling tools must be capable of using appropriate layers from the layout to construct the structure model. The results must pass, without human manipulation of the data, to the SPICE type circuit model. 5.3. Parametric Sensitivity Analysis Although micromachining techniques allow the fabrication of very small structures, the relative accuracy can be quite poor. In a typical manufacturing fabrication line it is not unusual to see lot-to-lot variations of 10%-30% for small dimension lateral features such as capacitive gaps. As a consequence it very important that the MEMS designer be able to predict the effect of such variations on the device's operation. Thus the ability to easily run a set of simulations using known manufacturing variation and generate
75
A VISION OF STRUCTURED CAD FOR MEMS Gary K Fedder
Department of Electrical and Computer Engineering and The Robotics Institute Carnegie Mellon University Pittsburgh, PA, 15213-3890 E-mail: [email protected]
ABSTRACT Computer-aided design tools tailored for microelectromechanical systems (MEMS) are needed to enable design of complex systems with multiple energy domains. In an analogy to the VLSI design methodology, physical, structural, and behavioral views of MEMS can be formed and coupled together in an integrated toolset. Of key importance is the formation of parameterized MEMS component libraries to support these views. Fast coupled-domain numerical (physical) simulation and behavioral simulation are required to move freely between the views.
logic, register, to system level. At each of these levels, a design can be viewed in physical, structural (schematic), or behavioral form. Similar design views and hierarchical levels for MEMS are feasible and sorely needed. Analogous hierarchical levels for MEMS up to the VLSI circuit level are easily made. Even higher levels of abstraction, which may be different from the VLSI paradigm, will evolve for MEMS. Complications with implementing a MEMS design methodology stem from the difficulty in handling the coupled energy domains. A first task in development of structured MEMS design tools is the formation of standard data representations and standard cell libraries. An enormous effort is necessary to identify and to model reusable MEMS processes, elements, devices, and architectures. MEMS CAD tools must be integrated, with appropriate links available to the designer to switch between different lateral views and hierarchical levels. In this paper, a brief overview of current MEMS CAD tools, design practices, and process services is presented. A vision of structured design and synthesis for MEMS is then given.
1. INTRODUCTION One prevailing trend in MEMS is toward monolithic systems where multiple micromechanical devices are integrated with sense electronics, digital 1/0, selftest, auto-calibration, digital compensation, and other signal processing functions. There is a growing demand for rapid design, analysis, and synthesis tools for MEMS that include coupling between multiple energy domains, including mechanical, electrostatic, magnetic, thermal, fluidic, and optical domains. Presently, most development of MEMS involves concurrent design of devices and their associated fabrication processes. More rapid development of MEMS requires a structured design methodology, supporting tools, and supporting model libraries for common MEMS processes. Eventually, MEMS processes that lend themselves to structured design will evolve into de facto standards, because it will become easier to design new devices in these processes. One key to rapid design will be the ability to reuse design knowledge through hierarchical component and model libraries. Structured design methods for MEMS are not completely new; they are borrowed from the VLSI design methodology. CAD for VLSI spans many levels of abstraction from materials, device, circuit,
2. CURRENT MIEMS CAD TOOLS Several groups have existing research programs to address the deficiency in MEMS design tools. Examples from the U.S. include MEMCAD (M.I.T. [1][2], Microcosm [3]) and CAEMEMS (Univ. of Michigan)[4]; examples from Europe include CAPSIM (Catholic Univ. of Leuven, Belgium)[5], SENSOR (Fraunhofer Institute, Germany)[61, and SESES (ETH, Zurich)[7]. These tools involve general numerical analysis of layout and generation of macro-models for simulation. MEMCAD has evolved into a MEMS modeling framework with rapid self-consistent electromechanical 3D numerical simulation. Recent advances have
76
Design concepts are implemented in a manual layout. The performance is then analyzed using numerical analysis tools, usually resulting in iterations on both the layout and the underlying process. The present state-of-the-art in MEMS CAD relies on device-level extraction of macro-models in a limited set of energy domains for behavioral simulation. Current commercial design tools cannot deal with the complex multidomain architectures that will be necessary to create the next-generation of commercial MEMS.
I
I tI
II II
Much future work should focus on creating very fast multi-domain numerical simulation tools to ease both process development and device macro-modeling. However, these numerical tools by themselves may not be practical for rapid iterative design since the physical layout (and perhaps the process) must be changed for each iteration without necessarily knowing what to change to best to improve the device performance. Currently, a self-consistent electromechanical analysis of even a simple device requires many man-hours to create the 3-D geometry and perform the numerical simulation. Plus, the time to complete a manual design cycle in MEMS has not decreased significantly over the past few years since knowledge from previous development efforts cannot be easily reused by future developers.
Fig. 1. Flowchart of current MEMS design practice. been made in simplifying the input and visualization of 3D models of micromechanical structures from layout using the MEMBuilder tool [8]. CAEMEMS is a framework in which the users chooses among modules that address specific design domains. CAEMEMS automatically generates a set of parameterized response surfaces by launching multiple finiteelement analyses. IntelliCAD[9] available from IntelliSense Corp. is a commercial MEMS CAD tool with automated 3D modeling from layout and process integrated with numerical analysis. Other commercial tools by Tanner Research[10] cater to the MEMS community by allowing layout of non-manhattan geometry and supplying MEMS technology files with design rule checking. These tools are definite improvements over use of Magic or KIC for layout and stand-alone numerical analysis tools (e.g. ABAQUS, ANSYS, Ansoft Maxwell). For general MEMS devices, concurrent design of process and structure is important, but there have been only a few efforts to improve CAD in this area. One promising activity at the University of Michigan involves MEMS process synthesis from design specifications [11].
4. MEMS PROCESS SERVICES MEMS covers a broad, evolving spectrum of fabrication processes. In the U.S., two process services are available that produce surface-micromachined MEMS, where the microstructures are patterned out of thin films on the substrate surface. Polysilicon MEMS can be built in MCNC's MUMPs process [12], and laminated oxide/aluminum MEMS can be built using MOSIS followed by an in-house dry-etch release step [13]. There are a several important benefits of making microstructures with stable fabrication services such as MUMPs and MOSIS: 1) sensor fabrication is fast and reliable, so prototypes can be reproduced anytime, 2) all, or most, fabrication steps are done externally, so resources can be invested in design, not standard processing, 3) the process is repeatable and characterized, so circuit and microstructure designs can be re-used, 4) devices improve as the process technology improves (e.g. scaling).
3. CURRENT MEMS DESIGN PRACTICES Current MEMS design practice, shown in Fig. 1, focuses on iterative device and process development.
77
_°9
e "Ie
MA1
uctural I 5utai
' 0•
DRC
iylre~l
2D/3D
ccept
-I
synthesis
macro-model xtraction
Fig. 3. CAD tools coupling the MEMS schematic, behavioral, and physical views.
imam
evaluation of the concepts are accomplished by moving back and forth between views until a suitable design is found and sent to the process service for fabrication. The physical view can take the form of either a 3D solid model or a 2D layout. The structural view specifies connectivity between micromechanical elements in a schematic. The behavioral view specifies the governing equations of the system, such as electrical node equations, equations of motion, and heat transfer equations. It is important to have the flexibility to move freely between these views.
fabrication Fig. 2. Parallel views and supporting libraries in the MEMS design and synthesis methodology.
5. STRUCTURED DESIGN AND SYNTHESIS In a complementary approach to existing analysis tools, we are developing circuit-level design and synthesis tools tailored to general surface-micromachined systems. The planar '2 1/2-D' topology of surface-micromachined MEMS lends itself to abstraction at this higher level than that offered by numerical device simulation alone. These tools will increase the performance, complexity, and level of integration designed into such MEMS as inertial sensors (accelerometers, gyroscopes), x-y-z servomechanisms for probe-based data storage, communications circuits (filters/resonators/mixers), displays (micromirrors, light valves), infrared imagers, pressure and tactile sensors, and thermal flow sensors. Once a working structured design methodology is established for surface-micromachined MEMS, the techniques may be extended to other processes, such as bulkmachined Si or a dissolved-wafer process. The existing IC CAD infrastructure can partially support the circuit-level MEMS design. In contrast to VLSI design, however, MEMS design has a tight coupling between form and function. The inclusion of mixed-signal, multi-domain functionality requires a re-tooling of the IC CAD software to handle the job. The three parallel views - physical, structural (schematic), and behavioral - for MEMS design are shown in Fig. 2. Design starts with the engineering specifications and concepts. Implementation and
Most existing surface-micromechanical designs can be partitioned into discrete components which are modeled as lumped-parameter elements, such as beam springs, plate masses, and electrostatic actuators. Conversely, new MEMS devices can be created by connecting together these lumped elements in different configurations, much like analog circuit design is accomplished. Simplified physical and behavioral views of MEMS are currently used by designers, while the schematic view has been missing. The MEMS schematic provides a critical link between the physical and behavioral views and may often be the starting point in design. Additional key features of the design methodology lie in the integration and automation of coupling between the views, as shown in Fig. 3. These coupling mechanisms include behavioral equation extraction and simulation, structural synthesis, and layout synthesis, layout extraction from schematic, and parameter extraction from layout. An example design session might start with construction of a MEMS schematic, then synthesize layout, extract parasitics, and simulate the mixed-signal response.
78
Depending on the simulation results, an inadequate design may be iterated by modifying the schematic, whereas a suitable design may be verified with 3D numerical simulation. Bi-directional coupling between views must be implemented to support this kind of design flexibility. Efficient mixed-signal mixed-domain simulation is critical to the iterative design process. Fast behavioral simulation will enable rapid exploration of different design concepts. During the initial iterative design phase, the designer will be freed from doing detailed layout and 3D numerical simulation. Many groups are already exploring behavioral MEMS simulation using device macro-models extracted from numerical analysis. An enabling feature of future MEMS design systems will be the automation of behavioral macromodel generation from layout. Physically correct and extensible models of microelectromechanical components are essential if one is to simulate, analyze, and synthesize MEMS. If the models are inadequate, the design will fail, regardless of how good the tools are. Efficient multi-domain numerical analysis is critically important for both generating lumped-element macro-models and for verification of final designs. In order to demonstrate the design tools and drive the effort to automate modeling, a core library of elements and models must be developed to support integrated MEMS design. When appropriate, parameterized models should be generalized with links to manufacturing variables to ease the extension to new MEMS process technologies.
*
fast modeling and verification tools; coupled-domain, numerical analysis (e.g. finite-element method, boundary-element method) * MEMS schematic capture * mixed-signal multi-level multi-domain behavioral simulation * layout of arbitrarily shaped objects with design rule checking * parameter extraction from layout * layout synthesis and verification * cross-linking of manufacturing, physical, and behavioral variables By providing circuit-level representations, MEMS design tools will enable rapid, intuitive exploration and analysis of the design space for large mixed-signal multi-domain systems. The high-level tools will speed up the design cycle from the order of several months to days.
REFERENCES [1]
[2]
6. CONCLUSIONS
[3]
An initial wish-list for the MEMS CAD toolset includes: * standard MEMS data representations and interchange formats * standard MEMS cell libraries supporting behavioral, schematic, and physical views (e.g. materials database, technology files, layout cells, schematic component library, and macro-model library) * standard MEMS process-module libraries and standard process flows * process visualization, simulation, synthesis, and technology file generation * 3D solid model generation, visualization and animation
[4]
[5]
[6]
79
S. D. Senturia, R. Harris, B. Johnson, S. Kim, K. Nabors, M. Shulman, and J. White, "A Computer-Aided Design System for Microelectromechanical Systems," Journalof MicroelectromechanicalSystems, v.1, no.1, pp. 3-13, 1992. J. R. Gilbert, R. Legtenberg, and S. D. Senturia, "3D Coupled Electro-mechanics for MEMS: Applications of CoSolve-EM," Proceedings of the IEEE Micro Electro Mechanical Systems Workshop, Amsterdam, The Netherlands, pp. 122- 127, Jan. 1995. MEMCAD 2.0, Microcosm Technologies, Inc., 201 Willesden Dr., Cary, NC 27513. S. B. Crary, 0. Juma, and Y. Zhang, "Software Tools for Designers of Sensor and Actuator CAE Systems," Technical Digest of the IEEE Int. Conference on Solid-State Sensors and Actuators (Transducers '91), San Francisco, CA, pp. 498-501, June 1991. B. Puers, E. Petersen, and W. Sansen, "CAD Tools in Mechanical Sensor Design," Sensors andActuators A, v.A17, pp. 423-429, 1989. B. Folkmer, H.-L. Offereins, H. Sandmaier, W. Lang, P. Groth, and R. Pressmar, "A Simulation Tool for Mechanical Sensor Design," Sensors and Actuators A, v.A32, pp. 521-524, 1992.
[7]
J. G. Korvink, An Implementation of the Finite Element Method for Semiconductor Sensor Simulation, Ph.D. Thesis, Swiss Federal Insti-
tute of Technology (ETH), Zuirich, Switzerland, Nov. 1993. [8]
P. M. Osterberg and S. D. Senturia, ""MEMBUILDER": An Automated 3D Solid Model Construction Program for Microelectromechanical Structures," Technical Digest of the 8th Int. Conf. on Solid-State Sensors and Actuators (Transducers '95), Stockholm Sweden,
v.2, pp. 21-24, June 1995. [9]
IntelliCAD, IntelliSense Corporation, 16 Upton Dr., Wilmington, MA 01887.
[10] L-Edit, Tanner Research, 180 North Vinedo Ave., Pasadena, CA 91107. [11]
B. Gogoi, R. Yeun, and C. H. Mastrangelo, "The Automatic Synthesis of Planar Fabrication Process Flows for Surface Micromachined Devices," Proceedingsof the IEEE Micro Electro Mechanical Systems
Workshop,
Oiso,
Japan, p.153-157, Jan. 1994. [12]
D. A. Koester, R. Mahadevan, K. W. Markus, Multi-User MEMS Processes (MUMPs) Introduction and Design Rules, available from
MCNC MEMS Technology Applications Center, 3021 Cornwallis Road, Research Triangle Park, NC 27709, rev. 3, Oct. 1994, 39 pages. [13] G. K. Fedder, S. Santhanam, M. L. Reed, S. C. Eagle, D. F. Guillou, M. S.-C. Lu, and L. R. Carley, "Laminated High-Aspect-Ratio Microstructures In A Conventional CMOS Process," Proceedings of the IEEE Micro Electro Mechanical Systems Workshop, San Diego,
CA, pp. 13-18, Feb., 1996. [14] Micromachines Program, available from Multi-Project Circuits (CMP) Service, 46, avenue Felix Viallet, 38031 Grenoble Cedex, Oct. 1994, 29 pages.
80
The Need for Composable CAD for Composite Systems Randolph E. (Randy) Harr Electronics Technology Office Defense Advanced Research Projects Agency 3701 N. Fairfax Drive Arlington, VA 22203-1714 rharr~darpa.mil
and material support libraries and standards to allow distributed and lumped element modeling and development of composite systems;
In the near future, it will be practical to fabricate microelectromechanical systems (MEMS), electronic, electro-optical, and micro-fluidic devices on a monolithic substrate or highly integrated module. In order to take advantage of these emerging micro-devices and manufacturing processes, we need to develop the design technology (tools, methodology, and architectures) to support device and systems design of mixed-technology, highly integrated systems. Of particular interest is design technology to support composite systems composed of hundreds to millions of tightly coupled, mixedtechnology devices. Such systems are envisioned to be completely self-contained with integrated control circuitry, sensors, and interface logic. Actuation, displays and energy storage devices might also be included. The mixed domain (kinematic, electric, electrostatic, and fluidic) analysis of micromachined, fabricated and possibly assembled devices, systems of devices and corresponding electronic circuits needs to be developed to support the design of these composite systems. Technical challenges that need to be addressed by such a design environment are: *
Physical 3D distributed-element modeling and design rule checking (especially against manufacturing process rules) tools and composable interface standards for mixed technology devices and interconnect;
*
Parameter-based, primitive device models
*
Development of new, usable languages (textual and/or graphical) for system specification, analysis, and design refinement allowing partitioning a design down to realizable physical devices and interconnect from implementation independent specifications; and
*
Development of new algorithms and "backplanes" to allow the simultaneous manipulation (modeling, analysis and results feedback: tuning) of the physical (distributed) and functional (lumped element) views of the system.
*
Robust physical design using in light of expected sensitivity to processing tolerances.
The overall need is to develop composable Computer-Aided Design (CAD) systems which allow the integration of all the mixed domains of analysis into a common architecture of tools. That is, an architecture which would allow a (CAD) system to be constructed from tightly integrated, interoperable tools from each domain of interest in a particular design situation. In this way, tools are more easily composed or integrated to allow multidomain analysis of heterogenous element systems. Designers who are not experts in any particular domain of technology can then com-
81
tronic circuits but valves, pumps and storage reservoirs, and c) the design, analysis and inclusion as primitive components of testing chambers with embedded sensors is considered.
pose highly integrated, mixed technology composite systems. The design of micro-fluidic components and systems for controlled on-chip processing of biological and chemical fluids is an example system focus to consider. One can envision needing models which support the analysis of flow for homogeneous chemicals and complex biological fluids. A challenge for this area is the modeling and analysis of mixed electronic control, MEMS sensing and actuation, and micro-fluidic systems where a) the design of interconnect is comprised of electrical, fluid, and gaseous conductors, b) the composition of devices consists not only of switching elec-
The needs for designing composite systems poses many new challenges to physical CAD design systems. One where domains are not simply the electrical and abstract physical but also electrostatics, heat transfer, fluid flow, and mechanical coupling. There are many more views, abstractions, design parameters and couplings to consider to create an optimized, manufacturable system.
82
STRUCTURED DESIGN METHODS FOR MEMS Ramaswamy Mahadevan
MCNC, MEMS Technology Applications Center PO BOX 12889 Research Triangle Park, NC 27709 USA [email protected] http://mems.mcnc.org/ tern generation software must support arcs and freeform structures in an efficient and accurate manner. Separation of form and function is not always present in MEMS fabrication techniques [5]. For surface micromachined devices, the function of a device may be considered to be relatively independent of its position and orientation. However, the relative positions and orientations of devices in a bulk micromachining [6] process are critical to determining the structural shapes and functional behavior of the device. MEMS wafer fabrication and design may be considered to be relatively cleanly separated where surface micromachined processes are concerned. In fact independence of the fabrication process from the mask pattern is generally the ideal sought. Such a separation is also possible for MEMS processes such as LIGA that incorporate electroplating even though the fabricated device structure can depend on the pattern and fill factor of the mask. However, in bulk micromachined MEMS devices the fabricated structure depends on the orientation, location, and size of the layout pattern making the separation of design and fabrication difficult. VLSI systems use a small number of elements such as transistors, capacitors and resistors to build all other functional blocks and systems. Unfortunately MEMS does not appear to have a small basis set of elements that can be utilized to synthesize any generic system functionality with a similar degree of flexibility. However, it should be possible to cover a moderately large functional design space using multiple instances and interconnection of elements from a set of basic electromechanical elements. Examples of such approaches are available in mechanical design and are used for synthesis of hydraulic systems, articulated mechanisms, and compliant mechanisms [7]. Currently, MEMS design is akin to analog VLSI, printed circuit board, and multi-chip module
ABSTRACT Is there a sufficient similarity between MEMS and VLSI systems to borrow from VLSI design techniques? Current MEMS CAD tools predominantly concentrate on the analysis domain and generally lack any synthesis or top-down system design and optimization capabilities. Some of the unique characteristics, constraints, and difficulties associated with the design of MEMS systems are discussed and certain key areas that should be addressed are highlighted. 1. MEMS DESIGN CONSIDERATIONS The VLSI revolution was enabled by the separation of the design and fabrication phases in the development of an integrated circuit. This was achieved through the standardization of a simplified interface between designers and fabricators[l] and the fact that a two dimensional shape representation adequately modeled the 2.5 dimensional IC fabrication technology. The choice of geometrical shapes that represent the final view of the IC chip made the fabrication line dependent features and variations invisible to the IC designers and resulted in portable designs [2]. Unlike VLSI, in MEMS the 2-D shape representation alone is not sufficient to model the electromechanical behavior of the system. It is necessary to incorporate process cross sectional information as well to generate the 3-D structures that will result when using a given mask layout and process combination. This requires the use of suitable device and technology models and representations. A few existing MEMS CAD packages use mask and process descriptions to generate the 3-D structure that would result [3], [4]. However, there is a need to standardize the design and process description and database formats. Selected formats/representations and mask pat-
83
(b) An integral layout and solid modeling/viewing tool to help designers visualize 3-D structures that would result from the combination of layout and process description. (Visualization of structures generated from a given layout and process appears to be the biggest obstacle for new MEMS designers.)
design. That is, functional blocks like actuators, sensors, mechanical suspensions, springs, bearings, etc. are used in lieu of operational amplifiers, capacitors, resistors, and switches and interaction between blocks must be taken into account. In addition packaging is integral to the design of a MEMS system.
(c) An open CAD framework and data translators for sharing design information between existing and new MEMS specific CAD tools. (d) MEMS specific solid model and mesh generation, analysis tools that are optimized for structures with thin layers. Tools to extract macro-models irom device level simulations.
2. CAD FOR MEMS Despite these differences, MEMS can borrow from many of the VLSI design methods and CAD tools. Top-down and bottom-up design, synthesis, device and/or block level detailed simulation, functional or behavioral models and system level simulations are all equally applicable to MEMS. Even a version of extraction and layout-schematic comparison is possible for MEMS. The main obstacle hindering the implementation of such an approach is the multiplicity of physical phenomena that may be utilized in a MEMS system requiring (possibly coupled) mechanical, electromagnetic, fluid, and thermal analysis. For detailed simulations (equivalent to circuit or device level simulations in VLSI) this can be done using existing finite element or boundary element solvers. This requires an open CAD framework with a database format that includes mask layout and process technology information (process cross section and related mechanical properties) that can be easily exported/imported to/from the appropriate solver. For system level design, functional representations and macro-models would be necessary in association with a bond graph, signal flow graph or differential equation based simulator to model all the phenomena. Existing simulators such as SPICE may be used for functional or behavioral simulations of MEMS at the system level. There is also a need for simulators that are faster than FEA (and perhaps less accurate) analogous to switch level and event driven simulators in VLSI design and techniques for simulating the effects of, and sensitivity to process variations in order to assess the manufacturability of a design. The following would help facilitate structured design and virtual prototyping for MEMS.
(e) High level representations and simulators for computationally efficient modeling and optimization of MEMS at the systems level. (It should handle multiple, coupled energy domains and allow the use of sensing, actuation, and electronic subsystems.) (f) Tools to model system performance dependence on process variability. (g) Tools for the synthesis of mechanisms, structural elements, or a class of MEMS systems using libraries of elements with macro-models. REFERENCES [1] Mead, C. and Conway, L. Introduction to VLSI
Systems, Addison-Wesley Publishing, 1980. [2] Sequin, C. H. and McMains, S. "What Can We Learn from the VLSI Revolution". Report of NSF Workshop on Structured Design Methods for MEMS, Nov. 1995.
[3] Gilbert, J. R. etal., "Implementation of a MEMCAD System for Electrostatic and Mechanical Analysis of Complex Structures from Mask Descriptions". Proceedings, IEEE Micro Electro Mechanical Systems Workshop, Fort Lau-
derdale, pp. 207-212, Feb. 1993. [4] Hubbard, T. J., and Antonnson, E. K., "Emergent Faces in Silicon Micromachining" Journal of MEMS, vol 3-1, pp. 19-28, Mar. 1993.
(a) A layout and process representation that allows efficient representation of arcs and freeform shapes along with process cross-sectional information, material properties, and perhaps even mesh information.
84
[5] Bryzek, J., Petersen, K. and McCulley, W., "Micromachines on the March" IEEE Spectrum, vol 31, No. 5, pp. 20-31, May 1994. [6] Petersen, K. E., "Silicon as a Mechanical Material". Proceedingsof the IEEE, vol 70. No. 5, pp. 420-457, May 1982. [7] Ananthasuresh, G. K., Kota S. and Gianchandani Y. "A Methodical Approach to the Design of Compliant Mechanisms". Technical Digest, IEEE Solid-State Sensor and Actuator Workshop, Hilton Head, pp. 189-192, June 1994.
85
AFFORDABLE MEMS CAD TOOLS John E. Tanner, Peter T. Parrish,Mary Ann C. Maher, Rodney Morison Tanner Research, Inc., 180 N. Vinedo Avenue, Pasadena, CA 91104 USA John.Tanner(tanner.com, Peter.Parrish(tanner. corm, MaryAnn.Mahergtanner.com, Rod.Morison~tanner.com MCNC, the ADXL50 process from Analog Devices, and CMOS from MOSIS with post processing (e.g. NIST). Foundries eliminate the requirement for designers to own expensive equipment. Foundry interfaces decouple the designer from fabrication details. The reduction of these barriers increases dramatically the number of potential MEMS designers. The high price of traditional CAD tools also represents a barrier to the growth of the MEMS design community, but fortunately the potential for volume sales encourages low-cost CAD vendors like Tanner Research to enter the market.
ABSTRACT The large number of integrated circuit designers supports high-performance yet affordable design tools. The advent of MEMS foundries has enabled an increase in the number of MEMS designers. Tanner Research proposes to develop a comprehensive MEMS tool suite by extending our existing integrated circuit tools and creating new tools.
1. THE TANNER RESEARCH BUSINESS MODEL
1.3.
Tanner Research is dedicated to providing low-cost high-productivity software tools and libraries to the engineering community. In order to keep prices low, we must concentrate on high volume markets in order to generate enough revenue to support continued rapid product development. To keep volumes up we provide software for all the most popular platforms including PC, Macintosh, and Unix workstations. Based on this low-cost highvolume strategy, Tanner Research has recently taken the lead position with an installed base of 6,500 licenses.
1.1.
Price and Quality
Low-cost tools appeal to all segments of the market, including those customers who could afford to pay much more for them. Everyone benefits from ease of use, intuitiveness of operation, freedom from bugs, and integration with other tools-features that are required in the lowcost/high-volume market. At Tanner Research these values are driven by the economic necessity of controlling our customer support costs and maintaining a solid reputation across a large, demanding user base. Even high-end customers that require certain specialized niche tools, often mix them with moderately priced tools to reap the benefits of reduced overall tool costs, reduced training and maintenance costs, and higher productivity.
Traditional MEMS Design
Until recently, the market for MEMS CAD tools was too limited to fit our company business model. Most MEMS design was done by large companies that owned their own fabrication line and could afford to develop a custom fabrication process for each MEMS device produced. The need for both process knowledge and capital equipment resulted in such a barrier to entry that there were very few MEMS designers. With a small customer base over which to amortize tool development costs, tool vendors by necessity had to charge high prices.
2. PROPOSED MEMS CAD TOOL SUITE Tanner Research proposes to develop a suite of low-cost MEMS design tools and libraries by extending our existing integrated circuit and multichip module design tools. Figure 1 shows a block diagram of the proposed tool suite and the major data paths between tools.
1.2. New MEMS Design The emergence of standard MEMS processes, offered to the community by way of foundry services, has changed the outlook of MEMS design dramatically. A growing number of MEMS designers have successfully used the MUMPS process from
2.1.
Existing IC Tools
Our existing integrated circuit tools include many modules that are of direct value to MEMS designers, such as schematic and layout editing. Our automatic place-and-route tool includes standard cell placeand-route, pad frame generation, and pad routing.
86
Our verification suite includes design rule checking, extraction of circuit topology from layout geometry, and comparison of circuit topologies from schematic and layout. We presently support a C-language compiled interface to our layout tool, with an interpreted version planned for the next major release. This interface allows users to add custom layout generators to the tool. Cross sections provide a valuable view of MEMS designs. Our layout editor can automatically construct and display cross sections. The engineer indicates a cut line on their layout and a window pops up displaying the cross section view. The software uses a simplified process definition file to build up the view, step by step. We presently have MEMS process definition files for the MUMPS, ADXL50, and MOSIS/NIST processes. While not a true three-dimensional tool, our cross-section viewer represents a significant and useful step along the path to true 2D to 3D conversion and visualization. Our schematic editor allows the engineer to easily create custom symbols and connect them together. The schematic editor can produce netlists for input to the standard cell place-and-route, the layout verification tools and for simulation in our circuit simulator. Our circuit simulator allows the user to add custom models that could represent MEMS components via a C-language interface. We are presently working with UCLA to develop design examples where precharacterized MEMS cells are connected on a schematic and the netlist that is passed to the circuit simulator contains both electrical and mechanical nodes and behavior models for a combined electrical-mechanical simulation.
2.2.
capability and enhance the process modeling and libraries to include anisotropic and isotropic etch models. Our visualization tools must be enhanced to allow for arbitrary cross-section views and full 3D perspective views, both of the raw geometry and of the results of three-dimensional simulation. The core of the complete suite will include tightly coupled mechanical, electrical, and thermal simulation based on a single data structure shared by multiple finite-element and boundary-element solvers. This multi-domain co-simulator represents the largest development cost in our proposed suite. Our circuit simulator, with a custom model capability, will be closely coupled to the finiteelement/boundary-element co-simulator to allow portions of a design to be simulated at a high-level of detail and other portions to be simulated with less computationally-intensive models. Fluid flow analysis and collision detection are each critical simulation components for a segment of MEMS designs and could be included in our suite. Our simulation control capability needs to be developed into a central design manager tool to coordinate the automatic generation of layout, schematics, and netlists for predefined MEMS devices and coordinate the simulation and display tools. We will include a flexible capability for optimization of MEMS system performance.
2.4.
We expect the source representation within our tools to be in the form of 2D geometry-a combination of hand drawn polygons, cells from custom libraries or parameterized libraries, and interconnect from automatic placement and routing with connectivity information coming from schematics. In this tool suite, 3D representations are created by the tools automatically from the 2D geometry by the process simulator. The resulting 3D solid model is passed to the multi-domain cosimulator for analysis. The complete suite will allow the designer to easily traverse from schematic and polygon input to both high-level and detailed simulations and to produce foundry-ready output.
Existing MCM Tools
Our existing multi-chip module design tools include a finite-element thermal analysis tool and a boundary-element 2D electromagnetic field solver. We have designed these two independent point tools to both operate from a common physical representation and meshing data structure to allow for future co-simulation.
2.3.
Tool Flow
3. CONCLUSION The existing Tanner Tools are being productively used by a growing number of MEMS designers world wide. We hope that our implementation of new affordable MEMS-specific capabilities will continue to expand access to MEMS design.
Extensions and New Tools
To develop a complete MEMS tool suite, we need to extend some of our existing tools and develop new tools. In particular, we hope to work with MCNC to port their CaMEL parameterized MEMS library to work with our language interface. We need to extend our cross-section viewer to full 3D
87
Mechanical Simulation (FEA)
Fluid Simulation
2D and 3D Visualization Thermal Simulation (FEA)
ElectroMagnetic Simulation (BEM)
Process Simulation (2D to 3D Conversion)
Software Interface
Layout Editing, Automatic Place-and-Route, and Verification
Figure 1: Proposed MEMS tool suite. Thick borders indicate existing Tanner Tools. Medium borders indicate tools with partial functionality or tools under development. Thin borders indicate proposed Tanner Tools.
88
WE'RE SOLVING THE WRONG PROBLEMS Lou Scheffer'
'Cadence Design Systems, San Jose, California, USA email: louQcadence.com
ABSTRACT
* The compiler generated code that didn't quite work on the machine you were using.
We are working on the wrong problems. We are wasting a lot of time improving things that are good enough while ignoring the things that will determine how we design in a few years. These errors of commission are improving tools that are already good enough, improving tools where improvements are likely to be marginal, and building tools that won't be used in tomorrow's design flows. The errors of omission include not building tools for real problems, not building tools that will work with almost inevitable process changes, and not considering methodology, tools, and process as a whole.
1.
* The math functions you wanted to use were not available in your language. All of these were very real problems when compilers and computers were new, and a lot of very smart people put a lot of hard work into solving them. Now, however, these problems have very little, if any, impact on the code you write. A brilliant person could spend several years improving (for -example) register allocation. Although the result would be a welcome improvement, it probably would not change your day to day work in the least. The solutions that have been found to these problems are, in a very practical sense, 'good enough'. In fact, you probably compile '-g', giving up a factor of 2 in both speed and code density, in order to make your programs easier to debug. Nowadays. the problems that worry programmers are of an entirely different nature - design of algorithms, source code control, portability, memory allocation, regression testing, and other high level concerns. If you spend as much as one week a year on compiler problems you are probably wasting your time. The CAD field (that's us) is at the same point that programmers were years ago. In the past, and maybe the present, the tools limited how much functionality could go on a chip. In the future, between bigger chips, shorter design cycles, and better tools, the limit will shift to brainpower instead of tools. The CAD field as a whole has not acknowledged this change, however - people are still more concerned with finding a slightly better algorithm for some old problem when they should be worrying about how in the world they are going to use 50 million transistors in any reasonable amount of time.
INTRODUCTION
A traditional error of armed forces is being perfectly prepared to fight the last war. We are doing exactly the same thing in CAD - we are solving the problems that were important yesterday, and ignoring those that will be important in the future. We are making marginal improvements to algorithms that are already good enough, or building tools that were useful once, but won't be useful when the main task is getting a 50 million gate chip out the door. On the other hand, there are problems either here today, or coming down the road shortly, that the field as a whole is not addressing. 1.1. An analogy with software Chips are getting bigger at an exponential rate (Moore's law) while being designed by people who are getting better at a much slower rate, if at all. This very simple fact will drastically change the way we design chips, but we are not preparing for it. The software industry has already gone through a similar transition. When computers were new, often the kind of program you could write was limited by the computer or compiler, not by the creator's brainpower. So people in the field spent a lot of time and effort improving compilers, building machines with bigger address spaces, creating subroutine libraries, and so on. They have done an excellent job of this. Just ask yourself when was the last time you could not write the program you wanted because:
2.
PROBLEMS WE SHOULD NOT BE
WORKING ON There are at least three categories of problems that people are working on that are unlikely to pay high dividends. These are tools that will not be needed in tomorrow's flows, tools where the existing solutions are 'good enough', and tools where additional improvements are likely to be small.
* The compiler wasn't fast enough.
2.1. An example of solving last year's problem Sematech recently asked for bids on portions of a new CAD system. Among the things they wanted was a program that
* The compiler could not compile a program as big as the one you wanted to write.
89
could analyze all the interconnect on a 28 million gate chip and report any crosstalk problems, using at most 30 hours of CPU time. An analogy to computer science shows how useless this is. What would you say if a salesman came to your office and said, "Here's our great new program. In only 30 hours it examines all your machine code and reports cases where the compiler generated code that won't work due to pipeline interlocks." Your reaction is entirely predictable - you'd say "If I needed this tool at all, what I'd really need is a new compiler." Another way of looking at this is to ask how you would use it in your design flow. First, you design your 28 million gate chip. Next you synthesize, place, and route it. Now you run this checker. How many nets will it report as having problems? If your methodology knows nothing of crosstalk, probably hundreds of thousands (all the long nets). If your methodology does know about crosstalk, why not tighten up the bounds a little and eliminate the need for the checking step entirely? That's exactly what compiler writers have done. No one will put up with a compiler that generates very fast, but sometimes wrong, code. The key idea is that you only need to back off a little on the performance to get a design that works every time, and most of the analysis can be done up front. Another example where this is done is design rules. Everyone knows that the yield vs. spacing function is not a step function, that the focus is better in the middle of the chip than in the corners, and so on. In the interests of expediency, however, we make and enforce a simple rule - any spacing less than X is an error; any greater spacing is OK. Of course this gives up some performance, but the simple rule is easy to enforce and good enough. In computer science, everyone realizes that it is the combination of compiler and machine architecture that determines performance, not either one alone. The chip analogy is a design methodology in combination with tools and a particular process. If done right, the combination of the methodology, tools, and process will generate a working design. The same input, fed to another such combination, will produce another working design that varies only in performance. Since the tools only make sense in combination, for each problem (such as crosstalk) we should ask ourselves whether the best solution is changing the methodology, the tools, or the process. This is seldom discussed, much less analyzed.
workstation size and speed track the size of the design task is of course not coincidence - workstations are built from a small number of current generation chips. Thus when bigger and faster chips are designed, they are immediately used to build bigger and faster workstations, and the ratio of chip size to computer speed stays roughly constant. Therefore, if a design problem has a O(N) solution, we can consider it solved. In practice, we can consider an 0(N log N) problem solved as well, since log(N) goes up by only 50 percent as N goes from 1 million to 1 billion. An algorithm can achieve O(N) performance in several ways - the basic algorithm can be O(N), or the design problem can be split into many independent problems. If, as the chip gets larger, you just get more of these problems, and not bigger instances of them, then O(N) overall performance can be achieved even if the algorithm itself has worse characteristics. An example of an algorithm that is O(N) in time and space is timing analysis by means of a longest path algorithm. Examples of algorithms that are intrinsically O(N - log N) are minimum spanning tree generation and DRC checking by means of scan line operations. These operations will continue to be practical for any size of chip. Furthermore, many computations that are difficult have approximate solutions that are O(N log N) or better. For example, travelling salesman and steiner tree (two NP complete problems) can be approximated by O(N . log N) algorithms for a most a small constant penalty. Examples of algorithms that are decomposable are logic synthesis and detailed routing. In each case the basic algorithm is performed on small chunks of the design, and the results stitched together to form the complete solution. A further advantage of this type of algorithm is that it can easily take advantage of parallel processing. 2.3.
Problems where big improvements are unlikely There are a number of CAD problems where the solution space is smooth. In these cases you would expect a heuristic to get close to the global minimum, although it may not achieve it exactly. Buffer sizing is an example of such a problem. If the buffer sizes are not quantized, a small change in any buffer size can be made independently of any other buffer, and the system timing will only vary a small amount as a result. Compaction is another example of such a smooth problem. For these problems, we would expect heuristics to do well, and they do. So fancier optimization methods will not buy us much.
2.2. O(N log N) problems In this field we have a very interesting phenomenon - an O(N) algorithm takes constant time as N increases! This seeming contradiction is easily explained - computer scientists measure the execution time as the number of machine instructions, and in fact an O(N) CAD algorithm does take more instructions as N increases. In the CAD field, however, the speed of the available computers goes up as the problems become bigger. If today we can design a 1 million gate chip using a 100 MIP machine with 100 MBytes of main memory, then when the time comes to build a 4 million gate chip, we can expect 400 MIP machines with 400 MB of memory. So the same O(N) algorithm, running on each case, would take the same elapsed time. The fact that
3.
PROBLEMS WE ARE NOT WORKING ON, BUT SHOULD BE
The flip side of the coin is important problems that are not getting addressed at all, or not with the vigor they deserve. 3.1. Looking under the lamp post There is an old joke about a drunk looking under a lamppost for his car keys. Someone asks him if he thinks he lost them there, and he replies, "No, but the light is better here." The same thinking applies to CAD problems, but with less excuse. For example, people are spending huge amounts
90
of time and energy trying to get accurate cross coupling capacitances. If you ask users what they do with these capacitances, they will tell you they calculate delays. When you point out that effective coupling capacitance for delay calculation can vary by ±100% depending on the behavior of the neighboring signal, they shrug their shoulders and say that it's too complicated, or someone else's problem. So CAD tool developers continue to work on this subset of the real problem because it is well defined and has a clean analytical solution. The real users, who have to produce working chips, then must take their carefully derived estimates, and multiply them by some random number ranging from 1 to 2, derived from a combination of hope and guesses. If they make it 2, their circuit will be foolproof, but they will leave half the potential performance on the table. If they make it 1, they will be right on the average, but a worst case may cause their chip to malfunction.
3.4.
4.
Multiple active layers
Someday soon, a process wizard will figure out how to deposit a decent layer of crystalline silicon on top of a chip that already has a few interconnect layers on it. This will allow multiple active layers, drastically increasing the achievable density of chips. However, at the very least, placement, routing, and compaction will need to be re-written to accomodate this. 3.2.2.
CONCLUSIONS
As a whole, the CAD research community is solving yesterday's problems rather than tomorrow's. We need to allocate a greater proportion of our resources towards solving the design problems of chips that are much bigger than those of today. The incentive to do so is clear - the people who solve these problems will be the heros of the academic world and the successes of the commercial arena.
3.2. Preparing for process improvements There are some process improvements that are not too far fetched, for which the CAD community is unprepared. 3.2.1.
Process portability
To use 50 or 100 million transistors in any reasonable amount of time, we'll need to build chips that include large chunks of existing designs (cores). We could map at the process level, the symbolic layout level, the relative placement level, the structural gate level, the RTL level, or any combination of these. What are the tradeoffs between these methods? Which ones work when? These are the kind of questions which will be crucial in the future, but are not being seriously addressed now.
Superconducting interconnect
At some point, at least for stationary applications, it will make sense to build systems working at liquid nitrogen temperatures. This is mostly waiting for improvements in refrigerator technology. Working at 77 degrees K has a number of advantages - CMOS is very fast, subthreshold effects are reduced, lower supply voltages are practical, and you can use superconducting interconnect. Unfortunately, we have no idea of how to analyze superconducting connections. If the R goes to 0, what is left is the self and cross coupling inductances. These are much harder to compute than resistances or capacitances, and we don't even have decent approximations for them. 3.3. Methodology Instead of building (for example) a new signal integrity checker, we should be looking at methodologies designed to prevent such problems in the first place. For example, 3 ways to eliminate crosstalk errors might be * Every Nth signal on every Mth layer must be a power or a ground. * Every gate has a limit to the length of connection it can drive. If this is exceeded, you must insert a repeater. * Every line over a certain length is routed as a differential pair. It would be an excellent research project to analyze under which conditions each of these solutions will work, what the overhead in size and performance of each solution is, how easy the condition is to meet and verify, and so on.
91
VLSI CIRCUIT PARTITIONING BY CLUSTER-REMOVAL USING ITERATIVE IMPROVEMENT TECHNIQUES Wenyong Deng
Shantanu Duttl
2
Department of Electrical Engineering, University of Minnesota, Minneapolis, MN 55455, USA [email protected] 2 LSI Logic Corporation, Milpitas, CA 95035, USA [email protected]
ABSTRACT
simple implementations and flexibility. However, this class of iterative improvement algorithms have a common weakness, viz., they only find solutions corresponding to local minima. Because of their iterative improvement nature, they can only evolve from an initial partition through very shortsighted moves. Thus the results strongly depend on the initial partition. In order to get a local minimum that is close to the optimum partition, multiple runs on randomly generated initial partitions are needed. As the circuit size becomes large, the probability of finding a good local minimum in one run will drop significantly. This makes FM an unattractive choice for partitioning large circuits. As will become clear later, FM gives good results on small to medium size circuits, but performs very poorly on large circuits. Some clustering [8] or compaction [15] techniques have been proposed to remedy the above weakness. Very good results have been obtained at the cost of considerable CPU time increases and implementation complexities. In this paper, we will propose a technique that significantly improves the ability of iterative improvement methods like FM and LA for finding good local minima. The new technique pays more attention to the neighbors of moved cells and encourages the successive moves of closely connected cells. This implicitly promotes the move of an entire densely-connected group (a cluster) into one subset of the partition. The large reduction in both the minimum and average cutsize of multiple runs indicates that the new technique is a more robust and stable approach. The increase in implementation complexity and run time is minimal. We also propose a sophisticated extension of this basic technique, which explicitly identifies clusters during the move sequence, and dynamically adapts the move sequence to facilitate the move of clusters into one subset of the partition. Very good results have been obtained at reasonable CPU times. In the next section, we briefly describe the FM and LA algorithms and point out their shortcomings. Then in Section 3. we present our rationale for the new technique mentioned above, and also propose a cluster-detecting algorithm based on iterative improvement methods. Extensive experimental results are presented in Section 4. along with discussions. Conclusions are in Section 5..
Move-based iterative improvement partitioning methods such as the Fiduccia-Mattheyses (FM) algorithm [3] and Krishnamurthy's Look-Ahead (LA) algorithm [4] are widely used in VLSI CAD applications largely due to their time efficiency and ease of implementation. This class of algorithms is of the "local improvement" type. They generate relatively high quality results for small and medium size circuits. However, as VLSI circuits become larger, these algorithms are not so effective on them as direct partitioning tools. We propose new iterative-improvement methods that select cells to move with a view to moving clusters that straddle the two subsets of a partition into one of the subsets. The new algorithms significantly improve partition quality while preserving the advantage of time efficiency. Experimental results on 25 medium to large size ACM/SIGDA benchmark circuits show up to 70% improvement over FM in cutsize, with an average of per-circuit percent improvements of about 25%, and a total cut improvement of about 35%. They also outperform the recent placement-based partitioning tool Paraboli [11] and the spectral partitioner MELO [12] by about 17% and 23%, respectively, with less CPU time. This demonstrates the potential of iterative improvement algorithms in dealing with the increasing complexity of modern VLSI circuitry. 1.
INTRODUCTION
The essence of VLSI circuit partitioning is to divide a circuit into a number of subcircuits with minimum interconnections between them. This can be accomplished by recursively partitioning a circuit into two parts until we reach the desired level of complexity. Thus two-way partitioning is a basic problem in circuit partitioning and placement. Kernighan and Lin [1] proposed the well-known KL heuristic for graph partitioning. The KL algorithm starts with a random initial two-way partition and proceeds by swapping pair of cells iteratively. Schweikert and Kernighan [2] extended KL to hypergraphs so that it can partition actual circuits. Fiduccia and Mattheyses [3] reduced the complexity of the algorithm to linear-time with respect to the number of pins in the circuit. This is done by moving one cell at a time and using an efficient bucket data structure. Krishnamurthy [4] enhanced FM by adding higher level lookahead gains and improved the results for small circuits. Recently, a number of clustering algorithms [9, 10, 11, 12, 15] have been proposed and excellent results have been obtained. FM and LA are the most commonly used two-way partitioning algorithms largely due to their excellent run times,
2.
PREVIOUS ITERATIVE IMPROVEMENT ALGORITHMS
A circuit netlist is usually modeled by a hypergraph G = (V, E), where V is the set of cells (also called nodes) in the circuit, and E is the set of nets (also called hyperedges). We will represent a net ni as a set of the cells that it connects. A two-way partition of G is two disjoint subsets VI and V 2 such that each cell v E V belongs to either V1 or V 2 . A net
92
is said to be cut if it has at least one cell in each subset and uncut otherwise. We call this the cutstate of the net. All the nets that are cut form a set called the cutset. The objective of a two-way partitioning is to find a partition that minimizes the size of the cutset (called the cutsize). Usually there is a predetermined balance criterion on the size of the subsets V 1 , V2, for example, 0.45 < VilfVI < 0.55, where i = 1, 2. The FM algorithm [3] starts with a random initial partition. Each cell u is assigned a gain g(u) which is the immediate reduction in cutsize if the cell is moved to the other subset of the partition: g(u) =
E n
EE(u)
c(ni)-
E
c(nj)
addition to FM gain, higher order gains are used to break the ties. 3.
CLUSTERING-BASED ITERATIVE IMPROVEMENT METHODS
3.1. Case for a Cluster-Oriented Approach A real VLSI circuit netlist can be visualized as an aggregation of a number of highly connected subcircuits or clusters. This fact has motivated the proposition of many clusteringbased algorithms. It is conceivable that there are many levels of clusters with different degrees in the density of their connectivities. A small group of densely interconnected cells may be part of a larger but less densely connected cluster. The goal of two-way partitioning is to determine a cut that goes through the most weakly connected groups. Iterative improvement algorithms like FM and LA start with a randomly assigned partition that results in a binomial distribution of cells in V1 and V2. In such a distribution, the probability of finding r (r < m) cells in V, and m - r cells in V2 in some group of m cells is:
(1)
njGI(u)
where E(u) is the set of nets that will be immediately moved out of the cutset on moving cell u, and 1(u) is the set of nets that will be newly introduced into the cutset. Put in another way, a net in E(u) has only u in u's subset, and a net in I(u) has all its cells in u's subset. c(ni) is the weight (cost) of the net ni which is assumed to be unity in this paper unless otherwise specified. The goal of FM is to move a cell at a time from one subset to the other subset in an attempt to minimize the cutsize of the final partition. The cell being selected for the current move is called the base cell. At the start of the process, the cell with maximum gain value in both subsets is checked first to see if its move will violate the balance criterion. If not, it is chosen as the base cell. Otherwise, the cell with maximum gain in the other subset is chosen as the base cell. The base cell, say u1, is then moved to the other subset and "locked"-the locking of a moved cell is necessary to prevent thrashing (a cell being moved back and forth) and being trapped in a bad local minimum. The reduction in cutsize (in this case, the gain g(ui)) is inserted in an ordered set S. The gains of all the affected neighbors are updated-a cell v is said to be a neighbor of another cell u, if v and u are connected by a common net. The next base cell is chosen in the same way from the remaining "free" (unlocked) cells and the move process proceeds until all the cells are moved and locked. Then all the partial sum Sj = Zt g(ut), 1 < j < n, are computed, and p is chosen so that the partial sum Sp is the maximum. This corresponds to the point of minimum cutsize in the entire moving sequence. All the cells moved after u, are reversed to their previous subset so that the actually moved cells are {u1, . . .,u,}. This whole process is called a pass. A number of passes are made until the maximum partial sum S, is no longer positive. This is a local minimum with respect to the initial partition [VI, V2]. The FM algorithm has been criticized for its well-known shortsightedness [4, 13]-it moves a cell based on the immediate decrease in cutsize. Thus it tends to be trapped in local minima that strongly depend on the initial random partition. We will later point out some other consequences of this shortsightedness that are related to the removal of natural clusters, which occur in real circuits, from the cutset. The FM gain calculation only considers critical nets whose cutstate will change immediately after the move of the cell. It is conceivable there will be many cells having the same gain value since the gain is bounded above by pmaz and below by -pa. where Pma. is the maximum degree of a cell. Krishnamurthy has proposed a lookahead gain calculation scheme which includes less critical nets [4]. In
P(m, r)
=
(rm)
pq
m
(2)
where p (q) is the probability of a cell being assigned to V1 (V2 ). For a random partition, p = q = 0.5. The probability distribution maximizes at r =-m/2 with standard deviation a = V/i/2. For example, a group of 100 cells will have the expected distribution of 50 cells in each subset with a standard deviation of 5 cells. Therefore, for a cluster with a fair number of cells, there is a very high probability that it will initially be cut. Hence in an initial random partition, most clusters will straddle the cutline, which is an imaginary line that divides the cells into the two subsets of the partition. This situation is illustrated in Figure 1(a). For an iterative improvement algorithm to succeed, it must be able to pull clusters straddling the cutline into one subset. It is easy to visualize that there will be many cells in different clusters with similar situations and therefore the same gain values. Since there is no distinction between cells with the same gain values but belonging to different clusters, the FM algorithm may well start to work on many clusters simultaneously, trying to pull them out of unfavorable situations. However, cell movement is a two-way process, and while some cells in a cluster are moved from V2 to V1 , other cells in the same cluster might be moved from V1 to V2 . Thus the cluster can be locked in the cutset at an early stage -a cluster is said to be locked in the cutset if it has locked cells in both subsets of the partition. This is the situation of clusters C, and C2 in Figure 1(b). Unfortunately, the moves made at an early stage (before the maximum partial sum point) are the actual moves. Hence these clusters will not be pulled out from the cutset in the current pass, and in later passes the same scenario may reappear. Undoubtedly. FM does a good job of pulling small and highly connected clusters to one side. This is because many nets of a small cluster will be incident on each cell of the cluster, and thus their gains capture significant information about the cluster. Once a cell is moved from V, to V2 , its neighbors in VI might have their gains raised and hence have a better chance of being moved to V2 , while its neighbors in V2 will get a negative gain contribution from nets connected to the moved cell and hence tend to stay in V2 . Therefore a small cluster has a good likelihood of being moved entirely to V2. However, when the size of a cluster is large, its
93
C''F,..4 C,
C ,2
6-Lv~
CD
-
I
C'~ V
Vln C
V,
VA
Cotll/,
col,,.
iVg
A'~ (b)
v,
i
v,
Culling A'
(0)
Figure 1. (a) In the initial partition, clusters straddle the cutline as a result of random cell assignment. (b) FM locks clusters on the cutline by moving cells within one cluster in both directions. (c) Better approaches pull clusters out from the cutline by moving cells within one cluster in a single direction. structural properties cannot be simply represented in the immediate connectivities of cells such as in FM and LA gain calculations; for example, the cluster may consist of subclusters. Although it might be easy for FM to move out subclusters, it is also very likely that it will lock the bigger clusters in the cutset. To demonstrate the validity of the above observations, let us consider the simple graph shown in Figure 2(a) that shows a two-level clustered structure. Cells u11 , u2 1, u, and u41' form a strongly connected subcluster Ci; other subclusters C1, C2 and C2 are similarly formed. Subclusters C1 and CI construct a higher level but less densely interconnected cluster C1. Similar is the case for C2, which is composed of sublusters C2 and C2. After a random partition, all the clusters as well as the subclusters straddle the cutline. Initial FM gain calculation gives the gain values indicated beside each cell in the figure. Cells ull and u2 belong to different clusters, but have similar situations, and hence the same gain of 4. We assume a 50%-50% balance criterion in which cells move alternately between the two subsets VI and V2 . The first four moves are shown by the numbered dashed arrows in Figure 2(a). FM quickly reaches the local minimum of cutsize 4 (further moves will be reversed finally since the current point is the maximum partial sum point). While FM succeeded in moving out subclusters, it locked the higher level clusters Cl and C2 on the cutline. Therefore it missed the optimal cut of one that can be easily identified in the figure. It is obvious now that a mechanism is needed to aid iterative improvement algorithms in pulling out clusters from the cutset. We propose a cluster-oriented framework for gain calculation and base-cell selection that focuses on nets connected to moved cells. It can be overlaid on any iterative-improvement algorithm with any cell-gain calculation scheme. It implicitly promotes the move of an entire cluster by dynamically assigning higher weights to nets connected to recently moved cells. This greatly enhances the probability of finding a close-to-optimum cut in a circuit. We also propose an extended version of this algorithm that tries to identify clusters explicitly and then move them out from the cutset.
CA, A
(a)
Cutline
(b)
Figure 2. (a) FM only pulls out subclusters and finds a local minimum in cutsize. (b) The new approach pulls out clusters and finds the optimum cut. are updated. At any stage in the move process, the total gain of a cell can be broken down as the sum of the initial gain component and the updated gain component. The total gain indicates the overall situation of a cell, while the updated gain component reflects the change in the cell's status due to the movements of its neighbors. An intuitive solution to the problem of an iterativeimprovement scheme "jumping around" and working on different clusters simultaneously, as illustrated in Figs. 1(b) and 2(a), thus locking them in the cutset, is to make cell movement decisions based primarily on their updated gain components. This minimizes distractions during the cluster-pulling effort caused by cells not in the cluster currently being moved, but with high total gains. In other words, it allows the algorithm to concentrate on a single cluster at a time for moves in one direction; note that the updated gain component of a cell reflects its goodness for moving with regard to the cluster currently being pulled from the cutset. The initial gain of a cell, however, provides useful information for choosing the starting seed for removal of a cutset-straddling cluster-the cell with the highest gain is most likely in such a cluster, and thus a very good starting point. Once the move process has begun, nets connected to moved cells should be given more weights so that the updated gain components of cells become more important than their initial gains. The utility of giving more weight to nets connected to moved cells (and hence to the updated gain components of cells) in facilitating the movement of clusters from the cutset is established in the following set
of results [14]. We consider an iterative improvement partitioning process like FM and assume that the probability (f.it) that an edge connects a pair of cells inside the cluster is uniformly distributed, the probability (fet) that an edge connects a cell in C to a cell outside C is also uniformly distributed, and fint > fext. This is similar to the uniformly distributed random graph model used by Wei and Cheng in [6]. Theorem 1 If a cluster C is divided by the cutline into subsets C1 e V1 and C2 E V 2, and C, is moved to V2 by a sequence of moves of its cells, then the cutsize of the partition will decrease, if initially IC1l < IC21, or first increase and then decrease, if initially OCil > IC2 I.
3.2.
Considering Clusters in Iterative Improvement Methods We first re-examine the cell gain calculation of FM. Initially, cell gains are calculated based on the immediate benefits of moving cells. After a cell is moved, the gains of its neighbors
Figure 4 illustrates Theorem 1. Assume originally all net weights are one and when updating the gain of a cell, each net connected to moved cells
94
Algorithm CLIP 1. Calculate the initial gain of all cells according to the iterative improvement algorithm of choice (e.g., FM, LA, PROP[13]). 2. Insert the cells into some sorted data structures (free-cell store) T1 and T2 for subsets V1 and V2, respectively. Select the maximum gain cell u E V as the first base cell to move. 3. Clear the gain of all cells to zero while keep their original ordering in the data structure. 4. Move u and update the gain of its neighbors and their ranks in the data structure as done in the chosen iterative improvement algorithm. The gain of a cell now only contains the updated part. 5. Choose the base cell based on the cell's updated gain and the balance criterion. Move the cell, update its neighbors. 6. Repeat Step 5 until all cells are moved. 7. Find the point in the move sequence which corresponds to the minimum cutsize, and reverse all the moves after this point. Figure 3.
V2
V1
Cluster
LI
Cutsize
Lo
L2 End of Cluster
Improvement
starting points
One pass of CLIP (CLuster-oriented Iterative-
improvement Partitioner)
0 2
is assigned a weight of at least pmax, where recall that pmax is the maximum cell degree in the circuit.
Figure 4.
C'
-2
IC
IC1I
The cutsize change with the move of a cluster as
indicated in Theorem 1. When the cutsize does not improve, it
Theorem 2 Once a cluster starts to move from V1 to V 2 , there is a high probability that the whole cluster will be removed from the cutset.
indicates the end of a cluster. clusters from one subset to the other. By rearranging clusters between the two subsets in this way, subsets V1 and V 2 of the final partition will become the two largest but most weakly connected superclusters, which implies that the cutsize will be small. As opposed to FM, which tends to do only local improvement within large clusters, the above new scheme can explore a wider solution space, and hence has less dependence on the initial partition. Compared to other clustering-based approaches, such as bottom-up compaction [15], top-down clustering [8] and vertex ordering [10], CLIP does not explicitly bind cells together as inseparable clusters. Instead, cells can be implicitly regrouped into different clusters in subsequent passes. Both the moving and possible regrouping of clusters are guided directly by the ultimate objective cutsize reduction. A possible advantage of this approach is that CLIP has more freedom in searching for the optimum cut.
The example of Fig. 2 can be used to demonstrate the advantage of the above approach. The cell gains are first computed and the base cell is again u 3 . The first two moves of u31 and u31 bring large negative gains to cells 122th u312 and u3 through the weighted edges. Therefore in the subsequent two moves, they are not selected as would be the case in FM. Instead, u412 becomes the top cell in V1I, and is selected as the next cell to move. The sequence of the first few moves are indicated by numbered arrows in the figure (for details see [14]). After eight moves, CI and C2 are moved out from the cutset, and we obtain the optimal cut of one. Thus this process escapes the local minimum of four in which the original FM algorithm was trapped. 3.3.
A Cluster-Oriented Iterative-Improvement Partitioner From the above discussion, we propose a general gain calculation and base-cell selection framework CLIP (for CLusteroriented Iterative-improvement Partitioner) presented in Figure 3, that can be applied to any FM-type iterative improvement algorithm. For implementation convenience, we set the cell gains to zero after the initial gain calculation. Cell gains are updated as in the original algorithm over which CLIP is overlaid. Zeroing of initial gains followed by gain updating is equivalent to giving nets connected to moved cells a weight of infinity over nets with no moved cells. The initial gain information is only reflected in the initial ordering of cells. After the first pass, most strongly connected clusters will probably have been removed from the cutset. The few clusters left in the cutset can be removed in the subsequent passes. In later passes, another advantage of the above cluster-oriented scheme is that clusters lying entirely in one subset can be easily moved to the other subset. This is because cell gains being cleared to zero in the initial stage causes cells in a cluster to have less inertia in staying inside their original subset. The benefit of cluster movement between subsets is that larger but less densely connected clusters (we call them superclusters) can be removed from the cutset by moving their densely-connected constituent
3.4. A Cluster Detection Method Although the new framework CLIP has significant advantages over the traditional iterative improvement approach, it is possible to do even better. We start by asking the following questions. First, how do we know when a cluster has been pulled out? Furthermore, once we finish pulling out the first cluster, how do we select the next starting point? We address the former question first by determining what happens when a cutline sweeps across a cluster-this is equivalent to a cluster being pulled out from the cutset. From Theorem 1, as we move a cluster from the cutset, the overall cutsize will decrease until the cluster is entirely removed from the cutset. In a practical partitioner, some external cells may be moved across the cutline due to their connections to moved cells in the cluster. However, since they are randomly distributed across many clusters (and thus do not belong to a specific cluster), their contribution to the overall cutsize change will not be significant. As a result, we obtain a cutsize change similar to that illustrated in Figure 4, which is a pictorial depiction of Theorem 1. Referring to this figure, the movement of a cluster starts from either point L1 or L2 . After it is removed from the cutset, there is an overall improvement in cutsize. If in subsequent moves no other cluster starts getting pulled out from the cutset, the cutsize won't improve anymore.
95
Algorithm CDIP
1. Calculate the initial gain of all cells according to the iterative improvement algorithm of choice.
2. Insert the cells into sorted data structures Ti and T2 for subsets VI and V2, respectively. Select the maximum gain
cell u
C V as the first base cell to move. 3. Clear the gain of all cells to zero while keeping their original
ordering in Ti (i = 1,2). 4. Move u and update the gain of its neighbors and their ranks in Ti (i = 1,2) as done in the chosen iterative improvement algorithm. Start to count the move index j and to calculate the partial sum -I1 for this first cluster, where u E Vi. The gain of a cell now only contains the updated part. 5. Repeat Steps 6 to 7 until all cells are moved and locked. 6. Choose the base cell based on the cell gain and balance criterion. Move the cell, and update the neighbors as done before. 7. If the current maximum partial sum Sp' > 0, and the current move index q > p + 3, then (a) Reverse the moves from q to p + 1. (b) Choose the free cell v E Vi with the maximum total gain as the next base cell. (c) For each free cell in V1, clear the cell gain except the gain component from nets connected to the locked cells in the same subset Vi. Reorder cells in Ti according to the modified gain. (d) Move the base cell v, update neighbors and start the count of new move index j and the calculation of new partial sum S k+ from this move. 8. Find the point in the moving sequence which corresponds to the minimum cutsize, and reverse all the moves after this point.
Test
#ofl
# ofl # of I Test
#of
Case
Cells
Nets
Cells
s1423 sioo s1488 balu p1 bml t4 t3 t2 t6 struct t5 l9ks
619 664 686 801 833 882 1515 1607 1663 1752 1952 2595 2844
Pins
538 1528 408 1882 667 2079 735 2697 902 2908 903 2910 1658 5975 1618 5807 1720 6134 1541 6638 1920 5471 2750 10076 3282 10547
Case 2
p s9234 biomed s13207 s15850 industry2 industry3 s35932 s38584 avq.small s38417 avq.large
1
3014 5866 6514 8772 10470 12637 15406 18148 20995 21918 23949 25178
# Nets
Pof#no Pins
3029 5844 5742 8651 10383 13419 21924 17828 20717 22124 23843 25384
11219 14065 21040 20606 24712 48404 68290 48145 55203 76231 57613 82751
Table 1. Benchmark circuits' characteristics. that the move of the current cluster ended at the point of maximum partial sum Silk. We then reverse the moves to this point, and select the cell with the maximum total gain, which very likely belongs to an unmoved cluster, as the base cell. After selecting this cell, the free cells are reordered in a manner that reflects their relevant connectivities to previously moved clusters. Since some cluster C was previously moved from the cutset, say, into V2 , it is not desirable to move the free cells of C in V2 to VI-doing so will lock the cluster in the cutset. Therefore, as at beginning of the pass, all gain components of cells are cleared to zero, with the exception of those components that correspond to negative gains corresponding to connections to locked neighbors in the same subset. These high-weight negative gains that are retained reflect the fact that some clusters were moved previously in the pass, and that free cells strongly connected to moved cells in these clusters likely belong to them and thus should be given lower priorities for movement.
Figure 5. One pass of CDIP (Cluster-Detecting Iterativeimprovement Partitioner) From this reasoning we propose the following cluster detecting criterion: After the move process reaches a positive maximum improvement point, and there is no further improvement in the following 3 moves, we say that a cluster has been moved out at the maximum improvement point;-3 is a parameter of the algorithm. Referring to the cutsize curve in Figure 4, the 3 cells moved after the minimum cutsize is reached do not belong to the previous cluster. This is in contrast to the two cluster detection criteria that Saab proposed in his compaction algorithm [15], viz., (1) The cutsize decreases for the first time after a sequence of moves; (2) The last moved cell in the sequence has a positive gain. We derived our detection criterion from the analytical results in Section 3.2., and we use the partial sum of gains of the sequence of node moves in our criterion, instead of individual cell gains as in [15]. We now address the second question raised at the beginning of this subsection. After reversing the 3 moves, we come back to the end point of the current cluster. Subsequently, we need to select the next seed to start the move of another cluster. For this purpose, the updated gain components of cells should not be the determining factor. Rather, the total gains of cells, which reflect their overall situation, is more useful. Just like the situation at the beginning of the pass, the cell with the maximum total gain is a good seed to start with. From the above discussion, a more sophisticated algorithm CDIP (Cluster-Detecting Iterative-improvement Partitioner) is presented in Figure 5. This is an extension of CLIP and can also be applied to any FM-type iterative improvement algorithm. The detection of the end of the kth cluster is done by monitoring the partial sum Sk (i = 1, 2) for each cluster and each subset separately. After S, k becomes positive and does not increase for 3 moves, we infer
By using either of the two new partitioning algorithms, CLIP or CDIP, we can find a good cut through weakly connected clusters. In order to obtain even better results, we can apply the original iterative improvement algorithm as a post-processing phase to fine-tune the partition. 3.5.
Complexity Analysis
In an implementation, both CLIP and CDIP have to be overlaid on some chosen iterative improvement algorithm. Let n be the number of cells, e the number of nets, p the number of pins and d the average number of neighbors of a cell. It is easy to show that CLIP doesn't increase the order of time complexity. CLIP-FM and CLIP-LA (CLIP applied to FM, LA) can still be implemented in 0(p) if a bucket list structure is used. CDIP has two major additional operations beyond those in the CLIP algorithm. After each detection of a cluster: (1) The 3 moves are reversed, and (2) The free cells are reordered. Assuming c clusters are detected in a pass, the first operation implies c6 additional moves over the entire pass, and the second operation causes c reorderings of all free cells. For a bucket list structure, where the insertion of a cell into a bucket is a constant time operation, the complexity of the first operation is 0(cdd) (each extra move requires 0(d) time for updating d neighbors and reinserting them in the bucket data structure), and the complexity of the second operation is 0(cn) over the entire pass. Thus
the complexity of CDIP is 0(max(cbd,cn,p)) for CDIPFM and CDIP-LA for one pass. Empirical results presented later show that CDIP is quite fast.
96
Minimum of 20 runs Cut Size Improvement V%
Test jCase
LP
FML31_FM s1423 SiOO s1488 balu p1 bml t4 t3 t2 t6 struct t5 l9ks 2 p s9234 biomed s13207 s15850 Industry
17 31 48 27 47 54 87 75 149 67 46 127 140 212 59 83 98 109 264
16 25 42 27 52 53 82 80 126 70 44 99 130 149 43 90 85 87 422
SubtotalJ 1740[1722] industry3 s35932 s38584 avq small s38417 avq large
272 85 100 347
15 37 43 27 52 49 56 57 89 60 37 75 119 149 49 84 98 80 260
15 25 41 27 52 52 52 57 90 60 36 74 105 152 44 83 70 67 190
1436] 1313] 261 102 49 223 78 216
504 168 85 608 284 398
261 73 55 146 73 138
jj240 jj350 fSubtotal] 1394 2047] 929 Total ]3134 f3769] 2365
-FM [-LA3
__A
16 25 42 27 52 52 51 57 92 60 33 80 107 142 47 84 68 73 205
11.8 -16.2 10.4 0.0 -9.6 9.3 35.6 24.0 40.3 10.4 19.6 40.9 15.0 29.7 16.9 -1.2 0.0 26.6 1.5
4.0 73 -16.7 47I 51.0 148 35.7 79 67.5 145 jJ38.3
-
5.9 19.4 12.5 0.0 -9.6 3.7 41.4 24.0 38.3 10.4 28.3 37.0 23.6 33.0 20.3 -1.2 30.6 33.0 22.3
17.51 24.51
12921 2431
746] 7351 2059] 20271
Average of% improvement
[
ILCI_
_LI_
____
LA3
4.0 14.1 45.0 57.9 69.6 60.6
-IU1-U
L)A3
~FM
LA31I -FM
11.8 19.4 14.6 0.0 -9.6 3.7 40.2 24.0 39.6 10.4 21.7 41.7 25.0 28.3 25.4 0.0 28.6 38.5 28.0
24.2 47.5 53.5 38.1 74.9 79.5 129.3 106.8 182.1 94.2 58.0 183.6 171.7 273.9 84.7 117.4 122.6 176.9 627.5
22.3 21.8 25.4 49.9 49.1 46.9 36.1 38.9 68.5 65.8 67.5 65.0 117.2 77.0 106.2 72.3 148.1 105.2 84.4 70.1 49.6 45.6 165.0 89.0 169.0 150.3 233.4 233.2 81.1 89.5 170.7 108.4 118.9 123.5 140.4 140.9 724396332.8
L693.4
25.01
47.3f 35
FCI
L-LA3
19.6 25.3 45.5 32.5 61.8 60.4 72.9 71.6 106.8 71.8 46.8 90.7 136.1 208.8 80.8 102.1 100.5 106.7
LA
376.6 127.1 83.2 335.1 136.7 305.2
LPICI
j-M
20.6 25.1 46.4 33.9 61.5 59.8 71.7 66.6 101.0 70.0 46.8 89.8 134.6 193.2 77.2 102.2 93.9 107.0 236
19631 17731 1694]
25.7 f264612585
f299.8
[
-LA31J
-LA3
9.9 -4.8 12.3 -2.1 12.1 18.3 40.4 32.2 42.3 25.5 21.3 51.5 12.4 14.8 -5.4 7.7 -0.8 20.3 41.1
18.8 46.7 14.8 14.6 17.6 24.1 43.6 32.9 41.4 23.7 19.3 50.6 20.7 23.8 4.5 13.0 180 397 470
14.7 47.1 13.2 11.2 17.9 24.8 44.6 37.4 44.5 25.6 19.4 51.1 21.6 29.5 8.9 1 2.4 39.5 53.2
25.8]
3.01
60
16.7T
29.11
62.4 65.5 46.5 70.3 53.8
60.5 68.1 46.1 71.7 62.8
421.6 358.6~ 25.6 39.6 79.0 83.0 103.3 95.7 72.2 309.9 311.7 42.1 114.0 108.71 64.4 349.1 260.8 59.6
2734 [3178J 13631 13771 1238 [I49.1 38.2 5380 5763 19631 17731 1694
-__
26.11
f
of 20 runs Improvement
cut-size
10.7 f506.01758.0 14.1 210.3 j231.4 53.0 271.2 57.3 j578.6 815.5 67.1 1384.1 408.2 58.6 755.0
33.4J 46.51 24.51 34. t17.81
ifAverage
48.6 [53.8] 41.41 45.5]
[i26.1
__-
I
33.2
[
52]
Table 2.
Comparisons of CLIP and CDIP (applied to FM and LA3) to FM. CLIP-LA3 results are for 6 = 50. Subtotals shown correspond first to medium-size circuits and then to large circuits.
Test
I
Case
IParaboli
Cut Size
FM
J
___
s1423 Si~o s1488 balu p1 struct
16 45 50 41 53 40 146 74 135 91 91 193
p2 s9234 biomed s13207 s15850 idtr2
f~Subtotal] industry3 1
17 25 46 27 47 41 182 51 83 78 104 264
97519651 -
CLIP
EIP CT
f
-LA3f
-LA3f
-ROPf
1 %Improvement TF Ff C'LIP 1 A3F j __ _ -FMf J _LA3f
M CLIP I f
lFi I -LA3f I_-PROPL_
15 25 43 27 51 33 152 42 84 71 56 192
7901 241 -
800
77
791
260 73 50 129 70 127
243 73 47 139 74 137
243 42 51 144 65 143
1.5 -27.1 -12.7 -24.6 -66.7 -60.3
9.7 -25.3 14.5 10.7 -25.8 -24.9
2.6 -15.1 9.1 42.4 -30.0 8.6
688
-33.9
-3.2
10.9
1479
-22.5
8.8
14.8
15.8
4.2
11.6
14.8
15.5
83 47 200 66 185
Ii
796
Total cut [I
1771
]1205 32170
1612
822
of per-ckt % improvements
] ]
709
713
1509
1492
3 ] --
Table 3.
-5.9 44.4 8.0 34.1 11.3 -2.4 -19.8 31.1 38.5 14.3 -12.5 -26.9
1.0]
i
over Paraboli
15 25 41 27 47 36 151 44 83 69 59 8
85 63 297 147 350
LIAverage
_ CLIP
15 25 42 27 51 33 142 45 83 66 71 200
-263
3Subtotal
____
CDIP
15 25 43 27 47 33 148 44 83 76 75 174
267 62 55 224 49 139
s35932 s38584 avq.small s38417 avq.large
1
6.2 44.4 14.0 34.1 11.3 17.5 -1.4 40.5 38.5 16.5 17.6 9.8
6.2 44.4 16.0 34.1 3.8 17.5 2.7 39.2 38.5 27.5 22.0 -3.5
6.2 44.4 18.0 34.1 11.3 10.0 -3.3 40.5 38.5 24.2 35.2 .
6.2 44.4 14.0 34.1 3.8 17.5 -3.9 4. 37.8 22.0 38.5 0.5
19.0 [
17.9]
20.1
18.9
9.0 -15.1 14.5 37.9 -33.8 1.4-
9.0 32.3 7.3 35.7 -24.6 -2.8
3
10.4
[
173.6
] ]
17.5 16.5
Comparisons of various iterative improvement algorithms to Paraboli [111. The results for the clustering-based iterativeimprovement algorithms in the table have been further improved by their corresponding original schemes (indicated by the subscript f). FM and CLIP-FMf results are the best cutsizes from 100 runs, while results for CLIP-LA3f, CDIP-LA3f and CLIP-PROPf are the best cutsizes from 20 runs. CDIP-LA3f results are for 6 = 50. Subtotals shown correspond first to medium-size circuits and then to large circuits.
97
I
Test
~
MELO
Case balu pI bml t4 t3 t2 t6 struct t5 l9ks 2 p s9234 biomed s13207 s15850 industry2
J 28 64 48 61 60 106 90 38 102 119 169 79 115 104 52 319
Total cut 1
UAverage
1554
of per-ckt
I
Cut Size FMFTIF |CLIP _CLIP -FMf -LA3f 27 27 27 47 47 51 49 47 51 80 53 49 62 56 56 124 87 92 60 60 60 41 33 33 104 74 80 130 109 104 182 148 142 51 44 45 83 83 83 78 76 66 104 75 71 264 174 200
~ ~
1486
[
1193
1210
CDIPCLP T II~_PC IF -LA3 -PROPf 27 27 47 51 47 47 48 52 57 57 89 87 60 60 36 33 74 77 104 104 151 152 44 42 83 84 69 71 59 56 182 192
[
1177
improvements
% Improvement over MELO l C DIP L _fl IFCLIP CICI CLIW CLIP -FMf -LA3f -LA3f -PROPf 3.6 3.6 3.6 3.6 26.6 20.3 26.6 20.3 2.1 -5.9 2.1 2.1 13.1 19.7 21.3 14.8 6.7 6.7 5.0 5.0 17.9 13.2 16.0 17.9 33.3 33.3 33.3 33.3 13.2 13.2 5.3 13.2 27.5 21.6 27.5 24.5 8.4 12.6 12.6 12.6 12.4 16.0 10.7 10.1 44.3 43.0 44.3 46.8 27.8 27.8 27.8 27.0 26.9 36.5 33.7 31.7 -30.7 -26.8 -11.9 -7.1 45.5 37.3 42.9 39.8
1
TFM 3.6 26.6 -2.0 -23.8 -3.2 -14.5 33.3 -7.3 -1.9 -8.5 -7.1 35.4 27.8 25.0 -50.0 17.2
1192 1
4.4
1
3 2
]
23.2
22.1
24.3
17.4
17.0
18.8
23.3 |
18.5|
Table 4. Comparisons of various iterative improvement algorithms to MELO [11]. The results in the table have been further improved by the original schemes (indicated by the subscript f). FM and CLIP-FMf results are the best cutsizes from 100 runs, while results for CLIP-LA3f, CDIP-LA3f and CLIP-PROPf are the best cutsizes from 20 runs. CDIP-LA3f results are for 6 = 50.
4.
EXPERIMENTAL RESULTS
clustering-based iterative improvement algorithms are further improved by the corresponding original schemes as indicated at the end of Section 3.4. (e.g., CLIP-FMf is CLIPFM followed by FM improvement, CLIP-PROPf is CLIPPROP followed by PROP improvement, etc.). Since FM is
All experiments have been done on ACM/SIGDA benchmark circuits whose characteristics are listed in Table 1. The circuit netlists were acquired from the authors of [10] and [11]. All cutset results are for the 45-55% balance criterion.
very fast, we perform 100 runs of both FM and CLIP-FMf; all other iterative improvement algorithms' results are for 20 runs. First, it is clear from the tables that the original FM algorithm can obtain good results for medium-size circuits. For this set of benchmarks, it is about 4% better than MELO, and 1% better than Paraboli in total cut. However, for large size circuits, it falls far behind Paraboli (by -22.5% in total cut). This confirms our earlier discussion on the shortcomings of previous iterative methods. After using the clustering-based techniques, CLIP and CDIP, all
4.1. Comparisons to FM Table 2 presents the results of applying CLIP to FM and LA3 (LA algorithm with lookahead level of 3). Both the minimum and average cutsizes over 20 runs are greatly improved compared to their corresponding original schemes. The overall minimum-cutsize improvements are 24.5% for CLIP-FM over FM and 45.4% for CLIP-LA3 over LA3. Note also from the table that while LA3 performs slightly better than FM for small to medium-size circuits, it performs much worse for large circuits-for an explanation of this phenomenon the interested reader is referred to [14]. However, CLIP-LA3 now performs much better than FM (by 24.5% for medium-size benchmarks and 46.5% for largesize benchmarks, for an overall improvement of 34.3%), as does CLIP-FM (by 8.6% and 19.7% for medium and large size circuits, respectively). As is clearly evident, the most improvements of the new schemes over FM are obtained for large circuits. The largest improvement of CLIP-LA3 over FM is about 70% for the circuit s38417. This clearly demonstrates the ability of the new clustering-based schemes to tackle large circuits. The cluster detection method CDIP-LA3, obtained by overlaying the CDIP scheme of Figure 5 on LA3, performs even better. The minimum and average cutsizes are improved over those of FM by 35.3% and 45.5%, respectively. Both the minimum and average cutsizes are also superior to those of CLIP-LA3. This indicates that CDIP is a better and more stable partitioner than CLIP.
of the four new algorithms CLIP-FMf. CLIP-LA3f, CLIP-
PROPf and CDIP-LA3f, are able to obtain cutsizes that are overall better than Paraboli's. Total cut improvements range from 8.8% to 17.5%. This demonstrates that iterative algorithms can also partition large-size circuits very effectively. The best results are obtained by applying CLIP to PROP, a probability-based iterative improvement partitioner [13]. PROP calculates cell gains based on the probability of a cell being actually moved in a pass. Thus the cell gain calculation is more accurate than that of either FM or LA. Further, when CLIP is applied to FM and LA, neighbors of moved nodes get updated only if they are connected to them by critical or somewhat critical (for LA) nets. On the other hand, in PROP, all nets are probabilistically critical, and thus all neighbors get updated, leading to a more accurate reflection of a cell's move on them. The PROP algorithm combined with CLIP overcomes the two fundamental shortcomings in traditional iterative improvement methods-inaccurate gain calculation and lack of a global view of the cluster-oriented structure of circuits. Thus it emerges as a very powerful partitioning tool and performs about 17% better than Paraboli. When compared to MELO [12], for which only results on medium-size circuits are given, the differences between the new algorithms are small, all showing about 23% better results in total cutsize
4.2. Comparisons to Paraboli and MELO Finally, in Tables 3 and 4, we compare the original iterative improvement and the new cluster-oriented iterative
improvement algorithms to two state-of-the-art partitioning methods, the placement-based algorithm Paraboli [11] and the spectral partitioner MELO [12]. Here, all the new
98
Test Case
FM [CLIP CLIP CDIP] CLIP MELO | -FMf -LA31 -LA3f -PROPf l X100l x100 x20 x20 x20
s1423 sioo s1488 balu pI bml t4 t3 t2 t6 struct t5 l9ks 2 p s9234 biomed s13207 s15850 industry2 industry3 s35932 s38584 avq.small s38417 avq.large
0.15 0.10 0.19 0.25 0.33 0.27 0.58 0.76 0.58 0.57 0.54 1.13 1.59 1.81 2.78 3.89 4.23 4.24 9.10 11.07 11.82 13.70 18.44 15.36 19.49
0.20 0.11 0.20 0.35 0.44 0.43 0.90 0.96 0.99 0.88 0.69 1.62 2.23 2.37 3.14 3.34 5.01 6.87 13.17 16.35 13.21 15.73 20.50 17.70 24.07
(16 ckts.)
32641 43361
(18 ckts.) 1 11754 1143461 To
0.54 0.68 0.35 0.44 0.59 0.74 1.59 1.55 1.86 2.36 1.78 2.42 4.07 5.29 4.41 5.35 4.97 6.20 2.54 2.75 2.25 2.72 7.07 8.11 8.92 10.83 8.33 10.52 8.74 13.48 11.56 18.13 11.03 21.47 13.33 27.13 56.47 72.64 70.93 97.18 23.16 48.19 37.38 66.93 63.01 125.35 40.58 86.65 71.50 157.26
29951
4218 |
8464 115070
0.61 0.77 1.44 1.69 1.81 1.84 5.70 4.74 5.46 8.73 3.75 9.39 10.96 17.64 13.22 28.61 19.02 28.80 106.33 96.01 54.26 97.48 104.08 84.49 106.28
535j 153251 1
5. CONCLUSIONS We proposed a new clustering-based approach that greatly enhances the performance of iterative improvement methods. The new approach incorporates clustering mechanisms naturally into traditional FM-type algorithms. This revives the power of fast move-based iterative improvement algorithms, making them capable of dealing with large-size circuits. They are significantly better, in terms of both cutset quality and speed, than current state-of-the-art algorithms like Paraboli [11] and MELO [12] that are purely clusteringbased, and deserve new attention in the development of VLSI CAD tools.
Paraboli 7.7 15.8 17.0 15.5 18.3
7 8 9 24 27 29 31 38 35.2 67 79 89 137.4 516 490.3 496 710.9 710 2060.4 1197 2730.9 1855 1367.3 760.7 2626.7 6517.5 4098.9 2041.5 4135.0
5177 ||
1! 277871]
Table 5. Comparisons of CPU times of various algorithms in secs per run, and total times over all circuits and all runs made. MELO was run on SUN SPARCIO, Paraboli on DEC3000 Model 500 AXP, all others on SUN SPARC5 Model 85.
than MELO.
4.3.
Run Time Comparisons
The run times of the iterative improvement algorithms are very favorable compared to other purely clustering-based algorithms like Paraboli and MELO (Table 5). Run times of Paraboli and MELO are reported for the DEC3000 Model 500 AXP and SUN SPARC10 workstations, respectively, while we have executed all CLIP and CDIP based algorithms, as well as FM and LA, on the SUN SPARC5 Model 85 workstation. The data structures used to store free cells for FM and LA are bucket structures as proposed in [3] and [4], respectively. Despite the O(maz(c~d,cn, p)) worstcase time complexity of CDIP-LA3f, in practice it uses less than twice the CPU time of CLIP-LA3f, which has a linear run time. PROP uses a tree structure which makes it easy to accommodate arbitrary net weight as is required in performance-driven partitioning [7, 5]. Yet the run time of CLIP-PROPf is quite reasonable. The total CPU times of all four new algorithms are less than that of Paraboli. Assuming the same speed for the three different workstations, CLIP-FMf is 1.9 times faster, CLIP-LA3f is 3.3 times faster, and both CDIP-LA3f and CLIP-PROPf are 1.8 times faster than Paraboli. CLIP-PROPf is comparable with MELO in run time, while CLIP-FMf, CLIP-LA3f and CDIP-LA3f are faster than MELO by factors of 1.2, 1.7 and 1.2, respectively. This also means that if equal CPU times are allocated, the new algorithms can perform more runs and generate even better cutsizes than those presented in Tables 3 and 4.
99
REFERENCES [1] B. W. Kernighan and S. Lin, "An Efficient Heuristic Procedure for Partitioning Graphs", Bell System Tech. Journal, vol. 49, Feb. 1970, pp. 291-307. [2] D. G. Schweikert and B. W. Kernighan, "A Proper Model for the Partitioning of Electrical Circuits", Proc. 9th Design automation workshop, 1972, pp. 57-62. [3] C. M. Fiduccia and R. M. Mattheyses, "A lineartime heuristic for improving network partitions", Proc. ACM/IEEE Design Automation Conf., 1982, pp. 175-181. [4] B. Krishnamurthy, "An improved min-cut algorithm for partitioning VLSI networks", IEEE Trans. on Comput., vol. C-33, May 1984, pp. 438-446. [5] M. Marek-Sadowska and S.P. Lin, "Timing driven placement", Proc. IEEE/A CM International Conference on CAD, 1989, pp. 94-97. [6] Y.C. Wei and C.K. Cheng, "Towards efficient hierarchical designs by ratio cut partitioning", Proc. Int'l. Conf. Computer-Aided Design, 1989, pp. 298-301. [7] M.A.B. Jackson, A. Srinivasan and E.S. Kuh, "A fast algorithm for performance driven placement", Proc. IEEE/ACM International Conference on CAD, 1990, pp. 328-331. [8] Y. C. Wei and C. K. Cheng, "An Improved Two-way Partitioning Algorithm with Stable Performance", IEEE Trans. on Computer-Aided Design, 1990, pp. 1502-1511. [9] L. Hagen and A. B. Kahng, "Fast Spectral Methods for Ratio Cut Partitioning and Clustering", Proc. IEEE Intl. Conf. Computer-Aided Design, 1991, pp. 10-13. [10] C. J. Alpert and A. B. Kahng, "A General Framework for Vertex Orderings, With Applications to Netlist Clustering", Proc. IEEE Intl. Conf. Computer-Aided Design, 1994, pp. 63-67. [11] B. M. Riess, K. Doll and F. M. Johannes, "Partitioning Very Large Circuits Using Analytical Placement Techniques", Proc. ACM/IEEE Design Automation Conf., 1994, pp. 646651. [12] C. J. Alpert and S-Z Yao, "Spectral Partitioning: The More Eigenvectors, the Better", Proc. ACM/IEEE Design Automation Conf., 1995. [13] S. Dutt and W. Deng, "A Probability-Based Approach to VLSI Circuit Partitioning", submitted to DAC'96. [14] S. Dutt and W. Deng, "VLSI Circuit Partitioning by Cluster-Removal Using Iterative Improvement Techniques", Technical Report, Department of Electrical Engr., University of Minnesota, Nov. 1995. This report is available at ftp site ftp-mount.ee.umn.edu in file pap-pdw96-ext.ps in directory /pub/faculty/dutt/vlsicad/papers. [15] Y. G. Saab, "A Fast and Robust Network Bisection Algorithm", IEEE Trans. Computers, 1995, pp. 903-913.
A HYBRID MULTILEVEL/GENETIC APPROACH FOR CIRCUIT PARTITIONING Charles J. Alpert'
Lars W. Hagen2
Andrew B. Kahng'
'UCLA Computer Science Department, Los Angeles, CA 90095-1596 2Cadence Design Systems, San Jose, CA 94135
ABSTRACT We present a genetic circuit partitioning algorithm that integrates the Metis graph partitioning package [15] originally designed for sparse matrix computations. Metis is an extremely fast iterative partitioner that uses multilevel clustering. We have adapted Metis to partition circuit netlists, and have applied a genetic technique that uses previous Metis solutions to help construct new Metis solutions. Our hybrid technique produces better results than Metis alone, and also produces bipartitionings that are competitive with previous methods [20] [18] [6] while using less CPU time. 1.
INTRODUCTION
A
netlist hypergraph
{V,
iV2. . ..
H(V E) has n modules V = C E is defined to be a subset of V with size greater than one. A bipartitioning P = {X, Y} is a pair of disjoint clusters (i.e., subsets of V) X and Y such that X U Y = V. The cut of a bipartitioning P = {X, Y} is the number of nets which contain modules in both X and Y. i.e., cut(P) = I{e I efnX $A0,e nY $ 0i1. Given a balance tolerance r, the min-cut bipartitioning problem seeks a solution P = {AX Y) that minimizes cut(P) such that '(I-r) < 1X I' < n(i+r) The standard bipartitioning approach is iterative improvement based on the Kernighan-Lin (KL) [16] algorithm, which was later improved by Fiduccia-Mattheyses (FM) [8]. The FM algorithm begins with some initial solution {X, Y} and proceeds in a series of passes. During a pass, modules are successively moved between X and Y until each module has been moved exactly once. Given a current solution {X', Y'}, the module v e X' (or Y') with highest gain (= cut({X' - v, Y' + v}) - cut({X, Y})) that has not yet been moved is moved from X' to Y'. After each pass, the best solution {X', Y'} observed during the pass becomes the initial solution for a new pass, and the passes terminate when a pass does not improve the initial solution. FM has been widely adopted due to its short runtimes and ease of implementation. One significant improvement to FM addresses the tiebreaking used to choose among alternate moves that have the same gain. Krishnamurthy [17] proposed a lookahead tie-breaking mechanism, and Sanchis [22] extended this approach to multi-way partitioning. Hagen, Huang, and Kahng [9] have shown that a "last-in-first-out" scheme based on the order that modules are moved in FM is cV,: a hyperedge (or net) e
100
significantly better than random or "first-in-first-out" tiebreaking schemes. More recently, Dutt and Deng [7] independently reached the same conclusion. Finally, Saab [21] has also exploited the order in which modules are moved to produce an improved FM variant. A second significant improvement to FM integrates clusteringinto a "two-phase" methodology. A k-way clustering of H(V, E) is a set of disjoint clusters p 5 = {C, , 0 ... Ck} such that C, UC2 U. .. UCk = V where k is sufficiently large.' We denote the input netlist as Ho(Vo, Eo). A clustering 5 P = {C, C2,. Ck I of Ho induces the coarser netlist HI(Vl,,E,), where V, = {C,,C2,..,Ck} and for every e e Eo, the net e' is a member of El where e' = {Ci I 3v E e and v E C0 unless Ie'l = 1 (i.e., each cluster in e' contains some module that is in e). In two-phase FM, a clustering of Ho induces the coarser netlist Hi, and then FM is run on HI (½V, El) to yield the bipartitioning Pi = {Xi, Yl }. This solution then projects to the bipartitioning Po = {Xo, Yo} of Ho, where v E Xo(Yo) if and only if for some Ch £ V½, v E Ch and Ch E X,(Y,). FM is then run a second time on Ho(Vo, Eo) using Po as the initial solution. Many clustering algorithms for two-phase FM have appeared in the literature (see [2] for an overview of clustering methods and for a general netlist partitioning survey). Bui et al. [5] find a random maximal matching in the netlist and compact the matched pairs of modules into 2 clusters; the 2 matching can then be repeated to generate clusterings of size 4,8- etc. Often, two-phase FM (not including the time needed to cluster) is faster than a single FM run because the first FM run is for a smaller netlist and the second FM run starts with a good initial solution, allowing fast convergence to a local minimum. The "two-phase" approach can be extended to include more phases; such a multilevel approach is illustrated in Figure 1 (following [15]). In a multilevel algorithm, a clustering of the initial netlist Ho induces the coarser netlist HI, then a clustering of HI induces H2, etc. until the coarsest netlist Hm is constructed (m = 4 in the Figure). A partitioning solution Pm - {Xm_,Y} is found for Hm (e.g., via FM) and this solution is projected to Pm-l {Xm-=, Yi-l I 'A partitioning and a clustering are identical by definition, but the term partitioning is generally used when k is small (e.g., k < 10), and the term clustering is generally used when k is large (e.g., k = e(n) with constant average cluster size). Although a bipartitioning can also be written as P' = {Cl, C2), we use the notation P = {X, Y} to better distinguish between partitioning and clustering.
Pm-i is then refined, e.g., by using it as an initial solution
netic algorithm. Section 4 presents experimental results for 23 ACM/SIGDA benchmarks, and Section 5 concludes with directions for future work.
for FM. In the Figure, each projected solution is indicated by a dotted line and each refined solution is given by a solid dividing line. This uncoarsening process continues until a partitioning of the original netlist Ho is derived.
Initial
2.
GRAPH PARTITIONING USING METIS
The Metis package of Karypis and Kumar has multiple algorithm options for coarsening, for the initial partitioning step. and for refinement. For example, one can choose among eight different matching-based clustering schemes including random, heavy-edge, light-edge, and heavy-clique matching. The methodology we use follows the general recommendations of [13], even though their algorithm choices are based on extensive empirical studies of finite-element graphs and not circuit netlists. Before multilevel partitioning is performed, the adjacency lists for each module are randomly permuted. The following discussion applies our previous notation to weighted graphs: a weighted graph is simply a hypergraph Hi and lel = 2 for each e e Ei with a nonnegative weight function w on the edges. To cluster, Karypis and Kumar suggest Heavy-Edge Matching (HEM), which is a variant of the random matching algorithm of [5]. A matching M of Hi is a subset of Ei such that no module is incident to more than one edge in M. Each edge in the matching will be contracted to form a cluster, and the contracted edges should have the highest possible weights since they will not be cut in the graph Hi+,. HEM visits the modules in random order: if a module u is unmatched, the edge (u, v) is added to M where w(u, v) is maximum over all unmatched modules v; if u has no unmatched neighbors, it remains unmatched. This greedy algorithm by no means guarantees a maximum sum of edge weights in M, but it runs in OQ(Ei ) time. Following [15], our methodology iteratively coarsens until IVmL < 100. An initial bipartitioning for Hm is formed by the Greedy Graph Growing Partitioning (GGGP) algorithm. Initially, one "fixed" module v is in its own cluster Xm and the rest of the modules are in Y_. Modules with highest gains are greedily moved from Ym to Xm until Pm = {Xm, Ym} satisfies the cluster size constraints. Since the solution is extremely sensitive to the initial choice of v, the algorithm is run four times with different initial modules. and the best solution observed is retained for the next step. Despite its simplicity, the GGGP heuristic proved at least as effective as other heuristics for partitioning finite element graphs [13]. The refinement steps use the Boundary Kernighan-Lin Greedy Refinement (BGKLR) scheme. Despite its name, the heuristic actually uses the FM single-module neighborhood structure. Kumar and Karypis label the KL algorithm "greedy" when only a single pass is performed, and propose a hybrid algorithm which performs "complete" KL when the graph is small (i.e., less than 2000 modules) and greedy KL for larger graphs. They show that greedy KL is only slightly inferior to complete KL, but saves substantial CPU time. A "boundary" scheme is also used for updating gains: initially, only modules that are incident to cut edges (i.e., boundary modules) are stored in the FM bucket data structure and are eligible to be moved; when a module that is not in the data structure becomes incident to a moved module, it is inserted into the bucket data structure only if it has
Partitioning
Figure 1. The multilevel bipartitioning paradigm. Multilevel clustering methods have been virtually unexplored in the physical design literature; the work of Hauck and Borriello [10] is the notable exception. [10] performed a detailed study of multilevel partitioning for FPGAs and found that simple connectivity-based clustering combined with a KL and FM multilevel approach produced excellent solutions. However, multilevel partitioning has been wellstudied in the scientific computing community, e.g., Hendrickson and Leland [11] [12] and Karypis and Kumar [13] [14] [15] have respectively developed the Chaco and Metis partitioning packages. The Metis package of [15] has produced very good partitioning results for finite-element graphs and is extremely efficient, requiring only 2.8 seconds of CPU time on a Sun Sparc 5 to bipartition a graph with more than 15.000 vertices and 91,000 edges. Our initial hypothesis, which our work has verified, was that Metis adapted for circuit netlists is both better and faster than FM. Metis runtimes are so low that we can easily afford to run it over 100 times to generate a partitioning. However, instead of simply calling Metis 100 times, we propose to integrate Metis into a genetic algorithm: our experiments show that this approach produces better average and minimum cuts than Metis alone. Overall. our approach generates bipartitioning solutions that are competitive with the recent approaches of [20] [18] [6] while requiring much less CPU time. The rest of our paper is as follows. Section 2 reviews the Metis partitioning package and presents our modifications for circuit netlists. Section 3 presents our Metis-based ge-
101
positive gain. The cost of performing the boundary version of KL is small, since only the boundary modules are considered. The overall Metis methodology is presented in Figure
Given a set S of s solutions, the s-digit binary code C(i) for module vi is generated by concatenating the ith entries of the indicator vectors for the s solutions. We construct a clustering by assigning modules vi and v. to the same cluster if C(i) and C(j) are the same code. Our strategy integrates this code-generated clustering into Metis, in that we use HEM clustering and force every clustering generated during coarsening to be a refinement of the codebased clustering.2 Our Genetic Metis (GMetis) algorithm is shown in Figure 3.
2.
'The Metis Algorithm Input: Graph Ho (IVY Eo) Output: Bipartitioning Po = {Xo, Yo} 1. i = 0; randomly permute the adjacency lists of H0 .
2. while IVI < 100 do 3. Use HEM to find a matching M of Hi 4.
Contract each edge in M to form a clustering.
Construct the coarser graph Hi+,(Vi+1,Ei+0). Set Zto i + 1. 5. Let m = i. Apply GGGP to Hm to derive Pm. Refine Pm using BGKLR. 6. for i = m-1 downto 0 do 7.
Project solution Pi+, to the new solution Pi.
Refine Pi using BGKLR. 8. return Po. Figure 2. The Metis Algorithm To run Metis on circuit netlists, we use an efficient hypergraph to graph converter that constructs sparse graphs. The traditional clique net model (which adds an edge to the graph for every pair of modules in a given net) is not a good choice since large nets will destroy sparsity. Since we observed that keeping large nets generally increases the cut size regardless of the net model, we removed all nets with more than T modules (we use T = 50). For each net e, our converter picks F - lel random pairs of modules in e and adds an edge with cost one into the graph for each pair. Here, F is a constant; our experiments show that the value of F is not too significant as long as it is large enough (we use F = 5). Table 1 shows how the Metis cuts vary with various values of F and T. Our converter retains the sparsity of the circuit, introduces randomness to allow multiple Metis runs, and is fairly efficient. 3.
A GENETIC VERSION OF METIS
Our experiments in Section 4 show that over the course of 100 independent runs Metis generates at least one very good solution, but that its performance is not particularly stable, generating average cuts much higher than minimum cuts. To try to stabilize solution quality and generate superior solutions, we have integrated Metis into a genetic framework. An indicatorvectorp = {pi,P2,. p} for a bipartitioning P = {X, Y} has entry p, = 0 if v, G X and entry p, = 1 if vi E Y, for all i = 1, 2, . n. The distance between two bipartitionings P and Q with corresponding indicator vectors p and 4is given by - qij, i.e., by the number of module moves needed to derive solution Q from the initial solution P. Boese et al. [41 showed that the set of local minima generated by multiple FM runs exhibit a "big valley" structure: solutions with smallest distance to the lowest-cost local minima also have low cost, and the best local minima are "central" with respect to the other local minima. Thus, we seek to combine several local minimum solutions generated by Metis into a more "central" solution.
En=
102
The Genetic Metis (GMetis) Algorithm Input: Hypergraph H(V, E) with n modules Output: Bipartitioning P = {X, Y} Variables: s: Number of solutions numgen: Number of generations C(i): s-digit code for module vi S: set of the s best solutions seen G: graph with n modules 1. Set C(i)= 00 ... 0 for I < i < n 2. for i = 0 to numgen - 1 do 3. for j = 0 to s -1 do 4. if ((i *s) + j) modulo 10 = 0 then convert H to graph G 5. P = Metis(G) (HEM based on codes C(i)) 6. if BQ E S such that Q has larger cut than P then S = S + P - Q. 7. if i > 0 and (s(i -1) + j) modulo 5 = 0 then recompute C(i) for 1 < i < n using S. 8. return P c S with lowest cut. Figure 3. The Genetic Metis Algorithm Step 1 initially sets all codes to 00. . . 0 which causes GMetis to behave just Eke Metis until s solutions are generated. Steps 2 and 3 are loops which cause numgen generations of s solutions to be computed. Next, Step 4 converts the circuit hypergraph into a graph, but this step is performed only once out of every 10 times Metis is called. We perform the conversion with this frequency to reduce runtimes while still allowing a variety of different graph representations; the constant 10 is fairly arbitrary. In Step 5, Metis is called using our version of HEM described above. Step 6 maintains the set of solutions S; our replacement scheme replaces solution Q G S with solution P if P has smaller cut size than Q; other replacement schemes may work just as well and need to be investigated. Step 7 computes the binary code for each module based on the current solution set, but only after the first generation has completed and five solutions with the previous code-based clustering have been generated. As in Step 4, the constant 5 is fairly arbitrary. Finally, the solution with lowest cut is returned in Step 8. 4.
EXPERIMENTAL RESULTS
All of our experiments use a subset of the benchmarks from the ACM/SIGDA suite given in Table 2: hypergraph formats of these circuits are available on the world wide web 2 A clustering pk is a refinement of Q' (k > 1) if some division of clusters in Q' will yield pk.
T/F 10 15 20 25 35 50 100 200 500
1 289(238) 276(231) 320(261) 321(243) 309(243) 316(233) 310(231) 325(252) 471(366)
2 241(184) 224(188) 252(202) 251(189) 248(172) 258(190) 274(184) 265(182) 427(333)
3 239(185) 239(185) 259(180) 250(174) 239(170) 250(177) 256(173) 266(170) 418(318)
4 238(180) 228(184) 258(189) 254(170) 227(168) 251(173) 260(172) 257(174) 412(327)
5 225(176) 222(178) 252(173) 243(169) 249(173) 245(169) 256(173) 288(184) 414(294)
6 228(184) 228(175) 253(187) 238(162) 245(171) 240(167) 254(175) 258(182) 429(295)
8 230(174) 228(165) 261(165) 255(173) 240(166) 255(159) 248(166) 261(187) 399(311)
10 220(169) 215(176) 269(190) 232(162) 247(164) 255(178) 237(170) 260(192) 408(321)
12 225(178) 241(181) 265(176) 245(176) 254(176) 232(162) 245(180) 271(181) 414(296)
15 227(169) 228(174) 253(178) 266(174) 239(169) 240(175) 252(176) 266(186) 411(270)
Table 1. Average(minimum) cuts for the avqlarge test case for 50 runs of Metis shown for various values of T (rows) and F (columns). at http://ballade.cs.ucla.edu/-cheese. Our experiments assume unit module areas, and our code was written in C++ and was compiled with g++ v2.4 on a Unix platform. Our experiments were run on an 85 Mhz Sun Sparc 5 and all runtimes reported are for this machine (in seconds) unless otherwise specified. We performed the following studies:
are better than others), we compare to the best FM results found in the literature.
Test Case balu bml primaryl test04 testO3 testO2 testO6 struct testO5 l9ks primarv2 s9234 biomed s13207 s15850 industry2 industry3 avqsmall avqlarge
* We compare Metis against standard and two-phase FM, to show the effectiveness of the multilevel approach. e We show that the GMetis algorithm is more effective than running Metis multiple times. * Finally, we show that GMetis is competitive with previous approaches while using a fraction of the runtime. Test Case balu bml primaryl testO4 testO3 testO2 testO6 struct testO5 l9ks primary2 s9234 biomed s13207 s15850 industry2 industry3 s35932 s38584 avqsmall s38417 avqlarge golem3
# Modules
# Nets
801 882 833 1515 1607 1663 1752 1952 2595 2844 3014 5866 6514 8772 10470 12637 15406 18148 20995 21918 23849 25178 103048
735 903 902 1658 1618 1720 1541 1920 2750 3282 3029 5844 5742 8651 10383 13419 21923 17828 20717 22124 23843 25384 144949
# Pins 2697 2910 2908 5975 5807 6134 6638 5471 10076 10547 11219 14065 21040 20606 24712 48404 65792 48145 55203 76231 57613 82751 338419
.
f
Minimum cut (100 runs) CPU (s Metis I MIT 2-FM I Metis FM . [6] 1 [9l 1] L91 [ 34 53 55 53 61 99 94 36 107 116 158 49 83 84 62 218 292 175 171
32 55 57 86 72 115 71 45 97 142 236 53 83 92 112 428
171
52 53 56 60 97 68 43 93 121 182
83
124
275 312 373 406
438 328 399 518
56
36
23 22 22 37 39 43 53 41 69 59 90 72 134 111 123 349 399 293 355
21 24 24 41 56 46 50 46 81 115 128 222 296 339 339 727 l
Table 3. Comparison of Metis with FM. Dutt and Deng [6] have implemented very efficient FM code; their exact bisection results for the best of 100 FM runs are given in the third column of Table 3 and the corresponding Sparc 5 run times are given in the last column. Hagen et al. [9] have run FM with an efficient LIFO tie breaking strategy and a new lookahead function that outperforms [17]; their bisection results are reported in the fourth column. Finally, we compare to various two-phase FM strategies. In the fifth column, we give the best twophase FM results observed for various clustering algorithms as reported in [1] and [9]. Metis does not appear to be faster than FM for circuits with less than two thousand modules, but for larger circuits with five to twelve thousand modules, Metis is 2-3 times faster. In terms of cut sizes, again Metis is indistinguishable from FM for the smaller benchmarks, but Metis cuts are significantly lower for the larger benchmarks. We conclude that multilevel approaches are unnecessary for small circuits, but greatly enhance solution quality for larger circuits. For these circuits, more than two levels of clustering are required for such an iterative approach to be effective.
Table 2. Benchmark circuit characteristics. 4.1. Metis vs. FM and Two-phase FM Our first set of experiments compares Metis against both FM and two-phase FM. We ran Metis 100 times with balance parameter r = 0 (exact bisection) and recorded the minimum cut observed in the second column of Table 3. Since there are many implementations of FM (some of which
103
4.2.
Genetic Metis vs. Metis
found that with this implementation, GMetis with r = 0.1 was sometimes outperformed by GMetis with r = 0 (exact bisection). Hence, in Table 5, we present results for GMetis with r = 0.1 and r = 0 (given in parentheses). 3 Since for r = 0.1, GMetis runtimes sometimes increase by 20-50%, we report runtimes for r = 0 in the last column. These experiments used the somewhat arbitrary parameter values of s = log2 in (jVj = n) solutions and 12 generations. Observe that GMetis cuts are competitive with the other methods, especially for the larger benchmarks s15850, industry2, and avqsmall. However, the big win for GMetis is the short runtimes: generating a single solution for avqlarge and golem3 respectively takes 417/(12 log 2 25178) = 2.5 and 450/(21og2 103048) = 15 seconds on average. For golem3, we only ran 2 generations since the results do not improve with subsequent generations; the solution with cost 2144 was achieved after only 210 seconds of CPU time.
The next set of experiments compares Metis with the GMetis. We ran GMetis for 10 generations while maintaining s = 10 solutions so that both Metis and GMetis considered 100 total solutions. The minimum and average cuts observed, as well as total CPU time, are reported for both algorithms in Table 4. Test
[[
Case
|[
balu bml primaryl testO4 test03 testO2 testO6 struct testO5 19ks primarv2 s9234 biomed s13207 s15850 industrv2 industry3 s35932 s38584 avqsmall s38417 avqlarge golem3
min
Metis [avg
CPFU
|min
34 53 55 53 61 99 94 36 107 116 158 49 83 84 62 218 292 55 55 175 73 171 2196
47 65 66 68 76 113 117 52 125 132 195 66 149 90 84 280 408 71 101 241 110 248 2520
26 23 23 37 42 44 59 41 70 59 95 71 145 106 126 336 384 257 310 289 294 318 1592
32 54 55 52 65 96 97 34 109 112 165 45 83 78 59 204 291 56 53 148 74 144 2196
GMetis
11
avg I CPU 38 59 59 58 74 101 121 40 117 116 174 52 134 89 74 230 313 62 67 174 104 181 2648
24 22 21 37 39 42 55 39 69 59 91 68 143 112 125 339 423 265 368 322 301 355 1928
5.
This work integrates the Metis multilevel partitioning algorithm of [15] into a genetic algorithm. We showed (i) Metis outperforms previous FM-based approaches, (ii) GMetis improves upon Metis alone for large benchmarks, and (iii) GMetis is competitive with previous approaches while using less CPU time. There are many improvements which we are pursuing: * Find sparser and more reliable hypergraph conversion algorithms. * Try alternative genetic replacement schemes, instead of simply inserting the current solution into S if it is a better solution. * Tweak parameters such as F, T, s, and the number of generations in order to generate more stable solutions in fewer iterations. * Experiment with various schemes to control cluster sizes in the bipartitioning solution. That GMetis frequently finds better 50-50 solutions versus 45-55 solutions is not acceptable. * Finally, we are integrating our own separate multilevel circuit partitioning code into a new cell placement algorithm.
Table 4. Comparison of Metis with Genetic Metis. The minimum cut, average cut and CPU time for 100 runs of each algorithm are given. On average, GMetis yields minimum cuts that are 2.7%, lower than Metis. and significantly lower average cuts (except for golem3). For the larger benchmarks (seven to twenty-six thousand modules) GMetis cuts are 5.7% lower, with significant improvements for industry2, avqsmall and avqlarge. We believe that GMetis can have the greatest impact for larger circuits. Note that golem3 is the one benchmark in which Metis outperforms GMetis in terms of the average case - the quality of GMetis solutions gradually became worse in subsequent generations instead of converging to single good solutions. 4.3.
CONCLUSIONS
REFERENCES [1] C. J. Alpert and A. B. Kahng, "A General Framework for Vertex Orderings, with Applications to Netlist Clustering", to appear in IEEE Trans. on VLSI. [2] C. J. Alpert and A. B. Kahng, "Recent Directions in Netlist Partitioning: A Survey", Integration, the VLSI Journal, 19(1-2), pp. 1-81, 1995. [3] C. J. Alpert and S.-Z. Yao, "Spectral Partitioning: The More Eigenvectors, the Better," Proc. A CM/IEEE Design Automation Conf., 1995, pp. 195-200. [4] K. D. Boese, A. B. Kahng, and S. Muddu, "A New Adaptive Multi-Start Technique for Combinatorial Global Optimizations", Operations Research Letters, 16(20), pp. 101-13, 1994. 3We attribute this undesirable behavior to improper modification of the Metis code. We believe that a better implementation should yield r = 0.1 results at least as good as those for r = 0 without increasing runtimes.
Genetic Metis vs. Other Approaches
Finally, we compare GMetis to other recent partitioning works in the literature, namely PROP [6], Paraboli [20], and GFM [19], the results of which are quoted from the original sources and presented in Table 5. All these works use r = 0.1, i.e., each cluster contains between 45% and 55% of the total number of modules. The CPU times in seconds for PROP, Paraboli, and GFM are respectively reported for a Sun Sparc 5, a DEC 3000 Model 500 AXP, and a Sun Sparc 10. We modified GMetis to handle varying size constraints by allowing the BGKLR algorithm to move modules while satisfying the cluster size constraints. "Ae
' 04
Test
._
Case
. PROP
balu bml primaryl test04 testO3 testO2 testO6 struct testO5 l9ks primary2 s9234 biomed s13207 s15850 industrv2 industrv3 s35932 s38584 avqsmall s38417 avqlarge golem3
cuts 27 50 47 52 59 90 76 33 79 105 143 41 83 75 65 220
.
l
Paraboli
|GFM |
41
27
53
47
40
41
146 74 135 91 91 193 267 62 55 224 49 139 1629
139 41 84 66 63 211 241 41 47 81
27(32) 48(53) 47(54) 49(52) 62(66) 95(96) 94(93) 33(34) 104(109) 106(110) 142(158) 43(45) 102(83) 74(70) 53(60) 177(204) 243(286) 57(55) 53(53) 144(145) 69(77) 145(144) 2111(2144)
l
CPU
GMetis( bal) | PRP KU 16 20 19 49 51 64 75 42 97 87 139 139 250 177 291 867
araboli
|GEM
16
24
18
16
35
80
137 490 711 2060 1731 1367 761 2627 6518 4099 2042 4135 10823
224 672 1440 1920 2560 4320 4000 10160 9680 11280
|G
ei 14 12 12 21 23 26 32 27 46 39 53 58 95 102 114 245 299 266 397 328 281 417 450
Table 5. Comparison of GMetis with PROP, Paraboli. and GFM for min-cut bipartitioning allowing 10% deviation from bisection. Exact bisection results for GMetis are given in parentheses in the fifth column. [5] T. Bui, C. Heigham, C. Jones. and T. Leighton, 'Improving the Performance of the Kernighan-Lin and Simulated Annealing Graph Bisection Algorithms". Proc, ACMI/IEEE Design Automation Conf., pp. 77578, 1989. [6] S. Dutt and W. Deng. "A Probability-Based Approach to VLSI Circuit Partitioning", to appear in Proc. ACM/IEEE Design Automation Conf., 1996. [7] S. Dutt and W. Deng. "VLSI Circuit Partitioning by Cluster-Removal Using Iterative Improvement Techniques", Technical Report, Department of Electrical Engineering, University of Minnesota, Nov. 1995. [8] C. M. Fiduccia and R. M. Mattheyses, "A Linear Time Heuristic for Improving Network Partitions", Proc. ACM/IEEE Design Automation Conf., pp. 175-181, 1982. [9] L. W. Hagen, D. J.-H. Huang, and A. B. Kahng, '`On Implementation Choices for Iterative Improvement Partitioning Algorithms", to appear in IEEE Trans. Computer-Aided Design (see also Proc. European Design Automation Conf., Sept. 1995, pp. 144149). [10] S. Hauck and G. Borriello, "An Evaluation of Bipartitioning Techniques", Proc. Chapel Hill Conf. on Adv. Research in VLSI, 1995. [11] B. Hendrickson and R. Leland, "A Multilevel Algorithm for Partitioning Graphs", Technical Report SAND93-1301, Sandia National Laboratories, 1993. [12] B. Hendrickson and R. Leland, "The Chaco User's Guide", Technical Report SAND93-2339, Sandia National Laboratories, 1993. [13] G. Karypis and V. Kumar, "A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs",
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
105
Technical Report #95-035, Department of Computer Science, Universitv of Minnesota, 1995. G. Karypis and V. Kumar, "Multilevel k-Way Partitioning Scheme for Irregular Graphs", Technical Report #95-035, Department of Computer Science, University of Minnesota, 1995. G. Karypis and V. Kumar, "Unstructured Graph Partitioning and Sparse Matrix Ordering", Technical Report. Department of Computer Science, University of Minnesota, 1995 (see http://www.cs.umn.edu/"kumar for postscript and code). B. W. Kernighan and S. Lin, "An Efficient Heuristic Procedure for Partitioning Graphs", Bell Systems Tech. J., 49(2), pp. 291-307, 1970. B. Krishnamurthy, "An Improved Min-Cut Algorithm for Partitioning VLSI Networks", IEEE Trans. Computers, 33(5), pp. 438-446, 1984. J. Li. J. Lillis, and C.-K. Cheng, "Linear Decomposition Algorithm for VLSI Design Applications", Proc. IEEE Intl. Conf. Computer-Aided Design, pp. 223-228, 1995. L.-T. Liu, M.-T. Kuo, S.-C. Huang, and C.-K. Cheng, "A Gradient Method on the Initial Partition of Fiduccia-Mattheyses Algorithm", Proc. IEEE Intl. Conf. Computer-Aided Design, pp. 229-234, 1995. B. M. Riess, K. Doll, and F. M. Johannes, "Partitioning Very Large Circuits Using Analytical Placement Techniques", Proc. ACM/IEEE Design Automation Conf., pp. 646-651, 1994. Y. Saab, "A Fast and Robust Network Bisection Algorithm", IEEE Trans. Computers, 44(7), pp. 903-913, 1995. L. A. Sanchis, "Multiple-Way Network Partitioning", IEEE Trans. Computers, 38(1), pp. 62-81. 1989.
MIN-CUT REPLICATION FOR DELAY REDUCTION James Hwang'
Abbas El Gamal 2
1
Xilinx, Inc., USA jim.hwangxtilinx.com 2 Stanford University, USA abbas~isl . stanford. edu
ABSTRACT Min-cut replication has been shown to substantially improve min-cut partitions without substantial increase in network size. This paper addresses the use of min-cut replication to reduce delay in partitioned networks. It is shown that mim-cut replication never increases delay in combinational networks or in sequential networks when registers are not replicated, and demonstrated that in practice, min-cut replication reduces delay significantly. Experimental results comparing a min-cut approach with the results of delay-optimal clustering algorithms indicate that min-cut replication can reduce partition size with modest increases in delay. 1
a,
bh
c,
h,
i,
I
'4I>
2
Ii
k
j I
'
(a)
(b)
Delay minimization example from [8] Figure 1. (node delays subscripted). :/ a
C"
a' "
b',
b
c'
INTRODUCTION
All known network partitioning algorithms that efficiently minimize delay subject to capacity constraints require replication. The problem has been solved for acyclic networks when replication is allowed, by Lawler et al. for the unit delay model [6], and by Rajaraman and Wong [11] for a more general delay model proposed in [9]. These approaches minimize delay subject to any monotone component constraint. Furthermore, it has been observed that the min-delay partitioning problem with the non-monotone pin constraint and no capacity constraint can be solved by the min-depth technology mapping algorithm for lookup table-based FPGAs proposed in [2]. A heuristic that combines elements of [2, 11] was proposed in [12] for the min-delay clustering problem under simultaneous gate and pin capacity constraints, with reported optimal or near optimal min-delay partitions. However, the min-delay clustering algorithms described in [6, 11], despite their elegance, have serious practical limitations. They involve massive amounts of replication, and in practice increase network size by roughly a factor of twenty. Merging heuristics have been proposed to reduce the amount of replication [9, 12], but these heuristics increase the running time substantially, and the resulting partition size is still at least doubled. Such substantial size increases can lead to additional delay, for example, when a multi-chip partition becomes too large to fit on a single PCB. In this paper we show that in contrast, the min-cut replication approaches described in [5, 13] can be effective in reducing delay in a partitioned network without the excessive size increase of min-delay solutions. As a simple example, consider the ISCAS85 benchmark design, c17, shown in Fig. la. A delay-optimal clustering, obtained using the algorithm in [11], is shown in Fig. lb, where the component size limit is three nodes and the cut-edge delay is equal to
Ik,
(a)
(b)
Figure 2. A min-cut approach can minimize delay with fewer components. three. The partition was obtained by merging components in the original solution. By beginning instead with the initial partition without replication shown in Fig. 2a and applying replication to minimize cut size, we obtain the partition in Fig. 2b. This partition is delay-optimal, even though the initial partition was not. Moreover, the it contains one fewer component than the partition in Fig. lb. The paper is organized as follows. Section 2 contains definitions and a brief review of min-cut replication. In Section 3, we prove that min-cut replication never increases delay in combinational networks and sequential networks without replicated registers, and show that when registers can be replicated, delay will never increase by more than one cut edge delay. In Section 4, we show that in practice, min-cut replication often reduces delay substantially. In Section 5, we compare the results of a min-cut approach with partitions obtained using the algorithm in [12], demonstrating that min-cut replication can reduce partition size with modest increases in delay. We conclude in Section 6 by discussing future work. 2 2.1.
PRELIMINARIES
Min-Cut Replication
Let V = {V,, V2 , ... , Vk} be a partition of a graph G = (V, E), where the Vi are not required to be disjoint as long as vertices occurring in more than one component
106
are assumed replicated. The min-cut replication problem is to determine a collection of vertex sets to replicate from each component to every other component which minimizes the cut size in the resulting network. Efficient solutions to this problem, called min-cut replication algorithms, are based on the following three ideas [5, 13]. * The replication problem can be solved by solving the k independent problems of determining the min-cut optimal set to replicate into (or out of) a component Vi from (to) its complement Vi. * Determining the min-cut vertex set to replicate into (or out of) a single component V, from (to) its complement Vi can be done efficiently using flow techniques. * If a replication set found by flow techniques is infeasible because of capacity constraints, the solution can be approximated by using a partitioning heuristic.
Figure 3. Choosing the clone to replicate is essential for minimizing delay.
2.2. A Delay Model for Partitioned Networks The general delay model proposed by Murgai et al. [9], is characterized by two parameters: a function 6, defined on the vertex set, and constant delay D associated with every cut edge in a partition.
Definition 2.5 A combinationalpath p is a critical path if it has maximal combinational delay.
JuI
I
/I -
k
(hj
(a)
We wish to consider sequential as well as combinational networks, assuming only that any cycle in a network contains at least one register node. Definition 2.2 Let X C V be the set of all register nodes in G. where X is empty for a combinational network. A path p = (vo, v, ,em) is a combinational path if vi R 7 for i = 1, m - 1. The gate delay of combinationalpath p, 6(p) -
m
6(vo),
iil
d V(v)
to be the delay induced by applying min-cut replication to the cut V.
Definition 2.6 The cycle time associated with cut V,
max
vev
3
MIN-CUT REPLICATION DOES NOT INCREASE DELAY If implemented properly, min-cut replication will never increase delay in combinational networks or in sequential networks if registers are not replicated. Furthermore, when registers can be replicated, min-cut replication will never increase delay by more than the external delay D. It might first appear that the difference in the delay for two clones can be unbounded, for example by considering the example in Fig. 3a. As can be seen in Fig. 3b, after min-cut replication the delays from w to u3 and Uk differ:
6(P)
...
A(wu)
=
2D+zA(w,u).
Proposition 3.1 If, after min-cut replication, a network with partition V has the property that for every cut edge (ui,,)vj, vertex ui has the minimal delay of all its clones, then for any clones v, v*. we have Idk(v) - d'v(*)l < D. Furthermore, min-cut replication never increases cycle time, i.e., maximal delay in a combinational network or in a sequential network when registers are not replicated.
{ (p) + D c(p,V)})
combinational path p=(,
=
The refinement is: whenever (u,, vj) is a cut edge with dr(ui) > dr(Uk) for some clones ui,uk, simply replace (ui, vj) by the edge (uk, v). It then follows that an edge (ui, v,) is a cut edge only if ui has the minimal delay of any of its clones.
6
max
dV(u,) dr(Uk)
However, as shown in Fig. 3c, a simple refinement of mincut replication ensures that the difference in delay between two clones is bounded above by the external delay D.
Definition 2.4 A cut V = {½l, V2 , ... , Vk} induces a delay function on the vertex set V under the general delay model. Let c(p, V) be the number of cut edges in a combinational path p. The combinational delay, or more simply, the delay of a node v, =
vEV
the cycle time associated with the cut after applying min-cut replication.
p=(u,...,v)
dv(v)
maxdv(v)
is defined to be the maximal combinational delay in a partitioned network. Because every cycle in G contains at least one register node, the cycle time is always finite. We denote by 4I ± maxdt (v)
Definition 2.3 The maximum gate delay along any combinationalpath p from u to v is denoted by combinational path
{ 6 (p) + D cr(p, V)},
max
combinational path p=(u,. .,v)
is the delay from the output of vo to the output of ve.
A(u, v)-
(,)
number of cut edges in p after applying min-cut replication to the partition V, we define
'Dv
Definition 2.1 Given a network G = (V, E), the function 6 : V - N assigns a gate delay 6(v) to every vertex v e V. The external edge delay D E N is a constant delay that is assigned to every cut edge in a partition; non-cut edges have zero delay. Choosing the external delay D to be unity and gate delay 6(v) = 0 for all v E V results in the unit delay model of [6].
/-
k
,v)
is defined to be the maximum combinationalpath delay (gate plus external) to the output of v. Letting cr(p, V) be the
107
Proposition 3.2 Design
i. Min-cut replication never increases delay in combinational networks or in sequentialnetworks if no registers are replicated.
MCNC c1355 c1908 c3540 s1238 c2670 c5315 c6288 c7552 s13207 s15850 s35932 s38584 7 s53 8 s9234 Industrial addresspart cat control lookl240 pdt vrcl biga entmisc gme-a seq avg
ii. If registers can be replicated, min-cut replication can increase the delay by at most D. Hence, 4§r < (v for combinational networks and 4)r (v + D for sequential networks. 4
<
RESULTS: CYCLE TIME REDUCTION
By Proposition 3.2, it is possible that min-cut replication can increase cycle time by an external delay D, but we have found that in practice, it never increases cycle time, and it in fact often decreases it. This is consistent with intuition, since by removing cut edges, min-cut replication removes cut edges along many paths, possibly critical paths. This can be demonstrated empirically under the unit delay model. Although simplistic, the unit delay model suffices to show that the maximum number of cut edges along any path in the network can be reduced by min-cut replication. Using TAPIR [5], we partitioned the fourteen largest designs in the MCNC Partitioning93 benchmark suite as well as eight designs obtained from Actel. The partitions were generated in a sequence of rounds that maintain a set of components, terminating when all components satisfy the capacity constraints. Initially the entire design is a partition component. In each round, the largest infeasible component is selected and bipartitioned using the Fiduccia-Mattheyses algorithm [3]. Then the bipartition defined by every pair of components in the partition is also refined using FM. If there is an infeasible component, the next round begins, otherwise the algorithm terminates. After partitioning, the cycle time 4part is calculated as the maximum number of cut edges along a path from a primary input or register to a primary output or register: (Drep is the cycle time after applying min-cut replication. As can be seen from the data in Tab. 1, min-cut replication is effective in reducing unit delay. For 16 of the 24 input designs, replication reduces the cycle time, with an average reduction of 26% taken over all the designs. In no cases did replication increase the cycle time. We note that TAPIR incorporates a delay oblivious implementation of min-cut replication; if the underlying partitioning routines were timing driven, min-cut replication would expressedly aim to reduce delay in addition to cut size. 5
RESULTS: MIN-CUT REPLICATION VS. CLUSTERING
The problem of partitioning for minimum delay has been solved if replication is allowed, although interestingly, it remains open if replication is not permitted. In this section we compare a straightforward min-cut approach implemented in TAPIR [5, 4] with an 'optimal' min-delay clustering algorithm implemented in the program sis-cluster [12, 11]. The min-delay clustering algorithm, valid only for acyclic networks, consists of two phases. In the first phase, each vertex is labelled with the minimum delay over all clusterings in the subgraph consisting of the vertex and its predecessors. In the second phase, the labelled nodes are grouped together greedily, in reverse topological order. If a node is an ancestor of nodes in distinct clusters, it is replicated. Essentially, the algorithm unrolls a directed acyclic graph and replicates all overlapping fanin cones. This replication increases the network size by over an order of magnitude, so
Partitioning FPGAs
1 Replication bars
Ji
1
.p
a
I
6 5 12 8 4 8 4 5 12 13 24 28 8 7
6 6 8 8 5 5 13 5 4 10 4 5 4 5
5 6 6 0 4 4 10 4 4 6 4 5 3 5
0.17 0.00 0.25 1.00 0.20 0.20 0.23 0.20 0.00 0.40 0.00 0.00 0.25 0.00
6 3 6 6 4 3 10 10 7 6
22 2 4 3 2 3 5 62 19 4
7 2 3 1 2 3 3 24 12 3
0.68 0.00 0.25 0 67 0.00 0.00 0.40 0.61 0.37 0.25 026
Table 1. Unit delay reduction from min-cut replication.
[
I Tapir
-mpts
delay D | c1355 c1908 c3540 c2670 c5315 c6288 c7552 c880
Table 2. nents.
5 6 11 15 22 15 26 5
TAPIR vs.
[
sis 4
'In'mpts
2
6
8
10
7 11 16 22 39 46 48 7
12 22 15 29 41 71 49 7
12 23 12 29 40 92 46 6
10 23 13 34 40 99 44 5
10 23 15 35 41 97 44 5
sis-cluster partition compo-
sis-cluster applies a heuristic procedure to merges components and reduce the network size. We used the MCNC ISCAS85 Logic Synthesis benchmarks for the comparison [1], since the available version of sis-cluster did not parse the Partitioning93 benchmark suite, specified in Xilinx's xnf netlist format. The sis-cluster partitioner has input parameters including component size, pins, and the external delay, currently assuming unit delay for every node in the network. Each of the designs was partitioned with size 200 and 50 pins, since otherwise the relatively small designs in the ISCAS85 suite led to trivial partitions (similar results were obtained using gate capacity 100 and pin capacity 50). TAPIR partitions were generated by combining min-cut partitioning and min-cut replication as described in [5] to reduce partition size. The external delay D was varied from 2 to 10, a representative range for many multiple-FPGA partitions (every internal connection has unit delay). For instance, the ratio of external delay to internal delay typically lies between 3 and 7 for an emulated circuit implemented in the Xilinx family of FPGAs [10]. Tab. 2 contains the number of partition components for
108
delay D
[
c1355 C1908 c2670 c3540 c5315 c6288 c7552 c880 I
_ _ _ _ _
6
60 92 78 104 110 274 94 56
70 102 90 116 120 300 100 64
80 112 102 128 130 326 106 72
50 82 66 96 100 252 88 50
50 84 68 100 103 254 88 50
l
c1355 C1908 c2670 c3540 c5315 c6288 c7552 C880
Table 3.
lTapir 4
2
11I
8
101
90 122 114 140 140 352 112 80
100 132 126 154 150 378 118 88
b SiS-Clo 52 50 86 88 70 72 102 104 104 106 256 258 88 88 50 50
54 90 76 106 108 260 88 50
TAPIR VS. sis-cluster
cycle time.
partitioned designs. delay increases essentially linearly with the external/internal delay ratio, since increasing the ratio simply adds a delay for each interchip signal - the partition itself does not change (although the critical path may). From the data, we observe that in the range of external to internal delay ratios that arise in practice for FPGAbased logic emulation systems, the time/area tradeoff may in fact sometimes favor a min-cut partitioning/replication approach over delay optimal solutions such as [11]. Certainly the results are sufficiently encouraging to suggest that by incorporating timing information into the partitioning and replication algorithms, one can expect a min-cut replication approach to produce substantially smaller partitions with competitive delay characteristics. 6 FUTURE IMPROVEMENTS Despite TAPIR'S delay obliviousness, we observed that a min-cut approach to partitioning with replication produces substantially smaller partitions with modest delay penalties. Clearlv the effect of min-cut replication depends heavilv upon the initial partition, and some work has been done to combine replication with determining the partition [8, 7]. We have tried several strategies for deriving partitions without replication from the delay-optimal clusterings produced by sis-cluster, including merging components to
eatonal delay
Figure 4. TAPIR vs. cycle time.
sis-cluster components and
each design and Fig. 4 contains a plot of the average ratio AF
#FPGAs with TAPIR #FPGAs with sis-cluster
of partition components for TAPIR and sis-cluster, as a function of the ratio of external delay to internal delay. Over a realistic range of delay ratios for FPGA partitioning, TAPIR generates partitions with about one half the number of components as sis-cluster, which performs the merging after clustering. Tab. 3 shows the cycle times for each partitioned design and Fig. 4 contains a plot the average ratio of cycle times
minimize cut size, merging components along critical paths, unreplicating nodes into the clones that minimize cut size, and unreplicating into min-delay and max-delay clones. No combination of these approaches we tried produced uniformly better results than TAPIR's delay obilivious min-cut partitloner, and in many cases, for most values of external delay D, after min-cut replication, TAPIR produced partitions with less delay. However, we expect that combining performance-driven partitioning with min-cut replication will yield smaller partitions than the algorithms in [11, 12], with competitive delay. A combined min-delay/min-cut partitioning heuristic can be incorporated in the min-cut replication algorithm as described in [5], adding a performance-driven capability to min-cut replication. This paper presented a first step in assessing this approach, by comparing a simple implementation of partitioning with replication (both delay oblivious), with the min-delay clustering approach. ACKNOWLEDGEMENTS The authors would like to thank Hannah Yang for providing her program sis-cluster, as well as answering questions
regarding its algorithms and use. We also thank an anonymous reviewer for correcting an error in the original version of the paper. REFERENCES [1] Franc Brglez. MCNC Partitioning93 benchmark suite. Microelectronics Center of North Carolina, 1993.
ITAPIR lsis-cluster
[2] Jason Cong and Yuzheng Ding. An optimal technology mapping algorithm for delay optimization in lookuptable based FPGA designs. In Digest of Technical Papers, ICCAD-92, pages 48-53, 1992. [3] C.M. Fiduccia and R.M. Mattheyses. A linear-time heuristic for improving network partitions. In Proceedings of the 19th Design Automation Conference, pages 175-181, 1982. [4] James Hwang. Replication in Partitioned Networks. PhD thesis, Stanford University, 1995.
as a function of the ratio of external to internal delay. As expected, the delay in the TAPIR partitions is larger than sis-cluster, which produces either optimal or close to optimal partitions. However, when the external delay D is less than four times the internal delay, TAPIR'S partitions are on average about 20% slower than those for sis-cluster, and when D is seven times the internal delay, cycle time is still only about 40%A slower for TAPIR partitions. In the TAPIR
109
[5] James Hwang and Abbas El Gamal. Min-cut replication in partitioned networks. IEEE Transactions on CAD, V-14(1):96-106, January 1995. [6] E.L. Lawler, K.N. Levitt, and J. Turner. Module clustering to minimize delay in digital networks. IEEE Transactions on Computers, C-18(1):47-57. January 1969. [7] L.-T. Liu and M.-T. Kuo and C.-K. Cheng. A replication cut for two-way partitioning. IEEE Transactions on CAD, V-14(5):623-630, May 1995. A cell[8] Chuck Kring and A. Richard Newton. replicating approach to mincut-based circuit partitioning. In Digest of Technical Papers, ICCAD-91, pages 2-5, 1991. [9] R. Murgai, R.K. Brayton, and A. SangiovanniVincentelli. On clustering for minimum delay/area. In Digest of Technical Papers, ICCAD-91, pages 6-9, 1991. [10] Quickturn Design Systems, Inc. private communication. [11] Rajmohan Rajaraman and D. F. Martin Wong. Optimal clustering for delay minimization. In Proceedings of the 30th Design Automation Conference, pages 309314, 1993. [12] Honghua Yang and D.F. Martin Wong. Area/pinconstrained circuit clustering for delay minimization. In FPGA '94 Workshop. IEEE and ACM, 1994. [13] Honghua Yang and D.F. Martin Wong. New algorithms for min-cut replication in partitioned circuits. In Digest of Technical Papers, ICCAD-95, pages 216-222, 1995.
110
TWO-DIMENSIONAL DATAPATH REGULARITY EXTRACTION Raymond X. T NijssenI and Jochen A. G. Jess 1Design Automation Section/ES' Eindhoven University of Technology, The Netherlands [email protected]
flows for certain parts of the circuit not only add cost and integration overhead, but also seriously decrease generality and flexibility. Furthermore, while such dedicated tools yield dense layouts of fully regular circuitry, they rapidly perform worse as the circuit is less regular, causing considerable area waste due to their limited flexibility [12]. This leaves a large class of circuits which would benefit from a regular placement with the same cell library as in the rest of the circuit, but for which neither dedicated systems, nor general layout systems produce satisfying solutions. Among the first addressing this open field, Odawara [11] proposed a methodology which was later refined [8][2][3]. This method is based on improving placement by searching logical designs for a structural characteristic feature typical for datapaths, namely bit-latches repeated over all bits connected via the same terminal type at each latch to one common net. A cell cluster, called location macro, is then grown around all such groups so as to serve as placement initializers for subsequent conventional standard cell layout generation. A similar method [14] uses primary outputs instead of latch chains attempting to find strongly connected subcircuits called cones in which all cells have a path to the same primary output. While both approaches yield some gain in terms of density and run-time over general placement, they are fundamentally unable to fully extract datapath regularity because they disregard the essentially two-dimensional nature of this feature. Consequently, the resulting placements are still not nearly as regular as those produced by dedicated datapath synthesis tools, hence the potential benefits of datapath regularity are only partially exploited. So far, no method is known to us from literature that is capable of extracting this two-dimensional structure which is needed for generating truly regular placements similar to those made by dedicated design flows.
ABSTRACT This paper presents a new method to automatically extract regular structures from logic netlists containing datapath circuitry. The goal of datapath extraction is the exploitation of structural regularity to efficiently obtain regular placements which are typically more compact. Datapaths constitute increasingly sizeable parts of ever more and larger circuits, hence flexible technology-independent layout tools, unlike state of the art datapath compilers, will become critical in the design flow. Our method transforms a circuit's existing functional hierarchy, if any, into a 2-dimensional hierarchy that is more suitable for subsequent cell-placement, thereby also automating the currently mostly handcrafted task of selective partial hierarchy flattening. Once the two-dimensional structure is known, the remaining placement task is greatly reduced to arranging just one row and one column of the discovered matrix-like structure, allowing for much larger circuits to be placed in one go. Experiments show superior extraction results compared to existing approaches. 1 INTRODUCTION Bitwise parallelism has become the predominant technique in the design of datapaths in high performance data processing circuits. Due to the repetition of per-bit operators across the width of the data representation, both interconnect structure and component geometries of datapath circuitry are inherently regular. These effects can be exploited to obtain high density layouts as reported in [5] and in other publications. However, the current two mainstream placement methods Gordian/Domino [9] and TimberWolfSC [13] are fundamentally unable take advantage of this structural regularity because these methodologies are based on optimizing objective functions in which regularity cannot be expressed. Several fully tailored datapath synthesis environments called datapath compilers using specialized standard cell libraries like [4] have been developed to answer this need. These systems explicitly put in regularity at the logical level and deal with this information in a very explicit manner throughout the entire design flow down to the layout phase using specialized tools and dedicated cells. An important drawback of such separate technology dependent design
2 CONTRIBUTIONS OF THIS PAPER We have developed a fast and efficient technique to automatically extract the two-dimensional datapath regularity from circuit netlists enabling explicit placement of the extracted circuitry as regular as by fully dedicated systems. While performing the extraction, the netlist is decomposed to form a new hierarchy that matches placement criteria much more closely than, if available, the usual functional hierarchy-implied locality used by most other placement methods. The target hierarchy is based on both the interconnect
*This research was supported by ESPRIT BRA 6855 LINK
111
I
CONTROLLER
Z7>
---S I.-
I
MULTIPLIER
.- - - - - - - - - -
Figure 1. Circuit hierarchy transformation structure and the physical geometry of the cells, namely blocks with discovered regularity, a glue logic part and any number of large hard macros. Figure I illustrates the effect of this transformaton on the floorplan. Note that this transformation, selective hierarchy flattening so as to obtain macros containing a suitable number of more or less related cells, is still mostly carried out by hand. The method described in this paper automates this task. Unlike the other approaches mentioned, the regular parts of circuitry generated by conventional non-dedicated synthesis tools are placed regularly, hence densely. Moreover, the transformed hierarchy reduces the solution space of the placement, allowing for much larger circuits to be placed in one go than without extraction. In addition, regular placement of regular structures is known to facilitate accurate clock and data skew control. Note that our approach is not technology dependent like dedicated datapath compilers, which enables to seamlessly integrate regular and non-regular circuitry, thus helping to prevent waste area. Furthermore, we believe that datapath placement using standard cells as against specialized cells will pay off even more as more routing layers are available. As depicted in figure 2, the proposed regularity extraction and hierarchy transformation method is an add-on which is plugged into a conventional design flow after the logical netlist generation, before the layout phase. The regularity extraction effectively performs a multi-decomposition of the circuit, yielding a restructured netlist, as well as the discovered 2-dimensional structure, if any. The remainder of this paper is organized as follows: The next section provides some necessary terms and preliminaries on datapath regularity. The modeling of datapath regularity we use is presented in section 5. Section 6 introduces a metric quantifying the extent of regularity of the circuit surrounding a partially reconstructed datapath. This metric guides the search-wave used by the regularity extraction algorithm presented in section 7 to expand into the most regular extension. Expenmental results are presented in section 8 Finally, in section 9 conclusions and remarks are given. 3
Figure 2. Extended Design Flow At the same time, the highly similar if not identical bit slices of the datapath are stacked alongside. Perpendicular to the slices, cells of the same type occurring at similar places in all slices are forming a datapath stage. The circuit is thus fitted onto a matrix of rectangular buckets containing the cells, where each slice coincides with a row, and each stage coincides with a column. The fact that all cells in a stage have the same type, hence form guarantees zero cell width variation per column. At the same time, as standard cells occupy only one row of transistors, the height variation within the rows is also negligible. Together, both properties establish a high degree of geometrical regularity yielding maximum density cell placement. Note that above properties are also found in many other popular layout styles.
PRELIMINARIES
Crucial to this approach, in addition to geometrical regularity, the interconnectregularity of datapaths have the following property: (almost) all nets running through the matrix are fully contained either within one slice or within one stage. This is caused by
Datapath cell-placement essentially maps the structural regularity onto topological regularity by cell alignment in 2 directions. Figure 3 shows a part of a fully regular 4-bit wide datapath. Cells associated with the same bit-slice are lined up honzontally.
112
slices
Figure 3. An ideally a ligned datapath * glue logic including insufficiently regular circuitry
the perpendicularity of data and control flows. Because of this orthogonality of the interconnect structure, the composition of each column is not affected by swapping rows in the matrix, and likewise for swapping columns. Under this interconnect orthogonality condition, ordering optimizations of rows and columns are mutually independent tasks that can be carried out in separate steps. The more nets violate this, the less valid it becomes to treat column and row orderings independently. Considering that ordering only one row and one column directly yields the complete relative placement of the entire matrix, the complexity of placement of datapath circuitry is thus reduced very significantly from one general 2-dimensional placement problem of all datapath cells at the same time, to two independent much smaller linear arrangement problems of just one single row and one single column. While the problem of regular placement remains NP-complete, the problem size is drastically reduced by many orders of a magnitude, as compared to the general placement task that would otherwise have to be carried out. Even for the small circuit in figure 3, the reduction ratio amounts to (8 x 4)!/(8! x 4!). This vast placement problem size reduction clearly already allows for much larger circuits to be placed in one go. Simplifying the task even further, the typical interconnect between the slices is such that ordering the slices with respect to each other is mostly straightforward. The linear arrangement problem under various constraints is a well known problem from literature, eg. [7][10], [I] or [6], which is therefore not elaborated on in this paper. 4
In addition, the extracted regularity is described by two decompositions of the datapath part, namely a set of stages and a set of slices. Figure 4 shows the result of this process. The main objective of this process is to maximally satisfy the orthogonality condition and regularity goals. This objective implies that all extracted bit slices will be identical or highly similar. 5
REGULARITY MODELING
Given a circuit as a set of modules M nets N and pins P C M x N. Each module m E M is an instantiation of a module-type ti. Each pin p = (m, n) E P instantiates a terminal-type T(p) = r7 of m. Importantly, any terminal-type uniquely belongs to one single function-type. For convenience, let E = M U N be the set of circuit entities. The desired datapath regularity information of the circuit can be fully described by two separate decompositions of M and N into a number of stage sets si of entities and a number of slice sets bj. Any entity occurs in exactly one slice set and exactly one stage set at the same time. Consider figure 5 showing a part of
PROBLEM FORMULATION
To be able to perform datapath placement in the above way, the membership to both one slice and one stage for each cell belonging to the datapath must be known. This information is generally completely or largely unavailable, inconsistent or even, placementwise, unreliable in most design flows. It must therefore be extracted automatically from the netlist before placement. This task results in a decomposition of the circuit into * a number of datapath chunks containing regular circuitry Figure 5. Part of a datapath
* large and/or prefabricated macros like memory blocks
113
Z C
rm.
unexpandable (leaf cell
expandable module
'Em
stages
[Es
slices
Figure 4. Regularity induced circuit hierarchy transformation an example datapath circuit and a reference stage s, = si. The pairs of nets {nl, n2}, {n 3 , n4} and {ns, ne } are each in a distinct slice, bl, b2 and b3 , respectively. Together, they form stage sl = {{fnl,fn2}, {n 3 ,n 4 }, {n 5 ,n 6 }},outlinedinthefigure. Note that a slice within a stage is actually is set of entities. The function S: E - N returns the unique stage index i of the entity which is in stage si. All entities that are not (yet) part of any stage are in the complement stage set s?. Recall that for alignment, all modules in the same stage s, have the same type, but note also that there may be many other entities with that type outside si. Likewise, entities in b? are not (yet) member of any slice, and function B : E - N returns the slice index of an entity. All entities are initialized with an undefined stage/slice membership: s? = b? = E. Our modeling is founded on the basis that datapath regularity in a circuit is an essentially relative notion, in the sense that it is expressed in terms of certain attributes of the interconnect structure between the entities in a current reference stage that is known to be regular, referred to as Sr, and entities outside s,. Many different sorts of netlist attributes that to some extent characterize datapath regularity can be distinguished. The most obvious attribute is the terminal-type associated with a pin between two entities. In addition, the degree of the adjacencies, their use (eg. signal flow direction), and possibly even explicit annotations concerning buses may be used if available. The set of attributes used for characterizing datapath regularity of a connection p = (el, e2) between entities e1 and e2 is called generally the regularity signatureor RS of p, denoted R(p). Using more characteristics may help reducing certain ambiguities, thereby possibly increasing the amount of regularity found. In practice, using only terminal-type attributes already provides sufficient information to be able to extract almost all regularity present a netlist, hence R(p) = T(p). This is because some of the other attributes are partially implied by T(. Nonetheless, the description in the next section is kept general to be able to accommodate more comprehensive signatures in the model. The extent of regularity between sr and its adjacent entities that are connected via incidences of the same RS is now determined by
the nature of the statistical distribution of the frequencies of that RS over the slices of s,. A uniform distribution corresponds with maximum regularity. For example, in figure 5, the incidences via the same terminaltype r2 to modules adjacent to those in so are clearly uniformly distributed over its slices bl, b2 -and b3 . Hence, the modules m2, m 3 and m4 which are of the same type t4 associated with r2 have a regular interconnect structure towards s,. The same does obviously not apply to the module of type t5. Entity types like t4 are called multi-slice types. These essentially form datapaths if the associated entities m2, m3 and m4 are repeated over the width of the datapath. Alternatively, this may be due to a single multi-slice entity, incident to multiple slices, like net n7 or module mi. In that case, it may be a non-expandable block such as a hard macro, so its slice membership will be left undetermined, e E b?, while it does belong to some stage. Otherwise, the multi-slice entity may even contain another datapath block which may also be considered by expanding it into the current circuit level. Finally, the two decompositions S and B of E can thus be inferred by repeatedly composing new reference stages if they are sufficiently regular. 6
LOCAL REGULARITY METRIC
In order to be able to quantify the extent of regularity between the bitslices in the current reference stage sr and the RSes of the connections to the neighborhoods of all members of s,, a numerical relation between the occurrence of each RS in the neighborhood and each bitslice in this reference stage is formulated. Only notknown entities in these neighborhoods are considered. An entity e is said to be known if at least its membership to a slice or a stage are known, so e ss, A e i b,. The analysis proceeds as follows: Suppose entity e is known to be in stage s,. Let P(e) denote the set of pins of e, then the set of RSses adjacent to s, is given by R(s,) =
U CEs,.PEP(e)
114
R(p)
Next, we use the function structure between the slices in
to express the connectivity and the RSses:
F(sr)
Sr
r(sr) : B(sr) x R(sr) -
2P
where B(sr) = U
B(e)
eEs,
is the set of slices that are found in the reference stage. r(sr) is formally defined as r(sr, b,, r,) = {p E P(e)Le e Sr n bi A R(p) = r,} Thus F(sr, b,, r, ) returns the set of pins attached to entities in slice bi of stage sr that have RS r,. The complete bipartite graph in fig-
PINSET
SIGNATURES
SLICES -
--
-
(bI)'
'Ti
(nl,n2)
tl
tr2
-t4
{n3,n4}
b2
In general, the uniformity of the distribution of X decreases as the corresponding regularity decreases. The simplest uniformity R - R measure of the number series X is its range L: w is the where min,(x[i]), = maxi(x[i]) L(X) defined as number of entries of X. Considering only the range would already suffice if only fully regular relations need to be detected. For instance, L(X(si, ri)) = 0 implies complete regularity, while L(X(si, r3 )) = 1 indicates non-regularity. The range does not provide much information about the extent of irregularity. For instance, L([1,1,1,1,1, 0]) = L([1, 0,1,1, 0, 0]) = 1, whereas the former vector is clearly preferable over the latter. To be able to distinguish between large and smaller irregularities, we also consider the number Z(X) of zero-entries in X and the average X of the vector in a sum of weighted terms. We propose the following local relative regularitymetric p(sr) R - R between stage sr and signature ri defined as follows: p(sr, ri)
= CZZ(X(Sr,
ri))+cLL(X(sr, ri))+ca,(X(sr, ri)-1)
where cz, cL and cay are weight factors, chosen such that aliasing between the terms cannot occur. In our experiments, we used 10000, 100 and 1, respectively. The metric has following important properties: p(sr, ri) is inversely proportional to the extent of regularity from Sr with respect to its adjacent cells with signature Ti.
* Maximally regular: p(Sr,
_t5
,
t3
{}
* Maximally irregular: p(s, ri ) = °° * The value of p(Sr, ri) increases monotonically as the regular-
n5,n6)
i.[ ' .
ity decreases These properties enable comparisons of the extent of regularity between different stages of the datapath and between different signatures making it suitable for the regularity extraction algorithm described in the next section.
Figure 6. Signature to bit-relation graph r(s 1 ) ure 6 depicts r(s. ). The pinsets implementing the adjacency of the elements in the respective classes are shown on the edges. For example, if ri = r1, then I(sl, b2 , ri) = {pl, p2}. We can now quantify the extent of regularity of the neighborhood of sr with regard to RS ri by interpreting a the distribution of the numberof pins to the individual slices of Sr. We therefore construct a score vector X(sr, rl) in which entry z[i] holds the connection count between bitslice bi of s, and signature class ri, hence 2[f(rs"
rT)= 0
7
REGULARITY EXTRACTION ALGORITHM
Our regularity extraction algorithm works by expanding searchwaves through the network, stage by stage. It uses the relative regularity metric introduced in the previous section to determine how to expand the wave such that every expansion is as regular as possible, while remaining able to deal with a certain amount of nonregularity. Suitable initializing reference stages can be found using another characteristic of datapath circuitry present in a subset of candidate stages, namely the occurrence of many nets that satisfy the condition it is connected once to one terminal type, and multiple times (typically 4 or more) to one other terminal-type. For example, see net n7 in figure 5. This datapath property is due to the fact that datapaths are formed by repeating bitwise operators that are operated in parallel. Such nets typically carry the control signals that apply to all bits of the datapath like clock lines, enable lines, multiplexor address selectors, etc. Any such net with a high numberof pins may induce a suitable first reference stage. Altematively, the user may explicitly specify initial reference stages.
rl) = Ir(sr$b,, ri)I
These counts may vary from many times in all slices to zero in all slices except one. A uniform distribution of all elements of X (sr, rl) corresponds to maximum datapath regularity at stage Sr with respect to the adjacent entities in rl. For example, the perfect regularity between stage si and the modules of type t 1 and in the circuit of figure 5 is reflected by X(Si,rl) = [I{Pl,P2}1,I{P6,P7}1,l{Pll,P12}1] = [2,2,2] Conversely, regularity deviations are manifested by non-uniform = 73 , since distributions of X, like incidences with r3 X(si,r3) = [1,1,0].
115
8
Some particular non-regularities occur in most circuits, such as some bits, notably MSB and LSB which are different at some spots. The extraction algorithm will usually still be able to expand the wave, and since the wave often encloses irregular spots from several directions, entities in non-regular parts of the datapath will be fit into a suitable position later. The algorithm is outlined below:
RESULTS
We implemented our algorithm in C++. The regularity extraction and hierarchy transformation tool is part of a larger framework under development, aimed at performing fully automatic placement and routing in an environment where the designs were automatically generated from an abstract specification where human intervention in the synthesis backend will no longer be a viable option. Input to the program is a EDIF netlist file which may or may not be hierarchical. Figure 2 shows the relevant part of the system. Output are the restructured netlist in EDIF, and a file describing the regularity found. The latter file can then read programs that perform both linear orderings. The partial placement is then supplied to a standard placement and routing backend to complete the placement. Since the emphasis of this paperpertains to regularity extraction, we did not elaborate on the two subsequent linear orderings. Lacking a general way to quantify the success of regularity extraction of circuits that are not completely regular, we used indirect metrics indicating the usefulness of the results, namely the percentage of regular circuitry. Table I presents some results of the extraction algorithm on a number of examples. These times were measured on a HP9000/735 workstation. In all examples, we used the default value of all tunable parameters. The first 3 circuits are automatically generated by an HLS system. TTA and 8048 are microprocessor cores. Circuit CDFilter is a signal processor for CD audio. The remaining circuits come from the standard cell benchmark set. The percentage of the total number of cells assigned to the datapath by the extraction process should be interpreted with care, since some circuits include controllers and other non-regular circuitry. In most benchmark netlists, the identifiers are stripped so that they do not provide information regarding the curcuit's structure which could have served as a reference. One notable advantage of a nonflattened circuit description is that they contain more multi-slice modules, which can be automatically selectively opened to keep memory requirements low. Note that not all extracted slices have to be identical.
1. Find as many as possible seed-stages 2. If there are no (more) seeds, exit 3. The seed-stage with the highest number of slices is selected as the first s, to start a search-wave. 4. Build F(s,) 5. Compute p for every RS in F(s,.) 6. If threshold not satisfied goto 8 7. Enter the pinset returned by r(sr) in queue Q keyed by p. 8. If Q empty Goto 2 9. Extract the pinset from Q with the lowest p 10. Create a new stage 11. For each pin p in the pinset Add the entity connected to via p to the new stage in the slice inherited via p from the reference stage. 12. The new stage becomes the reference stage s, 13. Goto4 A number of threshold values can be are used to control the expansion process. The algorithm does not work well if there are too few slices. A minimum datapath width of 4 slices seems sufficient. Waves that die before a configurable minimum of stages is reaches are discarded, etc. Also, if the extent if irregularity exceeds a tunable value, the current candidate stage will be dropped. The thresholds and weigh factors can be set such that the algorithm will only find fully identical slices, if present. A post-processing phase resolves undefined stage and slice tags for entities for which, a posteriori, a clear choice can now be made. This means that if for a module, a slice and stage tag is induced by a majority of its environment, and if the conditions for it being part of a datapath are met, it will still be added to the datapath. Otherwise, it is identified as being part of insufficiently regular logic. The run-time complexity of our algorithm is only O(I PI) since every pin is analyzed at most once from a net, and once from a module. Actual run times may even be significantly smaller, since the search-wave can never expand into non-regular areas of the circuit, hence no time is wasted in non regular circuitry. The space complexity is basically determined by the number of candidate extensions in the wavefront which takes just a very small amount of storage.
9
CONCLUSIONS AND REMARKS
This paper presents a very fast new technique for datapath extraction. It is based on a new metric quantifying the extent of regularity between a known regular part of a datapath and its neighborhood. A search-wave guided by this metric constructs the rest of the datapath regaining its 2-dimensional regular structure, allowing fast, dense and technology independent placement of the datapath's cells. The algorithm can deal with large circuits that need not be fully regular. Note that minor irregularities in datapath circuits such as carrylook ahead logic per 4 bits in a 16 bit wide datapath have a minor effect on the usefulness of the results regarding subsequent placement for two main reasons. The first reason is that the postprocessing takes care of the larger part of these cases. Secondly, the remaining cells which might qualify for being placed in the datapath will, if sufficiently strongly connected to other cells which
16
Circuit name wave digital filter diffeq8 elliptic8 register file i8048 w/ ctrl w/o mem TTA- 16 CDFilter struct fract biomed avq-large avq-small
total t cells 9180 1273 1857 730 948 6720 7218 1888 125 6417 25114 21854
e regular cells found 8041 855 1370 694 333 5418 5088 1879 72 5458 18928 16451
regular % 87% 67% 74% 96% 35% 80% 70% 99% 58% 85% 75% 75%
: DP chunks 5 1 2 2 2 9 12 2 1 1 1 1
MAX width 32 8 8 8 8 16 42 16 9 20 16 16
time (sec) 6.7s 0.5s I.s 0.4s 0.7s 5.6s 4.7s 1.3s 0.Is 1.9s 8.Os 7.2s
Table 1. Datapath extraction results were already explicitly preplaced regularly, be pulled into the regularly placed area because of the wire-length reduction performed by placement tool. Finally, in the simplified layout model we used, subsequent placement is assumed to be row-based with only one single row of standard cells per bit slice. Clearly, in case of a very large number of datapath stages compared to the number of slices, the aspect ratio of the datapath matrix may become unfavorable with respect to the global floorplan of the chip. Allowing 2 or 3 rows per slice can greatly alleviate this effect while hardly affecting the advantages of regular placement generation. Alternatively, the datapath can be folded.
[8] M. Hirsch and D. Siewiorek. Automatically extracting structure from a logical design. In Proceedingsof the International Conference on Computer Aided Design, pages 456459. IEEE, 1988. [9] J.M. Kleinhans, G. Sigl, F.M. Johannes. and K.J. Antreich. Gordian: Vlsi placement by quadratic programming and slicing optimization. IEEE Transactionson ComputerAidedDesign, 10(3):356-365, 1991.
REFERENCES
[10] H. Nakao, 0. Kitada, M. Hayashikoshi, K. Okazaki, and Y. Tsujihashi. A high density datapath layout generation method under path delay constraints. In Proceedingsof the Custom Integrated Circuits Conference, pages 9.5.1-9.5.5. IEEE, 1993.
[1] T. Asano. An optimum gate placement algorithm for mos one-dimensional arrays. Journal of Design Systems, 6(1):127, 1982.
[11] G. Odawara, T. Hiraide, and 0. Nishina. Partitioning and placement technique for cmos gate arrays. IEEE Transactions on ComputerAided Design, CAD-6(3):355-363, May 1987.
[2] H. Cai, S. Note, P. Six, and H. De Man. A data path layout assembler for high performance dsp circuits. In Proceedings of the Design Automation Conference, pages 306-311. ACM/IEEE, 1990. Paper 18.1.
[12] Leveugle R. and Safinia C. Generation of optimized datapaths: bit-slice versus standard cells. IFIP TransactionsA, A22:153-66, Sept. 1992.
[3] C.E. Cheng and C.-Y. Ho. Sefop: A novel approach to data path module placement. In Proceedings of the International Conference on ComputerAided Design, pages 178181. IEEE, Nov 1993. [4] Compass Design Automation. CompassDatapath Compiler, v8r3 edition, 1991. [5] Marshburn et al. Datapath: a cmos datapath silicon assembler. In Proceedingsof the Design Automation Conference, pages 722-12. IEEE, 1986.
[13] C. Sechen and K.W. Lee. An improved simulated annealing algorithm for row-based placement. In Proceedingsof the International Conference on ComputerAided Design, pages 478-481,1987. [14] Y.-W. Tsay and Y.-L. Lin. A row-based cell placement method that utilizes circuit structural properties. IEEETransactions on Computer Aided Design, 14(3):393-397, Mar 1995.
[6] C.M. Fiduccia and R.M. Mattheyses. A linear-time heuristic for improving network partitions. In Proceedingsof the Design Automation Conference, pages 175-181,1982. [7] S. Goto, I. Cederbaum, and B.S. Ting. Suboptimum solution of the back-board ordering with channel capacity constraint. IEEE Transactions on Circuits And Systems, 24(11):645652, Nov 1977.
117
HIERARCHICAL NETLENGTH ESTIMATION FOR TIMING PREDICTION Gerhard Zimmermann 2
Wolfgang Hebgen1 1
Rost+Partner, Kaiserstr. 42, D-60329 Frankfurt, Germany e-mail: [email protected] Computer Science Department, University of Kaiserslautern. D-67653 Kaiserslautern, Germany [email protected]
ABSTRACT With decreasing feature sizes of VLSI chips the importance of wiring capacitances and resistances increases. Thus timing prediction has to include wire length information. Our goal is the estimation of the length of individual nets or even net segments based on hierarchical netlists and on properties of the layout primitives, e.g. standard cells. These data can be fed into a normal timing analyzer. Our basic assumption is that layout properties, regardless of the used placement and routing methods, can be modeled by slicing trees, which can be deducted from netlists by good partitioning algorithms. The assumption was successful for area estimation. This paper shows that the length of individual nets can be estimated with good confidence based on the same assumption. Individual netlengths are necessary if logic path delays and circuit performance have to be estimated. Prior work has not made as much use of the structural information in the netlists and therefore only made statements about statistical properties of the ensemble of all nets. The paper also shows experimental data and theoretical foundations to support the claims. But the properties of layout synthesis algorithms cannot be modelled analytically today and empirical knowledge has to be included in the models. A procedure using the theoretical model and empirical data is explained together with first experimental results. The paper has practical applications in VLSI chip design as well as provides further insight into the interrelation between layout properties and timing. The paper shows a snapshot in our current research activity and points out areas where further research is necessary.
1 INTRODUCTION Layout synthesis of VLSI chips has reached a very high standard in regard to chip area utilization. This is due to the quality of placement and routing algorithms for sea-of-gates and standard cell design styles. Non-hierarchical and hierarchical methods are used. Top down chip planning has been made possible by reliable area estimation. With decreasing feature sizes the influence of net lengths on overall path delays increases and can account for more than 50%. Therefore, timing issues have gained greater attention in layout synthesis research. The original assumption that a small chip area guarantees short wirelengths and therefore a short critical path length, has failed in many cases, as had to
be expected for statistical reasons. Therefore, layout synthesis systems have been extended by either putting higher weights on nets on possibly critical paths or by trying to maintain upper delay limits of these paths. Both methods have shown good results in non-hierarchical methods. Both methods require a prediction of possible critical paths. This, in turn, requires netlengths estimations to be realistic in submicron technologies. A by-product of predicting the critical path length is the prediction of performance. This is very important in view of increasing clock rate requirements. Performance predictions should therefore be available as early in the design process as possible to reduce the number -of expensive design iterations. In hierarchical layout styles with chip planning, nets or logic paths can traverse several cell blocks at several levels of the hierarchy. It is therefore necessary to plan the distribution of net delays or the slack between blocks in such a way that optimal overall timing properties can be achieved. Normally, in synchronous automatons, this optimum means the highest possible clock rate, which, besides good clocking schemes and clock distribution, depends on the delay of the longest path between storage elements. Such planning of the distribution is only possible if we have delay estimates of the unfinished parts of the layout, either unplanned or planned but not laid out. Regardless of the discussion - do we need hierarchical layout methods or not - they will be used at the upper limit of complexity of designs and in mixed cell type layouts. It is therefore desirable to achieve estimates with all the knowledge we have at each phase of a design and with as little effort as possible. Finishing the layout in a prototype fashion to get estimates is one method. The achieved results may be reliable, especially, if the prototype also becomes the final layout. But this method does provide no insights into the interrelation between layout properties and timing characteristics as our approach does. Our approach seeks to estimate the lengths of logic paths, starting at the moment when the structure of the chip is finalized (all schematics or netlists at all levels of the hierarchy are known), and to improve the estimates along with design decisions as they are made. Such decisions are made when the physical cell hierarchy is established (repartitioning), during chip planning phases, cell layout and chip assembly.
118
In this paper we concentrate on the first phase, before or after a cell hierarchy has been established. Even for this phase we are in the middle of research and can only show results for standard cell blocks. Therefore, this paper is meant to trigger a discussion instead of showing final results. The problem is academically and practically very difficult and we would like to see more research done in this area.
Fig. 1. Synchronous automaton with critical path.
A
n2
n3
t
2 PREVIOUS WORK Wire length prediction has a relatively long history, although more with the goal of predicting wiring space and routability and not many contributions. A good overview is given in /Han88/ and we will only mention a few papers here. Donath /Don79/ derived upper bounds for average wire lengths bases on Rent's rule /Lan7l/ and also considered wire lengths distributions on the same basis /Don81/. These models have been improved by Feuer IFeu82I and Sastry and Parker. In /SaP86/ it was shown that the theoretical model predicts a Weibull distribution for gate arrays which fits very well measured wire lengths distributions. Statistical models have also been used by Kurdahi and Parker to predict channel widths in standard cell blocks, using an average wiring length factor /KuP86/. Pedram and Preas could show that this factor is not necessary /PeP89/. All these methods used the wire length to estimate area. They make little use of the knowledge of the circuit structure and only use the number of nets. Keller describes hierarchical models for logic paths in a hierarchical design environment /Kel89/. He tries to abstract from the large number of nets in a block by characterizing the paths between its pins. But the problem of the high complexity of timing properties analysis and description remains. This paper is based on Hebgen's work /Heb95/ and adds stochastic models.
Fig. 2. Logic path example. have experimental results for this case. The presented model is not restricted to this case and does not make use of this assumption. It is clear that a block will contain complete logic paths as well as fragments. Thus we have to extend the notion of end points of logic paths to 1/0 pins of blocks also. Therefore, all nets are included in the estimation process. In Fig. 2 a logic path between the end points A and B is depicted. Nets nl ...n5 contribute to its delay. Here we assume that a net only contains one wire. In the implementation nets can be vectors of wires if they do not have to be distinguished in length. The individual contributions of nets to the path delay are assumed to be independent of each other. Nets with np > 2 pins can be split into np - I
net segments. Independent of the used timing
model the geometric length of these segments is the best we can estimate before layout and that is our goal. The only meaningful distinction we can make is to split this length in its horizontal and vertical component. In principle these segments can be used in RC-tree timing models. Currently we doubt that this is meaningful, taking the variance of the results into account. Further research is necessary for this purpose.
3 THE NETLENGTH ESTIMATION MODEL
3.1 The Slicing Model In order to estimate the length of nets we have to predict the result of layout synthesis. This seem to be impossible because of the many different algorithms and because of indeterministic results of some of the algorithms. Also, in hierarchical design environments, neither the shape of a block nor the position of its pins are known before chip planning. Fortunately, the layout results of different placement algorithms are at least similar in regard to achieved area utilization, although the layouts may look totally different. This result has been established in many benchmark experiments. We have made use of this property in our shape function estimation method IZim88I. We have chosen partitioning based placement as a representative for all other
Let us assume that the digital system is a synchronous automaton as in Fig. 1 and that the maximal clock rate is determined by the total delay on the critical path through the network between outputs and inputs of the storage elements. Since we do not know beforehand which path is critical, essentially all logic paths between storage elements have to be considered. In a hierarchical design environment, the system is partitioned into subsystems that can be assigned to printed circuit boards, MCMs, chips or blocks on a chip. We will only consider a hierarchy of blocks on a chip and further assume for this paper that the blocks are composed of standard cells, because this is easy to imagine and currently we only
119
A
B
Y2
C
hV
h-
D E
F
h-
G
I
x
(b) (a) Fig. 3. Oriented slicing tree (a) and a corresponding geometry (b).
Fig. 4. Example of a shape function with orientations and cell shape example.
methods, even if it may not deliver the best results today. This could be accounted for by parameter settings. Bipartitioning the circuit of a block, if applied recursively, results in a binary tree. We distinguish between two types of vertices in the tree: The leaves represent the primitives. If we assume the circuit of a block with standard cells as primitives, the leaves are the standard cells. The leaves can also be sea-of-gates cells, macro cells, or, in a hierarchical design environment, blocks at a lower level. We simply call the leaves cells. All other vertices are called nodes with the root node representing the block. The tree can be interpreted as a slicing topology or slicing tree. With orientation, ordering, and sizing the slicing tree can be transformed into a slicing geometry. Orientation decides on the direction of the "slicing line" that separates two sibling vertices in the tree. It can be horizontal (h) or vertical (v) and can be assigned to the parent node of the siblings. Ordering decides on the direction of the siblings relative to the slicing line and can be either top/ left (tl) or bottom/right (br), depending on the orientation. Ordering can be assigned to the edges of the slicing tree. Sizing assigns dimensions to the nodes such that its siblings plus wiring space fit into a node rectangle and the rectangles partition the rectangular block area. Sizing is the process of first calculating the shape functions for all tree nodes bottom-up and then selecting the proper shapes topdown. Fig. 3 shows an example of a slicing tree (a) and a corresponding geometry (b). One node has been enhanced to show that it represents the horizontal slicing line between (A, B) and (E, F) as well as the rectangle containing all four cells. On purpose the slicing tree in Fig. 3a is not ordered. The reason is that, from the netlist, we can derive optimal orientations based on the knowledge of shape functions of all leaf cells and the expected shape of the block. If the latter is not known, we can even find the orientations for all possible shapes of the block during the shape function generation process /Zim881. Fig. 4 shows an example of an estimated shape function of a node. All shapes on or above the solid and dotted line are possible. The lower left-hand comers are the shapes with minimal area. Orientations are
assigned to the corners. But we cannot decide on the ordering without further knowledge about the layout, especially the positions of pins on the block's perimeter. Thus, without ordering, the geometry in Fig. 3b is only one of many possible geometries of the slicing tree. For the shape function estimation, results are independent of ordering. The questions to be answered for timing prediction are: Is a slicing topology also representative for the length of individual wires? Since even different executions of the same partitioning algorithms may result in obviously different partitions: What are properties of the partitions that are the same for all different partitions of the same circuit and also for different layouts? How important is ordering? How reliable are wire length estimates? Before we try to answer these questions, we further extent on the estimation model. Let us assume that a good partitioning is representative for a good layout in the respect that strongly connected cells will be at short distance in the slicing tree as well as in the layout. The distance in the tree can be measured by the number of nodes on the shortest path between the cells, the distance in the layout by the Manhattan distance of the centers of the cells. Since the Manhattan distance is a good measure for the length of net segments connecting the cells, we have to relate the distance in the tree to the length accordingly. For this purpose we will define a new distance metric between leaf cells. Let V be the set of vertices of the slicing tree of block Bl with subset C being the set of cells c(i) and subset NO being the set of nodes no(i) of the tree with b representing the node of the area of the block. Let bio be the I/O pin frame of Bl. Then bio and b compose Bl as shown in Fig. 5. Let P be the set of all pins with subset P. the set of all internal pins pi(c(i), j) and subset Pio the set of I/O pins pio(j) of Bl. j is the index of a net n(j) and P = Pi u Pio Let N be the set of nets n(j) of Bl. A net is defined as the set of pins it connects: n (j) = {pi (c (i), j), pio (j) Ipi, pio e P} If n(j) contains no 1/0 pins we call it an internal net. If n(j) connects only two pins and one is a pio, we call it an I/0 net. If n(j) connects at least two internal and one 1/0 pin, we call it a mixed net.
120
S(
Fig. 5. Assignment of nodes to net segments and possible routes in the geometry. Let np(j) be the cardinality of n(j), that is the number of pins it connects. If np(j)>2 a net is split into np(j)-l segments s(ij) connecting two points of the net. Points can be pins or Steiner points. This process is made unique using the slicing tree. Each segment is assigned to a vertex i of the tree which fully contains the segment. This means that the two points either emerge from the two subtrees of a node or belong to the same cell. Fig. 5 shows an example. A point in the tree is either a cell or another segment of the net. This other segment is represented by its node. Because of the slicing tree the segments of a net build a segment tree that is a subtree of the slicing tree. We distinguish two types of segments: Primary segments connect two cells, secondary segments connect to one or two nodes. If we assume that a net or segment is wired within the bounding box of its pins, then a node or rectangle in the slicing geometry is the smallest node that geometrically contains a segment with all its subsegments. From Fig. 5 it is also clear that, because all 1/0 pins belong to bio, all 1/0 nets are assigned to node BI and in mixed nets all segments to 1/0 pins are also assigned to BI. Since we proceed bottom-up from the leaves to the root, the I/O segments are handled last. In this paper we only handle internal segments. Now we will try to answer the question: Is it possible to estimate netlengths with all the uncertainties of layout? For this purpose we conducted an experiment. We took the circuit "alu", a 32 bit ALU with 1015 standard cells and 1059 nets, and did the layout with a simulated annealing tool 50 times, with different start solutions, different 1/0 pin positions, and different aspect ratios. Then we extracted the lengths of individual nets for all cases and looked at the frequency distribution. Fig. 6 shows two results. Net_123 seems to support our assumption perfectly. Its length distribution is very narrow, which means, that its pins have about the same distance in all layouts. Net-42 does support the assumption not as well, but still good enough, if we consider that the half perimeter length of the "alu" is 1000 grid units. We looked at many nets from this design and the samples are typical. These experiments show the limits of netlength estimation. A predicted netlength cannot be more precise than the inherent variance in the layout method.
Fig. 6. Measured netlength frequencies for different layouts of the "alu". The abscissa is divided in grid units. Half perimeter of the block is 1000 grid units.
v2
C1
c2b
x 'x
Fig. 7. Section length definition for different orientations 3.2 Primary Segment Length Model With these definitions we now come back to the length of segments which is our primary goal. Netlengths and path lengths can be derived from the segment lengths. Let a vertex i have the horizontal and vertical dimension x(i) and y(i) in its geometrical representation. Let each segment s(i, j) be composed of two orthogonal sections with the lengths lx(i, j) and ly(i, j). Since we may not know x and y, we also introduce relative lengths lxr(i, j)=lx(i, j)lx(i) and lyr respectively. Our first goal is the estimation of all relative section lengths. The second goal is the estimation of node dimensions and with these two we can calculate the absolute section and segment lengths. The relative length of a section depends on the position of its node in the slicing tree, in the segment tree and its orientation. Let us start with primary segments. Fig. 7 shows a simple example of a segment connecting cells cl and c2. It is not important what the exact routing is as long as it takes the shortest Manhattan path. There are obviously different rules for sections orthogonal (o) and parallel (p) to the slicing line. Let us therefore rather determine lo, lor Ip, and Ipr accordingly, as long as the orientation is not known. We also cannot know where in nodes v2 and v3 the pins of cells cI and c2 are located, as long as the ordering is not known. We can only assume that they will, on the average, be closer to the slicing line than to the opposite sides. Experiments suggest that this tendency is increasing with the increasing number of cells in the vertex to which the segment is assigned. We do not have sufficient experimental evidence to prove a relation. We therefore assume a lin-
121
x xI
I I I
I x'1 -
-1I
a10x'2
lo
fi
f
2
Fig. 9. Definition of a 2x8 standard cells window and distribution of lo. Average length of 8 cells is 160 grid units.
f.
'I
lo
x'1
x'2 -Yt
fi
f2
p
f "I< x1
Fig. 10. Parallel section length probability distribution
x'2 art
x'1
--- ly
9
'
net is, we use the averages of the distributions as relative lengths lor and 1pr. If the pins are equally distributed in the two sibling areas, loreq =0.5 and lpreq =0.33. Because of routing "detours" the factors can be larger. With increasing numbers of cells in the corresponding nodes, the factors decrease. First experiments indicate the following relation. If nco is the average number of cells orthogonal to the slicing line, we found:
lIo 10
x'2
Fig. 8. Different pin position probability functions and resulting orthogonal section length probability distributions ear probability distribution (by distribution we always mean densities) of pin positions perpendicular to the slicing line. Fig. 8 shows examples with different slopes. Because the pin positions in both sides are assumed not to be correlated and
r=
the length distribution is the convolution of the pin position distributions: f (lo) = conv (fl (x1
x'l),f2
(x'2))
b ) a+ aor nco
(EQ 3)
with a=0.5 and b=2. For 1pr we have no relation yet. Thus we only define with the average number of cells parallel to the slicing line (ncp):
(EQ 1)
lo = (xl -xl) +x'2
eq
ncp) lpr = g (lpr ,eq
(EQ 4)
If x and y are the horizontal and vertical dimensions of a node, we can calculate the primary section lengths for this node:
(EQ 2)
The results are depicted in Fig. 8. Fig. 9 shows some experimental results. They have been achieved by measuring the length of 2-point nets parallel to the rows in a standard cell layout that was placed with simulated annealing. In order to simulate a certain node in a slicing tree, only nets within a window of m rows and 2n cells in a row have been considered that cross a vertical slicing line in the center of the window (n cells in a row on either side). The window was moved across part of the layout to gather statistics. The result seems to fit to the bottom row of Fig. 8. Similar rules apply for the sections parallel to the slicing line. Fig. 10 shows the length distribution in the case that the location of the pins parallel to the slicing line are independent of each other and that the locations are equally distributed. This is a worst case assumption. From these distributions we can gain two results. First, since we cannot know where in the distribution a certain
Ix = lor x;
ly = lpr . y for orientation v
lx = lpr x;
ly = lor- y for orientation h
(EQ 5)
The second conclusion we can draw from the distribution is its width. It is a measure for the variance of lor and lpr and thus for the variance of the segment length. But it is only one component of the variance because uncertainties of the partitioning process have to be added. This is the subject of further research. 3.3 Secondary Segment Netlength Model Secondary segments connect to subnets that belong to the subtree of the corresponding node. This will reduce the length because the secondary segment connects to the nearest point of the subnet. We can either account for this by
122
80
70-
60
so 40 -
Fig. 11. Secondary segment length reduction by bounding boxes and reduced pin position space. another reduction factor or try to model it more precisely. Let us explore the latter possibility. What we need is a geometric representation of the subnet. A bounding box would be a good abstraction. But we did not find a means to calculate it during the estimation process because of the unknown orientations. The bounding box dimensions are related to the sum of lx and ly of the segments of the subnet. The relation will depend on the number of segments in the subnet. We do not know how yet. Instead we use the maximum of the dimensions as dimensions bx and by of the bounding box: bx = max(lx(i)); by = max(ly(i)); is the index of the subnet segments
lys = lpr (y - byI - by 2 ) Ixs = Ipr (x - bx 1 - bx 2) lys = lor (y-byl-by2 )
} for orientation v
}
-0
10
20
30
40
50
60
100
200
300
400
500
600
AO 700
Fig. 12. Netlength frequency distribution of the "alu". Abscissa in grid units.
(EQ6)
It is not known where the bounding box is located in the subnode area. This is the same situation as in the case of the primary segments. The difference is the extension of the "pin". This results in a virtual reduction of the node sizes by the same amount, as Fig. 11 tries to illustrate this effect with a segment that connects to two subnets. The white area in Fig. lib is the worst case area in which the ends of the segment can be located, if we assume that it can connect to any point on the bounding box of the subnets. There are two possibilities to take this reduction into account. In the relative length world we can calculate relative bxr and byr by using the quotient of the square roots of the sum of the cell areas of the corresponding nodes as divisor. The better possibility is to apply the reduction during the bottom-up calculation of absolute section lengths. We use this method because it is more precise. This has the following result on the secondary section lengths lxs and lys: Ixs = Ior (x-bx1 -bx 2 )
10
for orientation h
bxj, by, are the dimensions of one subnet, bx2, by 2 of the other. If only one subnet exists, the dimensions of the other are set to zero. 3.4 Netlength Distribution Test Proving the model correct is impossible. We have to
Fig. 13. Floorplan grid for experiment. make as many tests as possible to convince ourselves that the model is meaningful. One such test is a comparison of a measured netlength distribution with a distribution calculated with the model. Fig. 12 shows the distribution for the "alu" and a 1:1 aspect ratio. The layout was conducted with our own simulated annealing placement tool and fairly good global and detailed routers. In order to calculate the distributions of all net segments, we need a slicing tree and the number of net segments assigned to each node of the tree. This is a typical result of a bipartitioning tool. We therefore asked a colleague to bipartition the alu with his tool which gives excellent results /Mal96/, down to the standard cells. This procedure guarantees that the compared results are really independent. Next we needed cell node dimensions to calculate absolute lengths. For this experiment we used a simplified assumption: All cells have the same size and all vertices have a 1/ F2 aspect ratio and orientations alternate. This results in a very regular floorplan as shown in Fig. 13. As a further assumption we assumed equal distributions of the pins in all vertices and no correction factors. Fig. 13 highlights one node with two subnodes and a net segment. The dimension of each node is proportional to the square root of the number of cells in the node. Since the length I of the segment is I = lo + lp , the probability distribution of I is the convolution of the distributions of lo and ip. The aspect ratio has to be taken into account when defining the two distributions, which are shown in the top rows of Fig. 8. The orientation does not matter because the
123
Step 2. Partition the circuit of the block recursively until only two cells are left in each partition. Create a node no(i) for each partition and build the slicing tree during the recursion. Enter the name of the nets that are cut in each node's segment list. Enter the number of cells nc(i) represented by the node.
Fig. 14. Relative segment length probability distribution under the assumptions made in the text.
Step 3. Generate the shape functions for all nodes (in post order) with the algorithm in /Zim88I. Fig. 4 shows an example. The shape function of each node contains routing space for all net segments of the node. Transparency has been subtracted. Save the added routing space values. Calculate the average number of cells parallel ncp(i, k) and orthogonal nco(i, k) to the slicing line for every corner k of the shape functions. Step 4.
Fig. 15. Theoretical segment length frequency distribution for the alu. aspect ratio does not change if we rotate a node. The relative distribution of ir is shown in Fig. 14. The sum of all these distributions, stretched according to the dimensions of the nodes, for all segments of the alu is the total segment length frequency distribution shown in Fig. 15. It is scaled so that the root node has the same dimensions as the layout. It cannot be directly compared to the measured netlength distribution because nearly 50% of the nets of the "alu" have more than two pins. But the experiment shows that in principle the model shows the right kind of distribution. It has to be pointed out though that this distribution is not based on Rent's rule, but on a realistic slicing tree of the circuit and the number of cuts in each node.
4 LENGTH ESTIMATION METHOD From the above we now arrive at a relative simple segment lengths estimation method. We will describe the individual steps in prose. There exists a little different implementation which we used to develop the stochastic models in chapter 3. Here we explain how we would do it today. Step 1. Enter parameters loreq, lpreq, a, and b for EQ 3 and EQ4.
Distribute the routing space of all nodes to its siblings' shape functions, starting at the root (in pre order). This is necessary to estimate the dimensions of all nodes with all wires ending in or passing through the node as precisely as possible, in respect to the final layout. This step is under development.
Step 5. Traverse the slicing tree in post order to calculate section lengths lx(i, k), lxs(i, k), ly(i, k), and lys(i, k) for all corners k of the shape functions for all nodes, using EQ 3 to EQ 7. Step 6. Calculate individual segment lengths and netlengths, separately for horizontal and vertical components, if necessary, as input for a timing analysis.
5 EXPERIMENTAL RESULTS We experimented with the prototype implementation of the estimator which is built into our toolbox PLAYOUT and used our standard cell place and route and our timing analysis tool. We did estimations for the circuit "alu" for 11 different aspect ratios, ranging from 0.23 to 8.5 and did layouts for these shapes. We compared the estimated netlengths, which are the sum of the corresponding segments, with the extracted netlength individually. We got many results that show that the prototype method and the partitioning had to be improved. Since we have no other published estimations to compare with, it is difficult to say, how good the results are, when do we reach the stochastic limits, and what is a confidence measure. Despite all this we will show here at least one result in Fig. 16, that should be compared with Fig. 12. Although the average netlength is about 100 grid units, the average error is about
124
5 units. We purposely show absolute errors because the path lengths depend on absolute values.
/Feu82/ M. Feuer, "Connectivity of random logic," IEEE Trans. on Computers, Vol. C-31, No. 1, 1982, pp. 29-33. /Han88/ D. Hanson, "Interconnection Analysis," in: Physical Design Automation of VLSI Systems, Bryan T. Preas and Michael J. Lorenzetti (eds.), The Benjamin/Cummings Publishing Company, Inc., CA, 1988, pp. 31-64.
10 !O-
'10
1025
Report 4610), 1979, pp. 272-277.
-20
ln.rnrrlrrn nfllrnHllflEMf -15
-10
-5
0
5
10
15
20
Fig. 16. Distribution of absolute netlength estimation errors for the "alu". Abscissa in grid units.
But there are also some large errors in the shown distribution and we have to keep in mind that the longest path, not the average, decides on the circuit performance. Our goal therefore has to be to try to limit the maximum errors and be able to predict the probability of these.
6 CONCLUSIONS We have shown that the length range of individual nets in many different layouts is relatively small. It should therefore be possible to estimate netlengths, based on the knowledge of the circuit structure. We have shown, that with partitioning important structure information can be extracted and an estimation model can be built with some stochastic assumptions. So, the answer to the questions raised in chapter 3.1 with our current knowledge is: The slicing topology is representative for the estimation of individual netlengths, even without ordering information. This seems true for internal nets. For I/O nets we cannot answer this question currently. To answer the reliability question we need more experiments. The models could only be justified with a few experiments. A prototype implementation of the estimator has been implemented and already shows encouraging results. We are confident that the implementation of the improved models as shown in this paper will also improve the estimation quality. We need benchmarks in the submicron range with the necessary timing parameters to prove our results and to be able to compare with other estimation methods, for example prototype layouts. References /Don79/ W. Donath, "Placement and average interconnection lengths of computer logic," IEEE Trans. on Circuits and Systems, Vol. CAS-26, No. 4 (IBM
125
/Heb95/ W. Hebgen, "Netzlaengenbasierte Abschaetzung des Zeitverhaltens in einem top-down VLSIEntwurfsystem," Ph.D. thesis, University of Kaiserslautern, FRG, 1995. /Kel89/ W. Keller, "Ein Modell zur entwurfsbegleitenden hierarchischen Behandlung des Zeitverhaltens beim physikalischen VLSI-Entwurf," Ph. D. thesis, University of Kaiserslautern, FRG, 1989. /KuP86/ F. Kurdahi and A. Parker, "PLEST: A Program for Area Estimation of VLSI Integrated Circuits," Proc. of the 23rd Design Automation Conference, 1986, pp. 467-473. /LaR71/ B. Landman and R. Russo, "On a pin versus block relationship for partitions of logic graphs," IEEE Trans. on Computers, Vol. C-20, No. 12, 1971, pp. 1469-1479. /Mal96/ F.-O. Malisch, private communication, 1996 /PeP89/ M. Pedram and B. Preas, "Interconnection Length Estimation for optimized Standard Cell Layouts," Trans. IEEE Inl. Conf. on Computer-Aided Design, 1989, pp. 390-393. /SaP86/ S. Sister and A. Parker, "Stochastic Models for Wireability Analysis of Gate Arrays," IEEE Trans. on Computer-Aided-Design, vol. CAD-5, no. 1, 1986, pp. 52-65. /Zim88/ G. Zimmermann, "A New Area Shape Function Estimation Technique for VLSI Layouts," Proc. 25th Design Automation Conference (DAC), Anaheim, 1988, pp. 60-65.
EXPLORING THE DESIGN SPACE FOR BUILDING-BLOCK PLACEMENTS CONSIDERING AREA, ASPECT RATIO, PATH DELAY AND ROUTING CONGESTION Henrik Esbensen Ernest S. Kuh Department of Electrical Engineering and Computer Sciences University of California, Berkeley, CA 94720, USA esbensen(3eecs .berkeley.edu
ABSTRACT A genetic algorithm for IC/MCM building-block placement is presented. Optimization criteria considered are area, aspect ratio, routing congestion and maximum path delay. Designers can choose from an output set of feasible solutions. In contrast to existing approaches such as simulated annealing, no weights or bounds are needed. Experimental results illustrates the special features of the approach.
rithm for both ICs and MCMs is presented, which supports explicit design space exploration in the sense that 1) a set of alternative solutions rather than a single solution is generated by a single program execution, and 2) solutions are characterized explicitly by a cost value for each criterion instead of a single, aggregated cost value. The algorithm simultaneously minimizes layout area, routing congestion, maximum path delay and the deviation from a target aspect ratio. It searches for a set of alternative, good solutions where "good" is defined by the user in a simple manner. From the output solution set, the designer chooses a specific solution representing the preferred tradeoff. The approach avoids the use of both the weights and the bounds of (1) and consequently eliminates the above mentioned problems concerning weight and bound specification. The approach has three additional significant characteristics:
1. INTRODUCTION During placement of an integrated circuit (IC) or a multichip module (MCM) the objective is to find a solution which is satisfactory with respect to a number of competing criteria. Most often specific constraints has to be met for some criteria, while for others, a good tradeoff is wanted. However, at this point in the design process, the available information as to which values are obtainable for each criteria is based on relatively rough estimates only. Consequently, the designers notion of the overall design objective is rarely clearly definable. Virtually all existing placement tools minimizes a weighted sum of some criteria subject to constraints on others. I.e., if k criteria are considered, the objective is to minimize the single valued cost function
e
* Despite the fact that delay is inherently path oriented, most existing timing-driven placement approaches are net-based. While simple, these approaches usually over-constrains the problem, thereby potentially excluding good solutions from being found. The few existing path-based approaches includes [13, 15, 17], all of which, however, relies on very simple net models (stars and bounding boxes). The approach presented here obtains a more accurate path delay estimate by approximating each net by an Elmore-optimized Steiner tree.
i
C=
wici
s. t.
V
i
+ 1,...,k: ci < Ci
The maximum routing congestion is minimized, thereby improving the likelihood that the placement is routable without further modification. Consequently, the traditional need for multiple iterations of the placement and global routing phases is significantly reduced.
(1)
for some j, 1 < j < k. Here ci measures the cost of the solution with respect to the i'th criterion and the wi's and Chi's are user-defined weights and bounds, respectively. However, in practice it may be very difficult for the designer to specify a set of bounds and weights which makes the placement tool find a satisfactory solution. If the bounds are too loose, perhaps a better solution could have been found, while if they are too tight, a solution may not be found at all. It is also far from clear how to derive a suitable set of weight values from the vaguely defined design objectives, and constant weights may not be sufficient to keep the terms of the cost function properly balanced throughout the optimization process. Furthermore, the minimum of a weighted sum can never correspond to a non-convex point of the cost tradeoff surface, regardless of the weights [11]. In other words, if the designers notion of the best solution corresponds to a non-convex point, it can never be found by minimizing c in (1), even though the solution is nondominated. Our work is motivated by the need to overcome these fundamental problems. A building-block placement algo-
* The approach is based on the genetic algorithm (GA), since it is particularly well suited for design space exploration [12]. We are only aware of three previous GA approaches to building-block placement [4, 5, 10], none of which considers delay or routing congestion or performs explicit design space exploration. Previous work on design space exploration in CAD is still very limited. However, approaches for scheduling and channel routing are presented in [7] and in [6] an approach for FPGA technology mapping is described. Very recently, a combined wire sizing and buffer insertion method, also supporting design space exploration, was presented in [16]. The work presented here is based on significant extensions and improvements of our earlier placement approach described in [8].
126
2. PROBLEM DEFINITION The placement model described in Section 2.1. is relevant for both MCMs and for IC technologies with at least two metal layers available for routing. Section 2.2. characterizes the solution set searched for by the algorithm.
m(s(pi+,)), i = 0,1_ . ., I -2. s(po) or pi-, may be an IO pin. Each path in an MCM will have length I = 1 assuming that all signals are latched at the inputs of the components. m(pi) =
* Technology information: Number of metal layers available for routing on top of blocks and between blocks, denoted by lblock and ,P.,,. respectively. The routing wire resistance r and capacitance c per unit wire
2.1. Placement Model A placement problem is specified by the following input
length, and the wire pitch
* A set of rectangular building-blocks of arbitrary sizes and aspect ratios with a set of pins located anywhere within each block. * A set of IO pins/pads. Constraints on relative IO pin positions are expressed using a two-dimensional array A~xt as illustrated in Fig. 1. Each 10 pin can be assigned to an entry of A and the physical location corresponding to entry (i, j) will be (ix/(s - 1), jy/(t - 1)), where x and y are the horizontal and vertical dimensions of the layout, respectively'. An 10 pin assigned to A by the user is called a fixed IO pin, while the remaining IO pins are flexible. Each flexible IO pin will automatically be assigned to a vacant entry of A not specified as illegal. Since any subset of the entries of A can be specified as illegal, pins can be restricted to placement along the periphery of the layout, they can be uniformly distributed over the entire layout, etc.
Wpit~h.
For simplicity, these
values are all assumed to be constants. Each output solution is a specification of * An absolute position of each block so that no pair of blocks or a block and an 10-pin are closer than a specified minimum distance A > 0. This parameter allow physical constraints (design rules) to be met and is not intended for routing area allocation. Since multi-layer designs are considered, it is assumed that a significant part of the routing is performed on top of the blocks. * An orientation and reflection of each block. Throughout this paper, the term orientation of a block refers to a possible 90 degree rotation, while reflection of a block refers to the possibility of mirroring the block around a horizontal and/or a vertical axis. Changing the orientation of a block generally alters its contour, while reflecting it does Tot. In an IC, each block can be oriented and/or reflected in a total of eight distinct ways. For MCMs, only two distinct reflections exist, since the direction of its pins is fixed, giving a total of four distinct orientations/reflections. * An absolute position of each 10 pin, satisfying the specified constraints on their relative positions.
layout area
IO pin array
2.2.
Figure 1. Specification of constraints on placement of IO pins. Here A has dimensions 8 x 10 and 11 fixed JO pins (white circles) are assigned to specific entries of A while 14 entries (black circles) are illegal. The remaining entries are available for flexible JO pins. A will be oriented and/or reflected and subsequently scaled so that it exactly covers the layout area of the placement. * A specification of all nets, including for each net 1) the capacitance of each sink pin, and 2) a designated source pin p, its driver resistance and an associated internal delay i(p) in the block m(p) to which p belongs. t(p) is the time it takes a signal to travel through m(p) to p. 2 * A specification of a set of paths P. A path connects either two registers of distinct blocks or an IO pin and a register, i.e., it is an alternating sequence of wires passing through blocks and net segments. For a sink pin p, denote by s(p) the source pin of the net to which p belongs. Each path P e P is then uniquely specified by an ordered set of sink pins P = {po ,p ... ,pj- } of distinct nets, such that 'Relative to the building-blocks, the entire set of 10 pins can be oriented and/or reflected in eight distinct ways, while still satisfying the constraints on relative positions specified by A. The given absolute position of entry (i, j) assumes that the IO pin set is positioned on top of the blocks without changing neither the orientation nor the reflection relative to the blocks. 2If p is an 10 pin, m(p) is p itself and if it is also a source, t(p) = 0, i.e., input 10 pins have no internal delay.
What is a "Good" Tradeoff ?
Let II be the set of all placements and R+ = [0, oo[. The cost of a solution is defined by the vector-valued function c : HI _ R4 which will be described in Section 3.2.. This Section describes how to specify what a "good" cost tradeoff is, and how to compare the cost of two solutions without resorting to a single-valued cost measure. The user specifies a goal vector g = (gi, 2,g3,g4) and a feasibility vector f = (f 1, f2, f3, f4) such that g, f E (R+ U{oo}) 4 and 0 < gi < •f• < o for i = 1, 2,3,4. For the i'th criterion, gi is the maximum value wanted, if obtainable, while fi specifies a limit beyond which the solution is unconditionally of no interest. For example, if the i'th criterion is layout area, gi = 20 and fi = 100 states that an area of 20 or less is wanted if it can be obtained, while an area larger than 100 is unacceptable. Areas between 20 and 100 are acceptable, although not as good as hoped for. The vectors g and f defines a set of satisfactory solutions S9 = {x C II I Vi : xi < gy} and a set of acceptable solutions Af = {x E III Vi : xi < fi}, where xi is the cost of x wrt. the i'th dimension, i.e., c(x) = (Xz,X 2 ,X 3 ,X 4 ). The values specified by g and f are merely used to guide the search process and in contrast to traditional, user-specified bounds, need not be obtainable. Therefore, they are significantly easier to specify than traditional bounds. In order for the algorithm to compare solutions, a notion of relative solution quality is needed, which takes the goal and feasibility vectors into account. Let x, y e HI. The relation x dominates y, written X < d y, is defined by X < d Y
127
-
(V i : xi < yi) A (3 i : xi < yi)
(2)
3.
(0,0)
criterion I
g,
The concept of genetic algorithms is based on natural evolution. In nature, the individuals constituting a population adapt to the environment in which they live. The fittest individuals have the highest probability of survival and tend to increase in numbers, while the less fit individuals tend to die out. This survival-of-the-fittest Darwinian principle is the basic idea behind the GA. The algorithm maintains a population of individuals, each of which corresponds to a specific solution to the optimization problem considered. Based on a given cost function, a measure of fitness defines the relative quality of individuals. An evolution process is simulated, starting from a set of random individuals. The main components of this process are crossover, which mimics propagation, and mutation, which mimics the random changes occurring in nature. After a number of generations, highly fit individuals will emerge corresponding to good solutions to the optimization problem. Rather than altering given solutions directly, the crossover and mutation operators process internal representations of solutions. The solution corresponding to a given representation is computed by a function known as the decoder. Section 3.1. outlines our specific GA, Section 3.2. presents the placement representation and its interpretation, and Section 3.3. briefly discuss the selection strategy and the genetic operators.
fl
Figure 2. The sets of satisfactory and acceptable solutions, illustrated in two dimensions. Using <'d, the relation x is preferable to y, written x -< y, is then defined as follows, depending on how c(z) compares to g and f: If z satisfy all goals, i.e., x E S,, then x -< y
X
(3)
(X
If x satisfies none of the goals, i.e., V i: xi > gi then 2' -< y
t*
(X
(4)
Finally, x may satisfy some but not all goals, that is, 3k { 2,3,4): (Vi < k: xi < gi)A(Vi > k : zi > gi), assuming a convenient ordering of the optimization criteria. Then z -. y
X*
[(V i > k : xi < yi) A (3 i > k : xi < yi)]
3.1. Overview Fig. 3 outlines our GA. Let D = {9o0, 1, ON-I} denote the current population. The rank r(o) of o E 4 is the number of currently existing individuals which are preferable to X, i.e., r(6!) = I{ E <-1 ' -< b}. Furthermore, let ,o = {A E 4)I r(6) = 0} C ., i.e., 4o is the current best solutions. Initially, ' is constructed by routine generate (line 1) from random individuals. One iteration of the repeat loop (lines 2-12) corresponds to the simulation of one generation. Throughout the optimization process N = 1' is kept constant.
(5)
V
[(x E Af) A (y g
Af)]
(6)
V [(V i > k: xi = yi) A
{((V i < k: xi < yi) A (3 i < k
(7) xi < yi)) (8)
V
(3 i < k: yi > gi)}]
DESCRIPTION OF THE ALGORITHM
(9) 01 02 03 04 05 06 07 08 09 10 11 12 13
Due to (3), (4) and (6) an acceptable solution is always preferable to an unacceptable solution. The right hand side of (5) states that x dominates y wrt. the dimensions for which x does not satisfy the goals. (7), (8) and (9) states that in the special case when x equals y wrt. the non-satisfied dimensions, then x is still preferable to y if it either dominates y wrt. the satisfactory dimensions or if y does not satisfy a goal satisfied by x. Notice from (5) that when two solutions satisfy the same subset of goals, they are considered equal with respect to these goals, regardless of their specific values in these dimensions. Hence, when goals are satisfied, they are "factored out", focusing the search on the remaining, unsatisfactory dimensions. The algorithm outputs a set of distinct solutions which are the best found in the sense defined by -<. As a special case, if g = (0, 0, 0, 0) the algorithm searches for (a sample of) the P'areto-optimal set, i.e., the set of solutions, in which no solution can be improved in any dimension without being deteriorated in another. The above definition of -- is an extension of the definition introduced in [12], adding the feasibility vector f. This extension prevents the algorithm from wasting time investigating solutions which are non-dominated but of no practical interest.
generate(¢); repeat: select (PI, 62 4'1; deft 2 := crossover(0i, 02, b1, fb2 ); mutate(vi); insert(4, gV"); if deft2 : mutate(0b2); insert(4', 0 2 ); if converged(): optimize(4bo); until goals() or converged() or cpuLimit(; output 4';
Figure 3. Outline of the algorithm. In each generation, two parent individuals 01 and 62 are selected (line 3) and crossover performed (line 4), generating offspring ha and possibly V/2. deft! 2 is true if and only if +b2 was also generated. Routine mutate subjects the generated offspring to random changes (lines 5, 8) and the resulting individuals are inserted in 4Z by routine insert (lines 6, 9). An inserted individual ?b replaces a maximum rank solution
128
3. The layout is compacted, first vertically and then
to which it is preferable or, if ' is not preferable to any solution, it replaces a maximum rank solution, which is not preferable to 7b. Routine converged (line 10) detects if no improvement has occurred in S consecutive generations, that is, if 4o has not changed in this period. In that case routine optimize (line 11) attempts to optimize all rank zero individuals by simple hillclimbing. On each individual opE (Do a sequence of H mutations is tried. Each mutation yielding <' from 0 is only executed if O' -< o. The algorithm terminates (line 12) when either a) 4o contains a solution satisfying all goals (detected by routine goalso), b) the process has converged, or c) a CPU-time limit T has been reached (detected by cpuLimito). (o is then the output set of solutions (line 13). Routines insert and optimize assures that a solution o' can never replace o if X -< e'. Hence, the number of satisfactory solutions IS91 and acceptable solutions [AIl are non-decreasing functions of time and so is (O, in the sense inferred by -<, while 14ol is not.
horizontally,
using a simplified version of the one-
dimensional channel compaction algorithm presented in [22]. Fig. 4 illustrates the first three steps of decoding.
Given 10 blocks and the Polish expression Figure 4. 1 2 + 6 * 9 0 + + 3 4 + + 5 * 7 8 + *, the placement on the left is the result of step 2 of the decoding. Subtrees are recursively centered and oriented optimally. Since the height of the layout is determined by the blocks 1,2,9,0,3,4, no blocks are moved when attempting vertical compaction. Subsequent horizontal compaction moves blocks 8,7,5,9 and 0 towards the left, so that blocks 2.6,7 now determines the width of the layout. The placement to the right is the result of compaction (step 3), i.e., the final placement.
3.2. Solution Representation and Decoder The representation of a placement consist of five components a) through e): a) An inverse Polish expression of length 2b - 1 over the alphabet {0,1, .. ., b-1, +, *}, where b is the number of blocks. The operands 0,1,... ., b-1 denotes block identities and +, * are operators. The expression uniquely specifies a slicing-tree for the placement, as first introduced in [21], with + and * denoting a horizontal and a vertical slice, respectively. .
4. Given the orientation/reflection of the JO array A specified by component c) each flexible JO-pin is assigned to an unused entry of A. The nio nets having flexible IO-pins are treated one at a time, in the order defined by component d). For each net, each flexible pin is assigned to the unused entry of A, which is closest to the center of gravity of the remaining pins of the net. Then the area and aspect ratio of the layout can be computed. C.,,. is the area of the smallest rectangle enclosing all blocks and IO-pins and catio= - rt,,getl, where rTctual is the actual aspect ratio of the layout (height divided by width) and rt,,,et is a user-defined target ratio. 5. A global routing graph G = (V, E) is constructed which forms a two-dimensional, uniformly spaced lattice, exactly covering the entire layout. The pitch of the lattice is determined by the user-defined parameter 9pitch Each pin is then assigned to the closest vertex in V. 6. The topology of each net is approximated by a Steiner tree embedded in G. Each Steiner tree is computed independently by the SERT-C algorithm ("Steiner Elmore Routing Tree with identified Critical sink") introduced in [3]. SERT-C explicitly minimizes the Elmore delay from the source to a designated critical sink. For nets with more than one sink, the critical sink is specified by component e) of the representation. 7. The delay D(P) of each path P = {po,pu.pii-l} is estimated as
b) A bitstring of length 2b for ICs and b for MCMs, representing the reflection of each block. For ICs, the reflection of the i'th block is specified by bits 2i and 2i + 1 and for MCMs, it is specified by bit i. c) An integer in the interval [0; 7] selecting one of the eight possible orientations/reflections of the JO-pin array A relative to the placed blocks, cf. Fig. 1. d) A string of njo integers, where nio is the number of nets having at least one flexible JO-pin. The string is a permutation of the numbers 0,1, . . , njo - 1 which specifies an ordering of these nets to be used when placing flexible IO-pins.
lctu1
e) A string of nm.iti integers, where n-.iti is the number of multi-pin nets, i.e., nets having at least 3 pins. If the i'th multi-pin net has si sinks, the i'th integer ci satisfies 0 < ci < si - 1 and specifies ci as being the critical sink of net i, to be used when routing the net. Given a representation of this form, the corresponding placement and its cost c = (c-.,ea Catio, Cdel.y cco-g) is computed by the decoder in eight steps as follows: 1. From the slicing tree specified by the Polish expression, the orientation of each block is determined such that layout area is minimized. The orientations are computed using an exact algorithm by Stockmeyer [18] which guarantees a minimum area layout for the given slicing-structure. The reflection of each block is as specified by component b) of the representation.
i-1
D(P) = ,
[E(pi) + t(s(pi))]
i=o
2. Absolute coordinates are determined for all blocks by a top-down traversal of the slicing tree. At each operator node, if relative movement of the two subtrees along the slicing axis is possible, the centerpoints of the subtrees are aligned.
where E(pi) is the Elmore delay from s(pi) to pi computed in the Elmore-optimized Steiner tree. The delay cost cdelay is the maximum path delay, i.e., cdei., = maxPeP{D(P)}.
129
8. Finally, the maximum routing congestion is estimated. For each edge e E E, cap(e) denotes the capacity of e and equals gpitch X lblock/(2 X WUitch) if (a part of) e is on top of a block, and pitch X lapace/(2 X Wpitch), otherwise. The division by 2 reflects the assumption that in each layer the routing will mainly be either horizontal or vertical. usage(e) is the number of nets using e. The congestion cost ccong is computed as cco= 100 x max max { usage () - cap() cap(e) L (T
'
0
i.e., ccon9 is the maximum percentage by which an edge capacity has been exceeded. Although the nets are routed independently, the resulting Steiner trees represents a very accurate estimation of the net topologies compared to the estimations of previous approaches, cf. Section 1., which in turn indicates that Cdely is an accurate estimate. Furthermore, for ICs the topology dependent Elmore delay estimate has high fidelity in the sense that a solution which is near-optimal according to the estimate will also be near-optimal wrt. actual delay [2]. The smaller the congestion estimate ccong is, the fewer nets needs to be rerouted to obtain 100% global routing completion and, by assumption, the easier is the global routing task.
components of the representation independently. Polish expressions, component a), are handled by highly specialized operators introduced in [5]3. For components b), c) and e), standard operators extensively studied in the GA literature are applied. So-called uniform crossover and pointwise mutation is used, as described in e.g. [1]. Finally, component d) is handled using the operators of [20] which are applicable to any permutation problem. Detailed descriptions of the genetic operators can be found in the references given. A crucial property of both the crossover operator and the mutation operator is that they preserve feasibility, i.e., only feasible representations, which can be interpreted by the decoder, are ever generated. If feasibility were not preserved by the operators, either a repair algorithm would be required in the decoder, slowing down the GA, or a cost penalty method would be needed, jeopardizing a main objective of our approach, the elimination of weight factors. 4. EXPERIMENTAL RESULTS Evaluating performance by comparison to an existing approach is complicated by a number of factors. Firstly, the placement model assumption of routing on top of the blocks, is not compatible to earlier channel-based IC models applied by previous approaches. Secondly, there are no buildingblock benchmarks which includes appropriate timing information. And thirdly, and most importantly, it is inherently difficult to fairly compare the 4-dimensional optimization approach to existing 1-dimensional approaches. However, using the examples described in Section 4.1., comparisons to simulated annealing and random search have been established as described in Section 4.2.. Results are presented in Sections 4.3. and 4.4.. 4.1. Test Examples The characteristics of four of the circuits used for testing are given in Table 1. xeroxT, ami33T and ami49T are constructed from the CBL/NCSU building-block benchmarks xerox, ami33 and ami49, respectively, by adding the required timing information. Paths are generated in a random fashion and internal block delays, output driver resistances and input capacities are assigned randomly assuming normal distributions and using mean values from [2, 17], representative of a 0.8 gm CMOS process.
Figure 5. The Elmore-optimized Steiner tree computed by SERT-C for a 44-pin net. s is the source and c X the selected critical sink.
I Circuit xeroxT ami33T ami49T SPERT
3.3. Selection and Genetic Operators The scheme for selection of parents for crossover (line 3 of Fig. 3) should enforce the principle of survival-of-thefittest. Assume that the population 4) = {do, 01, .. ., ON-1 } is sorted in ascending order according to rank, i.e., r(9o) < r(61) < ... < r(ON-1). Each parent 6 is selected independently at random, using a scheme presented in [12, 19]. which have two properties: 1) The probability that r(O) equals r(Ok). written P[r(6)=r(Ok)], decreases linearly with k and P[r(O)=r(q6o)] = 0P[r(0)=r(0N/2)l, where 1 < f < 2 is a user-defined parameter controlling the selection pressure. 2) All individuals having the same rank have the same probability of being selected. This rankbased selection scheme eliminates the need for a singlevalued fitness function as traditionally used in GAs, which is essential in order to avoid aggregating the cost values in any way. The crossover operator (line 4 of Fig. 3) as well as the mutation operator (lines 5, 8), operates on each of the five
||
Type I Blocks I Pins I IO IC IC IC MCM
10 33 49 20
698 522 953 1,168
2 42 22 36
I
Nets I Paths 203 123 408 248
86 230 116 574
Table 1. Main characteristics of test examples. The columns are: type, no. of blocks, total no. of pins, no. of 10-pins, no. of nets and no. of paths. SPERT is an MCM consisting of a vector processor (ASIC), 16 SRAMs and 3 buffer components. The design was kindly made available to us by the International Computer Science Institute in Berkeley, California, where a dedicated hardware system for neural net based speech recognition based on SPERT is currently being developed. ' While operations on Polish expressions are as in [5], it should be noted that the interpretation of an expression, i.e., the decoder, is significantly different. This is a consequence of the fundamental differences of the two approaches, e.g., the distinct optimization criteria, multi-dimensional versus one-dimensional optimization, etc.
130
For all examples, rtarget = 1.0, Iblock = 2 and For the ICs, all IO pins are fixed, Witch
E RW
'Space = 3.
4.2.
* GA
1.00 i
i ,,,,,,v,,,,,
Method
........... ---i---
.-.....
The GA is implemented in 9,000 lines of C and runs on a DEC station 5000/125. Performance is compared to that of a simulated annealing algorithm, denoted SA, and a random walk, denoted RW. Both algorithms uses the same placement representation and decoder as the GA. The RW simply generates representations at random, evaluates them and stores the best (rank 0) solutions ever found. The SA generates moves using the mutation operator of the GA and the cooling schedule is implemented following [14]. The use of the same representation and decoder by all three algorithms means that the effects of the representation and the decoder algorithms, e.g., Stockmeyers algorithm and the compactor, does not a priori give an advantage to any of the algorithms. Furthermore, all algorithms explore exactly the same search space. Consequently, any performance differences observed can be attributed to the different search strategies themselves, rather than e.g. the decoder algorithms. Since RW does not rely on cost comparisons, it can use the same 4-dimensional cost function as the GA, allowing the two approaches to be directly compared. In contrast, the traditional SA algorithm relies on absolute quantification of change of cost when determining if a move should be accepted, and consequently, cost has to be single-valued. Using a SA cost function of the form (1), it is far from clear how to fairly compare the single solution output by the SA algorithm to the set of solutions output by the GA. Therefore, comparisons of the GA with SA has to be based on optimizing one criterion only, such that the GA output is also a single solution, which is comparable to the SA solution. On the other hand, such comparisons are still very informative since they reveal whether the GA is competitive in this special case. A fixed parameter setting is used for each algorithm, disallowing problem-specific tuning. The GA parameters are: Population size N = 40, selection bias a = 2.0 and mutation rate Prmt = 0.0005. The hillclimber attempts H = 1, 000 mutations on a given individual, and the search is considered converged if no improvement has been observed for S = 10, 000 generations.
4.3.
E SA
1.05
4gm, gpitch = 40gm and the minimum block spacing A is 0, i.e., blocks can be abutted. For the MCM, all IO pins are flexible, Wpitch = 40gm, gpitch = 400gm, and A = 5.0 mm. =
C
0.95
I .mi49T
0
- .
0.85
-.............
-- -.........
. . . . . . . . . . . . . . . . .....1 0
a80
delay
-
area
delay
area
SPERT |
.
I
q
I'
area
delay
4
I
r L delay
area
delay
Figure 6. Relative performance of the GA, SA and RW for one-dimensionaloptimization. time for all three algorithms. However, the SA approach has an advantage of defining the CPU-time requirement. Average CPU-time per run varied from 183 seconds for area optimization of SPERT to 3,903 seconds for delay optimization of ami49T. For area optimization of xeroxT and SPERT the SA averages are slightly worse than those of RW, indicating that the SA got trapped at local minima. However, in all other cases, both the GA and SA performs significantly better than RW. Overall, the GA and SA performance is very similar, indicating that the efficiency of the GA search
One-Dimensional Optimization
Neither aspect ratio deviation nor routing congestion are suitable dimensions for one-dimensional comparisons since optimal results can easily be obtained by distributing the blocks over a large area. Instead, one-dimensional optimization for area and delay are performed, for which the GA uses the goal vectors g = (0, Do, co, co) and g = (co, cc, 0, 0o), respectively. Fig. 6 illustrates the results. For each circuit and each of the two criteria, the three algorithms was executed 10 times each and the result indicated by a bar. The center point of each bar indicates the average result obtained in the 10 runs and the height of each bar is two times the standard deviation. For each circuit and criterion, the average result of RW is normalized to 1.00. The SA was executed first, and the average consumed CPU-time enforced on each of the GA and RW as a CPUtime limit, thereby obtaining the exact same average run-
Figure 7. The minimum area results obtained by the GA in 10 runs for amiSST (top) and ami49T (bottom). The empty space constitutes 9.3 % and 7.5 % of the layout area, respectively.
131
is comparable to that of SA in the special case of onedimensional optimization. Fig. 7 shows the smallest layouts obtained by the GA.
SPERT
4.4. Four-Dimensional Optimization For 4-dimensional optimization, Figures 8 and 9 compares the solution sets 4(o found by a sample execution of the GA using a 1 CPU-hour time limit to those found by RW using a 10 CPU-hour limit. For all examples, the GA uses the goal vector g = (0, 0.2, 0, 30) and the feasibility vector f = (1.5B,0.5,c,200), where B is the sum of the areas of all blocks of the circuit in question. The GA results are always significantly better than the RW results in all dimensions 4 .
40, I
, 35, = 30 I
"25. 20
15 56 5240
ami33T 4.75 -
52
00
\ I-
6160
Delay(Cs)
0
4.7
~6180
<
50
6140
6160 6200
Area (s5.mm)
Figure 10.
The solution set 'to found by the GA for SPERT. Only solutions satisfying the aspect ratio goal are shown.
4.65 0
The solution for SPERT indicated by a black circle in Fig. 10 is shown in Fig. 11 and illustrates the effect of minimizing routing congestion. The processor block TO has 416 connected pins and is the cause of potential congestion problems. However, by moving TO close to the center of the layout, cf. Fig. 11, a congestion cost c..n. of only 15 % is obtained. The other solutions in
, 4.6 a I0
4.55
I
X
4.5
1.25
1.3
1.35
1.4
1.45 1.5 1.55 Area (sq.mm)
1.6
1.65
1.7
1.75
Figure 8.
Comparison of the solution sets (Do found by RW and the GA for ami33T. The o 's are RW solutions and the x 's are GA solutions. Only solutions satisfying both the aspect ratio goal and the congestion goal are shown.
*1
-liS2
I 513
ami49T
0
0.5-C
a 0.4.
0
' 0.3
0
i
S6 i
0.21
1
~S12
57
i1
i
[~
5
0
0o
0.1 >
552
Figure 11. A placement of SPERT with low routing congestion, obtained by moving the processor TO towards the center of the layout, at the cost of increasedarea and delay.
220 Area (sq.mm)
Delay (ns)
Figure 9.
Comparison of the solution sets (Do found by RW and the GA for ami49T. All solutions shown satisfy the routing congestion goal, while all GA solutions also satisfy the aspect ratio goal. 'A measure of solution set quality is proposed in [9], which allow the relative quality of solution sets to be quantified numerically.
132
5.
CONCLUSIONS
A genetic algorithm for building-block placement of ICs and MCMs has been presented, which minimizes area, path delay and routing congestion while attempting to meet a target aspect ratio. The key feature is the explicit design space exploration performed, which results in the generation of a solution set representing good, alternative cost tradeoffs. The inherent problem of existing approaches wrt. specification of suitable weights and bounds is solved by eliminating these quantities, and another practical problem, the traditionally required placement-routing iterations, is significantly reduced by explicitly minimizing routing congestion. The experimental work includes results for a real-world design and shows that the solution sets found represent good, balanced tradeoffs. The usefulness of minimizing routing congestion is also illustrated and it is shown that the efficiency of the search process is comparable to that of simulated annealing in the special case of one-dimensional optimization. Furthermore, the required runtime ranging from about 3 CPU-minutes to 1 hour is very reasonable from a practical point of view. It is concluded that the presented algorithm is a promising approach for handling the discussed problems. Acknowledgments This research was supported by SRC grant no. 95-DC-324, NSF grant no. MIP 91-17328 and the Danish Technical Research Council. REFERENCES [1] D. Beasley, D. R. Bull, R. R. Martin, "An Overview of Genetic Algorithms: Part 2, Research Topics." University Computing,, Vol. 15, No. 4, pp. 170-181, 1993. [2] K. D. Boese, A. B. Kahng, B. A. McCoy, G. Robins, "Fidelity and Near-Optimality of Elmore-Based Routing Constructions," Proc. of the Intl. Conf. on Computer Design, pp. 81-84, 1993. [3] K. D. Boese, A. B. Kahng, G. Robins, "HighPerformance Routing Trees With Identified Critical Sinks," Proc. of the 30th Design Automation Conference, pp. 182-187, 1993. [4] H. Chan, P. Mazumder, K. Shahookar, "Macro-cell and module placement by genetic adaptive search with bitmap-represented chromosome," Integration, the VLSI Journal, Vol. 12, No. 1, pp. 49-77, Nov. 1991. [5] J. P. Cohoon, S. U. Hedge, W. N. Martin, D. Richards, "Distributed Genetic Algorithms for the Floorplan Design Problem," IEEE Transactionson Computer-Aided Design, Vol. 10, pp. 484-492, April 1991. [6] J. Cong, Y. Ding, "On Area/Depth Trade-Off in LUTBased FPGA Technology Mapping," IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol. 2, No. 2, June 1994. [7] P. Dasgupta, P. Mitra, P. P. Chakrabarti, S. C. DeSarkar, "Multiobjective Search in VLSI Design," Proc. of The 7th InternationalConference on VLSI Design, pp. 395-400, 1994. [8] H. Esbensen, E. S. Kuh, "An MCM/IC Timing-Driven Placement Algorithm Featuring Explicit Design Space Exploration," Proc. of the 1996 IEEE Multi- Chip Module Conference, pp. 170-175, 1996.
133
[9] H. Esbensen, E. S. Kuh, "Design Space Exploration Using the Genetic Algorithm," Proc. of the IEEE InternationalSymposium on Circuits and Systems, 1996 (to appear). [10] H. Esbensen, P. Mazumder, "SAGA: A Unification of the Genetic Algorithm with Simulated Annealing and its Application to Macro-Cell Placement," Proc. of The 7th International Conference on VLSI Design, pp. 211-214, 1994. [11] P. J. Fleming, A. P. Pashkevich, "Computer Aided Control System Design Using a Multiobjective Optimization Approach," Proc. of the IEE Control '85 Conference, pp. 174-179, 1985. [12] C. M. Fonseca, P. J. Fleming, "Multiobjective Optimization and Multiple Constraint Handling with Evolutionary Algorithms I : A Unified Formulation," Research Report 564, Dept. of Automatic Control and Systems Eng., University of Sheffield, U.K., January 1995. [13] T. Hamada, C.-K. Cheng, P. M. Chau, "Prime : A Timing-Driven Placement Tool using A Piecewise Linear Resistive Network Approach," Proc. of the 30th Design Automation Conference, pp. 531-536, 1993. [14] M. D. Huang, F. Romeo, A. Sangiovanni-Vincentelli, "An Efficient General Cooling Schedule for Simulated Annealing," Proc. of the 1986 International Conference on Computer-Aided Design, pp. 381-384, 1986. [15] M. A. B. Jackson, E. S. Kuh, "Performance-Driven Placement of Cell Based IC's," Proc. of the 26th Design Automation Conference, pp. 370-375, 1989. [16] J. Lillis, C.-K. Cheng, T.-T. Y. Lin, "Optimal Wire Sizing and Buffer Insertion for Low Power and a Generalized Delay Model," Proc. of the InternationalConference on Computer Aided Design, pp. 138-143, 1995. [17] A. Srinivasan, K. Chaudhary, E. S. Kuh, "RITUAL: A Performance-Driven Placement Algorithm," IEEE Transactions on Circuits and Systems-II: Analog and Digital Signal Processing, Vol. 39, pp. 825-840, Nov. 1992. [18] L. Stockmeyer, "Optimal Orientations of Cells in Slicing Floorplan Designs," Information and Control, Vol. 57, pp. 91-101, 1983. [19] D. Whitley, "The Genitor Algorithm and Selection Pressure: Why Rank-Based Allocation of Reproductive Trials is Best," Proc. of the Third International Conference on Genetic Algorithms, pp. 116-121, 1989. [20] D. Whitley, T. Starkweather, D. Fuquay, "Scheduling Problems and Traveling Salesmen: The Genetic Edge Recombination Operator," Proc. of the Third International Conference on Genetic Algorithms, pp. 133-140, 1989. [21] D. F. Wong, C. L. Liu, "A new algorithm for floorplan design," Proc. of the 23rd Design Automation Conference, pp. 101-107, 1986. [22] X.-M. Xiong, E. S. Kuh, "Geometric Approach to VLSI Layout Compaction," InternationalJournal of Circuit Theory and Applications, Vol. 18, pp. 411-430, 1990.
GENETIC SIMULATED ANNEALING AND APPLICATION TO NON-SLICING FLOORPLAN DESIGN Seiichi Koakutsu, Maggie Kang and Wayne Wei-Ming Dai Dept. of E.E, Chiba University Koakutsuqcute.te.chiba-u.ac.jp Dept. of CE, UCSC [email protected] Dept. of CE, UCSC dai~cse.ucsc.edu ABSTRACT
solution along this search path. SA starts with a given initial solution xo. At each step, SA generates a candidate solution X' by changing a small fraction of a current solution X. SA accepts the candidate solution as a new solution with a probability min{1, eALfT}, where Af = f(z') - f(x) is cost reduction from the current solution x to the candidate solution z', and T is a control parameter called temperature. A key point of SA is that SA accepts up-hill moves with the probability e-Af/T. This allows SA to escape from local minima. But SA cannot cover a large region of the solution space within a limited computation time because SA is based on small moves. Fig.1 shows the pseudo-code of SA.
We propose a new optimization method, named genetic simulated annealing (GSA), which combines the local stochastic hill climbing features from simulated annealing (SA) and the global crossover operations from genetic algorithm (GA). We demonstrated the advantages of GSA by solving one of the most difficult problems in layout the non-slicing floorplan design problem. Given the same amount of computing resources, our experimental results showed that GSA consistently obtained better results than SA, in terms of both the chip area and the total wire length. We also applied GSA to timing driven floorplan design and experimental results indicated that it achieved the specified wire length bounds for the critical nets with small penalty on the chip area and the total wire length. 1.
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15:
INTRODUCTION
Most of VLSI layout problems can be formulated as combinatorial optimization problems and are proven to be NPhard or NP-complete problems. Simulated annealing (SA) [1, 2] and Genetic algorithm (GA) [3, 4] are heuristics for combinatorial optimization problems and have been successfully used for various problems in the CAD area
[5, 6, 7, 8, 9]. While SA is very powerful for searching local regions of the solution space exhaustively via stochastic hill climbing, GA is very powerful for searching large regions of the solution space roughly and globally using crossover operations. Combining the local hill climbing features of SA and the global crossover operations of GA, we propose a new optimization method, named Genetic Simulated Annealing (GSA). We apply GSA to non-slicing floorplan design problems to demonstrate the advantages of GSA over SA. The rest of the paper is organized as follows. We discuss the characteristics of SA and GA in Section 2 and propose the new optimization technique GSA in Section 3. A new representation for non-slicing floorplan, called Bounded Slicing Grid (BSG) will be described in Section 4. Two key search operations mutation and crossover for BSG are described in Section 5. The experimental results are reported in Section 6. Timing driven floorplanning using GSA is discussed in Section 7 followed by the conclusions. 2.
16:
SA-algorithm(N,, To, a)
{ X - xo; T - To; while (system is not frozen) { for (i =1; i < Na; i<){ X Mutate(z); Af - f (X') - f (x) r - random number E (0, 1) if(Af < 0 or r < exp(-Af IT)) X - x';
}
}
T - T * a /* lower temperature */ } return x Figure 1. SA algorithm.
GA is another approach for solving combinatorial optimization problems. GA applies an evolutionary mechanism to optimization problems. It starts with a population of initial solutions. Each solution has a fitness value which is a measure of the quality of a solution. At each step, called a generation, GA produces a set of candidate solutions, called child solutions, using two types of genetic operators: mutation and crossover. It selects good solutions as survivors to the next generation according to the fitness value. The Mutationoperator takes a single parent and modifies it randomly in a localized manner, so that it makes a small jump in the solution space. On the other hand, the crossoveroperator takes two solutions as parents and creates their child solutions by combining the partial solutions of the parents. Crossover tends to create child solutions which differs from both parent solutions. It results in a large jump in the solution space. There are two key differences between GA and
SIMULATED ANNEALING AND GENETIC ALGORITHM
SA is a stochastic iterative improvement methods for solving combinatorial optimization problems. SA generates a single sequence of solutions and searches for an optimum
134
SA. One is that GA maintains a population of solutions and uses them to search the solution space. Another is that GA uses the crossover operator which causes a large jump in the solution space. These features allow GA to globally search large region of the solution space. But GA has no explicit ways to produce a sequence of small moves in the solution space. Mutation creates a single small move one at a time instead of a sequence of small moves. As the result GA cannot search local region on the solution space exhaustively. Fig.2 shows the pseudo-code of GA. 1: 2:
GA-algorithm(L, R, Rm)
Cost
{
/* initial population */ while (stop criterion is not met) { X' 0; while (I children 1< L x R,) { select two solutions, xi, xj from X x' Crossover(i,,xj); X'i X' + { X}:
3:
X
4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14:
1,
I
I
,XL);
GSA SA
} reconstruct population from X U X' while (I solutionsmutated 1< L x Rm) select one solution xk from X Xk - Mutate(xk);
15: 16: 17: 18:
seeds of SA local search in parallel, that is the order of applying each SA local search is independent, our GSA generates the seeds of SA sequentially, that is the seed of a SA local search depends on the best-so-far solutions of all previous SA local searches. This sequential approach seems to generate better child solutions. In addition, compared to PGSA. GSA uses fewer crossover operations since it only uses crossover operations when the SA local search reaches a flat surface and it is time to jump in the solution space. Fig.3 shows the optimization process of GSA and SA.
}
{
}
return the best solution in X } Figure 2. GA algorithm.
Step 3. GENETIC SIMULATED ANNEALING In order to improve the performance of GA and SA, several hybrid algorithms have been proposed. Mutation used in GA tends to destroy some good features of solutions at the final stages of optimization process. While Sigrag and Weisser [[0] proposed a thermodynamic genetic operator, which incorporates an annealing schedule to control the probability of applying the mutation, Adler [11] used a SA-based acceptance function to control the probability of accepting a new solution produced by the mutation. More recent works on GA-oriented hybrids are the Simulated Annealing Genetic Algorithm (SAGA) method proposed by Brown et al. [12] and Annealing Genetic (AG) method proposed by Lin et al. [13]. Both methods divide each "generation" into two phases: GA phase and SA phase. GA generates a set of new solutions using the crossover operator and then SA farther refines each solution in the population. While SAGA ases the same annealing schedule for each SA phase, AG tries to optimize different schedules for different SA phases. T~he above GA-oriented hybrid methods try to incorporate the local stochastic hill climbing features of SA into GA. Since they incorporate full SA into each generation and the number of generations is usually very large, GA-oriented hybrid methods are very time-consuming. SA-oriented hybrid approaches, on the other hand, attempt to adopt the global crossover operations of GA into SA. Parallel Genetic Simulated Annealing (PGSA) [14, 15], is a parallel version of SA incorporating GA features. During parallel SA-based search, crossover is used to generate new solutions in order to enlarge the search region of SA. We propose a new optimization method called Genetic Simulated Annealing (GSA). While PGSA generates the
Figure 3. Optimization process of GSA and GA. GSA starts with a population X = {x1,- . , xN} and repeatedly applies three operations: SA-based local search, GA-based crossover operation, and population update. SAbased local search produces a candidate solution x' by changing a small fraction of the state of x. The candidate solution is accepted as the new solution with probability min{1, e-Af/T}. GSA preserves the local best-so-far solution x4 during the SA-based local search. When the search reaches a flat surface or the system is frozen, GSA produces a large jump in the solution space by using GA-based crossover. GSA picks up a pair of parent solutions xj and Xk at random from the population X such that f(x;) 0 f (k), applies crossover operator, and then replaces the worst solution xi by the new solution produced by the crossover operator. At the end of each SA-based local search, GSA updates the population by replacing the current solution xi by the local best-so-far solution x. GSA terminates when the CPU time reaches given limit, and reports the global best-so-far solution x4. Fig.4 shows the pseudo-code of GSA. 4. FLOORPLAN PROBLEM We formulate the building block placement problem as follows: Given a set of arbitrary shaped and fixed sized modules and connection information among modules, find a minimum area placement with the shortest wire length.
135
1: 2:
GSA algorithm(N,, Na,To, a)
{
3:
A
4: 5: 6: 7: 8: 9: 10:
x-
i
{X1,*
,XNN};
the best solution among X; £3, XL; while (not reach CPU time limit) { T -To; /* jump */ select the worst solution xi from X;
11:
select two solutions X,, Xk from X; xi - Crossover(z,,Zk);
12: 13: 14: 15: 16: 17: 18:
/* SA-based local search */ while (not frozen) { for (i = 1; i < N; i + +) { i'- Mutate(xi); Af - f(z') -f(i); r - random number(0,1) if(Af < 0 or r < exp(-A fIT))
19:
xi E Z;
20: 21: 22: 23: 24:
if(f (z O < f (xi))
}
25:
if(f(z*L) < f(z*G))
26: 27: 28: 29: 30: 31:
'1
Many different floorplanning methods have been proposed, for example, rectangular dualization based methods [17, 18], integer programming based methods [19, 20], constructive methods [21, 22], and hierarchical methods [23, 24, 25]. In order to apply stochastic optimization to a combinatorial problem, we must represent the solution space completely and efficiently. That is, the global optimal solution must be reachable by a sequence of moves and each move can be evaluated quickly. Wong and Liu represented a slicing floorplan by a normalized Polish expression which enables efficient neighborhood search [6]. Cohoon et al. applied distributed GA to the same problem and obtained better floorplan results with fewer number of cost calculations [8]. For building block layout with multi-layer technology, most of channel routing will be replaced by area routing, and blocks are packed together. The technology shift makes non-slicing floorplan more important. Fig.5 shows an example of non-slicing and slicing floorplans. Recently Nakatake et al.[16] proposed a new representation for non-slicing floorplans, called the Bounded Slicing Grid (BSG). BSG provides a large solution space which includes optimual solutions and allows quick solutions evaluation such as chip area and total wire length. Therefore, BSG is a good choice for floorplan design using SA, GA, or GSA.
D m5
m3
m5
-Xi
} T x a;
T
XG -
xi
}
f(4)
xi; 4
Figure 5.(Ion-slicing (a) and slicing floorplan (b).
XL*;
i inf;
return xZ F
BSG is a structure which consists of regularly placed nonintersecting horizontal and vertical line segments(Fig.6). Each horizontal or vertical line segment is called a Bounded Slicing-line (BS-line). A rectangle area enclosed by four BSlines is called a room.
Il
Figure 4. GSA algorithm.
I
I
I
-l
-II J -L
II1
,.
I I Figure 6. BSG.
136
In the BSG model, a floorplan amounts to an assignment of modules to rooms. This assignment is called a BSG seed. Each room can contain at most one module. There are two types of rooms: an actual room and an empty room. An actual room is a room which contains a module. An empty room is a room-i which contain no modules. Given a BSG seed, a minimum area floorplan can be obtained by a sizing operation on the BSG. The sizing operation determines the size of each room in BSG, thus the (X, y)-coordinate of each module, by stretching or shrinking the BS-Enes. The sizing operation is based on two directed acyclic graphs: the horizontal sizing graph Gh and the vertical sizing graph G,. A vertex in Gh represents a horizontal BS-line. There is a directed edge from vi to v2 in Gh if the BS-line corresponding to vs is above the BS-line corresponding to v, and they share the same room. The edge weight is the height of the corresponding room. There is a directed edge from the source vertex to each of the vertices of the uppermost bounded BS-lines. Similarly, there is a directed edge from each of the vertices of the lowermost bounded BS-lines to the sink (See Fig.7). G. can be defined similarly.
y
R-1
E-=
im3| (a)
(b)
T
y
b X
y
.
LMfl1
Fm2]
Fm 1] Fm2]
Em-37
(c) (d) Figure 8. (ne dimensional compaction. (a) before compaction. (b) x-y compaction. (c) y-X compaction. (d) Ideal compaction. any new partial solutions which are inconsistent with the parents. It results large jumps in the solution space. 5.1.
Figure 7. Horizontal sizing graph Gh. The length of the longest path between the source and each vertex in Gh gives y-coordinate of the upper side edge of the corresponding module. The longest path length between the source and the sink in Gh (or G.) gives the height (or the width) of the overall layout. There are two key features of the BSG. First, unlike slicing floorplans, each BSG unit allows 45 degree-direction compaction. An actual room surrounded by empty rooms moves freely in all directions. Second unlike one dimensional compaction (Fig.8), the sizing operation does not depend on the order of compaction directions (Fig.9). These features enable BSG to produce non-slicing floorplans. 5.
Mutation Operation
The mutation operator aims to create small changes of the solution states. For the BSG model, we use three types of mutation operators: exchange, move, and rotate. The Exchange operator exchanges two modules. The Move operator moves a single module to an empty room. The Rotate operator rotates a single module by 90 degrees. GSA selects one of these operators at random and applies it in a randomized manner at each optimization step of SA-based local search. 5.2. Crossover Operation The crossover operator aims to create new solutions by combining partial solutions of the parents. There are three requirements to make the crossover operator desirable. First, crossover should not produce any new partial solutions which belong to neither parents. All of the features of a child solution should be inherited from the parent solutions. Second, crossover should create a child in such a way that the more the parents have in common, the more the child has similarity to the parents. In the extreme case where both parents are identical, the child should be identical to the parents. Finally crossover should produce a feasible child solution. In the case of the BSG floorplan model, a feasible solution means that every module has been assigned to exactly one room. To satisfy the above requirements, we impose the following constraint on the BSG crossover, In a child solution, each module is placed in the same room as in one of the parent solutions. For example, in Fig.10, A of 0 takes the position of that in Pi and C of 0 takes the position of that in P2 . Before describing in detail the crossover procedure for the BSG floorplan model, we define some terms. A module in
GSA SEARCH OPERATIONS
In order to apply GSA to the floorplan problem, we have to define the mutation operator and the crossover operator. These operators are very important because the performance of GSA depends highly on the implementation of these operators. Recall that mutation takes a single parent and modifies it at random in a localized manner. It makes a small jump in the solution space. On the other hand, crossover takes two parent solutions and creates new solutions by combining the partial solutions of the parents. Crossover does not create
137
one parent is called conflicting if the corresponding room in the other parent contains a different module. Otherwise it is called non-conflicting. For example in Fig.10, modules A, C, D, and E in parent PI are conflicting with modules B, A, E, and D in the other parent P2 , respectively, and module B and F in parent PI are non-conflicting.
T- I
I
-
I(a)
-72
I1
I
[E27
.m3
W-I (b)
(a)
(c)
Figure 10. BSG crossover creates child 0 (c) by combining parent Pi (a) and P2 (b). The BSG crossover copies modules from the parents to the child while obeying the following two rules. * Rule 1. If a pair of modules conflict with each other, the BSG crossover copies both modules from the same parents. * Rule 2. BSG crossover copies modules from both parents alternately in order to inheritfeatures fairly from both parents. FiguBefore (a)
co. pa
Fig.l0 shows an example of BSG crossover. First a certain module is selected from either parent, in this example module A in parent PI. Modules A and B in the same parent PI are copied to the corresponding room in child 0 by rule 1, because A in Pi conflicts with B in P2 . Now module B in parent PI is non-conflicting. Next module C is selected from the other parent P2 by rule 2, and is copied to the corresponding room in child 0. This module C in parent P2 is non-conflicting. Next module D is selected from parent PI again by rule 2. Modules D and E in the same parent Pi are copied to the corresponding room in child 0 by rule 1, because they conflict with each other. At this point, although module E in parent Pi conflicts with module D in parent P2, we regard E as a non-conflicting module, because the
b)
Figue 9.BSGcompaction. Before compaction. (b) After sizing.
138
module D has already been copied. Finally module F in parent P2 is ccpied into the corresponding room in child 0. Now we are ready to describe the precise procedure of BSG crossover. First BSG crossover initializes a module list M to contain all modules and selects one module mi from M at random. Module mi is copied from current parent P and mi is deleted from module list M. If module mi conflict with module minM, module mj is copied from the same parent P. Otherwise, flip the current parent P into another parent. These procedures are repeated until all the modules are copied. Fig.11 shows pseudo-code of BSG crossover. 1:
2:
{
M i /* initialize module list M */
5:
P-
6:
/* initialize current parent P */
PI;
~while (M
8: 9: 10: 11: 11: 12:
mi
e
14:
}
15: 16: 17: 18:
if (P = P,)
{
mj-
SA
wire GSA
Redc.(o)
1
1,159,424
1,161,972
2
-2.20
l
1,221,010
1,183,658
3.06
l
3 4 ea5 L average _;
1,188,348 1,218,784 1,197,504 1,197,014
1.163,596 1,138,802 1,160,328 -1,161,736
2.08 6.56 3.10 2.95
PF P2 ; else
P
i P;
}
19:
21:
Table 2. SA and GSA experimental results (5 hour).
# 0) {
select mi E M at random; copy m, from P to 0; M M\ fm0; while (mi conflicts with m, E M) copy M, from P to 0; M M \ fmA;
13:
20:
Table 1. SA and GSA experimental results (5 hour). area SA GSA ec. 1 51,254,000 41,776,224 18.49 2 54,848,836 43,218,000 21.21 3 49,635,040 44,029,440 11.29 4 45,243,072 50,413,944 -11.43 5 57,666,336 47,154,464 18,23 average 51,729,457 1 45,318,414 12.39
BSG crossover(PI, P2 , 0)
3: 4:
7:
and 2. The results showed that GSA improves the average chip area by 12.4% and the average wire length by 2.95% over SA. Fig.12 shows an example of the floorplan results.
4
report 0;
Figure 11. BSG crossover procedure. The current implementation of the BSG crossover satisfies two of three requirements of crossover which are described in 5.2. It always create feasible solutions in which all partial solutions are consistent with one of the parents. But it may create a child which is identical to one of the parents, although both parents are different. 6. EXPERIMENTAL RESULTS The goal of the floorplan problem is to place all modules on the BSG to minimize the total chip area and the total wire length. We use the following cost function: f
= A+3 x W.
(1)
where A is the chip area, W the total wire length and Hia constant controlling the relative importance of A and W. We use d = 0.5 in our experiments. In order to adjust the dimension so as to align with both terms, the square of the wire length is employed. The length of a net which has more than two terminals is estimated as the half perimeter of the minimum bounding box which includes all the terminals. We applied GSA and SA to several MCNC benchmarks. Table 1 and 2 show the results of AMI49. We set the time limit for the experiments to five hours, running on Sun Sparc 20 workstation. Both GSA and SA run five times with different initial floorplans generated at random. To be fair in our comparisons, the total number of generated new solutions was the same for both GSA and SA. All the data and the average values are shown in the Table 1
Figure 12. An example of floorplan results. 7. TIMING DRIVEN GSA FLOORPLANNING During floorplanning, some nets are timing critical. In addition to minimizing the total chip area and the total wire length, we need to meet the timing constraints for those critical nets. For simplicity, we assume the timing constraints
139
Table 3. Timing Driven GSA experimental results (5 hour).
Table 4. Timing Driven GSA experimental results (5 hour).
Total Wire Length of Critical Nets . w/o Timing with Timing Length Rec.(%) Exp 15. 42448 30884 27.24 Exp l 10 . 42448 30380 28.43 Exp 5;5 50428 39312 22.04 Exp 510 50428 34776 31.03
Exp Exp Exp Exp
are specified in terms of bounds on wire lengths. For the set of critical nets, Nc, we define the total excess wire length as follows: E= 6(,
E
(In -bn)b(In, bn,)
bn)={ I b/(nn n) ,0
oif I> bs other-wise
f =A +I3W2 + 7E
Total Wire Length with Timing Length Inc.(%) 1215340 5.01 1202082 3.87 1222774 5.66 1292004 11.64
Table 5. Timing Driven GSA experimental results (5 hour).
(2)
Exp 4 5 [ 410
(3)
7sp
Exp 510
where 1. is the wire length of net n and bn the wire length bound of net n. We add the total excess wire length into the cost function: 2
w/o Timing 4-5 1157282 4_10 1157282 55 T1157282 5_10l 1157282
|W/o Timing 55204380 55204380 55204380 55204380
Total Area with Timing 52563280 51283008 50207752 57269240
Area Inc. ('7c) -4.78 -7.1 -9.05 3.74
There are two main improvements which are left for future works. First of all, more elegant crossover operators which can produce a wide variety of child solutions are under investigation. Secondly handling of flexible module which can change the aspect ratio should be incorporated into BSG floorplan model.
(4)
where y, like /3, is a constant controlling the relative importance of the optimization objectives and timing constraints. Similar to the total wire length term, the total excess wire length. is squared to adjust the order of magnitude of the three terms. At each optimization step of the GSA algorithm. we ac-
[1]
cept a candidate solution with probability min(1, e-A WIT)
where SW == E,E, is difference in total excess wire length between the candidate solution r' and the current solution x. T is the annealing temperature. At high temperature, GSA accepts some infeasible solutions, which may generate better offsprings. At low temperatures, it accepts mainly feasible solutions. During the optimization, we keep track of a global best-so-far solution which meets the wire length bounds of the critical nets. In our experiments, we applied GSA to one of MCNC benchmark data AMI49. First we got the best solution without considering the timing bounds. Then we the selected the top 4% and 5% of the longest nets and set their wire length bounds equal to 95% and 90% of the current lengths. Using the Timing Driven GSA floorplanning algorithm, we got final solutions which meet all the length bounds. Table 3, 4 and 5 shows the results of five hour experiments running on a Sun Sparc 20 workstation. The results indicated that Timing Driven GSA placement could achieve the wire length bounds for the critical nets with small penalty on chip area and total wire lengths.
[2]
[3]
[4]
[5]
[6]
[7]
8. CONCLUSIONS Genetic: Simulated Annealing (GSA) searches large regions of the solution space effectively using both SA-based local search features with GA-based global search capability. We applied GSA to the non-slicing floorplan design problems and compared it with SA. Given the same computing resource, the experiments showed that GSA improved the average chip area by 12.4% and the average wire length by 2.95% over SA. We have also incorporated timing driven features to the GSA floorplan design.
[8]
[9]
140
REFERENCES S. Kirkpatrick, C. D. Gelatt, Jr. and M. P. Vecchi, "Optimization by Simulated Annealing." Science, 220, pp.6 7 1-680, May 1983. N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller and E. Teller, "Equation of State Calculations by Fast Computing Machines," J. of Chemical Physics, vol.21, no.6, pp.1087-1092, 1953. J. H. Holland, Adaptation in Natural and Artificial Systems. Ann Arbor, MI:University of Michigan Press (1975). D. E. Goldberg, Genetic Algorithms: In Search, Optimization and Machine Learning. Reading, MA: Addison-Wesley, 1989. C. Sechen and A. Sangiovanni-Vincentelli, "TimberWolf 3.2 : A new standard cell placement and global routing package," Proc. 23rd Design Automation Conf., pp.432-439, June 1986. D. F. Wong and C. L. Liu, "A new algorithm for floorplan design," Proc. 23rd Design Automation Conf., pp.101-10 7 , June 1986. J. P. Cohoon and W. D. Paris, "Genetic placement," IEEE trans. Computer-Aided Design, vol.CAD-6, no.6, pp. 95 6 -96 4 , November 1987. J. P. Cohoon, S. U. Hegde, W. N. Martin and D. S. Richards, "Distributed Genetic Algorithms for the Floorplan Design Problem," IEEE trans. ComputerAided Design, Vol.CAD-10, No.4, pp.483-492, 1991. K. Shahookar and P. Mazumder, "A genetic approach to standard cell placement using meta-genetic parameter optimization," IEEE trans. Comput.-Aided Design, Vol.CAD-9, No.5, pp.500-511, 1990.
[10] D. Sirag and P. Weisser, "Toward a Unified Thermodynamic Genetic Operator," in Proc. 2nd Int. Conf. Genetic Algorithms, pp.116-122, 1987.
[25] W.-M. Dai, and E. S. Kuh, "Simultaneous Floor Planning and Global Routing for Hierarchical Building-
[11] D. Adler, "Genetic Algorithms and Simulated Annealing: A Marriage Proposal," in Proc. Int. Conf. Neural Network, pp.1104-1109, 1993. [12] D. Brown, C. Huntley, and A. Spillane, "A Parallel Genetic Heuristic for the Quadratic Assignment Problem,' in Proc. 3rd Int. Conf. Genetic Algorithms, pp.406-415, 1989.
vol.CAD-6, no.5, pp.828-837, 1987. [26] Olivier C. Martin and Steve W. Otto "Combining Simulated Annealing with Local Search Heuristics," Meta-
Block Layout," IEEE Trans. Computer-Aided Design,
heuristics in Combinatoric Optimization, edited by G.
Laporte and I. Osman, 1994. [27] Nico L.J. Ulder, Emile H.L. Aarts, Hans-Jurgen Bandelt, Peter J.M. van Laarhoven, and Erwin Pesch "Genetic Local Search Algorithms for the Traveling Sales-
[13] F.-T. Lin, C.-Y. Kao, and C.-C. Hsu, "Applying the Genetic Approach to Simulated Annealing in Solving Some NP-Hard Problems," IEEE Trans. System, Man, and Cybernetics., vol.23, no.6, pp.1 7 52 -1767, 1993. [14] S. Koakuitsu, Y. Sugai, H. Hirata, "Block Placement by Improved Simulated Annealing Based on Genetic Algorithm," Tran. of the Institute of Electronics, Information and Communication Engineers of Japan, vol.J73A, No.1, pp.8 7 -9 4 , 1990. [15] S. Koakutsu, Y. Sugai, H. Hirata, "Floorplanning by Improved Simulated Annealing Based on Genetic Algorithm," Trans. of the Institute of Electrical Engineers of Japan, vol.112-C, No.7, pp. 4 11-416 , 1992.
man Problem" Parallel Problem Solving from Nature
pp.109-116, 1991. [28] Olivier Martin, Steve W. Otto and Edward W. Felten "Large-Step Markov Chains for the Traveling Salesman Problem" Complex Systems pp.299-326, 1991. [29] Kenneth D. Boese, Andrew B. Kahng and Sudhakar Muddu "A new adaptive multi-start technique for combinatorial global optimizations" Operations Research Letters vol.16, 101-113, 1994
[16] S. Nakatake, H. Murata, K. Fujiyoshi, and Y. Kajitani, "Bounded-Slicing Structure for Module Placement," Technical Report of the Institute of Electronics, Information and Communication Engineers of Japan, vol.VLD94, no.313, pp. 19 -24, 1994. [17] Y.-T. Lai, and S. M. Leinwarnd, "Algorithms for Floorplan. Design Via Rectangular Dualization," IEEE Trans. Computer-Aided Design, vol.CAD-7, no.12, pp.1 2 78 -12 8 9 , 1988. [18] M. J. Ciesielski, and E. Kinnen, "Digraph Relaxation for 2-Dmensional Placement of IC Blocks," IEEE Trans. Computer-Aided Design, vol.CAD-6, no.1, pp.55-66, 1987. [19] S. Sutanthavibul, E. Shragowitz, and J. B. Rosen, "An Analytical Approach to Floorplan Design and Optimization," IEEE Trans. Computer-Aided Design, vol.CAD-10, no.6, pp.761-769, 1991. [20] C.-S. Ying, and J. S.-L. Wong, "An Analytical Approach to Floorplanning for Hierarchical Building Block Layout," IEEE Trans. Computer-Aided Design, vol CAD-8, no.4, pp.403-412, 1989. [21] S. Wimer, and I. Koren, "Analysis of Strategies for Constructive General Block Placement, IEEE Trans. Computer-Aided Design, vol.CAD-7, no.3, pp.371-377, 1988. [22] D. P. La Potin, and S. W. Director, "Mason: A Global Floorplanning Approach for VLSI Design," IEEE Trans. Computer-Aided Design, vol.CAD-5, no.4, pp.477-489, 1985. [23] T. Lengauer, and R. Muller, "Robust and Accurate Hierarchical Floorplanning with Integrated Global Wiring," IEEE Trans. Computer-Aided Design, vol.CAD-12, no.6, pp.8 02 -809, 1993. [24] W. W.-M. Dai, B. Eschermann, E. S. Kuh, and M. Pedram, "Hierarchical Placement and Floorplanning in BEAR," IEEE Trans. Computer-Aided Design, vol.CAD-8, no.12, pp.1335-1349, 1989.
141
Physical Layout for Three-Dimensional FPGAs * Michael J. Alexander, James P. Cohoon, Jared L. Colfiesh, John Karro, Edward L. Peters and Gabriel Robins Department of Computer Science, University of Virginia, Charlottesville, VA 22903 Abstract We explore physical layout for a three-dimensional(3D) FPGA architecture. For placement, we introduce a topdown partitioning technique based on rectilinear Steiner trees; we then employ a one-step router to produce the final layout. Experimental results indicate that our approach produces effective 3D layouts, using considerably shorter average interconnect distance than is achievable with conventional 2D FPGA 's of comparable size.
1
Introcluction
A field-programmable gate array (FPGA) is a flexible and reusable design alternative to custom integrated circuits. Using FPGAs, digital designs can be quickly implemented and emulated in hardware, which enables a faster, more economical design cycle [8]. The flexible logic and connection resources of FPGAs allow different designs to be implemented on the same hardware. However, this versatility comes at the expense of a substantial performance penalty due primarily to signal delay through the programmable routing switches. This delay can account for over 70% of the clock cycle period [19, 25]. This paper explores physical-design issues for a new three-dimensional (3D) FPGA architecture, with a focus on reducing the interconnect delay. The increased number of switch block neighbors in our new architecture (i.e., 6 in 3D vs. 4 in 2D) affords greater flexibility during placement and routing, while shorter average interconnect dis-
tance (i.e.,
O(n0)
for an n-block 3D FPGA vs.
0(n2)
in
2D) implies shorter signal propagation delays. In order to realize such potential benefits, we must develop effective physical-design techniques which exploit these properties. The physical design tools we developed for 3D FPGAs build on those proposed in [4], and adapt these technique for the new architecture. Our placement algorithm employs a top-down approach: the FPGA is partitioned into several cubic sections, logic blocks are arranged to minimize the number of nets crossing partition boundaries, and each partition is similarly processed in a recursive *Professor Cohoon is supported by NSF grants CCR-9224789 and MIP-9107717. Professor Robins is supported by NSF Young Investigator Award MIP-9457412 and by a Packard Foundation Fel-
lowship. Our benchmarks and additional related papers may be found at WWWV URL http://www.cs.virginia.edu/-vlsicad/
manner. We use a one-step graph-based router which employs a Steiner-based heuristic to construct the overall routing solution. Experimental results indicate that our three-dimensional layout algorithms generate 23% average savings in total interconnect length over traditional 2-OPT -based placements. In addition, our three-dimensional layout reduces average interconnect distance by 14% and average maximum sourcesink path length by 26%, with respect to 2-OPT -based placements. The remainder of the paper is organized as follows: In Section 2 we discuss our 2D and 3D FPGA architecture models, and address some of the physical constraints in constructing our 3D variant. In Section 3 we discuss performance gains expected for the 3D architecture. Sections 4 and 5 present our physical layout techniques for 3D placement and routing, respectively. Section 6 contains experimental results for 3D FPGAs and compares these results with those achievable for 2D FPGAs, and we conclude in Section 7.
2
The FPGA Architecture
Field-programmable gate arrays are reprogrammable chips capable of implementing arbitrary design logic. The basic 2D FPGA architecture consists of a symmetrical array of user-configurable logic blocks. Running through the channels between the rows and columns of logic blocks is a network of channel segments, linked together by programmable switch blocks [8] (Figure 1). During customization, configuration data is downloaded onto an FPGA, specifying which programmable switches are to be made active (i.e., programmed "ON"). This process causes each net (i.e., a set of logic-block input/output pins) to be properly interconnected. Our 3D FPGA architecture [2] is a generalization of the basic 2D model, in which each switch block has six immediate neighbors (Figure 2 (a)), as opposed to four in 2D. Three-dimensional switch blocks are analogous to their two-dimensional counterparts; they enable each channel segment to connect to some subset of the channel segments incident on the other five faces of the 3D switch block (Figure 2(b)). Although the usable-gate count of FPGAs has been steadily increasing, the number of usable gates in FP-
142
an MCM substrate for interconnections between frames. An alternative approach investigates the use of optical interconnect to construct multi-layered FPGAs [10]. I/O for individual FPGA chips was recently enhanced by using solder bumps distributed across the FPGA die [26]. The three-dimensional FPGA model poses a number of new challenges in the area of heat dissipation. Higher operating temperatures can lead to less reliable operation (e.g., heat stress can induce faults). A number of MCM thermal-dissipation techniques (e.g., thermal bumps and pillars [9], thermal gels [7], etc.) may also be applicable to 3D FPGAs. Chip-size packages [23, 28] are a relatively new packaging technology which results in a package typically only 10% larger than the bare die itself. One such package consists of a thin epoxy coating, which allows bonding pads to be placed directly over the die surface. An important benefit of chip-sized packages is the reduction of thermal stress by physical buffering of the die. A second motivation for using chip-sized packages rather than bare dies is the known-good-die problem: compared to packaged die, testing of bare dies is costly as well as time-consuming. Chip-size packages, which are essentially testable dies, can dramatically reduce this problem [14].
Figure 1: Two-dimmensional FPGA architecture.
4_~2t7
3
(a)
(b)
Figure 2: (a) Arrangement of switch blocks in 3D FPGA, and (b) detail of 3D switch block. Dark lines in (a) denote channels containing parallel channel segments.
GAs is significantly less than in other design styles (maskprogrammed gate arrays, semi- and full custom ICs). This reduction in gate count is an inevitable consequence of the additional chip area dedicated to programmable routing switches and SRAM-based configuration data for these switches [24]. More recent FPGA architectures [27] strive to increase usable-gate-counts by reducing the number of programmable switches. Three-dimensional VLSI has recently begun to atThe implications of threetract increased attention. dimensional abutment of individual transistors has been studied in [16]. Issues in vertical integration have been investigated in [1]. with a focus on "silicon-on-insulator" techniques. In order to reduce delay between individual FPGAs, [11] proposed adding a surrounding programmable interconnection frame to individual die, using
Reducing Interconnect, Length
In this section, we explore analytically both the expected interconnect length and the required number of programmable switches. Channel segments on an FPGA are linked together with programmable switches. Under our 2D and 3D architecture models, reducing the interconnect length also reduces the number of programmable switches encountered along the traversed interconnect (which further lowers the interconnect delay). Consider two FPGA architectures, each having n switch blocks, arranged in a 2D grid and in a 3D grid (Figure 2(a)). The expected distance between two uniformly distributed points in the unit interval is , and therefore the expected Manhattan distances between two uniformly distributed points in the unit square and unit cube are and 3 = 1, respectively. Thus, the expected interconnect length for a two-pin net in an FPGA of n switch blocks, is n and n3 in 2D and 3D, respectively. Equating the average interconnect length values in the different architectures and solving for a break-even value of n indicates that the average interconnect length is expected to be shorter in 3D than in 2D when n > 12. Because typical values of n in existing FPGA parts already far surpass this value, we can expect 3D interconnect lengths to be shorter than their 2D counterparts. Currently available parts contain over 500 switch blocks [27]; for n = 525, the expected average interconnect length in three dimensions is less than half its value in two dimensions.
143
*
the FPGA. Finally, during routing, nets are electrically interconnected using the available routing resources. PROGRAMMiED PROGRAMMED .ON" .ON" Minimizing signal delay is an important consideration ROUTE -: during these physical layout phases. In order to miniROUTE. mize delay for each net, source-sink path lengths should be minimized. To minimize the FPGA size required to imPROGRAMMED .-P/GR plement a design (or equivalently, to insure that a given "OFF" D piece of the design fits onto a single FPGA), we must also (UNAVAILABLE) (UNAVAILABLE) minimize the maximum number of channel segments as(a) (b) signed to nets over all channels (this is commonly referred to as reducing congestion). Traditionally, these phases of layout have been considFigure 3: (a) 2D switch block with fanout 3, and (b) 3D switch block with fanout 5. ered independently and performed sequentially. Since the quality of the results of any one phase depends on results from earlier phases, it is beneficial for each phase to consider its effects on subsequent phases. A placement technique called Mondrian was designed to specifically conA popular 2D switch-block architecture [27] 'allows each sider the effects of placement on routing for 2D FPGAs [4] incoming channel segment to connect to a single segment [13]. Here we generalize this method to three dimensions. on the other three sides, providing a fanout of Fs,2D = 3. Our 3D FPGA placement algorithm uses a recursive Our 3D architecture generalizes this switch-block scheme geometric technique called thumbnail partitioning, which to three dimensions, resulting in a fanout of FS 3D = 5. decomposes the FPGA area into an m x n x r grid. In When a route passes through a switch block, all of the our case, m = n = 3 (and r = 1) for a standard twofanout channel segments (either three or five) become undimensional FPGA, and m = n = r = 2 for the threeavailable to subsequent nets; one or more of the fanout dimensional FPGA shown in Figure 4. segments is committed to the route by being programmed "ON" to make the connection, while the remaining fanout segments become unusable (they must be programmed "OFF" to ensure that subsequent nets are electrically disAh joint (Figure 3) from the current net). Thus, while in-
I
terconnections in the 3D configuration are shorter, they
,__1
commit more switching hardware per unit length. This motivates a more refined analysis as follows: (cost per unit len 2D) x(expected len 2D) Fs,2D-
2
3
=
(cost per unit len 3D) x(expected len 3D)
n 2
=
F
n
=
244
i1
I-,
n _,3D 3
Figure 4: (a) Partition template with m = n = r = 2, and (b) partitioning graph where partitiontemplate regions correspond to nodes.
Thus, an average net is expected to use fewer switching resources when routed in three dimensions for n > 244 switch blocks (this size range applies to most currently manufactured parts). Hardware requirements remain greater in three dimensions: however, if channel widths can be reduced sufficiently by going to three dimensions, then identical circuits may be implemented with less overall hardware using a three-dimensional FPGA part.
4
Three-Dimensional FPGA Placement
FPGA layout is generally performed in three phases: (1) technology mapping, (2) placement, and (3) routing. During technology-mapping, the design logic is partitioned into small "units", each capable of being implemented by an individual logic block. During placement, these units are each assigned to specific logic blocks on
The placement phase overlays the FPGA with a partitioning template, partitioning the overall design logic into m- n - r regions. Cut-lines of the template pass through switch blocks, so each logic block lies entirely within a single region of the partitioning template. The distribution of logic blocks among regions of the partitioning template is then improved using simulated annealing [20], where a move consists of swapping pairs of logic blocks from different regions of the partitioning template, with the objective being minimizing total interconnect length over all nets. For each net, we consider the set of partitions in which that net has been placed, and we calculate the minimum rectilinear Steiner tree connecting these points (i.e.,
144
the thumbnail). An important measure of the quality of a placement and global routing is maximum congestion, which in our case is the number of thumbnail edges crossing any given cut-line. Once logic blocks have been assigned to regions in the partitioning template, a congestion-balancing step is undertaken. Note that a typical set of partitions can have many 'thumbnails (see Figure 5). The objective of the congestion-balancing step is to assign one of the possible thumbnail alternatives to each net in a manner that minimizes the maximum number of thumbnails having edges that cross the same cut-line of the partitioning template. Since this problem is NPcomplete [13], it is accomplished using a greedy heuristic.
-
I-,' (a)
_1 (b)
Figure 6: A section of the routing graph showing (a) a 3D FPGA switch box, and (b) the corresponding portion of the routing graph.
-
QQ
1>
r
A
.1~~~
,I
Q
Figure 5: Three possible thumbnails for interconnecting the four dark-shaded nodes shown. Unshaded nodes represent possible Steiner nodes.
G = (V, E), where V is the node set and E C V x V is a set of weighted edges, we must span a subset of the nodes N C V (i.e., the net), while the remaining nodes may be used as Steiner points. Each edge eij E E has a weight wij corresponding-to interconnect distance, and our goal is to minimize the total spanning cost. The graph Steiner problem is formally defined as follows: The Graph Steiner Minimum Tree (GSMT) problem: Given a weighted graph G -= (V, E) and a net of terminals N C V to interconnect, find a tree T' = (V', E') with N C V' C V and E' C E such that Ze,,EE' ij is minimized. The GSMT problem is NP-complete [17]; we therefore employ the Iterated-KMB (IKMB) heuristic of [3] and adapt it to routing in three dimensions. The IKMB algorithm in turn uses the heuristic of [21] (which we call "KMB") to efficiently search for solutions of increasing quality. Given a graph G = (V, E), the net N C V, and a set S of potential Steiner points, we define the following:
Next, every edge in each thumbnail must be assigned to a specific switch block along the crossed cut-line of the partitioning template. Each such switch block is conceptually added as a new "virtual" pin in the net. The portion of each net within each region of the partitioning template is then passed on to a lower level of the recursion (this is akin to the virtual terminal [6] and terminal propagation [[L2] techniques). Assignments are made using a minimum-bipartite-matching heuristic, where one set of nodes represent all nets crossing a cut-line, and the other A\KMBG(N, S) = KMBG(N) - KMBG(N U S) node set represents the switch boxes on that cut-line. The recursion terminates when a region contains at most one The IKMB algorithm starts by computing the KMB logic block. Minimizing thumbnails for each net while bal- tree. Then, at each iteration the ]KMB method repeatancing overall congestion produces routable placements. edly finds additional Steiner node candidates that reduce the overall KMB cost and includes them into the growing 5 Three-Dimensional FPGA Routing set of Steiner nodes S (see Figure 8). The cost of the KMB After the placement phase, all portions of the design tree over N U S will decrease with each added node, and logic have been assigned to specific logic blocks on the the construction terminates when there is no x E V with FPGA. It is now the task of the router to interconnect A\KMB(N U S, {2}) > 0. The IKNIB method is formally the pins of eazh net. The routing phase models the FPGA described in Figure 7. as a graph, where the overall graph topology mirrors the The IKMB algorithm mirrors the Iterated 1-Steiner complete FP'GA architecture. Paths in this graph cor- method [15] [18] and is quite effective in reducing total respond to feasible routes on the FPGA, and conversely interconnect length for 2D FPGAs [5]. Our experimental (See Figure 6). Nets are routed one at a time, using a results indicate that IKMB is also effective for 3D FPGAs move-tc-front heuristic when infeasibility is encountered. (see Section 6). Figure 8 shows how introducing Steiner Routing each net on the FPGA corresponds to the points into the routing solution car reduce total intercongraph version of the Steiner problem: given a graph nect length.
145
TCTircuit NamLe 9svnm arge alu2 apex7 exa ple2 term 9sMmm 21Jarfe alu2 apex7 exa ple2 term 9symm 2 alu2_ apex7 example2 termil
'I otal Interconnect 11 Channel I Random I Mondrian T ALc || Ran om Ion
Placement algorithm Mondrian Mondrian Mondrian Mondrian Mondrian Mondrian Totals: ondrian + Mondrian + Mondrian + Mondrian + Mondrian + Mondrian + ot s: ondan + onrian+ Mondrian + Mondrian + Mondrian + Mondrian +
.
2-O 2-OP'l-PTlR 2-OPT-PAIR 2-OPT-PAIR 2-0OPT-FAIR 2-OP -PAIR
2-O
1057 2481 2143 1468 2286 1119 10554
627 1462 1533 737 1624 447 6430
857 2195 1936 1400 2040 829 9257
661 1464 1533 741 1597 450 6446
22.9 33.3 20.8 47.1 21.7 45.7 30.4
606 1479 1290 735 1270 479
583 1282 726 1265 452
3.8 4. 0.6 1.2 0.4 5.6
55
5720
2.4
583
3.8
. .
ST-a i -a 2-U(PT-FIRST-a 2-UPT-FIRST-a 2-)P11-FIRST-a 2-0)PT-Fl ST-a -
'lbtals:
9symml 21arge
I2 |aex7 exarmple2I terml
Mondrian + 2- Mondrian +
Mondrian Mondnan Mondrian Mondrian
+ + + +
PT-FIRS'-b -
2-0Pl-lTRT-b
606 11
2-OPT-FIRST-b 2-OPT-FIRST-51 2-UPT-FIRST-b 11 2-UPT-FIRST-b
Overall Totals:
-60.1
39.1
1479 1
1412 14.5
1290 797 1270 479
1282 713 1265 42
5921
5707 13.6
31591
24303
23.1
6 99 8 6 8 5 42
7 7 7 7 42
-16.7 222 12.5 -16.7 12.5 40.0 0.0
8 6 8 6 43
7 7 7 7 7 42
-16.7 99 2.2 12.5 -16.7 12.5 -16.7 2.4
I
8.6 7 6 7 6
6 6 7 6
25.0 4 14.3 0.0 0.0 0.0 25.0
4137A9.
]1
8
6
11
71
6 114.31
7 6
6 6 7.
0.6 10.511 5 0 1452 | 5.6 |
2479
Totals:
40.7 41.1 28.5 49.8 29.0
t ran
6 16 41
167
|
14.3 0.0 10.01
_
37_|
9.8
158 1
5.4
Table 1: The effects of three-dimensional Mondrian placement on total interconnect length and minimum channel width of several benchmark circuits. The percentages indicate the performance of Mondrian with respect to random placements (positive values indicate improvement, while negative percentages indicate disimprovement).
The Iterated-KMB (IKMB) Algorithm Input: A weighted graph G = (V, E) and net N C V Output: A low-cost tree spanning N S =V 0
While C -= XE V - NIAKMBG(N U S, {j}) > O} # 0 Do Find x e- C with maximum AKMBG(N U S, {X}) , = S U{x} Return KMBG(N U S) Figure 7: Iterated-KMB algorithm (IKMB).
6
Experimental Results Our placement and routing algorithms were implemented using C++ in the SUN Unix environment. We considered 6 large industrial benchmark circuits frequently used to evaluate FPGA routers [22]. Each circuit was first placed using the Mondrian method, and then the placement was further refined using a 2-OPT post-processing step. The various 2-OPT methods we used were as follows: (i) 2-OPT-PAIR, which starts by making the best swap possible, and then "follows" those two blocks, making any beneficial swap involving one of them; (ii) 2-OPT-FIRST-a, which makes the first beneficial swap found, and continues to do so until the number of swaps made exceeds the square of the number of blocks; and (iii) 2-OPT-FIRST-b, which is similar to 2-OPT-FIRST-a but runs until no possible swap would result in a savings of more than 1%. For comparison purposes, we also tested our tool on random placements (postprocessed by greedily swapping pairs of blocks). Thus, we
146
nique based on optimal rectilinear Steiner trees, and for routing we developed a graph-based one-step router to produce a final layout. Our experimental results indicate that our approach produces effective 3D layouts, using considerably less interconnect than is required with conventional 2D FPGAs of comparable size.
References (a)
[1] Y. AKASAKA, Three-DimensionalIC Trends, Proc. IEEE, 74 (1986), pp. 1703-1714.
(b)
[2] M. J. ALEXANDER. J. P. COHOON. J. L. COLEFLESH, J. KARRO, AND G. ROBINS, Three-Dimensional FieldProgrammable Gate Arrays, in Proc. IEEE Intl. ASIC Conf., Austin, TX, September 1995, pp. 253-256.
Figure 8: Example of routing solutions for a single 3-pin net (dark blocks represent pins), produced by (a) the KMB algorithm, and (b) the IKMB algorithm. The shaded switch block in (b) serves as a Steiner point to reduce overall interconnect length by about 16% in this case.
[3] M. J. ALEXANDER. J. P. COHOON, J. L. GANLEY. AND G. ROBINS, An Architecture-Independent Approach to FPGA Routing Based on Multi- Weighted Graphs, in Proc. European Design Automation Conf., Grenoble, France, September 1994, pp. 259-264. [4] M. J. ALEXANDER. J. P. COHOON. J. L. GANLEY, AND G. ROBINS, Performance-OrientedPlacement and Rout-
benchmarked eight types of placements: (i) random, (ii) ing for Field-Programmable Gate Arrays, in Proc. Eurandom followed by (three distinct kinds of) 2-OPT postropean Design Automation Conf., Brighton, England, processing, (iii) Mondrian, and (iv) Mondrian followed September 1995, pp. 80-85. by (three distinct kinds of) 2-OPT post-processing. [5] M. J. ALEXANDER AND G. ROBINS, New PerformanceTable 1 shows the total interconnect length and the Driven FPGA Routing Algorithms, in Proc. ACM/IEEE minimum charnel width for each benchmark circuit. The Design Automation Conf., San Francisco, CA, June 1995, results indicate that our placement algorithm generated pp. 562-567. 23.1% savings in total interconnect length over random [6] S. BAPAT AND J. P. COHOON, A Parallel VLSI Circuit placements. The overall improvement in minimum chanLayout Methodology, in Proc. IEEE Intl. Conf. VLSI Denel widths is 5.4%. Our Mondrian placement algorithm sign, January 1993, pp. 236-241. tends to draw logic blocks closer together, thus minimiz[7] D. M. BREWER AND L. P. BURNETT, MCM Designs ing total interconnect length (in certain cases, this can Require Exhaustive Thermal Analysis, Electronic Design increase the minimum channel width required to route News, (1992), pp. 96-103. the circuit). [8] S. D. BROWN, R. J. FRANCIS. J. ROSE. AND Z. G. Table 2 shows the average interconnect length per net, the average number of programmable switches required to VRANESIC, Field-Programmable Gate Arrays, Kluwer Academic Publishers, Boston, MA, 1992. route a net, and the average maximum source-sink pathlength (i.e., radius). As expected from the analysis in [9] P. C. CHAN, Design Automation for Multichip Modules Section 3, the increased switching flexibility in three di- Issues and Status, Intl. J. of High Speed Electronics, 2 mensions decreases all of these values. Our results indicate (1991), pp. 236-285. a reduction of 13.8% in average interconnect length, 23.0% [10] J. DEPREITERE, H. NEEFS. H. V. MARCK, J. V. savings in average switch block usage, and 26.4% reducCAMPENHOUT. B. D. R. BAETS. H. THIENPONT. AND tion in average source-sink path-length. These results are I. VERETENNICOFF, An Optoelectronic 3-D Field Proencouraging with respect to interconnect delay minimizagrammable Gate Array, in Proc. 4th Intl. Workshop tion in three-dimensional FPGAs (as compared to their on Field-Programmable Logic and Applications, Prague, two-dimensional counterparts). Figure 9(a) illustrates a September 1994. 2D layout of one of the benchmark circuits (ALU2), while Figure 9(b) shows the layout of the same circuit on a 3D [11] I. DOBBELAERE, A. E. GAMAL, D. How, AND B. KLEVEFPGA. LAND, Field Programmable MCM Systems - Design of Conclusions We have presented an effective physical layout framework for a new three-dimensional FPGA architecture. For placement, we have used a top-down partitioning tech-
an InterconnectionFrame, in Custom Integrated Circuits Conf.. Boston, MA, 1992, pp. 4.6.1-4.6.4.
7
[12] A. E. DUNLOP AND B. W. KERNIGHAN, A Procedure for Placement of Standard-Cell VLSI Circuits, IEEE Trans. Computer-Aided Design, 4 (1985). pp. 92-98.
147
[13] J. L. GANLEY, Geometric Interconnection and Placement Algorithms, PhD thesis, Department of Computer Science, University of Virginia, Charlottesville, Virginia, 1995.
0I D 00
AWF4=
1
[14] L. GILG, Known Good Die Meets Chip Size Package, IEEE Circuits and Devices Magazine, (1995), pp. 32-37. [15] J. GRIFFITH, G. ROBINS, J. S. SALOWE, AND T. ZHANG, Closing ihe Gap: Near-Optimal Steiner Trees in Polynomial Time, IEEE Trans. Computer-Aided Design, 13 (1994), pp. 1351-1365. [16] A. C. HARTER, Three-Dimensional Integrated Circuit Layout, Cambridge University Press, New York, 1991. [17] F. K. HWANG, D. S. RICHARDS, AND P. WINTER, The Steiner Tree Problem, North-Holland, 1992. [18] A. B. KAHNG AND G. ROBINS, A New Class of Iterative Steiner Tree Heuristics With Good Performance, IEEE Trans. Computer-Aided Design, 11 (1992), pp. 893-902. [19] A. B. KAHNG AND G. ROBINS, On Optimal Interconnections for VLSI, Kluwer Academic Publishers, Boston, MA, 1995. [20] S. KIRKPATRICK, C. D. GELATT. AND M. P. VECCHI, Optimization by Simulated Annealing: An Experimental Evaluation (part 1), Science, 220 (1983), pp. 671-680. [21] L. Kou. G. MARKOWSKY. AND L. BERMAN, A Fast Algor'thm for Steiner Trees, Acta Informatica, 15 (1981). pp. 141-145.
7
0 F7 0 0
E
OE--(,
0 0~
0
-i
1 0
0 0 0 0
.p
!I!M
a
qt
0
t __0
0
LE
4q
0
00L000
-u- I I0
0
00
0
0
0
0
0
0
-n 0
ULiULJLJE 00 o0oo0
0
0
0 0
0
0
10 - 0 - 0 - 0 - 0 - 0
0
-0 1 0 1 0
00
0
0
0
0
0
0
00
0
0
0
0
ta)
[22] G. G. LEMIEUX AND S. D. BROWN, A Detailed Routing Algorithm for Allocating Wire Segments in FieldProgrammable Gate Arrays, in Proc. ACM/SIGDA Physical Design Workshop, Lake Arrowhead, CA, April 1993. [23] D. MALINIAK, Chip-Scale Packages Bridge the Gap Be-
tween Bare Die and BGAs, Electronic Design, (1995), pp. 65-73. [24] J. ROSE, R. J. FRANCIS, D. LEWIS. AND P. CHOW, Architecture of Field-ProgrammableGate Arrays: The Effect of Logic Block Functionality on Area Efficiency, IEEE J. Solid State Circuits, 25 (1990), pp. 1217-1225. [25] S. M. TRIMBERGER, Field-Programmable Gate Array Technology, S. M. Trimberger, editor, Kluwer Academic Publishers, Boston, MA, 1994. [26] J. R1. V. MAHESHWARI, J. DARNAUER AND Ws. W.-M. DAI, Design of FPGAs with Area I/O for Field Programmable MCM, in Proc. ACM/SIGDA Intl. Symposium
on Field-Programmable Gate Arrays, 1995. [27] XILINX, The Programmable Gate Array Data Book, Xilinx, Inc., San Jose, California, 1994.
(b)
[28] M. YASUNAGA, S. BABA, M. MATSUO, H. MATSUSHIMA. S. NAKAO, AND T. TACHIKAWA, Chip Scale Packages: "A Lightly dressed LSI Chip", IEEE Trans. on Components, Packaging and and Manufacturing Tech., 18 (1995), pp. 451-457.
Figure 9: Final layout of circuit ALU2 on (a) 16 x 16 2D FPGA, and on (b) 4 x 8 x 8 3D FPGA. Placement was done using Mondrian without 2OPT post-processing in both cases.
148
0
CicuitJ Netsacement Nae Nt 9smI 21arge
a~u 2 apex T exa m pe 2 term
vg. Net Length 1 21 3D c1
lorithm Random
118.9
13.4
29.1 117.5 1
186 1_5 3 115 205 88
Random
119.7
13.3 14. 0 12.8 T11.2 2.7
32.4111.3 3 2._6 I19.1
F7
_
Randco-m
20.
Random _R~a~ndom Random
200 1.8 185
79
129
Random + 2-O 'R1~e7andom+-2
a u2 apex7 example2 terml
153 1 115 20FO5 88
Random Random Radom anom veaes:
+ + + +
IR
13.3
2-OPT1-PAIR 2 OP-AR 2-OPT-PAIR 2-OP7-PT
.
19.4 19.2 13.8 13.2 .2
79 186 153
andomn+2- P-F S-a Random +2-OP-FIS-a Random + 2-OPITJFIRST-_a
8.9 10.5 10.6
-7ex t7 exampe2 term
115 2_~05 88
Random + 2-01'I-FIRS'I-a Random +2-OP-F -a Random+2-OP-=S 'a
9simF] 21arge
79 186
Random + 2-OP-F S-b Random + 2 OP-kTS F
77T15_3
apex7 examp e2 term
11_5 20_5 88
sFmmi 2) arge au' apex, exam le2 term
186 1_53_ 115 205 88
79 21arge 186 alu2 1_5_3 apex i _11_5 exaOe25 trl 88
9symm .2ar~e
au2 aex7I exam e2 term
Random + 2-OP- FRS-_ Random + 2-OPT1-tRST-b Random + 2- F-F S-b Averages:
7Mondrian
9smmi
9sml 2 a au2 apex7 example2 terml
Random +T2OPT-IRT-b
79 8_6 153 115 205 88
__Vo~n ran Mondrian Mondrian Mondrian _Ron nr-an Averages:
_
ondrian +2-OP-PI Mondrian +2-OP-AI Mondrian + 2 OPT-PAIR Mondrian + 2-OFT-PAIR
Mondrian + 2-O Mondrian + 2Averages: Mondrian Mondrian To-ndT-a M n Mondrian Mondrian Mondrian v-ye-rages:
+ + + + + +
12.7 12.2 10.0 9.4
12.4
17.2 1 7 I7T
11.57
371 35 3258
TTTQ
W
9.9
17.3
12.0 1.7
1.6
34.8 36.4 27.8 287
17.8 17.9 1T2.5 12.0
11.6 11.1 8.9 8.4
.T9
3.
33.411 16.0 10.5 34.4 12.8 32811164 10.2 38.0
3.0
.7
1)I
12.2
iT18.71 11.8 i155-10.0
325.
1
if14 14
29.1
.TT
1
16.1 7 10.847
.7
' 34.7 37.9 29.3 2T9.8
15.3 15.8 12.1 -1.0
.7 9
._F1
10.0 38.1 9.2 3T6.7 101-4-33.6
f15
7.9
27.0]
92
3
91 9.6 8.4 .
4. 3. 3. 29.0
5.7 6.1 61
13.0 28.8 22.7
.TT
7.7 14.R 7.6 791-2-492 9-1 8.4 I20.[7 9.1
-68 6.9 7.
7.9 6.9 6.0
6.4 62 5'4
19.2 10.3 8.6
6. 5.7 4.8
. -53 47
16.0 70 1.4
67 9.8 5.0
51 5 4.5
2.6
8.9
7.7 7.9
14.2
7.6 9.1
6.8 6.9
11.0 6.6 '23.6 118.5
5.7 6.1
13.0 28.8 -20.6 11.5 9.5 19.1
10.5 10.6
24.2
I9.1
11.0 6.6 236 8.5 -7 -3. 79
.5 9.5
8.4
20.7
7.5
17.3 I
7.9
8.1 6.9 6.0 8.5
6.9 6.2 5.4 7.1
14.1 6.9 10.3 I5.7 8.6 4.8 16.5 7 .2
6.1
I6.1
10.7
7.0
5.5
5.3
7.0
4.7 6.2
1.4 13.9
5.0 6.8
4.5 5.5
.0
7.9
27.9
9.7
7.1
21
82
56
38
8.5 -9.3 9.3 6.0 6.0 8.1
6.7 8.8 5.5 6.9 4.2 6.5
5.9
25.1
9.9 10.810.6 7.1 .2 9.4
7.9 10. Ff 6.4 7.9 5.1 7.5
20.9 7.1 39.3-11.4 29.2 20.2
I11.0
8.4 123.9
9.6 10.8 10o. 5
7.9 117.6 10.0 17.1 6.4 3.
if
5.8 I5.1
21.6 117.8 5.0 . 406 . -15.9 6.0 29.9 5.8 19.8 7.4
9.7 1
7.4
236
8.2 9. 9.2
6.8 8.9 5.9
171 4.9 39.8
22.7
f8.07.
.7'--77
5.0 64 4.2 5.7
44.1 7.4 27.7 23.0-
5.8i
26.8
78 88
5.9 7.1 5.0
21.4 -8.5 -42.5
-AI
7.3
7.8
-7.4f
6.1
6.8
-11911
6.1
6.5
-6.9
-AI
7.2
5.1
29.2
6.1
4.3
2993
6.0
4.3
28.9
~
2-OP-F R -a 2-0 P- FRS-a 2-OPT FIRST a 2-OP-E S-a 2-OF-FIS- a 2-OP-FIS-a
-
1F
18.8 8.7 9.3 7.9 6.3 5.8
7.4 7.6 8.4 6.3 6.2 5.1
.
.
79 6
on rian +2 P-F S-b ondrian +T-D -- ' RT b
8.8
153
Mondrian + 2-OFT-FIRS'1-b
9.3
8.4
T115 2705 -88
Mondrian + 2-OFT-FIRST1-b Mondrian +2-OF-F S-b Mondran +2P-E -b
9.3 6.3
6.2 6.2
Overall Averages:
36.3 _33.6 3_1.4
10.9 118.5
R
-
9symmI 2a~rge au2
a7~1u2
wi~tches 3D I. a'
79
verages: 9sy
Avg. 2 UI
7.4
77-7T1
6.2
5.1
15.9 12.7 10.0 19.8 1.5 12.1 .
-T
7.4 7.4 7.9 66 51 47
.T
6.5 6.5 7.4 5.4 .2 .2
7.4
33.3 1.5 172
8.0 5.1 5.0
T. 7.9
12.3 12.2 6.2 18.2 -1.7 11.3
6.5 6.8 7.0 6.4 5.3 4.9
5.2 120.3 5.8 14.4 5.9 15.2 4.9 23.0 5.2 2.2 4.4 10.1
12.3 122
i6.5
5.2 20.3 5Y.V14TFTi
6.2
70
5.9 LII15.2
34.6 17 19.8
76 5.3 9.1
4.8 5.2
9
.
15.9 TT. 10 0
__-
695 69
~~.
10.9 18.4 113.8 119.6 17.4
52 5 42
123.0
6.
4.
8.7 16.4
Table 2: Average interconnect lengths, number of active switches used in routing a single net, and the maximum source-sink path-length (radius) for the 2D and 3D cases.
149
3T7.51 22 1.0 26.
Efficient Area Minimization for Dynamic CMOS Circuits Bulent Basaran and Rob A. Rutenbar Department of Electrical and Computer Engineering Carnegie Mellon University, Pittsburgh, PA 15213, USA {basaran,rutenbar} @ece.cmu.edu Abstract We present a new transistor ordering technique for the layout of dynamic CMOS leaf-cells which minimizes the cell area. The technique employs an Eulerian trail formulation which guarantees that the result is always of minimum width, i.e., the total diffusion area is optimum. A novel iterative improvement strategy minimizes the cell height as defined by the number of wiring tracks required to connect source and drain terminals. We were able to find a transistor ordering with the theoretical optimum area for many industrial circuits and standard test cases from the literature.
2 Transistor Ordering Two MOSFET transistors that have a common drain or source node (e.g., two transistors in series have one common node, two transistors in parallel have two) can share a diffusion regiont . Sharing a diffusion is advantageous both for saving die area and for reducing parasitic diffusion capacitances. Hence, circuit designers pay considerable attention to how to stack transistors best for compact layouts [11]. In row-based designs, transistors are located in a row to maximize possible diffusion sharing. Two parameters impact the quality of diffusion sharing: 1) order of transistors, 2) orientation of source and drain terminals. In what follows, we use the term transistororderingto denote an order of transistors in a row with specified orientations for each transistor. In a given transistor ordering, if two adjacent transistors can not share a diffusion, then we have to leave a diffusion break (also called a gap) for correct DRC (See Figure 1). The number of gaps determine the width of the row, and in turn the cell width. Fewer gaps lead to smaller diffusion area and smaller cell width.
1 Introduction In the layout of custom CMOS circuits, transistorordering is the problem of placing transistors in a row and orienting them to maximize the sharing of source and drain regions [1]. This is crucial for reducing the total diffusion area and the cell width. Transistor ordering also has an impact on the number of tracks required to complete the routing which determines the cell height. The focus of this paper is on dynamic CMOS circuits where transistor ordering is applied to the transistors in the discharge circuitry (the nFET logic block). For the row-based layout of dynamic CMOS circuits, area minimization is known to be NP-hard [2]. There are some techniques that minimize the cell width [3][4][5][6], but none of these techniques explicitly minimizes the total area. On the other hand, for static CMOS (sCMOS) there are some exhaustive techniques that minimize the area, but they have exponential time complexity [7][8]. The relaxation approach of [9] is an efficient technique but it does not guarantee minimum cell width. The chaining technique of [10] does not consider the cell height during transistor placement. In this paper, we present the first technique that minimizes both the width and the height for dynamic CMOS circuits. Our technique is efficient: it takes less than 2 seconds for typical cells. It guaranteesoptimum cell width, and is also very effective in reducing the cell height. The paper is organized as follows: Section 2 describes the transistor ordering problem. Section 3 introduces a new circuit representation based on Eulerian trails. Section 4 presents the iterative improvement technique. Results on circuits from industry and literature are given in Section 5. Finally, Section 6 offers some concluding remarks.
150
Figure 1. Effect of transistor ordering on diffusion sharing and the wi th of the cell.
o1 -.
.. ;--fh
b)
F-
-I
ILcell a)
width
ac
bd
>
e
Transistor ordering also has an impact on the number of wiring tracks required to route the cell, which in turn determines the cell height (Figure 2). In what follows, we consider only the diffusion connections. For most practical dynamic circuits, poly connections can be done afterwards without increasing the cell height2 . 3 Circuit Representation As introduced in [12], an Eulerian trail in a diffusion graph, if it exists, minimizes the diffusion area of sCMOS circuits. 1. With the assumption that they are of the same type (e.g., both nFET) and also have the same body potential. 2. The corresponding problem in static CMOS is considerably harder.
Later [6] proposed an Eulerian trail method for dynamic CMOS circuits consisting of only one type of network (e.g., nFET logic and pFET pre-charge circuitry) which is not restricted to series-parallel circuits. We use a similar method to represent a dynamic circuit. Given a circuit, we first generate a so-called diffusion graph, G (undirected and possibly having parallel edges) corresponding to the nFET discharge circuitry. Each vertex in G corresponds to a diffusion node (source or drain) in the circuit, and each edge in G corresponds to a transistor (solid edges in Figure 3).
trails in this graph. The corresponding transistor orderings for T1 and T2 are shown in Figure 2 b) and c). Figure 3. The diffusion graph, G,of the discharge circuitry in Figure 2. s is the super-vertex that makes GEulerian. 1o---
... -:143 " -rI I,
S
II
g 7
I I
Figure 2. Effect of transistor ordering on routing and the height of
the cell. The layout inb) has a height of 3, but c) has a height of 2. a)
1
a)
II
6 3
b
b)
Q
4w
FT--r7-M
2
6cdegh
-1 d
k c)
c d e g h
H I 1
The main goal of transistor ordering is to minimize the total diffusion area of the layout. It is easy to see that any e-trail that starts with a super-vertex minimizes the diffusion area. Conversely, any transistor ordering that minimizes the diffusion area is an e-trail in the modified diffusion graph. This property is essential for the efficacy of our modified diffusion graph representation. On the other hand, some e-trails are better than others with respect to the cell height (See Figure 2), which is the subject of the next section.
-Th
f f a
t
h
Fa r
4 Minimum-width Area Minimization We propose an efficient technique that reduces the cell height significantly by searching only among the minimum-width transistor orderings. The modified diffusion graph (described in the previous section) is employed to model the circuit and the layout. As mentioned above, this representation ensures that the cell width is kept at a minimum and the total diffusion area is minimized. This is critical for circuit performance and for fixed-width cell methodologies (e.g., in data-path design [16]). The second goal, which is also very important for area minimization, is to find a transistor ordering that requires a minimum number of routing tracks. The following sub-sections explain the basic ingredients of our iterative improvement technique to perform this task.
b ca e g n
A trail on G is a set of adjacent edges which does not touch any edge in the graph twice [13]. We denote it as: T= (vO,eovj,el,v2 ,...,vkq,,ek >,vk) where ei=(vi,v1 +1 ) is an edge in the graph and ei•ej for all icj. If vk=vO, the trail is a closed trail. A closed trail which touches every edge of G is an Eulerian trail (e-trail). We may use the shorthand (v0 ,v 1 ,v2 .- v k) for a trail, if we do not discriminate parallel edges. Note that a vertex vi can appear at more than one position in the trail. Each such position is called a terminal. vk and v0 are called the end terminals of the trail. The degree of a vertex v, denoted d(v), is the number of edges adjacent to it. A graph is Eulerian if an e-trail exists on it. It is well known that a graph is Eulerian if and only if all vertices in the graph have even degree. An e-trail for the graph in Figure 3 is
A. Move generation
A move is an operation on an e-trail which converts it into another e-trail. In a good iterative improvement technique it is important that the move set, i.e., the set of moves employed, corresponds to a neighborhood structure which spans the whole search space. Otherwise, there will be some configurations (e-trails or transistor orderings) that will not be reachable, resulting in a sub-optimal solution. In the previous section we have shown that any minimum-width ordering can be represented with an e-trail in the modified diffusion graph. Here, we introduce two moves which are sufficient to ensure that it is possible to transform any e-trail into any other via a sequence of moves: sub-trail modification and rotation.
T1 =(s,2,3,4,5,6,7,4,s,6,2, 1,s).
After G is created, we determine whether it is Eulerian or not. If it is not, we add a vertex, called the super-vertex, vs, to G and make it Eulerian by adding a new edge (vs, vi), called a
super-edge, for each odd-degreed v, (dashed lines in Figure 3). We call the resulting graph the modified diffusion graph. On the modified diffusion graph with m edges, we can find an e-trail using a well known algorithm with O(m2) time complexity [13]. There is also a linear time recursive algorithm for finding an e-trail which can be modified to handle various constraints [14].
A sub-trailis a set of consecutive edges in an e-trail. The subtrail modification move changes the order of a set of adjacent transistors. We first pick an arbitrary sub-trail of the e-trail. Next, we generate a sub-graph, G that has only the vertices
There are exponentially many e-trails in an Eulerian graph [15]. Another e-trail for the modified diffusion graph in Figure 3 is: 72 =(s,1,2,6,s,2,3,4,5,6,7,4,s). There are 176 e-
151
and the edges in that sub-trail (including the super-vertex and super-edges if any). Then, we find a different sub-trail in G '. Figure 4 shows an example. Note that G ' is either Eulerian or it has at most 2 odd-degreed vertices. In either case, the algorithm given in [14] can find an e-trail in O(m), where mis the number of edges in the sub-trail.
5 Results
The transistor ordering technique presented in this paper has been implemented in the C language on an IBM PowerPC 604 workstation running AIX 4.1. We have tested the algorithm on various dynamic CMOS circuits from the industry and literature. Table I lists some of these circuits and shows the results that were obtained.
Figure 4. Sub-trail modification move on the graph of Figure 3. Sub-trail (6, s, 4, 7, 6, 2) is replaced with (6, 7,4, s, 6, 2), where s corresponds to the gap.
a)
Table 1. m is the number of edges in the modified diffusion graph. m' is the number of transistors in the circuit and s is the number of super-edges added. H is the number of routing tracks for the worst ordering. h is the result of our technique.
El~~l bc d e
circuit
h g f a
b c d e g I
f a
The rotation move rotates the transistors in the ordering cyclically without breaking the shared diffusions. Figure 5 shows an example. Figure 5. Rotation move. The e-trail of Figure 4 b), (s,2,3,4,5,6,7,4,s,6,2,1,s), is rotated to obtain a new e-trail: (s,6,2, 1,s,2,3,4,5,6,7,4,s).
f a
m=m +s
H
h
cpu (sec.)
chak I chak2 chak3 ckl ck2 dI d2 d3 dorn3 uvl uvId
8+4 18+8 8+2 9+6 8+4 11+2 14+2 34+2 7+2 16+8 16+8
4
2
0.6
10 3 5 4 5 3 3 3 9 9
3 2 2 2 2 2 2 2 3 3
1.5 0.5 0.8 0.6 0.7 0.9 2.1 0.5 1.4 1.4
uv2
24+12
13
3
13.0
uv2d
24+12
13
3
14.0
Circuits chakl-3 are from [6]; ckl,2 are from [5]; dJ-3, dom3
b cd e g h
are domino CMOS circuits from industry. The last four circuits are from [21]. For all of the circuits, the optimum area is obtained: the number of diffusion breaks (cell width) and the number of routing tracks (cell height) are minimum Figure 6 shows the minimum area result for ckl. Both the number of gaps (2) and the number of routing tracks (2) is less than that reported in [5] (3 gaps, 4 tracks) 2
Note that neither of these two moves violates the Eulerian property of the trail: the trail generated by the move is also an e-trail. Therefore, the corresponding transistor ordering has minimum number of gaps i.e., minimum diffusion area is guaranteed. B. Move evaluation A move is evaluated with respect to the change in the cost of the transistor ordering as a result of the move. The cost of the transistor ordering is the final layout area. Since the moves in our move set do not change the cell width, we only need to look at the cell height to evaluate the new transistor ordering. The cell height is determined by the number of horizontal wiring tracks required to route the diffusion terminals. We denote it as the routing height. To find the routing height, we use the left-edge algorithm [17]. For a one-sided channel with n nets, the left-edge algorithm finds the optimal routing in 0(n). Figure 2 b) and c) show two routing examples generated by the left edge algorithm.
It is asserted in [21 ] that uvl (uvid is the dual), is a "difficult" circuit for row-based layouts. Indeed, when there is no height minimization, the area can be proportional to O(m2 ), for m transistors, whereas the gate-matrix style can produce a layout with O(mlog(m)) area. However, with our area minimization technique the transistor ordering has an area of 0(m). This is a positive result for row-based layouts as compared to gate-matrix. It is also worth noting that the enumeration technique of [8] for sCMOS (it was recently used in [22]) could also be applied to these circuits3 . Unfortunately, for the last two circuits that exhaustive algorithm would require time proportional to
C. Iteration control It is important that an iterative improvement scheme does not get stuck in local minima. We employed simulated annealing, which has been found to be very effective in optimizing many difficult layout problems [18][19]. In our implementation, we use a generic simulated annealing library that provides various effective iteration control mechanisms [20].
1. For verification, the optimum height as well as the worst case height for the last four circuits were found exploiting the special graph structure. For the other circuits we used a backtrack search technique for enumeration [15]. The full search takes around 16 hours for chak2. 2. We should also mention that this is not a complete comparison, since [5] uses a different layout style, called oilding layout. 3. The time complexity of that algonthm is super-exponential, O(h 3 hh). where h is the number of transistors in the longest discharge path [8].
152
However, our iterative improvement technique 1 3 3x13x13 was able to find the theoretical optimum in less than 15 sec-
[4] C.T. McMullen and R.H.J.M. Otten, "Minimum Length Linear Transistor Arrays in MOS", IEEE InternationalSymposium on Circuitsand Systems, August 1988, pp. 1783-1786. [5] H.Y. Chen and S.M. Kang, "iCOACH: A Circuit Optimization aid for CMOS high-performance circuits", Proceedings of the
onds. Figure 6. Schematic of a dynamic CMOS circuit, ckl, and the optimum transistor ordering obtained for the nFET circuitry. For illustration purposes the routing is done in the channel, but it can also be done over-the-transistors to eliminate the routing footprint and save area.
IEEE/ACM International Conference on
Computer-Aided
Design, November 1988, pp. 372-375. [6] S. Chakravarty, X. He, S.S. Ravi. "On Optimizing nMOS and Dynamic CMOS Functional Cells", IEEE InternationalSvmpo-
sium on Circuits and Systems, Vol. 3:, May 1990, pp. 1701-4. [7] S. Wimer, R.Y. Pinter, J.A. Feldman, "Optimal Chaining of CMOS Transistors in a Functional Cell", IEEE Transactionson Computer-Aided Design, Vol. CAD-6, September 1987. pp.
795-801.
[8] R.L. Maziasz and J.P. Hayes, Layout Minimization of CMOS 2S 1 A
Cells, Kluwer Academic Publishers, Boston, 1992. [9] A. Stauffer and R. Nair, "Optimal CMOS cell transistor place-
5
31
ment: a relaxation approach", Proceedings of the IEEE/ACM International Conference on Computer-Aided Design, Novem-
ber 1988, pp. 364-7. [10]C.Y. Hwang, Y.C. Hsieh, Y.-L. Lin, Y.-C. Hsu, "An Efficient Layout Style for Two-Metal CMOS Leaf Cells and its Auto-
2 -
I Z
- M b
matic
a g
Synthesis",
IEEE Transactions on Computer-Aided
Design, Vol. 12, No. 3, March 1993, pp. 410-424. [Il]N.H.E. Weste and K. Eshraghian, Principles of CMOS VLSI Design, Addison-Wesley Publishing Company, 1993. [12]T. Uehara and W. M. vanCleemput, "Optimal Layout of CMOS Functional Arrays", IEEE-Transactionson Computers, Vol. C30, No. 5, May 1981, pp. 305-312.
6 Conclusions A new transistor ordering technique was presented for area minimization of dynamic CMOS circuits. It is able to find a minimum area transistor ordering for all the practical circuits we have tested it on. It is very efficient, finding the optimum layout in less then 2 seconds for realistic dynamic CMOS leaf-cells. We are currently working to reduce routing blockages and handle gate connections effectively. We are also ex-
[13]J.A. Bondy and U.S.R. Murty, Graph Theory with Applica-
tions, Elsevier Science Publishing, New York, 1976. [14]B. Basaran and R.A. Rutenbar, "An 0(n) algorithm for transistor stacking with performance constraints", Proceedings of the
tending the cuiTent one-row technique to multi-row and 2-D
IEEE/ACM Design Automation Conference, June 1996. [15iA. Nijenhuis and H.S. Wilf, Combinatorial Algorithms, Aca-
styles by integrating it into a macro-cell style placer for overall area efficiency.
Stretchable Cells", IEEE Journal of Solid-State Circuits, Vol.
demic Press, New York, 1975. [161Y. Tsujihashi, et al., "A High-Density Data-Path Generator with 29, No. 1,Jan 1994, pp. 1-8. [17]A. Hashimoto and J. Stevens, "Wire routing by optimizing channel assignment within large apertures". Proceedings of the IEEE/A CM DA C, June 1971, pp. 155-69. [18]S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi, "Optimization by Simulated Annealing", Science, November 83, pp. 671-680. [19]C. Sechen and K. Lee. "An Improved Simulated Annealing Algorithm for Row-Based Placement", Proceedings of the
We are also working on a performance-driven transistor ordering mechanism for parasitic sensitive dynamic CMOS circuits. We hope to extend analog leaf-cell layout techniques that control circuit parasitics in order to meet constraints on the performance of the circuit [14][23][24].
IEEE/A CM ICCAD, 1987, pp. 478-481.
Acknowledgments The authors would like to thank Prof. Ron Bianchini and Aykut Dengi (CMU) for helpful discussions and Dr. John Cohn (IBM) for providing some of the circuits. This work is supporter in part by the Intel Corporation and the Semiconductor Research Corporation.
[20]E. S. Ochotta, L. R. Carley, and R. A. Rutenbar, "Analog Circuit Synthesis for Large, Realistic Cells: Designing a Pipelined A/D Convert with ASTRX/OBLX", IEEE Custom Integrated Circuits Conference, May 1994, pp. 15.4/1-4.
[21]C.C. Su and M. Sarrafzadeh, "Optimal gate-matrix layout of CMOS functional cells", Integration, the VLSI journal, Vol. 9,
1990, pp. 3-23. [22]B. Guan and C. Sechen, "An area minimizing layout generator for random logic blocks", Proc. 1995 IEEE CICC, May 1995, pp. 23.1/1-4. [23]E. Charbon, E. Malavasi, D. Pandini, A. Sangiovanni-Vincentelli, "Imposing Tight Specifications On Analog ICs through Simultaneous Placement and Module Optimization", Proc. IEEE CICC, May 1994, pp. 525-28. [24]B. Basaran, R. A. Rutenbar and L. R. Carley, "Latchup-Aware Placement and Parasitic-Bounded Routing of Custom Analog
References
[1] A. Domic, "Layout synthesis of MOS digital cells", Proceedings of the IEEE/ACM Design Automation Conference, June 1990, pp. 241-5. [2] S. Chakravarty, X. He, S.S. Ravi, "Minimum Area Layout of Series-Parallel Transistor Networks is NP-Hard", IEEE Transactions on Computer-Aided Design, Vol. 10, No. 7, July 1991, pp. 943-949. [3] T. Lengauer and R. Muller, "Linear Algorithms for Optimizing the Layout of Dynamic CMOS Cells", IEEE Transactions on Circuitsand Systems, Vol. 35, No. 3, March 1988, pp. 279-285.
Cells", Proceedings of the IEEE/ACM InternationalConference on Computer-Aided Design, November 1993, pp. 415-421.
153
A FAST TECHNIQUE FOR TIMING-DRIVEN PLACEMENT RE-ENGINEERING Moazzem Hossain, Bala Thumma, and Sunil Ashtaputre Compass Design Automation San Jose, CA 95131 moazzem~compass-da.com, balat'compass-da.com, sunilaiDcompass-da.com
ABSTRACT As the VLSI technology is advancing rapidly, the complexity and size of the layout problems is also increasing. As a result, after the initial design cycle, it is getting almost impossible to get a good quality layout satisfying all the design constraints. Therefore, in a typical design flow, if the design has some problem or some of the initial constraints are not met after the initial design cycle, the designer may alter the design specification by changing some of the constraints and/or by adding or deleting some of the modules and/or nets from the design. These changes are known as engineering changes. However, due to the inherent long turn around time for the placement and routing phase, it is not desirable to redo the placement and routing from the scratch on the altered design. This calls for designing a fast layout tool, i.e., layout re-engineering tool, to handle the engineering changes. In this paper, we study the problem of timing driven placement re-engineering. We develop a design-style independent fast algorithm for placement re-engineering. The algorithm has been implemented in C++ and the initial experimental results on industrial designs show the effectiveness of our algorithm. 1.
INTRODUCTION
A circuit; consists of a set of modules and a netlist specifying the interconnections between the modules. The VLSI layout problem is to realize the circuit in a minimum layout surface while meeting the circuit specifications. High performance and minimum layout area are the two most important objectives for a VLSI design layout problem. Traditionally, due to the inherent complexity of the problem, the VLSI layout problem of a circuit has been divided into two sequential phases: placement and routing. The objective of the placement problem is to map the modules of a circuit on a layout surface while optimizing certain objective functions. Once the placement is complete, the routing problem is to realize the interconnections between the modules according to the specifications. The region on the layout area which is not occupied by modules can be used for routing. In a typical design flow, if the circuit has some problem or some of the initial constraints are not met after the initial design cycle, the circuit designer may alter the design specification. These changes are referred to as engineering changes. The process of generating new design the incorporates the engineering changes is referred to as design reengineering[41. There are two forms of engineering changes: (1) structural changes, such as adding and deleting modules and nets from the circuit; (2) specification changes, such as changes in the timing constraints. In this paper, we address
154
the problem of placement re-engineering. The input to the placement re-engineering is a completely placed circuit and the engineering changes. The output is a placed circuit that incorporates all the structural changes and maximum number of specification changes. In an ideal situation, all the specification changes will be met in the final placement. However, if some of the specification changes are not met, the designer may make further engineering changes to meet all the design specifications. In [4], timing-driven placement re-engineering problem for regular architecture, such as FPGAs, has been addressed. In [2], an algorithm for minimizing the disturbance on the original placement while handling placement engineering is presented. Engineering changes during circuit simulation has been presented in [3]. In this paper, we address the problem of timing-driven placement re-engineering. The algorithm is design-style independent and wroks equally well on both standard cell and gate array designs. The nets are assigned weights based on the timing requirements. For each added module in the circuit, the algorithm first computes its target window and places the module within its target window. After all the added modules are placed, the algorithm iteratively moves modules based on target window of each module to satisfy all the timing constraints. The algorithm has been implemented in C++ and preliminary experimental results show good results on timing-driven layout on both gate-array and starndard cell designs. The remainder of the paper is organized as follows: Section 2 presents the formal definition of the re-engineering problem and some related definitions. Section 3 presents our algorithm. Some preliminary experimental results have been presented in Section 4. Conclusion is presented in Section 5. 2. PRELIMINARIES Engineering changes can be classified into two categories: 1. Structural changes: These changes modify the structure of the circuit. Addition and deletion of modules and nets fall into this category. 2. Specification changes: Changing the timing constraints or adding new timing constraints, changing the delay of a module are considered specification change of a circuit. When a module or a net is deleted from a circuit, it is easy to see that it will not cause an increase in the delay on a constrained signal. Therefore, it is easy to implement the deletion of a module or a net. However, adding new modules or nets to the netlist may cause the violation of
the existing timing constraints. As a result, addition of new modules or nets is more difficult and important in the placement re-engineering problem. We, therefore, address the problem of adding new modules, nets, and constraints into the circuit. A circuit consists of a set of modules M = {j1, /12-., itn}, a set of nets AV = {V1,v2,...,vm}, and a set of timing constraints T. The timing constraints are specified by the maximum allowed time from a primary input to a primary output. Associated with each net vi, we define a weight function w(vi) be a positive integer value. The netlist can be described as a binary relation 7R = JK x M. A pin of a net v C Y on a module it e M is represented by (v, t) E R. The set of modules connected by a net v is defined by M, = f E MI(v,p) E 7Z}. Similarly, the set of nets with a pin on module it is defined by A' = {v e N(l(V, ,) E 7Z}. For a net v E A, the bounding-box, B., of v is defined by the minimum rectangle enclosing all the pins of v. The weighted net length of v, L>, is estimated by half-perimeter length of B, times the weight w(v). In the remainder of the paper, we use net length to mean weighted net length. Since in most of cases, the interconnections between the modules are realized by rectilinear wires, the weighted net length of a net can bie considered as the summation of weighted length in x and y-direction. The weighted lengths in x- and ydirection are also called x-length and y-length, respectively. For a net, v, if L,>(x) and L>,(y) are x-length and y-length, respectively, then L, = Lv(x) + L, (y). For a net v and a module p, if (v. it) E 7Z, then we define BO be the boundingbox of v without the pin (v, i). The total net length is the sum of the lengths of all nets in K. Let f (x) be a piece-wise linear unimodal function. We define the domain of minimum, [xi, x,], of f (x) as the subset of the domain of f (x) such that f (x) reaches its minimum if and only if i < x <xr. In our algorithm, we use the notion of multiset. Just like a set;, a multiset is a collection of elements; however, unlike set, a multiset can have multiple occurences of the same element. Let a, and be be the number of occurences of element e in multisets A and B, respectively. The operation union of two multisets A and B results in a multiset C in which the occurences of e is ae + be. The operation intersection of two multisets A and B results in a multiset C in which the occurences of e is min{ae,be}. Let C = A n B and ce be the number of occurences of element e in C. The operation A - B results in a multiset D in which the occurences of e is ae + ce, Given a placement of all the modules in M, we are to find an optirnal location for a module M. The optimal location for pi is a location on the layout area such that placing ,u in that location results in a minimum total net length for all the nets with a pin on it. A target window, W,, for a module Adis a rectangular region (XI, yb, Xr, yt) such that placing it at (x, y) results in a minimum L, if and only if xi < x < x, and yb < y < yt. For ,i, if Wa can be computed then jt can be placed inside W, to minimize the total net length.
Initial placement of added modules
I
Figure 1. The main flow of our algorithm are assigned based on the criticality of the nets and how far off the signals are from their timing constraints. The overall flow of our algorithm is almost similar to the one presented in [4]; however, we use target window to place each added module. We develop a very efficient linear time complexity algorithm to compute the target window for each module to be placed. The target window gives a region in which placing the module will result in a minimum total weighted netlength for all the nets with a pin on the module. Target window is also used to iteratively improve the placement to meet the timing requirements. Figure 1 gives an overall flow of our algorithm. The target window based initial placement of new modules places the modules within the target window while minimizing the number of overlaps. However, in a very highoccupancy design, there may be some overlaps of modules after the initial placement. The second phase of our algorithm iteratively moves the overlapping modules based on the target window of each module to be moved to remove the overlaps.
3. OUR ALGORITHM The input to our algorithm is the modified netlist for a circuit along with the placement of the modules that are not replaced or modified. The output is the final placement with all the engineering changes incorporated in the design. Based on the timing constraints and timing analysis, we assign weights to nets to meet the timing constraints. Weights
155
Once the overlaps are removed, we iteratively do the timing analysis and improvement of the placement by moving modules based. on the target window At each step we assign weights on nets based on the timing analysis. As can be noted from the above discussion that target window computation is the key to our algorithm. For each added module we compute its target window to place the module. Target window is also used to iteratively improve the placement to unoverlap and meet the timing constraints. We now present the detailed algorithm to compute the target window for a module.
.,
3.1. Target Window Computation For a given a module ps e M, we are interested in computing its target window W, = (xv, y', x ', yt). The target window is computed using the nets with a pin on p. The x-span and y-span of the window are (xr, xz) and (y", yt), respectively. The x-span is computed using the x-length. Similarly, the y-span is computed using the y-length. It is easy to see that the two can be computed independently. We only present of the computation of x-span of the target window. The computation of y-span is similar. Let k ok > 1) be the number of nets with a pin on p, Then 1 = {ZAP' ,... ,v'}. Also let w(vi') > 1 be the weight and BP,, = (li,.bi, ri, ti) be the bounding box of vr without
I Y(~L. ) I
Figure 2. window.
the pin (vAip). Let the placement of all the modules in M - {p} be fixed and we need to find a placement location for p. Let Lx(zi4',,) be the weighted length function for net vz? in the x-direction with respect to the placement location of p. [t is easy to see that Lx (vi, p) is a piece-wise linear unimodal function. Then the domain of minimum of Lx (vi, p) is [li,ri]. If L( (p) is the x-length function for 1 all the nets with a pin on A, then L. (p) _- Lx(z4v p). Since the sum of piece-wise linear unimodal functions is also piece-wise linear unimodal function, L. (p) is a piece-wise linear unimodal function. Therefore, the domain of minimum of Lx,(p) will give the x-span of the target window of p . Figure 2 gives an example of a length function L(,(pi) and L,(p'i) for medule p1i for one net. Three modules /11,112, and p3 are connected by a single net. Also note that Wi is the target window for pi. Based on the above discussion, a straight-forward algorithm to compute the x-span of the target window of a module p E M can be computed by first computing the functions Lx (V,,, p), for 1 < i < k and then by algebraically adding Lx(v. p) and computing the domain of minimum of the resultant function. However, the time complexity of such an algorithm is 0(k 3 ), where k is the total number of nets with a pin on p. We make use of the piece-wise linear unimodality of Lz (vi', p) to develop our linear time complexity algorithm. For Lx (vt, p) with domain of minimum [li, ri] and for a point (x, y), we say that the domain of minimum of L1 ,(v',p.) is on the left side of x if ri < x, and the domain of minimum of Lx (vi<,p) is on the right side of x if x < li. Otherwise, we say that x is in the domain of minimum of L, (vt , p). We now prove the following important lemma:
Example of length functions and target
let L( 1(p) = L (v, p(iP) with domain of minimum [1, r]. Then I E S and r E S. Proof: We only show that if I
f
S then [1, r] is not the
domain of minimum of L. (jp). Analogously, it can be shown
that r E S. Let us assume that I f S. There must exist a net vi e Al; and a net vj E A/; such that li < l and 1 < rj. If no such net exist in Ar, then the domain of minimum of all L,(vj, p) is either to the right side or to the left side of 1. In that case, moving I either to the right or to the left towards the domain of minimum of L. (vfA, jl) will result in an overall net length reduction. That contradicts the fact that I is the left boundary of the domain of minimum of L. (p).
Let ki be the total number of nets of total weight wi with domain of minimum to the left of 1 and k2 be the total number of nets of total weight W2 with domain of minimum to the right of 1. Also let k3 be the number of nets with total weight w3 for which 1 is within the domain of minimum. Also let xi E S is the largest in S such that xi < l and i, E S is the smallest in S such that x, > l If W1 > W2, then for any value 1', xi < l' < 1, Lx(pi) will decrease, thus the assumption that 1 is the boundary of the domain of minimum of Lx(,u) is not true. Similarly, if w, < w2 then then for any value 1', 1 < l' < rr, Lx(11 ) will decrease, thus 1 is not in the domain of minimum of L. (1 ). Therefore, wi must be equal to w2. Since, w, = W2, the value of L. (p) will be same in 1 and in zi. Now for any value I' < xi, the net length of k 2 nets will increase by W2, the net length for k 1 - 1 nets will decrease by Wi - 1 and the net length for k3 + 1 nets will remain the same. Therefore, L,(pi) will increase. Thus xi is the boundarv of the domain of minimum of L,(p). Therefore, 1 must be equal to xi; as a result 1 E S. Similarly, we can show that r E S. The algorithm, TARGET-WINDOW, presented below is based on the above lemma. The input to the algorithm is the module p for which the target window should
Lemmea 1 Given a module p E M, let A'p = {zV, Vi,...,VAz}. Let w(vl) > 1 be the weight of net Vz/. If for 1 < i < k, Lx(vZI, p) is the function of x-length of ZiA with respect to the placement of pi with domain of minimum [1,, rj, let S = U_ Ul( ')
x
{li, ri} be a multiset. Also
156
be computed. For k nets (k > 1) connected to ju, their domain of minimum with respect to the placement of p is computed first. A multisets S. and S, are constructed from the boundary the domain of minimum of Lx(zvt,jL) and Lzx(vj,',p, respectively, for all the nets with a pin on p. The median two values in Sx and S, gives the domain of minimum of Lx(ti) and L 5 (p). We now formally present the algorithm.
Table 1. Specifications of Benchmark Examples circuit r#modules #nets structure constraint changes changes cl 1204 1309 17 5 c2 6372 7268 15 10 c3 14115 14930 __ 25 14 c4 14332 15339 57 29 c5 34965 36643 151 12
procedure TARGET-WINDOW(p) 1. compute , . 2. Let S. = ; S=. 3. Let m = 0 4. for each net vo E Af(, do 4.1. Let w(vt) be the weight of v4' 4.2. compute Bog = (li, bi, ri, ti). 4.3. for j = 1 to w(v,") do 4.3.1. Sx = Sx U {li,ri}. 4.3.2. S, = S, U{bi, t } 4.4. m=m+2*w(vi). 5. Let I = mth smallest element in Sx Let b = Ath smallest element in S, Let r = Xj + 1)th smallest element in S. Let t = (m + 1)th smallest element in S, 6. return (, b, r, t);
Table 2. Comparison with mincut based placement circuit mincut placement our algorithm time(sec) delay(ns) time(sec) delay(ns) cl 20 48.4 3 49.3 c2 112 75.0 9 74.7 c3 815 152.5 22 155 c4 718 143.9 18 145.1 c5 8631 284.0 105 285.3 As can be seen from the experimental results that our algorithm is extremely fast and produces very result timing constrained placement.
According to [1], the mth smallest element in Sx (Sy) can be computed in 0(m) time. Since the bounding box of nets are computed incrementally, we can assume that the bounding box is already computed and stored in the data structure. Thus the overall time complexity of the algorithm TARGET-WINDOW is dominated by the computation of the - th and ( - + 1)th smallest element in Sx and in S1,. Therefore the time complexity of the algorithm TARGETWINDOW is 0(m) = 0(k) where k is the number of nets connected to mu. Following theorem shows that the bounding box [1, b, r, t] computed by the above algorithm TARGET-WINDOW is the target window for the given module p.
5. CONCLUSION In this paper, we have presented a design-style independent algorithm to solve the timing-driven placement reengineering problem. The algorithm is based on a linear time complexity target window computation for each module to be placed. Initial experimental results on both standard cell and gate array designs show the effectiveness of our algorithm. REFERENCES [1] M. Blum, R. Floyd, V. Pratt, R. Rivest, and R. Tarjan, "Time Bound for Selection." Journal of Computer and System Science, vol. 7, pp. 448-461, 1972. [2] C. Choy and T. Cheung, "An Algorithm to Deal with Incremental Layout Alteration", Proc. 34th Midwest Symposium on Circuits and Systems, vol. 2, pp. 850853, 1991. [3] Y. Ju, "Incremental Circuit Simulation and Timing Analysis Techniques", Ph.D. Thesis, University of Illinois at Urbana-Champaign, 1993 [4] A. Mathur, K. C. Chen, and C. L. Liu, "Re-engineering of Timing Constrainted Placements for Regular Architectures", Proc. IEEE Intl. Conf. Computer-Aided Design, pp. 485-490, 1995.
Theorem 1 The bounding box [1, b, r, t] computed by the algorithm TARGET-WINDOW is the target window for a given module p. Furthemore, the time complexity of the algorithm TARGET-WINDOW is 0(k), where k is the total number of nets with a pin on p. Proof: Directly follows from the proof of lemma 1. U 4. EXPERIMENTAL RESULTS We have implemented our algorithm in C++. We have tested the performance of our algorithm on some industry circuits and compared our results with the constructive placement algorithm to handle to enginnering changes. As mentioned earlier that the placement re-enginering should be very fast compare to the actual placement from the scratch. Our experimental results demostrate that our algorithm is very fast. It is important to note that the algorithm takes longer time in very high-density design. This is due to the fact that the algorithm has to perform more number of iterations to unoverlap the modules during iterative improvement phase. Table 1 gives the specification of the circuits on which the experiments were conducted. Table 2 gives the comparison between mincut based placement to incorporate all the engineering changes and our algorithm. The column of time specifies the run time for both mincut placement and our algorithm and the column delay specifies the delay on the critical path after the final placement.
157
OVER-THE-CELL ROUTING WITH VERTICAL FLOATING PINS Ines Peters, Paul Molitor'
Michael Weber 2
'Computer Science, University Halle, Germany petersginformatik.uni-halle. de [email protected] 2 Deuretzbacher Research GmbH, Berlin, Germany [email protected]
ABSTRACT We present a generalized center terminal cell model and propose over the cell routing algorithms, utilizing the new feature of the model, i.e., Vertical Floating Pins. First we approach the problem by linear integer programming, and second by dynamic programming. For further channel density reduction, both approaches have been combined with linear channel pin assignment, that has been studied by the authors before. Experimental results show a resulting channel density reduction by about 35% compared to the over-the-cell routing algorithms for center terminal cells proposed in [19] and [26].
Next we adapt the Over-the-Cell (short OTC) routing problem to the PARD model and present algorithms to utilize the new features in order to minimize final layout area. We propose algorithms for two different goals. The first is to maximize the number of segments routed over the cell and the second is to minimize the resulting channel density. We approach the problem by linear integer programming and by dynamic programming. Experimental results on benchmark testsets are given at the end of the paper. They show a resulting channel density reduction by about 35% compared to the over-the-cell routing algorithms for center terminal cells proposed by [19] and [26].
2. 1.
THE PARD CELL MODEL
INTRODUCTION
The channel and over-the-cell routing problems play an important role in the physical design of VLSI circuits. They have been extensively studied in the past ([1], [4], [5], [7], [8], [11], [12], [15], [16], [17], [19], [21], [26], [27], [28]). The routing strategies depend on the underlaying cell mode].. Many cell models have been proposed in the past ([1], [13],[131, [14], [16] [18]', [24], [25], [26]).
connecte pins
:al ing
We distinguish between two major classes of cell models: 1. Margin terminal cell models (MTCM): The cells termia als are located along the top and bottom cell boundaries ([6], [13]) and 2. Middle or Center terminal cell models (CTCM): The cell terminals are located within the cell area ([14], [24], [25], [26]). In this paper we deal with CTCM cells. The main problem of CTCM cell design is, where exactly to place the terminals. Practical observation in industrial cell design shows, that very often the terminal placement allowed in the cell models is more restrictive than necessary. So we present in this paper the routing features - vertical floating pins - of a new generalized CTC-Model, i.e., the PARID-Cell model ([20]). Note that the PARD-Cell model has already been applied in two industrial chip projects.
When designing a new cell catalogue in a double metal 0.8 micron CMOS technology 2 we observed, that terminals can
'There a transparent cell model - TRANCA - was presented. It provides terminal access points in spite of fixed terminal positions. Unfortunately this model is hardly to manage by discrete optimization technics.
2 Note that we focus on double metal because more metal technologies are still too expensive for smaller ASIC manufacturers. The model presented in [1] uses a three metal layer technology where the third layer is essential for the algorithm proposed.
Figure 1. Vertical floating pins of PARD-cells
158
be looked at to be floating in vertical direction (see figure 1). Power-Ground lines are done in METI along the cell boiders. POLY and METi are used for internal connections. MET2 is preserved for OTC routing. The new routing features of the PARD cell model are the Vertical Floating Pins. They are a sort of prerouting we get immediately from cell design. Their basis are internal connections in MET1. Where in other models the cell designer decides where to place vias from METI to MET2 for OTC routing, we leave this decision to the OTC router itself. Thus MET2 is completely free for OTC routing. Vertical floating pins give the router more flexibility. Figure 2 shows an interesting routing pattern. We have a pin-net ordering of (1, 2, 3, 1, 2, 3). This ordering cannot be routed over the cell in other (two metal layer) cell models for planarity reasons. The "implicit tunneling" of the vertical floating pins obviously enables an OTC routing.
1 vertical floating pin of net 1 ,,
3
1
2
,,
:: =
.--
. .2
:
.
1
Figure
2
II
::t t I
2
1. The net segment does not cross any forbidden zone of other nets and 2. the horizontal constraints are fullfilled, i.e., the net segments of different nets do not overlap. Note that unlike in [16] the problem cannot be solved by simply applying maximum k-independent set algorithms because tracks are not necessarily contiguous and the sets of nets that can be assigned to these tracks are different.
3
:
to the top, to the bottom or to the top and to the bottom of its cell.To ensure the routability we introduce "forbidden zones". One zone runs from the desired cell boundary directly to the nearest point of the vertical floating pin, where a via can be placed for connection This zone is forbidden for all other nets. The OTC routing problem can now be described as follows: Find for every net an OTC-track, which realizes the net such that:
::
3
1
*
2
4.
3
2. An interesting routing pattern
In this section we will decribe how to use (0, 1) - linear integer programming for solving the OTC-routing problem. In the first part we will give an integer program in order to maximize the number of nets routed over the cell. In the second part we will expand this program in order to minimize the resulting channel density. 4.1.
3.
In the following we assume two-pin nets. All operations can easily be adapted to multi-pin nets (just allow overlap of segments that belong to the same original net). top pin
Assigned Nets Maximization
Given a set NET = {n,_ . ..,rr} of nets and a set TRACK = {ti tq} of (OTC-) tracks. For the following definitions let 1 < i, k < p be net indices and let 1 < j < q be a track index. Let f : NET -e 2TRACK be the function which assigns to all nets the set of non-forbidden track numbers. In these tracks the net can be connected to the corresponding vertical floating pins without crossing forbidden zones. We will use the notation
OTC ROUTING PROBLEM
vertical floating pins
LINEAR INTEGER PROGRAMMING
forbidden zone
I/
1, tj fi
A\
-------
i.
I
.
- - - - - - -
1
1
- - - - - -
iii l11 1 l
1'1l
1
1~~~'' -1
- - - -
-
- - - I
E
f(ni)
otherwise
in the following. From a given OTC-routing problem we can derive the horizontal constraints defined by o : NET x NET -* {0, 1} with
1-.
I.
0,
-
----
'
o(n,,nk) =
f
1, 0,
"overlaps" nk, nj 7L nk otherwise
kni
Once again we will use the notation oik
=
o(ni, nk).
In our integer program we have the following variables bottom pin
top-bottom pin t7
Figure 3. Forbidden zones in a cell row
{
1, ni "assigned to" tj -t 0, otherwise I
i.e., vij denotes whether net ni is assigned to track tj. Next we define the constraints of the integer program:
The global routing phase gives us an information for every (vertical floating) pin, whether it has to be connected
159
1. Nets can only be assigned to allowed tracks, i.e.,
vij < fi,
Vi,k E {I_._ p}. The variables are given by
vj =
has to hold Vi, j. 2. Every net can be assigned to at most one track. Thus
{
1, ni "assigned to" tj otherwise
0,
t
Vi E {1,...,p},j E {1,...,m+q} and
q
< 1
EV,
1,
U
j=1
=
has to hold Vi. 3. No two overlapping nets can be assigned to the same track, i.e.,
tj "is free" otherwise
I 0,
Vj e {1, . . ., m}. The v-variables are similar to these of the first integer program. The u-variables indicate whether a certain outer track is free or not.
Vij + Oik + Vki < 2
Next we define the constraints:
Vi, j, k.
1. All nets can only be assigned to allowed tracks: Vi E .mp},j e { +q}
Finally we define the optimization goal, namely P
max
vij < fij,
q
3
Vij. 2. Every net has to be assigned to exactly one track (OTC or outer): Vi E {1... ,p}
,=i j=1 Thus the integer program assigns as many nets as possible to OTC-routing tracks. The complexity of (0, 1) - linear integer programs is determined bv the number of variables and constraints. The presented program consists of pq variables and pq + p + p2q constraints. We can reduce the number of allowed track variables by setting the variables vij to 0 for fij=o and by eliminating Vij + Cik + Vki < 2 for Oik = 0. 4.2.
Evij =1 j=, 3. No two overlapping nets can be assigned to the same track: Vi, k E {1, .. ,p},j E {1,. ., m + q} Vij + Oik + Vki
4. Free outer track condition: The u-variables have to be 0 when at least one net is assigned to the corresponding outer track and they have to be 1 when no net is assigned to this track.
Channel Density Minimization
Given a set NET = {ni,. .. ,np} of nets and a set TRACK TRACKot U TRACKotc of tracks with TRACKot = {ti,.. ,tm} and TRACKotc= {tm+i, . . . ,tm+q}. According to the actual OTC-routing problem we can decide to minimize the channel density locally or globally. In the local case the OTC-tracks (TRACK0 tc) are the tracks of one certain cell row and the outer tracks (TRACKo.t) are the tracks of the two channels below and above this cell row. In the global case we extend the OTC-tracks to the tracks of all cell rows of a given standard cell placement. The outer tracks are the tracks of all channels inbetween these cell rows. Because the number of channel routing tracks needed is not fixed at the beginning of the routing phase, we just preserve "enough" tracks to complete the routing. Our integer program has the goal to "clear" as many outer tracks as possible. So we reduce the channel density of the nets not routed over the cells. Let f: NET - 2 TRACK and <
t-
* Force u-variables to 0: {1, ... ,m} Vij + uj
=
E
{1,... ,p}Vj E
< 1,
P
>
Vij + uj > 1.
Finally we define the optimization goal:
max
1, tj (Ef (ni) 0, otherwise
f 1, n, "overlaps" nk, 0, otherWise
Vi
* Force u-variables to 1: Vj E {1, . .. , m}
E
u,.
j=1
Thus the integer program clears as many outer tracks as possible. The extended program consists of p(m + q) + m variables and p(m + q) + p + pt (m + q) + pm + m constraints. The allowed track and overlap constraints can be reduced in a similar way as above.
Vi e 41. p},j E {1,..-., m + q4 This function assigns a net the OTC- and outer tracks allowed. The overlap function is defined by o: NET x NET {0, 1} and Oik
< 2.
ni :
nk
160
5.
DYNAMIC PROGRAM
names in the second. (A number of n channels for a circuit indicates a n + 1 cell row placement.) Not OTC-routed nets have been routed in channels between the cell rows. We achieved densities Dl, D2 and D3 for these channels. DI results from the PDCPA OTCrouting, D2 from linear integer programming in order to maximize the number of segments routed over-the-cell combined with PDCPA, and D3 from dynamic programming in order to minimize the resulting channel density combined with PDCPA. Note that D2 results are in some cases better than D3 results. That is why the "preconditions" for PDCPA channel density minimization after net segment assignment seem to be better in these cases. It will be part of future work to investigate this relationship in more detail. The improvements are given in percent. The D3-results for PRIMARY1 were achieved by a heuristic (left edge based) approach. The other examples could be computed with an exact algorithm. We achieved an average improvement by about 35% in both assigned track maximization and in remaining nets density minimization. The density minimization led to slightly better results. The linear integer program needed a run time of some seconds. The dynamic program took a bit longer, but less than ten minutes in all cases. We also tried the global minimum density integer programming approach (for the smaller examples). The results were the same in fact as for the other approaches, but the running time was much longer, up to one hour. So we would suggest the dynamic programming approach to minimize the resulting density. In future work we will examine the influence of the selection of nets to be routed over the cell to the succeeding channel pin assignment step for further density improvement.
In order to speed up the OTC-routing we present a dynamic programming approach. Let NET = {nl,...,n,} be a set of nets and TRACK = {ti, . . .,tq} a set of (OTC-) tracks. Let f N ET -. 2 TRACK and oij be defined as above and let 9: TRACK -* GRAPHS be a function defined by g(j) = (Vj, Ej) with Vj
=
{Vk I j E f(Vk)}
E5j
=
{(vi,vz)
I vi,,vL
and Vj AOil
=
1o
We assign an OTC-track a graph whose nodes represent the set of allowed nets of this track and whose edges represent the horizontal constraints of these nets. As the maximal independent sets are of interest for our routing problem we consider the sets Lj = {,N; Nj C Vj,Nj ma-. indep. set of g(j)} Vj E {1,... , q} in the following. The computation of all maximum independent sets of a graph is known to be NP-hard. Depending on the actual OTC-routing instance, we therefore decide for an exact algorith:m or a fast (for instance left edge based, [21], [28]) heuristics in order to compute the maximum independent sets. Now we can start with our dynamic program: For every OTC-track 1,starting with the lowest, we compute a set of net seits, which can be routed over the cell using the tracks 1 to 1. This can be done by combining the sets of the track - 1 with the maximum independent net sets of track l.
ml for do
:-
L1 :=
ml
:=
2toq
{m I 3mi E Ml-i1m m
2
E Li:
:= ml U m2}
[1]
od Let S, c- Mq with I Si I is of maximum size. Obviously Si is a solution which assigns most nets to over cell tracks. Let S2 E Mq with minimum resulting channel density. So S2 is obviously a solution which assigns nets to over-cell-tracks such that the remaining nets have minimum density. 6.
[2]
[3]
[4]
EXPERIMENTAL RESULTS
We implemented the proposed algorithms in Modula-2 language on PC and combined them with the PDCPA algorithm proposed in [19] as a postprocess. We used the Public Domain Integer Linear Program Solver of Michel Berkelaar (see [22]). The data include Primaryl and ISCAS benchmarks mapped to PARD-Cells. Note that PARD-Cells areas compare favourably to other (industrial) cell models. Layout data have been generated with HULDA ([9], [10]). The lengths of the vertical floating pins vary from 4 to 10 tracks. Most of them have a length of 6 or 7 tracks. The experimental results are summarized in Table 1. The circuit names are given in the first column and the channel
[5]
[6]
[7]
[8]
1
61
REFERENCES S. Bhingarde, R. Khawaja, A. Panyam, N. Sherwani, " Over-the-Cell Routing Algorithms for Industrial Cell Models," Proc. 7th International Conference on VLSI Design, pp.143-148, Jan. 1994. Y. Cai, D.F. Wong, "Minimizing Channel Density by Shifting Blocks and Terminals," Proc. ICCAD,pp.524527, 1991. Y. Cai, D.F. Wong, "On Shifting Blocks and Terminals to Minimize Channel Density," To appear in IEEE Trans. on CAD, 1992. Y. Cai, D.F. Wong, "Optimal Channel Pin Assignment," ICCAD, To appear in IEEE Trans. on CAD, 1990. H.H. Chen, E.S. Kuh, "Glitter: A Gridless VariableWidth Channel Router," IEEE Tlrans. on CAD, Vol. CAD-5, No. 4, pp.459-465, Oct. 1986. J. Cong, B. Preas, C.L. Liu, "General Models and Algorithmns for Over-the-Cell Routing in Standard Cell Design," Proc. 27th DAC, pp. 709-715, 1990. S.C. Fang, W.S. Feng, S.L. Lee, "A New Efficient Approach to Multilayer Channel Routing Problem," Proc. 29th DAC, pp.579-584, 1992. T. Fujii, Y. Mima, T. Matsuda, T. Yoshimura, "A Multi-Layer Channel Router with New Style of Overthe-Cell Routing," Proc. 29th DAC, pp.585-588, 1992.
Table 1. Experimental results
_UTu-it -
-CE7
CM;13O CM:150
chl chl ch2 ch3 ch4 chl chl chl chl chl chl ch2 chl ch2 chl ch2 ch3 ch4 chl ch2 chl ch2 chl chl ch2 ch3 ch4 chl chl ch2 chl ch2 ch3 ch4 ch5 ch6 ch7 chl ch2
CM/[151 CM/1162 CMV.163 CIVM42 CM/182 CM485 GMB CORDIC
CU DECOD II MUJX
PARITY PM1 PRIMARY1
X,! average -
-
D1
7 8 8 10
12 19
16 15 9
12 13 9
13 11
15 15 8 16 12 13 10
7 14 16 15 10
21 14 13 12 16 18 18 19
23 24 32 15 12 1:
D2 2 5 6 6
28.6 62.5 75.0 60.0 9 75.0 11 57.9 5 31.2 6 40.0 2 22.2 5 41.7 7 53.8 5 55.6 8 61.5 8 72.7 12 80.0 12 80.0 4 50.0 11 68.7 7 58.3 6 46.2 6 60.0 2 28.6 7 50.0 12 75.0 12 80.0 8 80.0 16 76.2 6 42.9 8 61.5 8 66.7 12 75.0 14 77.8 15 88.9 18 94.7 21 91.3 22 91.7 29 90.6 8 53.3 6 50.0 64.6 9
D37 3 5 6 6 9 10 4 5 5 6 9 5 8 6 11 12 5 10 6 6 6 2 7 11 11 8 16 6 9 9 15 15 16 17 22 22 30 7 7 9
--
T
42.8 62.5 75.0 60.0 75.0 52.6 25.0 33.3 55.6 50.0 69.2 55.6 61.5 54.5 73.3 80.0 62.5 62.5 50.0 46.2 60.0 28.6 50.0 68.7 73.3 80.0 76.2 42.9 69.2 75.0 93.7 83.3 88.9 89.5 95.6 91.7 93.7 46.7 58.3 64.2
[9] T. Hecker, I. Peters, B. Wartenberg, M. Weber, "An Integrated Synthesis Tool for the Generation of Space Efficient Standard Cell Layouts," Proc. 37th IWK, IImenau, Sept. 1992. [10] T. Hecker, I. Peters, M. Weber, "A New Integrated Approach to Global and Detailed Routing of Standard Cell Layouts," Dagstuhl Seminar 9343: Combinatorial Methods for Integrated Circuit Design, Oct. 1993. [11] T.T. Ho, "New Models for Four- and Five-Layer Channel Routing ," Proc. 29th DAC, pp.589-593, 1992. [12] C.Y. Hou, C.Y.R. Chen, "A Pin Permutation Algorithm for Improving Over-the-Cell Channel Routing," Proc. 29th DAC, pp.594-599, 1992. [13] N. Holmes, N. Sherwani, M. Sarrafzadeh, "Algorithmns for Three-Layer Over-the-Cell Channel Routing," Proc. ICCAD, pp. 428-431, 1991. [14] E. Katsadas, E. Kinnen, "A Multi-Layer Router Utilizing Over-Cell Areas," Proc. 27th DAC, pp.704-707, 1990. [15] M.S. Lin, H.W. Perng, C.Y. Hwang, Y.L. Lin, "Channel Density Reduction by Routing Over the Cells," IEEE Trans. on CAD, Vol 10, No. 10, Aug. 1991. [16] S. Madhwapathy, N. Sherwani, S. Bhingarde, A. Panyam, "A Unified Approach to Multilayer Over-theCell Routing," Proc. 31st DAC, pp.182-187, 1994. [17] S. Natarajan, N. Sherwani, N.D. Holmes, M. Sarrafzadeh, "Over-the-Cell Channel Routing For High Performance Circuits," Proc. 29th DAC, pp.600-603, 1992. [18] M. de Oliveira Johann, R.A. da Luz Reis, "A Full Over-the-Cell Routing Model," Proc. VLSI'95, pp.845-850, 1995. [19] I. Peters, "Priority Driven Channel Pin Assignment," Proc. 5th GLS VLSI, pp.132-135, 1995. [20] I. Peters, M. Weber, "The PARD-Standard Cell Model," to appear. [21] B. Preas, M. Lorenzetti, ed., "Physical Design Automation of VLSI Systems," The Benjamin/ Cummings Publishing Company, Inc., Menlo Park, CA 1988. [22] M.J. Saltzman, "Mixed Integer Programming Survey," OR/MS Today, pp.42-51, April 1991. [23] C. Sechen, A. Sangiovanni-Vincentelli, "The TimberWolf Placement and Routing Package," IEEE Journal of Solid-State Circuits, vol. sc-20, no. 2, April 1985. [24] M. Terai, K. Takahashi, K. Nakajima, K. Sato, " A New Model for Over-The-Cell Channel Routing with Three Layers," Proc. ICCAD, pp.432- 4 35, 1991. [25] M. Tsuchiya, T. Koide, S. Wakabayashi, N. Yoshida, " A Three-Layer Over-the-Cell Channel Routing Method for a New Cell Model," Proc. ICCAD, pp. 4 3 2 435, 1991. [26] T. Wang, D.F. Wong, Y. Sun, C.K. Wong, "On Overthe-Cell Channel Routing," Proc. EuroDAC, pp.110115, 1993. [27] B. Wu, N. Sherwani, N.D. Holmes, M. Sarrafzadeh, "Over-the-Cell Routers for New Cell Modell," Proc. 29th DAC, pp.60 4 -60 7 , 1992. [28] T. Yoshimura, E.S. Kuh, "Efficient Algorithms for Channel Routing," IEEE Trans. on CAD, Vol CAD-1, No. 1, pp.25-35, Jan. 1982.
16 2
Congestion-Balanced Placement for FPGAs* Yachyang Sun Altera Corporation Rajesh Gupta C. L. Liu Department of Computer Sciences University of Illinois at Urbana-Champaign Abstract In this paper, we propose to use routing congestion as a criterion in solving the FPGA placement problem. Based on the notion of congestion balance, we propose a placement algorithm that spreads out evenly routing congestion while minimizing the total wire length. This algorithm is based on the min-cut strategy which balances the numbers of interconnections in the two portions during each application of the min-cut partitioning procedure. The algorithm also minimizes the actual number of nets crossing horizontal and vertical cut-lines instead of using an estimated number of interconnects as was commonly done. We also propose an alternative way of employing the cut-line that allows the examination of a larger number of solutions to enhance the quality of the final solution. It also obviates the need for conventional terminal propagation techniques that are inherently less accurate in estimating net congestion. Based on our experiments, we propose a novel ordering according to which the cut-lines are applied. The ordering is related to the distance of a cut-line from the center of the chip. The cut-line closer to the center is applied earlier. This avoids the occurrence of a cut-line that has a large cut size and is close to the center of the layout area (as it often happens in a traditional top-down min-cut based placement). Experimental results show ;L reduction of up to 23% in maximum cut size and wire congestion, compared with the traditional min-cut placement algorithm.
1
Introduction
In most CAD tools for FPGAs the placement stage has traditionally employed algorithms originally developed for other technologies such as printed circuit boards and custom integrated circuits. Most placement tools use a min-cut algorithm [1, 9, 13] that employs a recursive top-down bipartitioning procedure to minimize the number of interconnecticns crossing the cut in each recursive step [5, 8]. This algorithm rins fast and is easy to implement. The objective of a rnin-cut based placement algorithm is to optimize the placement by putting communicating blocks close together. Therefore, the total wire length is the dominant component of the cost function in traditional placement algorithms. However, total wire length alone is not a good metric for architectures with limited routing resource such as FPGA and Complex Programmable Logic Device (CPLD) with partially connected interconnect matrix. Since the algorithm tries 1.o place connected blocks close together, it is likely to generate a placement with congested areas where a feasible routing is difficult to find, if not impossible at all. We propose an improved min-cut based placement algorithm which not only minimizes the total wire length, but 'This work was done while the first author was with Department of Computer Science, University of Illinois at UrbanaChampaign.
163
also spreads out the congestion uniformly across the channels, thus maximizing the possibility of finding a feasible routing for FPGA implementations. Further, most traditional placement algorithms use an approximation scheme to estimate the number of interconnections crossing a cut line. An example is to use the bounding box for a multiterminal net to estimate the probability that the route of the net will cross a cut-line. In contrast, our proposed algorithm minimizes the actual number of interconnections crossing a cut-line, as will be explained in Section 3.2. The proposed algorithm also obviates the need for a terminal propagation step [4] by using a novel procedure to implement the cut lines. Based on the results of our placement and routing experiments, we also propose a sequential order according to which the cut-lines are applied. Such a sequential order significantly reduces the possibility of net congestion near the center of the chip (as it often happens in the traditional top-down min-cut based placement). Finally, the proposed algorithm treats multi-terminal nets directly without dividing them into equivalent two-terminal nets.
2
FPGA Placement Problem
The two-dimensional FPGA layout model proposed in [3, 11, 14] contains an array of logic blocks as shown in Figure 1. Each logic block represented by a square implements logic functions. The terminals of a block are located on the boundaries of the block and are connected to wire segments called terminal segments. Terminal segments are connected to wire segments in the routing channel through switches at their intersections, which are shown as black circles in Figure 1. The switch matrix makes connections between wire segments in horizontal and vertical channels. In the switch matrix, each wire segment can be connected to a subset of the wire segments on the other sides of the matrix. Suppose there is a two-terminal net connecting two logic blocks, one at the lower left corner and the other at the upper right corner. A routing example of this two-terminal
net is shown in bold lines in Figure 1. In commercial FPGA products, routing resource is fixed and fairly limited. The placement problem is especially important in designs using such devices, because a placement is not routable if the number of nets in a channel exceeds the channel capacity. In order to develop an effective placement procedure a metric is needed to measure channel congestion. One way of measuring net congestion is the cut size, that is, the number of nets crossing a cutline in the chip. There are a horizontal cut-lines and b vertical cut-lines in an FPGA with (a + 1) x (b + 1) logic blocks, with each cut-line corresponding to a routing channel. We define the cut size of a cut-line to be the number of nets that have terminals on both sides of the cut-line. The cut size is a lower bound on the number of tracks needed for a complete routing solution. Thus, if the cut size of a
horizontal (vertical) cut-line is larger than the number of vertical (horizontal) tracks, then it is impossible to have a routable design. Moreover, a cut-line cutting through a connection-congested area usually has a large cut size. If the maximum cut size, the cut size that is the largest among all horizontal cut-lines and vertical cut-lines, decreases, the opportunity of the occurrence of connection-congested area decreases accordingly. Therefore, to avoid the occurrence of connection-congested area, we desire to minimize the maximum cut size. Note that, as was indicated in [10], the sum of the cut sizes over all a horizontal cut-lines and b vertical cut-lines in an (a + 1) x (b + 1) layout area gives the exact total wire length when the wire length of a multi-terminal net is estimated by half the perimeter of the bounding box. Thus, there is a close correlation between cut sizes and total wire lengths. In fact, the experimental results in Section 4 show that minimizing the maximum cut size produces a placement with a smaller total wire length in most cases, compared with the traditional min-cut based placement algorithm. Also note that the placement problem of minimizing the maximum cut size is NP-complete, since the linear arrangement problem [6] can be reduced to the placement problem when there is only one row or one column of logic blocks.
3
the two portions obtained in the first bi-partitioning step are bi-partitioned recursively and the placement obtained is that shown in Figure 2(a). The maximum cut size of this placement is eighteen. However, if we consider the distribution of the interconnections, a better placement with the maximum cut size being eleven can be obtained as shown in Figure 2(b). In this section, we propose a modified min-cut bi-partitioning algorithm that not only balances the size of the two portions, but also evenly distributes the connections among them. 3.1 Congestion-balanced Bi-partitioning In order to qualify the effect of congestion unbalance between the two portions of a given bi-partition, we define the following terms. Multiple terminal nets are represented by a hyper-graph model. For instance, consider a seventerminal net shown in Figure 3. The cut-line currently used is shown as the dotted line. Suppose the bi-partitioning result is such that there are three terminals in the left portion and four terminals in the right portion. This net will contribute a count of one in the cut set. Moreover, we know that two connecting paths are needed to connect those three terminals in the left portion and three connecting paths are needed in the right portion. In general, maf{k - 1, 0} connecting paths are needed to connect k terminals. We define the unbalancing number of a net to be the number of connecting paths needed to connect all the terminals in the left portion minus the number of connecting paths needed in the right portion. The unbalancing number of a bi-partition is defined to be the sum of the unbalancing numbers of all nets. The absolute value of the unbalancing number of a bi-partition counts the difference between the numbers of connecting paths needed in the left portion and the right portion. Therefore, it is a good measure of the unbalancing situation in a bi-partition. Given an initial bi-partition, we can compute its unbalancing number in O(ITI) time by examining all the nets, where T is the set of all terminals. Without loss of generality, we assume that there are more interconnecting paths in the left portion than in the right portion, that is, the unbalancing number of this bi-partition is positive. If node v is moved from the left portion to the right portion, then f(v, e), the amount by which the unbalancing number of the net e decreases, is equal to 1 if either the number of terminals belonging to net e in the left portion is equal to 1 (shown in Figure 4(a)) or there is no terminal of net e in the right portion (shown in Figure 4(b)); 2 otherwise (shown in Figure 5); and the gain in reducing the unbalancing number of the bi-partition is defined as
Congestion-Balanced Placement
Placement algorithms based on hierarchical min-cut partitioning [10, 12] have been studied and used extensively in both the academia and the industry. Given a layout region and a circuit represented by a graph (or hyper-graph), each node representing a logic block and each edge representing a net, the algorithm recursively bi-partitions both the circuit and the region until the graph is simple enough (for example, a graph with one node) to be placed in the region. The objective of each application of the bi-partitioning step is to minimize the cut size subject to the constraint that the sizes (the number of the nodes) of the two resultant portions are roughly equal. This approach has the advantages of producing placements of good quality, a short running time, and easy implementation. However, the bi-partitioning algorithms based on the Kernighan-Lin algorithm [5, 8] suffers from the lack of control on wire congestion because cut size iE the only metric in the cost function to be minimized. Consequently, it is possible to obtain a partition with a small cut size, with one portion being heavily connected and the other being very sparse. In other words, although these two portions have approximately the same number of nodes, the numbers of connections within them can differ significantly. Therefore, such a placement would likely contain unroutable channels. Clearly for FPGA applications, min-cut based placement algorithms must be modified to take into account not only the sizes of the two portions, the size of the network crossing the cut-line but also the distribution of interconnections within the two portions. Figure 2 shows an example to illustrate the need of such modification. A placement obtained by the traditional min-cut based placement algorithm is shown in Figure 2(a). There are sixteen nodes in the network to be placed in a 4 x 4 array, Note that each of the two groups of four nodes connected by six bold dotted edges represent a completely connected 4-node clique. That is, if we label these four nodes by a, b, c, and d, then there are eleven hyper-edges connecting them which are {a, bl, {b, c}, {c, d}, {a, d}, {a, c}, {b,d), {a,6,c}, {a,b,d}, {b,c,d}, {a,c,d}, {a,b,c,d}. The traditional min-cut based placement algorithm first divides the network into two portions of equal size, one placed in the left half and the other placed in the right half of the array. Since minimizing the cut size is the only objective in the bi-partitioning process, the first bi-partition obtained is shown in Figure 2(a), where the cut size is one. Then,
F(v) =
f(v,e). E V
t,
(1)
se
If node v is moved from one portion to the other portion, it takes only O(JN(v)J) time to update the values of the gain function that are changed by examining N(v), the set of all neighbors of v. When a node v in one portion is swapped with another node u in the other portion, if there is no edge connecting these two nodes, the gain in reducing the unbalancing number of the bi-partition is the sum of F(v) and F(u). If there are edges connecting both nodes, for each of such edges, e, we subtract the amount f(u,e) + f(v,e) from the sum of F(v) and F(u), since the swapping, in fact, contributes no reduction in the unbalancing number of the net e. The following cost function is used to incorporate the effect of congestion distribution in a given partition
164
CUT.SIZE + WEIGHTXUNBALANCING NUMBER
where WEIGHT is a constant. If WEIGHT is set to zero then the algorithm is the same as the conventional min-cut bi-partitioning algorithm. By setting the value of WEIGHT appropriately, we can control the importance of balancing the congestion. For the example in Figure 2, if WEIGHT is set to be greater than -, then the better placement in (b) is obtained. Note that a linear combination of cut size and unbalancing number is chosen to allow incremental computation of the cost function as new bi-partitioning solutions are constructed. It should be noted that a new definition of the unbalancing number of a bi-partition is needed if the ratio of the sizes of the two resultant portions is required to be, r, where r is not equal to one. Suppose we require that the left portion have x nodes and the right portion have y nodes such that X = r. Given a bi-partition and a net e, if the number of connecting paths needed to connect all the terminals of net e in the left (right) portion is a (b), then the normalized unbalancing number of the net e is defined to be
4
ax - bbx 4Y That is, we normalize the number of connecting paths needed to connect all the terminals in the left (right) portion by multiplying the ratio of half of the number of nodes to the number of nodes in the left (right) portion. The unbalancing number of the net is defined to be the resultant value for the left portion minus the resultant value for the right portion.
3.2 Slicing Line In conventional min-cut based placement, the layout region is bi-partitioned in a top-down fashion in accordance with a recursive application of the min-cut bi-partitioning. A frequently used strategy is to bi-partition the region into two halves with equal size in alternating directions as shown in Figure 6 [10], where the cut-lines depict the sequence in which they are applied. In this figure, dashed cut-lines are vertical cut-lines and solid cut-lines are horizontal cutlines. Such a hierarchical bi-partitioning process suffers from the problem of the external pin connections [4] as will be explained next. Suppose the layout region is first partitioned vertically into two equal parts such that each part will accommodate one portion of the bi-partitioned circuit. Then, for these two regions, the bi-partitioning process is repeated, except that horizontal cut-lines are now applied. Suppose that bi-partitioning is first carried out for the right region. The bi-partitioning result for the right region should be taken into account during the bi-partitioning process for the left region. The reason is that the signals entering the logic blocks in the left regions from the logic blocks in the right region might effect where each logic block in the left region should be placed just as much as the internal connections among the logic blocks in the left region. In [4], a technique called terminal propagationwas proposed. In this approach, all nodes to the right of the vertical cutline are projected horizontally on the cut-line, as shown in Figure 7. We then process the left region, regarding these projected nodes as fixed nodes'. This terminal propagation technique has the advantage of taking into account the bi-partitioning result in the right region. However, it suffers from two disadvantages. First, the left and right regions are processed in sequence. There is no clear reason as to which region should be processed first. Secondly, the net in the cut set computed using the terminal propagation technique does not necessarily cross the cut-line, as will be explained below. Consider a two-terminal net with 'Fixed nodes are those that are not allowed to be moved in the bi-partitioning process.
one terminal connected to a logic block placed in the lower right region shown in Figure 8(a). Assume that the other terminal is connected to a logic block placed in the upper left region. After the process of the right region and the terminal projection, this net contributes a count of one in the cut size for the left horizontal cut-line. However, there are two possible routes for this net and it does not cross the cut-line if the routing in Figure 8(c) is chosen. Therefore, counting this net in the cut set for the left horizontal cut-line is not accurate. One common way of solving this problem is to assume even probabilities for all cut-lines that overlap with the bounding box of the net [7]. That is, in our example of Figure 8, the probability for the net to cross each horizontal cut-line is one half. Therefore, one half of a net is contributed to the cut size of each cut-line. We solve this problem by considering only the cut-lines that span the width or the height of the layout region. To distinguish the cut-lines that cut through the entire layout area from the cut-lines used in the traditional min-cut placement algorithm, they are referred to as slicinglines. After the first vertical slicing line is applied, we apply one horizontal slicing line, instead of two half horizontal cut-lines in sequence. During the bi-partitioning process, we consider movement of nodes in both the left and the right regions at the same time. We honor the bi-partitioning result for the first vertical slicing line by prohibiting any movement of nodes across the vertical slicing line. It is not hard to see that the cut set obtained in this way constitutes exactly those nets that have terminals in both sides of the slicing line. 3.3 Slicing sequence Based on the observation mentioned previously, slicing lines which cut through the entire layout region are used in our algorithm. There are a horizontal slicing lines and b vertical slicing lines in an FPGA with (a + 1) x (b + 1) logic blocks. The sequence in which slicing lines are applied plays an important role. The conventional hierarchical min-cut placement always chooses the cut-line at the center of the region and such a choice is carried out recursively. If we order the slicing lines to be applied in this sequence, then Figure 9 (a)-(d) depict the sequence of slicing lines to be applied, where slicing lines currently applied are shown as dashed lines and slicing lines applied earlier are shown as solid lines. "Assume that the current slicing line I shown as the dashed line is immediately next to the center slicing line in Figure 10, where the remaining solid lines are those slicing lines that are applied earlier. Slicing line I will cut four regions and their corresponding sub-networks into halves. Note that in each move of this min-cut bi-partitioning, we can swap a pair of nodes such that one is in A and the other is in E, one is in B and the other is in F, one is in C and the other is in G, or one is in D and the other is in H. However, since we honor the bi-partitioning results for the slicing lines applied earlier, any swapping of nodes that crosses the slicing line applied earlier is not allowed. We know that the number of nodes that reside in a region is proportional to the area of the region. Since the slicing line I is immediately next to the center slicing line, the four regions that I cutting through are fairly small. Therefore, the numbers of nodes that are placed in regions A-H are small. Consequently, The number of possible pairs that we can choose from for a move in this bi-partitioning process is, thus, limited. This usually results in a relatively large cut-size for the slicing lines close to the center due to a small number of possible moves. Especially, for the two horizontal and vertical slicing lines immediately next to the center, a larger cut size is usually observed. Based on this observation, we propose another sequence of slicing lines in which they are applied to reduce the chance of congestion near the center. The horizontal slicing lines and vertical slicing lines are applied alternatively
165
TABLE 1 O
t
w
is
in the array, and thus, 13 vertical and 15 horizontal slicing lines. For the smallest circuit, f5lm, there are 8x8 logic blocks in the array, and thus, 9 vertical and 9 horizontal slicing lines. Figure 13 (a) and (b) show the distribution of the cut sizes of the slicing lines in alu2. Figure 13 (c) and (d) show the distribution of the cut sizes of the slicing lines in f5lm. The distribution of the cut sizes in the placement obtained by our algorithm is shown as a solid curve and the distribution of the cut sizes in the placement obtained by the traditional min-cut algorithm is shown as a dashed curve. In addition to a reduction in the maximum cut size, we also observe that the cut sizes in our placement are distributed more uniformly than those in the placement obtained by the traditional min-cut algorithm. Also, the large cut sizes next to the central horizontal slicing lines illustrate the need of a new slicing sequence in the traditional min-cut placement algorithm. The slicing sequence adopted in our algorithm solves this problem and the values of cut size near the center of the chip are not particularly large. We used the Xilinx Automated Placement and Routing (APR) program for XC3000 series to obtain the routing results. In each routing experiment, we count the number of wire segments passing through each channel and refer it as the wire density of the channel. According to Figure 13, there is a almost-linear relationship between the cut size and the wire density. In fact, the ratio between wire density and cut size is approximately uniform in our experiments. Table 4 lists the average ratios of wire density to cut size over all slicing lines in each circuit and the values of the corresponding standard deviation. The standard deviation is at most between 10%-.20%. Consequently, we can conclude that cut size is a good measure of wire congestion. The curves for the wire density in Figure 13 also show that the placements obtained by the traditional min-cut algorithm have higher wire congestion than those obtained by our algorithm, which is consistent with the fact that there are 19 unrouted nets in the placement of alu2 obtained by the traditional min-cut algorithm, whereas there are only 3 unrouted nets in the placement obtained by our algorithm.
w
OUNIMARY OF THE EIGHT MUCNU CIRCUITS
Circuit 1 Chip f5lm miseX2
comp 9sym
c499 term c880 alu2
8x8 8x8 lOxlO 10x10 lOxlO 12x12 12x12 16x14
J Blocks
I Nets
58(42) 101(40) 101(66) 105(95) 139(66) 160(116) 209(123) 229(213)
50 83 98 104 107 150 183 223
Terminals 224 309 340 518 427 621 677 833
as before. However, for the slicing lines of the same orientation, those slicing lines that are closer to the center are applied earlier as shown in Figure 11, where horizontal slicing lines are not shown for clarity. We compare the vertical slicing lines, denoted by 1, which are immediately next to the center in Figure 12 (a) and (b). Since the size of the area of a region is proportional to the number of nodes that reside in the region, we can compute the ratio of the number of possible pairs that we can choose from for a move in (a) to that in (b). By computing the areas of the regions, we can obtain the ratio as 2.(9 x 3 + 3 x 1): 4.(2 x 2) = 60: 16. There is a substantially larger number of possible pairs of nodes that we can choose from for a move in (a) than in (b). Therefore, it is expected that the cut size for a slicing line close to the center is smaller, if the sequence in (a) is adopted, instead of that in (b).
4
Experimental Results
Our congestion-balanced min-cut placement algorithm was implemented in the C language and run on a SPARC10 Sun workstation. We used eight MCNC circuits to test the efficiency of our algorithm and compare the results with that obtained by the traditional min-cut algorithm when congestion balance and the other features proposed in this paper were not taken into consideration. We used Xilinx 3000 series chips as the target FPGAs. The circuits were first synthesized and transformed into the XNF (Xilinx Netlist Format) format. Then, the circuits were mapped into the 3000-series logic blocks and the results were used as inputs to the placement program. The circuits are listed in the order of increasing number of logic blocks in Table 4. The first column of Table 4 shows the dimensions of the FPGA chip on which a circuit is implemented. The second, third, and fourth column shows the number of blocks, nets, and terminals in each circuit, respectively. Note that the number of logic blocks shown in the second column is the total member of I/O blocks and logic blocks. The number of logic blocks is shown in parentheses. The results are summarized in Table 4. We observe that our algorithm yields a reduction in the maximum cut-size by up to 23%, when compared with the traditional min-cut placement algorithm. We also computed the sum of the cut sizes over all slicing lines. As was mentioned earlier, the sum gives the total wire length according to the estimate of wire length by half of the perimeter of the bounding box. In six of the eight circuits, we obtained a placement with a smaller sum of the cut sizes. For the other two circuits, the sum of the cut sizes in the placement we obtained is roughly the same as that in the placement obtained by the traditional mincut algorithm. Therefore, we can conclude that reduction in the maximum cut size in our algorithm also gives the benefit of reducing the total wire length. The running time for our algorithm is comparable with that for the traditional min-cut placement algorithm. For the largest circuit, alu2, there are 14 x 16 logic blocks
5
Conclusions
In this paper, we propose an important criterion in the FPGA placement problem, balancing the congestion, which used to be overlooked, if not neglected, in the past. We propose an improved min-cut based placement algorithm which spreads out the congestion, while minimizing the total wire length. Therefore, the placement obtained maximizes the possibility of finding a complete routing in FPGA, if one exists. The algorithm balances the numbers of interconnections in the two portions during each application of the min-cut bi-partitioning algorithm. In the meantime, the algorithm minimizes the actual number of interconnections crossing each vertical cut line and each horizontal cut line, instead of the probabilistically estimated number of interconnections. We introduce the concept of slicing lines which span the entire chip and are useful for the bi-partitioning process where a larger solution space can be examined, compared with the terminal propagation technique. The sequence of the cut lines is determined according to the distance from each cut line to the center of the chip. The cut line closer to the center is applied earlier. This avoids the possibility of the occurrence of a cut line with a large cut size close to the center of the layout area (which happens in the traditional top-down min-cut based placement). The experimental results show a reduction of up to 23% in the maximum cut-size and the wire congestion, compared with the traditional min-cut placement algorithm. In the largest circuit, 19 unrouted nets in the traditional min-cut placement are eliminated with only 3 unrouted nets left in our placement which can be manually completed easily.
166
TABLE
Max. Cut-size Circuit . Name .t Our Aig. Mn-cut =Reduc. 19 24
f5lm-
9sym c499 term c880
alu2
38 48
24 31 32 27 43 47 55
48
59
31 23
= 40 -
j
RESULTS
Total Cut-size
]
misez2 comp
2
SUMMARY OF THE PLACEMENT
-_
Our Alg. I Min-cut
20.8% 22.6% 3.1% 14.8% 7.0% 19.1% 12.7% 18.6%
198 284 377 370 540 692 803 1167
234 313 372 352 543 730 860 1176
CPU Time (seconds) Our Alg. | Mm-cut 0.73 1.72 1.48 18.6 2.42 14.2 14.6 213.3
0.76 2.63 1.25 16.4 2.38 24.2 12.5 311.6
TABLE 3 SUMMARY OF THE AVERAGE RATIO AND ITS STANDARD DEVIATION
[Circuit | ______
f5lm misex2
comp 9sym c4 99 terml
c880 alu2
References
Our Algorithm Avg. Rat~io Std. Dev. 2.17 2.58 1.81 2.88 2.62 2.17 1.95
2.46
0.30 0.45 0.28 0.29 0.30 0.36 0.25 0.22
[1] M. A. Breuer. Min-cut placement. Journal of Design Automation and Fault Tolerant Computing, 1(4):343362, October 1977. [2] S. Brown, R. Francis, J. Rose, and Z. Vranesic. FieldProgrammable Gate Arrays. Kluwer Academic Publishers, 1992. [3] S. Brown, J. Rose, and Z. Vranesic. A detailed router for field-programmable galte arrays. IEEE Transactions on Computer-Aided Design, 11(5):620-628, May 1992. [4] A. E. Dunlop and B. W. Kernighan. A procedure for placement of standard-cell 'VLSI circuits. IEEE Transactions on Computer-Aided Design, 4(1):92-98, January 1985. [5] C.. M. Fiduccia and R. M. Mattheyses. A linear-time heuristic for improving network partitions. In Design Automation Conference, pages 175-181, 1982. [6] M. R. Garey and D. S. Johnson. Computers and Intractability: A Guide to the Theory of NVPCompleteness. W. H. Freeman, 1979. [7] S. Goto and T. Matsuda. Partitioning, assignment and placement. In T. Ohtsuki, editor, Layout Design and Verification, chapter 2, pages 55-97. Elsevier Science Publishers B. V., North-Holland, 1986. [8] B. W. Kernighan and S. Lin.. An efficient heuristic procedure for partitioning graphs. Bell System Technical Journal, 49(2):291-307, February 1970. [9] U., Lauther. A min-cut placement algorithm for general cell assemblies based on a graph representation. Design Automation Conference, pages 1-10, June 19,79. [10] T. Lengauer. Combinatorialalgorithms for integrated circuit layout. John Wiley & Sons, 1990.
Traditional Min-cut Avg. Ratio [Std. Dev. 2.00 2.71 1.80 2.81 2.25 2.31 1.84 2.67
0.20 0.51 0.39 0.40 0.27 0.32 0.18 0.56
[11] M. Palczewski. Plane parallel A* maze router and its application to FPGAs. In Design Automation Conference, pages 691-697, 1992. [12] B. Preas and P. G. Karger. Placement, assignment and floorplanning. In B. Preas and M. Lorenzetti, editors, Physical Design Automation of VLSI Systems, chapter 4. The Benjamin/Cummings Publishing Co., 1988. [13] N. Togawa, M. Sato, and T. Ohtsuki. A simultaneous placement and global routing algorithm for symmetric FPGAs. In second International Workshop on FPGA, Berkeley, CA, 1994. [14] Xilinx Inc. XC4000 Logic Cell Array Family, Technical Data, 1990. w;-
bnene
Figure 1: Two-dimensional FPGA layout model
167
-m --
xi mxmum cut simb= IS
- - -- --
- --
L-- -.
.i*,
- - --
4 4-t
maximum cut size = I1
--
->-
- - - - -I- - - -
->-
(a)
-l ----a.
....
e
I
-
(d)
(c)
'
Figure 9: Choosing the slicing line at the center
*-.---------~--,-----
-------
,
----,---,
I
A E
Figure 2: The shortcoming of the traditional min-cut based placement algorithm
1IF
D-H
Figure 10: A scenario in the traditional min-cut placement
IIWIV iUJIL jJIIU
Figure 3: A seven-terminal net (a)
(b)
(C)
Figure 11: Proposed sequence to apply slicing lines I 3.11
(a)
3
A
I
B
3
D£3.
2I1
3 2
*
F2
2
A:E B2F
(b) C:G
2
Figure 4: Case 1 in updating f(v, e)
1'igure 12: (a) The sequence Proposed (b) The traditional
seqtetnce 12(
.
I
I, - -.2
MI 1o( 90
-t Si. 70 6C
Figure 5: Case 2 in updating f(v, e)
iu dzeaty
SC 40
4(
I
Sc II
(D-
,
1
-
40
4oo
3'
I
2
3 4 5 6
7 t
- --,---9 10 It 12 13
20
/
Og
l
A
-L
10 .
1I2
.
'
31
(A3
'
'
'
-
,.b.
.
51 1 (C)
120 Ila
/arc
90
Figure 7
Figure 6
I.
I'--
100 to 70
I
40C
40
40 30-
l
........ I
(a)
()
2
3
4
2(
~
' j '' 'A. ...... 5 6 7 S 9 10 It 12 13 14 15
1 2
Ob)
(C)
Figure 8: Two possible routes in (b) and (c) for the twoterminal net in (a)
A=
1234rd67 1 4 ) 6 7 (d)
Figure 13
168
ikd-icy
tI
-c
b
Fanout Problems in FPGA Kum-Han Tsait, Malgorzata Marek-Sadowskat, Sinan Kaptanoglu4 t Department of Electrical Engineering and Computer Sciences University of California, Santa Barbara, CA 93106, USA
t
Actel Corporation, Sunnyvale, CA 94086, USA
Abstract
ited. Ideally, the number of free logic blocks (i.e. the difference of the number of total logic blocks on the chip and the number of mapped blocks occupied by the target circuit) is the maximum number of buffers which can be used. However, increasing the percentage of mapped blocks may cause the circuit routability degradation and/or increase the routing delay. So some times not all free logic blocks can be used.
This paper proposes a heuristic algorithm to improve the performance of FPGA circuit by inserting buffers and duplicating nodes. Since FPGA chips come in discrete sizes, almost all of the circuits are mapped with some logic blocks unused. By configuring these free logic blocks into buffers and properly inserting them to the mapped circuit, the maximum delay of the circuit could be reduced significantly without any extra hardware cost. We show the experimental results and compare the improvement in both logic level estimation and in the physical level.
1. Introduction Buffer insertion is one of well-known methods to minimize the network delay. The basic idea is to spread the fanout load and balance the fanout tree by inserting the proper size buffers at the right positions such that the maximum delay is minimized. Fanout tree buffering problem has been studied by several authors. CL. Berman, et.al, [1] first showed that to determine whether there is a fanout circuit meeting the given timing and area constrains is an NP-complete problem in the unit delay fanout model. They also proposed a heuristic algorithm called Two-Group Algorithm which runs in quadratic time. Touati [6] introduced dynamic programming approach to construct a balanced criticality fanout buffered tree. Singh eLal in [5] and Lin at al. in [4] proposed algorithms inserting single buffer per iteration. Recently, Carragher et. al [2], combined tree construction and single buffer operation to get more improvement. All these methods can be categorized into pre-layout buffer insertion. The other methods involve the post-layout approaches (e.g. [3]), which insert buffers based on the physical level information. They allow to calculate accurately the circuit element delays and the wiring delays.
If the post-layout strategy is used, not only the number, but also the positions of the free logic blocks are fixed and additionally the routing resources are less flexible. In this case, it's very unlikely to find a free buffer as well as required routing resources (wires and switches) to insert the buffer appropriately. On the other hand, the gap between technology independent (logic synthesis) and technology dependent processes (place and route) make pre-layout insertion inaccurate. In other words, because of the unpredictable wiring delay, the buffers inserted before the layout is completed may not be after all beneficial for the circuit's performance. A possible solution to overcome these difficulties is to have a two phase process. The first phase performs the prelayout insertion according to a unit delay fanout model. This delay model is based on the observation that routing delay is roughly proportional to the number of fanouts. After the first phase insertion, the modified circuit is placed and routed. The second phase modifies the first phase based on the timing analysis of the routed circuit. Since the timing estimation before the place and route may not reflect the real delay, the buffers inserted previously may not improve the performance as much as we have predicted. They may even degrade the performance. The second phase examines the inserted buffers and may either relocate or delete them. These incremental adjustments need to take the current available routing resources into account and may involve partial replacement and rerouting process.
Traditionally, the free logic blocks in FPGA are of no use after the circuit has been placed If we configure these free logic blocks as buffers and use the buffer insertion technique to speed up the network, we can improve the performance at no cost to the user. Two properties make the buffer insertion in FPGA different from the previous works. First, there is only one buffer size available what limits the possible inserting positions to reduce the delay, and second, the number of buffers is lim-
169
The other possibility to utilize the free logic blocks to improve the performance is node duplication which splits the fanouts of the duplicated node into two groups, one driven by the original node and the other driven by the new(duplicated) node. In the standard cell design style, buffer insertion is typically more effective than the node duplication since buffers are usually better designed for driving signals. This condition is no more true for the FPGA since buffers are also implemented by the logic blocks. By adding the node dupli-
cation to the fanout problem, we have more chance to find the beuer performance of the target circuit. This paper proposes a buffer insertion/node duplication technique in FPGA which involves the first phase (prelayout) process. We also show the delay after routing to see how close the delay prediction of the pre-layout level to the real physical delay.
delay to primary output of g, which is the local goal we like to minimize. Given a net net(g), with source g and sinks h1, h2 ,...hk, mpo(g) can be calculated as follows. mpo(g) = max {mpo(hl), mpo(h 2 ),..., mpo(hk)) + A + B(k-1) + D If we assign a weight to each sink W(hj) as follows: W(h^) = mpo(hi) + delay from input pin of g to input pin of hi
2. Problem Formulation We assume the circuit is combinational and can be represent as a DAG (directed acyclic graph). Each vertex in the graph represents a gate in the circuit. The gates in the circuit are assumed to be single output. There is a directed arc e from g to h if gate g directly fans out to gate h. In this case g is the source vertex of arc e and h is the sink vertex of e. The directed arcs which share the same resource vertex g form a multi-net net(g). The delay of the circuit depends on its longest delay path(s). The delay of a path is calculated by summing up the delays of vertices and arcs. Each vertex has a constant intrisic delay D. The directed arcs in the same multi pin net net(g) have the same delay which is a linear function of the number of fanouts of g and can be expressed by the following equation:
then mpo(g) = max {W(hl), W(h 2 ),... W(hk,)1. We sort the sinks by the non-decreasing weight order and then mpo(g) is W(hl). Now, suppose a vertex is inserted to net(g) at the yth fanout position such that the sinks h). hy+ ,... hk are buffered (see Fig 1). The weight of a sink hi is changed to W(hi)' as follows: W(hi)' = W(hi) - B(k-y) for 0 < i < y and W(hi)' = W(hi) + D + A for y <= i <= k Our objective is to minimize mpo'(g) = max {W'(hi), W'(hy+l)) and the gain of this operation is mpo(g) - mpo'(g), i.e. W(hl) - max {W'(hl), W'(hyl)}.
delay(arc) = A + B*(# fanout(g) - 1), where A and B are factors dependent on the mapped technology. For example, Actel ACT 3 -3 family which is used in our experiments have the value of 2.0ns, 1.3ns and 0.3ns for D, A, and B respectively. According to these definitions, the delay from input pin of gate g to the input pin of any g's fanout is D+A+B*(# fanout(g) - 1). This arc delay model is the same as the unit delay fanout model if A is equal to B. The problem we are considering is formulated as follows. Given a DAG which represents a circuit, and the number of free vertices, the objective is to insert the free vertices (either by buffer insertion or node duplication) to reduce the fanout load of the critical nodes such that the delay of the DAG is minimized without changing the circuit functionality.
3. Delay Analysis of the Buffer Insertion and Node Duplication 3.1 Single Buffer Insertion
1k a) net(g) before inserting a buffer
b) after inserting a buffer to the yth fanout
Fig 1. Single buffer insertion to a multi pin net After the buffer is inserted, the maximum delay from g to primary outputs is minimized. The paths which go through g and the inserted buffer become longer but the analysis above indicate that this penalty doesn't generate any new critical path. So, whenever the gain of inserting a buffer is positive, it always has positive contribution to the circuit's performance. One interesting fact is that no matter where the buffer is inserted, the weight of each buffered sink increases by a constant (D+A). If the sinks of a net(g) are sorted by the non-
Let's first look at a net to see how buffer insertion can improve the performance. Let mpo(g) indicate the maximum
170
increasing weight order, the mpo(g) as a function the buffer position y is a concave as shown in Fig 2. The position y' minimize mpo(g) is the best position of inserting a single buffer. This value can be found in linear time with respect to the number of the sinks in net(g). In [1], Cl. Berman and J. L. Carter have shown that the optimal solution for buffer insertion can be found in the sorted case for the single buffer insertion as well as the k-buffer tree construction. After the buffer has been inserted, the net is decomposed into two multiple pin nets which can be further improved by the same approach. A mpo(g)
117---
--
mpo(g)' A
--
-
--
-
for I • i < yl, W'(hi) = W(hi) - B(k-m-yl+l),
-
inserted position
-
4-
l.
U
If the sinks are sorted by the non-increasing weight order, mpo(g) is equal to the maximum value of W(hj), W(hy,) and W(hy 2). The smallest value of mpo(g) can be found by comparing all combinations of (yl, y2) which gives the 0(k2) time complexity. In general, if m buffers I(yl,y 2 ,y3,... ym) are inserted at the same level simultaneously, where O
before inserting buffer
9
There exists situations that the two buffer insertion can reduce all sink's maximum delay to primary output in a way which is not possible for single buffer insertion. This happens when the first insertion position yl is equal to 1 and the number of sinks (i.e. the fanout number of gate g) is large enough.
for yl • i < Y2, W'(hi) = W(hi)+D+A+B(y 2 +m-k-2), for Y2 • i < y3 , W'(hi)=W(hi)+D+A+B(y 3-y2 +Yl+m-k2)
o
k-1
y
Fig. 2 The maximum-delay-to-outputs-versus-theinserted-position relationship diagram for the single buffer insertion
for yj • i < yj+,, W'(hi)=W(hi)+D+A+B(yj+1 -yj +y1 +m k-2)
3.2 Multiple Buffer Insertion If we insert more than one buffer to the same net, the previous approach does not necessary yield an optimal solution. Considering a two buffers insertion I(y1, Y2) as shown in Fig 3, the weight of the sink hi, W(hi) is changed to W' (hi) as follows: for 0< i < yl, W'(hi) = W(hi) for yl
-
for ym • i • k, W'(hO) = W(hN)+D+A+B(yl- y.-2+m). Let ni be the number of sinks driven by the same buffer (or g for i= 1), i.e. n,= y -1,
B(k-yl-l),
nj = y1 - yj1
• i < Y2 , W'(hi) = W(hi) + D + A - B(k-y 2 ), and
for 1
n. = k - y.
for y2 • i • k, W'(hi) = W(hi) + D + A - B(y 2 -yl-1).
then the new weight can be expressed as: W'(hl) = W(hj) - B(k-m-nl)
or
.
-
W'(hj) = W(h) + D + A + (nj+ nj + m - k-i),
tiers
for h 1j
-0-
1
Y2
and the objective is to minimize MAX{W'(hl), W'(hy1), W'(hy 2),... W'(hy.)} and the gain of this operation is W(hl) MAX{W'(hl), W'(hyl), W'(hy 2),... W'(hy.)). The other possible approach to the multiple buffer insertion is to create the hierachical buffer tree like [2][6]. This strategy may be more beneficial than the strategy of inserting buffers at the same level only when the number of sinks is extremely large. Since the maximum number of fanouts for FPGA technology mapping is usually bounded (for example,
a) net(g) before inserting b) two buffer insertion I(yl, y2) any buffer Fig 3. Two buffers inserted at the same level
171
Actel's place and route tool can only accept the mapped circuit with up to 24 fanouts of any node), the hierarchical buffer insertion is less practical and will not be discussed in this paper.
which separates the fanouts into two group [hl, h2..., hy) and (hy+,,... hk) is chosen. The new weight of the fanouts are: for 1 < i •y, W'(hj) = W(hk) - B*(k-y-1),
Example 1). If we buffer a net into two equal size sub-nets (i.e. set (yl, y2) to (1, Ik/21)), the minimum number of sinks such that the weights of all sinks will be reduced is calculated from the following formula: w(hi) = D+A+B*(k-l) + mpo(hi) w(hi)'
= D+A+B*(2-l)+D+A+B*(k/2
- 1) mpo(hi)
= w(hi) + ( + A + B*(l- k/2)) To reduce the new weight, we need to have the term (D + A + E,*(l - k/2)) less than 0 which requires k to be at least 22 in the Aced 3 -3 FPGA whose values of D, A and B are 2.0, 1.3, and 0.3 respectively. This example demonstrates the multiple buffer insertion benefits all fanouts only when the number of fanouts is relatively large. However, in practical applications to balance the mpo of fanouts, multiple buffers may still have more gain than single ones. This is the main reason why we ignore the buffer tree construction but still take the multiple buffer insertion at the same level into account.
3.3 Node Duplication The other possible operation to reduce the delay of large fanout is node duplication. This operation splits the original fanouts of a node g into two groups, one driven by the node g and the other driven by a new duplicated node. In addition to the node overhead, the duplication operation introduces the wire overhead since the inputs of node g are also duplicated. This wire overhead may increase the delay of paths which pass through the fanins of the node g and hence generate a new longest path. To overcome this problem, we check the following condition when the node duplication is allowed. Only if a node g satisfies this constraint it will be duplicated. for h E fanin(g) for k E fanout(h), k!= g
for ye i S Ik,W'(hj) = W(hj) - B*(y-1), and the gain is W(hl) - Max{W'(hl), W'(hY+I)J. So the objective here is to find a y in [l.k] to minimize Max {W'(hl), W'(hy+1 )1. Similar to buffer insertion, the single node duplication can be extended to multiple duplication which splits the fanout to m groups. However, to do this, the overhead of the fanin nets of the duplicated node increases m times when the node is duplicated by m-i times. This overhead significantly increases the probability of generating new longest paths and makes constraint for multiple duplication relatively strict. Also, only when the number of fanouts is large enough can the node's multiple duplication achieve more gain than the single duplication. As we mentioned before, the number of maximum fanouts is limited in the FPGA technology mapping. So, we never duplicate node more than once.
4. The Algorithm The basic idea of our algorithm is to select the best position to insert the buffer into the critical subcircuit which contains all most critical paths. The critical network corresponds to the DAG of the critical subcircuit. Only if the delays of all the critical paths are reduced can the circuit performance be improved. So, the source nodes of multi output nets which are selected to be buffered(duplicated) need to form a node cut of the current network. Our algorithm uses a greedy strategy which picks the largest gain from all possible insertions until the selected nodes form a cut. Then the selected nodes are buffered(duplicated) and the critical network is updated. Whether the selected operation is a single buffer insertion, multiple buffer insertion or node duplication is decided by the gain of the operation. The analysis shown in section 3 makes sure that neither buffer insertion nor node duplication will generate a new path longer than the original critical path.
(arrivaLtime(h)-wireldelay(h)+B) -
The Algorithm
Max{arrival-time(x)+wire-delay(x) I x E fanin(k)} < slack-time(k)
1) Set the arrival times of PIs to 0 and calculate the actual arrival times of each node's input. Set the required arrival time of POs to the current longest delay and calculate the required time of each node's input, the slack and mpo of each node in the circuit.
endfor endfor
2) Create the critical network.
The method of spliting fanouts is similar to buffer insertion. The fanouts of node g are first sorted by the nonincreasing weight order. Then the best splitting position y
172
3) For every critical node find the best position to insert the buffer and store its gain as well as the position. (Compute the single buffer insertion and multiple buffer insertion gain and keep the better one) 4) Pick the maximum gain insertion in step 3). 5) Insert buffers selected from step 4) and modify the critical network.
reduce the delay by 61.4% with only 7 buffers inserted. After the third iteration, there are 31 more buffers inserted and only achieve 5.1% more reduction. It's obvious to terminate the process after the third iteration for this example if we try to reduce the overhead of the physical level. 160
6) Recalculate the arrival time, required time, slack and irnpo of each node. 100.
7) Update the critical network and goto step 3).
50
5. Experimental Results
60 ,
Table 1 shows the results on the larger MCNC benchmarks with the number of node larger than 300 and the number of IO pins that can be mapped into ACT-3 chip. These circuits are mapped to ACT3-3 technology before running our program. We use the technology mapping tools ACTMAP supplied by Actel and run the timing optimization with maximum number of fanouts set to 24. The delay factors are based on the Actel ACT3-3 delay estimation (from Actel databaok 95). The number of the buffers is unlimited so that we know the maximum possible reduction achievable by buffer insertion and node duplication. Table 2 compares the delay estimation at the logic level and the physical level.
6. The Number of Buffers to be Inserted The results shown in section 4 were obtained with no limit on the number of buffers. In practical applications, the number of inserted buffers in the pre-layout level should be bounded not only because free buffers are available in limited quantities in FPGA architecture but also because the more buffers inserted the harder to route the circuit. In general, the more buffers are added the more unpredictable the results of the layout stage. Our algorithm can be easily modified and a bound set such that the program terminates whenever the number of added buffers approaches the bound. However, how to set this bound is not trivial and not obvious. Our experiments suggest possible bounds for each circuit. We run the unlimited buffer insertion version of the program and observe the relationship between the circuit delay Io the number of inserted buffers. The curve usually becomes smooth after the first few iterations. Good results are achieved if the insertion process terminates when the absolute value of the slope becomes smooth. For example the result of section 4 shows 66.5% improvement by inserting 38 buffers to circuit C7552. The delay-versus-the-number-of-inserted-buffers for C7552 is shown in Fig 4. The first three iterations
-
40
-
y
z
-
Jz
D MU
W
0
M
Fig 4. The delay-versus-the-number-of-inserted-buffer diagram of circuit C7552 However, if the program is terminated based on the improvement of the current iteration, it may overlook a big gain. For example, the result of section 4 shows that 91 buffers are added to the circuit apex4 and the performance is improved by 40.6% in the unlimited buffer program. Fig. 5 shows the delay-versus-the-number-ofinserted-buffers function for the circuit apex4. The slope of this curve goes smoothly at the first few iterations and then drops significantly. After the fifteenth iteration the slope becomes smooth again. The first fifteen iterations insert 28 buffers and improve the performance by 32.5% and the later 70 buffers only achieve 7.2% improvement according to the logic estimation. If the program sets the termination condition by checking if the improvement of the current iteration is less than a certain value, it may terminate at the first iteration since only 0.4 ns reduction is achieved by inserting the first two buffers. The later example indicates that without running through all the iterations, we may not find the best terminating point. ap=4 -
5
-
4
-
40 0
.A
-
10
20
A SD
,
r
40
50
^ 50
A 70
Hu
so
. 100
Fig 5. The delay-versus-the-number-of-inserted-buffer diagram of circuit apex4
173
On the other hand, the logic delay estimation may not be sufficient to decide correctly what number of nodes to be inserted. The results shown in table 2 indicate that there exist few cases when the difference between the physical delay estimation and the logic estimation are quite large (e.g. dalu). To see the relationship between the number of node inserted and the physical delay we pick several benchmarks which have large number of nodes inserted and try to reduce them without degrading the logic level estimated performance too much. Table 3 shows the result of the same benchmark with different number of inserted nodes. In table 3, each circuit has two different strategies of insertion(duplication), one is the same as table 2 and the other tries to limit the number of added nodes. The second column shows three values: the number of nodes of the original circuit, the number of inserted buffers and the number of duplicated nodes. The third column shows the percentage of mapped logic block. It's interesting to note that even when the difference of the logic level estimated delay is small (within 3%), the real delay difference after routing are quite large. Most of the cases shown in table 3, except alu4, when the number of inserted(duplicated) nodes is reduced, the logic level estimated delay only increase a little but the physical level delay increases quite a lot. The possible reason is that there are some long paths which are not critical under the logic level estimation but become critical at the physical level. This observation suggests that the number of nodes to be inserted at the later iterations are beneficial even if they don't reduce the delay of the critical paths. Since our algorithm only takes the inserted nodes with positive gain, some of the delays of the fanouts become more balanced after the insertion and the chance of the layout delay violation could be reduced. Of course this situation happens when there are still large number of free logic blocks left.
7.Conclusions We have develpoed a heuristic algorithm which applies both buffer insertion and node duplication to improve the performance of FPGA designs. The experimental result demonstrates its feasibility. The average improvement of the performance is 11.7% after layout. The experiments also indicate that the delay model used by our algorithm at the logic level is close to the physical level delay estimation for most cases so that the extra nodes are properly inserted into the circuit to improve the performance. The difference of average improvement between logic level and physical level estimation is 2%. We also note that the number of nodes to be added will influence the correctness of the logic level
delay estimation and hence influence the actual improvement which can be achieved after layout. This gives a trade-off of number of nodes which should be added to the mapped circuit and is especially important when the number of the available blocks is very limited. The post-layout process will be the next step which may involve the incremental placement and local rerouting approach. How much improvement can we expect from the post-layout? Form table 2 we observe that the estimation of improvement in logic level is more optimistic than the real delay and can be treated as the objective delay which the post-layout process may achieve.
REFERENCES [l] C. L. Berman, J. L. Carter, and K. F. Day, "The Fanout Problem: From Theory to Practice," in Advanced Research in VLSI: Proceedings of the 1989 Decennial Caltech Conference, C. L. Seitz, editor, MIT Press, March 1989, pp. 69-99. [2] R. J. Carragher, M. Fujita, and C. K. Cheng, "Simple Tree-Construction Heuristics-for the Fanout Problem" .Proc. of the 1995 ACM/IEEE International Workshop on Logic Synthesis, 1995, pp. 1-11 -- 1-22, International Conference on Computer Design, October 1995. [3] J. Lillis, C. K. Cheng, and T. T. Y Lin, "Optimal and Efficient Buffer Insertion and Wire Sizing," Proc of IEEE 1995 Custom Integrated Circuits Conference, pp. 259-262. [4] S. Lin and M. Marek-Sadowska, "A Fast and Efficient Algorithm for Determining Fanout Trees in Large Networks," Proc of 1991 European DAC, pp. 539-544. [5] K. J. Singh and A. Sangiovanni-Vincentelli, "A Heuristic Algorithm for the Fanout Problem," Proc of 27th ACM] IEEE Design Automation Conference, 1990, pp. 357-360. [6] H. J. Touati, "Performance-Oriented Technology Mapping," Ph, D dissertation, Memorandum No. UCB/ERL M90/109, Department of Electrical Engineering and Computer Science, University of California at Berkeley, 28 November 1990. [7] Actel 1995 FPGA Data Book and Design Guide.
174
TABLE 2. Physical vs. Logic Level Estimation Table 1: Performance improvement by buffer insertion/node duplication, without limit the number of buffers, (D, A, B) = (2.Ons, O.9ns, 0.3ns)
-by Circuit
Nedo bur/i
Delay before
n WU n
dup
Delay
Delay
node
improv
Node
after befor 11 e s)(ns)
e insert ed
edpro ed%
C1355
320
77.5
74.2
56
4.3%
C19,08
270
70.6
(53.4
39
10.2%
C432
108
64.2
59.7
29
7.0%
C880
156
55.4
'50.9
21
8.1%
alu4
350
98.0
78.8
41
19.6%
dalu
834
85.2
72.3
23
15.1%
duke2
366
38.4
35.7
7
7.8%
misex3
456
45.6
'38.4
38
15.8%
misex3
344
57.8
38.9
7
32.7%
Circuit
too lar
467
57.0
ge
47.6
21
320/ 54/2
66.7%
77.5 74.2
4.3%
72.1 69.6
3.5%
C1908
270/
54.1%
70.6
10.2%
77.3
13.3%
63.4
108/
68.5%
64.2
24/5
67.0 7.0%
59.7
70.1
5.7%
66.1
C880
156/
55.8%
55.4 50.9
8.1%
61.2 44.6
27.1%
alu4
350W 21/22
69.7%
98.0 78.8
19.6%
106.8 88.3
17.3%
dalu
834/
62.6%
85.2
15.1%
95.0
2.2%
iO/7
11/17
16.5%
bn%
C1355
C432
72.3
92.9
duke2
366/ 4/3
66.1%
38.4 35.4
7.8%
44.4 42.7
3.4%
rmisex 3
456/ 13/24
87.4%
45.6 38.4
15.8%
54.2 48.7
10.1%
344/
63.3%
57.8
32.7%
61.8
24.1%
iisex
-
before
after
12/23
C
immp bp%
3c
6n
38.9
46.9
k2
1051
48.3
40.8
48
15.5%
vda
528
38.6
33.9
59
12.2%
too-la rge
467/ 13/9
86.7%
57.0 47.6
16.5%
66.6 57.1
14.3%
13.7%
k2
1051/ 55/13
81.3%
48.3 40.8
15.5%
82.2 74.1
9.9%
vda
528/ 36/23
69.1%
38.6 33.9
12.2%
52.5 51.0
2.9%
Average
Avera
13.7%
Table 3: The delays for different number of inserted node
Circuit
# Node/bul/
dup
resource utilize%
Logic Delay before after
imp%
Physical
Delay
beforeafterbefore
C1908
imp%
after
270/12/23 270/10/16 350/21/22 350/18/19
54.1% 52.5% 69.7% 68.6%
70.6 70.6 98.0 98.0
63.4 63.8 78.8 79.1
10.2% 9.6% 19.6% 19.3%
77.3 77.3 106.8 106.8
67.0 70.8 88.3 85.3
13.3% 8.4% 17.3% 20.1%
dalu
834/11/17
62.6%
85.2
72.3
15.1%
95.0
92.9
2.2%
k2
834/6/6 1051/55/13 1051/17/9
61.4% 81.3% 78.2%
85.2 48.3 48.3
75.0 40.8 41.8
12.0% 15.5% 13.5%
95.0 82.2 82.2
93.4 74.1 82.5
1.7% 9.9% -0.4%
alu4
175
11.7%
Performance Driven Layout Synthesis: Optimal Pairing & Chaining A.J. Velasco, X. Marin, J. Riera, R. Peset*, J. Carrabina Universitat Autbnoma de Barcelona. Campus UAB, 08193 Bellaterra, Spain. Tel: +34.3.581.10.78 Fax: +34.3.581.30.33 e-Mail: [email protected] * Philips Research Laboratories, Eindhoven, The Netherlands. ABSTRACT A complete CAD system for layout synthesis is outlined. The performance driven logic synthesis is targeted to transistor level logic to avoid library dependencies. Layout is generated from the transistor level netlist by a novel performance driven module generator. In this approach, performance optimization is integrated with the layout generation, and constraints are taken into account throughout the entire process. An optimal pairing and chaining algorithm embedded in this module generator is presented. Although usually, dual pairing is used, a different pairing can lead to better chaining solutions. A new and optimal algorithm is presented that combines pairing and chaining into one single step. This approach is based on Chi-Yi-Hwang et al's [1] algorithm. Extended definitions are given to introduce the new features. Optimal results are always reached. 1. INTRODUCTION As circuit requirements grow in complexity, manual performance optimization methods become clearly insufficient and should be replaced by automatic optimization CAD systems. During the design process, different CAD tools transform a circuit description from a given (high) abstraction level into an optimized representation in the target (lower) abstraction level. Our work is concerned with the last two abstraction levels of the design process (logic and physical) which are covered by logic and layout synthesis tools respectively. The paper is organized as follows. In sections 2 and 3, an overview of the logic synthesis and the layout synthesis tools are given, respectively. Section 4 shows in detail the pairing and chaining algorithm embedded in the layout synthesis tool. Here, the different subsections describe a short review of previous work, the definitions of the concepts, the Hwang et al.'s algorithm [1] and the new features introduced in the proposed algorithm. Results are presented in section 5. Finally, conclusions are discussed in section 6. 2. LOGIC SYNTHESIS Combinational logic synthesis has been traditionally divided into two phases [2]. First, a technology independent optimization is performed to obtain a minimum Boolean Network (BN) representation of the circuit. Then, this network is mapped to a gate library in a given technology,
176
while trying to minimize critical circuit parameters like area, delay and power dissipation. This approach presents several limitations. First, the finite (even small) size of the target gate library leads to a restricted solution space. Second, the network optimization is performed targeting area minimization only; performance is introduced during the technology mapping step. So, even if the technology mapping is optimally performed, optimality on the final circuit is not guaranteed, since the starting point for the technology mapping may not be the best one. Finally, the layout parasitic introduced at the layout level may invalidate the performance evaluation done by the logic synthesis, and may move the final circuit out of specifications. This introduces a costly (and possible nonconvergent) resynthesis process to take into account parasitics during synthesis. Many solutions have been proposed in the literature trying to avoid the mentioned drawbacks. These include introducing technology independent delay, power and routing evaluation during BN optimization [3,4,5]. Also, the size of the target library should be enlarged to provide a finer granularity of the solution space. Our approach introduces performance-driven transistor level logic synthesis closely tied to physical synthesis. BN optimization is performed taking into account technology independent evaluations for area, delay, power dissipation and routing. Then, the BN is directly mapped into a transistor-level netlist which is synthesized using a performance-driven module generator. Since transistor level technology mapping is not limited by the size of a finite library, it offers finer granularity than traditional gate level logic synthesis, giving more chances to the trade-off among area, delay and power dissipation. Of course, increasing the granularity of logic synthesis to transistor level requires increasing the granularity of layout synthesis approaches. Transistor level synthesis must be supported by automatic layout synthesis tools which can deal with performance-driven placement, sizing and routing at transistor level. 3. LAYOUT SYNTHESIS Traditionally, performance-driven module generators described in literature [6, 7, 8, 9] are based on the structure shown in figurel.
Figure 2: Proposed approach.
Figure 1: Conventional approach. As shown in the figure, from a netlist description, a layout meeting the imposed constraints is obtained by an iterative process. Such a process involves three main tasks: layout generation, layout extraction and performance optimization. As the first step of this iterative process, a layout is generated without taking into account the constraints. The parasitic capacitances of the generated layout are extracted during the second step. This information, along with the imposed constraints will be used in the optimization step. This optimization results in changes of the parasitic capacitances and can lead to transistors greater than the available space. These are the sources of the iteration. After each optimization the layout must be checked both for constraints and design rules violation. Should any of these not be correct, process must be repeated with the updated data. It is clear that generating the layout without taking the performance constraints into account cannot lead to optimal results. Two are the main differences of our approach with module generators. conventional performance-driven Firstly, performance constraints are taken into account all along the layout generation process. And secondly, performance optimization is integrated with the layout generation, as can be seen in Figure 2. Global iteration is avoided since parasitic capacitances are precisely estimated before optimization, and space can be provided if needed by a break line approach as explained in section 3.5. This module generator takes as input a transistor level netlist description, the performance constraints, information about the target process and data from the logic synthesis. Data from the logic synthesis is mainly consists of the transistor netlist and the switching activity of each net. The performance constraints consist of a minimum clock period, maximum values for the average power dissipation, area and number of required layout rows. Weight factors can be used to indicate the importance of different performance
The required process information consists of the transistor model parameters and the layout design rules. The performance-driven module generator produces the layout masks of the generated module, together with the achieved performance results. 3.1. Layout Style The performance-driven module generator is targeted at a row-oriented layout style, like that proposed by [10]. The diffusion chains are horizontally oriented, and cross the vertical polysilicon transistor gates. This layout style allows to each transistor to grow without affecting the disposition of Vdd and Gnd lines. 3.2. Clustering Clustering is responsible for dividing the circuit into stages, reordering transistors and pairing & chaining of transistors in each stage. A stage can be defined as a channel connected block. The cost function for reordering is a weighted average of delay [11] and power [12]. Optimally reordered stages are passed through the pairing & chaining step which is capable to find the best possible chaining solution among all possible pairings. Further, since the chaining solution is not unique, it chooses the one that fits best. The choice is made based on a cost function including total wire length and cell height (number of parallel tracks). Delay and power are implicitly reduced. The pairing and chaining algorithm will be described in detail in section 4. 3.3. Placement The output of the previous step defines the length of each stage, since all chains in a stage are placed in a row, separated by the minimum gap allowed by the technology. This constitutes the unit to be placed by the performance driven placement, which searches for the appropriate relative location of each unit. The placement is based on a stochastic evolution method, which has been shown to achieve better results than simulated annealing [13].
parameters.
177
As mention before, the units to be placed are stages. Overlap is allowed and taken into account in the cost function. The cost function also depends on the total wire length, the switching activity of each net and the overall power dissipation. Cost function is recalculated incrementally with each change. 3.4. Performance Optimization Classical approaches perform the sizing step before placement. However as technologies reduce feature size, wire delay becomes more important and a very accurate estimation is required. Since in our approach, sizing is performed after placement, the estimation of parasitics can be easily achieved. Furthermore, the sizing procedure guarantees correct timing without resynthesis. In the layout style employed Vdd and Gnd tracks are placed in parallel close to the center of the cell. Therefore, transistors can grow freely, since they are not bound by the power tracks. Different transistor sizes lead to different cell heights, which in turn generate some extra space. This space is used by the routing process. 3.5. Routing The final step of the layout synthesis is concerned with routing. For this process, a maze routing approach has been chosen. The main reason for this choice is the facility of this algorithm to be extended to a n routing layers technology. Two important issues are introduced in this routing step. Firstly, if there is not enough space for a net to be routed, the layout is enlarged following a break-line approach. As shown in figure 3, the layout is stretched to make room for a new track. Despite of the simplicity of the routing technique used, the break-line approach guarantees that all nets will be routed. Secondly, as the routing phase proceeds, more accurate parasitics estimations become available. Any performance violations caused by replacing parasitic estimations with accurate results or by the increase of capacitance due to the break-lines, are corrected by an incremental performance optimization integrated with this routing phase. Therefore, constraints will always be met.
4. PAIRING & CHAINING As mentioned before, one of the steps in layout generation is diffusion chain formation. Following the Uehara and vanCleemput's layout style [14], transistors are placed in two parallel rows, one for the p- and one for the ntransistors. Transistors with the same gate connection are aligned vertically to reduce internal routing. Maintaining this constraint, a chaining algorithm should be able to find a minimum set of transistor chains in order to reduce the number of diffusion gaps, and hence, the overall area. Our main new feature is to combine the paring and chaining steps. The pairing of the transistors is not fixed until it is involved in a selected abutment. This approach finds the best chaining solution without changing the circuit performance. 4.1. Previous Work Since in 1981 Uehara and vanCleemput presented their layout style [14] many approaches aiming at optimal diffusion chain generation have been proposed. Wimer et al. [15] presented a chaining algorithm consisting of pairing, chain formation a chain covering. Starting from a chain of length one, pairs are added to the ends of the chains until no more abutments can be realized. Then a chain covering algorithm selects a minimum set of chains that includes all transistors. An approach based on finite state machines was developed by Nair et al [16]. Other proposed algorithms are based on heuristics and optimal results are generally not reached [171. The best approaches presented so far are found in [1] and [18]. Both claim to obtain optimal solutions. The algorithm presented in [18] is limited to fully complementary CMOS circuits, although the theory of dual trail cover facilitates transistor reordering. Reordering can be defined as the optimization problem of finding the permutation of transistors that minimizes the number of diffusion chains produced by the chaining algorithm. T. Nakagaki [19] introduced some changes and a new concept to speed up the algorithm. However, reordering changes the performance of the circuit, and this may not be desired for optimally reordered circuits. Approach [1] can handle non-dual circuits as long as the number of p-transistors equals the number of ntransistors, and pairing is fixed in a previous step. This algorithm reveals faster than other approaches. Generally, a chaining algorithm tries to find a minimal number of diffusion chains, based on a given pairing. However, different a pairing can lead to a lower number of diffusion chains and to a better quality of the chains. The quality of a chain can be evaluated according to several criteria, such as internal routing complexity. 4.2. Definitions A transistor t is defined by the quadruple (T, D, G, S), where T is the transistor type ( p or n ) and D, G and S
Figure 3: Interconnect routing and break-lines.
178
correspond to the drain, gate and source nodes, respectively. Two transistors, t, and tj, are abuttable if {D(tj ),S(t, )}I n JD5(tj), S(tj )} ;e 0 .
,
A transistor pair p = (tpatl) consists of two transistors, a p- and an n-transistor. Let tP(p) and t'(p) denote the p- and n-transistor of pair p, respectively. Let P be the set of available pairs. Two pairs pi and pj E P are abuttable if both, the p- and n-transistors are abuttable. A chain is defined recursively as follows: a pair is a chain of length one. A chain c of length m and pair p can form a new chain c' of length m+1 if pi c and p is abuttable to at least one end of c.
Possible abutments between pairs in P are represented as edges in a bipartite graph. A bipartite graph can be formally described as G = ( VP u Vn, E ), where E is the set of possible abutments between pairs in p. VP and Vn are sets of vertices. A p-vertex contains all p-transistors attached, through drain or source to the node represented by the vertex. Similarly holds for Vn. This can be formally expressed as: VP
= {V,
vI, = {rtT(t,)
= P'
(D(tp) = i
and
or
S(tp) = i) }I}
Vn = {v'lv' =f{tnT(tn)='N" (D(tn) = i E
=
p1l| 1
1
and
or
't~kt~l)}
{t (pk),t (p1)}
S(tn) = i) }I} 5
VP
c
v'
tm ova
2
{PkPI} 5 P
d
t7 ~~ten|
..
a 2X a
Ftt
(a)
and
}
A CMOS complementary circuit and its bipartite graph representation can be seen in figure 4. P constitutes a typical pairing for the circuit and E the set of possible abutments between those pairs. Some additional definitions are now described. An essential abutment is an abutment which has to appear in any solution. Two possible abutments are mutually exclusive if at most one of them can be in any solution. Otherwise, they are compatible. The set of edges which are mutually exclusive with edge ePk PI is denoted as
4
c+[$
are abutable
and
4
XOR(
P = {(tl.t6), (t2,t9), (G3,C7), (t4,t8), (t6,tlO)} v
epk PI).
4.3. Hwang Transistor Chaining Algorithm The transistor chaining process presented in [1] is carried out by three separate steps: pairing, bipartite graph construction and chain formation. Pairing is responsible for grouping transistors into fixed pairs. A p- and an n-transistor are good candidates to be paired if both share the same gate. The transistors of a CMOS transmission gate can also be selected to be vertically aligned. Following these criteria, pairing process generates P. the set of all available pairs. All pairs in P, will be found in the final solution of the chaining. Pairs are fixed before the chains are generated, and the chain formation process cannot change those pairs. The bipartite graph is constructed following the given definitions. The set of abutments E, includes all possible abutments between pairs in P. To produce the chains, a tree is generated in a DFS (Depth First Search) fashion. First, all essential abutments are selected, confirmed for the final solution and removed from the graph. Then, at each level, an edge and its XOR set are selected and used as the expansion nodes for that
VP
2 VP
3 VP
4
VP
/
Figura 4: (a) A CMOS circuit. (b) Its bipartite graph representation
I 79
level. The edges in that level are queued in descending order according to the upper bound on the number of possible abutments. Once an edge is selected the graph is reduced by eliminating that edge and its XOR set and passed down to the next level. At each level the queue is pruned when its head has an upper bound on the number of realizable abutments no greater than the number of abutments in the returning solution. The correctness of expanding just a limited set, an edge and its XOR, can be found in [I]. To find the essential abutments, the XOR set and a good upper bound, some theoretical development is needed. The following lemmas are the key to obtain the first two sets. Then, the computation of the upper bound is also described. Lemma 1: Two edges, elk, cp E(G) and ep, pa
The upper bound on the number of realizable abutments is taken as min (Pabut,Nabut) where P'abur and Nabut are defined as
E(G), are mutually exclusive if
4.4.1. Integrated Pairing The pairing process is needed to minimize internal cell routing. As explained before this process follows some standard heuristical criteria and in CMOS complementary circuits, each transistor is paired with its dual. This is a good choice, since transistors with the same gate will be vertically aligned. However, when there are several pand/or n-transistors sharing the same gate, pairing of dual transistors can lead to suboptimal solutions as shown in figure 6a. Based on the circuit of figure 4, figure 6b shows a better chaining solution, obtained by changing the initial pairing. There are no direct criteria to decide which pairing can lead to the best chaining results. The number of final chains can only be exactly determined through a chaining process. Therefore, the pairing has to disappear as a separate step
XOR( epk PI)
=
ePk
degree(v.p)
Na b
degree(v,)
=
min
2j-
That is, the number of possible abutments between p/ntransistors is the sum of the maximum possible contributions of each vertex in Vp/Vn. 4.4.
{PkXPI}()iPk-Pr} ;0 and (i = i orj =f) Lemma 2: An edge,
Pabut = ,min({i2j|
PI c E(G), is essential if
0.
Proofs for these two lemmas can be found in [1].
ALGORITHM Chain-FormationO begin pairings; G=build-bipartite-graphO;
B=0; chaining(G, B); end; PROCEDURE chaining(G.B); begin if(E(G)=0 begin output(B);
New Features
P = l(tlt6), (t2,t9), (t3,t7), (r4,t8), (t5,tlO)I
returnO;
end: for eEsE(G); if e is essential begin B--:B+pair(e); E(G)=E (G)- {e };
1-d-72 12c4 -d3 2i~dilfl c-77~-2r~dT~3
3e1
1 a74
237eI3
13a4
(a)
end;
P = I(ti ,t8), (t2,t9), (t3,t7), (t4,t6), (t5,tlO) }
pick es E(G) with minimum IXOR(e) I; '={eruXOR(e); sort '*Saccording to the upper bounds on # abutments; while 1+'0 && #abutments found < head of IP's upper bound begin e=lhead of P; B'= B+ pair(e); G':= G E(G')=E(G')-{e}-XOR(e); chaining(G',B'); =T - e; end; end;
11e-3 | 3
4
,3-IeI21 [2-dj;1-
l-C-;2j [2-d- 3' [3 a,4j
F
d1
1 a411 aT
(b) |4C2
2d 1
e3
1 Ec,23 [2,d:3] [3'e'2
L3d4
2:d 1
|1 a:4i
,3,aj41
(c)
Figure 6: (a) Dual pairing and best possible chaining. (b) Optimal pairing and chaining. (c).Suboptimal solution. Not all abutments can be realized.
Figure 5: Pseudo code for Hwang's chaining algorithm I[].
180
and must be dynamically generated during the chain formation process. This implies more information during the chaining process, and requires therefore, some changes
involve a different pairing for any of the transistors included in the selected abutment. XOR(enpk P
to manage it.
The bipartite graph must include now all possible pairs, and hence, edges representing all new possible abutments must be generated. To introduce the new features in our approach, some concepts must be extended. Several new definitions are also needed.
= {ePkPI p)
({Pk Pi({pk .,pr} *0and(i=ti orj=;)) or {Pk', Pr}n(I(pk)uI(pl))0
}
When an edge (-abutment) is selected the pairing for the transistors involved is fixed, and henceforth, abutments that imply a different pairing are obviously impossible.
4.4.2. Extended and New Definitions A pair has been defined as a group of a p- and an ntransistor, without any other restriction. Therefore, given a circuit with a p- and b n-transistors the number of possible pairs is a*b. The set of all possible pairs will be denoted as Q2. However, it is not useful to take into account all pairs in U. Two kind of pairs are adequate for a chaining process: a) gate connected pairs, where both transistors share the gate signal. b) transmission gate pairs, where transistors share drain and source signals. The set of all gate connected pairs will be denoted as GCP and the set of all transmission gate pairs as TGP. The formal definition for these two sets are:
4.4.4. Modified bound The execution time required to process a circuit is directly related to the size of the tree generated during chain forming. As mentioned before, this tree is built in a DFS fashion and an upper bound on the number of possible abutments is used to prune the tree. Obviously, the better the upper bound can be adjusted, the smaller the tree, and the shorter the execution time will be. Theorem 1: A p-vertex can contribute with at most
[IA(v )1/2j abutments to the final chaining, where A(v,)={tEV;
13ejkPI 'EE
}
(tp(pk)=tortp(pi)=t) Similarly holds for any n-vertex. Proof: A transistor t E vP, involved in an abutment is,
GCP = {(tp, tn) e Q I G(tp) = G(tn)} TGP = {(tp,tn) e Ž | D(tp) = D(tp) and S(tp) = S(tn)}
by definition, automatically inserted in A(vp) . A transistor can be only in a pair at a time, and therefore, can contribute to at most to one abutment in v'. Henceforth, at most
The set of available pairs for the bipartite graph, P must now be redefined as: P = GCP u TGP Given a transistor t, all possible pairings for t are included in P. Since a transistor can only appear in one pair, not all pairs in P are compatible. Two pairs, pi, pj E: P I i • j, are compatible as long as
every two of those transistors included in A(vp ) can contribute with an abutment. Q.E.D.
tp(pi)•tp(pj) and tn(pi)•1n(pj). Otherwise, they are incompatible.
In our approach, pairing disappears as a separate step. The bipartite graph is built in the same way, but more abutments are reflected, since new pairs have been added to P. This requires a new definition of XOR(e). Some other changes are made to improve chaining results. 4.4.3. Modified XOR Let C(p) be the set of pairs compatible with pair p. C(p) = {Pi E P I tP (pi) • tP(p) and tn (pi) • t (p)} The set of all incompatible pairs with pair p can be defined as I(p) = P - C(p) .
The concept involved in XOR(e) is the set of all edges in P incompatible with edge e. Therefore, the definition of XOR(e) must be extended to include all abutments that
181
4.4.5. Cyclic chains The algorithm presented in [1], claims to obtain the optimal solution for the given pairing. However, suboptimal results can be generated when the solution contains a cyclic chain. A cyclic chain can be defined as a chain whose ends can be abutted. Figure 6c shows a solution with a cyclic chain for the circuit of figure 4. The number of abutments in this solution is the same as that in 6b, since the ends of the first chain in 6c can be abutted. If the algorithm finds this solution first, then the results will not be optimal, since abutments between the two ends of a chain cannot be realized. These situations must be controlled by the algorithm. It is necessary to remove from the number of possible abutments in the solution, the number of cyclic chains to obtain the real number of realizable abutments. A cyclic chain can only be detected when the final chains are formed, and therefore, once a possible solution is found, chains must be generated.
5. RESULTS The CAD system for layout synthesis is being implemented in C language. Although complete final results are not available, preliminary probes show that the characteristics of this tool can lead to high quality layouts. The pairing & chaining algorithm has been executed on a sparc- O workstation over a set of circuits. Tables I and 2 reflect the results obtained. In both tables, the first column references the circuit. #trt, #abut, #ch and #nodes correspond to the number of transistors, possible abutments, obtained chains and number of generated nodes of the: search tree, respectively. Execution times (in seconds) can be found in the columns time. Table I Circuit
[1,fig. I]
[ 2 0,p. 19 8] [14,fig.9] Iscas cl7 [20,p.334] [15,fig.1] [20,p.3 18] Mietec-FD2 Mietec..FDIS
A new algorithm that combines pairing and chaining in one single step has been presented. Although an heuristical pairing before the chaining process can achieve very good results, optimal solutions are not always reached. Benchmark circuits confirm the superiority of our approach in these cases. Our Pairing & Chaining algorithm is based on that of Hwang [I]. To introduce the new features, some definitions have been extended and new ones have been presented. A new upper bound on the number of possible abutments has also been presented. This new upper bound reduces the search tree, and therefore, the execution times and memory requirements. Important differences in tree size have been reported, specially when searching all possible solutions.
Pairing & Chaining Pairin + Chaining #trt #abut #ch time(s.) #abut #ch time(s.) 10 8 k 0.01 8 1 0.01 12 24 2 < 0.01 9 2 < 0.01 18 12 3 < 0.01 12 3 < 0.01 24 54 3 0.1 27 3 0.06 24 58 2 0.11 18 2 0.05 24 29 i 0.03 12 1 0.01 28 270 2 4.96 44 3 0.06 30 82 3 0.26 36 4 0.08 36 101 5 372 58 5 81
REFERENCES [I] C. Y. Hwang et al, "AFast Transistor-Chaining Algonthm for CMOS Cell Layout." IEEE Trans CAD, vol. 9, n"7, pp 781-786, July 1990. [2] R.K.Brayton, G.D.Hachtel, A.L.Sangiovanni-Vincentelli, Multilevel Logic Synthesis, Proc of IEEE, Vol 78, Feb. 1990 [3] H.Vaishnav, M.Pedram, Logic Extraction Using Fanin Ranges, Proc of the PATMOS, 1995. [4] S.Iman, M.Pedram, Logic Extraction and Factorization for Low Power, Proc. of the 32nd DAC, 1995. [5] H.Vaishnav, M.Pedram, Minimizing the Routing Cost During Logic Extraction, Proc. of the 32nd DAC, 1995 [6] H.Y.Chen et al. "iCOACH: A Circuit Optimization Tool for CMOS High-Performance Circuits".ICCAD 88, pp 372-375. [7] H.R. Lin el al. "Cell Height Driven Transistor Sizing in a Cell Based Module design", Proc. EDAC 94, pp. 425-429. [8] F. Moraes et. al, "A Transparent Macrocell Layout Methodology", Proc. PATMOS 1993, pp. 42-54. [9] C.Tsareff et al, "An Expert System Approach to Parametrized Module Synthesis", IEEE Circ and Dev mag, Jan 1988. pp 28-35. [1o] C.Y.Hwang et al, "An Efficient Layout Style Two-Metal CMOS Leaf
In table 1, our approach is compared to the best heuristical pairing followed by the Hwang's chaining algorithm. The number of chains is reduced in some cases. CPU time increases as more pairing possibilities are examined. When a circuit allows only one pairing possibility, the number of possible abutments, chains and execution times are the same in both cases. Table 2 Circuit
[1.fig.l] [20,I,.198] [14,fig.9] Iscas c17
[20.1,.334] [15,fig.1] [20,p,.318] Mietec-FD2 Mietec-FDIS
Upper Bound [1 New Um rBound # ch # nodes time (s.) # nodes time (s.) 1 8 < 0.01 8 < 0.01 3 53 0.01 36 < 0.01 3 27 < 0.01 25 < 0.01 3
921
0.45
234
2 1794 1 76 2 150485 3 944 5 316027
1.56 0.03 211 1.30 372
1440 76 19666 611 24438
Cells and Its Automatic Synthesis",IEEE Tra. CAD, Mar 93,410-24 [11] B.S.Carlson, S.Lee, "Delay Optimization of Digital CMOS VLSI Circuits by Transistor Reordering". IEEE Trans. on CAD. Oct. 95
0.21
1.49 0.03 37.3 1.10 81
[12] E. Musoll and J. Cortadella, "Optimizing CMOS Circuits for Low Power Using Transistor Reordering",ED&TC 96, pp 219-223 [131 Youssef G. Saab and Vasant B. Rao, "Combinatorial Optimization by Stochastic Evolution". IEEE Trans. on CAD. april 1991. [14] T. Uehara. W. vanCleemput, "Optimal Layout of CMOS Functional Arrays." IEEE Trans. Comput., vol. C-30, pp 305-312, May 1981 [15] S. Wimer et al., "Optimal Chaining of CMOS Transistors in a Functional Cell.", IEEE Trans. CAD, vol 6, September 1987. [16] R. Nair et al., "Linear-time algorithms for optimal CMOS layout,"
Table 2 shows the effect of the new upper bound. In this case only branches with an upper bound lower than the number of abutments of the best solution found so far have been pruned, in order to obtain all optimal solutions. In some cases, the number of nodes is reduced drastically. Execution time is proportional to the size of the search tree.
Proc. Int. W. on Parallel Computing & VLSI, pp 327-38, May 84 [17] D. Hill, "Sc2: A Hybrid Automatic Layout System." in Proc. Int. Conf. on Computer-Aided-Design, Nov. 1985, pp. 172-174. [18] R.L. Maziasz and J.P. Hayes, "Layout Optimization of Static CMOS Functional Cells." IEEE Trans. CAD, vol. 9, pp 708-719. July 1990. [19] Nakagaki et al " Fast Optimal Algorithm for the CMOS Functional, Cell Layout Based on Transistor Reordering." ISCAS 92. 2116-19 [20] N. Weste and K. Eshraghian, Principles of CMOS VLSI Design." Reading, MA: Addison-Wesley, 1985.
6. CONCLUSIONS A complete CAD system for performance driven layout synthesis has been outlined. This novel approach takes constraints into account all along the synthesis process.
182
CLOCK-DELAYED DOMINO FOR ADDER AND COMBINATIONAL LOGIC DESIGN Gin Yee and Carl Sechen Department of Electrical Engineering, Box 352500 University of Washington, Seattle, WA 98195-2500 gsyee @twolf4.ee.washington.edu, sechen @twolf4.ee.washington.edu
ABSTRACT
An innovative dynamic logic family, clockdelayed (CD) domino, was developed to provide gates with either inverting or non-inverting outputs, and the high speed and compactness of dynamic logic. The characteristics of CD domino logic are demonstrated in two carry lookahead adder designs and two MCNC combinational logic benchmark circuits. The CD domino designs are compared to designs using static CMOS and standard domino logic. A circuit design tool was developed to automate the design of CD domino circuits. Simulations show a 32-bit CD domino adder comprised of four 8-bit full adders to be 30% faster than a 32-bit standard domino adder, and a 32-bit CD domino adder comprised of a single 32-bit block full adder to be 45% faster. In the combinational logic benchmark comparisons, complex inverting and non-inverting gates were used to implement C1355 and C3540. The CD domino circuits were 22% and 43% faster than their static CMOS counterparts of C1355 and C3540, respectively.
I. INTRODUCTION
Domino logic has become the logic family of choice for high speed and compact circuits, as demonstrated by its use in Intel's Pentium Pro [1], Digital's Alpha [2], and a host of other state-of-the-art processors. The reduced input capacitance and use of nMOS logic transistors make domino circuits faster and smaller than their static counterparts [3]. However, one of domino logic's major shortcomings is its monotonic nature, which provides only non-inverting logic [3]. This inflexibility of standard domino makes synthesis and general circuit design more complicated than with slower and larger static logic gates. While dynamic logic with inverting and non-inverting outputs are known, most require generating both polarities of the output [4] [5] [6] or using latches [7]. A previous attempt at providing nondual-rail inverting or non-inverting domino logic seems unsuccessful [8]. The method of delaying the prechargeevaluation clock is currently used by Intel [1] and in wavedomino logic [9], but is not used to provide either
183
inverting or non-inverting gates. CD domino logic has the flexibility in design of static logic by providing either inverting or non-inverting logic functions, but without using dual-rail logic, latches, or pipelining. This allows CD domino gates to retain the high speed, compactness, and high fanin gates common in dynamic logic families. CD domino was used in the design and simulation of two 32-bit carry lookahead (CLA) adders, and in two MCNC combinational logic benchmark circuits. The CLA adder was chosen to illustrate the use of fast, high fanin CD domino gates versus standard domino and static logic in a datapath design. Likewise, benchmark circuits were implemented with CD domino to demonstrate its ease of use in the synthesis and design of combinational logic circuits. The next two sections discuss adder and domino logic design issues. This is followed by a description of CD domino, a design methodology for combinational logic synthesis using CD domino, and a CAD tool developed for automating the design of CD domino circuits. Next is the design of two fast 32-bit CLA adders which take advantage of the flexibility and high fanin gates of CD domino. In section VII, the CD domino adders are compared to static and standard domino implementations of the 32-bit CLA adder. In section VIII, the MCNC benchmark circuits and their simulation results are presented. The conclusion is the subject of section IX. II. ADDER DESIGN ISSUES
The theme of this paper is to demonstrate the speed improvements that are possible for a CLA adder design when using high fanin and inverting CD domino gates. The intent is not to design the fastest known adder, and this is why techniques such as the carry-select method [10] were not used in conjunction with the CLA technique. This is also why the focus is not in comparing this approach to other technologies for adder designs, such as pass transistor logic or carry save techniques [10]. When using static or standard domino logic, a common 32-bit CLA adder design uses eight 4-bit full adder (FA) blocks with two CLA logic levels, as shown in Fig. 1. In the equations for the adder designs, the equation a + b represents a OR b, a ( b represents a XNOR b,
a G) b represents a XOR b, the AND function is represented by juxtaposition, and a is the NOT of a. The FA logic equation for the first level is provided by (1), while (2) and (3) give the CLA logic equations for the second and third levels [11].
minimal in size and take up little area in the layout.
out = (abcd)
Fig. 2. Transistor level domino 3-input AND gate. Standard domino gate outputs precharge low and during the evaluation phase, can either stay low or switch low to high [3]. The required inverter for the output of cascaded standard domino logic circuits thus allows only non-inverting gates. This constraint makes designing datapath circuits more complicated, and synthesizing combinational logic circuits very difficult, compared to using static gates. For example, the XOR gate common in adder designs, is more difficult to implement in standard domino because both polarities of the inputs are required by the XOR. This can be dealt with by using dual-rail domino, latches in a pipeline, clocked cascade voltage switch logic (CVSL) [5], latched domino [6], NORA logic [4], or zipper domino [12]. While dual-rail, CVSL and latched domino provide inverting and non-inverting outputs, the area cost of a gate is about twice that of a standard domino gate, since both the inverting and non-inverting outputs are generated. Also, some these types of dynamic logic do not have the full range of flexibility in logic design and synthesis that static gates provide. For example zipper and NORA circuits generate inverting logic by alternating nMOS and pMOS dynamic gates.
l
Fig. 1. Block diagram of 32-bit CLA adder. gi = aibi, pi = ai c0 =
in,
bi,
Cl = g 0 +P 0 cin
si = piE Aci 1
(1)
C2 = g 1 +P~g0 +p p 0 c1
.
C3 = g + P2g 1 + P2PIg 0 + P2PIPocin
Po = P3P2P]PO.
P3 = PI5PI4 P13 PI2
(2)
Go = g3 + P3 g2 + P3P2g1 + P3P2P1 go .G3 = g 15 + p 1 5g 14 + p 1 5 p 14 g 13 + P1 5 P14 p 13 gI 2
Cl = Go + P0 cin, C2 = GI + PiGo + PiP 0 cin, C3 = G,2 + P2 GI + P2 P1 Go + P2 P1 Pocin x
Po
x Gox
=
F3 P2 P.PQ,
Pi = P=P.P5P.
(;3 + P3 G2 + P3 P 2 G1 + P 3 P2 P IGO,
GI = G7 + P 7G6 + P7P6 G5 + P7P6 P5 G4 C1 = GC + P ci ,
C=
GX + P G' + PXPXC
IV. CLOCK-DELAYED DOMINO
The above design uses 4-bit FA blocks due to the speed limitations imposed by the number of series transistors required for higher carry bit logic in standard domino and static logic. A CLA design using three levels of hierarchy (the bottom FA level and two CLA logic levels) has a longest path of 12 simple gates, four for each level as indicated in (1), (2) and (3). For example, the first CLA logic block, given by (2), has a longest path of two gates for G3 and two more for C3.
Each clock-delayed (CD) domino gate is composed of a dynamic gate and a clock-delay logic device (CDLD), if necessary, as shown in Fig. 3. The dynamic gate can be either non-inverting (domino type), as in Fig. 2, or inverting, as shown in Fig. 4. For OR/ NOR gates, extra precharge transistors are not needed since charge sharing within the gates does not occur. at e
d a t a
data inputs
HI. DoMINo LOGIC DESIGN ISSUES Fast and robust gates can be designed by using extra precharge transistors and a keeper, as Fig. 2 shows [7]. While extra precharge transistors and the keeper increase the gate delay, they also reduce the effects of charge sharing, noise and coupling parasitics, and can be
clock input
dynamic
rclock-delay
|
ata output clock output
Fig. 3. A CD domino gate with gate and clock-delay logic.
184
during the evaluate phase is avoided, and inverting dynamic gates can be used, as shown in Fig. 7. If a gate's output is not the slowest input to any gate at the next gate level, then a CDLD is not needed. Otherwise, a CDLD is used and the clock and data output are routed to the same gate(s) driven by the data output.
-.J
Fig. 4. Transistor level CD domrino 8-input NOR gate. A CD domino gate may also have a CDLD. The CDLD, shown in Fig. 5, is a simple circuit for delaying the clock input signal. The delay is set equal to the worst case pull-down delay of the corresponding dynamic gate, plus a margin for variations in fabrication processes and signal delay due to wire routing.. Because the data output must arrive at the next gate level before the clock output signal, the margin allowed must take into account variations in wire length, parallel and cross coupled capacitance with other wires, and the fabrication variations of the wires and devices. The margin can be minimized by routing the data and clock output wires in parallel., since they go to the same gates, as discussed later in this section.
Fig. 6. Simulation waveforms of CD domino gates, out2 briefly loses a little charge due to switching at the gate driven by out 2 , but the node is kept high by the keeper.
vdd clock input
>clock
output
number of transmission gates and their sizes set delay value Fig. 5. CDLD circuit. Fig. 7. Use of CD domino gates in logic designs. In simulations of CD domino designs discussed in later sections, a margin between 10% and 20% was chosen for the CDLD's. Fig. 6 shows the clock input signal (clk) to a complex gate, the clock output from the CDLD of the gate (clkout]), the data output of the complex gate (out]), and the data output of the gate driven by out] (out2). The margin was set at 10%, measured from the vdd/2 to vdd/2 points of out] and clockouti, and 1jm MOSIS process parameters were used. As shown in Fig. 5, the CD domino gate takes the clock and data inputs, and provides the data output and clock output to the next gate level. Each gate will get its clock from the clock output of its slowest input. This guarantees that all inputs are stable when a CD domino gate goes into the evaluate phase. Thus, the data hazard in standard domino caused by high to low input transitions
In the most basic clocking scheme, only the clock from the slowest gate at each gate level needs to have a CDLD. The delayed clock signals from the previous gate level would be used by the gates at the next level. This basic CD domino clocking scheme is demonstrated in a wave-pipelined method, as was done by Lien and Burleson [9], and shown in Fig. 8. The gate levels of a logic design can be divided into pipelined stages, and all gates at each stage share the same delayed clock. However, the domino gates in CD domino can be inverting or non-inverting, which was not considered by Lien and Burleson. Also, a more general clock tree can be designed with each gate using the clock from its slowest input, rather than the same clock of the entire gate level.
185
This approach shifts away from the timing constraints in pipelining and allows logic blocks to be easier to synthesize. At the cost of a few extra CDLD's, the outputs of the logic block would be evaluated faster since each gate does not have to wait for the slowest gate's clock from the previous gate level, just the clock from its slowest input gate.
NOR gate delay in 1um MOSIS 1.8 1.6 1.4 + S 1.2 .O 0
1 -
X 0.8 0
-
m 0.6 f 01
0.4 0.2
--
0 o'
I)
0
) N
(N
0
C
e'
e
v
v
LOL 0
0I,0 0 0
'N
fanin
Fig. 9. Plot showing increase in gate delay with fanin for CD domino NOR gates.
level
I
gate level j
elk I Dachare
I
gate level k
I
gate level I
With high fanin AND/NAND/ORINOR gates, any combinational logic function can be implemented with a two gate level circuit to optimize for speed, or with multiple gate levels to minimize area, by using a standard logic synthesis tool [13]. A-design methodology for CD domino using standard synthesis tools is presented in the next section.
I
evaluate
clk2F\/ clk3K\\ clk4
-
V. COMBINATIONAL LOGIC SYNTHESIS FOR CLOCKDELAYED DOMINO
Fig. 8. Simplest clocking scheme for CD domino, as used in wave-domino [9]. The power consumption for CD domino circuits would be the same as that in wave-domino if the minimum number of CDLD's is used, as in the pipelined scheme in Fig. 8. However, the general clock tree for CD domino would have improved power dissipation over wave-domino since delaying the clocks gives a more even power distribution and less peak power dissipation [9]. Since high fanin and high speed OR gates are possible with standard nMOS domino, CD domino also provides, high fanin NOR gates, as shown in Fig. 4. Likewise, high fanin AND and NAND gates can be implemented using CD pMOS domino. The simulated gate delays of high fanin NOR gates are shown in Fig. 9, which plots the fanin vs. gate delay for CD domino NOR gates, each with a load of 3 dynamic gates. As the plot indicates, the gate delay increases linearly with fanin by a factor of 0.019ns/fanin using lIm MOSIS simulation parameters. The usefulness for high fanin gates can be seen not only in programmable logic arrays (PLA's) and decoders, but in combinational and datapath logic because of the high speed and compact layouts obtained with high fanin dynamic gates.
186
In designing combinational logic blocks, an efficient method would be to input the logic netlist into an industry standard synthesis tool and have it map the reduced netlist to a library of gates. This approach would not work for standard domino logic and some other dynamic logic families with most synthesis tools currently available because the tools require both inverting and noninverting gate outputs. While synthesis tools have been developed for domino, they operate with the constraint of using only non-inverting logic [5]. By providing either polarity of any logic function, CD domino gates can be used by synthesis tools developed for static logic gates by simply replacing the static gate library with a CD domino gate library. Thus, combinational logic blocks can be implemented with the speed of domino logic, but without its functional limitations. Once a combinational logic netlist has been generated, CDLD's have to be inserted between the clocks of cascaded dynamic gates. A circuit timing analysis and CDLD insertion tool was developed to automatically insert clock-delay devices between the clocks of cascaded dynamic gates and route the clock signals accordingly. The tool determines the slowest input of each gate and uses lookup tables to determine the worst case gate delay plus a margin for variations due to chip fabrication and
wire routing. A CDLD with the desired delay is then inserted. One lookup table contains the simulated worst case gate delays for all of the gates, simple or complex, used in the adder and combinational logic circuit designs, for fanouts ranging from one to 20. The other lookup tables consist of delays for the CDLD's, with the number of transmission gates in the devices ranging from zero to four, the device sizes ranging from 1.tm to 40[tm, and the fanout ranging from one to 40. For example, the time delay achieved for a CDLD ranges from 0.24ns to 23.7ns for the I pm MOSIS process.
Using CD domino gates, two fast CLA adders were designed to demonstrate CD domino's use of inverting, high fanin and high speed dynamic gates. The first adder, ADD4x8, uses four 8-bit FA blocks with a single level of CLA logic, as shown in Fig. 10. The CLA logic block calculates the lookahead carry values (Ce) for the four 8-bit FA blocks, which gives a longest path of eight basic gates for the entire design. The FA and CLA logic equations are the same for the generate, propagate, and sum values as in (1), and the carry values (c1) and CLA logic are given by (4) and (5).
C4
4-bit CLA
(carry out
-_ t7 0-7
]5
g90 8a7
98-
8215
9166-2c3
CPo
I j
18e 0
70
of adder)
ip
24
it a 8 -15
1
85
1
-A
al6-23516-23516-23 a2
8-
24-31
2
4
-3
= Cin,
C, = go +P 0 cin 0
C6 = g 5 +P 5 g 4 + c7
g 6 + P6 g
5
-
s)
+ -. + (P6+
--
-
-
P3
Go
(7)
P31 + P30 + P29 + P28 + P27 + P26 + P25 + P2 4 = 93 + (P3 +,92)
g 15 +
+
+ (P3 +P2 +P I + g0),
(Pi5+g 14 ) +
+ (P15 + P14 + P13 + g
12
)
C, = Go+ (Po+cin). C4 = G 3 + (P3 + G2 ) + --- + (P3 +P 2 + PI +PO+ cin)
The only modification necessary with NOR gates is that all inputs and intermediate logic (e.g., carry bits) are inverted; however, the sum outputs remain unaffected. Also, all of the g and G values are in both polarities. The high bit count of the FA and CLA logic blocks is possible due to the high speed, high fanin gates provided by CD domino. Because inverting and noninverting gates can be used together, compact dynamic XOR and XNOR gates [15] are also used in the design, as indicated in (6). The second adder design, ADDlx32, takes yet further advantage of the high fanin capabilities of dynamic logic by implementing the carry logic directly in two gate delays. ADDlx32 uses the same propagate, generate and sum bits as in (6), and the carry logic is implemented as in (8). =
Cin,
Cl
=
go+
(8)
(PO+ Cin) ,
C31= g 3 0 + (P30 + g29) +
.
C32 = 931 + (P31 + g 3 0 ) + *
P3
= P23 1P 30 P 29 P 28 P 2 7 P 2 6 P 25 P 2 4
g31
,
-
+ (P30 +
+ Po + Cin)
+ (P31 +
- + PO + Cin )
+ -- + P6 P5 p 4 p 3 P2 pIPocin
= 1)7p6 p 5 p 4 p 3 p 2 pIp
G3 =
-p -
C2 = g 1 + (pi +go) + (PI+PO+Cin) .
.
PO
= g7 +P 7 g 6
+PO +Cin)
PO = P7 +P6 +P5 +P4 +P3 +P2+ PI +PO,
Co
+P 5 P4jP 3 p2 P1 POcin
*
= go + (Po + Cin) , ** *,
C7 = g 6 + (P6+ --
(6)
24-31 S24-31
Fig. 10. Block diagram of ADD4x8 Co
C
Co= Cin
G3 =
VI. Two 32-BIT CLA ADDER DESIGNS
Cm
gi= ai+bi, pi = ai®bi7, si =PiCi
+P
7
0
t .
P 6 g 5 + *.- +P
(5)
7
P 6 P 5 p 4 P 3 P 2 Plgo *. *
+ P31g3 0 + --. + P3 IP3 0 P 2 9 P 2 8 P 2 7 P 2 6 P2 5 g2 4
+ POcin 1
C2 = GI +P. Go0 + PPocinI..
C'
= / J0
C4
= G 3 + P3 G 2 + P3 P2 G, + P3 P2 P1 Go+P 3P 2P 1 P0 cin
As (4) and (5) show, high fanin OR and AND gates would be used in the adder design. However, with the availability of NOR gates, ADD4x8 can be optimized with the faster NOR gates, as indicated in (6) and (7).
As (6) and (8) show, ADDlx32's longest path is now four simple gate delays, one for the propagate/ generate values, two for the carry values, and another for the sum calculation. Thus, extra CLA logic blocks at the next level are not needed. The fanout of all gates in the 32-bit FA block design was limited to 32 gates, which was necessary since P16 would have had to drive 257 gates. VII. COMPARISON OF ADDER DESIGNS
Versions of the 32-bit CLA adder, described in section II, were designed with static and standard domino
187
logic gates, and compared to ADD4x8 and ADDlx32. Table I compares the simulation results for the worst case 32-bit addition and Table 2 compares various transistor count statistics of the four designs as an area comparison. Table 1: Worst case 32-bit addition delay comparisons 32-bit adder type
Delay [ns], I4m process
Delay [ns], 0.25g1m process
Static
7.9
2.5
Standard domino*
6.7
2.0
AlDD4x8
4.6
1.4
ADDlx32
3.5
1.1
Table 2: 32-bit adder transistor count comparisons 32-bit adder type
Logic transistors
Total gate count
Static
2594
291
Standard domino*
1844*
291
AD)D4x8
1620
423
ADDlx32
8737
897
*static XOR gates used to simplify design, otherwise transistor count is doubled using dual-rail gates In the adder simulations, parameters for a lltrm MOSIS and a state-of-the-art 0.25ptm process from a major semiconductor manufacturer were used. Device widths of Slim for nMOS and 20.Lm for pMOS transistors were used for the gate logic, and widths of 84lm and 4pm were used for the pMOS precharge and keeper devices, respectively. The extra precharge devices were also 4pm, and the nMOS evaluate transistor was 8pm. The device dimensions were chosen to compare the speed and functionality of the adders, rather than to optimize any design for speed. Also, the design of the standard domino adder used static XOR gates for the sum calculations, which simplified the design by requiring half of the transistors needed for a standard domino design using dual-rail gates. Thus the total transistor count for a dual-rail standard domino adder would be much more than the count for ADD4x8. The worst case addition for the CLA design used for the simulations is adding FFFFFFFF(hex) to OO000000(hex) with a carry in of 1. The rise and fall times of the inputs for the l4m MOSIS design simulations were 0.25ns, with a power supply of 5V, and 0.10ns for the 0.25;.m design simulations, with a 2.3V power
supply. All adder outputs were loaded with the equivalent of three gates of the same circuit family. VIII. BENCHMARK COMPARISONS Two MCNC combinational logic benchmark circuits, C 1355 and C3540, were implemented with complex static and CD domino gates. Because standard domino does not provide inverting functions, it was not used in the comparison. The device sizes for the static and CD domino gates are the same as those used in the adder simulations. Simulations of the benchmark circuits were done using 1Im MOSIS parameters and measurements were taken for the longest path delay. The static circuits were obtained by synthesizing the benchmarks using SIS [13] and Catamount [16]. The CD domino circuits were obtained by replacing the static gates with their CD domino gate counterparts (i.e., the nMOS logic of the CD domino gates are the same as the nMOS logic in the static gates). Thus the same logic netlist was used for both the static and CD domino circuits, and the CD domino circuits have half the number of logic devices used in their static counterparts. The CAD tool described in section V was used to automatically insert CDLD's and connect the delayed clock signals within the CD domino netlist. In both benchmarks, the maximum number of series transistors allowed for the pMOS and nMOS logic blocks was seven; however, very few gates actually reached this limit, none in the longest path. The C1355 benchmark's longest path has 17 gates, the entire circuit has 342 gates, and Table 3 shows the simulation results for the longest path delay. Table 3: C1355 Benchmark Comparisons Loi Fail Logic Family
path Longest delay [ns]
Percentstatic faster vs.
Static
5.69
0
CD domino
4.46
21.6
In the C3540 benchmark circuit, the longest path has 25 gates, the entire circuit has 465 gates, and Table 4 shows the simulation results for the longest path delay. Table 4: C3540 Benchmark Comparisons
188
. g Logic Family
Longest path delay [ns]
Percent faster vs. static
Static
14.71
0
CD domino
8.38
43.0
IX. CONCLUSION
As the comparisons in sections VII and vm show, the CD domino circuits show significant speed improvement over their standard domino and static CMOS counterparts. The CD domino ADD4x8 adder even required fewer logic transistors than the standard domino adder. CD domino logic: provides the high speed, high fariin, and compact gates common in dynamic logic, but with the flexibility in design and synthesis of static logic. This flexibility is especially demonstrated in the combinational logic benchmark circuits, which maintained the speed advantage seen in the adder designs. Robust and compact CD domino circuits can be designed with careful CDLD margins, and extra precharge and keeper transistors. Peak power consumption by CD domino circuits are less than in domino logic, and the power dissipation is more evenly distributed due to the delaying of the clock signals. ACKNOWLEDGMENTS
The authors wish to thank the Semiconductor Research Corporation, the Washington Technology Center, the Center for the Design of Analog/Digital ICs (CDADI[C), Intel Corporation, Digital Equipment Corporation, and LSI Logic for graciously providing financial support. The authors also wish to thank David Guan for providing the benchmark netlists and Tao Yao for his work on the circuit timing analysis tool.
1987. [9] W. Lien and W. P. Burleson, "Wave-domino logic: theory and applications," IEEE Trans. Circuits and Systems, vol. 42, no. 2, pp. 78-91, February 1995. [10] N. Weste and K. Eshraghian. Principles of CMOS VLSI Design, Reading, MA: Addison-Wesley, 1993. [11] J. Hennessy and D. Patterson. Computer Organizationand Design, San Francisco: Morgan Kaufmann, 1993. [12] C. Lee and E. Szeto, "Zipper CMOS," IEEE Circuits and Devices Magazine, pp. 10-16, May 1986. [13] E. Sentovich, K. Singh, L. Lavagno, C. Moon, R. Murgai, A. Saldanha, H. Savoj, P. Stephan, R. Brayton, and A. Sangiovanni-Vincentelli, "SIS: a system for sequential circuit synthesis," Technical Report UCB/ERL M92/41, University of Califomia, Berkeley, CA, May 1992. [14] G. De Micheli, "Performance-oriented synthesis of largescale domino CMOS circuits," IEEE Trans. Computer-Aided Design, vol. CAD-6, no. 5, pp. 751-765, September 1987. [15] J. Uyemura. Circuit Design for CMOS VLSI, Boston: Kluwer Academic, 1993. [16] T. Stanion, "Boolean algorithms for combinational synthesis and test generation," Ph.D. Dissertation, Yale University, May 1994.
REFERENCES
[1] R. Colwell and R. Steck, "A 0.6mm BiCMOS processor with dynamic execution," 1995 IEEE Int. Solid-State Circuits Conf 1995, pp. 176-177. A more detailed description can be found on the World Wide Web at http://pentium.intel.com/procs/ p6/isscc. [2] B. Benschneider. A. Black, et al., "A 300-MHz 64-b quadissue CMOS RISC microprocessor," IEEE J. Solid-State Circuits, vol. 30, no. 11, pp. 1203-1214. November 1995. [3] R. H. Krambeck, C. M. Lee, H. S. Law, "High-speed compact circuits with CMOS," IEEE J. Solid-State Circuits, vol. SC-17, no. 3, pp. 614-619, June 1982. [4] N. Goncalves and H. De Man, "NORA: a racefree dynamic CMOS technique for pipelined logic structures," IEEE J. SolidState Circuits, vol. SC-18, no. 3, pp. 614-619, June 1983. [5] L. Heller and W. Griffin, "Cascade voltage switch logic: a differential CMOS logic family," IEEE Int. Solid-State Circuits Conf. 1984, pp. 16-17. [6] J. F'retorius, A. Shubat, and C. Salama, "Latched domino CMOS logic," IEEE J. Solid-State Circuits, vol. sc-21, no. 4, pp. 514-522, August 1986. [7] V. Friedman and S. Liu, "Dynamic logic CMOS circuits," IEEE J. Solid-State Circuits, vol. SC-19, no. 2, pp. 263-266, April 19834. [8] C. Zhang, "An improvement for domino CMOS logic," Computer and Electrical Engineering, vol. 13, no. 1, pp. 53-59,
189
Layout Design for Yield and Reliability K.P. Wang', M. Marek-Sadowska', W. Maly 2 1. Department of Electrical and Computer Engineering, University of California, Santa Barbara 2.Department of Electrical and Computer Engineering, Carnegie Mellon University
ABSTRACT Charge accumulation occurring on long segments of the interconnect during IC manufacturing may cause damage in transistors' gate oxide. Chances for such accumulation (called "antenna effect") are strong function of interconnect geometry. This paper identifies the antenna effect produced by the HVH (horizontal-vertical-horizontal) three layer channel routing and proposes two methods to eliminate or reduce this effect. The results of numerical experiments illustrating antenna effect produced by original and improved HVH three layer channel routing are presented as well. 1 Introduction Modern IC technologies can nowadays deliver spectacular performance, very high packing density and many more desirable attributes of fabricated devices. There is, however, a price which must be paid for such achievements. Its most important components are growing costs of manufacturing and design [4]. Both of them are consequences of rapidly growing complexity of IC technologies and must be contained with all possible means. One of the avenues for a containment of manufacturing costs is design for manufacturability (DFM). Properly executed DFM has a potential to minimize design-manufacturing mismatch resulting in yield maximization and increased product reliability. This paper is focused on one such a mismatch resulting from recently intensely investigated [10,11] the so called "antenna effect". The antenna effect is the process of charge accumulation occurring on long, floating segments of incomplete interconnect network (called antennas) which are connected to transistors' gates only. Such accumulation occurs during various plasma based manufacturing operations (etching, ashing etc.). Accumulated charges may result in oxide failures which degrades manufacturing yield and product reliability. In this paper we address the relationship between layout design and antenna effect. In particular we study a 3-layer
190
channel routing wiring style. There are 2 most common layering conventions for such channels: VHV (vertical-horizontalvertical) and HVH (horizontal-vertical-horizontal). If no precautions are taken to avoid formation of antennas, it may happen that up to 40% of the total wire length in a HVH routed channel contribute to antenna effect. We will discuss a modification of standard channel routing algorithm minimizing antenna effect. In addition, we prove that the area upper bound for total elimination of the antennas in a channel routing problem (CRP) are extra two tracks over any HVH three layer routing results. This paper is organized as follows. First a brief explanation of the antenna effect is given. Next a new objective function for a routing algorithm is formulated. After the brief discussion of general properties of CRP, the antennas in the HVH three layer channel are considered. Following this we discuss the area upper bound penalty for eliminating the antennas and the methods for eliminating or reducing the antenna in the HVH three layer CRP. The CRP routing implementations and experimental results are shown in the later sections. 2 Plasma Induced Gate Oxide Damage Fine feature sizes of modern IC technologies can be achieved if, and only if, dry plasma based processes are applied. Plasma based processes have a tendency to charge conducting components of fabricated structure. The mechanism of charging is not fully understood [9,10,11] but there exist experimental evidence indicating when such a charging occurs and how it may affect the quality of thin SiO2 . The antenna effects were investigated by Shone et al [11] on specially designed test structures of oxide capacitors and transistors with different metal antenna area ratios. Using different approach, Shin et al [10] proposed a CV measurement to characterize plasma-etching induced damage. A model of oxide damage due to plasma etching was proposed by Shin et al [9]. In summary, it was determined that charging appears to be a problem when poly and/or metal layers, not covered by a shielding layer of oxide, are exposed to plasma and they are not connected to the substrate by previously formed p-n junctions. It was also found that charging becomes very critical
only at such a stage of the process when individual conducting connections no longer form one equipotential layer, i.e. at the end of the etching process. Careful analysis of CMOS process flow reveals that the above conditions take place during: 1. The late stages of polysilicon etching; 2. The late stages of metal etching but only for these segments of the formed network which are not connected to the source or drain regions; 3. The contact etching, but again only for contacts to "floating" (i.e. not connected yet to sources or drains) segments of the formed interconnect. The mechanism of the gate oxide damage is not very well understood either, but the first approximation explanation of its essence is relatively straight forward. The conducting elements of partially formed interconnect collect charges from plasma causing current in the oxide - especially in proximity of gate defects. Such a current may introduce more trap states which in turn can amplify gate oxide currents. In the extreme case the above mechanism may cause early gate oxide breakdown. It can also affect transistor threshold voltage. It was experimentally determined, for instance, that the charges trapped in the gate oxide decrease breakdown voltage. The surface states degrade also such transistor characteristics as subthreshold slope, transconductance and device lifetime under hot electron stress. Shone et al [I1] indicated that the metal antenna amplifies the charging effects by a factor of the metal to thin gate oxide area ratio. 3 Design Rules for Antenna Effect Avoidance During the manufacturing process the connections are made in the following sequence:
group of receivers, or be floating, i.e. not connected to any driver or receiver yet. Any set of interconnect segments connected to one or more receivers only forms the undesirable antenna. The risk of gate oxide damage is proportional to the amount of charges collected by the antenna and inversely proportional to the area of the gate oxide. One can conclude therefore that in order to reduce the negative impact of antenna effects the total antenna area has to be minimized. In the channel routing style, the areas of antennas are related to interconnect length. Therefore, it is natural to choose as a key component of routing objective function, which can take care of antenna effect, total antenna length in the channel. Charging also occurs during finishing stages of contact and via etching. Therefore routing objective function should account for total number of vias and contacts in each antenna. Consequently, it is proposed in this paper that the objective functions to be minimized during placement and routing takes the following form: MinX (w A ir+ WV-cic
(1)
where A'ris the area of wires of an unfinished partial connection i connecting only receivers, N'C is the number of contacts on such an unfinished subnet, wr and w. are weights which allow to include their relative contributions. Reduction of the total antenna length in the CRP may not be enough to prevent the antenna related yield/reliability problems. The reason is that minimum of (1) does not prevent generation of antennas long enough to cause gate oxide damage. (Even a single very long antenna may be enough to cause the damage of the device on the wafer). Therefore, the second objective of the CRP discussed in this paper is minimization of the variance of the antenna length distribution.
- poly.,
In the subsequent sections of this paper we will show how to modify a three layer channel router to minimize both of the above objective functions.
- contacts (after oxide deposition) to diffusion and poly,
4 Antenna Effect in Channel Routing Environment
- consecutive metal layers and vias.
4.1 Channel Routing Problem
- diffusion,
Each network to be connected is composed of a set of terminals, one of which is a driver and the other are receivers. At the beginning of the interconnect fabrication process the receiver type terminals are in poly and driver type terminals are in diffusion. During the subsequent fabrication steps partial connections are made until each net becomes fully connected. Each incrementally added segment and contact of the interconnect may be connected to: drivers only, receivers only, both a driver and some/all of the desired receivers, a
The basic channel routing problem is formulated as follows. We are given a rectangular channel area that has vertical grid columns and horizontal tracks. Terminals of the nets to be connected are located in the columns, on top and bottom sides of the channel. The terminals are assigned labels and a group of terminals with the same label forms a net and is to be connected. One among the terminals of each net is a source of the signal, or a driver. Drivers will be denoted by a subscript d. The remaining terminals of a net are receivers, denoted by a subscript r. The connections between the termi-
191
nals are to be fabricated using wire segments arranged on tracks on the given number of metal layers. Vias link wire segments of the same net which reside on different layers. Wires of different nets cannot overlap or intersect in the same layer. The objective of a channel router is to connect all the nets such that the height of the channel is minimized, i.e. the number of horizontal tracks is minimum. Figure I shows an example of channel routing.
r d
L
X\X1XX X r
r
Xr
d: driver r: receiver *: terminal
r
Figure 1. An example of a CRP with one net.
The CRP is an NP complete problem. Many heuristic algorithms have bee proposed. There are left-edge and maze based (Yoeli [12] and Braun et al [1]), greedy (Rivest & Fiduccia [7], and Ho et al [3]) and graph based (Yoshimura [13], Cheng and Kuh [2], Pitchurnani and Zhang [6]). There are many other algorithms mentioned in the book by Sherwani [8[l. Suppose two layers of metal are available for interconnects. If we adhere to the conventional layering and assume that vertical wire segments are on one layer and horizontal segments are on the other layer, then placing the horizontal segments on Metal I and vertical segment connection to the terminals on Metal 2, we will eliminate the antennas almost completely. Metal I will not introduce any antennas since all the wire segments placed there will be floating. The only antennas will be introduced by the contacts to Metal 2. Such contacts are unavoidable in general and we will introduce their minimal number, just enough to bring the terminals to the Metal 2.
give a brief overview of the channel routing related concepts. In a channel, the terminal positions influence the relative positions of horizontal wire segments of different nets. The non-overlap requirements for wires of different nets on the same layer are modeled by vertical and horizontal constraints. Two wires with the same horizontal span can not be placed on the same track which is expressed by a horizontal constraint. If in a particular column some net n1 connects to the top terminal and net n2 to the bottom one, then the horizontal track of net nI has to be above the track of net n2. This requirement is expressed by a vertical constraint between n1 and n2. The vertical constraints are captured by the vertical constraints graph (VCG) defined as follows. The vertices of VCG correspond to the nets and a directed edge from vertex u to vertex v exists if the wire u has to be on a track above the wire v. The maximum number of nets crossing a column in a channel is referred to as channel density. In case of 2-layer or VHV routing, the channel height is lower bounded by the channel density. In HVH three layer CRP, the lower bound on channel height is the density divided by two. In the VHV channel routing style Metal 1 and Metal 3 carry the vertical wires and Metal 2 contains the horizontal wires. In case of the HVH horizontal wires are in the Metal 1 and 3 and vertical wires in Metal 2. We will analyze both cases from the point of view of antenna effect avoidance. 4.2 Antenna Effect in the VHV Model We will now consider a multiterminal net to be routed in a VHV layered channel. Figure 2 shows such a net routed in the assumed model. Contacts A through F form a net, contact B is the driver and the other contacts are receivers.
However, if in two layer channel routing Metal 1 is used for vertical segments, then the partial connections to drivers on Metal 1 would create antennas. Since channel height is typically much shorter than channel length, the natural objective function of channel height minimization would indirectly decrease the longest antenna.
Vertical segments connecting to terminals B, E and F are in Metal 3, vertical segments connecting to the terminals A, C and D are in Metal 1. The antenna is formed by the wire segments {AX, DY, CW, XZ}. This example illustrates that in the VHV model if a driver of a net Na is connected to the vertical wire segment in Metal 3, then a subnet of Na composed of terminals connected to vertical segments in Metal 1 forms an antenna. Additional short antennas occur when terminals B, E and F are brought to Metal 3. If the driver's connection is placed in Metal I then the only antennas which occur are the minimal unavoidable antennas caused by bringing up the Metal 3 terminals.
The above discussion suggests that there is no problem with antennas in 2-layer CRP. We will focus therefore on 3layer routing. Before we address the antenna problem, we
192
-
metal 1 metal 2 metal 3
- - -
r r r Fig. 2. A net routed in VHV model In the VHV model one can minimize the antenna effect by insisting that for each net its driver is connected to vertical segments in Metal 1 or that all the vertical connections are made in Metal 3. If antennas of the channel's height lengths are tolerable, and a vertical connection to a driver of a net is in Metal 1 then any combination of receivers of this net can be also placed in Metal 1. These constraints are not very restrictive and typically do not cause any degradation in the channel height. It is so, because in VHV model the saving over 2-].ayer routing are relatively small, since the channel density bound can not be avoided and 2-layer routers typically achieve channel heights close to the density. The problem of routing Metal 2 and Metal 3 connections can be treated here as a two layer channel. 4.3 Antenna Effect in the HVH Model
r
d
rAoA
metal 1 metal 2
U , F-
i
r
r
-
metal 3
r
Antennas:
Al= {DY, YU,CWWVER} A2= {FZ}
Fig. 3. Antennas in HVH three layer routing. In the HVH model the horizontal wire segments occupy Metal ]. and 3 layers. When Metal 1 connections are manufactured, all the horizontal segments are floating and are not contributing to any antennas yet. After the Metal 2 and contacts to Metal 1 are formed, all connected subnets composed of receivers only form the antennas. In particular all the ver-
tical segments connected to receivers and not connected to any horizontal segments are antennas. Fig.3 illustrates an example of the same net considered in the earlier figures, now routed in the HVH style. After layer 2 has been fabricated, the net is not fully connected and is composed of 3 disjoint subnets. The subnet connecting the terminals A and B does not form an antenna, since B is a driver. The subnet connecting C,D, and E forms an antenna as is the wire segment connected to F. The weighted area of their wires and the weighted number of vias on them will contribute to the antenna cost. We have implemented several HVH three layer channel routers. The first router (referred to as R1) optimizes only the channel height and has no explicit consideration for antennas. For the results obtained by RI we calculate the antenna costs and their distribution. The routing results obtained by RI serve as a starting point for antenna eliminating method. Again, we determine the cost of mitigating antenna imposed by this approach. We developed a second router R2, which uses certain layering strategy resulting in short and evenly distributed antennas. The overall results obtained by R2 are better than those obtained by Rl followed by the antenna elimination. 5 Methods to Eliminate or Reduce the Antenna Effects Herein, two methods to eliminate or reduce the antenna effect are proposed. The first method eliminates all the antennas in an already routed channel. The second method imposes a constraint resulting in a routing with antenna lengths bounded by the final channel height. The first method is based on the following theorem: Theorem 1: All antennas in any given HVH 3 metal layer channel routing result can be eliminated by addition of at most two extra tracks of metal 2- metal 3 vias. Proof: For any given HVH 3 layer channel routing result, all antennas' connections to the receiver terminals must be made on the 2nd layer through the vertical wires. Consequently, by adding to the channel two extra tracks of metal 2 - metal 3 vias located at the top and bottom of the channel such that each metal 2 vertical connection is terminated by the via to metal 3 one can achieve "antenna - free" design. Observe that with such an arrangement the connection of that vertical wire segment to the receiver terminal must be made on the 3rd layer. Therefore, there is no connection to any receiver tenninal when the second layer is manufactured. The only antennas will be very shot connections from receivers to the 3-rd metal. Figure 4. illustrates the above idea. One draw back of the method described in Theorem 1 is that excessive vias might be added to connect drain terminals to the vertical wire seg-
193
ments on the second layer. While adding two more tracks has already added extra manufacturing cost, the excessive vias may cause performance degradation of the circuit. It is to be expected that extra channel area may be necessary to eliminate the antenna effect. From theorem l, we know the upper bound for eliminating antennas is 2 tracks.
ments connected to the receiver terminals on the second layer will be antennas. The longest such vertical wire segment can not be longer than the channel height produced by the router. Fig. 5 illustrates an example. r
d
r
A B
Y
C
U
Z
R -
metal 1
-
metal 2
X S DI
metal 2 - metal 3
V WE r
-
F r
r
(a) A routed net with all horizontal r
r
r
segments on one layer
(a) antennas
r d r E.; \. . \ \ \ \\
r
* extra via
r
r
(b)Antennas Fig. 5. Force all horizontal segments on the same layer. r
r
r
For routers that minimize the channel height, the routing obtained when the above layering constraint holds, is within the bound set by the theorem 1 and without the penalty of introducing excessive vias. We have implemented a simple HVH three layer channel router R2. The router can indeed complete the routing within the bound of theorem 1 and without the excessive vias.
(b)modified routing Fig. 4. Eliminating antennas by adding two extra tracks. The second method we propose is to reduce the antenna effect without introducing excessive vias while the area cost remains within the upper bound. It is based on theorem 2 which gives the routing constraint to accomplish reduction of the antenna effect. Theorem 2: For a HVH three layer CRP, if all horizontal wire segments of each multiple-pin net are routed on the same layer then the length of any antenna is not longer than the final channel height produced by the channel router. Proof: When all horizontal wire segments of a multiplepin net are routed on the same layer, no wire segments on layer 1 contribute to antennas since when layer 1 is manufactured, all the horizontal wires are floating. Then, on the 2nd layer all the receiver terminals and the driver terminal are connected to vertical segments. The only antenna in the channel will be contributed by the nets that have horizontal wire segments on the third layer. Their vertical wire seg-
The distribution of the antenna lengths produced by the second method is uniform because the longest antenna and the shortest antenna will differ only by channel height minus I as stated in theorem 3. Theorem 3: In a HVH three layer channel router using theorem 2 to reduce the antenna effect, the difference between the longest antenna and the shortest antenna is at most D-l, where D is the channel height. Proof: The length of an antenna is at most D in a routed channel if the layering suggested by theorem 2 is used. The shortest antenna has the length of 1. So the difference is at most D-l.
194
Experimental results suggest that the second method reduces the total antenna length as well as yields more uniform antenna distribution. 6 Results and Discussion We have implemented a simple graph based HVH three layer channel routers. The nets are routed one by one according to their left most coordinate and their weight in the vertical constraint graph (VCG). The routers are implemented in C++ using LEDA [5] library. Routines are developed to identify the antennas and their lengths. Since we used the channel routing benchmark examples which have no information about drivers and receivers of the nets, we assigned them randomly. The router RI routes the given netlist and after completion identifies the antennas and their lengths. The R2 routs horizontal wire segments of each multi pin net on the same layer. In this study we used as examples the standard benchmark Deutsch difficult problem and we created our own "large Deutch difficult problem". The large Deutsch example was obtained by reflecting the Deutch difficult example and merging the original and the new one assigning every other column to each netlist. Since the channel routing benchmarks have no information about driver and receiver terminals, we assigned them randomly. We ran each netlist with 30 different assignments of drivers and receivers. TABLE 1. Result for Deutsch difficult problem std Ri
mean RI
std R2
mean R2
max
12.79
15.84
4.29
7.90
min
12.28
10.51
4.07
7.22
lengths. TABLE I also shows the results of the second experiment. The mean antenna lengths are between 7.22 and 7.9. The standard deviations are between 4.07 and 4.29. Shorter antenna length and more even distribution are obtained. Similar to TABLE 1, TABLE 2 shows results of RI on the large Deutsch difficult problem. The antennas are long and are unevenly distributed. TABLE 2 also shows the results obtained by routing the large Deutsch difficult problem using router R2. The antenna are shorter there and are evenly distributed. Note, for the Deutsch difficult problem, the channel height was 12 when RI was used and the channel height was 14 when R2 was used. For the large Deutsch difficult problem, the channel heights were 22 and 24 for RI and R2, respectively. Results from TABLE 1 and TABLE 2 show that antenna length distributions using RI are not even. Therefore such routing produces riot only very long total antenna lengths but also very long antennas. This implies that these wirings have substantial probability of causing failures. TABLE 3 shows the details of the first experiments on 4 test cases. The RI channel router yields 12 tracks for HVH three layer routing. For a three layer channel routing result, either layer one the layer three can be implemented on Metal 1. Therefore, different results may be obtained. TABLE 3. RI results for 4 cases of the first experimet
TABLE 2. Result for large Deutsch difficult problem
casel
case2
case3
case4
no. tracks
12
12
12
12
no. vias
316
316
316
316
total wire length
3547
3547
3547
3547
std RI
mean Ri
std R2
mean R2
total anteanna legnth(%)
1447
1350
1501
1351
(40)
(38)
(42)
(38)
max
35.41
27.15
6.94
13.16
extra via needed
183
183
183
183
min
24.61
20.97
6.67
12.57
extra wire
235
235
235
235
Four sets of experiments were performed. Each experiment has 30 channel routing results. The first set of experiments routed Deutsch difficult problem using router RI. The second set of experiments routed. the Deutsch difficult problem with router R2. The third and fourth sets of experiments routed the large Deutsch difficult problem with RI and R2, respectively. TABLE 1 shows the mean, and standard deviation (std) of the antenna lengths for the first experiment. The mean antenna lengths are between 10.51 and 15.84. However, the standard deviations are larger than the minimum mean length. This shows uneven distribution of antenna
length needed
As shown in TABLE 3, the antenna can be as much as 42% of the total wire length in conventional routing. Table I also shows the costs of using the first method to eliminate the antennas in routed channels. The costs are the same for all cases. In addition to the area of two more tracks many extra vias are required. The cost is too high even if we consider the extra vias alone. As shown in TABLE 3, 74% more vias are needed to eliminate the antennas. Excessive
195
vias may increase the capacitance and the resistance of the interconnect which will degrade the performance. This may be as bad as having the antennas.
to the final channel height. This method still needs two extra tracks to complete the routing but the number of vias does not increase as in the first method.
The cost of reducing the antennas in a channel using the second method is not as high as eliminating the antennas accomplished with the first method. TABLE 4. shows detailed results using the second method. Area for two more tracks is still needed. The number of vias is even less than in the case of conventional routing (see TABLE 3).
There are situations not yet considered for limiting the antenna length to the channel height but not restricting all horizontal wire segments on the same layer. For example, in Fig 2, segments AX, DY, and CW are antennas with length less than the channel height but the horizontal wire segments are not on the same layer. If a subnet of a multiple pin net has horizontal wire segments routed on the first layer, the subnet has to include the driver terminal, i.e. be connected to driver terminal on the second layer. Such a subnet will have no antenna. The antennas left would be vertical wire segments for the subnets that have their horizontal segments on the third layer.
TABLE 4. R2 results for 4 cases of the second experiment casel
case2
case3
case4
no. tracks
14
14
14
14
no. vias
303
303
303
303
total wire length
3756
3756
3756
3756
total anteanna
668
6541
645
653
legntrh(%)
(18)
(17)
(17)
(17)
extra via needed
-13
-13
-13
-13
extra wire
191
191
191
191
Area of two more tracks may still be a cost that is too high. Further development of this study is to include new constraints yielding channel height minimization and at the same time decreasing the antenna lengths.
Acknowledgments The authors would like to thank Mr. S. Maturi from National Semiconductor for valuable discussion of antenna effect related issues. The first two authors were supported in part by the National Science Foundation grant MIP 9419119 and in part by the California MICRO program through LSI Logic and SVR.
length needed With the second method, the total antenna lengths are only about 18% of the total wire length. This is approximately a 45-50% reduction of the total antenna length with respect to conventional routing. In addition, the extra wire length needed is less than in the first method. The second method also produces even lengths antennas. As shown in Table 4 the antenna lengths are limited by the channel height.
8 References [1]. D.Braun et al, "Techniques for Multilayer Channel Routing", IEEE Trans. Computer-Aided Design, vol. 7, no. 6, pp. 69 8 -7 1 1
The experiments clearly indicate presence of the antennas in channel routed wirings. Both methods eliminated or reduced effectively antennas in the channels. The first method demonstrated the cost upper bound for removing antennas in the channel. The second method shows the first attempt to reduce antennas in channels. We have observed that its cost in terms of area are two extra tracks. The second method restricts all the horizontal wire segments of each multiple-pin net to be on the same layer.
[2]. H. Chen, and E. Kuh, "Glitter: A Gridless VariableWidth Channel router", IEEE Trans. Computer-Aided Design, CAD- 5(4), pp. 459-465, 1986. [3]. T.T.Ho, S. S. Lyengar, and S.Q. Zheng, "A General Greedy Channel Routing Algorithm", IEEE Trans. Computer-Aided Design, vol 10. no 2, pp 204-211, 1991.
7 Conclusion and Future Work We have described and analyzed the antenna effect in three layer CRP. We have shown that an area upper bound to eliminate the antennas are two extra tracks. Two methods have been proposed to eliminate or reduce the antennas in a CRP. The first method allows to eliminate all the antennas completely. The cost of this method are extra via, tracks and wires. The second method reduces the total antenna length and yields an even antenna length distribution. With this method, the length of antennas is limited to be less or equal
[4] W. Maly, "Cost of Silicon Viewed from VLSI Design Perspective", Proc. of 31st Design Automation Conference, June 1994, pp. 135-142. [5]. S.Naher, "Ther LEDA User Manual", Max-PlanckInstitut fur Informatik, Germany. [6]. V.Pitchumani and Q. Zhang, "A Mixed HVH-VHV Algorithm for Three-Layer Channel Routing", IEEE Trans. Computer-Aided Design, vol. CAD-6, no 4, pp. 497-501, 1987.
196
[7]. R.L. Rivest and C. M. Fiduccia, "A Greedy Channel Router". Proc. 19th DA Conf., pp. 418-424, 1982. [8]. N. Sherwani, "Algorithms for VLSI Physical Design Automation", Kluwer Academic Publishers, 1993. [9]. H. Shin et al, "Thin Oxide Charging Current During Plasma Etching of Aluminum", IEEE Electron Device Letters, vol. 12, no. 8., pp. 37-41, August 1991. [10]. H.Shin, Chih-chieh King, and Chenming Hu, "Thin Oxide Damage by Plasma Etching and Ashing Processes", IEEE/IRPS, 1992. [11]. Shone et al, "Gate Oxide Charging and Its Elimination for Metal Antenna Capacitor and Transistor in VLSI CMOS Double Layer Metal Technology". [12]. U.Yoeli, "A Robust Channel Router", IEEE Trans. Computer-Aided Design, vol 10, no. 2, pp. 212-219. [13]. T. Yoshimura, "An Efficient Channel Router", Proc. 21st DAC, pp. 38-44, ACM/IEEE, 1984.
197
Yield Optimization in Physical Design : A Review Venkat K. R. Chiluvuri Advanced Design Technologies, Motorola, Austin. TX-78735 A BSTRA CT In order to achieve yield improvements in a cost-effective way in ]arge-area chips, new design techniques are necessary. Only recently, researchers have started reporting results in the area of layout design for yield enhancement. Several design techniques have been developed for many stages of layout design for yield enhancement. This paper reviews the major developments in the area of layout design techniques for yield enhancement. Yield analysis tools, and yield enhancement methods in layout compaction and channel routing are covered in this review. I. INTRODUCTION During the last two decades. feature sizes have diminished drastically from a few microns to submicrons, allowing integration of millions of transistors on a single chip. At the same time manufacturing process complexity has increased significantly. Die sizes of highperformance general-purpose microprocessors have already crossed 3cm 2 . Such large-area chips became a reality partly because defect densities dropped almost one order of magnitude during this period. It seems unlikely that similar substantial improvements in manufacturing facilities will be achieved in the near future to improve the yield.. Therefore, with further increase in the level of integration, higher yields cannot be expected by achieving factory performance goals alone. For future chips of 5-10 cm 2 size, meeting the cost goals of $1.00/cm2 is an onerous task [19]. Thus, new design techniques must be applied in order to achieve further improvements in the yield of ]arge-area chips in a cost-effective way. The design process of VLSI circuits starts with system specifications. These specifications are then translated, systematically, in a series of design steps into a set of masks. Several design constraints such as performance, area and design rules are considered during this design process. These masks are used to fabricate the circuit in a manufacturing facility. The design and manufacturing process of VLSI circuits in the context of yield is illustrated in Figure 1. For the physical layout design stage, the concept of design for yield has been applied very successfully through global design rules and area minimization [46] from the very beginning. The design rules are formulated in such a way that global disturbances. such as misalignment of
the masks, and line width variations mav have minimal effects, and the amount of logic per chip is maximized. Therefore, these design rules are optimized for minimizing the yield losses due to global process variations. However, in a mature manufacturing line, random point defects are the major source of yield losses. To maintain the yield of future chips with complexities exceeding 100 million transistors, the distribution of point defects and the sensitivity of the design to these defects must be taken into account during layout synthesis. Only recently researchers have started reporting results in the area of layout design for yield enhancement. Several design techniques were proposed for many stages of design synthesis for yield enhancement. These yield improvement techniques are reviewed in the following sections. Several statistical design centering techniques have been developed for parametric yield optimization [18, 38, 50]. The objective of statistical design centering is to maximize the parametric yield of a circuit with respect to manufacturing process parameters. A variety of fault-tolerant techniques have been proposed for memory ICs, PLA-based designs and Wafer Scale Integrated Systems [29, 41, 51]. Several review papers and books have already been published in the areas of Statistical Design Centering and FaultTolerance. This review is restricted to yield optimization techniques in physical design of VLSI systems.
II. DEFECT SENSITIVITY OF A LAYOUT Researchers have proposed several yield models [17, 26] to predict the manufacturing yield. The three-parameter generalized negative binomial yield model given in equation 1 was found to match empirical results better than other yield models [17]. Y = Yo(1 + dAO/a)-'
(1)
where Y is the yield of the die, Yo is the gross yield factor, d is the average number of defects per unit area, A is the area of the die, 0 is the defect sensitivity, i.e., probability that a defect will result in a circuit fault, and ce is the clustering parameter. In this model, A represents the total area of the die, while the product AO (also called the critical area) represents the portion of the chip area that is sensitive to defects. In other words, not every defect results in a circuit failure. The effect of a defect on the chip is strongly dependent on the defect location. Therefore,
198
the defect-sensitivity of a chip depends on the layout density, where a denser layout is more susceptible to defects. Thus, design rules, layout design process and the design style have a strong impact on the yield. Since the layout design rules are optimized for minimizing the vield losses due to global process variations, they are targeted to maximize the gross yield, Yo. So far, only limited attention has been given to point defects while formulating the design rules. The contribution of point defects to yield losses will be relatively very high in a mature manufacturing process of submicron technologies. The defect sensitivity (0) depends on the size of the defect relative to the dimensions of the layout patterns. Several analytical models have been proposed to calculate the critical area from layout details [20, 27, 37, 49]. The critical area for a circular defect of diameter x is defined as the area in which its center must fall in order to cause a circuit failure. The expected value of the critical area, AC, is computed using
AC =
J
A(x)f(x)dx
(2)
where A(x) is the critical area for defects of size x and f(x) is the defect size probability density function [49]. Yield improvements can be achieved by minimizing the total critical area of open- and short-circuit and other faults in the layout. The effect of a reduction in critical area on the overall yield of a chip depends on its size and the process complexity [15]. III. CAD TOOLS FOR
-YIELD PREDICTION
Before attempting any optimization techniques for yield improvement, designers should be able to analyze the yield characteristics i.e., defect sensitivity of their designs, quickly. Several mathematical models have been proposed for very accurate yield predictions. This high degree of accuracy is achieved, to a large extent, by replacing the chip area with critical area in the yield models [20, 26, 28, 37, 49]. Recently several methods have been proposed and CAD tools have been developed for estimating the yield of a chip from its final layout. The main differences among these tools are the type of faults, ability to handle hierarchical layouts, the defect size distribution, and yield models used and the speed. A brief review of these tools is presented in this section. A Monte Carlo simulation based yield prediction tool VLASIC was first developed at CMU [55] in 1985. In this tool, the random defect generators produce a list of defect types and their locations. These defects are then placed on the layout and they are analyzed for faults. The defect types include shorts, opens, new devices, oxide pin holes and junction leakage defects. Users specify the chip size, wafer size and the defect densities of the target fabrication facilities. McYIELD [36] and XLASER 122] are the two early tools developed based on deterministic methods and they work
199
on flat layouts. In XLASER, the computation of multilayer critical area is divided into four stages. The layout partition stage extracts the soft-structures from the layout and determines which defect mechanisms affect these structures. The susceptible stage locates regions where defect mechanisms can introduce defects. The critical region stage identifies the regions where spot defects of a given size affect the soft-structures. Finally, the area stage computes the total critical area per intersection of critical regions with different fault types. Wagner developed another tool YMAP [54] for critical area estimation. It is shown that YMAP runs faster than VLASIC and XLASER. YMAP also provides a bound on error of estimation which is useful to the users. Most of these tools can handle only flat representation of the design, therefore, the hierarchy available in the design can not be exploited to speed-up the critical area or yield estimations. These tools may take longer time for yield estimations on the layouts of practical size. Recently critical area estimation tools have been developed for fast extraction using available hierarchy in the layout description [2, 43]. EYE yield analysis tool [2] provides the capability to efficiently extract the critical areas of hierarchical layout from the mask data. It supports nonorthogonal layout styles as well. Tools such as DEFAM [21] have been developed to handle the designs with redundancy. Although these tools are useful for yield analysis, they
cannot be used to modify layouts for yield enhancement. At best they can highlight the parts of the layout where improvements are possible. The first significant work in the area of layout modifications for yield improvement was reported by Allan [1]. A set of local rules have been proposed for contacts, metal and poly wires for yield enhancement. However, these techniques are not general enough to be applied in the regular physical layout synthesis stages such as routing and compaction. IV. LAYOUT COMPACTION TECHNIQUES FOR YIELD ENHAN CEMENT Compactors generate design-rule-correct physical layouts which occupy minimum area either from symbolic layouts or from actual layouts generated by other layout synthesis tools [5, 48]. These compacted layouts produce the various masks required for the fabrication of the chip. While the primary goal of all compactors is to minimize the area, they include some secondary objectives like minimizing the total wire length, minimizing the number of jogs, minimizing the defect sensitivity, etc [4, 11, 33]. In one-dimensional constraint-graph based compaction, the layout is compacted iteratively in alternating directions [4, 5]. The minimum achievable size of the layout, in the direction of compaction, is determined by the longest path (critzcal path) of the constraint graph. The elements on the critical path are placed at the minimum possible distances from one edge of the layout in order to mini-
mize the area. Therefore, elements that lie on the critical path do not have freedom to move. However, elements that do not lie on the critical path can be placed in a variety of ways. Therefore, after minimizing the layout area, there is freedom available to further optimize the layout for improved performance, yield, and manufacturability. Performance improvement methods, such as wire length minimization (WLM) are usually given priority over other improvements in commercial CAD tools. A. Impact on Yield Since compactors place the elements as close as the design rules permit in order to minimize the area, many
non-critical elements are unnecessarily packed very close resulting in layers with a large critical area for shortcircuit faults. When relocating the wire segments, the compactor may stretch them in order to maintain the original topology and connectivity which results in longer nets and layers with a large critical area for open-circuit faults. In SPA1R.CS [5] this situation was improved by uniformly distributing the unused space among the non-critical elements. The manufacturing yield of the final design can be improved by distributing the spacing between non-critical elements so as to minimize the total number of faults that can occur in the design under particular manufacturing conditions, i.e., defect size distribution and defect densities for different failure types. The defect sensitivity (critical area) of a layout element for short-circuit type fault depends on the proximity (spacing) of the neighboring elements. The defect sensitivity of open-circuit type fault is determined by its width. When changes are made in the layout to minimize the sensitivity of the design to one type of defects, the sensitivity to other defect types may increase. For example, when the width of the metal/active lines is increased to minimize the sensitivity of the design to open-circuit faults, its sensitivity to short-circuit faults and pinhole faults may increase. Therefore, the critical area of open-circuit as well as short-circuit faults should be considered while searching for an optimal location for non-critical elements. A new constraint-graph-based compaction algorithm is proposed in [11] to improve the yield without increasing the layout area. This new compaction algorithm improves the yield of the final design by distributing the spacing between non-critical elements so as to minimize the total defect sensitivity for given particular manufacturing conditions. The defect sensitivity of the short-circuit faults is minimized by distributing the space among non-critical elements. The defect sensitivity of the open-circuit type faults is minimized by increasing the width of several noncritical elements in the layout. The input to the algorithm is the directed-graph representation of the compacted layout. The defect size distribution and the defect densities for different layers of the layout are the other inputs to the algorithm.
The optimal location for a layout element depends on (a) its length, (b) spacing between the element and the elements above and below. In addition, elements connected on both sides and their width also influence the optimal location of an element. The optimization process used in [11] is illustrated in Figure 2. Wire segment Al in Figure 2(a) has 30 units of slack. If it is moved 30 units upward jog 31 can be completely eliminated. However, this is not an optimal location when yield enhancement is also a consideration. As shown in Figure 2(c), the optimal location is 10 units below. If segment Al is moved further upward, the increase in the probability of short-circuit faults is higher than the decrease in the probability of open-circuit faults due to jog length reduction as shown in Figure 2(b). In typical VLSI technologies, the defect densities for shortcircuit type faults are much higher than those of opencircuit type faults [16]. Therefore, in order to achieve better yield characteristics, the proper distribution of free space among the layout elements is very critical. The results of this new algorithm show that up to 10% improvement can be achieved when these defect sensitivity reduction techniques are applied during layout compaction. These two yield enhancement techniques have been prototyped in the IBM CircuitBench Compactor [7]. The most significant aspect of the yield enhancement techniques proposed in [11] is that no additional area is required. The yield enhancement can be realized at almost no cost except for marginal increase in computational time of the CAD tools. Another important aspect is that a layout can be optimized for given manufacturing conditions. Further improvements are possible by incorporating the yield enhancement criteria for other defect types. B. Wire Length Minimization and Yield Enhancement during Layout Compaction Wire-length minimization (WLM) is a commonly-used secondary optimization performed in the compaction stage of VLSI layout synthesis. Several algorithms have been proposed for WLM [32, 34, 39] and they have been implemented in commercial CAD systems. In compactors, WLM is performed by moving the non-critical (slack) elements after solving for minimum area. It is well known that wire length reduction can result in better electrical performance due to improvements in RC characteristics. Wire length minimization can sometimes even lead to smaller area if layouts are compacted iteratively in both directions. In most cases, reduction in wire length also results in better circuit yield [15]. However, a trade-off generally exists between reducing wire length and increasing yield. That is, large increases in yield can be achieved with modest increases in wire length. In wire length reduction only the area/length of the layout patterns is considered. For yield enhancement, both the area of the layout patterns and the spacings among them must be considered. The trade-off between area and spacing depends on the defect
200
densities of the open- and short-circuit type faults of the manufacturing process. The trade-off between WLM and critical area was analyzed in [121. The results of this analysis are summarized below. In WLM algorithms, the primary objective is to minimize the area of the layout patterns so that the electrical performance of the final layout is improved. Compactors, in the absence of WLM, place the layout elements as close as possible to one edge of the layout. This generally results in unnecessarily long wire segments as shown in Figure 3(b). When WLM is performed, the unnecessary jog segments are removed from the layout as shown in Figure 3(c). When the jogs are completely eliminated, most of the elements in the layout tend to be as close together as the design rules permit. Minimaum spacings adversely affect yield, however, because short-circuit faults are more likely among tightly-packed elements. For instance, if the layout of Figure 3(b) is optimized for yield, the layout shown in Figure 3(d) results. The amount of jog length justified for a wire segment depends on its length and on the spacing from the wire segments above and below. If the segment length is longer, the increase in open-circuit fault probability due to the additional jog length is easily offset by the reduction in short-circuit fault probability due to the increase in the spacing to its adjacent elements [12]. The results presented in [12] show that layout modifications for yield enhancement also reduce wire length, which benefits performance. In the absence of criticality information, WLM (which is performed at the expense of yield enhancement) may not result in better circuits. On the other hand, yield enhancement is always beneficial if the defect information is accurate, and the wire-length increase that occurs is minor. [n practice these two optimizaticons can be selectively applied to various parts of the chip to result in designs that, overall, have higher yield and improved performance over those designed with standard methods. C. Performance and Yield
In highly integrated systems, interconnect delay can be a limiting factor for achieving high performance. Therefore, during circuit and layout design, interconnect wire length is minimized in order to achieve better performance and lower power. However, in submicron technologies interconnect delay can be dominated by cross-coupling capacitance between adjacent signal lines [3, 47]. During wire length minimization, often, the effect of cross-coupling capacitance is ignored. In [44] it has been shown that delay can be reduced by 5% when the layouts of 0.5 micron technology are optimized based on coupling capacitance. The crosstalk also has been reduced by about 15-20% due to reduction in the coupling capacitance. They found that by increasing the layout area further improvements in delay can be achieved. Therefore, there is a need to consider the cross-coupling
201
capacitance during the layout optimization. Both layout pattern dimensions and spacings, as in the case of yield optimization, must be considered for minimizing the total interconnect capacitance [44]. If the layout shown in Figure 2(a) is optimized for parasitic capacitance reduction the optimal location for segment A is 8 units below segment B as shown in Figure 2(c). In case of WLM, since only the parallel plate capacitance is considered, its minimum is achieved when segment A is moved all the way up. It is to be noted that for these technologies the yield optimization solution may result in better performance when compared with that of the WLM solution, The trade-off between wire length minimization and yield enhancement is analyzed in [13]. During wire length reduction only the area/length of the layout patterns is considered whereas for yield enhancement both the area of the layout patterns and the spacing among them must be considered. In [12], it has been shown that layout modifications for yield enhancement also reduce wire length, which benefits performance. Through some examples it has been demonstrated in [13] that in the submicron VLSI technologies, layout modifications for yield enhancement will not degrade the circuit performance. If the crosscoupling capacitance is also taken into account, layouts synthesized for improved performance are very close to that of yield solutions. V.
CHANNEL ROUTING
Since compaction is the last stage of the layout synthesis, its effectiveness is highly dependent on the quality of the layout synthesis of the previous stages. For example, the quality of the routers has a major impact on compaction. During the compaction stage the topological order of the layout elements is not altered. Therefore, further yield improvements can be achieved through new strategies for routing, wire length and via minimization, layer assignment and so on. The primary objective of channel routing is to complete the routing in the smallest possible area. Via minimization is the most important secondary objective in several two- and multi-layer channel routers. Several constrained and unconstrained algorithms have been proposed [23, 42, 52] for via minimization. Via reduction is achieved by re-assigning the wire segments to different layers. In the early routers, the main objective of layer assignment was reducing the number of vias. As the number of layers available for routing has increased, layer re-assignment for performance improvement has assumed significance. Sometimes, just to avoid a via, routers may introduce very long wire segments [23]. If manufacturing yield is also taken into account [12], wire length cannot be ignored. Wire segments are susceptible to open- and short-circuit type faults. A trade-off therefore exists between via and wire length for performance, reliability and yield criteria. In this section the yield improvements techniques proposed for channel routing are reviewed.
A. Layer Assignment In multilayer channel routing, the majority of wire segments in a particular layer are either vertical or horizontal (HV, HVH etc) [8, 24, 53]. It is possible to minimize the critical area of short-circuit type faults by reassigning several horizontal wire segments to an otherwise vertical layer and vice versa whenever possible, which is similar to maximizing the utilization of a preferred routing layer with better electrical characteristics [24]. This optimization is achieved in routers by reassigning the wire segments, shifting or/and adding vias. All these strategies are helpful for minimizing the critical area of the shortcircuit faults as well [31]. However, the criteria for layer
re-assignment for yield enhancement is different from that of a preferred layer maximization. When a wire segment from a horizontal layer is moved to a vertical layer, reduction in the critical area of shortcircuit faults is very likely because the wire segment is moved from a dense region to a sparse region. There will be a change in the critical area of the open-circuit faults only if there is a change in the width of the wire segments or change in the wire length. It is to be noted that layer reassignment for yield enhancement and maximization of the preferred layer may work against each other depending on the technology. In such design situations the trade-off between these two parameters must be chosen carefully. B. Routers for Yield Optimization The very first work in channel routing was reported by Lorenzetti in [35]. They attempted to answer the question of whether the size reduction resulting from channel compaction offsets the increased susceptibility to short-circuit defects resulting from the tighter wire spacing created by channel compaction. Channels from the same design were routed using several different algorithms and the resulting layouts were analyzed for their defect sensitivity. They have selected a representative design employing a standard cell layout style. After placement and global routing, the layout had 40 channels to be routed. These channels were routed using four different channel routing algorithms and the layouts were analyzed for their susceptibility to faults resulting from certain defect types. They found that, although compaction of the channels results in greater susceptibility to short-circuit type faults, the yield was higher than the uncompacted ones, in addition to allowing more chips per wafer. It is to be noted that these differences are dependent on the defect densities of a given fabrication line. Their results showed that the yield differences between algorithms was minor. The algorithm proposed in [45] reduces the overlaps among adjacent wires so as to minimize the critical area. Their results show that though the routing area is same as that of produced by Yoshimura & Kuh's algorithm [53], the critical area is smaller in the routing generated by DTR. In this router only the adjacency information of
horizontal tracks is considered as a criteria for defect sensitivity. Neither defect size distributions nor analytical models were used to characterize yield. Since the vertical layer is not considered, the overall defect sensitivity might be higher in some cases. A new algorithm for via and wire length reduction was proposed in [15]. Wire length reduction was achieved by this algorithm in all the benchmark examples. The result of the wire length reduction on a benchmark example is shown in Figure 4. In some examples the wire length reduction is as high as 25%. This reduction in wire length results in better performance due to improvements in RC characteristics. This wire length reduction is achieved without increasing the number of tracks. The reduction in the number of vias is better than or equal to those obtained by most of the via minimization algorithms. It was observed that up to a point, reduction of the wire length also contributes to the reduction of the number of vias. After achieving near-minimum wire length, it is very simple to attempt further reduction in the number of vias. Kuo proposed a new channel routing algorithm for reducing the critical area of the routing [31]. In this router both layer reassignment (net burying and net floating) and via shifting was used to reduce the defect sensitivity. The critical area is calculated using negative binomial model. For the Deutsch's difficult example, a 20% improvement was reported. Though reduction in vias was claimed, they were not included in the yield analysis. Chen proposed two channel routing algorithms in [9] for yield enhancement. In the first method, vias in an existing layout are moved in order to decrease its sensitivity to defects. In the second, the layer assignment was formulated as a network bipartitioning problem. It was shown that by applying these algorithms, the critical area can be reduced by about 10% in the channel routing. Karri proposed a simulated annealing based routing algorithm [25] for yield optimization. In this router the defect sensitivity of the second layer was not considered while minimizing the defect sensitivity of the first layer. Moreover, the criteria chosen for defect sensitivity was not based on any analytical model reported in the yield literature. Via and wire length minimization for yield enhancement can be achieved very efficiently in the routing stage of VLSI layout synthesis. Though it is desirable to implement the layer reassignment technique in the routing stage, the exact critical area calculations may not be possible on a symbolic layout. Moreover, certain layer reassignments on symbolic layout for minimizing the critical area may increase the height of the compacted channel. In certain design situations, compaction may not be carried out on the routing area. Then these yield enhancement techniques have to be implemented in the router itself. In general, layer reassignment criteria in the routing and compaction stages is useful for minimizing the defect sensitivity of the layout.
202
Some results have been reported for the placement and floorplanning stages of layout synthesis as well [6, 30]. In [30] Koren investigated the relationship between floorplanning and yield by analyzing DEC's Alpha chip and Hitachi's SLSI chip which incorporate some defect tolerance. They showed that different floorplans may result in different chip yields.
ment has many attractive features compared to conventional methods. In this approach, additional resources such as spares, testing aids, reconfiguration circuitry etc., require extra area up to 25%. In the new approach, no additional resources are required to achieve the yield improvement. In a majority of the cases it may even result in better performance, due to reduction in wire length and overall area. Wire length reduction only in the polysilicon layer is considered. Similar yield optimization techniques can be applied for metal and diffusion layers as well. The topological optimization for PLA-based designs demonstrates that yield enhancement techniques can be applied even at a higher level of design abstraction. Designers do not have to wait until the final layout is generated to apply yield enhancement techniques. Similar yield enhancement techniques can be developed for other popular structured design styles.
VI. TOPOLOGICAL/STRUCTURAL OPTIMIZATION
The yield enhancement techniques reviewed in Section 4 are applied to the actual physical design where geometries, i.e., widths and spacings, of the layout patterns are known. Layouts do not undergo further modifications after compaction. Therefore, the defect sensitivity of the design can be estimated very accurately. However, pattern dimensions may not be known exactly during the routing stage of the layout design. Moreover, layout patterns may undergo further modifications if the layout is compacted. Therefore, at the routing stage of layout synthesis, yield enhancement is attempted by reducing the wire length and vias. These improvements cannot be achieved during the subsequent stages of layout synthesis. In a similar way, yield improvement techniques can be applied even at higher levels of design abstraction. In this section, the yield enhancement techniques for the structured logic design stages of VLSI design are summarized.
B. Yield Comparisons of VLSI Adders
A. Topological Optimization of PLAS During the last decade, many structured design techniques have been developed to minimize the design cycle time of VLSI Systems.
PLAS, gate arrays and standard cells are
some of the popular design styles. Two-level PLA-based logic synthesis is well developed and commercial automatic synthesis tools/silicon compilers are now available. One of the major drawbacks of these design styles is the large area overhead and the attendant yield and performance degradation. Several optimization techniques have been proposed to minimize the area of PLAs at various stages of the design, starting from functional design to physical design. PLA folding techniques and the corresponding software tools have been developed to optimize the topological representation of PLAS [40]. The primary objective of all these techniques is to reduce the area of the PLA. Significant yield enhancement can also be achieved by minimizing the defect sensitivity of the design that is already optimized for area. In [14], a yield enhancement technique is presented through which the defect sensitivity of the design is minimized without increasing the area. The topological representaticon of the PLA is altered so that the critical area of the generated layout is minimized. This reduction in critical area is achieved primarily by minimizing the wire lengths in one or more layers of the layout. It was shown that about 15% critical area reduction can be achieved by applying these techniques in PLA based designs. This new approach developed in [14] for yield enhance-
In [10], Chen examined the impact of algorithms on yield by comparing different designs of VLSI adders. They compared three different 64-bit adders: carry-look-ahead adder, carry-skip adder and hybrid adder used in DEC's ALPHA chip. They applied yield enhancement techniques of layer reassignment and redistribution of wires to these designs. The performance of the adders was analyzed using SPICE, and the yield was estimated using VLASIC. The results show that in most cases, area and yield have the same order. However, sometimes, a larger area adder can have a higher yield. They observed that yield enhancement due to layout modification, if limited to the sub-cell level, is insufficient and the yield improvement rate can nearly double if a system level yield enhancement effort is made. Therefore, in order to get a better result, system-level yield enhancement should be performed, even if every building block and standard cell have been designed for maximum yield. VII. CONCLUSIONS In most of the yield enhancement techniques reviewed in this paper, only two types of faults due to random point defects, i.e., open- and short-circuit type faults, are considered. Other defect mechanisms like contact failures must be considered during the layout optimization for achieving better results. For some of the defect mechanisms empirical models are not yet available. This is an important area for future research. The complexity of the future products will be too high to achieve the yield targets with improvements in the fabrication process alone. These yield enhancement techniques need to be incorporated in the regular design flows, so that designers can make use of these techniques on a routine basis without additional time and effort.
203
REFERENCES
[15] V. K. R. Chiluvuri, "Layout Synthesis Techniques for Yield Enhancement," Ph.D Thesis, ECE Department, University of Massachusetts, Amherst, 1995.
[1] G. A. Allan et al., "An Yield Improvement Technique for IC Layout Using Local Design Rules," IEEE Trans. Computer-Aided Design, Vol. 11, NO. 11, pp. 1355-1362, Nov. 1992. [2] G. A. Allan and A. J. Walton, "Hierarchical Critical Area Extraction with the EYE tool", Proc. IEEE Int, Workshop on Defect and Fault Tolerance in VLSI Systems, pp. 28-36 , Nov. L995. [3] H. B. Bakoglu, "Circuits, Interconnections, and Packaging for VLSI," Addison-Wesley, 1990. [4] D. G. Boyer,
"Symbolic Layout Compaction Re-
view", Proc. IEEE DAC., 1988, pp. 383-389.
[5] J. L. Burns and R. Newton, "SPARCS: A New Constraint-Based IC Symbolic Layout Spacer," Proc. IEEE Custom Integrated Cir. Conf, pp. 534-539, 1986. [6] I. Cederbaum and I. Koren and S. Wimer. "Balanced Block Spacing for VLSI Layout", Discrete Applied Mathematics, V. 40, pp. 302-318, 1992. [7] EDA CircuitBench User's Guide, IBM, NY., 1994. [8] K. C. Chang and H. C. Du, "Layer Assignment Problem for Three-Layer Routing", IEEE Trans. on Computer-Aided Design, pp. 625-632, May 1988. [9] Z. Chen and I. Koren, "Layer Assignment for Yield Enhancement," Proc. IEEE Int. Workshop on Defect and Fault Tolerance in VLSI, pp. 173-180, 1995. [10] Z. Chen and I. Koren, "A Yield Study of VLSI Adders," Proc. IEEE Int. Workshop on Defect and Fault Tolerance in VLSI, pp. 239-245 , Nov. 1994. [11] V. K. R. Chiluvuri and I. Koren, "Layout Synthesis Techniques for Yield Enhancement,' IEEE Trans. Semiconductor Manufacturing, pp. 178-187, May 1995.
[16] R. S. Collica et al., "A Yield Enhancement Methodology for Custom VLSI Manufacturing," Digital Technical Journal, Vol. 4, No. 2, pp 83-99, Spring 1992. [17] J. A. Cunningham, "The Use and Evaluation of Yield Models in Integrated Circuit Manufacturing," IEEE Trans. on Semiconductor Manufacturing, Vol. 3, No. 2, pp. 60-71, May 1990. [18] S. W. Director and W. Maly and A. J. Strojwas, "VLSI Design for Manufacturing: Yield Enhancement," Kluwer Academic Publishers, Boston, 1990. [19] R. B. Fair, "Challenges to Manufacturing Submicron, Ultra-Large Scale Integrated Circuits," Proceedings of IEEE, Vol. 78, No. 11, pp. 1687-1705, Nov. 1990. [20] A. V. Ferris-Prabhu, "Role of Defect Size Distribution in Yield Modeling," IEEE Trans. Electron Devices, Vol. ED-32, No. 9, pp. 1727-1736, Sep. 1985. [21] D. D. Gaitonde, "Accurate Yield Estimation of Circuits with Redundancy," Proc. IEEE Int. Workshop on Defect and Fault Tolerance in VLSI Systems, pp. 155-163 , Nov. 1995. [22] J. P. Gyvez and Chennian Di, "IC Defect Sensitivity for Footprint-Type Spot Defects," IEEE Trans. CAD, Vol. 11, NO. 5. pp. 638-658, May 1992. [23] S. Haruyama, D. F. WAong and D. S. Fussell, "Topological Channel Routing," IEEE Trans. CAD, Vol. 11, No. 10, pp. 1177-1197, October 1992. [24] D. A. Joy and M. J. Ciesielski, "Layer Assignment for Printed Circuit Boards and Integrated Circuits", Proceedings of IEEE, Vol. 80, pp. 311-331, Feb. 1992. [25] R. Karri, "Automatic Synthesis of Fault Tolerant VLSI Systems", Ph.D Thesis, CSE Department, University of California, San Diego, 1993.
[12] V. K. R. Chiluvuri and I. Koren and J. L. Burns, "The Effect of Wire Length Minimization on Yield", IEEE Int. Workshop on Defect and Fault Tolerance in VLSI Systems, pp. 97-105, 1994.
[26] I. Koren and C. H. Stapper, "Yield Models for Defect Tolerant VLSI Circuits: A Review," Defect and Fault Tolerance in VLSI Systems, Vol. 1, I. Koren (ed.), pp. 1-21, Plenum, 1989.
[13] V. K. R. Chiluvuri and I. Koren, "Yield Enhancemenlt vs. Performance Improvement in VLSI Circuits," Proc. IEEE Int. Symp. on Semiconductor Manufacturing, pp. 28-31, Sep. 1995.
[27] I. Koren, "The Effect of Scaling on the Yield of VLSI Circuits," Yield Modeling and Defect Tolerance in VLSI, W.R. Moore, W. Maly and A. Strojwas (Eds.), pp. 91-99, Adam Hillger Ltd., 1988.
[14] V. K. R. Chiluvuri and I. Koren, "Topological Optimization of PLAs for Yield Enhancement", IEEE Int. Workshop on Defect and Fault Tolerance in VLSI Systems, pp. 175-182. 1993.
[28] I.Koren, Z. Koren and C. H. Stapper, "A Unified Negative Binomial Distribution for Yield Analysis of Defect Tolerant Circuits," IEEE Trans. Computers, Vol. 42, No. 6, pp. 724-734, June 1993.
204
[29] I. Koren and A. D. Singh, "Fault Tolerance in VLSI Circuits", Computer, Special Issue on Fault-Tolerant Systems, Vol. 23, pp. 73-83, July 1990.
[43] P. K. Nag and W. Maly, "Hierarchical Extraction of Critical Area for Shorts in Very Large ICs" Proc. IEEE Int. Workshop on Defect and Fault Tolerance in VLSI Systems, pp. 19-27 , Nov. 1995.
[30] Z. KIoren and I. Koren, "Does the Floorplan of a Chip Affect its Yield?", Proc. IEEE Int. Workshop on Defect and Fault Tolerance in VLSI, pp. 159-166, 1993.
[44] A. Onozawa, K. Chaudhary and E. S. Kuh, "Performance Driven Spacing Algorithms Using Attractive and Repulsive Constraints for Submicron LSI's," IEEE Trans. on CAD, pp. 707-718, June 1995.
[31] S. Y. Kuo, "YOR: A Yield-Optimizing Routing Algorithm by Minimizing Critical Areas and Vias," IEEE Trans. CAD, Vol. 12, No. 9, pp. 1303-1311, Sep. 1993.
[45] A. Pitaksanonkul et al., "DTR: A Defect-Tolerant Routing Algorithm," 26st IEEE DAC., pp. 795-798, 1989.
[32] G. Lakhani and R. Varadarajan, "A Wire-Length Minimization Algorithm for Circuit Layout Compaction,"' 1987 IEEE Int. Simp. on Circuits and Systems, pp. 276-279.
[46] R. D. Rung,"Determining IC Layout Rules for Cost Minimization," IEEE J. Solid-State Circuits, Vol. SC-16,No. 1, pp. 35-42, Feb. 1981.
[33] T. Lengauer, Combinational Algorithms for Integrated Circuit Layout, John Wiley, England, 1990. [34] S. L. Lin and J. Allen, "Minplex - A Compactor that Minimizes the Bounding Rectangle and Individual Rectangles in a Layout," Proc. 23rd IEEE DAC., pp. 123-130, 1986. [35] M. Lorenzetti, "The Effect of Channel Router Algorithms on Chip Yield", Proceedings of Int. Workshop on Layout Synthesis, May 1990. [36] M. Lorenzetti et al., "McYield: A CAD Tool for Functional Yield Projections," IEEE Int. Workshop on I)efect and Fault Tolerance in VLSI Systems, pp. 100-110, Nov. 1990. [37] W. Maly, "Modeling of Lithography Related Yield Losses for CAD of VLSI Circuits," IEEE Trans. on CAD), Vol. CAD-4, No. 3, pp. 166-177, July 1985. [38] W. Maly, "CAD for VLSI Circuit Manufacturability," Proceedings of the IEEE, V. 78, pp. 378-392, Feb. 1990. [39] D. Marple et al.,"An Efficient Compactor for 450 Layout," Proc. 25th IEEE DAC, pp. 396-402, 1988. [40] G. L). Micheli and A. Sangiovanni-Vincentelli, "Multiple Constrained Folding of Programmable Logic Arrays: Theory and Applications", IEEE Trans. on CAI), Vol. "CAD-2", pp. 1,51-167, July 1983. [41] W. R. Moore, "A Review of Fault-Tolerant Techniques for the Enhancement of Integrated Circuit Yield", Proceedings of IEEE, pp. 684-698, Vol. 74, May 1986. [42] N. J. Naclerio, S. Masuda and K. Nakajima, " Via Minimization for Gridless Layouts," Proc. 24th IEEE DAC., pp. 159-165, 1987.
[47] T. Sakurai, S. Kobayashi and M. Noda, "Simple expressions for interconnection delay, coupling and crosstalk in VLSI's", Proc. of Int. Symp. on Circuits and Systems, pp. 2375-2378, 1991. [48] W. S. Scott and J. K. Ousterhout, "Plowing: Interactive Stretching and Compaction in Magic," Proc. of the 21st ACM/IEEE Design Automation Conf., pp. 166-172, 1984. [49] C. H. Stapper, "Modeling of Defects in Integrated Circuit Photolithographic Patterns," IBM J. Res. Develop, Vol. 28, No. 4, pp. 461-474, July 1984. [50] M. A. Styblinski and L. J. Opalski, "Algorithms and Software Tools for IC Yield Optimization based on Fundamental Fabrication Parameters", IEEE Trans. on CAD, VoL. 5, pp. 79-89, Jan. 1986. [51] E. E. Swartzlander, Jr., "Wafer Scale Integration", Kluwer Academic Publishers, Boston, 1989. [52] K. The, D. F. Wong and J. Cong, "Via Minimization by Layout Modification," Proc. of the 26th IEEE DAC., pp. 799-802, 1989. [53] T. Yoshimura and E. S. Kuh, "Efficient Algorithms for Channel Routing," IEEE Trans. CAD, Vol. 1, No. 1, pp. 22-35, Jan. 1982. [54] I. A. Wagner and I. Koren, "An Interactive Yield Estimator as a VLSI CAD tool," IEEE Trans. on Semiconductor Manufacturing, pp. 167-174 , Vol. 8, No. 2, pp. 130-138, May 1995. [55] D. M. H. Walker and S. W. Director, "VLASIC: A Catastrophic Fault Yield Simulator for Integrated Circuits ," IEEE Trans. CAD, CAD-5, pp. 541-556, Oct. 1986.
205
El * CLr0 D-.gn
. H I
- -
-
-
T
-
-
I
- -
--
-~-j-
9..
I
I
Figure 3(a): Uncompacted layout, (b) Compacted layout without wire length minimization
O.n
1k
Figure 1: Design and manufacturing process (Ref. [15]).
Figure 3(c): Compacted layout with wire length minimization. (d) Compacted layout with yield enhancement. (Ref. [12]). C
.t I620I I I 78148198417I 1418I 1917S fi19 't 41; 14'1
I
I
I
short-ckt faults *-Open-ckt faults xITotal Critical area o
12 1(C
Critical area (sq. pm)
I
. ............ .... ....t.. t--------I ---. -t:: :-::::::J-t: ....... :t -:::- :--:::-:-
-
.........:.,,.-......... . . . . :2-1. 11.... . 41 1 -6207 1.-----'''.....------
-
214 101115171511209
1
819 12
11219 1115
78 4
12131r121f15121612962
51
7
T
9 11
14184
189
[-
W
..............
1213151219151216129 6 21
1
5
613
3
516 3
i~E .... .t....... ~ .-il~~~~,~ ~~~ I
I
30
-
1 620
111
,.,.'
i- --...- I1 6I
2,
-
I
=
, M3-MMM= Location for yield optimization
x
f4---
I
10 15 20 Slack (Microns)
25
LL........... l
111
214j 10117520
5
0 -------- 4 .......... ...... .. ........ . .. . .. . ..... ... ... l...........
----
.......... 214
l^ii§: . ...... s
1011 15171511209
11 21 9
1115
9
11
1213151219151216129
6
21
JtI
A location for capacitance reduction
Figure 4(a): example 1 (Ref. [53]) (a) Original routing (Wire Length 310, 57 vias) (b) Optimized for wire length (Wire Length 222, 36 vias) (Ref[15]).
Figure 2: Layout optimization for yield enhancement. (a) Layout before relocating segment Al (b) POF vs slack (c) Layout after relocating segment Al. (Ref. [13]).
206
CATASTROPHIC YIELD, PARAMETRIC YIELD AND RELIABILITY: CAN WE STILL VIEW THEM AS DISJOINT ISSUES? (Invited Position Paper) Israel Koren * Department of Electrical and Computer Engineering University of Massachusetts, Amherst, MA 0:1003 A manufactured IC is operational at the desired levels of performance and reliability if:
The first example illustrates the contribution to parametric yield loss which random spot defects may have in deep submicron technologies. We analyzed in [11] the effect of spot defects on the propagation delay of signals through two adjacent metal interconnection lines depicted in Figure 1. For high clock frequencies (200MHz and above) the effects of reflections in these transmission lines are not negligible and any discontinuities in the lines due to spot defects may result in an increase in the propagation delay of the signals.
1. It does not have catastrophic defects resulting in openor short-circuit type faults; 2. It operates correctly at or above the desired frequency; and, 3. Its reliability is above a certain threshold, providing constrained sensitivity to phenomena like electromigration and hot-carrier effects. The percentage of ICs which have no catastrophic defects is called the catastrophic yield. The percentage of ICs which do not have catastrophic defects and operate at or above the desired frequency is called the parametric yield. Catastrophic yield loss is mainly due to random spot defects, most of which are the result of unwanted dust or chemical particles deposited on the wafer during the many steps of manufacturing [6]. Parametric yield loss has been mainly due to global disturbances, such as mask misallignment and line width variations [3]. Unfortunately, this important distinction between the two kinds of yield is rarely discussed. The majority of technical publications concerned with yield and manufacturing issues use the term "yield" to refer to either parametric yield or catastrophic yield, but not to both. In almost all such publications the existence of the "other" kind of yield is completely ignored. In most cases reading the title, or even the abstract, is insufficient to decide which kind of yield is discussed, and one gets the impression that the author is unaware of the double meaning of the term. This situation has been tolerable since the physical phenomena underlying the two kinds of yield loss were distinct, and as a result, the mathematical models and the techniques employed for improving the two different kinds of yield were completely different. Reliability issues have also historically been treated separately from either kind of yield, and justifiably so. because reliability-reducing factors were unrelated to yield. This will not necessarily be true in the near future. If the current trend of increasing chip size and further reducing the already submicron feature size will continue as expected, designers will have to consider catastrophic yield, parametric yield and reliability simultaneously, and, in some cases, will have to make trade-offs. We will illustrate the above through two examples, both concerning long on-chip interconnection lines.
,.,
.Any
X_ I.
Ki -
IZ
1 1
SoAlbrm, _.
'I
Figure 1. The effect of an extra metal defect on a line. The defect is modeled as a square of side 2r.
Let fo denote the maximum possible operating frequency of the defect-free circuit (which in our simple example consists of two adjacent interconnection lines and the corresponding drivers). Let a denote the delay-increasefactor due to spot defects so that the propagation delay of a signal on the interconnection line increases by (1 + or). This factor depends on the probability distribution function of the size of the defects which are assumed to be squares of size 2r x 2r (see Figure 1). The operating frequency of the circuit reduces from fo to f(ar) given by f (a)
The following expression for a has been derived in [11] 2(x/L) 2 log 2
log (
*Supported in part by NSF under contract MIP-9305912
207
=
r
)
where L and W are the length and the width of the line, respectively (see Figure 1), S is the line spacing, x is the distance between the driver and. the center of the square defect and y is the distance between the bottom edge of the line and the center of the defect. For the circuit to be operational at (or above) a given clock frequency fi, the delay-increase factor must satisfy o- < am where am is equal to fo/lf-1. The value a,>, corresponds to a certain size r- of a defect, and consequently, we can define and calculate the delay-dependent critical area, denoted by Ac(om), which is the generalization of the wellknown critical area term for catastrophic defects. The latter is equal to A,(oc). We can then write an expression for Ac(fm) and, define and calculate the frequency-dependent yield. denoted by Y(fm), using any existing yield model [5], [7]. If we select the simple Poisson yield model we obtain the results depicted in Figure 2, which shows the projected yield as a function of the frequency (assuming that the maximum working frequency of the defect-free interconnection line is fo = 500Mhz) for three different values of the line spacing. For very low values of the frequency fm the projected yield is the catastrophic yield, while for high frequencies it is the (multiplicative) contribution to the overall parametric yield. From this figure we conclude that the separation between long lines should be much larger for frequencies above 0.4fo, while in lower frequencies one may stick to the minimum spacing allowed by the technology ground-rules.
oxide breakdown and leakage current phenomena. In the next example we focus on the crosstalk noise. Crosstalk between two wires is proportional to the coupling capacitance between the wires, which in turn is proportional to their coupling length (the total length of their overlapping segments), and inversely proportional to their separating distance [9]. A similar relationship exists between the sensitivity to short-circuit type defects and the layout parameters. Therefore, techniques similar to those for short-circuit critical area reduction can be used to minimize crosstalk faults [2]. For example, in order to reduce the catastrophic yield losses due to short-circuit type faults, spacing between some of the interconnect lines is increased. This redistribution of spacing will also help to minimize the crosstalk faults [8]. Our results [1] show that layout modifications for yield enhancement will also improve the circuit crosstalk reliability due to reduced coupling capacitance. However, the optimal solutions for yield and coupling are not always the same, mostly because they react differently to a change in the separating distance. In crosstalk, coupling capacitance between two lines is proportional to S-1.45, where s
is the distance between the two lines [9]; while in yield, the short-circuit critical area is proportional to s-3 . A simple example in Figure 3 illustrates the difference between the optimal results for the two objectives. We should conclude therefore, that in the future trade-offs between the yield and reliability objectives will often be required, and that yield and reliability are in fact intimately related. REFERENCES [1] V.K.R. Chiluvuri and I. Koren, "Yield Enhancement vs. Performance Improvement in VLSI Circuits," Proc. of ISSM-95, The Intern. Sympos. on Semiconductor Manufacturing, pp. 28-31, Austin, Sept. 1995. [2] V. K. R. Chiluvuri and I. Koren, "Layout Synthesis Techniques for Yield Enhancement," IEEE Transactions on Semiconductor Manufacturing, Vol. 8, No. 2, pp. 178-187, May 1995. [3] S.W. Director, "Optimization of Parametric Yield," Proc. of the 1991 IEEE Internl. Workshop on Defect and Fault Tolerance in VLSI Systems, pp. 1-18, Nov. 1991.
[4] C. Hu, "Future CMOS Scaling and Reliability," Proceedings of IEEE, Vol. 81, No. 5, pp. 682-689, May 1993. [5] I. Koren, Z. Koren and C.H. Stapper, "A Unified Negative Binomial Distribution for Yield Analysis of Defect Tolerant Circuits," IEEE Trans. on Computers, Vol. 42, pp. 724-437, June 1993.
Figure 2. Yield vs. frequency for various values of the inter-line spacing S assuming that fo (the maximum working frequency of the line) is 500Mhz.
[6] I. Koren and A.D. Singh, "Fault Tolerance in VLSI Circuits," Computer, Special Issue on Fault-Tolerant Systems, Vol. 23, pp. 73-83, July 1990. [7] I. Koren and C.H. Stapper, "Yield Models for Defect Tolerant VLSI Circuits: A Review," Defect and Fault Tolerance in VLSI Systems, Vol. 1, I. Koren (ed.), pp. 1-21, Plenum, 1989. [8] A. Onozawa, K. Chaudhary. and E. S. Kuh, "Performance Driven Spacing Algorithms Using Attractive
Our second example illustrates the fact that yield and reliability are not disjoint issues and that, a tradeoff between the two is sometimes required. Various wear-out and internal noise mechanisms affect the reliability of VLSI circuits. These include the electromigration, crosstalk, thin
208
and Repulsive Constraints for Submicron LSI's," IEEE Transactionson CAD, Vol. 14, No. 6, pp. 707-749, June 1995. [9] T. Sakurai, S. Kobayashi and M. Noda, "Simple Expressions for Interconnection Delay, Coupling and Crosstalk in VLSI," Proc. of InternationalSymposium on Circuits and Systems, pp. 2375-2378, 1991. [10] E. Takeda et al., "VLSI Reliability Challenges: From Device Physics to Wafer Scale Systems," Proceedings of IEEE, Vol. 81, No. 5, pp. 653-674, May 1993. [11] I. A. Wagner and I. Koren, "The Effect of Spot Defects on the Parametric Yield of Long Interconnection Lines," Proc. of the 1995 IEEE Internl. Workshop on Defect and Fault Tolerance in VLSI Systems, pp. 4654, Nov. 1995.
r,~niT LIL .
V,..- -1iLdLi
-
LIEU
, T TT f?1lcr
J.
II. l.II
i[, I
;.
bigl[
a
ULl
Tr~l~~e "mi'u41Ilby
o I ftC
Major Challenge for VLSI," Proceedings of IEEE, Vol. 81, No. 5, pp. 730-744, May 1993.
(a) Original Layout
I7
I-J (b) Layout for Yield Optimization
(c) Layout for Crosstalk Minimization Figure 3. Difference between yield and crosstalk optimization.
209
A Gridless Multi-layer Channel Router Based on a Combined Constraint Graph and Tile Expansion Approach Hsiao-Ping Tseng and Carl Sechen Department of Electrical Engineering, Box 352500 University of Washington, Seattle, WA 98195 Abstract
However, these routers either cannot handle an arbitrary number of layers or they cannot handle variable width wires. We takes the tile expansion approach to generate necessary doglegs in the second pass routing stage.
We present a multi-layer gridless channel router based on the combination of a constraint graph and tile expansion approach. The constraint graph based approach is used for fast first-pass routing. The maze routing tile expansion approach is used for the secondpass optimization. A unified multi-layer combined constraint graph (MCCG) is created to store the vertical constraints from the vertical layers and horizontal constraints from the horizontal layers. A gridless multi-layer layout is achieved by processing the horizontal constraints. An iterative graph-based rip-up and reroute algorithm is developed to minimize the longest path in the MCCG. However, the graph-based router is not capable of generating the necessary overshoot wires and unrestricted layer wires for dense channels. A multi-layer tile expansion algorithm is implemented based on the A* maze routing algorithm to optimize the result from the graph-based approach. It allows fast and symmetric horizontal and vertical expansion on the same layer or alternating layers by using a dual tile plane. 'We propose an overlapped expansion model to implement the multi-level rip-up and reroute algorithm. The tile expansion router iteratively moves nets from uncongested regions to elsewhere such that the layout area can be compacted. Our gridless router achieves solutions at density for Deutsch's difficult example and other benchmarks in two, three, four and five metal layers. I. Introduction The channel routing problem is an important issue in VLSI physical design automation. As multi-layer metal technology evolves, a channel router has to handle the fact that the higher metal layers have significantly larger minimum-feature sizes than the lower layers. For a grid-based router, different grid sizes can cause a via alignment problem between different layers. Some critical nets (e.g. clock, power nets) need to be routed with a larger wire width than other signals. Our router takes two gridless approaches - a graph based algorithm and a tile expansion algorithm. Constraint graph based algorithms for two [I] and three [2] metal layers have been developed. We present an efficient combined constraint graph based algorithm for multiple metal layers and also a robust rip-up and reroute (RR in short) algorithm to improve the routing results. The graph-based algorithm takes a simple representation of a constraint graph for spacings between components of the layout but does not have effective information for spare spaces in the horizontal or vertical layers. However, that information is necessarily required in order to generate useful doglegs and overshoots. For very congested channels, we observe that overshoot doglegs and an unrestricted layer wiring scheme are helpful to reach the density solution. There are many routing algorithms that can handle doglegs efficiently, such as the symbolic router YACR2[3], the rip-up reroute maze router MIGHTY[4], Mulch[5], Chameleon[6], the three-layer router[7], the four-layer routerl-8] and greedy routers[9][10]. 1
210
The tile expansion routing model has been developed for area routers [I1][12]. The two-layer area router [11] uses a single tile plane to represent the two-layer layout. A breatn-first maze routing algorithm is implemented as its tile expansion algorithm. It adopts a restricted layer wiring scheme and only allows orthogonal direction expansion on alternating layers. The multi-layer area router [12] also adopts the restricted layer wiring scheme and only allows H-V alternating tile expansion on alternating layers. However, it implements the A* algorithm [13] as the maze searching algorithm. The A* algorithm is a time-efficient maze searching algorithm and guarantees to find an optimal solution according to the given weight function[14]. We implement a multi-layer tile expansion channel router based on the A* algorithm. For the channel routing problem, an unrestricted layer wiring scheme is helpful and necessary in order to approach the optimal solution. Our router allows expansions along the orthogonal direction or the straight direction on the same layer or alternating layers. This flexible routing style is achieved by a special technique called the dual tile plane. The dual tile plane, which is constructed by simply transforming the coordinates of cornerstitched horizontal strips, has the advantages of fast and symmetrical horizontal and vertical expansions. We also use the overlapped tile expansion model to allow expansion paths to overlap with existing nets. An efficient multi-level rip-up and reroute algorithm is implemented based on this expansion model. The preliminary layout is taken from the graph-based algorithm and is passed to the tile expansion rip-up and reroute router for optimization. The tile expansion approach also has the advantage of effectively implementing a one-dimensional compactor by minimizing the number of vias and shifting wires [15] during the routing or postprocessing stage. By penalizing the via creations in the maze router, the number of vias can be significantly reduced. The corner-stitching tile representation has efficient searching algorithms to reference and manipulate fragmented components in the layout. Thus a fast wire-shifting algorithm can be implemented with the tile expansion router. Our router is partitioned into several stages. First, the graph-based approach handles the layer assignment of pins and horizontal wires. Second, the edge assignment algorithm is applied to minimize the longest path in the constraint graph. Third, a graph1.The results in [9][10] on a density based greedy router could not be verified and appear to be incorrect. Implementation of the method yielded very poor results and the authors would not provide authentication of their results.
based rip-up and reroute (remove and re-insert) algorithm is used to reduce the longest path of the graph further. Our graph router obtains good results for most channels. However, for very dense channels, the maze routing tile expansion router is needed to optimize the result. In this last stage, a robust corner-stitching rip-up and reroute algorithm is applied.
with a rip-up and reroute capability. A detailed description of our graph-based algorithm is presented in the following section. Figure 2: Examples of combined constraint graphs -
'2
The remainder of the paper is organized as follows. Section 11 describes the routing model of the constraint graph. Section III contains a description of the algorithms for layer assignment, edge assignment and rip-up and reroute. Section IV describes the routing model of the corner-stitching data structure and the dual tile plane. Section V presents the A* algorithm, metal expansion model, via expansion model, and corner-stitching rip-up and reroute algorithm. Section VI shows our experimental results. The conclusion is in Section VII.
2
1-
>
12
2 -
i>
3;
T (b) VCG
C) HCG
(d) CCG
e i.
M2 (f) Processed CCG
III. The Combined Constraint Graph Algorithm
II. The Constraint Graph Routing Model
The channel routing problem could be considered as a onedimensional compaction problem. The goal is to reduce the channel height. We transform the geometrical information of nets into a constraint graph. A directed edge means a strict order of two conjugated nets in the y dimension. An undirected edge means that the two conjugated nets can be flipped. Two nets having a horizontal constraint implies that one net will have to be placed above the other, which further implies that a direction must be assigned to this edge. Changing-the direction of the edge is a process we call flipping. The longest path of the constraint graph is equivalent to the channel height. A constraint graph with undirected edges is called incomplete because by flipping the undirected edges the longest path may be changed. Thus, a gridless channel routing problem is transformed into the graph problem in which it is sought to minimize the critical path of the constraint graph by assigning proper direction for undirected edges (see Figure 2(e)(f)). A valid layout solution can thus be obtained from the complete graph which has all its undirected edges processed.
Horizontal Constraintsand Vertical Constraints We adopt the routing model for graph-based algorithms described in earlier papers [1][2]. Each layer is chosen to route only either horizontal (H) wires or vertical (V) wires. A multi-terminal net is broken into two-pin nets. A vertical constraint (VC) is introduced when two vias (1) are on the same layer, (2) have different signals, and (3) have overlapping in the vertical direction (see Figure 1). A horizontal constraint (HG) is introduced when two horizontal wires (1) are on the same metal layer, (2) have different signals, and (3) have overlapping in the vertical direction (see the horizontal wires in Figure 1). Figure 1: Vertical constraint and horizontal constraint
Problem Formulation - In the multi-layer problem, each vertical (V) layer has at least one adjacent horizontal (H) layer and each horizontal layer has at least one adjacent vertical layer. For example, a four layer partition model can be one of the HVHV, VHVH, HVVH and VHHV schemes. The pins can only reside on V layers. The horizontal metal segments can only reside on H layers. The vertical segments are constructed From pins to horizontal segments.
Nodes and Edges in ConstraintGraph A node represents a two-pin net, which consists of a left pin, a right pin, a straight horizontal wire, two straight vertical wires from the pins to the horizontal wire, and two vias (see nets 1,2 in Figure 1). An edge between two nodes is either directed or undirected. A directed edge between two nodes is equivalent to a vertical constraint between vias of the two nodes. An undirected edge between two nodes is equivalent to a horizontal constraint between the horizontal wires of two nodes. The weight (EDGEW) of the directed and undirected edges between nodes i and j is the minimum center-to-center spacing between two components.
The algorithm is divided into two phases. First, pins and horizontal wires of nodes are assigned to proper layers with minimum weight such that the length of critical path of the constraint graph is minimized. Second, the edge assignment algorithm processes undirected edges in MCCG to minimize the length of critical path. The layout is obtained from the complete MCCG. If needed, the rip-up and reroute algorithm removes the nodes on the critical path and re-inserts them into space intervals. These three algorithms are described in the following three subsections.
A top node and bottom node are created to represent the top boundary and bottom boundary, respectively. The channel height is equal to the length of the longest path of the graph. Vertical constraints always appear on vertical layers and horizontal constraints always appear on horizontal layers. A vertical constraint graph (VCG, see Figure 2(b)) is composed of nodes with directed edges. A horizontal constraint graph (HCG, see Figure 2(c)) is composed of nodes with undirected edges. A combined constraint: graph (CCG, see Figure 2(d)) is composed of nodes with undirected and directed edges. The VCG for the vertical layers and the HCG for the horizontal layers are merged to form the unique multi-laver combined constraintgraph (MCCG). Based on this constraint graph model, we developed a multi-layer router
211
3.1 Layer Assignment of Pins and Nodes In a hierarchical design style, the layer assignment of pins can be done in the global routing stage or in the detail routing stage. Our router allows the processing of the layer assignment of pins in the detail routing stage. If the pins have been already assigned in the global routing stage, only the layer assignment of horizontal wires (nodes) is invoked. A pin which is assigned to a metal layer in either the global routing stage or in the detail routing stage is called processed. A horizontal wire which is assigned to a metal
layer is also called processed. Finally, a node with processed pins and a processed horizontal wire is also called processed. The layer assignment for horizontal wires and pins has four different schemes (see Figure 3).
selected for processing in each iteration. After all the edges are processed, a preliminary result from MCCG is accomplished. However, the preliminary result is not optimal for some dense channels. It needs to be optimized by rip-up and reroute. We present a graph-based rip-up and reroute algorithm in following subsection.
Figure 3: Layer Assignment of Horizontal Wire and Pin horizon
Ml+ 2
p
wire
pin,,ft
3.3 Rip-up and Reroute
(a) left pin on lower layer, right pin on upper layer. -
'b) left pin on uper layer, rignt pin on lower ayer. a
An efficient rip-up and reroute process is necessary for some difficult channels to optimize the preliminary result from the edge assignment stage. Our algorithm uses a strategy that removes a node on the critical path and re-inserts it into a space interval with minimum weight. The rip-up and reroute (RR) algorithm is parti-
-
(c) left pin and right pin on upper layer.
Cd)left pin and right pin on lower layer.
All the possible H and V layer combinations for pins and horizontal wires of the two-pin nets are tested iteratively. The combination with minimum weight is chosen. If the pin was processed by a previous two-pin net assignment or by the global router, only the layer combinations for the unprocessed pin and the unprocessed horizontal wire are tested. This layer assignment process efficiently assigns the pins and horizontal wires into multiple layers. The number of vertical constraints and the length of the longest path in the VCG for V layers is minimized. The number of horizontal constraints is minimized and evenly distributed in each H layer. Since the length of the longest path of HCGC for H layer i is a lower bound for the density in layer i, we want each H layer to have the same density. The three-layer Trigger algorithm[2]1 takes a different approach. It assigns all horizontal wires to horizontal layers Hl and H2 initially and generates weighted constraint graphs WCG I and WCG3 for HlV2 and V2H3. The nodes are selected and assigned to a proper layer according to their priority. This insert-all-and-remove approach can not handle the multi-layer pin assignment problem because the pins can be on other than the V2 layer for the general multi-layer problem. Our approach is an insert-one-after-another style. According to the given prefixed routing scheme, our layer assignment process tries all the possible layer combinations and chooses the best. The primary MCCG constructed in this process is further processed by the edge assignment algorithm, which is described in following subsection.
tioned into two levels - node RR sequence and node deletion and re-insertion.
Ripping up a two-pin net is equivalent to deleting a node from the MCCG. Rerouting a two-pin net is equivalent to re-inserting a node into MCCG. The rip-up and reroute procedure is then transformed into a node deletion and re-insertion procedure. The nodes which have directed edges in the original MCCG to node n are called rigid neighbors of node n. The nodes which have undirected edges in the original MCCG to node n are called flexible neighbors of node n. A node is processed in the RR stage if it has
been ripped up and rerouted. Node deletion and re-insertion - By flipping the undirected edges
to its flexible neighbors, a node can be re-inserted to a different location in the MCCG. Instead of flipping one edge at a time, we sort the flexible neighbors of the RR node by their y value and reinsert the RR node to space intervals between flexible neighbors (see Figure 4). For example, in Figure 4(a), the re-insertion of node 2 into the interval between node 4 and node 5 reduces the channel height by one track. The space intervals above the nearest rigid ancestor or below the nearest rigid descendent are not qualified for node re-insertion, because the RR node cannot be inserted above the rigid ancestor nor below the rigid descendent in the MCCG. The first space interval from the top boundary accommodating a re-insertion of less than or equal weight is chosen. This technique helps push the RR node closer to the top boundary and reduces the length of the critical path. Figure 4: Re-insert a node
3.2 Edge Assignment An undirected edge is called a criticaledge if it causes a cycle by
an improper direction assignment. The algorithm always processes the critical edges first and then processes remaining edges according to the ordering described next. The label of an edge(ij) between node i and node j is defined as the length of the longest path through the edge if it is assigned to the improper direction. The label of an edge is considered to be the estimated length of the longest path through the edge in the final complete graph. We observe that the label value of edge(ij) is more closed to the final label value in the complete graph if the connected nodes i and j have been processed. The edge whose connected nodes are unprocessed tends to have a potentially large fluctuated label value to the final label value in the complete graph. Such an edge with unprocessed connected node is less demanding than an edge with processed connected nodes. The undirected edges are divided into three groups in decreasing order of priority - (I) connected nodes which have been processed and whose labels are larger than the charnel density, (2) connected nodes which have been processed and (3) all other nodes. The unprocessed undirected edge with maximum label value in the highest priority group is always
212
(a) Channel layout view before (b) Constraint graph view: (i)before node 2 being np-up and reroute on node 2 processed, Insert node 2 between (ii) 3 and 4. (iii) 4 and 5, and (iv) 5 and 6
Node Rip-up and Reroute Sequence
-
All the nodes on the critical
path are selected in the ascending order of their y value. To effectively push the RR node toward the top boundary, the ancestor nodes of the RR node are processed in advance in the ascending order of y value. The RR node is processed last. A node can only be processed once every iteration. A minimized graph is achieved by iteratively running the node rip-up and reroute routines. Our implementation of the graph-based rip-up and reroute algorithm performs well for multiple vertical layer problems (typically more than three layers of metal). However, it shows
disadvantages for two and three layer problems for very congested channels. Our experimental results show that non-terminal doglegs are necessary to achieve solutions at density. Thus, we propose another gridless approach - tile expansion routing, to do the job. This tile expansion maze router is presented in the following section.
Plane (see Figure 6(b),(c)). The coupled H and V tile planes have consistent geometric information on solid tiles except the coordinates are switched mutually on both planes. A potential disadvantage of the Dual Tile Plane representation is memory redundancy. However, we observed that the corner stitching routing model doesn't require much memory for channel routing problems. For example, to route Deutsch's difficult example for two metal layers, only 9K tiles (64 bytes x 9K = 572Kbytes) are generated for the dual tile plane representation. Instead of causing memory over-consumption, this technique offers the necessary information for tile expansion on the H and V tile planes of the metal layers. The maximal H(V) strip tiles on the H (V) tile plane allows fast tile expansion in the H (V) direction. It extends the horizontal routing efficiency of the conventional single tile plane to both directions of tile expansion. Plus, only the horizontal tile expansion algorithm is needed to be implemented, since the vertical tile expansion can be done by simply switching the x and y values to the horizontal algorithm. Moreover, the geometric information of tiles is consistently stored in mutually transposed coordinates of the H and V tile planes. In other words, a tile can be referenced easily between H and V representation by switching the x and y values.
IV. The Tile Expansion Routing Model 4.1 Corner Stitching Data Structure Figure 5: Corner Stitching Data Structure rt/
~
<~lblb
(a) A corner-stitching tile
t
I
'tH
4,
<-4,gbl (b) A corner-stitching tile on 45°mirrored coordinates
The corner stitching data structure was introduced by Ousterhout[16]. A rectangular shaped tile has four stitches connected to its neighbors (see Figure 5(a)). A space tile represents an empty space block in the layout. A solid tile represents a component in the layout. A tile plane is composed of stitched solid tiles and space tiles. Tiles on the plane are stitched and combined either into strips of maximal horizontal extent called a horizontal (H) tile plane or strips of maximal vertical extent called a vertical (V) tile plane. A tile plane can represent either one single mask layer or multiple mask layers. We chose to have one tile plane represent a metal layer and the vias connected to the metal layer. There are five types of solid tiles in the tile plane - pin, metal, via-up, via-down, via-updown. The via-up and via-down represent the via to the upper metal layer and the via to the lower metal layer, respectively. The via-updown tile exists only when stacked vias are allowed.
The dual tile plane Di is the coupled H and V representation for metal layer i. The horizontal and vertical representation of metal layer i is denoted as DiH and DiV, respectively. In Figure 6, an example of two-metal layer layout is shown on the H and V tile planes. The detailed description of the tile expansion algorithm is presented in the following section. V. The Tile Expansion Algorithm As a second pass optimizing router, the tile expansion router takes the result from the graph-based algorithm and iteratively moves nets from the uncongested regions to elsewhere such that the layout area can be reduced. Our tile expansion router essentially takes the A* algorithm technique[ 14] and uses a metal expansion model and a via expansion model to find a feasible route with minimum cost. The metal expansion model is applied when the routing path is expanded on the current metal layer. The via expansion model is applied when the routing path is expanded to the adjacent metal layer. Our expansion models allow unrestricted layer routing. The H layer in the graph-based routing stage allows vertical wires and the V layer allows horizontal wires in the tile expansion stage. We also propose the overlapped tile expansion model to allow for expansion path overlaps with existing solid tiles. The rip-up and reroute router takes this technique to find a feasible route with minimum overlapping with existing solid tiles. We present the expansion models in section 5.1 and 5.2. The weight function for our expansion models is described in section 5.3. The description of the rip-up and reroute scheme is in section 5.4.
An extensive data representation from the corner-stitching tile plane is proposed to perform fast H and V tile expansion. It is described in the next following section. 4.2 Dual Tile Plane Representation Figure 6: Dual Tile Plane Representation
(a) Layout of a two-layer example
5.1 Metal Expansion Tile expansion on the same metal layer is called metal expansion. It essentially generates the necessary metal wire on the expansion path. The expansion can be along both directions. We explain the models and terminology for this expansion method in subsection A and the expansion rules in subsection B. A. Model (b) Horizontal tile plane
(c) Vertical tile plane
Selected Tile - A candidate tile is chosen by the active tile for verification with routing and design rules. If it is qualified, it is generated and added into OPEN.
A special technique to mirror the coordinates of an H tile plane at 45 degrees can produce a dual V tile plane (see Figure 5(b)). We define this type of coupled H and V tile planes as the Dual Tile
Expandable Area - The maximum rectangular space area cover-
213
ing the selected tile is the expandable area. The fragmented space tiles are accumulated into a large enough area which satisfies the design rules for routing the metal wire (see the expandable area of H4 in Figure 7(a)). The expandable area is invalid if it does not meet the design rules. A tile with an invalid expandable area is not expanded. Routing Path for Metal Expansion - The routing path for metal expansion is the formation of metal wires. It is constructed from the left boundary to the right boundary inside the expandable area (see Figure 7(b)). The width of the routing path is the width of the metal wire. The routing path on an H (V) space tile is always in the H (V) direction. If the tile is a solid tile, the routing path is the full shape of the tile itself. The center of the constructed routing path for a space tile has to be inside the tile itself, otherwise the routing path is invalid. A tile with an invalid routing path is not expanded. Design Rule for Metal Expansion - The expandable area has to be large enough to route the metal wire. The distance between the routing path and other components has to be larger than or equal to the minimum spacing. Alignment of Routing Path- The routing path is aligned to the center line of the solid tile around it. This reduces the number of fragmented space tiles and helps the space tiles to accumulate into long horizontal strips. If the alignment procedure fails (see H5 in Figure 7(c)), this selected tile is penalized and is therefore less favored to be expanded. Effective Routing Path for Metal Expansion - The actual metal covered area of the routing path is called the effective metal routing path. It is constructed only when the path from the descendent node to the active node is decided. The routing path of the active node is trimmed at both ends by the routing paths of the parent node and descendent node (see Figure 8). The overlapped region of the routing paths between parent node and the active node belongs to the parent. One exception is that if the parent node or the active node is a solid tile, the overlapped region belongs to the owner of the solid tile.
the active space tile are also generated on the alternating plane. Solid Tile -A solid tile is expanded to the neighbor tiles on both the H and V plane (see Figure 9(a),(b)). The neighboring tiles off the expansion direction (e.g. tiles HI and H3 in Figure 9(a)) are not generated. This is because such an expansion will construct a malformed effective routing path between the active tile and the generated tile. Figure 9: Metal Expansion on the Dual Tile Plane.
-V
V2
( metal expansion of a solid tile on V plane, vo, vI, v2 and V3 are generated
(c) metal expansion of an H space tile on v plane, vo, v2 and V3 are generated.
5.2 Via Expansion The tile expansion between alternating metal layers is called Via Expansion. It essentially generates a necessary via and metal wires on both adjacent layers. The expansion can be along both directions. We explain the models and terminology for this expansion method in subsection A and the expansion rules in subsection B. A. Model
Figure 7: Expandable Area and Metal Routing Path
Figure 10: Via Expansion on Dual Tile Plane DiVO
D1VI ID
1V
DiV3
civetieon Di- ay (a) via expansion of an active tile on Dil layers
DiV layer
Figure 8: Effective Metal Routing Path Active tileon
DII l
r
,Ta-y-e
DiHI
b) via expansion of an active tile on D, layer to DiH layer. Tile DiHI is generated
H3
- VE Effective metal routing path of H3 on the metal expansion path HI->v7>H3->Vl->H6
DVI
VI Ir
DiV3 1R
I
B. Expansion Rule The tiles with invalid routing paths or invalid expandable areas are prohibited from expansion. The active tile first selects the neighbor tiles or overlapped tiles on the same layer as candidates. Then strict design rules are applied to check the routing path of these selected tiles. Only qualified tiles are generated by the active tile. The metal expansion rules for the active tile are as follows: Space Tile -A space tile is only expanded to the alternating tile plane. The overlapped tiles (see VO in Figure 9(c)) on the V(H) tile plane are generated for an active H(V) space tile. The; solid tiles (see V2 and V3 in Figure 9(c)) at both ends of
214
(c) via expansion sequence from DiVO -> DijHO
->DiV2
Routing Pathfor Via Expansion - The routing path for via expansion consists of two parts - the metal routing path and the via block. The metal routing path (block b in Figure 10(c)) is constructed like the routing path in metal expansion. The via block is the formation of a via between the active tile and the selected tile (see block a in Figure 10(c)). The via block is overlapped with the metal routing path. The metal routing path and via block have to be inside the expandable area. If
the tile is an H (V) space tile, the metal routing path is always in the H (V) direction. If the tile is a via, the metal routing path and the via block are the full shape of the tile itself. If the tile is metal, the metal routing path is the full shape of the tile itself. The center of the constructed metal routing path and via block has to be inside the tile itself, otherwise the metal routing path and via block are invalid. A tile with an invalid routing path is not generated.
direction for routing on a given layer), and via expansion in the same direction are penalized. Expansion on the solid tile with the same signal is at no cost. The cost function g(n) collects the cost of the expansion path from the source node to the parent node for node n. The node which generates node n is called the parent node of n. The cost function g(n) is formulated as follows,
Design Rule for Via Expansion - The expandable area of the
selected tile has to be large enough to route the metal wire and the via. The distance from metal routing path and via block to other components is larger than or equal to the minimum spacing.
g(n) = g(p) + VIA(n, p) + METAL(p) + OVERSHOOT(p) +
MISALIGN(p) ,where p is the parent node to node n. The weight for via creation between the parent node and node n is denoted as VIA(n, p). It is zero if the parent tile is a via already. Penallyviais added to VIA if this via expansion is along the same direction. The weight for metal creation on the parent tile is denoted as METAL(p). It is zero if the parent tile is metal already. Penalrywire is added to METAL if this metal expansion is along the non-preferred direction. The weight for misalignment of the routing path on the parent tile is denoted as MISALIGN(p). The weight for an overshoot segment on the parent tile is denoted as OVERSHOOT(p).
Alignment of Routing Path - The metal routing path and via block
are aligned to the center line of the solid tiles around the tile. If the alignment procedure fails, this tile is penalized and not preferred. Effective Routing Path for Via Expansion - The effective routing
path for via expansion consists of two parts - the effective metal routing path and the via block (e.g. block c and d in Figure 10(c)). The effective metal routing path does not overlap with ithe via block. It is constructed only when the descendent node to the active node is decided. The routing path of the active node is trimmed at both ends by the routihg paths or vias of the parent node and the descendent node. The overlapping region of the routing paths between the parent node and the active node belongs to the parent.
The estimated cost function h(n) combines the cost of expansion on node n (tile n) and the estimated cost of tile expansion from node n to the goal tile. It is formulated as follow: h(n) = EST-METAL(n) + EST-VIA(n) + CURRENT-TILE(n)
B. Expansion Rule The tiles with invalid routing paths or invalid expandable areas are prohibited from expansion. The active tile first selects the overlapped tiles on the adjacent layers as candidates. Then, strict design rules are applied to check the routing path of these selected tiles. Only qualified tiles are generated by the active tile. The via expansion rules for the active tile are as follows:
The weight for estimated via creations from node n to the goal node is EST - VIA(n). The weight for estimated wire creations from node n to the goal node is EST-METAL(n). CURRENT-TILE(n), the weight for the expansion on node n, is METAL(n) + OVERSHOOT(n) + MISALIGN(n) Penaltyvja causes most via expansions to be along alternating directions. Penaltymetj suppresses most metal expansions along
1) The active tile (space or solid tile) on the DiH or DiV plane is expanded to the overlapped tiles on the DiIH, Di+ 1V, Di AH and DEli-V planes (see Figure 10(a),(b)). The routing paths and expandable areas of the overlapped tiles are verified with the design rules. Only the qualified overlapped tiles are generated.
the nonpreferred wiring directions. The OVERSHOOT term penalizes the case that occurs when a two-pin net expands over the boundary of the source tile or goal tile and takes two parallel horizontal paths (tracks). 5.4 Rip-up and Reroute
2) In a few cases via expansion for the overlapped tile in rule (I) is not allowed - (i) The via routing block of the tile has overlapping with via blocks in the other metal layers and stacked vias are not allowed. (ii) The selected tile is a solid tile and has a different signal. (iii) The selected tile is metal and has a different orientation from the expansion direction. For example, a horizontal strip metal tile cannot be expanded on the V plane. Our via expansion algorithm allows expansion along the same direction on two adjacent layers. This feature increases the flexibility of the routing style and helps to generate potential lowweight routes.
The rip-up and reroute (RR) algorithm allows the expansion path to overlap with a non-equivalent solid tile which has a different signal than the current net being routed. This overlapped tile expansion feature guarantees to successfully find a feasible route with a minimum of overlapping with non-equivalent solid tiles. The router first rips up those two-pin nets which own the overlapped non-equivalent solid tiles, draws the new route for the current net, then continues this rip-up and reroute process for those ripped up overlapped nets. A multi-level control mechanism is developed to optimize this process. A special multi-level backtracing scheme is accomplished to cooperate with the multi-level rip-up and reroute process. Also the weight function is augmented with the contributions from the overlapped non-equivalent tiles.
5.3 Weight Function The weight functionf(n) for the A* algorithm is the summation of the cost of the path from the source node to node n and the estimated cost of the path from node n to the goal node. The weight functionjTn) essentially guides the maze router to find a best route within the given constraints. The cost function g(n) stands for the cost of the path from the source node to node n. The estimate cost function h(n) stands for the estimated cost from node n to the goal node. A basic weight function consists of the distance from the source node to the active node and the estimated distance from the active node to goal node. For our tile expansion routing model, non-preferred routes and expansions add contributions to the weight function. Overshoot wires, tiles with misaligned routing paths, metal expansion on the odd tile plane (i.e. the non-preferred
A. Multi-level Rip-up and Reroute To prohibit nets from being cyclically thrashed out by each other in congested channels, a multi-level control mechanism was implemented. The level of rip-up and reroute is defined as the number of times that the rip-up and reroute procedure can be executed recursively. We show the description of the multi-level ripup and reroute algorithm in Figure 12. A least congested horizontal region (a track in the grid-based problem) in the channel is selected, all the nets overlapped with that region are ripped up and rerouted elsewhere by a level n RR process. The level n RR
215
process for node p is accomplished by following procedures - (I) search for the least cost route, (2) remove nets which own the overlapped tiles on the route, (3) draw the route, (4) run the level n-i RR process for the removed nets, (5) if any net in step 4 fails, collect the failed solid tiles owned by the failed nets, and (6) call procedure I with n = n- 1 and with the failed solid tiles not being allowed to be expanded. The level 0 RR process does not allow overlapped tile expansions on non-equivalent tiles. If the level n RR process fails to reroute nets from the selected region to elsewhere, find the next least congested region and continue the process.
lapped net i by the measure of its net horizontal span and the size of the overlapped tile. RIPOVERLAP plays the most significant part in searching for a good route with easy-to-reroute overlapped tiles.
Do(
A multi-level rip-up and reroute tree is constructed to show the hierarchical relation of nodes in different RR levels (see Figure 11). Net p is processed in the level n RR and can find Rt possible (overlapped or non-overlapped) routes. The route rt of net p has N,, overlapped nets. The mth net (mE [1,Nj]) on the route rt of node p is denoted as net(rt, m) in the level n-i RR process. If net i is prior to net j and net i is on the path to net j in the RR tree, net i is the ancestor to netj and netj is the descendentto net i in the RR process. This RR process is analogous to weak modification in the two-layer router MIGHTY[4]. Both of them adopt local modifications to reroute the neighboring nets to somewhere else such that the current net can be successfully routed. In order to control the computation time and to obtain better results, a limited number of routes (Rt= 12) are considered for each net at each level of RR.
if fail, goto (1) lwhile(channel height > channel density) (b) Multi-level Net Rip-up and Reroute Algorithm net RR(net i, level 1) {BAD-SOLIDS = set of difficult solid tiles overlapped with previously failed routes. New routes are forbidden to overlap with these tiles again. do ICSOLIDS: overlapped solid tiles within this route. ROUTE = a solution of the expansion path from the left pin to the right pin of net i in level l; If I > 0, expansions on overlapped tlfes are allowed; Tiles in BAD- SOLIDS are prohibited from expansion; The overlapped tiles are collected into CSOLIDS. if ROUTE=0, return fail. set backtracing point here. RRN = set of nets which own CSOLIDS. np-up nets in RRN draw ROUTE for net i FAIL RRN: set of nets in RRN failing to be rerouted. for each net j E RRN call net-RR(net j, level I-l) if fails, FAILRRN = FAIL-RRN u netj
Figure 11: Multi-level Rip-up and Reroute Tree Level 0 RR. Level n RR Possible Level n-I RR Level n-2 RR reroute only - -and routes - reroutent net (0,0 i(0 (0) nt (0,0) nteRouut i(0,N0 )
,R-
=
net
*
net
a~~t(O,N ()
.No -I
Rt)
FAIL-SOLIDS = {solids e CSOLIDS, solid, is owned by net k e FAIL-RRN) BAD-SOLIDS = BAD -SOLIDS U FAIL-SOLIDS if (FAIL -RRN - ( ) return success While (true)
I
et(t-l.q)
Figure 12:Multi-level Rip-up and Reroute Algorithm (a) Area Oriented Rip-up and Reroute Algorithm RR-REGS = set of capable rip-up and reroute honzontal strip regions (tracks) in channel. I) select LEAST-REG = least congested region in RR-REGS { RR-REGS = RR-REGS - LEAST-REG N = { net i, horizontal wires or vias of net i are overlapped with LEAST REG) do { select net i E N; N = N - net i call netRR(net i, maxRRlevel) /* maxRRlevel is the level n in Figure 11 / awhile (N != 4 )
I
B. Multi-level backtracing
VI. Experimental Results
In order t~orestore the previous setting whenever any level of RR
The algorithms are implemented in GNU C++ and run on Unix systems. A comparison with other multi-layer routers for the Deutsch's difficult example [18] is shown in Table 1. The twolayer and three-layer results for examples 3a, 3b, 3c in Yoshimura and Kuh's paper [17] and rl through r4 in [3] are compared with other routers in Table 2. Our router achieves equal or better solutions for all the examples than other routers. The 3a, 3b and 3c
fails, the comer stitching data structure is designed in an objectedoriented style and all the tile manipulation operations are recorded with minimum information (creation, modification, deletion) in an event queue. Whenever the RR procedure fails, simple inverse operations (free, copy-back, undelete) from the event queue are executed to restore the dual tile planes to the specified point. It appears that 60 backtracing levels are necessary to solve the Deutsch's difficult example for the two-layer case. This dynamic backtracing scheme with minimum memory consumption also improves performance, since usually only tens of inverse operations are needed for the local restorations in the lowest level of the RR process.
examples need a level 2 rip-up and reroute process to achieve the optimal solution both in the two-layer case and the three-layer case, except example 3b which needs level 3 in the three-layer case. The router takes 7 seconds of CPU time on Digital AlphaStation 250/266 to obtain an 18 track solution for the example
C. Weight Function
YK3c in the two-layer case. The optimal solution for Deutsch's example in the two-layer case is completed by a level 8 rip-up and
We add two more terms to the cost function g(n) for the contribu-
reroute process. And the three-layer solution of Deutsch's example is reached by a level 9 rip-up and reroute process. The layout
tion of overlapped solid tiles. The weight for ripping up ancestor net i with overlapped tile p is denoted as RIP-ANC(i,p). The weight for ripping up net i with overlapped tile p is denoted as RIP-OVERLAP(i,p). RIP-ANC penalizes the expansion which rips up ancestor nets, and thus reduces the redundant RR thrashing. RIP.-OVERLAP estimates the difficulty to reroute the over-
216
of a two-layer solution for Deutsch's difficult example is shown in Figure 13. Note that the layout of these examples in the threelayer case are routed without stacked vias. In Table 2, only Trigger and our router use gridless approaches. However, our multilayer gridless router obtains as good results as grid-based routers. We show the comparison of the total wire length with other two-
layer routers in Table 3. Our router beneficially generates less wire length than any other router, except MIGHTY.
Variable Width Channel Router." IEEE Transactions on Computer-Aided Design, Vol. CAD-5, No. 4, pp.459-465, October 1986. of layers(wirng scheme) 2(HV) 3(HVH) 4(HVHV) 5(HVHVH) [2] Howard H. Chen, "Trigger: A Three-Layer Gridless ChanDe-nsity I 10 nel Router." Proceeding of IEEE International Conference 3)ur router 19 0 IO -7 on Computer-Aided Design(ICCAD), 1986, pp. 196-199. Chameleon j61 19 -11 I 10 7 [3] A. Sangiovanni-Vincentelli et al., "A new gridless channel router: Yet another channel router the second(YACRII." Table 1: Comparison for Deutsch's difficult example(Density=l9) with multi-layer routers Proceeding of IEEE International Conference on Computer-Aided Design (ICCAD), 1984, pp. 72-75. Two layer model Three layer model [4] Hyunchul Shin and Alberto Sangiovanni-Vincentelli, "MIGHTY: A 'Rip-up and Reroute' Detailed Router." ProExamples ensityACR[3] Cham[6] MIGHT Ours Tngge Robust Ours Ours [21 ha.[7] ceeding of IEEE International Conference on Computer7__a____ _ [4]3 Ours Aided Design (ICCAD), 1986, pp. 2-5. Ka 15 15 15 83 8 8 [5] Ronald I. Greenberg, Alexander T. Ishii, and Alberto L. YK3b 17 18 18 17 17 9 10 9 Sangiovanni-Vincentelli, "Mulch: A Multi-Layer Channel YK3c 18 19 19 18 18 9 10 Router Using One, Two, and Three Layer Partitions." Pro1 20 22 2 2 0 12 1 ceeding of IEEE International Conference on Computerr2 20 21 20 10 120 1 0 Aided Design (ICCAD), 1988, pp. 88-91. r3 16 18 18 17 16 9 [6] Douglas Braun et al., "Chameleon: A New Multi-Layer r4 15 17 7 17 15 Channel Router." Proceeding of 23rd Design ACM/IEEE 1 1iff. 9 19 19 1 11 11 10 10 Automation Conference, pp.495-502, 1986. [7] Yzi Yoeli, "A Robust Channel Router." IEEE Transactions Table 2: Comparison in two-layer and three-layer cases on Computer-Aided Design, Vol. 10, No. 2, pp. 212-219, February 1991. router Tracks Restricted winng Net length [8] Jingsheng Cong, D. F. Wong and C. L. Liu, "A New scheme tlnt Approach to Three or Four Layer Channel Routing." IEEE Our router1 oq : 4942 Transactions on Computer-Aided Design, Vol. 7, No. 10, Hierarchical[19] 19 Yes 5023 pp. 1094-1104, October 1988. YACR2 19 No 5020 [9] Tai-Tsung Ho et al., "A General Greedy Channel Routing MIGHTY 19 No 4838 Algorithm." IEEE Transactions on Computer-Aided Robust 171 19 Yes 4961 Design, Vol. 10, No. 2, pp. 204-211, February 1991. General Greedy[9] 19 Yes 15004 [10] Tai-Tsung Ho, "A Density-Based Greedy Router." IEEE Table 3: Comparison for Deutsch's difficult example in two-layer case Transactions on Computer-Aided Design. Vol. 13, No. 7, pp. 974-981, 1993. [11] A. Margarino, A. Romano, A. De Gloria, F. Curatelli and P. VII Conclusion Antognetti, "A Tile-Expansion Router." IEEE Transactions We presented a multi-layer gridless channel router based on the on Computer-Aided Design, Vol. CAD-6 No, 4, pp. 507517, July 1987. combination of a constraint graph and tile expansion approach. [12] Chia-Chun Tsai, Sao-Jie Chen and Wu-Shiung Feng, "An The constraint graph based approach was used for fast first-pass H-V Alternating Router." IEEE Transactionson Computerrouting. The maze routing tile expansion approach was used for Aided Design, Vol. 11, No. 8, pp. 976-991, August 1992. the second-pass optimization. A unified multi-layer combined [13] N. J. Nilsson, Principles of Artificial Intelligence, Engleconstrain: graph (MCCG) was created to store the vertical conwood Cliffs, NJ: 1980, pp. 53 -94 . straints from the vertical layers and horizontal constraints from the [14] Gary W. Clow, "A Global Routing Algorithm for General horizontal layers. A gridless multi-layer layout was achieved by Cells." Proceeding of ACM/IEEE 21st Design Automation assigning directions to the horizontal constraint edges. An iteraConference, pp. 45-51. tive graph-based rip-up and reroute algorithm was developed to [15] Chung-Kuan Cheng and David N. Deutsch, "Improved minimize the longest path in the MCCG. However, the graphChannel Routing by Via Minimization and Shifting." Probased router was not capable of generating necessary overshoot ceeding of 25th ACM/IEEE Design Automation Conferwires and unrestricted layer wires for dense channels. Therefore a ence, pp. 677-680, 1988. multi-layer tile expansion algorithm was implemented based on [16] John K. Ousterhout, "Comer Stitching: A Data-Structuring the A* maze routing algorithm to optimize the result from the Technique for VLSI Layout Tools." IEEE Transactions on graph-based approach. It allows fast and symmetric horizontal and Computer-Aided Design, Vol. CAD-3, NO. 1, pp. 87-100, January 1984. vertical expansion on the same layer or alternating layers by using a dual tile plane. We proposed an overlapped expansion model to [17] Takeshi Yoshimura and Ernest S. Kuh, "Efficient Algorithms for Channel Routing." IEEE Transactions on Comimplement the multi-level rip-up and reroute algorithm. The tile puter-Aided Design, Vol. CAD-I, No. 1, pp. 25-35, January expansion router iteratively moves nets from uncongested regions 1982. to elsewhere such that the layout area can be compacted. Our grid[18] D. Deutsch, 'A 'dogleg' channel router." Proceeding of less router achieves solutions at density for Deutsch's difficult and 13th ACM/IEEE Design Automation Conference, pp. 425other benchmarks in two, three, four, and five metal layers. 433, 1976. [19] M. Burstein and R. Pelavin, "Hierarchical channel router," References Proceeding of 20th ACM/IEEE Design Automation Conference, pp. 591-597, 1983 [1] Howard H. Chen and Ernest S. Kuh, "Glitter: A Gridless Figure 13: Two metal layer result for Deutsch's difficult exam ----------- -s*BDBG" B ---- -------------------nnn~n ... -
217
A Multi-layer Chip-level Global Router Le-Chin Eugene Liu and Carl Sechen IDepartment of Electrical Engineering, Box 352500 University of Washington, Seattle, WA 98195 Abstract We present a chip-level global router based on a new, more accurate routing model for the multi-layer macro-cell technology. The routing model uses a 3-dimensional mixed directed/undirected routing graph, which provides not only the topological information but also the layer information. The irregular routing graph accurately models the multi-layer routing problem, so the global router can give a good estimate of the routing resources needed. To generate the routes on the graph, we search for the Steiner minimum trees for the nets. Since the Steiner problem in networks is an NP-hard problem, we developed an improved Steiner tree heuristic algorithm which is suitable for our routing graph and able to generate high quality Steiner tree routing. The M-route method used in Mickey[2] is an improvement over the minimum spanning tree based Steiner tree heuristic proposed in [4]. Our algorithm shows even better results and uses much fewer memory resources than Mickey. This advantage makes our global router applicable to large industrial circuits, easily handling 200 macro cells and 10,000 nets. While minimizing the wire length, our global router can minimize the chip area, or minimize the number of vias, or solve the routing resource congestion problems. Test results on industrial circuits show our global router yields better results than previous methods and it is practical for large circuits built on multi-layer technologies.
A chip-level global router is introduced in this paper to work with the new technology. Our goal is to develop a model which can fit various multi-layer VLSI technologies. For this purpose, we developed a new routing architecture which defines new routing regions and a 3-dimensional routing graph. The graph consists of layer and via information. To generate the routing, a Steiner tree algorithm based on the shortest path heuristic is used. We improved the algorithm to generate better results without sacrificing much efficiency. Our global router generates an initial minimum-cost routing. The cost is a function of wire length and the number of vias. Then a second stage can be added to either minimize the chip area or to solve the congestion problems while minimizing total cost. A rip-up and re-route method is used to achieve the goals. The rest of the paper is organized as follows. Section 2 reviews briefly some other approaches related to macro-cell global routing. Section 3 describes the global routing model and the methodologies used in our router. Section 4 discusses the heuristic Steiner tree algorithm. Section 5 explains our congestion-problem solving and area minimization algorithm. Section 6 shows experimental results and performance. Section 7 concludes the paper. 2 Review
1 Introduction The macro-cell design style is one of the most important VLSI design approaches. Because the macro-cell design style is flexible in cell size, cell type, cell placement, and interconnections, it allows for very compact and high performance designs. The nature of macro cell design style gives the routing process a much higher complexity than other methods. Traditionally, the routing process can be divided into global and detailed routing. Global routing is to find a routing path for each net, and detailed routing assigns the actual tracks and vias. In the past, with one or two layers available for global routing, the job was simply to find the routing channels needed for each net:. However, as the VLSI technologies evolved, the problem has changed in many ways. For multi-laver VLSI technology, routing in the third dimension is an important issue. Many of the previous methodologies don't work with the new technology. Some methods for multi-layer routing problems can be found in [1][7]. But few of them are applicable for macro-cell global routing. In general, global routing for macro-cell layout is based on a routing graph which is defined from the layout. Some routing graphs may be independent of the layout, for example, a regular grid graph, but this is not a very accurate approach. The next step is to route the nets sequentially or concurrently. For each net, it can be formulated as the Steiner problem in networks or some variation of the Steiner tree problem. Finding a Steiner minimum tree is an NP-hard problem. And therefore it is necessary to use heuristics. Conventional global routers use planar routing graphs, which is adequate for single-layer or two-layer technology. For today's multi-layer technology, it's obvious that a planar routing graph cannot model the new technology properly. To solve the new routing problem, first, we need a new routing model. Second, we need to develop an effective and efficient Steiner tree heuristic to work on the new model.
218
Routing can be done in one phase, which is called area routing. A few techniques which can be used in area routing are reviewed in [1], such as maze routing and line probing (line-search and lineexpansion). Tile expansion[15][16] is another area routing technique based on the corner stitching data structure. However this approach is computationally infeasible for chip-level VLSI circuits. A divide-and-conquer strategy is more practical for large circuits. A lot of research has been done on the macro-cell global routing problem[l]. One way of solving the global routing problem is to use integer programming[ 11 ]. But its results are only an approximate solution. Usually graph-based methods are used for macrocell global routing. Defining a routing graph decides the model for the routing problem. Grid graphs are used in a few routers[8][12]. The advantage of using grid graphs is that you can use a maze router to do the routing. Grid graphs are good for array style circuits. However, they are not very suitable for general macro-cell global routing. There are a few other routing graphs, like graphs derived from the floor plan[17][19], and channel graphs[2][3]. Among the graphs, channel graphs are more accurate and general than all other graphs. The main task of global routing is to find the routes for the nets under some constraints. For two-pin nets, a few methods can be applied, such as maze routing, the line-probe algorithm, and the shortest path method. Multi-pin nets are much more difficult to deal with. In one approach, the multi-pin nets are decomposed into pairs of pins, then the previously mentioned methods are used for each pair. This yields poor results. The natural approach is the Steiner formulation of the problem. Finding a route for a multi-pin net on a routing graph is the Steiner problem in networks. Unfortunately, this is an NP-hard problem[6], and therefore heuristics are needed. One variation is to find a Steiner MinMax tree instead of a Steiner minimum tree[8]. A SMMT is a Steiner tree with maximum-weight edge minimized. The time complexity is better. In general, SMMT is good for avoiding congestion, but not good for minimizing wire length. The Rectilinear
Steiner tree approach is another important method for global routing[9][10][13]. Finding an optimal IRST is still an NP-hard problem[14]., although it can be solved by polynomial time algorithms with some restrictions[6]. Usually, a RST for a given set of terminals is found in a plane and not in a. routing graph. This makes it difficult to apply an RST method to a complicated macro-cell global routing problem. Mickey[2] is a graph-based global router, which uses an M-route method for finding routes on a channel graph. Mickey has outstanding performance on global routing quality. One key contribution is the M-route method which it used to find the routing trees. The M-route method can almost always find the Steiner minimum trees on the routing graph. Mickey uses a channel graph[3] as the routing graph. The pins of a net are mapped to the corresponding positions on the edges as additional nodes. Then Mickey tries to find the M shortest routes for the net. To find the first route, a Steiner tree heuristic proposed in [4] is used. The heuristic first constructs a distance graph which is a complete graph consisting of only the pins. A minimum spanning tree is found on the distance graph. Then the minimum spanning tree is mapped back to the original graph so that the first route is obtained. An improvement stage is followed to generate AM-l other shortest routes. An edge not in the first route is selected. The end points of the edge are extended until they reach the nodes in the first route. The extended segment forms a cycle. The cycle is broken by removing a longest segment of the original tree, and therefore a new tree is generated. The length difference between the new route and the first route is stored in a priority tree along with the edge and a few corresponding data. The key of the priority tree is the length difference. All the edges not in the first route are scanned unless the edge has been included in the previous extended segment. If a better route is found during the process, the new route is used as the new first tree. After all edges are scanned, extracting the next M- I edges from the priority tree yields the next M- I shortest routes. The process of generating the M-A other routes helps to find the Steiner minimum tree. All the M routes are stored for later re-routing purposes. In [21, Mickey was shown to significantly outperform the previously best known irregular graph-based global router, Mercury[20] and TimberWolfMC[3]. However, for modem VLSI technology, Mickey has some disadvantages. First, the channel graph is not good for mulli-layer technology, because some layers are available over the top of cells and a channel graph does not have layer information. Second, the M-route method consumes too much memory because it has to find and store so many extra routes. In the following sections, we present a new graph-based global router which can meet the requirements of new VLSI technologies. Grid graphs can be easily extended to multi-layer structures. Channel graphs are more accurate for modeling the layout. We combine the features of both graphs. The result is an irregular 3dimensional routing graph which can accurately model the macrocell multi-layer technology. A new Steiner tree algorithm is developed to efficiently generate quality routes and avoid the memory problem which Mickey has. In addition, the routing model is flexible and can used with many different objective functions. Aside from Mickey, we are not aware of any previously published method which can effectively handle an irregular 3-dimensional global routing graph.
edges are presented between the adjacent nodes. Furthermore, if a certain layer cannot go through the cells, there will be no edges connecting the nodes of the layer inside the cells. The only exception is the boundary regions of the cells. The nodes in the regions adjacent to cell boundaries still have edges connecting to regions outside the cell. That is because the pins on the cell's boundaries are mapped inside the boundary regions. So the edges across the boundaries on a cell-blocked layer are needed for the pins to exit. But those edges are directed edges. They can only be used for the pins to exit and are not used for any other routing purposes. The directed edges make route searching on the blocked layer efficient. Figure 2 is an example of a routing graph. There are two layers. One is for horizontal and the other is for vertical routing. The horizontal layer is not available inside the cells. The vertical layer is available over the top. The regions between the cells look like channels of a channel graph. However, in a conventional channel graph, the edges represent channels. In our routing graph, the nodes represent the channels. I I I
I I I
I I I
I I I
I
I
Figure 1. Three macro cells and cut lines.
U-\
E\
I
I
\
RZ
I
I
2-2
I
-1
I5
5
U-
219
1
I
k
II
I
I I
I
~I
I 7 I-
I -
I
I
in,
*
I
I
*kr V0iT0:0H i
3 Routing Architecture Our global router is for macro-cell layout, where we assume the macro cells are rectilinear. Given a placement of macro cells, the chip area is divided into small regions by cut lines which are the extension lines of the boundaries of the macro cells. Figure 1 shows how the regions are defined. The dashed lines are cut lines. In each region, we place a node for each layer. The nodes for different layers in the same region are connected by via edges. If a layer is used for horizontal tracks, horizontally adjacent nodes of the layer are connected by edges. That means each node of the layer has horizontal edges connected to the nodes of the adjacent regions. 'Similarly, for layers used for vertical tracks, vertical
R~I
I
_N1
l
Figure 2. Global routing graph for the example of Fig. 1. Some regions can be very small due to cut lines which are close to each other. Those regions cause efficiency problems and do not need to exist. We therefore merge two cut lines when they are too close to each other. We set a threshold for merging as two times the track pitch. That means if a region can accommodate fewer than two tracks, it's merged. The capacity of the regions have to be adjusted due to the merging. The routes are searched on the
routing graph according to the weights of the edges. The weights of vias can be assigned to reflect the resistance of vias. Or if the number of vias is to be minimized, the via edges can be assigned a huge weight. This is an effective way of modeling the use of vias for VLSI layout. Usually, the weights of non-via edges are set to reflect the wire length. But the edge weights on different layers can also be adjusted according to the different resistance or other measures which differentiate the layers. The advantage of the routing graph is that it closely models the actual multi-layer features and it is also very flexible. It has no constraints on the number of layers. For one additional layer, it requires one more layer of nodes. Each layer can be configured individually. Some layers are only available in the traditional channels. Some layers are available all over the chip. The routing graph is constructed due to the configuration of each layer. In addition. the structure is so flexible that you can even specify the capacity of a certain layer for a certain region. The via factor is also taken care of inherently in the routing graph. We define regions in this manner because this way the utilization of each layer within a region is uniform: a given layer in a region will be entirely blocked, entirely free, or partially free (in a relatively uniform sense, i.e. such a layer can occur over a macro cell when that metal layer is partially used for intra-cell routing). This routing model provides the ability to evolve with modem VLSI technology. After the routing graph is constructed, the global routing is done net by net. For a net, all the pins are mapped to the nodes corresponding to the layer specified and the regions where the pins reside. For the pins on the boundaries of the cells, we map them inside the cells unlike many other global routers which map the pins outside the cells. Our graph yields a more accurate estimate of via usage. We formulate the global routing problem as finding a Steiner minimum tree on the routing graph. According to the nature of our routing graph, we developed a practical and efficient algorithm to solve this problem. The algorithm will be introduced in next section.
better worst-case bound theoretically, neither of the two heuristics always generates better solutions than the other. But in general, the shortest path heuristic often generates better results[4]. The definition of Steiner points is that in a Steiner tree, those vertices which are not terminals are called Steiner points. To find the Steiner minimal tree is to find the proper Steiner points in the network. When the Steiner tree heuristics are applied to the routing problem, extra improvement stages can be used to generate better results. Mickey[2] used the MST on the distance network heuristic to find the first Steiner tree. A few more trees are searched for based on the first tree. The process explores the possible Steiner points for the Steiner minimal tree, so the optimal results can often be found when the first tree is not optimal. Table 2 shows Mickey's M-route method has substantial improvement over the original heuristic. However, its extensive memory requirements render it impractical for large designs. We developed our algorithm based on a shortest path heuristic which in turn is based on Kruskal's algorithm for finding a minimum spanning tree. So our algorithm has the bound of 2(1-1/IPI), where IPI is the number of the pins. We took advantage of our sparse routing graph to reduce the time complexity. This will be shown in sub-section 4.f. In addition, we made some modifications to improve the results. First, we save multiple shortest paths during the process of searching for a route. This can be done virtually without increasing the time complexity. Then a straightforward method is used to improve the results. 4.b Generate-route In the beginning, each required pin node is made a set in a disjoint set data structure. (Electrically equivalent pins are handled by putting them into the same set.) The shortest path is found among all pairs of sets. For the two sets corresponding to this short path, we store all shortest paths if more than one exists. All the pin nodes and the nodes of the paths are merged into one set. This process continues until only one set remains. At this point, we have a sub-graph of the routing graph. It is not a tree because it may contain cycles due to the set of equivalent-weight paths retained. This paths graph may not consist of the shortest paths for some nodes, so an improvement stage is needed. The improved paths graph is sent to the next stage to remove the cycles. After the cycles are removed, another improvement stage is applied to further improve the tree. The procedure of Generate-routeis as follows:
4 Route-generating Algorithm 4.a Introduction If the nets to be routed have only two pins, finding the shortest path is very simple. Dijkstra's and other shortest-path algorithms presented in [5] can be applied. If the nets have more than two pins, finding the shortest route is the Steiner problem in networks. The definition of the Steiner problem in networks from [6] is as follows:
Generate-route(net) 1. each required node forms a set 2. if (there are required edges) ( 3. each required edge and its nodes or the sets it connects to forms a set
* GIVEN: An undirected network G=(VEc) where c: E -< R is an edge length function, and a non-empty set N, N c V, of terminals. * FIND: A subnetwork TG(N) of G such that: there is a path between every pair of terminals, total length TG (N) | = E c (e) is minimized. eE TG(N) TG(N) is called a Steiner minimal tree of G. Although our graph is not undirected for some cases, basically, the global routing problem is still a Steiner problem. The Steiner problem in networks is an NP-hard problem[6]. There is no efficient way of finding the Steiner minimum tree. There are many heuristics introduced in [6]. Some of them have unacceptable time complexity. Some of them yield poor results. For routing problems, the shortest path and minimum-spanning-tree(MST) on the distance network heuristic are two practical algorithms. It has been proven that those two heuristics have the same worst-case bound --- 2(1-1/n) times the length of the Steiner minimal tree, where n is the number of the terminals. Actually, in [4], the MST on the distance network heuristic is proven to have the bound of 2(1-1/1), where l is the number of the leaves in the Steiner minimal tree. But this is a theoretical worst-case bound, because I cannot be known until the Steiner minimal tree is found. Although the latter heuristic has
4.
)
5. 6. 7.
while (number of sets > 1) find the shortest path(or paths) between any two sets merge the two sets into one set according to the paths found
8. 9. Improve-route(pathgraph) 10. Remove-cycles(path-graph) II. Improve-route(path-tree) Line 2 is needed in case that some paths are required to be in the routing by the user. Lines 9 and 11 are not applied if p, the number of pins, exceeds 100 and if pAVI is greater than a half, where IVI is the number of the nodes of the routing graph. The threshold was chosen experimentally. We find that it takes too much time to improve the nets with a huge number of pins, and when p/ll is high, usually no improvement is possible.
220
The paths graph shown in Figure 3 was obtained after the execution down to line 8 of Generate-route. Figure 3(b) is the top view of Figure 3(a). The top view doesn't show the via edges, so it demonstrates how Generate-route works more clearly. The dots on the cell boundaries are the pins of the net. Those pins are mapped to the nodes inside the cells. Since the bottom pin (A) is closer to the right pin (B) on the routing graph, the path between them is found first. During the second iteration, two shortest paths between the bottom pin and the left pin (C) are found. All the pins are in one set now. The loop concludes and therefore we obtained the paths graph. Obviously, the paths graph is not optimal. It will be improved in the next stage.
Improve-route(graph) 1. for (all edges in the graph) 2. if (the edge is marked) continue 3. create a segment by extending the edge to a required node or a node with degree > 2 4. mark all edges in the segment 5. if (two sets are created by removing the segment) 6. find a shortest path between the two sets 7. if (the new weight is lower) replace the old segment with the new path 8. I 9. * Example The paths graph shown in Figure 3 is sent to Improve-route. There are a few segments in that graph. Only the segment between the bottom pin (A) and the right pin (B) can be improved. That segment is removed and a shorter path between the two separated sets is found. Hence the original segment is replaced by the new path. The improved paths graph is shown in Figure 4.
(a)
(a)
(b)
Figure 4. Improved paths graph. 4.d Remove-cycles To remove the cycles in the paths graph, all segments in the paths graph are generated. The segments are sorted according to the weight of the segment. Starting from the largest-weight segment, the segment is removed. If this action causes the graph to be divided into two sets, restore the segment. Otherwise, remove the segment from the paths graph permanently. This is done sequentially for all segments. The procedure of Remove-cycles is as follows: (b) Figure 3. Paths graph of a net and its top view. 4.c Improve-route To improve the routing, all the segments in the paths graph are examined. First, all the edges (in the paths graph) are put in an array. One starting edge is chosen from the beginning of the array. This edge, i.e. its two endpoints, is extended in both directions to form a segment. The end points of such a segment are either a required node (pin) in the original routing graph or a node with degree more than two. For example, in Figure 3(a), the edge (el) is selected as a starting edge. The edge is extended from both end points and forms a segment from !dto a. We mark all edges in the segment. so that the edges will not be used as starting edges later on. The segment is removed from the paths graph. If the paths graph is now divided into two disconnected sets, find a shortest path between the two sets. If this new path has a lower weight than the original segment, the new path is used instead of the original segment. The process continues until all edges are marked. The procedure of Improve-route is as follows:
Remove-cycles(graph) I. sort the segments of the paths graph according to their weights 2. the sorted segments are put into a queue with the largestweight segment in the front 3. for (all the segments in the sorted queue) remove the segment 4. if (the graph is not divided into two sets) 5. if (two other segments can be merged due to the 6. removal of the segment) ( merge the two segments 7. adjust the queue 8. 9. continue 10. 11. restore the segment 12. 13.
* Example Figure 5 shows that the cycle in the paths graph has been removed and an optimal route tree is obtained.
stage. 1-.
3 f
0
s:~77777:7'1-
(a)
A
I
-
I
-
2
i
C
4
(b)
Figure 5. Final route for a net. Figure 6 is used to show how lines 6-9 of Remove-cycles work. The largest-weight segment may be B-C-F, so it is removed first. Originally, there are two segments A-B and B-E. Because of the removal of B-C-F, A-B and B-E are connected to form a new segment.
A
B
c
D
E
F
I
l
_1
6 5 Figure 8. Example of the 3-pin net after improvement. r
Figure 9 shows another problem we encountered. Figure 9(a) is the paths graph for a 4-pin net. There are two equal-weight paths between each pair of pins. This is not unusual for a multi-layer layout. Segments 3-4-D and 7-8-D are the largest-weight segments. Segments A-1-2 and A-5-6 are the second largest-weight segments. During the cycle-removing stage, it is possible that segment 3-4-D and A-5-6 are removed. The tree, after the cycles have been removed, is shown in Figure 9(b). Obviously, this is not the Steiner minimum tree. With the second execution of Improve-route, the Steiner minimum tree is found. The result is shown in Figure 9(c).
Figure 6. Example of a paths graph.
I
For this example, the second execution of Improve-route does nothing. For some other cases, the optimal tree might not be obtained after Remove-cycles has been executed. A second execution of Improve route is needed for those cases. This will be shown in next sub-section.
1 6
-7'
2
3
B
C
4 8(a)
4. e Discussion Figure 7 shows the importance of keeping multiple paths in the paths graph and the first improvement stage. 2
1 A
C
5
; L
rn I
I
r ~ ~
4M D7
.
6'
7
8 (b)
1C
2
3
B7
C7
6
7
4
E
4: 8' (c)
Figure 9. An example of the need for the second execution of
6 Figure 7. Example of a 3-pin net before improvement.
Improve-route.
It is an example of a 3-pin net. Nodes A, B, and C are the required nodes. The path between B and C is formed first, because it is shorter than the path between A and B. Then, paths between A and B are found. The paths graph is shown in the bold lines. If the paths graph directly goes to the cycle-removing stage, segment A4-B may be removed. Without node 4, it is not possible to find the Steiner minimum tree. Figure 8 shows the improved paths graph. The remove cycle procedure will remove segment A-S-B now. The minimum Steiner tree was found. If multiple paths were not kept, the Steiner minimum tree may not be found. For example, the tree ,4-5-B-6-C may be found as the route. This tree cannot be improved by Improve-route. The example shows the advantage of retaining; multiple paths and why we need the first improvement
The key to our algorithm is that we incorporate more Steiner points for the later improvement stages to work on. But if there are too many equivalent-weight paths, it could cause efficiency problems and make the cycle-removing part less effective. Two methods are used to prevent the algorithm from incorporating too many paths. One is that we try to keep the outside paths only, i.e. the paths which are enclosed by other paths are discarded, because the inside paths don't provide useful Steiner points. The other is that we directly limit the number of paths which can be incorporated. In current implementation, the limit is 20. On the other hand, for some graphs, we found that there hardly exists any equivalent paths. We therefore retain and treat the paths which are slightly higher in weight (up to 1% higher than the minimum
222
weight path) as equivalent-weight paths.
Minimum wire length does not mean minimum chip area. For macro-cell designs, the congestion can be solved by pushing the cells away so the regions can accommodate more tracks. The chip area is usually the main concern. Our global router can also be used to achieve the goal by using the rip-up and re-route method. First, we use a directed graph to compute the size of a chip. For example, Figure II shows the graph which we use to compute the height of the chip. We place nodes on the top and bottom sides of the cells. So a rectangular cell has one node on the top boundary and one node on the bottom boundary. For non-rectangular rectilinear cells, there may be more than one top or bottom boundary. We place one node on each horizontal boundary segment.
4f Time complexity We now examine the time complexity. The routing graph is G=(VE,c). Because there are at most six edges connected to a vertex, the relation, IEl < 61VI holds. We route a net ni with p pins. For Remove cycles, the sorting has a time complexity of O( ISI log ISI ), where ISI is the number of the segments. In the loop, the worst case occurs when we must update the queue every time, so the time complexity for the loop is O( IS12 ). The complexity of Remove- cycles is 0( ISI2 ). The end point of a segment has a degree greater than 2 if the end point is not a terminal (pin). A theorem from [6] says that for a Steiner minimum tree the number of such vertices is less than two times the number of terminals (pins). The number of segments is proportional to the number of vertices. So ISI is proportional to p. To find the shortest paths between the sets, we use an algorithm similar to Dijkstra's algorithm. The difference is that we have multiple sources instead of one. We start from multiple sources and update the weight for the vertices and put the vertices into a priority queue. Because the routing graph is sparse, according to [5], the time complexity is O( (IVI+ 1Et) log IVI ). For our case, it can be simplified to O( IVt log IVt). So the Improve -routesubroutine has the time complexity of O( ISII VI log IVI ). For Generate-route,the main loop has a time complexity of O(plV1 log lVt ). So for the subroutine, 2the dominant parts are O( plyV log 1V)+O( ISItVt log IV )+O SI ) for net ni. Therefore, the time complexity of our program is O( IPIIVI log IVI ), where IPFis the total number of pins. 5 Solving Congestion Problems and Area Minimization After the global routing is done for all nets, the following information is obtained: the number of tracks used for each layer in each region, the number of via used, and the wire length of each net. The path for each net consists of not only the topological information but also the layer and via information. The information can facilitate the detail routing process. If a certain region is congested due to the capacity of the region, a rip-up and re-route method is used to eliminate all the overcongested regions. Starting from the most congested region, it seeks to re-route all the nets using that region while avoiding the creation of additional congestion problems. Then all the new routes are sorted according to the increase of the wire length (or weight). The program chooses the new routes needed to relieve the congestion problems by selecting those new routes which minimize the increase in total wire length (or weight). Then it moves on to the next most congested region. It continues until all the overcongested regions have been processed.
Figure 11. Height graph for the example of Fig. 1. For example, the L-shaped cell in Figure 11 has two nodes on the top side. In addition, we have one source node on the top and one sink node at the bottom of the chip. Our height graph is directed from the source node to the sink node. Since our routing model divides the chip area into rectangular regions, we have columns, which may consist of a series of regions, between the cells or between the cells and the source/sink node. In Figure 10, the columns are shown by the directed arcs. To simplify the graph, only one directed edge is needed to represent the columns between any pair of nodes. The weight of the edge is determined by the highest column of the columns corresponding to the edge. The height of a column is decided by the horizontal layer which requires the most space. Figure 11 shows the final height graph. For illustration purposes, we placed the edges into the highest columns between the nodes, if there is more than one column. But they actually represent a set of columns. Inside the cells, each node on the top has an edge to each node at the bottom. The weight of such an edge is the distance between the boundaries. For a rectangular cell, it is the height of the cell. The longest path from source to sink determines the height of the chip. A similar directed graph is also created to compute the width. If there are layers which can go over the top the cells, a congestion-removal stage is needed before we can correctly estimate the chip size. All the congestion problems on the top of the cells must be removed first, because only the routing regions between the cells can be expanded. The size of a cell, however, is fixed. If a layer gets overcongested on regions over the cells, some tracks in the regions have to be moved to the non-cell regions. To minimize the chip area, we try to reduce the height and the width of the chip sequentially. We re-route those nets which use the critical regions. The critical regions are defined as the regions corresponding to the edges of the longest paths of the height and width graph. The edges inside cells are not included. To reduce the height and width of the chip, we re-route the nets in the critical regions to see if there is a new route which can reduce the size of the critical regions. In particular, we dynamically set the
Figure 10. Column graph for the height of the example of Fig. l.
223
weights of graph edges in critical regions and re-execute Generate-route.An edge's weight is set to infinity if the use of the edge does not reduce the size of the critical region. If such a new route exists, the new chip size is calculated according to the route change. If the chip size is reduced, the new route is accepted, otherwise it is rejected. The algorithm is as follows: 1. for (all nets in the critical regions) { 2. try to find a new route which reduces the size of critical regions /* Set the weights of the edges in critical regions to infinity if the use of those edges don't reduce the size of critical regions */ 3. if (no such a new route exists) continue 4. if (the new route does not increase the weight and reduces the chip size) accept the new route and continue 5. if (the new route increases the weight) put the new route in a priority queue according to the increase of the weight 6. } 7. for (each new route in the priority queue in order) 8. if (the new route reduces the chip size) accept the new route The re-calculation of the chip size is necessary, because a new route which reduces the size of critical regions may not reduce the chip size. New critical regions can be generated when a new route reduces the original critical regions. To speed up the calculation of the chip size, an M-longest-path method is used to calculate the new size. That means we store the longest M paths for both height and width. When a new route is found, we calculate only the new length of those paths instead of searching for the new longest paths in the graphs. The new longest path of the M paths determines the new chip size. We modified the Dreyfus method[18] for M shortest paths to find M longest paths. In current implementation, instead of using a fixed number, we store all the paths which have length greater than 90% of the length of the longest path. The algorithm shown above is one iteration of the re-routing process. It repeats until no further size reduction is possible, i.e. no new route is found or accepted during an iteration. And it is also necessary to obtain the new set of M longest paths and the new set of critical regions before each iteration.
usage was reduced by a factor of nearly nine. number of
We compared our Steiner tree algorithm with two other graphbased algorithms capable of handling irregular graphs. One is the MST on the distance network Steiner tree heuristic. This method has been used in several other routers[2][3]. The second algorithm we compared against is the M-route method introduced in Mickey[2] is an improvement over the first algorithm. It has been known as a fast and very good method for the routing problem. It can usually find the Steiner minimum tree for a net. In fact, it is the best performing public-domain graph-based global router previously known. The drawback is that it requires too much memory and is therefore impractical for large circuits. We tested the programs on some industrial circuits. Those circuits are shown in Table 1. The placements were generated by TimberWolfMC v.3.1. We ran the programs on the same two-layer global routing graph. The wire length results for the three algorithms are shown in Table 2. Our Steiner tree algorithm outperforms the other two algorithms. Table 3 shows the memory usage and run time of Mickey and our Steiner tree program. They both were run on a DEC 3000 APX Model 400 workstation. The results in Table 3 show our algorithm uses much less memory, while having a comparable (or slightly better) run time. For the largest circuit (intel), the memory
224
nets n
p P
nodes of graph
edges of graph
hp
11
83
309
26
39
ami33
33
83
376
64
101
qpdm-b
17
121
645
37
58
xerox
10
203
696
21
30
amd ami49
17 49
288 408
837 953
39
57
108 20 586 1,576 64 62 570 4,309 161 Table 1. Circuit information.
172
4832 intel
MST on the distance network hp
Mickey M-route
98 243
NewNe Steiner aig. vs. teag.MST on tedn
Nw Mickey
176,808
171,430
170,063
-3.81%
56,770
55,865
55,815
-1.68%
-
0.09%
qpdm-b
633,540
626,907
625,930
-1.2%
-
0.16%
xerox
568,480
561,935
561,935
-1.15%
0
amd
261,478
259,856
259,843
-0.63%
-0.01%
ami33
- 0.8%
arni49
371,362
361,378
360,592
-2.9%
4832
1,934,200
1,894,400
1,891,390
-2.21%
intel
6,087,362
5,942,640
5,925,695
-2.66%
-0.29%
-2.03%
-0.22%
average
0.22% -
0.16%
Table 2: Wire length comparison.
Mickey memory (bytes) hp
6 Results
cells
New Steiner tree algorithm time (se
memory (bytes)
time (se
memory reduction factor
666K
3.1
355K
1.7
1.88
ami33
1,855K
2.3
438K
6.3
4.24
qpdm-b
2,124K
2.2
784K
5.9
2.71
xerox
1,269K
4.2
735K
2.6
1.73
amd
4,594K
25.0
1,272K
4.7
3.61
ami49
6,228K
20.0
1,316K
5.4
4.78
9,032K
16.7
2,142K
8.9
4.22
4832 intel
50,642K 224.7 5,707K 203.9 Table 3: Memory and run time comparison.
8.88
The two percent improvement over the very simple minded MST heuristic may, at first glance, appear small. However two items are noteworthy: (1) the MST heuristic always finds the minimum length routes for two-pin nets and usually dues for three-pin nets. The majority of the nets in these benchmarks (or any real circuit) have three or fewer pins. (2) Nets with many pins have minimum enclosing rectangles (bounding boxes) which are huge. Although a route generated by the MST heuristic for such a net may look very poor to an experienced layout designer, the percentage dif-
ference between the length of a poor route and the optimal route may be small. However, for performance and density reasons, it is imperative to make each net as short as possible. To demonstrate the global router's ability to minimize the area of a macro cell layout, as well as its ability to handle multi-layer technologies, we considered the same benchmark circuits for three layers of metal. The first and second layer were not available over the cells. Only the third layer could be used everywhere. The first and third layer were for horizontal tracks, and the second layer was for vertical tracks. Table 4 shows the area reduction results. The column initial is the initial area when the shortest route is used for every net. The column final is the final area after the area reduction process. Some circuits exhibit large reductions, while a couple do not. It depends on the circuits. If the chip area of a circuit is dominated by actual cell area, the re-routing process can do little to reduce the chip size. On the other hand, if the chip area after routing has significant routing regions between the cells, the reduction percentage can be huge, as for the circuit amd. Note that the area reduction is quite significant for most of the circuits.
Mickey for the intel circuit.) Since VLSI circuits contain more and more transistors, this is an important factor in designing a feasible global router for the circuits of today and the future.
[1] [2] [3] [4] [5] [6]
initial
final
hp
3618 X 3132
ami33 qpdm b
reduction
:3608 X 3082
1.9%
1850X 1890
1840X 1850
2.6%
3943 X 4017
3873 X 3857
5.7%
xerox
7060 X 7700
6990 X 7290
6.3%
amd
2007X 1588
1930X 1343
18.7%
ami49
7232 X 6830
'7142 X 6810
1.5%
15110X 11350 14660X 11060 11370X 10940 11210X 10910 Table 4: Area reduction results
5.5% 1.7%
4832 intel
[7] [8]
[9]
[10]
Table 5 shows the results of our global router for two very large industrial circuits. Four routing layers were available for both circuits. The lowest horizontal and vertical layer are blocked by the cells and the other two, one for horizontal and one for vertical, are available over the cells. Because Intel2 has so many cells, the routing graph is very complicated. The results demonstrate that our global router can handle modern large industrial circuits, whereas Mickey could not be successfully applied to these large circuits. cells nets Intell Intel2
pins
memor wirerun m(byteosr) lewnigteh vias (byts) lngth(sec)
[11]
[12] [13]
time 1,430.6
[14]
189 9,497 31,647 104,607K 24,290,738 58,612 63,930.4 Table 5: Results on two industrial circuits.
[15]
37 7,285 17,578
37,725K 42,299,146 49,806
7 Conclusion We have presented a new chip-level global router which operates on a new, more accurate routing model. The routing structure is flexible and suitable for multi-layer VLSI technology. Its 3-dimensional irregular routing graph accurately models the multi-layer routing problem. So it gives a good estimate for the routing resources needed. It can meet different design needs. It can minimize the number of vias or the chip area, while minimizing total wire length. It can also be used to minimize the total wire length under channel capacity constraints. T:) achieve the goals, a practical and effective algorithm for finding the routes was also developed. Previously, Mickey was the best performing graph-based global router available in the public domain. However, our algorithm yields better results. In the mean time, we avoid the main drawback which the M-route algorithm has, namely, we use much less memory. (less than one eighth of the memory needed by
225
[16] [17] [18] [19] [20]
Reference Sherwani, N., "Global Routing," Chapter 6 in "Algorithms for VLSI Physical Design Automation," Kluwer Academic Publishers, 1993. Chen, D. and Sechen, C., "Mickey: A Macro Cell Global Router," Proceedings of the European Conference on Design Automation, pp. 248-252, Feb., 1991. Sechen, C., "VLSI Placement and Global Routing Using Simulated Annealing," Kluwer Academic Publishers, 1988. Kou, L., Markowsky, G., and Berman, L., "A Fast Algorithm for Steiner Trees," Acta Informatica 15, pp. 141-145, 1981. Cormen, T. H., Leiserson, C. E., and Rivest, R. L., "Introduction to Algorithms," McGraw-Hill, 1992. Hwang, F. K., Richards, D. S., and Winter, P., "The Steiner Tree Problem," North-Holland, 1992. Sherwani, N., Bhingarde, S., and Panyam, A., "Routing in the Third Dimension from VLSI Chips to MCM," IEEE Press, 1995. Chiang, C., Sarrafzadeh, M., and Wong, C. K., "Global Routing Based on Steiner Min-Max Trees," IEEE Transactions on Computer-Aided Design, Vol. 9, No. 12, pp. 13181325, Dec., 1990. Chiang, C., Wong, C. K., and Sarrafzadeh M., "A Weighted Steiner Tree-Based Global Router with Simultaneous Length and Density Minimization," IEEE Transactions on Computer-Aided Design, Vol. 13, No. 12, pp. 1461-1469, Dec., 1994. Griffith, J., Robins, G., Salowe, J. S., and Zhang, T., "Closing the Gap: Near-Optimal Steiner Trees in Polynomial Time," IEEE Transactions on Computer-Aided Design, Vol. 13,No. Il,pp. 1351-1365, Nov., 1994. Heisterman J, and Lengauer T., "The Efficient Solution of Integer Programs for Hierarchical Global Routing," IEEE Transactions on Computer-Aided Design, Vol. 10, No. 6, pp. 748-753, Jun., 1991. Lin, Y.-L., Hsu, Y.-C., and Tsai, F.-S., "Hybrid Routing," IEEE Transactions on Computer-Aided Design, Vol. 9, No. 2, pp. 151-157, Feb., 1990. Miriyala, S., Hashmi, J., and Sherwani, N., "Switchbox Steiner Tree Problem of Obstacles," IEEE International Conference on Computer-Aided Design, pp. 536-539, Nov., 1991. Garey, M. R. and Johnson, D. S., 'The Rectilinear Steiner Tree Problem is NP-complete," SIAM J. Appl. Math., Vol. 32, No. 4, pp. 826-834, Jun., 1977. Margarino, A., Romano, A., De Gloria, A., Curatelli, F., and Antognette, P., "A Tile-Expansion Router," IEEE Transactions on Computer-Aided Design, Vol. 6, No. 4, pp. 507 -5 17, July, 1987. Tsai, C.-C., Chen, S.-J., and Feng, W.-S., "An H-V Alternating Router," IEEE Transactions on Computer-Aided Design, Vol.11, No. 8, pp. 9 7 6 -99 1 . Aug., 1992. Xiong, J.G., "Algorithms for Global Routing," 23rd Design Automation Conference, pp.824-830, June, 1986. Lawler, E. L., "Combinatorial Optimization: Networks and Matroids," Holt, Rinehart, and Winston, 1976. Luk, W. K., Tang, D. T., and Wong, C.K., "Hierarchical Global Wiring for Custom Chip Design," 23rd Design Automation Conference, pp. 4 81 -489, June, 1986. Nishizaki, Y., Igusa, M., and Sangiovanni-Vincentelli, A., "Mercury: A New Approach to Macro-cell Global Routing," Proceedings of VLSI 89 Conference, Munich, Germany, pp. 411-420, Aug., 1989.
CHIP AND PACKAGE CO-DESIGN - ON CRITICAL BACK END DISCONNECTS Wayne Dai Computer Engineering University of California at Santa Cruz Santa Cruz, CA 95064 [email protected]
ABSTRACT
will approaximately remain the same. However, the resistance per unit length is inversely proportional to the area of the line cross-section. The interconnect scaling theory implies that with all other factors the same, thicker film results in the lower signal delay [2]. Interconnect delay is composed of two terms: the timeof-flight delay and the distributed RC delay. While the time-of-flight delay is not dependent on the area of the line cross-ssection, but set by material parameters and proportional to the line length, the distributed RC delay is inversely proportional to the area of the line cross-section, and proportional to the square of the line length. This suggests that it is more beneficial to place long lines on the package layer instead of the chip layer.
In this position paper, I will highlight some key points on one of the critical back end disconnects: physical design of chip and its package. I believe that the chip and its package should be designed concurrently to achieve the better performance and lower costs. This call for an integrated layout synthesis and electrical analysis tool and an early analysis tools fot making the trade-offs between chip and package routing. 1.
INTERCONNECT DOMINATED ICS
According to the National Technology Roadmap for Semiconductors [1], the IC feature sizes for the period from 1995 to 2001 will be decreased from 0.35u to 0.18u. The total number of transistors will be in the range of 28 to 64 million. The clock rates of ICs will be in the range of 300 to 600 MHz. The interconnect delay may account for about 70% to 0%o of the total gate to gate delay for long nets. 2.
5.
I/O BOUNDED ICS
The increase in die size is much slower than the increase in I/O count and the shrink of die size is much faster than the reduction in pad pitch. It become more and more difficult to accommodate all I/Os with peripheral pads. 3.
FLIP CHIP PACKAGE AND AREA I/O
The flip chip package is the most promising method for the first level packaging in the future. This method places solder bumps on the dice, flipping the chip over, aligning them with the contact pads on the substrate, and reflowing the solder balls in the furnace to establish the bonding between the chips and their packages. This method provides area pads which are distributed over the entire chip surface rather than being confined to the periphery as in wire bonding and most TAB technologies. 'This method increases the maximum number of I/O and power/ground pads available for a given die size, such that it may liberate the current I/O pads constrained VLSI design. This also provides a large number of low-capacitance and low-inductance electrical interconnections between the die and the substrate. 4.
CHIP AND PACKAGE CO-DESIGN
The general trend in digital VLSI circuits is certainly towards higher complexity and faster clock frequency. Further, as more circuitry is integrated into a single silicon chip and wider length words are processed in parallel, larger registers are more frequently employed. These registers require precise clock signals to synchronize local activity with the global signal. Design of a clock distribution network is critical to high speed and high performance microprocessors and other synchronous VLSI systems. It is very hard to achieve tolerable skew and short rise time in a complex VLSI chip with millions of clocked elements. This becomes more difficult in a VLSI system with multiple clock phases and several chips with different technologies. The first candidate for chip and package co-design is the clock network. The global clock wires may be routed on a dedicated package layer and the local clock wires may be distributed from the area pads to clock terminals. A case study [3] indicates that this scheme dramatically reduces the clock skew and the path delay of the clock network due to the very low interconnect resistance on the package layer. It also significantly reduces the power consumption since the package has lower capacitance per unit length. On-chip power/ground distribution also becomes more challenging as integration density increases and devices get faster. When minimum feature size is scaled down, resistance of the wires goes up. In addition, the total current going through these wires increases, due to raised clock frequency and increased circuit counts. This results in two problem: higher voltage drops along the power/ground nets and increased electromigration rate due to large current densities. For the novel VLSI chip with low voltage supply, the problem is more serious, because the magnitude of the voltage drops that can be tolerated becomes smaller. Power/ground nets are global nets which are over the whole chip. Due to the performance requirement, these nets are
INTERCONNECT SCALING THEORY
The interconect scaling is different from the device scaling. For a multilayer embedded microstripline, the capacitance per unit length and the inductance per unit length are scale invariant. If we uniformly scale the interconnect cross-section of a line as well as the dielectric thickness, the capacitance per unit length and inductance per unit length
226
usually pre-routed before other nets. When a chip becomes more dense and employs more circuits, these global nets make the routing of other nets harder. The power/ground net distribution can also make use of area pads of the flip dice. The flip chip technology provides a 10-20 times reduction in lead inductance compared with wire bonding. Exclusive package layers can be used for power and ground and area I/Os provide many more connection points to them. The more the power and ground pads, the less the effective inductance of the power and ground network, and the less the simultaneous switching noise. Other global connections may also be pull out from the chip to the package. This scheme not only improves the performance but also saves the on-chip routing resource and may even reduce number of layers on chip. Currently there is no physical design tool available to handle the area I/O design effectively. REFERENCES [1] Available from the Semiconductor Industry Association, 4300 Stevens Creek Blvd., Suite 271, San Jose, CA 95129. [2] R. C. Frye, "Physical Scaling; and Interconnect Delay in Multichip Modules," IEEE Trans. on Components, Packaging, and Manufacturing Technology. PartB: Advanced Packaging, Vol. 17, No. 1, 1994, pp. 30-37. [3] Q. Zhu and W. Dai, "Chip and Package Co-Design Technique for Clock Networks," Proc. of 1996 IEEE MultiChip Module Conf., 1996, pp. 160-163.
227
The Emergence of "Physical Synthesis" - Optimization of a System's Physical Implementation During Design Planning Peter A. Sandborn, Chet Palesko, Dave Gullickson, and Ken Drake Savantage, Inc. 3925 W. Braker Lane, Suite 325, Austin, Texas 78759-5321 Tel: (512) 305-0053 Fax: (512) 305-0060
Abstract -Traditionally, electronic system design automation tools have focused on logical and behavioral partitioning and synthesis with little or no formal treatment of the physical implementation of a design. ASICs and other complex ICs are synthesized with little understanding of how they impact the physical construction and performance of the system into which they are inserted (system size, routability, thermal performance, reliability, cost, etc...). Lack of methodologies and tools that enable the physical implementation of systems t o be studied and optimized during the planning and specification phase of design is causing a serious disconnect in the design process for high density systems (e.g., PCMCIA cards, cellular phones, laptop computers, etc.). ]In order to meet the demands produced by miniaturization, performance, and market window constraints, designers must adapt their design processes to address the physical implementation of systems as early in the design process as high-level logic design, i.e., "physical synthesis". Physical synthesis is the translation from the structural description of a system to the physical implementation of that system. This paper discusses the unique problems associated with automating the determination of the physical implementation of a system during design planning and specification, and suggests a methodology that addresses design-forpackagability of components. I. INTRODUCTION
Designers must make optimum physical implementation choices early in the design process. Electronic systems are composed of components (active and passive), substrates or boards that interconnect components, and enclosures that contain and protect the boards and components. The physical implementation of a system is often a barrier to realizing the full value of high performance ICs (Fig. 1). The widening gap shown in Fig. I is requiring system designers to use new higher-density packaging techniques (e.g., MCM, Chip Scale Packages, etc...). The use of non-traditional high-density packaging shifts more of the design liability from the ICs to the interconnections between the ICs. Therefore, physical partitioning and the selection of packaging and interconnect technologies are critical decisions for inserting IC's into today's electronic products. Large high performance ASICs need to be synthesized for the appropriate package and interconnect type. System designers need to concurrently
balance a large number of cost and performance views (electrical, thermal, size, reliability, etc...) of a system in order to optimize the physical implementation. 1996 .,J.
140 120 N
Gap
100 > lo 0 C
Q 80 IL 60 0
40 0
40 20 20
40
W0
80
100
120
140
Bare Die Clock Frequency (MHz) Fig. I - The widening gap between performance on chip and performance in traditionallypackaged systems.
Electronic system design automation (ESDA) tools have focused on logical and behavioral partitioning and synthesis with little or no formal treatment of the physical implementation of a design. In order to meet the demands produced by miniaturization, performance, and market window constraints, designers must adapt their design processes to address the physical implementation as early in the design process as high-level logic design. The way to incorporate physical aspects into the design process is to adopt a top-down system planning methodology that addresses design-for-packagability of components.
228
II. THE IMPACT OF PHYSICAL IMPLEMENTATION DECISIONS
Traditionally, designing the physical implementation of a system has waited until logic synthesis and chip physical design is completed. After the chips are completed and the system architecture determined, the system packaging design begins. Many high-density designs are less than optimal because of a failure in understanding and characterizing the
physical system environment and system manufacturing realities when designing the ICs. For years, the best approach was to always make a. larger chip and put more functionality in the silicon. However, the packaging and electrical requirements of these large chips today often make the system cost higher than it would have been with two smaller chips. Ideally, the physical implementation of the system should be addressed as early as behavioral and logical synthesis. Critical physical implementation decisions, made within the first 20% of the total design cycle time, can ultimately commit 80% or more of the final product cost and performance. Therefore, making the most appropriate choices early in the design cycle will significantly increase the chances of finding an optimal or near optimal system design solution.
level. Sub-500 ps rise times and 40 W power dissipations may be common by the year 2000. The trends in integrated circuits complicate an already challenging system design problem. Systems that were implemented on large printed circuit boards only a few years ago are now being forced into PCMCIA cards. Market windows that used to be years are now months. Shrinking product windows mean more efficient system design is required, and system optimization must receive automated support in the physical as well as behavioral and architectural domains. The physical aspects of high performance system that must be managed early in the design process include: I. Highly Interdisciplinary Design Space - One of the most difficult and frustrating problems in high-density system design is the concurrent management of a large number of interdisciplinary performance constraints and requirements. High-density systems have many important views (i.e., electrical, thermal, economic, size, manufacturability, etc...). Most engineers have become specialized in a single view of the problem and are not well equipped to balance highly technical design concerns against economic and manufacturing realities.
00 100
0
g F
80 so 0I
60I
A,
0
40
E 00
20 '
II. Concurrent Design Requirement - Traditionally, design focusing on the physical implementation of systems succeeded by assuming a "divide and conquer" attitude. In other words, the various physical views of a system were loosely enough coupled that they could be treated independently. Unfortunately, the very nature high density systems negates divide-and-conquer approaches. Seemingly small changes made to resolve one design problem often cause significant changes in other performance views (Fig. 3). Successful high-density systems design requires the concurrent treatment of design views.
0 ' I
& TOo
Df
Fig. 2 - A significantportion of a systems cost and performance are committed long before traditional "physical design" (layout and routing) begins.
The limiting factors for many next generation systems will be the physical packaging and interconnect of multiple components to implement the design. Increases in IC performance are outpacing the ability of designers to implement systems that can properly exploit such advances. Furthermore, it is often the physical implementation of a system that ultimately sets the cost and performance of the design. The primary system implementation cost drivers are the physical packaging technologies, partitioning, assembly, test, and rework of the system. III. MANAGING PHYSICAL IMPLEMENTATION DECISIONS
The amount of functionality fabricated into integrated circuits is increasing rapidly. Increased functionality is resulting in more I/O per chip and increased die sizes. Already, high-performance components are appearing with more than 1000 1/0, and die dimensions on the order of one inch. The Semiconductor Industries Association (SIA) roadmap predicts that die with over 2000 1/0 will appear before the year 2000. Along with greater functionality on a chip comes increased performance demands, i.e., higher clock rates and increased power dissipation. Higher clock rates mean shorter rise and fall times for digital systems, which result in additional switching noise problems at the board
229
Fig. 3 - The interdisciplinarynature of the design problem results in every aspect of system cost and performance being sensitive to every other aspect of system cost and performance.
III. Large Tradeoff Space - The number of technologies, processes, materials and approaches is substantial (Fig. 4) and selecting optimums is arduous and non-trivial if one truly wants a balance in cost and performance. Alternative technologies include: substrates (printed circuit boards, ceramic, thin-film), assembly methods (surface mount, through-hole, bare die - MCM), bonding techniques (wirebond, TAB, flip chip), test techniques, and manufacturing methods. The designer may not be aware of all the technology choices that exist, and few designers can comprehend all the interdependencies and ramifications the technologies and materials chosen may have on a particular design's cost and manufacturability. Further complicating the large tradeoff space is the reality that system optimums are often mixtures of technologies, i.e., not every chip on a board is necessarily assembled into the system using the same technologies and materials.
Methodologies and software tools that perform logical synthesis activities associated with translating behavioral to structural have been widely accepted in recent years. Unfortunately, synthesis activities associated with creating the physical specification of a system beyond a single die are virtually unknown. Because of the interdisciplinary and technology centric focus required to perform early physical synthesis, methodologies for attacking it do not necessarily follow from behavioral or logical synthesis. A top-down system planning methodology consists of physical synthesis coupled with physical design (see Fig. 5). Figure 6 shows a detail of the possible information transfer between a physically oriented system planning tool and system architectural and behavioral modeling. ,11 ... 1
N
I I
CGonpoent
(baZedie)
--
-
F alo- n *SWQtI.atd
WCdip
S.% ., ;A .
rmrLead tanr Pad
Chip Fist
B.11 nd TAB 9,s tipPcts
g
..
.MTh.OVI Flip Chips
5C -~~~T F.
SiT ehoa err
Dec Attach .. ChipLast
-
SattacetMount
Thc~inThPPn
PlatedThroaghHol
:tm |
CsolinqS"ata9i, II Eatatiana
Con'icNtizaton
Cod Pbte Impingement M-.,aah;
En-ap-Ilated
Can
Edge
SinglaCh, Pachage
Fig. 4 - Possible packaging technology options associated with the inclusion of a single bare die into a system.
IV.
PHYSICAL SYNTHESIS
Design synthesis is the process of creating new design representations, or providing refinement to existing design representations. Traditionally, synthesis produces an artifact that satisfies some high-level behavioral or structural specification via the translation from a behavioral (functional) description into a structural description and the translation of a structural description to a physical description. Translation from a functional description to the structural description generates structures that are not generally bound in physical space. The translation from the structural definition to the physical definition (physical synthesis) adds the physical information necessary to produce a working version of the object. System-level physical synthesis is not as concisely defined as logical synthesis activities. There are no well-developed languages analogous to a hardware description language or Boolean equations to define and represent system-level functionality.
230
Fig. 5 - A top-down system planning methodology that couples physical implementation design (physical synthesis) with high-level behavioraland structuralspecification.
Central to the realization of physical synthesis are automated tradeoff analysis and physical partitioning of systems. Ten years ago companies had the luxury of assigning large numbers of engineers with detailed simulation tools to conduct physical tradeoff analysis and partitioning problems in order to find optimum system solutions. Unfortunately, economic pressures rarely allow this type of time and manpower intensive solution today. Today, engineers accomplish these analysis activities using a
combination of committees of experts (an electrical expert, a testing expert, a manufacturing expert, etc...), experience, and back-of-the envelope guesses.
SytmAn hicl Develps Beh-vi
.1M.& deL.m&oi
Tr-4,off. (Au-o-ucd)
Specification : & Planning
Rapid
Speed.Si..Th-1nl. CostOp-minlioo-
IOIODS
S--
bok diag-
m
HDL
ncr6_1 d&opath Conmpon~R~eon,o
S .vnSys SystemP`e* Cmsicwnetiyo Sye 'lloiS~
* ehoog
* Oesisn
(IrOUObdliy. Onneo noI.s olght,power,& CO-i)
* Physical Impl-n
Spec ificationsl HDL Model I',,-,,,,,,,-,,,,,,-,",,--""I
erkff
yopdmiodon
u-naton Siriengy.
Physical Impcmomniaiion Sirategy and Constraints
--- ----------
II
Logic Design
To Next Design Activity
Fig. 6 Detail of possible information transfer between a physically oriented system planner (SavanSys - see Appendix) and traditionalsynthesis activities. In tradeoff analysis the selection of an optimum combination of technologies depends on several drivers: 1) the characteristics of the components to be integrated; 2) the application for which the system will be used (i.e., performance requirements, cost, operating environment, and support requirements); and 3) the availability of previously designed structures and their design history. The implementation of tradeoff analysis can be approached several different ways. Physical tradeoff analysis activities could be carried out using detailed point solution simulation tools. However., because not enough information about the system is available during the conceptual design phase the usefulness of detailed simulators during this portion of the design process is limited. Relying on detailed simulations to provide the information necessary to make system-level implementation decisions can be a dangerous, resource intensive undertaking. Providing intelligent assistance and system-level estimation and predictions to the designer in the synthesis process often proves to be more useful than simulation in the early stages of design. The term "Design Advisor" has been used extensively in the recent literature to describe CAD tools which aid engineers in a complex design process, primarily tradeoff analysis. Design advisors can serve two purposes: l) observe the state of a design and assess its appropriateness against a defined set of constraints, an activity that could be used throughout the design process, and 2) supplement specification and synthesis activities at the conceptual design level. Specific "design advisors" should not be confused with
their management functions. Design advisors provide quantitative and/or qualitative design assistance for a particular design view. The tool that provides the framework within which many different views of a design can be considered is a tradeoff analysis tool. Estimation-based design advisors use a mixture of predictive analysis approaches: heuristic models (using the results of empirical studies to model a system), analytical models (closed-form formulations derived from basic principles), and simulation. Heuristic approaches are useful for some types of tradeoff analysis but may not be applicable to the analysis of new systems that are not similar to the system for which the heuristic was derived. Analytical models mixed with intelligently managed simulations generally represent the best approach for physical tradeoff analysis. Estimation-based tradeoff techniques are not intended to replace detailed simulation, but to provide support for early (conceptual) specification design when simulation may not be practical. Most physical tradeoff analysis is done in a "what-if' mode as opposed to performing automated optimization. The automatic searching of the design space has been done using numerical optimization techniques and design-of-experiments techniques, however, the system packaging design space is often too large for practical optimization. More practical methods consist of knowledge-base paring of the solution tree coupled with local sensitivity analyses. If multiple advisors are operating in an automated optimization mode, the opportunity for conflicts between advisors exists, i.e., different advisors may provide opposing advice. For example, a thermal advisor might suggest removing wiring layers in order to decrease the thermal resistance through a board while a size advisor might advise adding wiring layers to decrease the board area. The design system must be able to determine which advisor's suggestion should be implemented based on user provided constraints and their associated weighting. Recently several software tools for pre-layout physical design of boards and multichip modules have appeared. These tools are auto-placement centric and while they provide useful functionality to the single board/module physical design process, they do not address the early physical synthesis. By the time the netlists are defined and layout process begins, it is often too late to perform significant physical synthesis activities such as technology selection and physical partitioning.
231
V. A SYSTEMS
VIEW OF PARTITIONING
Present design methodologies address the partitioning of behaviors and architectures into chips and only consider packaging technology tradeoff analysis after the partitioning (if at all). The optimum number of packaging entities (MCMs, boards, etc.), and the best distribution of the candidate components among them, has not yet been addressed in an automated fashion. A systems view of partitioning requires the consideration of several issues: 1) modified partitioning objectives, 2) the treatment of
additional technologies such as connectorization and enclosures, and 3) connection topology. Module partitioning is essentially analogous to its chiplevel counterpart although more interdisciplinary physical constraints must be considered and the basis for partitioning becomes broader. The use of objective functions designed to help partition functionality into chips may be appropriate for traditionally packaged systems (surface mount and throughhole) but is dangerous for high-density packaging systems such as multichip modules. In traditionally packaged systems, the cost of the system tracks the accumulated cost of the fabrication of the chips, however, in advanced systems where bare die are used and/or extensive test and rework of modules is required, the system costs often do not track the component fabrication costs. High-density packaging systems that contain bare die must include cost modeling that can assess the impact of Known Good Die (bare die test and burn-in) and the possibility of performing repair and rework operations. If these critical (and difficult to model) processes are not included then automated partitioning exercises are of little more than academic interest. One critical element of system-level physical partitioning that has no analog in chip-level partitioning is connection topological relationships. Connection topology is the physical orientation of one board or module to another. There are three topologies which are applicable to intermodule or board connections: plane-in-plane (single chip package, can, or 3D stacking), edge-to-plane (edge connector), and edge-to-edge. The connection topology used has no relationship to the number of connections required, but will determine the number of interconnect crossovers (a crossover is created when one connection between modules or boards crosses over another). Crossovers tend to add complexity that penalizes the system's size, cost, reliability, and electrical performance. Figure 7 shows the magnitude of the potential crossover problem. The number of connection crossovers can be minimized by appropriately partitioning the system. VI. DISCUSSION In this paper we have focused on system physical planning as it relates to a single board/module or a multiple board/module system. Unfortunately, in the future, system planning solutions will not have the luxury of treating a single product in isolation. The required solution for the present product will be the solution that optimizes the family of products it is in. In other words, tradeoff analysis and optimization will have to span multiple products and find the best solution for the present product based on its transition to future products. It is also evident that only considering the cost of manufacturing the system is insufficient for understanding the real cost of the system to the company. Lifecycle analysis is required to obtain true system costs during planning and will ultimately need to be used in system physical partitioning as well. Lifecycle costs include the costs associated with the product development and design, sales and marketing of the
1000000-
-1000000
0,
a) 0
100000.
*100000
'A
Crossovers/
0
0
10000 -
*10000
1000
*1000
qu
.0
5) -
0 Q.
X
100.
-'
E
-- Area
I'l
.0
E Z
10 i
10
1I
I1 I
10
100
1000
Number of Connections, N
Fig. 7 - Area (proportionalto cost) and crossover growth with the number of connections.
product, of support and maintenance, waste disposition, etc. The system design community is eagerly awaiting the introduction of software tools that address system physical synthesis. Tools that can provide automated tradeoff analysis and partitioning at a "what-if" level that allows designers to perform system optimization are becoming more common. The future will see the introduction of tools that perform automatic optimization of system designs moving towards
the concurrent design of chips and systems. Tm APPENDIX-THE SAVANSYS SYSTEM PLANNING TOOL
SavanSys is a software tool for enhancing the manufacturability and decreasing the design risk associated with the selection of packaging technologies for integrated circuits. The SavanSys software tool performs system packaging tradeoff analysis. SavanSys concurrently computes physical (size, weight, interconnect routing requirements, escape routing), electrical (delays, attenuation, dc drops, effective inductance), thermal (internal and external thermal resistances, air cooling), reliability (MTMF), and cost/yield performance metrics for multichip systems. The outputs from SavanSys are the physical implementation strategy and the partitioned design. Multichip modules (MCMs) and traditional packaging (through-hole and surface mounting) technologies treated by SavanSys include: traditional and fine-line printed circuit boards, low temperature cofired ceramic, and thin-film (chipfirst and chip-last). Component assembly approaches include wirebonding, TAB, flip chip, and single chip packages. Materials are also available for bare die attach, encapsulation, attaching extrusions, and for defining the bonding and substrate technologies. SavanSys provides the user the ability to compute the cost of assembled electronic systems, including component costs, component preparation (wafer and die level bum-in,
232
bumping), single chip package costs, surface mount and through-hole assembly costs, bare die attach costs (TAB, wirebond, flip chip), tooling costs associated with the processes above, substrate costs, repair and rework costs, and test costs. In addition, learning curves may optionally be defined for any or all steps in the processes that describe the above operation, and handling costs may be defined for all steps that involve the insertion of components into the process flow. The SavanSys tradeoff analysis tool is specifically designed to allow the impact of technology, material, and design rule variations on the cost and performance of a board or system of boards. SavanSys enables designers to make optimum physical implementation and physical partitioning choices early in the design process to facilitate successful implementation decisions. SavanSys is integrated into the Mentor Graphics and Cadence physical design frameworks and is compatible with Aspect and DIE format databases.
233
A Graph-Based Delay Budgeting Algorithm for Large Scale Timing-Driven Placement Problems * Gustavo E. Tellez, David A. Knol, and Majid Sarrafzadeh Department of Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208 email: gus,[email protected]
Abstract In this paper we present a new, general approach to the problem of computing lower and upper bounds on net delays. The upper bounds on the net delays are computed so that timing constraints between input and output signals are satisfied. The set of delay lower and upper bounds is called a delay budget. The objective of this work is to compute a delay budget that will lead to timing feasible circuit placemerit and routing. We formulate this problem as a convex programming problem with special structure. We utilize the special structure of this problem to propose a linear programming formulation of the problem. A n:ivel, simple, and efficient graph-based algorithm is proposed to solve the linear programming problem. We present experimental results for our algorithms with the MCNC placement benchmarks. Our experiments use budgeting results as net length constraints for the TimberWolf placement program, which we use to evaluate the budgeting algorithms. We obtain an average of 50% reduction in net length constraint violations over the well known ZSA algorithm. We also study different delay budgeting objective functions, which yield 2X performance improvements without loss of solution quality. Our results and graph-based formulation show that our proposed algorithm is suitable for modern large-scale budgeting problems.
1
Introduction
As integrated circuit technology advances, circuit performance becomes heavily dependent on connecting wire delays. Placement of circuits on an IC is known to have a major impact on connecting wire lengths, and "This work was supported in by NSF grant MIP9207267 and by the IBM Ph.D. Resident Study Program.
234
thus on connecting wire delays. It is therefore important that modern placement algorithms consider timing objectives directly. Similarly, given a placement that satisfies timing objectives, it is equally important that the routing algorithms satisfy the timing objectives. Such algorithms are known as timing-driven placement and routing algorithms. Traditional placement and routing algorithms have focused on minimizing total wire-length and maximizing routability. However, these objectives are not necessarily compatible with timing-driven objectives. A number of timing-driven approaches have been proposed. Timing-driven placement approaches can be grouped into two major categories: 1. Path-based algorithms: Analyze path delays explicitly during the physical design. The algorithms try to satisfy both timing requirements and physical requirements simultaneously[10]. 2. Net-based algorithms: In a net-based algorithm, timing requirements are first translated into physical requirements that may be translated into net weights [8], and/or net length upper bounds [11]. This paper focuses on the problem of translating path timing constraints into physical design upper bounds. Utilization of upper bounds simplifies the placement and routing algorithms by translating timing requirements into physical constraints. However, the choice of physical constraints is not unique. Ideally, one ould costheiupper boundsuch thatlth pae would choose the upper bounds such that the placement objectives are optimized, but this problem seems to be hard to solve. The alternative approach is to choose the upper bounds so that the placement algorithms have maximum flexibility. In addition, since
the budgeting results are heuristic in nature and may be overly tight, it is necessary to provide a method by which the budgeting algorithm can adjust the timing budget based on information such as results from a failed placement. In this paper we propose a general formulation for the delay budgeting problem, an efficient algorithm that solves our formulation and different delay budgeting functions that satisfy the above objectives. The first to use net budget approach for placement is the popular zero-slack algorithm (ZSA) [7]. ZSA has no global optimization criteria and is a greedy algorithm that assigns budgets to nets on long paths. ZSA ensures that the net budget is maximal, meaning no more budget could be assigned to any of the nets without violating the path constraints. Most other budgeting algorithms are off-shoots of ZSA. In [12] the ZSA algorithm is improved by allowing for budget distribution in proportion to the net-weights. In [4] the delay budgeting problem is formulated as a convex programming problem. A logarithmic function is chosen to maximize the size of the timing feasible region. A method of adjusting the functions parameters based on placement results is also proposed in [4]. This paper is organized as follows. In Section 2, we introduce some terminology and we introduce our representation of the timing constraints using a timing constraint graph. In Section 3 we formulate the convex delay budgeting problem (CDB) using a timing graph representation, we then convert this problem into a linear programming problem by approximating the CDB problem with the piece-wise linear budgeting problem (P WLDB), and finally we propose a simplified version of these problems, the linear delay budgeting problem (LDB). Next, in Section 4 we introduce the Graph-Based Simplex (GBS) algorithm, and we show how to use it to solve the LDB problem. In Section 5 we propose two algorithms: the PWL-GBS algorithm is an extension of the GBS algorithm that can handle the PWL cost functions, and the iPWL-GBS algorithm iteratively increases the accuracy of the PWL to yield a more time efficient solution to the PWLDB problem. In Section 6 we use the iPWL-GBS algorithm to solve the CDB problem, and we propose and discuss the merits of several slack cost functions. Final]y in Section 7 we present our experimental results and. conclusions.
2
Figure 1: Sample circuit. The module delays, earliest signal arrival time and latest signal arrival time are also given.
modules M = {Mili = 1,..., M[}, a set of nets = {Nii = 1,..., IA}, a set of primary inputs PI = {PlIli = 1,. .. , PI} and a set of primary outputs PO = {POiJi = 1,..., IP1}. The timing constraints for the circuit are given as an input arrival time ai for each primary input Pli, and as a required arrival time rk at each primary output POk. We assume that the output(s) of a module will reach a steady state after all the module inputs have reached a steady state. We let xi and Di denote the latest input arrival time and propagation delay, respectively, for module Mi. A sample circuit is given in Figure 1. This circuit will be used as an example throughout the paper. Let an output of module Mi drive an input of module Mi, and let the latest arrival time at Mj be xj, then the latest arrival time at Mj must satisfy the propagation delay constraint xj > xi + Dj. The delay slack of a connection between an output of module Mi to an input of module Mj is denoted sij =xj - xi - Di. The arrival times of the primary inputs and outputs must satisfy the timing constraintsimposed by the input arrival time and the required arrival time: xi = as and xk = rk The delay budgeting problem seeks to assign values to the delay slacks (and thus to the signal arrival times of the modules). The values of the delay slacks are said to be feasible if they satisfy the timing constraints and the delay propagation constraints. A placement is timing feasible if it satisfies these equations with delay slacks bounded by 0 < sij < siax, where 5s ax is an upper bound on the delay slack. The delay budgeting problem seeks to allocate delay slacks before the placement and routing steps. Thus as a result of delay budgeting, the performance-driven placement and routing steps are given net delay bounds. Since the delay slacks equate with wiring delay it is natural to expect all nets to have non-zero slacks. Furthermore. the distribution of these slacks determines the diffiX
Terminology
Some of the terminology for the following discussion is described next. In this paper we will assume that we are working with a combinatorial circuit C(M,AN, PI, PO), which consists of a set of
235
culty of finding a feasible placement (and/or routing) solution. As a result, the objective of the delay budgeting problem is to maximize an increasing function of the delay slacks.
maximizes C(s) = EeiE Cij(sij) subject to: sij = xj - xi - aij, Veij E E, xo = 0, x > 0, s > O. s E sm. X E Rn. The CDB formulation is a convex programming (CP) problem. Next, we convert the CDB problem into a linear programming (LP) problem. We linearize the function Cij (sij) with a piece-wise linear function (PWL) Ci (sij) of H linear segments, as follows:
I I
1. Select slack values
0
th, h to = 0, tH
< th < SzaJ,
4 V
2. Construct H linear segments, m sij + b0, h 1, ... , H, such that:
ILy
mjh
_
23
-
b.
-=
3
Figure 2: Timing graph for the sample circuit. Graph does not contain the timing constraint edges. Lower bounds are shown with the edges. Solid edges have weight mij = 1 and dashed edges have weight mij = 0 The timing budget problem will be formulated using a graph-based timing model. Given a circuit C(M ,Ag, Pi, PC) we construct an edge-weighted, directed graph G(V, E), where V and E denote the set of vertices and edges in the graph, respectively. The vertices of the graph represent the modules. We will denote the number of vertices and edges in the graph as IVI = n and El = m. For each vertex vi, we assign a variable for the latest input arrival time, xi. The edges of the graph model represent the timing constraints on the latest arrival times. Each edge eij E E, with weight aij, represents an inequality xj -- xi > aij. Delay propagation edges are added for each output/input pin pair of every net with cij = Di. Timing constraint edges are added so that the signal arrival times at the primary inputs and outputs are fixed. The graph G(V, E) is called a timing constraint graph. The timing constraint graph for the sample circuit is given in Figure 2.
3
= ,... ,H, such that = si~" and th- 1 < th.
Formulation
In the following formulations we assume that we have the circuit C(M,A(, P1, PC) represented as a timing constraint graph G(V, E). The general delay budgeting problem can be formulated as follows: Convex Delay Budgeting Problem (CDB): Given a convex function Cij(sij), and a timing constraint graph G(V, E), find a set of slacks s that
236
Cij(th) - Cij(th.-.1)
th
-
t
h-1
th-lCij(th) -thCij(th-1) th -
th-1
(1) (2)
Now we define the PWL function C(sij ) = minh=,(mzh.sij +b0). Let cij represent the delay slack cost for a delay slack siJ. The delay budgeting is formulated as a linear programming problem. H-PWL Delay Budgeting Problem (H-PWLDB): Given a piece-wise linear delay slack function CH(sij) - minH 1 (mk sij + Oj), and a timing constraint graph G(V, E), find a set of slacks s that maximizes C(c) = e,ieECi subject to: cij < mk sij + O., h = 1,...,H, Veij E E sij = xj - Xi -aij, foralleij E Eo E = 0, x > 0O s>0
C>0 S C E
m X ER.
The H-PWLDB problem is equivalent to the CDB problem if H is sufficiently large. A special case of budgeting problem which can be solved efficiently results when the objective function is linear, i.e. C(x) = Zv"iEV wixi. This problem is called the Linear Delay Budgeting Problem (LDB). The LDB problem can be made equivalent to the H-PWLDB problem by letting Qij = th-1 and ¾ij = th such that th-1 < sij <_ th. Denote fi(i) and fo(i) as the fanins and fan-outs of a vertex vi e V, respectively, then Mh m h.Orpo set wi = EejiEfi(i i F-, eiEfo(i) i. Our proposed algorithm wi1 use simple extensions to an algorithm that solves the LDB problem to solve the HPWLDB problem efficiently. The LDB problem has a nice structure and can be solved by several methods: the LP dual can be solved using Min-Cost Network Flows algorithms [1], or directly using Dual Network Flows algorithms [9] and Graph-Based Simplex algorithms (GBS) [5]. In this paper we will extend the GBS algorithm to solve the H-PWLDB problem. In the next section we will outline the details of our approach.
mine the non-basic variable that enters the new basic feasible solution, determine the cost change due to the pivot, and i f the cost change is an improvement then pivot.
-2/O
3. Stopping Criteria: Stop if no cost improving pivots can be found for any of the basic variables.
/I /3
Figure 3: Illustration of a GBS pivot starting from the sample tight tree. A forward pivot of non-basic edge e1,4 for basic edge e1 0,13 is shown.
4
The Graph rithm
Based Simplex Algo-
The first algorithm proposed for the solution of Linear Programming (LP) problems is the Simplex algorithm [3]. An efficient implementation the Simplex algorithm results when the constraint set is represented as a directed graph. The resulting algorithm is known as the Graph Based Simplex (GBS) Algorithm. The GBS algorithm has been proposed previously in [6, 5]. A Linear Program (LP) is a mathematical program in which the objective function is linear in the r unknowns and the constraints consist ofk linear equalities and linear inequalities. The intermediate solutions of the problem are called basic solutions, which consist of k - r components set equal to zero, called basic variables, the remaining r-k variables are called nonbasic variables. If a solution the constraints it is said to be feasible. A feasible solution that is also basic is said to be a basic feasible solution. The idea of the Simplex algorithm is to proceed from one basic feasible solution of the constraint set to another, in such a way that the value of the objective function is continually increased, until the maximum is reached. The method used to step from one feasible solution to another is called a pivot. A pivot turns a basic variable into a non-basic variable and a non-basic variable into a basic variable. An outline of the Simplex algorithm is as follows: 1. Initial solution: Compute the initial basic feasible solution. 2. Pivoting strategy: For each basic variable: deter-
We will now show how to apply the Simplex method when the constraint set is represented by a graph G(V, E). Each vertex vs in the graph represents a variable xi, and each vertex is assigned a weight wi. Each edge eij in the graph has a weight aij and represents a constraint of the form xj - xi > acij. The problem can be converted into standard form by introducing a slack variable sij for each edge: sij = xj-i-aij. An edge with sij = 0 is called a tight edge. A spanning tree T(V, ET) of tight edges ET C E, rooted at vertex vo, of G(V, E) represents the basic solution of the LDB problem. The initial basic feasible solution can be computed efficiently by computing the As Soon As Possible (ASAP) signal arrival times. This problem is the well known Longest Path Problem in a directed graph which can be solved using one of several algorithms [2]. These algorithms can be modified to generate the initial tight edge tree. The pivoting strategy in the GBS algorithm takes advantage of the tight edge tree T(V, ET) representation of the basic solution. A pivot consists in replacing a tight edge eij e ET with an edge that is not in the tree, say eib 0 ET such that the resulting solution remains basic feasible, i.e. can be represented by a new tight edge tree. Further details on the GBS pivoting algorithms can be found in [5].
5
Solving the H-PWLDB Problem
We now turn our attention to the H-PWLDB problem. The H-PWLDB problem is with the GBS algorithm, with a modifed pivoting strategy. We call the modified algorithm the PWL-GBS algorithm. The following changes to the GBS pivoting strategy are made for the PWL-GBS algorithm: 1. Given a tight edge eij, such that sij = Pij, with = th-1, 3ij = th, mij = mj, wi and wj, a forward flip of edge eij, changes these values to aij = th, 3ij = th+1, Mij = MZtl, Wi wi +mz - m+1 and wj = wj-~mh + mh. -ij
2. Given a tight edge eij, such that sij = aij with Ceji = th, ij = th+ wi and wj, 1 , mij = mr, a backward flip of edge eij, changes these values to aij = th-1, 3 ij = th, mij = M h-11, i -W +w M Zm3 ._ -~
237
=h-i and wj=w
~.+M.1
cii(s) I
GA) s.
m(.. ..+)
m
.-
th.,
th
S
SuS
Su
Figure 4: Illustration of a backward flip.
Figure 5: Illustration of a forward flip.
The test for a cost improving pivot remains unchanged if it is performed after a flip. Since a flip only affects the pivot cost of that edge, then the remaining GBS pivoting strategy remains unchanged. Illustrations of the forward and backward flips are shown in Figures 4 and 5. The above procedure is consistent with a pivot on the original H-PWLDB LP problem and thus eventually leads to an optimal solution. With the modified pivoting strategy, the PWL-GBS algorithm retains the memory complexity of the GBS algorithm, namely O(n + m). However, the number of pivots, and thus the time complexity, of the PWL-GBS algorithm increase by at least a factor of H. We will next improve the time complexity of this algorithm. We use the following algorithm to reduce the number of pivots required by the PWL-GBS algorithm. Begin by using a single segment approximation of the problem, hence setting H = 1. The problem initially simplifies to the LDB problem. At each subsequent iteration the number of segments in the PWL cost functions is doubled. Then we re-compute the PWL approximation so that the solution of the previous iteration can be used as the initial basic feasible solution for the current iteration. The PWL-GBS algorithm is then run on the more accurate problem. We call this new algorithm the iterated PWL-GBS algorithm or iPWL-GBS. The iPWL-GBS algorithm requires O(log H) PWL-GBS iterations to solve the H-P'WLDB problem. Our experiments indicate that each iteration takes time similar to the problem with H 1=(see Figure 7).
6
J
L
t&l 4
Solving the Convex Delay Budgeting Problem
The CDB problem can be solved directly using the iPWL-GBS algorithm, given a choice of objective
238
function. The main objective of the delay budgeting algorithm is to produce a budget that will yield a feasible and hopefully optimal placement. This objective is difficult to obtain directly so instead we propose of compatible heuristic objectives: maximize the volume of the timing feasible region, allocate non-zero slacks to all nets, if at all possible, and allow for some control over the assigned slacks during a potential postplacement feedback step. In addition to the above objectives, we are also free to choose functions that are amenable to optimization. With these objectives in mind, we propose three functions: QUAD: C(s) = -(2 - i), LOG: C(s) = log(s+a)/log(p +a), and QUAD+ZSA: use the QUAD function with p equal to ZSA algorithm slack values multiplied by a factor, 2. The method by which we linearize the objective function has a significant impact on the performance and the results of the algorithm. For this reason an optimal algorithm for linearizing the cost function is used.
7
Results and Conclusions
We implemented our algorithms using C++ on an Sun machine. Our experiments used the MCNC placement benchmarks. Since most of these benchmarks do not contain timing information, we have budgeted wirelengths directly instead. The formulation is identical to the timing budget formulation, except that instead of timing values we use wire-lengths. In our experiments we compute a single upper bound for all PI to PO pairs from the longest PI to PO path. The single upper bound is obtained by multiplying the length of the longest path times an upper bound factor 77 (we used 1.5-3.0). We then take the net budgets and convert them into constraints for the placement program. We used the TimberWolf V1.2 placement program for
5
10000
Ph'PI , 3
p
.. . . . . .
..
1
1
100
10000
m
.
.
..- 7.. .- .-
.
-.. - -.- .-
.N..
- - . - .- .-
.
No. of net bounds b33 1707
TesI:04
755
.05
1689
Tesi,08
2403
Upper Bound _____I
1.5 3.0 15 3.0 1.5
22.33 9.94 37.14 23.49 15.36
2
3
5
Figure 7: Pivot ratio versus iteration, for the iPWL - GBS algorithm. Plot shows min, max and average values for the QUAD (lower curves) and LOG (upper curves) functions.
evaluation of the timing budgets.
Te
.
h
Figure 6: Plot of number of pivots vs number of edges. Each point is obtained from one of the iterations of the iP WL- GBS algorithm.
rir.Tary2
. .
0
1
Prn-aryi
.
2
100
Example
..
functions.
Net Bound Volations (%) QUAD LOG QAD+ ZSA 19.14 8.26 7.50 23.26 1.69 4.69 22.14 11.78 16.40 24.25 6.33 7.03 23.58 9.01 7.47
3.0
3.58
19.60
4.5
2.25
1.5 30 15
18 .29 4.09 21.39
3.0
14.44
20.72 23.92 15.90 11.15
7.76 2 72 6.87 2.51
5.21 3.43 4.81 2.35
Table 1: Table of budgeting and placement experimental results, showing percentage of bad nets. Figures 8 and 9 show a plot of the net violations for a ZSA and a QUAD+ZSA budgeted and placed benchmark. The percentage of bad nets (nets which violated length constraints) for each benchmark is shown in Table 1. The results in this table indicate that the proposed algorithms offer a significant improvement over the traditional ZSA algorithm, reducing the number of bad nets an average of 50%. Furthermore, these results also indicate that there may be more than one suitable cost budgeting function, as the results for the LOG and QUAD+ZSA functions are virtually identical in quality. This result is of practical importance, since the performance of the algorithm differs depending on the cost budgeting function. In our experiments (see Figure 7) we observed as much as 2X difference in the number of pivots between the QUAD and LOG
239
Next, we provide experimental evidence for our average time complexity claims. For the following results we ran the iPWL-GBS algorithm on various MCNC benchmarks for the QUAD and LOG function, for H = 32 segments. Each data point for Figures 6 and 7 was obtained from one PWL-GBS iteration. In Figure 6 we show the number of pivots as a function of the number of edges. The plot shows that, for these benchmarks, the average number of pivots grows linearly with m. In addition Figure 7 also shows that the number of pivots remains bounded during successive iPWL-GBS iterations. Our implementation of the iPWL-GBS algorithm can handle large problems. The largest MCNC benchmark, for example, contains 25,000 cells and 33,000 nets, resulting in a graph with n =25,000 and m =58,000. Using the LOG objective function, this problem problem took 1 hr. 32 min. to solve on an SS10 workstation. In this paper we have studied the problem of computing a delay budget for placement and routing problems. We have formulated the problem as a convex programming problem. We then modified the problem formulation into an LP problem by modifying the objective functions into piece-wise linear functions. We have proposed a space and time efficient algorithm. iPWL-GBS, to solve these problems by taking advantage of the special structure of the LP formulation. We have implemented and tested our proposed budgeting algorithm. Our experiments use published MCNC benchmarks for data and the Timberwolf placement
Figure 9: Net violations for placement using constraints generated by iPWL-GBS with the GBS+ZSA function.
Figure 8: Net violations for placement using ZSA budget constraints. for Primaryl example with 7- = 2.0.
Design Automation Conference, pages 123-130. IEEE/ACM, 1986.-
program for evaluation of the budgeting results. Our experiments show that the proposed budgeting algorithm obtains results that provide significant improvement over previous approaches both in quality and in efficiency.
[7] R. Nair, C. L. Berman, P. S. Hauge, and E. J. Yoffa. "Generation of Performance Constraints for Layout". IEEE Transactions on Computer Aided Design, CAD-8(8):860-874, August 1989.
References
[8] Y. Ogawa, T. Ishii, Y. Terai, and T. Kozawa. "Efficient Placement Algorithm Delay for HighIn DeSpeed ECL MasterSlice LSI's". sign Automation Conference, pages 404-410. IEEE/ACM, 1986.
[1] R. K. Ahuja, T. L. Magnanti, and J. B. Orlin. "Network Flows: Theory, Algorithms, and Applications". Prentice Hall Inc., Englewood Cliffs, NJ, 1993.
[9] S. Plotkin and E. Tardos. "Improved dual network simplex". In Ann. ACM-SIAM Symp. on Discrete Algorithm, pages 367-376. ACM-SIAM, 1990.
[2] T. H. Cormen, C. E. Leiserson, and R. L. Rivest. Introduction to Algorithms. McGraw-Hill Book Company, 1991. [3] G. B. Dantzig. "Linear Programming and Extensions". Princeton University Press, Princeton, NJ, 1963.
[10] A. Srinivasan, K. Chaudhary, and E. S. Kuh. "RITUAL: An Algorithm for Performance Driven Placement if Cell-Based IC's". In Third Physical Design Workshop, May 1991.
[4] T. Gao, P. M. Vaidya, and C. L. Liu. "A New Performace Driven Placement Algorithm". In International Conference on Computer-Aided Design, pages 44-47. IEEE/ACM, 1991.
[11] M. Terai, K. Takahashi, and K. Sato. "A New Min-Cut Placement Algorithm for Timing Assurance Layout Design Meeting Net Length Constraint". In dac, pages 96-102. IEEE/ACM, 1990.
[5' J. F. Lee and C. K. Wong. "A PerformanceAimed Cell Compactor with Automatic Jogs". IEEE Transactions on Computer Aided Design, CAD-11(12):1495-1507, December 1992.
[12] H. Youssef and E. Shragowitz. "Timing ConIn Interstraints for Correct Performance". national Conference on Computer-Aided Design, pages 24-27. IEEE/ACM, 1990.
[6]1 S. L. Lin and J. Allen. "Minplex - A Compactor that Minimizes the Bounding Rectangle and Individual Rectangles in a Layout". In
240
REDUCED SENSITIVITY OF CLOCK SKEW SCHEDULING TO TECHNOLOGY VARIATIONS Jose Luis Neves and Eby G. Friedman University of Rochester Department of Electrical Engineering Rochester, New York 14627 email: [email protected] Abstract - A
methodology is presented
in this paper
for
determining an optimal set of clock path delays for designing high performance VLSIIULSI-based clock distribution networks. This methodology emphasizes the use of non-zero clock skew to reduce the system-wide minimum clock period. Although choosing (or scheduling) clock skew values has been previously recognized as an optimization technique for reducing the minimum clock period, the difficulty in controlling the delays of the clock paths due to process parameter variations has limited its effectiveness. In this paper the minimum clock period is reduced using intentional clock skew by calculating a permissible clock skew range for each local data path while incorporating process dependent delay values of the clock signal paths. Graph-based algorithms are presented for determining the minimum clock period and for selecting a range of processtolerant clock skews for each local data path in the circuit. These algorithms have been demonstrated on the ISCAS-89 suite of circuits. Furthermore, examples of clock distribution networks with intentional clock skew are shown to tolerate worst case clock skew variations of up to 30% without causing circuit failure while increasing the system-wide maximum clock frequency by up to 20% over zero skew-based systems. 1.
INTRODUCTION
Clock skew occurs when the clock signals arrive at sequentially-adjacent storage elements at different times. Although it has been shown that intentional clock skew can be used to improve the clock frequency of a synchronous circuit [1, 2, 3, 4, 5. 6], clock skew is typically minimized when designing the clock distribution network, since unintentional clock skew due to process parameter variations may limit the maximum frequency of operation, as well as cause circuit failure independent of the clock frequency (i.e., race conditions). In this paper, the clock skew of a local data path Lj is defined as Tskemij(Lij) = TcDi - TcDj. where TCDi and TcDj are the clock signal delays of registers Ri and Rj. The clock skew is described as negative if TcDi precedes TcDj (TcDt < TCDj) and as positive if TcDi follows TcDj (TcDi > TcDj). It is shown in [ 1,2] that double clocking (the same clock pulse
triggers the same data into two adjacent storage elements) can be prevented when the clock skew between these storage elements satisfies Tsge.ij 2 - TpDin, where TpDmin is the minimum propagation delay of the path connecting both storage elements. Furthermore, it is also shown in [1,2] that zero clocking (the data reaches a storage element too late relative to the following clock pulse) is prevented when Tskeij < Tcp - TpD., where Tcp is the clock period and TpDm,, is the maximum propagation delay of the data path connecting both storage elements. The limits of both inequalities, TSkeij(mti,) = -TPDmin and TskeCij(x) = TCP - TPDdefine a region of valid clock skew for each pair of adjacent This research was supported by Grant 200484/89.3 from CNPq (Conselho Nacional de Desenvolvimento Cientifico e Tecnol6gico) Brazil, the National Science Foundation under Grant No. MIP-9208165 and Grant No. MIP-9423886, the Army Research Office under Grant No. DAAH04-93-G-0323, and by a grant from the Xerox Corporation
241
storage elements, called the permissible range [7] or certainty
region [8]. as shown in Figure 1. A violation of the lower bound leads to circuit failure while a violation of the upper bound limits the clock frequency of the circuit. Based on these observations, the process variation tolerant optimal clock skew scheduling
problem can be divided into two sub-problems: determining a minimum clock period that defines a valid permissible range for any two storage elements in the circuit, and determining a minimum width for each permissible range such that unacceptable variations in the target clock skew remain within the bounds of a permissible range. In this paper, a solution for this problem is presented. Localdatapath R
im,
Race Candtttat.
Peetterne C Ti,,=kTk-
ClackPertad Lmnttaian
-gT,,,,
C,
dlock thetarange (itm'et
Figure 1: Permissible range of a local data path. The problem of determining a minimum clock period has been previously solved [1, 3-6] in which a set of timing equations is used to determine the optimal clock period and the clock delay to each register in the circuit, thereby defining the local clock skews. However, in order to better control the effects of process parameter variations, it is advantageous to determine the permissible range of each local data path, select a clock skew
value that permits the greatest variation of skew within the permissible range and, finally, determine the clock delays to each register.
This paper is organized as follows: in Section 2, a localized clock skew schedule is derived from the effective permissible range of the clock skew for each local data path considering any global clock skew constraints and process parameter variations. In Section 3, techniques for determining the set of clock skew values that are tolerant to process parameter variations are presented. In Section 4, these results are evaluated on a series of benchmark circuits, demonstrating performance improvements while tolerating process parameter variations. Finally, some conclusions are drawn in Section 5. 2. OPTIMAL CLOCK SKEW SCHEDULING
A synchronous digital circuit C can be modeled as a finite directed multi-graph G(VE), as illustrated in Figure 2. Each vertex in the graph, vj E V, is associated with a register, circuit input, or circuit output. Each edge in the graph, eij E E, represents a physical connection between vertices vi and v1, with an optional combinational logic path between the two vertices. An edge is a bi-weighted connection representing the maximum (minimum) propagation delay TpD.a. (TpDmin) between two sequentiallyadjacent storage elements, where TPD includes the register, logic,
and
interconnect
delays
of
a
local
data
path
[7],
TPD = TC -Q + TLg,,i, + T1,nt + T,,, * A local data path L0j is a set of two vertices connected by an edge, Lij = (vi, erj, vjl for any vi, vj E V, as shown in Figure 2. A global data path, PkI = Vk -- P- vI, is a set of alternating edges and vertices {vb ekh, v,, e 12, ..., en l/ v,}, representing a physical connection between vertices Vk and vl, respectively (see Figure 2). A multi-input circuit can be modeled as a single input graph, where each input is connected to vertex v0 by a zero-weighted edge. Pl(Lij) is defined as the permissible range of a local data path and Pg(Pkl) is the permissible range of a global data path.
Local data path LI
range of each cascaded local data path between Vk and v1 , independent of the global path between vk and v,. Therefore, a clock skew between the vertices Vk and v,exists if the intersection of the permissible ranges of the paths connecting Vk and v, form a non-empty set, where the intersection of the permissible ranges is determined by the recursive application of the intersection operation applied to a set [10] and a set is the collection of clock skew values within a permissible range. The following example illustrates two circuits, one with two parallel paths and one with one forward path in parallel with a feedback path. Note that determining the permissible range of each local data path is not a sufficient condition for both circuits to work. Example 1: An example of applying the concept of a permissible range of a clocked system to a circuit composed of multiple paths is illustrated in Figure 3, where the numbers assigned to the edges are the maximum and minimum propagation delay of each Lij, and the register set-up and hold times are arbitrarily assumed to be zero. Furthermore, the pair of clock skew values associated with a vertex (in bold and italic) are the minimum and maximum clock skew calculated with respect to the origin vertex vo for a given clock period. The pairs in italic are determined with Tcp = 6 tu (time units) while the pairs in bold are determined with Tcp 8 tu.
Figure 2: Graph model of a synchronous circuit in terms of local and global data paths
Parallel Path
2.1 Timing Constraints
Skew
The timing behavior of a circuit C can be described in terms of two sets of timing constraints, local constraints and global constraints. The local constraints ensure that the data signal is correctly ]latched into the registers of a local data path in order to prevent double and zero clocking. The local timing constraints are represented by the following equation [1-6] to prevent zero clocking,
Skewe
TSk,, (Lij ) 2 THOId)
TPD(m.if) +
-
Ci
I
-
T,. = 8 tu Tcp =
(0,o) (0,0)
6 tu
-
TPD(m)
I
(-10,-2) (-10-6) 11 (-2,-2)
(.2,0)
Permissible range: v,-.
-7If
| -10
-6
-10
(1)
I
T,-,,,
-2
time
-2
and the following equation to prevent double clocking, TSke (Lii ) < TCP
(-4,1) (-4,-) 7F
6
0 rime
(a)
(2)
Feedback Path
where ,ij is a safety term introduced in [7] to prevent race conditions due to process parameter variations, as described in Section 3. Besides satisfying the permissible range of a local data path Pl(Lij), it is also necessary to ensure the existence of a permissible range for each global data path Pg(Pk,) to guarantee a race-free circuit, particularly when there are multiple feedback and parallel paths between the two vertices Vk and v1. Two paths with common vertices are said to be in parallel when the signal data flows in the same direction in both paths. Likewise, a path is a feedback path when the data signal flows in a direction that is the reverse direction Df the data signal flowing from the input of the circuit to the output of the circuit. To illustrate a circuit configuration where it is necessary to provide a permissible range for each global data path, consider a circuit composed ef several global data paths connecting two common vertices * and v,. The vertices Vk and v, represent two registers, each register driven by a single clock signal. The two clock signals define a unique clock skew value between Vk and v, independent of the path connecting Vk to va.Therefore, a valid clock skew between Vk and v, only exists if the clock skew is common to all the global data paths connecting Vk and v,. Since the clock skew between vertices vk and v/ is also the sum of the clock skew of each cascaded local data path connecting Vk to vI [9], the resulting sum is independent of the global path between vk to v,. Alternatively, the permissible range of each of the paths connecting the vertices vk and v/ is the sum of the permissible
242
Skew - T_ = 8 tu
(0,11) (0,0()
Skew - Tc, = 2 tr
-
(-3,4) (-3.-2) 3 (010) (4,'10)
12 Permissible range: , 3
-3
2
.3
4 Tp = 8 < T,,,.
-v
lo
Tr2 ,
10
time
-2
lortime = 12
(b)
Figure 3: Example circuits describing the process for matching permissible clock skew ranges by adjusting the clock period Tcp. (a) System composed of two forward flowing parallel paths; (b) System composed of a single forward path and a single feedback path. The minimum clock skew of each local data path Lij is obtained by applying the maximum permissible negative clock skew to the local data path, or Tskewij = TPDmi, from (1). The maximum bound is obtained directly from (2), given that the clock period Tcp is known. Adding the minimum (maximum) clock skew of each cascaded local data path, the permissible range of each global data path connecting v] to V 3 is obtained, as illustrated in Figure 3. Observe that in Figure 3a, for a clock
period Tcp = 6 tu, no value of clock skew exists that is common to the two paths connecting vertices v/ and v3, since the permissible range of the path v1 -v2-v3 is [-10,-6] and the permissible range of the path v1 -v3 is [-2,-2], and these permissible ranges do not intersect (or overlap) in time. A common value of clock skew is only obtained when the clock period is increased to 8 tu. Note that Tcp = 8 tu is less than the minimum clock period determined with zero clock skew, which is 11 tu. Therefore, a reduction of the minimum clock period from 11 tu to 8 tu is obtained with the application of negative clock skew. From the example in Figure 3, in order to prevent circuit failures at the global level, circuits with parallel and feedback paths must have a non-empty permissible range composed of the intersection or overlap among the permissible ranges of each individual parallel and feedback path. Therefore, a new set of global timing constraints are required and formalized below. The concept of permissible range overlap of a global data path Pk. can be stated as follows:
Tsk,.
k(Flj)•0
MAX I mi~n[T_(l)mz] *kw
Imi~nn
(
,k (4)n
and the lower bound of Pg(Pk,) is given by TW. (Pk).i. =MAX{max[TSk(
J
(p)
j}.(5) (m5ax[
Observe that both bounds of a clock skew region given by (3) are dependent on the clock period in the presence of feedback paths between vertices Vk and v1. This recursive characteristic is used to increase the tolerance of the clock distribution network to process parameter variations, as explained in Section 3. For a non-recursive data path (either local or global), the lower clock skew bound is independent of the clock period, as shown in (1). 2.2
Theorem 1: Let Pkl E V be a global data path within a circuit C with m parallel and n feedback paths. Let the two vertices, Vk and v, E Pk,, which are not necessarily sequentially-adjacent, be the origin and destination of the m parallel and n feedback paths, respectively. Also, let Pg(Pkl) be the permissible range of the global data path composed of vertices Vk and v1. Pg(Pk,) is a nonempty set of values iffthe intersection of the permissible ranges of each individual parallel and feedback path is a non-empty set, or
Pg(P'k)=(flPg(PI'))n j lP n
(Pklmn
(3)
Proof =>: The clock skew between vertices Vkand vi, TSkekl, is
unique and independent of the number of paths connecting the two vertices. Also, the clock skew TSke,,k, of a single path that connects both vertices is the sum of the clock skew of each local data path along the path. Assuming that a value of clock skew exists between vertices Vk and v1, this value is always the same independent of the path connecting vk and v1. Furthermore, for each path connecting vertices vk and v1, the minimum (maximum) clock skew value is the sum of the minimum (maximum) clock skews of each local data path along the path defining the permissible range of the global path. Therefore, a valid clock skew between vertices vk and v, must be within the permissible range of clock skew of each and every path connecting both vertices. In other words, the intersection of permissible ranges must be a non-empty set. 4=: Assume that Pg(Pkl) = 0 and there exists a valid clock skew value between vertices Vk and v1. If this value of clock skew exists, it must be contained within the permissible range of all the paths connecting the vertices Vkand v1. If a clock skew value exists for all the paths, the result of the intersection of all the permissible ranges cannot be an empty set. Therefore the valid value of clock skew contradicts the initial assumption. U
Optimal Clock Period
Without exploiting intentional clock skew, the minimum clock period is determined from (2) for the local data path with the maximum propagation delay. However, applying intentional negative clock skew to a local data path permits the circuit to operate at higher clock frequencies. The minimum clock period of a circuit operating with intentional clock skew must simultaneously satisfy (1), (2), and (3) for every local data path. The minimum clock period to safely latch data through a local data path Lqj can be determined by the differences in propagation delay of the combinational logic block within L4j, assuming that the timing parameters of the registers (TSetr-ap THo ld, and Tc-Q) are constant. When the maximum possible negative clock skew [2] is applied to Lij, the clock period is the difference between the propagation delays, since the maximum negative clock skew is the minimum propagation delay within L,1 . The maximum negative clock skew defines the lower bound of the clock period of Lj. The upper bound of the clock skew can be any value defined by the minimum clock period. Similarly, the clock period of a circuit is bounded by two values, Tcpmi and Tcpmt, determined from the differences in propagation delay within the local data paths of the circuit, as shown below and independently demonstrated by Deokar and Sapatnekar [6]. The lower bound of the clock period, Tcpmin is the greatest difference in propagation delay of any local data path Lj E C, TCP~mn = MAXFmin(TPDmaxij-TPDminiJ),max(TP~,,ii)l
(6)
L jEV
I jv
and the upper bound of the clock period, Tcpm_, is the greatest propagation delay of any local data path Lij e G, TCmp
= MAX[max (TPDmx.j)i
fmax (TPD
~ii )]
.(7)
The second term in (6) and (7) accounts for the self-loop where the output of a register is connected to its input through an optional logic block. Since the initial and final registers are the same, the clock skew in a self-loop is zero and the clock period is determined by the maximum propagation delay of the path connecting the output of the register to its input. Observe that a clock period is equal to the lower bound in circuits without parallel and/or feedback paths. Furthermore, the permissible ranges determined with a clock period equal to the upper bound
Similar to the permissible range of a local data path, the permissible range of a global data path is bounded by a minimum and maximum clock skew value. These values, the upper and lower bounds of the permissible range Pg(Pkl), can be determined as a function of the upper and lower bounds of the permissible ranges of each independent parallel or feedback path connecting
TCPma will always satisfy (3) since the permissible range of any
local data path in the circuit contains zero clock skew. Although (7) satisfies any local and global timing constraints of circuit C, it is possible to determine a minimum clock period that satisfies (3) while including intentional clock skew. This transformation leads to the optimal clock period problem which is stated in the following theorem:
vertices Vk and v1.
Lemma 1: Let the two vertices, vk and v, E POk,be the origin and destination of a global data path with m forward and n feedback paths. If Pg(Pkl) • 0, the upper bound of Pg(Pk1 ) is given by
243
Theorem 2: Given a synchronous circuit C modeled by a graph G(V,E), there exists a clock period Tcp satisfying (3) and bounded by Tcpmin < Tcp < Tcpm. The clock period is a minimum if the permissible range resulting from (3) contains only a single value of clock skew. Proof: For a local data path, if the clock period increases (decreases) monotonicaly, the upper bound of the permissible range always increases (decreases) monotonicaly due to the linear dependency between the clock skew and the clock period. The lower bound does not change since it is independent of the clock period. Therefore, starting with Tcp = Tcp,,m, and progressively reducing the clock period is equivalent to constraining the permissible ranges to narrower regions. In the limit, the minimum clock period is determined when a single value of clock skew within the permissible range is reached, since, due to monotonicity, a further reduction in the clock period would result in an employ permissible range, violating (3). M
three registers symbolized by v, v,, and Vf with combinational logic within each local data path. It is assumed for simplicity in this example that the timing parameters of each register (Tse,.,p, THld, and Tc-Q) are zero. The minimum clock period Tcpmin is determined from (6) and is 7 tu, which is the difference in propagation delay within the logic block of the local data path vi-vf . The maximum clock period Tcpa. is the maximum propagation delay through a logic block in the circuit, which is 12 tu. Starting with Tcp1 j, the permissible ranges of each local data path are used to calculate the permissible range of each global data path connecting vertices vi to Vf . Since a unique clock skew must exist between vertices v, and Vp, this value of clock skew must exist within the permissible range of each global data path connecting both vertices. Hi~Tc,_j
= 7 tu T__, = 12 tu
A graph-based algorithm is presented in Figure 4 to determine the minimum clock period that ensures that each of the permissible ranges in the circuit satisfy (3). The initial clock period is given by (6) and, for each pair of registers in the circuit C, the local and global permissible ranges are calculated, as illustrated in Figure 4 in the lines 4-13. The content of the permissible range is evaluated (line 14) and if empty, the clock period is increased (line 25), otherwise the clock period is decreased (line 26). A binary search is performed on each new
To1 = Path v, v,-
- v,
7 tu
FTUJ
Path v,- f
F}Tpl) -5
Path v/ - V;_
_
rime
-3 _
_
1
_ 4
Pg(Pd) = Pg(Pt,) - Pgq(P,,) , Pg(P,.) = 0
clock period within the algorithm Intercept until the minimum
clock period has been reached.
Te = 1] tu
Tp= 9tu
1. Intercept( G(VE), TCp) 2. for each v, E V do 3. for each vy E V and v se v, do
4.
for i <- 1 to m do
5. 6.
-8
(intersection of m parallel paths)
!5 :
calculate the bounds of the permissible range Pg(Pl') if Setparie, = 0 then SetparplleI = Pg(P<) else Setpamaiel = Setparaiiei ra Pg(P,')
7. 8.
ford <- I to n do
9. II. 1 2.
13. 14. 16.
20. Optimal-Tcp( C) 21. lower = Tcpm,,; upper = Tcpmax; 22. while (upper - lower) > c Iniercept( C, Tcp);
if "no success" then lower = Tcp;
26. 27. end
else upper =
4 ime
1
time
From Figure 5 with Tcp = 7 tu, the permissible ranges do not intersect, thus no value of clock skew exists that will permit the circuit to function correctly. Increasing Tcp to 9 tu permits the permissible ranges of the global data paths vi-v-vf and vi-vf to intersect, but the permissible range of the path Vi-VJ-vf does not intersect with the permissible range of the path vf-vi. Therefore, the clock period is again increased. In the example shown in Figure 5, the clock period is increased beyond the optimal clock period to 11 tu to illustrate the existence of a permissible range for vertices vi and Vf that permits choosing more than one value of clock skew between vertices v, and Vf. A single value permissible range is obtained using the algorithm shown in Figure 4, which for this example determines a minimum clock period of 9.67 tu. The difference between the algorithm described here and other algorithms described in the literature [4-6] is the process for verifying whether a timing violation exists. In the approach offered by Szymanski [4], the existence of positive cycles, indicating a violation of the timing relationships, is checked with Lawler's algorithm [11], where Szymanski also indicates that the Bellman-Ford algorithm is a more efficient strategy for testing for positive cycles. This approach is adopted by Shenoy and Brayton [5] and Deokar and Sapatnekar [6]. Each of these algorithms run in O(VE) time, where V is the number of registers and E is the number of edges. Linear programming solutions to this problem have also been developed by Fishbum [I] and by Sakallah et al. [3]. The solution of these algorithms produces the clock delay
- Tske[Pg(Pxy)mjn] I < C1
25.
i
,
Figure 5: Example of selecting the clock period Tcp
17. then return "permissible range too small" 18. else return "success" 19. end Intercept
Tcp= (upper - lower)12;
)1
Pg(P,}= Pg(P,,)- Pg(P,,,) - Pg(P>)= 1-3.1]
return "permissible ranges do not intercept"
24.
i
i-3
Pg(P,)
-3
3
23.
)
dni
pg(p,) = Pg(p,) - PgTP ,,) ^ Pg(P,) = 0
(intersection of n feedback paths)
else if lTsk.,.[Pg(Pry)ma,]
PI(P4 1 1
I- P P(Pt) 0 .1 4 rime
calculate the bounds of the permissible range Pg(Px ) if Setfedb~Ck = 0 then Setfeedback = Pg(P,j) else Setfeedback = Setfeedback q Pg(p~v) Pg(xy) = Setp.,..Ie, n Setfeedback if Pg(Px)= 0 then
15.
t ime
1 pi e
Tcp;
Figure 4: Pseudo-code of algorithm for determining the minimum clock period based on permissible range overlap The crder of the algorithm in Figure 4 is O(V). This order is similar to other clock scheduling algorithms referenced in the literature [5,6], since the number of edges E is approximately of the same order as the number of vertices V by a linear transformation, orE = O(V). Example 2: An example circuit illustrating how the clock period is determined is presented in Figure 5. The circuit is composed of
244
from the: clock source to each register in the circuit, thereby defining the clock skew of each local data path. However, in order to better control the effects of process parameter variations, it is advantageous to determine the permissible range of each local data path, select a value of clock skew that allows a maximum variation of skew within the permissible range and, with the clock skews selected, determine the clock delay to each register. 2.3 Selecting Clock Skew Values The permissible range of a local data path P1(Lij) bounded by (1) and (2) defines the set of valid clock skews for a single local data path. However, for a circuit composed of multiple local data paths connected to form parallel and/or feedback paths, not all of the clock skew values that are valid for a local data path can be used to satisfy the permissible range of a global data path. Consider, for example, the path Pif shown in Figure 5 with Tcp = 11 tu. A clock skew of Tskei,I + TskewIf = -3 + -5 = -8 tu is a value of clock skew that is not within Pg(Pif), although the individual clock skews are within the respective permissible ranges, Pl(L;,) = [-3,2] and Pl(Llf) = [-5,-1]. This example indicates that only a sub-set of the permissible range of each local data path can be used to obtain the permissible range of the global data paths of the circuit. Lemma 2: Let Lij be a local data path within a global data path Pk,. Given a clock period Tcp that satisfies (3), the sub-set of values within PI(Lj) used to determine Pg(Pk 5 ) is called the effective permissible range of a local data path p(Lj), such that p(L i>)c: PI(Lij).
Lemma 2 does not define the actual position of an effective permissible range within each Pl(Lj), since several solutions are possible, as illustrated in the example shown in Figure 5. Considering the path Pi, f, p(L1j) + p(Lif) = [-2,2] + [-1,-l] = [-3,1] and p(L1j) + p(Llf) = [0,2] + [-3,-1] = [-3,1], two valid
choices exist for the effective permissible range of Li, and Li, respectively, since both choices result in Pg(P 1 f) = [-3,1]. The actual choice of the effective permissible range is constrained by additional criteria, such as reducing the absolute value of the clock skew [6], or ensuring the largest possible effective permissible range for each local data path so as to maximize the tolerance to process parameter variations. Observe that the possibility of multiple solutions is consistent with the existence of multiple solutions to the problem of indirectly choosing non-zero clock skews by calculating a set of clock path delays to satisfy a valid clock period [1,6]. Therefore, the selection of a specific value of clock skew for each local data path is performed in two steps. In the first step, the effective permissible range is determined for each local data path, while during the second step, the specific local clock skews are chosen to maximize the tolerance to process parameter variations. The assignment of the largest possible effective permissible range to a local data path begins with determining the unique solution to the permissible range of each global data path, as formulated below: Theorern 3: Given a synchronous circuit C modeled by a graph G(V,E), let the two vertices, Vk and vI E V, be the origin and destination of a global data path Pk, with m forward and n feedback paths. Let Pg(Pkj) be determined by (3). If Pg(PkI) • 0, the width of Pg(PkJ) is greatest when the bounds of Pg(Pk,) are determined by (4) and (5), respectively. Proof: This theorem is proved by observing that the bounds of Pg(Pk,) depend directly on the bounds of the permissible range of each global data path connecting vertices Vk and v/. Assuming that the two vertices Vkand v, are connected by two parallel paths and the minimum [maximum] clock skew of the permissible range between the two vertices is a value smaller than the value given
245
by (4) [(5)], a permissible range is produced with a width larger than the permissible range obtained with (4) and (5). However, from Lemma I and the property of monotonicity, this assumption is a contradiction since the larger width can only result from the interception of larger permissible ranges. Therefore, a smaller bound indicates that the upper and lower bounds of a particular global data path have not been constrained by (4) or (5). U The pseudo-code to determine the clock skew of each local data path is presented in Figure 6. The algorithm Intercept is first used to determine the permissible range of each global data path in the circuit, given a clock period Tcp that satisfies (3). Determining the effective permissible range and selecting the clock skew value for each local data path are performed as follows: I) the permissible range of a global data path Pg(Pk,) is divided equally among each local data path connecting the vertices vk and vI (line 5); 2) each effective permissible range p(Lij) is placed as close as possible to the upper bound of the original permissible range Pl(Lj) (lines 6 and 7), thereby minimizing the likelihood of creating any race conditions; and 3) the clock skew is chosen in the middle of the effective permissible range, since no prior information of the variation of a particular clock skew value may exist (line 8). The minimum clock path delays are determined from this clock skew schedule [9]. 1. Select-Skew( G(VE), Tcp) 2.
Intercept(G(VE), Tcp)
3. 4.
for each Pken c G(VE) do fori -kto 1 Pk' do -
5. 6.
Width[p(Lj)] = MAX[Pg(PkI)] - MIN[Pg(PkI)]I # Lij E PkI Upper bound of p(Lj) = MAX[PI(Lj)]; Lower bound of p(Lj) = MAX[Pl(Lj)] - Width(p(Lj)); Tskjj = MAX[p(L, 1 )] - MINMp(Lij)] 12; end Select-Skew
7. 8. 9.
Figure 6: Pseudo-code of algorithm for selecting the non-zero clock skew of a local data path 3.
REDUCED SENSITIVITY TO PROCESS PARAMETER VARIATIONS
A top-down design system has been developed for synthesizing intentionally skewed clock distribution networks from the timing constraints of the circuit without prior layout information [7,9], as illustrated in Figure 7. The top-down synthesis system is integrated with a bottom-up verification phase (darker-shaded region in Figure 7) to ensure that the effects of process parameter variations on the selected clock skew values do not violate the bounds of the effective permissible range of each local data path. The clock distribution network is primarily composed of active devices (CMOS inverters) that accurately implement the clock path delays that enforce the non-zero clock skew. The circuit model of the clock tree with active devices is based on the alpha-power law model [12]. Due to the active devices within the clock tree, the clock path delay variations are primarily due to the effects of process parameter variations on the active devices rather than variations of the interconnect lines within the clock tree [13]. Once the clock distribution network has been designed, each clock path delay is re-calculated assuming that the cumulative effects of device parameter variations, such as threshold voltage and channel mobility, can be collected into a single parameter characterizing the gain of a CMOS inverter, specifically the output current IDO [12]. The worst case variation of each clock skew is determined from calculating the minimum and maximum clock path delays considering the minimum and maximum IDO of each inverter within each branch of the clock distribution network. If a single worst case clock skew value is outside the
effective permissible range of the corresponding local data path, TskeUij a: p(Lij), a timing constraint is violated and the circuit will not function properly. This violation is passed to the top-down synthesis system, indicating which bound of the effective permissible range is violated. The assigned clock skew of at least one local data path
Example 3: An example of a synchronous circuit with upper and lower bound violations is presented in Figure 8. This circuit is composed of two parallel paths. The initial clock period is Tcp = 9 tu. The valid permissible range between vertices v] and v3 is initially Pg(P13 ) = [-2,01, as indicated by the dark-shaded area shown in Figure 8.
Lij within the system may violate the upper bound of p(Lij), i.e., Tskeij > MAX[p(Lij)]. This violation is corrected by increasing the
clock period Tcp, since due to monotonicity the effective permissible clock skew range for each local data path is also increased (Tskewj(,,,) is increased). The new clock skew value may also violate the lower bound of a local data path, i.e., Tskeij`< 7Skeij(min),
where
TSkwij(min) C p(Lj).
Upper bound violation: Assume, for example, that the worst case variation of the clock skew results in an upper bound violation. For example, the nominal clock skew is TskeI3 = -I tu while the clock skew caused by the worst case variation is 0.5 tu. By increasing the clock period to 9.5 tu, the width of the permissible range between vertices v, and V3 is increased to Pg(P;3 ) = [-2,1]. The new upper bound is greater than the worst case variation of clock skew and the circuit will now operate correctly. (-4,2)
(0,0)
(-10,0)
Permissible range: v, - v,
1111111111111MI 1, -10
-2 9< T
T=
VL Circ Des
Upper bound violation Tsk-w13:
Cy
TSI.. 3:
Solution: Increase Tc, to 9.5 New permissible range: =
Pg(P,,)= [-2,0]
Chosen =-I
Worst case = -2.5
Solution: Increase C,, to 1.0 Increase Tp to 10.0 New permissible range:
[-2,1]
-10 T
Figure 7: Synthesis methodology of clock distribution networks tolerant to process variations
1 time 11
Lower bound violation
Pg(P13 ) = [-2,0] Chosen = -I Worst case = 0.5
Pg(P73)
0 =
=
-I 10 < Tc
=
2 time 11
Pg(P73 ) = [-1,2] Chosen T,_,,,3 = 0.5
Two compensation techniques are used to prevent lower bound violations, depending on where the effective permissible range of a local data path p(Lij) is located within the permissible range of the local data path, P1(L4j). If the lower bound of p(Lij) is greater than the lower bound of P1(Lj), the clock period Tcp is increased until the race condition is eliminated, since the effective permissible range will increase due to monotonicity. However, if after increasing the clock period, the clock skew violation still exists and the lower bound of the effective permissible range is equal to the lower bound of the local data path (MIN[p(L 1j)] = MIN[PI(I4)]), any further increase of the clock period will not eliminate the violation caused by not satisfying (1). Rather, if the lower bound of p(L5j) is equal to the lower bound of PI(Lij), a safety term Cjj > 0 is added to the local timing constraint that defines the lower bound of Pl(Lj) [see (1)]. The clock period is increased and a new clock skew schedule is calculated for this value of the clock period. The increased clock period is required to obtain a set of effective permissible ranges with widths equal to or greater than the set of effective permissible ranges that existed before the clock skew violation. Observe that by including the safety term j,, the lower bound of the clock skew of the local data path containing the race condition is shifted to the right, moving the new clock skew schedule of the entire circuit away from the bound violation and minimizing the likelihooi of any race conditions. This iterative process continues until the worst case variations of the selected clock skews no longer violate the corresponding effective permissible ranges.
Figure 8: Example of upper and lower bound clock skew violations and strategy to remove these violations Lower bound violation: Now assume that for the same circuit shown in Figure 8, the worst case variation causes the clock skew
to be -2.5 tu, rather than a nominal value of the clock skew of -I tu, violating the lower bound of the permissible range Pg(PI3 ). In this case, increasing the clock period does not remove the violation since the lower bound of Pg(P13 ) is independent of the clock period. Rather, the safety term 4 13 is increased arbitrarily from 0 to I tu and the clock period Tcp is increased to 10 tu. As discussed previously, a revised schedule of clock skew values is calculated by applying the top-down design system to re-design the clock distribution network. The topology of the clock distribution does not change, although the delay values assigned to the branches must change to reflect the new clock skew schedule. In Figure 8, increasing Tcp to 10 tu will create a new permissible range Pg(P13 ) = [-1,2]. A new clock skew is selected from this permissible range and the worst case clock skew variations are calculated. Assuming that the worst case variation is still 1.5 tu of the nominal value (see Figure 8), this new target clock skew is within Pg(P13 ) and the violation is eliminated.
246
4.
SIMULATION RESULTS
Table 2: Worst case variations in clock skew due to process parameter variations, IDO = 15%
The simulation results presented in this section illustrate the performance improvements obtained by exploiting non-zero clock skew while considering the effects of process parameter variations. In order to demonstrate these performance improvements, the suite of ISCAS-89 sequential circuits is chosen as benchmark circuits [14]. A unit fanout delay model (one unit delay per gate plus 0.2 units for each fanout of the gate) is used to estimate the minimum and maximum propagation delay of the logic blocks. The set-up and hold times are set to zero. The performance results are illustrated in Table 1. The number of registers and gates within the circuit including the I/O registers are shown in Column 2. The clock period assuming zero clock skew is shown in Column 3. The clock period obtained with intentional clock skew is shown in Column 4. The resulting performance gain is shown in Column 5. The clock period obtained with the constraint of zero clock skew imposed among the 1/0 registers is shown in Column 6 while the performance gain with respect to a zero skew implementation is shown in Column 7.
circuit Tcpo1TcPi permissible selected Simulated range clock skew (ns) skew nom worst case I
cdn cdn 2 cdn 3
exl s27 s298 s386 s444 s510 s838
size #reg./#gates 20/7/10 23/119 20/159 30/181 32/211 67/446
TCP. Tskcwi = 0
Tcpi Tsk.i;• 0
11.0 9.2 16.2 19.8 18.6 19.8 27.0
6.3 5.4 11.6 19.8 11.1 17.3 13.5
gain (%) 43 41 28 0 41 13 50
T5sk-
TCP gain - 0 ( 10o
7.2 6.2 11.6 19.8 11.1 17.3 15.6
[-8,-2] [-6.8. -1.4] [-14, 2.3]
-3.0 -4.2 1.1
-3.0 -41 1.14
-2.10 -3.3 13
nom
worst
-
case
0.0 2.4 3.6
30.0 21.4 18.2
5. CONCLUSIONS
Table 1: Performance improvement with non-zero clock skew circ.
-
11/9 18/15 27/18
Error (%)
35 33 28 0 41 13 42
The results shown in Table I demonstrate reductions of the minimum clock period of up to 50% when intentional clock skew is exploited. The amount of reduction is dependent on the characteristics of each circuit, particularly the differences in propagation delay between each local data path. Note also that by constraining the clock skew of the 1/0 registers to zero, circuit speed can be improved, although less than without this constraint. Examples of clock distribution networks which exploit intentional clock skew and are less sensitive to the effects of process parameter variations are listed in Table 2. The clock trees are synthesized with the system presented in [7,9]. The clock skew values are derived from a circuit simulation of the clock path delays of a clock tree using SPICE Level-3 assuming the MOSIS SCMOS 1.2 gLm fabrication technology. The minimum clock period assuming zero clock skew TcpO and intentional clock skew Tcpi is shown in Column 2, respectively. The permissible range most susceptible to process parameter variations is illustrated in Column 3. The target clock skew value is shown in Column 4. In Columns 5 and 6, respectively, the nominal and maximum clock skew are depicted, assuming a 15% variation of the drain current IDO of each inverter. Note that both the nominal and the worst case value of the clock skew are within the permissible range. The per cent variation of clock skew due to the effects of process parameter variations is shown in column 7. A 20% improvement in speed with up to a 30% variation in the nominal clock skew, and a 33% improvement in speed with up to an 18% variation in the nominal clock skew are observed for the example circuits listed in Table 2.
The problem of scheduling clock path delays such that intentional localized clock skew is used to improve performance and reliability while considering the effects of process parameter variations is examined in this paper. A graph-based approach is presented for determining the minimum clock period and the permissible ranges of each local data path. The process of determining the bounds of these ranges and selecting the clock skew value for each local data path so as to minimize the effects of process parameter variations is described. Rather than placing limits or bounds on the clock skew variations, this approach guarantees that each selected clock skew value is within the permissible range despite worst case variations of the clock skew. The clock skew scheduling algorithms for compensating for process variations have been incorporated into a top-down, bottom-up clock tree synthesis environment. In the top-down phase, the clock skew schedule and permissible ranges of each local data path are determined to allow the maximum variation of the clock skew. In the bottom-up phase, possible clock skew violations due to process parameter variations are compensated by the proper choice of clock skew for each local data path and the controlled increase of the clock period Tcp. The clock period of a number of ISCAS-89 benchmark circuits are minimized with this clock scheduling algorithm. Scheduling the clock skews to make a clock distribution network more tolerant to process parameter variations is presented for several example networks. The results listed in Table 2 confirm the aforementioned claim that variations in clock skew due to process parameter variations can be both tolerated and compensated. 6.
[1] J.
P.
Fishburn,
"Clock
Skew
Optimization,"
IEEE
Transactions on Computers, Vol. C-39, No. 7, pp. 945-951, July
1990. [2] E. G. Friedman, Clock DistributionNetworks in VLSI Circuits and System, IEEE Press, 1995.
[3] K. A. Sakallah, T. N. Mudge, and 0. A. Olukotun, "checkTc and minTc: Timing Verification and Optimal Clocking of Synchronous Digital Circuits," Proceedings of the IEEE/ACM Design Automation Conference, pp. 111- I 17, June 1990.
[4] T. G. Szymanski, "Computing Optimal Clock Schedules," Proceedings of the IEEE/ACM Design Automation Conference,
pp. 399-404, June 1992. [5] N. Shenoy and R. K. Brayton, "Graph Algorithms for Clock Schedule Optimization," Proceedings of the IEEE International Conference on Computer-Aided Design, pp. 132-136, November
1992. [6] R. B. Deokar and S. Sapatnekar, Approach to Clock Skew Optimization," IEEE
International Symposium
pp. 407-410, May 1994.
247
REFERENCES
on
"A Graph-theoretic Proceedings of the
Circuits and Systems,
[7] J. L. Neves and E. G. Friedman, "Design Methodology for Synthesizing Clock Distribution Networks Exploiting Non-Zero Localized Clock Skew," IEEE Transactions on VLSI Systems, Vol. VLS'[-4, No. 2, June 1996. [8] D. G. Messerschmitt, "Synchronization in Digital System Design," IEEE Journal on Selected Areas in Communications, Vol. 8, No. 6, pp. 1404-1419, October 1990. [9] J. L. Neves, Synthesis of Clock Distribution Networks for High Performance VLSI/ULSI-Based Synchronous Digital Systems, Ph.D. Dissertation, University of Rochester, December 1995. [10] D. F. Stanat and D. F. McAllister, Discrete Mathematics in Computer Science, Prentice Hall, 1977. [11] E. L. Lawler, Combinatorial Optimization: Networks and Matroids, Holt, Rinehart and Winston, 1976. [12] T. Sokurai and A. R. Newton, "Alpha-Power Law MOSFET Model and its Applications to CMOS Inverter Delay and Other Formulas," IEEE Journal of Solid State Circuits, Vol. SC-25, No. 2, pp. 584-594, April 1990. [13] M. Shoji, "Elimination of Process-Dependent Clock Skew in CMOS VLSI," IEEE Journalof Solid State Circuits, Vol. SC-21, No. 5, pp. 875-880, October 1986. [14] S. Yang, "Logic Synthesis and Optimization Benchmarks User Guide: Version 3.0," Technical Report, Microelectronics Center of North Carolina, January 1991.
248
Multi-layer Pin Assignment for Macro Cell Circuits Le-Chin Eugene Liu and Carl Sechen Department of Electrical Engineering, Box 352500 University of Washington, Seattle, WA 98195 Abstract
N, (ncx(N), ncy(N)), is defined as follows:
We present a pin-assignment algorithm based on a new multi-layer chip-level global router. Combining pin assignment and global routing has been an important approach for the pin-assignment problem. But there are many difficulties when combining the two processes. In the past, only specialized global routing methods were used in the combined process. In our pin assignment program, we use an actual global routing algorithm. To meet the requirement of pin assignment while keeping the routing quality, we dynamically adjust the weights in the routing graph during the routing stage. In addition, the multi-layer technology has introduced new challenges for the pin-assignment problem. Our algorithm can also handle the modem technology to provide pin assignment for multi-layer layouts. To our knowledge, no other pin assignment program can handle multi-layer layout. Tests on industrial circuits show that our pin-assignment algorithm is quite effective at reducing the demand for routing resources.
m (x) ncx(N
=m(N)
N) IMa (N)I|
m (y) ny (AN)
mea
Ma (N)t
The position of a pin in net N is determined by the intersection of the periphery of m and the line which connects the center of m and the center of the net N. So the approximate pin assignment assigns a pin to some segment of the boundary of a macro. Then global routing is performed on a channel intersection graph. The global routing results are used to decide the exact position for the pin. A channel intersection graph was used in [2] to perform the global routing. The pin assignment is decided by the global routing results. The weight function of the routing graph consists of two terms. One is how crowded the pins are in a macro's boundary segment. The other is the rectilinear distance between the center of a net and the center of a channel. Both of the above algorithms rely on using the center of a net. This approach suffers from the same drawbacks which the block-by-block method has. The center of a net contains no information about blockage in the path or possible channel capacity violations. Hence, the pin assignment may cause difficulties in the actual global routing. A channel connection graph was used as the global routing graph in [4], and feed-through paths inside macros are also allowed in the algorithm. A channel graph was used in [3] to perform the routing and allowed block re-shaping. From reviewing the above algorithms, we can learn the following points. First, accurate global routing results are the basis for good pin assignment. Second, as the technologies advance, more flexibility is needed for the pin assignment. The global routing is formulated as the Special Steiner Minimum Tree (SSMT) problem in the above four algorithms. SSMT is a Steiner minimum tree of a net in which all the pins are leaves. This is because in traditional pin assignment, each macro can have only one pin for one net. Therefore, restricted Steiner tree heuristics were used for the global routing. A restricted Steiner tree heuristic usually yields worse results than the general heuristics, but the general heuristic can not guarantee that the pins are all leaves. Another problem for the above algorithms is that the routing graphs they used are not suitable for multi-layer layouts. As the VLSI technologies advance, the number of routing layers keeps increasing. The new technologies introduce many difficulties for old physical design models. To handle the multi-layer layout pin-assignment problem, we have developed an algorithm based on an effective multi-layer global router[l]. The pin assignment is closely combined with an actual global router. We use the same general Steiner tree algorithm used in the global router for pin assignment. Since a general Steiner tree heuristic does not guarantee the pins to be leaves, the results imply that we may have multiple pins assigned to a macro for a net. Although this is not acceptable in traditional pin assignment. this may actually be needed in multi-layer layouts. Because there are more routing layers, more layers are available for the routing inside macros. Having more than one pin on a macro is often a virtue for multilayer layouts. So our pin assignment program initially generates a multiple-pin assignment solution. If the internal routing resources are not available to implement the multiple-pin assignment, we
1 Introduction The macro-cell design style is one of the most important VLSI design approaches. A circuit is partitioned into a set of functional blocks (also called macro cells). Usually, the physical design process for macro cells is divided into the following steps: floorplanning/placement, pin assignment, global routing, and detailed routing. In the floorplanning/placement step, the dimensions and locations of the macro cells (macros) are determined. In the pin assignment step, the locations of pins on the macro boundaries are determined. Global routing assigns the routing regions for connecting the pins of the same net. Detail routing generates the actual geometric layout for the interconnections. In early research on the pin-assignment problem, the pins are assigned on a block by block basis[6][7][8]. To determine the pin locations for a macro, [6] used two concentric circles for the current macro to assist the assignment. A nine-case (or nine-zone) method was used to decide the pin locations in [7]. A "radar sweep" arm based on the current block was used to determine the topological pin assignment in [8]. The disadvantage of the blockby-block method is obvious. The total wire length and chip area are the two most important objectives to be optimized for the macro-cell physical design. Neither of them can be estimated accurately without carrying out the global routing step[2]. Processing the assignment block by block simply neglects too many important factors. In additional to the methods mentioned above, [9] used a physical analogy for the pin-assignment problem. But the method assumed the pins on a macro have a fixed relative order. The limitation is not practical for general applications. Since pin assignment and global routing are closely related, combining the two steps has been a necessary approach for the recent research on the pin-assignment problem[2][3][4][5]. The center of a net was used to do an approximate pin assignment in [5]. Given a net N, Ma(N) denotes the set of macros containing at least one pin connected to net N. |Ma (N) I is the size of the set, i.e. the number of macros which belong to a net. Given a macro m, m(x) and m(y) are the coordinates of the center of m. The center of a net
249
will remove the extra pins and allow only one pin on a macro for a net. We Jeveloped an algorithm to convert a multiple-pin assignment to a one-pin per macro assignment while maintaining the routing quality. The rest of the paper is organized as follows. Section 2 briefly reviews the multi-layer macro-cell global routing model which our pin assignment algorithm is based on. Section 3 describes our pin assignment algorithm. Section 4 shows experimental results on some industrial circuits. Section 5 concludes the paper.
vertical layer is available over the top. If there is one more layer available, one more layer of nodes can be added, and the edges are added accordingly. This routing structure is flexible and accurate for multi-layer technology.
I I
--
I
1-
I--I[
-
-4
-
-
I
layr Ii
i
:
. sd fr hrizna tak,
I
;
f
I
iI
.
hoional
NZ
'I -
I
B,.I
I
in-I 1
It
I
I
I
:~
I
I
-414)-
")
I-U- 1 I
I:
I
L
I K-
I
7
I
W-1-
I
L
I
Every edge except the via edges in the routing graph is assigned a weight according to the wire length which the edge represents. Usually, the wire length is the distance between the nodes which the edge connects, i.e. the distance between the centers of two adjacent regions. For different layers, the weights can be adjusted due to the resistance difference. The weights of the via edges can be specified by the users. Usually, it reflects the equivalent resistance of a via. To route a net on the routing graph, all the pins of the net are mapped to the nodes corresponding to the layer specified and the regions where the pins reside. For the pins on the boundaries of the cells, we map them inside the cells. Then the global routing problem is formulated as finding a Steiner minimum tree on the routing graph. The definition of the Steiner tree in networks is as follows[10]: * GIVEN: An undirected network G=(YEc) where c: E -4 R is an edge length function, and a non-empty set N, N 5 V , of terminals. * FIND: A subnetwork TG(N) of G such that: there is a path between every pair of terminals, total length TG (N) = E c (e) is minimized. es TG(N)
I I
I
N1
II
Figure 2. Global routing graph for the example of Fig. 1.
-
I
0~
5 0-.
I
I I .-
I
El I
r - - - - I--I
- -
II-
I-
I1
'I
El
Ek
01~
'I
2 The Multi-layer Global Routing Model Since our pin assignment program is based on our global router, it is necessary to understand our global routing model before we can introduce our pin assignment algorithm. In [1], we presented a multi-layer chip-level global router based on a 3-dimensional routing graph. The routing graph closely models the multi-layer macro-cell layout. It contains not only the topological information but also the layering and via information. Therefore, the global router can give a very accurate estimate for the routing resources needed. In addition, the global router can be used for many objective functions, such as solving the congestion problems due to the channel capacity limits, minimizing the number of vias, or minimizing the chip area. We assume that all macro cells are rectilinear. Given a placement of macro cells, the chip area is divided into small regions by cut lines which are the extension lines of the boundaries of the macro cells. Figure 1 shows how the regions are defined. The cut lines are shown as the dashed lines.
I:I -- - --- :- -
K.'
0\
adjaentnodso
Figure 1. Three macro cells and cut lines.
TG(N) is called a Steiner minimal tree of G. We developed a Steiner-tree algorithm based on a shortest path heuristic which in turn is based on Kruskal's algorithm for finding a minimum spanning tree[10]. The original algorithm can be described in three steps. First, each terminal forms a set. Second, a shortest path between any two sets is found to merge the two sets into one set. Third, if there are more than one set, go to the second step. Otherwise, the Steiner tree is found. The key difference between our algorithm and the original algorithm is that we retain multiple shortest paths between two sets. Hence, there may be cycles in the routes. A stage for removing cycles is necessary to guarantee the result to be a valid route. Also, a straightforward improvement stage is also added. Each segment in the route is ripped up and rerouted. If a better segment is found, the old segment is replaced by the new one. The test results show our algorithm outperforms other graph-based Steiner tree heuristics which can handle irregular graphs[l]. The same heuristic was used in our pin assignment algorithm. To meet some special requirements for pin assignment, we added some weight-adjusting steps into the heuristic. The details will be introduced in the next section.
In each region, we place a node for each layer. The nodes for different layers in the same region are connected by via edges. If a layer is uIsed for horizontal tracks, horizontally adjacent nodes of the layer are connected by edges. That means each node of the layer has horizontal edges connected to the nodes of the adjacent regions. Similarly, for layers used for vertical tracks, vertical edges are presented between the adjacent nodes. Furthermore, if a certain layer cannot go through the cells, there will be no edges connecting the nodes of the layer inside the cells. The only exceptions are the boundary regions inside the cells. The nodes in the regions adjacent to cell boundaries still have edges connecting to regions outside the cell. That is because the pins on the cell's boundaries are mapped inside the boundary regions. So the edges across the boundaries on a cell-blocked layer are needed for the pins to exit. But those edges are directed edges. They can only be used for the pins to exit and are not used for any other routing purposes. The directed edges make route searching on the blocked layer efficient. Figure 2 is an example of a routing graph. There are two layers. One is for horizontal and the other is for vertical routing. The horizontal layer is not available inside the cells. The
250
3 Pin Assignment Algorithm
13.
3.a Introduction To combine pin assignment and global routing, we have added some extensions to the global routing graph. For pin assignment, each pin on a macro is mapped to the center of the macro. According to this concept, we added one node for each macro. But the nodes do not necessarily represent the centers of the macros. It is just where the pins are mapped to. Those nodes are connected to the nodes in the boundary regions of the macros by directed edges.
'nose directed edges are called connection edges. Connection
edges only connect the macro nodes to the nodes of the layers which arc permitted for pin placement. Figure 2 shows the modified routing graph. The round nodes are the "macro" nodes. The connection edges are directed, because they are only used to let the pin get to the boundary regions and are not used for other routing purposes. Initially, the weights for the connection edges are all set to a minimal value (e.g. we use 1) to show that a pin can be placed on any segment of the boundary of a macro. The weights of the other edges are set the same way as for global routing. M; , lk~~~
I
1
I
~
1
M
I JIl
I--
I M '1 1 ,-ml El-4.--
14. } 15. for (each macro) 16. for (each segment of the boundaries) I 17. if (the segment is not over-congested) continue 18. Solve-pin congestion(scgment) 19.} 20. assign exact pin locations 3.b Routing for multiple-pin assignment After the stage I global routing, a minimum-weighted Steiner tree (a minimum wire length route) is generated for a net. This is the initial routing for the net. Since the Steiner-_minimum tree subroutine searches for a general Steiner minimum tree, it does not guarantee that every pin is a leaf of the tree. In the route, a macro may have more than one edge crossing its boundaries. Such a case implies that more than one pin should be placed on the macro for the net. Figure 5 shows an example. In the example, Macro-A requires one pin on the right side and one pin on the bottom side. Since we have cut lines in our routing model, we therefore know which segment of the boundary a pin should be placed on according to the route.
-
\
Steiner-Minimum tree(net) /* perform stage 3 global routing */ -
A-
I
}&lii 1__4 ;000
n~71
L-0
It
1|I I I
~~~~--
- ~
- r- - - - - - - - - - - -.
I
11I I
I
I
I - - - -
Figure 3. Pin assignment routing graph for the example of Fig. 1.
[
,-
- - - -i - - - - - - - I
Figure 4. Example of a route of a 3-pin net for pin assignment.
After the routing graph is constructed, we route each net on the graph sequentially. The pins of a net are mapped to the corresponding macro nodes. Then, we use the algorithm mentioned in section 2 to search for a Steiner tree connecting the terminals. During the routing, the channel congestion and chip area factor are not considered, since they are handled as a post-processing step[ 1]. IThe pin assignment algorithm is as follows: Pin-assignment()
1.buil d the routing graph (an example is in Figure 3) 2. for (each net) ( 3. 4. 5. 6. 7. 8. 9. 10. 11. 12.
l
i- -- -- t- ----
Initialize the weights of the edges Steiner-minimum-tree(net) /* perform stage I global routing */ if (no macro has more than one pin assigned) continue if (multiple-pin assignment is allowed) continue calculate the macro centers for the macros with multiple pins assigned update the weights of the edges Steiner-minimum-tree(net) /* perform stage 2 global routing */ if (no macro has more than one pin assigned) continue examine routing results and set proper direction for macros
I
- -4 - - -
Figure 5. The route of the example in Figure 4 before the cycle is removed. As we mentioned in section 2, our Steiner tree algorithm allows multiple paths in the middle of the process. Figure 5 shows the route of the example in Figure 5 before the extra path is removed. During the routing, path I-A-B-C-2 and 1-D-E-F-2 have the same weight, so both paths are used to connect pin I and pin 2. If neither path is connected in the middle as the routing progresses, one of the paths will be removed to make a valid route. Since the weight is the same, the cycle-removing step randomly removes
update the weights of the edges
25 1
one path. The route shown in Figure 5 is not the only possible result. Path 1-D-E-F-2 could be removed instead of path 1-A-B-C2. For global routing, both routes have the same wire length. But for pin assignment, the route in Figure 5 is the only optimal solution, because it reduces the internal (to the macro) routing conges-
one pin on the macro, some long wires are needed to go around the macro.
tion.
One important issue for multiple-pin assignment is that internal routing congestion needs to be considered as well. But wire length inside a macro is not a static parameter. It changes as the routing progresses. For example, in Figure 5, path 1-A-B-C-2 and 1-D-EF-2 have the same weight initially. But when path 1-G-H-I-3 is connected, their weights should be different. Hence, for each net, before the cycle-removing step. we examine the routing, and adjust the weights of some edges so that proper segments are removed. We select a comer as the basis for adjusting the weights. When pins are needed on two adjacent sides of a macro, moving pins closer to the corner where two adjacent sides meet saves the routing resources needed inside the macro. In the example, the lower right corner is chosen. The weights of the edges crossing the boundaries are incremented by an amount equal to their distance to the corner. The edge D-E is closer to the lower right corner than the edge A-B, so its weight is less than that of edge A-B during the cycle-removing step. This adjustment results in the removal of
Figure 6. Example of a multiple-pin on a macro If the multiple-pin assignment does not cause any routing congestion problems inside the macros, the initial routing is sent to the pin-congestion-solving step (line 15-19 in pin -assignment in section 3.a), then to the final stage (line 20) to assign the exact location of the pins. However, since some internal routing congestion problems may occur, we need to be able to re-route the nets which have more than one pin on some macros and cause the internal routing congestion problems. More processing is needed to convert the multiple-pin assignment to a single-pin assignment for those macros having internal routing congestion problems. 3.c Routing for single-pin assignment Before the stage 2 routing. the weights of the edges have to be adjusted according to the initial routing results. Here, we introduce the definition of "macro center." In multiple-pin assignment, a macro could have more than one pin for a net. For a net N, a macro M has a set of pins, P(N, M). IP (N, M) I is the size of the set, i.e. the number of pins of N on M. Given a pin p, p(x) and p(y) denote the coordinates of the pin. The macro center of a macro M for a net N, (mcx(NM), mcv(NM)). is defined as follows:
path 1-A-B-C-2.
For a macro with pins on three adjacent sides, there are two corners which can be chosen from. Either one of the corners can be used. For all other cases, the edge weights are not adjusted. After the initial routing is done, to make a valid pin assignment, we need to solve two more issues. First, a segment of a boundary may be over-congested because it is assigned too many pins. Second, the multiple-pin assignment could cause routing congestion problems inside the macros. To solve the first issue, a re-routing technique is used. The method is as follows: Solve.-pir-congestion(segment-of -a-boundary) 1. for (each net using the segment..of -.a-boundary) { 2. search for an alternate route for the net which avoids any over-congested boundary. 3. difference=the weight of the new route minus the weight of the initial route 4. the net is inserted into a priority queue according to difference. 5. 6. while (the segment-of -a-boundary is still over-congested) I get the minimum-difference net from the priority queue. 7. replace the initial route of the net by the new route. 8. remove the net from the list of those nets using the 9. segmentf.-.a-boundary.
10. I This method is used to solve the most over-congested boundary segment, then the second most over-congested one, and so on. It proceeds until all the over-congested boundary segments are processed. The method is not totally net-ordering independent, but it does solve some of the net-ordering problem. For the second issue, multiple pins on a macro, it is usually not a problem. In traditional pin assignment, one macro can only have one pin for one net. However, for today's multi-layer technology, there are more layers available for the internal routing inside the macros. This also means that inside the macros the routing resources may not be fully used for internal connections. Some of the routing resources may be used to provide more than one pin on the boundary for some signals. Besides, this may save the overall routing resources. Figure 6 shows an example. For a long rectangular macro, if a signal is needed on both long sides, it would consume less routing resources to have a pin on both sides. With only
252
mcx(N, M)
E p (x) pE P(N,M) IP(N,M)LI
E(N MA) mcy (NM) M) = (N,
E P(N,M) P (N,M) I
Unlike other pin assignment algorithms, our initial routing does not consider the center of a net. But during the second routing stage, we take the macro centers into consideration. (In contrast, the center of a net is calculated from the position of the macros. It is decided before the routing is performed.) The macro center is calculated from the information obtained from the initial routing. According to the initial routing, we place the pins at the middle of each corresponding boundary segment, so we can calculate the macro center for those macros with more than one pin for a net. During the second stage routing, this information is used to calculate the weight of the edges which cross the boundaries. For the initial routing, an edge's weight is the wire length between the two nodes which are connected by the edge. During the second stage, an edge which crosses the boundary of the macro by means of a macro center has its wire length calculated from the macro center to the outside node. Figure 7(a) shows how a macro center is decided. The small triangle shows the macro center. Figure 7(b) shows how the wire length is calculated to one of the possible pin locations. During the second routing stage, the macro centers help reduce the pins required on a macro. But only the macros with more than one pin in the initial routing have a macro center. The reason is that those macros with only one pin still need the flexibility (i.e. the possibility of having more than one pin for a net on a macro) during the second stage to get the best results. Since the macro center method does not guarantee to totally remove the internal
congestion, we need a third stage to further limit the number of pins on a macro for a net. I
111. I (0)
(a)
Figure 8 shows an example. Figure 8(a) is the initial routing. Figure 8(b) shows that the direction of the center macro is decided incorrectly. In Figure 8(c), the middle macro is excluded in the beginning of the routing. The route between the two outer macros is connected first. Then the middle macro joins them. Our algorithm yields the optimal result in this case. The reason why we use the three-stage approach is to allow the maximum freedom for the routing to get better results. The later the stage, the more limitations. Stage 2 uses the information obtained from the initial routing to limit the routing. Stage 3 uses the information from the results of the two previous routing results to further limit the routing. This is a time-quality trade-off. The more information we obtain, the less possibility for setting incorrect limitations.
Figure 7. Example of a macro center. For those nets which still have multiple pins on macros after the second stage, they are processed by one more stage. We use a forced-direction method in the third stage. We force the direction of the pins on some macros. The direction is decided by the following rules. For a net, those macros which have only one pin are forced to the direction of the only pin. For example, if a macro has one pin on the right side according to the second routing, the macro can only have a pin on the right side during the third routing stage. For those macros which have more than one pin, the direction is decided by the macro center calculated before the second routing stage. The side closest to the macro center is the direction for the macro. It is possible that one macro has one pin for the initial routing, but has more than one pin in the second routing. For those macros, the direction is decided by the initial routing. For some cases, the macro center does not prefer any side. For example, if a macro has two pins, one on the right side and one on the top side, and the macro center has the same distance to either side. In this case, the direction is chosen randomly from among both sides. Another case is that two pins are on opposite sides. This causes the macro center to be at the middle of both sides. For this case, it is not good to randomly choose a direction from either side. This is because when a macro has two pins on opposite sides, it usually means the signal wants to pass through the macro. Those macros are held during the third routing stage. They do not join the routing until all other macros are connected. Then they are connected to the existing route one by one. This approach guarantees single-pin assignment for the macros and usually yields better results. To efficiently implement this method, we take the advantage of the nature of our Steiner tree heuristic. Our Steiner-tree algorithm is based on a shortest path heuristic which in turn is based on Kruskal's algorithm for finding a minimum spanning tree[l0]. The shorter paths between the terminals are found first. The connection edges, which are introduced in section 3.a, connect the terminals to the routing graph. We set a large weight (we use the wire length of the initial route of the net) to the connection edges of those macros which have pins on opposite sides. The large weights naturally delay the connection to those macros. Those macros won't get connected until all other macros are connected.
F_`
(a)
(b)
3.d Example A more complicated example will be shown to demonstrate a complete three-stage routing progress.
Figure 9. Result of the first stage routing.
(c)
Figure 8. Example of reducing the number of pins on a macro with two pins which are on the opposite sides.
Figure 10. Result of the second stage routing.
253
Figure 9 shows the initial routing of a 4-pin net. This is also the multiple-pin assignment for the net. If there is an internal congestion problem, the process will proceed to the next stage. Figure 10 shows the result after the execution of the second stage routing. The number of pins on macro MAA is reduced by the macro center of MAA. The pin assignment for MDD also changes boundaries. But the number of pins on MBB is increased. This is because the assignment for MBB is not restricted, yet. According to the information from the two stages of routing, we can force the proper direction for each macro. During the final stage, the weights of the edges which cross the incorrect boundaries are increased, so the edges of the proper direction are favored. Figure 11 shows the final result. This is the result of single-pin assignment. The result is optimal for the four-pin net.
macros. We used our global routerri] to obtain the wire length and area of three pin assignments. The first one is the original assignment. The second one is the traditional single-pin assignment. The third one is the multiple-pin assignment. We have the wire length comparison results in Table 2 and Table 3. On average, the original wire length is reduced by 35% using our singlepin assignment algorithm, and by 53% using our multiple-pin assignment algorithm. Note that allowing more than one pin per net per macro reduces the total wire length by an average of 27% over allowing only a single pin per net per macro (the traditional approach). number of
cells
nets
-
hp ami33
11 33
83 83
qpdm-b Xerox
17 10
amd ami49 4832
20
intel
pins -
309
nodes of graph -
edges of graph
26
39
376
64
101
121 203
645 696
37 21
58 30
17
288
837
39
57
49
408
953
108
172
586
1,576
64
98
62 570 4,309 161 Table 1. Circuit information.
243
Original pin
Single-pin assignment
assignment
wire length (wil)
wire length (w12)
w12 / wi]
238,265
138,771
0.582
91,500
61.315
0.670
793,388 1,082,177
635,573 570,563
0.801 0.527
amd
501,507
351,236
0.700
ami49
895,149
0.588 0.746
hp ami33 Figure 11. Final result.
qpdm-b xerox
3.e Exac epin location To decide the exact locations of the pins, we divide the pin connections into four cases. Figure 12 shows the four cases. The thick line is a segment of a macro boundary. Although they are shown horizontally, the vertical boundary is treated similarly. For case (a), the pin is assigned to the leftmost available position, for case (b), the rightmost available position. For case (c) and (d), they are assigned randomly after the other cases are processed. Please notice one difference in our routing model. A routing region is not necessarily a channel. It can have one side bounded by a macro and the other side open.
4832
4,736,701
526,103 2,703,249
intel
7,948.858
5,927,195
average
0.648 Table 2. Wire length comparison. Multiple-pin assignment wire length (w13)
hp ami33 (a)
0.571
(b) Figure 12. Four cases of the pin connection. 4 Results
We tested our program on some industrial circuits. Those circuits are shown in Table 1. They already came with a pin assignment. We used our program to re-do the pin assignment. The placements were gererated by TimberWolfMC v.3.1. We used a two-layer technology, one for horizontal tracks and the other for vertical tracks. BDth layers are not available for global routing inside the
w13 I w12
78,024
0.327
0.562
33,820
0.370
0.552
qpdm-b Xerox
365,374 439,095
0.461 0.406
0.575 0.770
amd ami49
287,553 449.984
0.573 0.503
0.819 0.855
4832
2,334,710
0.493
0.864
intel
4,765,700
0.600
0.804
0.467 Table 3. Wire length comparison.
0.725
average
254
w13 I wi]
Table 4 and Table 5 show the area comparison results. Our program saves, on average, 23% of the area for the single-pin assignment and 28% for the multiple-pin assignment. Again, note that allowing multiple pins on a macro for a net yields an average area reduction of 9% over the traditional single-pin restriction. For multi-layer technologies, only a multiple-pin assignment can take the full advantage of it. To our knowledge, no other pin assignment program can handle multi-layer layout or assign multiple pins to a macro. Original pin assignment area (a])
Single-pin assignment
area (a2)
a2 / al
References Liu, L. E. and Sechen. C.. "A Multi-layer Chip-level Global Router," Fifth ACM/SIGDA Physical Design Workshop, 1996. [2] Cong, J., "Pin Assignment with Global Routing for General Cell Designs," IEEE Transactions on Computer-Aided Design, Vol. 10, No. 11, pp. 1401-1412, Nov. 1991. [3] Koide, T., Wakabayashi, S., and Yoshida, N., "An Integrated Approach to Pin Assignment and Global Routing for VLSI Building-Block Layout," European Conference on Design Automation with the European Event in ASIC Design, pp. 24-28, Feb. 1993. [4] Wang, L. Y., Lai, Y. T., and Liu, B. D., "Simultaneous Pin Assignment and Global Wiring for Custom VLSI Design," IEEE International Symposium on Circuits And Systems, Vol. 4, pp. 2128-2131, 1991. [5] Choi, S.-G., and Kyung, C.-M., "Three-step Pin Assignment Algorithm for Building Block Layout," Electronics Letters, Vol. 28, No. 20, pp. 1882-1884, Sep. 1992. [6] Koren, N. L., "Pin Assignment in Automated Printed Circuit Board Design," 9th Design Automation Workshop, pp. 72-79, June 1972. [7] Mory-Rauch, L., "Pin Assignment on a Printed Circuit Board," 15th Design Automation Conference, pp. 70-73, June 1978. [8] Brady, H. N., "An Approach to Topological Pin Assignment," IEEE Transactions on Computer-Aided Design, Vol. 3, No. 3, pp. 250-255, July 1984. [9] Yao, X., Yamada, M., and Liu, C. L., "A New Approach to the Pin Assignment Problem," 25th ACM/IEEE Design Automation Conference, pp. 566-572, June 1988. [10] Hwang, F. K., Richards, D. S., and Winter, P., "The Steiner Tree Problem," North-Holland, 1992. [1]
hp
3618 x3620
3528 x3138
0.845
ami33
1910 x 2050
1670 x 1900
0.810
qpdm..b
3933 x4877
3749 x4097
0.810
xerox
7230x 7990
6680 x 7220
0.835
amd
2379 x 2325
1827 x 1411
0.466
ami49
7572 x 7490
6992 x 6900
0.851
4832
15360x 12330
13760x 11160
0.811
intel
11710 x 12200
10320 x 11160
0.805
average
0.779 Table 4. Area comparison.
Multiple-pin assignment area (a3)
a3 a]
a3 / a2
hp
3458 x 2968
0.784
0.927
ami33
1660 x 1790
0.759
0.936
qpdmr-b
3274 x 3577
0.611
0.762
xerox
6520x 7190
0.812
0.972
amd
1694 x 1182
0.362
0.777
ami49
6982 x 6750
0.831
0.977
48'32
13630 x 11050
0.795
0.981
intel
10040x 11050
0.775
0.963
0.716
0.912
average
past, combining pin assignment and global routing means that only an inferior global routing method could be used. We overcame the difficulties and actually combined the two stages. Since our global router is capable of handling multi-layer layout, our pin assignment can also work for multi-layer technology. All the objective functions which can be performed by the global router can be performed for the pin assignment. In addition, our algorithm is the first one reported which can assign multiple pins on a macro for a net. When the internal routing resources are available, this can greatly reduce the overall routing resources needed. The test results show that our pin assignment algorithm is quite effective at reducing the demand for routing resources.
Table 5. Area comparison.
5 Conclusion We have presented a new pin assignment algorithm which is closely combined with a multi-layer global router. Near-optimal global routing results are used for the pin-assignment task. In the
255
CONSTRAINT RELAXATION IN GRAPH-BASED COMPACTION Sai-keung Dong'
Peichen Pan2
Chi- Yuan Lo3
C. L. Liu 4
'Silicon Graphics, Inc., Mountain View, CA 94043 ([email protected]) Department of Electrical and Computer Engineering, Clarkson University, Potsdam, NY 13699 3 Lucent Technologies, Murray Hill, NJ 07974 4 Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801 2
ABSTRACT Given a weighted, directed graph G with positive cycles, we study the problem of modifying the edge weights of G such that G (with new edge weights) has no positive cycles. The total change in edge weights and the length of the longest path from a "source" vertex to a "sink" vertex should be kept to a minimum. This problem arises in graph-based compaction where a constraint graph with positive cycles means that the positions of some circuit elements cannot be decided because of the existence of over-constraints. To eliminate such over-constraints, previous approaches examine positive cycles in G one at a time and apply heuristics to modify some of the edge weights. Such a local approach produces suboptimal results and takes exponential time in the worst case. We show that the problem can be solved in polynomial time by linear programming. Moreover, we show that a special case of the problem has a linear program whose dual corresponds to that of the minimum cost flow problem and hence can be solved efficiently. 1.
INTRODUCTION
A common approach to the solution of the symbolic layout compaction problem is to express the constraints on the positions of circuit elements in terms of a system of linear inequalities S; and then find a feasible solution for S that minimizes a certain linear objective function, e.g. the width of a cell [Boye88, Marp9O, BaVa92, LeTa92, YaCD93]. S may contain various types of constraints: hierarchical, pitchmatching, separation and connectivity (design rules) and user-defined constraints. Consequently, we often find that S is an over-constrained system of linear inequalities, i.e. there is no feasible solution for S. In the presence of over-constraints, a compactor can either: (i) identify some or all of the inequalities that are too restrictive and ask the circuit designer to manually modify the layout; or (ii) use some heuristics to relax some of the inequalities; so as to remove the over-constraints in S incrementally. These two strategies, however, are "local" in nature since the over-constraints are not resolved all at the same time. Both strategies have been used by different compactors, for example: * in hierarchical compaction, where over-constraints occur because of interaction among cells at the same as well as different levels of the hierarchy, a graph-theoretic technique was proposed to identify some of the inequalities that cause over-contraints. These inequalities are then recorded in a database to provide feedback for the circuit designer
256
[BaVa93]. * in leaf cell compaction, the system of linear inequalities, S, is modeled by a weighted, directed graph Gs. It is well-known that the system S has a solution if and only if the constraint graph Gg has no positive (directed) cycles [LiWo83]. In other words, over-constraints among circuit elements within the same cell appear in the form of positive cycles in Gs. In [LiWo83], for those edges with negative weights and whose corresponding inequalities cannot be satisfied, all the positive cycles that contain these edges are exhaustively enumerated. Such information is presented to the circuit designer who will decide the inequalities to be relaxed. In [King84], edges in a constraint graph are prioritized according to the kind of constraints they represent: user-defined constraints are less important than design rule constraints which are less important than abutment constraints etc. When a positive cycle is detected, the weights of those edges in the cycle with the lowest priority will be modified. In [Schi88], jog generation and edge weight modification are used to remove positive cycles. The latter selects the edge with the smallest weight in the cycle and decreases its weight by the weight of the cycle. In this paper, we focus on graph-based compaction. We consider the problem of removing over-constraints in a constraint graph. The "local" strategy of removing one positive cycle at a time which we mentioned earlier [King84, Schi88] has two drawbacks: (i) the removal of a positive cycle is carried out independent of the removal of other positive cycles. Since positive cycles might have edges in common, this might lead to "suboptimal" results and require computation that might otherwise be unnecessary. (ii) there could be an exponential number of positive cycles in a constraint graph. This limits the size of the problem that can be handled by a compactor using this strategy. A possible "global" strategy is to first find a subgraph of the constraint graph which has the most number of edges and contains no positive cycles; and then modify the weights of those edges not in the subgraph so as to remove the positive cycles. Unfortunately, finding such a largest subgraph is NP-complete (Section 3). We propose an alternative strategy and consider the Constraint Relaxation Problem: the problem on how to change the edge weights of a constraint graph minimally such that all positive cycles are removed and the length of the longest path is minimized. We show that this problem can be solved in polynomial time by the method of linear programming. In Section 2, we discuss two situations in layout compaction which motivate this work. In Section 3, we show that finding a largest subgraph that contains no positive cycles is NP-complete. In Section 4, we state the Constraint
Relaxation Problem and solve it by the method of linear programming. We give three linear programming formulations for this problem. The three formulations differ in the objectives they try to achieve. In particular, the dual linear program of the third formulation is that of a minimum cost flow problem and hence can be solved quite efficiently. Section 5 gives the experimental results and Section 6 is the conclusion. 2.
for the removal of positive cycles. Figure 2 shows two cells A and B where pin P, is to be connected to pin Qi, i = 1, 2, 3, by abutment.
MOTIVATION
One way to eliminate positive cycles in a constraint graph is to change the weights of some of its edges. We discuss two situations in layout compaction where a minimal change in edge weights is meaningful. We then give an example of a constraint graph in which different changes in the individual edge weight have different effects on the length of the longest path. This example demonstrates the difficulty in deciding how edge weights should be changed. For submicron technology with high source to drain resistance, it is important that spacings between circuit elements be tightly controlled. For example, consider Figure 1 where parts of three transistors are shown: C1, C2, C3 are contact cuts and G1 , G2 , G3 are gates. Suppose compaction is carried out in the x-direction. We are interested in min-
Figure 2. These requirements can be expressed as: Pi.r - Qi. = 0 P 2 .r - Q 2 .r= 0 P 3 .r - Q 3 .r = 0 where P,.r (Qi.r) is the 1-coordinate of pin Pi (Qi), i = 1. 2, 3. These abutment constraints together with the intracell constraints of cell A and cell B might create positive cycles in the constraint graph. In this case, it is necessary to relax some of the abutment constraints. For example, the circuit designer might decide to keep the abutment constraints between the pairs of pins (Pi,Qi) and (P 2 ,Q 2 ) but connect P3 and Q3 by river routing instead of by abutment [LiCS93]. To minimize the length of the connecting wire, it is desirable to have IP 3 . r - Q3.rl be as small as possible. This corresponds to: min a P1 .r - Qi.r = 0 P2 .X - Q2.X = 0
olysihicon
G0
G2
G3
Figure 1. imizing the x-dimension of the diffusion region. Let Gi.r and C,.r be the unknown r-coordinate of gate G, (i = 1, 2, 3) and contact C, (j = 1, 3) respectively. It is desirable that the spacing between Ci and G1 , G1 and G2 , G2 and G 3 , and G3 and C3 be as small as possible. These requirements can be expressed as: Gi.x - Cl.x = a G2.1 - Gl.r = 2a + b G 3 .r - G2. =C C3.x - G 3 .x = a where a is the minimum spacing between a contact cut and polysilicon, b is the minimum width of a contact cut, and c is the minimum spacing between polysilicon and polysilicon, respectively. These four equations (or eight inequalities), together with other compaction constraints, might form an over-constrained system of inequalities. In this case, some of these inequality constraints need to be relaxed and be satisfied as strict inequality constraints. Since we want to minimize the "stretching" of the diffusion region, the problem becomes that of: min ca+ + + a Gi xr- C x = a G 2 .r - Gi.x = ,
P 3 .r - Q3.r < a Q3. - P3.X < a
A minimum adjustment in edge weights will minimize the length of the connecting wire used in river routing. The two situations above show the need for an algorithm that can remove positive cycles from a constraint graph with as little change to the edge weights as possible. However, deciding the set of edges and by how much their weights should be changed is a non-trivial problem. Consider the constraint graph G shown in Figure 3 where
Figure 3. G the edge weights are shown next to the edges. G has a positive cycle
V2 -4 V3 -+
V4
-+
V5
-*
V 2.
To remove this cycle,
the minimum amount of change in edge weights is 4, the weight of the cycle. In the mean time, we want to minimize the length of the longest path from vi to V6. Of course, one could choose to change the edge weights by a large amount. In Figure 4, the total amount of change in edge weights is 7; and the length of the longest path from v1 to v6 is 0 (edges on the longest path are shown in bold arrows). Figure 5 gives four different ways to remove the positive cycle in G. They all introduce the same total amount of change in edge weights, namely 4, but with different effects on the length of
G3.r - G 2 .r = -y C3.x - G 3 .r = a 2a+b and y > c. A minimum adjustment
where a > a, 3 > in edge weights in the constraint graph will correspond to a minimum increase in the r-dimension of the diffusion region. Cell compaction with abutment constraints is another situation in which a change in edge weights may be necessary
257
such that G'(V, E', w'), where w' is w restricted to E', has no positive cycles ? 1'
We shall show that the Largest Feasible Subgraph problem is NP-complete by reduction from the Feedback Arc Set problem. The latter is defined as:
I
_
I
FEEDBACK ARC SET [GaJo79, p. 192] INSTANCE: Directed graph H = (V, A), positive integer K < JAI.
Figure 4. the longest path from v 1 to Ve, l(vi, v.), and on the lengths of the longest paths between other pairs of vertices.
0
Theorem 1 LARGEST FEASIBLE SUBGRAPH is NPcomplete.
0
Proof By reduction from FEEDBACK ARC SET. Given an instance H = (V, A) and integer K, of FEEDBACK ARC SET, consider the instance of LARGEST FEASIBLE SUBGRAPH G = (V,A,w) where w, = 1 for every edge (vi,v3 ) E A and M = IAI - K. Then H has a subset of edges with size at most K and constains at least one edge from every directed cycle in H if and only if G has a subgraph with at least M edges and which does not have any positive cycles. 0
(b) I (v ,v6) = 2
(a) I (vlv6) = 3
0
QUESTION: Is there a subset A' C A with IA'I < K such that A' contains at least one edge from every directed cycle in H ?
Corollary 1.1 Given a system S of linear inequalities, the problem of finding a largest (in terms of the number of inequalities) consistent subset of S is NP-complete.
I
(c) 1(vl~v6)
-
(d) I (v1,v6) = I
1
Figure 5. .3.
Proof By reduction from LARGEST FEASIBLE SUBGRAPH. D
A'P-COMPLETENESS RESULTS
In this section, we show that given a constraint graph, the problem of finding a largest subgraph which has the
4.
PROBLEM STATEMENT AND SOLUTIONS Since finding a largest subgraph that has no positive cycles is NP-complete, we propose an alternative approach for removing over-constraints in graph-based compaction. Instead of minimizing the number of edges whose weights will be changed, the alternative approach tries to minimize the total amount of change in edge weights used in the removal of positive cycles. We define the Constraint Relaxation Problem as follows: given a constraint graph G = (V, E, w) with positive cycles, modify the edge weights of G such that all positive cycles are removed, the change in edge weights is minimal and the length of the longest path from vi to v. is minimized. We shall give three linear programming formulations for the Constraint Relaxation Problem. The first formulation requires the solution of a sequence of two linear programs (LPs) (Section 4.1). We then argue that some of the generality in the first formulation is not necessary and derive a formulation that can be solved by a single LP (Section 4.2). Finally, we consider a special case of the second formulation and show that its dual corresponds to that of a minimum cost flow problem and hence can be solved efficiently (Section 4.3).
most number of edges and contains no positive cycles is NP-complete. Thus, it is not likely that the approach of retaining the largest number of edges intact and modifying the weights of the remaining edges will lead to an efficient algorithm for the removal of over-constraints. The NP-completeness result can be extended to the problem of finding a largest consistent1 subset of inequalities in a given system of inequalities. This shows that in the case where a system of general linear inequalities is used, e.g. hierarchical compaction, finding a largest subset of inequalities that are not Dver-constrained is a difficult problem. Let C = (V, E, w) be a constraint graph where V =
{v,
..., vt,} is the set of vertices, E is the set of (directed)
edges and w : E -e R is the edge weight function, respectively. WVe shall adopt the convention that v, and v, represent the left and the right cell boundary, respectively (assuming the direction of compaction is along the x-axis) and that wi, denotes the weight of edge (vi, v,). We define the Largest Feasible Subgraph problem as follows: LARGEST FEASIBLE SUBGRAPH INSTANCE: Constraint graph G = (V, E, w), positive integer M < JEl.
4.1.
First Formulation For a given constraint graph G = (V, E, w), its corresponding LP is [PaDL93]: min X, - XI X -xi > Wij, (vi, vj) E E
QUESTION: Is there a subset E' C E with IE'l > M 1A consistent set of linear inequalities is one in which the solution set it defines is non-empty.
258
where variable xi is the unknown x-coordinate of vertex vi E V. In the presence of positive cycles, some the edge weights in the constraint graph need to be modified. For each edge (vi, v 1) of weight wi,, a variable ei, is introduced and the new edge weight will be wtj + eij, where li, < ei, < u,1 . li, and ui, are constants given by the circuit designer to limit the amount of change in the weight of edge (vi, v,). For example, if ui) < 0, it means that the weight of edge (vi, v 1 ) may decrease but cannot increase. If an edge weight wi, should not be changed because it corresponds to some important compaction constraint, one can set lij == uij = 0. (In this case, a simpler way is not to create the variable ei, for this particular edge at all.)
A = min lj - Ii-
( ivj)eE
1
-
, -E ij > EiŽ7 > -ei, >
Wij, 1
1i, -uij,
>
-Uij,
Xj -5 xi-
Ei+
+
- 4
E(vivj) EE
E E
min A
a new lower bound -J and a new upper bound p-. The '1 ' new lower and upper bounds are defined as:
j-
p+ = max{0, uij} M7 = max{O, -lij}.
(eij3
+Ei _)
=
Case 3 li < 0 and 0 < uj. 0 = A+i -+r < +
< -
wi 2 ,
(vi, v.?)
EE
(vi, v,) EB (vi, V-) EE AW, (vi, v,) -Juij, (vi, v,) (vi,
v,)
EE EE EE
Xi
(Z(
)EEri) +
B(in
-
1i )
Wij,
(vi, v,) E
E
Fj
(Vi IV.7)
C E
E
+ e7 > -ij,
(vv)
E
P3 Intuitively, the more changes made to the individual edge weights, the shorter the longest path from vl to v.. Thus, the objective function of P3 captures the tradeoff between the amount of change in edge weights and the length of the longest path from vt to v.. By using this formulation, only one instead of two LPs need to be solved.
0 < li;.
Case 2 ui, < 0. 0 = A+ < e+'2 < u+0'2 = 0 and- -ui, = Aj' < E-'2~ -< i
6ii > '2-
>
-7- >
The validity of these new lower and upper bounds can be seen by analyzing the following three cases (assuming lij < Uj):
0 =A,-j < Eij<
E
rn-1X
-6-
Ei-j
and
E
(Vi, V,) c
Second FormulationIn P1 and P2, ui,, the upper bound on ei, can be some positive constant. Thus, the new edge weight for edge (Vi, V,), wi, +eij, can increase in value. However, this is not necessary since increasing the weight of an edge can never help in removing positive cycles. It is the decrease in the weight of some positive-weight edges and/or some negativeweight edges that remove the positive cycles in a contraint graph. Thus, we can assume ui, is non-positive. According to Case 2 of the analysis in Section 4.1, e+ = 0. Hence, we '2 can remove all occurrences of the variables E+ and their corresponding lower bound and upper bound constraints from P1' and P2'. Futhermore, we consider a new objective function which is a weighted sum of the two objective functions of P1' and P2'. The result is the linear program P3 shown below (A and B are positive integers defined by the user):
ery occurrence of ei, by + -Er- and replace leij by et+Er In particular, the new edge weight for edge (vi, vj) will be wi7 + e + - E-. For variable et, there will be a new lower bound and a new upper bound u+ and for variable E7
ij, = A+j < e+j < p+j = ui,
(vi, vj) E
4.2.
and eJ for each variable e,%. The intention is to replace ev-
Case 1
E
eE
By solving the linear programs P1' followed by P2', we can determine the least amount of change in edge weights that will remove all positive cycles from G and the resultant minimum longest path length from v1 to vn.
P2
A
-psJ,
E E
2) e
P2'
P1 and P2, in its current form, are not linear programs because of the absolute value operator. To overcome this limitation, we shall introduce two non-negative variables e+
lij} = max{O, -uij},
A.-,
>
(vi, vj)
(vi, V2 )
-Pii ,
Ei-j >
EF
A
A\+= max{O,
A+l,(vi,v
>
min
(Vi, Vj) E E (vi, vi) e E
(Vi V3)
WijI
>
P1'
If P1 has a solution, the mathematical program P2 below will determine the minimum longest path length from vl to vn and how much the weight of each edge should be changed subject to the condition that the total change in edge weights is A. min XI - xi j- xi - ei, > Wij, (vi, v1) E E eij > lij (vi, v 1 ) e E -ij,
>
- >
P1
)-(v,,vj)EE Jeij
E
-EiŽ
lEji
(Vi,,vj)
+
Cy
To determine, A, the minimum amount of change in edge weights that is necessary to remove all positive cycles, consider the following mathematical program P1:
A = min
(E% + Ci)
Z(vi,vj)EE
e+
0.-j=
=-li l
~==-i.
P1 and P2 can now be formulated as two bona fide linear programs:
259
4.3. Third Formulation In this section, we show that a special case of P3 has a dual LP that corresponds to that of a minimum cost flow problem. The minimum cost flow problem [Tarj83] is a network flow problem in which there is a cost ci, associated with an edge (vi, vj) in the network and such that a unit of flow along that edge will have a cost of ci,. The objective is to obtain a flow of a pre-specified quantity from
some supply vertices to some demand vertices that has the minimum total cost. The minimum cost flow problem can be solved in low order polynomial time and it has been used in the solution of other problems. For example, the problem of determining the minimum number of delay elements inserted into a pipelined system for achieving synchronization has been formulated as a linear program whose dual corresponds to that of a minimum cost flow problem [WoDF89, HuHB91, BoHS92]. We consider the case when the constraints -e > li, ,j (vi, va) e E, are absent from P3. In other words, the linear program looks like: min A (e,) + B(x,- xi) 3 dual
variables YJ
xJ - Xi + c- > e-3>
2ij
,,j)EE v
0.9 3
CQ0
-0.2
03
-0.6
0 -0.2
C
-0.5 allow the weights of edges (Cl,GC) , (G 3 , C3) , (G,
CO) ,(G2, GI) , (G3 , G1 ) and (C3 , G3 ) be changed and we re-
quire that the distance between contact Cl and gate GI is the same as that between gate G3 and contact C3 . By introducing the e's required by our first formulation, we obtain a new constraint graph G' (Figure 7). 0.9 02+
02 +rc
50.3
(W'JY'1 + (-u'J)z,3 )
T(.""OEE ypi-
Yiq i)EE=
0
(1pss')eE Ypl
=
B
(v~p,')E YPn
T(vq)EE Ynq = Yij + zij =
B A
Yu3 >
0 0
Ziu >
for for for for
xi, i 0 1, n so In
-
E3
Figure 7. G'
A = min
In D4, the variables yi 's and zi,'s are related by the equations y,, + zi, = A. By substituting zt, = A - yij into the objective function of D4; and replacing the equation yi, + zij = A by the inequality yi, < A (since zij is nonnegative), we can eliminate, for each (vi, vi) e E, the variable Zi, from D4. The resultant LP is D4' shown below: (.i,.)EE ((wij + uij) yi
-0.5-
The corresponding LPs are:
E.,,
D4
max
EXPERIMENTAL RESULTS
Figure 6. G
From P4, the new weight of an edge satisfies Wj - 6- < wij + uJ, i.e. the new edge weight, wi,-ei, has an upper bound but not a lower bound. The difference between P3 and P4 is that the latter does not have control on how small the weight of an edge will become. Consider D4, the dual LP [PaSt82] of P4 (the dual variables Ytj and Zi, used in D4 are shown in P4 next to the inequalities to which they correspond):
Z
5.
We first give an example similar to the one shown in Figure 1 and its solution produced by our first formulation. Consider the following constraint graph G (Figure 6) which contains more than one positive cycles: Suppose we only
(vi,v1 ) E E (vi,vj) E E
wi, -uij,
P4
max
inequalities specify that the flow on each edge must be no more than A and non-negative, respectively.
A uij)
2E1 + 62 + 63
Gi.x - C1 .x - ei > 0.2 Ci.x - Gli. + el > -0.2 G 2 .X - Gi.x G 3 .r - G 2 .r G2.x - C 1 .r
C3 .x - G3.x - El G3.r - C2.x + El G0.x - G2.r + 62 Gi.x - G 3 .r + e3 31, 62, 63
> 0.5 > 0.3 > 0.9 > 0.2 > -0.2 > -0.6 > -0.5 > 0
Q1 (,Vq)CE (vp, E~v,
)eE Ypl n~iEYpn_
Z(vl,vq)EE
yiq =
Ylq = Z(wnVq)EE Ynq= -yi, > yi, >
0 for pi, i 0 1, n min C 3 .x - C 1 .x
-B for xi B for In -A for e0
D4' In D4', the second term in the objective function - E A uij can be dropped since it is a constant. Since maximizing a linear function f is the same as minimizing -f, D4' is a minimum cost flow problem where the cost of each edge is -(wij -- ui,) and the flow on each edge is the unknown yi,. The first equation in D4' is the flow conservation constraint for vertices v2, ... ,v_: the second and the third equations specify that the outgoing flow at v1 is B and the incoming flow at vn is B, respectively; and the fourth and the fifth
260
G.x - C 1 .x - E C,.r - G,.x + El G2 - Gi.r
> 0.2 > -0.2 > 0.5
G3.x - G2
>
X
G2.x - C 1 .r C 3 .x - G 3 .x -G3.x - C3 .x + el G.x - G2.x + e2 G,.x - G3.- + e3 2E,
+ 62 + 63
61,E2,
3
0.3
> 0.9 > 0.2 > -0.2 > -0.6 > -0.5 = A > 0
Q2 Solving the linear program Ql, we obtain an optimal solution with El = 0.1, 32 = 0 and E3 = 0.4 which minimizes A
(= 0.6). In this case, the length of the longest path from C, to C'3 is 1.5. If we set A = 0.6 and solve the linear program Q2, we obtain el = 0, 62 = 0.1 and 63 = 0.5. In
[BoHS92]
E. Boros, P. L. Hammer and R. Shamir, "A Polynomial Algorithm for Balancing Acyclic Data Flow Graphs," IEEE Transactions on Computers, Vol. 41, No. 11, pp. 1380-1385, Nov. 1992. [Boye88] D. G. Boyer, "Symbolic Layout Compaction Review," Proceedings of 25th Design Automation Conference, pp. 383-389, June 1988. [GaJo79] M. R. Garey and D. S. Johnson, Computers and Intractability: A Guide to the Theory of NP- Completeness, Freeman and Company, New York, 1979. [HuHB91] X. Hu, R. G. Harber and S. C. Bass, "Minimizing the Number of Delay Buffers in the Synchronization of Pipelined Systems," Proceedings of 28th Design Automation Conference, pp. 758763, June 1991. [King84] C. Kingsley, "A Hierarchical, Error-Tolerant Compactor," Proceedings of 2lth Design Automation Conference, pp. 126-132, June 1984. [LeTa92] J. F. Lee and D. T. Tang, "HIMALAYAS - A Hierarchical Compaction System with a Minimized Constraint Set," Digest of Technical Papers, International Conference on ComputerAided Design, pp. 150-157, Nov. 1992. [LiCS93] A. Lim, S. W. Cheng and S. Sahni, "Optimal Joining of Compacted Cells," IEEE Transactions on Computers, Vol. 42, No. 5, pp. 597-607, May 1993. [LiWo83] Y. Z. Liao and C. K. Wong, "An Algorithm to Compact a VLSI Symbolic Layout with Mixed Constraints," IEEE Transactionson CAD, Vol. CAD-2, No. 2, pp. 62-69, April 1983. [Marp90] D. Marple, "A Hierarchy Preserving Hierarchical Compactor," Proceedings of 27th Design Automation Conference, pp. 375-381, June 1990. [PaDL93] P. Pan, S. K. Dong and C. L. Liu, "Optimal Graph Constraint Reduction for Symbolic Layout Compaction," Proceedings of 30th Design Automation Conference, pp. 401-406, June 1993. [PaSt82] C. H. Papadimitriou and K. Steiglitz, Combinatorial Optimization: Algorithms and Complexity, Prentice-Hall, N.J., 1982. [Schi88] W. L. Schiele, "Compaction with Incremental Over-Constraint Resolution," Proceedings of 25th Design Automation Conference, pp. 390395, June 1988. [Tarj83] R. E. Tarian, Data Structures and Network Algorithms, Society for Industrial and Applied Mathematics, Philadelphia, Pennsylvania, 1983. [WoDF89] D. Wong, G. De Micheli and M. Flynn, "Inserting Active Delay Elements to Achieve Wave Pipelining," Digest of Technical Papers, International Conference on Computer-Aided Design, pp. 270-273, Nov. 1989. [YaCD93] S. Z. Yao, C. K. Cheng, D. Dutt, S. Nahar and C. Y. Lo, "Cell-based Hierarchical Pitchmatching Compaction Using Minimal LP," Proceedings of 30th Design Automation Conference, pp. 395-400, June 1993.
this solution, the total amount of change in edge weights remains unchanged, namely, 0.6. However, the length of the longest path from C, to C3 becomes 1.4 which is the shortest possible given that A = 0.6. To demonstrate the quality of the solutions produced by our linear programming formulations, we apply the second formulation on four test examples. Each example is a constraint graph with some positive cycles. The number of vertices and the number of edges in each constraint graph are listed in Table 1. We compare our approach with a heuristic. The heuristic examines each positive cycle in a constraint graph one by one and decreases the smallest edge weight in the cycle by the weight of the cycle. In Table 2, AH and ALP is the total amount of change in edge weights introduced by the heuristic and by LP, respectively; 1H and ILP is the length of the longest path from a1 to v. achieved by the heuristic and LP, respectively. From the four test examples, we see that there is more than 40% difference in the amount of change in edge weights and more than 30% difference in the length of the longest path from v, to vn between the two approaches. This shows that our linear programming formulations indeed produce very good results. Example lE| 7V 1 6r 7 - 2 10 12 3 20 24 4 32 45 Table 1 LFaLP - AH ALP AH ILP . . 6 4 4 -43% -56% 8 6 10 26 16 16 -47%c .-4. 64% 13 30
1
Heuristic £ xamp1e I
-
-
I
2
1
3 -
- -1
4
T
AH
7 18 30
84
'H
- -
LP -
1H
3H
-33%o -4-
-40% -38% -46%
Table 2 6.
CONCLUSION
We study the problem of removing positive cycles in a constraint graph by modification of edge weights. Previous attempts to solve this problem examine the positive cycles one at a time and use heuristics to determine which edge weight to be modified. We show that the problem can be solved in polynomial time by the method of linear programming. We give three linear program formulations for the problem. In particular, the third formulation has a dual whose linear program is that of a minimum cost flow problem and hence can be solved efficiently. REFERENCES [BaVa92]
C. S. Bamji and R. Varadarajan, "Hierarchical Pitchmatching Compaction Using Minimum Design," Proceedings of 29th Design Automation Conference, pp. 311-317, June 1992.
[BaVa93]
C. S. Bamji and R. Varadarajan, "MSTC: A Method for Identifying Overconstraints during Hierarchical Compaction," Proceedings of 30th Design Automation Conference, pp. 389-394,
June 1993.
261
An 0(n) Algorithm for Transistor Stacking with Performance Constraints Bulent Basaran and Rob A. Rutenbar Department of Electrical and Computer Engineering Carnegie Mellon University, Pittsburgh, PA 15213, USA {basaran,rutenbar} @ece.cmu.edu Abstract
results on industry-quality circuits are given in Section 4. Finally, Section 5 offers some concluding remarks.
We describe a new constraint-driven stacking algorithm for diffusion area minimization of CMOS circuits. It employs an Eulerian trail finding algorithm that can satisfy analog-specific performance constraints. Our technique is superior to other published approaches both in terms of its time complexity and in the optimality of the stacks it produces. For a circuit with n transistors, the time complexity is O(n). All performance constraints are satisfied and, for a certain class of circuits, optimum stacking is guaranteed.
2 Basic stacking strategy A stacking methodology is needed to model the circuit schematic in a format appropriate for a graph algorithm to solve the layout problem effectively. Our strategy is similar to that introduced in [6] and earlier in [4] in more general terms: I . Divide the circuit into partitionswith respect to device type and bias node (body node in MOS transistors).
1 Introduction
2. Perform device folding: split large transistors into smaller parallel transistors. These are called "fingers" by designers; we refer to these more generally as modules as they are the component pieces of our solution. -
In the layout of custom CMOS cells, stacking is defined as merging the diffusion regions of two or more transistors that have a common node, e.g., series-connected transistors have one node in common which can share a diffusion and save area. Since stacking has a dramatic impact on the total diffusion area and therefore on chip yield, there has been an extensive amount of research on optimizing leafcell layout through stacking. The original work of Uehara and van Cleemput [1] first posed the problem and offered a heuristic solution for digital circuits. For this important two-row P-over-N layout style, polynomial time algorithms were later discovered to arrange series-parallel dual CMOS ([2] is a good survey here). When more general aspects of the layout are to be optimized, e.g., wiring as well as stacking, a variety of combinatorial search algorithms have been used with success, e.g., [3]. In the analog domain, stacking is critical not only for area, but also for circuit performance due to parasitic diffusion capacitances. Unfortunately, the wider range of device sizes, and requirements for device matching and symmetry render the simpler row-based digital layout styles inadequate for analog. To address this, Cohn et al. introduced a free form 2-D stacking strategy integrated with device placement [4]. Charbon et al. later introduced a technique to satisfy performance constraints through constraint-driven stacking during placement [5]. Both tools can generate high-quality layouts, however., neither can guarantee a minimum diffusion area. More recently, [6] introduced a new stacking style and a novel technique to generate optimum stacks that satisfy performance constraints, using a path partitioning algorithm. However, because it attempts to enumerate all optimal stacks, runtime can be extremely sensitive to the size of the problem. Symmetry and matching constraints can greatly prune the search, but the basic algorithm has exponential time complexity [6]. In this paper, we present a new algorithm to perform stack generation in linear time. For a large class of circuits, our algorithm is optimum with respect to total diffusion area and a cost function modeling circuit performance. The cost function ensures that performance constraints are, if possible, met. Device matching is also guaranteed through symmetry and proximity constraints. The paper is organized as follows. Section 2 describes the basic stacking strategy. Section 3 explains how the circuit performance is modeled. In Section 4, the new stack generation algorithm is presented. Some
3. Perform further partitioning to reduce the variation on the module widths in a partition, 4.
Generate stacks that implement each partition. In analog CMOS circuits, as in digital standard-library leafcells, only transistors of the same type (e.g., NMOS), which share a common well, can be stacked (i.e., their common diffusion nodes can be merged in the layout to minimize diffusion area). In addition in analog circuits, it is fairly common to have transistors of the same type which require distinct body potentials, for example, to optimize noise performance. Such transistors have their own isolated wells and cannot be stacked with other transistors of the same type. Therefore in the first step, we put such transistors in different partitions. We also allow the designer to specify explicitly to have two or more transistors in the same stack. In the second step, large transistors are folded into fingers to minimize the diffusion capacitances as well as to balance the aspect ratio of the resulting module. This can either be done automatically [6] or manually by the designer. It is important to note that, in this stacking strategy, transistor folding is done a priori.The stack generation algorithm is given fixed-width modules as input - it does not dynamically fold transistors. This is in contrast to tools such as KOAN [4], in which the overall optimization loop treats stacking, folding and placement simultaneously. Of course, such a separation of design tasks is sub-optimal. One of our main motivations in this paper is to devise a stacking strategy that is fast enough to be used in the inner loop of a placement tool like KOAN. In the third step, the partitions are examined again to account for variations in module widths. If it is requested, modules with widths significantly larger, or smaller, than others in a partition can be put in a separate partition. This will result in a better utilization of space, but it will have suboptimal diffusion sharing. If such a partitioning is not acceptable for performance reasons, this step may be skipped. In the fourth step, the stack generation algorithm (Section 4) operates on each circuit partition separately. We note that a pair of phases before and after the stacking algorithm may handle special
262
patterns required by some analog circuits: module interleaving (i.e., common-centroid or inter-digitated device pairs); devices with ratio constraints to obtain precise current ratios (e.g., current mirrors) [6]; multi-fingered devices with proximity constraints [7].
During stack generation, it is required that certain performance specifications are considered and, if possible, met. The input to the stack generation algorithm (Section 4) is a cost function based on criticalityweights on circuit nodes and symmetry constraintson the devices. In this section, we will briefly review how these parameters are obtained from performance specifications. Our approach follows [8] and [9]. The process of translating high-level circuit performance specifications into bounds on low-level layout parameters is called constraint generation. This process is traditionally done manually by circuit designers. Recently, techniques have been proposed to automate this process using sensitivity analysis [8]. Constraint generation starts with small signal sensitivity analysis of performance functions at the nominal operating point. Performance constraints are defined as maximum allowed variations of the performance functions around the nominal operating point. These constraints can be mapped to parasitic capacitance constraintson certain nodes and matching constraintson devices. The parasitic capacitance constraints, together with bounds on estimated parasitic capacitances, can further be translated into criticality weights, denoted w, on nodes. The tighter the constraints, the closer the minimum allowed performance to the estimated nominal value, the higher the weights. A cost function evaluating a stacking solution is introduced in [6] that minimizes the parasitic capacitance of critical nodes. We will use the same cost function to guide our stack generation algorithm. It is shown in Eq. (I) for the sake of completeness. Cost (stacking) =
w (diff) .k (diff)
(1)
diff
Here, the summation is carried over all diffusion regions in the stacks. w(dif]) denotes the criticality weight on the node that corresponds to diff k(diff) is I if diffis a merged diffusion in the stacking. Otherwise it is given by Ce,/Cit where Cest and Cint (Cext >in are the capacitances of an unmerged (external) and a merged (internal) diffusion, respectively. Note that when w is 1, the cost function minimizes only the total diffusion area. w is an effective way of prioritizing critical nodes during stacking. Matching constraints are translated into symmetry constraints on devices and wiring and also to device proximity constraints. In order to match devices, our stack generation algorithm employs symmetry constraints on the devices of the circuit. The stacks obtained with a stack generation algorithm should be symmetric around a symmetry axis with respect to the twin transistors in them (Fig. 1) [ 14]. Further matching can be enforced earlier in the partitioning step of our stacking strategy as in [6] as well as later during placement and routing [4][15]. The next section describes in detail how the cost function in Eq. (I) is optimized and how the symmetry constraints are satisfied in the stack generation algorithm.
M4 M2
IM1
M.m
(b)
A
3 Modeling performance constraints
M1 M3
M
M
d
M3--
Ml M3
>1-4r' (a)
M2 M4
, (C)
.
I
i
01
Fig. 1. Two symmetric transistor pairs (a) and their layout with symmetric stacks. Stacks in (b) and (c) are mirror symmetric and perfect symmetric, respectively.
4 Stack generation algorithm As introduced in [1], finding an Eulerian trail in a diffusion graph is equivalent to minimizing the diffusion area of series-parallel static CMOS circuits. Later [10] presented a simple linear time Eulerian trail finding algorithm for dynamic CMOS circuits consisting of only one type of network (e.g., an nFET logic network). In our algorithm, we use a similar algorithm for finding an Eulerian trail. The main contributions of this algorithm are twofold: 1. Performance: We optimize.a cost function that considers not only area but also circuit performance - this was previously achieved in exponential time [6], 2. Generality: Without any symmetry constraints, the algorithm is optimum. With symmetry constraints, it is still optimum for a large class of circuits. Given a circuit partition, our algorithm first generates a modified diffusion graph, G, that represents the circuit partition. G incorporates the performance constraints in the form of criticality weights as defined in Section 3, as well as the symmetry constraints among transistors. Next a trail cover on G is found that satisfies the symmetry constraints in the circuit. In the final step each trail in the trail cover is converted to a transistor stack for layout. The outline of our algorithm is given in Fig. 2. procedure stack(circuit-partition ckt) I generate the modified diffusion graph, G, from ckt 2 trail-cover = sym-trail-cover(G) 3 convert trail-coverinto transistor stacks 4 return(transistor stacks) Fig. 2.
The stack generation algorithm.
Next we describe the modified diffusion graph and the symmetric trail cover finding step in detail and give an analysis of the algorithm.
A
The modified diffusion graph, G
Let ckt be the circuit partition for which we wish to generate the transistor stacks. ckt can be represented with an undirected graph G'(possibly with parallel edges) called the simple diffusion graph. Each vertex in G 'corresponds to a diffusion node in the circuit, and each edge in G'corresponds to a transistor (Fig. 3). Let v be a vertex in G': v is labeled with w(v) and s(v). w(v) denotes the criticality weight on the node that corresponds to vertex v. s(v) denotes the symmetric twin of v. Let e be an edge in G'; e is labeled with s(e) = e ', where (e, e ) is a symmetric edge pair. s(e)=e ' > s(e )=e. Note that a diffusion graph with symmetry constraints
263
must be fully symmetric: all the edges must have symmetric twins. Otherwise, the circuit must be partitioned further so that each partition is fully symmetric (Fig. 4 (a)). A vertex is called self-symmetric, if s(,) = v. A self symmetric vertex is on the symmetry axis which curs the graph into two halves (vertex v7 in Fig. 4 (a)). A pair of symmetric edges are called cross-symmetric, if they cross the symmetry axis.
an trail on it. The degree of a vertex v, denoted d(v), is the number of edges adjacent to it. It is well known in graph theory that a graph is Eulerian if and only if it is connected and all vertices in the graph have even degree [13]. Obviously in an Eulerian graph we can always find a trail cover of cardinality 1, since there is an Eulerian trail on it. It is also easy to see that in a graph that has nodd vertices with odd degree, the minimum trail cover has a cardinality of n,,dd/ 2 (It is known that ndd is always even). Note that in general the simple diffusion graph G 'is not Eulerian. Let nodd denote the number of vertices with odd degree in G'. If ndd > 0, we add a vertex, called a super-vertex, vs, to G ' and we make it Eulerian by adding a new edge (v, vi), called a super-edge, for each odd-degreed vi. We set w(v1 ) to 0, since its criticality, by definition, is zero. The graph obtained from the simple diffusion graph, G-, by the addition of (1) the super-vertex and (2) the superedges is called the modified diffusion graph and is denoted as G (Fig. 4 (b)).
v1
e3
e2
e4 v4
B Finding a symmetric trail cover If there are no symmetry constraints, we can find an Eulerian trail, te, in G, using a recursive Eulerian trail finding algorithm [13]. Let te be (vS,v],v 2 ,. ,vk,v,). If we delete the super-edges in te, we obtain a set of trails, Te, that has a cardinality of n~ddI2 . Therefore, T, is a minimum trail cover for G, the simple diffusion graph. However, when there are symmetry constraints, an arbitrary Eulerian trail, in general, does not yield a feasible solution. Here, we propose an algorithm which can be used to find a minimum trail cover in the presence of symmetry constraints. Our symmetric trail cover algorithm employs the same recursive algorithm for finding an Eulerian trail with modifications to handle perfect and mirror symmetry constraints. The outline of the algorithm is given in Fig. 5. The algorithm symntrail cover () starts by selecting the vertex, v0, with the lowest criticality weight. Next it finds a set of trails, cover-left, in Line 2 with the call to the recursive procedure eul.er() (Fig. 6). Here we note that the first trail euler() generates, first-trail, has v0 at its end terminal; more on this in Section C. The trail cover, coverIleft, includingfirsttrail, covers only half of the edges in the modified diffusion graph, since at each iteration of Line 10 in euler() we not only delete the edge that is inserted in the trail but also its symmetric twin.
e5 vS
Fig. 3. A circuit partition and its simple diffusion graph. Each node in the circuit, n, is mapped to a vertex v in the graph; each device, d, is mapped to an edge e. First we introduce some terminology from graph theory that will be used in the following sections. A trail on a graph is a set of edges (v 0 ,e 0 ,vj,e1,v 2 . - Vk-l, ek.1,vk), where ei=(v,,vi+1) is an edge in the graph and eidej for all iwj [12]. We may use the shorthand (vO,vl,v 2 ,. V.,vk)or (e0 ,e1 ,e2 ,...,ek-4) to denote a trail. Note that an edge in a trail can not appear more than once but a vertex can appear at more than one position. Each such position is called a terminal of the vertex. vk and v0 are called the end terminals of the trail. The trail is a closed trail, if Vk=VO.
el
v'5
Ie6 v
e3
e2
procedure symjtrail-cover(G) I pick v0 s.t. w(v 0 ) • w(vi) for all i#0 2 first-trail=euler(vo) / inserts open trails in cover-left
e4 v5
v6
K
Vs (a)
(b)
Fig. 4. (a) A simple diffusion graph with symmetry constraints. Note that pairs of edges with symmetry constraints are drawn symmetrically around the vertical symmetry axis. (b) The modified diffusion graph obtained from (a). Gray lines are the super-edges. A set of trails, T=Iti}, on G is called a coverfor G,if Ve e G, 3ti st. e E ti and e X tI , Vj * i . Tis called a minimum trailcover if the number of the trails in T, or the cardinality of T, IT1 , is the smallest among all possible set of trails. For example, for the graph of Fig. 4 i'a), two trails (vleJ,v3,e5,v7) and (v5,e3,v3) together with their symmetric twins (v2,e2,v4,e6,v7) and (v6,e4,v4) cover the whole graph. Let T1 denote the set of these trails. Note that ITIT= 4. Joining the first and the third trails at v7, their common end terminal, we can reduce the cardinality of T1 to 3 which is the minimum for this graph. A closed trail is an Eulerian trail, if it touches all the edges in the graph. A graph is called Eulerian if there exists a closed Euleri-
3
insertfirst-trail in cover-left
4 5 6 7 8 9 10 11 12 13 14 15
remove the super-edges at the end terminals join-trails(cover -left) if there are symmetry constraints foreach trail tr in cover -left construct the symmetric trail tr' if tr and tr'have a common end terminal join tr and tr' at the common end terminal insert the result into cover-all else insert tr and tr' into cover-all decompose all the trails by deleting the super-edges
Fig. 5.
return(cover-all)
Finding a symmetric trail cover.
In Line 5 of syrn.trail-cover (, the procedure join-trailso) concatenates the open trails in cover.-.left at their end terminals if possible (Fig. 7). This step is required due to the existence of cross-symmetric edges in the modified diffusion graph.
264
recursive procedure euler(vertex vin) I if d(vin) = 0 // no edges 2 return vin 1 trivial trail 3 /,' starting from vin create a random trail tr: 4 vtemp = vin 5 do 6 if d(vtemp) = 0; 7 break; // open trail 8 insert vtemp into tr 9 pick an edge on vtemp, e=(vtemp, vneigh) 10 delete e, s(e), if exists, from G 11 vtemp = vneigh; 12 while vtemp # vin /1iterate until a closed trail is found 13 let tr = (vin,vJ,v2,...vk) 14 find tr2 = euler(vin),euler(vl),euler(v2),...,euler(vk) 15 if vtemp = vin 11 closed trail 16 return concatenation of tr2 and vin: (tr2, vin) 17 e]lse / open trail 18 insert tr2 into tcover 19 return vin Fig. 6.
(a) v5
v6 v4
v7
v
v3
v2 v4
417p7
I .e.
I
M
mfE (b)
v5
v3
v7 v4
v6
vl
v3 v4
v2
Fig. 8. (a) Perfect and (b) mirror symmetric trail covers; T7=((v5,v3,v7), (v6,v4,v7), (vl,v3), (v2,v4)], Th=[(v5,v3,v7,v4,v6), (vl,v3), (v4,v2)1 and the corresponding stacks for the graph of Fig. 4 (a). C Analysis of the algorithm Time-complexity: The do-while loop in Line 5-Line 12 of euler () encounters each edge of the graph at most once, therefore it has complexity 0(n), where n denotes the number of edges. The two foreachloops in join-trails () operate on each trail only for a constant number of steps. Hence the complexity is 0(m), where m denotes the number of trails. But since m = 0(n), the complexity of join-trails () is 0(n). It follows that the overall complexity of the algorithm is 0(n). Optimality: If there are no symmetry constraints, it is easy to see that the algorithm minimizes the cost function defined in Eq. (1): euler() returns an Eulerian trail which is laterdecomposed by deleting the super-edges (if any). Let us assume that the trail cover has k trails after the decomposition; k = max II, n,, /2} . Also note that every vertex v in G must have at least Fd(v)/2f1 terminals in a trail cover T. First assume nodd>> 0. If d(v) is odd, then v has d(v)/2 + I terminals in the trail cover. Otherwise it has d(v)/2 terminals. In either case the number of terminals is equal to the low-
Finding an Eulerian trail with symmetry constraints.
procedure join-trails(cover left) I if there is only one trail in the list 2 return // no pairs to join 3 foreach trail tr=(vl.,vk) in cover-left 4 let tr=(a,...,b) I/ a and b are end terminals 5 insert tr in list(a) and list(b) 6 foreach end terminal x 7 join trails in list(x) pair-wise at x 8 update effected list 9 return(cover-left) Fig. 7.
v3 v7
Joining open trails.
Next, in Line 7-Line 13, symmetric twins of the trails in cover-left are constructed. This is possible, since as a trail in cover-left, tr. was being generated in euler 0, the edges required to construct its symmetric twin, tr, were preserved by deleting them from the graph. This process can also be viewed as simultaneously generating two trails that traverse the two halves of the graph in a synchronous and symmetrical way. Line 8 can construct either a mirror symmetric trail or a perfectly symmetric trail. In Line 9-Line 10 the trail tr and its symmetric twin tr' are joined, if they have a common end terminal and if the operation does not violate a perfect symmetry constraint. Fig. 8 shows an example. As a consequence of deleting both of the edges in a symmetric pair, euler () may encounter a vertex of zero degree while it is trying to find a closed trail in the do-while loop, Line 5-Line 12. When such a vertex is reached, euler () detects that the current trail has to be an open trail. For an open trail, euler () first recurses on the vertices of the open trail, as is the case with closed trails, but when the recursion terminates, it inserts the open trail in the trail cover cover-left and returns the initial vertex as a trivial trail to the previous recursion level (for more details and some examples see [18]). Note that in an Eulerian graph without symmetry constraints there is always a closed trail; no open trails are detected and euler () returns an Eulerian trail.
er-bound given by Fd(v)/21. Now assume nodd
=
0. The previ-
ous argument still holds for all vertices except the one at the end terminals (Note that k = I ). But the vertex at the end terminal was chosen to be the one with the lowest criticality weight, therefore the cost function is minimized and the stacking is optimum. The cost function in Eq. (1) is also minimized for a class of circuits with symmetry constraints for which the corresponding modified diffusion graph satisfies two conditions: (1) no cross-symmetric edges (2) number of self-symmetric vertices with degrees d (v) = 2 (2k + 1) , k > 0 is less then 4. Given these conditions the optimum can be found in linear time by adding a post-processing step to the algorithm which recombines certain trails to reduce the cardinality further. The proof is rather long and will be presented in another paper. When the second condition is waived, the optimum can still be found via a similar post-processing step, but with a penalty in the time-complexity of the algorithm. Currently we are working on a sufficient condition for optimality in the general case. It is also worth noting that we do not evaluate the cost function given in Eq. (1) in sym traibcover (). After stacking, the performance of the circuit can be evaluated using estimates on parasitic diffusion capacitances and device matching, looking at the generated stacks [6]. If there is an unsatisfied performance constraint, then the stack generation step indicates that the performance specifications were too tight and it is infeasible to meet them during the layout phase; hence either the design or the specifications must to be modified.
265
4 Results The stack generation algorithm presented in this paper has been implemented in C++ on an IBM PowerPC 604 (133MHz) based workstation running AIX 4.1. We have tested the algorithm on various circuits from the literature. Table I lists some of these circuits that we obtained from the literature [4][15][6] and shows some results. For all of the circuits the number of stacks is optimum and hence equal to the results obtained by [6]. Again note that the technique presented in [6] is enumerative and has exponential time complexity. We note that in theory our algorithm can guarantee optimality for only some classes of circuits. But still it could find the optimum results for all the circuits that were available to us, since most practical circuits indeed fall into the class for which our technique is proved to be optimum. Sensitive circuit nodes are maximally merged, and estimated performance degradation, as computed by Eq. (1), is equivalent to that in [6]. The run time is very low (less then 100ms per circuit). This compares favorably to [6] which employs an exponential-time algorithm; e.g., for Comp3, our optimum stack generation algorithm found a solution in less than I00msec while the technique in [6] reports 7.5 sec, a difference of approximately two orders of magnitude1. For bigger circuits, higher savings can be expected.
Fig. 9.
The Mult circuit. MIIS
M124
Table 1. Stacking results. .icu# Circuit
p'amTF Opamp2 Opamp3 Opamp4 Opamp5 AB Comp2 Comp3 Mult Buffer
Ref.
of # of devices modules
flV2
1
2T 4
[6] [5] [4] [6] [6] [4,6] [6] [15] [15]
29 11 27 25 15 15 19 12 10
32 30 40 36 29 25 33 46 53
of ckt. partitions
#of stacks M126
5 3 3 9 6 4 4 2 2
9 3 11 10 9 5 4 3 4
Mill
M116
M112
Fig. 9 shows a multiplier circuit [15]. It is a typical analog circuit that was used as a benchmark in KOAN [4] as well as in other constraint-driven layout research [15]. The stacking solution generated with our algorithm is shown in Fig. 10. The number of stacks found is 3, which is the theoretical optimum. As a comparison, the number of stacks found in the KOAN layout is 72. Fig. 11 shows another analog cell, a comparator which is highly sensitive to device mismatch and parasitic capacitance [4][15][6][16]. The stacking generated by our algorithm is shown in Fig. 1'2. Again, compared to KOAN, our algorithm found a better stacking, with 3 fewer stacks.
1. Also note that in [6], an enumerative algorithm is utilized which can find all optimum solutions whereas our technique finds only one. 2. We note that this is not a fair comparison, since KOAN integrates stack generation with placement.
M125
M114
Fig. 10.
The optimum stacking generated for Mutt.
Fig. 11.
The Comp circuit.
Vss
266
Ml
M22 M26
M20 M25
p2: In
fI
p
In M21
M10 M7
p4
>t
M2
M23 M8
p4: M9 M6 Mll
Fig. 12. The optimum stacking generated for Comp.
5 Conclusions First-generation custom analog cell layout tools relied on simultaneous stacking, folding and placement of devices to achieve acceptable density and performance. The disadvantage of these approaches is the lack of any guarantees on the achievable circuit performance, and (due to their annealing-based formulations) the variability in layout solutions, run to run. Second-generation tools have focused on two-phase approaches, in which a partition of the devices into optimal stacks is performed first, and subsequent placement manipulates a palette of alternative stacks. The advantage is more predictable circuit performance, and these techniques can be fast for small circuits. But the runtime to generate all stack partitions can be extremely sensitive to circuit size due to the exponential algorithms at the core of these approaches. In this paper we introduced an effective stacking strategy that is fast enough to be exploited in the inner loop of a device placer, yet still respects analog node criticality information. In comparison with the 2-D freeform stacking style of [4], our approach is faster and can find better results. In comparison with the branch-and-bound technique of [6] which enumerates all optimum solutions, our approach can find a single solution of equivalent cost, for most practical circuits, but in linear-time with respect to the circuit size. Our long term goal in this work is to integrate this stacking algorithm into a device placer in the style of [4], replacing random search for good merges with directed search among local clusters of devices. Instead of finding all stacking alternatives a priori, we only stack those local sets of devices that the placer tells us ought to be stacked. This should yield improved analog cell layout tools, and digital cell layout tools as well. Complex dynamic-logic CMOS cells are increasingly analog in character, and we believe that a combination of aggressive search (for device placement and folding) coupling with simultaneous, dynamic stacking proposed in [17] (to optimally arrange local clusters of devices) is an attractive strategy here. Acknowledgments We are grateful to Prof. Ron Bianchini and Pinar Keskinocak (CMU) for helpful discussions on Eulerian trails. We thank Prof. Rick Carley (CMU) and Dr. John Cohn (IBM) for giving us some of the circuits used in this paper. We thank Mehmet Aktuna for fruitful discussions. Pinar Keskinocak and Aykut Dengi also helped to improve the presentation by reading an earlier draft of the paper. We would also like to acknowledge MPI, Germany for their LEDA library which was of great assistance in prototyping with graph algorithms and basic data structures. This work is supported in part by the Intel Corporation and the Semiconductor Research Corporation.
267
References [I] T. Uehara and W. M. vanCleemput, "Optimal Layout of CMOS Functional Arrays", IEEE Transactions on Computers, Vol. C-30, No. 5, May 1981, pp. 305-312. [21 R.L. Maziasz, J.P. Hayes, Layout Minimization of CMOS Cells, Kluwer Academic Publishers, Boston/London, 1992. [3] S. Wimer, R.Y. Pinter, J.A. Feldman, "Optimal Chaining of CMOS Transistors in a Functional Cell", IEEE Transactions on Computer-Aided Design, Vol. CAD-6, September 1987, pp. 795-801. [4] J. M. Cohn, D. J. Garrod, R. A. Rutenbar and L. R. Carley, "KOAN/ANAGRAM II: New Tools for Device-Level Analog Placement and Routing", IEEE Journalof Solid-State Circuits, Vol. 26, No. 3, March 1991, pp. 330-342. [5] E. Charbon, E. Malavasi, U. Choudhury, A. Casotto. A. Sangiovanni-Vincentelli, "A Constraint-Driven Placement Methodology For Analog Integrated Circuits". IEEE Custom IntegratedCircuits Conference, May 1992, pp. 28.2/1-4. [6] E. Malavasi, D. Pandini, "Optimum CMOS Stack Generation with Analog Constraints", IEEE Transactions on ComputerAided Design, Vol. 14, No. 1, Jan. 1995, pp. 107-122. [7] M.J.M. Pelgrom et al., "Matching Properties of MOS Transistors", IEEE Journalof Solid-State Circuits, Vol. sc-24, October 1989, pp. 1433-1440. [8] U. Choudhury and A. Sangiovanni-Vincentelli, "Automatic Generation of Parasitic Constraints for Performance-Constrained Physical Design of Analog Circuits", IEEE Transactions on Computer-Aided Design, Vol. 12, No. 2, February 1993, pp. 208-224. [9] E. Charbon, E. Malavasi, A. Sangiovanni-Vincentelli, "Generalized Constraint Generation for Analog Circuit Design", Proceedings of the IEEE/ACM ICCAD, Nov. 1993, pp. 408414. [10]S. Chakravarty, X. He, S.S. Ravi, "On Optimizing nMOS and Dynamic CMOS Functional Cells", IEEE InternationalSymposium on Circuits and Systems, Vol. 3:, May 1990, pp. 17011704. [Il]S. Chakravarty, X. He, S.S. Ravi, "Minimum Area Layout of Series-Parallel Transistor Networks is NP-Hard", IEEE Transactions on CAD, Vol. 10, No. 7, July 1991. [12]J.A. Bondy and U.S.R. Murty. Graph Theory with Applications, Elsevier Science Publishing, New York, 1976. [13]C. H. Papadimitriou and K. Steiglitz, Combinatorial Optimization: Algorithms and Complexity, Prentice-Hall, Inc., Englewood Cliffs, New Jersey 07632, 1982. [14]J. M. Cohn, "Automatic Device Placement for Analog Cells in KOAN", PhD dissertation, Carnegie Mellon University, February 1992. [15]B. Basaran, R. A. Rutenbar and L. R. Carley, "Latchup-Aware Placement and Parasitic-Bounded Routing of Custom Analog Cells", Proceedings of the IEEE/ACM International Conference on Computer-Aided Design, November 1993, pp. 415421. [16]E. Charbon, E. Malavasi, D. Pandini, A. Sangiovanni-Vincentelli, "Simultaneous Placement and Module Optimization of Analog IC's", Proceedings of the IEEE/ACM Design Automation Conference, June 1994, pp. 31-35. [17]B. Basaran and R. A. Rutenbar, "Efficient Area Minimization for Dynamic CMOS Circuits", IEEE Custom Integrated Circuits Conference, May 1996. [18] B. Basaran and R. A. Rutenbar, "An O(n) Algorithm for Transistor Stacking with Performance Constraints", Research Report No. CMUCAD-95-56, Carnegie Mellon University, 1995.
Efficient Standard Cell Generation When Diffusion Strapping Is Required Bingzhong (David) Guan and Carl Sechen Department of Electrical Engineering University of Washington Box 352500, Seattle, WA 98195-2500
Abstract In [31, we proposed a single contact layout style (SC style) for CMOS standard cells with regular and compact structure., based on the assumption that a single diffusion contact is sufficient. In reality, the assumption is not always true. We therefore propose a partial strapping style (PS style) for use when diffusion strapping is required. The PS style keeps all the features of the SC style. The structure uses less area for individual cells, allows easy embedding of feedthroughs in the cell, and enables output pins to occur at any grid location. Using an exact algorithm to generate static CMOS cells with a minimum number of diffusion breaks ensures that the width of the cells is minimized. For the PS style, a constructive routing algorithm is used to perform the intra-cell routing. An exhaustive search among the minimum width cells produces the minimum height cell. Our results show that cells in the PS style have cell height very close to those in the SC style. Furthermore, cells using either layout style achieve significant area savings compared to cells using the traditional full strapping style.
Introduction Layout generation, also known as silicon compilation, transforms the logic description of a system into physical
masks for silicon fabrication. Since the problem is so vastly complicated, over the past two decades, the standard cell design methodology gained popularity because it solves the problem in a reasonable fashion with the divide-and-conquer approach. The layout generation process typically has been divided into subtasks of logic synthesis and technology mapping, cell generation, placement, global and detail routing, and compaction. The ultimate goal is to minimize chip area while satisfying performance requirements. The building blocks of this approach are these standard cells. Layout minimization depends a lot on the structures of individual cells (cell layout style) and also the structure of the standard cell library (mainly, the content and the size of the library). Since Uehara and vanCleemput [12] proposed a layout style (UvC78) for static CMOS cells in 1978, almost all cell layout styles have followed UvC78 with minor variations [1][7][9]. In the style of UvC78 and its variations, the power lines run parallel to the diffusion rows. However, the original style was targeted for a one metal layer process. This style is disadvantageous with respect to layout area [5]. In layout generator THEDA.P, a new layout style was introduced to target 2-layer metal CMOS processes [6]. However, in this style, the pins are not aligned and the metal-2 layer is not obstacle free.
*
poly p diffusion
E
n diffusion
*
metall
*
contact
,
0= a+b(c+ d)(e +fg +hi) Figure 1: The layouts of the same complex logic function are shown in (a) single contact style, (b) partialstrapping style, and, (c) traditionalfullstrapping style. These cells have same width, same diffusion width, but different cell height and quite different metal I blockage.
268
via
In [3], we proposed a single contact layout style (SC style) for CMOS standard cells. In this paper, we propose a new partial strapping style (PS style). In designing our styles, the main goal was integration with place and route tools so that the total chip area is minimized after the cells are placed and routed.
New Layout Styles The traditional fabrication process typically mandates that the drain and source regions be fully contacted (called full strapping, as shown in Figure l(c)) to improve performance and ensure reliability. With advanced processing technology, self-aligned silicide (salicide) can make the sources and drains very low in resistance. With either a local interconnect being available to strap the diffusion areas or salicide being available, connecting to diffusion with a single contact does not effect performance much. One industry source, based on the latest 0.25gim processing technology, found that the performance degradation is minimal when moving from full strapping to a single contact style. The worst case degradation is less than 5%, when only having one minimum size contact driving a 10tm transistor from one end of the diffusion. In [3], we proposed the SC style for CMOS standard cells (Figure 1(a)) based on the single contact assumption. The style is applicable to the processes where a single contact is sufficient due to salicide. It is also applicable to the processes where a local interconnect (LI) layer is available to accommodate full strapping of diffusion area. The SC style has a regular yet compact structure. The structure uses less area for individual cells, makes routing problems straightforward and allows easy embedding of feedthroughs in the cell. Using that new style, we developed a cell generator using an exact algorithm to minimize cell width and height. Unfortunately, only a few semiconductor manufactures currently support such advanced fabrication processes (either salicide or LI). In this paper, we therefore propose a partial strapping style (PS style, Figure l(b)). Although different from the SC style due to the strapping requirement, the PS style follows the same discipline of maintaining the cell structure as compact and regular as possible. The strapping requirement however creates a quite different intra-cell routing problem. We developed a constructive heuristic algorithm for the intra-cell routing problem. This algorithm has been integrated into our area minimizing cell generator. Figure l(a) shows a complex cell in the SC layout style. Figure l(b) shows the same complex function cell in the new PS layout style. Both new layout styles only use metal1 for intra-cell connections. They have same configuration for input and output pins or terminals. Pins are aligned in a row between the diffusion regions and are equally spaced between the poly gates. The differences between the two styles are power bus positions and intra-cell routing schemes. In the SC style, the power lines are over the diffusion area, running horizontally near the middle of the cell to facilitate cell abutment. In contrast, in the PS style, the power lines are partially overlapping the diffusion area on the top and bottom boundaries of the cell. In addition, in the SC style, the intra-cell routing scheme will use the horizontal tracks that are close to the center of the cell. Those regions over the diffusion area, but not used for intra-cell routing are outside the power buses and can be grouped together with the channel to form one routing region for
inter-cell routing. In contrast, the intra-cell routing and partial strapping of diffusion area uses up all the area in between power buses in the metal-I layer. The features of these styles are: I) The output pins can be at any grid position. The freedom of putting the output pin at any grid offers the potential to reduce the number of tracks needed to route all intra-cell nets. Reducing the number of tracks means the portion of the height of a cell used for intra-cell routing is reduced. 2) The pins are at the cell center, aligned in a row, on a uniformly spaced grid. This regularity made the inter-cell routing problem easier. The only obstacle for routing the inter-cell connections are the regions used for intra-cell routing in the metal-l layer. Thus the routing region in the metal-l layer is a rectilinear area between two rows, including the routing channel and the area over the cell which is not used for intra-cell connections in SC style. In PS style, the routing region will just be the channel area in the metalI layer. The metal-2 and metal-3 layers are obstacle free. Over-the-cell routing is simple with a uniform pin grid. 3) These styles provide a lot of built-in feedthrough positions. After place and route, the connection directions of pins in a cell are known. Two neighboring pins having opposite direction connections can be moved into one pin column, thus one column is free to be used as a feedthrough position. For example, in Figure 2, pin f only connects upward, and pin e connects downward. Pins e and f can share the same pin grid, while a straight feedthrough position is created. These layout styles also provide dog-leg feedthroughs.
Cell Generation Functional cell generation is the process of translating a design from the transistor circuit level to the transistor layout level. The process of generating a minimum area cell has been proved to be NP-hard [2]. The primary goal in
I I
Connections to Pins
15 Feedthroughs in Metal-2
Straight feedthrough 0
Dog-leg feedthroughs
= ab + (c + d) (e+ f) +gh
Figure 2: The new layout style providesfeedthroughs, both straight and dog-leg feedthroughs.
269
optimization is normally to minimize the cell area (width times height). One important aspect in reducing cell width is to utilize diffusion abutment. This abutment can be achieved when the source and drain diffusions of adjacent transistors are electrically equivalent. If they are not electrically connected to each other, a diffusion gap is needed to isolate these transistor terminals. In our layout styles, a diffusion gap forces the separation between neighboring poly gates to be twice as large as that needed by diffusion abutment. In fact, one of our goals is to maximize diffusion abutment. Exact Algorithm to Minimize Cell Width and Height In [8], an exact algorithm, HR-TrailTrace, was proposed to minimize cell width by minimizing the number of diffusion breaks. The algorithm utilizes the transistor reordering (also called delayed binding) technique. An exhaustive search among all minimum width cells produces a minimum height cell. The algorithm only counted the density of intracell nets as the height. The algorithm was shown to be feasible for all cells of practical size. We implemented that algorithm with a few modifications and extensions. Modifications include utilizing logical equivalence to drastically reduce the number of permutations we need to consider and increasing the efficiency of the algorithm [3]. This algorithm can handle series-parallel connected transistor netlists. For the SC style, a modified left edge algorithm (LEA) has been implemented to route the intra-cell connections. The modification stems from the necessity of determining the output pin location and having the output net always on the first track from the middle. Among all these routed minimum width cells, those which have the least number of tracks are picked as the minimum height cells. As an example, corresponding to the complex function shown in Figure 1, the minimum height layout is shown in Figure 3. This area minimum cell only needs four routing tracks, while the
: 0 : :0 : ::
N
:j0
0itt E0
t : t W:0 dX : ; : ::0 : : :: : ;:E A;; in:
id; )XS;:00l: A:;:
Vi a:
:E X:; lEX 000
i:t :
|X:0
Hi:
i El
S: D Wgtg
E;I0
I'l;' 1-
-
a
b
f
g
0 =a+ b(c
E N ME d
c
4
FindViolations&Flip(OT AlwaysFlip);
5 6 7
While (ViolationsExist & Numlteration < Threshold) ( FindViolations&Flip (MAT CostReductionOnly); FindViolations&Flip (OT CostReductionOnly); /* Flip only the net not costing any more intersections */ I
8 9
/* Any remaining intersections need to be at the outside */ If (ViolationsExist) ( FindViolations&Flip(MT AlwaysFlip); I
Function FindViolations&Flip(Tracks,FlipFlag) / While (ViolationsExist) ( 10 Find all segments with violations; 11 Calculate costs for all these segments; /* Three costs are associated with a given configuration: CostOrig is the number of violations in its current location; CostDest is the number of violations if the segment is flipped to the other side; And CostDiff is the difference of the previous two (CostDiff= CostOrig - CostDest). *l
12 13 14
Order these segments according to CostDiff, with CostOrig as the tie breaker; if (FlipFlag is CostReductionOnly and the largest CostDiffis negative) return; Flip the segment with the biggest CostDiff;
1
GND!
O
LEA(Cell); Assign the output net track to MiddleTracks (MT); Assign all other tracks to OutsideTracks (OT); /* Any intersections are in the OutsideTracks */
layout shown in Figure I (a), although also minimum width, needs five tracks.
g
2
2 2 3
Figure 4: The intra-cell routing algorithmfor the PS style.
Vdd! mm
Algorithm DetailRouteOneMinimumWidthCell(Cell)
e
h
i
+ d)(e +fg +hi)
Figure 3: This complex cell is minimum height in our single contact layout style, while its correspondingpartial strapping style cell is not minimum area.
270
Intra-Cell Routing for PartialStrappingStyle For the PS style, because of the strapping requirement, the LEA algorithm in itself cannot be applied. The routing problem becomes how to route all connections so that there are the fewest number of intersections between nets. If we don't have any (metal-1) intersections between nets, all routing can be placed over the diffusion regions. Otherwise, some nets will have to be routed in the area between the diffusion and the power bus, not only increasing the cell height, but also requiring some segments of poly wires to be used (as used in Figure l(c)). For example, the corresponding PS style layout of Figure 3 will have the segment shown on top to have one intersection with one of the power connections. That intersection forces the top segment to be routed in the area between the power bus and the diffusion, which results in a taller cell than the one shown in Figure l(b). This complication renders both HR-TrailTrace's density counting algorithm and LEA not applicable. The intra-cell routing problem for each of the PMOS and NMOS sections actually is a single-row-planar-routing
(SRPR 110]) problem with some constraints. One constraint is the power connections which block one side entirely, while the connection for the output pin blocks partially the other side. The other constraint is that vertical tracks between nodes are quite limited, and sometimes even do not exist. We developed a heuristic to solve this routing problem. Listed in Figure 4 is the intra-cell routing algorithm for the PS style Shown in Figure 5 is the step-by-step flow to generate intra-cell routing for the cell in Figure 1(b). The PS style routing starts from the results of the LEA (step 1). The track that contains the output net is assigned to the middle (step 2) and all other tracks are assigned to the outside (step 3). At this time, all violations (intersections), if any, are in the OT. Figure 5(bl) shows that the initial intersections are (ni, n2) and (n2, n3). Step 4 calls the function FindViolations&Flipwith the AlwaysFlip Flag to resolve all these violations in OT. This function first finds all segments which have violations (step 10) (nets ni, n2, and, n3 in Figure 5(b I)) and then calculates (step 11) the costs associated with these segments: CostOrig is the number of intersections the segment has in its current location; CostDest is the number of intersections if the segment is flipped to the other side; And CostDiff is the difference of the previous two (CostDiff = CostOrig - CostDest). Step 12 orders these segments according to CostDiff, with CostOrig as the tie breaker. In Figure 5(bl), the ordered list of segments is (n2, ni, n3) with the corresponding (CostDiff CostOrig) list as (1, 2), (1, 1), and (0,1). When the FlipFlagis AlwaysFlip, we will flip the segment with the largest CostDuff to the other side, even when it means an increase in the total number of intersections (step 14). The function returns when all violations are resolved on that side. When the FlipFlag is CostReductionOnly, the function will return when the cost cannot be reduced (steps 13 and 14). In Figure 5(b 1), n2 will be flipped to the middle tracks, and that resolves all violations in OT. Afterwards, during the while loop in steps 5 through 7, the algorithm cycles through from side to side to resolve any violations we may have. The flipping is only done when the moving net will not be causing more violations. In Figure Out
Voss
nl
n3
n2
7 Vss
n4Lt
nl
F~
Out n3
(b],1)
In2 nl
n3
n2 (a)
move resolves all violations as shown in Figure 5(b3). If
there are any remaining intersections, they have to be placed outside (step 9). This concludes the intra-cell detail routing. The intra-cell routing is performed over all minimum width cells of a given function, then the layouts which have the least intersections will be picked. Experiments have shown that the algorithm is very effective in routing nets over the diffusion area. The algorithm finds that all 87 cells having no more than 3 transistors in series in both the NMOS and PMOS sections have layout implementations in which all the intra-cell routing is over the diffusion.
Results We studied the layout area and circuit performance as a result of utilizing a large library of standard cells [4]. We built libraries of all static CMOS cells having a chain length of up to 7, and this was made possible by our new cell generator's capability of generating any static CMOS cell with a user-specified limit on the number of series transistors. We used the TimberWolf place and route tools [11] to generate the actual chip layout. We used an industrial timing analyzer, which included wiring parasitics, and state-of-the-art 0.25 pgm design rules, to provide performance information. We refer to a library of all possible cells having a chain length limit of n as sn. We experimented on 13 MCNC benchmark circuits, ranging from 124 to 2090 cells, 165 to 2135 nets. Our results (Table I) show that compared to using library s2, in terms of chip area, s3, s4, s5, s6, and s7 saves 16%, 22%, 24%, 25%, and, 26% respectively. For larger designs, the area savings amounted to 50% when using s7. In the meantime, since a netlist which uses larger cells will have fewer cells on each critical path, we found that the average worst path delay is quite similar for libraries s3 through s7. We concluded that using a very large library (e. g., s7) is optimal in terms of area and delay. Due to the sheer sizes of these large libraries, it is not possible to draw all those possible cells by hand. An efficient automatic standard cell generator is necessary to make this approach feasible and to realize the chip area savings. As can be seen from Figure 1, while the SC, PS and FS style cells all have the same width, their height is quite different. Compared to the SC style cells, the PS style cells are taller due to partial strapping. For FS cells, intra-cell connections use some more area. Table II shows the cell height comparison between the three styles. The cells in the SC style are 12% shorter than those in the PS style, while the cells in the FS style are 30% taller than those in the PS style. Table II also shows the comparison of metal-I blockage. Since the unblocked area can be used in the SC style, the height of the metal-i blockage can be considered as the effective cell height. We can see from Table II that relaxing the partial strapping requirement could potentially reduce metal-i usage by 43%. In summary, the PS style is very close in matching the area savings offered by the SC style, while satisfying the strapping requirement. The area savings that a semiconductor manufacturer can gain from using a more advanced fabrication process (e. g., salicide or local interconnect) is quite clear. Nonetheless the PS style is very dense as can be seen by the very high metal-l usage in the cells. Further, the PS style represents a significant area improvement over the
Vdd
lf
TF-
5(b2), n4 has the most CostDiff, and thus is flipped. This
(b2)
(b3) (b)
Figure 5: Intra-cell routingflowfor the cell layout in Figure i'(b). The cross indicates the intersectionviolation. (a) shows the routing steps for the NMOS section. (b) shows the routing steps for the PMOS section.
271
Table I Comparison of total standard cell area, normalized to the s2 case. s2
s5
s6
s7
.938 .875 .722 .791 .776 .836 .868 .786 .737 .788 .656 .809 .504
.926 .857 .717 .774 .743 .831 .858 .750 .709 .766 .611 .790 .499
.926 .843 .714 .779 .702 .826 .850 .739 .697 .752 .592 .776 .492
.926 .840 .710 .769 .697 .824 .849 .725 .696 .750 .580 .768 .472
.776
.756
.745
.739
s3
s4
C1355 C1908 C2670 C3540 C432 C6288 C7552 b9 dalu des k2 rot t481
1 1 1 1 1 1 1 1 1 1 1 1
.950 .890 .777 .851 .844 .870 .897 .828 .846 .854 .740 .860 .695
Average
1
.839
References
conventional FS style. [1]
Table H Cell height comparison, normalized to PS case. Cell Style .
Effective Cell Height
Cell Height
(Metal-i Blockage)
-_ PS
FS;
_
__
IJ__I_
_
[2]
0.57
0.88
SC
_
1.30
[3]
1.30
S. Bhingarde, A. Panyam. and N. A. Sherwani, "Middle terminal cell models for efficient over-the-cell routing in high performance circuits." IEEE Trans. on VLSI, vol. 1, pp. 462-472, December 1993. S. Chakravarty, X. He, and S. S. Ravi, "Minimum area layout of series-parallel transistor networks is NP-hard," IEEE Trans. Computer-Aided Design, vol. 10, pp. 943-949, July 1991. B. Guan and C. Sechen, "An area minimizing layout generator for random logic blocks," Proc. of Custom Integrated Circuits Conference, Santa Clara, CA, May 1995.
[4]
In [3 1,we presented chip area comparison between the SC style and the Mississippi State University (MSU) library and an industrial library. Our results show that circuits using the SC style cells achieve significant area savings (as much as 50%) compared to the use of these manually laid out compact cells. Since the PS style is very close in matching the height of the SC style, similar area savings can be achieved.
[5]
B. Guan and C. Sechen, "Large standard cell libraries and their impact on layout area and circuit performance," submitted to the International Conference on Computer Design, 1996. Y-C. Hsieh, C-Y. Hwang, Y-L. Lin, and Y-C. Hsu, "LiB: A CMOS
[6]
cell
compiler,"
IEEE Trans. Computer-Aided
Design, vol. 10, pp. 994-1005, August 1991. C. Y. Hwang, Y-C. Hsieh, Y-L. Lin, and Y-C. Hsu, "An efficient layout style for two-metal CMOS leaf cells and its automatic synthesis", IEEE Trans. Computer-Aided Design,
Conclusions
[7]
We presented a partial strapping style for use when diffusion strapping is required. The PS style keeps all the features of the SC style. The structure uses less area for individual cells, allows easy embedding of feedthroughs in the cell, and enables output pins to occur at any grid location. Using an exact algorithm to generate static CMOS cells with a minimum number of diffusion breaks ensures that the width of the cells is minimized. We developed a constructive routing algorithm to perform the intra-cell routing. An exhaustive search among the minimum width cells produces the minimum height cell. Our results show that cells in the PS style have cell height very close to those in the SC style. Furthermore, cells using both layout styles achieve significant area savings compared to cells using the traditional full strapping style.
272
[8] [9]
vol. 12, pp. 410-424, March 1993. S. M. Kang, "Metal-metal matrix (M3 ) for high-speed VLSI layout," IEEE Trans. Comp.-Aid. Design, vol. 6, pp. 886891, Sept. 1987. R. L. Maziasz and J. P. Hayes. Layout Minimization of CMOS Cells. Kluwer Academic Publishers, 1992. S. S. Sapatnekar and S. M. Kang. Design Automation for Timing-Driven Layout Synthesis. Kluwer Academic Publishers, 1993.
[10] N. Sherwani. Algorithmsfor VLSI Physical Design Automa-
tion. 2nd Edition. Kluwer Academic Publishers, Boston, 1995. [11] TimberWolf Systems, Inc. TimberWolf: Mixed Macro/Standard Cell Floorplanning, Placement and Routing Package
(Version 1.0). Obtained from Bill Swartz at TWS, Dallas, TX, 1994. [12] T. Uehara and W. M. vanCleemput, "Optimal layout of CMOS functional arrays," Proc. 16th ACM/IEEE DAC, pp. 287-289, May 1978.
Author Index Alexander, M. J. Alpert, C. J. Ashtaputre, S. Bart, S. F. Basaran, B. Berg, E. C. Blaauw, D. Carrabina, J. Chen, C.-P. Chen, Y.-P. Cheng, C.-K. Chiluvuri, V. K. R. Cohoon, J. P. Colflesh, J. L. Cong, J. Cowen, A. Dai, NV. W.-M. Deng, W. Dong, S.-K. Drake, K. Dutt, S. El Gamal, A. Entrena, L. A. Esbensen, H. Fedder, G. K. Friedman, E. G. Gabriel, K. J. Ganguly, S. Guan, B. Gullickson, D. Gupta, R. Hagen, L. Harr, R. He, L. Hebgen, W. Hossain, M. Hwang, J. Jess, J. A. G. Kahng, A. B. Kang, M. Kaptanoglu, S. Karro, J. Knol, D. A. Koakutsu, S. Koren, I. Kuh, E. S. Lee, H[. J. Lehther, D.
142 100
154 71 150, 262 67 40 176 21 21 7
198 142 142 1, 34 61 134, 226 92 256 228 92 106 13 126 53, 76 241 45 40 268 228 163 100
81 34 118 154 106 111 100
134 169 142 234 134 207 126 67 40
Lillis, J. Lin, T.-T. Liu, C. L. Liu, L.-C. E. Lo, C.-Y. Lo, N. R. Mahadevan, R. Maher, A. C. Maly, W. Marek-Sadowska, M. Marin, X. Molitor, P. Morison, R. Mukherjee, T. Neves, J. L. Nijssen, R. X. T. Okamoto, T. Olias, E. Palesko, C. Pan, P. Parrish, P. T. Peset, R. Peters, E. L. Peters, I. Pister, K. S. J. Pullela, S. Riera, J. Robins, G. Rutenbar, R. A. Sandborn, P. A. Sarrafzadeh, M. Scheffer, L. Sechen, C. Simon, J. N. Sun, Y. Tanner, J. E Tellez, G. E. Thumma, B. Tsai, K-H. Tseng, H.-P. Uceda, J. Velasco, A. J. Vittal, A. Wang, K. P. Weber, M. Wong, D. F. Yee, G. Zimmermann, G.
7 7 163, 256 218, 249 256 67 61, 83 86 190
27, 169, 190 176 158 86 53 241 111 1
13 228 256 86 176 142 158 67 40 176 142 150, 262 228 234 89 183, 210, 218, 249, 268 67 163 86 234 154 169 210 13 176 27 190
158 21 183 118