Three Dimensional System Integration: IC Stacking Process and Design

Three Dimensional System Integration wwwwwwwwwwwwwwww Antonis Papanikolaou Dimitrios Soudris Riko Radojcic ● ...

Author: Antonis Papanikolaou | Dimitrios Soudris | Riko Radojcic

235 downloads 1105 Views 12MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

Three Dimensional System Integration

wwwwwwwwwwwwwwww

Antonis Papanikolaou Dimitrios Soudris Riko Radojcic ●

Editors

Three Dimensional System Integration IC Stacking Process and Design

Editors Antonis Papanikolaou Department of Electrical and Computer Engineering National Technical University of Athens 157 80 Athens Zographou Campus Greece [email protected]

Dimitrios Soudris Department of Electrical and Computer Engineering National Technical University of Athens 157 80 Athens Zographou Campus Greece [email protected]

Riko Radojcic Qualcomm Inc. San Diego, CA USA [email protected]

ISBN 978-1-4419-0961-9 e-ISBN 978-1-4419-0962-6 DOI 10.1007/978-1-4419-0962-6 Springer New York Dordrecht Heidelberg London © Springer Science+Business Media, LLC 2011 All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Springer Science+Business Media, LLC, 233 Spring Street, New York, NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com)

Contents

1 Introduction to Three-Dimensional Integration................................... Antonis Papanikolaou, Dimitrios Soudris, and Riko Radojcic

1

2 TSV-Based 3D Integration...................................................................... James Burns

13

3 TSV Characterization and Modeling..................................................... Michele Stucchi, Guruprasad Katti, and Dimitrios Velenis

33

4 Homogeneous 3D Integration................................................................. Robert Patti

51

5 3D Physical Design................................................................................... Jason Cong and Guojie Luo

73

6 Co-optimization of Power, Thermal, and Signal Interconnect for 3D ICs........................................................................... 103 Young-Joon Lee, Michael Healy, and Sung Kyu Lim 7 PathFinding and TechTuning.................................................................. 137 Dragomir Milojevic, Ravi Varadarajan, Dirk Seynhaeve, and Pol Marchal 8 3D Stacking of DRAM on Logic............................................................. 187 Trevor Carlson and Marco Facchini 9 Microprocessor Design Using 3D Integration Technology................... 211 Yuan Xie 10 3D Through-Silicon Via Technology Markets and Applications......... 237 E. Jan Vardaman Index.................................................................................................................. 243 v

wwwwwwwwwwwwwwww

Contributors

James Burns Lincoln Laboratory, Massachusetts Institute of Technology, Lexington, MA, USA Trevor Carlson Ghent University, Ghent, Belgium Jason Cong University of California, Los Angeles, CA, USA Marco Facchini IMEC vzw & Katholieke Universiteit Leuven, Leuven, Belgium Michael Healy Georgia Institute of Technology, Atlanta, Georgia, USA Guruprasad Katti IMEC vzw & Katholieke Universiteit Leuven, Leuven, Belgium Young-Joon Lee Georgia Institute of Technology, Atlanta, Georgia, USA Sung Kyu Lim Georgia Institute of Technology, Atlanta, Georgia, USA Guojie Luo University of California, Los Angeles, CA, USA Pol Marchal IMEC vzw, Leuven, Belgium Dragomir Milojevic Universite Libre de Bruxelles, Brussels, Belgium Antonis Papanikolaou Department of Electrical and Computer Engineering, National Technical University of Athens, 157 80 Athens, Zographou Campus, Greece Robert Patti Tezzaron Semiconductor, Naperville, IL, USA vii

viii

Riko Radojcic Qualcomm Inc., San Diego, California, USA Dirk Seynhaeve AutoESL, Cupertino, California, USA Dimitrios Soudris National Technical University of Athens, Athens, Greece Michele Stucchi IMEC vzw, Leuven, Belgium Ravi Varadarajan Atrenta Inc., San Jose, California, USA E. Jan Vardaman TechSearch International, Austin, TX, USA Dimitrios Velenis IMEC vzw, Leuven, Belgium Yuan Xie Penn State, University Park, Pennsylvania, USA

Contributors

wwwwwwwwwwwwwwww

Chapter 1

Introduction to Three-Dimensional Integration Antonis Papanikolaou, Dimitrios Soudris, and Riko Radojcic

1.1 The Ever Increasing Need for Integration The semiconductor industry has been one of the main enablers for the boom of the information technology revolution that we have witnessed in the beginning of the twenty-first century. Each new generation of consumer electronics devices that hits the shelves boasts more features and functionality, better connection to other devices, lower cost, and better power efficiency per function. An excellent example of this trend is the evolution of the mobile phone since its proliferation in the end of the last century. Mobile phones started out by offering the minimal functionality of voice calling, then evolved to offering short messaging services and since then the features have kept piling up. State-of-the-art mobile phones in 2010 are in reality computing platforms offering extreme power efficiency, small form factor, and low cost for the offered functionality, which includes connectivity with virtually all known standards, high definition video decoding, social networking, office productivity suites, GPS plus any application the software community generates! Consumers have gotten used to these trends and expect a further improvement with every generation of products coming out, which puts pressure on the consumer electronics manufacturers to deliver on these expectations. This translates to a continuous pursuit for low-cost and low-power integration. More and more functionality needs to be integrated into fewer chips to reduce the component count and the real estate of printed circuit boards. Chips with increased functionality need to be shrunk in order to reduce their cost and power consumption. The key driver of this continuous improvement has been semiconductor process technology scaling that has shrunk the physical dimensions of transistors and interconnections to miniscule sizes, transistor channels measure a few tens of nanometers across. This miniaturization has increased the functionality per unit of area in chips by about a factor two every 3 years for the past five decades; the first integrated

A. Papanikolaou (*) National Technical University of Athens, Athens, Greece e-mail: [email protected] A. Papanikolaou et al. (eds.), Three Dimensional System Integration: IC Stacking Process and Design, DOI 10.1007/978-1-4419-0962-6_1, © Springer Science+Business Media, LLC 2011

1

2

A. Papanikolaou et al.

circuits in the 1960s comprised a few transistors, state-of-the-art integrated circuits in 2010 measure more than a billion transistors on a single die. Shrinking transistor sizes has provided other benefits as well; smaller transistors have lower capacitances so they are faster and consume less power each. Note that even though power per transistor is reducing, the increased level of integration is packing more of them into smaller spaces, elaborate design solutions are required to keep the chip-level power consumption low. This has fueled the evolution of electronics for decades. This miniaturization by physical dimension scaling is slowing down as transistor channel lengths hit the range of 20–30 nm. Process technology is running into problems such as process variability, increased leakage currents, lithography limitations. Designers are forced to embed worst-case margins in the chips in order to work around these issues, which leads to an increase in power consumption. Furthermore, manufacturing ever larger chips has a negative impact on the production yield; fewer of them turn out to be fully functional. A lot of research effort is invested worldwide to overcome these problems and enable the technology scaling to continue unabated. But it is very doubtful whether technology scaling alone can keep delivering the rate of improvement it offered in the past decades. Another source of increased integration has been advances in chip packaging. In the early days, packages housed one chip each. Later on multiple chips were being integrated in one package in various configurations, either side to side or on top of each other, and were interconnected using small wires inside the package. State-ofthe-art packaging techniques include System-in-a-Package, which integrates multiple chips of heterogeneous functionality and process technology into a single package interconnected with wire bonds. The benefits offered by technology scaling and packaging advances are coming to an end. The desire for further integration and power efficiency is not however. Consumers still want more functionality at lower cost and higher power efficiency, probably more than ever before as electronic devices are slowly becoming ubiquitous in the environment.

1.2 Chip Stacking Keeping up the trend of integration and power consumption reduction in the era of slower scaling clearly requires a new set of solutions. Chip stacking is seen by many in the industry as the technology that will enable the necessary continuum in the trend for increased integration. It is a class of solutions that has emerged as a combination of technology scaling and packaging techniques. Various flavors of chip stacking technologies have been proposed and some of them have been in production for a number of years. Stacks of packaged die interconnected using wire-bonding, flip-chip bumps or ball grid arrays, wafer-level packaging of chip stacks, and other techniques to stack and interconnect chips have been used to create products in the past. All these techniques manage to integrate more functionality in a single package and they reduce the power consumed in communication between

1 Introduction to Three-Dimensional Integration

3

chips since they substitute Printed Circuit Board connection with wire-bonds or solder balls which have much better electrical characteristics. To push the boundary of integration and power consumption reduction, chip stacking has taken another step. 3D stacked integrated circuit (3D SIC) is a chip stacking technique where the vertical conductors are embedded in the substrate during the manufacturing of the wafers in the foundry. This enables a very high interconnection density between neighboring die in the stack with low-capacitance interconnects. Figure 1.1 illustrates how the aforementioned chip stacking techniques score on a number of axes. Technology integration refers to the capability of integrating die or chips built using different process technologies. Interconnection density refers to the number of vertical connections that can exist per unit of area. Integration density quantifies the capability of the technique to integrate a lot of functionality in a small volume. Performance refers to latency of vertical interconnections and form factor refers to the size of the final product per functionality embedded. Multi-chip-modules (MCMs) which incorporate many chips in a package side by side can integrate very heterogeneous chips at a good, i.e. low, cost but score rather low on the other axes, because it is still a solution that requires a lot of area and interconnections need to traverse large distances. Chip stacks interconnected with wire-bonds are relatively good along all axes but do not excel anywhere. They are better than MCMs in integration density and form factor as they can pack more functionality in smaller volume and faster since the resulting wire bonds have better electrical characteristics. But, wire-bonding suffers from scalability issues. Wire-bonds can only connect I/O pins on the periphery of chips and the available real-estate there is limited. Hence wire-bonding cannot offer a significant increase in interconnection density. MCM

technology integration

form factor

5

wire-bonded stacks

4

BGA stacks

3

cost

2 1

WLP 3D-SIC

0

interconnection density

performance

integration density

Fig. 1.1 Comparison of 3D-SIC technology with other prevailing integration techniques

4

A. Papanikolaou et al.

Ball grid array (BGA) stacks interconnect the different chips in the stack using arrays of solder balls. They improve on technology integration, but they are less cost-efficient compared to wire-bonded stacks. One of their main advantages includes the better electrical characteristics of the balls compared to the wire-bonds. This is especially useful in power delivery; for instance, solder balls enable better power delivery to power-hungry blocks with much supply voltage variations. Wafer-level packaging (WLP) goes a step further in form factor and integration density. It uses large vias through the silicon substrate to interconnect the different die. This increases the integration density and improves the form factor of the final stack compared to BGA implementations. 3D stacked integrated circuits (3D SIC) or 3D integrated circuits (3D IC) technology is the next generation in chip stacking technology. Bare die are stacked and are interconnected using vias through the substrate with very fine pitches. It is today the only practical solution that provides the capability to interconnect different die with tens of thousands of interconnects. Even though it is currently still an expensive process, it offers advantages in integration density as it results in the smallest volume, the highest performance, the smallest final form factor for the packaged stack and has by far the highest interconnection density between the die in the stack. For the remainder of this book we will focus on the 3D SIC technology.

1.3 Benefits and Challenges of 3D Integration Stacking ICs and densely interconnecting them vertically carries a lot of benefits for the end product. However, the technology to take multiple planar die and perform the actual stacking and operational interconnection has not been completely ironed out yet. Moreover, a number of business challenges remain before the supply chain for the production of 3D SICs is fixed. The next paragraphs outline these benefits and challenges and set the stage to understand why chip stacking is considered so important a step in integration as well as the remaining challenges that need to be addressed before this technology can become mainstream.

1.3.1 Benefits 1.3.1.1 Heterogeneous Integration Typical consumer devices include a number of heterogeneous functionalities, like processing, sensing, memory, data transmission, which cannot be incorporated in a single die, because the underlying process technologies need to be optimized for the individual purposes. Chip stacking offers an alternative to board-level connectivity and system-in-package solutions which suffer from reduced interconnection density between the functionalities and increased board footprint,

1 Introduction to Three-Dimensional Integration

5

hence cost. Dies can be manufactured in different process technologies, even in different foundry lines or by completely different vendors and can be bonded at a later stage by a third party. This enables, for example, to stack DRAM memory on logic processing or analog and RF functionalities with baseband processing on a single chip. The increased interconnection density opens up new opportunities for efficient system integration, as outlined in Chap. 8. 1.3.1.2 High Degree of Integration in a Small Form Factor A strong trend to miniaturize consumer electronics products is evident nowadays; thickness of devices has even become a selling proposition for new generations of smartphones and tablets. This trend puts significant pressure on chip providers to add the chip form factor as another design optimization criterion. In order to make devices ever thinner, with the screen and the printed circuit board taking up a significant amount of z-axis real estate, chips need to be less than a couple of millimeters thick. Moreover, consumer electronics manufacturers are pushing chip vendors to increase the functionality per chip in order to reduce the component count on the printed circuit board and cost as a result. Stacking of thinned chip is the only technology available currently that can densely pack more functionality than ever before into a very thin package. This may be the key to enable the next wave of consumer electronics miniaturization. 1.3.1.3 Improved Power Consumption Power consumption is the second most important design optimization criterion after cost nowadays. Chips that are embedded in portable devices need to consume power in a very frugal manner in order to maximize battery lifetime between recharges. Power consumption is becoming ever more important for other application as well. Compute farms and personal computers need to regulate their power consumption to control temperature and avoid catastrophic side-effects to the chips themselves. Manufacturers of set-top boxes and other similar devices need components that are power efficient to avoid installing cooling equipment which increases cost. One of the main sources of power dissipation on chips is the wires interconnecting the various functional blocks of the chip. As more and more functionality is integrated on a chip and physical dimensions increase, these wires tend to get longer and more power hungry as a result, since they need to transmit more information, faster, over larger distances. Continuing this trend using conventional planar chips leads to a bottleneck. Chip stacking offers an alternative solution. Partitioning the functionality of the chip in multiple die and vertically stacking them increases the locality between the different functional blocks. Blocks that previously were on opposite sides of a planar die can now be placed on top of each other. This enables a severe reduction of interconnect lengths which directly translates to reduced power consumption or even faster data transmission if necessary.

6

A. Papanikolaou et al.

1.3.1.4 Cost Benefits Cost is the single most important driver and optimization target in the design of integrated circuits for most applications. Traditional semiconductor scaling techniques that have enabled the increased system integration up to now are becoming more difficult and more expensive with each new technology node. The cost of developing the process technology and building a foundry for the next node is becoming so expensive that only very few companies can afford it worldwide. Apart from sheer capital expenditures, yield is also becoming a bottleneck for chip production. Defects that reduce yield in the production process are not scaling in size together with the feature dimensions of chips and cleanrooms cannot become any cleaner, the defect density has saturated. Dies become bigger to accommodate more functionality and as a consequence yield is dropping. This further increases the cost of producing chips in each next technology node, since fewer of them will be functional. Chip stacking offers an alternative route to integration which may prove to be more cost-efficient. First, it enables the stacking of multiple die which means that large, low-yielding die can be split into smaller ones which will then be bonded together. Second, if some part of the system has more relaxed requirements than the critical one, it can be manufactured in an older, cheaper technology node to further reduce the overall cost. Hence there is great potential for cost optimization as long as the additional process steps added in the process for bonding the multiple die together are cheap enough! This is one of the major challenges that must be overcome in order for chip stacking to become a mainstream technology which will complement traditional CMOS to enable the next steps of scaling for the semiconductor industry.

1.3.2 Technical Challenges The potential benefits of chip stacking are too lucrative to ignore. Semiconductor companies, research centers and universities are spending significant time in an effort to address the following remaining technical challenges and make chip stacking a reality. 1.3.2.1 Process Steps Stacking multiple chips obviously requires additional process steps. These steps can be clustered into three main functions, namely through silicon via (TSV) etching and filling, thinning and bonding. TSVs are the vertical interconnections between the different die, they connect an interconnect of the die where they are created to an interconnect on the die just below in order to establish an electrical connection. Creating them requires to etch holes through the silicon substrate in the appropriate die during its manufacturing and to later fill these holes up with conductive material. Thinning refers to a process step that thins the wafer with the TSVs to a thickness ranging from few tens of micrometers to a few hundred micrometers.

1 Introduction to Three-Dimensional Integration

7

This is an essential step in order to create a 3D SIC with a small thickness. The third important step is bonding were die are bonded together in pairs. This requires the careful alignment of the two die and their bonding such that the TSVs will land at the correct places to establish electrical connections with the lower die. Chapter 2 explains these steps in much more detail.

1.3.2.2 Operating Temperature The operating temperature of chips is mainly determined by three factors, the ambient temperature, the density of their power consumption, and how well they can transfer heat out of the package. Power consumption creates heat and if it cannot be dissipated fast enough the temperature of the die increases leading to problems such as increased leakage currents in the transistors and reliability degradation. Stacking of multiple die affects the power consumption density and the capability of the chips to dissipate heat. The power density increases because multiple die are thinned and bonded together in a small volume, the power consumed over a large area in conventional planar die is now consumed in a small volume. As a result, a large amount of heat is generated in a small volume. The capability to dissipate this heat depends on the materials used in the chips and the package. In a conventional chip, a spreader is used to distribute the heat along the surface of the die from where it is subsequently dissipated through the package. In a stacked IC, the heat has to be potentially dissipated through another die. The difference in heat conductivity needs to be understood and evaluated. The fact that the die in the stack is thinned before bonding adds another level of complication; thicker die can spread heat much better than thin ones along their horizontal plane. One school of thought believes that this increase in power density can lead to a temperature increase in the overall die stack. Another school believes that even in conventional planar chips the temperature problems are encountered in small hotspots; hence, the problem is essentially similar. It is clear that some research is needed to compare the heat dissipation capabilities of conventional vs. stacked integrated circuits in the context of their packages and to assess the impact of how and where heat is generated on the operating temperature of the die. 1.3.2.3 Mechanical Stability Three-dimensional chip stacks will comprise a number of die of different sizes, thinned to a few tens of micrometers, made of different materials stacked and bonded on top of each other so as to retain electrical connections. This system presents a nightmare in terms of mechanical stability when temperature changes. Different materials, and as a result different die, have different thermal expansion coefficients and are affected in a different manner by temperature gradients. This creates a potential threat that the stack might be partially de-bonded if temperature changes fast, which might

8

A. Papanikolaou et al.

be a catastrophic failure because electrical connections may be jeopardized. Mechanical stress that builds up due to these sources can have subtler effects as well. Stress engineering has been widely used in deep sub-micron technology nodes to improve the electrical properties of transistors. Mechanical stress due to thermal expansion can interfere with the stress carefully engineered in the transistor channel to destroy the on-currents of transistors. Another major source of problems is the handling of the thinned wafers. After thinning, wafers are so thin that they actually become flexible. Extreme care needs to be taken to make sure they are transferred from one process step to the other without damaging them. The current solution for handling such wafers is to attach them to supporting carrier wafers, but even the operations of bonding and de-bonding them to and from the carrier wafers may create mechanical issues. 1.3.2.4 Testing Testing is a very important, albeit often underestimated, step in the chip manufacturing process. Any manufacturing plant wants to ship only operational products; hence, testing is critical. Testing the functionality and performance specifications of a 3D chip stack is similar to testing a conventional chip. Chip stacking, however, offers opportunities to test individual die before they are bonded together. In order to avoid bonding functional with nonfunctional die together, which leads to wasting functional die, each die should be tested separately. This adds another level of complication to the testing process. Moreover, testing individual thinned die is a tricky process as their mechanical probing is extremely difficult process. Research is needed to establish a proper testing protocol for chip stacks and a way to test the individual die before bonding to increase yield and minimize cost. Design-for-testability is another area of testing that may be affected by three-dimensional integration. It comprises techniques to embed testing functionality, scan-chains for instance, in the chip so as to efficiently test it. Partitioning the system functionality in multiple separate die can increase the complexity of such techniques, this impact needs to be better understood. 1.3.2.5 Bonding Strategies Three ways exist to perform the bonding that creates chip stacks; wafer to wafer, die to wafer, and die to die. Each of these alternatives has clear advantages and disadvantages. Wafer to wafer bonding is the fastest, since multiple stacks are created at once, but the different die in the stack have to be exactly the same shape and size. Die to wafer bonding is an intermediate solution where one wafer has been diced and individual die are bonded on the second wafer. This makes alignment more difficult, but it enables stacking of die with different physical dimensions and allows to pretest the die and to create stacks of functional die. Die to die bonding offers the most freedom to the foundry to mix and match operating die to create chip stacks. It is the least preferred approach, however, since it increases production time and cost significantly. Depending on the type of product any of the aforementioned approaches may be useful. Even wafer to wafer bonding which may result

1 Introduction to Three-Dimensional Integration

9

in bonding functional with nonfunctional die is good for production of DRAMs where the cost of individual chips is very small and it is much more important to increase the production speed and throughput.

1.3.3 Business Challenges There are clearly strong interactions between the technical solutions that need to be developed and the target products as outlined above. Another level of complication involves the interaction between the different vendors that are part of the chip stacking value chain. Successful and efficient production of 3D SICs requires the streamlining and standardization of a number of processes. The challenges that need to be addressed include the following. 1.3.3.1 Liability In the world of conventional planar chips and simple packaging options, it is relatively easy to find out whether it is the die that has stopped working or the package when the entire system brakes down. This interface enables companies to cooperate and each to take responsibility for its products. Chip stacking complicates this interface. When multiple bare die are stacked on top of each other and the chip stack stops working it is very difficult to establish what went wrong and who is to blame. Was each die operational before stacking? Did the bonding process destroy the functionality of some die? Were the die designed according to the bonding specs? What happens in case thermal expansion creates reliability problems? Will companies be willing to provide warrantees for the proper operation of their products in this context? Chip stacking introduces many interactions between products of different vendors and sorting out where the liabilities lie is increasingly difficult. It is necessary, however, in order to commercialize this technology on a wide scale. 1.3.3.2 Cost Reduction Chip stacking has the potential to become an integral part of the global semiconductor production process if it can resolve the problems faced by scaling into the very deep submicron technology nodes, such as yield degradation. But this is only realistically achievable if the additional process steps related to chip stacking become very cost efficient and yielding. 1.3.3.3 Vendor Interfaces Traditional chip manufacturing has established interfaces between vendors in the supply chain which have served the industry well for a number of years. In a simple case design houses design the ICs and hand-off the layout to the foundries, the

10

A. Papanikolaou et al.

foundries manufacture the dies and hand them over to houses that perform packaging, assembly, and test. These three entities are distinct and one can hand off to another based on well-defined interfaces. Chip stacking complicates this process. It is not clear which steps need to be undertaken in a foundry and which can be performed by a packaging house. In the case of densely interconnected chip stacks the etching and filling of through silicon vias must be performed in the foundry because they sit in the middle of the manufacturing process. But thinning and bonding can be performed by either a foundry or a packaging house. In the case of stacking of heterogeneous die from multiple vendors or foundries things can become even more complex. The industry players need to come up with a new set of, potentially flexible, interfaces in order to establish a viable supply chain for the production of 3D SICs.

1.3.3.4 Standardization The hand-off between different vendors discussed above will be enabled by the standardization of key technology and design parameters and processes. The interfaces between chip stacking related process technology steps need to be standardized so that they can be executed by different vendors. For example, the physical dimensions of the TSVs fabricated by a foundry pose strict requirements on the alignment accuracy of the bonding process which may be performed by a packaging house. There are a large number of such parameters and interactions that need to be standardized to streamline the process and ensure full compatibility between vendors.

1.3.3.5 Design Kits The process of designing chip stacks should be independent of the supply chain that will produce the actual product with design kits acting as the interfaces to the manufacturing vendors, as is the case nowadays with conventional ICs. This implies the creation of the necessary integrated design kits and additional information and tools for designers that will span the responsibility of multiple vendors.

1.4 Purpose of this Book A large body of literature already exists on individual topics of chip stacking and three-dimensional integration, ranging from the issues revolving around the process and packaging technologies (process modules and steps, materials, issues with characterization and metrology, yield, etc.) to issues related to design of systems

1 Introduction to Three-Dimensional Integration

11

using this novel integration approach (building circuits, architectures, and systems using stacks of homogeneous or heterogeneous chips). Since this is still an integration technology in development, the process side has been explored in much more detail and consolidated information can be found in various publications [1–4]. The design of three-dimensionally integrated circuits [5] and especially entire systems has received much less attention up to now. The purpose of this book is to provide an overview of the entire trajectory from basic process technology issues to the design at the system level of three-dimensionally integrated nano-electronic systems. The emphasis has been put on the design side, physical design and design at the architecture and system level, because the technology is entering the maturity stage and these issues are starting to become very important. The book is intended for an audience with a basic grasp of electrical engineering concepts including some familiarity with fabrication of semiconductor devices, very large scale integration (VLSI) and computer architecture.

1.5 Book Contents The book can be roughly divided in three main sections. The first section, which includes Chaps. 2–4, explains the issues related to the process technology itself as well as some issues revolving around electrical modeling of the vertical through silicon vias. Chapter 2 provides a historical perspective of three-dimensional integration and dives deeper into the details of the process technology required to stack multiple die together and interconnect them with through silicon vias. Chapter 3 discusses in detail how to model the electrical characteristics of these vias in order to create the necessary design kits. Chapter 4 starts with a high-level outline of the basic considerations involved in three-dimensional integration and goes on to illustrate an industrial example of integration of homogeneous die. The second section, Chaps. 5–7, discusses issues in the context of physical design of three-dimensionally integrated ICs. Chapter 5 provides an overview of the entire trajectory of the physical design process for three dimensionally integrated systems from a chip designer point of view. Chapter 6 goes into more detail in the routing strategies required for optimal interconnection of the different die. Chapter 7 discusses why and how the process technology and the design of 3D SICs need to interact with each other in order to result in an optimal system implementation that takes full advantage of the technology offerings. The third and final section, Chaps. 8 and 9, outlines the architecture and system design aspects that come with three-dimensional integration. Chapter 8 goes in detail about the design of a system with a DRAM stacked on top of a microprocessor to exploit the additional interconnection density between the two. Chapter 9 discusses how to optimize microprocessor design given the additional degrees of freedom provided by three-dimensional integration.

12

A. Papanikolaou et al.

Finally, Chap. 10 provides an outlook on the markets and applications that are expected to drive the development of both the process technology and the design environment and tools in the future.

References 1. Garrou P, Bower C, Ramm P (2008) Handbook of 3D integration: technology and applications of 3D integrated circuits. Wiley-VCH, Weinheim 2. Bakir M, Meindl J (2008) Integrated interconnect technologies for 3D nanoelectronic systems. Artech House, Norwood 3. Tan C-S, Gutmann R, Reif L (2008) Wafer level 3-D ICs process technology. Springer, New York 4. Deng Y, Maly W (2010) 3-Dimensional VLSI: a 2.5-dimensional integration scheme. Springer, New York 5. Pavlidis V, Friedman E (2008) Three-dimensional integrated circuit design. Morgan Kaufmann, San Franscisco

Chapter 2

TSV-Based 3D Integration James Burns

2.1 Introduction 2.1.1 Initial Studies and Experiments Theoretical studies in the 1980s [1, 2] suggested that significant reductions in signal delay and power consumption could be achieved with 3D integrated circuits (3D ICs). A 3D IC is a chip that consists of multiple tiers of thinned-active 2D integrated circuits (2D ICs) that are stacked, bonded, and electrically connected with vertical vias formed through silicon or oxide layers and whose placement within the tiers is discretionary. The term “tier” is used to distinguish the transferred layers of a 3D IC from design and physical layers and is the functional section of a chip or wafer that consists of the active silicon, the interconnect, and, for a silicon-on-oxide (SOI) wafer, the buried oxide (BOX). The basic features of a 3D IC are illustrated in Fig. 2.1 in a symbolic drawing along with a cross-section of an actual 3D IC. The TSV (through silicon via) is an essential feature of the 3D IC technology and is the vertical-electrical connection formed between tiers and through silicon or oxide. A TSV is formed by aligning, defining, and etching a cavity between two tiers to expose an electrode in the lower tier; lining the sidewalls of the cavity with an insulator; and filling the cavity with metal or doped polysilicon to complete the connection. A TSV drawing and a cross-section of a TSV are shown in Fig. 2.2. A 3D IC technology was viewed as necessary to maintain integrated circuit performance on the path described by Moore’s law. One issue was the projected increase in chip-operating frequencies that would lead to different clock rise and

J. Burns (*) Lincoln Laboratory, Massachusetts Institute of Technology, Lexington, MA, USA e-mail: [email protected]

A. Papanikolaou et al. (eds.), Three Dimensional System Integration: IC Stacking Process and Design, DOI 10.1007/978-1-4419-0962-6_2, © Springer Science+Business Media, LLC 2011

13

14

J. Burns

a

2D ICs

Thinned and bonded together

TSV connected

3D IC

b 3D Ring Oscillator Cross-Sectional SEM Tier-3: FDSOI CMOS Layer

TSV

Stacked TSV

Tier-2: FDSOI CMOS Layer

TSV

Transistors

Tier-1: FDSOI CMOS Layer

5 mm

Fig. 2.1 (a) An expanded view that illustrates that a 3D IC consists of 2D ICs that are thinned, bonded together, and interconnected with TSVs distributed within the planes of the 2D ICs. (b) A cross-section of a 3D ring oscillator built with a fully depleted SOI (FDSOI) technology

fall times within a chip as shown in Fig. 2.3. Numerous theoretical studies examined the performance of 3D ICs as a function of the number of active tiers and the placement of memory, logic, and other functions among and within the active tiers, but early attempts to build even rudimentary 3D ICs were unsuccessful. Those 3D ICs were constructed using epitaxial overgrowth or polysilicon deposition [3] to stack silicon layers, but the transistor characteristics or transistor densities were unsatisfactory. Attempts to create vertical connections through silicon chips were frustrated by the inability to uniformly thin the chips to less than 50 mm and to insulate deep cuts etched through the thinned chips. At the same time IC technology developments led to tighter design rules and improved transistor performance so that IC progress continued to satisfy Moore’s Law. Within the last 10 years it became clear that Moore’s law could not be met solely by transistor design and fabrication innovations. Therefore, the development of an alternate technology to design and construct microelectronic systems as 3D devices became essential.

2 TSV-Based 3D Integration

a

15

TSV TSV

A

Tier Tier11Metal Metal

Bond line

Tier 1

A

Tier 2

Tier Tier22Metal Metal

Insulator Metal Section A-A b

TSVs Bulk silicon

Fig. 2.2 (a) A drawing of a TSV and (b) a SEM of two parallel TSVs that are ~100 mm deep with an aspect ratio of 20, courtesy of IBM

2.1.2 Advanced 3D Packaging Many of the gains projected for a 3D IC technology were achieved by advancements in packaging that reduced the interconnect length among chips [4]. Multichip modules, stacked-edge connected chips, and ball-bonded chips are examples of past packaging innovations. Further innovation included dual inline packages (DIP) configured to stack two DIPs with pins inserted into pins and stacked ceramic modules with chips bump-bonded face down to increase the functional density of the

16

J. Burns

Process Technology (nm)

300 250 200

From 2003 ITRS Roadmap

700 MHz

1.25 GHz 2.1 GHz

150

6 GHz 100 50 0 1995

10 GHz

13.5 GHz

(20 mm x 20 mm Die) 2000

2005 Year

2010

2015

Fig. 2.3 Technology improvements have produced increased chip clock frequencies. The radius of the circle shows the distance a data signal can propagate within a single-clock-cycle as a function of clock frequency and indicates that above ~4 GHz data cannot be reliably clocked across a 20 mm × 20 mm die

package. Products that contain these advanced packages are in today’s market place with cell phones being the consumer product that illustrates advanced packaging used to maximize performance, ease of use, and at a minimum size, particularly its thickness. However, the interconnect density in the direction perpendicular to the plane of the cell phone’s circuit boards is still a small fraction of the multilevel metallization vias in any of its ICs. This suggests that future cell phones could be more compact and functional if constructed with 3D ICs. Recent innovations in 3D packaging include a 3D IC system developed by ChipPac as shown in Fig. 2.4 where chips are stacked, bonded, and interconnected at the chips’ edges and the stack assembly is attached and connected to a chip carrier. Another 3D packaging approach developed by Irvine Sensors is shown in Fig. 2.5. In this concept the interconnect is designed to extend to the edge of each chip so that when several chips were stacked and bonded together the edge of the assembly could be polished to expose the interconnect at the chips’ edges. The compact 3D system was completed by depositing and patterning interconnect metal on the stack’s edges. In a more recent approach to building 3D chips, 2D chips are stacked and either bump- or adhesively bonded to a base wafer. In this design vertical connections are achieved within but not through the chips. The sizes of the chips can be different which permits the integration of chips from different sources and different technologies, but the alignment of the pads on the base wafer and the chips to be attached must be compatible. A Geiger counting imager [5], shown in Fig. 2.6, is an example of such a technology. The base wafer is a CMOS readout to which an avalanche photo diode array is adhesively attached and connections between the imager and the CMOS readout are made by deposited aluminum. The interconnect design limits

2 TSV-Based 3D Integration

17

Fig. 2.4 A 3D package by ChipPac consists of four chips that are stacked and bonded. The chips are electrically connected to each other and to the chip carrier by wire bonds

Fig. 2.5 Irvine Sensor’s Neo-StackTM technology accommodates a variety of different sized chips that are stacked and edge connected to make a module of 4–50 layers that is less than 13 mm thick

the vertical connection density and image fill factor of the 3D chip. Similar 3D chips have been made using bump bonding techniques, but the density is limited by the size of the bond pads which are a function of the chip–wafer alignment budget and the bond pad size. Note that any of the preceding approaches to 3D construction can be embedded into a multichip module to further increase the packing density of the

18

J. Burns

APD diode

Light shield

Diode-CMOS connection

CMOS readout ( beneath the diode tier)

Fig. 2.6 A 32 × 32 Avalanche photo diode array bump-bonded to a CMOS readout with a 100-mm pitch and a 5% fill factor

system. However, none of these approaches has the ability to achieve vertical connections that are distributed within the chip’s area with vertical interconnect densities that approach the density of back-end-of-the-line (BEOL) vias.

2.1.3 Recent Progress in 3D IC Technology Within the last 10 years significant progress has been made in solving a fundamental set of 3D fabrication challenges. • High strength and void-free chip and wafer-bonding processes [6] evolved from CMP techniques used to fabricate SOI wafers. Further refinements of those CMP processes made it possible to thin wafers to less than 50 mm with a total thickness variation less than 1 mm and without altering transistor parameters. • The migration of SOI device technology into commercial production presented an alternate path to thin wafers by using the buried oxide as an etch stop to set the final tier thickness to be the sum of the BOX, the SOI layer, and the interconnect layers [7]. • Deep oxide and silicon-etching equipment and processes were developed to etch 50-um deep cavities with near vertical sidewalls. • Similar improvements in dielectric and metal deposition techniques made it possible to reliably coat the cavities with a dielectric and fill them with conductors such as polysilicon or tungsten to form TSVs [8]. • Thermal compression bonding of copper TSVs became another chip–wafer or wafer–wafer bonding option [9].

2 TSV-Based 3D Integration

19

• Alignment equipment originally developed for thick film processes was modified to permit improved wafer–wafer alignment with overlays less than 2 mm. • More recently, the development of a wafer–wafer alignment system that incorporates wafer–stepper alignment principles has demonstrated wafer–wafer overlays less than 500 nm [10]. Numerous institutions are developing 3D IC technologies based on either wafer–wafer or chip–wafer bonding [11–15] and it is conceivable that the topology of advanced 3D ICs will be tailored to the application such as a lens-shaped chip for imaging system or even an irregular surface to mimic the human brain surface. No consensus has emerged as to the optimum path to achieve a 3D IC capability, but current progress has shown that a 3D IC technology is essential and probable rather than desirable and possible.

2.1.4 3D IC Technology in the ITRS Roadmap The 2007-ITRS roadmap identified the interconnection problem as one of the nearterm (through 2015) “grand challenges” since additional device and interconnect scaling alone could not deliver the required increase in IC performance. A 3D technology with TSVs aligned on a tight pitch was one of the new technologies identified to meet that challenge. The 2008 update [16] of the roadmap included increased emphasis on 3D IC technology development and specified a set of TSV-critical dimensions based on a stacked wafer model with wafers thinned to 10 mm. The TSV parameters analyzed are illustrated in Fig. 2.7 and a summary of the update is contained in Table 2.1. The wafer–wafer or chip–wafer alignment problem was defined as a major obstacle to scale the vertical interconnects and new approaches to the alignment problem were seen as necessary. Additional issues such as 3D IC design and thermal–mechanical modeling tools were identified as future challenges.

TSV-2 Tier 2

TSV-1

Bond line

L Tier 1

Θ D S P

PD

Fig. 2.7 The TSV parameters of Table 2.1 included in the 2008 ITRS update

20

J. Burns

Table 2.1 High-density through silicon via projections in 2008 ITRS update Year Principle parameters 2008 2009 2010 2011 2012 2013 1.6 1.5 1.4 1.3 1.3 1.2 TSV diameter, D (mm) TSV pitch, P (mm) 5.6 5.5 4.4 3.8 3.8 2.7 Pad spacing, S (mm) 1 1 1 0.5 0.5 0.5 Pad diameter, PD (mm) 4.6 4.5 3.4 3.3 3.3 2.2 Bonding accuracy, 1.5 1.5 1 1 1 0.5 D (mm), 3sigma

2014 1.2 2.6 0.5 2.1 0.5

2015 1 2.5 0.5 2 0.5

2.2 TSV-3D Integration Technologies 2.2.1 Introduction The TSV is the structure that has the greatest potential for widespread use in advanced 3D ICs because it can be scaled to achieve a vertical connection density that approaches the density of 2D vias and its electrical properties of low electrical resistance, parasitic capacitance, and parasitic inductance are compatible with the requirements of advanced microelectronic systems. Equally important, TSV fabrication uses thin film processes typical of BEOL technologies unlike ball bonds, wire bonds, tape bonds, or solder bonds that are thick film technologies with limited potential to scale the connections. All 3D IC fabrication process comprise three basic steps, namely wafer thinning, TSV etching and filling, and tier bonding (Fig. 2.1). Depending on the sequence of these steps we can distinguish between different approaches. A process is described as “TSV first” or “TSV last” if the TSVs are fabricated before or after tier bonding, respectively, and the order in which TSVs are fabricated within a 3D IC process is an important process decision to be made before developing a 3D IC technology. The process flows for TSV first, TSV last, and TSV middle – an intermediate flow – are contained in Table 2.2 [17] and 3D IC process flows for TSV-first and TSV-last technologies are illustrated in Figs. 2.8 and 2.9, respectively.

2.2.2 TSV Design The goal of TSV design is to minimize the size and maximize the TSV pitch without exceeding the maximum resistance permitted by an application. This combination results in a TSV whose low capacitance and resistance leads to a power-efficient design that meets the system’s performance requirements. The resistance, R, of a TSV can be calculated as a function of the features illustrated in Fig. 2.6:

R = rL / ( D − d ) (( D − d ) + 2 tan ΘL )

(2.1)

2 TSV-Based 3D Integration Table 2.2 TSV process flows TSV first Etch deep silicon cavities Insulate cavities Fill cavities with a conductor Fabricate BEOL interconnect Bond wafer pair Thin backside of upper wafer Fabricate BEOL interconnect on upper wafer

21

TSV middle Etch deep silicon cavities Insulate cavities Fabricate transistors Fill cavities with a conductor Fabricate BEOL interconnect Bond wafer pair Thin backside of upper wafer

a

TSV last Fabricate transistors Fabricate BEOL interconnect Bond wafer pair Thin backside of upper wafer Backside etch deep silicon cavities Insulate cavities Fill cavities with conductor

b

Start 1

3

Glass 2 3 Glass 2

Glass 3

Glass

Tier 2-1 bond

2

Tier 3-2 bond

2 1

1 3

TSV-bond pad connection

2

2 1

1

2 Tier 3D IC

Finish: 3 Tier 3D IC

Fig. 2.8 The TSV-first process flow for a three-tier 3D IC. With the exception of the base chip, #1, all wafers have TSVs formed before first metallization. The assembly of a two-tier 3D device from wafers 1 and 2 is shown in column (a). The glass layer is bonded to wafer #2 to provide support during substrate thinning to expose the tips of the TSVs. Wafer #2 is aligned to bond pads on the lower tier and bonds are formed through contact with metal pads on the lower tier. After bonding the glass layer is removed. (b) Wafer 3 is added to the two-tier assembly to form a three tier 3D device

22

J. Burns

a

b

Start 1

3

3 2 e

2

Tier 3-2 bond

2

Tier 2-1 bond

1 3

1

2 2 1 1 3

2

2

Tier 2-1 connection 1

2 tier 3D IC

1

Finish: 3 tier 3D IC

Fig. 2.9 The TSV-last process flow for a two-tier 3D IC is shown in column (a). After bonding wafer 2 to wafer 1, the substrate of wafer 2 is thinned then TSVs are formed between the wafers. A continuation of the TSV-last flow is shown in column (b) in which wafer 3 is bonded to the twotier 3D assembly. Substrate thinning of wafer 3 and TSV formation complete the 3D process flow

where r is the resistivity of the metal plug, L is the length the TSV, D is the size of the TSV, which is assumed to be square at the bottom contact, d is the thickness of the dielectric on the TSV sidewalls, and Q is the taper of the TSV cavity. Equation (2.1) indicates that it is desirable to reduce the dielectric thickness on the sidewalls in order to minimize the TSV resistance. From (2.1) we obtain the required size of the bottom contact, D as follows:

D = 2 d – L tan Θ + (( L tan Θ)2 + ( rL / R))

(2.2)

We require that the TSV plug be fully landed; that is the plug must not extend beyond the lower metal pad. From this we determine the TSV pitch to be:

P = 2WA + D + S + 2 L tan Θ

(2.3)

2 TSV-Based 3D Integration

23

where WA is the wafer–wafer alignment overlay and S is the minimum metal–metal spacing permitted in the 2D design rules. The final expression is as follows:

P = 2WA + 2 d + S + L tan Θ + (( L tan Θ)2 + ( rL / R))

(2.4)

Equation (2.4) indicates that a minimum TSV pitch requires that the TSV etch process produce vertical sidewalls, the dielectric deposition process produce thinpinhole free dielectrics, and, most importantly, the alignment tool be capable of sub-micron wafer–wafer alignment.

2.2.3 SOI-Based TSV Technology TSVs used in the fabrication of SOI-based 3D ICs [14] do not require deposition of a dielectric layer since the TSVs are placed in the field oxide regions of the ICs. As a result, the TSV process is simpler and, for the same pitch, the connection resistance is lower than that for a TSV through bulk silicon since the entire TSV cut is filled with a conductor. Because the TSVs are embedded in the field oxide, the parasitic capacitance and inductance between adjacent connections is reduced. A TSV design used in a SOI-based 3D IC technology is shown in Fig. 2.10. An SOI-based 3D IC technology is not the impediment it had been in the past since SOI wafer fabrication has emerged as a main line technology for high performance ICs and has also transitioned to a foundry.

Fig. 2.10 (a) Cross-sectional and (b) isometric drawing of a TSV used in a SOI-based 3D IC technology. The tungsten plug connects the metal annulus in the upper tier to the metal pad (3D land) in the lower tier. The top of the plug is defined by a resist mask; the metal annulus defines the size of the plug at the 3D land

24

J. Burns

2.3 TSV Process Integration 2.3.1 Stack Alignment The TSV pitch is a critical factor in the viability of the 3D technology since for optimal circuit density the minimum pitch of the TSVs should be comparable to that of the 2D vias that connect multilevel metal layers. The principle limit for the TSV pitch has been wafer–wafer alignment as described in the 2008 ITRS roadmap and as seen in (2.4). As an example, consider the layout of the 3D ring oscillator in Fig. 2.1b. The device was successfully fabricated using an SOI-based 3D technology with inverters in two tiers that were connected with TSVs so that the oscillator’s signal cycled between tiers as shown in the figure. In the design shown in Fig. 2.10 the bottom metal contact of the TSV is the 3D Land and the TSV cut is the size of the resist mask used to etch the TSV cavity. The 3D Land is:

3D Land = 2 WA + D.

(2.5)

In the initial set of 3D design rules the 3D Land was 5.5 mm since the wafer–wafer alignment overlay was 2 mm and D was 1.5 mm. An improved alignment system [10] with an overlay of 0.5 mm and an improved TSV etch process with D = 0.5 mm permitted scaling the 3D Land to 1.5 mm and a reduction in pitch from 9 to 4.5 mm. This example illustrates that alignment overlay is a fundamental impediment to decreasing the size of TSVs.

2.3.2 Stack Bonding Two issues are dominant in the development of a proper bonding technique. The first is establishing the bonding process itself together with the appropriate materials. The second involves maintaining the mechanical stability of the individual tiers and the complete stack during and after the bonding process is complete. Early attempts to build 3D ICs used adhesives as the bonding technology [18]. Experiments indicated that the TSV pitch could not be scaled to less than 6 mm with an adhesive bond due to outgassing from the adhesive. In addition, the adhesive was not sufficiently stable to develop 3D ICs with three or more tiers. That led to the development of alternate technologies such as low temperature oxide–oxide bond [6] and metal–metal thermal compression bond processes [9]. Any bond process must be compatible with the alignment technique used so that the alignment is not degraded during bonding and the process must also be consistent with stacking more than two tiers for those cases where a 3D IC composed of three or more tiers is required. These requirements mean that the thermal processes of layer bonding must not weaken any bonds and TSV connections previously established.

2 TSV-Based 3D Integration

25

Metal–metal and metal–oxide bonding techniques have been used for years in lead attachments and glass–metal seals and that technology has been extended to 3D integration. Copper–copper bonding is an attractive candidate for TSV-first processes since it creates a strong bond and an electrical connection simultaneously and the copper layers can aid heat extraction. However, the TSV conductor cannot be a part of the wafer-bonding process for TSV-last processes. In any bonding process, cleanliness is essential to eliminate voids created by particles. An additional void creation mechanism is gaseous reaction products created during silicon and oxide bonding that cause bubbles that create bond voids. The bond is initially formed at room temperatures immediately following alignment and additional processes are required to increase the bond’s strength. When successfully implemented, the bond process leads to a very thin bond with no voids and has the strength to maintain wafer–wafer alignment during the addition of one or more tiers to the system. Wafer distortion during fabrication must be controlled since excess deformation will decrease the wafer bond strength particularly for oxide-fusion processes where intimate contact between surfaces is required to initiate and establish a strong bond. Wafer deformation is reduced by using compensating films on the backside of wafers to decrease the deformation to less than 50 mm for a 150-mm diameter wafer, a value that will not compromise the bond process. In addition, wafer distortion, some times called continental drift, can lead to misalignment between critical features of the TSV. In principle it could be possible to compensate for distortion by measuring wafer distortion and offsetting the distortion with a wafer chuck having the ability to maintain temperature gradients. Ultimately, a process must be characterized to determine the amount of distortion that occurs so that the value can be added to the design of the TSV since distortion is an additive factor in misalignment.

2.3.3 TSV Etching and Filling In order to minimize the TSV pitch, an etch chemistry is required that maximizes the aspect ratio of the TSV cavity defined as the ratio of the depth to the width of the cavity. In addition the etch process must be selected to avoid resist erosion which can lead to an increase in the lateral size of the cavity at the surface. A mask, such as aluminum, that is not attacked by the etch process can be used to maintain the dimensional integrity of the mask but removal can be a challenge. In all cases mask removal must be compatible with the TSV metals used, particularly for TSVlast processes. Etch reaction products must be monitored and controlled since they can alter or block the deep cavity etch [19]. In the case of a TSV cavity formed through silicon, a pin-hole free insulator must be deposited on the sidewalls, the bottom corners, and the top edges to provide a reliable insulated coating for yield and reliability considerations. Another etch process is required to remove the insulator from the bottom of the cavity to expose the metal pad without attacking the

26

J. Burns

insulator on the sidewalls or the metal pad. This challenge does not exist for SOI-based 3D IC technologies, since the TSVs are formed in the field oxide. Metal deposition processes must have thermal cycles that do not degrade the 3D bonding and TSV structures previously established. A lesser problem for a TSV design but one that requires some thought is the possibility of field-induced leakage due to a lightly doped substrate so sidewall doping may be required. A TSV-last process that includes SOI and bulk wafer fabrication, oxide fusion bonding, oxide etching to form 8-mm TSV cuts with tungsten connections was used to fabricated three layer 3D ICs [14]. The bond temperature never exceeded 275ºC while the oxide and tungsten deposition temperatures were 450º and 475ºC, respectively. Finally a thermal analysis of the entire 3D IC fabrication process is required to insure that the 3D IC is thermally stable and the electronic properties of the devices have not been degraded at the completion of the 3D IC process.

2.4 Characterization of TSV Processes 2.4.1 Physical Characterization Stack bond strength is evaluated by the crack insertion test [20] using a pair of witness wafers. The percentage of bond voids is measured with infrared microscopy using a pair of witness wafers or with an acoustic imaging system using a bonded pair. Experience has shown that the extensive use of witness wafers is essential to maintain control of the bond process since the bond strength measurement is a destructive measurement and a defective bond can lead to the loss of the entire 3D assembly. Wafer–wafer alignment after bonding and substrate thinning is measured using standard lithographic metrology tools but the optical path between alignment structures in both tiers must be free of opaque layers which places restrictions on the design of alignment targets.

2.4.2 Electrical Characterization Active and passive test devices are required for in-process and completed assembly analysis. Electrical test structures can also be used to measure stack alignment after 3D fabrication either by measuring the via resistance or by measuring the yield of TSV structures designed with varying degrees of misalignment between the TSV cut and pad. The TSV design can be optimized and the process margins determined by measuring the yield of TSV chains designed with a set of TSV sizes and misalignment values. Test transistors are required for testing during the 3D assembly process and at the completion of 3D assembly in order to determine if the 3D process has degraded the devices. Witness wafers with TSV chains and test transistors are valuable tools to

2 TSV-Based 3D Integration

27

aid the development and control of the 3D IC process. TSVs are required to reach transistors in each of the tiers to determine whether heat dissipation, oxide charge within the 3D assembly, or processing problems have altered transistor performance. In all cases it is essential that those same transistors be measured before the initiation of 3D process to accurately measure device changes due to the 3D process. TSV structures are also required for the extraction of resistance, capacitance and inductance and to optimize CAD software for the analysis of circuit performance. It is important to place standard circuits such as ring oscillators, SRAMs and counters in each of the tiers to assess fabrication effects on circuit yield and at least one of the circuit types should have components distributed in each level since it is difficult to diagnose circuit failure of large 3D circuits. A ring oscillator is a good candidate for this activity.

2.4.3 3D vs. 2D Chip Yield The major yield detractors for a 3D IC technology are bond defects from stacking, TSV opens, TSV shorts to polysilicon or to the metal interconnect due to stack misalignment, and changes of device parameters due to 3D processing. The latter yield detractor can sometimes be minimized by a post-processing sinter. Singulation, the dicing of a 3D wafer stack into individual 3D chips, can be a loss mechanism due to the 3D assembly delaminating in the dicing streets since the thinned tiers are susceptible to fracture from dicing debris. This failure mechanism can be minimized by layout practices that keep the streets free of metals and polysilicon. Heat buildup within the stack is a unique 3D failure mechanism that can cause parametric drift and circuit failure [21–22] and the removal of heat from a 3D IC chip is a major challenge confronting 3D technology development.

2.5 TSV-Based Chips 2.5.1 3D Design Challenges Within the last 5 years, 2D circuit design, layout, and circuit extraction tools have been adapted to design 3D circuits and work continues to improve the tools [23–24]. One challenge is to design the CAD tools to be compatible with the different 3D technologies such as TSV-first and TSV-last as well as a bulk wafer vs. a SOI-based 3D technology. 3D IC is an evolving technology and CAD tool development, to be effective, must take place in step with the 3D technology which includes the development of improved models of TSVs so the tools can better simulate 3D IC circuit performance. Visualization of 3D chips from the design data both in cross section as well as in an exploded view is essential both for the chip designer and for the technologist to support yield studies and failure analysis as part of the fabrication and development effort (Fig. 2.11).

28

J. Burns

a

b

3D via 2.0 mm 3D via landing pad 1.5 mm 3D via landing pad 5.5 mm

3D via ~1.0 mm

Inter-metal via 0.4 mm

c Tier 2 Tier 1 Fig. 2.11 (a) The landing pad of a SOI- based 3D inverter fabricated with the TSV of Fig. 2.9 is reduced from 5.5 mm (b) to 1.5 mm by a reduction of alignment overlay error and TSV size. (c) The inverters are located in two tiers and connected by TSVs to from the 3D ring oscillator in Fig. 2.1

2.5.2 Functional 3D Chips with TSV A 3D imager is an obvious application of 3D due to the possibility of achieving a 100% fill factor which is an imaging plane unobstructed by interconnections and other opaque features. This is a result that cannot be achieved by scaling a 2D technology. A 3D imager [25] that was reported by Lincoln Laboratory is shown in Fig. 2.12. 3D IC chips with memory attached to a processor chip is another important application since dense TSVs will minimize routing delays as well as delays in obtaining information from the memory and numerous institutions have reported such 3D chips. 3D has been touted as a natural application for ICs with mixed materials and or mixed technologies. The mixed material 3D ICs are of particular interest for building imagers that operate in the ultraviolet and infrared spectrums but utilize silicon readouts and the integration of Indium phosphide with silicon CMOS has already been demonstrated [26]. 3D ICs constructed with different processes are of particular interest and a mixed silicon technology chip has been demonstrated that was composed of three tiers: a photo diode layer, a 3.3-volt CMOS layer, and 1.2 and 1.5-V CMOS layers to form an avalanche photodiode imager [27].

2 TSV-Based 3D Integration

29

Fig. 2.12 (a) Captured raw image from digital tile at 10 fps, with digital data read out in 1 ms and (b) cross-sectional SEM micrograph through functional 3D-integrated active-pixel imager

2.6 Future Challenges The lack of access to 3D fabrication by circuit designers has been an impediment for 3D design and has limited a full exploration of 3D opportunities. However, several institutions have opened their technology to external designers who can

30

a

J. Burns

b 3D Circuits FPGA, stacked memory (SRAM & CAM), asynchronous microprocessor, FFT with on-chip memory, multi-processor chip with high-speed RF interconnect, ASIC with DC-DC converter, reconfigurable ∆Σ modulator, decoder with 3cube torus network, self-powered and mixedsignal RF chips 3D Imaging Applications ILC pixel readout, high-speed imaging FPA, 3D adaptive image processor, artificial bio-optical sensor array, 3D retina, 3D-integrated MEMS biosensor, sensor lock-in-amplifier 3D Technology Characterization 3D signal distribution, 3D interconnect methods, parasitic RF & 3D radiation test structures

Fig. 2.13 (a) Photograph of a 22 ´ 22 mm 3D multiproject chip fabricated with a SOI-based 3D technology and (b) a list of 3D circuits designed by members of the 3D community. Multiproject programs will be the key to future 3D IC design

submit their IC designs for 3D fabrication. Lincoln Laboratory has published 3D design rules and completed three 3D multiproject fabrication runs based on a threetier technology with TSVs. A photograph of the 3D chip and a list of 3D circuits contained in the chip are shown in Fig. 2.13. Tezzaron has acquired funding to support a 3D multiproject program and IMEC has also established a multiproject capability. However, continued financial support from the microelectronic industry or governments will be required to sustain access to 3D technologies until IC houses commit to establishing a 3D capability. An encouraging development that will aid 3D IC fabrication efforts is the formation of a consortium by equipment suppliers to accelerate the development of 3D IC-specific equipment. The TSV design must continue to be scaled if 3D is to satisfy new applications. Fabrication advancements and improved wafer–wafer alignment tooling will scale future TSVs. However, each TSV requires a zone about it from which there can be no silicon, polysilicon, or metal interconnect thus decreasing the effective density of the chip. As a result the TSV scaling limit may not be based on feature sizes but on the exclusion zones required by the TSV and work will be required to minimize the impact of those exclusion zones. Layout tools will be required to better partition circuits among tiers to optimize TSV placement and minimize the impact of TSVs on circuit density. The greatest technological challenge is heat control and removal from within a 3D IC, particularly the embedded layers, and improved CAD tools will be essential for the optimal placement of heat-generating circuits to minimize heat effects on 3D IC performance. There are also challenges in the factory connected with yield. A wafer-scale 3D technology represents the greatest yield risk since a bond failure could doom at least two wafers if not more. For that reason chip–wafer circuits will continue to receive

2 TSV-Based 3D Integration

31

careful attention to work around the yield problem. Yield can also be a problem due to the coincidence of defective transistors that would cause 3D circuit failure. Redundancy techniques can be used initially to work around defective regions but eventually test techniques will be required to map the good regions of each wafer in order to match wafers of varying quality prior to 3D assembly. Technology development will be required to introduce conductive features such as heat pipes or cooling channels between or within the tiers to further minimize heat effects on circuits. In spite of these challenges, the demonstrations of functional 3D imagers, 3D processors with stacked memory, 3D CMOS with stacked MEMS, and 3D ICs with mixed technologies and materials are proofs of concept that a 3D IC technology in the marketplace is within sight. Acknowledgments The work was sponsored by the Defense Advanced Research Projects Agency under Air Force contract #FA8721-05-C-0002. Opinions, interpretations, conclusions, and recommendations are those of the authors and are not necessarily endorsed by the United States Government.

References 1. Reber M, Tielert R (1996) Benefits of vertically stacked integrated circuits for sequential logic. In: Proceedings of the IEEE international symposium on circuits and systems, vol 4, pp 121–124 2. Akasaka Y (1986) Three-dimensional IC trends. Proc IEEE 74(12):1703–1714 3. Chan VWC, Chan PCH, Chan M (2000) Three dimensional CMOS integrated circuits on large grain polysilicon films. In: Technical digest - IEEE international electron devices meeting, pp. 161–164 4. Lea R, Jalowiecki I, Boughton D, Yamaguchi J, Pepe A, Ozguz V, Carson J (1999) A 3-D stacked chip packaging solution for miniaturized massively parallel processing. IEEE Trans Adv Packag 22(6):424–432 5. Aull BF, Loomis AH, Gregory J, Young D (1998) Geiger-mode avalanche photodiode arrays integrated with CMOS timing circuits. In: IEEE annual device research conference digest, pp 58–59 6. Warner K, Burns J, Keast C, Kunz R, Lennon D, Loomis A, Mowers W, Yost D (2002) Lowtemperature oxide-bonded three-dimensional integrated circuits. In: IEEE international SOI conference proceedings, pp 123–124 7. Burns J, McIlrath L, Hopwood J, Keast C, Vu DP, Warner K, Wyatt P (2000) An SOI-based three-dimensional integrated circuit technology. In: IEEE international SOI conference proceedings, pp 20–21 8. Topol A, Tulipe D, Shi S, Alam S, Frank D, Steen S, Vichiconti J, Posillico D, Cobb M, Medd S, Patel J, Goma S, DiMilia D, Farinelli M, Wang C, Conti R, Canaperi D, Deligianni L, Kumar A, Kwietniak T, D’Emic C, Ott J, Young A, Ieong M (2005) Enabling SOI-based assembly technology for three-dimensional (3D) integrated circuits (ICs). In Technical Digest - IEEE International Electron Devices Meeting, pp. 363–366 9. Reif R et al (2002) 3-D interconnects using Cu wafer bonding: technology and applications. In: Advanced metallization conference (AMC) 10. Warner K, Chen C, D’Onofrio R, Keast C, Poesse S (2004) An investigation of wafer-to-wafer alignment tolerances for three-dimensional integrated circuit fabrication. In IEEE international SOI conference proceedings, pp 71–72

32

J. Burns

11. Fukushima T, Yamada Y, Kikuchi H, Koyanagi M (2005) New three-dimensional integration technology using self-assembly technique. In: Technical digest – IEEE international electron devices meeting, pp. 359–362 12. Tropol AW, La Tulipe DC Jr, Shi L, Frank DJ, Bernstein K, Sheen SE, Kumar A et al (2006) Three-dimensional integrated circuits. IBM J Res Dev 50(4/5):491–506 13. Tezzaron Semiconductor, Naperville, IL 60563, http://www.tezzaron.com/technology/ FaStack.htm 14. Burns JA, Aull BF, Chen CK, Chen C-L, Keast CL, Knecht JM, Suntharalingam V, Warner K, Wyatt PW, Yost D-RW (2006) A wafer-scale 3-D circuit integration technology. IEEE Trans Electron Devices 53(10):2507–2516 15. Van Olmen J, Mercha A, Katti G, Huyghebaert C, Van Aelst J, Seppala E et al (), “3D stacked IC demonstration using a through silicon via first approach. IMEC. http://www.imec.be/ ScientificReport/SR2008/HTML/1224951.html 16. (2008) International Technology Roadmap for Semiconductors: ITRS. Semiconductor Industry Association, San Jose, CA http://www.itrs.net/Links/2008ITRS/Home2008.htm 17. Knickerebocker JU, Andry PS, Dang B, Horton RR, Interrante MJ, Patel CS et al (2006) Three-dimensional silicon integration. IBM J Res Dev 50(4/5):553–567 18. Burns J, McIlrath L, Keast C, Lewis C, Loomis A, Warner K, Wyatt P (2001) Threedimensional integrated circuits for low-power, high-bandwidth systems on a chip. In: Digest of technical papers. IEEE international solid-state circuits conference, pp 268–269, 453 19. Knecht J, Yost D, Burns J, Chen C, Keast C, Warner K (2005) 3D via etch development for 3D circuit integration in FDSOI. In IEEE Int. SOI Conf. Proc., pp 104–105 20. Maszara WP, Goetz G, Caviglia A, McKitterick JB (1988) Bonding of silicon wafers for silicon-on-insulator. J Appl Phys 64(10):4943–4950 21. Chen CL, Chen CK, Burns JA, Yost D-R, Warner K, Knecht JM, Wyatt PW, Shibles DA, Keast CL (2007) Thermal effects of three dimensional integrated circuits stacks. In: IEEE international SOI conference proceedings, pp 91–92 22. Sri-Jayantha SM, McVicker G, Bernstein K, Knickerbocker JU (2006) Thermalmechanical modeling of 3D electronic packages. IBM J Res Dev 50(4/5):553–567, 623–634 23. Mentor Graphics, IC Nanometer Design Tool Suite 24. Cadence Virtuoso Design Tool 25. Suntharalingam V, Berger R, Clark S, Knecht J, Messier A, Newcomb K, Rathman D, Slattery R, Soares A, Stevenson C, Warner K, Young D, Ang LP, Mansoorian B, Shaver D (2009) A fourside tileable, back illuminated, three-dimensionally integrated megapixel CMOS image sensor. In: Digest of technical papers. IEEE international solid-state circuits conference, pp 38–39 26. Warner K, Oakley DC, Donnelly JP, Keast CL, Shaver DC (2006) Layer transfer of FDSOI CMOS to 150 mm InP substrates for mixed-material integration. In: International conference on indium phosphide related materials, pp 226–228 27. Aull B, Burns J, Chen C, Felton B, Hanson H, Keast C, Knecht J, Loomis A, Renzi M, Soares A, Suntharalingam V, Warner K, Wolfson D, Yost D, Young D (2006) Laser radar imager based on three-dimensional integration of Geiger-mode avalance photodiodes with two SOI timing-circuit layers. In Digest of technical Papers. IEEE international solid-state circuits conference, pp 304–305

Chapter 3

TSV Characterization and Modeling Michele Stucchi, Guruprasad Katti, and Dimitrios Velenis

3.1 Definition and Structure of a TSV The through-silicon via (TSV) is composed of a conductor, also named “nail” or “plug,” crossing the Si substrate of the stacked dies [1, 2], as shown in Fig. 3.1. The conductor [common material choices include copper (Cu), tungsten (W), and polysilicon] is electrically insulated from the substrate by a dielectric layer (usually SiO2) and interconnects the metal wires of the stacked dies. The geometry of the TSV conductor may vary depending on the 3D stacking technology. The area crossed by current may have different shapes (squared, rectangular, circular, elliptical, and polygonal) [3]. Also, the lateral surface of the conductor can be cylindrical or conical. The TSV interconnection of metal wires in adjacent dies within a 3D stack can follow different schemes [2]. For example, TSVs can connect a Metal 1 (M1) wire of the top die with the topmost MN wire of the bottom die [4] when the latter features N-level wire hierarchy, as shown in Fig. 3.2. Another scheme may foresee the connection of both the topmost MN layers of two adjacent stacked dies. The electrical link established by a TSV between dies can be utilized for any of the typical functions supported by standard 2D interconnects: signal (analog and/or digital), clock, and power supply/ground links. In general, the geometry and the materials used in all the TSVs crossing a die are the same and this simplifies the optimization of the 3D technology process. Once the TSV technology is fixed, the only degree of freedom in achieving different design requirements in the electrical characteristics of TSV links is to employ more TSVs in parallel in the same link. For example, this can be utilized in power supply distribution networks, where high DC current capability is required and low-resistance interconnects are desirable. Connecting multiple TSVs in parallel can decrease the resistance of power/ground links among the

M. Stucchi () IMEC vzw, Leuven, Belgium e-mail: [email protected] A. Papanikolaou et al. (eds.), Three Dimensional System Integration: IC Stacking Process and Design, DOI 10.1007/978-1-4419-0962-6_3, © Springer Science+Business Media, LLC 2011

33

34

M. Stucchi et al.

Fig. 3.1 Through-silicon via providing electrical links among multiple vertically stacked die

Fig. 3.2 Cross-section of a TSV

stacked dies. In addition, an increase in the parasitic TSV capacitance due to multiple parallel TSVs can also contribute to compensating dI/dt effects, similar to the effect of inserting large decoupling capacitors. The electrical parameters of a TSV, namely, RTSV , LTSV , and CTSV would strongly depend on the TSV structure (both geometry and materials). This dependence is discussed in Sect. 3.2.

3 TSV Characterization and Modeling

35

3.2 Electrical Characteristics of a TSV: RTSV , CTSV , LTSV In this section, the resistance, inductance, and capacitance of a TSV are modeled and analyzed. The analysis is limited to low-frequency regime. The parameters are evaluated with static solvers and analytical formulas, and the obtained results are compared with electrical measurements where applicable. The TSV electrical parameter values can be used in basic circuit analysis of 3D interconnect links in digital applications with clock frequency of a few hundreds of megahertz. High-frequency analysis and modeling require the use of Maxwell equation solvers [5] and they are not considered here. In the discussion that follows, the geometry of the TSV is assumed to be cylindrical; other geometries require the use of the appropriate geometrical models and formulas.

3.2.1 TSV Resistance The electrical dc resistance of a uniform conductor is given by the known formula:

R=

rL , A

(3.1)

where L is the conductor length, A is the conductor section crossed by the electrical current and assumed constant along L, and r is the resistivity of the conductor material. For a cylindrical TSV, as shown in Fig. 3.3, A = p (d/2)2 where d is the diameter of the conducting plug, given by d = D – 2t, where D is the diameter of the TSV and t is the thickness of the insulator between the conducting plug and the substrate. The thickness T of the substrate after thinning determines the length of the plug; since L = T, for a TSV the formula (3.1) becomes

RTSV =

rT

p (d / 2 )

2

.

(3.2)

Equation (3.2) is the resistance introduced by a TSV assuming that the contact resistances with the metal layers above and below the TSV are negligible. Several values of TSV resistance are calculated using (3.2) and are listed in Table 3.1 for a reasonable set of geometrical parameters. Copper (Cu) with a resistivity of 1.7 mW cm is assumed as the conducting plug for calculating RTSV in Table 3.1. RTSV simulations by a static solver (Raphael™ by Synopsis [6]) assuming the same TSV geometries are also included in Table 3.1. The simulations are based on a realistic current distribution inside the TSV from the metal connection on the top to the landing pad on the bottom. The results in terms of resistance are consistent with those obtained by using the simple analytical formula (3.2).

36

M. Stucchi et al.

Fig. 3.3 Geometrical parameters of a cylindrical TSV used to calculate the TSV resistance RTSV Table 3.1 Analytical model and Raphael simulations results for RTSV D (mm) 2 2 5 5 10 10

t (nm) 50 100 50 100 1,000 2,000

T (mm) 20 20 20 20 50 50

RTSV (mW) (analytical) 118.5 132.0 17.8 18.6 16.7 29.7

Table 3.2 Comparison of RTSV values obtained by simulations TSV chain Single TSV link Number of resistance (W) resistance (mW) TSVs in chain (measured) (measured) 8 0.704 175.9 98 10.292 210.1 998 106.222 212.9

RTSV (mW) (simulated) 119.3 132.8 17.9 18.7 16.9 30.0

Deviation (%) 0.6 0.6 0.6 0.6 0.9 0.9

measurements on TSV chains and Single TSV link resistance (mW) (simulated) 183.6 183.6 183.6

Deviation (%) 4.382 12.599 13.756

For all the TSV dimensions considered for the Cu plug, RTSV is always estimated to be below 200 mW. For a mature and optimized 3D bonding process with micrometer scale TSV geometries, the TSV contact resistance value is very small, therefore maintaining the total RTSV in the milliohm range. Electrical measurements on a typical TSV chain, which is illustrated in Figs. 3.4a, b, are listed in Table 3.2 and show values of resistance of ~200 mW per link, which includes two TSVs plus the metal connections between them and to the next link. Raphael™ simulation results for a single TSV link are illustrated in Fig. 3.4c and show good agreement between measurement and simulations. It is also shown in Table 3.2 that the percent deviation increases with the number

3 TSV Characterization and Modeling

37

Fig. 3.4 (a) Schematic representation of a TSV chain. (b) SEM picture of a TSV chain after etching away the substrate of the top die. (c) Single TSV chain link simulated in Raphael™

of TSVs in the chain; this can be attributed to a larger spread of the higher values of TSV resistance for longer TSV chains. Specific test structures for measuring TSV resistance can be found in [7].

3.2.2 TSV Capacitance An individual TSV structure is shown in Figs. 3.5 and 3.6 and it can be viewed as a cylindrical MOS capacitor structure considering the conducting plug as the gate and substrate contact as ground. Therefore, the TSV capacitance behavior is similar to the MOS behavior [7]. C–V characteristics of the TSV with p-Si substrate will follow a similar pattern as planar MOS C–V characteristics with p-Si substrate. Based on the TSV bias (VTSV), there are three distinct regions of operations, namely, accumulation, depletion, and inversion as illustrated in Fig. 3.7. Accumulation region: Accumulation occurs for large negative bias of the TSV plug with respect to the p-Si substrate. At this bias, majority charges accumulate at the Si-SiO2 interface and act as a conductive plate, thus making the total TSV capacitance equal to the insulator capacitance. For a cylindrical TSV structure, as shown in Fig. 3.6, the insulator capacitance is given by the following formula (3.3):

C=

2p e i L ln [b / a ]

(3.3)

38

M. Stucchi et al.

Fig. 3.5 Components of TSV capacitance: the insulator capacitance CINS and the depletion capacitance CDEP

Fig. 3.6 Illustration of capacitance for a cylindrical structure in Si substrate

where L is the TSV length, which is equal to the substrate thickness T, ei is the dielectric constant of the insulator, d = 2a is the diameter of the Cu conductive plug, and D = 2b is the TSV diameter. Therefore, the thickness of the insulator oxide is: t = (b − a) = (D − d )/2. Substituting these parameters of the TSV, equation (3.3) becomes CINS =

2p e iT . ln [D/d ]

(3.4)

3 TSV Characterization and Modeling

39

Fig. 3.7 Accumulation, depletion and maximum depletion regions on a C–V curve for a TSV characterized as a metal–insulator–semiconductor capacitor

Depletion region: In a p-type substrate, if a positive enough voltage VTSV is applied to the TSV with respect to the substrate, a depletion region and consequently a depletion region capacitance will be formed. In the depletion region, majority carriers in the substrate are pushed away from the substrate–insulator interface and form a depletion capacitance in series with the insulator capacitance as shown in Fig. 3.5. As a result, the total MOS capacitance is reduced. The width of the depletion region determines the reduction of the TSV capacitance and it depends upon several factors, such as the TSV diameter, the oxide thickness of the insulator layer, the work function of the metal and oxide charges, the doping of the substrate, etc. [8]. The detailed analytical model [9–10] is obtained by solving the 1-D Poisson’s equation in cylindrical co-ordinate system with adequate boundary conditions and the depletion TSV capacitance is given by the following analytical expression: CDEP =

2πTε Si , ln [DDEP /D ]

(3.5)

where DDEP is the diameter of the boundary of the depletion region in the substrate surrounding the TSV, as shown in Fig. 3.6; eSi is the dielectric constant of the Si substrate. DDEP depends on the voltage VTSV applied to the TSV with respect to the Si substrate. When a depletion region is formed, the total TSV capacitance is given by the series of CINS and CDEP:

CTSV =

CINS CDEP . CINS + CDEP

(3.6)

By substituting (3.4) and (3.5) into (3.6), it is possible to calculate analytically the total capacitance of the TSV in the depletion region.

40 Table 3.3 Analytical model and Sdevice™ simulation results inversion regions CINS (fF) CINS (fF) accumulation accumulation (simulation) D (mm) T (nm) Na (cm−3) (analytical) 5 120 2E15 88.23 88.4 5 120 1E17 88.23 88.4 2 50 2E15 84.61 84.77 2 50 1E17 84.61 84.77

M. Stucchi et al. for CTSV in accumulation and

Deviation (%) 0.199 0.199 0.185 0.185

CTSVmin (fF) (simulation) 37.20 69.98 21.65 52.37

Inversion-minimum depletion capacitance region: For low-frequency signals, the holes and electrons in the p-Si substrate respond to the signals on the gate and electrons get accumulated at the silicon–insulator interface, thus inverting the substrate type. As a result, the substrate gets isolated from the gate and the inversion TSV capacitance value increases to CINS as shown in Fig. 3.7. Alternatively for high-frequency signals, holes and electrons in the p-Si substrate do not get sufficient time to respond to the changes on the gate and the depletion width stays pinned to its maximum value offering minimum depletion capacitance CTSV = CTSVmin. Comparison between Sdevice™ [11] simulations and the analytical model for the TSV capacitance in accumulation (3.4) is presented in Table 3.3 for various TSV parameters. The analytical model is able to accurately predict CINS in accumulation for different values of TSV diameter and oxide liner thickness. Minimum depletion capacitance CTSVmin can also be calculated by simulation by specifying the value of the following parameters: The work function difference between metal conductor and Si substrate: fms = fm − fs, and the doping density of the p-substrate: Na. Simulated values for CTSVmin using Sdevice™ are also listed in Table 3.3. 3.2.2.1 TSV Capacitance vs. Supply Voltage TSV C–V behavior suggests that the TSV capacitance is voltage dependent and would depend upon the operating region of the TSV. Furthermore, TSV capacitance in the accumulation and maximum depletion regions could differ by a factor of 2 or more, according to simulation results performed on TSVs with 5 mm diameter, 20 mm length, and oxide insulator thickness of ~120 nm. Such C–V behavior of TSVs might represent a problem and an opportunity at the same time. If the depletion region corresponds to zero or negative TSV voltages, as shown in Fig. 3.8a, the capacitance seen by the signals propagating along TSVs is lower than the insulator capacitance, thus allowing faster signal propagation (lower RC delay) and lower dynamic energy consumption (E = aCV2, where a is the signal activity factor). The problem arises if the depletion region occurs at the range of the supply and signal voltages, as illustrated in Fig. 3.8b; this would create a voltage-dependent TSV capacitance, thus forcing the design of 3D circuits to be conservative and consequently more energy consuming. Option (c) in Fig. 3.8 would offer a maximumTSV capacitance which is ideally not desired. Therefore, an upfront C–V characterization

3 TSV Characterization and Modeling

41

Fig. 3.8 Possible occurrences of the depletion region for TSV capacitance. (a) At negative or zero voltages, (b) at the VDD range, (c) at voltages higher than the TSV operating voltage range

of the TSV capacitance for a given 3D technology, prior to the circuit design phase, is very important. Various techniques that reduce the TSV capacitance to improve the 3-D IC performance are detailed in [12]. Test structures for TSV capacitance characterization are proposed and described in [7]. 3.2.2.2 TSV Crosstalk The coupling between adjacent TSVs cannot be trivially treated in an analytical way as a parallel plate capacitor with fringing capacitance, which is the standard approach for 2D interconnects. Both the geometry and the equivalent circuit of the coupled TSVs are rather complex due to the cylindrical shape of TSVs and the semiconductor nature of the substrate between them, as shown in Fig. 3.9. The silicon substrate among the TSVs is connected to the ground; therefore, a few electric field lines would be originating from one TSV and terminating on the other, providing a minimal mutual capacitance between neighboring TSVs. This mutual capacitance can be reduced further with increasing the distance between the TSVs (i.e., the TSV pitch). The crosstalk between two TSVs in the presence of a substrate contact, as shown in Fig. 3.10, has been simulated in Sdevice™. For the simulation, a voltage pulse of 1.2 V, 100 ps duration, and 5 ps rise/fall time is applied on one TSV (i.e., the aggressor TSV), and the resulting voltage on the other TSV (i.e., the victim TSV) is measured. For the simulation experiment, different insulator oxide thicknesses and different TSV pitches are considered, together with different positions of the substrate contact. The simulation results are listed in Table 3.4. The crosstalk voltage among TSVs is reduced with increasing pitch. In addition, by increasing the thickness of the insulator oxide, the crosstalk is also reduced since the capacitive coupling between the TSVs and the substrate is lower. Finally, the distance of the ground substrate contact from the TSVs (DGND) is considered. As the distance of the ground contact from the TSVs increases, the resistance of the substrate to the ground increases and more electric field lines are likely to originate and terminate on neighboring TSVs, thus increasing coupling. Therefore, the amount of crosstalk between TSVs increases as the ground contact is placed farther away from the TSV pair. According to the parameters considered in Table 3.4, a ground contact can be placed up to

42

M. Stucchi et al.

Fig. 3.9 Equivalent circuit describing capacitive crosstalk interaction among TSVs

Fig. 3.10 Setup for crosstalk simulation among two neighboring TSVs, considering the effect of a substrate ground contact Table 3.4 Simulation of crosstalk between pair of TSV for different pitch length, insulator oxide thickness, and distance of the substrate ground contact Ground contact Induced voltage D (mm) t (nm) TSV pitch (mm) distance (mm) (% of VDD) 5 5 5 5 5 5 5 5 5

50 50 50 50 50 50 100 100 100

10 10 10 20 20 20 10 10 10

0 100 1,000 0 100 1,000 0 100 1,000

0.00625 0.00627 7.62 1.43E-06 1.42E-06 0.95 0.00730 0.00755 0.00852

3 TSV Characterization and Modeling

43

500 mm away from the TSV pair without significant effects on the amount of crosstalk between the two TSVs. The amount of coupling is expected to increase in the presence of additional TSV aggressors surrounding a victim TSV. However, placing contacts to the ground in smaller distances among TSVs is expected to keep crosstalk at a negligible level, even in this case. 3.2.2.3 TSV Leakage Current and Breakdown Voltage The leakage current and breakdown voltage of a TSV are as important as the TSV resistance and capacitance. In fact, a leaky TSV represents a waste of static energy toward the substrate, and a consequent potential voltage drop on interconnects that could violate the noise margins of the digital signals. The TSV should also withstand potential overstresses caused by ESD-EMI events without being damaged and becoming leaky. One of the potential contributing factors to the TSV leakage and breakdown is the lack of uniformity of the insulator layer thickness: local reduction of oxide thickness increases locally the electric field between the plug and the substrate, thus increasing the possibility of leakage and breakdown. The quality of the insulator layer is therefore of capital importance, however, limited data is available on these reliability aspects [13]. In ideal conditions, by considering an oxide insulator thickness of ~100 nm, and assuming the ideal oxide breakdown strength at ~8 MV/cm, the TSV breakdown should occur at ~80 V. In reality, this value could be much lower, since the oxide breakdown strength is weakened during processing due to undesired effects such as plasma damage and moisture uptake. Furthermore, the roughness of the TSV sidewall at the substrate side could also create protrusions that increase the electric field. Various test structures to evaluate the TSV leakage and breakdown along with measurements are detailed in [7].

3.2.3 TSV Inductance The inductance contribution of a TSV in a current loop – or partial inductance – depends on both the self- and mutual-inductive components of a TSV. However, specifying these components requires that the current loop of the TSV and any other neighboring current loops linked with that one are precisely defined. Since this information is not available a priori, the only possible evaluation of the TSV partial inductance is by considering a current return path at infinity. This quantity can be determined by an analytical formula [14]. For the cylindrical-shaped TSV assumed in this chapter, the partial self-inductance depends upon the TSV diameter and length, and is given by the following expression [14]: LTSV =

  2T + r 2 + 2T )2  (  mo  2 TSV 2 + rTSV − rTSV + (2T ) 2T ln  4p  r   TSV   

(

 ,  

)

(3.7)

44

M. Stucchi et al. Table 3.5 Analytical model results for LTSV D (mm)

t (nm)

T (mm)

L TSV (pH)

2 2 5 5 5 5 10 10

50 100 50 100 50 100 1,000 2,000

20 20 20 20 50 50 50 50

13.83 14.04 10.19 10.26 34.27 34.47 29.52 32.29

where mo is the permeability of free space given by 4p × 10−7 H/m, rTSV = d / 2, and T = TSV length. In Table 3.5, the partial self-inductance values obtained with (3.7) for different dimensions of the TSV are listed. By considering the frequency at which the reactive part of the TSV impedance becomes comparable with the resistive part, RTSV £ w LTSV, the TSV partial self-inductance can be considered negligible in first approximation for frequencies up to ~3 GHz. Signal rise times containing such frequency components correspond to clock signals with a frequency in the order of 10% of this value, namely, ~300 MHz [15].

3.3 Impact of TSV Geometry and Material Parameters on RTSV and CTSV The simulations and analytical formulas presented so far can be used to determine the impact of geometry and material parameters on the resistance, capacitance, and inductance of a TSV. It is then possible (1) to predict the electrical characteristics of a TSV given the technology parameters such as diameter, insulator layer thickness, dielectric constant of the insulator, and resistivity of the plug conductor; (2) to find the technology parameters which provide the desired values of RTSV, CTSV, and LTSV by exploring the technology parameter space within a specific range of dimensions and/or material parameters. Both (1) and (2) are very important analyses for circuit design and process tuning, respectively. In Figs. 3.11a–e, examples of exploration of the impact of technology parameters on R TSV , C TSV , and L TSV are demonstrated. In Fig. 3.11a, the effect of TSV diameter and dielectric oxide thickness on the resistance of a TSV is illustrated. It is shown that varying the value of the dielectric oxide thickness can affect the resistance of a TSV only for diameter lengths shorter than 2 µm. The impact of the same geometrical parameters on the TSV capacitance at accumulation and inversion regions is illustrated in Figs. 3.11b, c, respectively. As shown in Fig. 3.11b, the capacitance of a TSV in accumulation is affected by the value of the dielectric oxide thickness predominantly at long diameter lengths. Also, the TSV capacitance in maximum depletion region depends on both the TSV diameter and the oxide thickness as shown in Fig. 3.11c. Notice, however, that the TSV capacitance values

3 TSV Characterization and Modeling

Fig. 3.11 Impact of TSV geometry and materials on electrical characteristics

45

46

M. Stucchi et al.

in maximum depletion region that are shown in Fig. 3.11c are significantly lower than the corresponding values in accumulation shown in Fig. 3.11b for the same values of TSV diameter and dielectric oxide thickness. Furthermore, the effect of the TSV diameter D and length T on the self-inductance of a TSV is illustrated in Fig. 3.11d. As shown in Fig. 3.11d, changing the TSV diameter has a small effect on the inductance of a TSV for short TSV lengths. However, this effect becomes more prominent when the TSV length is within the range of 20–25 µm. In this case, increasing the diameter of the TSV results in a reduction in the partial self- inductance of the TSV. Finally, the effect of substrate doping on the capacitance of a TSV in maximum depletion region is illustrated in Fig. 3.11e. It is shown in this figure that increased doping of the substrate increases the TSV capacitance, especially for low dielectric thickness values. A complete example of using this information for designing an optimal 3D system is reported in [16].

3.4 Electrical Modeling of a TSV The basic electrical model of the TSV consists of a lumped RLC network, as shown in Fig. 3.12a. RTSV and L TSV cause the voltage drop along the interconnected nodes between Metal 1 of the top die and the top metal of the bottom die. Shunt path for the current is provided by CTSV connected between TSV and ground. As suggested in Sect. 3.2, for contemporary TSV dimensions, L TSV is predominant only for frequencies above 3 GHz when RTSV £ w L TSV . By ignoring the inductance, the approximate model is a simplified RC model as shown in Fig. 3.12b. The model has been successfully validated using the 2D/3D Ring Oscillator power-delay measurements and simulations [17]. At high frequencies, the electrical model of a TSV becomes more complicated.

Fig. 3.12 Lumped models for TSV impedance

3 TSV Characterization and Modeling

47

3.4.1 Impact of TSV on Interconnect Links By considering the RC model of a TSV in a typical transmitter–interconnect– receiver link, it is possible to estimate its impact on speed and energy consumption of such a link. The impact of R TSV and CTSV on circuit delay can be analyzed with the help of the schematic shown in Fig. 3.13. In this schematic, an inverter is placed on the bottom die driving an inverter on the top die through TSV and wire RC loads. The entire network in the signal path is shown in Fig. 3.13. Cext and Cint indicate the output and input capacitances of an inverter, respectively. Rw_B and Cw_B denote the lumped interconnect wire load on the bottom die, while Rw_T and Cw_T represent the lumped wire load on the top die. The Elmore delay of the link can be expressed as:

tp = 0.69Rdr Cext

( + 0.69 (R + 0.69 (R

)

+ 0.69 Rdr + nRw _ B Cw _ B

)

dr

+ Rw _ B + 0.5RTSV CTSV

dr

+ Rw _ B + RTSV + Rw _ T Cw _ T + Cint ,

)(

(3.8)

)

where Rdr is the driving resistance of the inverter. First-order calculations [18] indicate that Cint and Cext are in the order of ~3 fF while the driving resistance of inverter is in the order of kilo ohms for 0.25 mm technology. Since CTSV = 35 fF, the TSV capacitance is much larger than the Cint and Cext values, and the term (Rdr + Rw_B + 0.5RTSV)CTSV would have a dominant effect in the signal delay described by (3.8). It can also be seen in (3.8) that the term RTSV is always added to the driving resistance of the inverter Rdr and to the BEOL resistances Rw_B and Rw_T . Since, smaller wire cross-sections and longer interconnect lengths produce larger BEOL resistances, RTSV values of ~20 mW in contemporary TSV architectures would have a minimal impact on the delay. The predominant impact of CTSV and the reduced impact of RTSV approximate the lumped TSV impedance model with a single capacitor, as shown in Fig. 3.12c.

Fig. 3.13 Equivalent circuit model for a pair of inverters on stacked die, connected through a TSV

48

M. Stucchi et al.

3.5 Conclusions In this chapter, an electrical model of the TSV has been presented. Since the TSV provides an electrical link between interconnects of 3D stacked dies, it is important to know how the basic electrical parameters such as resistance R TSV , capacitance CTSV , and inductance L TSV are related to geometry and physical parameters of the materials used in TSV architectures. From a circuit design perspective, it is also important to have an electrical model of the TSV that describes the most relevant effects and allows for first-order approximations of signal speed and energy in 3D interconnect links. This chapter has provided a basic investigation of these aspects. In addition, aspects related to reliability of TSVs such as leakage mechanisms and breakdown voltages have also been discussed. The validity of the proposed models can be extended to high-frequency regime by using proper simulation tools, test structures, and electrical measurements.

References 1. Topol AW et al (2006) Three-dimensional integrated circuits. IBM J Res Dev 50(4/5):491–506 2. Beyne E (2006) The rise of the 3rd dimension for system integration. In: Proceedings of international interconnect technology conference, Burlingame, CA, 5–7 June, pp 1–5 3. Pak JS, Ryu C, Kim J (2007) Electrical characterization of through silicon via (TSV) depending on structural and material parameters based on 3D full wave simulation. In: Proceedings of international conference on electronic materials and packaging, Daejeon, 19–22 November, pp 19–22 4. Van Olmen J et al (2008) 3D stacked IC demonstration using a through silicon via first approach. IEDM Technical Digest pp 603–606 5. Kamon M, Silveira L et al (1996) FastHenry USER’S GUIDE: version 3.0, Massachusetts Institute of Technology, November, ftp://rle-vlsi.mit.edu/pub/fasthenry 6. Raphael Interconnect Analysis Program Reference Manual (2007) Version A-2007.09, Synopsis Inc., Mountain View, CA 7. Stucchi M, Perry D. Katti G. Dehaene W (2010) Test structures for characterization of through silicon vias, IEEE International Conference on Microelectronic Test Structures (ICMTS) 8. Sze SM (1981) Physics of semiconductor devices. John Wiley & Sons, New York 9. Katti G et al (2010) Electrical Characterization & Modeling of Through Silicon Via (TSV) for 3D ICs, IEEE Transactions on Electron Devices, Volume 57, Issue 1; Jan, pp 256–262 10. Bandyopadhyay T, Chatterjee R, Chung D, Swaminanithan M, and Tummala R (2009) Electrical modeling of through silicon and package vias, IEEE Inter. Conf. 3D System Integration, Sep, pp 28–30 11. Sentaurus Device User Guide (2008) Version A-2008.09, Synopsis Inc., Mountain View, CA 12. Katti G, Stucchi M, Olmen JV, Meyer KD, and Dehaene W (2010) Through-silicon-via capacitance reduction technique to benefit 3-D IC performance, IEEE Electron Device Lett., vol. 31, no. 4, Jun, pp 549–551 13. Kikuchi H et al (2008) Tungsten through-silicon via technology for three-dimensional LSIs. Jpn J Appl Phys 47(4):2801–2806 14. Pucel R (1985) Gallium arsenide technology, Technology and Design Considerations of Monolithic Microwave Integrated Circuits. In: Ferry D (ed) Howard W. Sams and Co., Indianapolis, IN, chapter 6, pp 216

3 TSV Characterization and Modeling

49

1 5. Young B (2001) Digital signal integrity. Prentice Hall, New York, pp 45 16. Marchal P et al (2009) 3-D Technology Assessment: Path-Finding the Technology/Design Sweet-Spot, Proceedings of the IEEE, Volume: 97, Issue: 1, pp 96–107 17. Katti G et al (2009) 3D Stacked ICs using Cu TSVs and Die to Wafer Hybrid Collective Bonding, in IEDM Tech. Dig., pp 357–360 18. Rabaey J et al (2003) Digital integrated circuits, Upper Saddle River, New Jersey 07458, 2nd edn. Prentice Hall Electronics and VLSI Series

wwwwwwwwwwwwwwww

Chapter 4

Homogeneous 3D Integration Robert Patti

4.1 Introduction This chapter focuses on homogeneous 3D integration – that is, the vertical assembly of like materials or components – but also provides information about homogeneous 3D assemblies in combination with other 2D or 3D devices. One example of homogeneous integration is the stacking of memory layers to create a 3D memory device. In such a device, the component layers are usually made of the same material and are often virtually identical in design. This chapter uses 3D DRAM as a reference application.

4.2 3D Assembly Options There are three basic ways to assemble 3D devices: wafer-to-wafer (W2W), dieto-wafer (D2W), and die-to-die (D2D). Each method has its advantages; each application must be examined to determine the best way to realize the cost benefits of 3D. W2W assembly is potentially the most cost effective because wafer-level handling and processing allows hundreds or thousands of devices to be created at once. On the other hand, D2W and D2D work well for dissimilar materials where the mismatch in thermal coefficient of expansion (TCE) might make W2W impractical. Almost all 3D assembly processes require processing at temperatures of 300°C or higher. Higher temperatures and higher TCE mismatch cause problems with alignment and increase the induced stress within the part. However, 3D assembly with homogeneous material suffers minimal TCE mismatch and is therefore ideal for W2W bonding.

R. Patti (*) Tezzaron Semiconductor, Naperville, IL, USA e-mail: [email protected] A. Papanikolaou et al. (eds.), Three Dimensional System Integration: IC Stacking Process and Design, DOI 10.1007/978-1-4419-0962-6_4, © Springer Science+Business Media, LLC 2011

51

52

R. Patti

A critical factor in 3D processing is ultimate device yield. A 3D assembly has approximately the same yield as the equivalent 2D device – for example, a four-layer stack of size 5 × 5 mm will have yield equivalent to a single-layer die measuring 10 × 10 mm. Unless extraordinary techniques are used to address yield, 3D will have limited appeal. D2W and D2D assembly techniques allow the use of a known good die (KGD) protocol that can produce yields superior to the equivalent 2D die, but this is at the cost of additional handing. Another effective way to address yield is with built-in repair and redundancy. The underlying repetitive design of memories, FPGAs, and image sensors make these devices readily capable of built-in repair and redundancy schemes; indeed, most of them already use such schemes. As a result, these applications are excellent candidates for wafer-bonded 3D devices. They can be made into the equivalent of huge 2D dies yet maintain reasonable yield by repair and redundancy without any KGD techniques. Our reference application, 3D DRAM, uses homogeneous materials and repair-and-redundancy schemes, so it is ideally suited for cost-effective W2W assembly. Many other high-volume lower-margin devices fall into the realm of W2W bonding for the same reasons.

4.2.1 W2W Alignment Wafer-to-wafer alignment in 3D-assembled circuits, measured post bonding, is ~1 mm three sigma for 300 mm wafers using a copper-to-copper bonding process. Oxide bonding, such as that used in the Ziptronix DBI® (direct bond interconnect) process, can achieve a more precise alignment of ~0.5 mm because the initial bond is performed at room temperature. Frequent topics of debate in 3D discussions are what the alignment needs to be, and what the road map should look like. There may or may not be a need to improve significantly beyond the current equipment. Unlike the shrinking 2D design rules, 3D must contend with the metal interconnect layers and a finite substrate thickness. If the vertical direction is 10 mm (today, 10–12 mm is the norm for Tezzaron), then staying in the 2D plane still makes sense a lot of the time. One would not choose to go vertically 10 mm if one can instead go horizontally 1 mm. Thus, 3D wafer alignment is already in the correct range for most, if not all, applications. Certainly there will be improvements in the equipment and there will be applications that can drive the vertical pitch requirement lower, but today’s capabilities probably exceed requirements. The tightest pitch requirement presented to Tezzaron to date was for a frontside-to-frontside bond at 1.4 mm. The pitch was necessary for pixel pitch interfacing of a CMOS image sensor. Because it was a face-to-face bond, through-silicon via (TSV) pitch was not involved; the only requirement was tight wafer-to-wafer alignment.

4 Homogeneous 3D Integration

53

4.3 3D Bonding Options 3D integration has drastically improved in the last 10 years. Numerous entities have demonstrated void-free bonding of wafers and thinning wafers to only a handful of microns. The most promising 3D assembly technologies seem to be those that can accomplish both the mechanical interconnect and the electrical interconnect at the same time. There are at least three bonding processes that offer this one-step interconnection: copper-to-copper, copper-tin eutectic, and direct oxide. The market will likely find more than one effective process, but the one that offers the best price performance is most likely to succeed in the market. Copper-to-copper metal bonding seems to have the edge. Wafers can literally be taken right from the fab with no special surface processing and no additional deposition steps, cleaned (a slight oxide etch), and bonded. The other bonding process options require more work up front. The copper-to-copper bonding seems to produce the required bond quality and at the lowest price point, but the other flows may have advantages that stem from lower temperatures and higher throughput. Market forces will ultimately pick the best techniques.

4.4 Wafer-to-Wafer Assembly This section offers an example of 3D W2W assembly using Tezzaron’s SuperContact™ enabled wafers and copper-to-copper thermal diffusion bonding. Tezzaron’s stacking method uses a “vias-middle” approach to integrate two or more wafers into a fully interconnected wafer stack. Hundreds of thousands of TSVs are built into the circuitry of the wafers. The base wafer may be built with or without TSVs, as desired. Tezzaron’s “SuperContact™” TSVs may be either copper or tungsten, and may be laid down before the first metal contact layer or created at a later processing stage. In all cases, creating the TSVs requires a new process module at the vendor fab. The module has proven relatively easy to add and does not introduce any new materials at the stage where the TSVs are added. To lay down TSVs after transistors have been created, but before any contact metal, the TSVs are etched through the oxide and into the silicon substrate approximately 6 mm. The walls are lined with SiO2/SiN. The TSVs are filled with tungsten or copper and finished with chemical-mechanical polishing (CMP). This completes the unique processing requirements at the wafer level. The wafer is then finished with its normal processing, which can include a combination of aluminum and copper wiring layers. The top layer must be copper. The top is metallized with a 0.5-mm SiO2 insulating glass layer followed by a 1.0-mm Cu metal bonding layer. The bonding layer, formed by a copper single- or dual-damascene process, is laid out in a proprietary design of bondpoints (Fig. 4.1). All wafers, including the base wafer, are metallized in this manner. The oxide surface is then slightly recessed, leaving the bondpoints elevated above the oxide.

54

R. Patti

Dielectric (SiO2/SiN) Gate poly STI (Shallow trench isolation)

Silicon (full depth not shown)

TSV

W (Tungsten TSV & vias) Al or Cu interconnect (M1–M5) Cu bondpoints (M6, top metal)

Fig. 4.1 A tungsten TSV in a six-metal process with Metal 6 as the copper bonding layer

TSV

Silicon (thinned)

Dielectric(SiO2/SiN) Gate poly STI (Shallow trench isolation) W (Tungsten TSV & vias) Al or Cu interconnect (M1 - M5)

Silicon (full depth not shown)

Cu bondpoints (M6, top metal)

Fig. 4.2 Face-to-face bonding

One TSV-enabled wafer is inverted onto the base wafer, frontside-to-frontside, see Figs. 4.2 and 4.3. The wafers are aligned and then bonded, using a thermal diffusion bonding process at <400°C, producing a metal-to-metal bond. The structural base (back side) of the upper wafer is then thinned by using a combination of conventional wafer grinding, spin-etching, and CMP. This thinning leaves a substrate of about 4 mm and exposes the TSVs built into the wafer, as shown in Fig. 4.2. If more wafers are to be added to the stack as in Fig. 4.4, the backside of the thinned wafer (with its exposed TSVs) is covered by another oxide layer and then another copper bonding layer. The next TSV-enabled wafer is inverted onto this

4 Homogeneous 3D Integration

55

Fig. 4.3 SEM micrograph of a two-wafer stack and one of its TSVs

TSV

Silicon Dielectric(SiO2/SiN) Gate poly STI (Shallow trench isolation) W (Tungsten TSV & vias) Al or Cu interconnect (M1–M5) Cu bondpoints (M6, top metal) Al padout

Fig. 4.4 A finished three-wafer stack padded out with aluminum for normal wire bonding

new surface, face-down, and bonded as before. Thinning, metallizing, and bonding are repeated as desired. Once the wafer stacking process is completed, the topmost wafer of the stack is thinned to its TSVs and finished with standard wire bonding or flipchip assembly; the base wafer is backlapped to remove excess silicon.

56

R. Patti

Dielectric (SiO2/SiN) Gate poly STI (Shallow trench isolation)

Silicon (full depth not shown)

TSV

W (Tungsten TSV & vias) Al or CU interconnect (M1, M2) Cu (M3 - M6 & vias)

Fig. 4.5 Tungsten TSV created after the second layer of metal

Because of the aggressive thinning, the height of the total stack increases by not more than 15 mm per wafer. Even a stack of many layers can be housed in normal packaging. Note that the base wafer retains its complete thickness during the stacking process and there is no need to handle the extremely thin (15 mm or less) upper layers as individual pieces. This greatly eases the manufacturing challenges. For comparison, Fig. 4.5 shows an example of a tungsten TSV created at a later processing stage – in this case, after Metal 2. Note that all metals laid down after this TSV are copper, a requirement of our current process.

4.5 Die-to-Wafer Assembly Almost every process for W2W bonding has an equivalent process for D2W bonding. The D2W technique can enhance the yield by using KGD. KGD allows a supersized die to be produced with reasonable yield, and thus reasonable cost. However, D2W has additional processing costs when compared with W2W. First, each individual die must be aligned to the base wafer. The alignment of each die takes about as long as the alignment of an entire wafer. The alignment process is also typically done in the most expensive equipment used in the 3D assembly process. In many assembly flows, all wafer sites, including defective ones, must have dies present to ensure even processing both during bonding and during any subsequent thinning. If the bonded dies are to be thinned down to the TSVs, additional processing may be required, such as filling and planarizing the wafer, to ensure even thinning across all dies. Tezzaron’s D2W method uses a template to ease this task. An illustration of the template process is shown in Fig. 4.6. The photos in Fig. 4.7 show completed D2W assemblies before final dicing. The template provides better tolerance and faster placement, but it still is an order of magnitude slower than W2W alignment.

4 Homogeneous 3D Integration

57

Singulated Die Finished Device

Stencil

Stencil Window Host Wafer

Host Die

Fig. 4.6 Die-to-wafer assembly process

Fig. 4.7 Wafers with D2W 3D-ICs, after bonding, before final dicing

58

R. Patti

4.6 Designing a 3D Device Designing 3D semiconductors is a straightforward extension of 2D design, yet it presents a huge new set of problems. One such problem is the normal process variation between any two wafers. A 3D design engineer must take into account the variation between wafers within a single device. Our reference design, a Tezzaron 3D DRAM, is composed of layers of DRAM process memory cells and a logic process layer containing almost everything else. Memories are prime candidates for early 3D design because the CAD tools generally need not be 3D aware. Although billions of transistors are present, they tend to be connected in highly repetitive tiles. The design is thus “simple” enough for visual and hand checking – still a large task, but at least doable. A 3D DRAM contains the same basic circuits as a 2D DRAM, but it can also extend and enhance the design, including features not possible within the restrictions of the 2D DRAM process. 3D design can address the issue of much larger effective die area by implementing new yield enhancement repair and redundancy. To take another example, a CPU design could derive considerable benefit by migrating to 3D. A well-engineered 3D design would reduce the time of flight and increase the number of circuits within the span of control, thus improving the speed and power while reducing complexity. The problem is that a CPU design contains very few repeating patterns, so the 3D design demands more tool support. The decision tree of how to effectively split a CPU across two or three layers would easily overwhelm a design team. It would be much like designing a modern processor with the use of only schematics and SPICE, without Verilog and synthesis. Another critical design issue is the placement of the TSVs. As of today, no commercial EDA tools exist that can place and optimize TSV locations to account for both the design rules and 3D positional optimization. For designs that are not simple and regular, automating TSV placement is a significant first step; however, it is also a major change to existing EDA software. Very little, if any, chip design software was written with the idea of adding a third dimension to the task. Underlying databases and fundamental assumptions in millions of lines of existing code would be affected by this change. Fortunately, our DRAM reference design does not need the help of sophisticated tools because the placement of the TSVs is rather obvious. The memory layers must connect to the logic layer at the bitlines and wordlines, but few other connections are required; thus we can avoid the immediate issue of missing tool support. (It is worth noting that some design groups have created their own tools to place TSVs, taking into account many factors such as design rules, delay, power, and heat dissipation.)

4 Homogeneous 3D Integration

59

4.6.1 Other Design Considerations There are other items associated with 3D integration that heretofore did not matter. If one is practicing wafer-to-wafer assembly, one needs to be concerned with several small issues: • Wafers must be the same size. • Dies on the wafer must step, using the same step size, from approximately the same starting point on the wafer. • The wafers probably must all have a notch or all have a flat. (Yes, there are 200 mm wafers with flats.) • The notch location (usually up or down) needs to be accounted for. Generally, you want all the notches to line up for additional processing. When wafers are sourced from multiple foundries, and perhaps additional postprocessing at another set of fabs, these seemingly trivial items can be complete show stoppers. One must also consider the material incorporated in the wafers and the specific wafer process issues that affect 3D assembly, such as TSV material, fragile dielectrics like those in ultra low-k, and the use of strained silicon. Copper TSVs have a high TCE, so they should not be aligned vertically layer after layer into a long column. Delicate low-k material may require some structural pillars of metal and vias (not TSVs) from the silicon substrate to the bond surface. Strained silicon requires keeping a deeper silicon substrate to preserve the original transistor characteristics. This thicker substrate directly affects the TSVs by adding resistance and capacitance, and it also requires a wider TSV diameter in order to maintain the aspect ratio. This is a lot of new territory for the average designer. Just as the blazing speed of digital design has to some extent made all designers into analog designers, so 3D design will stretch designers a bit further with a need to understand at least some of the processing and the implications of using certain processes.

4.7 TSV Options TSVs come in countless varieties. This seems to be causing some delay in industry adoption of 3D, as there are simply too many choices. Today, TSVs primarily use copper or tungsten. However, other materials such as nickel and polysilicon are also used. Each of these materials has its own advantages and disadvantages. TSV sizes range from submicron (at Tezzaron, as small as 0.85 mm) to larger than 300 mm. Clearly, different TSVs solve different problems and thus will have different forms.

4.7.1 How Big and How Many? TSVs tend to address one of two different requirements. Either the TSVs connect large functional blocks, where only hundreds or thousands of connections are

60

R. Patti

required, or they connect “true” 3D circuits requiring thousands to millions of connections. “True” 3D circuit integration connects circuits at the tens or perhaps small hundreds of gates. This high-density TSV interconnect generally implies that the TSVs need a pitch of <10 mm. Our reference 3D memory must connect at a multiple of the DRAM bitline pitch, so our TSVs have a 1.76 mm pitch. For low-density vertical interconnect, larger TSVs offer multiple advantages. First, the larger diameter means that the TSVs can go deeper into the silicon while still maintaining a reasonable aspect ratio. (“Aspect ratio” is the depth divided by the diameter.) The higher the aspect ratio, the more difficult it is to line the silicon hole with dielectric and then fill the hole. Normal vias in semiconductors have less than a 7:1 aspect ratio; tungsten supports a ratio of 7:1, while copper typically supports a 4:1 ratio. For TSVs, the aspect ratio has been stretched. Copper-filled TSVs are often cited with a ratio of 10:1, and even 20:1 is reported. For tungsten, the range has been extended only modestly, to perhaps 10:1.

4.7.2 Tungsten or Copper? Tungsten is typically put onto a wafer by chemical vapor deposition (CVD). This process fills the TSV holes, but it also creates a uniform coating (“overburden”) across the entire top surface of the wafer. Wider TSV holes require more material to be deposited, resulting in a thicker overburden. The CVD process is done at a somewhat elevated temperature; when the wafer cools, a thick overburden can contract and warp the wafer. In addition, the overburden must be “sanded” off by using etching and/or CMP. Because tungsten is a hard material, a bulky overburden is difficult to remove. These issues limit tungsten-filled TSVs to a diameter of <3 mm, with 2–2.25 mm as a more practical limit. Accordingly, the deepest practical tungsten TSV would be about 20–25 mm. Copper, unlike tungsten, is deposited by electroplating. Careful attention to chemistry allows the filling of deep TSV holes with little overburden. With copper, the issues of extending the TSV aspect ratio come not so much from the filling itself, but from the liners required to contain the copper and seed the plating. If the barrier layer (normally Ta and TaN) is not complete, copper can leach into the silicon and poison the transistors. If the seed layer is not complete, it can cause voids in the copper. Voids electromigrate with time. Electromigration moves copper from higher electric fields to lower ones. For example, the corner connection between a TSV and the metal interconnect is a high field, while any void in the TSV is a low field. Electromigration would move copper from the corner connection to the void, causing an open circuit at the connection. Another issue with copper is its TCE mismatch with silicon, much higher than the silicon– tungsten mismatch. With thermal cycling, copper repeatedly expands and contracts more than the silicon. This “pumping” motion can cause the oxide insulating liner to crack and fail. The SEM images in Figs. 4.8 and 4.9 show failures due to copper pumping.

4 Homogeneous 3D Integration

Fig. 4.8 Example of failure due to copper pumping

Fig. 4.9 Example of failure due to copper pumping

61

62

R. Patti

These issues can probably be mitigated, but limitations exist. Today, copper TSVs are the only viable solution for TSV depths greater than 25 mm. Copper has thus become the material of choice for 3D assembly requiring handling of thin wafers or chips. Many schemes today target a device thickness of around 50 mm prior to the 3D assembly process. A TSV diameter of 5 mm and an aspect ratio of 10:1 gives a 50 mm depth, which can be reliably manufactured and has shown early results of being able to tolerate thermal cycling.

4.7.3 Other TSV Factors Now that we have examined some of the specifics of TSVs, let us look at the applications driving the need. For image-processing devices, such as area of interest image processors, there is at least one TSV per pixel or perhaps per small group of pixels. In a 3D FPGA, routing resources determine the usability and to a great extent the operating speed, so TSVs need to be small and dense; FPGAs incorporate more metal layers than any other type of semiconductor. In the case of our reference memory, there are ~1.5 million TSVs per layer. The TSVs are forced to a tight bitline pitch, but the average pitch is several times larger. Indeed, many 3D applications require a locally high TSV density although the number of TSVs is rather low. The absolute number of required TSVs obscures the need for tight pitch and high TSV density. It is completely possible for a large die with merely hundreds of TSVs to require a tight 2 mm pitch. A processor might require the interface to memory on other layers to lie near its bus execution unit. Large TSVs with a coarse pitch would connect the circuits, but would undermine much of the benefit of 3D integration. Density would be maintained, but the power and performance gains would be largely eliminated. Another TSV variable is the point in the process flow where the wafer is thinned to expose the TSVs. Handling thinned wafers or dies is risky at best. Thinning wafers to <50 mm and then attempting to stack them is not practical. The thinner the wafer, the more the stress is released. The release of stress causes at least two issues. First, the performance of the transistors will change for both better and worse. Second, the surface shifts and warps in many directions. This surface change greatly affects the quality of the bond as well the alignment of features. Thinning dies below 50 mm has similar effects, although there are successful examples with 30 mm thick die. On the other hand, if a thick wafer is permanently bonded to another thick wafer, it can be safely thinned to a few microns of substrate. The stiffness of the permanent bond prevents the stress release. The same is possible with die-to-wafer bonding. This is why the assembly flow must play a role in the selection of the TSV technology. The Tezzaron 3D process and many others make use of the “bond then thin” paradigm that enables high-density tight pitch TSVs. The following table summarizes Tezzaron’s various TSV offerings.

SuperContact™ 200 mm Via first, FEOL 1.2 × 1.2 × 6.0 W <2.5 2–3 <0.6

SuperContact™ II 200 mm Via first, FEOL 0.85 × 0.85 × 5.5 W <1.75 2 <1.5

<< denotes Vanishingly small, extremely insignificant; actual measurement will vary. < denotes small, insignificant; actual measurement varies.

SuperVia™ Via first, BEOL 4.0 × 4.0 × 12.0 Size L × W × D (m) Material Cu Minimum pitch (m) 6.08 Feedthrough Capacitance (fF) 7 <0.25 Series resistance (W)

SuperContact™ 300 mm 1.6 × 1.6 × 10.0 W <3.2 6 <1.5

Bond points 1.7 × 1.7 (0.75 × 0.75) Cu 2.4 (1.46) << <

Die-to-wafer 10 × 10 Cu 25 <25 <

4 Homogeneous 3D Integration 63

64

R. Patti

4.8 Tools Most of the tools in our tool box will need some upgrading. As already mentioned, the synthesis and place and route tools will be the most impacted. These tools are most certainly required before 3D is widely adopted for use in typical logic devices. One can easily see the huge new opportunities and issues that can arise for a given device by optimizing the number of layers and the processes in which those layers are made. Imagine a tool set deciding the best location for a circuit in a 3D device, based not only on what it is connected to, but also on available space on a given layer for a specific process – deciding, for example, “this adder could be built on this layer in 40 nm but its signal must travel further than if it were built in 90 nm next to the mixed signal circuit that uses its data.” In addition, there are power and heat tradeoffs to consider. All these variables combine to create a wildly exponentiating complexity that would be nearly impossible to manage with any tool set. Tools capable of handling all these optimizations are probably far off. Most likely, near-term tools will do generalized modeling and require some human help. A general term for this type of tool is “path finding.” Path finding tools allow designers to make fundamental choices in partitioning and selection of appropriate processes for a 3D assembly. Many of the other EDA tools require only minor rework for at least a usable 3D function. The first 3D efforts relied completely on workarounds and custom software fixes. Today, there are complete 3D flows for physical editing, DRC, and LVS. Parasitic extraction and simulation still require some hand manipulations to assemble the complete 3D data. While these flows are not without issues, they work well enough to allow the design and fabrication of real working 3D chips.

4.9 Modeling 3D Circuits There are two types of modeling required for 3D circuits. The first is modeling the interaction of different device layers, and the second is modeling the TSVs. Device layer interaction is minimal. The distance from backside to frontside circuitry is several microns. Probably, the only modeling necessary is capacitance to the ground plane presented by the silicon substrate. For a frontside-to-frontside bond, there are a couple of microns between active circuits. In the copper-tocopper bonding process, there are islands of floating copper and dielectric. This could be modeled as continuations of the metal stacks. Today this modeling requires manual intervention in the extraction process, but the effects are small. In many cases, the stacked interaction is negligible. If circuit performance is marginal, a straight forward effort would permit complete and accurate simulation. TSV modeling is more important and usually easier. High-density TSVs can be modeled as a lumped element for virtually all digital work, and almost as analog below 5 GHz. The Tezzaron SuperContact is ~1.5 W and has 2–3 fF of capacitance. This can be added as a lumped element. The inductance is minute and generally does not become an effect below 5 GHz. Modeling has shown the SuperContact to

4 Homogeneous 3D Integration

65

have effects of interest above 50–100 GHz. In the digital design for W2W assembly, one can think of a SuperContact as about 10 mm of wire. This is a good rule of thumb to the first order. In die-to-wafer designs, including the TSVs and the somewhat larger pads required, it can be thought of as 100 mm of wire.

4.9.1 Thermal Modeling There is significant concern regarding thermal issues in 3D structures and with good reason. One can easily imagine stacking layers of 100 W processors to form a supersized multicore CPU. The resulting 3D stack would surely fail. 3D assembly does not change the fundamental power and heat constraints. Power dissipation and the associated heat flow is limited to about 1 W/mm2 of die attachment area to a package for conventional heat sinks and fans. Liquid cooling and microchannels in 3D assembly can raise this limit by a factor of 2 or 3, but these solutions are not generally available or usable. Today, the first rule in maintaining a reasonable die temperature is to keep the power below 1 W/mm2. Assuming one passes the first rule, the issues become more subtle and may require true modeling. Different 3D assembly techniques tend to drive differing modeling needs; for example, thicker layers in the 3D assembly are more likely to create isolated hotspots. Also, low-k dielectrics add to thermal insulation. As a general set of rules, one can apply the following: <7 W per 100 mm2: Modeling is probably not required. 7–15 W per 100 mm2: Take care to avoid vertical colocation of the highest power elements. 15–30 W per 100 mm2: Locate high-power elements on the outer layers. The highest power items are best located on the side of the stack that will be in contact with the heat sink. Avoid colocating high-power elements on layer after layer. >30 W per 100 mm2: Perform thermal analysis of the actual heat generation in addition to locating high-power elements as above. The objective of thermal modeling is to identify areas where multiple heat generators are located in a small 3D space. Unless one is approaching the 1 W/mm2 total limit, gathering precise data is less important than simply identifying the heatgenerating sources and spreading them out over the largest 2D area possible. In high-power device packages, most of the heat will travel out the back of the die, with some out the front and virtually none out the edges. Heat, just like electrical signals, should travel the shortest path for best performance.

4.9.2 Yield Effects The same fundamental factors affect the yield in 3D devices as in 2D devices. In general, the yield is most affected by die area: the yield of a 3D device is roughly the same as the yield for a 2D die with equal area (i.e., equal to the sum of the areas

66

R. Patti

of the layers). The 3D assembly itself tends to use relatively large interconnects that yield well. As a rule of thumb, each bond contributes defects that are equivalent to adding another metal interconnect layer to a 2D wafer.

4.10 Testing for 3D 3D assembly raises many new questions related to test. For W2W assembly the important information is the general condition and the process corner of the entire wafer. Individual die testing probably is not helpful. Unless one plans to optimize the assembly by picking vertically aligned sets of devices that are all working – a difficult task! – knowing which specific dies are good prior to assembly does not contribute much useful information, especially for volume production. Testing each die on the wafer adds cost and also reduces the yield. Testing the wafer involves physical contact with the wafer surface, and the probe damage to the planar surface adversely affects the 3D assembly. It is best to minimize the wafer damage by probing only those areas that are as remote and isolated as possible, far from the useable die areas. Wafers that are to be 3D assembled should be tested with a focus on speed grade and device yield. One approach would be to copy a small portion of the critical circuitry into the scribe lane to be used for testing. The test results of the specialized test structure are the most useful in indicating the wafer grade and give a much stronger correlation to final results than just general process control monitors. It is highly desirable to match the speed grades of wafers used in a 3D assembly. This matching gives the designers a smaller set of ultimate process windows to examine and generally a more useful binning of final parts. As an example, if slow and fast wafers were stacked together, and if every stack contained a slow process corner wafer, then every resulting part might come out as a slow corner device. The speed would be governed by the slowest wafer in the stack. For memory devices this is certainly the case. Grouping wafers by the process corner will precipitate a more normal set of device speed binning. In less homogeneous device stacks, e.g., 3D logic stacks, process corner matching may be mandatory to achieve function and useful yield. Testing 3D devices prior to assembly poses another issue to reckon with: the number of test points. Even if probe damage were not an issue, complete testing would still not be possible in many cases. In the reference memory device, there are 1.5 M vertical signals per die. The quantity is far too large to test, and the required test pitch is impossibly small. Indeed, a driving reason to do 3D integration is that it offers significantly more wire at a finer pitch. The closer to “true” 3D integration a device is – i.e., the more it utilizes the benefits of process separation, short wiring, etc. – the less likely it is to lend itself to wafer probing. This issue applies not only to W2W, but also to D2W, and even to D2D stacking. “True” 3D integration with resulting supersized die will require new test strategies regardless of how the 3D assembly is performed.

4 Homogeneous 3D Integration

67

4.11 Design-for-Test Meets Repair and Redundancy 3D integration increases the number of transistors that may be present in an assembly by more than an order of magnitude. Beyond concerns about the initial yield one must also start to evaluate device lifetimes. Transistors and wires do wear out. Historically, lifetimes have been far beyond the useful life of a device. However, as geometries have continued to shrink, the life expectancies of transistors and other chip elements have also been shrinking. The conjunction of 3D assembly and shrinking geometries push the need for field repairable devices and indeed self-repair. Consider, as an example, 64 GB memory layers stacked 16 layers high with W2W 3D assembly. This memory stack is in turn stacked onto a two layer 16 core processor. The entire 3D assembly would contain well over one trillion transistors. An analysis of such a device would return a mean time between failures (MTBF) of mere weeks. Although a device of this complexity may sound futuristic, such devices are actually on drawing boards today and will see first attempts at fabrication in the next few years. The need for field repair will drive the need for field self-test. Design for test (DFT) has come a long way and is used today by most logic designers, but analog, commercial memory, and commercial CPU designers are resistant to the typical test insertion flows. Analog circuitry has no workable DFT equivalent. Fortunately, due to its typically larger geometries, analog circuitry is much less likely to need field test and repair. Memories and large fast CPUs run close to the edge of operating speed. Blind test insertion would cause significant performance reduction. These types of devices must use a more hand-guided, hand-engineered test insertion approach. These realities will force testing to be contemplated and inserted at the IP block level. As a first step, future 3D chip designers must tie together the various test interfaces to allow factory testing. The next step will progress from built-in test hardware to self-test. Test hardware must not only provide access but, with minimal outside help, actually test the chip. Ignoring the 3D aspect and simply thinking about the I/O bandwidth requirements, one trillion transistors might take a long time to test with just a few hundred pins on a device. The situation quickly becomes one where the test cost far exceeds the cost of silicon. This cost issue opens the door to self-test. Self-test undoubtedly adds silicon area, but if it reduces test cost by a larger amount, it makes sense. After crossing the bridge to self-test, we can see the final destination: a device that can diagnose its own failure and repair itself through redundancy. The most obvious example of repair is memory. Virtually every DRAM made today has defects. The memories are tested and repaired through redundancy at the factory. The repairs are permanent and transparent to the user. An acquaintance in the industry once stated, “If you are getting prime die [dies that do not need repair] at the end of the line, you just aren’t trying hard enough.” Tezzaron already employs built-in self-test and self-repair (Bi-STAR®) in some of its devices. Through time, this will likely be extended to all devices. The benefits of this approach are manyfold: factory test is simplified and failures that occur in subsequent assembly and

68

R. Patti

packing are almost completely masked. The MTBF of a memory device with Bi-STAR is 1,000 times better than without. Granted, memory self-test and repair is much more straightforward than in, say, a CPU core, but some things are not difficult in any case. Many multicore processors are now built with “spare” cores. A failing core could be field repaired by replacement. Other components might also be built with spares. The best direction forward in order to control and connect various system parts seems to be building a network on chip (NOC) that is fault tolerant and provides ready access to spare parts. The most difficult 3D work ahead is most certainly in test, repair, and redundancy.

4.12 A Case Example: Tezzaron 3D DRAM The Tezzaron DRAM is split into two fundamental pieces: the memory cells and basically everything else that makes up a DRAM (senseamps, wordline drivers, timers, DLL, and I/O circuitry) as shown in Fig. 4.10.

4.12.1 Customization Process separation itself enables much easier customization of memories. DRAM fabrication processes are highly tuned to run high volumes of identical devices. This tuning allows the lowest possible cost and highest yield. The portion of the memory that one would want to customize would not be the memory cells but rather the other circuitry. If this circuitry is separately fabricated in a logic process,

Memory Layers Memory Cells Only

Ultra-dense, ultra-fine TSV: Bitlines, Wordlines, Power, Ground, VBB, VDDH

Controller Layer Senseamps, Wordline Drivers, etc.

Fig. 4.10 Tezzaron’s 3D DRAM architecture

R

TA

i-S

B

I/O Drivers

4 Homogeneous 3D Integration

69

one can easily see how high-volume low-cost DRAM cells can be combined with fully customized interfaces. Tezzaron’s 3D memories make use of this concept with only a few different designs required at each node. The concept is being extended further to allow expansion of memory arrays in x, y, and z dimensions to meet the needs of specific applications for density, cost, and speed. The memory layers are truly identical, taking advantage of stagger step control signal routing that allows up to 16 layers of memory to be stacked together with no changes to the memory layer itself.

4.12.2 Process Separation DRAM capacitors can be built as trenches or as stacked devices. Trenched capacitors are basically adjacent to or even overlapping the transfer gates, while stacked capacitors are stacked above the transfer gates. As shrinking has progressed, the trenched method is reaching its terminus. Only stacked capacitors will reach production at nodes below ~45 nm. Stacked capacitors can be more than 3 mm tall. This height creates some interesting interconnect issues. The first layer of metal (after bitlines, which often are made or “plated” with tungsten) is more than 3 mm above the actual transistors. Thinking back to the earlier discussion about TSV aspect ratios, one can see a similar set of issues. DRAM manufacturers push high aspect ratios for their transistor contacts. The drawbacks, in addition to larger design rules, include high-resistance transistor connections. 30–50 W per contact is reasonable. This is about 10× what can be achieved in a typical logic process. In the memory cell itself, the transfer gates below the capacitors have contacts that are usually made of polysilicon and the resistance is measured in tens of kilo ohms. One can see why DRAM is not particularly fast. The I/O frequencies achieved by DRAM for DDR3 and DDR4 are near miracles. The rest of the DRAM, beyond the memory cells, can benefit greatly by being built in a logic process. A typical logic process has about 3× the transistor gain bandwidth of a DRAM process at the same process node. The senseamps can be faster and more sensitive; logic for built-in self-test and repair becomes much more reasonable. Logic interconnect below 150 nm is also made with copper, whereas DRAM still primarily uses aluminum (only Micron® uses copper for its top layer or two of wiring). Also of note is that DRAM usually has only two or three layers of metal interconnect above the bitline. One can readily grasp how a logic process could unshackle new DRAM capabilities.

4.12.3 Improving Latency Adding the third dimension provides a specific set of key benefits that enable dramatic speed and yield improvements. If the memory cells are layered,

70

R. Patti

the length of the bitlines could be reduced with little impact to the overhead. Shorter bitlines on each layer can be multiplexed to the vertical bitline that runs down to the logic layer and its senseamps. In a 2D DRAM, reducing the bitline length does indeed produce a faster DRAM, but it necessitates more rows of senseamps and thus much more overhead, leading to higher cost. This is in fact how the fastest DRAMs are built today. The impact of reducing the bitline length is significant. Typical bitline evaluation time is 10–15 ns. If 3D can reduce the bitline length by about 1/2, it should decrease the evaluation time by about a factor of 4. In fact, 3D stacking realizes a factor of 3. This same basic lesson is true for the wordlines.

4.12.4 Improving Yield Using shorter rows and columns in the memory not only improves the speed of the device, but also improves the reparability. In DRAMs, the primary repair mechanism is row and column redundancy. Half-size rows and columns will double the number of possible repairs with no additional spare elements. Even more dramatic is the impact of sharing repairs from layer to layer. An irreparable DRAM has typically run out of rows in a particular memory array. When this occurs, the part is bad no matter how few other defects exist. Defects tend to cluster, so DRAMs generally do repairs on groups of rows, typically four or eight rows at a time. Clustering defects have no effect across vertically connected wafers. This significantly reduces the likelihood of running out of repair elements, as spare elements from the other layers can be brought into play. Process separation allows the senseamps and other typical DRAM elements to be built in a much higher performing process, and can also conceal additional errors. On-board error detection and correction can be built in, as well as other redundancy mechanisms to swap out secondary internal circuits. As an example of the logic performance improvement one can obtain, a 64-bit ECC calculation in a 100 nm DRAM process takes almost ten times longer than in a 100 nm logic process. The difference comes from a combination of transistor drive strength and the limitations of the interconnect in the DRAM process. Because it runs ten times faster, ECC functionality can be “hidden.”

4.12.5 Reducing Cost The new, faster, more reparable memory can also provide cost improvement. The largest cost improvement stems from reducing process complexity for the DRAM layers. Because the peripheral circuitry has been removed, the process can be simplified, thus reducing cost and probably somewhat improving yield. This benefit by itself may balance the costs of the 3D assembly. Adding the improved yield that 3D provides (discussed previously) certainly tips the cost scale in 3D’s

4 Homogeneous 3D Integration

71

favor. Another boost to cost improvement comes from the enhanced on-board reparability with built-in self-test and self-repair. On mature technology memories, the cost of test actually exceeds the cost of the silicon itself. This fact, plus the approaching realities of usable lifetime wear-out, makes large DRAM an ideal target for self-test and self-repair. The additional interconnect provided by 3D assembly enables more fine-grained repair, and the process separation allows the addition of intelligence to use the new repair capabilities.

4.12.6 Reducing Power A noteworthy reduction in power comes with 3D integration. In the reference DRAM design, the reduced lengths of bitlines and wordlines reduce active power consumption by about 40%. The logic process layer, which now consumes the vast majority of the power, is also much more power efficient than the same circuitry built in a DRAM process.

4.13 Summary It is true that 3D integrated circuits still have many significant issues that must be addressed. Nonetheless, more than 40 3D devices have been successfully designed and built, including multicore processors with millions of transistors and memories with billions of transistors. It is not easy to create these devices with today’s tools, but it can be done with sufficient perseverance, hard work, and care. 3D integration can simultaneously enable higher performance, lower power, lower cost, and more reliability. Tezzaron’s earliest 3D devices have proven themselves with more than 6 years of operation and have passed full temperature qualification. Ultimately, most future circuits are certain to employ 3D integration: it has proven, and continues to prove, its ability to make any existing semiconductor technology even better.

wwwwwwwwwwwwwwww

Chapter 5

3D Physical Design Jason Cong and Guojie Luo

5.1 Introduction The physical design process for 3D ICs is similar to that used for the traditional 2D physical design, in a sense that it transforms the circuit representation from a netlist into a geometric representation by the steps of floorplanning, placement, and routing. While the multiple-layer metals have already had 3D structure in traditional ICs for interconnects, the 3D IC technologies allow multiple layers of logical devices to be integrated in the third dimension by bonding stacks of multiple “tiers” to form 3D chips. Each tier, which is similar to a traditional 2D IC, consists of one silicon layer and several metal layers, and different tiers are connected by through-silicon vias (TS via). Figure 5.1 shows two examples of 3-tier 3D ICs in a cross-section view. The bottom tier, the middle tier and the top tier are labeled 1, 2, and 3, respectively. The physical layers are parallel to the (x, y) plane, and are bonded along the z-direction, where the darker-shaded bands are dielectric layers, the lighter-shaded bands are silicon layers, and the white bands are metal layers. The large rectangles vertically drilling through silicon layers represent TS vias, which connect logic gates on different silicon layers. The I/O ports open above the topmost layer. Figure 5.1a presents a 3D IC by bonding three tiers in a back-to-face order, where the back side (the silicon layer) of the upper-level tier is bonded to the front side (the topmost metal layer) of the lower-level tier. Figure 5.1b presents another 3D IC, where the middle tier is bonded face-to-face to the bottom tier, and the top tier is bonded face-to-back to the middle tier. The requirements on physical design tools to support 3D IC technologies come from several aspects [3, 4, 9]. The latency and power are still important criteria, where the floorplanning and placer have to consider the timing and power characteristics of TS vias. The thermal issues in 3D ICs become critical: (1) The vertically stacked multiple layers of active devices cause a rapid increase in power density; (2) the thermal

J. Cong (*) University of California, Los Angeles, CA, USA e-mail: [email protected] A. Papanikolaou et al. (eds.), Three Dimensional System Integration: IC Stacking Process and Design, DOI 10.1007/978-1-4419-0962-6_5, © Springer Science+Business Media, LLC 2011

73

74

J. Cong and G. Luo

a

b

z x y

Fig. 5.1 Two examples of 3D ICs in a cross-section view

Fig. 5.2 Physical design flow for 3D ICs

conductivity of the dielectric layers between the device layers is very low compared to silicon and metal. For instance, the thermal conductivity at the room temperature (300 K) for SiO2 is 1.4 W/mK [26], which is much smaller than the thermal conductivity of silicon (150 W/mK) and copper (401 W/mK). Therefore, the thermal issue needs to be considered during every step of the 3D physical design flow. A reference 3D physical design flow is shown in Fig. 5.2, as developed in [16, 13]. The 3D design database holds the necessary information for physical

5 3D Physical Design

75

design tools, including the technology library (e.g., design rules, attributes of physical layers), cell/macro library, and the netlist. The netlist is transformed to a 3D geometric representation in the steps of 3D floorplanning, 3D placement, and 3D routing, and the thermal issues of 3D ICs are relieved by adopting the thermal TS vias [10, 33]. These steps form the main flow for the 3D physical design, which are covered in this chapter. Please note that other supporting steps, such as power grid optimization [59] and clock tree synthesis [37, 42], are also important for 3D physical design but are not addressed due to page limitations. In the remaining of this chapter, we shall present the steps of 3D physical design in the reverse order starting with 3D routing and thermal TS via planning (Sect. 5.2). Then we present the problem formulations and algorithms of 3D placement (Sect. 5.3) and 3D floorplanning (Sect. 5.4). The microarchitectural exploration using 3D floorplanner is presented in Sect. 5.5. Finally, Sect. 5.6 concludes this chapter.

5.2 3D Routing and Thermal TSV Planning Given the placement of every cell and every module, either manually or automatically, 3D routing is used to connect all the cells and macros by metal wires, vias, and TS vias according to the netlist information, without violating the design rules, under constraints like timing, crosstalk, temperature, and yield.

5.2.1 Problem Formulation The inputs of a 3D routing problem include the following: • Design rules: They specify the minimum sizes and spacing of the metal wires, vias, and TS vias. • Netlist: It specifies the connectivity of pins. • Pin locations: The pins include the I/O pins of the top-level design, and the pins of all the cells and macros, the locations of which are determined after the 3D placement step. • Obstacles: The pre-routed nets create obstacles, and the placed cells and macros create obstacles for TS vias. • Constraint-related parameters: They include the electrical parameters, thermal parameters, yield parameters, etc., for various design constraints. Two examples of the 3D routing resources are shown in Fig. 5.3a, b, which correspond to the 3D ICs in Fig. 5.1a, b, respectively. The postfix of the layer names in Fig. 5.3 represents the tier in which this layer is located, and the prefix represents the layer type. The silicon layer is labeled with a prefix TSV, where the interconnect going through this layer is implemented by TS via. The metal layers are labeled

76

J. Cong and G. Luo

a

b m2_3 v12_3 m1_3 poly_3 silicon_3 TSV_23 m2_2 v12_2 m1_2 poly_2 silicon_2 TSV_12 m2_1 v12_1 m1_1 poly_1 slicon_1

m1_3 v12_3 m2_3 poly_3 silicon_3 TSV_23 silicon_2 poly_2 m1_2 v12_2 m2_2 TSV_12 m2_1 v12_1 m1_1 poly_1 silicon_1

Fig. 5.3 Two examples of the 3D routing models

with prefixes m1 and m2 with gray shading, and the via layer between them is labeled with v12. Although there are only two metal layers of each tier shown in these examples for the convenience of demonstration, more metal layers are manufacturable in 3D IC technologies. The interconnects are routed in orthogonal directions inside the metal layers, and they are routed in a vertical direction on the via layer and TS via layers. The pins of cells and macros are usually located at the low-level metal layers, as in 2D ICs, and the I/O pins are at the topmost layer in the 3D IC layer stack. Obstacles may exist at every layer, where the pre-routed nets create obstacles on the metal layers and via layers, and the cells and macros create obstacles on the TSV layers. The 3D routing models are very similar to the 2D routing models with metal layers and via layers only, where the TS via layers can be viewed as special via layers. The major differences are that: (1) there are many more obstacles on the TS via layers than on the via layers, due to the placed cells and macros; (2) the minimum size and spacing of the TS via layers are much larger than those of the via layers; and (3) there are tight thermal constraints. Clearly, the 3D routing problem is a generalized version of the routing problem for multi-metal layer 2D ICs. Since the thermal issues are critical for 3D designs, the concept of the thermal TS via is proposed as an effective way to reduce temperature [10, 33]. The thermal TS via planning problems can be formulated as below: given a netlist, the place and route (P&R) region, and the thermal analysis model, find the location of thermal TS vias to satisfy the temperature constraint without violating the feasibility and

5 3D Physical Design

77

degrading the quality of the 3D P&R results. The thermal TS via planning can be performed before routing, in routing, or after routing.

5.2.2 3D Global Routing Algorithms Researchers began to investigate the 3D channel routing problems [22, 47] in the early 1990s purely out of theoretical interest. In recent years, the area routers dominate when multiple metal layers are available for routing, and so do the 3D routers, where the routing problems can be modeled as in Fig. 5.3. The 3D routing process can be divided into global routing and detailed routing stages. In this chapter, we focus on the global routing problem, because once the TS via positions are determined, we can take advantage of the existing 2D detailed routers to complete the detailed routings tier by tier. During 3D global routing, a grid structure is imposed on the routing layers. Each grid is modeled by a switching box with six capacities along the x-axis, y-axis, and z-axis in both directions. A single layer of grids can model a physical layer at a fine level [14], and it can also model a sequence of physical layers, which usually form a tier, at a coarse level [57]. For example, if a layer of grids model a physical layer, the capacities can be computed in this way: (1) the capacities of a grid on the metal layers along the x-axis and y-axis are computed according to the obstacles inside that grid, and the capacities along the z-axis are zero; (2) the capacities of a grid on the via layers and TS via layers along the x-axis and y-axis are zero, and the capacities along the z-axis are computed according to the obstacles inside that grid. Roughly speaking, the 3D global routing problems are very similar to the 2D routing problems, because the multiple metal layers already create some 3D structures. The additional considerations in 3D routing problems include the following: (1) the solution space is larger than that of 2D routing problems because more layers are available; (2) the pins are located on more layers in 3D routing problems than in 2D routing problems, where the pins are only on a few metal layers close to the silicon layer; (3) the TS vias create blockages and consume routing resources to go through silicon layers; (4) thermal optimization is available based on the idea of thermal TS vias. The global routing flow can be implemented either in a flat global routing scheme (Fig. 5.4), or in a multilevel global routing scheme (Fig. 5.5). In the flat global routing scheme [57], the initial 3D global routing consists of the signal TS via planning and the wire routing. After the initial routing, iterations of thermal TS via planning and rip-up and reroute are performed to meet the congestion constraints and the thermal constraints. In the multilevel global routing scheme [14], a V-cycle consists of a downward pass and an upward pass. In the downward pass, the coarse-level 3D global routing problems are constructed from the fine-level problems by estimating the coarselevel routing resources and the thermal-related information. These problems are solved in the upward pass from the coarsest level to the finest level, where the local nets are routed and the routing of global nets is refined. At the coarsest level, an

78

J. Cong and G. Luo

Fig. 5.4 The flat global routing scheme

Fig. 5.5 The multilevel global routing scheme [14]

initial 3D global routing and the thermal TS via planning are performed. Then the coarse-level routing results are projected to a finer-level problem, and the finerlevel problem is solved by completing the signal TS via planning, the thermal TS via planning and the rip-up and reroute. After the 3D global routing is done and the TS via locations are determined at the finest level, 2D detailed routings are performed tier-by-tier to finish the 3D routing process. In both the flat scheme and the multilevel scheme, the rip-up and reroute is only for wire routings. Thus, the techniques for 2D rip-up and reroute can be adopted.

5 3D Physical Design

79

In the following subsections, we shall focus on the key components for 3D global routing algorithms, including initial 3D global routing (Sect. 5.2.2.1), signal TS via planning (Sect. 5.2.2.2) and thermal TS via planning (Sect. 5.2.2.3). 5.2.2.1 Initial 3D Global Routing During the initial 3D global routing stage, the signal TS via locations can be determined before or during wire routing. The work in [57] plans TS vias before wire routing using three steps: routing congestion estimation, signal TS via planning, and 2D wire routing. The first step estimates the routing congestion by extending the L-Z shaped statistical routing model [50] for 3D global routing. Based on the congestion information, the second step plans the signal TS vias by a min-cost network flow heuristic, which will be presented in the next subsection. After the signal TS via planning, the inter-tier nets are decomposed into a set of 2D nets by adding pseudo pins on each tier to replace the signal TS vias, and the decomposed nets are routed by 2D routing algorithms, e.g., 2D maze routing, at the last step to complete the initial 3D global routing. The initial 3D global routing stage can also be completed by simultaneous signal TS via planning and wire routing, either with concurrent approaches or sequential approaches. The work in [20] extends the hierarchical routing algorithm [6] as a concurrent approach to the 3D global routing problem. However, it assumes that space between cell rows is provided for TS vias and only limits the total TS via per row of cells without modeling the placed cells as routing blockages. As a sequential approach, the work in [14] applies 3D maze searching to conduct the simultaneous routing of TS vias and wires. The multipin nets are first converted to minimum spanning trees, and then these minimum spanning trees are converted to minimum Steiner trees by performing a point-to-path 3D maze searching. Sterner edges are created when the searching path touches the existing edges of the tree before the target point. The maze-searching algorithm finds the shortest paths, with awareness of obstacles that are capable of handling TS vias by properly setting the routing cost and routing resources along the z-direction for the 3D maze-searching engine. 5.2.2.2 Signal TS Via Planning The signal TS via planning can be used at the initial 3D global routing stage, or at the fine-level routing refinement in the multilevel scheme. Given a planning window PW divided into planning bins {b j } , j = 1, 2,¼, m , with position ( x j , y j ) and capacity c j , assign signal TS vias {vi } , i = 1, 2,¼, n , into PW one of these bins, so that the TS via number assigned to each bin b j does not exceed its capacity c j , and the wirelength is minimized. The min-cost network flow heuristic is a commonly used method for signal TS via planning [14, 57]. A network G (V , E ) is constructed, whose node set includes all the TS vias {vi } , all the planning bins {b j } , a pseudo source node s , and

80

J. Cong and G. Luo

Fig. 5.6 Min-cost flow problem for signal TS via planning

a pseudo sink node t. There are three kinds of edges in the edge set E , where each edge will be assigned with a (capacity, cost) pair: • The source node s has supply of n, and connects to n TS vias {vi }. Each edge ( s, vi ) has capacity 1 and cost 0. • There are n ´ m edges from the TS vias {vi } to the bins {b j } . The capacity of edge (vi , b j ) is infinity, and the cost costWL (i, j ) is the estimated wirelength after assigning vi to b j. • Every bin b j connects to the sink node t, where each edge (b j , t ) has a capacity of c j and a cost of 0. • In this min-cost flow problem, the supply at the source node and the maximal capacities at the edges are all integers. Thus, the optimal solution will also have integer values [29]. This solution indicates the optimal assignment of each signal TS via to the planning bins, in the sense of the estimated wirelength and can be solved in polynomial time (Fig 5.6). 5.2.2.3 Thermal TS Via Planning Thermal TS via planning is carried out after signal TS via planning, because wirelength is usually more critical. During thermal TS via planning for an L-tier 3D IC, the routing region on each tier is divided into N ´ M planning bins, which are denoted as {bi , j , k } with 1 £ i £ N , 1 £ j £ M , and 1 £ k £ L. Given the placement of cells/macros and the signal TS via planning results, we can compute the TS via capacity ci , j , k and the minimum TS via number si , j , k of each planning bin bi , j , k , which give the per-bin TS via constraints such that the TS via number ni , j , k satisfies si , j , k £ ni , j , k £ ci , j , k . Thus, the problem of thermal TS via planning is to minimize

5 3D Physical Design

81

the total number of TS vias å ni , j , k , subject to the temperature constraint and i, j ,k the per-bin TS via constraints. The thermal TS via planning problem can be solved by linear programming [57] with {ni , j , k } as problem variables. However, the approximation that the temperature change DTi , j , k in bin bi , j , k is proportional to the TS via number change Dni , j ,k is not accurate in general, as the thermal conductance and the temperature are in a nonlinear relationship. The thermal sensitivity factors depend on the TS via distribution. Therefore, the thermal sensitivity analysis and the linear programming have to be performed iteratively until a feasible solution is reached. The thermal TS via planning affect the thermal characteristic of a 3D IC. The work in [15] proposes a nonlinear programming-based method, where a thermal-resistive network model [51] is integrated in the formulation. The bin structure is shown in Fig. 5.7a, where each bin bi , j , k is associated with a power dissipation Pi , j , k and a temperature Ti , j , k . The thermal-resistive network model according to the bin structure is shown in Fig. 5.7b. It is assumed that the heat sink is attached to the bottom tier, and the other sides of the resistive network are adiabatic. The notations for the heat flows in the thermal-resistive network are shown in Fig. 5.7c, where a heat flow opposite to the arrow direction is represented by a negative value. The power dissipation and the heat flows related to a node in the thermal-resistive network satisfy:

Pi , j , k + H i(-x1,) j , k + H i(, yj )-1, k + H i(,zj), k +1 = H i(,xj), k + H i(, yj ), k + H i(,zj), k .

(5.1)

And the heat flows and the temperature satisfy:

é H i(,xj), k Gi(,xj), k = (Ti , j , k - Ti +1, j , k ),ù ê ( y) ú ( y) ê H i , j , k Gi , j , k = (Ti , j , k - Ti , j +1, k ),ú ê H i(,zj), k Gi(,zj), k = (Ti , j , k - Ti , j , k -1 ),ú ë û

(5.2)

( x) (z) ( y) where Gi , j , k , Gi , j , k , and Gi , j , k are the thermal conductivities at the heat flow edge (bi , j , k , bi +1, j , k ) , (bi , j , k , bi , j +1, k ) , (bi , j , k , bi , j , k -1 ) , respectively.

Fig. 5.7 TS via planning bins and the thermal-resistive network model

82

J. Cong and G. Luo

Instead of directly formulating {ni , j , k } as problem variables, the temperatures T = {Ti , j , k } and the heat flows H = H i(,xj), k È H i(, yj ), k È H i(,zj), k are the vari-

{

} {

} {

}

ables in the nonlinear programming problem. The temperature and heat flows determine the number of TS vias ni , j , k ( H , T ) to be planned in each bin bi , j , k :

ni , j ,k ( H , T )gTSV + gi(,zj),k = Gi(,zj),k = Þ ni , j ,k ( H , T ) =

H i(,zj), k

Hi(,zj),k Hi(,zj),k (Ti , j ,k - Ti , j ,k -1 ) - (Ti , j ,k - Ti , j ,k -1 ) - gi(,zj),k gTSV

Ti , j , k - Ti , j ,k -1

(5.3)

,

(z)

where gi , j , k is the thermal conductivity at the heat flow edge (bi , j , k , bi , j , k -1 ) without any TS vias, and g TSV is the thermal conductivity of one TS via.Thus, the thermal TS via planning problem can be solved by minimizing ni , j , k ( H , T ) , subject to the per-bin TS via constraints si , j , k £ ni , j , k ( H , T ) £ ci , j , k , the temperature constraints Ti , j , k £ Tmax and the thermal model constraints as in (5.1) and (5.2). This nonlinear programming problem can be solved by the alternating direction TS via planning algorithm (ADVP) [27], which iteratively alternates between vertical thermal TS via planning and horizontal thermal TS via planning.

5.3 3D Placement Placement is an important step in the 3D physical design flow. The performance, power, temperature, and routability are significantly affected by the quality of placement results. Thus, a 3D placement tool has to minimize the total wirelength and has to control the TS via number and temperature.

5.3.1 Problem Formulation Given a circuit H = (V , E ), the tier number K and the per-tier placement region R = [0, a] ´ [0, b], where V is the set of cell instances (represented by vertices) and E is the set of nets (represented by hyperedges) in the circuit H (represented by a hypergraph), a placement ( xi , yi , zi ) of the cell vi Î V satisfies that ( xi , yi ) Î R and zi Î {1, 2,¼, K }. The 3D placement problem is to find a placement ( xi , yi , zi ) for every cell vi Î V , so that an objective function such as the weighted total wirelength is minimized, subject to overlap-free constraints, and other constraints such as performance and temperature. In this chapter, we focus on temperature constraints, as the performance constraints are similar to that of 2D placement. The reader may refer to [18, 40] for a survey and tutorial of 2D placement.

5 3D Physical Design

83

5.3.1.1 Wirelength Objective Function The quality of a placement solution can be measured by the performance, power, and routability, but the measurement is more difficult than that in routing. In order to model these aspects during optimization, the weighted total wirelength is a widely accepted metric for measuring placement quality [39, 40]. Formally, the placement objective function is defined as

OBJ = å (1 + re ) ·(WL(e) + a TSV ·TSV(e)).

(5.4)

e ÎE

The objective function depends on the placement {( xi , yi , zi )} , and it is a weighted sum of the wirelength WL(e) and the number of TS vias TSV(e) over all the nets. The weight (1 + re ) reflects the criticality of the net e , which is usually related to performance optimization. The unweighted wirelength is represented by setting re to 0. This weight is often used to model thermal effect, timing, and timing criticality of net e [25]. The wirelength WL(e) is usually estimated by the half-perimeter wirelength (HPWL) [25, 17]:

(

)(

)

WL(e) = max{xi } - min{xi } + max{ yi } - min{ yi } . vi Îe

vi Îe

vi Îe

vi Îe

(5.5)

Similarly, TSV(e) is modeled by the range of {zi : vi Î e} [17, 24, 25]:

TSV(e) = max{zi } - min{zi }.

vi Îe

vi Îe

(5.6)

The coefficient aTSV is the weight for TS vias; it models a TS via as some additional wirelength. For example, in 0.18 mm silicon-on-insulator (SOI) technology, [21] estimates that a 3-mm-thickness TS via is roughly equivalent to 8–20 mm of metal-2 wire in terms of capacitance, and it is equivalent to about 0.2 mm of metal-2 wire in terms of resistance. Thus a coefficient a TSV between 8 and 20 (mm) can be used for optimizing power or delay in this case. 5.3.1.2 Overlap-Free Constraints The ultimate goal of overlap-free constraints can be expressed as the following:

xi - x j ³

yi - y j ³

( wi + w j ) 2 or for all cell pairs vi , v j with zi = z j , (hi + h j )

(5.7)

2

where ( xi , yi , zi ) is the placement of cell i, and wi and hi are its width and height, respectively. The same applies to cell j. Such constraints were used directly

84

J. Cong and G. Luo

b

a bin

bin Fig. 5.8 (a) Density constraint is satisfied; (b) density constraint is not satisfied

in some analytical placers early on, such as in [7]. However, this formulation 2 leads to a large number of O(n ) either-or constraints, where n is the total number of cells. This amount of constraint is not practical for modern large-scale designs. To formulate and handle these pairwise overlap-free constraints, modern placers use a more scalable procedure to divide the placement into coarse legalization and detailed legalization. Coarse legalization relaxes the pairwise nonoverlap constraints by using regional density constraints:

å

for all celli with zi = k

overlap(bin m , n, k ,celli ) £ area (bin m , n, k ) for all m, n, k ,

(5.8)

where overlap(bin m, n, k , celli ) represents the partial area of celli that is contained in bin m , n , k , and area(bin m , n , k ) represents the area capacity of bin m , n , k . For a 3D circuit with K tiers, each tier is divided into L ´ M bins. If every bin l , m, k satisfies inequality (5.8), the coarse legalization is finished. Examples of the density constraints on one tier are given in Fig. 5.8. After coarse legalization, the detailed legalization is to satisfy pairwise nonoverlap constraints, using various discrete methods and heuristics [17, 25].

5.3.1.3 Thermal Awareness In existing literature, temperature issues are not directly formulated as constraints. Instead, a thermal penalty is appended to the wirelength objective function to control the temperature. This penalty can either be the weighted temperature penalty that is transformed to thermal-aware net weights [25], or the thermal distribution cost penalty [55], or the distance from the cell location to the heat sink during legalization [17].

5 3D Physical Design

85

5.3.2 Overview of Existing 3D Placement Approaches The state-of-the-art algorithms for 2D placement can be classified into flat placement approach, top-down partitioning-based approach, and multilevel placement approach [40]. These approaches exhibit scalability for the growing complexity of modern VLSI circuits. In order to handle the scalability issues, these approaches divide the placement problem into three stages of global placement, legalization, and detailed placement. Given an initial solution, the global placement refines the solution until the overlap-free constraints (Sect. 5.3.1.2) are satisfied. These regions are handled in a top-down fashion from coarsest level to finest level by the partitioning-based techniques and the multilevel placement techniques and are handled in a flat fashion at the finest level by the flat placement techniques. After the global placement, legalization proceeds to determine the specific location of all cells without overlaps, and the detailed placement performs local refinements to obtain the final solution. As the modern 2D placement approaches evolve, a number of 3D placement approaches are also being developed to address the issues of 3D IC technologies. Most of the existing approaches, especially at the global placement stage, can be viewed as extensions of 2D placement approaches. We group the 3D placement approaches into four categories: partitioning-based approach, flat placement approaches, multilevel placement approach, and transformation-based approach. • The partitioning-based approach [1, 2, 19, 25] applies the same divide-andconquer strategy as the well-known partitioning-based 2D placement approach. The bisection of the placement region in the z-direction is performed at some suitable steps in addition to the bisections in the x-direction and the y-direction. And the cost of partitioning is measured by a weighted sum of the estimated wirelength and the TS via number, where the nets can be further weighted by thermal-aware or congestion-aware factors to consider temperature and routability. • Flat placement approaches are the variations of quadratic placement, including the force-directed approach [24, 31], the cell-shifting approach [28], and the quadratic uniformity modeling approach [55]. Since the unconstrained quadratic placement will introduce a great amount of cell overlaps, different variations are developed for overlap removal. The minimization of the quadratic wirelength, as well as the quadratic form of TS via number, could be transformed to the problem of solving a linear system. The idea of these flat placement approaches is to append a force vector, which is computed from the area density distribution and helps to remove overlaps, to the right-hand side of the linear system. The vector is updated and the linear system is solved iteratively until the area in every predefined region is not greater than the area capacity of that region. These flat placement approaches differ by the definition of this force vector, which will be presented in detail in Sect. 5.3.3. • The multilevel placement approach [12] constructs a physical hierarchy from the original netlist, and solves a sequence of placement problems from the coarsest

86

J. Cong and G. Luo

level to the finest level. An analytical 3D placement solver is applied at each level, which optimizes the log-sum-exp wirelength [5, 41] and the log-sum-exp TS via number estimation subject to the overlap-free constraints. To model the 3D overlap-free constraints for the intermediate solution, which is continuous at the z-direction, the area projection method with pseudo tiers is applied to guarantee the legality of the final solution. Details will be presented in Sect. 5.3.3. • In addition to these approaches, the 3D placement approach proposed in [17] makes use of existing 2D placement results and constructs a 3D placement by transformation. The transformation schemes include two folding transformations, the stacking transformation and the window-based folding/stacking transformation. All these transformations start with a wirelength-optimized 2D placement on a placement region, whose width and height are K times as large as the width and height of a K-tier 3D IC. The idea of folding transformations is to fold the 2D placement like a piece of paper without cutting off any parts of the placement. Thus, the lengths of the global nets that go across the folding lines get reduced. The stacking transformation first shrinks the 2D placement by a factor of K , which can be viewed as a wirelength-optimized 3D placement projected on the ( x, y ) plane. Then the Tetris-style legalization is used to decide the tier assignment of the stacked cells. Although the wirelength is small by the stacking transformation, the TS via number is usually large. To trade off the wirelength and TS via number, a window-based folding/stacking transformation can be used, which divides the 2D placement into windows and transforms each window to 3D placement by the folding transformation or the stacking transformation.

5.3.3 Modeling of 3D Overlap-Free Constraints 3D global placement by the flat placement approaches and the multilevel placement approach, as presented in Sect. 5.3.2, usually relax the tier assignment from the discrete set z Î {1, 2,¼, K } to a continuous interval z Î [1, K ] for a k-tier 3D IC. The modeling of 3D overlap-free constraints for such intermediate placement solution is an essential issue for these 3D placement approaches. The flat placement approaches (force-directed, cell-shifting, and quadratic uniformity modeling) define the cell area distribution in the 3D space in the following way: for a K-tier 3D IC with width W and height H , a 3D space [0,W ] ´ [0, H ] ´ [0, tK ] is defined; a cell with width w and height h and its lower left corner at ( x, y, z ) occupies the 3D region [ x, x + w] ´ [ y, y + h] ´ [ z - t, z ] . In such a way, the cell area distribution in the 3D space is defined for a given intermediate placement solution. The force-directed approach [24, 30] computes the force vector (see Sect. 5.3.2) by solving a 3D Poisson equation for the potential of the cell area distribution. The force vector is the gradient of the 3D potential field. The cell-shifting approach [28] first computes the expected placement by cell-shifting to even out the cell area distribution; this expected placement is not actually performed, and a pseudo net is

5 3D Physical Design

87

created for each cell, where the pseudo pins are located properly so that the cells tend to move in the desired direction; the steepest descent direction of the wirelength for these pseudo nets gives the force vector in the linear system. The quadratic uniformity modeling approach [55] defines a density penalty function based on the 3D discrete cosine transformation (DCT) of the cell area distribution, and approximates this density penalty function by a quadratic function. The steepest descent direction of the density penalty function is the force vector appended to the right-hand side of the linear system for this approach. The multilevel placement approach [12] models the 3D overlap-free constraints in a different way. Its analytical engine solves the 3D global placement problem as a nonlinear programming problem. The tier assignment is also explored in the interval z Î [1, K ] . Instead of defining a cell area distribution in the 3D space, this analytical engine models the overlap-free constraints by examining the area distribution on certain cross sections in the 3D space. The cross sections include all the actual tiers and all pseudo tiers between every two adjacent tiers, where z Î {1, 2,¼, K } and z Î{3 / 32 - 2,5 / 52 - 2,¼, (2 K - 1) / (2 K - 1)2 - 2} , respectively. The area distribution on a specific actual tier or a pseudo tier is defined by an area projection function based on the bell-shaped function [41]. For a 3D placement problem without white space, which can be achieved by adding dummy cells, it can be proved that if the area distributions on all the actual tiers and pseudo tiers are equal, the placement will be legal. These area distribution constraints imply the 3D overlapfree constraints, Thus, the 3D global placement is formulated as a nonlinear programming problem:

minimize WL( x, y, z) + a·TSV( x, y, z) subject to Dk (u, v) = 1 Dk (u, v) = 1

(5.9)

for k = 1,2,¼, K , 3 5 2K - 1 for k = , ,¼, , 2 2 2

where the density function Dk (u , v) is the sum of the area contribution of cell vi to point (u , v) at actual tier k or pseudo tier k . The cell area density function di (u , v) is 1 inside the region covered by vi , and it is 0 outside this region. The area contribution is computed after area projection h(k , z ) . These functions are defined as follows:

Dk (u, v) = å h(k , zi )di (u, v), i

1 ì 2 z-k £ , ï 1 - 2( z - k ) , 2 ï 1 ï h( k , z ) = í2( z - k - 1) 2 , < z - k £ 1, 2 ï 0, otherwise, ï ï î

(5.10)

88

J. Cong and G. Luo

The density functions Dk (u , v) can be converted to differentiable functions by the density smoothing techniques, e.g., Helmholtz smoothing [11]. Thus, the nonlinear programming problem can be solved by the quadratic penalty method or the augmented Lagrangian method, to obtain a 3D global placement solution.

5.4 3D Floorplanning 3D IC technologies make floorplanning a much more difficult problem because the multi-tier structures dramatically enlarge the solution space and the increased power density accentuates the thermal problem. Therefore, moving to 3D designs increases the problem complexity greatly: • The design space of 3D floorplanning increases exponentially with the number of tiers. The work in [34] showed that, given a floorplanning problem with n blocks, the solution space of 3D floorplanning with L tiers is n L -1n L -1 ( L - 1)! - ( L - 1)! times larger than the solution space of 2D floorplanning, if a 3D floorplan solution is represented by an array of the corresponding 2D floorplan representations. • The addition of a temperature constraint or temperature minimization objective complicates optimization, requiring tradeoffs among area, wirelength, and thermal characteristics. And with the high temperature in 3D chips, it is necessary to account for the closed temperature/leakage power feedback loop to accurately estimate or optimize either one. • Multi-tier stacking offers a reduction in inter-block latency. It can also be used to help the intra-block wire latency when the block is implemented in multiple tiers. Use of multi-tier blocks requires a novel physical design infrastructure to explore 3D design space. Therefore, it is imperative to develop thermal-aware and timing-aware floorplanning tools that consider 3D design constraints. The goal of 3D floorplanning is to pack blocks on multiple tiers with no overlaps by optimizing some objectives without violating some design constraints. According to the block representation, we can classify the 3D floorplanning problem into two types. The first type is a 3D floorplan with 2D blocks in which each block is a 2D rectangle and the packing on each tier can be treated as a 2D floorplan. A 3D floorplan with 2D blocks can be represented by an array of 2D representations (2D array), each representing all blocks located on one tier. The second type of 3D floorplanning involves 3D blocks where each block is treated as a cuboid block with nonzero height in the z-dimension. In this case, the existing 2D representations no longer apply, and we need new representations.

5.4.1 Problem Formulation Similar to the traditional 2D floorplanning, 3D floorplanning also aims at a small packing area, short wirelength, low power consumption, and high performance.

5 3D Physical Design

a

89

b

3D floorplanning with 2D blocks

3D floorplanning with 3D blocks

Fig. 5.9 3D floorplanning

Although 3D IC technologies have many potential benefits, thermal distribution becomes a critical issue during every step of 3D physical design. Therefore, 3D floorplanning distributes blocks on a certain number of tiers without overlapping each other so that the design metrics, such as the chip area, wirelength, TS via number and maximal on-chip temperature, are optimized or meet some design constraints. With the additional z-direction, not only can the 2D blocks be spread among multiple tiers, but some individual components can be folded into the designs of a multi-tier block so that the intra-block wire latency can be reduced, as well as the power consumption. The 3D components with different tier numbers can be treated as cuboid blocks to be packed in the 3D space. The dimension in the z-direction represents the tier information. Therefore, in 3D floorplanning the blocks to be packed can be 2D blocks or 3D blocks. Figure 5.9a shows the two-tier packing for Alpha 21264 in which all blocks are 2D blocks, and Fig. 5.9b shows the packing with some 3D blocks. The implementation for each 3D component may have multiple choices with different area-delay-power tradeoffs. As shown in Fig. 5.9b, it is possible that an optimal floorplan has a subset of the microarchitectural units occupying a single tier, while others are implemented on multiple strata with potentially different heights in the z-dimension. According to the block representation, we classify the 3D floorplanning problem into two types: 3D floorplan with 2D blocks only and 3D floorplan with possible 3D blocks. 5.4.1.1 3D Floorplanning with 2D Blocks Though 3D packing with 2D blocks can be treated as multiple stacked 2D packings, the additional concern at the chip level relates to the large number of active devices that are packed into a much smaller area, so that the power density is much higher than in a corresponding 2D circuit. As a result, in addition to the common objectives of packing area and wirelength, thermal issues are given primacy among the set of design objectives. Hence, we can formulate a 3D floorplan with 2D blocks as follows.

90

J. Cong and G. Luo

An instance of the 3D floorplanning problem with 2D blocks is composed of a set of blocks {m1, m2, …, mn}. A block mi is a Wi × Hi rectangle with area Ai, aspect ratio Hi/Wi, and power density PDi. Each block is free to rotate. There is a fixed number of tiers L. Let the tuple (xi, yi, li) denote the coordinates of the bottom-left corner of block mi, where 1 £ li £ L. A 3D floorplan F is an assignment of (xi, yi, li) for each block mi such that no two blocks overlap. The common objectives of 3D floorplanning algorithms are to minimize (1) chip peak temperature Tmax, (2) total wirelength (or total power), and (3) chip area. Chip area is the product of the maximum height and width over all tiers. Wirelength is the half-perimeter wirelength estimation. In addition, some other design objectives, such as noise, performance, the number of TS vias, etc., can be considered at the same time. Also, some design constraints can be included, such as pre-packed blocks (the positions of the constrained blocks are pre-defined), alignment constraints (some specific blocks are constrained to be aligned in x, y, or z directions), etc. Since 3D floorplanning with 2D blocks can be represented with an array of 2D representations, the 2D floorplanning algorithm can be extended to handle multi-tier designs by introducing new operations in optimization techniques. Though floorplanning for 2D design is a well-studied problem, the design space of 3D IC floorplanning increases exponentially with the extension of an IC at the z-direction. Though the multi-tier design can be represented by an array of 2D packings, the specific optimization techniques are still needed for efficient exploration. Thermal-aware optimization is especially critical in 3D designs. 5.4.1.2 3D Floorplanning with 3D Blocks Fine-grain 3D IC provides reduced intra-block wire delay as well as improved power consumption. The implementation for each component may have multiple choices due to various configurations. Therefore, the components might be implemented on multiple tiers, such as a four-tier or two-tier cache, by different stacking techniques. But locally, the best implementation of an individual unit may not necessarily lead to the best design for the entire multi-tiered chip. To obtain the trade-off between multiple objectives, it is possible to have cubic blocks, which have different heights in the z-direction, in the packing design. Therefore, a cube-packing algorithm should be developed to arrange the given circuit components in a rectangular box of the minimum volume without overlapping each other. With the various implementations for each critical component, the block implementation is partially defined. Without the physical information, it is impossible to obtain the optimal implementations for components for the final chip. Thus, 3D floorplanning with 3D blocks should not only determine the coordinates of the blocks but also be able to choose the configurations for components, such as the number of tiers, the partitioning approaches, etc. Therefore, we can formulate the 3D packing with 3D blocks as follows. Given a list of 3D blocks, suppose for block i, there are k different implementai i i i tions that are recorded in a candidate list as {c1 , c2 ,¼ck } . And each candidate c j i i i i i has the width ( w j ) ( w j ) , height (h j ) , tier number ( z j ) , delay ( d j ) , and power

5 3D Physical Design

91

( p ij ) (assume each tier has the same power consumption). The objective is to generate a floorplan that optimizes for the die area, maximum on-chip temperature, etc. At the same time, the number of tiers is normally fixed, which means, given the tier number constraints as Zcon, the blocks should not exceed the tier number constraint.

5.4.2 3D Floorplanning Algorithms Since the 2D and 3D rectangular packing problems are NP-hard, most floorplanning algorithms are based on stochastic combinatorial optimization techniques such as simulated annealing [31] and genetic algorithm [38]. But analytical algorithms [58] are also proposed for handling 3D floorplanning. In this section, we focus on the simulated annealing algorithm, which is to minimize a given cost function by searching the solution space represented by a specific representation. Normally, the cost function describes the combination of chip area, wirelength, maximal on-chip temperature, or other factors. Figure 5.10 shows the optimization flow based on the simulated annealing approach. The critical components in a simulated annealing algorithm include the following: (1) cooling schedule, (2) cost function, (3) representation of the solution, (4) solution perturbation. The whole cooling schedule includes the set up of the initial temperature, cooling function, and end temperature, all of which depend on the size of the problem and the property of the problem. The cost function is usually a weighted sum of the wirelength estimation (half-perimeter model), the total area of all tiers (product of the maximal height and width and number of tiers), the number of TS vias and the maximal temperature. Various 3D floorplanning algorithms differ in the solution representation, which defines the neighborhood structure for solution permutation. The 3D floorplan with 2D blocks can be represented as an array of 2D floorplans at each tier; thus, the solution of 3D floorplanning with 2D blocks can be represented as an array of the solution representation of the corresponding 2D floorplans. There is a plethora of literature on 2D floorplanning problems, so we only summarize the various representations in Table 5.1. The solution perturbation includes the following: Rotation, which rotates a block Swap, which swaps two blocks on one tier Reverse, which exchanges the relative position of two blocks on one tier Move, which moves a block from one side (such as top) of a block to another side (such as left) Inter-tier swap, which swaps two blocks at different tiers z-neighbor swap, which swaps two blocks at different tiers but close to each other z-neighbor move, which moves a block to a position at another tier close to the current position

92

J. Cong and G. Luo Initial simulated annealing temperature(Temp) and a random initial packing New solution=Random-move(current 3D representation) Construct packing based on new solution

Calculate cost function of new solution

Y

Cost function of new solution better than that of current solution?

Accept new solution as current solution with Probability = (Dcost/Temp)

Accept new solution as current solution

N

N

Reached maximum tries for this temperature step? Y Reduce Temp by Step-size(Temp)

N

Temp reached minimum or total number of steps reached Maximum? Y Output current solution

Fig. 5.10 The flow of the simulated annealing approach Table 5.1 Various representations for 2D floorplanning [53] Complexity of Representation Solution space floorplan construction O(n) NPE (SST) O(n! 23n−3/n1.5) SP n!2 O(n log log n) – O(n2) BSG n! C(n2,n) O(n2) O-tree O(n! 22n/n1.5) O(n) B*-tree O(n! 22n/n1.5) O(n) CBL O(n! 23n−3/n1.5) O(n) O(n2) TCG n!2

Move O(1) O(1) O(1) O(1) O(1) O(1) O(n)

Packing category slicing General General Compact Compact Mosaic General

The 3D floorplanning with 3D blocks also has various solution representations, which are summarized in Table 5.2. The solution representations also define the neighborhood of solution perturbation in the simulated annealing algorithm. The readers may refer to the references in Table 5.2 for more details.

5 3D Physical Design Table 5.2 Various representations for 3D floorplanning with 3D blocks [53] Complexity of floorplan Move construction complexity Packing category Representation ST [54] O(n2) O(1) General but not all Squin [54] O(n2) O(1) All 3D Slicing-tree [8] O(n) O(1) Slicing 3D-subTCG [56] O(n2) O(n2) General but not all 3D-CBL [36] O(1) Mosaic O(n)

93

Solution space n!3 n!5 O(n! 3n−122n−2/n1.5) n!3 O(n! 3n−124n−4)

5.5 3D Floorplanning for 3D Microarchitectural Exploration One important application of 3D physical design is to provide 3D physical prototyping for microarchitectural evaluation. Recent studies have provided block models for various microarchitectural structures, including 3D cache [32, 46, 49], 3D register files [48], 3D arithmetic units [43], and 3D instruction scheduler [44]. To construct multi-tier blocks to reduce intra-block interconnect latency and power consumption in architecture design, there are two main strategies for designing blocks in multiple silicon layers: block folding (BF) and port partitioning (PP). Block folding implies a folding of blocks in the x- or y-direction – potentially shortening the wirelength in one direction. Port partitioning places the access ports of a structure in different tiers. The intuition here is that the additional hardware needed for replicated access to a single block entry (i.e., a multi-ported cache) can be distributed in different tiers, which can greatly reduce the length of interconnect within each tier. As an example, the use of these strategies for cache-like blocks is briefly described. For all the other components, such as issue queue, register files, etc., a similar analysis can be performed accordingly. Caches are commonly found in microarchitectural blocks with regular structures. They are composed of a number of tag and data arrays. Figure 5.11 shows a single cell for a 3-ported structure. Each port contains bit, bitbar lines, a wordline, and two transistors per bit. The four transistors that make up the storage cell take much less space than that allocated for ports. The wire pitch is typically five times the feature size. For each extra port, the wirelength in both x and y directions is increased by twice the wire pitch. On the other hand, the storage, which consists of four transistors, is twice the wire pitch in height, and has a width equal to the wire pitch. Therefore, the more ports a component has, the larger the portion of silicon area allocated to those ports. A 3-ported structure would have a port area to cell area ratio of approximately 18:1. Figure 5.12a demonstrates a high-level view of a number of cache tag and data arrays connected via address and data buses. Each vertical and horizontal line represents a 32-bit bus. It can be assumed that there are two ports on this cache, and therefore the lines are paired. The components of caches can easily be broken down into subarrays. CACTI [45, 49] can be used to explore the design space of different subdivisions and find an optimal point for performance, power, and area.

94

J. Cong and G. Luo

Fig. 5.11 3-ported SRAM cell

Fig. 5.12 3D block alternatives for a cache: (a) 2D 2-ported cache: The two lines denote the input/output wires of two ports; (b) wordline folding: Only y-direction is reduced. Input/output of the ports is duplicated; (c) port partitioning: ports are placed in two tiers. Length in both x and y directions is reduced

Block folding (BF): For block folding, there are two folding options: wordline folding and bitline folding. In the former, the wordlines in a cache subarray are divided and placed onto different tiers. The wordline driver is also duplicated. The gain from wordline folding comes from the shortened routing distance from predecoder to decoder and from output drivers to the edge of the cache. Similarly, bitline folding places bitlines into different tiers but needs to duplicate the pass transistor. Our investigation shows that wordline folding has a better access time and lower power dissipation in most cases compared to a realistic implementation of bitline folding. Here, the results using wordline folding are presented in Fig. 5.13. Port partitioning (PP): There is a significant advantage to partitioning the ports and placing them onto different tiers, as shown in Fig. 5.12c. In a two-tier design, we can place two ports on one tier, and one port and the SRAM cells on the other tier. The width and height are both approximately reduced by a factor of two and the area by a factor of four. Port partitioning allows reductions in both vertical and horizontal wirelengths. This reduces the total wirelength and capacitance, which translates into a savings in access time and power consumption. Port partitioning requires vias to connect the memory cell to ports in other tiers. Depending on the technology, the via pitch can impact size as well. In our design,

5 3D Physical Design

95

a

Improvement in area

b

Improvement in timing

c

Improvement in power

Fig. 5.13 Improvements for multi-tier F2B design

a space of 0.7 mm × 0.7 mm is allocated for each via needed. The same model as [49] is used to obtain via capacitance and resistance. Figure 5.13 shows the effects of different partitioning strategies on different components. The diversity in benefit from these two approaches demonstrates the need for a tool to flexibly choose the appropriate implementation based on the constraints of an individual floorplan. With wire pipelining considered, the process of choosing the appropriate implementation should consider the physical information. The best 3D configuration of each component may not lead to the best 3D implementation for the whole system. In some cases, such as in a 4-tier chip, if a component is chosen as a 4-tier block, other blocks cannot be placed on top of it and the neighboring positions. Additionally, this block may not be enough for all the other highly connected blocks. Therefore, the inter-block wire latency may be

96

J. Cong and G. Luo

increased and some extra cycles may be generated. On the other hand, if a 2-tier implementation is chosen for this component, though the intra-block delay is not the best, the inter-block wire latency may be favored since other blocks that are heavily connected with this component can be placed immediately on top of the component, and the vertical interconnects are much shorter. Therefore the packing with a 2-tier implementation may perform better than the packing with a 4-tier implementation of this component. Furthermore, to favor the thermal effect, the reduction in delay of 3D blocks may provide the latency slack to allow the trade-off between timing and power. But this optimization should also depend on the timing information that comes from physical packing results. Therefore, to utilize 3D blocks, the decision cannot simply be made from the architecture side only or the physical design side only. To enable the co-optimization between 3D microarchitectural and physical design, a true 3D packing engine is needed to choose the implementation while performing the packing optimization. The readers may refer to [27, 35, 52] for details on using 3D floorplanning for 3D microarchitectural exploration. In the remainder of this section, we shall describe the 3D microarchitectural exploration by the 3D corner block list (CBL) floorplanning algorithm [35]. The 3D CBL floorplanning algorithm is based on simulated annealing with 3D CBL representation. As presented in Sect. 5.4.2, the key components in a simulated annealing algorithm include the cooling schedule, the cost function, the representation of a solution and the solution perturbation scheme. The cooling schedule could follow the one in the wirelength-driven 3D CBL floorplanning algorithm [36]. However, for the purpose of microarchitectural exploration, the performance estimation should be included in the cost function:

cost = w1

1 + w2 Area + w3 Temp + w4 WL, BIPS

(5.11)

where a billion instructions per second (BIPS) corresponds to the performance of the floorplanned microarchitecture, Area is the area of bounding box of the whole floorplan, Temp is the maximum on-chip temperature, and WL is the total wirelength between the floorplan blocks. The coefficients w1 , w2 , w3 , w4 are used to normalize and weigh the criticality of these four components. The performance in BIPS is estimated by calculating the instructions per cycle (IPC) degradation by the extra latency introduced by the interconnects in the layout, assuming the target frequency is a constant during floorplanning [35]. The 3D CBL representation ( S , L, T ) , which supports floorplanning with 3D microarchitectural blocks, consists of three elements: the block list S , the block orientation list L , and the encoded covering list T . The solution perturbation scheme generates a neighboring solution by any one of the following operations: (1) randomly exchange the order of the blocks in S ; (2) randomly choose a position in L and change the orientation; (3) randomly choose a position in the encoded T , change “1” to “0” or change “0” to “1”; (4) randomly replace a microarchitectural block by an alternative block.

5 3D Physical Design

97

Table 5.3 Architectural parameters for the design driver [35] Processor Width Register files Data cache Instruction cache L2 cache Branch predictor Functional units

6-way out-of-order super scalar, two integer execution clusters 128 entry integer (two replicated files), 128 entry FP 8 KB 4-way set associative, 64 B blocksize 128 KB 2-way set associative, 32 B blocksize 4 banks, each 128 KB 8-way set associative, 128 B blocksize 8 K entry gshare and a 1 K entry, 4-way BTB 2 IntALU + 1 IntMULT/DIV in each of the two clusters; 1 FPALU and 1 FPMULT/DIV

Fig. 5.14 Performance speedup on SPEC2000 benchmarks [35]

The work in [35] studies the performance impact of 3D IC technologies. The processor parameters used in this study are listed in Table 5.3. Figure 5.14 presents the performance results of three configurations running at 4GHz: the single device layer design is set as the baseline; the dual device layer design with 2D blocks improves the performance by 6% on average; and the dual device layer design with 3D blocks improves the performance by 23% on average. These results imply that only the inter-block wire latency reduction is not enough to fully take advantage of the 3D IC technologies; using 3D blocks can further reduce the intra-block wire latency and contribute 16% more performance enhancement in this case. Figure 5.15 demonstrates the performance in terms of BIPS, using different frequencies from 3 to 6 GHz, different numbers of device layers from 1 to 4, and different microarchitectural blocks (2D blocks only versus 3D blocks). The conclusion from Fig. 5.14 remains true for various frequencies, and the performance is getting better with the increase of the frequency and the number of device layers. However, the higher the frequency of the chip, the more degradation the extra latency will have. Thus, the overall performance degrades when the frequency is too high, which is true for both 2D designs and 3D designs. The increase of device layers will improvement the performance, but the improvement becomes smaller when there are more device layers available.

98

J. Cong and G. Luo

Fig. 5.15 Frequency impact on performance in multi-layer implementations

5.6 Conclusions Along with the development of 3D IC technologies in the recent decade, a significant amount of advancement has been made in the 3D IC physical design automation. In this chapter, we cover important problems and algorithms developed in the 3D IC physical design flow, including 3D routing, thermal TS via planning, 3D placement, 3D floorplanning, and 3D microarchitectural exploration. A highlevel overview of the basic concepts in the 3D physical design flow is presented, and the necessary references are included for readers who would like to dig deeper in a specific topic. The challenges and opportunities in 3D physical design automation are presented in [4, 9]. To conclude this chapter, we list these challenges for 3D EDA tools as follows: • Most of the existing studies in 3D placement and 3D floorplanning claim great benefits for wirelength reduction using 3D IC technologies. However, the physical impact of signal TS vias is not adequately considered. It is critical to consider the area consumption and distribution of signal TS vias during physical design. • The timing characteristic of the signal TS vias has to be considered. Although the timing optimization of a two-terminal or a multi-terminal 3D interconnect has been studied [23], the timing-driven 3D placement still needs to be addressed. • Strong linkage between the architecture level analysis tool and 3D physical planning tools is required to take advantage of 3D IC technologies with new architectures and physical implementations. Acknowledgments This study is supported by the National Science Foundation (NSF) under CCF-0430077 and CCF-0528583.

5 3D Physical Design

99

References 1. Ababei C, Mogal H, Bazargan K (2005) Three-dimensional place and route for FPGAs. In: Proceedings of the 2005 conference on Asia and South Pacific design automation, Shanghai, China, 18–21 January, pp 773–778 2. Balakrishnan K, Nanda V, Easwar S, Lim SK (2005) Wire congestion and thermal aware 3D global placement. Proceedings of the 2005 conference on Asia and South Pacific design automation, Shanghai, China, 18–21 January, pp 1131–1134 3. Banerjee K, Souri SJ, Kapur P, Saraswat KC (2001) 3-D ICs: a novel chip design for improving deep-submicrometer interconnect performance and systems-on-chip integration. In: Proc IEEE 89(5):602–633 4. Bernstein K, Andry P, Cann J, Emma P, Greenberg D, Haensch W, Ignatowski M, Koester S, Magerlein J, Puri R, Young A (2007) Interconnects in the third dimension: design challenges for 3D ICs. In: Proceedings of the 44th annual conference on design automation, San Diego, California, 04–08 June, pp 562–567 5. Bertsekas DP (1977) Approximation procedures based on the method of multipliers. J Optim Theor Appl 23(4):487–510 6. Burstein M, Pelavin R (1983) Hierarchical wire routing. IEEE Trans Comput Aided Des Integrated Circ Syst 2(4):223–234 7. Chan TF, Cong J, Kong T, Shinnerl JR (2000) Multilevel optimization for large-scale circuit placement. In: Proceedings of the 2000 IEEE/ACM international conference on computeraided design, San Jose, California, 05–09 November, pp 171–176 8. Cheng L, Deng L, Wong MDF (2005) Floorplanning for 3-D VLSI design. In: Proceedings of the 2005 conference on Asia and South Pacific design automation, Shanghai, China, 18 - 21 January, pp 405–411 9. Chiang C, Sinha S (2009) The road to 3D EDA tool readiness. In: Proceedings of the 2009 conference on Asia and South Pacific design automation, Yokohama, Japan, 19–22 January, pp 429–436 10. Chiang TY, Banerjee K, Saraswat KC (2001) Compact modeling and SPICE-based simulation for electrothermal analysis of multilevel ULSI interconnects. In: Proceedings of the 2001 IEEE/ACM international conference on computer-aided design, San Jose, California, 04–08 November, pp 165–172 11. Cong J, Luo G, Radke E (2008) Highly efficient gradient computation for density-constrained analytical placement. IEEE Trans Comput Aided Des Integrated Circ Syst 27(12):2133–2144 12. Cong J, Luo G (2009) A multilevel analytical placement for 3D ICs. In: Proceedings of the 2009 conference on Asia and South Pacific design automation, Yokohama, Japan, 19–22 January, pp 361–366 13. Cong J, Luo G (2009) A 3D physical design flow based on OpenAccess. Proceedings of the international conference on communications, circuits and systems, Milpitas, California, 23–25 July 14. Cong J, Zhang Y (2005) Thermal-driven multilevel routing for 3-D ICs. In: Proceedings of the 2005 conference on Asia and South Pacific design automation, Shanghai, China, 18–21 January, pp 121–126 15. Cong J, Zhang Y (2005) Thermal via planning for 3-D ICs. In: Proceedings of the 2005 IEEE/ ACM international conference on computer-aided design, San Jose, CA, 06–10 November, pp 745–752 16. Cong J, Zhang Y (2006) Thermal-aware physical design flow for 3-D ICs. In: Proceedings of the 23rd international VLSI multilevel interconnection conference, Fremont, California, 25–28 September, pp 73–80 17. Cong J, Luo G, Wei J, Zhang Y (2007) Thermal-aware 3D IC placement via transformation. In: Proceedings of the 2007 conference on Asia and South Pacific design automation, Yokohama, Japan, 23–26 January, pp 780–785

100

J. Cong and G. Luo

18. Cong J, Shinnerl JR, Xie M, Kong T, Yuan X (2005) Large-scale circuit placement. ACM Trans Des Autom Electron Syst 10(2):389–430 19. Das S (2004) Design Automation and Analysis of Three-Dimensional Integrated Circuits. PhD Dissertation, Massachusetts Institute of Technology, Cambridge, MA 20. Das S, Chandrakasan A, Reif R (2003) Design tools for 3-D integrated circuits. In: Proceedings of the 2003 conference on Asia and South Pacific design automation, Kitakyushu, Japan, 21–24 January, pp 53–56 21. Davis WR, Wilson J, Mick S, Xu J, Hua H, Mineo C, Sule AM, Steer M, Franzon PD (2005) Demystifying 3D ICs: the pros and cons of going vertical. IEEE Des Test Comput 22(6):498–510 22. Enbody RJ, Lynn G, Tan KH (1991) Routing the 3-D Chip. In: Proceedings of the 28th conference on ACM/IEEE design automation, San Francisco, California, United States, 17–22 June, pp 132–137 23. Friedman EG, Pavlidis VF (2009) Three-dimensional integrated circuit design. Morgan Kaufmann, Burlington 24. Goplen B, Sapatnekar S (2003) Efficient thermal placement of standard cells in 3D ICs using a force directed approach. In: Proceedings of the 2003 IEEE/ACM international conference on computer-aided design, San Jose, CA, 9–13 November, p 86 25. Goplen B, Sapatnekar S (2007) Placement of 3D ICs with thermal and interlayer via considerations. In: Proceedings of the 44th annual conference on design automation, San Diego, California, 04–08 June, pp 626–631 26. Grove AS (1967) Physics and technology of semiconductor devices. Wiley, Hoboken 27. Healy M, Vittes M, Ekpanyapong M, Ballapuram CS, Lim SK, Lee HHS, Loh GH (2007) Multiobjective microarchitectural floorplanning for 2-D and 3-D ICs. IEEE Trans Comput Aided Des Integrated Circ Syst 26(1):38–52 28. Hentschke R, Flach G, Pinto F, Reis R (2007) 3D-vias aware quadratic placement for 3D VLSI circuits. In: IEEE Computer Society Annual Symposium on VLSI, Porto Alegre, Brazil, 9–11 March, pp 67–72 29. Hillier FS, Lieberman GJ (2004) Introduction to operations research, 8th edn. McGraw-Hill, New York 30. Kaya I, Salewski S, Olbrich M, Barke E (2004) Wirelength reduction using 3-D physical design. In: Proceedings of the 14th international workshop on power and timing optimization and simulation, Isle of Santorini, Greece, 15–17 September, pp 453–462 31. Kirkpatrick S, Gelatt CD, Vecchi MP (1983) Optimization by simulated annealing. Science 220(4598):671–680 32. Kleiner MB, Kuhn SA, Weber W (1995) Performance improvement of the memory hierarchy of RISC-systems by application of 3-D-technology. In: Proceedings of the 45th electronic components and technology conference, 21–24 May, pp 645–655 33. Lee S, Lemczyk TF, Yovanovich MM (1992) Analysis of thermal vias in high density interconnect technology. In: Proceedings of the 8th annual IEEE semiconductor thermal measurement and management symposium, 03–05 February, pp 55–61 34. Li Z, Hong X, Zhou Q, Cai Y, Bian J, Yang HH, Pitchumani V, Cheng CK (2006) Hierarchical 3-D floorplanning algorithm for wirelength optimization. IEEE Trans Circuits Syst I, Reg Papers 53(12):2637–2646 35. Liu Y, Ma Y, Kursun E, Reinman G, Cong J (2007) Fine grain 3D integration for microarchitecture design through cube packing exploration. 25th International conference on computer design, 2007 (ICCD 2007), Lake Tahoe, California, 07–10 October, pp. 259–266 36. Ma Y, Hong X, Dong S, Cheng CK (2005) 3D CBL: an efficient algorithm for general 3D packing problems. In: Proceedings of the 48th midwest symposium on circuits and systems, vol 2, 07–10 August, pp 1079–1082 37. Minz J, Zhao X, Lim SK (2008) Buffered clock tree synthesis for 3D ICs under thermal variations. In: Proceedings of the 2008 conference on Asia and South Pacific design automation, Seoul, Korea, 21–24 January, pp 504–509 38. Mitchell M (1998) An introduction to genetic algorithms. MIT Press, Cambridge

5 3D Physical Design

101

39. Nam G-J (2006) ISPD 2006 placement contest: benchmark suite and results. Proceedings of the 2006 international symposium on physical design, San Jose, California, USA, 09–12 April, pp 167–167 40. Nam G-J, Cong J (Eds.) (2007) Modern circuit placement: best practices and results. Springer, New York 41. Naylor WC, Donelly R, Sha L (2001) Non-linear optimization system and method for wire length and delay optimization for an automatic electric circuit placer. US Patent 6301693, October, 2001 42. Pavlidis VF, Savidis I, Friedman EG (2008) Clock distribution networks for 3-D integrated circuits. In: Proceedings of the IEEE custom integrated circuits conference, San Jose, CA, 21–24 September, pp 651–654 43. Puttaswamy K, Loh GH (2006) The impact of 3-dimensional integration on the design of arithmetic units. In: Proceedings of the 2006 IEEE international symposium on circuits and systems, Island of Kos, Greece, 21–24 May, pp 191–194 44. Puttaswamy K, Loh GH (2006) Dynamic instruction schedulers in a 3-dimensional integration technology. In: Proceedings of the 16th ACM great lakes symposium on VLSI, Philadelphia, PA, April 30–May 01, pp 153–158 45. Reinman G, Jouppi NP (2000) CACTI 2.0: an integrated cache timing and power model. Technical Report WRL-2000–7, HP Western Research Laboratory 46. Ronen R, Mendelson A, Lai K, Shih-Lien L, Pollack F, Shen JP (2001) Coming challenges in microarchitecture and architecture. Proc IEEE 89(3):325–340 47. Tong CC, Wu C-l (1995) Routing in a three-dimensional chip. IEEE Trans Comput 44(1):106–117 48. Tremblay M, Joy B, Shin K (1995) A three dimensional register file for superscalar processors. In: Proceedings of the 28th Hawaii international conference on system sciences, Wailea, Hawaii, 03 - 06 January, pp 191–201 49. Tsai Y-F, Xie Y, Vijaykrishnan N, Irwin MJ (2005) Three-dimensional cache design exploration using 3DCacti. In: Proceedings of the 2005 IEEE international conference on computer design: VLSI in computers and processors, San Jose, California, 02–05 October, pp 519–524 50. Westra J, Bartels C, Groeneveld P (2004) Probabilistic congestion prediction. Proceedings of the 2004 international symposium on physical design, Phoenix, Arizona, 18–21 April, pp 204–209 51. Wilkerson P, Furmanczyk M, Turowski M (2004) Compact thermal modeling analysis for 3D integrated circuits. In: 11th International conference mixed design of integrated circuits and systems, Szczecin, Poland, 24–26 June, pp 277–282 52. Xie Y, Loh GH, Black B, Bernstein K (2006) Design space exploration for 3D architectures. J Emerg Technol Comput Syst 2(2):65–103 53. Xie Y, Cong J, Sapatnekar S (2009) Three dimensional integrated circuits design: EDA, design, and microarchitectures. Springer, New York 54. Yamazaki H, Sakanushi K, Nakatake S, Kajitani Y (2000) The 3D-packing by meta data structure and packing heuristics. In: IEICE Trans Fund Electron Comm Comput Sci 83(4):639–645 55. Yan H, Zhou Q, Hong X (2009) Thermal aware placement in 3D ICs using quadratic uniformity modeling approach. Integrat VLSI J 42(2):175–180 56. Yuh P-H, Yang C-L, Chang Y-W (2007) Temporal floorplanning using the three-dimensional transitive closure subgraph. ACM Trans Des Autom Electron Syst 12(4):37 57. Zhang T, Zhan Y, Sapatnekar SS (2006) Temperature-aware routing in 3D ICs. In: Proceedings of the 2006 conference on Asia South Pacific design automation, Yokohama, Japan, 24–27 January, pp 309–314 58. Zhou P, Ma Y, Li Z, Dick RP, Shang L, Zhou H, Hong X, Zhou Q (2007) 3D-STAF: scalable temperature and leakage aware floorplanning for three-dimensional integrated circuits. In: Proceedings of the 2007 IEEE/ACM international conference on computer-aided design, San Jose, California, 05–08 November, pp 590–597 59. Zhou P, Sridharan K, Sapatnekar SS (2009) Congestion-aware power grid optimization for 3D circuits using MIM and CMOS decoupling capacitors. In: Proceedings of the 2009 conference on Asia and South Pacific design automation, Yokohama, Japan, 19–22 January, pp 179–184

wwwwwwwwwwwwwwww

Chapter 6

Co-optimization of Power, Thermal, and Signal Interconnect for 3D ICs Young-Joon Lee, Michael Healy, and Sung Kyu Lim

6.1 Introduction Historically, advances in the field of packaging and system integration have not progressed at the same rate as ICs. In fact, today’s silicon ancillary technologies have truly become a limiter to the performance gains possible from advances in semiconductor manufacturing, especially due to cooling, power delivery, and signaling [1, 2]. Today, it is widely accepted that three-dimensional (3D) system integration is a key enabling technology and has recently gained significant momentum in the semiconductor industry. Three-dimensional integration may be used either to partition a single chip into multiple strata to reduce on-chip global interconnect lengths [3] and/or used to stack chips that are homogeneous or heterogeneous. There are a number of interconnect challenges that need to be addressed to enable stacking of high-performance dice, especially in the area of cooling. When two 100 W/cm2 microprocessors are stacked on top of each other, for example, the net power density becomes 200 W/cm2, which is beyond the heat removal limits of currently available air-cooled heat sinks. Thus, cooling is the key limiter to stacking of high-performance chips today. This issue has recently been addressed with a novel 3D integration technology that features a microchannel heat sink in each strata of the 3D system and the use of wafer-level batch fabricated electrical and fluidic chip input/output (I/O) interconnects [4, 5, 6]. Another major challenge is power delivery. As the fabrication technology advances, power consumption of the chip increases. According to ITRS projection [7], the power consumed by a single chip will reach 200 W in a few years. Even in the packages of today’s industry designs, more than half of the IO pins (or C4 bumps) are dedicated for power and ground connections. As multiple chips are

S.K. Lim (*) School of Electrical and Computer Engineering, Georgia Institute of Technology, 777 Atlantic Drive NW, Atlanta, 30332-0250, GA USA e-mail: [email protected] A. Papanikolaou et al. (eds.), Three Dimensional System Integration: IC Stacking Process and Design, DOI 10.1007/978-1-4419-0962-6_6, © Springer Science+Business Media, LLC 2011

103

104

Y.-J. Lee et al.

stacked together into a smaller footprint, delivering current to all parts of the 3D stack while meeting the noise constraints becomes highly challenging. This is mainly because the number of through-silicon-vias (TSVs) available for signal nets and P/G nets is limited, causing routing congestion if many 3D connections are desired. This issue is further exacerbated if microfluidic channels (and TSVs) are used for liquid cooling.1 Figures 6.1 and 6.2 shows a possible configuration of microfluidic channels, signal, and P/G TSVs, all competing for layout resource. The following topics are covered in this chapter: • We study the routing problem for multifunctional interconnects – signal, power, thermal – in 3D ICs. We learn how to consider various physical, electrical, and thermomechanical requirements of these multifunctional interconnects to successfully complete routing while addressing thermal and noise concerns. • We study a compact physical model to analyze the thermal performance of microfluidic channel-based liquid cooling. We also learn the routing challenges posed by this thermal interconnect and study a way to optimize the geometries of the interconnects. • Co-optimization of the above-mentioned interconnects is a hard task. Considering vast design space with lots of design freedom and little knowledge on the system poses great challenge. We learn how to perform this task using design of experiments (DOE) and response surface methodology (RSM). The design knobs and the assessing metrics are defined, and the correlations among them are discussed. The remainder of the chapter is organized as follows. Section 6.2 provides an overview of 3D design methodology. We study microfluidic cooling and thermal analysis in Sect. 6.3. Section 6.4 presents 3D P/G network design and noise analysis. Section 6.5 presents signal routing method. Section 6.6 presents Design-of-Experiment (DOE)based optimization methodology. Experimental results are presented in Sect. 6.7, and we conclude in Sect. 6.8.

6.2 Overview of 3D Physical Design Flow In order to accurately model the interaction among the signal, power, and thermal interconnects in 3D ICs and obtain the metrics of interest including congestion, temperature, and power supply noise, we build a physical layout of the given gatelevel circuit. This involves 3D tier partitioning, placement, and routing. In this case, we perform global placement and routing and skip detailed placement and routing mainly because global-level physical design still provides enough details about the metrics mentioned above while eliminating the need for extremely timeconsuming detailed layout construction. Note that the level of details included in

These silicon ancillary technologies continue to advance, and the size of the related interconnects and TSVs continues to scale down. Our study will be helpful in evaluating these advances in the context of full layout and routing environment.

1

6 Co-optimization of Power, Thermal, and Signal Interconnect for 3D ICs

105

Fig. 6.1 Three major types of interconnects in 3D IC. They compete with each other for routing space

this global layout can be controlled by adjusting the size of global placement and routing grid: if more details are desired, we use finer grid for placement and routing. Table 6.5 provides details on the dimension of grid structures we use in our experiments.

106

Y.-J. Lee et al.

Fig. 6.2 A 3D die structure with microfluidic channels, power/ground TSVs, and signal TSVs. Transistors and signal wires are not shown for simplicity

feedback and re-optimization

top-down physical design

Tier Partitioning 3D Placement Fluidic Channel Routing Power/Ground Routing Clock Routing Signal Routing Thermal, Power, Noise, Skew, Congestion Analysis

Fig. 6.3 Our design flow for multifunctional interconnect routing

Figure 6.3 shows an overview of our design flow. After reading in the input circuit, the partitioning stage starts. In this stage, the input circuit is divided into multiple dies. Since the total number of signal TSVs is determined at this stage, we minimize the cutsize by utilizing a mincut algorithm such as FM algorithm [8]. In the placement stage, we perform global placement onto np ×np grid. Cell occupancy ratio (CO) of a placement tile at (x, y, z) is defined as:

CO( x, y, z ) =

å

"cellÎr _ tile(x,y,z)

Sr _ tile

Scell ,

(6.1)

6 Co-optimization of Power, Thermal, and Signal Interconnect for 3D ICs

107

where Scell is the area of a cell in the routing tile r _tile(x, y, z), and Sr _tile is the area of a routing tile. We perform congestion-driven placement based on simulated annealing technique [9] with a predefined target CO ( = tCO) to distribute gates evenly. Next, we perform global routing onto nr × nr grid. In our routing, the thermal, power, and signal nets are routed sequentially in this order. We first route the microfluidic channels, followed by the P/G TSVs on routing tiles with no microfluidic channels. Lastly, we add P/G wires (thick and thin). The remaining area is used for signal net routing. Since the thermal and P/G interconnects are routing obstacles, we use the thermal-aware 3D maze router for obstacle avoidance under the given thermal profile [10]. After the routing is finished, we run power noise and thermal analysis to see if given constraints are satisfied. We may repeat the entire or some parts of the physical design steps with more routing area if routing fails to complete.

6.3 Thermofluidic Interconnect for 3D ICs Three-dimensionally stacked ICs bring several challenges in thermal management. By stacking layers, the heat dissipation per unit volume and per unit horizontal footprint area are significantly increased. Also, the interior layers of the 3D structure are thermally detached from the heat sink. Heat transfer is further restricted by the low thermal conductivity bonding interfaces and thermal obstacles in multiple IC layers. Moreover, the inherent spatial nonuniformity of the power/heat flux distribution/dissipation within each active layer generates local hot spots in temperature, which degrade the functionality of circuits and create thermal stress issues, due to nonuniform thermal expansion. Conventional cooling techniques, which depend on heat sinks on the backs of ICs to transfer heat into streams of forced air, will be unable to meet the needs of future power-hungry devices – especially 3D multichip modules that will pack more processing power into less space. Several kinds of advanced cooling technologies have been presented mainly for 2D ICs, including microjet impingement cooling, compact thermosyphon, loop heat pipe, electro-osmotic pumping loop, stacked microchannel heat sink, thermoelectric microcooler, miniature vapor compression heat pump system, and miniature absorption heat pump system. However, such cooling solutions for 2D planar circuits have difficulties to overcome the limited surface area available for thermal management and the large vertical thermal resistance between the bottom layer and the heat sink of 3D integrated circuits.

6.3.1 Microfluidic Channel-Based Cooling Unlike air-cooled heat sinks, liquid cooling using microchannels offers a larger heat transfer coefficient (and thus lower thermal resistance) and chip-scale cooling solution. Figure 6.4 shows the SEM image of fabricated microfluidic channel.

108

Y.-J. Lee et al.

Fig. 6.4 SEM image of fabricated microfluidic channels

Recent advancement on wafer-level fabrication technique provides polymer pipes that allow electronic and cooling interconnections to be made simultaneously using automated manufacturing processes. The low-temperature technique, which is compatible with conventional microelectronics manufacturing processing, allows fabrication of the microfluidic cooling channels without damage to integrated circuits. By controlling average operating temperature and cooling hotspots, liquid cooling can enhance reliability of the integrated circuits. Lower operating temperatures also mean a smaller thermal excursion between silicon and low-cost organic package substrates that expand at different rates. In the study of [6] and [5], both electrical and fluidic TSVs and I/Os were demonstrated. The electrical interconnects are used for power delivery and signaling between strata, and the fluidic interconnects are used to deliver a coolant to each microfluidic channel heat sink in the 3D stack and thus enable the dissipation of heat from each stratum in the 3D stack. The chips are designed such that when they are stacked, each chip makes electrical and fluidic interconnection to the dice above and below. Consequently, power delivery and signaling can be supported by the electrical interconnects (solder bumps and copper TSVs), and heat removal for each stratum can be supported by the fluidic I/Os and microchannel heat sinks. The thermal resistance of the microchannel heat sink for single chip was previously measured [2]. Figure 6.5 shows an illustration of 4-tier 3D IC with microfluidic channel-based cooling. The coolant utilizes the following “thermal interconnect” path to cool individual dies in the 3D stack: (1) packaging substrate, (2) fluidic IO bump, (3) fluidic TSVs, (4) fluidic channel in each die. The hot liquid exiting the system is cooled using an external freezer and reenters the system. Note that the fluidic TSVs are located outside the core region of the dies, thereby not causing any interference with the circuitry in each tier.

6 Co-optimization of Power, Thermal, and Signal Interconnect for 3D ICs

109

back Die front back front

RDL C4 bump Fluid out

Fluid in

Package substrate Ball

Fig. 6.5 Four-tier 3D IC with microfluidic channel

6.3.2 Thermal Analysis A three-dimensional thermal model of [11] is modified to consider the lateral temperature and fluid flow rate distribution caused by nonuniform power/heat flux distribution. Figure 6.6 shows the cross-sectional view of the 3D stacked IC with embedded microfluidic channels. It is assumed that the temperatures of the fluid and the solid domains (including channel wall, channel base, Avatrel cover, and oxide layer) are different but uniform at each cross section within each control volume. In reality, the temperature of Avatrel cover and oxide-metal layer will be slightly different from that of the silicon structure. These layers are very thin (~10 mm) and have low thermal conductivity ( ~ 10W ∕ mK), resulting in negligibly small heat transfer through them. Thermal and fluid flow in microfluidic channels are described by the following energy and momentum conservation equations: di (6.2) = h h P (T - T ) + h w(T - T ). m

dz

0 c

w, k

-

f ,k

c

w, k +1

f ,k

dP 2 fG 2 = . dz dhr

¶ ¶Tw ¶ ¶T ¶ ¶T (k ) + ( k w ) + ( k w ) + q g + qc = 0. ¶x ¶x ¶y ¶y ¶z ¶z

(6.3)

(6.4)

110

Y.-J. Lee et al. j-1 Avatrel cover

j

j+1 k=4

Fluidic channel Channel wall Channel base SiO2 & Metal layer

k=3

k=2

k=1

Fig. 6.6 Side view of the thermal grid structure used for 3D stacked IC with microfluidic channel

Tw and Tf represent the temperatures of solid and fluid, respectively, m , i, and hc are mass flow rate, enthalpy, and convective heat transfer coefficient, respectively. For each microfluidic channel, heat is directly supplied only to the channel base, and the channel wall is analyzed as a fin attached to the base (h0 is the overall surface efficiency for heat transfer, including an array of fins and the base surface). Microchannel geometry is described by the channel perimeter P and the width w. Equation (6.2) represents the fluid enthalpy change due to the convective heat transfer owing to the temperature difference between the solid and fluid, as well as fluid convective motion. The pressure drop along the microfluidic channel is obtained by solving the fluid momentum balance equation (6.3), wherein P, G, and r are pressure, mass flux, and density of the fluid, respectively, f is the fluid friction factor, and dh is the hydraulic diameter of a microfluidic channel. Equation (6.4) is the three-dimensional thermal transport equation for the solid. It has two source/sink terms owing to heat generated from the active and oxide-metal layers and convective heat transfer to the fluid (k denotes the thermal conductivity of solid). Deionized water is considered as a representative working fluid for singlephase flow in the microfluidic channel. The governing equations (6.2), (6.3), and (6.4) are integrated over a control volume (Fig. 6.6) and then discretized using the upwind scheme [12]. The resulting system of linear algebraic equations is simultaneously iteratively solved using successive underrelaxation (SUR) method.

6.3.3 Routing Requirements for Thermofluidic Network The on-chip thermofluidic network is composed of fluidic TSVs and microfluidic channels. Here we assume that all the fluidic TSVs are located outside the main region where all the gates and metal wirings are distributed. Thus, only the microfluidic channels are considered for routing requirement analysis. Since

6 Co-optimization of Power, Thermal, and Signal Interconnect for 3D ICs

111

icrofluidic channels are fabricated on the back side of a silicon die, they do not m affect routing capability on metal layers. More important issue is that these channels obstruct TSV connections. Considering significant size of the microfluidic channels, they decrease the routing capacity of TSVs quite considerably. The scarce resource is usually the signal TSV capacity. Because of the microfluidic channels, many routing tiles have no capacity left for signal TSVs. The following geometries related to microfluidic channels have an impact on thermal and routability objectives: • Microfluidic channel width (wfl _ch) – By increasing the microfluidic channel width, mass flow rate and thus cooling capability can be improved, but the space available for signal TSVs will be decreased. • Microfluidic channel depth (dfl _ch) – By increasing the microfluidic channel width, mass flow rate and thus cooling capability can be improved while, but due to the fixed AR of TSVs, the diameter of signal TSVs should be increased proportionally, decreasing routing capacity between dies. • Microfluidic channel pitch (pfl _ch) – By increasing the microfluidic channel pitch, total thermal contact area between the channel wall and the working fluid will be reduced, and thus the cooling performance will be degraded, but the routability of TSVs will be improved. Assuming a square chip with the width of wchip, the microfluidic channel occupancy ratio (MFO) that measures the area ratio of microfluidic channel to chip is defined as follows: ê ú

wchip ê ú ´ wfl _ ch ê pfl _ ch ûú ë MFO = . wchip

(6.5)

MFO can be manipulated to control the cooling effectiveness vs. routability tradeoff. These design knobs will affect the following metrics: • Chip size (optimization objective or constraint) – Larger microfluidic channels may increase the required chip size, because the z-direction routing congestion gets worse with larger microfluidic channels, and the signal routing may fail with the given chip size. • Silicon temperature (optimization objective or constraint) – Depending on the power dissipation map and microfluidic channel configuration, silicon temperature distribution varies. This temperature should be kept under certain value for the entire die area. Also, the temperature distribution should be as uniform as possible to minimize mechanical stress induced by thermal expansion. • Routability (optimization constraint) – the primary goal is to complete the signal routing. • Cooling cost (optimization objective or constraint) – The cost associated with microfluidic cooling method can be a metric for optimization. This includes the pumping power required for coolant circulation. In addition, the die area required for fluidic TSVs can be included in the cost.

112

Y.-J. Lee et al.

The possible scenarios for design space exploration are as follows: (1) Chip size, thermal requirement, and cooling cost are fixed. Routing has to be completed. We find the method to tune the design parameters. (2) Given chip size and cooling cost, we optimize silicon temperature. We find the method to finish routing while minimizing the silicon temperature.

6.4 Power Delivery Network for 3D ICs In 3D ICs, the on-chip P/G networks in individual tiers are vertically connected with P/G TSVs. This means current is drawn from P/G lines and/or on-chip decaps not only from the devices in the same tier but also from adjacent tiers. This means the average current demand per power line and/or on-chip decap is higher in 3D ICs. However, the smaller footprint area of 3D ICs causes the number of P/G bumps that can fit into the bottom of the stack to reduce significantly. This is exacerbated by the fact that P/G TSVs are via-last type, where they interfere with both device and metal layers. This restricts the number of TSVs used in 3D P/G network in order to prevent placement and routing congestion. In addition, signal routing must be done carefully to prevent coupling noise between P/G TSVs and signal wires. This complex optimization problem usually results in larger area, more power consumption, and less performance, thereby diminishing the benefit of TSV-based 3D IC technology.

6.4.1 3D Power Distribution Network In a 3D stack, global power distribution networks on each die are distributed using grids made of orthogonal interconnects on the top wiring levels. Power is fed from the package through power I/O bumps distributed over the bottom-most die and travels to the upper die using TSVs and solders [13]. High-performance systems require dense power/ground grids for both power distribution and signal current return purposes. In order to avoid excessive loop inductance and provide nearby current return path, each signal wire needs to be close enough to at least one power/ground wire. In high-performance designs, empirically, 2 or 3 signal wires are sandwiched between a pair of power and ground wires, and the width of a power/ground wire should be two or three times the width of a signal wire to accommodate the return current from signal wires [14]. Thus, normally a highperformance chip has dense grids on the top wiring levels with a couple of micron wire width and several micron pitches. The densities of power/ground I/Os for each die (TSVs and solders) are much smaller than the density of on-chip grids. The sizes and pitches of I/O pads can be in the range of tens and hundreds of microns, respectively. Tens of power/ground wires are routed between two power/ ground pads.

6 Co-optimization of Power, Thermal, and Signal Interconnect for 3D ICs

113

Fig. 6.7 Wires and TSVs in our P/G network

Figure 6.7 shows the form of our power grid. The power grid for the processor tiers contains two levels of granularity. We assume a flip-chip package with ball grid array chip connections. The power/ground TSVs are connected directly to the power/ground balls. The TSVs are connected to one another using thick 10 mm wires. Within this coarse mesh is a fine-grained mesh for local power delivery. There are 20 small 5 mm wires for each of the power and ground grids in each direction. The power-to-power pitch of the bumps, TSVs, and coarse grid is 400 mm. The ground grid is offset from the power grid by 200 mm in each direction. This pitch accommodates the size and pitch of the micro-fluidic channels. To model the 3D power distribution grid, we assume that TSVs and package bumps have parasitic resistance, capacitance, and inductance. The 2D distribution grid that exists on each tier of our system is purely resistive. A capacitor (representing decap) and a current source (representing the current demand of the transistors) connect the power and ground grids at each node. The current sources are simulated as a ramp from 0 to the current demand of the particular module that covers that area of the floorplan. The rise-time of the current source ramp is dependent on the type of tier (processor, memory) that the current source is located on.

6.4.2 Noise Analysis For a 3D chip stack structure with footprint size of 1 cm2, we may have thousands of power/ground I/Os for each die and millions of wire segments on the power/ ground grids in each die. If each individual wire segment is modeled, manipulating this huge circuit network with tens of million elements by simulation is prohibitively

114

Y.-J. Lee et al.

time-consuming. To deal with this problem we created a custom circuit simulator. This simulator is based on Modified Nodal Analysis (MNA). We also add a modification based on Domain Decomposition (DD) [15]. Using these MNA with DD we have simulated networks containing up to 11 million nodes. Unfortunately, our full-size 46-tier system contains over 25 million nodes. To efficiently simulate our largest-size system we therefore limit our studies to 25% of the die area. Our experiments indicate that this introduces a small amount of error, approximately 5%. This error comes from the change in the ratio of active area to decap, due to reduced whitespace. We add a buffer area around the simulated area to reduce edgeeffect errors. The error, however, is systematic in nature and should not affect the conclusions of the scaling studies. This work focuses on a conservative approximation of the power noise of a given system. Accordingly, our dynamic noise simulations are of an extremely worst-case scenario. We simulate the power noise generated from all processors in the system being simultaneously powered on from a sleep state.

6.4.3 TSV RLC Parasitic Modeling The focus here is on TSV parasitics as they apply to the power distribution network. Typical dimensions for power distribution TSVs must be much larger than for signal TSVs. In the power distribution network it is more important to reduce parasitic inductance and resistance than to save space. For our base case, and the scaling study that follows, we assume copper TSVs with a square cross-section of 40 × 40 mm. For scaling, we consider TSVs that range in size from 20 × 20 mm to 80 × 80 mm. The length of the conducting path for our TSVs, equivalent to the thickness of the die through which they pass, is assumed to be 15 mm for thinned dies and 150 mm for dies that contain micro-fluidic channels. 6.4.3.1 Inductance Modeling The main cause of low-frequency first-droop power-supply noise is the interaction between the inductance of the package and the on-die capacitance. By adding large, vertical TSVs for power delivery, the power-supply noise problem may be exacerbated. The TSVs in the power-supply network have a much larger pitch than length, so the mutual-inductance of neighboring TSVs is dominated by the self-inductance of the TSVs. Figure 6.8 shows a plot of the inductance of a TSV as a function of length. Each line represents a TSV of a different dimension. The approximate height of our largest 3D stacked system is 1, 600 mm. The solid lines represent inductances for TSVs with varying dimension as computed by Synopsys’ inductance extractor, Raphael. The dotted lines represent the same TSV dimensions as calculated by standard self-inductance equations. The solid black bar across the graph represents

6 Co-optimization of Power, Thermal, and Signal Interconnect for 3D ICs

115

1.6 20X20 40X40 60X60 80X80

1.4

Inductance (nH)

1.2 1 0.8 0.6 0.4 0.2 0

0

200

400

600

800

1000

1200

1400

1600

Height (um)

Fig. 6.8 Inductance height scaling for various TSV dimensions. The solid bar represents the inductance of a package bump. Stacking of TSVs longer than around 700 mm can more than double the inductance for the tiers farthest from the bumps

the inductance of a package-level bump. The graph shows that TSVs in large stacks can more than double the inductance value seen by tiers that are farthest from the bumps, which has a large negative impact on dynamic noise. 6.4.3.2 Resistance and Capacitance Modeling The resistance of a metal interconnect is calculated assuming a uniform current density. This assumption is valid in the power supply grid because there is no highfrequency oscillation that would cause skin effects to become dominant, as could be the case in signal TSVs. The resistance of a TSV, RTSV, is calculated as:

RTSV =

lr A

where, l is the conducting path length, r is the resistivity of the TSV material, and A is the cross-sectional area of the conducting path. We assume that TSVs are made from copper and we use a conservative estimation of the resistivity, 21nW⋅m. This value should account for any thermal effects. Capacitive parasitics in the TSVs will improve dynamic noise response because the TSVs work as on-chip decoupling capacitors. However, we calculate and present them here for completeness. We again use both the capacitance extractor Raphael as well as capacitance formulas [16] to generate values. Table 6.1 shows the values for the resistance and capacitance of some typical TSVs in the power distribution networks we simulate. The table shows that the resistance values for the thinned die can be very low, less than one milliohm.

116

Y.-J. Lee et al. Table 6.1 Resistance and capacitance values in mW and fF, respectively, for typical power distribution TSVs. All lengths are in mm Resistance Capacitance TSV length TSV length Cross section 15 150 15 150 20 × 20 0.788 7.875 14.28 142.80 40 × 40 0.197 1.969 26.60 266.02 60 × 60 0.088 0.875 38.93 389.33 80 × 80 0.049 0.492 51.25 512.53

For comparison, the resistance of the thick lines connecting neighboring power TSVs is about 840mW.

6.4.4 Routing Requirements for Power Delivery Network Overall power and ground structure is shown in Fig. 6.9. In our work, power and ground (P/G) TSVs are placed regularly in a mesh structure with a predefined pitch (pVpg = 400 mm). The width of a P/G tile is half the power TSV pitch (wpg _tile = pVpg ∕ 2 = 200 mm) and contains one quarter of power TSV and one quarter of ground TSV. The total number of P/G TSVs in a die is as follows:

ê Schip ú N Vpg = ê 2 ú ´ (1 / 4 + 1 / 4), êë wpg _ tile úû

(6.6)

where Schip is the area of the chip. Power and ground nets are routed on metal layers 7 and 8. The following three levels of wiring hierarchy are used: • P/G TSVs provide vertical connections. • Thick wires of 10 mm width runs between P/G TSVs. • Between the thick wires, thin wires of 1 mm width and 4. 64 mm pitch are placed. Then, between two adjacent thin wires, up to six signal wires may be routed. Thus, the routing capacity ratio of signal wire on metal layers 7 and 8 (compared to the case with no P/G wires) is as follows:

ratiosig _ M78 =

6 = 0.724. 4.64 / 0.56

(6.7)

In our 3D technology, two TSVs from two different dies are connected using the metal layers and vias in these dies. This means two stacked signal TSVs are not directly connected. On the other hand, P/G TSVs pierce through all 4 dies for

6 Co-optimization of Power, Thermal, and Signal Interconnect for 3D ICs Purple: M7 - horizontal Light Blue: M8 - vertical

117

power TSV 400um

10 40

thick P/G wire thin P/G wire

400um

ground TSV

single tile

signal wire

thin P/G wire

Fig. 6.9 Top-down and side view of dual-mesh-structured power and ground net

e fficient power delivery (see Fig. 6.10). Thus, no cell can be placed at the P/G TSV locations. Considering the size of power and ground TSVs, this area is not negligible. Thus, the total area required for P/G TSVs can be calculated by multiplying the number of P/G TSVs by the P/G TSV area:

S tot _ Vpg = N Vpg ´ SVpg .

(6.8) Based on our structural assumptions, P/G TSVs occupy around 2% of the chip area. For the routing tiles containing a part of a P/G TSV, routing capacity is decreased by a large amount.

118

Y.-J. Lee et al. power TSV

100

ground TSV 100

signal TSV back

50

100

100

MF channel

40

bulk Si

active layer

metal layer

40

10

front

Fig. 6.10 Side view of a die layer in a stacked chip. Die is flipped and the active layer is facing down. Shapes are drawn to scale. Unit is mm. Note that in our technology assumption, P/G TSVs span the entire 4-die stack, while signal TSVs only span single die, requiring metal layers and vias for signal TSV-to-TSV connection

6.5 Signal Interconnect for 3D ICs Signal routing becomes more challenging in the presence of prerouted thermal and power delivery networks. It is, thus, important to accurately model the routing resource available for signal routing and find ways to make the best use of it. We discuss these topics in this section.

6.5.1 Geometries of Wires and Vias For the signal wires, we use the metal interconnect dimensions similar to the ones in Intel’s 45nm technology [17]. The TSV formation approach was assumed to be via-first. In via-first approach, TSVs are formed before metal wirings are constructed. From the signal routing point of view, this is beneficial because TSVs do not touch the metal interconnection layer. By contrast, in via-last approach, a TSV passes through the entire structure and decreases the metal wire interconnection space by a large amount. TSV aspect ratio (AR, the ratio of the die thickness to the minimum TSV diameter) was assumed to be 15:1. Figure 6.10 shows the side view of a die. The diameter of signal TSVs is set to the minimum size to accommodate as many connections as possible. In contrast, the diameter of P/G TSVs is 40 mm, which is comparable to a previous work [Huang et al., 2007a]. Table 6.6 provides more details on related geometries. We fix the width of our routing tile wr _tile = 50 mm. Then, we can calculate the total number of grids accordingly. Figure 6.11 shows the routing tile objects.

6 Co-optimization of Power, Thermal, and Signal Interconnect for 3D ICs Routing Tile

Fluidic TSV

200

100

100

50

50

119

50

100

P/G TSV

40

Signal TSV

40

10

100

Microfluidic Channel

M7/M8

0.2

Fig. 6.11 Top view of routing tile. Objects are drawn to scale

6.5.2 Routing Capacity Calculation For each routing tile, there are x-, y-, and z-direction routing capacity values. x- and y-direction capacity represents available routing space on metal layers, while z-direction capacity is for signal TSVs. Basically, x- and y-direction capacity values of a metal layer are calculated from dividing the routing tile size by the pitch of the metal layer. Since the benchmark circuits used in this study are gate-level designs, metal layers 1 and 2 are used for local routing. Thus, we assume that only 20% of the routing capacity is available in metal layers 1 and 2. Metal 3–6 are dedicated to signal routing. In metal layers 7 and 8, we decrease number of routing capacity values due to the P/G nets. The capacity values based on multiple metal layer stack are added together for each tile. If the tile is preoccupied with P/G TSVs, we decrease the capacity accordingly. For z-direction capacity, we calculate the remaining surface area of each routing tile. Starting from the routing tile area, we extract the placed cell area and the P/G TSV area. Since we place P/G TSVs at the center of four routing tiles, only one quarter of P/G TSV is included in a routing tile. Then, we divide the resulting area by the minimum pitch signal TSV area.

capa z ( x, y , z ) =

S r _ tile - S placed _ cells - ratio Vpg ´ S Vpg . S Vs

(6.9)

We set our routing tile size to 50 ×50 mm, which results in the following three types of routing tiles for signal net routing: • Type 1 (contains no obstacle): 618 wires for x/y-direction wires and 6 signal TSVs. • Type 2 (contains P/G interconnect obstacles): 371 wires for x/y-direction wires and 5 signal TSVs.

120

Y.-J. Lee et al.

• Type 3 (contains micro-fluidic channel obstacles): 618 wires for x/y-direction wires and 0 signal TSVs. In case of x-direction interconnects, the power and thermal TSVs occupy 2.5% and 0% of the available routing area, respectively. Thus, 97.5% of x-direction routing resource is available for signal net routing. The same is true for y-direction. In case of z-direction, the power TSVs and the thermal interconnects (= micro-fluidic channels) occupy 2% and 50% of the available routing area, respectively. Thus, 48% of x-direction routing resource is available for signal TSVs.

6.6 Design of Experiments So far the “default setting” of various geometries of micro-fluidic channels and P/G network – shown in Table 6.6 in detail – has been determined without the consideration of “co-optimization”: it is not clear if the 100 mm width of micro-fluidic channel is overly large based on the given temperature requirement, or the 40 mm P/G TSV diameter is too small to meet the power supply noise requirement. Thus, our goal is to co-optimize various geometries of thermal and power interconnects so that the given thermal and noise requirements are met and all signal interconnects are routed while minimizing the overall area. This is a very challenging task based on the complex interaction among signal, power, and thermal interconnects. In this section, we investigate the viability of Design-of-Experiment (DOE) as the co-optimization engine. Since its invention [18], DOE has been used for many science and engineering applications. DOE has been proven to be effective and efficient when an optimization is desired for complex systems with multiple input factors. With the input factors and the region of interest, we assume an underlying metamodel to describe the system. Based on the strategy for covering the region of interest, DOE method suggests the design points to be experimented. After running the experiments and gathering the responses, we draw the response surfaces to understand and optimize the system. This is called response surface methodology [19]. Major benefits of modern DOE techniques are as follows: (1) The number of experimental runs is dramatically decreased compared to that of conventional full factorial design. (2) The knowledge about the target system is not needed. (3) The effect of input factors on responses can be identified. DOE has also been used in the VLSI and CAD community. In Brglez and Drechsler [20] the DOE framework for CAD was discussed. A robust interconnect model based on DOE was presented in [21]. And in [22], DOE was used to identify the performance critical buses in a microarchitecture.

6 Co-optimization of Power, Thermal, and Signal Interconnect for 3D ICs

121

6.6.1 Overall Design Flow Overall design flow is summarized in Fig. 6.12. We start with defining the design knobs (= input factors) and the metrics (= responses). Our single experiment is equivalent to performing gate-level global placement and routing, where we first perform gate-level partitioning and placement, followed by routing thermal, P/G, and signal nets. We then evaluate the metrics of interest and complete the current experiment. Once the desired number of experiments are performed, we construct response surface curves and use them to obtain optimal design solutions. Table 6.2 shows our design knobs, and Table 6.3 shows our assessing metrics.

6.6.2 Designing the Experiments We define the range of the design knobs as shown in Table 6.4. We used the ModelBased Calibration Toolbox in MATLAB to design the experiments. Stratified Latin hypercube is selected from space-filling design styles. Several geometric constraints are applied to design space to rule out any physical overlap among objects and conform to spacing requirements. For example, when MFC pitch is 400um and P/G TSV pitch is 400um, our requirements become:

BEGIN

single experiment

Define the ranges of design knobs

Prepare settings

Determine design points to experiment

Tier partitioning 3D placement

Run all experiments

Signal net routing

Assess metrics, build response surfaces

MFC routing P/G net routing

Find optimal design Rip up and rerouting N

Found optimal?

Thermal, power noise, congestion analysis

Y END

Fig. 6.12 Design-of-Experiment (DOE) flow for co-optimization of signal, power, and thermal interconnect routing

122

Y.-J. Lee et al.

Table 6.2 Design knobs

MFC depth MFC width MFC pitch Pressure drop P/G TSV diameter P/G TSV pitch P/G thin wire ratio

We change the depth of MFCs. Since MFCs are etched on the back side of Si, deeper MFC means increased die thickness. We assume all the MFCs on the chip have the same depth The width of MFCs is also varied. We assume all the MFCs have the same width. The pitch is the third knob for MFC. This is the distance between the centers of two consecutive MFCs on a die. We change the pressure drop of fluid between inlet and outlet of MFCs. This in turn affects the mass flow rate of the fluid. The diameter of P/G TSVs affects the RLC values of P/G TSVs. This is defined as the distance between two power (or two ground) TSVs. This is the ratio between P/G thin wires and signal wires on metal layer 7/8. For example, if this value is 0.4, P/G thin wires use 40% (= 20% each) of the routing tile, and the rest (= 60%) is for signal routing.

Table 6.3 Assessing metrics Wirelength This is related to the quality of signal routing. Total # of signal TSVs The actual signal TSV usage may vary with the availability of routing capacity. If it helps reduce wirelength, the router may use more TSVs. Routing congestion The average z-direction utilization for each layer is calculated, and the maximum value among all layers is used to represent routing congestion. 1.0 (= 100%) means signal routing failure due to insufficient z-direction capacity between layers. Max. fluid temp. Maximum fluid temperature should be moderate. Too high value implies insufficient cooling, whereas too low value suggests an overdesigned fluidic cooling system. Max. Si wall temp. The performance of the transistors degrades with higher temperature. We constrain this to be less than 85 º C. s Si wall temp. If the temperature varies much within a silicon layer, heat expansion can incur mechanical stress and compromise system reliability. The temperature distribution of Si wall temperature was considered by the standard deviation value. Pump power This is a part of the system cost. This is related to the pressure drop and the mass flow rate of fluid. The system designer may constrain this value. Power noise The power noise should not exceed the noise margin of the transistors. Usually it is assumed to be 10% of Vdd.

pitch PG _ TSV / 2 - DPG _ TSV ³ width MFC + 10,

pitch PG _ TSV / 2 + DPG _ TSV + 10 £ pitch MFC - width MFC ,

(6.10)

(6.11)

where D represents diameter. With the given constraints, totally 115 design points survived out of 330 generated points. Then we ran the experiments and removed design points with signal routing failure. The remaining 89 design points were used to construct response

6 Co-optimization of Power, Thermal, and Signal Interconnect for 3D ICs

123

Table 6.4 The ranges of input factors Input factor Value 50–200 MFC depth (mm) MFC width (mm) 50–200 MFC pitch (mm) 200, 400, 600, 800 Pressure drop (kPa) 100–180 P/G TSV diameter (mm) 20, 40, 60, 80 P/G TSV pitch (mm) 400, 800 P/G thin wire ratio 0.2–0.8

surfaces. In addition, 6 design points that are different from the previous ones were generated for design validation purposes.

6.6.3 Design Optimization With multiple assessing metrics and design constraints, there can be multiple optimization scenarios. In this work, we solve the following problem: minimize the combined cost ( = Cost) under Si wall temperature constraint of 85 ° C, power noise constraint of 100 mV, and 100% routability. We want to minimize total wirelength, the standard deviation of Si wall temperature to reduce mechanical stress, and maximum power noise, while maximizing maximum fluid temperature to avoid overdesigning MFCs. Each of the metrics under consideration is normalized to [0, 1] and forms a partial cost. Then, we combine them into a single desirability function [23]. The following Cost is used to evaluate the solution:

* * Cost = 4 Cost wl ·(1 - Costft* )·Costswt ·Cost pn ,

(6.12)

* * where Costwl , 1 - Costft* , Costs*wt , and Costpn respectively denote normalized wirelength, fluidic temperature, standard deviation of Si wall temperature, and power noise cost. Some of the input factors have discrete levels, thus we made optimization cases for all possible combinations of them. Since MFC pitch has 4 levels, P/G TSV diameter has 4, and P/G TSV pitch has 2, total 32 cases were considered. Optimization with respect to Cost was performed for all the cases independently. Some of the resulting optimal designs were infeasible due to constraint violations. Removing these infeasible designs and sorting the designs with the Cost values in ascending order, we get the best optimal design.

6.7 Experimental Results The circuits are from the ISPD 2006 Placement Contest benchmark [24] that range from 200K to 1.2M gates as shown in Table 6.5. We also show the chip size. The dimension of routing, P/G, and thermal grids is calculated from the chip size. The technology and setting parameters used in our experiments are shown in Table 6.6.

124

Y.-J. Lee et al.

Table 6.5 ISPD 2006 benchmark circuits. We report the total number of cells and nets, footprint area (mm2) and width (mm) of the 3D stack. We also report the dimensions of the routing, P/G, and thermal grids based on the chip area adaptec1 newblue1 newblue3 adaptec5 newblue5 # Cells 211,447 330,474 494,011 843,128 1,233,058 # Nets 221,142 338,901 552,199 867,798 1,284,251 Area 36 56.25 64 225 256 Width 6,000 7,500 8,000 15,000 16,000 r-grid 120 × 120 150 × 150 160 × 160 300 × 300 320 × 320 ×4 ×4 ×4 ×4 ×4 p-grid 30 × 30 × 4 37 × 37 ×4 40 × 40 × 4 75 × 75 × 4 80 × 80 × 4 t-grid 30 × 80 × 4 37 × 80 × 4 40 × 80 × 4 75 × 80 × 4 80 × 80 × 4 Table 6.6 Various technology and setting parameters used in our experiments Item Value Number of dies 4 Bonding type face-to-back Die thickness (mm) 150 Bonding layer thickness (mm) 10 TSV aspect ratio 15:1 Routing grid size (mm) 50 Signal TSV diameter (mm) 10 Signal TSV minimum pitch (mm) 20 P/G TSV diameter (mm) 40 P/G TSV pitch (mm) 400 P/G grid size (mm) 200 Micro-fluidic channel width (mm) 100 Micro-fluidic channel pitch (mm) 200 Micro-fluidic channel depth (mm) 100 Micro-fluidic channel occupancy ratio (MFO) 0.5 0.25 target cell occupancy ratio (tCO)

The target cell occupancy ratio (= tCO) is lower than that of 2D cases [24], because we need to make room for TSV connections.

6.7.1 Routability and Congestion Analysis Table 6.7 shows the signal, P/G, and thermal interconnect routing results, where we report the average and maximum utilization of the routing tiles in x, y, and z directions among all 4 dies. We also report the number of signal TSVs for all dies, which is the real TSV usage after routing is finished. The run time of signal routing stage was about 50 h with newblue5. Looking at the average routing tile usage, we note that the z-direction usage is higher than x/y-directions. This is expected since the size of signal TSVs is significantly larger, and some tiles are fully blocked by P/G TSVs. We also note that the

6 Co-optimization of Power, Thermal, and Signal Interconnect for 3D ICs

125

Table 6.7 Routing results, where we report the average and maximum utilization of the routing tiles in x, y, and z directions among all 4 dies. We also report the number of signal TSVs for all dies adaptec1 newblue1 newblue3 adaptec5 newblue5 Average routing tile usage 0.110 0.150 0.182 0.241 0.317 x, die 0 y, die 0 0.091 0.122 0.153 0.184 0.260 z, die 0:1 0.227 0.288 0.362 0.277 0.341 x, die 1 0.141 0.107 0.247 0.202 0.259 y, die 1 0.116 0.090 0.198 0.153 0.203 z, die 1:2 0.325 0.362 0.403 0.159 0.284 x, die 2 0.140 0.157 0.135 0.180 0.216 y, die 2 0.115 0.126 0.111 0.141 0.174 z, die 2:3 0.103 0.344 0.246 0.029 0.252 x, die 3 0.102 0.127 0.150 0.124 0.194 y, die 3 0.083 0.104 0.130 0.090 0.152 Maximum routing tile usage x, die 0 y, die 0 z, die 0:1 x, die 1 y, die 1 z, die 1:2 x, die 2 y, die 2 z, die 2:3 x, die 3 y, die 3 TSV 0:1 TSV 1:2 TSV 2:3

0.345 0.275 1.000 0.421 0.305 1.000 0.423 0.332 1.000 0.302 0.248 14,299 19,505 6,270

0.485 0.345 1.000 0.375 0.283 1.000 0.499 0.431 1.000 0.413 0.310 30,310 40,062 36,471

0.545 0.545 1.000 0.701 0.464 1.000 0.429 0.321 1.000 0.453 0.652 31,604 32,498 21,800

0.795 0.580 1.000 0.631 0.491 1.000 0.604 0.458 0.998 0.458 0.294 123,224 71,846 12,937

0.979 0.741 1.000 0.782 0.585 1.000 0.736 0.537 1.000 0.666 0.491 165,101 143,555 128,020

average usage correlates with the number of TSVs derived from partitioning stage. The maximum usage shows more drastic difference in terms of routing tile usage. In almost all cases, z-direction routing fully utilizes the available area for TSVs. while no routing tile is fully utilized in x/y-directions. Figure 6.13 shows x, y, and z-direction routing tile utilization on die 1 for newblue5. We note that the congestion of z-direction is more severe than that of x/y-directions. In this case, a large routing congestion “hotspot” is located in the bottom of the chip.

6.7.2 Thermal Analysis In our experiment, pressure drop between inlet and outlet was constrained to 70 kPa for all micro-fluidic channels. Inlet working fluid temperature was assumed to be 20 ° C. The thermal analyzer was written in Matlab m-function, and the run time was about two

126

Y.-J. Lee et al. 1 0.9

y [grid index]

50

0.8 0.7

100

0.6 150

0.5 0.4

200

0.3 250

0.2 0.1

300 50

100

150

200

250

300

0

x [grid index]

x-direction congestion 1 0.9

y [grid index]

50

0.8 0.7

100

0.6 150

0.5 0.4

200

0.3 250

0.2 0.1

300 50

100

150 200 x [grid index]

250

300

0

z-direction congestion

Fig. 6.13 Routing usage of newblue5: (a) x-direction, (b) z-direction (= TSV between die 1 and die 2), where the horizontal white lines denote micro-fluidic channels. y-direction looks similar to x-direction

minutes with newblue5 circuit. Figure 6.14 shows the working fluid temperature of newblue5. Thermally induced coolant flow maldistribution was not observed as the mass flow rate of each channel ranges from 16 to 19 mg/s. The maximum working fluid temperature at the outlet was around 40 ° C, which means that the chosen micro-fluidic channel cooling scheme provides sufficient background cooling capability. Figure 6.15 shows that the die temperature is maintained well below 85 ° C using micro-fluidic cooling. Considering the working fluid temperature levels observed in Fig. 6.14, the maximum temperature difference between the die and the working fluid is only around 40 ° C, which means that the convective heat

6 Co-optimization of Power, Thermal, and Signal Interconnect for 3D ICs Fluid temp.[°C] − Die2

16

60

12

50

8

40

4

30

0

z [mm], (Fluid flow)

z [mm], (Fluid flow)

Fluid temp.[°C] − Die1 16

60

12

50

8

40

4

30

20

0

16

60

16

60

12

50

12

50

8

40

8

40

4

30

4

30

0

4 8 12 16 x [mm], (Channel width)

0

0

4 8 12 16 x [mm], (Channel width)

20

Fluid temp.[°C] − Die4 z [mm], (Fluid flow)

Fluid temp.[°C] − Die3 z [mm], (Fluid flow)

127

20

0

16

60

16

60

12

50

12

50

8

40

8

40

4

30

4

30

0

4 8 12 16 x [mm], (Channel width)

0

4 8 12 16 x [mm], (Channel width)

20

Fig. 6.14 Coolant temperature of newblue5

0

Wall temp.[°C] − Die2 z [mm], (Fluid flow)

z [mm], (Fluid flow)

Wall temp.[°C] − Die1

20

0

16

60

16

60

12

50

12

50

8

40

8

40

4

30

4

30

0

4 8 12 16 x [mm], (Channel width)

0

0

0

4 8 12 16 x [mm], (Channel width)

20

Wall temp.[°C] − Die4

20

Fig. 6.15 Silicon wall temperature of newblue5

z [mm], (Fluid flow)

z [mm], (Fluid flow)

Wall temp.[°C] − Die3

4 8 12 16 x [mm], (Channel width)

0

0

4 8 12 16 x [mm], (Channel width)

20

128

Y.-J. Lee et al.

Table 6.8 Summary of thermal analysis results. Temperatures are reported in ∘ C, and power and power density values are in Watts and W/cm2, respectively adaptec1 newblue1 newblue3 adaptec5 newblue5 # Hotspots 42 66 98 168 245 Ave die power 13.86 21.49 33.31 56.17 80.78 Ave pwr density 38.50 38.20 52.05 24.96 31.55 Ave fluid temp. 21.21 21.85 22.85 24.94 26.73 Max fluid temp. 23.73 25.63 28.60 33.53 37.40 Ave wall temp. 30.92 31.50 35.94 31.22 34.65 Max wall temp. 53.44 50.58 60.14 55.25 62.36 s Wall temp. 4.24 4.30 5.83 4.66 6.34 Pump power 0.380 0.376 0.388 0.397 0.411 Table 6.9 Various micro-fluidic channel geometry settings base w50 100 50 Micro-fluidic channel width (mm) Micro-fluidic channel pitch (mm) 200 200 Micro-fluidic channel depth (mm) 100 100 Micro-fluidic channel occupancy ratio 0.5 0.25 Die thickness (mm) 150 150 Signal TSV diameter (mm) 10 10 20 20 Signal TSV minimum pitch (mm)

d50 100 200 50 0.5 90 6 12

p400 100 400 100 0.25 150 10 20

transfer coefficients were higher than estimated and/or thermal diffusion due to conduction played a significant role. The temperature difference was supposed to be around 80 ° C considering the maximum power level (80 W) and the expected heat transfer coefficient (around 10, 000 W/m2K). Table 6.8 shows the summary of thermal analysis results. For all the cases, the coolant temperature never exceeded 40 ° C. Maximum silicon wall temperature was kept under 65 ° C for all die layers. This means that micro-fluidic scheme had enough cooling capability for thermal management of the simulated 3D circuits. Standard deviation of overall wall temperature was around 5 ° C. In our system, micro-fluidic channels are the biggest objects that block P/G and signal nets and thus should attract more attention for routing optimization. To investigate the impacts of channel configurations on design quality, we conducted experiments with three varied geometric parameters of micro-fluidic channels shown in Table 6.9. In setting w50, the width of the micro-fluidic channel was halved, in d50, the depth of the micro-fluidic channel was halved and the signal TSV dimension was decreased accordingly, and in p400, the pitch of the microfluidic channel was doubled. Circuit adaptec1 was used for the demonstration. Table 6.10 shows the routing results with the micro-fluidic channel variations. Compared to the baseline case (base), x/y-direction utilizations of variants are almost the same. Since z-direction capacity has been increased for all variants, the router uses more TSVs to decrease wirelength. In d50 case, the average usage of z-direction is lower than that of other cases, because the reduced signal TSV dimension increased the z-direction capacity.

6 Co-optimization of Power, Thermal, and Signal Interconnect for 3D ICs

129

Table 6.10 Impact of micro-fluidic channel geometries on routing congestion (adaptec1 is used) Average routing tile usage Maximum routing tile usage base w50 d50 p400 base w50 d50 p400 0.110 0.110 0.110 0.110 0.345 0.334 0.361 0.345 x, die 0 y, die 0 0.091 0.090 0.090 0.090 0.275 0.272 0.275 0.275 z, die 0:1 0.227 0.232 0.087 0.236 1.000 1.000 1.000 1.000 x, die 1 0.141 0.139 0.139 0.139 0.421 0.399 0.431 0.402 y, die 1 0.116 0.115 0.115 0.115 0.305 0.297 0.302 0.302 z, die 1:2 0.325 0.342 0.130 0.349 1.000 1.000 1.000 1.000 x, die 2 0.140 0.138 0.138 0.138 0.423 0.426 0.434 0.429 y, die 2 0.115 0.113 0.113 0.113 0.332 0.321 0.321 0.321 z, die 2:3 0.103 0.111 0.042 0.113 1.000 1.000 1.000 1.000 x, die 3 0.102 0.102 0.102 0.102 0.302 0.297 0.302 0.302 0.083 0.083 0.083 0.083 0.248 0.237 0.235 0.237 y, die 3 Table 6.11 Impact of micro-fluidic channel geometries on temperature base w50 d50 p400 Ave die power (W) 13.86 13.86 13.86 13.86 Ave fluid temp. ( ∘ C) 21.21 25.57 25.47 22.39 Max fluid temp. ( ∘ C) 23.73 35.76 36.75 26.46 Ave wall temp. ( ∘ C) 30.92 32.98 33.47 41.64 Max wall temp. ( ∘ C) 53.44 55.33 62.76 64.51 s Wall temp. 4.24 5.23 5.95 6.12 Pump power (W) 0.380 0.085 0.085 0.207

Table 6.11 shows the thermal analysis results with micro-fluidic channel variations. Average power consumption of the chip was kept the same for all the cases. Compared to base, in w50 case, working fluid temperature becomes higher because of the reduced mass flow rate caused by the smaller channel cross-sectional area. In d50 case, the maximum wall temperature is higher than that of w50 case, because of reduced cooling efficiency due to the thinner channel walls. In p400 case, although coolant temperature is not much higher than that of base, the maximum silicon wall temperature increases due to the reduced contact area for heat transfer between the channel wall and the working fluid.

6.7.3 Power Noise Analysis The gate oxide thickness was set to 1 nm for decap size calculation. The inductance and the resistance of package pins was assumed to be 0.3 nH and 3 mW, respectively. In order to determine decap area ratio at each grid point, we calculated the unused silicon area ratio as follows:

S u ( x, y , z ) = 1 - CO( x, y , z ) - used z ( x, y, z )

S Vs . S r _ tile

(6.13)

130

Y.-J. Lee et al.

Thus,

ratiodecap ( x, y, z ) = Su ( x, y, z ) ´ 0.8 .

(6.14)

The average of ratiodecap for the entire stack was found to be around 36.9%. As a worst case scenario, we assumed that the entire chip is turned on at once, and the rise time of current profile was set to 5 ns. Since running the entire power noise grid simulation would take too much computational time and memory, we focused on the specific area where many hotspots are clustered. In this 2, 000 ×2, 000 mm area, the current demand would be higher than other area and that would lead to higher power noise level. We gathered the biggest droop of power voltage for each grid point, and the periphery area of the resulting peak noise map was trimmed to neglect the boundary interaction with outer area. Figures 6.16 and 6.17 show the power noise level and the voltage waveform at the grid with maximum peak noise on the top die of newblue5. Only top die result

0.125

y [grid index]

5 0.12

10

0.115

15 20

0.11

25

0.105 5

10

15

20

25

x [grid index]

Fig. 6.16 Peak power noise on the top layer of newblue5. Unit is V 0

noise (mV)

-20 -40 -60 -80 -100 -120 0

10

20 time (ns)

30

Fig. 6.17 Waveform of the grid with maximum peak noise in newblue5

40

6 Co-optimization of Power, Thermal, and Signal Interconnect for 3D ICs Table 6.12 Summary of power noise simulation results. We report average mum peak power noise in mV adaptec1 newblue1 newblue3 adaptec5 Ave 174.2 163.2 164.5 91.6 Max 210.5 181.4 190.9 115.3

131 and maxinewblue5 113.1 127.2

is shown because it is the farthest one from power supply and thus tends to show the highest power noise level. Some parts have greater power noise due to higher power demand. The fluctuation of voltage is mainly due to the parasitic inductance of package and the capacitance of the wires. Table 6.12 shows the summary of power noise simulation results. Peak power noise values are quite high for all circuits. It was found that the maximum peak power noise location has correlation with the maximum power consumption location. This is because we used the current profile of simultaneous on-switching.

6.7.4 Correlations Among Knobs and Metrics In our DOE-related experiments, we used a benchmark circuit adaptec1 from Nam [24], which has about 210 K gates and nets. Table 6.13 shows various technology parameters that we assumed. Figure 6.18 shows the comparison between the predicted and the actual total wirelength. It shows that the prediction values match the actual data very well. R2 was 0.998. Figure 6.19 shows the response surfaces of all metrics. For each metric, we show the two knobs that are the most influential. For total wirelength, total signal TSVs, and routing congestion metric, the most significant knobs are MFC depth and width. For power noise, the most influential input factors are P/G TSV pitch and P/G thin wire ratio. P/G TSV diameter did not affect power noise as much. When MFC depth or width is increased, the total wirelength increases, total number of signal TSVs decreases, and the congestion metric gets higher. This is because there are less number of z-direction connections available and some routes take detour. Increasing MFC depth increases die thickness. With a fixed signal TSV aspect ratio, the diameter of signal TSVs increases, which in turn decreases z-direction routing capacities for the entire stack. Wider MFCs decrease available space for signal TSVs. As suggested by the routing congestion curve, with large MFCs in terms of cross-sectional area (= MFC depth ×width), the routing failed due to insufficient z-direction connections. As the cross-sectional area of MFCs gets smaller, z-direction routing capacity increases, but fluid and Si wall temperatures get higher. Too narrow MFC hinders mass flow rate from reaching needed values due to friction loss. The mass flow rate of each MFC is affected by the geometry of MFCs and the pressure drop. Pump power depends on fluid temperature and pressure drop as well as MFC geometry. Increasing MFC pitch eases up the signal routing, but it also decreases heat removal capability. A significant temperature difference on Si die may incur mechanical stress and degrade system stability. Moreover, hotspot coverage gets worse.

132

Y.-J. Lee et al. Table 6.13 Various technology and setting parameters used in our DOE experiments for adaptec1 Item Value 6000 × 6000 Chip size (mm) Number of dies 4 Bonding type face-to-back Si layer thickness (mm) 50 Bonding layer thickness (mm) 10 TSV aspect ratio 15:1 Routing grid size (mm) 50 Gate oxide thickness (nm) 1 Inductance of package pins (nH) 0.3 3 Resistance of package pins (mW)

predicted total wirelength (um)

3.95

x 108

3.9

3.85

3.8

3.75 3.75

3.8

3.85 total wirelength (um)

3.9

3.95 x 108

Fig. 6.18 Comparison of total wirelength between predicted and actual values. The crosshair marks represent RBF centers

Making P/G TSV diameter larger helped decrease power noise level, but it did not always decrease the overall Cost because it also increased total wirelength due to routing congestion. Since the distance from P/G TSV to the current sink is related to the P/G noise level [25], smaller P/G pitch led to smaller P/G noise. Since P/G thin wires have large resistance values, P/G thin wire ratio affects P/G noise level significantly. The decap size also affects power noise, and it depends on the area not used by placed gates or TSVs.

6.7.5 Optimization Results and Comparison We compare the following five cases: • Baseline: This is based on a typical setting of design knobs. • Gradient-only: With 8 initial solutions (= knob settings) to start from, we perform gradient search to derive a new solution, do an experiment with the new solution, and check the Cost. We ran 14 iterations for each initial solution.

6 Co-optimization of Power, Thermal, and Signal Interconnect for 3D ICs

x 10

8

x 10

4

5 total # sig. TSVs

total wirelength (um)

4.2

133

4 3.8

4 3 2 1 50

3.6 200 200

150 100 MFC width (um)

50 50

100

150 100 MFC depth (um)

150 MFC depth (um)

200 50

200 150 100 MFC width (um)

80 max. Si wall temp. (°C)

routing congestion

2 1.5 1 0.5 0 200 150 50 50

100

max. power noise (mV)

pump power (W)

3 2 1

0 800 200

600 400 MFC pitch (um)

200 50

150 100 MFC depth (um)

20 0

100 150 MFC depth (um) 200

MFC depth (um)

4

40

-20 50

200

150 100 MFC width (um)

60

50

150 100 MFC width (um)

200

250 200 150 100 50 800 P/G TSV pitch (um) 400

0.8

0.2 0.4 0.6 P/G thin wire ratio

Fig. 6.19 Response surfaces. Only two significant axes are shown. Fluidic temperature and standard deviation of Si wall temperature curves, which have similar curves as Si wall temperature one, are not shown for simplicity

• DOE-predicted: This is the optimal design by model prediction. • DOE-actual: This is the actual results obtained from experimenting the optimal setting in DOE-predicted. Comparison between DOE-predicted and DOE-actual reveals the accuracy of the DOE prediction model. • DOE+gradient: After the DOE-based optimization, we chose 3 solutions with minimum Cost and executed gradient search. We ran 6 iterations per solution. Table 6.14 shows the knob settings obtained from the five cases. We show the best setting for each case. We note that in gradient-only, MFC width and pitch is somewhat bigger compared with DOE-based results. It seems that the optimization is not efficient. Table 6.15 shows the comparison of design results. Compared to the baseline, gradient-only method found a better solution with less Cost. Since there was no constraint on pump power, the pressure drop increased to maximum to

134 Table 6.14 Comparison of design settings Base Gradient line only 100 110 MFC depth (mm) MFC width (mm) 100 200 MFC pitch (mm) 200 800 Pressure drop (kPa) 140 180 P/G TSV diameter (mm) 40 60 P/G TSV pitch (mm) 400 400 P/G thin wire ratio 0.5 0.8

Y.-J. Lee et al.

DOE pred 107 50 200 180 40 400 0.8

DOE actual 107 50 200 180 40 400 0.8

DOE +grad 106 50 200 180 40 400 0.8

Table 6.15 Comparison of design results. Percentiles are when compared to baseline. In calculating total design time, we assumed that 24 computing nodes are available Baseline Gradient-only DOE-predicted DOE-actual DOE+gradient 379359200 378501700 377121127 377075650 377024700 Total signal (99.77%) (99.41%) (99.40%) (99.38%) wirelength (mm) Total # signal 39955 41498 42313 42823 42926 TSVs (103.86%) (105.90%) (107.18%) (107.44%) Routing 0.72 0.63 (87.50%) 0.62 (86.11%) 0.56 (77.78%) 0.55 (76.39%) congestion 21.82 21.16 25.38 25.75 25.82 Max. fluid (96.98%) (116.32%) (118.01%) (118.33%) temp. ( ° C) Max. Si wall 43.93 43.5 (99.02%) 40.87 41.35 41.48 temp. ( ° C) (93.03%) (94.13%) (94.42%) s Si wall 2.963 2.758 2.627 2.846 2.867 temp. ( ° C) (93.08%) (88.66%) (96.05%) (96.76%) Pump power 1.5002 1.7918 0.5718 0.5756 0.5684 (W) (119.44%) (38.11%) (38.37%) (37.89%) Max. power 70.3743 54.4782 54.4606 54.6174 54.6259 noise (mV) (77.41%) (77.39%) (77.61%) (77.62%) Cost 0.28684 0.22903 0.21343 0.21771 0.21769 (79.85%) (74.41%) (75.90%) (75.89%) Total design 1 14 6 7 12 time (hours)

decrease Si wall temperature. And P/G thin wire ratio also increased to maximum because it decreased power noise and did not exacerbate routing congestion much. From comparing DOE-predicted and DOE-actual, we see that the DOE prediction was quite accurate on all metrics except routing congestion and standard deviation of Si wall temperature. The routing congestion model has around 10% error for this design point, but note that this is not included in Cost. And, the routability can be checked by running the actual routing. The standard deviation of Si wall temperature also incurs error in Cost prediction, which urges the need for local search after DOE-based optimization. Our DOE outperforms gradient-only by 4.94% at the fraction of runtime. We also note that the post-DOE optimization with gradient search did not help too much:

6 Co-optimization of Power, Thermal, and Signal Interconnect for 3D ICs

135

only MFC depth was decreased by 1. This shows that DOE alone can generate high-quality design solutions. One of the strengths of DOE-based optimization is that the experiments for all chosen design points can be done simultaneously, if enough computing resources are available. Instead, gradient-search-based optimization is sequential and takes long time. Another benefit is that when we use gradient search after DOE, we do not need a lot of iterations if the model accuracy is fairly good.

6.8 Conclusions In this chapter, we studied routing with the following multifunctional interconnects in 3D ICs: signal, thermal, and power distribution networks. We learned how to consider various physical, electrical, and thermomechnical requirements of these multifunctional interconnects to successfully complete routing while addressing various reliability concerns. Our studies revealed that the liquid cooling based on micro-fluidic channels is highly effective in removing the hotspots in 3D designs. We also learned that the P/G distribution network for 3D IC requires a high demand on TSVs and interferes with signal net routing. The major signal net routing bottleneck was related to TSVs, since these signal TSVs compete with P/G TSVs and micro-fluidic channels for vertical routing resources.

References 1. Shahidi GG (2007) Evolution of CMOS technology at 32nm and beyond. In: Proceedings of the IEEE custom integrated circuits conference, pp 413–416 2. Bakir M, Dang B, Meindl J (2007) Revolutionary nanosilicon ancillary technologies for ultimate-performance gigascale systems. In: Proceedings of the IEEE custom integrated circuits conference, pp 421–428 3. Joyner JW, Zarkesh-Ha P, Meindl JD (2004) Global interconnect design in a three-dimensional system-on-a-chip. IEEE Trans VLSI Syst 12(2):367–371 4. King CK, Sekar D, Bakir MS, Dang B, Pikarsky J, Meindl JD (2008) 3D stacking of chips with electrical and microfluidic I/O interconnects. In: Proceedings of the electronics components and technology conference 5. Sekar D, King C, Dang B, Spencer T, Thacker H, Joseph P, Bakir MS, Meindl JD (2008) A 3D-IC technology with integrated microchannel cooling. In: Proceedings of the international interconnect technology conference 6. Bakir MS, King C, Sekar D, Thacker H, Dang B, Huang G, Naeemi A, Meindl JD (2008) 3D heterogeneous integrated systems: liquid cooling, power delivery, and implementation. In: Proceedings of the IEEE custom integrated circuits conference 7. ITRS (2007) Executive summary. In: International technology roadmap for semiconductors, 2007 edition 8. Fiduccia C, Mattheyses R (1982) A linear time heuristic for improving network partitions. In: Proceedings of the ACM design automation conference, pp 175–181 9. Kirkpatrick S, Gelatt CD, Vecchi MP (1983) Optimization by simulated annealing. Science 220:671–680 10. Pathak M, Lim SK (2007) Thermal-aware Steiner Routing for 3D Stacked ICs. In: Proceedings of the IEEE international conference on computer-aided design, pp 205–211

136

Y.-J. Lee et al.

11. Koo J-M, Im S, Jiang L, Goodson KE (2005) Integrated microchannel cooling for threedimensilonal electronic architecture. J Heat Transfer 127:49–58 12. Patankar SV (1980) Numerical heat transfer and fluid flow. Hemisphere Publishing Corp., Washington, DC 13. Huang G, Bakir M, Naeemi A, Chen H, Meindl JD (2007). Power delivery for 3D chip stacks: physical modeling and design implication. In: Proceedings of the IEEE electrical performance of electronic packaging, pp 205–208 14. Morton S (2002) Inductance: implications and solutions for high-speed digital circuits – onchip signaling. In: Proceedings of international solid-state circuit conference, vol 2, pp 554–556 15. Zhou Q, Sun K, Mohanram K, Sorensen DC (2006) Large power grid analysis using domain decomposition. In: Proceedings of the design, automation and test in Europe 16. Sakurai T, Tamaru K (1983) Simple formulas for two- and three-dimensional capacitances. IEEE Trans Electron Devices 30:183–185 17. Mistry K et al (2007) A 45 nm logic technology with high-k+metal gate transistors, strained silicon, 9 cu interconnect layers, 193 nm dry patterning, and 100% pb-free packaging. In: IEEE International Electron Devices Meeting, pp 247–250 18. Fisher RA (1935) The design of experiments. Oliver and Boyd, London 19. Myers RH, Montgomery DC (1995) Response surface methodology: process and product optimization using designed experiments. Wiley, New York 20. Brglez F, Drechsler R (1999) Design of experiments in CAD: context and new data sets for ISCAS’99. In: Proceedings of the IEEE international symposium on circuits and systems, vol 6, pp 424–427 21. Zhang Q, Liou JJ, McMacken J, Thomson J, Layman P (2001) Development of robust interconnect model based on design of experiments and multiobjective optimization. IEEE Trans Electron Devices 48(9):1885–1891 22. Nookala V, Chen Y, Lilja DJ, Sapatnekar SS (2005) Microarchitecture-aware floorplanning using a statistical design of experiments approach. In Proceedings of the ACM Design Automation Conference, pp 579–584 23. Derringer G, Suich R (1980) Simultaneous optimization of several response variables. J Qual Technol 12(4):214–219 24. Nam G-J (2006) ISPD 2006 placement contest http://www.sigda.org/ispd2006/contest.html 25. Huang G, Sekar D, Naeemi A, Shakeri K, Meindl, JD (2007) Compact physical models for power supply noise and chip/package co-design of gigascale integration. In: IEEE electronic components and technology conference, pp 1659–1666

Chapter 7

PathFinding and TechTuning Dragomir Milojevic, Ravi Varadarajan, Dirk Seynhaeve, and Pol Marchal

This chapter discusses various implications of the 3D integration technology on the design methodologies, flows, and associated tools. The experiences from the advanced 2D technologies are extrapolated and combined with the incremental challenges posed by the 3D technologies, and the requirements for design ecosystem for 3D technologies are precipitated. The chapter is organized in five sections. In the first section, we define the overall requirements for the 3D design ecosystem, and we identify the need for two incremental design methodologies, in addition to the traditional design authoring flow. The second section describes one of the incremental design methodologies – named PathFinding, and the third section discusses the other methodology – named TechTuning. In section four we present practical application of the proposed design methodology and associated tool chain. Section 7.5 gives a brief summary and few concluding remarks.

7.1 Definition of Requirements for 3D Design Ecosystem This section discusses the implications of 3D technology on design flow and defines the requirements for an ecosystem, based on the learning derived from the 2D experiences.

7.1.1 2D Design Experiences It is by now a well-established and acknowledged fact that the cost of design – for conventional 2D designs targeted for advanced technology nodes – is of the order of tens of millions of US$. See, for example, Fig. 7.1, which indicates that this is $35M for designs targeted to 45 nm node.

D. Milojevic () Université Libre de Bruxelles, BEAMS, CP165/56, 50, Av. F. Roosevelt, B-1050, Brussels, Belgium e-mail: [email protected] A. Papanikolaou et al. (eds.), Three Dimensional System Integration: IC Stacking Process and Design, DOI 10.1007/978-1-4419-0962-6_7, © Springer Science+Business Media, LLC 2011

137

138

D. Milojevic et al. 90

Milions of Dollars

72 54 36 18 0

250

180

130

90 65 Technology [nm]

Design, Test, Verification

45

Embedded SW

32 Mask

Source Gartner (April 2008)

Fig. 7.1 Cost of conventional 2D design

The exact estimate of this cost vary by as much as a factor of 2×, and clearly depend on the nature of design, the target technology, the experience of design team, the accounting methodology, and many other factors. However, it is pretty much accepted in the industry that a significant portion of this cost comes from the fact that in reality, design methodologies are iterative in nature, with multiple design re-dos (iterations prior to tape out) and re-spins (iterations post tape out) taking place during a typical design project. Multiple studies have been conducted to analyze the frequency and cause of the re-dos and re-spins, but these can broadly be categorized as falling in two groups: 1. Instability in the system design specifications that serve as inputs into Physical Design flow. 2. Instability in the target process technology that serve as design enablement constraints for the Physical Design flow. This is “conventional wisdom” derived from many years of experience with the conventional 2D designs. There is no such experience base with 3D designs, but it is clear that the degrees of freedom offered by the 3D technologies will serve only to increase the instability in both, the design specifications and design enablement, and can therefore be expected to significantly drive up the cost of design. This is not good.

7.1.2 Incremental Causes for Design Specification Instability in 3D In recent years, the industry is putting a very strong focus on 3D-SIC technology, as it provides numerous opportunities for making better digital systems. Many of

7 PathFinding and TechTuning

139

these advantages have been already pointed out in the literature [1, 2]. First, the number of functions in the system can be extended beyond the near-term capabilities of traditional 2D scaling following the Moore’s law. By adding another dimension to the construction of the physical circuit we can increase the IC packaging density (basically more gates for the same circuit footprint). Secondly, 3D technology can resolve the interconnect performance limitations since the circuit can be designed so that the different functional blocks are closer to each other, resulting in less and shorter wires [3, 4]. This will have a direct impact on wire length (and consequently on delays) allowing higher operating frequencies and higher bandwidths. Also, less and shorter wires mean lower total parasitic capacitance and inductance, resulting in lower power dissipation. A 3D interconnects can also replace lateral wires of tens and perhaps hundreds of microns, thereby significantly reducing interconnect RC delay and related buffering cost. Also, 3D-SIC technology also supports heterogeneous integration of components in dedicated technologies embedding not only traditional digital circuits such as processors and memories, but also analogue circuits such as sensors, antennas, and power supplies [5]. Using dedicated technologies for various functions, performance/power/cost metrics can be significantly improved. However, the advantages of the 3D technology do not come without drawbacks. It is now commonly adopted that besides classical system-level design challenges, which are also true for the traditional 2D-ICs, 3D technology, by definition, opens up multiple new degrees of freedom – over and above those that normally have to be considered – that need to be narrowed down and frozen. Furthermore, these degrees of freedom need to be narrowed down during high-level system design as they will impact the fundamental trade-offs that have to be made at this stage of design. Some of the examples of the specification trade-offs that need to be considered during system design include the following: Target stack definition – Within the traditional 2D paradigm, integration of subsystem components implemented in different (optimal) technologies takes place at the board of MCM level, and comes at the performance and power “price” of offchip communication. With 3D technologies, the communication between subsystems implemented in different technologies is potentially “free,” and the system partitioning across number of tiers (vs. number of chips) needs to be re-optimized. So, the number (two or more die…) and nature (digital, analog, memory, MEMS…) of the tiers in a 3D stack need to be defined. Target Si technology selection – Within the traditional 2D paradigm, the target technology selection is driven mostly by the cost-performance trade-offs that are fundamentally one-dimensional and predictable, and can be made based on the simple extrapolation of past experiences. Higher critical performance or integration needs basically dictate the selection of more advanced technology nodes for the entire chip. 3D technologies offer an opportunity to couple closely subsystems implemented in different technologies, making the trade-off analyses multidimensional, and requiring that the power and cost analyses be performed for multiple combinations of performance constraints for various subsystems (tiers). So, the traditional 2D cost-performance trade-off analyses for each tier needs to be conducted in the context of the entire stack.

140

D. Milojevic et al.

Target stacking technology selection – The selection of the optimum assembly technology (wire bond, flip chip bump, peripheral, or area bump, etc.) is complex enough for 2D die. In addition to this, with 3D technology a trade-off amongst different interconnect technologies (Via-First, or Via-Middle or Via-Last) and stacking orientations (Face-to-Face, Face-to-Back) and bonding technologies (Cu-to-Cu, glue, m-bump…) must all be analyzed. Since these choices impact system performance and cost, this selection must also be explored during system design stage. And so on… Note that all the incremental degrees of freedom offered by the 3D technologies affect performance, power, and cost attributes of the system, thereby necessitating exploration of trade-offs during high-level system design. However, the trade-offs are made in the physical domain, i.e. the electrical and cost characteristics of a given 3D technology selection are dictated by physical characteristics (wire length distribution, RLC characteristics). Therefore, in order to be meaningful, the system-level design space exploration must be aware of physical characteristics of various 3D technology options. Note that the alternatives to this physically aware design space exploration at the system level are either suboptimal chip designs, or design re-dos – neither of which is desirable.

7.1.3 Incremental Causes for Design Enablement Instability in 3D The need for chip and package co-design is acute enough with conventional 2D ICs, but this becomes especially important with 3D technologies. Furthermore, the thermal and mechanical considerations that have been traditionally addressed only at the package or board level of design must be included in the stack and chip design stage. Thus, the thermal and mechanical considerations which have traditionally been addressed only through relatively simple constraints during chip design (max power, max die size, etc.) become real design parameters that must be analyzed and traded-off vs. other chip attributes. As such, these factors potentially need to be included in the Design Enablement kit (which traditionally included only electrical models and physical design rules). Some of the examples of the design enablement factors – incremental to the physical design rules and SPICE models – that need to be considered during chip design include: Thermal factors – Since 3D technology allows intimate integration of multiple die within a single package, the thermal interaction amongst the die must be considered, and managed through the design of each of the die. Thus, power dissipated on one die will affect the temperature – and hence the power and performance – on the other die in the stack. Therefore, in order to meet the constraints, the performance simulations, physical floorplaning, and the power management need to be conducted in the context of the entire stack, rather than in isolation that considers a single die at a time. In order to enable this thermally aware design, the

7 PathFinding and TechTuning

141

interactions and the consequent junction temperatures, need to be quantified. The heat flow across a stack, under various use condition, must therefore be modeled, and the results must be included in the design enablement kit – either in the form of design rules or simulation models. Mechanical stress factors – Traditionally, designing around mechanical stress considerations is alien to normal chip design practices – this was something considered only at package design level. However, with the very advanced 2D technology nodes stress is purposely introduced in the Si process, to boost device performance, and the distribution of mechanical stress is considered to be one of the sources of layout driven performance variability. Hence, novel methods and/or tools are being introduced in the design flow to address stress variability. With 3D technology there are a number of incremental interactions, which can be expected to introduce new and different sources of stress. These may affect stack mechanical integrity, with consequences on manufacturability and yield. Furthermore, the incremental sources of stress may also perturb the intended stress distribution from strain boosters, with consequences on device performance. The stress distribution for various stacking configurations must therefore be modeled, and the results must be included in the design enablement kit – either in the form of design rules or simulation models. Traditionally process considerations are imported in a design flow through use of “design enablement kit,” which contains physical design rules and associated checkers, and models and coefficients that represent the electrical behavior of the devices. The electrical models are specific attributes of a given process technology and are calibrated based on extensive device characterization. Since both the thermal and the stress considerations are expected to impact the design of individual die in a stack, they must be imported in a design flow and included in a design enablement kit. The thermal and mechanical models and/or rules, required for design enablement kit, are clearly an attribute of the various materials used and the stacking process details. Hence, these models and rules will have to be calibrated for a given 3D technology, and new characterization practices, model extraction methodologies, and simulation tools will have to be deployed. Note that the alternative to inclusion of thermal and mechanical models in a design enablement kit is to discover these effects and their associated interactions with chip manufacturability, cost and/or performance attributes, in actual Si, causing increased re-spin rate. This would be unacceptable and could, in practice, negate all of the potential advantages of 3D technology.

7.1.4 3D Design Ecosystem Based on the discussion and considerations summarized above, the design ecosystem for 3D technology is viewed as a collection of three different but interdependent methodologies. This is illustrated conceptually in Fig. 7.2. As indicated, iterations in design specification and design enablement – expected to be particularly frequent with 3D technologies – are undesirable and

142

D. Milojevic et al. Specifications

PathFinding

Design enablement

Design Authoring

System Goals & Objectives

TechTuning

Process & Material Properties

3D Products Fig. 7.2 3D design ecosystem

drive up the cost of design. Consequently, two incremental design methodologies, which are optimized for easy and rapid iterations are introduced in the ecosystem, with the intent to minimize the iterations in the actual design authoring flow. PathFinding is a design methodology intended to produce a clean and stable specification inputs to Design Authoring. TechTuning is a simulation technology intended to produce an enriched design enablement kit for Design Authoring. Design Authoring is equivalent to the traditional 2D design flow, and includes the entire physical design flow, from RTL to GDS, all the simulation and verification tools and practices, etc. The intent of the design authoring flow is to output designs that are expected to produce functional and yielding silicon (Si) die. The De-sign Authoring flow is very complex and comprises a family of very advanced EDA tools that have been optimized to address all the details required to output functional Si die. Consequently most of the cost of design is associated with the design authoring flow, and hence iterations here are very undesirable. PathFinding is a novel design methodology that allows physically aware system-level design space exploration. Hence, it is a flow that is optimized for flexibility and rapid, low cost, iterations – potentially at the expense of accuracy. In order to incorporate realistic physical awareness for design space exploration, and to comply with the requirements for flexibility, PathFinding is based on virtual design practices and tools. This is described in the next section of this chapter. TechTuning is a simulation environment intended to enable inclusion of thermal and mechanical considerations into the design enablement kit. This is novel practice in the chip design environment that is analogous to SPICE simulations for addressing the electrical domain. This is described in Sect. 7.3 of this chapter.

7 PathFinding and TechTuning

143

7.2 PathFinding Design Methodology In this section, we will describe PathFinding design methodology and the current instance of the EDA tool chain that has been used to demonstrate the advantages of the proposed approach. It enables designers to trade-off between different choices and options, typically: • System-level design choices – e.g., different functional partitioning schemes; data path widths, various micro-architectures of the computational nodes, communication infrastructure types: buses, NoCs, their topologies and configurations, etc. • Physical design options – e.g., die orientation: face-to-face, face-to-back; presence of RDL layers and how many layers are actually required to make a given design routable; power distributions; packaging strategies; etc. • Technology options – e.g., different technology nodes (corners); different through silicon via sizes, pitch; type and characteristics of microbumps; size of the keep out areas; etc. Obviously, the designer would like to understand the impact of all these design options (and this as soon as possible in the design flow) not only on typical design parameters such as cost, performance, and power but also on thermal and mechanical aspects of the 3D-SIC. In the beginning of this section, we will give a short overview of the PathFinding design flow, arguing la raison d’être of this approach, present our vision of the PathFinding design methodology and analyze the requirements for a practical tool chain. We will then describe in more detail different steps of the PathFinding design methodology and current tool chain instance.

7.2.1 Introduction to PathFinding 7.2.1.1 Motivation We already mentioned the advantages of the 3D-SICs and new challenges that need to be addressed from design perspective to enable effective design of 3D products. To understand at what extent we need to change our current vision of the design flow, let’s look into a typical front-end/back-end decomposition of the standard ASIC design flow,1 illustrated in Fig. 7.3. First, the product specifications are analyzed at system level. In this step, and based on product specification (probably made using some high-level specifications), important global architectural options are fixed, e.g., number and type of processing units (general purpose, dedicated hardware block, third party IP, etc.), memory nodes, etc. Note that at this stage these design options are fixed without any knowledge about the physical view of the system. 1

Front-end corresponds to the System Level and the Back-End to Design Authoring.

144

D. Milojevic et al. Front End

Specs

System Level Design

Logic Design

Synthesis

P&R

Back End

Synthesis

P&R

Layout

Config.

Test Verification

Fig. 7.3 Traditional design flow

Once established, system-level specifications are then passed to the logic design teams that are responsible for providing full functional specification of each and every single component in the system. While for most of the products, the RTL already exists for many components, the time required to develop and validate new component depends of course on the component complexity (and logic design team skills) but is typically counted in man months. Once functionally correct system is in place, the specification of the whole system is then passed to the back-end team, responsible for the actual physical implementation of the circuit. In order to meet the stringent requirements of a brand new product, a painstaking process of iterations inside back-end and sometimes to the front-end flows will probably take place. These iterations are rarely wanted because of the time, resources and risks involved (dashed lines in Fig. 7.3). Today, no practical solution exists to assess physical design during the system-level exploration phase. Current practices for 2D design technology only rely on spreadsheets containing data on the area/performance/power estimates of individual IP components (potentially for various technologies). This information is insufficient for 3D-SICs, where the actual physical layout of the stack and selected 3D technology thus determines performance and the cost of the design. The alternative solution, which is to perform a full trial design using standard EDA tools and derive design characteristic, is not a workable solution. Full design involves multiple teams and expensive tools, and worst of all it takes time, a lot of time. As a rule of thumb one can count 3–5 days for trial iteration between front end and back end to get the visibility into physical and timing aspects of the produced RTL. Also, using full design flow one assumes the availability of accurate design inputs for all components in the system (preferably gate-level netlists and macros) and technology inputs for all used process technologies. These data are typically only partially available during the system-level design exploration phase.

7 PathFinding and TechTuning

145

7.2.1.2 PathFinding: The Vision The PathFinding acts like a bridge between system-level design, chip design, and process technology, enabling design sweet spot exploration by taking into account all these domains in parallel in an iterative and progressive manner. As pointed out earlier on, the main objective of the PathFinding is to produce a clean specification for both architecture and the technology, by allowing design iterations earlier in the design cycle. While in the following a particular attention will be given to the design of 3D-SICs, the whole flow can be of course applied to 2D designs. Compared to the traditional flow illustrated previously, PathFinding appears more like an extension of the existing flow, rather than a radically new approach to this matter. In PathFinding vision, system-level design and logic design steps are replaced with the following three design phases (we will refer to those in the following as steps): 1 . 3D system architecture 2. 3D physical prototyping 3. 3D design authoring This is illustrated in Fig. 7.4. By moving the design iterations upfront the costly (3D) design authoring (i.e. RTL2GDS flow), PathFinding appears as a way of cheaply exploring many different trade-offs at different abstraction levels, so that the specs become reasonably firm and stable. Note that the design iterations, i.e., feedback loops in the flow, are now desirable (represented now as solid lines in Fig. 7.4). These feedback loops indicate that the physical information from the lower abstraction levels is brought to the upper

3D System Architecture (Step1)

Technology

FoM

3D Physical Prototyping (Step2)

FoM

3D Design Authoring (Step3)

FoM

Fig. 7.4 PathFinding design methodology

Other Design Parameters

RTL Elaboration

146

D. Milojevic et al.

abstraction levels in the flow and can somehow influence the system specification. Different feedback loops can occur in the flow and we identified different loops that can be summarized as follows: Loops at RTL elaboration level – Using this particular feedback, the designer can fine-tune the micro-architecture of a given component described in some highlevel description language (C/C++/SystemC). Typically he can explore different pipelining strategies, he can try to unroll different loops, perform memory arrays merge or split, etc. Eventually, a 3D implementation of the component and/or technology options for this component can be considered. Loops at physical prototyping level – Here the designer could try different technology assignment on per tier basis, different inter-tier connectivity schemes, physical partitioning schemes (i.e., stack organization), floorplans, buffer insertion, etc. 3D system architecture level – If any of the previous loops didn’t help in getting close to the system specification, some global architectural parameters need to be changed. At this stage the designer needs to modify essential design parameters, such as number of processors, memories, interconnect topology, data path widths, etc., and perform the whole flow. To enable practical implementation of the PathFinding methodology, it is mandatory that different tools in the flow fulfill the following two prerequisites: 1. They need to allow rapid evolution in the flow, especially between system architecture and physical prototyping phases. The designer wants to arrive as soon as possible to the level of the prototype and doesn’t want to be delayed by the usual process of the RTL development. 2. At each step, different tools need to provide means to extract estimates of the vital design parameters point performance/power/cost/ or any other metrics that might be of interest (referred as figures-of-merit – FoM in Fig. 7.4). In order two fulfill these two (crucial) requirements, PathFinding design methodology relies on two major system/physical design paradigms that have been already extensively covered in the literature: • High-level synthesis • Virtual prototyping High-level synthesis (HLS) – refers to the process of automated synthesis of register transfer-level (RTL) models (typically in VHDL, Verilog and/or synthesizable SystemC) from high-level descriptions such as standard software programming languages (C or C++) and/or high-level hardware modeling languages (such as untimed SystemC). Once generated, the RTL descriptions can be used to produce gate-level models using standard logic synthesis tools and target technology libraries. HLS is basically a method enabling automated generation of the physical views for the physical design starting from high-level models. Obviously, the goal of the HLS is to significantly reduce design time generally associated with the logic design phase of the traditional design flow (see Fig. 7.3). Some references about HLS can be found in [6–8].

7 PathFinding and TechTuning

147

Virtual prototyping – To assess the benefits of the 3D technology and 3D-SICs, there is a need for physical design tools. Therefore, in academic world, several solutions for 3D floorplanning and global routing have been proposed. Unfortunately, proposed tools do not support the use of placement and route of different components in heterogeneous technologies across different tiers, and as a result, only very roughly estimate delay/power of both 2D and 3D tiers. Moreover, they have very limited support for partial design inputs [9–13], making them hard to use for PathFinding. With virtual prototyping [14, 15], the designer does not need to go through complete standard design flow to access the information at the physical level. EDA tools that provide virtual prototyping support can handle components described at different abstraction levels and are capable of performing vital design operations even on an incomplete design specification (for both technology and the design). They dramatically reduce turn-around time for placement and routing by trading-off accuracy for speed.

7.2.1.3 Requirements for Practical Pathfinding Tool Chain In order to make the PathFinding design methodology practically feasible and usable in the context of the design of real-world IC products, the associated tool chain needs to fulfill the following requirements: 1. It needs to be technologically aware – This is obvious. For both HLS and virtual prototyping, it is essential that the information about the technology used for IC manufacturing is somehow passed and used in each of these steps. Hence, the technology input in Fig. 7.4. 2. It needs to be fast – The size of the design space for complex systems even for 2D-ICs is enormous. 3D-SIC technology will increase this space even more, due to the huge number of design parameters introduced on both system and physical levels. In order to evaluate as many design space points as possible (i.e., one particular instance of the system with a given set of options), the complete tool chain need to be sufficiently fast to allow fast design parameters assessment and support many iterations and feedback loops in the flow. 3. The results need to be accurate – The second requirement is essential for quantitative exploration of the design space. However, the analysis of one particular point in the design space will have to provide a set of vital design parameters as output. These parameters (FoM in Fig. 7.4) need to be constantly monitored during the refinement of the design, as one goes further down into the flow. These parameters will obviously need to exhibit some degree of accuracy. While absolute accuracy is less important in the beginning of the iterative process (when the designer copes with incomplete system specification, for example), the relative accuracy is vital. Typically, the designer would like to compare many solutions and see how does the solution A compares with the solution B, allowing him to make the right decision at this level of abstraction. Obviously, at this stage the absolute accuracy of the parameters will depend on the accuracy of

148

D. Milojevic et al.

different models in the system. As the iterative process progresses, and the designer looks into more complete specification of the system (all components are defined at RTL level, for example), the absolute accuracy of the observable parameters will increase. 4. The tool chain needs to be automated for iterations – In order to evaluate the impact of different design options at system/physical/technological level on observable design parameters, the designer needs to minimize manual intervention in the whole process. Once the design environment is in place, the modification of one design option to a new value, or definition of the whole range of values needs to be easy to define. In that case, the execution of the complete flow needs to be fully automatic. 5. The tool chain needs to be interoperable – Obviously many different tools will be used, since there is no integrated working solution for the complete flow today. Clearly, standardized exchange of information is mandatory to insure transparent transition and design iterations (think of the link from the physical level to the thermal and mechanical analysis and package co-design, for example). While the most of the required information is already captured in the existing file structures (e.g., RTL, .LIB, .LEF etc.), some small amount of information relative to 3D-SIC still needs to be captured. XML specifications and existing tools allowing the XML file manipulation are the best candidate for such representations.

7.2.2 PathFinding: Practical Tool Chain In the following, we will give more detailed description of three basic design steps defined by the PathFinding design methodology. Having in mind the requirements identified in the previous section, we were able to make a concrete selection of the existing EDA software to build a practical PathFinding tool. Different softwares have been integrated into one tool chain, with necessary interfacing to allow transparent transition between different tools. Current design exploration environment allows certain level of design automation, such as technology exploration, some system and physical parameter exploration. However some of the steps are still manual (such as RTL partitioning); the automation of this (important) task is a part of our current work in progress. Typical design environment setup is measured in hours, even for complex designs containing thousands of placeable objects and mixed abstraction-level descriptions. Newly generated RTL for computation and communication elements of low to medium complexity using HLS is measured in days. One design cycle/iteration is very fast: typical exploration of the design space parameters (such as one technology node or TSV property, one system-level option like data path width) can be done in minutes. 7.2.2.1 3D System Architecture The main goal of this step is to provide the refinement of the system architecture based on the functional requirements for the design, the cost, and taking into

7 PathFinding and TechTuning

149

3D System Architecture

VHDL, Verilog

Black Boxes

Constraints

C/C++/ SystemC

Fig. 7.5 3D System architecture

account the benefits of the 3D technology. Already at this stage, some first rough estimates of the vital design parameters can be made depending on the abstraction level and individual models precision: typically latency, area, and power estimates, etc. At this very first step the designer will focus more on system-level architecture exploration, investigating typical 2D design issues such as functional decomposition, component selection (definition of the processing and memory components) and the design of the communication infrastructure. However, 3D-SIC technology will bring new issues that have to be considered in this step, such as definition of the number of tiers, partitioning of functionality across available tiers, technology choice on per-tier basis. During this phase physical design aspects are typically ignored. However, they may significantly impact design parameters for the 3D-SIC. Hence, the feedback loop from the physical design prototyping step is needed to validate and refine system-level decisions made at this stage. After this first step, the designer holds the functional model of the system architecture, or at least what we could refer to as first iteration (instance) of the design architecture. For such instance the designer would typically verify assumptions and hypothesis made by incorporating the physical design information. If an assumption has been made on communication infrastructure performance, for example (typically bus/network-on-chip speed), it will heavily depend on the system as a whole and on results obtained after placement and route. Information database now contains various component models at different abstraction levels and rough estimates of different design parameters. The systemlevel view of the design defines different constraints for actual synthesis of the components that will be performed in the next step. The information that is available at this stage is summarized in Fig. 7.5.

7.2.2.2 RTL Elaboration In order to bridge the gap between the 3D system architecture where component models are at higher abstraction levels (C/C++/untimed SystemC), and 3D physical prototyping, that use models at lower abstraction levels for better accuracy, the designer need to provide the actual RTL of the component. Complex designs are

150

D. Milojevic et al.

Functional Level Specification Black Box: Trace gen.; perf/MHz; mW/MHz

RTL: Verilog, VHDL

RTL Synthesis

Conversion; Manual annotation

RTL Elaboration

Black Box

Hierarchical Blocks

Behavioural Models: C, C++, SystemC

High-Level Synthesis

RTL Elaboration

Hard Blocks

Physical Level Specification Fig. 7.6 From functional to physical models

never made from scratch: there is an enormous amount of legacy RTL code that is very often re-used for most of the new products. Not to mention soft and/or hard, open source and/or third party IPs. However, a new product will certainly incorporate a new functional block, otherwise it will not be a new product, and for that particular block, the RTL needs to be developed and validated. Today, there are many different ways to achieve that. Within the PathFinding flow we have first identified different types of models that can exist at different abstraction levels. Then we identified different paths, and more importantly, means and tools that will allow designers to move quickly from one abstraction level to the other. This is summarized in Fig. 7.6. First let’s have a look into different models. There are two different abstraction levels: functional-level specification and the physical-level specification (white and gray filled boxes in Fig. 7.6, respectively). At functional level we can distinguish different databases that contain models at following abstraction levels: • Black box models – Obviously here we can have anything that is usually used for system-level design: from formulas, spreadsheets to more complex models like instruction set simulators, traffic generators, etc. These are typically not standardized and will require conversion for physical-level utilization.

7 PathFinding and TechTuning

151

• RTL models – This is a conventional database of the existing RTL: legacy code and/or available third party components provided as soft IPs. • Behavioral models – These models are somewhere in between black box models and RTL models. They can describe functionality using standard software programming languages or hardware modeling languages such as SystemC. At physical prototyping level we can define also three different types of models depending on the abstraction level used: • Black boxes – are high-level descriptions, and are technology independent (i.e., they do not require specific technology database). • Hierarchical blocks – are used to represent logical hierarchy of the black boxes and hard blocks, and their connectivity. They can be technology dependent if they contain hard blocks. • Hard blocks – These are used to describe already existing (i.e., implemented) components at any abstraction level (it can be a standard cell or a more complex component). They are characterized as blocks with fixed size and position of the input/output ports. They obviously contain precise technology information. The RTL elaboration step is all about moving from three different versions of the functional-level models to the physical prototyping-level models in “no time” while some of the paths are obvious and are already part of the existing design flows (e.g., traditional synthesis of the RTL), others are much less explicit, such as those going from behavioral to RTL models. In the following, we tried to synthesize all possible paths from functional to physical-level specification: • Annotation – These models can be created manually as high-level models using black boxes annotated with area, performance (timing) and power estimates. These estimates can be arbitrarily fixed for a new (unknown) component and/or sourced from the component or third party IP data sheets, designer experience, etc. Of course in case of a known component, black boxes can be derived from the gate-level synthesis of the RTL. This can be extremely useful to accelerate physical prototyping process. • RTL synthesis – Existing RTL models (IP legacy) or RTL that can be easily generated from the specific environments (already mentioned in the previous paragraph). These can be brought to the level of physical models using standard synthesis tools and appropriate technology libraries. • High-level synthesis – For this RTL elaboration path, specific RTL synthesis tools can be used to generate the RTL and before equivalent soft, hierarchical, and hard block are created using standard EDA tools. For example, the RTL of a generalpurpose processor can be generated using templates [16] and memory block using third party memory generators. For processing elements, dedicated HLS tools can be used to translate models described in high-level languages to any synthesizable RTL format (typically VHDL or Verilog). The same holds for communication infrastructure synthesis, whether it is for a bus or a NoC-based architecture. Because there is very little that we can currently do on the memory synthesis, in the following we will concentrate only on processing elements and communication.

152

D. Milojevic et al.

High-Level Synthesis for Processing Elements Typical design flow for HLS of the processing elements is presented on Fig. 7.7. Light gray boxes indicate different steps of the HLS (numbers indicate typical order of operations). Black filled boxes are the inputs, gray boxes are typical design steps/ tools, and white filled boxes indicate generated output. First (point 1), different technology libraries are characterized using standard gatelevel synthesis tool and a set of typical component RTL descriptions (such as adders, multipliers, muxes, etc.; these RTL descriptions can be provided by third party vendor – e.g., Design Ware Kit from Synopsys). During this technology characterization some cost parameters (such as number of inputs, datapath width, max delay, latency, etc.) are extracted once for all in a database called Component Models. In the next step (point 2), the input source code (i.e., the processing element model described in C, C++ or untimed SystemC) is instrumented using synthesis directives (basically C code pragmas). These synthesis directives will guide the synthesis tool in the process of the micro-architecture generation. In another words they can be seen as micro-architecture hints or guidelines that designer can use to steer the tool and therefore influence the final RTL outputs (i.e., component micro-architecture).

High-Level Models (C/C++, untimed SC)

Tech. Libraries

2. Source code Instrumentation

Instrumented code #C/C++, untimed SC)

Tech. Libraries

1. Technology Characterisation

Design Constraints

3. RTL Generation

RTL

4. RTL Synthesis

Gate Level Netlists Performance Paremeters

Tech. Libraries

5. Physical Prototyping

Fig. 7.7 Design flow for HLS of the computational elements

Component Models

7 PathFinding and TechTuning

153

Finally, (point 3) the HLS tool will take as input: instrumented source code, design constraints (typically desired operating frequency) and component models library and will generate automatically the corresponding RTL that can be then synthesized to gate-level netlist using standard or virtual synthesis tools (point 4). After synthesis, generated gate-level netlist can be then used for physical prototyping (point 5). At this stage, it is possible to iterate through the HLS flow based on results produced by the physical prototyping phase. Based on the design performance parameters, the designer can either influence the constraints for the HLS tool, and/or modify micro-architecture by adapting the source code instrumentation. In this context, another type of feedback is also possible, in what we could call design-driven technology development, based on the information acquired during physical prototyping, the designer could either produce a new, virtual specification for the component model (in Component Models database), or act directly on the properties of the standard cell in the technology file. Such approach could be used to create application-specific (virtual) libraries that can be used for the process of prototyping. Obviously, these can then serve as inputs for technology vendors. Many HLS tools have been proposed in both academic and industrial worlds. One can mention SPARC project and commercial solutions such CatapultC from Mentor Graphics, PICO Express from Synfora or AutoPilot from AutoESL. For RTL elaboration of the processing elements we are using AutoPilot, a high-level synthesis tool available from AutoESL [17]. High-Level Synthesis for Communication Similar HLS design flow can be defined for the communication infrastructure generation, although it can be simplified in the great deal because of the particular functionality that these tools are required to handle (they are less general compared to computation). Bus or NoC HLS tools can be used, depending on the complexity of the design and communication requirements. Literature suggests that for complex MPSoCs with intense traffic demands, NoCs are best suited [18]. For NoCs, high-level models of the NoC primitives (network interfaces, links, and switches) can be used to prototype NoC topology and the configuration of the NoCs primitives. During this process, traffic information can be taken into account (using instruction set simulators and/or traffic generators) and based on the assumption on the NoC operating frequency one can derive latency/network occupation. However at this stage, there is no way to verify the hypothesis on the NoC operating frequency, that in the case of PathFinding can be easily obtained using physical prototyping. Many examples of the academic and industrial Bus/NoCs tools can be cited as example: academic NoCs [19, 20] and commercial buses [21] and NoCs [22, 23]. In the context of this particular PathFinding tool chain instance we have been using iNoCs tools [23].

154

D. Milojevic et al.

7.2.2.3 3D Design Prototyping After the RTL elaboration step the designer holds the database of all components in the system. This database is usable inside the virtual physical design prototyping environment (Fig. 7.8). Each system component has at least one model in one of the three databases. The choice of the abstraction level (black box, hierarchical, or hard block) for a given component is left to the designer, allowing him to trade-off the design parameter accuracy and iteration speed. Because in this phase we do not need to have all models at lowest abstraction level (typically gate-level netlists), model manipulation time can be significantly increased. By creating the complete design instance, it is now possible to perform exploration of alternative design/technology options on the design parameters using physical design impact. Hence, for a given system described at functional level, different physical implementation scenarios can be compared in terms of design parameters. This step will also provide the necessary information for the design authoring tools. The system can be described using different abstraction level models for different components allowing the designer to choose the level of abstraction he needs in order to trade-off the accuracy of the results vs. model manipulation time. At this stage, enough information about the system is gathered to make a link to the existing thermal, mechanical, and eventual cost models. For physical prototyping purposes we use Atrenta 1Team®-Implement [24]. Some of the key features of the standard tool version include mixed abstractionlevel support of the component models; virtual RTL synthesis; automatic floorplanning and global placement; congestion and routability analysis; timing analysis and optimization, tight correlation with third party timing sign-off tools; persistent and customizable report generation; power estimation with various speed/accuracy possibilities (from high-level to accurate power estimates); easy hand off to third party back-end tools. To the standard tool version, additional features have been added so that the 3D-SIC can be virtually prototyped. These additions include support for TSV descriptions from native technology files (technology vendors.LIB/.LEF files); the possibility to instantiate multiple tiers within the same design and using different technology nodes for each tier; automatic generation of the TSV arrays on per-tier basis; and automatic routing to the TSV arrays and between tiers.

3D Physical Prototyping

RTL VHDL,Verilog

Fig. 7.8 3D physical prototyping

3D-SIC (.xml)

Floorplan (.def)

Constraints (.sdc)

7 PathFinding and TechTuning

155

7.2.2.4 3D Design Authoring In the previous phase, many iterations over the system and physical-level options hopefully reduced the design space search to few viable design space points that are well worth exploration using any sign-off quality EDA tool. Thanks to the physical awareness of the flow the specs that have been set are firm and most of all reasonable for the back-end team. With the complete system RTL database, design floorplans including 3D partitioning and functional assignment on per-tier basis and design constraints, the setup time for the sign-off design environment should be shorter.

7.2.2.5 Extraction of Estimates As we said earlier on at each step of the PathFinding flow some observable design parameters are outputted for a subsystem or the whole system, so that the designer can control decisions that he makes. These are typically linked to the standard performance parameters such as area, delay (timing), and power. But 3D integration introduces need for other parameters such as thermal and mechanical. This is covered in the following section.

7.3 TechTuning In this section we will describe TechTuning, design methodology intended to bring thermal and mechanical stress considerations into a design flow. Thermal and mechanical stress concerns are especially acute with TSV-based 3D technologies, and hence a methodology for addressing these issues during design is required. Traditionally, the thermal and mechanical stress considerations have been design parameters taken into account during package and PCB design only, and were typically alien to Si design practices. Thermal and stress factors are typically addressed purely through simple global constraints such as maximum overall power dissipation and maximum die size. However, with 3D integration technologies where Si and package design considerations are merged, and a number of new thermo-mechanical interactions are enabled, a structured methodology for addressing thermal and mechanical stress parameters during Si and stack design is required. We will begin this section by identifying different objectives for TechTuning. We will then explicit the implementation strategy and infrastructure for TechTuning design methodology. Finally, we will discuss our own compact thermal and mechanical modeling, which have been developed as standalone tools and coupled to the existing PathFinding flow described in the previous section.

156

D. Milojevic et al.

7.3.1 TechTuning Objectives The overall objective of the TechTuning methodology is to co-optimize the Si, 3D stack, and package design and process technologies to proactively produce manufacturable and reliable 3D products for thermal and mechanical stress considerations. Since thermal and mechanical characteristics are a function of the design (feature sizes, thicknesses, etc.), manufacturing processes (thermal history), and fundamental material properties (thermal conductivity, coefficient of thermal expansion, Young’s modulus, etc.), TechTuning methodology must comprehend all these factors and should be used to assess the compatibility of a given set of materials and processes with a given set of design targets. That is, TechTuning should be a simulation environment that allows tuning of both the design and the process technology “knobs” in order to optimize a given 3D product. Since thermo-mechanical considerations are new to the Si design practices, it is expected that these design parameters will be incorporated gradually and gracefully into the design flow. An evolutionary approach is expected – in a manner similar and parallel to the one outlined above for the growth of the overall design methodology – going from the current 2D paradigm to 2.5D paradigm and then to a full 3D paradigm. In this context, the principal limitations of the current practices used for in-package design, are that the established methodologies are reactive and tend to deal with Si as a monolithic black box. That is, the current practices, mostly based on finite element analyses methodologies, are not compatible with real-time interfacing to Si design flows. Hence, it is difficult to use the results of these analyses to affect actual Si or 3D stack design decisions. It is believed that a true design for thermo-mechanical methodology, where thermal and stress factors are included in the cost function algorithms used in chip and stack design, will eventually evolve and be incorporated in a design flow to produce “correct by constriction” designs – but only over time. However, it is believed that there is an intermediate step between the current paradigm (rule-driven global corners), and a true design-for-thermomechanical paradigm (model-driven cost function) will evolve first. This intermediate methodology is called here TechTuning and is focused on analyses of the thermal and mechanical attributes of a given design, at a given design stage, with, if and when necessary, a manual design correction, rather than a fully automated correct by construction design flow. This is outlined in Fig. 7.9. The specific target objectives for the TechTuning methodology are listed below: • Multidomain – It is clear that both, thermal and mechanical stress phenomena are dependent equally on package and Si design factors, and the interactions between these design domains. Thus, for example, die floorplanning impacts thermal characteristics as much as package substrate design. Similarly, the layout of the Si features under a flip chip ball will affect mechanical integrity as much as the location of the ball with the respect of the whole chip. Hence, the TechTuning methodology must span both, the package and Si design domains. • Multidimensional – The TechTuning methodology must be able to “zoom in” from millimeter-level scale used in package domain, to nanometer scale used in

Complexity

7 PathFinding and TechTuning

157 Phase 2: Design for T-M - i/f to design: in situ tools & flow - compliance: IP & design simulation - tool focuses: Correct by Design ° input models: Compact models ° EDA perf. req. must be fast

Phase 1: TechTuning - i/f to design: complex rules - compliance: hot spot checker - tool focuses: Analyses ° input models quasi physical ° EDA perf. req. slow is OK

Past - simple constraints ° die size/max power ° some placement limits - no specific DRC

Time

Fig. 7.9 Evolution of the thermal and mechanical design methodology for Si and stack design. Phase 1 solution – called TechTuning – is focused on analyses of the thermal and stress phenomena with easy interface to standard chip design flow, and a manual design correction

Si domain. That is, if the methodology is to enable optimization in Si and Package design domains, it must be compatible with the dimensions and characteristics associated with each domain. This requirement is one of the fundamental limitations of just extending the traditional finite element-based methodologies to 3D design environments. • Predictive and iterative – The methodology must support some kind of predictive analyses in order to impact real technology or material selection decisions, and Si, stack, and package design decisions in real time. That is, the methodology must be sufficiently flexible to be compatible with both, a predictive analyses based on a guestimate of a given 3D process and product implementation, as well as a full sign-off analyses of a final implementation. • Compatible with Si design – The methodology must be compatible with interfacing to Si design activities such as partitioning and floorplanning. That is, if the TechTuning analysis is to be used proactively to optimize the thermal and mechanical characteristics of 3D products, then an easy interface to Si design environment, including EDA tools or standard design practices, must be enabled. At this time, there is no single EDA or thermal/mechanical analysis tool that meets all of these requirements. Consequently TechTuning has to be implemented as flow that stitches together several tools and methodologies into a single integrated practice.

7.3.2 TechTuning Implementation It is expected that the 3D TechTuning solution will evolve gradually from the existing methods, used in existing 2D technology or design decisions, primarily by

158

D. Milojevic et al. Package-Domain

Inter-Domain

1

Target Stack Concept

No

Material Tech File 1 7

Design Tech File 4 7

2

Classic FEA

OK?

No

OK?

Yes

Manufacturability Check Designability Yes Check SEF

5 Chip & Package Substrate Design Flow

3

Stress Rules

Specilized FEA

6 Stress Hot Spot

SignOff

No

Yes OK?

Si-Domain

Verification Check

Fig. 7.10 Outline of the 3D stress analyses and optimization concept flow

integrating established and proven tools into a practical and usable flow. Illustrations on Figs. 7.10 and Fig. 7.14 explain concept flows for thermal and mechanical stress analyses, respectively. Note that both flows have similar basic philosophy, and borrow a submodeling practice used in the finite element analyses methodologies for dealing with multiscale requirements, progressively taking the output of coarse analyses as a boundary condition for finer analyses. In addition, both flows require an input of material properties and design characteristics, and deal with the predictive requirements by performing successive iterations from estimated values to final characteristics, and produce progressively more accurate analyses based on the results of the previous iteration. However, thermal and mechanical stress analyses are obviously very different, require different inputs and corrective actions, and are hence addressed separately, below. 7.3.2.1 TechTuning for Mechanical Stress Management It is clear that managing mechanical stress has moved to the forefront of the semiconductor technology, both in terms of achieving the desired device performance through strain engineering techniques in advanced Si technologies (e.g., use of stress liners and strained Si), and in terms of managing chip–package interactions (CPI) with advanced packaging technologies. Dealing with the mechanical stress is vital for successful continuation of technology scaling along the More-Moore axes and is becoming increasingly difficult. For example, mechanical stress, which

7 PathFinding and TechTuning

159

modulates device mobility and Vt, has comparable impact on layout driven variability as the lithographic factors (which modulate devices L and W) in typical 45 nm and below technologies. Similarly, the combined use of ELK dielectrics in BEOL (required to manage wire delays) with harder interconnect materials used in packaging (such as Cu pillars or Pb-free balls) is exacerbating CPI that cause delaminating, cracking and or fracturing of various materials that often limit yield and reliability for leading edge products. Managing mechanical stress factors is even more important for successful deployment of the More-than-Moore 3D type of technologies. The diverse challenges of managing stress/strain characteristics, including both the impact on device mobility and material integrity are converging with the through silicon via (TSV)-based 3D stacking technologies (here referred to as “TSS” for through silicon stacking technology). Successful dealing with the mechanical stress will thus be critical for deployment of this class of technologies. The challenge of managing mechanical stress is not new, and a number of simulators do exist and have a long track record of use in electronics industry. Most of the proven simulators are based on finite element analyses (FEA), or derivatives of that class of modeling technique. FEA simulators are typically challenged when having to deal with large span in dimensions, as the mesh required to resolve finer features also leads to an exploding computational challenge when applied to larger dimensions. FEA-based methodologies tend to deal with this scaling challenge through a series of submodeling cycles, where the results of a macro-scale simulations are progressively imported as boundary conditions for simulations at finer spatial granularity. As mentioned above, this fundamental approach is adopted here. The established FEA stress simulators have typically been used for addressing the traditional chip–package interactions, and have therefore mostly modeled physical deformations, such as cracking, delaminating, or fracturing. For this class of analyses Si die is typically modeled as a monolithic brick. However, in order to address the impact of stress on electrical parameters, a specialized simulator that includes models that relate stress to carrier mobility is also required. Furthermore, if a stress analysis is to include the effects on device performance the simulation methodology needs to resolve features of the order of a transistor size (~nm). In addition, as the range of stress interactions in Si is typically taken to be of the order of approximately micrometers (this is so even just for the stress interactions caused by the 2D layout factors) the number of polygons that have to be analyzed simultaneously is large. Thus, even with the submodeling methodology, FEA is intrinsically incompatible with analyzing all of the stress effects within Si design environment. Ultimately the simulator should interface to GDS for an entire die and some kind of a compact modeling methodology is required. Given the requirements outlined above, and the constraints on the existing techniques and tools, a flow suitable for analyses of mechanical stress factors and sign off of 3D TSS stack design is summarized in Fig. 7.9. As stated above, a blend of tools and techniques is used and integrated into a concept flow that goes from initial exploration through to final verification and stack sign off. Specifically, the concept flow is shown on Fig. 7.10.

160

D. Milojevic et al.

From Fig. 7.10 we can summarize different steps in 3D stress analyses and optimization flow: • • • • • • •

Step 1: Estimate target TSS implementation Step 2: Package analyses Step 3: Stack feature analyses Step 4: Design Tech File Step 5: Si chip design Step 6: Stress hot spot analyses Step 7: Next iteration

Step 1 – Target 3D through Si stack implementation is defined, including the die, interconnect, and the package to be used, along with the associated material set. At this early stage, it is assumed that the Si chips and the package substrate are not (yet) designed, but that the target die sizes and global interconnect schemes and constraints are defined so that the basic dimensions and material properties are known. Step 2 – Package analyses is performed. For this step, traditional FEA tools such as ANSYS or ABAQUS are used to perform the global stress modeling, and an analysis of general package-level manufacturability and feasibility is performed. Note that these are incumbent tools already proven in this domain, and that the models are qualified for use with the typical material set involved in packaging – materials that are fairly soft and plastic, have relatively low Tg point, and have characteristics that are non-linear with temperature. Si die are treated as monolithic bricks, and the FEA mesh is set according to the usual packaging considerations driven by bump sizes, etc. At this stage, factors such as maximum allowed die sizes or the basic material properties are explored. If these are found to be inconsistent with the constraints for the target TSS implementation, then different material set is required. When there is convergence between the target requirements and the analyses results, the proposed implementation is judged to be manufacturable, and a “Stress Exchange Format” (SEF) file is extracted for subsequent use. SEF – SEF file is a (proposed) file format to be used to transfer the boundary conditions from package-level analyses to feature-level analyses. A fixed file format is required in order to decouple the choice of the simulation environments for different domains, thereby allowing use of optimum tools for each step. SEF is basically a matrix of stress fields, expressed in terms of displacement, on every face of every die in the package. The granularity of a given matrix is determined by the size of the features on the respective die face. Step 3 – TSS Stack Feature analyses is performed, using the boundary conditions imported from package-level analyses (SEF) and the target TSS material characteristics. For this step specialized FEA tool is used (SNPS FAMMOS) because models that relate mechanical stress to electrical device characteristics (carrier mobility) are necessary. In addition, this tool is compatible with material set used in Si processing and is capable of interpreting the Si layout parameters. Note that this is still an FEA-based tool, and as such can be used to model only limited layout configurations and at a given selected granularity (vs. analyzing a GDS for the whole chip). This level of analyses is used to assess the “designability”

7 PathFinding and TechTuning

161

of a proposed stack by modeling the stress distribution in local regions, and exploring the physical and electrical interactions for a specific limited layout configurations for a given imported package boundary conditions. At this stage, basic design rules and target stack design statistics are known, and constraints for specific layout configurations based on some performance criteria (e.g., mobility shift < ~x%) can be explored. If the layout constraints, for example, the die sizes and alignments or placement and floorplanning restrictions are inconsistent with the desired stack targets, then a different package material set, and/or a different process flow, and/ or a different stack configuration are required. When there is convergence between the requirements and the analyses results, the proposed implementation is judged to be designable, and a set of stress design rules is extracted. Stress design rules – These are an incremental set of layout and placement rules that will be used to constrain chip and stack design; rules such as keep out area (KOA) around a TSV, or placement constraints for FC or m-bumps, alignment constraints from die to die, etc. As indicated above, these rules are based on target criteria for max deviation in electrical performance vs. 2D implementation, as well as physical integrity constraints (CPI), for a given set of packaging and assembly boundary conditions. Step 4 – Upgrade the Design Tech File to include the stress design rules and constraints, to drive the chip, package substrate and stack designs. At this time, the most practical approach for incorporating mechanical stress considerations into a design flow is through a set of rules. Rules by definition tend to be “one-size-fitsall” solutions, and hence necessarily include excess margin. If this excess margin becomes prohibitively expensive, the methodology will naturally be upgraded to some form of true design-for-stress methodology. Step 5 – Standard Si chip(s) design flow, including the considerations from the Stress Design Rules and Constraints. As implied above, at this time the intent is to produce minimal perturbation to the standard design practices, and hence the mechanical stresses would be handled purely through a set of layout and placement rules. It is assumed that for the foreseeable future, 3D TSS stack design will be implemented in a series of quasi-independent 2D chip designs, where the constraints from one layer will be imported to the next level. This transfer of constraints can be expected to naturally evolve from quasi-manual to a more automated implementation, based on the evolutionary changes in standard EDA tools. Note that this approach is adequate for heterogeneous stack design where the partitioning across tiers is implemented along functional lines – e.g., memory-on-logic or analog-onlogic kind of integration. Design of a fully optimized logic-on-logic type of stacks will require more of a revolutionary change to the design flows and EDA tools. Step 6 – Si chip analyses for “stress hot spots” is performed. Since stresses from different sources result in complex interactions that produce some net effect on a given design feature, a final analyses of the cumulative effects is required. For example, the combination of the 2D layout driven stresses, the TSV stress, the stress from m-bumps and FC bumps, the stresses from die to die interactions in the TSS stack, etc. may produce an interaction that impacts a particular location on a given die. Furthermore, some features, such as analog blocks, may be more

162

D. Milojevic et al.

s ensitive to a given stress configuration than other features, such as standard cells. Consequently, a “stress hot spot” checker that evaluates a complete GDS is needed. As mentioned above, this capability has to be driven by suitable compact models, and the tool used cannot be a FEA class of a simulator. Tools like this do exist and are conceptually similar to the DFM tools used to analyze a layout for “printability hot spots.” Fundamentally, these tools fragment the layout into a series of features whose stress-response characteristics can be described by a set of behavioral models, the effects are accumulated, and the design is reconstituted with suitable adjusted performance characteristics. Synopsys Seismos is one example of this class of a tool. Note that the specialized FEA tool used in step 3 to derive the stress design rules can be used to analyze specific layout configurations to derive the compact behavioral models. Sign off – Finally, if any hot spots are detected, the performance effects are analyzed and layout is suitably altered. Ultimately, when all hot spots have been removed or waved, the design of the Si die and the stack is signed off. Note that the intent of the flow is to be iterative, and that full final sign off may require going back to step 1, with suitable corrections in the design of the stack or the package substrate. The flow described above is an outline of a concept flow and has not been deployed and demonstrated in its entirety. However, some of the tools and elements of the flow have been used, and in order to illustrate the type of analyses and data expected, some sample analyses are described below. • Keep out area – A key feature of 3D TSS technology is a through Si via itself. With the via-middle and via-last configurations, the TSV is usually filled with Cu, and given the TCE mismatch between Si and Cu and the typical BEOL thermal cycles, the TSV is a significant source of stress which impacts carrier mobility and device performance. Note that the effect on device performance is multidimensional variable and will depend on device type (n-MOS vs. p-MOS), crystal orientation (111 vs. 100.), circuit application (digital vs. analog), layout configuration, as well as some of the 3D stacking parameters. This therefore necessitates some form of restrictions on placement of active devices in the proximity of TSV. Figure 7.11 illustrates simulated change in device mobility due to stress induced by TCE mismatch between Cu-filled TSV (in the center) and Si for a given assumed set of parameters. In addition, specific layout restrictions may be required for the metal layers in the proximity of the TSV. • m-Bump – 3D technologies involve use of a die-to-die bonding techniques, such as for example m-bumps, that are deposited on the backside of die 1 and are used to connect to subsequent die in a stack. m-Bumps can act as a source of stress in a manner that is analogous to that associated with the conventional flip chip (FC) bumps used on the front side of a wafer. Note that the stress from m-bump can interact with the devices of die 2, or, with sufficient thinning, with the devices on die 1. The figure illustrates simulated change in device mobility due to stress induced by m-bump and underfill for a given layout configuration (Fig. 7.12, illustrated in the inset).

7 PathFinding and TechTuning

163

Fig. 7.11 Simulated change in device mobility due to stress induced by TCE mismatch between Cu filled TSV and Si

Fig. 7.12 Simulated change in device mobility due to stress induced by m-bump and underfill

• Thinning – TSS technology – especially when using via-first or via-middle configurations – requires Si wafers to be thinned to few approximately tens of micrometer. At this thickness Si substrate is not necessarily a dominant layer, and redistribution of intrinsic stresses may take place, resulting in unintended stress relaxation at the surface, with possible impact on device mobility. Figure. 7.13 illustrates the difference in the distribution of TSV stress for two different wafer thicknesses – under a given assumed set of conditions. This implies that

164

D. Milojevic et al.

TSV

Wafer Thickness = 50 um

Si Substrate

TSV

Wafer Thickness = 10 um

Si Substrate

Fig. 7.13 Difference in the distribution of TSV stress for two different wafer thicknesses

the keep out area rules, and the m-bump placement restrictions would be different with different wafer thicknesses. These sample illustrations also highlight the intended dual nature of the TechTuning flow – where some of the “knobs” are in the design arena and some are in the process arena. For example, the choice of wafer thickness is a parameter that lies purely in the process domain, but this choice must be made in context of a target design, as the decision may have significant impact on the end product characteristics. On the other hand, implementation of the keep out area rules, or the layout in the proximity of the m-bumps is decisions that lie in the design domain. Clearly, optimization of product characteristics requires suitable trade-offs across the process and design domains, as the impact of the various decisions will depend on number and distribution of TSVs, the base-line die size and complexity, nature of the circuit design, etc.

7.3.2.2 TechTuning for Thermal Management It is clear that thermal management is a key challenge that poses an intrinsic barrier to continued integration – not just for 3D chips but also for any advanced technology implementation. However, 3D implementation does pose some incremental challenges over and above the intrinsic power density issues faced by regular 2D die, namely: • 3D TSS implementation puts multiple die in intimate contact with each other, thereby facilitating thermal interactions between them, so that a hot spot on one

7 PathFinding and TechTuning

165

die is propagated to the neighboring die. The thermal robustness of the stack is then limited by the most sensitive die in the stack. Thus, for example, memory and/or analog die are known to be thermally more sensitive than normal logic die, so that stacked memory-on-logic or analog-on-logic implementations may be thermally more sensitive than the logic die by itself. Specifically, typically CMOS logic is designed for a max temperature of the order of 125C, while DRAM memory max temperature is more like 105°C. Stacking memory-onlogic therefore brings the max temperature of the stack, and hence the logic die, down to 105°C. On the other hand, a 3D stack that includes a relatively low power die may have better thermal performance than a nonstacked version, because the low power die can act as a sort of a heat spreader that brings down the max temperature of a hot spot on its neighbor. Thus, a thermal management methodology for 3D requires understanding of the hot spot distribution (temperature and location) on every die and in the whole stack and understanding of the thermal sensitivity of each of the stacked dies. • Thermal management and heat sinking is more challenging with 3D implementations, because, short of some very exotic heat sinking solution, the thermal path for at least some of the die in the stack is lengthened. This can limit some of the stacking degrees of freedom or may even make some stacks impractical. For example, some high performance and power die require a backside heat sinking and hence are assembled in a face down orientation. Any 3D implementation would either place the second die between the high power die and its heat sink or between the high performance die and its electrical connections – thereby either degrading thermal or electrical performance. Similar, but less acute, tradeoffs may need to be made with the die whose normal heat sinking path is through its package and board, rather than a backside heat sink. Thus a thermal management methodology for 3D requires an understanding of the heat flow paths for the entire environment, which clearly includes package and to some extent board thermal characteristics. • Finally, since 3D stacking allows higher levels of integration, it also results in higher power dissipation per unit volume, and hence in higher maximum temperatures. Furthermore, the distribution of the power dissipation across each of the die in the stack is very important and must be comprehended in the floorplanning of each of the die. That is, for example, aligning of the hot spots on neighboring die can result in increased maximum temperatures and corresponding decrease in the performance. Note that with this configuration the most effective thermal management methodology is suitable die floorplanning. Thus a thermal management methodology for 3D requires a proactive understanding of the hot-spot distributions and a procedure for feeding this information to early physical chip design decisions. Thus, as is the case with Mechanical Stress, the requirements for thermal management solution call for an integrated simulation flow, rather than a single tool. This is outlined in Fig. 7.14. The flow is quite similar to the one described above for mechanical stress, in that it relies on several incumbent tools, progresses from package-level simulation, uses a set of guidelines to couple into Si design flow, and

166

D. Milojevic et al. Package-Domain

Inter-Domain

1

Target Stack Concept

No

Material Tech File 1 7

2

OK?

Classic FEA

TEF

Si Stack & Package Substrate Design Flow

4 7

1 7

OK?

Yes

Manufacturability Check Designability Yes Check

5

Constraint File

No

3

Thermal Guides

Specilized FEA

6 Thermal

SignOff

Hot Spot

No

Power models

Si-Domain

Yes

OK? Verification Check

Fig. 7.14 Outline of the 3D thermal analyses and optimization concept flow

ends in a final Si and stack sign off. The principal difference between the two flows, other than the obvious difference in the tools used, is the requirement for power models as an input. Different steps of the thermal analyses and optimization flow can be summarized as follows: • • • • • • •

Step 1: Estimate target TSS implementation Step 2: Package analyses Step 3: Stack feature analyses Step 4: Design guidelines Step 5: Si chip design Step 6: Thermal hot spot analyses Step 7: Next iteration

Step 1 – Target 3D through Si stack implementation is defined, including the die, interconnect, and the package to be used, along with the thermal properties of the target material set. At this early stage, it is assumed that the target die sizes and other basic dimensions can be defined. In addition, target power models for a given partitioning scheme – with a granularity corresponding to major subsystems – are required. Note that this kind of models can be obtained from PathFinding explorations. Step 2 – Package analyses is performed. For this step traditional FEA tools such as ANSYS or FloTherm are used to perform the thermal modeling, and an analysis of general package-level thermal feasibility is performed. Note that these are

7 PathFinding and TechTuning

167

incumbent proven tools in this domain. Si die are treated at a granularity of the available thermal models, and the FEA mesh is set according to the usual packaging considerations driven by bump sizes, etc. At this stage, factors such as maximum subsystem power or the basic material properties and heat flow solutions are explored. If these are found to be inconsistent with the requirements of the target TSS implementation, then different material set or thermal solution is required. When there is convergence between the target requirements and the analyses results, the proposed implementation is judged to be manufacturable, and a “Thermal Exchange Format” (TEF) file is extracted for subsequent use. TEF – TEF file is a (proposed) file format to be used to transfer the boundary conditions from package-level analyses to feature-level analyses. A fixed file format is required in order to decouple the choice of the simulation tools for different domains, thereby allowing use of existing tools for each step. TEF is basically a matrix of thermal resistance values for every face of every die in the stack. The granularity of a given matrix is determined by the size of the features on the respective die face. Step 3 – TSS stack feature analyses is performed, using the boundary conditions imported from package-level analyses (TEF) the target material characteristics and power models. For this step specialized FEA tool can be used (e.g., Cadence Kelvin or gradient tool) in order to facilitate an interface to Si design environment and to explore various floorplan possibilities. This level of analyses is used to assess the “designability” of a proposed stack by modeling the thermal distributions, and exploring the physical design feasibility for a given imported package thermal characteristics. If the layout constraints, for example, the die sizes, floorplans and die-to-die alignments are inconsistent with the desired stack targets then a different package thermal characteristics and/or a different stack configuration is required. When there is convergence between the requirements and the analyses results, the proposed implementation is judged to be designable, and a set of thermal guidelines is extracted. Thermal Guidelines – These are an incremental set of layout and placement constraints and guidelines that will be used to constrain chip and stack design; guidelines such as floorplanning constraints, power dissipation limits for given subsystems, placement constraints for FC or m-bumps, alignment constraints from die-to-die, etc. As indicated above, these guidelines are based on specified max temperatures or temperature gradients (and the associated performance implications). Step 4 – Upgrade the design constraint file to include the thermal guidelines, to drive the chip, package substrate, and stack designs. At this time, the most practical approach for incorporating thermal considerations into a design flow is through a set of rules and guidelines. Rules by definition tend to be “one-size-fits-all” solutions, and hence necessarily include excess margin. If and when this excess margin becomes prohibitively expensive, the methodology will naturally evolve to some form of true design-for-thermal methodology. Step 5 – Standard Si and Stack design flow, including the considerations from the thermal guidelines. As implied above, at this time the intent is to produce minimal perturbation to the standard design practices, and hence the thermal considerations

168

D. Milojevic et al.

would be handled purely through a set of floorplanning constraints. It is assumed that for the foreseeable future, 3D TSS stack design will be implemented in a series of quasi-independent 2D chip designs, where the constraints from one layer will be imported to the next level. This transfer of constraints can be expected to naturally evolve from quasi-manual to a more automated implementation, based on the evolutionary changes in standard EDA tools. Note that this approach is adequate for heterogeneous stack design where the partitioning across tiers is implemented along functional lines – e.g., memory-on-logic or analog-on-logic kind of integration. Design of a fully optimized logic-on-logic type of stacks will require more of a revolutionary change to the design flows and EDA tools. Step 6 – Si chip analyses for “thermal hot spots” is performed. Heat flow from different sources and via different paths result in complex interactions that produce some net temperature at a given node, for a given use case, so that a final analyses of the cumulative effects is required. Note that some blocks (e.g., analog) or paths (e.g., set up and hold) may be more sensitive to a given temperature or temperature gradient condition than other features, such as combinatorial logic. Consequently, a “thermal hot spot” checker that evaluates a complete design is needed. Since thermal effects are by definition diffuse, a fairly course spatial granularity may be sufficient, such that the FEA methodology and the existing specialized thermal analyses tools (e.g., Cadence Kelvin) may be adequate for this analyses. However, suitable thermal compact models could be developed and used at this stage (see last subsection of this section). Sign off – Finally, if any hot spots are detected, the performance effects are analyzed and layout is suitably altered. Ultimately, when all hot spots have been removed or waved, the design of the Si die and the stack is signed off. Note that the intent of the flow is to be iterative, and that full final sign off may require going back to step 1, with suitable corrections in the design of the stack or the package substrate. The flow described above is an outline of a concept flow that has been deployed and used, with, at this time, manual interfaces across different domains. A sample analyses is described below to illustrate the EDA tools and elements of the flow. A given target floorplan is assumed, and power dissipation for the major blocks is modeled for a given target use case. A target package is assumed with its material characteristic, and the thermal resistance experienced by each die on each face was modeled, and imported into the die-level thermal simulator. Thermal maps, such as illustrated in Fig. 7.15 are extracted for different use cases, or floorplans. Note that in the example illustrated, there is a hot spot on Tier 1 die, which results in a hot spot on a Tier 2 die. Different placement of Tier 2 die relative to Tier 1 die would result in different thermal profile on Tier 2.

7.3.3 TechTuning Infrastructure Simulation flows and methodology outlined above are clearly only as good as the infrastructure developed to support them, i.e., simulations are only as good as the models

7 PathFinding and TechTuning

169

Fig. 7.15 Thermal maps extracted from the physical view of the die

used, and the models used are as good as the material data extracted, etc. Thus, deploying these simulation flows requires support infrastructure, and this section outlines some of the requirements: • Material properties – Simulation of thermal and mechanical simulations are driven by a set of suitable material properties, such as thermal conductivity, Young’s modulus, Poisson ratio, and thermal coefficient of expansion, for each material used. These are necessary inputs, along with the geometries in x, y, and z dimensions, of the target implementation. • Process characteristic – In addition, since the net stress–strain characteristics are a result of stress redistribution through the process of building the die and assembling the stack in a package, mechanical stress modeling requires process thermal history. In addition, some of the residual stress characteristics – e.g., stress in the ILD layers in Si processing or compression bonding processes – are not just a function of temperature and TCE but depend on the process conditions used (e.g., plasma gasses, pressures, etc.). Consequently, stress modeling has traditionally been a domain of T-CAD class of simulations performed by the process developers. Since the foundries and SATs, rightly, do not release this type of intimate process data this is clearly a significant barrier to implementation of the simulation capabilities – especially for fabless and IFM design teams. Hence, in addition to the material properties, some measure of residual stress at the end of the process line is required. For example, a “stress free temperature” – a mathematical abstraction that describes the effective temperature at which a given film has zero stress vs. Si substrate – could be used. Note that stress free temperature, as defined here, is not necessarily the actual temperature at which a given film has been deposited, as material restructuring and other plastic deformations need to be accounted for, and TCAD kind of simulation may be required to derive it. • Calibration test vehicles and metrology techniques – Material properties described above need to be measured, and the models need to be validated. Whereas some of the material properties – e.g., Si or molding compound – can be assumed to be comparable to the book values derived on bulk material, other characteristics need to be calibrated for a specific films in a specific process

170

D. Milojevic et al.

technology – e.g., dielectric films, m-bumps, etc. Consequently, a set of measurement methods and corresponding test vehicles are required. Some of these are part of established infrastructure – especially for materials associated with the packaging processes such as the molding compounds, underfill materials, FC bumps, etc. However, some of the necessary practices – especially these involving the materials and films used in Si processing are new, and need to be developed. Thus, for example, characterizing the material properties of the various ILD films to include mechanical parameters such as Young’s modulus is not an established practice, and a metrology system need to be developed. • Expanded Design Enablement Kit – In addition, some cross-industry collaboration is required in order to define a practice of including the required material characteristics in a standard “design enablement kit” that is used to communicate process characteristics to design teams. Standard practices used to transfer design rules and DRC decks, as well as the models that describe electrical characteristics (e.g., BSim models for SPICE) need to be expanded to include thermal and mechanical stress characteristics as well. • Validation test vehicle and metrology systems – Given the complexity of the models involved and the novelty of the calibration methodologies, it is clear that incremental validation of simulation results vs. Si measurements is needed. Correlation of electrical models to actual Si performance is a standard practice used across the entire industry. Expanding this to encompass thermal and stress models will involve incremental test vehicles, with suitable test structures that compound the various thermal and mechanical stress sources, and evaluate the net effect on device performance and/or material integrity, enabling direct comparison of measured to simulated characteristics. It is clear that the TechTuning flows and the associated infrastructure, as described here, involve a lot of new capabilities and require cooperation of, and adoption by, many entities across the industry. However, it is believed that since thermal and mechanical stress management is such fundamental challenge for successful 3D TSS implementation, some version of a TechTuning simulation flow is a “must” and will therefore be supported by the industry.

7.3.4 Compact Thermal and Mechanical Modeling The physical prototyping environment generates floorplan of the design. This information can be used for evaluation of the given solution (virtually synthesized, placed and routed instance for a given set of fixed system-level and technology parameters) in terms of thermal and mechanical properties. At IMEC, we developed compact thermal and mechanical models that can be used for quick estimations of these parameters (for more information on these models the reader can take a closer look in [25–27]). Their integration in the PathFinding flow is illustrated in Fig. 7.16.

7 PathFinding and TechTuning

171

3D Physical Prototyping (Step 2) Place and Route

RTL VHDL,Verilog

Stack Config.

3D-SIC (.xml)

Floorplan (.def)

Constraints (.sdc)

Power maps Component positions

TSV Positions

Compact Thermal Model

Compact Mechanical Model

Analysis

Fig. 7.16 Compact thermal and mechanical models and PathFinding flow

Compact thermal model takes as input: • Stack configuration information: such as the die size, the package resistance, the thermal properties and thickness of the bonding layers and the BEOL structures and the thickness of the different dies etc. • Floorplan information. • Power per component figures – established based on black box models inputs, estimations based on gate and flip-flop count for the RTL components, or if available Value Change Dump files (.VCD) for more accurate power estimations. In each of the dies the power dissipation is entered in unit cells with an area of 100 × 100 mm2. In this way the thermal behavior of multiple and larger hot spots can be accounted for by considering several unit cells. The output of the model is thermal map for each die in the 3D-SIC, where the designer can quickly identify hot spots in the design and take appropriate steps. For the mechanical model, only TSV positions are extracted from the design database and passed to the compact mechanical model. The output of the compact mechanical model is the TSV displacement maps in X and Y directions. Thermal maps and displacement maps can be used for thermally mechanically aware floorplanning (feedback from the analysis box to the place and route step in Fig. 7.16).

172

D. Milojevic et al.

7.4 Case Studies In order to demonstrate PathFinding and TechTuning design methodologies and practical application of the proposed tool chain, we propose in this section three different case studies. Each study has been chosen to highlight one of the particular aspects of the proposed design methodology and the tool chain. First, we will describe the case study involving HLS, how it can be used to generate different micro-architectures and physical prototypes from high-level C descriptions to virtually placed and routed designs. The following example will illustrate the feedback mechanism of the PathFinding and TechTuning, and how it can be used to explore different topologies of the communication infrastructure (in this case the NoC). We will use the same example to demonstrate the link to the compact thermal and mechanical models. Finally, the third example will illustrate the capabilities of the proposed tool chain to explore different 3D stacking options of the memory die on the top of the logic die.

7.4.1 High-Level Synthesis for Computation This case study shows how HLS tools can be effectively used to bridge the gap between the system-level exploration and physical prototyping steps. Besides going through the complete PathFinding flow, the study also focuses on analysis of the dependency between the design efforts (i.e., the time that designer is ready to spend on HLS optimization) and design performance (area/timing). For all experiments we assume an ANSI C implementation of the full search motion estimation kernel, one of the basic functionalities implemented in most of video codec’s in use today. This kernel is often considered to be a major bottleneck for computation, communication, and memory access, especially when moving to higher frame resolutions and rates. In order to quantify the sensitivity of the HLS to the design effort, we consider three different levels of design efforts: • MEI – Low effort minimal instrumentation: the designer just prepares the code so that the HLS tool can be run on it. The amount of the time spent here is measured in hours. • MEII – Medium effort: the designer tries simple optimization techniques (loop unrolling, pipelining, etc.) without going deeply into micro-architecture/performance analysis. The amount of the time spent here is measured in few tens of hours. • MEIII – High effort: the designer performs detailed optimization, using numerous iterations inside the HLS process. The amount of the time spent is measured in days. The technology characterization has been done using TSMC 90 nm library. Generated RTL is then synthesized to the gate-level netlist using Synopsys Design

7 PathFinding and TechTuning

173

Motion Esitmation (ME) engines with different mico-architectures (90nm) 450 400

Area [umm 2 ]

350 300 250 200 150 100 50 0

200

250 333 Frequency constraint for synthesis tool [MHz] MEI_90nm

MEII_90nm

500

MEIII_90nm

Fig. 7.17 Design effort vs. area of the ME RTL synthesized using HLS for different synthesis constraints

Compiler with the same 90 nm library and for different timing constraints in order to capture the impact of the performance on the area. The results for one ME hardware block instance are shown on Fig. 7.17, for different design efforts and timing constraints for gate-level synthesis tool. Clearly, the area of the kernel can be improved by increasing the design effort (factor of 2 between ME I and ME III). However medium effort RTL represent a good compromise between the time spent to get the RTL out of the tool (and therefore the physical prototype) and the accuracy of the performance parameters. This means that HLS can be effectively used to derive performance parameters for a given functional block and therefore accelerate the process of the system prototyping as a whole.

7.4.2 Communication Infrastructure Case Study and PathFinding Feedback; Link to the Compact Thermal and Mechanical Models 7.4.2.1 PathFinding For reasonably sized MPSoC platforms and applications dominated by internode traffic, the communication infrastructure is often considered to be the bottleneck for high-performance systems. In order to solve this problem, networks-on-chip

174

D. Milojevic et al.

(NoCs) have been proposed as alternative solution for traditional bus-based communications. Although very flexible, the performance of the NoCs will strongly depend on physical aspects of the design that we want to address with this study. The demonstration platform is a fairly complex MPSoC system dedicated for advanced video encoding–decoding applications using NoC as communication infrastructure. The platform is inspired from our previous work [28] and the simplified block diagram of the system is given on Fig. 7.18. Main goals for this case study can be summarized as follows: 1. To show how difficult it is to make a right system-level design choice about the NoC topology/configuration, without the information of the physical characteristics of the design as a whole. 2. How PathFinding can be used to produce such physical design information and verify quickly system-level assumptions. 3. How physical information obtained during physical prototyping can be used as feedback to the NoC HLS tool to generate the micro-architecture of the communication infrastructure that will satisfy the initial timing constraint. The system is composed of 15 processing elements, 8 memory nodes, 3 ME engines (the RTL used for these components has been produced using HLS for

CPU01

CPU15

Motion Estimation Subsystem ME01 ME02 ME03

Interconnect (NoC)

Instructions RAM01 RAM02 RAM03 RAM04

Data RAM01 RAM02 RAM03 RAM04

Dummy

DRAM Ctrl

Fig. 7.18 Block diagram of the MPSoC system

7 PathFinding and TechTuning

175

Power Area cost of best design points 42 41.5 41 Powder [mW]

40.5 40 39.5 39 38.5 38 37.5 37 0.365

0.37

0.375

0.38

0.385 0.39 0.395 Area [mm2]

0.4

0.405

0.41

0.415

Fig. 7.19 System-level exploration of different NoC architectures

computational elements described in previous subsection) and one DRAM memory controller. In the following, we will describe step-by-step procedure of the PathFinding flow. Step 1 – During system architecture exploration phase, different NoC topologies have been explored using high-level SystemC models of the basic NoC components (network interfaces, routers). Based on traffic assumptions and different instances of the NoC topologies, these models were used to derive area/ power characteristics of the NoC. The area/power dissipation trade-offs, for NoC topologies using 3, 6, and 24 routers are given in Fig. 7.19. Using traditional design flow, system-level designer, based on such diagram could potentially pick-up one of the solutions, (marked 1 on the diagram), since it represents the best trade-off from power/area perspective. Note that during this phase we assume the NoC speed of 300 MHz, and this assumption can’t be verified at this stage. Step 2 – Using PathFinding and RTL elaboration step, the designer can now choose few design points for further exploration. Out of 20 NoC configurations analyzed in the previous step, we chose to implement (and therefore physically prototype in the next step) three different NoC instances using 3, 6, and 24 routers (1, 2, 3 marked, respectively, on Fig. 7.19). Note that the RTL generation of different NoC instances is quite fast, once the step 1 has been performed. Using iNoCs HLS generation tools, corresponding RTLs have been obtained in less then 1 day. Step 3 – Using gate-level netlists of three different NoC instances and black box description of different processing and memory blocks we were able to describe the system as a whole in the proposed PathFinding environment. Using virtual synthesis, placement and routing tools (the layout of two dies is shown on Fig. 7.20) we are now able to derive performance parameters (area, maximum operating frequency and power) of the NoC that are summarized in Table 7.1.

176

D. Milojevic et al.

Fig. 7.20 Layout of the memory and logic die with TSV arrays

Table 7.1 Performance parameters of different NoC topologies 3 Routers 6 Routers 24 Routers 2 0.74 0.76 0.82 Area (mm ) Max F (mHz) 139 172 165 Power (mW) 37 39.5 41.5

Looking into the performance parameters of different NoC topologies we can now easily advocate that the NoC with 6 routers is actually much better then the one suggested by system architecture exploration phase. For a small increase of area/power budget, we can now run the NoC 1.24 times faster. However, operating frequency constraint, established during step 1 at 300 MHz to satisfy bandwidth needs of the system, has not been reached, and this even after timing optimization process – the design is not feasible as such, and modifications at micro-architecture level are thus required. During physical prototyping phase, the floorplan of the system is generated and it can be easily exported using standard Design Exchange Format files (.DEF). This information can be then used during RTL generation of the NoC. The NoC HLS tool will generate another NoC micro-architecture by exploiting different degrees of freedom of the basic communication elements (router connectivity, internal router configuration, etc.). This time the NoC model generation is physically aware (note that in the first iteration, the NoC topology has been generated only using the information about the traffic). After prototyping of the new NoC micro-architecture, we can summarize frequency results after each PathFidning flow step. On Fig. 7.21 we show maximal operating frequency of the NoC: after step 1 (the estimation during the RTL synthesis), after physical prototyping: step 3 – first iteration and after generation of the physically aware NoC micro-architecture for 3 and 6 NoC routers (1 and 2 on the abscissa of the graph, respectively).

7 PathFinding and TechTuning

177 NoC Frequency

350 300

F [MHz]

250 200 150 100 50 0

1 Step1

2 Step3 1st Iter.

Step3 2nd Iter.

Fig. 7.21 NoC frequency for 3 and 6 routers after system-level exploration, physical prototyping and after PathFinding feedback loop

By taking the physical information (in this case the floorplan) into account d uring the HLS phase of the NoC, we were able to generate physically aware NoC micro-architecture. The latest iteration of the NoC micro-architecture meets the NoC requirement established during the system-level exploration phase, which could not be met without physical prototyping of the design as a whole.

7.4.2.2 TechTuning The same design can be used to demonstrate the link to the compact thermal and mechanical models. Appropriate scripts have been written to retrieve the necessary information from the design database stored in the OpenAccess format (from Atrenta 1Team®-Implement) and pass this information to compact thermal and mechanical models. Resulting thermal maps of the two tiers MPSoC design are illustrated on Fig. 7.22a shows thermal maps of the logic and memory die after first physical prototype iteration (the placement and routing was thermally unaware). Two high-power components have been placed on the top of each other creating a local hot spot. Figure 7.22b illustrates the thermal map after second iteration and after manual placement constraint that has been specified on one of the two high-power components (on tier 1). Obviously, this action will reduce the local temperature elevation. The result of the mechanical stress analysis of the TSV array from the layout (Fig. 7.20) is shown on Fig. 7.23 (a) showing the TSV displacement in X direction and (b) in Y direction.

178

D. Milojevic et al.

Fig. 7.22 Thermal maps of the 2 stack 3D-SIC: (a) initial floorplan with two high-power components on the top of each other creating the hot-spot; (b) thermal map after one component has been manually constrained to get another position in the die during the place and route process

Fig. 7.23 TSV stress analysis

7.4.3 Packaging Case Study Similar design can be used to derive performance of different packaging options. From classical 2D implementation to 3D-SICs, in this case by simply placing memory die on the top of the logic die using different implementation scenarios. Furthermore, we show how high-level estimations of the delay and power of the logic to memory interconnect can be inaccurate, and if made early in the design cycle can lead to the wrong system-level decisions. Here we look into the subsystem built using two instances of CPUs. Each CPU core is a combined VLIW, 4 × 4 functional units (FU) coarse grain array (CGA) developed at IMEC [16]. As communication infrastructure we use a simple AMBAAHB bus with two masters and one slave port. Access to the DDR memory is made through a configurable memory controller, in this case a third party IP. The DRAM memory chosen is an embedded 512 Mb DDR, built using 4 × 128 Mb memory

7 PathFinding and TechTuning

179 MPSoC - Logic Die

CPU1

CPU2

DRAM Controller

BANK 0

BANK 1 Control Logic

BANK 2

BANK 3

Mobile DDR2 Memory - Memory Die Fig. 7.24 Block diagram of the MPSoC used for packaging case study

arrays and some control logic, similar to the one presented in [29]. Simplified block diagram of the subsystem is given in Fig. 7.24. Let us imagine that the proposed system is a part of a new design that should meet the requirements of the next generation platform. Since we deal with a video encoding platform, one can imagine another product scenario: the future application bandwidth requirements will have to move from BW1 to BW2 aggregate bandwidth due to the increase in image resolution and/or frame rate. From the system-level perspective, the designer will trade-off the choice between 32 or 64 bit bus widths, thus requiring operating frequencies of F1 and F2 = 1/2F1. Also the designer could check the power dissipation of these two solutions. He will make an assumption on wire capacitance (most likely the worst case, typically 2pf), and make a simple calculation using the usual formula:

P = aVdd 2CF .

Assuming F1 of 300 MHz and switching activity of a = 0.5, the evaluation of the above expression will result in 7 and 11 mW, respectively, for 32- and 64-bit wide buses. Given operating frequency will translate into constraints for the synthesis and implementation of the communication infrastructure, that at this stage the designer simply can’t verify. Another way of doing the analysis would be to use PathFinding methodology and tool chain and try to derive delay and power figures based on the following: • Actual traffic characterization – allowing to use real activity, this time based on the actual data that appears on the communication medium and not average toggle probability

180

D. Milojevic et al.

• Physical properties of the design and the packaging information. This information will be of course based on results acquired from virtually prototyped circuits. If we look more closely, the proposed MPSoC can be partitioned into two CMOS dies that can be manufactured independently: 1. Logic die – (dark gray filled box), containing different processing cores, communication infrastructure and the DRAM controller 2. Memory die – (light gray filled box), with memory arrays and associated control logic. These two dies can be then packaged using different scenarios shown on Fig. 7.25. Scenario A – Two separate packages, PCB integration Both logic and memory dies are integrated in separate packages, implemented on a printed circuit board (PCB). Scenario B – 3D-SIC with bonding wires Memory die is flipped and placed face-to-face on the top of the Logic die. Connections between dies are established using micro-bumps, while connections between the Logic die and the package are made using wire bonding. While this scenario represents current state-of-the-art technique of the 3D-SIC integration, it is a worst case from the QoR perspective because of the higher inductance of the bonding wires, when compared with the TSVs as suggested in [30]. Therefore, in the remainder of this work we will exclude this scenario and focus only on implementation scenarios using TSVs. Scenario C – 3D-SIC using TSVs and the existing Memory die with IO pads Logic die is flipped; connections from the logic die to the IO pads are made using micro bumps. Memory die is also flipped and placed on the top of the Logic die, in a face-to-back fashion. Connections between two dies are established using Scenario A Memory

Logic PCB

Scenario B

Scenario C Memory

Memory Logic Logic PCB

Scenario D

PCB

Scenario E Memory

Logic

Memory Logic

PCB

Fig. 7.25 Different packaging scenarios for MPSoC implementation

PCB

RDL

7 PathFinding and TechTuning

181

TSVs, from the first metal layer of the Logic die connected to the backside metal redistribution layer (marked RDL on Fig. 7.25). Micro bumps of the Memory die are placed on the existing memory IO openings. In this scenario only the Logic die has to be processed for 3D. Scenario D – Narrow TSVs array, without IO pads on memory interface, either die In this scenario the configuration of the dies is the same as in the scenario above: both dies are flipped and face-to-back oriented. But instead of backside metal redistribution layer, TSVs are used to connect the Logic die directly to one of the metal layers of the Memory die. Note that in this case both dies have to be processed for 3D. Scenario E – Wide TSVs array, without IO pads on memory interface, either die This scenario is the same as Scenario D, except that the width of the data path between the Logic and the Memory die is increased, i.e., the size of the DRAM memory interface. Such approach can lead to significant savings in power dissipation and/or increasing available bandwidth as it has been suggested in [31]. For different packaging scenarios described above, we will demonstrate in the following how PathFinding design methodology and proposed tool chain were used to estimate accurately the delay and power dissipation of the wires connecting the Logic and the Memory die (from DRAM controller to the memory array controller of the logic die). Step 1 – In system-level design exploration phase, instead of cycle-accurate model of the complete MPSoC (all components are described at RTL level), we used simplified platform models for shorter simulation times. All processing cores are replaced with cycle-accurate trace generators. In order to collect valuable memory statistics, trace generators are fed with trace files containing all DMA transfers across the AMBA bus to the DRAM. These trace files are generated by executing the application code (in this case MPEG-4 simple profile encoder) that run on standard video sequences (Foreman, etc.). This environment is thus accurately modeling the synchronization and timing of the target platform. Different traffic traces have been captured and saved using Value Change Dump (VCD) file format. This information was passed to the physical design prototyping phase (step 3) for QoR assessment. Step 2 – RTL elaboration has been used to generate VHLD files for processors, DRAM controller and the AHB bus. RTL description of the CPUs (VHDL) has been generated from the processor template using IMEC tools. For the memory controller we used existing third party IP. Standard synthesis tool (Synopsis Design Compiler v.2006.06-SP4) has been used to generate gate-level netlists. The AMBA-AHB bus has been generated from the CoWare environment from high-level description of the system, defined in Step 1. The model has been kept at RTL level (VHDL description). The memory has been described as black box, based on the floorplan and interface definition similar to the one presented in [29]. Step 3 – Different workspace environments have been created for each scenario using gate-level models from the step 2. For all experiments we assume TSMC

182

D. Milojevic et al.

90 nm high performance technology library. For Scenario A, we use the RC parameters of typical PCB implementation assuming 0.1 mm track width, 0.1 mm track thickens (height) and 40 mm track length, resulting in R = 0.194 ohms, C = 1.83 pF. In order to mimic the effects of inductance in the PCB seen from spice analysis while using the static timing analysis in the PathFinding tool, we modeled the board by increasing the metal interconnect resistance to a value similar to the worst case characteristic impedance. For Scenarios C, D and E, TSV geometrical and electrical models are relative to the IMEC technology. The diameter of the TSVs is 5 µm, 25 µm pitch, with following electrical characteristics: R = 0.018 ohm and C = 0.237 pF. For Scenario C we assume the following values for the backside metal redistribution layer: R = 0.018 ohm and C = 0.237 pF. QoR measurements were made on a virtually routed design. Power assessment of the interconnect wires between the Logic and Memory dies was made using Switching Activity Interchange File (SAIF), automatically generated from the VCD file obtained in the step 1. Results obtained during physical prototyping are shown on Fig. 7.26. The graphic indicates for all nets of the Logic to Memory die interconnect: minimal and maximal resistance and capacitance, maximal and total power dissipation, minimal and maximal delay. Based on the results obtained, the designer can now derive the actual operating frequency of the circuit, and therefore make a right decision about the data path width. At the same time he can verify the power dissipation of the wires, which are respectively 0.5 and 0.9 mW. Compared to the figures obtained using high-level estimation we can clearly see an overestimate of about ten times for the power dissipation.

12 10 8 6 4 2 0 Cmin pF

Cmax pF

Rmin x100 Rmax x100 ohms ohms 2.7 9.5

Pnmax x10mW 4.2

Pint mW

Dmin ns

Dmax ns

A

6.2

11.2

6.5

0.7

5.5

C

2.1

4.2

0.97

5.69

2.5

2.1

0.2

1.3

D

0.5

1

1.05

4.51

0.3

0.5

0.04

0.2

E

0.4

0.9

0.8

4

0.3

0.9

0.04

0.3

Fig. 7.26 Performance parameters of the MPSoC case study

7 PathFinding and TechTuning

183

7.5 Summary and Conclusions In this chapter, we presented two incremental design methodologies for efficient design of 3D integrated circuits: PathFinding and TechTuning. Using state of the art EDA tools for high-level synthesis (HLS) and virtual physical prototyping tools together with our own compact thermal and mechanical models we were able to assemble a complete tool chain for exploration of 3D-SICs. Main features of the presented flow include • Automatic handling of the 3D interconnect components (TSVs) from technology files • Semi-automated translation of high-level to RTL models using HLS • Virtual synthesis, place, and routing of the design for fast generation of the 3D physical prototypes • Design analysis: timing optimization and reporting, area and power reporting, congestion and routing analysis • Design automation capabilities allowing fast exploration of many design options on both system and physical level Proposed tool chain has been applied to three different case studies, chosen to highlight different aspects of the flow. Following conclusions can be made based on the results obtained: • HLS can be effectively used to generate accurate models of the computational and communication elements. • Automated process of virtual synthesis, place and route can be used for exploration of many system-level and physical parameters in short amount of time. • Figures of merit (FoM) obtained through physical prototype are more accurate then current design practice based on “back of the envelope” calculations. • Accurate FoM can help system-level designers to take right decisions early in the design flow, that otherwise could not be taken. • Compact thermal and mechanical models can be used to drive placement and routing. Acknowledgements The authors would like to thank Riko Radojcic for his visionary contributions to PathFinding and TechTuning. They would also like to express their gratitude to Srinivasan Murali and Federico Angiolini from iNoCs for their valuable help to make the vision become true.

References 1. Al-Sarawi S, Abbott D, Franzon P (1998) A review of 3-D packaging technology, components, packaging, and manufacturing technology, part B: advanced packaging. IEEE Trans Compon Packag Manuf Tech 21(1):2–14 2. Swinnen B, Jourdain A, De Moor P, Beyne E (2008) Wafer level 3-D ICs process technology, Tan S, Gutmann RJ, Reif LR (Eds), Springer, ISBN 978-0-387-76532-7

184

D. Milojevic et al.

3. Das S, Fan A, Chen K et al (2004) Technology, performance, and computer-aided design of three-dimensional integrated circuits. In: ISPD’04 proceedings of the 2004 international symposium on Physical design. ACM, New York, pp 108–115 4. Gupta S, Hilbert M, Hong S et al (2004) Techniques for producing 3D ICs with high-density interconnect. In: 21st international VLSI multilevel interconnection conference, Waikoloa Beach 5. Beyne E, Swinnen B (2007) 3D system integration technologies, integrated circuit design and technology. In: IEEE International Conference on Integrated Circuit Design and Technology, 2007 (ICICDT ’07), pp 1–3 6. Martin G, Smith G (2009) High-level synthesis: past, present, and future. IEEE Des Test Comput July/August:18–25 7. Aditya S, Kathail V (2008) Algorithmic synthesis using PICO.In: Coussy P, Morawiec A (eds). High-level synthesis: from algorithm to digital circuit. Springer, New York 8. Bollaert T (2008) Catapult synthesis: a practical introduction to iterative C synthesis. In: Coussy P, Morawiec A (eds). High-level synthesis: from algorithm to digital circuit. Springer, New York 9. Deng Y, Maly W (2004) 2.5D system integration: a design driven system implementation schema. In: Proceedings of ASPDAC, pp 450–455 10. Goplen B, Sapatnekar S (2005) Placement of thermal vias in 3-d ICs using various thermal objectives. IEEE Trans Comput Aided Des Integrated Circ Syst 25(4):692–709 11. Li Z, Hong X, Zhou Q et al. (2006) Efficient thermal via planning approach and its implications on 3-d floorplanning, IEEE Trans Comput Aided Des Integrated Circ Syst 26:645–658 12. Cong J, Wei J, Zhang Y (2004) A thermal driven floorplanning algorithm for 3D ICs. In: Proceedings of the international conference on computer aided design, pp 306–313 13. Cong J et al. (2005) Thermal driven multi-level routing for 3-D ICs. In: Proceedings of Asia Pacific DAC 2005, pp 121–126 14. Chun C, Corleto J, Nowak M, Radojcic R (2208) Virtual design for technology exploration – a process design integration methodology for a fabless entity. In: International conference on integrated circuit design and technology, pp 125–130 15. Nowak M, Corleto J, Chun C, Radojcic R (2008) Holistic pathfinding: virtual wireless chip design for advanced technology and design exploration. In: DAC ’08: Proceedings of the 45th annual conference on design automation. ACM, New York, pp 593–593 16. Mei B, Sutter B, Aa T et al (2008) Implementation of a coarse-grained reconfigurable media processor for AVC decoder. J Signal Process Syst 51(3):225–243 17. AutoESL, AutoESL Design Technologies, Inc. 20245 Stevens Creek Blvd., Suite 200, Cupertino, CA 95014, http://www.autoesl.com/ 18. Wolkotte PT, Smit GJM, Kavaldjiev NK, Becker JE, Becker J (2005) Energy model of networkson-chip and a bus. In: Nurmi J, Takala J, Hamalainen TD (eds). In: Proceedings of the international symposium on system-on- chip (SoC 2005), Tampere, Finland. IEEE, Piscataway, pp 82–85 19. Goossens K, Dielissen J, Radulescu A (2005) The Æthereal network on chip: concepts, architectures, and implementations. IEEE Des Test Comput 22(September-October):21–31 20. Bertozzi D, Benini L (2004) Xpipes: a network-on-chip architecture for gigascale systems-onchip. IEEE Circ Syst Mag 4:18–31 21. Sonics inc. 890 N. McCarthy Blvd, 2nd Floor, Milpitas, CA 95035, USA. http://www. sonicsinc.com/ 22. Arteris SA, 6 Parc Ariane – Immeuble Mercure, Boulevard des Chenes, 78284 Guyancourt Cedex, France. http://www.arteris.com 23. iNoCs, Route de Chavannes, 27D, 1007 Lausanne VD, Switzerland. http://www.inocs.com/ 24. Atrenta, 2077 Gateway Place, Ste 300, San Jose, California 95110, USA. http://www.atrenta.com/ 25. Torregiani C, Oprins H, Vandevelde B, Beyne E, De Wolf I (2009) Thermal analysis of hot spots in advanced 3D-stacked structures. In: 15th International workshop on thermal investigations of ICs and systems – THERMINIC 2009, Leuven, Belgium, 7–9 October 2009 26. Torregiani C, Oprins H, Vandevelde B, Beyne E, De Wolf I (2009) Compact thermal modeling of hot spots in advanced 3D-stacked structures. In: 11th Electronics packaging technology conference – EPTC 2009, Singapore, 9–11 December 2009

7 PathFinding and TechTuning

185

27. Oprins H, Cupak M, Van der Plas G, Vandevelde B, Marchal P, Srinivasan A, Cheng E (2009) Fine grain thermal modeling of 3D stacked structures. In 15th International workshop on thermal inverstigations of ICs and systems – THERMINIC 2009, Leuven, Belgium, 7–9 October 2009, pp 45–49 28. Milojevic D, Montperrus L, Verkest D (2008) Power dissipation of the network-on-chip in multiprocessor system-on-chip dedicated for video coding applications. J Signal Process Syst, 15. 29. Lee DU, Lee HW, Kwean KC, et al (2006) A 2.5Gb/s/pin 256Mb GDDR3 SDRAM with series pipelined CAS latency control and dual-loop digital DLL. In Solid-state circuits, 2006 IEEE international conference digest of technical papers, pp 547–556 30. Pak JS, Ryu C, Kim J, et al (2008) Wideband low power distribution network impedance of high chip density package using 3-D stacked through silicon vias. In: APEMC 2008, AsiaPacific symposium on electromagnetic compatibility and 19th international Zurich symposium on electromagnetic compatibility, 2008, pp 351–354 31. Kumagai K, Yang C, Izumino et al (2006) System-in-silicon architecture and its application to H.264/AVC motion estimation for 1080HDTV, Solid-State Circuits Conference, 2006. ISSCC 2006. Digest of Technical Papers. IEEE International, pp 1706–1715

wwwwwwwwwwwwwwww

Chapter 8

3D Stacking of DRAM on Logic Trevor Carlson and Marco Facchini

8.1 Introduction An ever-increasing number of transistors are being placed onto silicon dies in both small and large computing systems. The shrinking of the transistors on these chips has allowed an increasingly large amount of computing power to be brought to smaller and smaller devices. There have been many advances in computing architectures that have allowed this pace to continue. Examples can be seen through the development of combining numerous application processors into MPSoCs (multiprocessor systems on a chip) as well as techniques like hardware multithreading. Through the increase in both complexity of chip designs and the number of those chips occupying a single die, our computers now require greater bandwidth to an ever increasing amount of system DRAM memory. By optimizing these systems with TSVs, one can alleviate this memory bottleneck while simultaneously reducing the overall energy consumption of the complete computing platform.

8.1.1 Why Focus on DRAM? High-density DRAM memories are primarily used to store large portions of the working set of data that needs to be processed but that is not able to fit into the fast, local cache memories of the processors. Generally, it consists of images or other media, but it will also hold the instructions used by the machine to perform the actual work. Given this tight interaction between system memory and the processing units, we could benefit greatly by having a high-bandwidth interface to the system DRAM. However, the manufacturing processes required to produce efficient DRAM wafers are specifically optimized for the storage of data in memory cells and not for switching speed which drives performance for standard logic-based chips. Thus, DRAM is sold as a T. Carlson (*) Ghent University, Ghent, Belgium e-mail: [email protected] A. Papanikolaou et al. (eds.), Three Dimensional System Integration: IC Stacking Process and Design, DOI 10.1007/978-1-4419-0962-6_8, © Springer Science+Business Media, LLC 2011

187

188

T. Carlson and M. Facchini

separately packaged and optimized data storage product. The packages themselves have a relatively low number of pins used for data access, typically 8-, 16-, 32-, or 64-bits wide. Embedded systems typically contain just one or two memory chips to keep overall energy utilization low. Nevertheless, this limits the total memory communication bandwidth available to the processing cores. General purpose computing, on the other hand, tends to use many groups of these memory chips on DIMMs (dynamic inline memory modules) to increase the overall width, and therefore overall bandwidth to a system. The use of a large number of chips simultaneously allows for increased performance, but at the cost of significantly increased energy use. One solution to this performance bottleneck could be to use embedded DRAM (eDRAM) as a main memory replacement. By modifying the wafer fabrication process, one can come to a middle-ground, achieving good performance for both the logic and DRAM portions of the die. Embedded DRAM can be designed at a bit-width suited to the performance needs of the logic component. This could allow for a 512-bit wide interface directly to the eDRAM, allowing for extremely high bandwidths. However, eDRAM has some significant disadvantages, including the reduction of wafer yield through increased chip area; the increased cost due to higher complexity of manufacturing processes and reduced performance of components, and the distribution of additional voltage levels since the optimal DRAM voltage is not typically the same as the logic [8]. Typically, one might see eDRAM used as an on-chip cache with capacities hundreds of times smaller than that found in a main system memory. For these memory capacities where the die sizes for both logic and memory are similar, it would be ideal to keep them as separate dies, and not integrated onto the same die significantly increasing its area. 3D integration with TSVs is a promising solution because it enables highbandwidth communication between two dies which have been tailored for a specific purpose. It goes beyond existing wafer-level techniques by allowing higher-density interconnects with better electrical characteristics to provide higher bandwidth with lower energy consumption. In this chapter, we provide an overview of current memory technologies and trends, and examine why the memory subsystem in today’s computing devices is often the performance bottleneck. Through the use of 3D-TSV technologies, a higher bandwidth pathway to the system memory helps to overcome these application bottlenecks through a significant increase in bandwidth. With targeted use of TSVs in the memory subsystem of computing systems, this additional bandwidth can provide improved performance with lower system energy use.

8.2 DRAM Structure and Overview Several options for the storage of data used in digital systems currently exist. Most of them are suited for accessing random locations in a matrix of memory cells, known as RAM, or random access memory. There are two distinct classes of RAM found in computing devices today: nonvolatile memories like flash which retain their data after the power has been removed; and volatile memories that need to be

8 3D Stacking of DRAM on Logic

189

continuously powered-up for the data to remain valid. In this section, we detail the inner structures of one class of volatile memory: DRAM, and describe how it is accessed and used in a conventional computing system. In the following sections, we then describe how we can optimize the system to take advantage of the strengths of both DRAM and TSVs to move beyond its current limitations.

8.2.1 Fundamental Structures of Dynamic and Static Memories SRAM (static random access memory) is able to maintain its data for as long as a power supply is provided to the memory cells. In contrast, DRAM, or dynamic random access memory, can retain the data only for a limited amount of time. Indeed, the fundamental DRAM element is composed of a single capacitor whose charge is controlled by a single transistor. A bit is set by storing a given number of electrons into the capacitor while the transistor holds the charge in place until it is needed at a later time. Because there are leakage effects in the transistor, the charge representing the data decreases over time. This means that unless the DRAM bits are periodically refreshed, the data contained within them will become invalid. The advantage of this type of memory is the ability to store data without the use of active transistor loops or other logic circuits as required for SRAM. Indeed a single bit of an SRAM cell is composed of two cross-coupled inverters whose state is controlled by two additional transistors enabled by the word lines or WLs. The BLs, or bit lines, contain the actual value of the data to store into the memory. As indicated by the BL arrows in Fig. 8.1, a write cycle would occur for both RAM types when WL is high, and the BLs are driven to the appropriate value. One critical difference between the usage of DRAM and SRAM is the ability of the devices to retain data after a read operation. With DRAM, once a read is WL BL

DRAM

Vdd

WL BL

Vdd

WL

__ BL

SRAM Fig. 8.1 DRAM and SRAM bit cells. Enabling WL and placing data on BL will store the BL value into the cell

190

T. Carlson and M. Facchini

p erformed, all the charge has been read from the capacitor, destroying its current value. One will need to write the same value back into the DRAM capacitor cell to continue to store that data. SRAM cells are entirely different. As they are designed to hold their state, reading from them is nondestructive, as indicated by the name staticRAM. In most computing platforms found today, SRAM tends to be physically close to the processor, acting as either a high-speed cache to store frequently accessed data items or as a buffer (e.g., separating logic pipeline stages or interconnecting separate clock domains). Its ability to retain data while being read is one significant factor for the use of SRAM in critically fast sto rage for chip components. If a DRAM cell was in the process of being written to, or being recharged, there would be a delay when requesting data from a particular DRAM storage location. These performance characteristics make it unsuitable for completely random,very-low-latency access. In both area and power utilization, a single DRAM cell performs better than an SRAM cell. Indeed DRAM is much denser with respect to SRAM (up to four times more dense) and it consumes significantly less power per memory access.

8.2.2 Fundamentals of DRAM It is the density of DRAM that allows the relatively large arrays of memory to be designed. DRAM is usually organized into arrays of off-chip memory. The DRAM itself is organized into a hierarchy of structures that allow for the reading and writing of data in an efficient manner. Collections of DRAM cells are arranged in a 2D grid of rows and columns (Fig. 8.2). One row at a time can become active, through the process of activating the bit lines and sensing the data from the entire row. Latches are used to store the values of the row from each of the DRAM WL0

WL1

WL2

Sense amplifiers

WL3

Latches

BL3

BL2

BL1

BL0

Column decoder

Fig. 8.2 The organization of DRAM bit cells into rows, columns, and banks. Write-back wiring from the latches to the DRAM is not shown for clarity

8 3D Stacking of DRAM on Logic

191

cells, as reading from them is a destructive operation. The bits from the entire row can then be read out of the DRAM in a piece-meal fashion from the column decoder. As column addresses are presented to the DRAM, portions of the row are fed out of the DRAM and read by the system. In addition, the values of the individual latches can be changed from the external interface, allowing for write operations. To complete the access step of DRAM when read or write operations on that particular row have been completed, the cells must be restored to their previous state. After the values from the latches are then stored into the DRAM cells, freeing up the sense amplifiers and latches for operations on the next row.

8.2.3 DRAM Internals The internal memory storage of the DRAM itself is not based on synchronous, clock-based logic, but asynchronous logic and timing information. Early DRAM chips operated without clocks and exposed almost all the internal operation of the chips to the system itself. Because of the dynamic nature of DRAM, and the refreshing that is needed for each of the cells, a DRAM memory controller is needed to keep track of the internal state of the DRAM. It controls and initiates refresh operations, as well as translate memory addresses from the computing systems into row and column requests. For example, in a JEDEC1 standard compliant DRAM module, the maximum amount of time that can pass before refreshing the device is known as refresh time (tREF), which is defined as 64 ms. There are additional parameters that describe how the DRAM operates: the time for activation; time to read data from a row, or row activation; and the time to restore the latch data into the array, or precharge it. Each DRAM has specific timing conditions that need to be met in order to operate correctly, and it is the role of the memory controller to maintain this internal state of the DRAM while maintaining timing constraints (Table 8.1). Table 8.1 Typical DRAM timing parametersa Parameter Description Clock period tCK Row cycle time tRC tRCD Row-column delay tRAS Row active time tRP Row precharge time tWR Write recovery time tRRD Act-to-act delay time tRFC Row refresh cycle time Average refresh internal time tREFI

Min 5 55 15 40 15 15 10 70 –

Max 7.5 – – 70 – – – – 7.8

Units ns ns ns ns ns ns ns ns µs

Micron 256 MB DDR SDRAM, model MT46V16M16 (http://www.micron.com)

a

JEDEC, or Joint Electron Device Engineering Council, is an organization that defines standards for electronics, including DRAM module specifications such as timing and communications protocols. The goal is for interoperability of DRAM chips from different DRAM manufacturers in computing systems.

1

192

T. Carlson and M. Facchini

In addition to refreshing of data in the DRAM itself, the memory controller also requests to read and write data to the memory on behalf of the system requests. Internal components either directly connected to the memory controller or attached to it via an internal communications structure, such as a bus, request data from this large memory. These standard access requests need to be interleaved with the DRAM controller’s memory refresh rate. Nevertheless, in this DRAM example, it would not be possible to refresh the entire memory all at once because it contains only a single row of sense amplifiers and latches. Therefore, the memory controller will need to balance the timing of memory refreshes with the data requests to serve the rest of the system. For example, if a DRAM device contains 8,192 rows and needs to refresh the DRAM cells every 64 ms (tREF), then the average interval that refreshes will need to occur will be 64 ms/8,192 or approximately every 7.8 µs (tREFI).

8.2.4 Hiding DRAM Latency Let us consider the connection from the processing cores through a memory controller to a single DRAM chip. If we would like to stream data from the DRAM at a constant rate, we would need to continue to request data from a single activated row, column by column until all the row’s data has been read. Nevertheless, once we have reached the end of the row, we would need to precharge it from the latch, and activate the subsequent row for reading. Unfortunately, this will force us to stop retrieving data from the DRAM and decrease our overall streaming performance. As an optimization to allow for multiple pieces of memory to be accessible at the same time, multiple banks in a single DRAM were introduced. By spanning the address space across multiple banks instead of a single bank, the next sequential address of memory can be activated from the next bank of memory early enough to allow the access of the addresses from the second bank (Fig. 8.3). By introducing an additional memory bank, we are allowing for additional bandwidth in the system. Nevertheless, this is at the cost of chip area as well as both static and dynamic power utilization. Although we could replace a single memory array with two additional ___ CK, CK tRc

tRRD

CMD/ ADDR

ACT

B0,Rx

ACT

PRE

B1,Ry

B0,Rx

tRAS

DQS

Hi-Z

DQ

Hi-Z

PRE B1,Ry

tRP Hi-Z

x0 x1 x2 x3 y0 y1 y2 y3

Hi-Z

CL

Fig. 8.3 Two sequential reads from different banks. CAS latency = 3 and burst length = 4

ACT

B0,Rx

8 3D Stacking of DRAM on Logic

193

arrays, each with half the size of the original, column decoders would need to be duplicated, and additional logic would need to be added to select data from the appropriate bank. Although this technique increases the sequential bandwidth available to processors and accelerators, the latency or amount of time it takes to access a random byte in the memory has not changed. Overall, the latency remains the same as the sum of the activation time and column data response time.

8.2.5 Interconnect Options for SDRAM The characteristics of the connectivity between core processing units and its memory interface play a very important role in overall system performance and power utilization. The JEDEC SDRAM specifications, such as DDR2 or DDR3, are usually labeled according to performance characteristics of the bus, as a combination of the clock frequency of the data bus and the number of data bus transactions that can occur during that time. This speed indicator is used to determine the maximum bandwidth a DRAM chip can transfer, typically in megabytes per second assuming that all the latency of activation and precharging of the DRAM itself is hidden by the pipelined use of the chips’ separate banks. See Table 8.2 for the bandwidth and laten cycharacteristics of a few of these standarized intercaces. The SDRAM, or synchronous DRAM, standard first introduced the concept of pipelined access to DRAM memories, allowing the access of one bank of memory while another bank is being prepared for access. This standard called for four memory banks with a maximum frequency of 200 Mhz. As the demand for memory bandwidth increased, so has bandwidth between the computational units and DRAM chips. Additional DRAM interface standards have been introduced to bring larger bandwidths through higher bus frequencies. Nevertheless, these increasing frequencies have brought about the need for more accurate and therefore complex I/O interface circuitry. These on-chip I/O circuits, along with their corresponding printed circuit board (PCB) implementing the SSTL signaling protocol, are optimized for data transfer speed and typically consume a relatively large amount of static and dynamic energy. Recently, LP-DDR or low-power DDR has been introduced to reduce the overall energy Table 8.2 Bandwidth and access times for commonly used DRAM interface standards Standard name Maximum BW (MB/s) Access time (ns)a SDRAM 191–381 20–30 DDR SDRAM 381–763 20–30 DDR2 SDRAM 763–1,526 20–30 DDR3 SDRAM 1,526–3,052 20–30 Bandwidths are calculated for the minimum and maximum chip frequency specifications using a single 16-bit four-bank SDRAM chip as a reference a Here, we define the access time as the minimum amount of time required to activate a row (tRCD) and to receive the first burst of data from that row (CL)

194

T. Carlson and M. Facchini

c onsumed by the memory interface, but does so at a significantly reduced bandwidth. The LP-DDR standard reintroduces the CMOS-TTL signaling methodology originally used in SDRAM components, which removes on-PCB components that both allow for the higher frequencies and use a significant amount of energy.

8.3 The DRAM Bottleneck in Embedded Systems In this section, we review some common embedded system attributes and detail some of the future applications that, when implemented, require bandwidths that are significantly larger than what embedded systems today provide.

8.3.1 Overview of Current Embedded System Designs Software-based solutions on typical general purpose processing computing systems require a huge amount of bandwidth to their memories as the compression and decompression of streams with all intermediate data need to be stored in the cache hierarchy. In a fully custom VLSI-ASIC hardware solution, many megabytes of SRAM storage in FIFOs and local buffers are used to provide a stream-based data flow so that the hardware will have low-latency and high-bandwidth access to main memory. But even hardware-based solutions for embedded systems will be constrained, as the need to keep energy and cost to an absolute minimum for a particular application will force designers to make tradeoffs that result in smaller amounts of power-consuming SRAM memory where necessary. For example, a typical workstation-class desktop machine could contain a 128-bit data bus to a number of SDRAM DIMMs and in turn chips that are typically 16–32 bits of data per chip. Keeping a large number of chips powered-up and ready to be accessed consumes quite a bit of energy. Even through low-power memory states exist, they only allow for a reduction in energy utilization when the chips are not being accessed. Reading or writing from these memory devices in parallel is a common way to create a larger working set of data than can normally be accessed from a single DRAM chip, which typically supplies 8–32 bits. Using arrays of DRAM is helpful when accessing large amounts of data at the same time, but the cost of having to keep eight DRAM chips in the active state, all consuming active power, will add up to quite a bit of energy. For large servers with a significant amount of memory, one can multiply that by 32, 64, or more. In addition, communication of data and control bits through the system board traces with SSTL protocol from the standard DDR SDRAM requires a relatively large amount of energy to pass information back and forth between the memory controller and the SDRAM chips themselves. Therefore, in extremely small embedded systems where energy is at a premium, one might typically observe only a single connection from a custom ASIC or MPSoC to a memory chip in order to minimize the amount of energy that the system will consume for a particular task. Therefore,

8 3D Stacking of DRAM on Logic

195

the energy costs per bit access will be significantly lower in these systems because only a single bank of the SDRAM chip will be activated at a particular time. This tradeoff creates bandwidth issues, as transferring data between the MPSoC system and the SDRAM itself will take much longer. The embedded application domain consists of platforms tailored for very particular purposes, such as audio compression or video transcoding. Nevertheless, with current trends of integration and consolidation of MPSoC platforms, a single chip can now manage all aspects of a system’s tasks. For example, two components, an H.264 video encoder and decoder, might be placed on the chip simultaneously. The system architecture will need to be structured to be able to handle certain conditions that might exist simultaneously, allowing for very specific requirements to be required from the interconnect subsystem, as well as the memory controller to be sure that optimal use of the limited memory bandwidth is observed.

8.3.2 Embedded Application Overview Ever demanding user applications continue to drive the needed performance levels higher with each passing year. Higher-resolution video, images, and music that once only could be found on desktop workstations are now finding their way into handheld devices. A few examples include: the new wireless digital television standards requiring mobile devices to decode compressed streams in real time with ultra-low energy use; real-time video editing and effects of movies captured from the embedded environment; high-definition, and 1080p30 video signals need to be encoded on the mobile device for later storage. Each of these multimedia applications requires an ever increasing amount of bandwidth for the retrieval of the uncompressed streams from the video sensors and for the storage of the compressed stream back into the memory to be sent along the wireless link for storage or possibly video conferencing. These newer multimedia applications require greater bandwdths to supply the decoding of received attendees’ video streams, audio streams, and the bandwidth necessary becomes excessive. There are many types of applications found on both general purpose and embedded systems. Many of the applications that access the memory subsystem and have a greater need for bandwidth are usually multimedia applications, such as video and audio decoders and encoders. In addition, for general purpose processing, the graphics chips will have a direct connection to their own memory subsystem, but for embedded computing, the graphics accelerators tend to share a system bus with a variety of additional high- and low-bandwidth peripherals. The applications themselves vary in purpose, but generally fall into a number of categories based on the current system configuration. Applications tend to be bandwidth bound when they require to stream large amounts of data. In this case, the data itself is processed in a way that takes advantage of the existing data layout and hardware or software caching strategies. For example, if the application requires only simple sequential access to data without the need to reread previous portions

196

T. Carlson and M. Facchini

of the data stream, a caching architecture that provides read-ahead prefetching will be sufficient to fetch the data early enough to bring it into a cache with very fast access time compared with the processor. Because the hardware has been able to predict how the application behaves, the hardware will continue to prefetch the needed data before the processor will need it and will see that the application continues to operate at full speed. Nevertheless, if an application accesses data from different areas of main memory that are not able to be predicted correctly by the memory hierarchy, we observe an overall application slowdown. These types of applications tend to be latency bound, where they are most affected by the access latency time to retrieve data from higher levels of the memory hierarchy. Although applications are not strictly bandwidth or latency bound, depending on the current system configuration, both tend to place a role in the overall system performance and power use.

8.3.3 Embedded Applications’ System Demands Embedded devices are usually small computing devices that can be portable and operate on a limited power budget. Nevertheless, consumers are expecting more and more from them in terms of delivering exciting user experiences that are highly interactive while maintaining constant connectivity and interaction with the world. From mobile phones to portable media players, devices that once performed only a single function now take on many. The mobile devices of today are now complex, embedded devices, serving movies and music, recording video and audio or playing 3D games. This increasing integration of function into a single device normally does not increase the overall amount of data or bandwidth that needs to be processed on the system. Nevertheless, the simultaneous increase in the number of simultaneous functions coupled with increasing data rates and complex CODECs2 place additional demands on a system that normally would require only gradual additions to its data processing power and throughput. Enabling real-time video conferencing on a mobile phone is an example of an integrated application that requires a large amount of processing capabilities. In this example, each user would use their mobile phone to capture their own image, compress it, and send it to the other participant. In addition, the captured image would be overlaid on the picture received, so that the one holding the mobile phone could see an example of what they look like on the other end of the connection. In addition to these typical video-conferencing parameters, the expectations for future devices will be the ability to perform additional functions. The resolution of the

A CODEC or a compressor–decompressor. The process of compressing transforms a multimedia stream into a more compact form for storage or transport. A decompressor performs the opposite operation, recreating an exact or similar version of the multimedia stream for display or additional processing.

2

8 3D Stacking of DRAM on Logic

197

screen in this scenario is much higher than a typical embedded screen, with a required HD resolution of 1,920 × 1,080. In addition, the video captured and displayed on this device would be recorded from a high-resolution image sensor, compressed, and sent along the network to the recipient of the call. In this example, the user would also like to record the conversation for later use, but would like to keep the file size small, and therefore the video is transcoded3 into a format with a higher compression ratio suitable for archival. Finally, the user would like to apply a pseudo-3D effect4 while watching the video to provide an additional level of immersion [3]. Each step in this process requires an additional image processing step. The encoding and decoding of the video and audio streams, as well as transcoding of one encoded format will require sufficient computing power, but in addition, it will require enough bandwidth to be able to stream these decompressed as well as compressed streams into and out of the memory.

8.4 TSVs as an Enabler for Next-Generation Platforms In previous sections, we have described applications requiring an increasing amount of data throughput between the processing units and external memory modules. To satisfy the application throughput requirements, memories are equipped with serialization protocols (DDR, etc.) and they exploit internal parallelism by distributing data over multiple banks. These protocols have been introduced to cope with the limited number of wires that can be used to connect the memory die with the logic die. By increasing the number of wires on the board, static energy consumption increases with the width of the bus for DDR’s SSTL signaling technology. This means that even when not using the data and command busses, they will continue to consume energy. However, the adoption of TSV technology relaxes this limit for a number of reasons. With TSV technology, it will become possible to connect chips anywhere on the die, even in a distributed manner, and not arranged on the peripheral of the chip, as normally seen with standard chip I/O pads. Moreover, interconnect density increases because of the fine TSV pitch, allowing for wider bus designs with very little area penalty. In addition, the good TSV electrical characteristics allow for interface buffer simplification, thus reducing area and power requirements (Fig. 8.4). In this paragraph, we focus on these new possibilities enabled by 3D-TSV integration technology, describing some scenarios for future memory products.

Transcoding is the process of converting a compressed stream of data into another compression format. Usually transcoding is done to satisfy a need, such as an opportunity to utilize a newer compression format that allows one to store data in a format with a higher compression ratio than before while maintaining a similar level of quality. 4 This effect modifies the background of the image of the recipient of the phone call to make it appear three dimensional. For example, when the originator of the call moves her head as if to look around or behind the recipient, the background image is updated to provide the illusion of the depth. 3

198

T. Carlson and M. Facchini

Fig. 8.4 A layout view, comparing an I/O bonding pad, a TSV and a transistor. The I/O pad is 80 microns wide, the TSV has a 5 micron diameter (10 micron pitch), and a 130nm minimumsized transistor

8.4.1 Current-Generation Platform Introduction 8.4.1.1 Intrasystem Interfaces Benefits from TSVs Electric and Geometric Characteristics Electronic systems are in general composed of several components with a variety of purposes: memories, processors, displays, antennas, etc. These components are manufactured by different producers and thus they are normally designed, produced, and packaged separately. Bringing all these components together in the final system integration step consists of creating a design that is reliable and effective, while still being power efficient. An interconnect system is needed to allow the components to communicate with each other. Typically, the components’ interface has been standardized to simplify this final design phase and optimize overall system performance. 8.4.1.2 Reducing Off-Chip Communication Requirements PCBs are primarily used to provide a common base for interconnecting all the system parts (Fig. 8.5a), because it provides a robust, reliable, and commonly used platform for electronic applications. However, in this scenario, the electric load of the I/O pins between the different components is quite large when compared with standard logic. Indeed, the package adds about 1 pF of capacitance and a series resistance of several hundreds of milliohms (Table 8.3). In addition, the copperetched trace on a FR-4 dielectric PCB is on average long enough to duplicate these

8 3D Stacking of DRAM on Logic

a

199

PACKAGE DIE

PCB Active area

b

Bulk

Solder bumps

c

Bond wires

d

e TSVs

Fig. 8.5 Various die-level packaging techniques: (a) dies connected by a PCB; (b) two dies in a multichip module; (c) chip stacking with wire-bonds; (d) face-to-face bonding of dies; (e) TSV interconnections Table 8.3 Electrical characteristics of system components

Resistance Package FR-4 PCB

~500 mW ~500 mW/sq

Capacitance ~1 pF ~400 fF/mm2

parasitic values. It follows that powerful I/O drivers are required to route the signals from the silicon die to the package and then to over the PCB. Also, the PCB traces behave as transmission lines, adding problems like reflections, signal dispersion, interferences, etc. Consequently, the interface design has to take into account signal integrity issues, making the I/O design more complex by adding dedicated structures (i.e., terminations, differential receivers, etc.). Systems that contain these powerful but complex I/O drivers will experience significant additional energy consumption during the operation of these intrasystem interfaces. For low-power applications, compromises will therefore need to be made by both keeping the number of active connections to each chip low as well minimizing the total number of memory chips used. This will keep static power consumption per bit-line and memory chip to a minimum.

200

T. Carlson and M. Facchini

Reducing the need for off-chip communication can allow for a considerable reduction in the total system power consumption. System-on-a-chip (SoC) design minimizes off-chip communication by placing the entiresystem on a single silicon die. In this way, an almost unlimited bandwidth is available between the various system components, and the power used for communication is significantly reduced. However, it is not always possible or easy to place every single system component onto a single silicon die. Moreover even when it would be possible, the increased silicon cost (the yield reduces as silicon area increases) and the complexity of the system design and of the manufacturing process may not justify the power/performance benefits. Also, the chip resulting from a SoC design develops on the two dimensions of the single silicon die, making packaging problematic and requiring larger form factors of the final product. In the last few years, several methods have been proposed to interconnect and package more dies together into a single chip, like system in package (SiP). Typical single die packaging techniques, such as wire-bonds or solder bumps, are used to create the interconnections between the different dies that are normally placed one over the other to optimize the package form factor (Fig. 8.5c, d). Solder bumps can be used to connect two chips in a face-to-face configuration (Fig. 8.5d). This solution allows for increasing the pin-count between the two dies and thus the bandwidth. However, only two dies can be stacked in this way, and signals have to be routed among all the metal layers on both dies before reaching the transistors, which increases routing effort. Packaging with wire-bonds (Fig. 8.5d) does not have such a limitation on the number of chips that can be stacked, but overall package height and connection density limit the number of dies. The electric characteristics of the interface wiring are still acceptable (Table 8.4) but require dedicated drivers. In addition, the position of the I/O pads is confined to the chip edge, which again limits the communication performances between dies. TSVs can eliminate most bandwidth bottlenecks at the interface between dies. As illustrated in Table 8.4, the electrical characteristics of the TSV are many times better then other interconnects options. Therefore, the signals that have to be propagated on TSVs require less amplification and therefore less energy. Moreover, TSVs have minimal inductive characteristics and the reduced length allows one to consider them as any other on-chip interconnect network, avoiding additional signal integrity issues. This translates into simplified I/O interface design, power savings, and bandwidth improvement. The fine TSV pitch (~10 mm) allows one to place up to 100 TSVs in the space of a single I/O pad used for wire-bonding. This allows for an increase of the available bandwidth between the two dies by a factor of 100, enabling the possibility for high performance and low-voltage off-chip communication. Table 8.4 Typical parameters of a few 3D interconnect options Typical interconnect Resistance Capacitance diameter (mm) Solder bumps 50 mW ~250 fF 35 Wire-bonds 130 mW/mm 66 fF/mm 25 TSVs ~40 fF 5 ~20 mW

Minimum Typical length spacing (mm) 15 ~20 mm ~1 mm ~5 ~25 mm

8 3D Stacking of DRAM on Logic

201

The possibility to place TSVs anywhere into the design brings the opportunity for new design options with 3D die integration. Indeed, we now have the possibility to design a system directly in 3D, by distributing single functional units of the system on different stacked dies. In this first generation of TSV integration, we focus on the integration of typically separate components that can benefit greatly from high-bandwidth lowenergy communication. Nevertheless, because the TSV technology is in its first revisions, there is large potential for feature size scaling to improve electric and geometrical characteristics. In this way, we will be able to split the design into different stacked dies at gate level or even at transistor level enabling real 3D transistor level design. In the next sections, we show the advantages of redesigning the I/O interface of a standard DDR-style DRAM by using TSVs (Sect. 8.4.2). We then provide an overview of a few possible directions that TSVs and 3D-SIC provide which could not be considered before (Sect. 8.4.3).

8.4.2 A Promising Near-Term DRAM Interface Direction TSV-connected memory appears to be an excellent way to provide the performance of an on-chip memory because of the low parasitic load on the memory interface. There are other examples where high-bandwidth, low-power designs have been shown to significantly improve the performance. A design [1] was detailed that stacked SRAM over an 82 core processor; it demonstrated the possibility to compute more than one TFLOP with reductions in I/O power consumption of more than 90%. However, both the improved TSV electrical characteristics, such as reduced capacitive load, and the interface redesign contributed to this significant power savings. Continuing with this trend, we see similar possibilities for improvement for DRAM memories, although its integration provides its own set of challenges as we will detail. 8.4.2.1 DRAM to Logic Interface Redesign In the work of Facchini [2], the impact of the TSV load and of the drivers’ simplification over the total power savings is explored for DRAM to logic die communication. The reference design is an SSTL2 Class I I/O interface, the standard used in the commercial DDR-SDRAM products (JEDEC Standard No. 8-9B) [4] (Fig. 8.6). An SSTL2 I/O transmitter consists primarily of a large CMOS buffer that can deliver up to 16.2 mA of current to be able to drive the voltage on the receiver side over given noise thresholds for all the allowed interconnect configurations. The class I interconnect configuration consists of a termination resistance (RS) connected in series among the transmitter, the interconnecting medium, and another termination resistance connected in parallel to the receiver input port. The other pin of the parallel termination is connected to a supply source VTT which is set to be equal to half of the operating voltage of the interface ports (VDDQ = 2.5 V; VTT = 1.25 V). The resistive value of the parallel termination has to match the characteristic

202

T. Carlson and M. Facchini VDDQ

VDDQ Tx

Rx

RoutMEM Memory

Controller

VRef

interconnect RS

VDDQ

RT

VDDQ

VTT

VRef

Rx

RoutMC

Tx

Fig. 8.6 A schematic of an electrical model of a set of drivers and receivers for a memory subsystem

impedance of the interconnecting medium (~50 W) to avoid signal reflections. For the same reasons, the output resistance of the transmitter and the series termination resistance have to result in the same impedance. The SSTL2 receiver is a differential amplification stage, which compare the received analog voltage signal with a reference voltage (Vref). Normally, the reference is set as equal to VTT. 8.4.2.2 Three Alternative Scenarios The SSTL reference design can be gradually simplified in three 3D interface scenarios that are represented in Fig. 8.7 for a single bit wire. The first simplification consists of replacing the PCB and the wire-bonds with TSVs, while maintaining the standard SSTL2 interface. This first scenario would be a first step toward 3D DRAM integration and will mostly likely occur in systems that have not been designed to be 3D aware. However, as mentioned earlier, the electrical distance between the two dies is largely reduced; thanks to the TSV connections. Thus, we expect only marginal inductive effects and we do not expect transmission problems caused from reflections on the interconnect networks. These considerations enable the removal of the termination resistors as seen in the second scenario. However, we still have the proper drivers and receivers for the SSTL standard interface which are highly capacitive and unnecessarily overdimensioned for the low TSV capacitive load. In the redesigned I/O interface (Fig. 8.7), the SSTL transceivers are replaced by simple CMOS buffers and receiving inverters. 8.4.2.3 Power Considerations Depicted in Fig. 8.8 are energy estimations for the previously described scenarios. The first scenario is dominated by the DC power consumption, which represents about 95% of the energy consumed. This is because as it is designed, the SSTL

8 3D Stacking of DRAM on Logic

203

a Rx (differential)

Tx (buffer) TSV

VRef

RS RT VTT

b Rx (differential)

Tx (buffer) TSV

c Tx (opt-buffer)

VRef

Rx (min-buffer)

TSV

Fig. 8.7 Three buffer models using TSV technologies: (a) no modifications to the physical bus interface; (b) removal of termination resistors; (c) a redesign of the driver to a CMOS version

60 Dynamic (interconnect)

50

Static (interconnect) Static (interface)

pJ

40 30 20 10 0

ScRef

Sc1

Sc2

Sc3

Fig. 8.8 Dynamic and static power consumption is listed for a number of scenarios. Here, we see the move toward CMOS communications significantly saves energy consumed per bit

transmission interface does not transmit the full-rail signal (VSS5 to VDDQ) over the interconnect network but instead utilizes the terminations to create a resistive partition at VTT. This is done to reduce the voltage swing, allowing for faster transitions at the expense of static power consumption which as shown above can be a significant contributor to the total bus energy used. 5

As an example, the JEDEC SSTL_25 standard sets VSS to 0 V, VTT to 1.25 V, and VDDQ to 2.5 V.

204

T. Carlson and M. Facchini 350 300

Bus Mem

250

mW

200 150 100 50 0

ScR Sc1 Sc2 Sc3 artificialtrace for locality

ScR Sc1 Sc2 Sc3

ScR Sc1 Sc2 Sc3

mp4

qsdpcm

Fig. 8.9 Power utilization traces for three different applications. Power reductions can be seen in these examples primarily due to modifications to the bus itself

This effect is even more evident when looking at the power distribution obtained for the first modified scenario. Changing the TSV capacitance generates only a limited AC power consumption reduction. Nevertheless, in the second scenario, most of the DC power consumption is eliminated by removing the grounded termination. However, the AC power component takes the largest role in the overall power consumption because the signal needs to be driven from VSS to VDDQ, instead of from the midpoint value of VTT to 2.5 V, requiring more energy for a single transition. Finally, the I/O redesign in the third and last scenario considerably reduces the power consumption by minimizing the possible load on the interconnection network. In Fig. 8.9, we see that these modifications carry over favorably to the complete picture of DRAM power usage for three different applications. These examples show how the changes in the bus power utilization contributes to the overall power use of the DRAM. The final scenario exhibits a power reduction of approximately 50% when compared with the reference scenario. 8.4.2.4 Application-Specific Results for a Wider Bus Interface In the previous section, we have shown that by linking two dies with TSVs, chip designers now have the ability to significantly reduce the energy consumed when communicating between two separate dies. While this hardware-only solution provides very good results, we could also consider additional improvement scenarios. By increasing the bit-width to the memory itself, we can tradeoff some of the energy savings for improving the interface bandwidth. The interface pins to the DRAM can be grouped by address, command, synchronization, and data pins. Because of the extremely fine TSV pitch, the overall

8 3D Stacking of DRAM on Logic

205

700 600

bus

mW

500

memory

400 300 200 100 0 x16

SSTL

x16

x32

x64

x128 x256 x512

CMOS

Fig. 8.10 Power used by the bus and memory modules. We see that wide CMOS interfaces provide power advantages up to the 128-bit data width for this execution trace

pin-count of the DRAM can be increased. One way to take advantage of these extra pins is to transmit more data, thus providing an opportunity for higher-bandwidth communication. To measure the benefits of this increment, a realistic system platform consisting of processor, communication bus, and memory has been considered. Starting with an MPEG-4 encoder application on this platform, the activity on the interface between the external memory and the rest of the system has been monitored. To simulate an example use of the wider bus, a new trace was generated, based on the MPEG-4 data patterns. This trace simulates usage patterns between the DRAM and the memory controller for much wider data widths. This activity was then supplied to the power model from Facchini [2], and the resulting power consumption of the memory and its interface has been plotted in Fig. 8.10. On the left-most column, we see the power consumed in the reference scenario interface (SSTL + PCB). These simulations project an overall application power savings when moving to bus widths that are eight times wider when using the CMOS signaling technology. In a second study, the SDRAM power model was adapted to take into account the new, wider bus interface. The results are depicted in Fig. 8.11 where the distribution of the power consumption among the different parts of the memory and the interface has also been visualized. We can see that by increasing the bit-width, the result is an overall reduction of the energy used for communication. This is due to the fact that fewer data requests occur, and with each one there is an opportunity to receive a large chunk of memory at once. As the SDRAM interface is optimized for large bursts of contiguous memory addresses, better performance is obtained. Fewer bus requests mean that the DRAM chips can stay in a low-power state for a larger proportion of the time. Finally, we can see that because these results have been produced using an existing platform and application not optimized for this new interface. However, even with these limitations, the experiments show both power and bandwidth improvements. By adapting the application and improving

206

T. Carlson and M. Facchini 45 40

Read

35

Write

30

mJ

25 20

Activate Standby

8.03

9.18

11.00

Activate Powerdown

15 10

Precharge Standby

Precharge Powerdown

15.44

15.38

15.37

Activate and Precharge Refresh

5 0 16

64

128

Bus Bitwidth

Fig. 8.11 Energy used by the memory itself during a simple trace execution

the system features, we can see the trend toward an even wider margin of performance and power gains. When TSVs become generally available, the memory-logic interface will therefore be an excellent candidate to take advantage of this technology. These first iterations will bring the important milestone of successfully integrating 3D TSV technology. Benefits will be see though the compactness of the overall product the compactness of the overall product resulting from the removal of the PCB, and the integration of the two components on top of each other. This first step is integral, but will also require that technologies to be defined to cover the integration of components from differing lithography processes, the testing of the components before and after integration, as well as being able to route the proper signals between the layers without introducing additional layers or parasitic effects.

8.4.3 Potential Next-Generation DRAM-on-Logic Architectures In this section, we introduce a few interesting ideas that upcoming devices could implement as the next steps in the evolution toward a high-bandwidth and lower power solutions.

8.4.3.1 Eight Levels of DRAM with Novel MCs Previously, we have discussed the first steps for stacking logic dies with commodity DRAM devices. This progression has taken us through the adaptation of the logic dies themselves with the integration of TSVs. This was followed by updates to peripheral interface of the DRAM, showing a progression toward a fully integrated

8 3D Stacking of DRAM on Logic

207

logic-on-DRAM device. Nevertheless, the use of wider interfaces between these two dies is the only first step toward a complete circuit-level 3D integration solution. Moving to a 3D DRAM solution where its internal components are distributed across 3D layers could show additional improvements. There have been indications [6] that the introduction of 3D stacking to the DRAM microarchitecture itself could result in a reduction in the access latency for the DRAM. Further experiments [6] demonstrate increased performance as a direct result of this new configuration. Although this latency benefit was used to achieve higher single-threaded application performance, it is also possible to reengineer one’s design to use this lower latency to reduce the energy needed for multimedia applications, for example. Through the use of carefully tuned software by optimizing the memory layout and access patterns, it has been shown that significant energy reductions can be obtained. Although there are a very significant number of performance benefits to be obtained, there are also some new architecture and software consequences that should be considered. A fully integrated 3D-DRAM on logic solution could experience nontypical system integration issues through the use of 3D stacking. The physical proximity of the DRAM to a traditional processor would now create externally induced, local hotspots on the DRAM itself. Simulation experiments [6] show a significant rise in the temperature of the DRAM, almost reaching the maximum temperature specification of 90°C. This increase in temperature in turn requires the DRAM to refresh more frequently, increasing the overall power consumption of the DRAM. Of course, these simulations make assumptions about the new access times, refresh rates, and other parameters that are not presently validated in a physical implementation. In addition, optimized software that has been targeted for older technologies might have to be updated to optimize for this newer architecture. These software modifications can be time-consuming, as each kernel, or intensive computation or software loop, needs to be optimized individually. In addition, these optimizations are made for a particular architecture or computer system, and might have to be changed significantly if one were to migrate to a new hardware platform to achieve the required throughput or power efficiency. 8.4.3.2 Ultra-High Bandwidth Solutions as a Solution for Low-Power ASIC Designs We have discussed how advances in interconnect technologies, specifically the excellent electrical properties of TSVs compared with PCB DRAM traces, can provide an opportunity for improved system architectures through a wider memory data bus. Through the use of these wider interfaces, one can see significant reductions in the overall energy used in current embedded and general purpose processing systems. The reduction would be seen in the energy consumption from both the board-level components and chip-to-board level driver circuitry. The majority of this chapter has centered on adapting existing system components to take advantage of a wider data bus to maintain a system’s current overall performance with reduced

208

T. Carlson and M. Facchini

power consumption. Nevertheless, simply using standard applications and components in this way is just the first step to optimizing a platform’s energy usage. It is then possible to translate this greatly increased bandwidth into a lower overall system frequency. As the power consumed is proportional to the frequency of the system, the total energy utilization of the system can be reduced by reducing its clock frequency. The end result could be the creation of a design that utilized very low frequencies. This is because the frequency of standard CMOS logic contributes proportionally to its overall power consumption. By reducing the operating frequency of a system, a design could significantly reduce the overall power consumption of your digital system. At the same time, one can take advantage of a high-density interconnect, like TSVs, to maintain the per-cycle bandwidth requirements. But, in addition to raw bandwidth that is available to the logic device, the latency would also be expected to decrease as the amount and number of the driver circuitry would be reduced. By taking this bandwidth-energy usage idea one step further, one can imagine updating application accelerator architectures of a logic processor to take advantage of this newly acquired bandwidth. Systems have been designed [5] that use high-bandwidth interconnects to optimize the overall energy efficiency of the entire architecture. When developing a custom ASIC solution, it is possible to define and exploit system advantages to optimize for specific characteristics. The motion estimation kernel of the H.264 encoder tends to require relatively large amounts of bandwidth to perform the energy minimization routines needed to efficiently compress the data in a stream of images with a minimal loss in visual quality [7]. By bringing the DRAM on-chip with the logic die using this high-bandwidth connection, one can now use the DRAM as a fast local memory cache. Traditionally, general purpose processors use multiple layers of cache memory, typically SRAM, to take advantage of the temporal or spatial locality. Through the use of low-frequency CMOS logic with high-bandwidth interconnects, we can reduce power consumption while maintaining or even increasing the accessible bandwidth to the DRAM. We can now do this because the random access time has been reduced to fairly low level. For example, if a DRAM chip has a random access latency of 20 ns, and you are operating your design at 25 MHz, a 40 ns clock period, you would then be able to randomly access the DRAM every two cycles. The logic chip would be able to continue to burst data from the device at each cycle, as modern DRAM is optimized for sequential access. 8.4.3.3 Standardization of Interchip TSV Connections The process of standardization provides benefits to both the engineering and development as well as business benefits. From an engineering viewpoint, a single standard design simplifies many components of the product development lifecycle. From the design and validation to production and final product test, the agreed-to standard simplifies development, tool use, and final testing. Overall, the number of resources, people with knowledge

8 3D Stacking of DRAM on Logic

209

of the implementation, and available software implementing or using the standard tend to be higher than with proprietary products or services. Nevertheless, standards also introduce compromises between different users of the standard. By casting your net wide, and accepting the needs of many different groups, one risks the possibility of reducing the overall benefit of implementing the standard in a product. As we have seen in the evolution of DRAM, the primary concern of the JEDEC standards group was to increase performance, namely bandwidth that is available to the DRAM itself. Only later, with the introduction of LPDRR, the low power dual data rate protocol, the needs of higher bandwidth in the embedded community become addressed. The tradeoff between DRAM bandwidth and the energy-per-bit needed for those accesses was resolved through the use of LVCMOS drivers and receivers instead of the LVTTL_2 connections commonly found in the latest DDR standards. Before the development of the LVCMOS drivers, embedded devices which had large bandwidth requirements needed to cache larger amounts of data locally on expensive SRAMs or use the more powerhungry DDR flavors. Nonstandard options were available in the form of proprietary eDRAM devices with nonstandard interconnects, but overall, this increases the cost and complexity of developing and testing a product. Standardization of products is one way for hardware integrators to help bring down the price through higher consumption. If the computer industry were to agree on a standard like the JEDEC DDR interface, and there are enough computer vendors who integrate these technologies, and there are enough suppliers who produce and see these products, market forces could help to bring prices down through higher production rates of the devices and universal use throughout computing systems. If one were to stay proprietary with an interface, and not release a specification or simply produce products that do not conform to a standard, then one could end up seeing lower distribution in products because of the large number of different options, resulting in communities of test, development, and manufacturing that are separate and disconnected from each other. Through the process of standardization, the hope is that an industry-wide consensus on how 3D-stacked DRAMs will be connected to logic devices. Mass produced, commodity DRAM products that can be easily connected to a range of logic devices could be the catalyst for adoption across multiple product categories, bringing TSVs and the DRAM-on-Logic into the mainstream

8.5 Conclusion New technologies can have the ability to bring about fundamental shifts to the ways we typically build and use standard components for computing systems. These technologies could allow for a significant shift in the design and assumptions made when designing these systems. The assumption that memory chips should be placed off-chip, to be connected by typically long wires across the PCB, is expensive in terms of energy used per bit access and data fetch latency. This new technology,

210

T. Carlson and M. Facchini

namely the TSV, allows us to bring heterogeneous components, such as DRAM and logic processors, together with an interconnect performance only outperformed by the local interconnection on a single chip. The use of TSVs as a low-power, highspeed, ultra-wide bandwidth communications medium can open the door for new architecture designs that would previously use too much energy to be feasible. Even more improvements lie ahead as embedded-system software is modified to take advantage of these new opportunities.

References 1. Borkar S (2008) 3D-technology: a system perspective. In: International 3D system integration conference, pp 1–14 2. Facchini M, Carlson T, Vignon A (2009) System-level power/performance evaluation of 3D stacked DRAMs for mobile applications. In: Proceedings of design, automation, and test in Europe, Nice, 20–24 April, pp 923–928 3. Harrison C, Hudson SE (2008) Pseudo-3D video conferencing with a generic webcam. In: Proceedings of the 10th IEEE international symposium on multimedia, Washington, DC, pp 236–241 4. JEDEC http://www.jedec.org 5. Kumagai K, Yang C, Izumino H et al (2006) System-in-silicon architecture and its application to H.264/AVC motion estimation for 1080HDTV. In: International solid-state circuits conference, San Francisco, CA, pp 1706–1715 6. Loh GH (2008) 3D-stacked memory architectures for multi-core processors. In: Proceedings of the 35th international symposium on computer architecture, Beijing, 21–25 June, pp 453–464 7. Tuan J, Chang T, Jen C (2002) On the data reuse and memory bandwidth analysis for fullsearch block-matching VLSI architecture. IEEE Trans Circ Syst Video Tech 12(1):61–72 8. Wehn N, Hein S (1998) Embedded DRAM architectural trade-offs. In: Proceedings of Design, Automation, and Test in Europe, Paris, 23–26 February, pp 704–708

Chapter 9

Microprocessor Design Using 3D Integration Technology Yuan Xie

9.1 Introduction Previous chapters have described various aspects of 3D integration technology, including the fundamentals of process technology and EDA design flows for 3D IC design. In this chapter, we discuss how to leverage the emerging 3D integration technology for future microprocessor design. As described in Chap. 1, 3D integration technologies offer many benefits for future IC designs. Such benefits include: (1) the reduction in interconnect wire length, which results in improved performance and reduced power consumption; (2) the support for realization of heterogeneous integration; (3) improved memory bandwidth, by stacking memory on microprocessor cores with massive parallel TSV connections between the memory layer and the core layer; (4) smaller form factor, which results in higher packing density and smaller footprint due to the addition of a third dimension to the conventional two-dimensional layout, and potentially results in a lower cost design. In this chapter, we describe how to leverage such benefits that 3D integration technologies offer to help future microprocessor design. We first give a brief review of the 3D integration technology and the influence of various 3D integration approaches on microprocessor design. We then review various approaches to design future 3D microprocessors, which leverage the benefits of fast latency, higher bandwidth, and heterogeneous integration capability that are offered by 3D technology. A case study on a fine-granularity 3D microprocessor design is then presented. The challenges for future 3D architecture design are also discussed in the last section.

Y. Xie (*) Department of Computer Science and Engineering, Pennsylvania State University, University Park, PA 16802-5027, USA e-mail: [email protected] A. Papanikolaou et al. (eds.), Three Dimensional System Integration: IC Stacking Process and Design, DOI 10.1007/978-1-4419-0962-6_9, © Springer Science+Business Media, LLC 2011

211

212

Y. Xie

9.2 The Influence of Various 3D Integration Technology on Microprocessor Design The 3D integration technologies [36, 37] can be classified into one of the two following categories. (1) Monolithic approach. This approach involves sequential device process. The frontend processing (to build the device layer) is repeated on a single wafer to build multiple active device layers before the backend processing builds interconnects among devices. (2) Stacking approach, which could be further categorized as wafer-to-wafer, die-to-wafer, or die-to-die stacking methods. This approach processes each layer separately, using conventional fabrication techniques. These multiple layers are then assembled to build up 3D IC, using bonding technology. Since the stacking approach does not require the change of conventional fabrication process, it is much more practical compared with the monolithic approach, and become the focus of recent 3D integration research. Several 3D stacking technologies have been explored recently, including wire bonded, microbump, contactless (capacitive or inductive), and through-silicon vias (TSV) vertical interconnects [6]. Among all these integration approaches, TSVbased 3D integration has the potential to offer the greatest vertical interconnect density, and therefore is the most promising one among all the vertical interconnect technologies. Three-dimensional stacking can be carried out using two main techniques [10]: (1) face-to-face (F2F) bonding: two wafers (dies) are stacked so that the very top metal layers are connected. Note that the die-to-die interconnects in face-to-face wafer bonding does not go through a thick buried Silicon layer and can be fabricated as microbump. The connections to C4 I/O pads are formed as TSVs; (2) faceto-back (F2B) bonding: multiple device layers are stacked together with the top metal layer of one die is bond together with the substrate of the other die, and direct vertical interconnects which are called TSV tunneling through the substrate. In such F2B bonding, TSVs are used for both between-layer-connections and I/O connections. Figure 9.1 shows a conceptual two-layer 3D IC with F2F or F2B bonding, with both TSV and microbump connections between the layers. All TSV-based 3D stacking approaches share the following three common process steps [10]: (a) TSV formation; (b) wafer thinning; and (c) aligned wafer or die bonding, which could be wafer-to-wafer (W2W) bonding or die-to-wafer (D2W) bonding. Wafer thinning is used to reduce the impact of TSVs. The thinner the wafer, the smaller (and shorter) the TSV is (with the same aspect ratio constraint) [10]. The wafer thickness could be in the range of 10 to 100 mm and the TSV size is in the range of 0.2 to 10 mm [6]. In TSV-based 3D stacking bonding, the dimension of the TSVs is not expected to scale at the same rate as feature size because alignment tolerance during bonding poses limitation on the scaling of the vias. The TSV size, length, and the pitch density, as well as the bonding method (face-to-face or face-to-back bonding, SOIbased 3D or bulk CMOS-based 3D), can have a significant impact on the 3D microprocessor design. For example, relatively large size of TSVs can hinder partitioning

9 Microprocessor Design Using 3D Integration Technology

213

Fig. 9.1 The estimated cost of OpenSPARC T2 by using conventional 2D, homogeneous 3D partitioning, and heterogeneous 3D partitioning: fabricating the memory and the core part separately can further reduce the cost

a design at fine granularity across multiple device layers and make the true 3D component design less possible. On the other hand, the monolithic 3D integration provides more flexibility in vertical 3D connection because the vertical 3D via can potentially scale down with feature size due to the use of local wires for connection. Availability of such technologies makes it possible to partition the design at a very fine granularity. Furthermore, face-to-face bonding or SOI-based 3D integration may have a smaller via pitch size and higher via density than face-to-back bonding or bulk-CMOS-based integration. Such influence of the 3D technology parameters on the microprocessor design must be thoroughly studied before an appropriate partition strategy is adopted.

9.3 Designing 3D Processor Architecture The following subsections discuss various architecture design approaches that leverage different benefits that 3D integration technology can offer, namely, wire length reduction, high memory bandwidth, heterogeneous integration, and cost reduction. The 3D network-on-chip (NoC) architecture designs are also briefly reviewed.

9.3.1 Wire Length Reduction Designers have resorted to technology scaling to improve microprocessor performance. Although the size and switching speed of transistors benefit as technology feature sizes continue to shrink, global interconnect wire delay does not scale accordingly with technologies. The increasing wire delays have become one major impediment for performance improvement.

214

Y. Xie

Three-dimensional integrated circuits (3D ICs) are attractive options for overcoming the barriers in interconnect scaling, thereby offering an opportunity to continue performance improvements using CMOS technology. Compared with a traditional two-dimensional chip design, one of the important benefits of a 3D chip over a traditional two-dimensional (2D) design is the reduction on global interconnects. It has been shown that three-dimensional architectures reduce wiring length by a factor of the square root of the number of layers used [12]. The reduction of wire length due to 3D integration can result in two obvious benefits: latency improvement and power reduction. 9.3.1.1 Latency Improvement Latency improvement can be achieved due to the reduction of average interconnect length and the critical path length. Early work on fine-granularity 3D partitioning of processor components shows that the latency of a 3D components could be reduced. For example, since interconnects dominate the delay of cache accesses which determine the critical path of a microprocessor, and the regular structure and long wires in a cache make it one of the best candidates for 3D designs, 3D cache design is one of the early design example for fine-granularity 3D partition [37]. Wordline partitioning and bitline partitioning approaches divide a cache bank into multiple layers and reduce the global interconnects, resulting in a fast cache access time. Depending on the design constraints, the 3DCacti tool [29] automatically explores the design space for a cache design, and finds out the optimal partitioning strategy, and the latency reduction can be as much as 25% for a two-layer 3D cache. Three-dimensional arithmetic-component designs also show latency benefits. For example, various designs [9, 22, 25, 31] have shown that the 3D arithmetic unit design can achieve around 6–30% delay reduction due to the wire length reduction. Such fine-granularity 3D partitioning was also demonstrated by Intel [2], showing that by targeting the heavily pipelined wires, the pipeline modifications resulted in approximately 15% improved performance, when the Intel Pentium-4 processor was folded onto twolayer 3D implementation. Note that such fine-granularity design of 3D processor components increases the design complexity, and the latency improvement varies depending on the partitioning strategies and the underlying 3D process technologies. For example, for the same Kogge–Stone adder design, a partitioning based on logic level [31] demonstrates that the delay improvement diminishes as the number of 3D layers increases; while a bit-slicing partitioning [22] strategy would have better scalability as the bit-width or the number of layers increases. Furthermore, the delay improvement for such bit-slicing 3D arithmetic units is about 6% when using a bulk-CMOSbased 180 nm 3D process [9], whereas the improvement could be as much as 20% when using a SOI-based 180 nm 3D process technology [22], because the SOIbased process has much smaller and shorter TSVs (and therefore much smaller TSV delay) compared with the bulk-CMOS-based process.

9 Microprocessor Design Using 3D Integration Technology

215

9.3.1.2 Power Reduction Interconnect power consumption becomes a large portion of the total power consumption as technology scales. The reduction of the wire length translates into the power saving in 3D IC design. For example, 7–46% of power reduction for 3D arithmetic units were demonstrated in [22]. In the 3D Intel Pentium-4 implementation [2], because of the reduction in long global interconnects, the number of repeaters and repeating latches in the implementation is reduced by 50%, and the 3D clock network has 50% less metal RC than the 2D design, resulting in a better skew, jitter, and lower power. Such 3D stacked redesign of Intel Pentium-4 processor improves the performance by 15% and reduces the power by 15% with a temperature increase of 14 ° . After using voltage scaling to lower the leak temperature to be the same as the baseline 2D design, their 3D Pentium-4 processor still showed a performance improvement of 8%.

9.3.2 Memory-Bandwidth Improvement It has been shown that circuit limitations and limited instruction level parallelism will diminish the benefits of modern superscalar microprocessors by increased architectural complexity, which leads to the advent of chip multiprocessors (CMP) as a viable alternative to the complex superscalar architecture. The integration of multicore or many-core microarchitecture on a single die is expected to accentuate the already daunting memory-bandwidth problem. Supplying enough data to a chip with a massive number of on-die cores will become a major challenge for performance scalability. Traditional off-chip memory will not suffice due to the I/O pin limitations. Three-dimensional integration has been envisioned as a solution for future microarchitecture design (especially for multicore and many-core architectures), to mitigate the interconnect crisis and the “memory wall” problem [11, 18, 20]. It is anticipated that memory stacking on top of logic would be one of the early commercial uses of 3D technology for future chip-multiprocessor design, by providing improved memory bandwidth for such multicore/many-core microprocessors. In addition, such approaches of memory stacking on top of core layers do not have the design complexity problem as demonstrated by the fine-granularity design approaches, which require re-designing all processor components for wire length reduction (as discussed in Sect. 9.3.2). Intel [2] explored the memory-bandwidth benefits using a base-line Intel Core2 Duo processor, which contains two cores. By having memory stacking, the on-die cache capacity is increased, and the performance is improved by capturing larger working sets, reducing off-chip memory-bandwidth requirements. For example, one option is to stack an additional 8 MB L2 cache on top of the base-line 2D processor (which contains 4 MB L2 cache), and the other option is to replace the SRAM L2 cache with a denser DRAM L2 cache stacking. Their study demonstrated that a 32 MB 3D-stacked DRAM cache can reduce the cycles per memory

216

Y. Xie

access by 13% on average and as much as 55% with negligible temperature increases. PicoServer project [14] follows a similar approach to stack DRAM on top of multicore processors. Instead of using stacked memory as a larger L2 cache (as shown by Intel’s work [2]), the fast on-chip 3D-stacked DRAM main memory enables wide low-latency buses to the processor cores and eliminates the need for an L2 cache, whose silicon area is allocated to accommodate more cores. Increasing the number of cores by removing the L2 cache can help improve the computation throughput, while each core can run at a much lower frequency, and therefore result in an energy-efficient many-core design. For example, it can achieve a 14% performance improvement and 55% power reduction over a baseline multicore architecture. As the number of cores on a single die increases, such memory stacking becomes more important to provide enough memory bandwidth for processor cores. Recently, Intel [33] demonstrated an 80-tile terascale chip with NoC. Each core has a local 256 KB SRAM memory (for data and instruction storage) stacked on top of it. TSVs provide a bandwidth of 12 GB/s for each core, with a total of about 1 TB/s bandwidth for Tera Flop computation. In this chip, the thin memory die is put on top of the CPU die, and the power and I/O signals go through memory to CPU. Since DRAM is stacked on top of the processor cores, the memory organization should also be optimized to fully take advantages of the benefits that TSVs offer [18, 19]. For example, the numbers of ranks and memory controllers are increased, in order to leverage the memory-bandwidth benefits. A multiple-entry row buffer cache is implemented to further improve the performance of the 3D main memory. Comprehensive evaluation shows that a 1.75 × speedup over commodity DRAM organization is achieved [18]. In addition, the design of MSHR was explored to provide a scalable L2 miss handling before accessing the 3D-stacked main memory. A data structure called the vector bloom filter with dynamic MSHR capacity tuning is proposed. Such structure provides an additional 17. 8% performance improvement. If stacked DRAM is used as the last-level caches (LLC) in chip multiple processors (CMPs), the DRAM cache sets are organized into multiple queues [19]. A replacement policy is proposed for the queue-based cache to provide performance isolation between cores and reduce the lifetimes of dead cache lines. Approaches are also proposed to dynamically adapt the queue size and the policy of advancing data between queues. The latency improvement due to 3D technology can also be demonstrated by such memory stacking design. For example, Li et al. [17] proposed a 3D-chip multiprocessor design using network-in-memory topology. In this design, instead of partitioning each processor core or memory bank into multiple layers (as shown in [29, 37]), each core or cache bank remains to be a 2D design. Communication among cores or cache banks are via the NoC topology. The core layer and the L2 cache layer are connected with TSV-based bus. Because of the short distance between layers, TSVs provide a fast access from one layer to another layer, and effectively reduce the cache access time because of the faster access to cache banks through TSVs.

9 Microprocessor Design Using 3D Integration Technology

217

9.3.3 Heterogenous Integration Three-dimensional integration also provides new opportunities for future architecture design, with a new dimension of design space exploration. In particular, the heterogenous integration capability enabled by 3D integration gives designers new perspective when designing future CMPs. Three-dimensional integration technologies provide feasible and cost-effective approaches for integrating architectures composed of heterogeneous technologies to realize future microprocessors targeted at the “More than Moore” technology projected by ITRS. Three-dimensional integration supports heterogeneous stacking because different types of components can be fabricated separately, and layers can be implemented with different technologies. It is also possible to stack optical device layers or nonvolatile memories such as magnetic RAM (MRAM) or phasechange memory (PCRAM) on top of microprocessors to enable cost-effective heterogeneous integration. The addition of new stacking layers composed of new device technology will provide greater flexibility in meeting the often conflicting design constraints (such as performance, cost, power, and reliability) and enable innovative designs in future microprocessors. Nonvolatile memory stacking. Stacking layers of nonvolatile memory technologies such as magnetic random access memory (MRAM) [7] and phase change random access memory (PRAM) [35] on top of processors can enable a new generation of processor architectures with unique features. There are several characteristics of MRAM and PRAM architectures that make them as promising candidates for onchip memory. In addition to their nonvolatility, they have zero standby power, low access power, and are immune to radiation-induced soft errors. However, integrating these nonvolatile memories along with a logic core involves additional fabrication challenges that need to be overcome (e.g., MRAM process requires the growing of a magnetic stack between metal layers). Consequently, it may incur extra cost and additional fabrication complexity to integrate MRAM with conventional CMOS logic into a single 2D chip. The ability to integrate two different wafers developed with different technologies using 3D stacking offers an ideal solution to overcome this fabrication challenge and exploit the benefits of PRAM and MRAM technologies. For example, Sun et al. [27] demonstrated that the optimized MRAM L2 cache on top of multicore processor can improve the performance by 4.91% and reduce the power by 73.5% compared with the conventional SRAM L2 cache with the similar area. Optical device layer stacking. Even though 3D memory stacking can help mitigate the memory-bandwidth problem, when it comes to off-chip communication, the pin limitations, the energy cost of electrical signaling, and the nonscalability of chip-length global wires are still significant bandwidth impediments. Recent developments in silicon nanophotonic technology have the potential to meet the off-chip communication bandwidth requirements at acceptable power levels. With the heterogeneous integration capability that 3D technology offers, one can integrate optical die together with CMOS processor dies. For example, HP Labs

218

Y. Xie

Face-2-Face (F2F)

Face-2-Back (F2B)

Fig. 9.2 Illustration of F2F and F2B 3D bonding

proposed a Corona architecture [34], which is a 3D many-core architecture that uses nanophotonic communication for both intercore communication and off-stack communication to memory or I/O devices. A photonic crossbar fully interconnects its 256 low-power multithreaded cores at 20 TB/s bandwidth, with much lower power consumption. Figure 9.2 illustrates such a 3D heterogenous processor architecture, which integrates nonvolatile memories and optical die together through 3D integration technology.

9.3.4 Cost-Effective Microprocessor Design Increasing integration density has resulted in large die size for microprocessors. For example, a 65 nm 2-Billion-Transistor Quad-Core Itanium processor has a die size of 21.5 × 32.5 mm2 [26]. With a constant defect density, a larger die typically has a lower yield. Consequently, partitioning a large 2D microprocessor to multiple smaller dies and stacking them together may result in a much higher yield for the chip, even though 3D stacking incur extra manufacture cost due to extra steps for 3D integration and may cause a yield loss during stacking. Depending on the original 2D microprocessor die size, it may be cost-effective to implement the chip using 3D stacking. In addition, as technology feature size scales to reach the physics limits, it has been predicted that moving to the next technology node is not only difficult but also prohibitively expensive (many companies decided to go for fabless because of this reason), and eventually we may reach the “post-silicon” era in a decade or so. Three-dimensional stacking can potentially provide a cost-effective integration solution, compared with traditional technology scaling.

9 Microprocessor Design Using 3D Integration Technology

219

Using today’s high-performance microprocessors as an example, a large portion of the silicon area is occupied by on-chip SRAM or DRAM, and nonvolatile memory can also be integrated as on-chip memory [7]. However, the fabrication processes for these different modules are different. For instance, while the underlying conventional CMOS logic circuits require 1-poly-9-copper-1-aluminum interconnect layers, the one needed by DRAM modules is 7-poly-3-copper and the one needed by Flash modules is 4-poly-1-tungsten-2-aluminum. As a result, heterogeneous integration will dramatically increase the cost. As an example, Intel shows that heterogeneous integration for large 2D SoC could boost the chip cost by three times [3]. Separating the fabrication of heterogeneous technology and stacking them with 3D integration could be a cost-effective way for such systems. Here, we take the OpenSPARC T2 [28] as an case study. The original 2D OpenSPARC T2 chip has the area of 342 mm2 and fabricated with TI 65 nm process with 11 metal layers. About half of the die area is attributed to on-chip SRAM cache. One way of using 3D integration for such microprocessor is to partition all SRAM modules on one die and all the rest of modules on the other die, similar to the recent Intel 80-core Tera-scale chip [32]. Applying the early design estimation method in [40], the number of metal layers can be estimated to be reduced to 5, and the total cost can be reduced. The comparison is shown in Fig. 9.3. To summarize, the ability to enable heterogeneous integration offers extra opportunities to reduce the total cost in 3D IC designs.

9.3.5 Three-Dimensional NoC Architecture Network-on-chip (NoC) is a general purpose on-chip interconnection network architecture that is proposed to replace the traditional design-specific global onchip wiring, by using switching fabrics or routers to connect processor cores or

Processor Core Layer DRAM Layer Fiber: to off-chip

Non-volatile Memory Layer

Fiber: to off-chip

Optical Die Package

Fig. 9.3 An illustration of 3D heterogeneous architecture with nonvolatile memory stacking and optical die stacking

220

Y. Xie

processing elements (PEs). Typically, the PEs communicate with each other using a packet-switched protocol. Even though both 3D integrated circuits and NoCs are proposed as alternatives for the interconnect scaling demands, the challenges of combining both approaches to design three-dimensional NoCs have not been addressed until recently [8, 15, 17]. Researchers have studied various NoC router design with 3D integration technology. For example, various design options of the NoC router for 3D NoC has been investigated: (1) symmetric NoC router design with a simple extension to the 2D NoC router; (2) NoC-bus hybrid router design which leverage the inherent asymmetry in the delays in a 3D architecture between the fast vertical interconnects and the horizontal interconnects that connect neighboring cores; (3) true 3D router design with major modification as dimensionally decomposed router [15]; (4) multilayer 3D NoC router design which partitions a single router to multiple layers to boost the performance and reduce the power consumption [8]. Three-dimensional NoC topology design was also investigated [38]. More details can be found in [4]. The Intel 80-core TeraFLOPS processor is a proof-of-concept of the 3D network-on-chip architecture. The 80-core chip is arranged as an 8 × 10 array of PE cores and packet-switched routers, connecting with a mesh topology. Each PE core contains two pipelined floating-point multiply accumulators (FPMAC), connecting with the router through the router interface block (RIB). The router is a five-port crossbar-based design, with mesochronous interface (MSINT). To provide a high memory bandwidth at a relative low power, 20 MB SRAM layer is stacked on top of 80-core layer, with 256 KB per core, connecting with the bus to the core. The resulting 3D NoC-bus hybrid design can provide a memory bandwidth of 12 GB/s/ core (totally 1 TB/s for the whole chip), while the mesh NoC network provides a bisection bandwidth of 2 TB/s.

9.4 Case Study 1: Fine-Granularity 3D Microprocessor Design In this section, we adopt the fine-granularity partitioning approach to re-design a microprocessor using 3D integration technologies, by redesigning various components in a microprocessor with a true 3D implementation. Such fine-granularity design approach is based on the assumption that the die-to-die interconnects are extremely small with very high pitch density. We first discuss the design of various components in a microprocessor with fine-granularity partitioning, and then study the overall system performance improvement when putting all these components together. Processor model. In this case study, a Verilog implementation of an Alpha-like architecture (denoted as Alpha in the rest of the section) is used to evaluate the impacts of fine-granularity 3D implementation for microprocessors. A diagram of the processor is shown in Fig. 9.4. Each functional block in Fig. 9.4 represents a functional module that can be partitioned to multiple layers.

9 Microprocessor Design Using 3D Integration Technology

221

device layer 2

vertical inter-wafer via

device layer 1

Fig. 9.4 IPC results for 2D and 3D processors

9.4.1 Three-Dimensional Cache Design The regular structure and long wires in a cache make it one of the best candidates for 3D designs. In this section, we first give a brief illustration of the cache structure, and then describe different approaches to partition a cache into multiple device layers [30]. A combination of different partitioning at various granularity discussed in this section is also possible. Cache structure. The structure of a cache contains a tag array and a data array. A portion of the address bits are used to index the corresponding set in the tag and data arrays. Figure 9.5 shows the data array of a 32 KB cache (only data array is shown). Next, the tags and data of the different blocks belonging to a set are read. The tags read from all the blocks are compared against the tag portion of the incoming address. The indication of a match from the comparator output is used to enable the output driver of the corresponding block’s data from the data array. Neither the tag nor the data arrays are monolithic structures; the wordlines and bitlines of the memory array are divided into Ndwl and Ndbl parts resulting in Ndwl ×Ndbl sub-arrays (Blk0–Blk7 in Fig. 9.5). This partitioning is effective in reducing the access times and power consumption. Since the dimensions of the tag and data arrays are different, they are typically partitioned differently. In a 3D structure, we can extend this partitioning approach to divide bitlines and wordlines across different device layers. We refer this methodology as subarray level partitioning in the following sections. In addition to influencing the design of individual subarrays, the use of 3D structures can also help to reduce the delays due to global interconnects in the cache. One of the global interconnects are the incoming address inputs to the cache that are sent to a predecoder, which is placed in the center of the subarrays.

222

Y. Xie

fetch0

fetch1

fetch2 L1 Icache ITLB

Decode

4X Decoder

Rename0

Rename1

Rename2

RegRead

Register File

C−ALU AGEN0

Rename Schedule

Scheduler

S−ALU0 S−ALU1

Fetch

L1 Dcache DTLB

AGEN1

Arch RAT Reorder Buffer

Execute

Retire

Fig. 9.5 Processor model diagram

The predecoded address signals then traverse in an H-tree format to the local decoders of the subarrays. The local decoders in turn drive the corresponding wordline drivers. Other global signals include the select signals for driving the output buffers of the data array, the wires from output driver to the edge of the array, and the select lines for write and multiplexer control. All these global signals should benefit from a smaller footprint through the use of 3D technology. Global clock wiring will also benefit from 3D cache design as it travels shorter distance, even though in our evaluation we do not account for the benefit of the clock network. SRAM Cell Level Partitioning. The finest granularity of partitioning a cache is at the SRAM cell level. At this level of partitioning, any of the six transistors of a SRAM cell can be assigned to any layers. For example, the pull-up PMOS transistors can be in one device layer, while the access transistors and the pull-down NMOS transistors can be in another layer. The benefits of cell level partitioning include the reduction of footprints for the cache arrays and, consequently, the routing distance of global signals discussed. The number and complexity of the peripheral circuits remain the same as a conventional 2D cache designs. However, the feasibility of partitioning at this level is constrained by the 3D via size and via density when compared with the SRAM cell size. Assuming a limitation that the size of 3D vias cannot be scaled less than 1 mm by 1 mm, a 3D via has a comparable size to that of a 2D 6T SRAM cell in 180 nm technology and is much larger than a single cell in 70 nm technology. Note that for the 70 nm technology, the size difference remains even when using a “wide cell” topology for the SRAM cell

9 Microprocessor Design Using 3D Integration Technology

223

in deep submicron design to alleviate process difficulties of small feature sizes [39]. Consequently, when the 3D via size does not scale with feature size as currently in wafer-bonding, partitioning at the cell level is difficult in future technology nodes. In contrast, partitioning at SRAM cell level will continue to be feasible in technologies such as MLBS, because no limitations are imposed on via scaling with feature size. Availability of such technologies makes it possible to partition the cache at the granularity of individual cache cells [13]. However, it should be noted that even if the size of a 3D via can be scaled to as small as a nominal contact in a given technology, the total SRAM cell area reduction (when compared with a 2D cache design) due to the use of additional layers is limited, because metal routing and contacts occupy a significant portion of the 2D SRAM cell area [39]. Consequently, partitioning at higher levels need to be explored. Furthermore, wafer-bonding requires fewer changes in the manufacturing process and is more popular in industry [16, 21] than the MLBS technology. Therefore, our 3D cache design space exploration is mainly focused on coarse level partitioning using wafer-bonding technology. Subarray level partitioning. At this level of partitioning, individual subarrays in the 2D cache are partitioned across multiple device layers. The partitioning at this granularity reduces the footprint of cache array and routing length of global signals. However, it also changes the complexity of the peripheral circuits. In our research, we consider two options of partitioning subarrays into multiple layers: 3D divided wordline (3DWL) approach and 3D divided bit line approach (3DBL). Three-dimensional divided wordline (3DWL). In this partitioning strategy, wordlines in a subarray are divided and mapped onto different active device layers. The corresponding local wordline decoder of the original wordline in 2D subarray is placed on one layer and is used to feed the wordline drivers on different layers through the 3D vias. Instead of a single wordline driver as in the 2D case, we have multiple wordline drivers in the new design for each layer. The duplication overhead is offset by the resized drivers for a smaller capacitive load on the partitioned wordline. Further, the delay time of pulling a wordline decreases as the number of pass transistors connected to a wordline driver is smaller. The delay calculation of the 3DWL also accounts for the 3D via area utilization. The area overhead due to 3D vias is small compared with the number of cells on a wordline. Another benefit from 3DWL is that the distance of the address line from periphery of the core to the wordline decoder decreases as the number of device layers increases. Similarly, the routing distance from the output of predecoder to the local decoder is reduced. The length of select lines for the writes and muxes, as well as the wires from the output drivers to the periphery is also reduced. Three-dimensional divided bitline (3DBL). This approach is akin to the 3DWL approach and applies partitioning to the bitlines of a subarray. The bitline length in the subarray as well as the number of pass transistors connected to a single bitline is reduced, which can facilitate faster switch of the bitline. In the 3DBL approach, the sense amplifiers can either be duplicated across different device layers or shared between the partitioned subarrays in different layers. The former

224

Y. Xie

approach is more suitable for reducing access times while the latter is preferred for reducing number of transistors and leakage. In latter approach, the sharing increases complexity of multiplexing of bitlines and reduces performance when compared with the former. Note that sharing the sense amplifier among multiple layers will increase the number of 3D vias, because each bitline has to use one 3D via to share the sense amplifier. This can potentially cause the via congestion problem and may not be feasible for some 3D integrations with much larger 3D via (such as 10 mm by 10 mm), even though it is possible for much smaller 3D via (such as IBM’s 0.2 mm by 0.2 mm technology [1] or MLBS 3D integration). On the other hand, when sense amplifiers are duplicated for each die in the stack, the via congestion problem can be avoided, at the expense of more transistors and extra leakages. Similar to 3DWL, the length of the global lines are reduced in this scheme.

9.4.2 Instruction Scheduler The issue scheduler in a dynamically scheduled superscalar processor is a complex mechanism which is in charge of starting execution of multiple instructions. The instruction scheduler consists of two major components: wake-up and selection logic. The wake-up logic is responsible for awakening instructions that are eligible for issuing when both of its source operands have been produced and the requested functional unit becomes available. The function of the selection logic is to determine which instructions should be issued up to the maximum issue width of a processor. Due to its complexity, a significant amount of energy is consumed. Moreover, as pointed out in [23], the logic associated with the issue scheduler will be one of the primary clock speed limiters because wake-up logic and selection logic form an atomic operation. As a result, the turnaround communication delay between these two components should be made as less as possible to meet the performance budget because the wire delay will soon dominate the overall delay as feature size keeps scaling. Both the architectural level structure of wake-up and selection logics are shown in Fig. 9.6. Our implementations of these two logics adopted the architectural designs proposed in [23]. The delay of the wake-up logic consists of three components and can be expressed as Delay = Ttagdrive + Ttagmatch + Tormatch, where Ttagdrive represents the time taken by buffers to drive tag bits, Ttagmatch represents the time for a comparison cell implemented as CAM structure to pull down the match line, and Tormatch represents the time needed to OR individual match line. The delay time of the selection logic is composed of the propagation time for request signals to get to the root arbiter cell, the time for the root arbiter cell to generate the grant signal, and the time to propagate the grant signal to the selected instruction for starting execution. The delay of wake-up logic is affected by both issue width and window size, whereas the selection delay is mostly influenced by the size of the instruction window.

9 Microprocessor Design Using 3D Integration Technology

wordline decoders and drivers wordline pre-decode

wordline

128 x WLs

bitline

225

256 X BLs

address input

data output

4 output multi-plexors and sense amplifier

cache sub-array

Fig. 9.6 An example layout of a 2D cache: each Blki is a subarray with 128 wordlines and 256 bitlines. Note only data array is shown

a

b tag k

tag 1

issue window entry 1

instructioni CAM cell =

= tag L

tag R

=

entry 2

enable

entry 3

anyreq

entry 4 entry 5

=

arbiter cell

entry 6

enable

entry 7

anyreq

entry 8 ready_R

ready_L

enable anyreq

grant0 req0 Priority grant1 Encoder req1 grant2 req2 OR grant3 req3

enable

anyreq

ready enable anyreq

Fig. 9.7 (a) Wake-up logic. (b) Selection logic

More specifically, the Ttagdrive is the most influential one in deciding the overall delay of the issue logic based on following HSPICE simulations. Figure 9.7a–c show the delay breakdown of the wake-up logic under different issue widths and window sizes. The increased window size affects the delay of the

226

Y. Xie

Ttagdrive most significantly as visible from these figures. Figure 9.7d shows the power consumption comparison of different issue widths and window sizes. As expected, power consumptions are higher with larger window sizes due to the big cumulative capacitance. Accordingly, larger issue width means more wires are needed to deliver the results from functional units back to the comparison cells, which also has influence on power. In addition to wires, more comparison cells are needed alone with the enlarged issue width. Figure 9.8a shows the delay breakdown of the wake-up logic with different issue widths and a fixed instruction window size of 64. The overall increase from issue width of 4–6 is 23.7%, whereas 20.3% is observed for issue width of 6–8. One thing to note from this figure is that the time of Tormatch is only affected by the issue width, albeit slightly. This is because the delay of an OR gate is mainly decided by the number of input pins it has, which corresponds to the issue width. On the other hand, window size has greater impact on the delay than issue width because the wire delay is more eminent in advanced technologies. Note that, neither the window size nor the issue width affects the time of tag-match since the comparison cell is designed based on the CAM structure and its delay is solely dependent on discharging the match line. Figure 9.8b shows the delay breakdown of the selection logic for different issue widths with a fixed window size of 64. As can be observed, the

a

b

c

d

Fig. 9.8 Wake-up logic delay breakdown and power with different issue widths and different window size

9 Microprocessor Design Using 3D Integration Technology

227

delay of the arbiter is independent of the window size, and the increase in delay for forward and backward paths is not 100% for the window size of 8–16 and 32–64 due to the log4 structure of the selection logic. Based on the HSPICE simulation results shown above, we know that the delay time of Ttagdrive dominates the overall delay of the issue logic, and the wake-up logic itself is one of the major barriers in boosting the performance of a microprocessor. One solution to tackle the wire-induced problems is the move to 3D technology, and thus implementing this logic in 3D to mitigate the derivative issues of Ttagdrive. From this point of view, two possible partitioning approaches can be applied to the long tag-drive lines. The first one is referred to as horizontal partitioning, which cuts the tag-drive lines in half horizontally and places one-half length of tag-drive lines on the first layer and that of the other on the second layer assuming two active device strata are used. In other words, we duplicate the tag-drive lines with only half-long length of tag-drive lines onto each layer, and thus lowered wire capacitance can be acquired. We refer the second approach as vertical partitioning, which separates the tag-drive lines vertically into two halves. That is, we can assume tag[0:3] is on one stratum while tag[4:7] is on another stratum. Both partitioning approaches are conceptually shown in Fig. 9.9.

9.4.3 Three-Dimensional Arithmetic Design In this section, we look at 3D arithmetic function unit designs. Highly parallel circuits benefit more from the increased number of neighboring gates in 3D systems than those highly serial circuits. Therefore, we only investigate some arithmetic function units that can have potential improvement on critical paths while implemented on 3D.

a

b

Fig. 9.9 (a) Wake-up logic delay breakdown. (b) Selection logic delay breakdown

228

Y. Xie

a

tag k

b

tag k

tag 1

tag 1

inst 1_L

inst 1

inst 2_L

inst 2

tag k

3D via inst n

tag 1

inst n_L 3D via inst 1_R inst 2_R

inst n+1

inst n_R

inst 2n

Fig. 9.10 (a) Horizontal partitioning of tag-drive. (b) Vertical partitioning of tag-drive

Kogge–Stone adder. The Kogge Stone (KS) adder is one of the fastest adders in CMOS design. Since the interconnect length in the critical path increases linearly with the number of inputs, wire delay dominates its performance in the current deep submicron technologies. Figure 9.10a shows the 2D placement of the 16-bit KS adder and Fig. 9.10b shows the corresponding schematic 3D placement of this adder in four layers. For the sake of clarity, only the bottom three layers are shown. The 3D layers are shown in different shades in order to match the corresponding 2D design. Note that, 3D via contacts, not shown in the figure, are needed when signals travel across multiple layers. The critical path (highlighted in bold line) in the 2D adder spans across 12 cells against 3 cells in 3D. It is visible from the 3D placement that the cell placement wraps around for every four cells, which gives a maximum of 4 ×wire length reduction in 3D along the critical paths. Logarithmic shifter. Another design which has wire delay impacts on performance is the logarithmic shifter. The 2D layout of the 8-bit log shifter in Fig. 9.11a shows the linear dependence of wire length on the number of inputs. The metric used here to calculate the wire length is based on the number of cells crossed by the wire before reaching the destination (i.e., wire length is calculated in number of cell units). The cells in the 8-bit log shifter are 2-1 muxes which gets it select signal from s0, s1, and s2. Figure 9.11b shows the placement of the shifter cells in two strata. The 2D implementation of log shifter has a critical path highlighted in bold line of ten cells, while as the corresponding path in 3D spans only four cells and two vias as shown in the figures.

9 Microprocessor Design Using 3D Integration Technology

229

a

b

Fig. 9.11 16-Bit KS adder in 2D (a) and 3D (b), critical path is shown in bold line

9.5 Experimental Results The processor’s components were all implemented in 70 nm technology. The latency and power of all components in 2D and 3D were acquired through a combination of circuit-level HSPICE simulations. To model the performance and delay impacts of the 3D via more accurately, the RC delay of 3D via is added to the circuits to reflect its influence. The resistance of 3D via is estimated to be 10 − 8 W cm2

230

Y. Xie

based on the actual resistance measurement [5], and the capacitance is estimated as the capacitance of a 1mm-by-1mm contact using the top metal layer and the length of the interlayer via is assumed to be 10 mm. Based on the benefits of transferring to 3D, we evaluated the performance impact of 3D microprocessor with some applications from SPEC2000 benchmark suite. Issue scheduler. Figure 9.12 shows the latency benefits for tag-drive by using different numbers of strata. For the horizontal partitioning, we observed the latency improvement of 44% when moving from 2D to two-strata 3D implementations. We also noticed the improvement of 22% from two to three strata and an additional 16% improvement with the move from two to four strata. For the vertical partitioning, we only show the result of 4% improvement for two strata because adding more device layers has little benefit. Therefore, only the horizontal partitioning will be considered in the following experiments. Also based on this figure, the delay of the

a

b

Fig. 9.12 Log shifter in 2D (a) and 3D (b), critical path is shown in bold line

9 Microprocessor Design Using 3D Integration Technology

231

tag-drive scales linearly as the number of window size increases for two strata. However, in some cases, the performance is the same. This is because the number of window entries located on different strata is the same in some cases and thus results in the same tag-drive latency. For example, window size 40 has partitions of 3, 3, and 4 sections of window entries and each of the section has four window entries (due to the selection logic structure), whereas window size 48 has partitions of 4, 4, and 4. So both window sizes 40 and 48 have the tag-drive delay of travelling four sections of window entries and reaching down the tag comparison cells. Since a significant portion of the overall delay of an issue logic comes from the tag-drive, we have evaluated the delay reduction obtained from 3D integration and the result is shown in Fig. 9.13. We show only the results of window size larger than 32 with issue width of 8 because the smaller window sizes also exhibit the same trend and window size larger than 32 is more realistic in current generation of processor designs. The difference of a single loop delay (wake-up and select) is obvious between 2D and 3D implementations. The average delay reduction is 23% across all five window sizes when comparing 2D and two-strata 3D implementations. Additional reductions, 6 and 10%, can be observed when implemented upon three and four strata. Note that, the selection logic is pitch-matched to the wake-up logic and 3D vias are added as needed. The delay time of the selection logic is not changed significantly in 3D due to its log4 structure. Figure 9.14 shows the power comparison between 2D and 3D implementations. We can easily see, from 2D and two-strata 3D implementations, the power is effectively lowered; going beyond two strata also benefits on power reduction except slightly. The average power reduction is 16% for all five window sizes with two strata, whereas additional reductions, 6 and 8% are obtainable with three and four strata implementations.

Fig. 9.13 Tag drive latencies under different stratum configurations

232

Y. Xie

Fig. 9.14 Single-loop delay of issue logic for 2D and 3D implementations

Table 9.1 Two-dimensional and 3D implementations of adder and shifter 16-Bit KS adder Log 16 Log 32 Delay (ps) Power (mw) Delay Power Delay Power 2D 504 0.87 224 0.88 398 2.0 2-Strata 402 0.80 194 0.75 285 1.86 3-Strata 385 0.74 4-Strata 339 0.68

Based on the preceding results, we observe that the issue logic is a wire-bound structure. Thus, the move from 2D to 3D is essentially helpful in relieving the wirerelated impacts. Three-dimensional arithmetic modules. Table 9.1 shows the power and performance results of both the KS adder and the log shifter when implemented on 2D and 3D. The performance improvements of 16-bit Kogge–stone placed in two, three, and four strata over the 2D design are 20.238, 23.611, and 32.738%, respectively. Similarly, the corresponding power reductions are 8.14, 14.67, and 22.24%. Unlike the other components, we have only considered a two-strata implementation for the log shifter because only marginal benefit is observed when going beyond two strata. However, we also considered the 32-bit log shifter to demonstrate the potential improvements. The same improvements are noticeable for the 16-bit log shifter on 2-strata 3D with 13.39% on performance, whereas 14.28% on power, and for the 32-bit log shifter, the improvements are 28.39 and 6.99%, respectively. Performance impact. Although it is possible to implement microprocessors on multiple dies, we only consider the performance impact from a two-strata processor in which we observe the largest gain. From the results above, we observe that the

9 Microprocessor Design Using 3D Integration Technology

233

latency of the issue logic can be reduced by 23%, while 20 and 28% delay reduction is achieved for KS adder and 32-bit shifter, respectively, in a two-strata 3D technology. Based on the recent results reported in [24, 30], the cache can be clocked 10–13% faster when implemented upon a two-strata 3D architecture. Since the structure of a register file is similar to that of cache, we assume similar benefits can be obtained. We also assume the Load/Store queue has the same latency reduction as in issue logic due to their similarities. We have thus enlarged certain structures in 3D and have assumed their latencies to equal that of a corresponding smaller structure in 2D. Table 9.2 lists the parameters for both 2D and 3D processors. We used an architectural level simulator, SimpleScalar, with applications from the SPEC2000 benchmark suite to evaluate the performance impact. The result is shown in Fig. 9.15. As can be observed from the figure, the enlarged structures and the lowering latencies in 3D can effectively extract more IPC compared with the conventional 2D implementation of processors. The average IPC speedup is 11% across all five applications; however, we believe more improvements can be achieved if more 3D-optimized components can be incorporated.

Table 9.2 Processor parameters for 2D and 3D implementations 2D 3D Issue width 8 8 Window size 32 64 ROB size 128 128 IL1, DL1 32 KB, 3 cycles 64 KB, 3 cycles Register file 128 128 Load/store queue 16 32 Unified L2 1 MB, 8 cycles 1 MB, 7 cycles

Fig. 9.15 Power consumption comparison of 2D and 3D implementations

234

Y. Xie

9.6 Challenges for 3D Architecture Design Even though 3D integrated circuits show great benefits, there are several challenges for the adoption of 3D technology for future architecture design: (1) Thermal management. The move from 2D to 3D design could accentuate the thermal concerns due to the increased power density. To mitigate the thermal impact, thermal-aware design techniques must be adopted for 3D architecture design [37]; (2) Design tools and methdologies. Three-dimensional integration technology will not be commercially viable without the support of EDA tools and methodologies that allow architects and circuit designers to develop new architectures or circuits using this technology. To efficiently exploit the benefits of 3D technologies, design tools and methodologies to support 3D designs are imperative [36]; (3) Testing. One of the barriers to 3D technology adoption is insufficient understanding of 3D testing issues and the lack of design-for-testability (DFT) techniques for 3D ICs, which have remained largely unexplored in the research community. Acknowledgements Much of the work and ideas presented on this chapter have evolved over several years in work with our colleagues and graduate students at Penn State, in particular Professor Vijaykrishnan Narayanan, Professor Mary Jane Irwin, Yuh-Fang Tsai, Wei-lun Hung, and Xiangyu Dong.

References 1. Bernstein K (2006) Introduction to 3D integration. In: Tutorials in international solid state circuits conference (ISSCC) 2. Black B, Annavaram M, Brekelbaum N, DeVale J, Jiang L, Loh GH, McCauley D, Morrow P, Nelson DW, Pantuso D, Reed P, Rupley J, Shankar S, Shen J, Webb C (2006) Die stacking 3D microarchitecture. In: MICRO, pp 469–479 3. Borkar S (2008) 3D Technology: A System Perspective. In: Technical digest of the international 3D system integration conference 4. Carloni L, Pande P, Xie Y (2009) Networks-on-chip in emerging intercoonect paradigms: Advantages and challenges. In: International symposium on networks-on-chips, pp 93–102 5. Chen KN, Fan A, Tan CS, Reif R (2004) Contact resistance measurement of bonded copper interconnects for three-dimensional integration technology. IEEE Electron Device Lett, 25(1):10–12 6. Davis WR, Wilson J, Mick S, Xu J, Hua H, Mineo C, Sule AM, Steer M, Franzon PD (2005) Demystifying 3D ICs: The pros and cons of going vertical. IEEE Des Test Comput, 22(6):498–510 7. Dong X, Wu X, Sun G, Xie Y, Li H, Chen Y (2008) Circuit and microarchitecture evaluation of 3D stacking magnetic RAM (MRAM) as a universal memory replacement. In: DAC ’08: Proceedings of the 45th annual design automation conference, pp 554–559, ACM, New York, NY, USA 8. Dongkook P, Eachempati S, Das R, Mishra AK, Xie Y, Vijaykrishnan N, Das CR (2008) MIRA: A multi-layered on-chip interconnect router architecture. In: International symposium on computer architecture (ISCA), pp 251–261 9. Egawa R, Tada J, Kobayashi H, Goto G (2009) Evaluation of fine grain 3D integrated arithmetic units. In: IEEE International 3D system integration conference, pp 1–8

9 Microprocessor Design Using 3D Integration Technology

235

10. Garrou P (2008) Introduction to 3D integration. In: Handbook of 3D integration: technology and applications using 3D integrated circuits, Wiley, London 11. Jacob P, Zia A, Chu M, Kim JW, Kraft R, McDonald JF, Bernstein K(2008) Mitigating memory wall effects in high clock rate and multi-core CMOS 3D ICs: Processor memory stacks. Proceedings of IEEE, 96(10) 12. Joyner J, Zarkesh-Ha P, Meindl J (2001) A stochastic global net-length distribution for a three-dimensional system-on-a-chip (3D-SoC). In: Proceedings of the 14th annual IEEE international ASIC/SOC conference, pp 147–151 13. Kang YH, Jung SM, Jang JH, Moon JH, Cho WS, Yeo CD, Kwak KH, Choi BH, Hwang BJ, Jung WR, Kim SJ, Kim JH, Na JH, Lim H, Jeong JH, Kim K (2004) Fabrication and characteristics of novel load PMOS SSTFT ( stacked single-crystal thin film transistor) for 3-dimentional SRAM memory cell. In: Proceedings of IEEE international SOI conference, pp 127–129 14. Kgil T, D’Souza S, Saidi A, Binkert N, Dreslinski R, Mudge T, Reinhardt S, Flautner K (2006) PicoServer: using 3D stacking technology to enable a compact energy efficient chip multiprocessor. In: ASPLOS, pp 117–128 15. Kim J, Nicopoulos C, Park D, Das R, Xie Y, Vijaykrishnan N, Das C (2007) A novel dimensionally-decomposed router for on-chip communication in 3D architectures. In: Proceedings of the annual international symposium on computer architecture ACM SIGARCH Comput Archit News, 35(2):138–149 16. Lee KW, Nakamura T, Ono T, Yamada Y, Mizukusa T, Hashimoto H, Park KT, Kurino H, Koyanagi M (2000) Three-dimensional shared memory fabricated using wafer stacking technology. In: Technical digest of the international electron devices meeting, pp 228–229 17. Li F, Nicopoulos C, Richardson T, Xie Y, Vijaykrishnan N, Kandemir M (2006) Design and management of 3D chip multiprocessors using network-in-memory. In: International symposium on computer architecture (ISCA’06) ACM SIGARCH Comput Archit News, 34(2):130–141 18. Loh GH (2008) 3D-stacked memory architectures for multi-core processors. In: International symposium on computer architecture (ISCA), pp 453–464 19. Loh GH (2009) Extending the effectiveness of 3D-stacked dram caches with an adaptive multi-queue policy. In: International symposium on microarchitecture (MICRO), pp 201–212 20. Loh GH, Xie Y, Black B (2007) Processor design in 3D die-stacking technologies. IEEE Micro, 27(3):31–48 21. Mayega J, Erdogan O, Belemjian PM, Zhou K, McDonald JF, Kraft RP (2003) 3D direct vertical interconnect microprocessors test vehicle. In: Proceedings of the 13th ACM great lakes symposium on VLSI (GLSVLSI), pp 141–146 22. Ouyang J, Sun G, Chen Y, Duan L, Zhang T, Xie Y, Irwin M (2009) Arithmetic unit design using 180nm TSV-based 3D stacking technology. IEEE International 3D system integration conference, pp 1–4 23. Palacharla S, Jouppi NP, Smith JE (1997) Complexity-effective superscalar processors. ACM SIGARCH Comput Archit News, 25(2):206–218 24. Puttaswamy K, Loh GH (2005) Implementing caches in a 3D technology for high performance processors. In: ICCD ’05: Proceedings of the 2005 international conference on computer design, pp 525–532, IEEE Computer Society, Washington, DC, USA 25. Puttaswamy K, Loh GH (2007) Scalability of 3D-integrated arithmetic units in high-performance microprocessors. In: Design automation conference, pp 622–625 26. Stackhouse B, Bhimji S, Bostak C, Bradley D, Cherkauer B, Desai J, Francom E, Gowan M, Gronowski P, Krueger D, Morganti C, Troyer S (2009) A 65nm 2-billion transistor quad-core itanium processor. IEEE J Solid-State Circuits, 44(1):18–31 27. Sun G, Dong X, Xie Y, Li J, Chen Y (2009) A novel 3D stacked MRAM cache architecture for CMPs. In: International symposium on high performance computer architecture, pp 239–249 28. Tremblay M, Chaudhry S (2008) A third-generation 65nm 16-core 32-thread plus 32-scout-thread CMT SPARC processor. In: IEEE International solid-state circuits conference, pp 82–83

236

Y. Xie

29. Tsai Y-F, Wang F, Xie Y, Vijaykrishnan N, Irwin MJ (2008) Design space exploration for threedimensional cache. IEEE TVLSI, 16(4):444–455 30. Tsai Y-F, Xie Y, Narayanan V, Irwin MJ (2005) Three-dimensional cache design exploration using 3DCacti. In: IEEE International conference on computer design, pp 519–524 31. Vaidyanathan B, Hung W-L, Wang F, Xie Y, Narayanan V, Irwin MJ (2007) Architecting microprocessor components in 3D design space. In: VLSI design, pp 103–108 32. Vangal S, Howard J, Ruhl G, Dighe S, Wilson H, Tschanz J, Finan D, Iyer P, Singh A, Jacob T, Jain S, Venkataraman S, Hoskote Y, Borkar N (2007) An 80-tile 1.28TFLOPS network-on-chip in 65nm CMOS. pp 98–589 33. Vangal SR, Howard J, Ruhl G, Dighe S, Wilson H, Tschanz J, Finan D, Singh A, Jacob T, Jain S, Erraguntla V, Roberts C, Hoskote Y, Borkar N, Borkar S (2008) An 80-tile Sub-100-W TeraFLOPS processor in 65-nm CMOS. IEEE J Solid-State Circuits, 43(1):29–41 34. Vantrease D, Schreiber R, Monchiero M, McLaren M, Jouppi NP, Fiorentino M, Davis A, Binkert N, Beausoleil RG, Ahn JH (2008) Corona: System implications of emerging nanophotonic technology. In: Proceedings of the 35th international symposium on computer architecture, pp 153–164 35. Wu X, Li J, Zhang L, Speight E, Xie Y (2009) Hybrid cache architecture. In: International symposium on computer architecture (ISCA) 36. Xie Y, Cong J, Sapatnekar S (2009) Three-dimensional integrated circuit design: EDA, design and microarchitectures. Springer, New York 37. Xie Y, Loh G, Black B, Bernstein K (2006) Design space exploration for 3D architectures. ACM J Emerg Technol Comput Syst, 2(2):65–103 38. Xu Y, Du Y, Zhao B, Zhou X, Zhang Y, Yang J(2009) A low-radix and low-diameter 3D interconnection network design. In: International symposium on high performance computer architecture, pp 30–42 39. Zhang K, Bhattacharya U, Chen Z, Hamzaoglu F, Murray D, Vallepalli N, Wang BZY, Bohr M (2004) A SRAM design on 65nm CMOS technology with integrated leakage reduction scheme. In: VLSI technology digest of technical papers, pp 294–295 40. Dong X, Xie Y (2009) System-level cost analysis and design exploration for 3D ICs. In: Asia and South Pacific design automation conference, pp 234–241

Chapter 10

3D Through-Silicon Via Technology Markets and Applications E. Jan Vardaman

10.1 Drivers for Through-Silicon Via Applications The drivers for through-silicon via (TSV) adoption can be divided into two major application areas. The first is products driven by form factor requirements. In some cases, this is also coupled with performance advantages. The second is high- performance computing, where the adoption of 3D TSV technology promises higher clock rates, lower power dissipation, and higher integration density. The technology will be adopted in many high-performance computing applications because it solves issues related to electrical performance, memory latency, power, and noise on and off the chip. For some applications, a high-bandwidth memory interface to the logic has been the main driver for the development of the technology [1].

10.2 TSV Applications Major applications for 3D TSV include CMOS image sensors, memory, processors, and other logic devices such as field programmable gate arrays (FPGAs). Each application has different requirements, therefore features and aspect ratios will vary. Different TSV fabrication processes may be used for each application. While through vias have been used in MEMS applications for many years; this market overview focuses on applications outside the MEMS area.

10.2.1 Image Sensors The first application of the technology in production today is CMOS image sensors. Production lines at Toshiba were installed in 2008, and commercial products have E.J. Vardaman (*) TechSearch International, Austin, TX, USA e-mail: [email protected] A. Papanikolaou et al. (eds.), Three Dimensional System Integration: IC Stacking Process and Design, DOI 10.1007/978-1-4419-0962-6_10, © Springer Science+Business Media, LLC 2011

237

238

E.J. Vardaman

been shipped for mobile phone applications. Aptina (Micron’s spin-off), Oki Electric, STMicroelectronics, and Tessera with modifications to the ShellCase technology have announced image sensor products for camera modules. Today’s applications use a backside via formation process. The addition of DSP to image sensors is anticipated in future camera module versions.

10.2.2 Memory Issues include concerns that high-speed memory such as DDR3 will suffer performance limitations when connected in a stacked package using wire bond technology. 3D TSV memory technology developments have been announced by Samsung Electronics, Micron Technology, NEC, Elpida, Oki Electric, Hynix, and Tezzaron. Some companies predicted that at the 32 nm node in 2010, a 3?memory density could be achieved using TSV technology [2]. While some companies are targeted at mobile phones and other portable devices with sufficient RAM to run high-definition video and other 3D graphics applications, the first commercial application for DRAM with TSVs is expected to be high-performance memory for the server market starting in 2012. Samsung’s developments include an all-DRAM stacked memory packaging using TSV technology. Prototypes using the company’s wafer level-processed stacked package (WSP) consist of four 512 MB DDR2 DRAMs for a combined 2 GB of high-density memory and a 4 GB DIMM stack made up of TSV-processed 2 GB DRAMs. Samsung stated that it was developing the process for next-generation computer systems in 2010 and beyond [3]. Samsung has also published the results of its 3D DRAM that supports four-rank operations with a single master and three slave chips connected using approximately 300 TSVs. The total device density is 8 GB and each stack constitutes a rank. The master acts as a buffer that isolates the channel and the slave chips. Samsung reported both improved performance and lower power consumption for these prototypes [4]. Tezzaron has developed a process for high-volume 3D memory that is repairable, with smaller capacitances and with low manufacturing costs. The company is expected to begin selling a high-performance 3D SRAM replacement that will be less expensive to produce than the existing high-end SRAMs. Some companies have also discussed the possibility of TSVs for flash memory applications. Examples of NAND flash memory applications include memory storage cards, USB drives, MP3 players such as Apple’s iPod, digital cameras, and portable gaming machines. Samsung has noted that the demand for smaller and smaller NAND flash card form factors is increasing at a rapid rate. Samsung representatives have indicated that future flash applications may require TSV to meet performance/form factor requirements [5]. Other industry experts are pessimistic about shrinking design rules for NAND flash beyond the 32 nm generation because memory cells will be so small that operation will be unstable. The big problem is not the transistors, but the increase in delay [6].

10 3D Through-Silicon Via Technology Markets and Applications

239

While Samsung has made an argument for the adoption of TSV for NAND flash, other companies argue that the cost of TSV cannot be justified and the required density can be obtained by thinning the die and wire bonding the stack. TSV technology is not expected to be used for NAND flash in high volume for many years.

10.2.3 Processor and Memory The tremendous bandwidth required to avoid latency issues in the multicore processor systems of the future can only be addressed by TSV technology. The first application is expected to be cache memory bonded to a processor. A full repartitioning of processors to take advantage of the complete potential of 3D stacking, including low power consumption and less noise, will be developed in the future. The adoption requires advances in 3D design tools for repartitioning. Memory access speed is known to be increasing at a much slower rate compared with processor speeds. The slower speed of the memory causes the processor to stall as it waits for the memory to access data. Caching has been used to reduce the impact of slower memory on processor performance. Most designs use multiple levels of cache, described as Level 0 (L0), Level 1 (L1), etc., in order of distance from the processor. Conventional 2D architectures for processors place the processor and cache in the same plane. L0 and L1 caches are present on the same die as the processor, while an L2 cache can be on a separate die. Interconnection wires between the processor and the off-chip caches are long, especially with L2 on a separate die, and this causes multiple clock cycles to pass before data moves from one end to the other. Initial 3D processor designs anticipate placing L0 and L1 cache on the same die as the processor. Table 10.1 illustrates the effectiveness of repartitioning the Intel Core 2 processor to demonstrate the potential impact of 3D on latency. The reduction of latency by repartitioning the cache and the cores is significant [7]. The use of 3D architecture may well be the only way to avoid memory latency issues for future generations of multicore microprocessors.

Table 10.1 Impact of partitioning on latency for Intel’s processor Item Percentage latency reduction Scheduler 32 ALU + bypass 36 Reorder buffer 52 Register file 53 L1 cache 31 L2 cache 51 Register alias table 36 Source: Georgia Institute of Technology

240

E.J. Vardaman

10.2.4 Field Programmable Gate Arrays The benefits of using 3D architecture in logic devices are especially evident in FPGAs. FPGAs consist of large arrays of simple, programmable logic elements with a programmable interconnect hierarchy that allow the logic blocks to be connected as desired by the system designer. The performance of FPGAs is limited by the configurable interconnect which takes up 90% of the chip real estate and accounts for 40–80% of the device delay. 3D integration can improve FPGA performance by removing the programmable interconnect from the logic block layer and placing it on another tier in the stack, thus reducing the interconnect delay [8].

10.3 Remaining Challenges 10.3.1 Technical Challenges With the exception of CMOS image sensors, the industry is just starting to adopt 3D TSV technology. The timing for mass production in many applications depends on how the TSV compares in terms of cost with existing packaging and interconnect technologies. Applications in which die stacks with wire bonding provide a cost-effective solution and performance does not mandate the adoption of 3D TSV will not adopt the new technology. Remaining challenges include: • Commercial availability of EDA tools and design methodologies • Thermal concerns due to the increased power densities • Test, including known good die (KGD) and test methods Companies must have design tools in order to use 3D TSV technology. 3D integration technology will not become commercially viable without the support of EDA tools and methodologies that will allow circuit designers to use the technology. Design tools remain a weak link in the 3D infrastructure, with better thermal modeling, finite-element analysis, floor planning, and layout tools all required for a smooth 3D design flow. Current design tools used for 2D ICs cannot be easily extended to 3D ICs [9, 10]. Because of this, 3D integration has been limited to applications such as image sensors. IDMs such as IBM, Intel, Samsung, and Micron can develop internal tools. TSMC is working to develop a set of design guidelines for its customers, and other foundries are also expected to provide similar design guidelines in the future. Thermal dissipation is an extremely important issue in 3D TSV technology. Most of the heat generated in ICs arises due to transistor switching and interconnect resistance. This heat is typically conducted through the silicon substrate to the package and then to the ambient by a heat sink. In multi-tiered 3D

10 3D Through-Silicon Via Technology Markets and Applications

241

device designs, the layers will be insulated from each other by the layers of dielectrics (SiO2, porous SiO 2), which have much lower thermal conductivity than silicon. Thus, the heat dissipation issue becomes even worse for 3D devices and can cause degradation in device performance and reduction in chip reliability due to increased junction leakage, electromigration failures, and acceleration of other temperature sensitive failure mechanisms [11]. While much progress has been made in thermal issues with TSV technology, additional developments will still be required for high-performance devices. As these issues are resolved, 3D TSV technology will move from R&D into production. Many companies have not sufficiently investigated the test issues associated with the adoption of TSV for more complicated devices in a production environment. Many researchers assume that KGD will be available, but test methods are still in development and little has been publicly disclosed. New probe technology may be required with the introduction of TSV. Alternatively, greater use of built-in self-test (BIST) will be required.

10.3.2 Sourcing Challenges Apart from the pure technical challenges that need to be overcome to enable widespread adoption of TSV-based 3D integration technologies, a number of business challenges also exist. These must be resolved for any every new technology to become cost efficient and thus find its way to mainstream application products. We can distinguish between two large classes of challenges: (1) compatibility between technologies of different vendors and (2) cost structures. 3D integration using TSVs is a very complex technology process that includes steps that traditionally belong to the realm of foundries and packaging/system assembly and test houses. Heterogeneous integration implies the bonding of multiple die that will be fabricated in different technologies and maybe even in different foundries. Hence, unfinished wafers and/or dies, bare die with TSVs extruding or wafers temporarily bonded on carrier wafers pre- or post-thinning, for example, may need to be transported to other facilities, where the production can continue using technologies that are compatible with the technologies used for the fabrication up to that phase. The main reason for using such complex sourcing and logistics arrangements is cost. Foundries typically operate with higher profit margins than SAT houses, hence it is more cost efficient to do as much as possible on the SAT side. Such an arrangement automatically implies that a via-last process is better from a manufacturing cost point of view, since there is a clean interface between what can be done at the foundry side and what at the SAT side, leaving many process steps to the latter where the cost structure is more beneficial for the customer. However, via-last technology offers much smaller benefits and is more expensive in terms of area cost due to the larger TSV pitch than via-first.

242

E.J. Vardaman

References 1. Vardaman EJ (2008) The Z-direction goes vertical. Circ Assemb Up Media Group p. 24 2. Ikeda H (2007) 3D stacked DRAM using TSV plenary session. Electronic components and technology conference 3. Samsung Develops New (2007) Highly efficient stacking process for DRAM [Online]. Semicond Int 4. Kang U et al (2009) 8Gb 3D DDR3 DRAM using through-silicon-via technology. In: IEEE international solid-state circuits conference, San Francisco, CA, 8–12 February, pp 130–131 5. Lee K (2006) Next generation package technology for higher performance and smaller systems. In: 3D architectures for semiconductor integration and packaging conference, Burlingame, CA, 31 October to 2 November 6. Ooishi M (2007) Vertical stacking to redefine chip design. Nikkei Electronics Asia Nikkei Business Publications, Inc. 6(4):20 7. Puttaswamy K, Loh GH (2007) Thermal herding: microarchitecture techniques for controlling hotspots in high performance 3D-integrated processors. In: IEEE 13th international symposium on high performance computer architecture, Scottsdale, AZ, 10–14 February, pp 193–204 8. Ababei C, Maidee P, Bazargan K (2004) Exploring potential benefits of 3D FPGA integration. In: Field programmable logic and application Book Series Lecture Notes in Computer Science, Springer, Berlin pp 874–880 9. Xie Y, Loh G, Black B, Bernstein K (2006) Design space exploration for 3D architectures. ACM J Emerg Technol Comput Syst 2:2 10. Ronen R, Mendelson A, Lai K, Lu S-L, Pollack F, Shen JP (2001) Coming challenges in microarchitecture and architecture. IEEE Proc 89:3 11. Garrou P, Vardaman E (2008) Through silicon via technology: the ultimate market for 3D interconnect. TechSearch International, Inc.

Index

A Accumulation, 37, 39, 40, 44, 45 Algorithm, 73, 75–80, 83, 88–91, 94, 96, 104, 154 Alignment, 7, 8, 10, 16, 17, 19, 23–26, 28, 30, 49, 50, 54, 56, 60, 88, 159, 165, 219 Architecture, 11, 48, 66, 91, 94, 96, 141, 143–144, 146–147, 149–151, 170, 172–175, 185, 193, 204–218, 220, 232, 233, 239, 240 Arithmetic, 91, 210–211, 217, 227–228, 231 Assembly, 10, 16, 21, 22, 26–27, 31, 49–51, 53–57, 60, 62–65, 68–69, 138, 159, 241 B Bandwidth, 65, 67, 137, 174, 177, 179, 185–186, 190–195, 197–199, 202–214, 216, 217, 237, 239 Bonding, 2, 3, 6–10, 17–22, 24–26, 36, 49–54, 60, 62, 71, 105, 122, 130, 138, 160, 167, 169, 178, 197, 198, 218–219, 222, 239–241 Breakdown voltage, 43, 48 Bus, 60, 92, 147, 149, 151, 172, 176, 177, 179, 190–192, 195, 201–203, 205–206, 216 C Cache, 88, 91–92, 95, 185, 186, 188, 192, 194, 206, 210, 212–215, 220–222, 224, 232, 239 Capacitance, 2, 3, 20, 23, 27, 34, 37–45, 47, 48, 57, 61, 62, 81, 92, 93, 111–114, 129, 137, 177, 180, 196–198, 202, 224, 226, 229, 238 Characterization, 10, 26–27, 30–44, 139, 150, 151, 177

Chip, 1, 13, 56, 71, 101, 137, 185, 210, 237 Circuit, 1, 3, 5, 13, 16, 24, 27, 29–31, 35, 41, 42, 44, 47, 48, 58, 62, 71, 80, 82, 87, 88, 104, 111–112, 124, 126, 129, 137, 142, 160, 162, 180, 205, 211, 217, 229, 233, 240 Clock, 16, 33, 35, 44, 73, 188, 189, 191, 206, 211, 217, 222, 223, 232, 237, 239 Communication, 2, 137, 141, 146, 147, 149, 151–152, 170–178, 186, 190, 192, 196–199, 201, 203, 208, 213, 214, 223 Compact mechanical models, 168–176, 181 Compact thermal models, 168–176, 181 Contact, 20–22, 24, 25, 35–37, 41–43, 51, 63, 64, 67, 109, 127, 162, 222, 227, 229 Cooling, 5, 31, 63, 89, 94, 101, 103, 105–107, 109, 110, 120, 124, 126, 127, 133 Copper, 18, 25, 33, 35, 50–52, 54, 57–60, 62, 67, 72, 106, 112, 113, 196, 215 Crosstalk, 41–43, 73 D Depletion, 37–41, 44–45 Design, 2, 13, 33, 49, 71–96, 101, 135, 185, 209–233, 238 Design of experiments, 103, 118–121, 129–133 Diameter, 20, 25, 35, 38–40, 43–45, 57, 58, 60, 108, 109, 116, 118, 120–122, 126, 129–130, 132, 180, 198 Die, 2, 16, 33, 49, 89, 103, 137, 185, 211, 239 Dielectric, 18, 22–23, 33, 38, 39, 44–45, 58, 62, 71, 72, 167, 196 3D integrated circuit (3D IC), 4, 13–16, 18–24, 26–28, 30–31, 55, 69, 71–74, 79, 83–84, 86–88, 95, 96, 101–133, 181, 209–211, 215–219, 233, 240

A. Papanikolaou et al. (eds.), Three Dimensional System Integration: IC Stacking Process and Design, DOI 10.1007/978-1-4419-0962-6, © Springer Science+Business Media, LLC 2011

243

244 3D stacked integrated circuit (3D SIC), 3, 4, 7, 9–11, 107, 108, 136, 137, 141–143, 145–147, 152, 169, 176, 178, 181, 199 Dynamic Random Access Memory (DRAM), 5, 11, 49, 50, 56, 58, 65–69, 163, 173, 176, 178, 179, 185–208, 212–213, 215, 238 E Energy consumption, 40, 47, 185, 186, 195, 197, 205 Etching, 6, 10, 13, 18, 20, 25–26, 37, 52, 58 Exploration, 29, 44, 73, 88, 91–96, 110, 138, 140, 142, 143, 145–147, 152, 153, 157, 164, 170, 173–175, 179, 181, 213, 222 F Fabrication, 11, 14, 18, 20, 23, 25–27, 29–30, 62, 65, 66, 101, 105, 106, 186, 214, 215, 218, 237, 241 Filling, 6, 10, 13, 20, 25–26, 54, 58, 119 Floorplanning, 71, 73, 86–96, 145, 152, 154, 155, 159, 163, 165, 166, 169 G Geometry, 33–35, 41, 44–45, 48, 108, 126, 129 H Heterogeneous integration, 137, 209, 210, 213–215, 217–218, 241 High-level synthesis, 144, 149–151, 170–171, 181 Homogeneous integration, 49 Hot spot, 105, 106, 158–160, 162–164, 166, 169, 175, 176 I Image sensor, 50, 195, 237–238, 240 Impedance, 44, 46, 48, 180, 200 Inductance, 20, 23, 27, 34–45, 48, 62, 110–113, 127, 129, 130, 137, 178, 180 Interconnects, 3, 4, 19, 33, 41, 43, 48, 64, 71, 74, 94, 101–106, 110, 118, 133, 137, 186, 198, 206, 207, 210, 211, 214, 216, 218–221 Interface, 9, 37, 39, 40, 60, 155, 157, 165, 179, 185, 186, 189–192, 195–205, 207, 216, 237, 241

Index International roadmap for semiconductors (ITRS), 19–20, 24, 101, 213, 217 Inversion, 37, 40, 44 L Landing pad, 28, 35 Latency, 3, 67–68, 71, 86, 87, 91, 93–95, 147, 150, 151, 188, 190–192, 194, 205–207, 209–213, 229, 230, 232, 237, 239 Leakage, 2, 7, 26, 43, 48, 86, 187, 223, 241 Library, 73, 151, 171, 180 Liquid cooling, 63, 103, 105, 106, 133 M Materials, 6, 7, 10, 24, 28, 31, 33–35, 44–45, 48–51, 57, 58, 60, 61, 113, 139, 154–159, 164–168 Measurements, 26, 35, 36, 48, 81, 167, 168, 180, 229 Memory subsystem, 186, 193, 200 Metal, 13, 16, 18, 20–27, 30, 33, 35, 36, 39, 40, 45, 50–54, 57, 58, 60, 62, 64, 67, 71–75, 81, 107–110, 113, 114, 116, 117, 120, 160, 179, 180, 198, 211, 214, 215, 219, 222, 229 Micro-bumps, 178 Microprocessor, 11, 101, 209–233, 239 Modeling, 11, 19, 33–48, 62–64, 77, 83–86, 112–114, 144, 149, 153, 157–159, 164, 165, 167–170, 179, 240 Multi chip modules, 3 Multi-processor system on chip (MPSoC), 151, 171–172, 175, 177–180, 185, 192–193 N Netlist, 71, 73, 74, 83, 142, 151, 152, 171, 173, 179 Network on Chip (NoC), 66, 147, 149, 151, 170–175, 210, 212, 213, 216–218 Noise, 43, 88, 102–105, 110–113, 118, 120–121, 127–130, 132, 199, 237, 239 P Packaging, 2, 9, 10, 15–18, 54, 101, 106, 137, 141, 156–159, 165, 176–180, 197, 198, 238, 240, 241 Parasitic, 20, 23, 34, 62, 111–114, 129, 137, 196, 199, 204

Index Partitioning, 5, 62, 83, 88, 91–93, 104, 119, 123, 137, 141, 144, 146, 147, 153, 155, 159, 164, 166, 210–211, 213, 215, 218–223, 225, 226, 229, 230, 239 PathFinding, 135–181 Physical design, 11, 71–96, 104–105, 136, 138–142, 144, 145, 147, 152, 165, 172, 179 Physical prototyping, 91, 143, 144, 147, 149, 151, 152, 168, 170, 172, 174, 175, 180, 181 Pin, 3, 15, 23, 25, 65, 73–75, 77, 85, 102, 127, 130, 186, 196–199, 202, 203, 212, 214, 217, 225 Pitch, 4, 18–20, 22–25, 41, 42, 50, 58, 60, 61, 64, 91, 92, 109–112, 114, 117, 119–122, 126, 129–132, 141, 180, 195, 198, 202, 219, 220, 230, 241 Placement, 13, 30, 56, 71, 73, 78, 80–86, 96, 104, 110, 119, 121, 145, 147, 152, 159, 160, 162, 165, 166, 173, 175, 181, 227, 228 Planning, 73–80, 96 Platform, 1, 171, 172, 177, 179, 185, 188, 193, 195–199, 203, 205, 206 Plug, 20, 22, 23, 33, 35–38, 43, 44 Power consumption, 1–3, 5, 7, 13, 69, 86–89, 91, 92, 101, 110, 127, 129, 197–203, 205, 206, 209, 211, 214, 216, 217, 221, 224, 231, 239 delivery, 4, 101, 106, 110–116 density, 7, 72, 86–88, 101, 126, 162, 233 distribution, 110–114, 133, 141, 202 grid, 73, 111 Printed Circuit Board (PCB), 1, 3, 5, 153, 178, 180, 191, 192, 196, 197, 200, 203–205, 207 Processor, 28, 30, 56, 60, 65, 95, 111, 149, 179, 188, 194, 199, 203, 205, 206, 210–214, 216–220, 223, 229, 232, 239 R Redundancy, 31, 50, 56, 65–66, 68 Resistance, 20, 22, 23, 26, 27, 33, 35–37, 41, 43, 44, 47–48, 57, 61, 67, 81, 93, 105, 106, 111–114, 127, 130, 165, 166, 169, 180, 196–200, 229, 240 Resistivity, 20, 35, 44, 113 Ring oscillator, 14, 24, 27, 28 Routing, 11, 28, 60, 71, 102, 145, 198, 222

245 S Self-inductance, 43–45, 112 Series stub terminated logic (SSTL), 191, 192, 195, 199–203 Signal, 13, 16, 24, 30, 33, 40, 43, 44, 47, 48, 62–64, 67, 75–78, 96, 101–133, 191–193, 195–198, 200–204, 212, 214, 217, 221, 222, 224, 227, 228 Solder balls, 3, 4 Spacing, 2, 20, 23, 44, 62, 63, 73–75, 77, 84–93, 102, 103, 105, 109, 110, 112, 116, 117, 119, 129, 138, 140, 145, 146, 153, 179, 190, 198, 210, 213, 222 SRAM. See Static random access memory SSTL. See Series stub terminated logic Stack, 2–5, 7–9, 11, 14–16, 24–27, 33, 50–54, 60, 63–65, 74, 101, 102, 106, 110, 111, 116, 117, 122, 128, 129, 137–139, 142, 144, 153–155, 157–160, 163–167, 169, 176, 212–214, 217, 218, 223, 238–240 Stacking, 2–10, 24, 27, 33, 49, 51, 53, 54, 63, 64, 68, 84, 86, 88, 101, 105, 113, 138–139, 157, 160, 163, 170, 185–209, 212–215, 217–219, 239 Static random access memory (SRAM), 27, 30, 91, 92, 187–188, 192, 199, 206, 212, 214–216, 222, 238 Substrate, 3, 4, 6, 21, 22, 26, 33, 35, 37–43, 45, 50–52, 57, 60, 62, 106, 154, 158–161, 165–167, 219 Supply voltage, 4, 40–41 System, 5, 14, 66, 83, 101, 136, 185, 215, 238 T TechTuning, 135–181 Temperature, 5, 7, 24–26, 49–51, 58, 63, 69, 72–74, 79, 80, 82, 83, 86–90, 94, 104, 106–110, 118, 120, 121, 123–127, 129, 131, 132, 138, 158, 163, 165–167, 175, 205, 211, 212, 241 Thermal analysis, 26, 63, 74, 103, 105, 107–108, 123–127 Thermal conductivity, 72, 80, 105, 107, 108, 154, 167, 241 Thermal maps, 166, 167, 169, 175, 176 Thermo-fluidic, 105–110 Thickness, 5–7, 16, 18, 20, 22, 35, 38, 40–45, 50, 54, 60, 81, 112, 116, 120, 122, 126, 127, 129, 130, 154, 161, 162, 169, 219 Thinning, 6, 8, 10, 20–22, 26, 35, 51, 53, 54, 60, 160, 161, 219, 239, 241 Through silicon via (TSV), 6, 13, 33, 50, 73, 102, 146, 185, 209, 237

246 Tier, 13, 18, 20–23, 30, 71, 73–80, 82, 84–89, 91, 94, 104, 106, 107, 110, 111, 137, 144, 147, 152, 153, 166, 175, 240 TSV. See Through silicon via Tungsten, 18, 23, 26, 33, 51, 54, 57–60, 67, 215 V Vias, 4, 10, 11, 13, 16, 18, 20, 24, 51, 57, 58, 71, 73–75, 77–81, 88, 89, 92, 96, 102, 114, 116–117, 218, 219, 222, 223, 228, 230, 237 Virtual placement and routing, 104, 119, 145, 173 Virtual prototyping, 144–145 Virtual synthesis, 151, 181

Index W Wafer, 2, 4, 6, 8, 13, 16–27, 30, 31, 49–58, 60, 61, 63, 64, 101, 105, 160–162, 186, 218–220, 222, 238 Width, 25, 39, 65, 81, 108, 137, 185, 209, 237 Wire-bonds, 2–4, 197, 198, 200 Wirelength, 77, 78, 80–84, 86–89, 92, 94, 96, 120, 121, 126, 129, 130, 132 Y Yield, 2, 6, 8–10, 25–27, 30, 31, 50, 54, 56, 63–68, 73, 139, 140, 157, 186, 198, 218