Coupled Data Communication Techniques for High-Performance and Low-Power Computing (Integrated Circuits and Systems)

Integrated Circuits and Systems Series Editor Anantha Chandrakasan, Massachusetts Institute of Technology Cambridge, M...

Author: Ron Ho | Robert Drost

29 downloads 598 Views 26MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

Integrated Circuits and Systems

Series Editor Anantha Chandrakasan, Massachusetts Institute of Technology Cambridge, Massachusetts

For other titles published in this series, go to www.springer.com/series/7236

Ron Ho • Robert Drost Editors

Coupled Data Communication Techniques for High-Performance and Low-Power Computing

Editors Ron Ho Oracle Corporation Sun Labs VLSI Research Group 16 Network Circle UMPK 16-161 Menlo Park, CA 94025 USA [email protected]

Robert Drost Los Altos, CA 94024 USA

ISSN 1558-9412 ISBN 978-1-4419-6587-5 e-ISBN 978-1-4419-6588-2 DOI 10.1007/978-1-4419-6588-2 Springer New York Dordrecht Heidelberg London Library of Congress Control Number: 2010927932 © Springer Science+Business Media, LLC 2010 All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Springer Science+Business Media, LLC, 233 Spring Street, New York, NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com)

For Christina, Sawyer, and Finley – RH For Sharon and Juliet – RJD

Foreword

Wafer-scale integration has long been the dream of system designers. Instead of chopping a wafer into a few hundred or a few thousand chips, one would just connect the circuits on the entire wafer. What an enormous capability wafer-scale integration would offer: all those millions of circuits connected by high-speed on-chip wires. Unfortunately, the best known optical systems can provide suitably fine resolution only over an area much smaller than a whole wafer. There is no known way to pattern a whole wafer with transistors and wires small enough for modern circuits. Statistical defects present a firmer barrier to wafer-scale integration. Flaws appear regularly in integrated circuits; the larger the circuit area, the more probable there is a flaw. If such flaws were the result only of dust one might reduce their numbers, but flaws are also the inevitable result of small scale. Each feature on a modern integrated circuit is carved out by only a small number of photons in the lithographic process. Each transistor gets its electrical properties from only a small number of impurity atoms in its tiny area. Inevitably, the quantized nature of light and the atomic nature of matter produce statistical variations in both the number of photons defining each tiny shape and the number of atoms providing the electrical behavior of tiny transistors. No known way exists to eliminate such statistical variation, nor may any be possible. Proximity communication, or coupled data communication in general, may make possible the long-sought dream of wafer scale integration. Proximity communication permits assembly of wafer-scale systems from small parts. We can make circuit chips small enough for low defect rates, cast aside bad chips, and reassemble the good chips into wafer-scale systems. Two properties of proximity communication suit it to wafer-scale use. First, quality: the connections between chips are nearly as good as wires on a single chip. As this book describes, proximity connections are fast, occupy small area, and consume little energy. Second, and I think much more important, is replacement: proximity communication permits one to replace chips in a big system. Together, quality and replacement make wafer-scale integration possible. Because I think replacement is so important, Im going to devote a few more lines to it.

vii

viii

Foreword

What makes replacement possible? Proximity communication needs neither welds nor solder. The parts are joined electrically only by the electric fields between them. These fields pass right through the top layers of glass that protect the chips. Within error limits, the communication is also insensitive to chip separation and chip alignment. If one chip in a wafer-scale assembly of hundreds of chips proves unsuitable because of a hidden defect, or through aging, or simply for product upgrade, no physical bonds prevent its replacement. I believe that replacement will prove most useful for test. A complete system could serve as a jig that would test fresh chips in their real environment. Each fresh chip would spend only long enough in the complete system for a thorough test. A test jig smaller than a complete system might also serve to test only a single type of chip, providing it an environment indistinguishable from a full system. Such a test jig would have full speed access to every connection to or from the fresh chip. I see a huge potential for replacement to simplify and improve test. I also see that replacement may permit a profound change in the business alliances that produce products. Without the ability to replace, one bad chip destroys an entire multi-chip module, making specialization in module assembly a poor business. Because one bad chip spoils the entire module, a contractor who assembles multi-chip modules must take responsibility not only for defects in his own process, but also for defects in separate chips. This dual responsibility is a very high barrier to contract assembly. Board-level assembly houses are common because they avoid this dual barrier in two ways. First, not only is board-level assembly an old art with a well known low defect density but also it uses packaged and well tested parts. Second and more important, at the board-level some, albeit limited, replacement is possible. It is possible to remove and reuse at least the high-value chips on a board-level assembly, greatly reducing the high cost of bad parts. I believe that because proximity communication permits replacement it will also foster wafer-scale assembly houses. Bob Johnson, formerly technical head of Burroughs, talked about using conductive grease to connect the ordinary pads on chips placed face-to-face. A large area of thin grease between facing pads would provide a connection. The thinner and much longer layer of grease reaching to other pads would produce small but manageable cross talk. I merely replaced Johnson’s grease with electric fields. Robert Drost’s fiendishly clever diagonal arrangement of pads greatly reduces cross talk. Bob Bosnyak designed and measured some early proximity communication test chips. I recall one flawed ring oscillator test chip built for us by the MOSIS foundry service. The flaw turned out to be total omission of the metal plates on adjacent levels of metal that were to form the bulk of Bosnyak’s test structure. Nevertheless, the test chip worked, albeit at a mystifying small fraction of its intended speed. The mystery vanished when we discovered the omitted plates. MOSIS rebuilt the test chip for free. The late Bob Proebsting, a pioneer and life-long designer of fine memory parts, contributed to us much knowledge about sense amplifiers. For a period, the authors of this book were, in effect, Proebsting’s post-doc students. As usual in such relationships, both the brilliant teacher and the apt students took much delight from the

Foreword

ix

process. It was my joy to assemble such a mass of brainpower and to watch both its progress and the continuing delight of its participants. Portland, Oregon, September 2009

Ivan Sutherland

Contents

Part I Introduction 1

Introduction to Coupled Data Technologies . . . . . . . . . . . . . . . . . . . . . . 3 Ron Ho, Robert Drost 1.1 Life has been good . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Faster computers tomorrow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2.1 The end of Moore’s Law . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.2.2 The arguments against–and for–multiple chips . . . . . . . . . 7 1.3 Coupled data communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.3.1 This book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

Part II Overview of 3D Technologies 2

Power delivery, signaling and cooling for 2D and 3D integrated systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Muhannad Bakir, Gang Huang and Bing Dang 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Evolution of conventional silicon ancillary technologies: A brief overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Novel silicon ancillary technologies . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Optical I/Os . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.2 Fluidic I/Os for single and 3D chips . . . . . . . . . . . . . . . . . . 2.4 Power delivery for 2D and 3D systems . . . . . . . . . . . . . . . . . . . . . . . 2.4.1 Power delivery and design implications of 2D systems . . 2.4.2 Power delivery and design implications of 3D systems . . 2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

13 13 14 18 23 26 31 34 38 43 45

xi

xii

Contents

Part III Coupled Data Technologies 3

4

Capacitive Coupled Communication . . . . . . . . . . . . . . . . . . . . . . . . . . . David Hopkins, Alex Chow, Frankie Liu, Dinesh D. Patil, Hans Eberle 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 An electrical model of capacitive interchip communication . . . . . . 3.2.1 Crosstalk mitigation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.2 Simulation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Transmitting data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Receiving data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.1 Attenuation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.2 Loss of DC information . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.3 Comparators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.4 Receiver sizing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.5 Timing schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Two-dimensional arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6 Measurement results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.1 Voltage waterfall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.2 Timing waterfall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.3 Combined eye diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.4 BER versus chip separation . . . . . . . . . . . . . . . . . . . . . . . . . 3.7 Prototype application: a high-radix switch . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

51 51 53 56 56 61 62 62 63 65 66 67 68 70 70 71 72 72 73 77

Inductive Coupled Communications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 Noriyuki Miura, Takayasu Sakurai, and Tadahiro Kuroda 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.2 Inductive-coupling channel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.2.1 Overview of channel characteristics . . . . . . . . . . . . . . . . . . 80 4.2.2 Range extendability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.2.3 Coupling strength through Si substrate . . . . . . . . . . . . . . . . 84 4.2.4 Crosstalk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.3 Inductive-coupling transceiver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.3.1 Signaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 4.3.2 Coil design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 4.3.3 Transceiver circuit design . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.3.4 Inter-chip communications . . . . . . . . . . . . . . . . . . . . . . . . . . 92 4.4 Power reduction techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 4.4.1 Pulse shaping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 4.4.2 Daisy chain transmitter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 4.5 High-speed techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 4.5.1 Asynchronous transceiver . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 4.5.2 Burst transmission . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 4.6 Crosstalk reduction techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 4.6.1 Time interleaving . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

Contents

xiii

4.6.2 Differential coil . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 Application I: memory stacking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 4.7.1 Homogenous chip stacking . . . . . . . . . . . . . . . . . . . . . . . . . 114 4.7.2 Inductive-coupling up/down repeater . . . . . . . . . . . . . . . . . 114 4.7.3 Test chip measurement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 4.8 Application II: processor and memory stacking . . . . . . . . . . . . . . . . 118 4.8.1 Heterogenous chip stacking . . . . . . . . . . . . . . . . . . . . . . . . . 119 4.8.2 Interface design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 4.8.3 Test chip measurement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 4.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 4.7

5

Use of AC Coupled Interconnect in Contactless Packaging . . . . . . . . . 127 Paul Franzon 5.1 Introduction: Why use ACCI? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 5.1.1 Chapter outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 5.2 Historical Perspectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 5.3 Capacitively Coupled Chip I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 5.3.1 Capacitively Coupled Channel Design . . . . . . . . . . . . . . . . 130 5.3.2 ACCI Circuits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 5.3.3 ACCI Packaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 5.4 Mid-channel Capacitively Coupled Structures . . . . . . . . . . . . . . . . . 142 5.5 Inductively Coupled Connectors and Sockets . . . . . . . . . . . . . . . . . . 146 5.6 Conclusions and Future Perspectives . . . . . . . . . . . . . . . . . . . . . . . . . 151 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

Part IV Enabling Coupled Data Technologies 6

Aligning chips face-to-face for dense capacitive communication . . . . . 157 John E. Cunningham, Ashok V. Krishnamoorthy, Ivan Shubin, James G. Mitchell, Xuezhe Zheng 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 6.2 Aligning chips face-to-face . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 6.2.1 Power and ground connections between coupled chips . . . 163 6.3 A low-cost package for capacitive proximity communication . . . . . 168 6.4 Array packages using bridge chips . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174

Part V Extending Data Coupling Technologies 7

Delivering On-chip Bandwidth Off-chip and Out-of-box with Proximity and Optical Communication . . . . . . . . . . . . . . . . . . . . . . . . . 179 Ashok V. Krishnamoorthy, Jon Lexau, Xuezhe Zheng, John E. Cunningham 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 7.2 Photonics as a long-reach interconnect . . . . . . . . . . . . . . . . . . . . . . . . 180

xiv

Contents

7.3 Photonics on VLSI (optoelectronic VLSI) . . . . . . . . . . . . . . . . . . . . . 182 7.4 Proximity and photonic communication . . . . . . . . . . . . . . . . . . . . . . . 184 7.5 Test chip results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 7.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 8

AC Coupled Wireless Power Delivery . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 Makoto Takamiya, Kohei Onizuka, and Takayasu Sakurai 8.1 Three dimensional stacked inter-chip wireless power delivery . . . . 193 8.2 Prototype of wireless power transmission circuits . . . . . . . . . . . . . . 195 8.3 Theoretical analysis and circuit improvements . . . . . . . . . . . . . . . . . 198 8.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205

List of Contributors

Dr. Ron Ho Sun Microsystems Research Labs, 16 Network Circle, Menlo Park, CA 94025, USA, e-mail: [email protected] Dr. Robert Drost Sun Microsystems Research Labs, 16 Network Circle, Menlo Park, CA 94025, USA, e-mail: [email protected] Dr. Muhannad Bakir Microelectronics Research Center, Georgia Institute of Technology, 791 Atlantic Dr. NW, Atlanta, GA 30332-0269, USA, e-mail: [email protected] Alex Chow Sun Microsystems Research Labs, 16 Network Circle, Menlo Park, CA 94025, USA, e-mail: [email protected] Dr. John E. Cunningham Sun Microsystems Chief Technology Organization, 9515 Towne Centre Drive, San Diego, CA 92121, USA, e-mail: [email protected] Dr. Bing Dang IBM T. J. Watson Research Center, 1101 Kitchawan Rd, RM 6-242, Yorktown Heights, NY 10598, USA, e-mail: [email protected] Dr. Hans Eberle Sun Microsystems Research Labs, 16 Network Circle, Menlo Park, CA 94025, USA, e-mail: [email protected] Prof. Paul Franzon Department of Electrical and Computer Engineering, North Carolina State University, Box 7914, Raleigh, NC, 27695, USA, e-mail: [email protected] David Hopkins Sun Microsystems Research Labs, 16 Network Circle, Menlo Park, CA 94025, xv

xvi

List of Contributors

USA, e-mail: [email protected] Dr. Gang Huang Intel Corporation, Ultra Mobility Group, 1501 S. MO-Pac Expy, Austin, TX 78746 USA, e-mail: [email protected] Dr. Ashok V. Krishnamoorthy Sun Microsystems Chief Technology Organization, 9515 Towne Centre Drive, San Diego, CA 92121, USA, e-mail: [email protected] Professor Tadahiro Kuroda Department of Electrical Engineering, Keio University, 3-14-1, Hiyoshi, Kohokuku, Yokohama 223-8522, JAPAN, e-mail: [email protected] Jon K. Lexau Sun Microsystems Research Labs, 16 Network Circle, Menlo Park, CA 94025, USA, e-mail: [email protected] Dr. Frankie Liu Sun Microsystems Research Labs, 16 Network Circle, Menlo Park, CA 94025, USA, e-mail: [email protected] Professor Noriyuki Miura Department of Electrical Engineering, Keio University, 3-14-1 Hiyoshi, Kohokuku, Yokohama 223-8522 JAPAN, e-mail: [email protected] Dr. James G. Mitchell Sun Microsystems Chief Technology Organization, 16 Network Circle, Menlo Park, CA 94025, USA, e-mail: [email protected] Dr. Kohei Onizuka Formerly with the Institute of Industrial Science, University of Tokyo, and now with Toshiba Corporation. Dr. Dinesh D. Patil Sun Microsystems Research Labs, 16 Network Circle, Menlo Park, CA 94025, USA, e-mail: [email protected] Professor Takayasu Sakurai Institute of Industrial Science, University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, JAPAN, e-mail: [email protected] Dr. Ivan Shubin Sun Microsystems Chief Technology Organization, 9515 Towne Centre Drive, San Diego, CA 92121, USA, e-mail: [email protected] Professor Makoto Takamiya VLSI Design and Education Center, University of Tokyo, 4-6-1 Komaba, Meguroku, Tokyo 153-8505, JAPAN, e-mail: [email protected] Dr. Xuezhe Zheng Sun Microsystems Chief Technology Organization, 9515 Towne Centre Drive, San Diego, CA 92121, USA, e-mail: [email protected]

Part I

Introduction

Chapter 1

Introduction to Coupled Data Technologies Ron Ho, Robert Drost

1.1 Life has been good The past quarter-century has seen an explosive growth in the performance of computer systems. One of the first widely popular personal computers was a mid-1980s IBM PC, running on a 4.77 MHz Intel 8088 processor, stuffed with 256 KB of system memory (plus another 384 KB on an expansion card), displaying 640x200 black-and-white graphics, and storing data on 360 KB 5.25-inch floppy disks. In 2009, a typical workstation configuration sold by Sun Microsystems, the Ultra 24 Workstation, used a 3 GHz Intel Quad Core 2 processor with 8 GB of memory, displayed 2560x1600 graphics on a 30-inch LCD monitor using an Nvidia Quadro NVS 290 accelerator card, with up to 1.8 TB of Serial-Attached SCSI drives spinning at 15 krpm. Both systems cost around $4000 in contemporary dollars. The enormous advancement in price-performance between these computer systems came from improvements in many different technologies, including storage media, displays, software systems, and so on. But certainly a large part of it was because VLSI semiconductors, and high-end microprocessors and memories in particular, have gotten faster. Figure 1.1 shows the historical performance of microprocessors, normalized to the SpecINT2000 benchmark [1], and how it has seen a remarkable 35% cumulative annual growth rate over the past twenty-five years – a growth curve seen by virtually no other industry. The natural question prompted by this chart is, “can this growth curve continue?” Or, for the readers of this book, “what must designers do to enable it to continue?” This growth in performance is popularly, though somewhat incorrectly, fully attributed to “Moore’s Law.” This is what Carver Mead at CalTech called Gordon Dr. Ron Ho and Dr. Robert Drost Sun Microsystems Research Labs, 16 Network Circle, Menlo Park, CA 94025, USA, e-mail: {ron.ho},{robert.drost}@sun.com R. Ho and R. Drost (eds.), Coupled Data Communication Techniques for High-Performance and Low-Power Computing, Integrated Circuits and Systems, DOI 10.1007/978-1-4419-6588-2_1, © Springer Science+Business Media, LLC 2010

3

Introduction

4

%$ $"

"

&&

&$

&!

&

%%

%#

Fig. 1.1 Processor performance scaling over the past twenty-five years [3].

Moore’s now-famous 1965 extrapolation of transistor density scaling [2]. Moore had argued that when optimized for lowest total cost, integrated circuit chips would, over time, contain an ever-growing number of transistors. Too few transistors per chip, and the fixed overhead of manufacturing and packaging the chips would dominate their cost; too many transistors, and random defects would excessively reduce the yield of good chips and hence increase their cost. But the right number of transistors–the number that minimized cost–would continue to grow, as wafer sizes increase and transistor dimensions decrease. In reality, transistor density scaling has only partly fueled the growth in computer systems performance. Equally important have been rapid advances in raw transistor speeds and in aggressive design techniques, as we discuss next.

1.2 Faster computers tomorrow For a new computer system to out-perform an old computer system on the same software program, it must demonstrate improvements in the product of three terms: seconds per logic gate, logic gates per clock cycle, and clock cycles per instruction [4]. The product of these three gives program execution rate, in seconds per instruction. The number of seconds per logic gate (approximately 10−11 seconds, or 10 ps, in a modern 40 nm process technology) has been scaling down roughly linearly with technology for many years: a technology with half the drawn transistor dimensions as another could be expected to be twice as fast. Each new generation reduces dimensions by 70%, so this doubling of speed arrives two generations, or every five to six years.

1 Introduction

5

While this improvement trend has held steady for several technology generations, designers expect it to slow down soon. This is because maintaining transistor performance directly conflicts with reducing transistor power, and power has become a primary design constraint in today’s systems. As a result, transistor designers will likely choose to reduce what they have long jokingly called their “technology entitlement,” and live with devices that are only slightly faster each process generation. But even if the delay of logic gates does not reduce as many designers expect, it provides at best a 2x improvement every five to six years, or approximately a 13% annual growth rate. More must be done to match the 35% historical growth rate in computer performance. The number of logic gates per clock cycle, when combined with the seconds per logic gate, gives clock frequency, which is 2–5 GHz in modern processors. Logic gates per clock cycle directly measure the aggressiveness of the processor design: a CPU with thirty gates per clock cycle is much less aggressive than one with only ten gates per clock cycle; its designers have much more time per cycle to perform computation or communication. What is the limit to this design aggressiveness? Over &$ $"

' & % $ # " !

"

''

'$

'!

'

&&

&#

Fig. 1.2 Scaling of logic gates per clock as a function of technology generation.

the past twenty years, the number of logic gates per cycle has fallen as the aggressiveness of designs has increased. Pre-Pentium processors used around 100 gates per cycle. Today, the industry has settled in the range of 15–30 gates per cycle. Collectively we now understand that achieving the lower end of that range is possible but disproportionately expensive: building so-called “short-tick” machines requires much more effort and care in clock distribution, parasitic extraction, timing verification, and min-path methodology. For example, a modern processor has a clocking overhead of nearly two gates per cycle, so a ten gate-per-clock design would thus have only eight gate delays in which to do work, barely enough time to complete a 64-b integer addition. While doable, such designs consume not only extra design

6

Introduction

resources and non-recurring engineering (NRE) costs but also significantly extra power in the design. Therefore, the number of logic gates per cycle will most likely not fall any further. Combined with the argument above for seconds per logic gate, this predicts slow changes in clock frequency, on the order of 13% per year and likely even lower. Therefore, the way to continue to improve processor performance must come from reducing the number of cycles per instruction. This arises through increased parallelism: pipelined or superscalar execution, vector processing, and speculation all aim to increase the number of operations concurrently executed1 . At a larger scale, processors with multiple cores and a shared memory can be used to divide a complex problem into separate threads. Historically, such techniques have provided the balance of the performance gains shown in Figure 1.1, with designers increasingly leveraging and targeting parallelism. However, increased parallelism–and reduced cycles per instruction–has a cost: processors by necessity also grow increasingly complex. Larger instruction windows to winnow out code independencies require larger queues and communication structures. Multiple execution pipes require more area for more adders, multipliers, and registers, as well as the switches to access these added functional blocks. Processors packed with multiple cores need to fit not only those cores on the die, but also correspondingly more cache to keep them all fed. This last point bears repeating: suppose we increase the number of cores on a chip. If we keep the memory-to-core ratio constant, then each core still has the same amount of cache available to it, and therefore has a consistent cache miss rate. However, because of the growing number of cores, the total aggregated miss rate for the chip will go up and put pressure on the fixed off-chip I/O bandwidth; as a result, when increasing the number of cores on a chip we must in fact disproportionately increase the cache size as well, to lower miss rates and to continue to fit inside the available total chip I/O. Thus far the transistor density scaling provided by Moore’s Law has kept up with the need for ever-complex architectures and systems, and allowed us to continue to find and to exploit parallelism on a chip. In other words, the improvements in clock cycles per instruction provided by Moore’s Law scaling have combined with the historical improvements in seconds per logic gate and logic gates per clock to give the trends in Figure 1.1.

1

In this discussion we gloss over important distinctions between instruction-level parallelism and task-level parallelism. While they are remarkably different at an architectural level, at a physical level both require similar increases in integration and hence increased transistor counts in a package.

1 Introduction

7

1.2.1 The end of Moore’s Law “Is Moore’s Law ending?” is a perennially-asked question in industry journals and conferences. For several very good reasons involving seemingly fundamental physics, feature size scaling has “always” been on its last legs. Yet the industry has stubbornly insisted on solving these problems and continually shrinking transistors and wires. Today, foundries pattern structures with dimensions of a few 10s of nm using light with a wavelength of 193 nm. By rights, this ought to be impossible. Yet it is done, by using optical proximity correction, phase-shifting masks, off-axis illumination, spacer masks, and some extremely expensive diffractive lenses. Atomic thinness limits in oxide gate insulators are overcome by employing metallic gates and high-permittivity liners, which happily also help reduce gate leakage currents. And a combination of mostly-air dielectrics that reduce wires parasitics, and thick deposited metals that reduce wire resistance, have helped to keep wires from overly constraining chip performance. Will these improvement trends continue in the next ten to twenty years? While the answer “no” has been proven wrong time and time again, recent economic limitations have now supplanted technology as the likely true limit for Moore’s Law. Especially given the financial realities of the current global economic crisis, the semiconductor industry can no longer continue to enjoy a fully elastic market that supports ever-increasing global financial investment. Worse yet, new fabrication plants will each cost over 1% of the total semiconductor market, thus limiting the number of new technologies able to come on-line each year. Gordon Moore himself pointed out that his “law” will eventually end, although he was hopeful that new technologies would delay that date–and from his talk in 2003 to the present, they certainly have. However, any industry that constantly relies on exponential growth to continue will eventually be disappointed. Thus, Moore’s Law of transistor scaling has historically combined with logic gate scaling and clock rate scaling to enable faster and faster computers. Looking forward, Moore’s Law is the only scaling trend left, as gate scaling and clock rate scaling are both slowing down for design and integration reasons, and even Moore’s Law will not survive through the next few technology generations. What is a designer of high-end computer systems to do?

1.2.2 The arguments against–and for–multiple chips Designers can achieve more complex systems either by exploiting Moore’s Law scaling for a single chip or by aggregating the functionality across multiple chips. An example of the former is a recent Xeon microprocessor from Intel that occupies nearly 7 cm2 in area and contains as many as eight full processors and a proportionally large cache [5]. An example of the latter would be an IBM Power processor with five chips integrated on a multi-chip module (MCM).

8

Introduction

Often, system designers resort to the multiple-chip approach because a singlechip solution is infeasible: the design would not fit within a silicon lithographic reticle, or it would be so big that the inevitable sprinkling of random defects during wafer processing would unacceptably reduce the final yield of working chips. But if the design fits, integrating within a single chip allows the functional blocks to communicate using on-chip wires, which are dense and plentiful, and can be run efficiently and with high data fidelity. On-chip wires can also be highly optimized for power and/or performance improvements using various circuit techniques [6, 7, 8, 10], so that the costs to send data between blocks on a single chip are small. By constrast, using multiple chips entails a number of tradeoffs in power and performance that make it less appealing. Principally, chips communicate to other chips through solder connections and traces on printed circuit boards or packages. Solder connections are large and thus expensive: a single high-speed channel requires two chip solder pads, each around 100 µm on a side (and more if shields are included). While a large chip can have several thousands of these solder pads, most are required for delivering power supply current to the chip, and typically only a few hundred are left for data communication. To squeeze as much bandwidth out of these few pads as possible, designers run them at data rates much higher than the chip’s clock speed, thus serializing data into the pad and deserializing the data at the other chip. These serializer-deserializer (or “serdes”) circuits consume significant energy per transmitted bit: not only do they have to queue and dequeue the data bits, but they also have to generate and precisely align overclocked timing signals from the data stream. Moreover, running high data rates on resistive and lossy channels requires several circuit enhancements, such as channel equalization, which consume additional power and add complexity. A natural question therefore arises: can we design systems that employ communication structures as cheap and efficient as on-chip wires, but that do not suffer from single-chip area limits? Can we build some semblance of a “virtual” chip, out of multiple chips appropriately stitched together, with the equivalent of on-chip wires? This book discusses several ways in which the answer is “yes.”

1.3 Coupled data communication The idea behind coupled-data communication is that chips can communicate with each other, directly or through an intermediate layer, when placed in close proximity. Structures on one chip can interact with matching structures on the other, through electric field or magnetic field interactions, or through optical coupling. Because these structures can be small–and much smaller than traditional solder balls–they offer enormous improvements in bandwidth density, and thus the possibility of running many such channels in parallel rather than serializing them with overclocked timing. This, along with the small size and hence small capacitance of the data coupling structures, allows the circuits to be low-energy as well. Finally,

1 Introduction

9

the relative simplicity of coupled data circuits allows them to have relatively low latency. In many ways, then, coupled data circuits allow chip-to-chip communication with metrics similar to those of on-chip wires. Imagine, then, a large VLSI design, integrating together many blocks and units together in order to achieve high system performance, but too large to fit economically on a single die. It may also ideally combine together disparate technologies– CMOS, DRAM, Flash, SiGe–that cannot traditionally be manufactured in the same fabrication process. Designers might create such a system out of individual chips, each tailored to its own optimized technology, and connect them together using coupled data communication. In this way, the internal chip-to-chip communication would be small enough, fast enough, consume low enough energy, and have low enough latency that it would be akin to using on-chip wires to connect together different parts of a much larger chip. The design would be a single “virtual” chip comprised of many different chips. Whether these chips are stacked vertically or spread horizontally depends on the system, the packaging, and the coupled data circuits. This style of design, exploiting coupled data communication, offers a way past the inevitable slow-down of Moore’s Law to continue to scale overall system performance. In many-chip systems enabled by coupled data communication, designers can create systems of much greater complexity than standard silicon scaling offers; in a very real sense they can skip generations of Moore’s Law scaling.

1.3.1 This book In this book we discuss several ways of designing these coupled data circuits in current state-of-the art implementations. We begin with an overview of packaging technologies for many-chip integrated systems, by Bakir, Huang, and Dang, surveying their work in this field at Georgia Tech over the past several years. Capacitive coupled data communication circuits as envisioned by Sun Microsystems are next introduced by Hopkins, Chow, Liu, Patil, and Eberle; followed by a complementary chapter on inductive coupled data communication circuits from Keio University at the University of Tokyo, by Miura, Sakurai, and Kuroda. Some earlier foundational work on capacitive coupling through board traces done at North Carolina State University is then reviewed by Franzon. Coupled data communications can require careful chip-to-chip alignment. The next chapter, by Cunningham, Krishnamoorthy, Shubin, Mitchell, and Zheng, discusses packaging technologies to overcome these requirements. Merging electrical with optical communications is the subject of the following chapter, by Krishnamoorthy, Lexau, Zheng, and Cunningham. Finally, work at the University of Tokyo on delivering power through coupled connections is introduced by Takamiya, Onizuka, and Sakurai.

10

Introduction

References 1. http://www.spec.org 2. G. Moore, “Cramming more components onto integrated circuits,” Electronics Magazine, vol. 38, no. 8, 1965. 3. M. Horowitz, E. Alon, D. Patil, S. Naffziger, R. Kumar, K. Bernstein, “Scaling, power, and the future of CMOS,” IEEE International Electron Devices Meeting, 2005, pp. 7–15. 4. J. Hennessey, D. Patterson, Computer Architecture: A Quantitative Approach, Third Edition, Morgan-Kaufmann, 2002. 5. S. Rusu, S. Tam, H. Muljono, J. Stinson, D. Ayers, J. Chang, R. Varada, M. Ratta, S. Kottapalli, “A 45nm 8-Core Enterprise Xeon(R) Processor,” IEEE International Solid State Circuits Conference, 2009. 6. R. Ho, T. Ono, F. Liu, R. Hopkins, A. Chow, J. Schauer, R. Drost, “High-speed and lowenergy capacitively driven wires,” IEEE Journal of Solid State Circuits, Vol. 43, No. 1, January 2008. 7. E. Mensink, D. Schinkel, E. Klumperink, E. van Tuijl, B. Nauta, “A 0.28pJ/b 2Gb/s/ch transceiver in 90nm CMOS for 10mm on-chip interconnects”, IEEE International Solid State Circuits Conference, 2007. 8. B. Kim, V. Stojanovic, “A 4Gb/s/ch 356fJ/b 10mm equalized on-chip interconnect with nonlinear charge-injecting transmit filter and transimpedance receiver in 90nm CMOS,” IEEE International Solid State Circuits Conference, 2009. 9. G. Moore, “No exponential is forever: but ’Forever’ can be delayed!” IEEE International Solid State Circuits Conference, 2003. 10. J. Seo, R. Ho, J. Lexau, M. Dayringer, D. Sylvester, D. Blaauw, “High bandwidth and low energy on-chip signaling using adaptive pre-emphasis in 90nm CMOS,” IEEE International Solid State Circuits Conference, 2010.

Part II

Overview of 3D Technologies

Chapter 2

Power delivery, signaling and cooling for 2D and 3D integrated systems Muhannad Bakir, Gang Huang and Bing Dang

2.1 Introduction As gigascale integrated (GSI) technology progresses beyond the 45 nm generation, the performance of a monolithic system-on-a-chip (SoC) has failed by progressively greater margins to reach the “intrinsic limits” of each particular generation of technology [1]. The root cause of this lag is the fact that the capabilities of monolithic silicon technology per se have vastly surpassed those of the ancillary or supporting technologies that are essential to the full exploitation of a high-performance SoC. The most serious obstacle that blocks fulfillment of the ultimate performance of an SoC is inferior heat removal. The increase in clock frequency of an SoC has been virtually brought to a halt by the lack of an acceptable means for removing, for example, 200 W from a 15x15 mm die. In addition, the inability to remove more than 100 W/cm2 per stratum is a key limiter to the successful 3D integration of high-performance ICs. A huge deficit in chip input/output (I/O) bandwidth due to insufficient I/O interconnect density is the second most serious deficiency stalling high performance gains. The excessive access time of a chip multiprocessor (CMP) for communication with its off-chip main memory is a direct consequence of the lack of, for example, a low-latency 100 THz aggregate bandwidth I/O signal network. Lastly, SoC performance has been severely constrained by inadequate I/O interconnect technology capable of supplying, for example, 200–400 A at 0.7 V to a CMP with ever-decreasing noise margins. Of course, innovation in silicon ancilDr. Muhannad Bakir Georgia Institute of Technology, 791 Atlantic Dr. NW, Atlanta, GA 30332-0269, USA e-mail: [email protected] Gang Huang Intel Corporation, Ultra Mobility Group, 1501 S. MO-Pac Expy, Austin, TX 78746 USA e-mail: [email protected] Bing Dang IBM T. J. Watson Research Center, room 6-242, 1101 Kitchawan Rd, RM 6-242, Yorktown Heights, NY 10598, USA e-mail: [email protected] R. Ho and R. Drost (eds.), Coupled Data Communication Techniques for High-Performance and Low-Power Computing, Integrated Circuits and Systems, DOI 10.1007/978-1-4419-6588-2_2, © Springer Science+Business Media, LLC 2010

13

14

2D and 3D integrated systems

lary technologies will have to be done in parallel with continued innovations at the chip level, including improvements in scaled transistors, interconnects, and system architecture. A critical technical hurdle to the above grand challenges is the realization of a low-cost integrated interconnect network that is capable of addressing the heat removal, I/O bandwidth and power delivery requirements for a gigascale SoC. Challenges in power delivery and cooling, moreover, are exacerbated with 3D system integration. This chapter is organized as follows: Section 2.2 provides a review of current silicon ancillary technologies, and Section 2.3 describes low-cost and fully compatible electrical, optical, and fluidic, or “trimodal,” chip I/O interconnects for single and 3D chips. Power delivery for GSI systems and chip-package codesign of the power delivery network are discussed in Chapter 2.4. Section 2.5 is the conclusion.

2.2 Evolution of conventional silicon ancillary technologies: A brief overview In order to maintain constant junction temperature with increasing power dissipation, the size of the heat sink used to cool a microprocessor has been increasing, and thus imposing limits on system size, chip packing efficiency, and interconnect length between chips. A schematic illustration of a 2D system is shown in Figure 2.1. It is projected that the junction-to-ambient thermal resistance at the end of the roadmap will be less than 0.2 ◦ C/W [2]. However, using conventional materials for the various thermal interconnects between the silicon die and the ambient (the heat spreader, the heat sink, and the thermal interface materials (TIM) at the die/heat spreader and heat spreader/heat sink interfaces), the lowest attainable thermal resistance from a conventional air-cooled heat sink is approximately 0.5 ◦ C/W. Some reduction in the thermal resistance can be achieved with improved materials and increased air flow rate. Not only can the TIM account for a large fraction of the overall thermal resistance, but it also presents many reliability problems [3]. The cooling of hot-spots (power density up to 500 W/cm2 ) greatly exacerbates the complexity of cooling. Thus, it is clear that more revolutionary innovations in cooling technologies are needed to 1) eliminate/improve the TIM, 2) reduce the thermal resistance of the heat sink, 3) address the cooling needs of nonuniform power dissipation (cooling of hot spots), and 4) reduce the dimensions of the chip cooling hardware. The number of power and ground I/Os needed on a chip is a function of power dissipation and the maximum allowable on-chip power supply noise, which is due to resistive losses and δ I/δ t noise across the power distribution network. Because the supply voltage decreases with each technology (although more slowly in the future) and decreasing timing margins, the allowable on-chip power supply noise will also decrease. However, the increase in power dissipation and the resulting increase in current drain of a SoC will increase the power supply noise and resistive losses through the motherboard, socket, package, and chip I/Os, which can become large

2 2D and 3D integrated systems

15

Heat sink Heat Spreader Capacitor Socket

Die

Power

Communication

Fig. 2.1 A schematic illustration of a traditional “2D” multi-socket system. There are many unknowns

for 3D IC: •How to cool? •How to deliver power? [4, 5, 6]. In order to maintain acceptable on-chip supply noise, the number of power ? •How to package? andPhotonics? ground pads must be scaled accordingly with each technology generation and •Type of interstratal interconnect(s)? ? ? DC-DC? sized on-chip appropriately wires and decoupling capacitors are allocated for the •How H tto assemble/bond? bl /b d? power distribution network [7]. The above issues are discussed in more detail later •Chip-scale or wafer-scale? in the chapter. ? ? moreA??? Power delivery is not the only •And challenge. key bottleneck to the realizaHeat removal?

tion of high-performance microelectronic systems is the lack of low-latency and high-bandwidth off-chip interconnects. Some of the challenges in achieving highbandwidth chip-to-chip communication using conventional electrical interconnects include the low number of signal pins, frequency-dependent high losses in the substrate, reflections and impedance discontinuities, and susceptibility to cross-talk. The motivation for the use of microphotonics technology to overcome these challenges and leverage low-latency and high-bandwidth chip-to-chip communication has been presented [8, 9]. Significant progress has been made in developing chip-to-chip optical interconnects. Fiber-to-the-chip schemes, where an optical signal is coupled to a silicon integrated circuit through nanoscale silicon-based waveguides, have been reported [10]. However, such an approach limits the optical I/O density (because of fiber size and handling), increases the complexity of packaging, and potentially increases the cost of assembly because fibers must be manually (and serially) connected to each chip. High-density free-space optical interconnects are also being pursued for chip-to-chip communication [11, 12, 13]. However, susceptibility to misalignment and complexity in packaging are formidable challenges that have yet to be fully addressed. Optical misalignments can severely reduce the optical power delivered to the photodetector thereby increasing the bit error rate (BER) and reducing bandwidth [14]. Moreover, such free-space optical I/O schemes are not compatible with underfill processes. Technologies to address these challenges are discussed later in the chpater. Challenges in power delivery and cooling are exacerbated in 3D integrated systems, which have recently gained significant momentum in the semiconductor industry and are viewed as a key enabler to future system performance advances. Three-dimensional integration may be used either to partition a single chip into multiple strata to reduce on-chip global interconnect length [15] and/or used to stack chips that are homogenous or heterogeneous. An example of 3D stacking of ho-

Socket Power

Communication

16

2D and 3D integrated systems

Heat removal?

•How to cool? •How to deliver power? •How to package?

?

Photonics? DC-DC? ?

?

?

?

•How H to t assemble/bond? bl /b d? •Chip-scale or wafer-scale? •And more ???

Fig. 2.2 A schematic illustration of the challenges associated with “3D” stacking of highperformance die. These include cooling, power delivery, packaging, types of intrastratal interconnects, assembly and bonding, chip- or wafer-scale integration, and more.

mogenous chips is memory chips, while an example of heterogeneous chip stacking is memory and microprocessor chips. There are a number of interconnect challenges that need to be addressed to enable stacking of high-performance die (see Figure 2.2). When two 100 W/cm2 microprocessors are stacked on top of each other, for example, the net power density becomes 200 W/cm2 , which is beyond the heat removal limits of conventional air-cooled heat sinks [16]. Thus, cooling becomes a key limiter to the stacking of high-performance chips. Power delivery to a 3D stack of high power chips also presents many challenges and requires careful and appropriate resource allocation at the package level, die level, and interstratal interconnect level [17]. Both issues are discussed later in the chapter. Non-TSV Based 3D Die Package #2

Die Package #1

Die #6 Die #5 Die #4 Die #3 Die #2 Die #1 Substrate

Die #3

Inductive coupling

Die #2 Die #1 Substrate

Fig. 2.3 Schematic illustrations of 3D technologies not using through-silicon vias (TSVs).

Figures 2.3–2.5 show representative schematic illustrations of 3D integration technologies that have been proposed to date and consist of three categories. The first category contains 3D stacking technologies that do not utilize TSVs and are shown in Figure 2.3. The second category consist of 3D integration technologies that require TSVs (Figure 2.4), and the third category consists of monolithic 3D systems that make use of semiconductor processing to form active levels that are vertically stacked (Figure 2.5). Of course, a combination of all these technologies is possible. The non-TSV 3D systems span a wide range of different integration methodologies. The left of Figure 2.3 illustrates stacking of fully packaged die. Although this may offer the advantages of being low cost, simplest to adopt, fastest to market, and modest form-factor reduction, the overhead in interconnect length and low-density interconnects between the two die do not enable one to fully exploit the advantages of 3D integration. The middle of Figure 2.3 illustrates the most common method

2 2D and 3D integrated systems

17

TSV Based 3D

Die #N

Die #4 Die #3

Die #3 Die #2

Die #2 Die #1

Die #1

Substrate

Substrate

Fig. 2.4 Schematic illustrations of 3D technologies using through-silicon vias (TSVs).

to stack memory die, which is based on the use of wire bonds. Naturally, this 3D technology is suitable for low-power and low-frequency chips due to the adverse effect of wire bond length, low density, and peripheral limited pad location for signaling and power delivery. The right of Figure 2.3 illustrates the use of wireless signal interconnection between different levels using inductive coupling (capacitive coupling is also possible) [18]. This is discussed later in the book. There are several derivatives to the topologies described above and in general are hybrid die/package level solutions. It is important to note that the non-TSV approaches rely on stacking at the die/package level (die-on-wafer possible for inductive coupling and wire bond) and thus do not utilize wafer-scale bonding. This may serve to impose limits on economic gains from 3D integration due to cost of the serial assembly process. Figure 2.4 illustrates 3D integration based on TSVs. The left figure illustrates bonding of die with C4 bumps and TSVs. The short interconnect lengths and high density of interconnects that this approach offers are important advantages. Compared to wire bonding, it is possible to have several orders of magnitude larger number of interconnects. Although it is possible to bond at the wafer level, this approach is most suitable for die-level bonding (using a flip-chip bonder). At the right the figure illustrates 3D stacking based on thin-film bonding (metal-metal or dielectricdielectric) [19, 20, 21]. Not only are solder bumps eliminated in this approach, but also increased interconnect density and tighter alignment accuracy may be achieved when compared to the previous approach due to the fact that these approaches are based on wafer-scale bonding (although there are challenges in aligning 12-inch Monolithic 3D wafers, for example). Active A ti layers Substrate

Fig. 2.5 Schematic illustrations of 3D technologies using monolithic integration.

Finally, Figure 2.5 illustrates a semiconductor-manufacturing (non-packaging) approach to 3D integration. The main enabler to this approach is the ability to deposit/grow a semiconductor film on a wafer during the IC manufacturing process using number of techniques [22, 23]. Ultimately, this approach may offer the most integrated system but may potentially be limited to a smaller set of applications.

18

2D and 3D integrated systems

It is important to note that none of the above described 3D integration technologies address the need for cooling in a 3D stack of high performance chips. This is a significant omission and imposes a constraint on the ability to fully utilize the benefits of 3D technology for high-performance chips. As such, new 3D integration technologies are needed for such applications.

2.3 Novel silicon ancillary technologies In order to provide all critical interconnect functions for a gigascale SoC, fullycompatible, low-cost, and microscale electrical, optical, and fluidic chip I/O interconnects (or, “trimodal” I/Os) have been recently proposed (Figure 2.6) [24]. A schematic illustration of a cross-section of a gigascale SoC with trimodal I/Os and SEM images is shown in Figure 2.7. The overarching strategy of this novel approach is to extend and utilize low-cost wafer-level batch processing, the key to the success of Si technology, to the ancillary technologies that have now become the millstone around the neck of Si technology itself.

Die Electrical

Optical

Fluidic

Substrate

Fig. 2.6 Schematic illustration of a chip with electrical, optical, and fluidic (“trimodal”) I/O interconnects.

Electrical chip I/O interconnection is achieved using solder bumps. The optical I/Os are implemented using surface-normal optical waveguides and take the form of polymer pins [25, 26, 27]. A polymer pin, like a fiber optic cable, consists of a waveguide core and a cladding. The polymer pin acts as the waveguide core, and unlike a fiber optic cable, the cladding is air. A key feature of the optical pins is that they are mechanically flexible and thus, can bend to compensate for the coefficient of thermal expansion (CTE) mismatch between the chip and substrate. More details on the optical I/Os will be presented in Section 2.3.1. The fluidic I/Os are implemented using surface-normal hollow-core polymer pins, or micropipes [28, 29]. Unlike prior work on microfluidic cooling of ICs that require millimeter-sized and bulky fluidic inlets/outlets to the microchannel heat sink, the micropipe I/Os under consideration are microscale, wafer-level batch fabricated, area-array distributed, flip-chip compatible, and mechanically compliant. Figure 2.8 illustrates the evolution of thermal interconnects, including the thermal interconnects under consideration. This is discussed in Section .

2 2D and 3D integrated systems

19

Cap

Optical device

Si 100 m

200 m

Microchannel heat sink Fluidic I/O Cu pad

Optical pin I/O Solder bump

Optical waveguide Fluidic channel

Optical & Fluidic I/Os

Electrical & Optical I/Os

Electrical & Fluidic I/Os

Fig. 2.7 Schematic illustration of GSI chips with trimodal I/Os. SEM images are also shown.

© 2007 IEEE

Fig. 2.8 Schematic illustration of the evolution of thermal interconnects. (MCHS: microchannel heat sink)

20

2D and 3D integrated systems

The fabrication process of the proposed electrical, optical, and fluidic I/Os is shown in Figure 2.9. We assumed that the optical devices (detectors or sources) are monolithically or heterogeneously integrated on the CMOS chip. The fabrication process begins by partially etching through-wafer fluidic vias starting from the back side of the chip (side closest to the heat sink), as shown in Figure 2.9b. Next, trenches are etched directly into the silicon surface (Figure 2.9c) while simultaneously completing the etch of the fluidic through-wafer vias. Following the silicon etch, the microchannel heat sink is enclosed using any of a number of techniques [30] (Figure 2.9d). This completes the fabrication of the microchannel heat sink. Next, solder bumps are fabricated on the front side of the chip using standard processes (Figure 2.9e). Next, a photosensitive polymer film, equal in thickness to the height of the final optical and fluidic I/Os, is spin coated on the front side of the wafer (and over the solder bumps), as shown in Figure 2.9f. Finally, the polymer film is photodefined to yield the optical and fluidic I/Os simultaneously. The polymer I/Os are next cured for one hour in a nitrogen purged furnace set to 160 ◦ C.

© 2007 IEEE

Fig. 2.9 Process used to fabricate the trimodal I/Os. Schematic illustration of a chip with an alternate configuration of electrical, optical, and fluidic I/O interconnects.

There are many derivatives to the electrical, optical, and fluidic I/O approach described above. One such derivative is shown in Figure 2.10, which illustrates the ability to embed each optical pin in a solder bump to create a dual-mode electricaloptical solder bump [31]. An SEM image of such dual-mode electrical-optical solder bumps is shown in Figure 2.11. Not only does this enable higher levels of integration between the electrical and optical I/O interconnects, but also enables the possibility

2 2D and 3D integrated systems

21

of the aggregate number of I/Os to double for a given pitch. This approach is easily extendable to the fluidic I/Os and enables the fabrication of dual-mode electricalfluidic I/O interconnects.

Solder bump with embedded optical pin I/O

© 2007 IEEE

Fig. 2.10 Schematic of optical and fluidic pins (I/Os) embedded in conventional solder bumps.

© 2007 IEEE

Fig. 2.11 SEM images of optical pins (I/Os) embedded in conventional solder bumps.

In another approach, the optical and electrical I/Os can be assembled without having to embed the polymer pins in the solder bumps [32], as illustrated in Figure 2.7 and Figure 2.12. In order to adhere the optical I/Os to the waveguides on the substrate, the die containing the electrical and optical I/Os is dipped into a thin layer of a polymeric adhesive before bonding. When the I/Os are dipped, only the optical polymer pins make contact with the polymer adhesive. This is accomplished by fabricating the polymer pins to be taller than the solder bumps. Moreover, the adhesive is spin-coated to a thickness so that it only makes contact with the optical

22

2D and 3D integrated systems

pins when the chip is dipped into the film. SEM images of a chip with polymer pins assembled using this approach are shown in Figure 2.13.

Die

Optical device

Dip I/O Adhesive carrier Die

Adhesive

Substrate © 2008 IEEE

Fig. 2.12 Schematic illustration of the process used to bond electrical and optical I/Os simultaneously using a flip-chip bonder.

Die Substrate Die Optical polymer pin i

Adhesive

Substrate © 2008 IEEE

Fig. 2.13 SEM images of optical pins bonded to a substrate using the process shown in Figure 2.12 .

2 2D and 3D integrated systems

23

2.3.1 Optical I/Os The use of flexible surface-normal optical waveguides, or optical pins, have been proposed as a means of addressing the shortcomings of free-space optical I/O interconnections [33]. The height separation between the chip and the substrate has minimal effect on the optical power received at the photodetector (except for losses through the polymer pin) because the light is tightly confined within the crosssectional area of the pin. Although we consider using polymeric materials with relatively high optical absorption losses for the fabrication of the optical pins, due to their very short length (height), the optical transmission losses through the pins are small [33, 34]. The optical pins are designed to be mechanically compliant (flexible). The low elastic modulus of the polymer and air cladding of the waveguide contribute to the flexible nature of the optical pins. As a result, the lateral misalignment induced by chip-substrate CTE mismatch is compensated by the mechanical compliance of the optical pins. Thus, optical interconnection and alignment are maintained at all times between the optical components on the chip and substrate due to the mechanical compliance of the optical pins.

© 2006 IEEE

Fig. 2.14 Experimental setup used to characterize the coupling efficiency of various diameter optical pins.

Figure 2.14 illustrates the experimental setup used to characterize the surfacenormal optical coupling efficiency of the pins. A fiber was scanned in the X-axis and in the Y-axis across the endface of the pin and across the surface of the aperture (at a Z-axis distance equal to the pin’s height). The relative transmitted optical intensity measurements of 50x150 µm optical pins and 50 µm optical apertures are plotted in the top of Figure 2.15. The transmitted intensities are normalized to the maximum transmission at the center of the aperture without a pin. The X- and Y-axis scans are essentially equal due to the radial symmetry of the light source and the pins. The dif-

24

2D and 3D integrated systems

ference between the coupling efficiency of the two measurements (using data from the X-axis scan) is plotted in the bottom of Figure 2.15. The data demonstrate that

Loss Reduction [dB]

5 4 3 2 1

24

20

16

8

12

4

0

-4

-8

6

-1 2

-1

-2 0

-2

4

0

Lateral Position [um]

© 2008 IEEE

Fig. 2.15 Using the experimental setup shown in Figure 2.14, the transmitted optical intensity as a function of light source lateral position above the pin (50x150 µm) and aperture are measured (top). The reduction in the coupling loss due the optical pins ranges from 2–4 dB (bottom).

at the 0 µm displacement position, the optical pins enhance the coupling efficiency by approximately 2 dB when compared to direct coupling into the aperture. At distances of ±25 µm away from the center, the optical coupling improvement due to the pin exceeds 4 dB. The 4 dB coupling improvement is significantly larger than the 0.23 dB excess loss of the pins [33, 34], which clearly demonstrates the benefits of the pins. Note that the profile of the relative intensity curve of the optical pin is almost flat across the entire endface of the pin and abruptly drops beyond the edges of the pin (X=±25 µm). On the other hand, the intensity curve of the aperture resembles an inverse parabola. This is important because it signifies the importance of having perfect alignment for the direct coupling case. Any misalignment in the lateral direction would cause a fast roll-off in the intensity. Even with perfect alignment during assembly, any lateral misalignment between the mirror and the detector due to either CTE mismatch or other factors may reduce the coupling efficiency and limit the achievable bandwidth. When 30x150 µm pins (optical aperture 30 µm in diameter) were tested, the coupling efficiency improved by 3 to 4.5 dB [33]. This improvement in the optical coupling efficiency is larger than the measured improve-

2 2D and 3D integrated systems

25

ment from the 50x150 µm pin. This is significant because it demonstrates that as the optical I/O density increases and smaller PDs are used to attain higher bandwidth, optical pins become even more critical to the overall performance of the system.

© 2007 IEEE

Fig. 2.16 Schematics of the experimental setups used to evaluate the optical displacement compensation of the pins.

The experimental setup used to characterize the optical displacement compensation of the optical pins is shown in Figure 2.16. To quantify the effects of pin bending on optical signal transmission, two experimental configurations, shown in Figure 2.16, were developed. In the first configuration, which is labeled “scanning” in the figure, the light source (fiber) is scanned laterally across the endface of the pin (similar to earlier measurements). In the second configuration, which is labeled “bending” in the figure, the fiber is attached to the endface of the pin using epoxy to form an air-free interface between the source and the substrate. In the “bending” case, the controlled lateral displacement of the light source causes the optical pin to bend sideways helping to keep the lightmode confined in the pin and thus deliver the optical signal to the detector with lower coupling losses. The relative transmitted intensities as a function of lateral displacement of the light source for the two experimental configurations illustrated in Figure 2.16 are shown at the top of Figure 2.17. The loss reduction is less than 1 dB up to 15 µm displacement, while it increases up to 4 dB at 30 µm. This is significant since a limited loss budget is available in typical systems for misalignments/assembly to maintain proper operation. The top of Figure 2.17 demonstrates that for a given loss budget of, e.g., 1 dB, the 50x150 µm flexible pins double the displacement tolerance from less than 15 µm to approximately 30 µm. The 4 dB pin-assisted loss reduction at the 30 µm displacement can

26

2D and 3D integrated systems

decrease the BER by few orders of magnitude [33, 34]. Thus, the optical pins provide a method of reducing optical coupling losses caused by thermomechanically induced misalignment between the CTE mismatched chip and substrate. Lateral Displacement [um] 0

10

20

30

40

50

Relative Intensity [dB]

1 0 -1 -2 -3

Bend forward Bend return Scan Scan with epoxy

-4 -5

Loss Reduction [dB]

5 4 3 2 1 0 0

2

4

6

8 10 0 12 14 16 6 18 8 20 0 22 24 26 6 28 8 30

Light Source Lateral Position [um]

© 2007 IEEE

Fig. 2.17 Measured optical displacement compensation using the flexible optical pins. The experimental procedure for the data labeled “Scan with epoxy” is similar to that labeled “Scan” with exception being that the fiber tip contained a layer of epoxy.

2.3.2 Fluidic I/Os for single and 3D chips The process that is used to bond the fluidic I/Os is shown in Figure 2.18. As with the optical I/Os, the fluidic I/Os are flip-chip compatible and are batch-fabricated at the wafer-level. In fact, the optical and fluidic I/Os are batch fabricated simultaneously using the same polymer (Figure 2.19). The assembly process begins by aligning the die, which contains the electrical and fluidic I/Os, to the substrate using a flipchip bonder. In this case, the fluidic I/Os are aligned with inlet/outlet fluidic vias that interconnect to substrate-level fluidic channels. Although an organic substrate could be used, in this work, a Si substrate is used because it can also be used for very high density interconnects (Figure 2.19). The fluidic through-substrate vias

2 2D and 3D integrated systems

27

connect directly to fluidic tubes attached on the other end of the substrate. As the solder bumps make contact with the copper pads on the substrate, the fluidic I/Os become inserted into the inlet/outlet vias. Once assembled, an encapsulant is applied between the die and substrate to seal the interface between the fluidic I/Os and the vias in the substrate (Figure 2.18). This approach is radically different from previously reported research [35, 36] in the area of fluidic interconnects for ICs. Fluidic I/O

Die

Cu pad

Si Substrate

Fluidic via Encapsulant Power to heaters

Electrical reading

Fl id in Fluid i

Fl id out Fluid © 2007 IEEE

Fig. 2.18 Schematic of the experimental setup used to demonstrate the fluidic I/O interconnects.

In order to characterize the fluidic interconnects and the microchannel heat sink and perform temperature measurements, thin-film Pt heaters/thermometers were fabricated on the silicon die. The tested die contained a total of 51 parallel microchannels (100 µm in width and 200 µm in height) distributed evenly across the back-side of the chip (1 cm2 ) and a total of 32 fluidic I/Os were used. In this case, the microchannel heat sink was capped with a Pyrex wafer using an adhesive [30]. The inlet and outlet temperatures were measured using thermocouples, while the chip temperature was measured by recording the change in the resistance of the Pt heaters. Figure 2.20 plots the temperature of the coolant (DI water) at the substrate inlet and outlet and the average chip temperature when 75 W/cm2 is applied to the Pt heaters/thermometers. Under a relatively large flow rate (≈ ≈ 104 ml/min), the average temperature rise is 12.7◦ C, and the corresponding thermal resistance for the chip is approximately 0.28◦ C/W. As shown in Figure 2.20, during testing, the supply power was toggled to verify the consistency of the measurement results. As the microchannel heat sink was not optimized, lower thermal resistance and the cooling

28

2D and 3D integrated systems

© 2007 IEEE

Fig. 2.19 Optical micrograph of a silicon substrate with electrical interconnects (copper) and through-substrate fluidic vias. 50.0 average chip hi ttemperature t outlet temperature

Temperatture (C)

45.0

inlet temperature

40.0

35.0

30.0

25.0

20.0 0

100

200

300

400

500

600

700

800

900

Time (seconds)

Fig. 2.20 Measured temperatures at the inlet and outlet as well as the average chip temperature at a flow rate of 104 ml/min. The calculated unit thermal resistance is ≈ ≈0.17 ◦ C·cm2 /W. (Overall power is ≈ ≈45 W; heating area is ≈ ≈0.6 cm2 ; on-chip temperature rise is 12.7 ◦ C.)

2 2D and 3D integrated systems

29

of higher power density can be achieved. In fact, the first example of microchannel liquid cooling demonstrated a junction-to-ambient thermal resistance of 0.09 ◦ C/W and the cooling of 790 W/cm2 [37], which demonstrates ability to cool hot spots (up to 400 W/cm2 in some processors). In this work, we focus on the integration and implementation of the fluidic I/O interconnect network rather than on the microchannel heat sink. The novel feature of the research is delivering and extracting a liquid coolant from a microchannel heat sink in a way that is compatible with CMOS process technology and conventional chip I/O technology.

Solder cap

Polymer micropipe

© 2008 IEEE

Fig. 2.21 SEM micrograph of a solder-capped polymer micropipe.

In this work, an encapsulant was used to seal the fluidic I/Os after assembly. In an alternate configuration, solder may be capped on the top of the polymer (or metallic) micropipe I/Os, as shown in Figure 2.21, to seal the fluidic I/Os. The plating and reflow processes are clearly compatible with the solder bumps fabricated for the electrical I/Os. Thus, this could potentially enable the use of conventional solder to enable the interconnection of all I/Os modes: the electrical, optical, and the fluidic. It is possible to extend the microfluidic chip I/Os to 3D integrated chips, as illustrated in Figure 2.22 [1, 38, 39, 40]. The electrical interconnect network is used for power delivery and signaling between strata, and fluidic interconnects are used to enable the rejection of heat from each stratum in the 3D stack. One implementation of this approach is shown in Figure 2.23. Each silicon die in the 3D stack contains the following features: 1) a monolithically integrated microchannel heat sink; 2) through-silicon electrical (copper) vias (TSEVs) and through-silicon fluidic (hollow) vias (TSFVs); 3) solder bumps (electrical I/Os) and microscale polymer pipes (fluidic I/Os) on the side of the chip opposite to the microchannel heat sink. Microscale fluidic interconnection between strata is enabled by the combination of through-wafer fluidic vias and polymer pipe I/O interconnects. The chips are designed such that when they are stacked, each chip makes electrical and fluidic interconnection to the die above and below. Consequently, power delivery and signaling can be supported by the electrical interconnects (solder bumps and copper TSVs), and heat removal for each stratum can be supported by the fluidic I/Os

30

2D and 3D integrated systems

Die Electrical

Fluidic Die

#2 Fluidic

Electrical Die El t i l Electrical

#3

#1 Fluidic

Substrate © 2008 IEEE

Fig. 2.22 Schematic illustration of a 3D chip stack with electrical and fluidic chip I/O interconnects.

and microchannel heat sinks. Optical TSVs [41] and I/Os may also be integrated to provide unusual flexibility to system integration.

© 2008 IEEE

Fig. 2.23 Schematic illustration of one possible implementation of the system shown in Figure 2.22.

In order to achieve high heat transfer, low thermal resistance, and low pressure drop, a relatively tall microchannel heat sink is needed (≈250 µm, for example). ≈ As a result, this necessitates a thick silicon wafer and is different from other 3D integration technologies, which seek to polish the silicon wafer to as small a thick-

2 2D and 3D integrated systems

31

ness as possible before wafer handling and mechanical strength become limiters. Cross-sectional optical images of fabricated electrical TSVs in a silicon wafer with and without a microchannel heat sink are shown in Figure 2.24. At the bottom of the figure, the microchannel shown is 200 µm tall and 100 µm wide. The aspect ratio of the microchannel can be varied to meet thermal resistance and pressure drop of different applications [37]. TSVs with aspect ratios 30:1 and greater have been demonstrated [42].

Si

Cu © 2006 IEEE

Fig. 2.24 Optical images of a silicon wafer with through-silicon electrical vias (top) and the fabrication of a microchannel heat sink with electrical TSVs (bottom).

The process used to fabricate the die is shown in Figure 2.25. The process begins by (a) fabricating electrical TSVs followed by (b) the fabrication of trenches and microfluidic TSVs into the silicon wafer. In (c), the trenches are encapsulated to form the microchannels [30]. Vias are next formed into the overcoat polymer to simultaneously expose the electrical TSVs and form fluidic vias that ultimately allow fluid flow to the upper and lower die. Following this process step, copper pads are patterned above the electrical TSVs to facilitate solder bonding during assembly. Finally, in (d), solder bumps and microfluidic polymer micropipes (electrical and fluidic I/Os, respectively) are fabricated on the side of the wafer opposite to where the microchannel heat sink is located using processes reported previously [24]. A two-die stack using the above outlined assembly process, including SEM images of the trenches and microfluidic TSVs are shown in Figure 2.26 (the microchannel heat sink was not included, in order to simplify the assembly experiment).

2.4 Power delivery for 2D and 3D systems Power consumption of GSI chips is increasing at an alarming rate [43]. The increasingly faster devices packed at unprecedented densities result in high current densi-

32

2D and 3D integrated systems

(a)

(b)

(c)

(d)

© 2007 IEEE

Fig. 2.25 Schematic illustration of the process used to fabricate silicon die, at the wafer level, that each contain electrical and microfluidic TSVs and I/Os.

© 2007 IEEE

Fig. 2.26 SEM image of a two-die stack that contain electrical and fluidic I/Os and TSVs. In this experiment, no microchannels were included, in order to simplify the process and assembly experiments.

ties. Although the scaling of the supply voltage has slowed down in recent years, the logic on the integrated circuit (IC) continue to become increasingly sensitive to any supply voltage change because of the decreasing clock cycle and therefore noise margin. With this trend, power supply noise, the voltage fluctuation on power delivery networks, has become a significant factor that can substantially influence the overall system performance. As a result, the design of power delivery systems becomes a very important and challenging task. Therefore, understanding complicated power delivery networks and supplying clean power to microprocessors is of great significance [44, 45]. IR-drop and ∆ I noise are the two main components of the power supply noise. IR-drop results from the supply current passing through the parasitic resistance of

2 2D and 3D integrated systems

33

© 2007 IEEE

Fig. 2.27 Simulated noise droops of Intel microprocessors [46].

the power distribution networks. ∆ I noise is caused by the inductance of the power delivery system and becomes important when a group of circuits switch simultaneously. Power supply noise consists of three distinct voltage droops [44], and they result from the interactions between the chip, package, and board. The three droops are illustrated as shown in Figure 2.27. The third droop is related to the bulk capacitors at the board level, and has a time duration of a few microseconds. The third droop influences all critical paths but can be readily minimized by using more board space for bulk capacitors [44]. The second droop is caused by the resonance between the inductive traces on the motherboard and the decoupling capacitors (decaps) in the package. The second droop has time duration of a few hundred nanoseconds and impacts a significant number of critical paths. The first droop is caused by the package inductance and ondie capacitance. The resonance frequency of the first droop is in the range of tens of MHz to a few hundred of MHz depending on the sizes of package level components and on-chip decaps [47]. Because putting additional on-chip decaps is very costly, among the three droops, the first droop is the most difficult one to suppress. The first droop noise has the largest magnitude. Even though the first droop has the smallest time of occurrence it can adversely affect GSI circuits as its duration can be tens of nano seconds (ns). Chip performance can be severely degraded when the first droop affects some critical paths. Because of its severe impact on high-performance chips, the first droop is thus the main focus of this section. Excessive power supply noise can lead to severe performance degradation of onchip circuitry and off-chip high speed data links, and even result in logic failures [44]. Thus it is vitally important to model and to predict the performance of power delivery networks with the objective of minimizing supply noise. On-chip power distribution networks consist of global and local networks. Global power distribution networks carry the supply current and distribute power across the chip. Local networks deliver the supply current from global networks to the active devices. Global networks contribute most of the parasitics, and thus are the main concern of this chapter. For global distribution networks, the most common way

34

2D and 3D integrated systems

is to use a grid made of orthogonal interconnects routed on separate metal levels connected through vias [48]. Wire-bond and flip-chip technologies are the two most commonly used chipto-package interconnects [49]. Wire-bond is lower cost than flip-chip interconnect; however, peripheral wire-bond interconnect causes higher power supply noise level because of larger parasitics. In flip-chip technology, the parasitics are reduced by spreading I/O pads over the surface area of the chip, therefore reducing the noise. The development of GSI systems is not only driven by more efficient silicon real estate usage but also by more I/O counts. Hence most of today’s high performance designs use flip-chip interconnect and area-array I/Os to provide larger bandwidth for chip to the next level interconnections.

2.4.1 Power delivery and design implications of 2D systems The grid structure and area-array I/O pad allocation are shown in Figure 2.28. Power is fed through the power pads from the package. The current flows through power wires and on-chip circuits, and returns to the package through ground wires and ground pads.

© 2007 IEEE

Fig. 2.28 On-chip power/ground grids and I/O pads in flip-chip technology.

Different types of modeling methods can be used to analyze power supply noise, such as circuit simulation methods, 3D electromagnetic solver methods, and compact physical models. Circuit simulations and 3D solvers are commonly used for

2 2D and 3D integrated systems

35

dedicated validation after designs are fulfilled. However, to gain sufficient physical insight, compact and accurate physical models are needed before the physical designs are performed. Such models would be critical in the early stages of design and can estimate the on-chip and off-chip resources needed for the power distribution network. Also compact physical models can be used to predict the power noise trends of different generations of technology from mathematical and physical bases. Compact physical models for ∆ I noise and IR-drop have been proposed [7, 50]. Power supply noise is a dynamic effect changing with time, and IR-drop is the case when the noise goes to steady state. The transient part of the power supply noise, ∆ I noise, is significant in determining the timing budget of a system. These models embody the distributed nature of on-chip power grids and display high accuracy when predicting the frequency response and time domain transients of the power supply noise of 2-D power distribution networks. An example of the simplified circuit model for a part of the power distribution network is shown in Figure 2.29 [50]. The segment resistance of the grid is represented by Rs . Switching current between a power grid node and the adjacent ground grid node is modeled as a current source, and J(s) represents the switching current density in the Laplace domain. Symbol Cd denotes the decoupling capacitance (including both the intentionally added decaps and the equivalent capacitance of the non-switching transistors) per unit area. Symbols ∆ x and ∆ y represent the distances between two adjacent power (or ground) nodes at the same wiring level for x and y directions, respectively. Symbol L p (4L p for quarter pad) represents the per pad loop inductance of the package.

J ( s)'x'y

Cd 'x'y © 2007 IEEE

Fig. 2.29 Simplified circuit model for GSI power distribution systems.

36

2D and 3D integrated systems

Absolute value off the worstt case peak noiise (V)

Based on this circuit model, compact physical models can be derived and the relationships between the power supply noise and other physical parameters can be quantified as shown in Figures 2.30–2.32. (2.29),Rs=0.88:

0.20

(2.29),Rs=0.44:

(2.29),Rs=0.22:

0.18

(2.29),Rs=0.11:

SPICE,Rs=0.88:

SPICE,Rs=0.44:

SPICE,Rs=0.22:

SPICE,Rs=0.11:

0.16 0.14 0.12 0.10 0.04

0.06

0.08

0.10

0.12

0.14

0.16

0.18

Proportion p of the chip p area occupied p by decaps

© 2007 IEEE

Fig. 2.30 The worst case peak noise as a function of the chip area occupied by decaps: Comparison between the physical model in [50] and the results of SPICE simulations for a pair of grids.

We can observe from Figures 2.30–2.32 that ∆ I noise is sensitive to the amount of decaps, package level inductance, and the number of I/O pads. Decap insertion is an effective way to reduce the noise level. However, the on-die area budget for decoupling capacitors can be limited. In this situation, package-level high density I/O solutions, such as sea of leads (SoL) [51], can be used to suppress the power supply noise. High density chip I/Os can greatly reduce the loop inductance of power distribution networks, resulting in smaller noise. Larger numbers of I/Os can also reduce the IR-drop. It is also of great importance to project the power noise trends for different generations of technology. In [50], the worst case peak noise value is calculated for a high performance microprocessor unit (MPU) for each generation from the 65 nm node (year 2007) to the 18 nm node (year 2018). Figure 2.33 suggests that supply noise could reach 25% Vdd at the 18 nm node compared to 12% Vdd for current technologies if the ITRS [52] scaling trends are followed. Excessive noise can cause severe difficulties for circuit designers, and new solutions to tackle this supply noise problem are needed in the future. The importance of scaling package parameters such as the number of I/O pads is also indicated in Figure 2.33. It can be seen that by increasing the pad number by 1.3x each generation, the supply noise can be kept well under control.

Absolute value of the wo orst case peak n noise (V)

2 2D and 3D integrated systems

0.22 0.20 0.18

37

(2.29),Rs=0.88:

(2.29),Rs=0.44:

(2.29),Rs=0.22:

(2.29),Rs=0.11:

SPICE,Rs=0.88:

SPICE,Rs=0.44:

SPICE,Rs=0.22:

SPICE,Rs=0.11:

0.16 0.14 0.12 0.10 0 08 0.08 200p

300p

400p

500p

600p

700p

800p

Package inductance per I/O (H) Fig. 2.31 The worst case peak noise as a function of L p : Comparison between the physical model in [50] and the results of SPICE simulations for a pair of grids.

Absolutte value o of the worrst case peak no oise (V)

0.18 0.16 0 14 0.14

(2.29),Rs=0.88:

(2.29),Rs=0.44:

(2.29),Rs=0.22:

(2.29),Rs=0.11:

SPICE,Rs=0.88:

SPICE,Rs=0.44:

SPICE Rs=0.22 SPICE,R =0 22:

SPICE Rs=0.11 SPICE,R =0 11:

0.12 0.10 0.08 0.06 0.04 2000

4000

6000

8000

10000

Number of total power/ground I/O pads © 2007 IEEE

Fig. 2.32 The worst case peak noise as a function of the number of pads: Comparison between the physical model in [50] and the results of SPICE simulations.

38

2D and 3D integrated systems

0.26 0.24

Vnoise / Vdd

0.22

ITRS scaling 1 3x pad number scaling 1.3x

0.20 0.18 0.16 0.14 0.12 0 10 0.10 0.08 2006

2008

2010

2012

Year

2014

2016

2018

© 2007 IEEE

Fig. 2.33 Technology trends of the worst case peak noise.

2.4.2 Power delivery and design implications of 3D systems 3D nanosystems can provide enormous advantages in achieving multi-functional integration, improving system speed and reducing the power consumption for future generations of ICs [53]. 3D chip stacks have been used in commercial products though today’s applications are mainly focused on low power portable devices, such as flash memories and wireless chips. At the high performance end, industry has already started to pave the way for microprocessor stacking and microprocessormemory stacking, which will extend Moore’s Law beyond its expected limits and help break the bottleneck of the memory bandwidth problem for multi-core microprocessors [54, 55]. Through-silicon-vias (TSVs) and micro-bumps are the key technologies to fulfill 3D chip stacks for high performance applications; they eliminate the need for long-metal wires that connect today’s 2-D chips together, instead, relying on short vertical connections etched through the silicon wafer [54]. These TSVs and micro-bumps enable multiple chips to be stacked together, allowing greater amounts of information to be passed between them. However, stacking multiple high-performance die may result in severe power integrity problems. As shown in Figure 2.34, if multiple high power microprocessors are stacked together and flip-chip technology for 3D chip stacking is used, several hundred amperes of current (or even more) will need to be delivered through limited footprint area. Also the supply current flows through the micro-bumps and narrow TSVs that may exhibit large parasitic inductance. These may potentially lead to a large ∆ I noise if stacked chips switch simultaneously. Thus, power distribution net-

2 2D and 3D integrated systems

39

ȝp4, 100W, 125A Long inductive trace

ȝp3, 3 100W 100W, 125A ȝp2, 2 100W 100W, 125A ȝp1, p1 100W, 100W 125A Package Large amount off currentt

Fig. 2.34 Power integrity problem of 3D chip stack.

works in 3D systems need to be accurately modeled and carefully designed. In [56], analytical models are derived to describe the frequency-dependent characteristics of the power supply noise in each stack of the chips and to obtain physical insight into the rather complex power delivery networks in 3D systems. The model in [56] comes from the simplified circuit model of the power distribution network in 3D systems, as shown in Figure 2.35. A wire between two nodes on the ith die is simply modeled as a lumped resistance Rsi . The decoupling capacitance per unit area of the ith die is represented by Cdi . The current density for an active block of Die i is represented by Ji (s) in the Laplace domain. Inductance L p is the per pad loop inductance associated with the package, connected to the bottom-most die (Die 1). Each silicon TSV is modeled as connected inductor Lvia and resistor Rvia in series (this includes the parasitics of the micro-bumps when they are used between die). Symbols x and y represent the distances between two adjacent power (or ground) nodes in the same wiring level for ox and y directions, respectively. The models derived in [56] help address the power integrity problem of 3D integration systems. To make a worst case scenario analysis, the worst case peak noise value will be considered. The case when a single die is switching is considered first. This can be the case when one die dissipates considerably larger power relative to the other die in the stack. An example of such a system is a processor die with several memory die. It can be seen in Figure 2.36 that if the total number of the stacked die increases, the noise level for the topmost die decreases when the number of die is less than 6. This is because non-switching dice behave as decaps for the switching die. However,

40

2D and 3D integrated systems

J i ( s)'x'y

Cdi 'x'y

© 2007 IEEE

Fig. 2.35 Simplified circuit model for 3D stacked system.

Die n Die n-1

Die 1 Package

Absolute Valuue of Power Nooise (mV)

(a) W o r s t C a s e P e a k N o is e f o r T o p m o s t D ie ( S P I C E ) W o r s t C a s e P e a k N o is e f o r T o p m o s t D ie ( P h y s ic a l M o d e l)

180

160

140

0

2

4

6

8

10

T o ta l n u m b e r o f d ic e

(b) Fig. 2.36 Single die switching, increasing total number of die.

© 2007 IEEE

2 2D and 3D integrated systems

41

when the number of die increases beyond 6, the increase in decaps can not compensate the impact of the longer inductive TSV traces and micro-bumps associated with those added die, which result in the increase of the noise level.

2D 2-D

3-D © 2007 IEEE

Fig. 2.37 Achieving shorter interconnects between communicating blocks by using 3D integration.

If only one die is switching, the noise is smaller than the single chip case (2-D case), because the switching die can use the decaps of those non-switching die in the 3D stacks. However, it is expected that the activities of the two blocks with the same footprints are highly correlated because an important purpose of 3D integration is to put the blocks that communicate most as close to each other as possible, as shown in Figure 2.37. Therefore, we must consider the worst-case scenario when all the functional blocks sharing the same footprint switch simultaneously, as shown in Figure 2.38. If the total number of die is increased and the noise levels of the topmost and bottommost levels are examined, it can be seen that when all die are switching the noise produced in a 3D integrated system is unacceptable when compared to a single chip case. This is especially true for the topmost die where the noise level changes dramatically (180 mV for the single die case as opposed to 790 mV for the 10 die case). Even for the bottommost die, methods of suppressing the noise need to be identified. If we can use a whole die as decap (100% area is occupied by decap) and stack the “decap die” with other die, the noise can be suppressed to some extent. For example, if the same setup as discussed in previous sections is adopted and four die with one decap die are stacked together, putting the decap die on the top can result in a 36% reduction in the worst-case peak noise (256 mV compared to 400 mV). Putting the decap die at the bottom of the stack can result in a 22% reduction (312

42

2D and 3D integrated systems Die n Die n n-1 1

Die 1 Package Absolute V Value of Poweer Noise (mV V)

(a) 800 600 400 200

Topmost Die (SPICE) Topmost Die (Physical M odel) Bottommost Die (SPICE) Bottommost Die (Physical M odel)

0 2

4

6

8

10

Total # of Dice

(b)

© 2007 IEEE

Fig. 2.38 All die switching and increasing total number of die.

mV compared to 400 mV). Although improvements result from the decap die, we still need to add more decap die to achieve the noise level of a single die (182 mV). Figures 2.39b through 2.39d illustrate the case of different schemes for using two decap die. By putting the two decap die on the top, we can suppress the noise to the level of a single chip. It can be seen that putting the decap die on the top is the best scheme to suppress the noise of the fourth die. Instead of adding a decap die, it will be more efficient if high-k material is used between the power and ground planes (on-chip). Finally, it should be emphasized that cooling also presents challenges to 3D integration (also discussed in this chapters), and the newly developed microfluid cooling technique can potentially alleviate this problem. Another possible solution is to use more TSVs. To examine the efficiency of increasing the number of TSVs, in the first case, a five die-stacking structure is used, and the total number of power/ground I/Os is fixed as 2048. As shown in Figure 2.40, one cannot benefit by solely increasing the number of TSVs. Because the parasitics of TSVs are much smaller than those of the package, only small changes for noise level can be obtained by inceasing the number of TSVs. Adding more TSVs might even make designers lose benefits because TSVs consume die area that would be potentially used for decaps or additional circuits for noise suppressing purposes.

2 2D and 3D integrated systems

43 Die 4 Die 3 Decap

Single Die

Die 2

Package

Die 1

|Vnoise|=182 mV (a)

Decap Package

|Vnoise|=266 mV, 34% reduction

(b)

Decap

Decap

Di 4 Die

Decap

Die 3

Die 4

Decap

Die 3

195 mV

Die 2

Die 2

204 mV

Die 1

Die 1

200 mV

Package

Package

|Vnoise|=228 mV, 43% reduction

|Vnoise|=199 mV, 51% reduction

(c)

(d)

Fig. 2.39 Effect of adding two “decap” die. (a) Single die switching; (b) One “decap” die at the bottom and the other in the middle; (c) One “decap” die in the middle and the other on the top; (d) Both “decap” die on the top.

In the second case, the numbers of both P/G pads and TSVs in each die are increased. This causes the power supply noise to greatly reduce and even reach the level of a single chip, as shown in Figure 2.41. These two cases show that the bottleneck is still power/ground I/Os as they have a critical role in determining the power supply noise. The inductance of the package is the dominant part throughout the whole power delivery path for the first droop noise. Therefore, the power integrity problem needs an I/O solution that can provide high-density interconnection without sacrificing the mechanical attributes needed for reliability.

2.5 Conclusion In order to address the ever increasing adverse effects of conventional silicon ancillary technologies on the performance of CMOS nanosilicon technology, this chapter describes the implementation of low-cost and fully-compatible electrical, optical, and fluidic, or “trimodal,” I/O interconnects. We proposed that electrical I/Os be used for power delivery and signaling, optical I/Os for massive off-chip bandwidth, and fluidic I/Os (with integrated back-side heat sink) for heat removal. A key feature

44

2D and 3D integrated systems Die 5 Die 4 Die 3 Die 2 Die1 Package

(a) Absolute vaalue of power nnoise (mV)

500

400 Worst case peak noise for Die 5 (SPICE simulation) Worst case peak noise for Die 5 (Physical model)

300

200

100

0

10000 20000 30000 Number of thru-vias per die

(b) Fig. 2.40 Effect of adding more TSVs: fixing the number of power/ground I/Os.

of the I/O technology is that it demands 4-5 minimally demanding masking steps and is fabricated using wafer-scale batch fabrication. The trimodal I/Os are flip-chip compatible making them usable with current assembly infrastructure and can be extended to enable the stacking of high-performance (high-power) microprocessors. Moreover, the aggressive scaling of CMOS integrated circuits makes the design of power distribution networks a serious challenge. This is because the supply voltages and thus the circuit noise margins are decreasing, while the supply current and clock frequency are increasing, which increases the power-supply noise. Excessive power-supply noise can lead to severe degradation of chip performance and even logic failure. Therefore, power-supply noise modeling and power-integrity validation are of great significance in GSI system designs. In 2-D systems, is it shown that ∆ I noise is sensitive to the amount of decaps, package level inductance, and the number of I/O pads. Decap insertion is an effective way to reduce the noise level. Package-level high density I/O solutions can also be used to suppress the power supply noise. High density chip I/Os can also alleviate the pressure of the integrity problems in future designs. Power delivery challenges are exacerbated in 3D systems. The supply current flowing through the microbumps and narrow through-silicon-vias (TSVs) may have large parasitics. This may potentially lead to a large ∆ I noise if stacked chips switch simultaneously. The relationships between the power supply noise, decap insertion, power/ground I/O allocation, and TSVs allocation are

2 2D and 3D integrated systems

45

Die 5 Die 4 Die 3 Die 2 Die1

Absolute valuee of power noisse (mV) A

Package (a)

400

Worst case peak noise for Die 5 (SPICE simulations) Worst case peak noise f Die for Di 5 (Physical (Ph i l model) d l)

300

200

100 0 10000 20000 30000 Number of power/ground I/Os under the bottommost die Number of TSVs for each die

(b)

Fig. 2.41 Effect of adding more TSVs and power/ground I/Os.

discussed quantitatively. Schemes for reducing the power supply noise in 3D integrated systems are also proposed and their impact on future 3D system designs are also emphasized in this section. Liquid cooling for a 3D stack of high-performance chips is also discussed. Acknowledgements The authors acknowledge the support of the Interconnect Focus Center, one of five research centers funded under the Focus Center Research Program, a DARPA and Semiconductor Research Corporation program. This work is also in part based upon work supported by the National Science Foundation under Grant Number 0701560.

References 1. M.S. Bakir, J.D. Meindl. Integrated interconnect technologies for 3D nanoelectronic systems. Artech House, Boston, 2009. 2. Semiconductor Industry Association, “International Technology Roadmap for Semiconductors (ITRS),” 2007. 3. R. Prasher, “Thermal interface materials: historical perspective, status, and future directions,” Proceedings of the IEEE, vol. 94, 2006, pp. 1571–1586. 4. G. Schrom, P. Hazucha, H. Jae-Hong, V. Kursun, D. Gardner, S. Narendra, T. Karnik, and V. De, “Feasibility of monolithic and 3D-stacked DC-DC converters for microprocessors in

46

5. 6.

7. 8. 9. 10. 11. 12.

13. 14. 15. 16. 17. 18. 19.

20. 21.

2D and 3D integrated systems 90 nm technology generation,” Proceedings of the IEEE International Symposium on Low Power Electronics and Design, 2004, pp. 263–268. D. Mallik, K. Radhakrishnan, J. He, C.-P. Chiu, T. Kamgaing, D. Searls, and J.D. Jackson, “Advanced package technologies for high performance systems,” Intel Technology Journal, vol. 9, 2005, pp. 259–271. P. Hazucha, G. Schrom, H. Jaehong, B.A. Bloechel, P. Hack, G.E. Dermer, S. Narendra, D. Gardner, T. Karnik, V. De, and S. Borkar, “A 233-MHz 80%-87% efficient four-phase DCDC converter utilizing air-core inductors on package,” IEEE Journal of Solid-State Circuits, vol. 40, 2005, pp. 838–845. K. Shakeri and J.D. Meindl, “Compact physical IR-drop models for chip/package co-design of gigascale integration (GSI),” IEEE Transactions on Electron Devices, vol. 52, 2005, pp. 1087–1096. D.A.B. Miller, “Rationale and challenges for optical interconnects to electronic chips,” Proceedings of the IEEE, vol. 88, 2000, pp. 728–749. D. Huang, T. Sze, A. Landin, R. Lytel, and H.L. Davidson, “Optical interconnects: out of the box forever?” IEEE Journal of Selected Topics in Quantum Electronics, vol. 9, 2003, pp. 614–623. M. Lipson, “Overcoming the limitations of microelectronics using Si nanophotonics: solving the coupling, modulation and switching challenges,” Journal of Nanotechnology, vol. 15, 2004, pp. 622–627. A.G. Kirk, D.V. Plant, M.H. Ayliffe, M. Chateauneuf, and F. Lacroix, “Design rules for highly parallel free-Space optical interconnects,” IEEE Journal of Selected Topics in Quantum Electronics, vol. 9, 2003, pp. 531–547. C. Debaes, M. Vervaeke, V. Baukens, H. Ottevaere, P. Vynck, P. Tuteleers, B. Volckaerts, W. Meeus, M. Brunfaut, J. Van Campenhout, A. Hermanne, and H. Thienpont, “Low-cost microoptical modules for MCM level optical interconnections,” IEEE Journal of Selected Topics in Quantum Electronics, vol. 9, 2003, pp. 518–530. Y. Ishii, S. Koike, Y. Arai, and Y. Ando, “SMT-compatible large-tolerance ’OptoBump’ interface for interchip optical interconnections,” IEEE Transactions on Advanced Packaging, vol. 26, 2003, pp. 122–127. X. Wang, F. Kiamilev, P. Gui, J. Ekman, G.C. Papen, M.J. McFadden, M.W. Haney, and C. Kuznia, “A 2-Gb/s optical transceiver with accelerated bit-error-ratio test capability,” Journal of Lightwave Technology, vol. 22, 2004, pp. 2158–2167. J.W. Joyner, P. Zarkesh-Ha, and J.D. Meindl, “Global interconnect design in a threedimensional system-on-a-chip,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 12, 2004, pp. 367–372. G.G. Shahidi, “Evolution of CMOS technology at 32 nm and beyond,” Proceedings of the IEEE Custom Integrated Circuits Conference, 2007, pp. 413–416. G. Huang, M. Bakir, A. Naeemi, H. Chen, and J.D. Meindl, “Power delivery for 3D chip stacks: physical modeling and design implication,” emphProceedings of the IEEE Conference on the Electrical Performance of Electronic Packaging, 2007, pp. 205–208. H. Ishikuro, N. Miura, and T. Kuroda, “Wideband inductive-coupling interface for highperformance portable system,” Proceedings of the IEEE Custom Integrated Circuits Conference, 2007, pp. 13–20. J.Q. Lu, Y. Kwon, G. Rajagopalan, M. Gupta, J. McMahon, K.W. Lee, R.P. Kraft, J.F. McDonald, T.S. Cale, R.J. Gutmann, B. Xu, E. Eisenbraun, J. Castracane, and A. Kaloyeros, “A wafer-scale 3D IC technology platform using dielectric bonding glues and copper damascene patterned inter-wafer interconnects,” Proceedings of the IEEE International Interconnect Technology Conference, 2002, pp. 78–80. C.S. Tan, K.N. Chen, A. Fan, and R. Reif, “A back-to-face silicon layer stacking for threedimensional integration,” Proceedings of the IEEE International SOI Conference, 2005, pp. 87–89. J.A. Burns, B.F. Aull, C.K. Chen, C.-L. Chen, C.L. Keast, J.M. Knecht, V. Suntharalingam, K. Warner, P.W. Wyatt, and D.R.W. Yost, “A wafer-scale 3D circuit integration technology,” IEEE Transactions on Electron Devices, vol. 53, 2006, pp. 2507–2516..

2 2D and 3D integrated systems

47

22. D.J. Witte, F. Crnogorac, D.S. Pickard, A. Mehta, Z. Liu, B. Rajendran, P. Pianetta, and R.F.W. Pease, “Lamellar crystallization of silicon for 3-dimensional integration,” Microelectronic Engineering, vol. 84, 2007, pp. 118. 23. J. Feng, Y. Liu, P.B. Griffin, and J.D. Plummer, “Integration of Germanium-on-Insulator and Silicon MOSFETs on a silicon substrate,” IEEE Electron Device Letters, vol. 27, 2006, pp. 911–913. 24. M.S. Bakir, B. Dang, and J.D. Meindl, “Revolutionary nanosilicon ancillary technologies for ultimate-performance gigascale systems,” Proceedings of the IEEE Custom Integrated Circuits Conference, 2007, pp. 421–428. 25. M.S. Bakir and J.D. Meindl, “Sea of polymer pillars electrical and optical chip I/O interconnections for gigascale integration,” IEEE Transactions on Electron Devices, vol. 51, 2004, pp. 1069–1077. 26. M.S. Bakir, T.K. Gaylord, K.P. Martin, and J.D. Meindl, “Sea of polymer pillars: compliant wafer-level electrical-optical chip I/O interconnections,” IEEE Photonics Technology Letters, vol. 15, 2003, pp. 1567–1569. 27. O.O. Ogunsola, H.D. Thacker, B.L. Bachim, M.S. Bakir, J. Pikarsky, T.K. Gaylord, and J.D. Meindl, “Chip-level waveguide-mirror-pillar optical interconnect structure,” IEEE Photonics Technology Letters, vol. 18, 2006, pp. 1672–1674. 28. B. Dang, M.S. Bakir, and J.D. Meindl, “Integrated thermal-fluidic I/O interconnects for an on-chip microchannel heat sink,” IEEE Electron Device Letters, vol. 27, 2006, pp. 117–119. 29. B. Dang, “Integrated input/output interconnection and packaging for GSI,” Ph.D. Thesis, Georgia Institute of Technology, 2006. 30. B. Dang, P. Joseph, M.S. Bakir, T. Spencer, P. Kohl, and J.D. Meindl, “Wafer-level microfluidic cooling interconnects for GSI,” Proceedings of the IEEE International Interconnect Technology Conference, 2005, pp. 180–182. 31. M.S. Bakir, B. Dang, O. Ogunsola, and J.D. Meindl, “’Trimodal’ wafer-level package: fully compatible electrical, optical, and fluidic chip I/O interconnects,” Proceedings of the Electronic Component and Technology Conference, 2007. 32. M.S. Bakir, D. Bing, O.O.A. Ogunsola, R. Sarvari, and J.D. Meindl, “Electrical and optical chip i/o interconnections for gigascale systems,” IEEE Transactions on Electron Devices, vol. 54, 2007, pp. 2426–2437. 33. M.S. Bakir, A.L. Glebov, M.G. Lee, P.A. Kohl, and J.D. Meindl, “Mechanically flexible chipto-substrate optical interconnections using optical pillars,” IEEE Transactions on Advanced Packaging, vol. 31, 2008, pp. 143–153. 34. A.L. Glebov, D. Bhusari, P. Kohl, M.S. Bakir, J.D. Meindl, and M.G. Lee, “Flexible pillars for displacement compensation in optical chip assembly,” IEEE Photonics Technology Letters, vol. 18, 2006, pp. 974–976. 35. H.Y. Zhang, D. Pinjala, T.N. Wong, and Y.K. Joshi, “Development of liquid cooling techniques for flip chip ball grid array packages with high heat flux dissipations,” IEEE Transactions on Components and Packaging Technology, vol. 28, 2005, pp. 127–135. 36. E.G. Colgan, B. Furman, A. Gaynes, W. Graham, N. LaBianca, J.H. Magerlein, R.J. Polastre, M.B. Rothwell, R.J. Bezama, R. Choudhary, K. Marston, H. Toy, J. Wakil, and J. Zitz, “A practical implementation of silicon microchannel coolers for high power chips,” Proceedings of the IEEE Semiconductor Thermal Measurement and Management Symposium, 2005, pp. 1– 7. 37. D.B. Tuckerman and R.F.W. Pease, “High-performance heat sinking for VLSI,” IEEE Electron Device Letters, vol. 2, 1981, pp. 126–129. 38. C.K. King, D. Sekar, M.S. Bakir, B. Dang, J. Pikarsky, and J.D. Meindl, “3D stacking of chips with electrical and microfluidic I/O interconnects,” Proceedings of the Electronics Components and Technology Conference, 2008. 39. D. Sekar, C. King, B. Dang, T. Spencer, H.D. Thacker, P. Joseph, M.S. Bakir, and J.D. Meindl, “A 3D-IC technology with integrated microchannel cooling,” Proceedings of the International Interconnect Technology Conference, 2008.

48

2D and 3D integrated systems

40. M.S. Bakir, C. King, D. Sekar, H.D. Thacker, B. Dang, G. Huang, A. Naeemi, and J.D. Meindl, “3D heterogeneous integrated systems: liquid cooling, power delivery, and implementation,” Proceedings of the IEEE Custom Integrated Circuits Conference, 2008. 41. H.D. Thacker, O. Ogunsola, A. Carson, M.S. Bakir, and J.D. Meindl, “Optical through-wafer interconnects for 3D hyper-integration,” Proceedings of the IEEE Lasers and Electro-Optics Society Annual Meeting, 2006, pp. 28–29. 42. J.H. Wu, J. Scholvin, and J.A. del Alamo, “A through-wafer interconnect in silicon for RFICs,” IEEE Transactions on Electron Devices, vol. 51, 2004, pp. 1765–1771. 43. J.D. Meindl, “Low Power Microelectronics: Retrospect and Prospects,” Proceedings of IEEE, vol. 83, 1995, pp. 619–635. 44. M. Swaminathan and E. Engin. Power Integrity: Modeling and Design for Semiconductor and Systems. Prentice Hall PTR, 2007. 45. H. Zheng, B. Krauter and L.T. Pileggi, “Electrical Modeling of Integrated-Package Power/Ground Distributions,” IEEE Design and Test of Computers, vol. 20, no. 3, 2003, pp. 23–31. 46. K.L. Wong, T. Rahal-Arabi, M. Ma, and G. Taylor, “Enhancing Microprocessor Immunity to Power Supply Noise with Clock-Data Compensation,” IEEE Journal of Solid-State Circuits, vol. 41, no. 4, 2006. 47. W.D. Becker, J. Eckhardt, R.W. Frech, G.A. Katopis, E. Klink, M.F. McAllister, T.G. MacNamara, P. Muench, S.R. Richter, and H.H. Smith, “Modeling, Simulation, and Measurement of Mid-Frequency Simultaneous Switching Noise in Computer Systems,” IEEE Transactions on Components, Packaging, and Manufacturing Technology, part B, vol. 21, 1998, pp. 157–163. 48. A. Dharchoudhury, R. Panda, D. Blaauw, R. Vaidyanathan, “Design and Analysis of Power Distribution Networks in PowerPC Microprocessors,” Design Automation Conference, 1998, pp. 738–743. 49. R. Tummala. Fundamentals of Microsystems Packaging. McGraw Hill, 2001. 50. G. Huang, D. Sekar, A. Naeemi, K. Shakeri, and J.D. Meindl, “Compact physical models for power supply noise and chip/package co-design of gigascale integration”, Proceedings of the Electronic Component and Technology Conference 2007. 51. M.S. Bakir, H. A. Reed, H.D. Thacker, P.A. Kohl, K.P. Martin, and J.D. Meindl, “Sea of Leads (SoL) ultrahigh density wafer level chip input/output interconnections,” IEEE Transactions on Electron Devices, vol. 50, no. 10, 2003, pp. 2039–2048. 52. Semiconductor Industry Association, “International Technology Roadmap for Semiconductors (ITRS),” 2004. 53. K. Banerjee, S.J. Souri, P. Kapur, and K.C. Saraswat, “3D ICs: A novel chip design for improving deep-submicrometer interconnect performance and systems-on-chip integration,” Proceedings of the IEEE, vol. 89, no. 5, 2001, pp. 602–633. 54. J.U. Knickerbocker, P.S. Andry, B. Dang, R.R. Horton, C.S. Patel, R. Polastre, K. Sakuma, E. Sprogis, C.K. Tsang, and S.L. Wright, “3D chip stacks and silicon packaging technology using through-silicon-vias (TSV) for systems integration,” 3D System Integration Conference (3D-SIC), 2007. 55. J. Held, J. Bautista, and S. Koehl, “From a few cores to many: a tera-scale computing research overview,” Research at Intel White Paper. 56. G. Huang, M. Bakir, A. Naeemi, H. Chen, and J.D. Meindl, “Power delivery for 3D chip stacks: physical modeling and design implication,” Proceedings of the Electrical Performance of Electronic Packaging, 2007, pp. 205–208.

Part III

Coupled Data Technologies

Chapter 3

Capacitive Coupled Communication David Hopkins, Alex Chow, Frankie Liu, Dinesh D. Patil, Hans Eberle

3.1 Introduction Capacitive coupled communication is a wireless chip to chip communication technology that uses capacitive coupling to transfer signals from a chip to neighboring chips. Its high-bandwidth, low-power, and low-latency chip-to-chip I/O capabilities enable the construction of high-performance and economical multi-chip modules (MCMs). Chips are placed face-to-face (Figure 1), with only a few microns of separation, such that overlapping transceiver circuits communicate through capacitive coupling between top-layer metal pads [1]. By using relatively small metal structures to communicate signals over short distances, capacitive coupled communication directly improves channel density, power, and latency to more closely match the performance of on-chip wires. With capacitive coupling, chips communicate without off-chip wires or soldered connections. The absence of permanent attachment enables easy removal and replacement of individual chips. This could simplify package rework during manufacturing, solve the known-good-die issue of multi-chip packages and further lower packaging cost [2]. While assembling chips using capacitive coupled communication yields important performance and cost benefits, it also presents a number of electrical and mechanical challenges. First, chips must be precisely aligned to ensure that each transmitter couples strongly to its corresponding receiver. As chips move apart in any of the six alignment axes (see Figure 3.2) and become misaligned from the nominal position, signal strength degrades and noise becomes more significant. Dense packing allows low-latency, energy efficient communication at the expense of increased power density. Furthermore, the spatial concentration required to form a two dimensional grid of chips in a multi-chip module necessitates new packaging solutions that David Hopkins, Alex Chow, Dr. Frankie Liu, Dr. Dinesh D. Patil, and Dr. Hans Eberle Sun Microsystems Research Labs, 16 Network Circle, Menlo Park, CA 94025, USA, e-mail: {robert.hopkins},{alex.chow},{frankie.liu},{dinesh.d.patil},{hans.eberle}@sun.com R. Ho and R. Drost (eds.), Coupled Data Communication Techniques for High-Performance and Low-Power Computing, Integrated Circuits and Systems, DOI 10.1007/978-1-4419-6588-2_3, © Springer Science+Business Media, LLC 2010

51

52

Capacitive Coupled Communication

Chip 2 Chip 1

Chip 3

Transmit

Receive

Receive

Transmit

© 2003 IEEE

Fig. 3.1 Capacitive coupled communication between face to face chips.

can hold the chips in alignment, deliver adequate power to every chip, and extract heat from the tightly packed module. In comparison to capacitive coupled communication, optical coupled communication may be more tolerant of z-misalignment (interchip gap). In addition, when combined with advanced techniques such as wavelength division multiplexing, optical coupling provides even higher bandwidth density than capacitve coupling. However, it requires an added layer of complexity to convert signals between the optical and electrical domains for use in standard electronic circuits. Area and energy efficient conversion between the optical and electrical domains, using technology compatible with standard CMOS devices and fabrication, is an area of active research today [3, 4, 5]. Inductive coupled communication has shown additional tolerance to z-misalignment (interchip gap), but has severe crosstalk that must be mitigated for reliable communication [6, 7, 8].

© 2007 IEEE

Fig. 3.2 Misalignment can be in any of six axes: three translational and three rotational.

This chapter begins with the development of an electrical model for capacitive interchip communication, followed by a discussion of transceiver circuitry, two di-

3 Capacitive Coupled Communication

53

mensional arrays of capacitive communication links, testchip measurement results and an application prototype.

3.2 An electrical model of capacitive interchip communication In this section we develop an electrical model of capacitive interchip communication in order to assess the signal and noise characteristics and their effect on the performance of a communication channel. As shown previously in 3.1, chips are placed face-to-face with a few microns of separation, such that overlapping transceiver circuits couple capacitively from the transmitter on one chip to the receiver on an adjacent chip. An interchip communication system may contain hundreds or thousands of capacitive coupled channels. We present here the electrical model of a single transmit-receive pair, describing the channel behavior and effects induced upon it by neighboring communication channels. In an isolated capacitive communication channel with nominal alignment, the transmitter pad is positioned exactly opposite the receiver pad, so that series coupling capacitance is provided by the entire overlap area. However, since capacitive channels are typically employed within a densely packed two-dimensional grid, we must amend this simplistic model to reflect the influence of both stray capacitance and crosstalk to neighboring communication channels. Each transmitter and receiver pad has additional stray capacitance coupling with its environment, including neighbors on the same chip, neighbors on the opposite chip, substrate and other nearby metal structures. The transmitter and receiver plates each have parasitic capacitance to fixed potentials and coupling capacitance between them. These three components yield a capacitive π-model. In addition, both the transmitter pad and receiver pad have capacitance to neighboring pads, resulting in crosstalk. Noise couples into the receiver pad only through fringe capacitance, which may be much lower than the area capacitance. Neighbors that couple only diagonally by way of a corner are less significant, and will be omitted in much of this discussion for simplicity, although it is straightforward to extend the analysis to include more remote neighbors. Figure 3.3 depicts a simple π-model with the side capacitors on either chip split into crosstalk and parasitic components.

TX attackers

RX attackers CTXT

Din

CTPar

Cc

CRXT Dout

CRPar

Fig. 3.3 Simple channel model showing crosstalk and parasitic capacitances. Though not shown here, capacitively coupled channels are typically differential.

54

Capacitive Coupled Communication

A thorough understanding of the types of noise present in a capacitive coupled communication link is essential to making informed design decisions. To ease analysis of the many types of noise present in modern computer systems, noise sources are partitioned into bounded noise (Nb) such as crosstalk and power supply noise, and unbounded noise (Nu) created by Gaussian random processes. To estimate the bit-error rate (BER) of the communication link, we calculate a signal to noise ratio (SNR) using the signal strength (defined as half the peak-to-peak voltage swing) and bounded noise, divided by the standard deviation of the unbounded noise. As there are a finite number of possibilities for the nearest neighbor crosstalk, a weighted sum is used to calculate the aggregate error rate. For each possible crosstalk condition, the SNR is calculated and the BER contribution is determined by using a weighting (pi ) corresponding to the likelihood of that crosstalk condition. By summing over all possible crosstalk conditions, we find a BER estimate. TX attackers

RX attackers

Signal C− Nbi RXT Cc σ(Nui ) � Q D�out � SNR � 1 1 i i BER = ∑ pi · · erfc √ = ∑ pi · · erfc √ 2 2 2 2 i i C C

CTXT = SNR i Din

TPar

(3.1) (3.2)

RPar

where erfc(x) denotes the complementary error function, associated with the probability that a normally distributed random variable lies outside a certain region defined by the argument x. TX attackers RX attackers Nui The most important sourcesCRXT of random and unbounded noise are the noise curCTXT comparator C c rents generated by the MOS transistors in the receiving amplifier. We consider a Din Dout simple model of the capacitive coupled channel, bias circuitry and sense amplifier sampler receiver where the differential receiver pads are connected directly to the inputs of bias CRPar Co input data TPar a receiving senseCamplifier. The sense amplifier periodically samples the and produces full-swing digital output levels. The input bias voltage is set by weak transistors. The minimum signal strength required to satisfy a given BER can be estimated by calculating the total receiver noise referred to the sense amp inputs. —

Cc

Nbi

Nui

comparator Dout

Si Signal l sampler Ci

bias

Co

vbias

Fig. 3.4 Channel model showing a receiving amplifier consisting of an ideal bias circuit, an ideal sampler, and an ideal comparator, with sources of both bounded, deterministic noise Nbi and random, unbounded noise Nui . The total receiver input capacitive load, Ci , includes all parasitics. Channels are typically differential in practice.

Two independent noise sources have significant impact on the operation of the capacitive coupled receiver. The first comes from the input bias circuits and its vari-

3 Capacitive Coupled Communication

55

ance is given by kT Ci , where Ci is the total capacitance on the input node, k the Boltzmann constant, and T the temperature. The second source of noise is the set of transistors in the sense amplifier. Computing the noise generated by the sense amplifier is much more complex. The sense amplifier is typically a non-linear and time-varying element, making it difficult to refer noise sources appropriately to the input. The sense amplifier periodically samples the inputs, amplifies the voltage difference and then regenerates the resulting signal as the transistors go through different regions of operation. Therefore, the input-referred noise cannot be calculated by treating the sense amp as a conventional linear amplifier.

clk

clk

out—

out+

in+

in—

Fig. 3.5 A typical sense-amplifier used in capacitively coupled channels.

Several different approaches to estimating the noise of a sense amplifier noise like that shown in Figure 3.5 have been proposed [10, 11]. A simplification that yields good results models the sense amplifier as a linear, periodically time-variant system. By calculating the output noise, and refering it to the input by dividing by the low-frequency gain of the amplifier, these methods reveal useful and reasonably accurate results. The low-frequency gain is the product of three terms: the gain of the differential pair at the input, the gain of the regenerative pair before the onset of regeneration, and the regenerative gain. Some simplifying assumptions, including operating the sense-amplifier as fast as the technology allows, lead to a very understandable and informative result: σ (Nui )2 =

kT γkT + Ci Co

(3.3)

where Ci and Co are the total capacitance at the input and output nodes of the amplifier, respectively, and γ is the excess noise factor in CMOS technologies. This reveals a fundamental tradeoff between power, maximum operating speed and noise. Given a fixed power target, a designer can use a small amplifier (with low input and output capacitance) that functions up to a high data rate but exhibits more noise than a larger, slower amplifier.

56

Capacitive Coupled Communication

3.2.1 Crosstalk mitigation For capacitive coupled channels with typical geometries, nearest neighbor crosstalk from adjacent transmitter and receiver pads is the dominant source of crosstalk. There are many different choices for the arrangement of I/O channels in a two dimensional grid and the arrangement effects the interaction between a channel and its nearest neighbors. Although some early work focused on single-ended signaling, differential signaling was found to have significant advantages in terms of sensitivity, noise rejection, and reduced return path ambiguity. Single-ended signaling devotes a single pad to each signal, whereby information is encoded as changes in the voltage of that pad relative to a common reference, usually the ground voltage on the chip. Single-ended signaling has the disadvantage that all four neighboring pads may oppose a given transition, leading to substantial crosstalk. Differential signaling sends information encoded as the difference in voltage between a pair of adjacent pads. Although twice as many pads are needed for such a scheme, the benefits in terms of channel reliability are significant enough to outweigh the area penalty. We consider three arrangements of pads for differential signaling along with single-ended, as shown in Figure 3.6. Side differential signaling places differential positions pairs side-by-side, such that each shares an edge with its complementary neighbor. Corner differential signaling and butterfly differential signaling place differential pairs diagonal to one another, such that each shares a corner with its complementary neighbor. Each of these configurations has a distinct impact on the magnitude of crosstalk noise. We developed an arrangement of channels called butterfly differential signaling (Figure 3.6d) which completely rejects nearest neighbor crosstalk, for a receiver with good common mode rejection. Figure 3.6d highlights a differential channel (pads A+ and A-) and four adjacent channels (B, C, D, and E). Channel A sees no net noise from channel B, because pads B+ and B- couple equally to A+; any noise due to a transition on B+ is canceled by an opposing transition on B-. Pads E+ and E- act similarly on A-. Channel A sees no net noise from D or C, because D+ and C- couple equally to A+ and A-; any noise due to a transition on D or C is thus common-mode to A. This crosstalk cancellation scheme enables reliable communication using smaller I/O pads, over a greater chip separation, and at higher data rates. This pad arrangement can also mitigate crosstalk in any 2D array of communication channels, including channels on different layers on a printed circuit board or on adjacent area solder connections.

3.2.2 Simulation results The signal and noise properties of capacitive coupled channels can be studied using a 3D electromagnetic field solver to extract all the coupling capacitances between pads on the two chips. The extracted coupling capacitance between corresponding signal pads indicates the available signal at the receiver, while the extracted cross-

3 Capacitive Coupled Communication

57

+

+

+

-

+

-

+

+

+

-

+

-

(a) Single-ended

(c) Corner Differential

(b) Side Differential

EE+ +

C+

D-

A-

D+ +

C-

A+

B-

B+ + (d) Butterfly Differential

Fig. 3.6 Pad arrangements for four different signaling schemes: (a) single-ended; (b) side differential; (c) corner differential; (d) butterfly differential.

coupling capacitances indicate the amount of noise that may be injected into a receiver pad. This enables a comparison of different signaling schemes in terms of their ability to reject noise. By modifying the geometries in these models, it is also possible to study how signal and noise levels are affected by chip misalignment, pad sizes, and different dielectric materials. In order to get a sense of scale, the metal and dielectric stackup from a representative modern process is shown in Figure 3.7. For a 90 nm process, there are typically six to ten copper interconnect layers sandwiched between a variety of insulating glass dielectric materials. Most of the dielectric materials have properties similar to silicon dioxide, so we use an approximate relative dielectric constant four times that of vacuum. Each interconnect layer and its respective interlayer dielectrics create a sandwich structure of approximately one micron in thickness (except for the top and bottom layers which are thicker and thinner, respectively). Figure 3.8 shows an example of a chip-to-chip pad model that can be used to extract signal and crosstalk capacitances between transmitting and receiving pads using an electromagnetic field solver. The model consists of two arrays of square pads, one on each chip. The central pad in each array is the transmitting or receiving pad. Neighboring pads that

58

Capacitive Coupled Communication

Passivation

Metal N (Top layer)

... Metal 3

Via Metal 2

Inter-layer dielectric Metal 1

Poly Well

Substrate

Fig. 3.7 Metal and dielectric stackup of a typical modern CMOS process.

introduce crosstalk noise are modeled by the eight pads surrounding the central pad. An outer ring represents all other pads on the chip. Dielectric slabs are used to represent the passivation and intermetal layers, and ground planes are used to model the presence of other circuitry beneath the arrays. In this model, there are 22 individual conductors: 9 square pads, an outer square annulus, and a ground plane on each of the two chips. The electromagnetic field solver returns a 22-by-22 matrix that gives the self and coupling capacitances of each conductor with all other conductors in the model. The self capacitances, given by the entries along the diagonal of the matrix, are the total capacitances seen by

3 Capacitive Coupled Communication

59

Chip 1

Chip 2

Fig. 3.8 Chip-to-chip pad model for capacitance extraction using a 3D electromagnetic field solver. Ground planes and dielectric layers are not shown.

each conductor. All other entries show the coupling capacitances between two conductors.

Capacitance (fF)

100

10

Signal Crosstalk 1

0.1

0

5

10

15

20

25

30

Chip separation (µm)

Fig. 3.9 Coupling capacitance as a function of interchip spacing (z), for I/O pads on a 36 x 36 µm pitch.

Figure 3.9 shows the variation of signal and crosstalk capacitances with chip spacing in the absence of any translational or rotational misalignment. It is worth noting that the signal capacitance drops with chip spacing approximately as Csig (z) ≈

1 k+z

(3.4)

60

Capacitive Coupled Communication

for some constant k. The cross-coupling capacitance between a receiving pad and the crosstalk-inducing transmitting pads drops approximately as � t� Cxtalk (z) ≈ log 1 + (3.5) z

where t represents the metal thickness. The difference in these two characteristics results from the fact that the signal capacitance mainly consists of area capacitance; cross-coupling capacitance, on the other hand, mainly consists of fringe capacitance, which drops with plate distance more slowly. This is unfortunate because it indicates that as chips move apart, not only does the desired signal decrease, but the relative contribution of crosstalk also increases. 100

Capacitance (fF)

10

Signal

1

Crosstalk 0.1

0.01

0

5

10

15

20

25

30

35

In-plane misalignment x, y (µm) Fig. 3.10 Coupling capacitance as a function of in-plane misalignment (x, y), for I/O pads on a 36 x 36 µm pitch.

Figure 3.10 shows the variation of signal and crosstalk capacitances with inplane misalignment in both dimensions. For a small amount of misalignment, signal coupling does not drop significantly. However, crosstalk coupling does increase appreciably; even with crosstalk-canceling signaling, crosstalk can become significant when misalignment is more than one-quarter of the pad pitch, because coupling from corner neighbors cannot be effectively eliminated. Electronic alignment correction is therefore useful in keeping communication within an acceptable alignment range.

3 Capacitive Coupled Communication

61

3.3 Transmitting data In the simplest implementations, the transmitter circuits for each data channel consist of CMOS inverters and standard retiming elements, as necessary. The very high pad density, however, requires very high accuracy mechanical alignment. To relax this constraint, we developed electronic alignment correction [12]. This technique shifts the location of the transmitting channel to compensate for physical misalignment between the transmitting and receiving channels (Figure 3.11). Each transmit pad is physically divided into a 4x4 array of micropads. Multiplexers on the transmitting chip steer data to the micropads that best align with the receiving chip. We determine the optimal multiplexer configuration by precisely measuring chip alignment [13]. More micropads per channel reduces the residual misalignment, but at the expense of additional circuit complexity and power consumption.

Normal Tx bit location (with no misalignment)

Misalignment (x,y)

Actual Tx bit location (with misalignment)

Rx pad

Fig. 3.11 Electronic alignment correction.

Electronic alignment increases power consumption due to the extra wires and multiplexers necessary for data steering. To reduce this power cost, some implementations contain a power-efficient multiplexer that uses NMOS-only pass gates (Figure 3.12). A low select signal drives M2’s gate low, making it opaque. A high select signal drives M2’s gate to one threshold voltage below VDD . It is held very weakly in this state, as M1 is off. A rising data transition bootstraps M2’s gate above VDD , allowing M2 to pass full VDD levels. Falling transitions restore the gate voltage on M2 to one threshold voltage below VDD . Because M2’s gate voltage tracks the channel voltage, the effective channel capacitance and resistance is lower. Compared to a typical CMOS pass gate, the bootstrapped NMOS pass gate reduces overall transmitter power by more than 20%, from 2.5 pJ/bit to 2.0 pJ/bit, while providing similar edge rates. Also, because the multiplexers consist of only NMOS devices, layout is more compact. This technique is used in memory design and demonstrates

62

Capacitive Coupled Communication

both high performance and reliability. However, it requires occasional data transitions to prevent droop and jitter if the data remains high over an extended time on the order of 1 ms. select

VDD

VDD - Vth + 'VB

M1 Cb

VDD - Vth Cb

in

out M2 © 2007 IEEE

Fig. 3.12 Power-efficient pass-gate circuit.

3.4 Receiving data The capacitive coupled channel presents two key challenges to reliable recovery of transmitted data: attenuation and loss of DC information.

3.4.1 Attenuation Although the transmitter transitions between full CMOS levels, the coupling capacitor forms a voltage divider with the total capacitance on the receiver input node. In most practical circumstances, the coupling capacitance is a small fraction of the total capacitance, leading to significant attenuation. Simulation and measurement results show that the received voltage varies between 1% and 20% of the transmitted voltage. It is the responsibility of the receiving amplifier to restore this low-swing signal to full CMOS levels that can be easily used by standard logic gates. Performing this voltage amplification, while functioning at high-speed and consuming little energy, can pose a significant challenge. In addition, device variability increases as devices scale down with each technology generation. Device variability leads to mismatch between devices that the designer intends to be identical, creating asymmetry in differential amplifiers. This asymmetry biases the amplifier so that its decision threshold is no longer at zero differential voltage. As a result, an acceptable input signal must exceed this offset voltage in addition to the signal needed for noise margin and

3 Capacitive Coupled Communication

63

sensitivity. Additional circuitry can be added to the receiving amplifier to reduce these effects, but these circuits always increase complexity, power consumption, and area.

3.4.2 Loss of DC information The capacitive coupled channel combines with any shunt conductance on the receiver input node to form a high-pass filter. Spectral content below the corner frequency of this filter is attenuated, with DC information being completely lost. This creates a problem for biasing the receiving amplifier. There are a number of ways to manage this loss of DC information and establish the appropriate DC bias for the amplifier (see Figure 3.13). The most widely used method to deal with this limitation is data encoding. Popular schemes include 8b10b and 64b/66b, which encode 8 and 64 bit words as 10 and 66 bit messages, respectively [14, 15]. The increased code space allows these encoding schemes to maintain a nearly equal number of 1’s and 0’s over every two word sequence. A data stream that has nearly as many 1’s as 0’s over a reasonable timeframe is often referred to as DC-balanced. Given a system with DC-balanced data, DC biasing of the input to the receiver can be accomplished with a simple lowpass filtered version of the data stream. Although this type of scheme is used widely, it presents several important drawbacks. An encoded channel requires more bandwidth and the process of encoding and decoding the signals increases complexity, area and energy consumption. In addition, encoding can add a significant amount of latency. Without encoding, these channels may have latency as low as a few bit periods. With 8b10b encoding applied to each data channel, there is up to an additional 10 bit periods of latency for encoding and decoding. For certain latency-sensitive applications, this is unacceptable. More generally, it is possible to create DC-balanced data streams by applying modulation techniques used in wireless communication systems. In most of these schemes, the incoming data sequence or a derivative sequence is combined with a sinusoidal carrier. The data sequence may modulate the frequency, amplitude, or the phase of the sinusoidal carrier. As long as the sinusoidal carrier is not highly correlated with the data sequence, the resulting signal is nearly DC-balanced. These modulation schemes can be combined with a variety of techniques from the wireless community including multiple-access (FDMA, CDMA and TDMA) and diversity techniques, as well as many others. In addition, as in 64b66b, the data stream can be scrambled by mixing it with a pseudo-random bit sequence (PRBS), which dramatically reduces the likelihood of long-term DC imbalance. Unfortunately, these modulation techniques will usually increase the number of signal transitions, and therefore consume more power in the transmitter circuitry. Another method to deal with the loss of DC information is to periodically restore the DC bias to a known state. This requires the simultaneous application of a known voltage to both the transmitter pads and the receiver pads. Although it is possible

64

Capacitive Coupled Communication x Gbps

y Gbps, y>x

vbias

x Gbps

enc 'T

dec 'T

Din unbalanced data

Dout unbalanced data

DC balanced data

carrier or PRBS Din unbalanced data

vbias

carrier or PRBS

mixer

mixer

Dout unbalanced data

statistically DC balanced data

vbias refresh Din unbalanced data

refresh

0 VTbias

Dout

1 vtop

0 < vbot < vtop < VDD

vbot Din i unbalanced data

Dout

Fig. 3.13 Methods of dealing with the loss of DC information. From top to bottom: using coding such as 8b10B; mixing/modulation using a carrier or otherwise orthogonal code; explicit refresh of the channel; feedback with a keeper latch.

to impose these conditions without interrupting the flow of data within a channel, in most practical circumstances this requires a relatively brief pause in the flow of data on the channel undergoing this periodic restoration of DC bias which we call “refresh,” after the name given to the related process in DRAM cells. For systems that can accommodate this infrequent unavailability, refreshing the channel from time to time may be a suitable solution. Finally, the DC bias on the receiver can be maintained continuously, be means of a feedback keeper. The feedback keeper is placed around the receiving amplifier, such that once a decision is made about whether a bit is a logic 1 or 0, the keeper maintains the voltage at the input to the receiver. Thus, even if no data transitions occur for a very long time, the data signal is correctly received. The challenge with this method, however, is supplying the feedback keeper with the appropriate high and low levels. The high and low levels result from the attenuation of the channel, which depends upon environmental factors, and will not only be different for distant channels, but may also vary with time. Setting the levels incorrectly will lead to inter-symbol interference (ISI) leading to degraded noise margins. In order to minimize this degradation, adaptive schemes are usually needed.

3 Capacitive Coupled Communication

65

3.4.3 Comparators A comparator can be used as a simple receiver for capacitive coupled links. A comparator samples the small voltage difference between the pair of input signals and decides which of the signals is larger. A comparator can be either a continuous or clocked amplifier. A clocked comparator typically operates in two phases: a reset and a comparison phase. During the reset phase, the comparator is drawn asymptotically toward a metastable point, readying it for a quick decision once the input signal arrives. During the comparison phase, the incoming signal tips this delicate balance, and regenerative positive feedback amplifies the voltage difference in an exponential fashion. The usual analog circuit design trade-offs apply to clocked sense amplifiers; the smaller the initial voltage differential, the longer it takes for the comparator to resolve to full CMOS levels. As is often the case, one may also choose to trade-off added latency to further amplify the signal. Additional gain and latency are usually associated with larger power consumption, and often require more complex, multi-phase clocking schemes. On the other hand, one may also oversample the signal with a bank of comparators, so that each comparator is given more time to resolve, but the extra cost in area and power, and more importantly, the added signal attenuation due to increased parasitic capacitance from the parallel paths usually makes this an expensive option. For capacitive links, the comparator offset may be stored in its input parasitic capacitance, in order to perform offset cancelation and increase its ability to correctly sense small signals in the presence of device variability. Because the offset voltage is stored as a voltage on the parasitic capacitor it requires periodic re-calibration, at an interval that depends on how quickly charge is lost through leakage mechanisms on the input node. Because offset compensation requires this periodic interruptions of data flow in the channel, one may choose instead to reduce the intrinsic offset of the comparator. This can be accomplished by increasing the size of the devices in the comparator. This comes at a cost of area and power, and since the offset is reduced proportionally to the square root of the device area, decreasing the offset can be very expensive in terms of area. An alternative is to implement a digital offset calibration scheme, whereby a finite number of levels of compensation are available. Often implemented as a set of binary weighted capacitances that can be switched onto the internal nodes of the comparator to introduce an intentional offset to compensate for the device variation, this technique is quite effective if the ratio of maximum expected offsets to minimum resolvable voltage difference is less than ten. If this ratio is very large it becomes very expensive and complex to implement. Another challenge encountered in comparators is the kickback of charge during the comparison phase. When the regenerative feedback takes the output to the rails, some of that output charge is transferred to the input through parasitic capacitances, creating hysteresis in the response, which reduces voltage margin. This kickback can be reduced considerably by buffering the input stage with a low gain pre-amplifier stage followed by the comparator.

66

Capacitive Coupled Communication

3.4.4 Receiver sizing The sizing of data receivers has a profound impact on the sensitivity, power consumption, performance and reliability of the capacitive-coupled link. Two competing effects govern the sensitivity of the receiver (i.e. the ability of a receiver to correctly identify and amplify the incoming signal). First, the data receiver presents capacitive loading on the receiving pad; to maximize the voltage at the receiver’s input, the transistors connected to the receiving pad should be small. Second, transistor variability decreases for larger devices; to minimize threshold variations and corresponding input offset voltages, the transistors connected to the receiving pad should be large. The choice of receiver size must balance the competing goals of maximizing received voltage and minimizing threshold variation, while satisfying data rate, power consumption and area constraints.

Signal/Threshold variation (V)

1E+0

Z=0

3

1E-1

Z = 3µm Z = 5µm Z = 10µm

1E-2

Z = 15µm

1E-3

0

5

10

15

20

25

30

Device width (µm) Fig. 3.14 Variation in received voltage and device threshold uncertainty as a function of receiver size.

by

In general, the voltage on the receiving pad in a capacitive-coupled link is given Vr =

Csig Csig +Cpad +Crx

(3.6)

where Csig is the signal coupling capacitance, Cpad is the total capacitance seen by the receiving pad (including the signal coupling capacitance), and Crx is the capacitive loading presented by the receiver. Figure 3.14 shows the variation in Vr for different chip separations, as a function of the width of a transistor whose gate terminal is connected to the receiving pad. The standard deviation in the threshold of

3 Capacitive Coupled Communication

67

such a device, with respect to the nominal, is given by AVth σVth = √ WL

(3.7)

where W and L are the effective physical width and length of the device, and AVth is a process-specific parameter. For normally-distributed device thresholds, about 1 in 1000 devices has a threshold variation exceeding 3σ of the mean. Figure 3.14 also shows the 3σ variation as a function of transistor width, for a device with L = 0.1 µm in a process with AVth = 5 mV·µm. The points at which the curves of received voltage and threshold variation intersect indicate the minimum receiver size required at the corresponding chip separation in order to keep the received voltage above the 3σ threshold variation. For example, targeting for a receiver fallout of no more than 1 in 1000 at a chip separation of 10 µm requires a minimum transistor width of about 17 µm. Explicit offset compensation can be added to the receiver circuit to reduce the impact of device variation and allow the use of smaller devices. As power consumption is increased with larger devices, this added circuit complexity may be worthwhile.

3.4.5 Timing schemes In addition to data moving from one chip to another, there must also be a means of recovering a timing signal, indicating the validity of the data. This timing signal determines when the comparator samples the received signal on its input pads. Ideally, it should sample the data when it produces a strong and stable signal. Often, one aims to sample the data midway between adjacent data transitions, minimizing the bit-error rate with typical channel characteristics. There are a number of ways to obtain a suitable timing signal. In the simplest case, both the transmitter chip and receiver chip distribute a low-skew global clock signal with identical frequency and well-controlled phase. In this case, the timing signal for the receiver circuits can be easily derived from this global clock signal. Unfortunately, such clock signals are not always available. Obtaining a timing signal from the data can be divided into two parts: frequency acquisition and phase recovery. For capacitive coupled channels, frequency acquisition is fairly straightforward. A reference clock is typically forwarded along with the data from the transmitter chip to the receiver chip. Although unnecessary when both the transmitting and receiving chips have identical clock frequencies, it provides additional flexibility in the use of the capacitive coupled links. Determining the optimal time to sample the data is more difficult. Although the forwarded clock can be configured to provide the appropriate phase, this phase will drift as the channel and environmental conditions change. As the chips move apart, for example, the received signal voltage falls. This increases the delay through the receivers amplifying the forwarded timing signal, causing a change in the relative phase of the clock

68

Capacitive Coupled Communication

and data channels. These variations can be accommodated by using a control loop to track the phase of the data channels, and adjust the delay of the forwarded clock channel to match.

3.5 Two-dimensional arrays Capacitive coupled I/O pads are arranged in two-dimensional arrays. In most applications, there are separate arrays of transmitting and receiving pads for bidirectional communication. Figure 3.15 shows one possible arrangement of these arrays. The transmitting array, on top, is slightly larger than the receiving array because it has an extra border of pads to enable electronic alignment correction. The jagged shape of the arrays is a result of the asymmetric pad arrangement for butterfly differential signaling. Array width W

Border pads for electronic alignment correction Array depth H

...

Transmitting Array

Timing Channels

Receiving Array ...

Slice 1

Slice 2

Slice 3

Slice N

Fig. 3.15 Two-dimensional transmitting and receiving arrays, arranged as modular slices for scalability.

3 Capacitive Coupled Communication

69

The placement and dimensions of the arrays have significant effects on their performance and layout feasibility. In most applications, the arrays are placed near a chip edge to enable sufficient overlap with another chip (Figure 3.16). The shaded area–which recedes from the chip edge by an amount equal to the depth of the arrays plus some additional clearance–must be void of any I/O pads or other features that disturb the smoothness of the overlapping chip surface. This restriction may limit the locations at which the arrays are placed. The need for a keepout area also means that the arrays are often located far away from power I/O pins; surrounding areas must therefore have adequate metal coverage to ensure proper power delivery to the arrays.

Chip 1 (face up)

Tx

Line of symmetry

Rx

Keepout region

Chip 2 (face down) Fig. 3.16 Typical locations of transmitting and receiving arrays, near the chip edge in a keepout region.

For a fixed number of I/O channels in an array, a tradeoff must be made between the width and depth of the arrays to satisfy physical and electrical constraints. The array width W is mainly limited by available area along the chip edge. It may also be constrained by power delivery if power is mainly supplied from the two side edges. The array depth H is mainly limited by the latency of signal propagation up the array. In most applications, data propagates vertically along the depth of the arrays; deep arrays may therefore require additional latches or flip-flops, adding latency and complexity. The array depth may also be constrained by wiring resources, as the width of each column (or pair of columns for differential signaling) must accommodate data wires for all the channels along the column. For modularity and scalability, the arrays are often designed as slices, so that an arbitrary number of such slices can be assembled to satisfy the bandwidth requirements of different applications. Although the number of channels to include within a slice is somewhat arbitrary, a reasonable choice is the word width plus extra channels for control signals (e.g. arbitration, flow control, or parity bits). Each slice may also have separate timing and clock distribution. In source synchronous designs, each slice contains a separate clocking channel. The clocking I/O pads are large compared to data I/O pads, because one timing channel clocks all data channels in the slice; a larger pad size allows a correspondingly larger timing receiver, which

70

Capacitive Coupled Communication

reduces the required fanout. In addition, it is possible to eliminate the electronic alignment correction circuits from a large timing channel, leading to a reduction in timing uncertainty and variation with power supply noise. Additional issues complicate the design of the actual transmitting and receiving arrays. Circuitry for electronic alignment correction imposes significant demands on chip real estate and wiring resources beneath the transmitting pads. As a result, the transmitting array is typically placed close to the edge of the chip, so that data wires from the receiving pads do not need to be routed through it. The receiving array, however, also experiences wire constraints because it has fewer available metal layers; one or two metal layers beneath the top-level I/O pads are typically left empty, or with floating fill metal, in order to minimize parasitic pad loading. These wiring constraints become more stringent as I/O pad sizes scale down, although this is somewhat alleviated by the availability of more metal layers and finer wire pitches in more advanced fabrication processes.

3.6 Measurement results We designed and tested a chip fabricated in the TSMC 180 nm CMOS process. The chip has four capacitive coupled I/O slices for a total of 72 transmit and 72 receive data channels. All channels can operate simultaneously at up to 1.8 Gbps per channel [16]. This provides a maximum aggregate I/O bandwidth of 260 Gbps on each chip, equivalent to 430 Gbps/mm2 . On PRBS31 data, the measured bit error rate (BER) is lower than 10−15 . This BER was limited by test time, with this measurement representing 3 weeks of operation without a single error. A hand calculation gives an estimated BER of 10−26 for this system operating under nominal conditions. The combined energy cost of the transmitter, receiver and amortized clock distribution is 3.0 pJ/bit. In addition, this implementation includes electronic alignment correction capable of correcting up to + 18 microns of planar misalignment and our noise cancelling layout configuration scheme that reduces the BER estimate by three orders of magnitude.

3.6.1 Voltage waterfall Figure 3.17 shows a plot of bit-error rate (BER) versus offset voltage, often referred to as a “waterfall plot” due to its characteristic shape. In this experiment the channels are operating at 1.6 Gbps and the differential inputs to the receiving amplifiers are intentionally biased to different voltages, to introduce a voltage offset. As the amount of offset voltage is varied, the signal to noise ratio varies, creating a corresponding change in the BER. For the results shown here, no bit errors were observed until the magnitude of the offset voltage exceeded 200 mV, followed by a rapid in-

3 Capacitive Coupled Communication

71

crease in BER. This indicates a channel with quite large voltage margins, and very little random noise. 1E+0

1.60Gbps

1E-1 1E-2 1E-3 1E-4 1E-5

BER

1E-6 1E-7 1E-8 1E-9 1E-10 1E-11 1E-12 1E-13

Voltage margin = 207mV

1E-14 1E-15 -250 -200 -150 -100

-50

0

50

100

150

200

250

Voltage Offset (mV) © 2007 IEEE

Fig. 3.17 Voltage waterfall curve: variation in BER vs. voltage offset.

3.6.2 Timing waterfall Analogous to the voltage waterfall is a timing waterfall plot (Figure 3.18). In this experiment, the relative timing between the edges of the clock channel and the data channel are intentionally skewed, to introduce a timing offset. As the amount of timing offset is increased, the sampling time of the data approaches the time when the data transitions and the BER increases. For the results shown here, no bit errors were observed until the magnitude of the timing offset exceeded 35% of the bit period, followed by a fairly rapid increase in BER. This indicates a channel with quite large timing margins, and reasonably low timing jitter.

72

Capacitive Coupled Communication

1E+0

1.60Gbps

1E-1 1E-2 1E-3 1E-4

BER

1E-5 1E-6 1E-7 1E-8 1E-9 1E-10 1E-11 1E-12 1E-13

Timing margin = 0.72 Tbit = 450ps

1E-14 1E-15 -0.5

-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

0.5

Timing Offset (% Bit Period) © 2007 IEEE

Fig. 3.18 Timing waterfall curve: variation in BER vs. timing offset.

3.6.3 Combined eye diagram By varying both voltage and timing offsets and measuring BER at each combination of voltage and timing, a type of “eye” diagram can be created. Figure 3.19 shows a contour of constant BER (1 in 109 bits) as both voltage and timing offsets are varied. The figure provides a visual confirmation that the channel still has substantial margin when operating at 1.8 Gbps.

3.6.4 BER versus chip separation In this experiment, a pair of chips was mounted on a high precision six-axis positioning system that provides sub-micron placement resolution. Using this system, BER is measured as the interchip gap is increased (Figure 3.20). As the gap was widened, no errors were observed until the gap reached about 9 microns. At that point the BER degrades quite rapidly and by 12 microns the channel is not useable

3 Capacitive Coupled Communication

73

175

1.80Gbps

Voltage Offset VPN (mV)

150 125 100 75 50 25 0 -25 -50 -75

-100 -125 -150 -175 -1

-0.75

-0.5

-0.25

0

0.25

0.5

0.75

1

Timing Offset tm (% Bit Period) © 2007 IEEE

Fig. 3.19 Eye diagram: constant-BER contours of voltage and timing offsets.

for most applications. Applying receiver offset compensation can extend this range. All measurements are taken with air as the interchip dielectric; this tolerance can be significantly increased by using interposer materials with higher permittivity.

3.7 Prototype application: a high-radix switch Capacitive coupled communication offers a number of important advantages over conventional interchip communication technologies that can be leveraged to develop systems that have lower cost, higher reliability, more flexibility or greater capability. In order to get a flavor for these possibilities, we present an application example that highlights some of the system-level advantages gained by architecting a computing system with this technology in mind. This example is a study of high-radix switching networks enabled by extremely high bandwidth interconnects. The huge increase in chip I/O bandwidth made possible by capacitive coupling allows the system designer to completely re-architect large-scale switching systems.

74

Capacitive Coupled Communication 1E+00 1E-01 1E-02 1E-03 1E-04

BER

1E-05 1E-06 1E-07 1E-08 1E-09 1E-10 1E-11 1E-12 1E-13 1E-14 1E-15

8

9

10

11

12

13

Chip Separation (µm)

14

15

1.80Gbps 1.50Gbps

© 2007 IEEE

Fig. 3.20 Measured BER as a function of interchip separation.

Today, single-chip high bandwidth Ethernet and Infiniband switches are limited to about 36 ports, due to the costs of interchip I/O bandwidth. If a larger switch is required, designers resort to hierarchical, multistage topologies. There is significant cost and complexity associated with building these multistage networks. Furthermore, their performance is inferior to that of a single-stage network. For example, a multistage network can suffer from saturation under non-uniform, real-world traffic. Thanks to the large amount of chip I/O bandwidth offered by capacitive coupled communication, it is possible to extend a single-stage switch architecture to a larger scale. Multi-chip switch fabrics can be implemented with a simple crossbar architecture that was previously only applicable to single-chip switch implementations. This simple architecture is made possible because PxC offers enough bandwidth that a large switch can be partitioned in such a way that the full bisection bandwidth can be exposed at the chip boundaries. A large scale crossbar switch can be implemented in an MCM, where chips are interconnected by capacitive coupled I/O links. The MCM may contain a onedimensional vector of chips or a two-dimensional array of chips. Although a vector is easy to implement and package, a matrix design affords greater flexibility and enables larger scale systems. It is easy to map such a crossbar onto a vector MCM. Figure 3.21 shows such a system, with each crossbar slice mapped onto an Island chip and the bus segments stitched together by capacitive coupled links by way of the Bridge chips. The

3 Capacitive Coupled Communication

75

sliced design shown in Figure 3.21 suggests an implementation through a linear array of chips; however, it is equally straightforward to map these designs onto chips arranged in a two-dimensional matrix. Input Port 1

Input Port 2

Island chip

Input Port 3

Island chip

Bridge chip

Output Port 1

Input Port 4

Island chip

Bridge chip

Output Port 2

Island chip

Bridge chip

Output Port 3

Output Port 4 © 2008 IEEE

Fig. 3.21 Architecture of a output-buffered crosspoint switch using capacitive coupled communication.

In order to demonstrate the viability of capacitive coupled I/O, we implemented a small-scale vector switch prototype [17]. This prototype extends previous demonstrations of capacitive coupled communication in several ways: 1. it uses larger chips that are more representative of high-performance systems applications and that are more challenging from a mechanical and thermal perspective, 2. it uses a larger chip assembly that consists of four Island chips and two Bridge chips, 3. it uses a prototype face-to-face chip package to align the chips, and 4. it demonstrates an actual system application in the form of a fully-functional switch. This prototype implements an Ethernet switch with four 10 Gbps ports. The internal architecture is a fully-buffered 4x4 crossbar that is vertically sliced such that each slice corresponds to a single Island chip and implements four crosspoints, one input port and one output port (Figure 3.21). Each Island chip connects to the PCB through two 16-bit wide LVDS data links (Figure 3.22). There are three such pairs of 16-bit wide capacitive coupled I/O interfaces on the left and right sides of the chip to connect to the neighboring Island chips, through face-down Bridge chips.

76

Capacitive Coupled Communication

All links run at a data rate of 1 Gbps, corresponding to a 500 MHz DDR clock rate. Because the internal datapaths run only 1/4th as fast, there are deserializers and serializers interfacing the external I/O to the switching core. Because data transmission over these capacitive coupled communication links is assumed to be DC-balanced, packets are 8B10B-encoded when they enter an Island chip. The resulting 25% encoding overhead is accommodated by running the I/O links with a raw bandwidth of 16 Gbps. The switch core of each Island chip is mainly made up of buffer memories.

Island chip

Bridge chip

Fig. 3.22 Prototype system of a crosspoint buffered switch in a packaged one-dimensional vector array. Note that two bridge chips are physically implemented as one long monolithic chip, for ease of packaging.

A flat switch offers many advantages including low and uniform latency, resistance to saturation, increased scalability, reduced chip count and cost, reduced power consumption, and higher reliability. Because the minimum forwarding delay in a multistage network is typically proportional to the number of stages, latency for a single-stage switch can be much lower than for a multi-stage switch. Furthermore, many multi-stage networks are susceptible to traffic imbalance that causes a further increase in latency. In contrast, it is much easier to guarantee that a flat switch operates in such a way that a congested output does not slow or stop traffic flow to other uncongested outputs. Often, it is a beneficial for an architecture to span a wide range of switch sizes. A sliced crossbar gives much flexibility as it allows for building switches with any number of slices up to a maximum given by the bisection bandwidth of the system. Finally, a flat switch requires fewer switch chips than a multi-stage network with the same number of ports. For example, a 3-stage 288port switch requires 36 switch chips whereas a similar single-stage switch requires only 12 switch chips (each implementing 24 ports). Reducing component count not only reduces cost but also reduces power consumption and increases reliability. Capacitive coupled I/O technology changes the way we build large-scale systems based on MCMs. By offering many times more chip-to-chip bandwidth than conventional I/O technologies, capacitive coupled communication allows the designer to rethink how systems are partitioned.

3 Capacitive Coupled Communication

77

References 1. R.J. Drost, R.D. Hopkins, R. Ho, I.E. Sutherland, “Proximity communication,” IEEE Journal of Solid-State Circuits, vol. 39, no. 9, 2004, pp. 1529–1535. 2. A. Chow, D. Hopkins, R. Drost, R. Ho, “Exploiting capacitance in high-performance computer systems,” 4th Annual IEEE International Symposium on VLSI Design, Automation, and Test, 2008, pp. 55–58. 3. A.V. Krishnamoorthy, R. Ho, X. Zheng, H. Schwetman, J. Lexau, P. Koka, G. Li, I. Shubin, J.E. Cunningham, “Computer systems based on silicon photonic interconnects,” Proceedings of the IEEE, vol. 97, no. 7, 2009. 4. J. Cunningham, X. Zheng, I. Shubin, R. Ho, J. Lexau, A.V. Krishnamoorthy, M. Asghari, D. Feng, J. Luff, H. Liang, C. Kung, “Optical proximity communication in packaged SiPhotonics,” 5th IEEE International Conference on Group IV Photonics, 2008. 5. X. Zheng, P. Koka, H. Schwetman, J. Lexau, R. Ho, I. Shubin, J. Cunningham, A.V. Krishnamoorthy, “A Silicon photonic WDM network for high performance macrochip communications,” Proceedings, SPIE Photonics West, Vol. 7221: Photonics packaging, integration, and interconnects IX, 2009. 6. N. Miura, Y. Kohama, Y. Sugimori, H. Ishikuro, T. Sakurai, T. Kuroda, “A high-speed inductive-coupling link with burst transmission,” IEEE Journal of Solid-State Circuits, vol. 44, no. 3, 2009, pp. 947–955. 7. N. Miura, H. Ishikuro, T. Sakurai, T. Kuroda, “A 0.14 pJ/b inductive-coupling inter-chip data transceiver with digitally-controlled precise pulse shaping,” Digest of Technical Papers, IEEE International Solid-State Circuits Conference, 2007, pp. 358–359. 8. N. Miura, D. Mizoguchi, M. Inoue, K. Niitsu, Y. Nakagawa, M. Tago, M. Fukaishi, T. Sakurai, T. Kuroda, “A 1 Tb/s 3W inductive-coupling transceiver for inter-chip clock and data link,” Digest of Technical Papers, IEEE International Solid-State Circuits Conference, 2006, pp. 424–425. 9. N. Miura, D. Mizoguchi, M. Inoue, H. Tsuji, T. Sakurai, T. Kuroda, “A 195 Gb/s 1.2 W 3Dstacked inductive inter-chip wireless superconnect with transmit power control scheme,” Digest of Technical Papers, IEEE International Solid-State Circuits Conference, 2005, pp. 264– 265. 10. J. Kim, B.S. Leibowitz, J. Ren, C.J. Madden, “Simulation and analysis of random decision errors in clocked comparators,” IEEE Transactions on Circuits and Systems I, in press. 11. P. Nuzzo, F. De Bernardinis, P. Terreni, G. Van der Plas, “Noise analysis of regenerative comparators for reconfigurable ADC architectures,” IEEE Transactions on Circuits and Systems I, vol. 55, no. 6, 2008, pp. 1441–1454. 12. R. Drost, R. Ho, R. Hopkins, I. Sutherland, “Electronic alignment for proximity communication,” Digest of Technical Papers, IEEE International Solid-State Circuits Conference, 2004, pp. 144–518. 13. A. Chow, R. Hopkins, R. Ho, R. Drost, “Measuring 6D chip alignment in multi-chip packages,” 6th Annual IEEE Conference on Sensors, 2007, pp. 1307–10. 14. A.X. Widmer, P.A. Franaszek, “A DC-balanced, partitioned-block, 8B/10B transmission code,” IBM Journal of Research and Development, vol. 27, no. 5, 1983, pp. 440-452. 15. R. Walker, R. Dugan, “64b/66b low-overhead coding proposal for serial links,” IEEE 802.3 HSSG 10G Study proposal, January 12, 2000. 16. D. Hopkins, A. Chow, R. Bosnyak, B. Coates, J. Ebergen, S. Fairbanks, J. Gainsley, R. Ho, J. Lexau, F. Liu, T. Ono, J. Schauer, I. Sutherland, R. Drost, “Circuit techniques to enable 430 Gb/s/mm/mm proximity communication,” Digest of Technical Papers, IEEE International Solid-State Circuits Conference, 2007, pp. 368–369. 17. H. Eberle, P.J. Garcia, J. Flich, J. Duato, R. Drost, N. Gura, D. Hopkins, W. Olesinski, “High-radix crossbar switches enabled by proximity communication,” Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, 2008.

Chapter 4

Inductive Coupled Communications Noriyuki Miura, Takayasu Sakurai, and Tadahiro Kuroda

4.1 Introduction Inductive coupled communication is a wireless communication technology for three-dimensionally (3D) stacked chips in a package. As discussed in a previous chapter, capacitive coupled communication (see Figure 4.1) utilizes a pair of metal electrodes which forms a capacitive-coupling channel–essentially a capacitor–as a vertical wireless data link between stacked chips. In inductive coupled communication, a pair of metal coils creates an inductive-coupling channel–essentially a transformer–between stacked chips. Both of these are pure digital circuit solutions compatible with a standard CMOS technology. The metal electrodes and/or the metal coils can be fabricated by using IC interconnections. No additional wafer or mechanical processes are required, and hence they are inexpensive. In addition, since the capacitive- and the inductive-coupling channels can create inter-chip link without any physical and mechanical contacts, electro-static-discharge (ESD) protection devices are not needed, enabling the inter-chip link to be high-speed, lowpower, and small-area. Moreover, since these two channels are AC-coupling channels, they can communicate between chips operating under different supply voltages without level-shifters. As described previously, these two wireless communication technologies have many potential advantages over wired mechanical solutions such as micro bumps and through-Si vias (TSVs). However, electromagnetic and circuit co-optimization are necessary in order to deliver high-performance and high-reliable Professor Noriyuki Miura Department of Electrical Engineering, Keio University, 3-14-1 Hiyoshi, Kohoku-ku, Yokohama 223-8522 JAPAN, e-mail: [email protected] Professor Takayasu Sakurai Institute of Industrial Science, University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, JAPAN, e-mail: [email protected] Professor Tadahiro Kuroda Department of Electrical Engineering, Keio University, 3-14-1, Hiyoshi, Kohoku-ku, Yokohama 223-8522, JAPAN, e-mail: [email protected] R. Ho and R. Drost (eds.), Coupled Data Communication Techniques for High-Performance and Low-Power Computing, Integrated Circuits and Systems, DOI 10.1007/978-1-4419-6588-2_4, © Springer Science+Business Media, LLC 2010

79

80

Inductive Coupled Communications

operation. This chapter deals with inductive coupled communications. First, basic characteristics of the inductive-coupling channel are explained with comparison to the capacitive-coupling channel. Next, channel and transceiver circuit co-design is described. Several circuit techniques for performance enhancement are introduced and evaluated by test-chip measurements. Finally, example applications of the inductive coupled communications are discussed and prototype demonstrations are presented.

Metal Electrode

Metal Coil

Capaciti e Co Capacitive Coupled pled Communications Technology

Ind cti e Co Inductive Coupled pled Communications Technology © 2008 IEEE

Fig. 4.1 Capacitive (left) and inductive (right) coupled communications technologies. Fig.1.1 Capacitive (left) and inductive (right) coupled communications technology.

4.2 Inductive-coupling channel This section covers channel characteristics, coupling range through silicon, and crosstalk issues.

4.2.1 Overview of channel characteristics Figure 4.2 illustrates an inductive-coupling channel model. A transmitter (Tx) coil is driven by transmit current IT . According to changes in IT , a magnetic field H is generated and a received voltage VR is induced in a receiver (Rx) coil. As shown in the equivalent circuit of the channel in Figure 4.2, the Tx and the Rx coils are modeled as a parallel resonator where L, C, and R represent the self-inductance, parasitic capacitance, and parasitic resistance of the coil respectively. The magnetic coupling between the coils is given by the mutual inductance M. Based on this equivalent circuit, the transfer function of the inductive-coupling channel is given

4 Inductive Coupled Communications

81

by 1 1 � � VR = � · jωM · � · IT 1 − ω 2 LRCR + jωCR RR 1 − ω 2 LT CT + jωCT RT

(4.1)

Equation 4.1 can be expressed as

VR = BR (ω) · jωM · BT (ω) · IT 1 B(ω) = � 2 1 − ω LC) + jωCR

(4.2) (4.3)

The first term BR (ω) and the third term BT (ω) in Equation 4.2 denote bandwidth limitations due to parasitic C and R. In an ideal inductive-coupling channel without any parasitics (C=0, R=0), VR = jωM · IT = M

CT IT

H

LT RT

M RR

Rx + VR -

+ CR V R -

LR

|B(Z)|

Gain n

Coil

T Tx

(4.4)

fSR

RT Tra ans-Imped dance [:]

IT

dIT dt

|VR/IT|

ZM

RR Frequency [Hz]

Channel Model

Frequency Characteristics

Fig. 4.2 Channel model (left) and frequency characteristics of inductive coupling (right).

Fig.1.2 model4.4 (left) and frequency of inductive coupling (right). AsChannel Equation indicates, the characteristics ideal inductive-coupling channel functions

as a first-order differentiator jωM, and the frequency characteristics are proportional to the frequency as shown in Figure 4.2. In an actual inductive-coupling channel, the above-mentioned bandwidth limitation of the resonator B(ω) is multiplied in each transmitter and receiver side. As shown in Figure 4.2, B(ω) behaves as a secondorder low-pass filter with peaking at the self-resonant frequency of the coil fSR , fSR =

1 √

2π LC

(4.5)

82

Inductive Coupled Communications

Overall, the inductive-coupling channel behaves as a band-pass filter with a peak at around fSR . It can be seen in Figure 4.2 that the channel operates as a differentiator at the frequencies below fSR . That means the bandwidth of the inductive-coupling channel is determined by f SR . In this frequency range, the received voltage VR is approximately given by Equation 4.4 and therefore the VR amplitude is proportional to the trans-impedance ωM. Here, ωM is rewritten as √ ωM = ωk LT LR (4.6) where k is a coupling coefficient between the transmitter and the receiver coils. In order to increase the operating frequency ω, the channel bandwidth and thus f SR has to be increased, which √ requires the self-inductance L to be reduced (see Equation 4.5). As a result, ω LT LR keeps constant in most cases. Finally, the coupling coefficient k determines the trans-impedance and hence the VR amplitude. k is a parameter defined by the ratio between the amount of transmitted and received magnetic flux. It is approximately given by the coil diameter D and the communication distance between the coils X as � �1.5 0.25 k= � (4.7) �2 X/D + 0.25

Couplling Coeffficient, k

Saturation 1

Linear

10-1

Square

Cubic

D Coil IT

X

10-2 + VR 10-3 0.1

1/5

1/3

X/D

1

5

Fig. 4.3 Calculated coupling coefficient depending on communication distance and coil diameter.

4 Inductive Coupled Communications

83

Figure 4.3 plots k calculated by Equation 4.7 as a function of X /D. It can be seen in 4.3 that X /D dependency of k can be classified into four different regions according to the values of X/D: 1. 2. 3. 4.

Saturation region: X/D < 1/5 Linear region: 1/5 < X/D < 1/3 Square region: 1/3 < X/D < 1 Cubic region: 1 < X/D

In the square region, many early prototypes of inductive-coupling transceivers were reported [1, 2, 3, 4, 5, 6]. Since the received signal is strongly attenuated by square of the communication distance variation ∆ X, a data recovery scheme with high noise immunity is required, such as synchronous data recovery schemes where the received voltage is sampled by a synchronous clock. In this approach, the receiver is not exposed to noise except at the sampling moment. As a result, signal-tonoise ratio (SNR) can be improved, enabling highly reliable data communications in the square region. Details of the synchronous transceiver will be described in Section 4.3. In the linear region, the signal attenuation is mitigated so that asynchronous data recovery schemes can be used. An asynchronous inductive-coupling transceiver [7, 8] achieved high-speed operation by eliminating a complicated timing controller used in the synchronous schemes. Details of the asynchronous transceiver will be discussed in Section 4.5. In the cubic region, k is significantly degraded by the cube of X/D, however a receiver with multiple amplification stages can communicate even in such an adverse region [9]. In the saturation region, k asymptotically approaches to one and the coupling efficiency becomes very high. The channel in this region is mainly used for wireless power delivery such as in [10].

4.2.2 Range extendability The inductive-coupling channel can extend the communication distance by simply increasing the coil diameter. As described in the previous section, the coupling gain of the inductive-coupling channel is governed by the coupling coefficient k. Equation 4.7 denotes that k is only defined by the ratio between the distance and the diameter of the coils, X/D. Consequently, X /D determines the coupling gain of the inductive-coupling channel. Therefore, even if the communication distance is extended, the inductive-coupling channel can keep the coupling gain constant by increasing the coil diameter linearly. For the capacitive-coupling channel, on the other hand, it is difficult to extend the communication distance. Figure 4.4 depicts a simplified model of a capacitivecoupling channel. A transmitter electrode is driven by a transmit voltage VT . According to changes in VT , an electric field E is generated and then a received voltage VR is induced in a receiver electrode. The received voltage VR is approximately given as:

84

Inductive Coupled Communications

VT X

Electrode Area, S

E

VT CC VR

VR

XSUB

CSUB Fig. 4.4 Simplified channel model of capacitive coupling.

VR =

CC VT CC +CSUB

(4.8)

where CC is the capacitance between the electrodes and CSUB is that between the receiver electrode and the substrate. Modeling CC and CSUB as simple parallel-plate capacitors, we have XSUB VR = VT (4.9) XSUB + X where X is the distance between the electrodes and XSUB is that between the receiver electrode and the substrate. As Equation 4.9 indicates, VR is reduced for long-distance communication since CC decreases with increasing X and VT is limited under the supply voltage VDD . Even if the electrode size is enlarged, VR hardly increases because both CC and CSUB increase in a similar way. As a result, the communications distance of the capacitive-coupling channel is limited.

4.2.3 Coupling strength through Si substrate The other advantage of the inductive coupling is coupling strength through a Si substrate. Figure 4.5 plots simulated S21 parameters of the capacitive- and inductivecoupling channel through the substrate. When the substrate resistivity is reduced to between 1 and 0.1 Ω ·cm (the typical resistivity of p+ Si), the electrical field of capacitive coupling is significantly attenuated in the substrate, causing a rapid decrease in the S21 parameter. As a result, for the capacitive-coupling channel, it is difficult to communicate through the Si substrate. The capacitive coupling is therefore used only for data links in face-to-face chip stacks. On the other hand, inductive coupling utilizes magnetic fields for signal transmission. Even if p+ Si substrate is inserted between the coils, the magnetic field is minimally attenuated due to eddy currents in the Si substrate. Since the S21 parameter is only degraded by several percent through the substrate, the inductive-coupling channel can be applied to not only face-to-face but also face-up, face-down, and even back-to-back chip stacks. This stacking variety provides better flexibility to chip designers in 3D integration. In addition, conventional and hence inexpensive packaging technologies can be used for power delivery to the stacked chips. For ex-

4 Inductive Coupled Communications

85

@10GHz 30Pm ~ ~ H

0.6

30 30Pm 30P Pm

0.4

02 0.2

0 10-66

IEDDY

10-44

Si

E

10-22 100 102 Substrate Resistivity, U [:cm]

~ ~

Si

p+ Si Resistiv vity

0.8

30Pm m

Normalize N ed S21

1

Infinity

Fig. 4.5 Simulated S21 parameters of inductive and capacitive coupling through substrate.

ample, in a back-to-back chip stack with inductive-coupling communications [11], a processor chip is mounted face down on a package using C4 bumps and an SRAM chip is glued on it face up, with power provided by conventional wire-bonding. The inductive-coupling channel in it communicates through the substrates of both the processor and the SRAM chips. Further details will be introduced in Section 4.8.

4.2.4 Crosstalk Array-area distribution of either capacitive- or inductive-coupling channels increases data bandwidth. However, since these two technologies employ wireless communications, crosstalk between neighboring channels may degrade performance. Compared to the capacitive-coupling channel, the crosstalk of the inductive-coupling channel is stronger. In capacitive coupling, since the electric field is confined inside the capacitor, the crosstalk is essentially small. Only the crosstalk from the most adjacent channels should be considered but it can be easily reduced by a ground shield structure [12]. Therefore, crosstalk is not a serious issue in capacitive-coupling communications. On the other hand, in inductive coupling, the magnetic field of the coil easily extends to adjacent coils. Figure 4.6 shows the inductive-coupling crosstalk calcu-

86

Inductive Coupled Communications

0 D =3X

20

X Y

40

60

Crosstalk-to-Signal Ratio o [dB]

Cross stalk-to-Siignal Ratio [dB]

0

80

1 2 3 4 6 8 10 Normalized Horizontal Distance, Y/D Crosstalk from Single Channel

-5

P

-10 10 -15

D=3X

-20 20 -25 -30 -35 -40

1 2 3 4 5 Normalized Channel Pitch, P/D Aggregated Crosstalk from Channel Array

Fig. 4.6 Calculated crosstalk from a single channel (left) and from multiple channels in array (right).

Fig.1.6 Calculated crosstalk from single channel (left) and from multiple channels in array (right).

lated by a theoretical model based on the Biot-Savart law [13]. When the horizontal distance between the coils increases over twice the coil diameter (Y > 2D), the crosstalk rapidly decreases by 1/Y 3 . As a result, the crosstalk from the channels of Y ≥ 3D is negligibly small. However the crosstalk from the channels of Y ≤ 2D cannot be ignored. Unfortunately, the inductive-coupling crosstalk cannot be reduced by the ground shield structure. A channel pitch has to be increased to suppress the crosstalk in the channel array. Figure 4.6 plots aggregated crosstalk in the channel array as a function of the channel pitch P. In order to suppress the crosstalk sufficiently far enough (≈-20dB), the channel pitch has to be increased to 2D to 3D. For high-density channel arrangements, crosstalk reduction techniques are required. Circuit solutions based on time [3, 4] and space division multiplexing [14] are presented. Further details will be described in Section 4.6.

4.3 Inductive-coupling transceiver In this section, a basic design theory of inductive-coupling transceiver is studied. A proto-type synchronous inductive-coupling transceiver is taken as a design example [3, 4]. First, a signaling scheme is discussed. Characteristics of transmitted and received signals are analyzed. Next, coil layout design is explained. A design guideline for channel characteristic optimization is discussed. Next, transceiver cir-

4 Inductive Coupled Communications

87

cuit design is described. Finally, an inductive-coupling transceiver designed based on this theory is evaluated in inter-chip communication.

4.3.1 Signaling Signaling schemes utilized in wireless communications are classified into carrier modulations or pulse modulations. For long-distance wireless communications (e.g. mobile phone and wireless LAN), carrier modulations are employed. In carrier modulation, the signal spectrum can be concentrated around the carrier frequency, so out-of-band noise can be appropriately filtered out to improve Signal-to-Noise Ratio (SNR). This enables highly reliable wireless communication even if the communication distance is long and the channel is lossy. However, carrier modulation requires complicated analog circuits, such as a voltage-controlled oscillator, low-noise amplifier, mixer, or filter, which results in high power and large area consumption in a transceiver. Pulse modulation, on the other hand, spreads the signal spectrum over a wide frequency band, making noise filtering more difficult. As a result, SNR is significantly degraded in a lossy channel. Therefore, it is hard to apply pulse modulations to longdistance wireless communications. However, in an inductive-coupling channel, the coils are coupled in close proximity, providing a low-loss wireless channel. It guarantees high SNR even if a wideband frequency is used. Therefore, pulse modulation can be utilized in inductive-coupling communications.

Txclk Pulse Generator Txdata Tx IT

Txclk Txdata IT VR

-+

VR Rx

Rxdata

Rxclk

Rxclk Rxdata Time

Fig. 4.7 Bi-phase modulation (BPM). Fig.1.7 Bi-phase modulation (BPM).

88

Inductive Coupled Communications

Pulse-modulated signals can be generated by simple digital circuits using digital clock and data. Complicated analog circuits are not needed, enabling a transceiver to be low-power and small-area. Bi-Phase Modulation (BPM) can be employed for the data link (see Figure 4.7) [3, 4]. At the rising edge of the transmitter clock Txclk, a transmitter produces positive or negative pulse current IT , according to Txdata. A positive pulse is generated when Txdata is High and a negative pulse is generated when Txdata is Low. The IT signal induces a positive or negative pulse-shaped voltage VR in the receiver coil. The receiver directly samples VR by the receiver clock Rxclk, and recovers digital data Rxdata.

ETX~WIPVDD SP=IP/W

|IT(Z)|

IT((t)

IP

W

VP VP~1.7MIP/W 0

W/2

-VP

|VR(Z)|=ZMIIT(Z)

VR(t)=MdIT(tt)/dt

0

Time

fP~0.45/W

2fp~0.9/W

Frequency © 2007 IEEE

Fig. 4.8 Characteristics of transmitted and received BPM pulse in time domain (left) and frequency domain (right) .

For inductive-coupling channel design, it is necessary to understand the time and frequency characteristics of transmitted and received pulse signals. The input transmitted current IT can be modeled as a Gaussian pulse (see Figure 4.8), � 4t 2 � IT (t) = IP exp − 2 τ

(4.10)

where IP is a pulse amplitude and τ is a pulse width. By the inductive-coupling channel, the received voltage VR is given as a time-derivative form of IT : VR (t) = M

� 4t 2 � dIT (t) 8t = −MIP 2 exp − 2 dt τ τ

(4.11)

4 Inductive Coupled Communications

89

As mentioned previously, VR becomes a Gaussian monocycle double pulse (see Figure 4.8). The receiver samples the former or latter half of the double pulse to detect the polarity of the transmitted pulse and hence transmitted data. The pulse width of VR is given by τ/2. Therefore, τ finally determines the receiver’s sampling timing margin. The received pulse amplitude VP is obtained from Equation 4.11 as � 2 IP IP VP = 2 M ≈ 1.7M = 1.7MSP (4.12) e τ τ where VP is determined by the slew rate SP . The right side of Figure 4.8 depicts the frequency spectrum of the transmitted and the received signal IT (ω) and VR (ω). IT (ω) is given by a Gaussian distribution. The derivative property of the channel jωM removes low-frequency components of IT (ω). As a result, VR (ω) becomes a convex distribution with the peak frequency of fP . In order to deliver the VR pulse signal without distortion, the inductive-coupling channel requires a frequency bandwidth of 2 fP which is given by √ 2 2 0.9 2 fP = ≈ (4.13) πτ τ The pulse width τ determines the required channel bandwidth. The inductivecoupling channel is designed to maximize the mutual inductance M while keeping the bandwidth (self-resonant frequency fSR ) over 2 fP .

4.3.2 Coil design Characteristics of an inductive-coupling channel are determined by the communication distance X and the coil layout. Figure 4.9 illustrates an example of the coil layout. It can be defined by four layout parameters (diameter D, turns n, line width w, and line space s). There is a complex relationship between the layout parameters and the circuit parameters k, L, C, and R which finally decide the channel characteristics [15]. Here we describe a basic design guideline based on the first-order approximation of the relationship. First, the coil diameter D is determined by the communication distance X. As discussed in Section 4.2.1, the coupling coefficient as well as the operating region of the inductive-coupling channel is defined by X /D (Equation 4.7 and Figure 4.3). A synchronous transceiver in this study (Figure 4.7) operates within the square region of the inductive-coupling channel (1/3 < X/D < 1). The coil diameter D is designed to be around 2X so that the transceiver can operate in the middle of the square region. Next, the channel bandwidth is adjusted by the coil turns n. Recall from Section 4.2.1, the channel bandwidth is equal to the self-resonant frequency of the coil fSR . As written in Equation 4.5, f SR is given by a product of L and C. For a first-order approximation,

90

Inductive Coupled Communications

Diameter, D

Turns, n

Width, w

Space, s

Fig. 4.9 Metal inductor layout.

L ∝ Dn2

(4.14)

C ∝ Dn

(4.15)

Typically for the coil, interconnections in upper metal layers are utilized to reduce parasitic substrate capacitance. Mostly, C is given by the parasitic capacitance between the wires and floating capacitance of the wires so that it is proportional to the total wire length Dn. Since D is already determined by X , the minimum value of C is given when n is set to 1. Similarly, the minimum value of L is given so that the maximum value of fSR is determined. As for the rest, considering the relationship in Equations 4.14 and 4.15, n is increased until f SR reaches the signal bandwidth 2 fP . As a result, L is optimized for maximizing the mutual inductance M while keeping the channel bandwidth required for the pulse signal transmission. Finally, the line width w is determined to adjust the parasitic resistance R within an appropriate range. R is inversely proportional to w. If w is too narrow and R is too high, the channel bandwidth is limited by the RC delay of the wire rather than the LC resonant frequency. On the other hand, if w is too wide and R is too low, the Q factor of the coil (ωL/R) is increased, causing resonance in the received signal and hence inter-symbol interference (ISI) which degrades BER. In order to avoid ISI, the Q factor should be reduced to two or three. The simplest design guideline is using w of around 1–2% of D. The line space s slightly changes C. The minimum line space allowed in the process can be used.

4 Inductive Coupled Communications

91

Based on the design guideline, the coil layout parameter can be roughly optimized. Fine tuning and evaluation of the channel characteristics should be done by iterative calculation using an electro-magnetic field solver and a circuit simulator.

4.3.3 Transceiver circuit design Figure 4.10 depicts an inductive-coupling transceiver circuit for BPM with its operating waveforms. A pulse generator in a transmitter consists of a NAND gate and a delay line by an inverter chain. By taking the NAND of a transmitter clock Txclk and its delayed inverted signal Txclkd, a negative pulse Pulse is generated in every clock cycle. The pulse width is determined by the delay of the inverter chain τ. A succeeding H-bridge driver generates positive or negative pulse current IT according to transmit data Txdata. When Txdata is High, P1 is ON, N2 is driven by X2 and a positive pulse is generated. When Txdata is Low, P2 is ON, N1 is driven by X1 and a negative pulse is generated. Txclk Pulse Generator

Delay, y W

Txclk Txclkd

Txclkd Pulse

Txdata

P1 I P2 T N1 N2 +

Rxclk

N3

VR

X2

W

IT

IP

-

X1

W

Pulse

Txdata

Txdata

VR

VB

Rxclk N4 VSP

N5 VSN

P3

P4

Rxdata

Rxdata

VSP, VSN

VSP

VSN

Rxdata Time © 2007 IEEE

Fig. 4.10 Inductive-coupling BPM data transceiver.

The pulse amplitude of IT is determined by the channel width of N1 and N2. The pulse slew rate of IT is determined by the slew rate of the gate input of N1 and N2.

92

Inductive Coupled Communications

The received voltage VR in a receiver coil is given as a derivative form of IT . The receiver coil is biased at VB through the high resistance of several kΩ to give the input common mode of the receiver circuit. Since the channel is AC coupling, it can give arbitrary bias voltage to the receiver without affecting the transmitter. The receiver circuit is a latch comparator. It directly samples VR by the receiver clock Rxclk, and detects the polarity of the pulses to recover digital data Rxdata. The latch comparator consists of a sense amplifier and an SR latch. The first-stage sense amplifier has two operating phases depending on Rxclk. When Rxclk is Low, the sense amplifier is in a pre-charge phase where the output voltages VSP and VSN are both pre-charged to High by PMOS transistors P3 and P4. In this operating phase, since the inputs of the SR latch are both High, Rxdata holds the data. When Rxclk goes High, the sense amplifier is in an evaluation phase where P3 and P4 are OFF, N3 is ON and the NMOS differential pair is activated. If VR is positive at the rising edge of Rxclk, N4 is strongly ON and VSP is pulled down to Low while VSN is kept High. As a result, Rxdata in the latch becomes High. Again, Rxclk goes Low and the VSP is pre-charged to hold Rxdata. If VR is negative in the evaluation phase, VSN is pulled down and Rxdata becomes Low. In this receiver, VSP and VSN temporarily drop at the rising edge of Rxclk. The SR latch erroneously operates if both VSP and VSN drop down to Low. The input threshold voltage of the SR latch should be carefully designed to avoid the erroneous operation.

4.3.4 Inter-chip communications Two test chips for a transmitter and a receiver (see Figure 4.11) were fabricated in 180 nm CMOS. The transmitter chip is thinned down to 10 µm and stacked faceup over the receiver chip with 5 µm-thick glue. The communication distance between the transmitter and the receiver X is therefore 15 µm. The coil diameter of the transceiver is 30 µm (2X) so that the transceiver is tested in the middle of square region of the inductive-coupling channel. The delay line in the transmitter is designed to set the transmit pulse width τ to be 180 ps for the receiver’s timing margin of 150 ps. From Equation 4.13, the required bandwidth of the inductive-coupling channel is calculated to be 5 GHz. Based on the design guideline above, the coil layout is optimized to maximize the mutual inductance M while keeping the self-resonant frequency fSR higher than 5 GHz. The transceiver circuit is placed under the coil to save layout area. In order to evaluate the interference between the circuit and the coil, a transceiver, whose circuit is placed aside of the coil, is also implemented. Figure 4.12 presents measurement results of the inductive-coupling transceiver whose circuit is placed under the coil. The figure shows a snapshot of the transmitted and the received data waveforms on the left. It is confirmed that a 223 -1 Pseudo Random Binary Sequence (PRBS) data at 1 Gb/s is correctly delivered through the inductive-coupling transceiver. The right side of the figure depicts the measured timing bathtub curve. A timing margin of 200 ps is achieved for a BER under 10−12

4 Inductive Coupled Communications

93

Tx under Coil Tx 30Pm Tx Coil Stacked Face-Up

Transmitter

Rx under Coil Rx 30Pm Rx Coil Receiver Fig. 4.11 Die photos of inductive-coupling transmitter (left) and receiver (right).

which is 50 ps wider than designed. It is because the transmit pulse width τ is increased due to process variation in the delay of the inverter chain. The transceiver whose circuit is placed aside of the coil is also measured. There is no difference in measured timing bathtub curve. Interference between the transceiver circuit and the coil is seen to be negligible. The transceiver consumes 2.6 mW in the transmitter and 0.2 mW in the receiver from a 1.8 V supply. In this section, we described the basic design theory of inductive-coupling transceivers. In the next three sections, circuit techniques for performance improvements are introduced. The effectiveness of each technique is evaluated by test-chip measurements.

4.4 Power reduction techniques This section introduces circuit techniques for power reduction in the inductivecoupling transceiver. As can be seen in the measurement results in the previous section, power dissipation in the transmitter is more dominant than that of the receiver. The latch comparator in the receiver only consumes charge and discharges energy 2 . The energy dissipation is only equivalent to that consumed in four CMOS CVDD gates. In addition, this energy dissipation can be effectively reduced by device scal-

94

Inductive Coupled Communications

223-1 PRBS Data @ 1Gb/s

10-3 223-1 PRBS Data @ 1Gb/s

Rxdata

10-6 BER R

Txdata

10-9 Txclk Timing Margin =200ps Snapshot

10-12

-150 -100 -50 0 50 100 Sampling Timing [ps] Bathtub Curve

Fig. 4.12 Measured snapshot of data waveforms (left) and measured timing bathtub curve (right).

ing. On the other hand, in the transmitter, the output H-bridge driver consumes large short current for generating the transmit pulse current IT . In this section, two circuit techniques are introduced for effective generation of the transmit current.

4.4.1 Pulse shaping The transmitter’s energy dissipation ETX strongly depends on the transmit pulse shape. Based on an understanding of the relationship between them, the pulse shape should be optimized for effective use of charge. The transmitter’s energy dissipation ETX is given by a product of the supply voltage VDD and total electric charge Q carried for the IT pulse. As shown in Figure 4.8, Q is equal to the area of the IT pulse. Thus, ETX is given as ETX = QVDD ≈ τIPVDD

(4.16)

Recall from Section 4.3.1, the amplitude of the received voltage VP is determined by the pulse slew rate SP . Using SP from Equation 4.13, we can rewrite Equation 4.16 as ETX = τ 2 SPVDD (4.17)

4 Inductive Coupled Communications

95

Equations 4.17 and 4.13 indicate that, by reducing the pulse width τ with constant SP , the transmitter’s energy dissipation can be reduced by τ 2 with constant VP .

IP

SP

IT

ETX~W2SPVDD

SP

W

0

0

VP

VP VP~1.7MSP

0

-VP

VR=MdIT/d dt

VR=MdIT/d dt

IT

IP

W/2

W ETX~W2SPVDD/4

0

-VP Time Pulse Width=W

Time Pulse Width=W/2

Fig. 4.13 Waveform sketch of transmitted current and received voltage when pulse width is τ (left) and pulse width is τ/2 (right).

Figure 4.13 sketches this relationship conceptually where the IT pulse is approximated by a simple triangular waveform. It shows that, when the pulse width of IT is reduced from τ to τ/2, the area (total electric charge) of IT is reduced to 1/4 while keeping VP constant. Consequently, in the inductive-coupling transmitter, reducing the pulse width τ is effective for power reduction. However, as we saw in the test-chip measurements in Section 4.3.4, the transmit pulse shape is changed due to variations in process, voltage, temperature (PVT), and chip thickness (communication distance). In order to adjust the pulse width and also the slew rate against variations, a precise pulse shaping circuit is needed. In addition, since the narrower pulse width reduces the receiver’s timing margin, a robust timing design is required to maintain BER. To solve these problems, a digitally-controlled pulse shaping circuit and timing control circuit has been introduced [5],[6]. Figure 4.14 depicts the pulse-shaping circuit. It consists of pulse width, pulse slew rate and pulse amplitude controls. In the pulse width control, a 4-phase clock generator provides 0◦ , 45◦ , 90◦ , and 135◦ clocks to two phase interpolators (PIs). One of the PI interpolates a clock phase between 0◦ and 45◦ by 1/256 of a UI step, which is equivalent to 4 ps at 1 GHz operation. Another PI is a dummy circuit which always outputs 135◦ clock. A succeeding AND gate generates a pulse clock that determines the pulse width τ. The pulse slew rate is digitally controlled by variable capacitors. The pulse amplitude is

96

Inductive Coupled Communications

Pulse Width Control (5bit) 1/256-UI Step

Txclk 4-Phase Clk 0º 45º 90º 135º PI

5bit 0º~45º

20w

Pulse

24w

Pulse Amplitude Control (5bit)

W

Txdata 24w

20w

IT Tx Chip -+

Rxclk

135º

135º

Pulse

Txdata

Pulse Slew Rate Control (4bit)

PI

0º~45º

Rx Chip

VR Rx Rxdata

© 2008 IEEE

Fig.1.14 Digitally controlled pulse shaping circuit.

Fig. 4.14 Digitally controlled pulse shaping circuit.

digitally controlled by changing the channel width of NMOS in the H-bridge driver. Figure 4.15 describes the timing design. An inductive-coupling clock link is located adjacent to the data link. The timing jitter caused by supply noise and temperature variations can be effectively rejected as common-mode noise. A sampling timing controller calibrates timing shift due to the process variations. Stacked test chips are fabricated in 180 nm and 90 nm CMOS (Figure 4.16). In both of them, the transmitter chip is stacked face-up over the receiver chip. The thickness of the transmitter chip is 10 µm and that of glue layer is 5 µm. The communication distance between the transmitter and the receiver is thereby 15 µm. Coil size is 30 µm diameter for the data link and 200 µm for the clock link. Data rate is 1 Gb/s. The experimental condition, such as distance, coil size, and data rate, is identical with that of the previous proto-type inductive-coupling transceiver (Figure 4.11). The left side of Figure 4.17 presents measured bathtub curves of the transceiver in 180 nm CMOS. By using pulse shaping circuit, the pulse width is reduced from 120 ps to 60 ps. The received pulse amplitude VP is adjusted to 60 mV by the transmit pulse amplitude and slew rate controls. It is confirmed that ETX is reduced by τ 2 . When τ is set to the minimum pulse width of 60 ps, ETX is reduced to 0.13 pJ/b which is 17 times lower than the previous proto-type design. However, in this case, the timing margin for BER under 10−12 is reduced to 25 ps. Static timing variation due to the process variations can be calibrated by using the timing controller

97

Txclk 4-Phase Clk 0º 45º 90º 135º PI 5bit 0º~45º

PI 135º

Txdata

Txdata

Tx

Tx

ITC

IT

Pulse Width Control

4 Inductive Coupled Communications

Tx Chip

1bit

5bit

Rx Chip

VR Rx Rxdata

0º~135º

0º

Rxclk

PI

-+

VRC Rx

PI

4-Phase Clk 45º 90º 135º

Sampling Timing Control

-+

Clock Link

© 2008 IEEE

Fig.1.15 Sampling timing controller.

Fig. 4.15 Sampling timing controller.

Data Link

R Chi Rx Chip

Tx Chip

Data Link

30Pm

(10Pm-Thick) 30Pm Tx Chip (10Pm-Thick) Cl k Link Clock Li k

Rx Chip

200Pm 180nm CMOS

90nm CMOS © 2007 IEEE

Fig. 4.16 Stacked test chips of low-power inductive-coupling transceiver in 180 nm (left) and 90 nm CMOS (right).

98

Inductive Coupled Communications

VP=60mV @ 1Gb/s

1

W=60ps ETX=0.13pJ/b

BER R

10-3

VP=60mV @ 1Gb/s

W=60ps, ETX=0.11pJ/b, ERX=0.03pJ/b

10-6

10-9 25 25ps 10-12

20

40 60 80 100 120 Sampling S li Ti Timing i [[ps]] 180nm CMOS

30ps -40 -30 -20 -10 0 10 20 S Sampling li Timing Ti i [[ps]] 90nm CMOS

© 2008 IEEE

Fig. 4.17 Measured bathtub curves in 180nm (left) and 90nm CMOS (right).

and the timing can be adjusted within the 25 ps timing margin. The robustness of the timing design against dynamic timing variation (jitter) is measured by giving power supply noise intentionally. An individual load is connected to the local supply of each transmitter and receiver chip. The load is randomly changed at various frequencies. The data transceiver communicates at 1 Gb/s with BER under 10−12 under the supply noise of 350mV peak-to-peak (±10% of VDD ). It is confirmed that, by the source synchronous transmission, timing jitter caused by the supply noise is effectively rejected and suppressed within the timing margin of 25 ps. The right side of Figure 4.17 shows a measured bathtub curve of the transceiver in 90 nm CMOS. The pulse width is 60 ps. The transceiver operates at 1 Gb/s with a BER under 10−12 and timing margin of 30 ps. The energy dissipation in the transmitter and receiver is reduced to 0.11 pJ/b and 0.03 pJ/b by the device scaling. The total energy dissipation is 0.14 pJ/b which is 1/20 of that in the prototype transceiver.

4.4.2 Daisy chain transmitter Another power reduction technique is a daisy-chain transmitter [16]. The left of Figure 4.18 shows a channel array of conventional H-bridge transmitters; here, transmit pulse current IT1 –ITN is dissipated in each transmitter. The right of Figure 4.18 de-

4 Inductive Coupled Communications

99

picts a daisy-chain transmitter where multiple channels of the transmitters are concatenated in order to reuse the transmit current between the channels. The polarity of the transmit current in each transmitter coil is determined by switching NMOS transistors according to the transmit data of the adjacent channels. BPM pulse signals are finally generated by the pulse generator connected to the bottom NMOS transistors. In this transmitter, the energy efficiency is improved by increasing the number of concatenated transmitter stages N. Ideally, the power dissipation per transmitter stage is reduced by 1/N. In practice, N is restricted by bandwidth limitation caused by increasing serially-connected NMOS transistors in the current path. Txdata1

IT1

Txdata1 Txdata1

IT1

Txdata1•Txdata2

Txdata2

IT2

Txdata2

Txdata1•Txdata2

Txdata1•Txdata2

IT2

Txdata2•Txdata3

TxdataN

ITN

TxdataN

TxdataN-1•TxdataN

Txdata1

Txdata1•Txdata2

Txdata2•Txdata3

ITN

TxdataN-1•TxdataN TxdataN

TxdataN

Pulse Generator

Pulse Generator

Txclk

Txclk © 2008 IEEE

Fig. 4.18 Conventional H-bridge parallel transmitters (left) and daisy-chain transmitters (right).

Figure 4.19 depicts microphotographs of stacked test chips in 90 nm CMOS. A transmitter chip is back-ground to 10 µm thickness and stacked face-up over a receiver chip with 5 µm-thick glue. As a result, the communication distance is 15 µm. The transmitter chip integrates the daisy-chain transmitters with the number of concatenated stages N = 2, 4, 6. The coil diameter is 30 µm. The experimental setup is identical with that of the previous inductive-coupling transceiver with the pulse shaping circuit (Figure 4.16). Figure 4.20 presents measured energy dissipation of the daisy-chain transmitter as a function of the number of concatenated stages N When N = 4, the energy dissipation is reduced to 35 fJ/b for the same data rate (1 Gb/s/ch), BER (under 10−12 ), and timing margin (30 ps) in the previous inductive-coupling transceiver with the pulse shaping. When N exceeds six, the bandwidth limitation due to stacked NMOS transistors starts to degrade the performance. Device scaling will improve

100

Inductive Coupled Communications

Transmitter (Upper)

Receiver (Lower)

Tx Chip (10Pm-Thick) Rx Chip

© 2008 IEEE

Fig. 4.19 Stacked test chips.

frequency characteristics of the transistors, enabling N to be increased more than six for further power reduction.

4.5 High-speed techniques In this section we introduce a number of high-speed circuit techniques. Compared to wired solutions such as micro bumps and TSVs, inductive-coupling communication has an advantage in high-speed operation since non-contacted circuits do not need highly-capacitive ESD protection circuits and can have improved channel bandwidth. The load capacitance of the inductive coupled channel can be reduced to less than 10 fF due to the absence of the ESD protection circuits. As a result, the self-resonant frequency of the coil and hence the channel bandwidth can be designed to be higher than 100 GHz in 180 nm CMOS [7, 8]. Furthermore, since the inductive-coupling channel is formed using on-chip structures, the bandwidth can be further improved by device scaling. The inductive-coupling channel does not limit the data rate of the transceiver. By optimizing the transceiver circuit topology, the data rate can be maximized up to the performance limitations of the transistors. However, in the synchronous inductive-

4 Inductive Coupled Communications

101

Normalized Energ gy Dissip pation

ETX=110fJ/b (Pulse Shaping Only) 1.0 Data Rate=1Gb/s. BER<10-12 Ti i Margin=30ps Timing M i 30 0.8

0.6

0.4

Measured ETX=35fJ/b

0.2 Calculated 0

1

2

6 4 Number of Stages, N

8 © 2008 IEEE

Fig. 4.20 Measured and calculated energy dissipation dependence on number of stages.

coupling transceiver described so far, the data rate is limited to around 1 Gb/s due to the need for a complicated timing control. In this section, we discuss an 11 Gb/s asynchronous inductive-coupling transceiver and burst transmission utilizing this high-speed transceiver [7, 8].

4.5.1 Asynchronous transceiver Figure 4.21 depicts the high-speed asynchronous inductive-coupling transceiver. An H-bridge driver in the transmitter generates IT from Txdata and drives the transmitter coil. A small positive or negative pulse-shaped voltage VR is induced in the receiver coil which is biased at VB (around VDD /2) by a replica bias generator [8]. Centered around VB , a positive pulse is generated when Txdata transitions from low to high and a negative pulse is generated when Txdata transitions from high to low. The receiver is a hysteresis comparator that detects the small pulse and converts it to digital data Rxdata. The hysteresis comparator consists of a gain stage (CMOS inverters XL, XR) and a latch circuit (cross-coupled PMOS). The gain stage amplifies the VR pulse and it drives the succeeding latch to switch and recover Rxdata. According to Rxdata the latch circuit modulates the threshold voltage of the inverters in the gain stage. A broken line in Figure 4.21 denotes the modulated threshold voltage of the inverter XL, namely VTH . For example, when Rxdata is low, VTH increases

102

Inductive Coupled Communications

to VTH0 + ∆V where VTH0 is the nominal threshold voltage of the inverter (typically around VDD /2) and ∆V is the hysteresis width of the comparator. This width ∆V needs to be of an appropriate range so that the receiver can distinguish between signal and noise. When the inverter’s input exceeds VTH0 + ∆V due to the positive pulse VR , Rxdata switches high. The latch circuit then shifts VTH to VTH0 − ∆V and holds Rxdata high until a negative pulse voltage VR is applied to the inverter’s input. Repeating this operation, digital data is correctly recovered from the pulse voltages.

Txdata

Txdata

IT

IT

VTH=VTH0+'V

VR + VB XL

Rxdata

Txdata

VR

XR

VTH=V VTH0-'V

Rxdata

Rxdata

Time © 2008 IEEE

Fig. 4.21 Asynchronous inductive-coupling transceiver.

In this transceiver, an asynchronous scheme is employed for the data link. No clock is needed for the data recovery. Since complicated timing control required in the synchronous scheme by using multi-phase clocks and a high-precision phase interpolator [4, 5] is not needed, operation speed is improved. The maximum data rate of the inductive-coupling transceiver is determined by the transition frequency of the transistor f T , which is around 60 GHz in 180 nm CMOS. As mentioned previously, the self-resonant frequency of the coil can be designed to be higher than 100 GHz in 180 nm CMOS so that it does not limit the data rate. Circuit simulation shows that the data rate can be improved up to 11 Gb/s by the asynchronous transceiver. However, coil size should be increased to improve signal-to-noise ratio (SNR) in order to compensate for weak noise immunity of the asynchronous receiver. This area overhead can be eliminated by burst transmission that will be introduced in the next section. The modulation scheme is also modified such that Txdata drives the H-bridge directly to generate IT , removing a pulse generator in the conventional transmitter. This reduces the number of circuit stages in the transmitte, resulting in a small link

4 Inductive Coupled Communications

103

latency. The simulated latency from Txdata to Rxdata is only 36 ps, equivalent to 0.5 FO4 inverter delay in 180 nm CMOS. This short latency enables high-speed burst transmission, which will also be discussed later. In this modified modulation scheme, the transceiver consumes large DC current. Although the overhead can be minimized in high-speed operation during an active mode, the DC current should be switched out in a stand-by mode for low-power applications [17]. Two test chips for a transmitter and a receiver are fabricated in 180 nm CMOS. The transmitter chip is thinned to three different thicknesses–40 µm, 25 µm, and 10 µm–and stacked over the receiver chip, both face-up, with 5 µm-thick adhesive (Figure 4.22). The communication distances are therefore 45 µm, 30 µm, and 15 µm respectively. The coil size is 120 µm in diameter, with five coil turns, providing a self inductance of 6 nH and a self-resonant frequency of 16 GHz. The coils communicate through the transmitter chip substrate. This test chip also integrate a transceiver for burst transmission, as introduced in the next section.

Fabricated in 180nm CMOS

Burst Transmitter (Top) Data Tx

Transmitter (Top) OSC

Tx

MUX Clock Tx 120Pm

120Pm

Rx Receiver (Bottom)

Data D t Rx

Top Chip (40,25,10Pm-Thick)

DEMUX Clock Cl k Rx

Distance=45,30,15Pm Bottom Chip

Burst Receiver (Bottom) © 2008 IEEE

Fig. 4.22 Stacked test chips of high-speed inductive-coupling transceiver.

The maximum data rate of the asynchronous transceiver is measured for each communication distance. For the communication distance of 15 µm, the maximum data rate is 11 Gb/s with a BER under 10−14 . For the distances of 30 µm and 45 µm, the maximum data rates are 10.5 Gb/s and 8.5 Gb/s respectively.

104

Inductive Coupled Communications

4.5.2 Burst transmission Burst transmission is an area reduction technique. The concept itself is well-known in high-speed serial links. As illustrated in Figure 4.23, because the bandwidth of the data link is improved by the above-mentioned asynchronous transceiver, several data links can be multiplexed into one burst data link. This reduces the number of data links and hence the required layout area. In face-up and back-to-back chipstacks, coils with large diameters are required due to the long communication distance. Therefore, it is area efficient to reduce the number of coils even if the multiplexer (MUX) and demultiplexer (DEMUX) increase the layout area for the circuits. A technical challenge is in providing a high-frequency burst clock to MUX and DEMUX in a simple way. Of course, building a Phase-Locked Loop (PLL) circuit is one approach to generating the high-frequency clock. However, it consumes large layout area. A simple digital circuit solution is required for area reduction. Txdata0~3

Tx

Tx

Tx

Txdata0~3

Tx

High-Frequency Burst Clock

MUX Burst Txdata Tx

Low Speed (<1Gb/s)

High Speed (>10Gb/s) Rx

Rx

Rx

Rx

Rx Burst Rxdata DEMUX

Rxdata0~3 Parallel Data Links

High-Frequency Burst Clock

Rxdata0~3 Burst Transmission © 2008 IEEE

Fig. 4.23 Block diagrams of parallel (left) and burst transmission (right) data links. Fig.1.23 Block diagrams of parallel (left) and burst transmission (right) data links.

Figure 4.24 depicts a burst-clock generator for providing timing to the MUX. When Enable is high, control logic generates a pulse signal at the rising edge of SystemClk to reset a counter. After the reset, a local ring oscillator (OSC) starts oscillation and generates the high-frequency burst clock Txclk. The counter stops the oscillation when it generates the same number of clock waves as the number of data bits. Figure 4.25 depicts a block diagram of the burst transceiver. Multi-bit data Mtxdata are multiplexed into burst data Txdata by the high-frequency burst clock Txclk. Txclk is transmitted by another inductive-coupling link along with the data link and used for demultiplexing the received burst data Rxdata. Large jitter in the ring OSC can be cancelled out by this source synchronous transmission. In

4 Inductive Coupled Communications

105

addition, since both clock and data are transmitted by the same inductive-coupling links whose latency is as small as 36 ps, variation in sampling timing tsample caused by PVT changes can be largely suppressed. A delay is inserted in the clock path to the de-multiplexer in order to latch the data in the middle of the data cycle. No other timing control is needed.

f-Mb/s N Mtxdata

N:1 MUX

fN-Mb/s Txdata

Enable

Stop

Local Ring Oscillator Counter Reset Control Logic System Clk

Txclk

System Clk Enable Reset Stop

Burst Mode

Burst Txclk © 2008 IEEE

Fig. 4.24 Burst clock generator for multiplexing.

The burst transceiver is designed for a 400MHz system clock, assuming an application to a processor for mobile phones. All the circuits (MUX, DEMUX, oscillator, counter, and delay buffer) are implemented in current mode logic (CML) for high-frequency operation and small PVT variations. The MUX and DEMUX are designed for operation at 6.4 Gb/s. The local ring oscillator is thus designed to produce a 3.2 GHz Txclk. The counter is Johnson-type for such high-frequency operation. It generates 8 clock waves in order to multiplex 16 bits of 400 Mb/s Mtxdata into a 6.4 Gb/s burst Txdata. Both Txdata and Txclk are transmitted by the inductive-coupling links whose link latency is 36 ps. The delay buffer gives 0.5 UI delay in Rxclk so that the sampling timing tsample is set in the middle of the data cycle (≈ 78 ps). Simulated PVT variation in tsample is less than 20 ps (under 13% UI). This wide design margin is obtained by the source synchronous transmission with the low-latency inductive-coupling link. Stacked test chips are fabricated in 180 nm CMOS (Figure 4.22). Layout area of the burst transceiver including MUX/DEMUX and oscillator is 0.1 mm2 . In a conventional parallel transceiver [3, 4], 16 data links would have been required for the same aggregate data rate of 6.4 Gb/s. Even if the coil diameter is reduced to 90 µm, a layout area of 0.3 mm2 would be required in the synchronous scheme. The burst transceiver requires only two links (data and clock) so that the layout area is reduced to 1/3 of the parallel transceiver. All experimental setups are identical with the previous measurement for the high-speed inductive-coupling link. Again, the transmitter chip is thinned down and stacked over the receiver chip both face-up.

106

Inductive Coupled Communications

N f-b/s Mtxdata

N:1 MUX

fN-b/s Txdata

Coil

Tx

fN/2-Hz Txclk Enable

Oscillator +Counter Counter

Source Synchronous

Tx

f-Hz System Clk

Data Link Rx

Clock Link

fN-b/s Rxdata

1:N DEMUX

0.5UI Rx Buffer

Rxclk

N f-b/s Mrxdata

© 2008 IEEE

Fig. 4.25 Block diagram of burst transceiver with source synchronous transmission.

The communication distances are 45 µm, 30 µm, and 15 µm. The coil size is 120 µm in diameter with five coil turns. Two coils for the clock and the data link are placed next to each other. The crosstalk is small enough since the burst transceiver uses only two inductive-coupling links and the number of crosstalk channels is limited. Theoretical calculations show that the crosstalk-to-signal ratio is lower than -20dB for the distances shorter than 90 µm [13]. BER of the burst transmission is measured at the maximum data rate of 6.4 Gb/s. BER is less than 10−14 and error-free operation is achieved. Tolerance against supply voltage change in the burst transmission is also measured in order to demonstrate the robustness of this system. In 6.4 Gb/s burst transmission, BER under 10−14 is achieved for ± 10% variations of the supply voltage, thus confirming that the source synchronous transmission by the low-latency inductive-coupling link provides strong immunity against supply voltage change.

4.6 Crosstalk reduction techniques The burst transceiver introduced in the previous section reduces layout area by multiplexing data links. However, since the energy efficiency is very low due to large static current consumption in the high-speed MUX/DEMUX, it may be difficult to use for over-Tb/s high-bandwidth communications. The energy efficiency can be improved in the parallel synchronous transceivers while the area overhead should be minimized by arranging the data links in high density. As mentioned earlier, the crosstalk between the inductive-coupling channels is more serious and a crosstalk

4 Inductive Coupled Communications

107

reduction technique is required for high-density channel arrangement. In this section we introduce two crosstalk reduction techniques.

4.6.1 Time interleaving One of the techniques is based on time interleaving. Because using multi-phase clocks creates several operating time slots, we can divide the parallel inductivecoupling transceivers into these time slots, reducing the number of channels operating at the same time and hence also reducing the crosstalk. Figure 4.26 depicts a block diagram of 1 Tb/s parallel inductive-coupling transceivers with time interleaving. The transceiver comprises 16 slices of a 64 channel block, yielding 1024 channels of data transceivers in total. Each of the 64 channel blocks consists of 64 data transceivers and one clock transceiver. The clock for the transmitter, Txclk, is transmitted through the inductive coupling and the clock for the receiver, Rxclk is recovered by the clock transceiver. The clock frequency is 1 GHz. The Phase Interpolator (PI) generates 4 time slots per clock cycle by creating 4-phase clocks from both Txclk and Rxclk for time interleaving. Data transceivers are divided into the time slots to reduce crosstalk. Each data transceiver communicates at 1 Gb/s/channel, resulting in 1 Tb/s data bandwidth from the 1024 parallel data links. The 4-phase

Data Tx D

PI 1GHz

1Gb/s

Data Rx

-+

Clk Tx

IT1 1Gb/s

VR0

VR1

IT2

VR2

IT3

VR3 64ch

PI

VRC I

-+

Clk Rx x

IT0

1GHz

-+

ITC

64ch

-+

4-phase Clock

-+

I

16 S Slices

T d t 0 Txdata1 Txdata0 T d t 1 Txdata2 T d t 2 Txdata3 T d t 3

Txclk

16 Slices 1

64ch Tx Block

Rxclk 64 h R 64ch Rx Block Bl k

Rxdata0 Rxdata1 Rxdata2 Rxdata3 © 2007 IEEE

Fig. 4.26 Block diagram of 1Tb/s parallel inductive-coupling transceivers.

clocks for time interleaving (TI) are assigned like a checkerboard pattern in the data

108

Inductive Coupled Communications

transceiver array. Figure 4.27 shows simulated waveforms of received signal and crosstalk. When the channel pitch is taken down to 30 µm, the crosstalk increases to the same level of the signal (top of Figure 4.27). Two-phase TI reduces crosstalk to half of the signal; however it is not low enough for communications with BER lower than 10−13 (middle of Figure 4.27). Four-phase TI increases the equivalent channel pitch to twice the coil diameter and reduces crosstalk to 10mV-peak voltage. This enables BER lower than 10−13 . 50

64ch Block w/o TI

w/o TI

Signal

0

Rece eived Volttage [mV]

Crosstalk -50 50

30Pm 50mV

50

w/ 2-phase TI

2-phase TI

OFF

0

ON

25mV

-50 50

w/ 4-phase TI

4-phase TI

0 10mV -50 50 0

1

Time [ns]

2

3

© 2007 IEEE

Fig. 4.27 Simulated waveforms of signal and crosstalk without (top), with 2-phase (middle), and with 4-phase (bottom) time interleaving.

Figure 4.28 shows microphotographs of the test chips fabricated in 180 nm CMOS. The transmitter chip is placed on top of the receiver chip. Both chips are face up and polished to 10 µm thickness. The communication distance including an adhesive layer is 15 µm. The clock transceiver transmits 1 GHz clock by a coil with 200 µm diameter. The clock transceiver is set up for every 64 data transceivers. The data transceiver communicates at 1 Gb/s/channel by a coil with 30 µm diameter. 1024 data transceivers are arranged next to each other. The transmitter and receiver circuits are placed under the coils to save layout area. Because of the compact layout, inter-channel skew in the 64 channel block can be suppressed to 11 ps in the clock distribution network. The total layout area for the data link is only 1 mm2 . We measured the BER dependence on channel pitch and the number of phases in TI. An on-chip timing controller changes the number of phases and phase assign-

4 Inductive Coupled Communications

Coil for Data Link

Channel Pitch 30Pm

109

512ch Data Transceivers 16ch Clock Transceivers

64ch Data Transceivers 512ch Data Transceivers 200Pm Transmitter Chip (Top) Receiver Chip (Bottom) Clock Transceiver © 2007 IEEE

Fig. 4.28 Stacked test chips.

ment so that the transceiver with 4-phase, 2-phase or without TI can be tested for comparison. A pitch controller selects activated channels to change channel pitch and number of aggregated channels. Built-In-Self-Test (BIST) circuits are implemented for BER measurement. Pseudo Random Binary Sequence (PRBS) generators produce a 223 -1 word pattern for transmitted data, and the number of errors in received data is counted in the receiver. A scan chain initializes the PRBS generators and outputs the count of measured errors for BER measurement. The measured results are plotted in Figure 4.29. By increasing the number of phases in TI, crosstalk is reduced and the channel pitch can be shortened for the same BER. By using the 4-phase TI, 1024 transceivers arranged with a pitch of 30 µm operate at 1 Gb/s/ch for the BER lower than 10−13 . As a result, an aggregate data bandwidth of 1 Tb/s is achieved with the layout area of 1 mm2 at an area efficiency of 1 mm2 /Tb/s. The transceiver chip consumes 3 W at 1.8 V and the energy efficiency is thus 3 pJ/b.

4.6.2 Differential coil Another crosstalk reduction technique is a sort of space division multiplexing where a differential coil [14] provides directionality in the inductive coupling. The differ-

110

Inductive Coupled Communications

10-3

Data Rate=1Gb/s/ch 223-1 PRBS Data Power=3mW/ch

10-5

BE ER

10-7

10-9 w/o TI 10-11

2-phase TI 4-phase TI

10-13

30 (1024ch/mm2)

60 (256ch/mm2) Channel Pitch [Pm]

120 (64ch/mm2) © 2007 IEEE

Fig. 4.29 Measured BER dependence on channel pitch.

ential coil consists of two rectangular coils turned inversely (Figure 4.30). Transmit current IT generates differential magnetic flux BL and BR which have the same magnitude but opposite direction.

BL

BR

IT

IT

Fig. 4.30 Differential coil.

The pair of the differential coil has a coupling mode and a decoupling mode according to the angle ϕ between the transmitter and the receiver coil (Figure 4.31). In the coupling mode where ϕ = 0◦ (left of Figure 4.31), BL induces VL in the left coil and BR induces VR in the right coil. The total received voltage VDIFF is given by the difference of VL and VR . Since VL and VR have the same magnitude but opposite polarity, VDIFF is given by 2VR . In the decoupling mode where ϕ = 90◦ (right of

4 Inductive Coupled Communications

111

BL

BR

IT

IT

Transmitte er

Transmitte er

Figure 4.31), magnetic flux in the receiver coil is canceled out so that no received voltage is induced.

BL

BR

IT

IT

+ VL IR

M

Receiv ver

Receiv ver

Angle g M

+ VR IR

+ + VLU VRU + + VLD VRD -

VDIFF=VR-VL=2VR (VL=-VR)

VDIFF=VRU+VLU-(VRD+VLD)=0 (VLU=-VRU, VLD=-VRD)

Coupling Mode (M=0º)

Decoupling Mode (M=90º)

Fig. 4.31 Differential coils in coupling (left) and decoupling (right) direction.

Utilizing this property, two orthogonal differential coil pairs can be vertically overlapped to save their footprint (Figure 4.32). The concept is evaluated by the test chip measurement (Figure 4.33). The transmitter in the channel 1 transmits PRBS data while the transmitter in the channel 2 always transmits High. Since the channel 1 and channel 2 are orthogonally placed to eliminate their crosstalk, the channel 1 successfully communicates at 1.5 Gb/s with a BER under 10−14 . Thus far we have described a number of circuit techniques to enhance performance. In the next two sections, we present practical applications of inductive coupled communications.

4.7 Application I: memory stacking Stacking multiple memory chips in a package is widely used to fabricate a large capacity memory device. For example, commercial production of modern Solid-State Drives (SSDs) use NAND flash memory chips with a thickness less than 60 µm. Such small thickness enables one package to contain 64 chips. But it is conven-

112

Inductive Coupled Communications

Txdata2

Tx x

Txdata1 Tx

Channel1

Channel2

Rx

Rxdata2

Rx Rxdata1

Fig. 4.32 Channel overlap by using differential coils.

Txdata1=PRBS Tx

1.5Gb/s 223-1 PRBS Data, BER<10-14 Txdata1

Txdata2=Hig gh Coupled

Tx2

Tx

Tx1

Rx

Rxdata1

D Decoupled l d

Rx Rxdata1=Txdata1

Fig. 4.33 Snapshot of data waveforms.

4 Inductive Coupled Communications

113

tionally difficult because more than 1,500 wires must be connected for data access and power supply. A wireless interface based on inductive coupling communication reduces the number of bonding wires for the data access (Figure 4.34).

Controller Coil

25Wires/Chip Memory Read

3Wires/Chip (VDD,VSS,Reset)

.. ..

Inductive Coupling

Sle eep

Selec ct

Repea at

Memory Write

Controller Inductive Coupling

Wire Bonding

© 2009 IEEE

Fig. 4.34 SSD with 64 NAND flash memories using inductive coupling (left) and wire bonding (right).

A controller chip communicates with a stack of 64 NAND flash memory chips underneath using relayed transmission. Only 200 bonding wires are required for power supply and reset. This reduction in the number of bonding wires makes it possible to integrate 64 chips in one package, which conventionally would require eight separate packages. However, the wireless interface has some issues when applied to homogeneous chip stacking. First, chips must be stacked with space to bond wires for power supply and to align coils for data access. Second, a receiver has to receive a signal from an intentional transmitter and ignore a signal from an opposite unintentional transmitter, since the coil of any transmitter emits magnetic field both upwards and downwards. In this section, we introduce inductive-coupling channel alignment schemes for homogeneous chip stacking and an inductive-coupling up/down repeater to prevent unintentional reception [17].

114

Inductive Coupled Communications

4.7.1 Homogenous chip stacking To provide space to bond wires for power supply and at the same time align the coils, we used two inductive-coupling channel alignment schemes. They are a halfturned-and-staggered stacking, and a terraced stacking (Figure 4.35). The former has superiority in power consumption. Even-numbered chips are rotated by 180 degrees relative to the odd-numbered chips. One channel requires two physical links, one for the odd chips and the other for even chips. This cuts the number of repeaters to access a random chip, as well as power consumption, by half. On the other hand, the latter reduces chip thickness because the chip underneath provides mechanical supports against wire bonding force. Thinner chips reduce communication distance and hence coil size. Higher density channels are possible.

Inductor

Shield W S

N E

W

N

S

E

Shield

Inductive Coupling

Inductor Odd n

Even

W

E

S

N

n+1 E

W

N

S

n+2 W n+3 E n+4 W n+5 E

E W

E W

S N S N

N S N S

Half-turned-and-staggered Stacking

n

W

E

n+1 n+2 n+3 n+4 n+5 n+6 n+7

W W

E E

W W W W

E E E E

S

S S

N

S

S

N N S

S

N

N

N

N

Terraced Stacking © 2009 IEEE

Fig. 4.35 Inductive-coupling channel alignment scheme for homogeneous stacking.

4.7.2 Inductive-coupling up/down repeater To prevent unintentional reception, shields are added to block unintentional links. Figure 4.36 shows the alignment of shields and coil pairs in stacked chips. Every

4 Inductive Coupled Communications

115

coil pair has a transmitter and a receiver with Tx-enable and Rx-enable signals. The optimal selection of active transmitter and receiver depends on transmission direction. The transmitter and receiver are selected by bonding options of the reset wire as described in the next paragraph. Received voltage VR is proportional to the coupling-coefficient k between coils (Figure 4.37). Magnetic flux is reduced by eddy currents in the shield, which is made of pad metal. Since the distance Z from the transmitter to the shield is half of the communication distance X, the couplingcoefficient k is enough to communicate in the intentional link. Tx-enable

Rx-enable Rx

Tx

Rx

Rx

Tx

Rx

Tx

Tx

Tx

Rx Shield

Tx

Rx

Rx

Tx

Up Link (Memory Read)

Rx

Tx

Tx

Rx

Down Link (Memory Write) © 2009 IEEE

Fig. 4.36 Inductive-coupling up/down repeater with shields.

In addition, a chip access scheme has been developed to reduce power consumption. Repeaters are used for relayed transmission between a controller and a target memory. The state of chips–Select, Repeat, and Sleep–is set by wireless communication. Figure 4.38 shows a Finite State Machine (FSM) controlled by 2 bit signals. For instance, if the controller chip transmits “00, 01, 11, 10” control signals, Memory00 is set to Repeat by “01” and repeats “01” to Memory01. Memory01 is also set to Repeat and repeats “11” to Memory02. Memory02 is set to Select-ready by “11” and repeats “10” to Memory03. Memory03 is set to Sleep-ready by “10”. Memory02 is set to Select and Memory03 to Sleep by “00.” This protocol is optimal in homogeneous stacking because it does not require a chip address for communication. As all receivers are shutdown in Sleep, the re-

116

Inductive Coupled Communications

Shield

Relative M Mutual Ind ductance e, M/M0

1.0

X

Signal 0.8

Z X-Z

C Crosstalk t lk 0.6 Optimal

0.4

VR=MdIT/dt

0.2

Crosstalk

0.0 0

0.5 Z/X

1 © 2009 IEEE

Fig. 4.37 Calculated mutual inductance of inductive coupling between and through shield.

Interface Chip 01,01,11,10,00 Memory00 : Repeat 01,11,10,00 Memory01 : Repeat 11 10 00 11,10,00 Memory02 : Select 10,00 Memory03 : Sleep

Repeat Data Receive only Sleep 00 01 10 11 00,01,10,11 Repeat 01

Select R d Ready

11 wire

Receive

10 Select

00,01,10,11

00

Rx

Rx

Tx

Tx

FSM Ctrl.

Wire e

00

Stacked Chips

Sleep Ready 01,10,11

Fig. 4.38 Programmable repeater circuit.

Sleep 00 00,01,10,11

Rx-enable e Tx-enable e

01,10,11

Data Clock

© 2009 IEEE

4 Inductive Coupled Communications

117

ceivers cannot be turned on by wireless communication. To turn on the receivers, all memory chips are connected with reset wires from the controller. Every chip has a double-edge triggered flip-flop to synchronize clock and data. A synchronization of clock and data is independent of the number of repeaters. The control signals of NAND flash memory are transmitted in a packet, and recovered by an Application Program Interface (API) decoder in the selected chip. Figure 4.39 shows recovered commands for the NAND flash memory which are the same as conventional ones. This wireless interface requires minimum changes in memory peripheral circuits.

Memory

Data Clock Rx

CL LE, ALE, //WE ..

ALE /RE DataOUT [15:0]

MUX X

CLE

Counter

CLKIN

@40MHz

DataOUT [1 D 15:0]

DataIN [[15:0]]

/RE

DataIN [15 D 5:0] CLKIN C

Rxdata

API Decod der

DEMUX X ..

Co ounter

Tx

API Decode er

Rx xdata

Rx

Data Clock

Tx

1GHz OSC © 2009 IEEE

Fig. 4.39 Random access.

4.7.3 Test chip measurement Test chips are fabricated in a 180 nm CMOS technology. Six chips are stacked using the half-turned-and-staggered stacking (Figure 4.40). The thickness of each chip is 60 µm; therefore the communication distance is 120 µm. The coil diameter is 200 µm and the shield is 400 µm-square. The layout area is 27 µm-square for each receiver and transmitter circuit, 123 µm-square for the FSM and control circuits. These circuits can be located below the coil. Figure 4.41 shows the measured BER

118

Inductive Coupled Communications

dependence on transmission power at a data rate of 2 Gb/s. We measured a BER less than 10−12 and an average energy consumption in each chip of 15 pJ/b.

200Pm Stacked Test Chips on Evaluation Board

Inductors 400Pm

Shields

© 2009 IEEE

Fig. 4.40 Microphotograph of stacked test chips.

4.8 Application II: processor and memory stacking Memory capacity and bandwidth is a critical issue in a processor system. Integrating a large size SRAM or eDRAM on a processor increases die size or process steps; either way it raises both cost and leakage power. It is desired in low-power consumer electronics that a memory chip and a processor chip are each fabricated in their optimal process and integrated by heterogeneous chip stacking in a package. One of the technical challenges is a wide bandwidth between the processor and the memory. The gap between computing power and communication bandwidth can be filled if chip area is used for a data link, rather than using only chip periphery. Inductive coupled communication is a suitable technology for this application. As shown above, it is a circuit solution on a standard CMOS process, and hence less expensive than TSVs. It bears comparison with TSV in performance since ESD protection devices are not needed. Furthermore, it provides an AC-coupling interface, and therefore obviates the need for a level shifter. Power supply voltages can be different, and they can be changed for dynamic voltage scaling (DVS) and power

4 Inductive Coupled Communications

119

@2Gb/s

100 Txdata

Txdata

BER=0.5

Tx

BER

10-3

10-6

Rxdata Rx Txdata Tx Rxdata Rx

Txdata

Rxdata

Crosstalk thru Shield

Tx

Txdata

BER<10-12

Rxdata Rx

10-9 w/o Shield

10-12 0

Rxdata

w/ Shield

2 4 6 8 10 Transmission power [mW]

12 © 2009 IEEE

Fig. 4.41 Measured BER dependence on transmission power.

gating with little impact on interface delay. In this section, we introduce the first 3D heterogeneous integration of a commercial mobile processor and a memory by using inductive coupling [11].

4.8.1 Heterogenous chip stacking Figure 4.42 presents microphotographs of the chips and their stacking. A 90 nm CMOS 8-core processor is mounted face down on a package by C4 bumps. A 65 nm CMOS 1 MB SRAM is glued to it face up, and its power is provided by conventional wire-bonding. The two chips are each fabricated in their optimal process and supplied with optimal voltages (1 V for the processor and 1.2 V for the SRAM). The thickness of each chip is 50 µm. The radius of the coils is the same as the communication distance, 120 µm. There are 18 data channels each for uplink and downlink. In total, 36 coils are arranged in 243 µm by 320 µm pitch. Both the rising and falling edges of a clock are used for 2-phase time interleaving to reduce crosstalk between the adjacent channels. There are clock channels for source synchronous transmission. One size larger coils are employed to strengthen the coupling coefficient for an asynchronous channel. The total layout area for the inductive-coupling link is 2.82 mm2 . The area normalized by bandwidth is 0.15 mm2 /Gbps, which is 1/3 of the normalized area for a conventional DDR2 interface in the same device technology.

120

Inductive Coupled Communications

Inductive-Coupling Links Inductive-Coupling Links

CPU1 CPU3

CPU5 CPU7

CPU0 CPU2

CPU4 CPU6

1MB SRAM

SRAM Chip 65nm CMOS, 50Pm-Thick, VDD=1.2V Stacked Face Face-up up on Processor Bonding for Power to SRAM

Processor Chip 90nm CMOS, 50Pm-Thick, VDD=1V Mounted Face-down on C4 Bump p © 2009 IEEE

Fig. 4.42 Microphotograph of stacked test chips.

4.8.2 Interface design Figure 4.43 depicts a block diagram. An inductive-coupling bus state controller (IBSC) supports packet-based communications by adding two signals (Strobe and Packet-end). Sixteen-bits at 600 Mb/s, Data[15:0], are transmitted by synchronous parallel inductive-coupling links. The aggregated data rate is therefore 19.2 Gb/s. A control register in the IBSC is used for timing adjustment. The timing adjustment is essential in the synchronous data link. As mentioned in Section 4.4.1, there is a tradeoff between power dissipation and timing margin. Since power dissipation in a transmitter is in proportion to the square of the pulse width, the narrower the pulse, the smaller the power dissipation. The timing margin for sampling the narrow pulse, however, will be reduced. Low-power design requires accurate timing control. Adaptive circuits and systems are required to adjust the timing for the following reasons: 1. Timing jitters caused by PVT variations, especially in clock paths with a long latency through another chip 2. VDD changes by DVS 3. Inter-channel skews, especially when the channels are distributed over a wide area

4 Inductive Coupled Communications

SRAM VDD=1.2V

121

1MB SRAM Module (Working Memory for CPU)

Processor VDD=1V

19.2Gb b/s Data[1 15:0] Strobe e, Packet--end

600MHz Clk k

150Mbps x 64bit PHY of Inductive-Coupling Link Packed-Based Communication Clk Strobe

18bit

Packet-end Valid Data

Data[15:0]

PHY of Inductive Inductive-Coupling Coupling Link 600MHz 300MHz

*IBSC

BIST

Timing Ctrl. Ctrl Control Register

System Bus 8 Cores

300MHz 600MHz Clock Controller

CPU0~7

*IBSC : Inductive-Coupling I d ti C li Bus State Controller © 2009 IEEE

Fig. 4.43 Block diagram of processor and SRAM chip stack.

The timing jitter under PVT variations can be monitored and calibrated by a coarse timing control unit with the control register in the IBSC (Figure 4.44). Once the calibration result under each condition of DVS is stored in the control register, the timing control unit can adjust the timing for DVS instantly by digital control. The inter-channel de-skew can be performed by a fine timing control unit that is implemented in each channel. First, the control register sets a loopback path in the SRAM for a test mode (an SRAM “through” mode). Secondly, pass/fail information, like a shmoo plot, is stored in a register for both the uplink and downlink by changing the coarse timing. Thirdly, the coarse timing is set such that timing margin becomes the largest when all the channels pass. For each channel, fine timing is next tuned such that the timing margin becomes the largest.

4.8.3 Test chip measurement The SRAM was accessed for both read and write from the processor and BER was measured by changing the setting of the control register. A timing shmoo plot is depicted in Figure 4.45. A bathtub curve at the marked condition by a broken line

122

Inductive Coupled Communications

1MB SRAM Module Through g Mode

Tx Fine

FF C t l Control Register

Tx

Data Tx (18ch)

Processor Rx Chip

Timing Control FF

FF BIST

Rx

Tx Clk (1ch)

Rx Coarse

Tx Fine

Coarse

Tx

SRAM Chip

FF

Fine

Data Rx (18ch)

Up plink

Rx

FF

Down nlink

Clk Rx (1ch)

FF

Fine

FF

FF

IBSC

Fig. 4.44 Block diagram of inductive-coupling link with two-step (coarse/fine) adaptive timing adjustment.

(from TA to TB) is also depicted. A BER less than 10−14 with a 231 − 1 PRBS is achieved. After optimizing the timing by setting the control register at the center of the shmoo plot, tolerance against VDD and temperature changes was measured. The measured result is presented in Figure 4.46. No single bit failed under ±5% VDD variations and temperature ranges from 25◦ C to 55◦ C.

4.9 Conclusion This chapter describes inductive-coupled communications for stacked chip in a package. It is a wireless circuit solution made in a standard CMOS technology. Compared to mechanical wired solutions, significant cost reduction can be achieved by eliminating additional process steps. Modeling and design of the inductive-coupling channel and transceiver are also presented for high-reliable wireless communications. In addition, circuit techniques for performance enhancement are introduced. Test chip measurements demonstrate high-performance and high-reliable inter-chip communications which are competitive with the wired solution. Two practical applications are introduced in both homogeneous and heterogeneous chip stacking. Suc-

4 Inductive Coupled Communications

123

100

Test Pattern : PRBS 231-1

Timing g in Down nlink, TD

10-2

180 180ps TA

Optimal Ti i Timing

TB

BER

10-4 10-66 10-8

10-10 180ps p

12 10-12

36ps/step

10-14

Timing in Uplink, TU

TA

TB TU (36ps/step) © 2009 IEEE

Variation n in Supp ply Voltage of SRAM M Chip VDDD,SRAM

Fig. 4.45 Measured timing shmoo plot.

T=25ºC T=55ºC 5%

Test Pattern : PRBS 231-1 BER=10-12

+5% VDD Variation

2.5% 0% 1.2V

PASS

-2.5% -5%

FAIL 1.05V -5% 5% -2.5% 2 5% 0% 2.5% 2 5% 5% Variation in Supply Voltage of Processor Chip VDD,Processor © 2009 IEEE

Fig. 4.46 Measured tolerance against variations in supply voltages and temperature.

124

Inductive Coupled Communications

cessful operations are demonstrated in the proto-type measurement. The inductivecoupled communication is ready for practical use.

References 1. D. Mizoguchi, Y.B. Yusof, N. Miura, T. Sakurai, and T. Kuroda, “A 1.2 Gb/s/pin wireless superconnect based on inductive inter-chip signaling (IIS),” Digest of Technical Papers, IEEE International Solid-State Circuits Conference, 2004, pp. 142–143. 2. N. Miura, D. Mizoguchi, M. Inoue, H. Tsuji, T. Sakurai, T. Kuroda, “A 195 Gb/s 1.2 W 3Dstacked inductive inter-chip wireless superconnect with transmit power control scheme,” Digest of Technical Papers, IEEE International Solid-State Circuits Conference, 2005, pp. 264– 265. 3. N. Miura, D. Mizoguchi, M. Inoue, K. Niitsu, Y. Nakagawa, M. Tago, M. Fukaishi, T. Sakurai, T. Kuroda, “A 1 Tb/s 3W inductive-coupling transceiver for inter-chip clock and data link,” Digest of Technical Papers, IEEE International Solid-State Circuits Conference, 2006, pp. 424–425. 4. N. Miura, D. Mizoguchi, M. Inoue, K. Niitsu, Y. Nakagawa, M. Tago, M. Fukaishi, T. Sakurai, T. Kuroda, “A 1 Tb/s 3 W inductive-coupling transceiver for 3D-stacked inter-chip clock and data link,” IEEE Journal of Solid-State Circuits, vol. 42, no. 1, 2007, pp. 111–122. 5. N. Miura, H. Ishikuro, T. Sakurai, T. Kuroda, “A 0.14 pJ/b inductive-coupling inter-chip data transceiver with digitally-controlled precise pulse shaping,” Digest of Technical Papers, IEEE International Solid-State Circuits Conference, 2007, pp. 358–359. 6. N. Miura, H. Ishikuro, K. Niitsu, T. Sakurai, T. Kuroda, “A 0.14 pJ/b inductive-coupling transceiver with digitally-controlled precise pulse shaping,” IEEE Journal of Solid-State Circuits, vol. 43, no. 1, 2008, pp.285–291. 7. N. Miura, Y. Kohama, Y. Sugimori, H. Ishikuro, T. Sakurai, T. Kuroda, “An 11 Gb/s inductivecoupling link with burst transmission,” Digest of Technical Papers, IEEE International SolidState Circuits Conference, 2008, pp. 298–299. 8. N. Miura, Y. Kohama, Y. Sugimori, H. Ishikuro, T. Sakurai, T. Kuroda, “A high-speed inductive-coupling link with burst transmission,” IEEE Journal of Solid-State Circuits, vol. 44, no. 3, 2009, pp. 947–955. 9. H. Ishikuro, T. Sugahara, T. Kuroda, “An attachable wireless chip-access interface for arbitrary data rate using pulse-based inductive-coupling through LSI package,” Digest of Technical Papers, IEEE International Solid-State Circuits Conference, 2007, pp. 360–361. 10. Y. Yuxiang; Y. Yoshida, T. Kuroda, “Non-contact 10% efficient 36 mW power delivery using on-chip inductor in 0.18 µm CMOS,” Digest of Technical Papers, IEEE Asian Solid-State Circuits Conference, 2007, pp. 115–118. 11. K. Niitsu, Y. Shimazaki, Y. Sugimori, Y. Kohama, K. Kasuga, I. Nonomura, M. Saen, S. Komatsu, K. Osada, N. Irie, T. Hattori, A. Hasegawa, and T. Kuroda, “An inductive-coupling link for 3D integration of a 90nm CMOS processor and a 65nm CMOS SRAM,” Digest of Technical Papers, IEEE International Solid-State Circuits Conference, 2009, pp. 480–481. 12. R.J. Drost, R.D. Hopkins, R. Ho, I. Sutherland, “Proximity communication,” IEEE Journal of Solid-State Circuits, vol. 39, no. 9, 2004, pp. 1529–1535. 13. N. Miura, T. Sakurai, T. Kuroda, “Crosstalk countermeasures for high-density inductivecoupling channel array,” IEEE Journal of Solid-State Circuits, vol. 42, no. 2, 2007, pp. 410– 421. 14. Y. Yoshida, N. Miura, T. Kuroda, “A 2 Gb/s bi-directional inter-chip data transceiver with differential inductors for high density inductive channel array,” Digest of Technical Papers, Asian Solid-State Circuits Conference, 2007, pp. 127–130. 15. N. Miura, D. Mizoguchi, T. Sakurai, T. Kuroda, “Analysis and design of inductive coupling and transceiver circuit for inductive inter-chip wireless superconnect,” IEEE Journal of SolidState Circuits, vol. 40, no. 4, 2005, pp. 829–837.

4 Inductive Coupled Communications

125

16. K. Niitsu, S. Kawai, N. Miura, H. Ishikuro, T. Kuroda, “A 65 fJ/b inductive-coupling interchip transceiver using charge recycling technique for power-aware 3D system integration,” Digest of Technical Papers, Asian Solid-State Circuits Conference, 2008, pp. 97–100. 17. Y. Sugimori, Y. Kohama, M. Saito, Y. Yoshida, N. Miura, H. Ishikuro, T. Sakurai and T. Kuroda, “A 2 Gb/s 15 pJ/b/chip inductive-coupling programmable bus for NAND flash memory stacking,” Digest of Technical Papers, IEEE International Solid-State Circuits Conference, 2009, pp. 244–245.

Chapter 5

Use of AC Coupled Interconnect in Contactless Packaging Paul Franzon

5.1 Introduction: Why use ACCI? Contactless packaging refers to the concept of using capacitively coupled or inductively coupled structures to connect chips together. That is, to use them as the electrical interfaces for chip-to-package, package-to-socket, and board-to-board interfaces (i.e. connectors). Collectively, these techniques are referred to as “AC Coupled Interconnect,” (ACCI) referring to the lack of a DC connection. These concepts arise from the key realization that a DC connection is not needed to communicate high frequency digital data; a good AC connection suffices. In fact, many high speed chip-to-chip communication standards, such as FiberChannel and Ethernet, use DC blocking capacitors so that the transmitter and receiver do not have to share a common DC supply, making them hot-pluggable. The difference in ACCI is the realization that smaller values of capacitance can be used. Using AC Coupled Interconnect in packaging structures brings a number of potential benefits (see Figure 5.1). The structures are mechanically simple and thus can be miniaturized without concerns involving mechanical issues, such as pin and hole alignment and insertion force. The opposing half-capacitor plates or inductors can slide with respect to each other, so relative motion due to different rates of thermal expansion in the relevant materials does not lead to manufacturability and reliability-induced limits, as it normally does with soldered connections. The mechanical simplicity of the connection structures provides potential for complex 3D geometries and 3D packaging. There is also the potential for low-power consumption due to reduced signal swing, without the need for multiple supplies, regulators, etc.

Professor Paul Franzon Department of Electrical and Computer Engineering, North Carolina State University, Raleigh, NC, 27695, USA, e-mail: [email protected] R. Ho and R. Drost (eds.), Coupled Data Communication Techniques for High-Performance 127 and Low-Power Computing, Integrated Circuits and Systems, DOI 10.1007/978-1-4419-6588-2_5, © Springer Science+Business Media, LLC 2010

128

Contactless Packaging Chip

Pkg High density chip-substrate packaging L2: BGA Socket

Chip 1

L1: wire-bond L3: connector

True 3D packaging

Backplane

High density connectors and sockets

True 3D packaging

Fig. 5.1 Potential benefits of using AC Coupled Interconnect in packaging structures include high I/O density, high I/O count, large separable connectors, potential for low-power operation and ESD protection elimination (adapted in part from [1]).

One disadvantage of using AC Coupled Interconnect is the need to work out how to build the DC power and ground interface into the same face-to-face structure. How this can be solved in different scenarios will be explained in detail below. Many other ACCI concepts are chip-to-chip. What are the main differences that come about by using ACCI in package structures? One thing it makes tremendously easier is physical design. Now there is no need to physically align the half-capacitors or inductor coils in the chips. There is no need to determine, track and keep a physical placement standard for the chip-to-chip connections, and to redesign and refabricate all the affected chips if one alignment point needs to be moved. However, using ACCI at the package level makes circuit design more complex. It is a lot easier to drive a signal through a series capacitor if it is loaded by the high impedance input of a receiver than a 50 Ω transmission line. Another difficulty is the need now to

5 Contactless Packaging

129

cope with all the signal integrity and power integrity issues that all I/O designers have to worry about. Overall ACCI can enable low-cost, high-density, high pin count, and large package and connector structures. For example, it can uniquely enable chips that are larger than can be reliably packaged today, chips with more than 10,000 I/O, and high density connectors with thousands of connection “pins” at 0.1 mm or less pitch. It can do this while permitting low signal swing, and thus low power consumption.

5.1.1 Chapter outline This chapter is outlined briefly as follows. First, we provide a historical perspective of the early history of capacitive and inductive coupling, before its wider adoption and investigation, as evidenced in the other chapters in this book. Then capacitive coupling is explored as a circuit element, as implemented in chip-to-package structures and in the middle of the channel. Finally, inductive coupling is explored mainly in the context of enabling high density sockets and connectors.

5.2 Historical Perspectives The first identification of the use of capacitive coupling to enable chip-to-chip communications was by Tom Knight of MIT and David Salzman of Polychip [2]. Soon after that the author’s group at North Carolina State University started working with the inventors to solve some of the engineering issues and to look at a broader range of applications [3]. Around the same time groups at Siemens and Lucent Bell Labs demonstrated some circuit solutions for capacitively coupled systems [4, 5]. The author’s group at NCSU went on to develop practical packaging solutions for capacitively coupled chip I/O, and also identified inductive coupling as an alternative AC coupled mechanism [6]. They were the first to demonstrate 3D chip-to-chip inductive coupling [7, 8] as well as to explore the package level solutions outlined in this chapter. Since then, several groups in the US and Japan have been exploring both capacitive and inductive coupling as described elsewhere in this book.

5.3 Capacitively Coupled Chip I/O Capacitively coupled chip I/O brings a number of potential advantages to chip-topackage interface structures (see Figure 5.1). The I/O can be built at tighter pitches than can be reliably achieved with solder bumps. Solder bumps are typically built at pitches between 150 and 250 µm to ensure good yield and to be sufficiently tall enough to be compliant. Tighter pitches in turn permits higher I/O counts for the

130

Contactless Packaging

same silicon area, and thus larger off-chip bandwidth. In addition, capacitive coupling can be used to expand the area of a chip that can be used for I/O. Normally designers keep solder bump connections towards the center of a chip-substrate interface so that the stresses induced by the different thermal rates of expansion of the chip and substrate do not lead to solder bump failure. Opposing capacitive plates are inherently compliant and thus stress is not an issue. Arrays of capacitively coupled connections can be built outside the array of solder bumps in order to increase the total chip count. A further advantage is the potential for elimination of the need for electrostatic discharge (ESD) protection. Eliminating the capacitive overhead of these protection circuits makes the design of high-speed I/O easier. Finally, signaling through capacitive junctions can reduce the power consumption compared with conventional signaling, because it reduces signal swing. The latter advantage can be enjoyed even if the capacitor is not built at the chip-package interface. For example it can be built as a metal-metal structure on-chip.

5.3.1 Capacitively Coupled Channel Design Several different capacitively coupled circuit topologies are possible, as illustrated in Figure 5.2, which shows single-sided and full differential signaling. Further variations include having capacitors at only one end of the transmission line, or at both ends. In this sub-section, the basic approach to circuit design, including sizing of the capacitors is explored. The basic circuit structure of an ACCI channel is shown in Figure 5.3. Normal binary (non-return to zero, or NRZ) signals are produced at the transmitter. The first series capacitor acts like a high-pass filter in the frequency domain, or a differentiator in the time domain. As a differentiator, it turns each edge in the NRZ data into a pulse that is then sent down the transmission line. The pulse shape is asymmetric, with a slower falling edge, as one would expect the step response of this circuit to be. As the pulse travels down the line, its tail also gets longer due to the high frequency losses in the line. The second capacitor also acts like a high frequency filter, suppressing the long tail. Because this RX capacitor is loaded by a high input impedance circuit it does this suppression particularly well. The pulses are restored to NRZ data using a pulse receiver, which is discussed below. In general, the faster the edges in the NRZ data coming from the transmitter, the bigger the pulses are. This arises naturally from the consideration that the first capacitor acts like a differentiator in the time domain. Thus capacitive coupling gets more attractive at more aggressive circuit nodes. For example, in a 45 nm process, 20 Gbps pulse signaling should be possible [10]. The reader should note that in Figure 5.3, the parasitic capacitors are shown along with the main series connected coupling capacitors. These are important to include. The magnitude of the pulse degrades proportionally to the ratio of the parasitic capacitance to the coupling capacitance. With poor design, the value of the parasitic capacitance can approach the

5 Contactless Packaging

131 Single-ended, single-sided

TX

RX

Differential, single-sided TX

RX

Single-ended, double-sided TX

RX

C

Transmitter

Differential, double-sided TX

RX

Fig. 5.2 Different circuit topologies that can be used to implement capacitively coupled chip-tochip channels.

Figure 2. value of the coupling capacitance. This is important to consider in practical implementations. All results in this chapter include the actual parasitics. It is interesting to contrast an ACCI channel with the operation of a normal chipto-chip channel. This contrast is illustrated in Figures 5.4 and 5.5. Figure 5.4 shows the basic concepts used in a typical implementation of a high-speed chip-to-chip communications scheme. Due to frequency-dependent skin resistance and dielectric losses, the transmission line channel acts like a low-pass filter. Digital signal processing is used at the transmitter to create a high pass filter circuit that is used to compensate for the losses in the transmission lines and create a wider band-pass. The signal processing increases the complexity and power consumption of the circuits. Figure 5.5 shows the equivalent diagram for an ACCI channel. The transmission lines display their normal low-pass response. When the high-pass series capacitors are added, this becomes a band-pass response. Since the flat band of the composite channel response is quite wide, no equalization is needed: the equalization was achieved using passive structures. The receiver includes a latch, which essentially acts like an integrator, bringing the effective band-pass of the channel down to DC. Note that no coding or other method to remove low frequency data is needed, due to the use of a latch in the pulse receiver. In Figure 5.6, the frequency response of a typical channel (a 20 cm long transmission line) is compared with the same channel with two 150 fF series capacitors added. Note that by adding series capacitors, the high pass 3dB point is moved from

Chip 1

132

Contactless Packaging TX out: NRZ data

In ACCI channel: RZ pulse signal

i

ii B

iii

RX out: NRZ data

vD

ivC

A

TX

RX

Chip 1

E

Chip 2 CCTX

i

A

ii

B

iii

C

iv

D

v

E

Up to 30 cm long, (a) T-line 50 : coupled

(b)

CCRX

© 2006 IEEE

Fig. 5.3 Simulations of signals at different points on a capacitively coupled channel, illustrating basic operation [9].

4.1 GHz to 12.5 GHz. No active equalization is needed in this channel for any signaling operating at a rate of less than 25 Gbps! Figure 5.7 shows how the equalization works in the time domain. The signals before and after the second capacitor are both shown. The first series capacitor differentiates this signal, turning the edges in the NRZ data train into a series of pulses with longer tails than leading edges. The long tails result is a significant amount of inter-symbol interference (ISI). In this example, the first pulse has a 13% residual at the start of the next bit period; this is the ISI. Since the pulse is meant to be a Return to Zero (RZ) signal, any residual signal at the end of the bit period becomes ISI. However, this ISI is rejected by the series capacitor at the receiver as explained above. Both capacitors are needed to create a high quality passively equalized channel. There are tradeoffs involving the capacitor size. If larger capacitors are used, the signal swing is larger, making the signal-to-noise ratio (SNR) larger and the bit er-

5 Contactless Packaging

133

TX

RX

Clk

RX equalizer

delay

Sampler

delay

Equalizing Filter

Channel

Equalized Channel

Frequency Bitrate/2 Bit t /2

Frequency Bitrate/2 Bit t /2

Frequency Bitrate/2 Bit t /2 © 2006 IEEE

Figure Fig. 5.4 Design of a conventional equalized high4.speed chip to chip communications channelRX [9].

TX

TX

RX

ACCI Channel

Pulse RX (latch)

Equalized qua ed C Channel a e

Frequency Bitrate/2

Frequency Bitrate/2

Frequency Bitrate/2 © 2006 IEEE

Fig. 5.5 Design of an AC Coupled high speed chip to chip communications channel [9].

TX

RX

134

Contactless Packaging -5 5

T-Line

Transmission line

-10

Volts, dB

-15

-20

-25

-30

-35

ACCI ACCI

-40 1 GHz

4.1

4.1 GHz

12.5 12.5 GHz

10 GHz

Frequency, Hz

Fig. 5.6 Frequency response of a 20 cm long PCB line and the same line with two 150 fF series coupling capacitors added at each end. Figure 6.

ror rate (BER) better. On the other hand, larger capacitors increase the residual ISI. Smaller capacitors improve the ISI, take less area, and reduce the power consumption but reduce the SNR. The reduced power consumption comes about because of the reduced signal swing used to charge and discharge the capacitance of the transmission line during signaling. The SNR-limited signal swing at the pulse receivers discussed below is 60 mV differential peak to peak. This gives a BER of better than 10−12 . While 60 mV might sound low, 20 mV swings have been commercially demonstrated, and are even advantageous in terms of power consumption [11]. Figure 5.8 shows the range of acceptable coupling capacitance as a function of the transmission line length. This is for the differential channel described below, as implemented in 0.18 µm CMOS for 3 Gbps signaling. If the capacitance is too small, or the line too long, the signal swing at the receiver gets too small. On the other hand, if the capacitance is too high, the ISI is too large. Note that as the edge rate gets faster, e.g. due to a faster transistor technology, smaller capacitances and longer lines become more feasible, as the pulse signal swing will increase. As the bit rate gets faster, the points on this graph associated with larger capacitances and longer lines are more likely to fail due to ISI. It is particularly encouraging that the designs work for a wide variety of capacitance size and transmission line length. The scheme is not very sensitive to manufacturing process variations.

5 Contactless Packaging

160

Current bit

120

Vpp, mV

135

Next bit

TX

RX

80

13% tail

40

3% tail 0

600

800

Time, ps

1000 © 2006 IEEE

Fig. 5.7 Time domain illustration of how ACCI creates an equalized channel. Capacitors are needed at each end of the channel to create a fully equalized, high-data rate channel [9]. Figure 7.

Most of the design below used a nominal capacitance of 150 fF. Figure 5.9 shows the size of this capacitance, and other capacitances, when implemented between the top metal on the chip and metal on a substrate. These calculations assume that a 1.7 µm thick overglass on the chip is not thinned. It can be seen here that a 150 fF capacitor with a 1 µm air gap would have a size of 150 µm on a side. If the air gap was reduced to zero, this is reduced to 65 µm on a side, but some of the mechanical advantage of ACCI would be lost. Nonetheless, this scale of air gap would be achievable in a silicon package. One way to dramatically reduce the plate size, or to permit larger plate to plate gaps, is to introduce a high dielectric constant material. Epoxy loaded with SrTiO2 can achieve relative dielectric constants of more than 20 [12]. Using such a material in a silicon packaging scenario would permit pads of dimensions of less than 50 µm on a side to be used. In addition, such a material would permit organic packaging to be used. In organic packaging, the surface roughness is typically in the range of 3–5 µm. Pad sizes of less than 70 µm on a side are possible with this range of epoxyfilled gap. Of course, the use of epoxy does reduce the mechanical stress advantage of using ACCI, and this would need to be fully evaluated. In this sub-section, it was assumed that the series capacitances are physically and thus electrically small (less than 1 nF). Small capacitances result in a band-

136

Contactless Packaging

Coupling ca apacitance, fF

205 195 185 175 165 155 145 135 125 115 105 95 85 75 65

Pass Fail 1

10 20 30 40 T-Line Length, cm

50 © 2006 IEEE

Fig. 5.8 Plot showing valid range of coupling capacitance and transmission line length for 3 Gbps signaling in a 0.18 µm CMOS technology. This is using the receiver described in Figure 5.11b below [9]. Plate size vs. High-K filling thickness for desired Coupling capacitance

Plate size vs. Air gap for desired Coupling capacitance

(Permittivity High-K material: 20.0) Plate size vs N=20 fillingofthickness for desired capacitance

Plate size(Chip: vs air gap/ Substrate: for desired capacitance TSMC MCNC)

150

450 425 Figure 8. 404

150 fF

383

225 fF

350

Plate Pm P Plate size, size (um m)

336

281

312

250

249

286 271 255

243

237 219

216

212

200

300

291 268

167 150

145

225 fF

176

127

300 fF

110 105 100

115

90 85

83

70 0

103 99 94

91

87

80

77

75

74 68

66

150

110 104

97

96 90

126 121

120 113

80

199 184

139 133

150 fF

120

332

310

300

145 140 130

368

360

300 fF

350

Plate Pm P Plate size, size (um m)

400

62

60

118

54

100

50 0.0

0.5

1.0

1.5

2.0

2.5 3.0 Air Gap (um)

Air gap, Pm

3.5

4.0

4.5

5.0

2.0

3.0

4.0

5.0 6.0 7.0 8.0 9.0 High-K filling thickness (um)

10.0

11.0

12.0

N=20 filling thickness, Pm

Fig. 5.9 Figure Capacitor size (length of oneforedge) versus air gap and anddifferent required capacitance assuming an 9. Plate edge size required different capacitances interface dielectrics. These assume a 1.7 Pmepoxy SiO overglass on theThese chip capacitor. ([Xu06].) air gap, and a calculations gap filled with a loaded with κ=20. calculations assume a 1.7 µm SiO2 thick overglass on the chip capacitor [1].

Figure 9. pass circuit and pulse signaling as described above. If the capacitance is larger, then the low-pass of the band becomes 1 MHz or less, and conventional NRZ signaling is possible, as long as some code, such as an 8B10B code, is used to eliminate low-frequency information. This is explored further by Xu and Su et al. [1, 13]. Some existing standards such as Fiber Channel use large DC blocking capacitors and coded NRZ signaling.

5 Contactless Packaging

137 (a) Transmitter circuit

5.3.2 ACCI Circuits

Typical single ended circuits used for capacitively coupled structures are shown in Figure 5.10, and some typical differential circuits are shown in Figure 5.11. The main difference between the circuits for an ACCI channel and a conventional channel is the need for a self-biased receiver. Because there is no DC component to the received signal, the DC average at the receiver wanders with the relative number of 1’s and 0’s. Without a run-length code the receiver bias would eventually saturate and the receiver would stop functioning. Some form of self-biased receiver is Figure 10. 10 Schematics of singlesingle-ended needed. TXin

TXout Transmitter circuit

RXin

RXout

Receiver circuit

Fig. 5.10 Single-ended ACCI driver and receiver [1].

The transmitter in both cases (single-ended and differential) is a voltage-mode full-swing transmitter. This reduces the driver power considerably (when compared with a current mode driver), at the expense of a slight increase in simultaneous switching noise. The single-ended receiver uses a modified version of the circuit first introduced by Kuhn to provide this self-bias [4]. It uses a diode connected FET to bias the receiver to around VDD /2. For a more complete description of this circuit, please see Xu [1]. Two different fully differential circuits are shown in Figure 5.11. Again, a fullswing complementary voltage-mode driver is used to save power and permit operation at future low voltage supply values. Two receivers are shown. The low-swing receiver is optimized towards maximizing receiver sensitivity. It can reliably detect swings as small as 60 mV. Again, it uses feedback to self-bias the receiver. In addition to diode-connected feedback (M1–M4), two VDD -gated feedback transistors (M5 and M6) are added to provide extra stability. The output of these two amplifiers (RXout+/-) are fed into a latch that restores the NRZ signal. This circuit was demonstrated at operating speeds of 3 Gbps operating over 15 cm long lines through 150 fF on-chip coupling capacitors. The die and board photos are shown in Figure 5.12,

138

Contactless Packaging

RXOUT+

M11

M12

RXOUT-

M10 RXIN+

M1

(a) H-bridge voltage mode driver

Bias generator

RXAMP+

M7

Vb

M8 M9

RXAMP-

RXIN-

M2

M3

M4

M5

M6

(b) Low swing pulse detector with latch

Differential amp

Latch

(c) Fully differential pulse receiver

© 2006 IEEE

Fig. 5.11 Differential ACCI driver and receivers: (a) H-bridge driver; (b) low-swing receiver, and (c) high speed fully differential receiver [9, 14].

and a summary of performance is given in Table 5.1. The circuit achieved a 20% power savings over a conventional equalized serial link. Most of the power savings was achieved at the driver. For more details, the reader is referred to Luo et al. [9]. The second receiver circuit (Figure 5.11c) is optimized towards high speed. It is essentially a fully differential current-mode version of the circuit shown in Figure 5.11b. The die and board photos, together with the output received eye diagram are shown in Figure 5.13. The performance summary is given in Table 5.2. It operated at 6 Gbps over 30 cm of line at a core TX/RX power of 11.8 mW, giving a very efficient 1.97 mW/Gbps. The power and signal integrity of these capacitively coupled circuits has been investigated in detail, and in general, are comparable to their conventional counterparts. Details can be found in Luo et al. [10].

5 Contactless Packaging

139

TX1 7 and Pads TX1-7 TXT

RXT 1

RXT 2

RXT 3

RXT 4

RXT 5

RX1-5 and Pads

© 2006 IEEE

Fig. 5.12 Die and board photos, together with an eye diagram at the receiver output (after data recovery and deserializer) of the implementation of the circuits described in Figures 5.11a and 5.11b. It operated at 3 Gbps over 15 cm of lines at a core TX/RX power of 15 mW [9]. Table 5.1 Summary of performance achieved with the chip given in Figure 5.12. Process TSMC 0.18 µm 1.8 V CMOS Data rate 3 Gbps/channel BER under 10−12 Coupling caps 60 µm by 60 µm on-chip (150fF) Link 15 cm long 50 Ω microstrip line Jitter of recovered data 7 ps RMS Total power 134 mW Driver power 5 mW Pulse RX power 10 mW Clock, test circuitry, and buffer power 116.5 mW

140

Contactless Packaging

Differential eye Single g ended eye + Single ended eye –

© 2006 IEEE

Fig. 5.13 Die and board photos, together with an eye diagram at the receiver output (before clock and data recovery) of the implementation of the circuits described in Figures 5.11a and 5.11c. It operated at 6 Gbps over 30 cm of lines at a core TX/RX power of 11.8 mW (giving 1.97 mW/Gbps) Figure [9]. 13. Table 5.2 Summary of performance achieved with the chip given in Figure 5.13. Process TSMC 0.18 µm CMOS Max data rate 6 Gbps/channel Min data rate DC (50 MHz measured, limited by source) Area 3.3mm x 3.3mm Total power (12 TX+RX, 72 Gbps) 141.6 mW = 1.97 mW/Gbps TX power 5 mW RX power 6.8 mW Total AC signal IOs 26 Wirebonds 12 MCNC flip chip 7 Endicott flip chip 7 Total bandwidth 156 Gbps Wirebonds 72 Gbps (partially measured to 36 Gbps) Flip chip 84 Gbps

5 Contactless Packaging

141

5.3.3 ACCI Packaging A problem common to all contactless schemes is how to get power delivery and cooling together in the same packaging structure as the capacitive or inductive connections. Cooling is usually done through the back side of the chip. Thus in most geometries the more difficult problem is how to deliver DC power and ground across the same interface as the capacitive or inductive connections. This is particularly difficult for capacitive coupling as the inter-plate spacing has to be well controlled to a few µm or less. The DC connections have to be the same thickness. While this is possible, by for example, using copper to copper bonding, it can only be done for silicon-on-silicon interfaces, as the two opposing surfaces must have the same coefficient of thermal expansion. Thus it is limited to face-to-face chip stacking, and, possibly to chips connected via a silicon interposer. We qualify this latter statement because it has never actually been demonstrated, and it is likely to be difficult to do this with multiple die. One approach to solving this is to use flip-chip solder bump technology and recess the solder bumps on the package side. If the solder bumps are built using a metal deposition process that is precisely controlled, then the end standoff height of the reflowed bump can be precisely calculated and replicated. If the package side recess depth is also precisely controlled, then when assembled the air gap between the capacitor plates is also precisely controlled. This concept is best illustrated with silicon on silicon packaging, since no special tooling is required, and thus it is easy to demonstrate in a research environment. An illustration of the proposed geometry is shown in Figure 5.14, and the results of an experiment are shown in Figure 5.15. In this particular experiment, the geometry was designed to bring the chip and substrate into physical contact. Thus the dielectric between the two capacitor plates consists solely of the chip overglass. Bringing these two opposing surfaces into physical contact reduces their compliance but that is unlikely to be an issue in silicon on silicon packaging. With just the chip overglass as a dielectric, pad pitches on the scale of 100 µm or less are possible (see Figure 5.9 and its discussion) but were not shown on this one and only demonstration run. For more details of this experiment, please see Mick et al. and Wilson et al. [16, 17, 15]. The buried bump concept can be readily extended to other packagking technologies. For example, organic packaging can accommodate recessed bumps by taking an existing package and adding a build up layer where the capacitive coupled I/O are desired. This has been investigated using Endicott Interconnect Technology’s HyperBGA technology, and is illustrated in Figure 5.16. Here the capacitive coupled I/O are arrayed around the edge of the die, where for large die, solder bumps would not be possible. The solder bumps are arrayed in the center of the package. The buildup around the edge ensures the capacitor gap is controlled to 3–5 µm accuracy (the surface unevenness of the package). A high-κ epoxy is used in the capacitor gap so that high density ACCI I/O can be built (see Figure 5.9 and the discussion there). The build-up has inlets in it to ensure flow of the underfill needed for stress relief around the solder bumps. If this bump array were small enough, underfill would not be needed. For more details, the reader is referred to Wilson et al. [12].

142

Contactless Packaging DC connection ti (buried solder bump)

Capacitor plates

coupled on a multi-chip module (a)Capacitive 3-D view for CCI oninterconnect MCM

DC connection (buried solder bump)

Capacitive coupling

Cross-section view (b) Cross-section view for CCI on MCM © 2006 IEEE

Fig. 5.14 Concept of using recessed solder bumps to control the inter-plate capacitor spacing [9]. Figure 14.

Transmission line TX

Bypass capacitor

Coupling plates Sawn-through cross-section view showing silicon die (top), C 3-metal MCM (bottom), and solder bumps.

RX

Two die mounted via recessed bumps with capacitive coupling on multi-chip a silicon multi chip module

Solder bumps © 2005 IEEE

Fig. 5.15 Demonstration of using recessed solder bumps to control the inter-plate capacitor spacing [15].

5.4 Mid-channel Capacitively Coupled Structures There are two circumstances in which a capacitively coupled structure might be inserted in the middle of a communications channel, rather than at one end or both ends. These circumstances are illustrated in Figures 5.17 and 5.18. The first reason (Figure 5.17) would be to create a capacitively coupled connector for 3D packaging. It could replace conventional connectors or could be used to enable true 3D cube-

5 Contactless Packaging

143

Die-package gap and edge of C4 trench Coupling cap Thick dielectric to create C4 trench

Embedded term resistor

HyperBGA package modified for ACCI

BGA solder

Chip

Pkg

DC bumps with stressrelief underfill

AC I/O with high-N material

Underfill inlet on package

DC bumps AC I/O

Layout of a 25 mm x 25 mm, 13300 pin package enabled by capacitive coupling

© 2006 IEEE

Fig. 5.16 Concept of using recessed solder bumps in organic packaging to control the inter-plate capacitor spacing [12].

to-cube connectivity. As long as the inter-plate spacing is well controlled, many interesting packaging geometries are possible. A second, most likely more interesting, application is to enable the use of board embedded capacitors in standards that require DC decoupling, such as FiberChannel (Figure 5.18). High-κ dielectrics have been introduced into board making technology to make embedded high quality, low inductance decoupling capacitors. However, the potential to use this embedded capacitor technology for implementing a conventional DC blocking capacitor is complicated by the low capacitance values that would have to be used. The embedded capacitance has a capacitance density in the 10s of nF/cm2 range, and conventional blocking capacitors would require an area of 1 cm2 or more. However, if the standards could evolve to allow the use of smaller capacitors, then such a use would be feasible. ACCI circuit technology could be used to enable such an application [13].

144

Contactless Packaging L2: BGA Socket

L2: BGA

Chip 1

Chip 2

L1: wire-bond

L1: flip-chip

L3: connector

L3: connector

Cap p chip p with pads p Epoxy Capacitive or inductive I/O transceiver chip Metal I/O plate (inductive or capacitive) Fig. 5.17 Capacitive connections to replace connectors or to enable true 3D packaging. The utility is limited by the ability to control the interpolate spacing precisely enough. Figure 17.

Not To Scale

VDD

Chip Signal

Flip-Chip Bump

high-K dielectric

Package

GND BGA

Embedded Cap for Decoupling

Embedded Cap for ACCI © 2008 IEEE

Fig. 5.18 Using an embedded capacitor toFigure Fi replace 18. 18 the DC blocking capacitor in standards that require this [13].

Because the inductive parasitics associated with an embedded capacitor are much lower than those with a surface mount capacitor, such an application would permit DC blocking capacitors to be implemented at faster data rates with better signal integrity and lower cost. This is a significant potential advantage. The equivalent circuit for the channel is shown in Figure 5.19. A small series capacitor, together with its parasitics, is in the middle of the transmission line network. Having a small blocking capacitor in the middle of a transmission line brings several disadvantages that must be compensated for: 1. It forms a high impedance (Z = (ωC)−1 ) discontinuity causing significant reflection noise. Good matching terminations are needed at each end of the channel. 2. The capacitance value is too small to support NRZ signaling; pulse or RZ signaling must be supported.

5 Contactless Packaging

145 Embedded cap

NRZ bitstream

NRZ bitstream

Cp

TX

RX

Chip 1

Chip 2 stub < 10 cm

Cc > 1 pF

PCB trace 0.5 to 1 m

stub < 10 cm © 2008 IEEE

Fig. 5.19 Channel equivalent circuit [13]. Figure 19.

3. When fed and loaded by a 50 Ω transmission line, the resulting pulse has a long tail that must be accounted for or eliminated through circuit correction.

400

Cp=15% x CC

60pF

Cp=30% x CC

300 Voltage (m mV)

30pF

200 10pF 5pF

100 1pF 0 7n

500fF 8n

9n

10n

time (second)

11n

12n

© 2008 IEEE

Figure 20. Fig. 5.20 Step response of a mid-channel series capacitor for different capacitor and parasitic capacitor values [13].

The last two points are illustrated in Figure 5.20. This clearly shows the RZ or pulse signal response and the long tail. For example, consider the curve labeled “5 pF.” If no circuit solutions are used to compensate for the tail, the fastest bit period possible is 3.9 ns, corresponding to a rate of 256 Mbps.

146

Lstub 0.3

1pF LPCB

Lstub

Lstub

CP=15%xCC

0.3

RX IN (Pulse Signal) RX_IN

0

-0.3 0 0.3

LPCB

Contactless Packaging Lstub

CP=15%xCC

RX IN (NRZ Signal) RX_IN

0

100p

200p

FIR OUT FIR_OUT

-0.3 0 0.3

0

-0.3 0

10pF

100p

200p

FIR OUT FIR_OUT

0

100p

200p

-0.3 0

Cc = 1 pF

100p

200p

Cc = 10 pF

stub

Cc Cp

PCB

stub

Cp = 0.15*Cc

© 2008 IEEE

Fig. 5.21 Fractional equalization permits high frequency signaling through channels with small values of series capacitance [13].

Fortunately, there is a technique that can be used to compensate for the long tails. Fractional equalization [18] is a technique in which the transmitted bit is changed digitally to enhance its high frequency, and suppress its low frequency, content. Thus when sent through a series capacitor, the (low frequency content) signal tail is suppressed. This concept is being explored and can be shown to enable high frequency signaling through a wide variety of channel scenarios (see Figure 5.21). A complete circuit implementation has been designed and was being tested while this chapter was written.

5.5 Inductively Coupled Connectors and Sockets In the previous sub-section, the use of capacitors as mid-channel circuit elements was explored. One serious limitation was the large amounts of reflection, or “return loss,” such an element would cause. Such high levels of return loss are unacceptable in many applications. In addition, there was the difficult requirement to create

5 Contactless Packaging

147

physical structures where the inter-plate spacing was small and well controlled. Replacing opposing capacitor plates with inductor coils solves both these issues and is explored in this sub-section.

Opposing inductors can be built as a transformer to enable separable true zero insertion force connectors.

Layout ayout o of a an inductively duct e y coupled connector array with sub 0.5 mm pitch structures, extendable to large areas (both top and bottom inductors are shown).

Inductively coupled connector built in a PCB process. © 2008 IEEE

Fig. 5.22 Inductively coupled connectors and sockets, adapted in part from Chandrashekar et al. [19].

Physically, a connector or socket can be built by aligning two opposing inductors so as to make a transformer (see Figure 5.22) [19]. This makes for a mechanically simple structure. There is no need to precisely control the alignment or spacing between the opposing inductors. The latter is an important point. The coupled magnetic field strength does not decrease dramatically with increased spacing between the coils. Thus this structure is a lot easier to physically implement than a connector using capacitors. The achievable pitch depends on the wire and via geometry. Structures as small as 100 µm have been built on ceramic and silicon substrates. A structure with a 285 µm diameter is buildable in a 1 mil line and space organic laminate process technology. Some possible circuit implementations are shown in Figure 5.23. This subsection is mainly focused on the top structure–a single transformer in a single-ended

Coupled Inductors

148

Contactless Packaging Transmitter

Single-ended, single-transformer TX

Chip 1

RX

Transmitter

Differential, single-transformer TX

RX Coupled Inductors

Single-ended, two transformers TX

Transmitter

RX

Chip 1

Differential, two transformers Transmitter

TX

RX

Fig. 5.23 Circuit implementations of inductively coupled connections.

circuit configuration. For discussion on the other structures, please refer to Chan [19]. This figure also leads into another major advantage that inductively coupled structures have over capacitively coupled ones. In a 1:1 turns ratio transformer with low losses, the impedance seen at the input to the transformer is the same as the impedance load on the output. Thus, in when placed in a 50 Ω transmission line, the ideal transformer will act like a 50 Ω load. Reflection noise and return loss are zero! The first question a transformer circuit designer would have to answer is what nominal value of inductance to target. This turns out to be driven by a similar set of considerations as in the mid-channel capacitively coupled case, as illustrated in Figure 5.24. The smaller the inductance value, the shorter the pulse tail (and the higher the potential bit rate) but the smaller the signal swing. Again, the tradeoff is to achieve an acceptable signal swing at the fastest bit repetition rate. For example, based on the data in Figure 5.24, achieving 1 Gbps signaling requires that the inductor value be less than 5 nH. However, as in the capacitive case, the tail can be compensated for by using circuits employing fractional equalization. This is currently being explored. Unfortunately, two opposing inductors do not form a perfectly coupled lossless transformer. Actual parasitics must be modeled and taken into account. The equivalent circuit model is shown in Figure 5.25. The plate capacitance to ground and the series winding resistance have a small impact on losses. The most significant issue is the “leakage inductance” that appears due to the fact that coupling coefficient is less than 1 in a practical spiral transformer. If uncompensated, this could cause

149

Output (mV)

5 Contactless Packaging

25 nH 15 nH 5 nH 1 nH

Time (ns) Fig. 5.24 With an inductive coupled channel, smaller inductors permit faster signaling though with less signal swing [20].

Leakage inductance

significant return loss. However, another parasitic in the circuit is the plate to plate capacitance of the two opposing spiral inductors. This appears in series in the circuit and parallel to the leakage inductance. If the inductors are wound in opposition to each other (e.g. one clockwise and the other counter-clockwise looking from above), Figureis24. then the reactance due to the leakage inductance negative. As the reactance of the series capacitance is positive, these can be designed to cancel each other out, at least over a broad frequency range. Cc

Leakage inductance

K L1

L2

R

L1-M

R

L2-M

M

© 2008 IEEE

Fig. 5.25 Parasitics of an actual spiral-wound

Figure 25.[19]. transformer

A practical example of a well designed structure, as would be built in a high end laminate process, is described in Table 5.3, with modeled results presented in Figure 5.26. The leakage inductance of 1.2 nH (with negative reactance) is compensated by spacing the coils 25 µm apart (with a κ = 4 dielectric) to give a 200 fF inter-plate capacitance. A 25 Ω series resistance (which could be built by thinning the metal in the primary coil) completes the matching network. The resulting frequency response shows acceptable insertion loss (S21) for pulse signaling (in pulse signaling the low

150

Contactless Packaging

frequency loss does not matter), and return loss (S11) up to 5 GHz. This structure could thus support 10 Gbps signaling. Cc=200 fF 25 :

Port 2

2 nH

K=0.7

Port 1

3:

2 nH

S21 dB

S11

3:

Frequency (GHz)

© 2008 IEEE

Fig. 5.26 If the spirals oppose each other, then the leakage inductors reactance is negative, Figure 26. and can be cancelled by the plate-to-plate capacitance reactance. Return loss (S11) and insertion loss (S21) are acceptable over a frequency range that can support 10 Gbps signaling [19].

Table 5.3 Transformer geometries that achieve useful inductively coupled interconnect. Here, L1 =L2 =2 nH; Cc =200 fF; K=0.7 across a 25 µm dielectric spacing with a dielectric constant of 4.0; 25 µm minimum trace width; 50 µm microvias. The structure in the first line shows what could be achieved in a 1 mil line and space advanced laminate structure. The second row corresponds to the circuit model shown in Figure 5.26. Outer diameter

Width/Spacing

# of turns

K

Cc

370 µm 460 µm

25 µm/25 µm 47 µm/25 µm

2 2

0.6 0.7

88.85 fF 200.3 fF

Inductively coupled structures have been shown to be robust to manufacturing variations in critical dimensions of up to 25% and to have acceptable crosstalk if the lateral coil-to-coil spacing is greater than 33% of the coil diameter [1, 20]. Though complete systems have not been built using inductively coupled connectors, the interconnect structure itself has been built and tested, and extensive simulation studies conducted. Figure 5.27 shows a measurement result of an operating channel like that described above. Here the RZ pulse nature is clearly observable. This is operating at 4.2 Gbps. With fractional equalization, it is possible to operate these channels up to 20 Gbps. Full circuit implementation is currently underway.

5 Contactless Packaging

151

© 2008 IEEE

Fig. 5.27 Measured data for a 100 µm diameter transformer (with 1.22 nH inductors realized with 5 µm trace width/space) across a gap spacing of approximately 5–7 µm; eye diagram at 4.25 Gbps [19].

Figure 27. 5.6 Conclusions and Future Perspectives Capacitive and inductive coupling have the potential to enable high-density, lowcost, reliable, mechanically simple chip-to-chip, chip-to-package and package-topackage connections. Designed correctly, these coupling structures can also reduce power consumption (through reduced signal swing) and provide passive equalization (which also reduces power further). The main negatives are the need to redesign the receiver, so that it can recover pulses back to binary data, and the general unfamiliarity designers have with the concepts described in this chapter. Capacitive coupling’s main packaging use is in situations where the inter-plate spacing can be well controlled, so that useful capacitance values can be achieved with small capacitors. This includes silicon-on-silicon packages and chip packaging with high end organic structures. The series capacitance can be built entirely on-chip (with two usual layers of on-chip metal) if the only goals are low-parasitic ESD protection, passive equalization and the power advantage. The series capacitance can also be embedded in the board to enable a low-cost, low-parasitic series DC block capacitance for those circumstances that need it. Because the inductive parasitics associated with an embedded capacitor are much lower than with a surface mount capacitor, such application will permit DC blocking capacitors to be implemented at faster data rates with better signal integrity and lower cost. A useful application of inductive coupling is to build high pin-count, larger area separable interfaces (i.e. connectors and sockets). Want a 10,000 pin connector? Inductive coupling can provide that cheaply and easily. It does not have the dimen-

152

Contactless Packaging

sional control requirements of capacitive coupling and thus can be easily used for a wide variety of structures. It is widely acknowledged that system level bandwidth will have to increase dramatically in the next few years to keep with the demands of multicore computers and other systems. Inductive and capacitive connections provide a low-cost, lowpower, mechanically compliant technology set to permit such bandwidth scaling in 2D and 3D systems. Acknowledgements The author would like to thank David Salzman and Tom Knight for introducing him to capacitive coupled interfaces. He would like to thank the PhD students and professionals he worked with in developing the concepts outlined in this chapter. In no particular order, these include Stephen Mick, Lei Luo, Karthik Chandrashekar, Jian Xu, Bruce Su, Evan Ericsson, Steve Lipa and John Wilson. They all made unique and valuable contributions to this field of knowledge. It is their work that is largely reflected here. He would like to thank the companies and agencies that supported this work, including the Semiconductor Research Corporation, IBM, Irvine Sensors, the National Science Foundation, the Air Force Research Laboratory, and the Defense Advanced Research Project Agency. He would also like to thank Endicott Interconnect Technologies for designing a modified HyperBGA structure for AC coupled interconnect, and RTI (formally MCNC) for substrate and solder bump process design and fabrication.

References 1. J. Xu, AC Coupled Interconnect for Interchip Communications, Ph.D. Dissertation, North Carolina State University, Raleigh, NC, 2006. 2. D. Salzman, and T. Knight, “Capacitive coupling solves the known good die problem,” IEEE Multi-Chip Module Conference, 1994, pp. 95–99. 3. D. Salzman, T. Knight, and P. Franzon, “Application of capacitive coupling to switch fabrics,” IEEE Multi-Chip Module Conference, 1995, pp. 195–199. 4. S.A. K¨uhn, M.B. Kleiner, R. Thewes, and W. Weber, “Vertical signal transmission in threedimensional integrated circuits by capacitive coupling,” IEEE International Symposium on Circuits and Systems, 1995, pp. 37–40. 5. T. Gabara, and W. Fischer, “Capacitive coupling and quantized feedback applied to conventional CMOS technology,” IEEE Journal of Solid-State Circuits, vol. 32, no. 3, 1997, pp. 419– 427. 6. S.E. Mick, J.M. Wilson, and P. Franzon, “4 Gbps AC coupled interconnection,” IEEE Custom Integrated Circuits Conference, 2002, pp. 133–140. 7. J. Xu, S. Mick, J. Wilson, L. Luo, K. Chandrasakhar, P. Franzon, “AC coupled interconnect for dense 3-D systems,” Proceedings of the IEEE Conference on Nuclear Science and Imaging, 2003. 8. J. Xu, L. Luo, S. Mick, J. Wilson, P. Franzon, “AC coupled interconnect for dense 3-D ICs,” IEEE Transactions on Nuclear Science (TNS), vol. 51, no. 5, 2004, pp. 2156–2160. 9. L. Luo, J.M. Wilson, S.E. Mick, J. Xu, L. Zhang, P. Franzon, “3 Gbps AC coupled chip-tochip communication using a low swing pulse receiver,” IEEE Journal of Solid-State Circuits, vol. 41, no. 1, 2006, pp. 287–296. 10. L. Luo, J. Wilson, J. Xu, S. Mick, P. Franzon, “Signal integrity and robustness of ACCI packaged systems,” Proceedings, IEEE Conference on Electrical Performance of Electronic Packaging, 2005, pp. 11–14. 11. J. Poulton, R. Palmer, A.M. Fuller, T. Greer, J. Eyles, W.J. Dally, and M. Horowitz, “A 14mW 6.25 Gb/s transceiver in 90-nm CMOS,” IEEE Journal of Solid-State Circuits, vol. 42, no. 12, 2007, pp. 2745–2757.

5 Contactless Packaging

153

12. J. Wilson, L. Luo, S. Mick, B. Chan, H. Lin, P. Franzon, “AC coupled interconnect using buried bumps for laminated organic packages,” Proceedings, Electronic Components and Technology Conference, 2006. 13. B. Su, P. Patel, S. Hunter, M. Caises, and P. Franzon, “AC coupled backplane communication using embedded capacitor,” Proceedings, IEEE Conference on Electrical Performance of Electronic Packaging, 2008, pp. 295–298. 14. L. Luo, J. Wilson, S. Mick, J. Xu, L. Zhang, E. Erickson, P. Franzon, “A 36 Gb/s ACCI multichannel bus using a fully differential pulse receiver,” Proceedings, IEEE Custom Integrated Circuits Conference, 2006, pp. 773–776. 15. J. Wilson, S. Mick, J. Xu, L. Luo, S. Bonafede, A. Huffman, R. LaBennett, P. Franzon, “Fully integrated AC coupled interconnect using buried bumps,” Proceedings, IEEE Conference on Electrical Performance of Electronic Packaging, 2005, pp. 7–10. 16. S. Mick, L. Luo, J. Wilson, P. Franzon, “Buried solder bump connections for high-density capacitive coupling,” Proceedings, IEEE Conference on Electrical Performance of Electronic Packaging, 2002, pp. 205–208. 17. S. Mick, L. Luo, J. Wilson, P. Franzon, “Buried bump and AC coupled interconnection technology,” IEEE Transactions on Advanced Packaging, vol. 27, no. 1, 2004, pp. 121–125. 18. R.D. Gitlin, S.B. Weinstein, “Fractionally-spaced equalization: An improved digital transversal equalizer,” Bell System Technical Journal, vol. 60, 1981, pp. 275–296. 19. K. Chandrashekar, J. Wilson and P. Franzon, “Inductively coupled connectors and sockets for multi-gbps pulse signaling,” IEEE Transactions on Advanced Packaging, vol. 31, no. 4, 2008, pp. 749–758. 20. K. Chandrasekar, Inductively Coupled Connectors, Ph.D. Dissertation, North Carolina State University, Raleigh, NC, 2009.

Part IV

Enabling Coupled Data Technologies

Chapter 6

Aligning chips face-to-face for dense capacitive communication John E. Cunningham, Ashok V. Krishnamoorthy, Ivan Shubin, James G. Mitchell, Xuezhe Zheng

6.1 Introduction Conductive electrical interconnections and on-chip transceivers have long been used to provide reliable interconnections between VLSI electronic components, and have dominated the interconnect hierarchy for reasons of manufacturing cost, system packaging, and ease-of-use. VLSI linewidths and on-chip clock speeds have continued to scale, putting pressures on the ability of traditional wires to achieve the offchip bandwidths necessary to fully and efficiently utilize the resources available onchip. When designing chip input and output circuits that communicate conductively, electronic circuit and system designers must design with the constraints of VLSI packages and circuit boards by using advanced circuit techniques such as predistortion, equalization, multilevel coding, and digitally controlled feed-forward clock and data recover blocks commonly referred to as serializer-deserializer (SerDes) transceivers. However, this generally increases the area and power consumption and limits the maximum number of I/O circuits per chip. Current best-in-class Serdes Dr. John E. Cunningham Sun Microsystems Chief Technology Organization, 9515 Towne Centre Drive, San Diego, CA 92121, USA, e-mail: [email protected] Dr. Ashok V. Krishnamoorthy Sun Microsystems Chief Technology Organization, 9515 Towne Centre Drive, San Diego, CA 92121, USA, e-mail: [email protected] Dr. Ivan Shubin Sun Microsystems Chief Technology Organization, 9515 Towne Centre Drive, San Diego, CA 92121, USA, e-mail: [email protected] Dr. James G. Mitchell Sun Microsystems Chief Technology Organization, 16 Network Circle, Menlo Park, CA 94025, USA, e-mail: [email protected] Dr. Xuezhe Zheng Sun Microsystems Chief Technology Organization, 9515 Towne Centre Drive, San Diego, CA 92121, USA, e-mail: [email protected] R. Ho and R. Drost (eds.), Coupled Data Communication Techniques for High-Performance 157 and Low-Power Computing, Integrated Circuits and Systems, DOI 10.1007/978-1-4419-6588-2_6, © Springer Science+Business Media, LLC 2010

158

Chip alignment

transceivers are expected to yield signaling densities between 1-5 terabits per second per square centimeter (Tbps/cm2 ) [1]. “Proximity communication” represents the general concept for face-to-face integrated circuits communicating by capacitive or inductive coupling [2]. Very high communication signal density can be achieved when compared to wire-bonding or solder-ball connections. In addition, in the capacitive case, off-chip circuits need drive only a small, high-impedance, capacitive pad much akin to the gate of a transistor. The electrical pad pitch may be on the order of 20 µm. Each pad can drive signals at line rates of 2.5–5 Gbps or higher [3]. This provides a potential communication density in excess of 1.25 petabits per second per square centimeter (Pbps/cm2 ). Experimental capacitive proximity communication circuits have yielded aerial densities up to 43 Tbps/cm2 to date [4]. In the case of capacitive proximity communication, the engineering limits to signal density will result from the area and power of the transmitter and receiver circuits. More critically, this form of signaling relies on the ability to align the chips and maintain this alignment during the packaging and operating of the chips. In the following discussion we will consider face-to-face chips arranged in an lower face-up plane and an upper face-down plane. As described later, we will often label the face-up chips as “island” chips and the face-down chips as “bridge” chips. The bridges will typically be smaller and have very little functionality besides transmitting data between island chips. The island chips themselves can have similar or vastly different functions.

6.2 Aligning chips face-to-face Ultimately, proximity communication provides an off-chip signaling bandwidth that can scale with the feature size and frequency of on-chip wires. However, there are alignment constraints required for this form of communication to be effective. The chips are required to face each other with their active sides abutting, and to maintain strict alignment between the chips. Exploiting this enormous bandwidth effectively requires a reliable, manufacturable and economical means for positioning the chips precisely relative to all six degrees of freedom (x and y in-plane, perpendicular zseparation, tip, tilt, and rotation). In theory, satisfactory communication requires that any misalignment to be under half of a pitch between the pads; this is often less than the error experienced during the sawing or dicing of the chips. In practice, the alignment requirements are generally much more stringent. Reducing misalignment improves communication performance between the chips and lowers power consumption. Unfortunately, it is not a simple matter to align the chips properly using existing mounting structures, such as those used for conventional single-chip modules or conventional multi-chip modules. Furthermore, a need to deliver power and cooling to chips that communicate through proximity communication further complicates chip alignment.

6 Chip alignment

159

For capacitive proximity communication between a given pair of overlapping pads, the x-y misalignment ought to be kept to under about a quarter pad size. Given a pad size on the order of 20–30 µm, one finds a maximum lateral misalignment tolerance of approximately ± 5–8 µm (if signal steering or larger pads are used, the tolerance grows by a small integer factor). For capacitive coupling, the z-separation is more critical, because the capacitively coupled signal voltage on a particular receiver pad falls rapidly with chip separation. Proximity signaling experiments have shown that separations of 10 µm (roughly half to a third of the pad size) are required to communicate with high fidelity and low bit-error-rate [3]. In order for arrays of pads to communicate across the chip gap, the tip, tilt, and rotation must also be controlled (to varying degrees depending on the size of the array versus the size of the chips). With conventional packaging techniques, chips can be die bonded onto first-level packages with lateral manufacturing accuracies of 50 µm or so at low cost. In typical die bonding manufacturing practices little or no fine control over the chips’ topographical height above the package is achieved. When creating multi-chip flip-chip packages, generally the spacing between two circuits that are flip chip bonded is controlled by the topology of metal bumps. Such bumps are about 80 µm in diameter and leave a gap of several 10s of µm. No means for placing one of the chips face-up and the other face-down nor for providing overlap between chips (let alone precise alignment) is typically available. The ultimate height of the attached bump and the ultimate chip separation in these approaches are typically not controlled, and when the chips are connected by the above methods they cannot be disconnected. Hence, a new chip alignment method that can guarantee chip separations below 10 µm, lateral alignment precision on the order of 1 µm, with minimal tip, tilt, and rotation errors is needed to hold chips together face-to-face with partial overlap. A final requirement of remateability provides flexible (Lego-like) chip assembly and chip replacement at low cost. The alignment mechanism reported here takes advantage of miniaturized versions of two of nature’s idealized shapes: an inverse pyramid with atomically smooth surfaces defined by a self-terminating wet-etch process in silicon; and commercially available microspheres with radii accurate to 1 µm and smooth to within a tenth of an optical wavelength (see Figure 6.1). Pyramidal etched pits in silicon are a key part of this alignment mechanism. Silicon whose surface is the 100 crystal plane can be etched preferentially on the 111 plane at high selectivity, e.g., at a 400 to 1 ratio [5]. A square opening in a resist pattern, when etched in this way, will produce a self-terminating inverted pyramid in the surface of the silicon that is sized and placed with the accuracy of commercially available photolithography. A set of such etch pits can be fabricated in to the corners of each silicon chip containing the proximity communication circuits. As shown in Figure 6.2, the silicon chips can be positioned face to face. Precise microspheres are then inserted in the etch pits on upward-facing chip, and the two chips are then brought in mechanical alignment. As the joining process continues, the balls co-locate the two chips by equilibrating into each corner of the chips. Because of the photolithographically-defined size and location of the etch pits, and the

160

Chip alignment

© 2009 IEEE

Fig. 6.1 Plan view, scanning electron microscope (SEM) images ofIEEE fabricated 250 µm wide etch © 2009 pits in silicon, with sapphire balls in place.

Fig. 6.2 Side view, scanning electron microscope (SEM) images of fabricated 250 µm wide etch pits in silicon, with sapphire balls in place.

6 Chip alignment

161

auto-centering of the ball in the inverse pyramidal etch pit, the exact relative position between two chips can be established. In addition, with uniform size balls, the gap between the two chips can be well controlled and maintained over a range of 1–100 µm.

© 2009 IEEE

Fig. 6.3 Silicon die that each contain electrical circuits are positioned face to face. An etch pit has been fabricated in each corner of each silicon chip, to capture precise spherical balls that are inserted into the wells before the positioning step. The two chips are then brought in mechanical alignment. As the joining process continues the balls eventually co-locate the two chips as the balls equilibrate into each corner of the chips. At equilibrium, circuits on the top chip are precisely aligned to the respective circuits on the bottom chip, as intended.

The etch pits can be created with photolithography before, during, or after circuit fabrication of the silicon chips. This enables the etch pits to be precisely defined and positioned in relationship to the circuits on the chip, and exactly aligns top and bottom circuits to each other [6]. The angle of the etch pit sidewall, set by the 111 planes when etching a 100 silicon surface, is 54.75 degrees [7]. The precision of the approach is enabled by the use of a silicon etch with atomic selectivity: for instance, a KOH wet-etch removes all planes except the 111 planes. Eventually, as the etch proceeds through a defined opening in a 100 surface, four 111 facets are ultimately exposed because the particular etch does not attack 111 planes. The bottom of the wells may be less than perfect since the reaction may become mass-transport limited. This does not impact the alignment since the vertex does not support the ball (see Figures 6.1 and 6.2). Good lateral alignment between the circuits on the upper and lower chips in Figure 6.2 is achieved provided the balls are sufficiently large to fit in the etched pit well. Clearly, a too-small ball does not precisely align the top and bottom circuits. However, if the sphere sits in the pit such that its equator lies at or above the chip surface then top and bottom circuits become precisely aligned. One of the most important features of this technique is chip spacing in the z dimension. This chip separation, d, depends on only a few input parameters. If we assume the pit has been etched to a depth greater than the radius of the ball, then the distance d depends critically only on the photolithographic feature size of the pit and the diameter of the balls.

162

Chip alignment

h, chip separration (P Pm)

The exact depth of the well is unimportant when using the etch chemistry described above. This is a key component to enable a simple (low-cost) manufacturing solution, since neither a timed etch nor special stop-etch layers are necessary. Only three balls are needed to define a plane on which the spacing between circuits can be uniformly held over extended lateral distances of the chip. This is important since conventional electrical mating connectors have significant slop resulting in misalignment tilts of several degrees. For chips of side length 1 cm such misalignments can equate to approximately 100 µm of variation in chip separation. Our estimate of the uniformity of chip misalignment achieved in our design is less than 1 µm.

400 Pm balls 10 h

D

W

1 Photolithographic error (1Pm) + Grade 3 limit Grade 5 balls limit G Grade 10 balls limit

0.1

430 440 450 460 470 480 490 500

W, feature opening (Pm) Fig. 6.4 Chip gap as a function of two parameters, the photolithographic opening and the ball diameter. In this calculation the ball diameter is 400 µm. The different grades of balls produce different limiting alignment floors for the chip gap.

The kinematics of this chip locating principle removes five of the six degrees of chip misalignment while controlling the sixth degree, the chip gap. The chip gap can be tuned accurately by changing the ball diameter or the photolithographic opening. Figure 6.4 shows a range of achievable gaps using a 400 µm ball in different opening sizes. Different grades of balls (based on precision grinding) lead to different limiting chip gap floors; a grade 3 ball has a roundness diameter tolerance of 3 millionths of an inch, a grade 5 ball has a tolerance of 5 millionths of an inch, and so on. The ultimate limit also depends on the photolithographic tool resolution. In addition, near a chip gap of 1 µm bowing from stress mismatches in metals and dielectrics on CMOS chips begin to complicate the meaning of a “chip gap.” Finally, the graph in Figure 6.4 does not include a stack of metal layers but rather presumes a silicon surface; the metal stack up must be included to determine the chip gap, although

6 Chip alignment

163

this correction can easily be solved geometrically. Likewise, in SOI wafers the chip gap for proximity communication needs to account for the silicon and buried oxide (BOX) thicknesses above the handler [8, 9, 10].

6.2.1 Power and ground connections between coupled chips While the transfer of data signals is accomplished by means of capacitive proximity communication, a conductive channel is useful for transferring power supply current between chips. In the configuration of islands and bridges mentioned above, the islands are large, highly-functional computation chips and the bridges do nothing but transfer data between islands. While the islands consume and dissipate high power levels, the bridges do not. Thus, the bridges may easily be powered from the islands using a conductive channel between chips established along side capacitive proximity communication. If the aligning balls are made of a conductive material then the alignment features used to intimately pair chips together can also be used to share power and ground between two chips. The spheres may be metallic or may be glass or other material (e.g. sapphire) with a conductive coating. In addition, the etch pits must be metallized so that power and ground signals may respectively be conducted from other areas on the chips. Using these metallized pits and conductive balls allows a remateable conductive channel from the lower to the upper layer of chips that can be used not only for power and ground, but also for a small number of low-speed signals. In practice, one may choose to use several smaller balls to more evenly distribute power and ground between chips, and to use spheres with micro-roughness to improve their electrical contact to the metallized pits. Compliant spheres or spheres with a compliant metallic coating can further improve electrical contact between the balls and the sidewalls of the etch pits. The conductive spheres may optionally be soldered to one chip to create a male-female connector interface as is standard in many connector topologies. Although the ball-in-pit can be used to provide power, the upper layer bridge chips can alternately be powered by conductive micro-bump attachment. Microsolder bumps shown in Figures 6.5 and 6.6, with a diameter 10 µm and height of approximately 5 µm, were photolithographically defined at the wafer scale. They provide the potential for thousands of conductive connections from an island chip to a flip-chip bonded bridge chip. In this case, each bridge chip is permanently aligned and attached to a host island chip at only one end; the bridge spans across to a neighbouring island where it is remateably aligned with the ball-in-pit mechanism (see Figure 6.7). Note that the microbump technology used to mount bridge chips must be fine-pitch and low profile, so that it maximizes connection density and minimizes the standoff of the bridge chip from the island. Other examples of the microsolder and flip chip bonding are shown in Figures 6.8 and 6.9. They depict more results for our high density, low resistance electrical interconnect, the microsolder, which powers up the bridge chip when it is flip chip

164

Chip alignment

Fig. 6.5 Microsolder with diameter 10 µm can be photolithographically defined over large areas.

Fig. 6.6 Top view photograph of an island chip with microsolder around the perimeter, providing thousands of flip-chip conductive connections to a face-down bridge chip (not shown).

bonded to another island chip. The microsolder is a dense array of specially shaped microbumps, designed to have a small pitch with low electrical resistance and high level of compliance after flip chip bonding to result in extremely small (a micron or less) chip to chip separation. Each microbump consists of a layering of metallurgy with shapes as a square 3 µm tall base and “crown” elevated over the base edges by 4 µm. This special shape insures high conductivity as the crown is embedded into an opposing pad during flip chip bonding. Bumps are e-beam deposited onto aluminum pads of an island chip.

6 Chip alignment

165

© 2009 IEEE

Fig. 6.7 Photograph of the top view of a neighboring island chipIEEE with balls and pits that enables © 2009 rematable alignment to a bridge chip. Unlike the other island chip, this one does not bond to the bridge chip, allowing for chip replaceability. In both cases, proximity communication transports data from island to bridge and back to island.

© 2009 IEEE

Fig. 6.8 SEM microphotograph of microsolder, viewing from the IEEE top an array of high-density © 2009 connections.

166

Chip alignment

© 2009 IEEE

Fig. 6.9 SEM microphotograph of microsolder, showing ©a 2009 closeup of several microbumps. IEEE

The bumps could be scaled down to several microns in diameter with a comparable pitch. Figures 6.8 and 6.9 shows square shaped 18 µm bumps on a 45 µmm pitch. The interconnect is completed with flip chip bonding by the means of thermal compression. After alignment of the chips, a low viscosity epoxy is introduced on the chip surface, and the chips are brought together and compressed under several pounds of loading pressure at modest temperature. Figure 6.10 shows an individual microbump before the flip chip bonding and a cross-section of the compressed bump between the bridge and island chips. The resulting electrical resistance per microbump is under 100 mΩ . Figures 6.10 and 6.11 shows the mechanical behavior of microsolder bump before and after thermal compression bonding. The original bump, about 7 µm in height, was plastically deformed to about 3 µm tall. Due to pressure-enhanced metal migration across the top and bottom chips, some alloy intermixing is expected across the bump interface. Pressures per bump are typically tens of micrograms per square centimeter. The lateral displacement between top and bottom chip registry is measured to be 2 to 3 µm. This is on the high end of our precision flip chip bonder that statically registers chips to a level of about 1 µm. However, given a precision accuracy of 1 µm specification in our flip chip bonder it is not unusual to observe cases as high as three microns of misalignment error. An important attribute of the micro-solder is its capability to scale to a smaller footprint as silicon chip fabrication scales to its next generation technology. Historical scaling of silicon linewidth, conventional area solder diameter and micro-solder diameter are shown in Figure 6.12. One of the issues with conventional area solder is that its scaling roadmap does not match silicon CMOS scaling and produces a so-

6 Chip alignment

167

© 2009 IEEE

Fig. 6.10 SEM microphotograph of a single microbump.

© 2009 IEEE

© 2009 IEEE

Fig. 6.11 SEM microphotograph of microbump after thermal compression. © 2009 IEEE

Chip alignment

Silicon linewidth S h (Pm)

1

10 10

Area solder bump diameter

Microsolder bump diameter

0

101

2

100 10

1

10 10 Silicon linewidth

-1

0

10 01 0.1

10 1

Bump d B diameter (Pm)

168

1985 1990 1990 1995 1995 2000 2000 2005 2005 2010 2015 1985

Time (years) Fig. 6.12 Technology scaling of silicon linewidth, C4 solder pitch, and microsolder pitch. Microsolder provides a powering solution for bridge chips that matches the needs of proximity communication.

called “interconnect gap” between on-chip and off-chip wires, due to the fact that area solder is an important chip-to-package technology. By contrast, microsolder is a chip-to-chip connection that is more likely to scale along with next generation silicon linewidths. Microsolder technologies complement proximity communication by providing power and ground solutions for various bridge chip geometries. When used with flip-chip bonding, microsolder has sufficiently small footprint and alignment metrics to match proximity communication requirements in terms of x,y,z chip misalignment. While microsolder does not itself lead to chip reworkability or replaceability, it can work together with proximity communication to produce hybrid assemblies that are reworkable and replaceable.

6.3 A low-cost package for capacitive proximity communication The precise alignment and inter-chip power connection mechanisms discussed above are both independent of the housing for the chips, which can be accomplished with low-cost injection molded parts that provide chip housings that mate together. When microsphere balls are assembled in the etch pits of the first chip, they create a relative reference system controlling location, rotation, and height of its facing chip. When the facing chip is brought into approximate alignment during the mating of

6 Chip alignment

169

the package connector housings, the ball located in an etch pit of the first chip will locate itself in an etch pit of the second chip. This provides the primary position reference. A second populated etch pit establishes rotational reference, with some allowance on the true position of the ball. Two more balls on another edge of the mating chip can then find etch pits in a third chip (and so on), and provide the final spacing reference between the three (or more) chips. A true position analysis, based on the accuracy of the silicon etch process and the planarity of the silicon wafers and the underlying substrate, can determine the optimal size and location of the etch pits. Note that the precise alignment between chips occurs passively. The tapering sidewalls of the pyramidal etch pits provide guidance into precise final alignment after coarse registration occurs to within the radius of the ball. An objective of the package is thus to guide the chips into approximate position so that the ball and etch pit mechanism engage, and also to provide and maintain force on the facing chips so that final alignment is maintained. This mechanism allows one to take advantage of the relatively coarse chip placement accuracies of the packaging industry. Hence the basic mechanical functionality of the multi-chip package or housing is to initially provide coarse chip registration, followed by a low insertion force coupling to help guide the alignment, and finally retention of the two chips after the precise alignment is complete.

Fig. 6.13 Exploded view of a low-cost linear vector package with four island chips and four bridges spanning pairs of islands. The four bridges appear on two chips, with two bridges per chip; the silicon between bridges was non-functional and could have been diced away, but was kept to ease handling and assembly.

The package must also provide power supplies and a ground to the chips, and a means to extract heat. All this functionality can be provided not only with conventional ceramic or organic chip packages but also with a low-cost injection-modeled plastic package, as shown in Figures 6.13 and 6.14. This package was designed

170

Chip alignment

Fig. 6.14 Assembled view of the linear vector package. Again, island chips are face up and two bridge chips (with four bridges) are face down. Bridges are held down by pressure clips. The package itself contains mechanical features to provide coarse chip-to-chip alignment.

to support four island chips in a line, with bridges spanning each island-to-island hop. For simplicity, the bridges were paired together, with two bridges on the same elongated piece of silicon, one on each end. The silicon between distinct bridges was non functional and could have been diced away (especially if the island chip faces needed to be accessed), but leaving them attached simplified assembly and handling. This linear package provides coarse chip alignment to approximately 3040 µm using guideposts in the housing (not shown); top clips with spring-loaded actuation to keep facing bridge chips held together; and the facility for a compound heat-sink to be attached from below. A spring-loaded clasp helps provide a controlled preload force to assure that the two electronic chips establish their precisely controlled proximity mating and accurately holds the chips together. The housing provides necessary strain relief, so that no external mechanical influence affects the alignment of the two chips. The clasp may be opened, allowing the chips to detach. The package thus provides the necessary compressive force to eliminate chip separation except during intentional replacement. The assembly is simplified because it does not require active alignment. The final resting position and alignment between chips relies solely on the etch pits and spheres for alignment. The multi-chip assembly may then be wire-bonded to a printed circuit board or second level package, after the chips have been precisely aligned. A heat sink, attached from below can provide a path for heat removal and also mechanical reinforcement to the package. The fabricated and assembled package shown in Figure 6.15 is the first to successfully demonstrate capacitive proximity communication between multiple chips outside of an experimental laboratory setup. After assembly of the package in Figure 6.15 global positioning measurements showed only 1 µm positioning error between neighbour chips. The total global positioning error across the array was observed to be 3 µm. We believe there is an additive error across the array because of an internal positioning bias associated with population of the package. Detailed investigations to resolve the origin of the global positioning bias are underway. The fully func-

Ͳ Ͳ

6 Chip alignment

171

Fig. 6.15 A fabricated package after alignment, assembly, and wire bonding to a printed circuit board. Eberle,HotChips,2007

30

SunMicrosystemsResearchLabs

tional package with the ball and pit alignment approach integrated into CMOS, the micro-soldered bridges, and the low cost injection molded plastics with integrated heat sink was assembled with active island and bridges containing proximity communication circuitry and found operational. The PxC circuitry communicates across multiple chips in the package with low bit error rate.

6.4 Array packages using bridge chips The properties of proximity communication make multi-chip modules (MCMs) with large chip counts an attractive way of designing computer systems. But building large scale MCMs involves a number of simultaneous challenges; alignment, power provisioning, and thermal management are the most important of these. In this section we describe one way of packaging an array of chips that satisfy these constraints. As discussed above we will call the lower chips in Figure 6.2 islands and the upper chips bridges, although now we envision a two-dimensional array of islands (and corresponding bridges connecting them). Power and ground will be directly provided to the island chips, which have higher functionality, processing power, and consequently power consumption. The bridge chips, which primarily connect together island chips, have correspondingly lower functionality and power consumption, and may have their power and ground provided from one of their island chips. As before, both chip-to-chip lateral alignment and axial (vertical) z-separation between transmitter and receiver pads on opposing chips must be well controlled. In a two-dimensional array of chips, maintaining a precise z-alignment grows increasingly challenging as the size of the array grows, especially for rigid (i.e., noncompliant) chips. This is especially true when accounting for flatness tolerances of the heat-sink and the package base supporting the chip array, and other packaging considerations associated with tiling chips in a remateable fashion. In addition, alignment difficulties are not only static and at assembly time; such an array may also experience large temperature excursions, temperature gradients, mechan-

172

Chip alignment

ical shock, or vibration during operation. Nevertheless, the misalignment in all dimensions must be held below cutoff values, at which the capacitive coupling between chip pads is insufficient to support reliable communication. We have implemented one design that provides approximately 30 µm of vertical compliance and the needed package flexibility, by thinning the bridge chips until they are flexible. This predetermined amount of flexibility may be chosen to provide the necessary compliance required to maintain the chips within the desired target separation, while limiting the amount of thinning required of the chips. This last consideration is important, because chips that have been excessively thinned are difficult to handle and also exhibit reduced reliability. In this design, each bridge chip is permanently attached to a first island chip and is remateably attached (allowing proximity communication) to a second island chip. The package provides compliance under force to minimize chip separation.

© 2009 IEEE

Fig. 6.16 Cartoon of a two-dimensional array of silicon chips aligned for proximity communication using a lattice. The four wings of each bridge are accurately registered to adjacent island chips with a pair of etch pits and spherical balls per wing. A bridge chip is conductively connected with microbumps and powered from the island to which it is attached. Central areas on each island chip are exposed for subsequent solder attachment for power, ground, and conductive signaling (not shown).

The ball-in-pit alignment mechanism relies on a local alignment of chips that can be extended in one dimension to a linear array of chips (see Section 6.3), which works even if the initial global positioning of chips was unavailable. In principle, this will also be true in two dimensions, assuming that all the chips are within approximate position to facilitate alignment. To achieve this, we employ a coarse alignment lattice system that provides approximate positioning of the chips in the array, and some freedom of motion to allow the bridge chip to accurately mate with the island chip. Again, the final resting position and orientation is determined solely by the location and orientation of the microspheres.

6 Chip alignment

173

© 2009 IEEE

Fig. 6.17 Photograph of a prototype array of sixteen islands and 8 “quad-bridge” chips. The package achieved a lateral chip placement precision of ±2 µm with a z-separation of under 1 µm between each bridge wing and its corresponding island chip.

Figure 6.16 depicts this concept with a grid of face-up island chips and facedown bridges. These bridges are combined, with four bridges fabricated from a single piece of silicon that is then etched away to create a “sombrero” shape. The center of the bridge is also removed to allow the supporting island chip’s face to allow solder attach for power, ground, and conductive I/O. Attaching each sombrero “quad-bridge” chip using microsolder to a subset of islands allows for chip rework; the wings of each bridge use ball-in-pit alignment to lock to neighboring islands. Figure 6.17 shows a prototype of this array system, in which prototype islands soldered to prototype quad-bridge chips aligned using balls-in-pits to neighboring islands with ±2 µm lateral precision and under 1 µm separation between bridges and islands. The global lattice holding the islands atop a copper cold plate provided for coarse chip alignment. Using facing etch pits that are elongated along two orthogonal in-plane directions allow a fixed amount of play in the array. This can also be done using enlarged pits that allow some movement by alignment balls, because we prefer in fact that the two chip surfaces touch normally; the balls and pits only need to provide x-and ypositioning, not z alignment. This refinement allows a predetermined relaxation in tolerances for the true position and orientation of the fabricated etch pits, as well as some relaxation in the positional tolerance requirements for the assembly of the aforementioned balls, while still maintaining the overall requirements of location and spacing for proximity communication. Because chip positions are also subject

174

Chip alignment

to dynamic changes (such as thermal expansion) this technique can help alleviate mechanical perturbations during the assembly and operation of the chips. Above the two-dimensional array chips is a top plate (not shown) that provides sufficient compressive force on the chips to maintain the spheres within the pits and hence the alignment between chips. It will also provide a necessary force to assure the chips maintain their proper spacing once mated and also provides an electrical path for powering the island chips. The bridges can be powered from the island chips as discussed above. The top plate also provides some protection for the chip in the event there should be some gross misalignment during the mating of connectors, in that it will “give” to prevent an interference fit and resulting damage to both chips. Finally the top plate must provide area solder interconnects for power and ground as well as other control signals to be provided to the island chips. Acknowledgements The authors thank Ivan Sutherland for providing the inspiration to perform this work. The authors also gratefully acknowledge many valuable suggestions and helpful guidance from Robert Drost, Arthur Zingher, Bruce Guenin, Ron Ho, John Simons, Hans Eberle, and Danny Cohen. This material is based upon work supported, in part, by DARPA under agreements HR0011-08-09-0001 and W911NF-07-1-0529. Approved for public release by DARPA, distribution unlimited.

References 1. J. Poulton, R. Palmer, A.M. Fuller, T. Greer, J. Eyles, W.J. Dally, M. Horowitz, “A 14-mW 6.25-Gb/s transceiver in 90-nm CMOS,” IEEE Journal of Solid-State Circuits, vol. 42, no. 12, 2007, pp. 2745–2757. 2. R.J. Drost, R.D. Hopkins, R. Ho, I.E. Sutherland, “Proximity communication,” IEEE Journal of Solid-State Circuits, vol. 39, no. 9, 2004, pp. 1529–1535. 3. X. Zheng, J. Lexau, J. Bergey, J.E. Cunningham, R. Ho, R. Drost, A.V. Krishnamoorthy, “Optical transceiver chips based on co-integration of capacitively coupled proximity interconnects and VCSEL,” IEEE Photonics Technology Letters, vol. 19, no. 7, 2007, pp. 453–455. 4. D. Hopkins, A. Chow, R. Bosnyak, B. Coates, J. Ebergen, S. Fairbanks, J. Gainsley, R. Ho, J. Lexau, F. Liu, T. Ono, J. Schauer, I. Sutherland, R. Drost, “Circuit techniques to enable 430 Gb/s/mm/mm proximity communication,” Digest of Technical Papers, IEEE International Solid-State Circuits Conference, 2007, pp. 368–369. 5. W. Menz, J. Mohr, O. Paul, Microsystem technology, Wiley-VCH, CITY, 2000, p. 223. 6. I. Shubin, E.M. Chow, J. Cunningham, D. DeBruyker, C. Chua, B. Cheng, J. Knights, K. Sahasrabuddhe, Y. Luo, A. Chow, J. Simons, A.V. Krishnamoorthy, R. Hopkins, R. Drost, R. Ho, D. Douglas, J. Mitchell, “Novel Packaging with Rematable Spring Interconnect Chips for MCM,” 59th Electronic Components and Technology Conference, San Diego, 2009. 7. A.N. Cleland, Foundations of Nanomechanics, Ch. 10, 2003, Springer-Verlag, BerlinHeildelberg. 8. J.E. Cunningham, X. Zheng, I. Shubin, R. Ho, J. Lexau, A.V. Krishnamoorthy, M. Asghari, D. Feng, J. Luff, H. Liang and C.-C. Kung, “Optical Proximity Communication in Packaged SiPhotonics,” Proceedings of the 5th IEEE International Conference on Group IV Photonics, FB8, 2008, pp. 383–385. 9. A.V. Krishnamoorthy, J.E. Cunningham, X. Zheng, I. Shubin, J. Simons, D. Feng, H. Liang, C.-C. Kung, M. Asghari, “Optical proximity communication with passively aligned silicon photonic chips,” IEEE Journal of Quantum Electronics, vol. 45, no. 4, 2009, pp. 409–414.

6 Chip alignment

175

10. A.V. Krishnamoorthy, R. Ho, X. Zheng, H. Schwetman, J. Lexau, P. Koka, G. Li; I. Shubin, J.E. Cunningham, “The integration of silicon photonics and VLSI electronics for computing systems intra-connect”, Proceedings, SPIE Photonics West, Vol. 7220: Silicon Photonics IV, 2009, pp. 1–12.

Part V

Extending Data Coupling Technologies

Chapter 7

Delivering On-chip Bandwidth Off-chip and Out-of-box with Proximity and Optical Communication Ashok V. Krishnamoorthy, Jon Lexau, Xuezhe Zheng, John E. Cunningham

7.1 Introduction While copper-based electrical Serdes links have, to date, dominated the domain of ultra-short reach interconnects, future high-performance computers may require the integration of diverse interconnect technologies. In previous chapters, various Dr. Ashok V. Krishnamoorthy Sun Microsystems Chief Technology Organization, 9515 Towne Centre Drive, San Diego, CA 92121, USA, e-mail: [email protected] Jon K. Lexau Sun Microsystems Research Labs, 16 Network Circle, Menlo Park, CA 94025, USA, e-mail: [email protected] Dr. Xuezhe Zheng Sun Microsystems Chief Technology Organization, 9515 Towne Centre Drive, San Diego, CA 92121, USA, e-mail: [email protected] Dr. John E. Cunningham Sun Microsystems Chief Technology Organization, 9515 Towne Centre Drive, San Diego, CA 92121, USA, e-mail: [email protected]

R. Ho and R. Drost (eds.), Coupled Data Communication Techniques for High-Performance 179 and Low-Power Computing, Integrated Circuits and Systems, DOI 10.1007/978-1-4419-6588-2_7, © Springer Science+Business Media, LLC 2010

180

Proximity and Optical Communication

forms of proximity communication that can provide low-energy chip-to-chip links between adjacent chips have been described. The strengths of proximity communication lie in low-energy short-distance links; the strengths of optical communication lie in efficiently reaching longer distances. Here we look to combine these technologies in a new hybrid I/O platform that can deliver balanced bandwidth on-chip, off the chip and even out of the box. In this chapter we will introduce the concepts of an optical-to-proximity interface chip, and review results from an experimental 90 nm test chip that integrates three types of high-speed chip-to-chip interconnects: capacitive interconnects for proximity communication; optical interconnects employing vertical-cavity surface-emitting lasers (VCSELs) and photodiodes; and electrical interconnects using current-mode logic (CML). We will discuss the operation and compatibility of each interconnect modality, and review interface requirements, chip layout considerations and test results.

7.2 Photonics as a long-reach interconnect Over the last several decades, optical data links have penetrated deeper and deeper into the interconnection hierarchy [1]. Historically, optical interconnects have had commercial success when the product of the channel bit-rate and the distance exceeded 100 m-Gbps. This trend is evidenced by widely deployed interconnect standards including fast Ethernet (100 Mbps), gigabit Ethernet (1 Gbps), double-data rate Infiniband (4x5 Gbps), and quad-data rate Infiniband (4x10 Gbps). Optical links are expected to continue to break into the interconnect hierarchy by providing increased aggregate bandwidth performance and lower energy communication over increasingly shorter distances. At the other extreme, coupled-data technology [2] can provide efficient and dense interconnect over very short chip-to-chip distances under 100 µm. However, this simply takes the traditional chip-edge bandwidth bottleneck, and moves it to the mid-range interconnect scale of a few centimeters to a few meters–in other words, to the printed circuit board, backplanes, and equipment racks. Interconnections at this length scale have typically been dominated by electrical serial links. Using the state-of-art signaling technologies, such as pre/de-emphasis, equalization, and/or multilevel signaling, single channel data rates of electrical interconnect can be pushed well beyond 10 Gbps over moderate distances of printed circuit boards. The real challenge for serial electrical interconnect is not to provide high bandwidth for individual channels but instead to pack and route thousands of such signals at high density. This challenge exists on all levels: in the chip package, on the board, and among the backplane connector levels. Although one can imagine a system interconnect hierarchy that includes proximity communication at the chipto-chip level, transitioning to serial links at the board level, and finally to optical links for longer distance communication, this complex hierarchy may involve both the throttling of the aggregate communication bandwidth as well as energy inefficient conversions between the various communication interfaces.

7 Proximity and Optical Communication

181

Fig. 7.1 Proximity communication and photonics–integrated on CMOS chips–extend large aggregate bandwidth across the system.

In previous chapters, we have seen that proximity communication provides fast and low energy communication to an ASIC over short inter-chip gaps. It relies on either inductive or capacitive coupling between chips that are placed face-to-face, as chip 1 and chip 2 as shown in Figure 7.1. For the purposes of this chapter, we will focus on capacitive proximity communication, although the concepts and principles of the proximity-to-photonic interface chip are also readily extendible to inductive coupling between adjacent chips. Hence, chip 1 and chip 2 communicate through the capacitive coupling of the metal transmit plates and receiver plates on each chip that are aligned with each other and driven by the transmitter and receiver circuits respectively. The two chips can be pushed together such that their surfaces are aligned to one another and nearly touch, which allows the capacitively-coupled plates to be very close to each other and thus enable small transmitter and receiver structures to communicate efficiently and consume little power. This form of capacitive coupling also allows signal densities two orders of magnitude denser than traditional off-chip communication using wire-bonding or traditional ball bonding.

VLSI chip 1

TX/RX chip

Interface A

OE/EO

Interface B

OE/EO

Interface C

TX/RX chip

Interface D

VLSI chip 2

Interface E

Fig. 7.2 Typical interface for a parallel optical link.

The direct integration of opto-electronic (OE) devices with CMOS VLSI chips brings the possibility of an on-chip transition from proximity to optical communication, eliminating the need for a serial, off-chip link interface. Figure 7.2 illustrates a typical data path for most electrical-to-optical system interconnect implementations. Here, an application-specific VLSI chip uses a serial link to communicate via two chip packages and a printed circuit board (interface A) to a Tx chip which then drives an optical device (interface B). After an optical waveguide interface (interface C which could be, for instance, a fiber ribbon) the optical intensity can then be

1

182

Proximity and Optical Communication

detected and reconverted to an electrical current that is fed to an Rx chip (interface D), which then amplifies the detected signals and communicates the signal via another serial link (interface E) to a second VLSI chip. In this traditional approach, the aggregate bandwidth is limited by the least dense interface in the signal path, and the efficiency is reduced by the multiple conversions required along the path.

VLSI chip 1

Proximity + photonics chip

Interface A’

Proximity + photonics chip Interface C

VLSI chip 2

Interface E’

Fig. 7.3 The proposed interface.

Figure 7.3 depicts the interface discussed in this chapter. Here, an interface chip communicates directly to the VLSI chip via proximity communication (interface A� ). This interface chip integrates Tx/Rx circuits and the optical devices and directly interfaces to the fiber interface C. The waveguide interface terminates to another proximity communication and photonic chip that converts the signals back to the electrical domain and transmits via a proximity communication link (interface E � ) to the second VLSI chip. Interfaces A and E are replaced by proximity communication, and interfaces B and D are replaced with on-chip wires. Both density and energy efficiency can potentially be improved, and the signals may communicate over arbitrarily long distances using only proximity communication and optical interfaces. Depending on the density achievable for the on-chip photonic link technology, this new interface can potentially balance the gap between the on-chip and off-chip bandwidth and may provide a long-reach chip I/O technology that can scale with the on-chip feature size in both bandwidth and density. This can also reduce the power dissipation of the communication link by removing the need for off-chip driver circuits.

7.3 Photonics on VLSI (optoelectronic VLSI) The highest-density optical interconnects to silicon VLSI circuits have been prototyped with flip-chip bonded multiple quantum-well (MQW) modulators-on-silicon [3]. These devices have high yield and low leakage, making them suitable for both light modulation and detection. Over 5,700 high-speed detector/modulator devices have been simultaneously flip-chip integrated onto a CMOS chip with a device yield exceeding 99.95%. Each bonded device had a load capacitance of under 50 fF and could be driven by a CMOS inverter at 2.5 Gbps (using 0.5 µm CMOS circuits)

7 Proximity and Optical Communication

183

to accomplish the electrical-to-optical interface. This technology produced several switching system demonstrators and was the basis of two multi-project foundry shuttles of optoelectronics that resulted in CMOS-based optoelectronic chips being distributed to research groups around the world [4]. In addition to the potential density achievable by such technologies, the circuits need only drive a high-impedance capacitive pad to communicate off-chip, thereby simplifying the driver and keeping the power dissipation low. We note that even larger number of such devices (up to one million) have been simultaneously bonded to silicon circuits for low-speed infrared imaging applications, firmly establishing the potential of this technology for density and yield. One of the requirements for the quantum-well modulators described above was external, high-power lasers and custom optical packaging to provide surface-normal access for I/O. This made the technology suitable for special-purpose applications where the surface of the electronic chip was accessible, but not generally applicable for a low-cost optical interface. In contrast, the vertical-cavity surface-emitting laser (VCSEL) has many well-known performance and cost advantages including its low drive current, favorable high-speed modulation characteristics, high wall-plug efficiency, its ability to complete manufacture and testing at the wafer level, and the ability to tailor the light output to improve coupling to optical fiber. One advantage of the VCSEL device that is currently seeing much interest in the industry today is its ability, with certain modifications, to be directly connected to electrical circuits at the chip and wafer levels [5]. Arrays of VCSELs have been bonded directly to CMOS VLSI chips, with each VCSEL capable of multi-Gbps modulation by the CMOS circuits. Research efforts have used 980 nm lasers so that the substrate was transparent to the light output from the bottom-emitting VCSELs [6, 7, 8]. These approaches were also extended to commercial 850 nm VCSELS bonded to CMOS followed by removal of the substrate. CMOS chips with interleaved 850 nm VCSELs and detectors have also been developed to create circuits with optical input and output. Two bonding and substrateremoval steps were applied to accomplish this: the first to bond the PIN (ptype/intrinsic/n-type) detectors, and the second for the VCSELs [9]. Compact detectors and CMOS transimpedance receiver circuits have respectively been developed to execute the optical-to-electrical current and electrical current-to-voltage conversions. Single-ended receivers (one diode per optical input) fabricated in a 0.25 µm CMOS technology, have been operated over 10 Gbps. Total power dissipation per channel has been reduced to approximately 2.5 mW per transmitter-receiver pair [10]. Another notable development is the integration of coarse wavelength-divisionmultiplexed (WDM) VCSELs with CMOS using multiple linear arrays of VCSELs grown separately, but aligned and bonded in a single step [8]. Combined with a space-to-wavelength multiplexer, this allows the fiber connector and packaging costs to be amortized over multiple data channels. This technique has been used to create terabits-per-second density interconnections to a single chip. The wavelengthto-space multiplexer technique has also been used to access a two-dimensional array of multiple quantum well modulators with a 1-D fiber array [11, 12]. Indeed, we ex-

184

Proximity and Optical Communication

pect that any effort to provide 10s of Tbps of bandwidth to a single VLSI package will require the judicious use of wavelength multiplexing to reduce connector and waveguide routing complexity and cost.

7.4 Proximity and photonic communication

4mm

As noted in other chapters, proximity communication enables low-power and scalable I/O for VLSI chips, but with a very limited reach. On the other hand, parallel optical communication has a distinct advantage in reach, but its bandwidth density can be no greater than the electrical I/O used to interface to the parallel optical module. Indeed, efforts at engineering and commercializing high-density multi-channel optical modules have shown that the electrical I/O from a parallel optical transceiver can drive the size and power dissipation of the modules [13]. A “symbiotic” highdensity I/O chip combining the best of the two worlds can be established by integrating proximity and optical communication together on the same CMOS platform, with proximity communication to bring the on-chip bandwidth to neighboring chips and the optics to deliver the bandwidth anywhere else it is needed in the system with low latency.

6mm © 2007 IEEE Sun Labs 2/31/2008I/O circuitry: chip layout. Fig. 7.4 CMOS chip that integrates proximity, serial, and–optical

The first step towards this goal is the development of a test chip that integrates proximity communication and VCSEL/photodetector-based optical communication on the same commercial CMOS platform. We named our test chip Light-OutProximity-In (LOPI) chip (see Figure 7.4). In fact, this name describes only one

4

7 Proximity and Optical Communication

185

of the many configurations of the chip, since it can convert from any one of its three incoming interfaces (proximity, optical, and serial electrical communication) to any of three outgoing interfaces (Figure 7.5). LOPI integrates together proximity communication circuits, CML electrical I/O, and VCSEL driver and receiver circuits on the same CMOS chip. VCSEL and photodetector arrays are attached either through flip chip bonding or wire bonding. Fiber arrays with matching pitch are applied to butt-couple with the VCSELs and photodetectors for optical I/O. This chip bridges the high density bandwidth from chip 2 to chip 1 (see Figure 7.1) carried through aligned proximity communication transmitters and receivers, with the long-distance chip-to-chip communication carried through the optical I/O integrated on chip 1.

7.5 Test chip results In order to demonstrate and test this integration of proximity communication and optical I/O, we designed and fabricated a LOPI test chip in a commercial 90 nm CMOS process [14, 15], depicted in Figure 7.4. In addition to the proximity communication interface and optical I/O interface, the test chip also had an electrical I/O interface for testing purposes. Each interface had 4 channels. There were also three sets of specialized circuits for electrically measuring chip alignment, whimsically called “where blocks.” Figures 7.5 and 7.6 respectively show the test chip functional block diagram and floorplan. The three different I/O interfaces were interconnected via a network of multiplexer and de-multiplexer circuits so that different I/O combinations could be tested, including loopback on each interface. Differential current-mode logic (CML) drove serial electrical I/O off the chip up to a 5 Gbps datarate. Three digital waveform-shaping signals allowed the user to control the CML edge rate, halve the swing of the output buffer or disable it completely (see top of Figure 7.7). For the proximity communication interface, plates provided for both data transmission and for measuring chip alignment. For the datapath, each transmitter plate was implemented as a 4x4 array of 16 micro-plates. Each transmit micro-plate was 26.5 µm square on a 31.25 µm pitch. Under JTAG control, one could electrically steer the transmit data up to two micro-plates distance in any direction from the center of the array, and thus could correct for some amount of chip misalignment without physically moving the chips. The receiver used a single plate that was slightly smaller than one 4x4 group of micro-plates. The center-to-center plate spacing for both transmitter and receiver plates was 125 µm. Data to/from the optical or electrical I/O interface could be steered from/to the proximity communication interface using JTAG control. There were also multiple sets of large arrays of transmitter/receiver plates (without the micro-steering ability) for electrically detecting the relative positions of two chips, known as “where blocks.” These blocks were composed of 20x20 arrays of pads on 25 µm centers. The optical circuits for the test chip consisted of a 4-channel, 5 Gbps per channel VCSEL driver and a 4-channel, 5 Gbps per channel optical receiver. The optical transmitter required a CML electrical input signal and converted this signal into

SML# 2005:jkl11

Datapath

186

Proximity and Optical Communication Here is the same datapath with all configurable connections shown. Multiplexor/demultiplexor select controls will be set using scan bits. Rx Rx only one Tx/Rx pair active at a time

Rx Rx

pad

Rx

CML

Tx cml2 cmos

pad

Tx Tx Tx Tx

µpad

loPI

µpad

data gen VSCEL

proximity

PD

optical

optical to CML connection optical to proximity I/O connection CML to proximity I/O connection

© 2007 IEEE

loopback connections

Floorplan

proximity experiment D

etch pit

etch pit

JTA AG

chnl 4

chnl 3

printed on August 15, 2005

transmission lines

proximity expE ((on-chip) p)

CML/ooptical chaannel 1 CML/ooptical chaannel 2 CML/ooptical chaannel 3 CML/ooptical chaannel 4

proximity experiment C

PD CML_IN

Sun Microsystems Proprietary Information chnl 2

proximity experiment B SML# 2005:jkl11

proximity expA (on-chip) chnl 1

etch pit

chip is 4mm x 6mm

data steeringg

etch pit

Line of symm metry

Fig. 7.5 CMOS chip that integrates proximity, serial, and optical I/O circuitry: functional diagram.

page 3 of 10

CML_OUT VCSEL

© 2007 IEEE

Fig. 7.6 CMOS chip that integrates proximity, serial, and optical I/O circuitry: chip floorplan.

Optical interface

7 Proximity and Optical Communication

CML IN

CML_OUT[p]

to prox

from prox

to optical

from optical

MUX X

VCM

DEMU UX

CML_IN[p]

187

CML OUTPUT BUFFER

CML_IN[n]

CML_OUT[n]

TIA

LA5 OFFSET CORR

MUX M

TIA_IN

VDDO

from CML

DE EMUX

to prox

to CML

from prox

LEVEL SHIFT

MOD CORE

VCSEL_OUT

Fig. 7.7 Block diagrams of communication interfaces: above is the CML interface, and below is the optical interface. Not shown is the proximity communication interface. Sun Labs – 2/31/2008 11

a current that could drive a VCSEL load at high speeds (bottom of Figure 7.7). Several user-configurable features for the TX allowed optimization for various VCSEL and package configurations. The tunable parameters included VCSEL bias and modulation current, as well as waveform shaping signals to control the edge-rate and crossing for the VCSEL current drive. The optical receiver consisted of a transimpedance amplifier and a 5-stage limiting amplifier with offset correction. It took as input the current signal from a photodiode, amplified it, and converted it into a differential CML output pair. There were also several user-configurable features of the optical receiver to optimize the link performance for different components and packages. All of the optical circuits, with the exception of the modulator core of the transmitter, were designed for a 1 V power supply. We also built a printed circuit board (PCB) to test the LOPI chips. A total of three PCBs (PCB1, PCB2, and PCB3) containing LOPI chips were used for testing. Two LOPI chips were arranged face to face by mounting PCB1 face-down on a 6-axis manipulator, and PCB2 face-up on a fixture. With feedback from the on-chip “where blocks,” the 6-axis manipulator was used to correct the relative board position until the two LOPI chips were aligned. A fiber with a coupling lens collected the optical output from the VCSEL on PCB2. The other end of the fiber was butt-coupled to the photodetector on PCB3 for a complete optical link. The optical link demonstrated 5Gbps operation, as the bit-error rate (BER) versus the receiver optical power plot shows in Figure 7.8. No apparent noise floor was observed up to a BER of 10−14 , and a receiver sensitivity of -11.5 dBm was obtained at BER of 10−12 . The inset of Figure 7.8 shows the eye diagram of the optical transmitter (top) and link (bottom) at 5 Gbps. We also tested the complete link with proximity communication and all three chips. Data into the electrical input of the LOPI chip on PCB1 was steered to the proximity interface, transferred to the LOPI chip on PCB2 via proximity commu-

188

Proximity and Optical Communication

Rx Sensitivity at 5 Gbps TX

1E-1 1E 3 1E-3

1E-7

Link

BER

1E-5

1E-9 1E-11 1E-13 1E-15

Power at Rx (dBm)

© 2007 IEEE

Fig. 7.8 BER versus optical power for the optical at data rate of 5 Gbps. Inset shows the respective eye diagrams of the optical transmitter and the Sun optical at a data rate of 5Gbps. Labs –link 2/31/2008

Timing Margin vs Gap

5

Timing Marrgin (% of bit period) T

70 60

CML-PxC-VCSEL-PD

50 40 30

CML-PxC-CML 20 10 0

2

4

6

8

10

12

Chip Separation (Pm) © 2007 IEEE Sun Labs – 2/31/2008

Fig. 7.9 Comparison of link timing margin versus chip separation for proximity-plus-optical and proximity-plus-CML communication

17

7 Proximity and Optical Communication

189

nication, steered to the optical interface, converted to an optical output at a VCSEL that was coupled into a fiber and transmitted to a receiver on PCB3. A data rate of 2.5 Gbps achieved a BER better than 10−12 for this complete LOPI link. Figure 7.9 shows the timing margins for the proximity-plus-CML link and the proximityplus-optical at a data rate of 1.85 Gbps (at a BER under 10−12 ). Notice that the proximity-plus-optical link offered better timing margins for a given chip separation, suggesting that the optical interface had better performance and lower jitter than the corresponding CML interface on the chip. Figure 7.10 shows the timing “waterfall” plots for a data rate of 1.85 Gbps. The figure shows that reliable communication without a clear noise floor could be obtained for chip separations below 12 µm. Above this separation, clear indications of a noise floor appeared.

12Pm

Log BE ER

10Pm 8Pm 6Pm 4Pm 2Pm 0Pm

Timing offset (% of bit period) Fig. 7.10 BER waterfall curves for proximity-plus-optical communication versus chip separation. Sun Labs – 2/31/2008

We also performed measurements to characterize the tolerance of proximity communication to the chip separation. The complete LOPI link was operated at a 1.85 Gbps datarate with a BER under 10−12 , and we varied the chip separation to observe the link performance degradation. From Figure 7.10, one can determine the timing margin for different proximity chip separations. As an approximate rule of thumb, typical clock and data recovery circuits require a minimum phase margin of 40% unit interval (UI) to work properly. With this criterion, the plots indicate that the LOPI link can tolerate up to 10 µm chip separation. Figure 7.11 depicts the entire test setup. For all the results shown above, 5 meters of 62.5/125 multimode fiber was used. We also tried the LOPI link with 100 meters of fiber. We observed no noticeable performance degradation.

190

Proximity and Optical Communication

PCB1 PCB3 PCB2

3-axis stage

6-axis manipulator

fiber

Sunoptical Labs – 2/31/2008 Fig. 7.11 Test setup for proximity, serial, and links.

7.6 Conclusion For any interconnect, the end goal is to transfer information. In modern digital systems, the sender and recipient of information are typically VLSI chips. Proximity communication provides a high-bandwidth, high-density channel between two neighboring chips. Test results show far lower per-pin power, latency, and area costs when compared to traditional solder balls. However, because this technique relies on capacitive (or inductive) coupling between two chips placed face-to-face, it only works if the chips are in very close proximity. In contrast, optical networks provide a proven communication technology for larger distances, ranging from backplanes to wide-area-networks. Systems that integrate many multi-chip packages together can benefit from both communication technologies: i.e. they can use proximity communication within each package for very high bandwidth and low power data transfers, and they can communicate between packages using optical networks. In addition, such a system would likely require high-speed electrical channels for system I/O, testing, and configuration. This chapter explored these technology options of optics, proximity communication, and high-speed electrical I/O. To be competitive, the parallel optical interconnect must provide a scaling path beyond conventional electrical off-chip interconnect bandwidth. The proposed proximity-to-optical transceiver chip uses the proximity interconnect concept to provide extremely high-density high bandwidth between the data sender and the optical transceiver. We expect that the combination of proximity communication and ad-

9

7 Proximity and Optical Communication

191

vanced optical interconnect concepts will provide the very-much-needed bandwidth for very short interconnect length (less than half a meter) as well as system level interconnect (much longer than 1 meter). We successfully demonstrated the integration of proximity communication with optical communication on a commercial 90 nm CMOS platform, as a promising I/O solution that can scale with the VLSI technology. With emphasis on integration demonstration, relatively large proximity pads were used for the test chip instead of pushing the limit for high density and throughput. The complete LOPI interface comprised four I/O channels, each at 2.5 Gbps (or 10 Gbps throughput), with the CML-to-optical I/O capable of 5 Gbps per channel (or 20 Gbps throughput). The I/O link performance was characterized for various data rates and chip separation and interoperability was verified between all three interfaces at speeds exceeding 2.5 Gbps per channel. A maximum chip separation tolerance of approximately 10 µm was obtained for the proximity interface. This work is on-going and future prototypes will more fully characterize metrics such as power and latency, and will further explore die packaging and drive more aggressive photonic interconnect technologies such as silicon-based photonics.

References 1. A.V. Krishnamoorthy, “Photonics-to-electronics integration for optical interconnects in the early 21st century,” Optoelectronics Letters, vol. 2, no. 3, 2006, pp. 163–168. 2. R.J. Drost, R.D. Hopkins, R. Ho, I.E. Sutherland, “Proximity communication,” IEEE Journal of Solid-State Circuits, vol. 39, no. 9, 2004, pp. 1529–1535. 3. K.W. Goossen, J.E. Cunningham, W.Y. Jan, “GaAs 850 nm modulators solder-bonded to silicon,” IEEE Photonics Technology Letters, vol. 5, 1993, pp. 776–778. 4. A.V. Krishnamoorthy, K.W. Goossen, “Progress in Optoelectronic-VLSI smart pixel technology based on GaAs/AlGaAs MQW modulators,” International Journal of Optoelectronics, vol. 11, no. 3, 1997, pp. 181–198. 5. L.M.F. Chirovsky, A.V. Krishnamoorthy, W.S. Hobson, J. Lopata, L.A. D’Asaro, “Verticalcavity surface-emitting lasers specifically designed for integration with electronic circuits,” Heterogeneous Optoelectronics Integration, SPIE Critical Review, Vol. CR76, 2000, pp. 49– 74. 6. A.V. Krishnamoorthy, L.M.F. Chirovsky, W.S. Hobson, R.E. Leibenguth, S.P. Hui, G.J. Zydzik, K.W. Goosen, J.D. Wynn, B.J. Tseng, J. Lopata, J.A. Walker, J.E. Cunningham, L.A. D’Asaro, “Vertical-cavity surface emitting lasers flip-chip bonded to gigabit/s CMOS circuits,” Photonics Technology Letters, vol. 11, no. 1, 1999, pp. 128–130. 7. F.E. Doany, C. Schow, C. Baks, R. Budd, Y.-J. Chang, P. Pepeljugoski, L. Schares, D. Kuchta, R. John, J. Kash, F. Libsch, R. Dangel, F. Horst, B. Offrein, “160-Gb/s Bidirectional Parallel Optical Transceiver Module for Board-Level Interconnects Using a Single-Chip CMOS IC,” Proceedings of the 57th Electronic Components and Technology Conference, vol. 57, 2007, pp. 1256-1261. 8. B.E. Lemoff, M. Ali, G. Panotopoulos, G.M. Flower, B. Madhaven, A.F.J. Levi, D.W. Dolfe, “MAUI: enabling fiber-to-the-Processor with parallel multiwavelength optical interconnects,” IEEE Journal of Lightwave Technology, vol. 22, no. 9, 2004, pp. 2043–2054. 9. A.V. Krishnamoorthy, “The intimate integration of photonics and electronics,” Advances in Information Optics and Photonics ICO Vol. VI, SPIE Press, 2008, pp. 589–607.

192

Proximity and Optical Communication

10. C. Kromer, G. Sialm, C. Berger, T. Morf, M.L. Schmatz, F. Ellinger, D. Erni, G. Bona, and H. J¨ackel, “A 100 mW 4x10 Gb/s transceiver in 80 nm CMOS for high-density optical interconnects”, IEEE Journal of Solid-State Circuits, vol. 40, no. 12, 2004, pp. 2667–2679. 11. A.V. Krishnamoorthy, J.E. Ford, F.E. Kiamilev, R.G. Rozier, S. Hunsche, K.W. Goosen, B. Tseng, J.A. Walker, J.E. Cnningham, W.Y. Jan, M.S. Nuss,, “The Amoeba switch: an optoelectronic switch for multiprocessor networking using dense-WDM” IEEE Journal of Selected Topics in Quantum Electronics, vol. 5, no. 2, 1999, pp. 261–275. 12. B.E. Nelson, G.A. Keeler, D. Agarwal, N. Helman, D.A.B. Miller, “Wavelength Division Multiplexed Optical Interconnect Using Short Pulses,” IEEE Journal of Selected Topics in Quantum Electronics, vol. 9, no. 2, 2003, pp. 486–491. 13. C. Cook, J.E. Cunningham, A. Hargrove, G.G. Ger, K.W. Goossen, W.Y. Jan, H.H. Kim, R. Krause, M. Manges, M. Morrissey, M. Perinpanayagam, A. Persaud, G.J. Shevchuk, V. Sinyansky, A.V. Krishnamoorthy, “A 36-channel transceiver parallel optical interconnect module based on optoelectronics-on-VLSI technology,” IEEE Journal of Selected Topics in Quantum Electronics, vol. 9, no. 2, 2003, pp. 387–399. 14. X. Zheng, J. Lexau, J. Bergey, J.E. Cunningham, R. Ho, R. Drost, A.V. Krishnamoorthy, “Optical transceiver chips based on co-integration of capacitively coupled proximity interconnects and VCSEL,” IEEE Photonics Technology Letters, vol. 19, no. 7, 2007, pp. 453–455. 15. J.K. Lexau, X. Zheng, J. Bergey, A.V. Krishnamoorthy, R. Ho, R. Drost, J.E. Cunningham, “CMOS integration of capacitive, optical, and electrical interconnects,” Proceedings of the International Interconnect Technology Conference, 2007, pp. 78–80.

Chapter 8

AC Coupled Wireless Power Delivery Makoto Takamiya, Kohei Onizuka, and Takayasu Sakurai

8.1 Three dimensional stacked inter-chip wireless power delivery The three-dimensional (3D) integration of stacked chips has recently gathered popularity as an approach for implementing a System-in-a-Package (SiP). Throughsilicon vias (TSVs) and bonding wires are the primary candidates for electrically connecting the stacked chips. Current commercially available 3D stacked chips use bonding wires due to their low cost. Bonding wires, however, are not a final solution for 3D stacked chips, because they limit signal bandwidth and hence the number of the stacked chips. TSVs represent an ideal technology for 3D stacked chips, because their signal bandwidth is high and the number of the stacked chips is theoretically unlimited. TSVs, however, have not been put to practical use in 3D stacked chips due to their high cost. The first commercially available product with TSVs was a CMOS image sensor released from Toshiba in 2007 [1], and in the future TSVs will increasingly expand their application range. In order to bridge the gap between high-performance and expensive TSVs and low-performance and cheap bonding wires, researchers have proposed short range wireless communication such as inductive [2] and capacitive [3, 4, 5] coupling communication for 3D stacked chips. Inductive coupling communication between a processor chip and an SRAM chip has recently been demonstrated [6], and inductive coupling communication between NAND flash memory chips for SSD was has also been recently demonstrated [7]. Professor Makoto Takamiya VLSI Design and Education Center, University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 1538505, JAPAN, e-mail: [email protected] Dr. Kohei Onizuka Formerly with the Institute of Industrial Science, University of Tokyo, and now with Toshiba Corporation. Professor Takayasu Sakurai Institute of Industrial Science, University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, JAPAN, e-mail: [email protected] R. Ho and R. Drost (eds.), Coupled Data Communication Techniques for High-Performance 193 and Low-Power Computing, Integrated Circuits and Systems, DOI 10.1007/978-1-4419-6588-2_8, © Springer Science+Business Media, LLC 2010

194

AC Coupled Wireless Power Delivery

The power delivery in previously reported inductive and capacitive coupling communication, however, was not wireless and was based on bonding wires. This presents difficulties in closely placing two stacked chips, because bonding wires typically need several hundred microns of separation between the chips. One solution is to stagger the stacked chips but this is difficult when one chip is larger than the other chip. However, if power can be supplied wirelessly, the chips can be stacked closely and the performance of wireless data transmission will be increased, since data bandwidth and communication reliability will increase with the decrease in chip-to-chip distance. The cost may also be decreased due to the elimination of bonding wires. Furthermore, chip detachability can also be achieved by making both data and power transmission wireless. This opens up a totally new post-fabrication system customization scheme, enabling the seamless exchange of chips in existing systems.

SiP Stacked chips with data / power transceivers Inductors

Fig. 8.1 Concept of chip-to-chip wireless power delivery with inductive coupling.

Figure 8.1 illustrates a proposed wireless power delivery scheme [8] based on inductive coupling between the lower chip and upper chips. The shortest inductorto-inductor distance depends on the thickness of the chip plus any adhesion layer, which can be thinned as small as 20 µm. If a face-to-face configuration is possible, the distance is further reduced; this setup is Fig. used 1 for measurements in this chapter. Figure 8.2 shows the concept of system customization after the fabrication of an SiP, to reduce mask set cost and to increase the freedom for post-fabrication system modification. Chips like computation accelerators, special processing engines, analog functions, and memories can be attached wirelessly on top of the lower base chip. The lower chip embedded in the package has data transceivers and “power transmitters.” The upper chips attached to the package have data transceivers and “power receivers.” This system may give users the ability to service or improve the SiP’s like a daughterboard replacement or upgrade nowadays. The risk of electro-static discharge (ESD) problems is mitigated because this system eliminates any naked interconnections and metal pads, using coils for wireless transmission that are covered by a passivation layer.

8 AC Coupled Wireless Power Delivery

195

Detachable chips with data transceiver and power receiver p for system modification

SiP Base chip with data transceiver and power transmitter

Fig. 8.2 Concept of system modification after fabrication of SiP’s/SoC’s.

8.2 Prototype of wireless power transmission circuits Fig. 2

Figure 8.3 shows the circuit diagram of the proposed power transceiver. The lower chip includes a power transmitter circuit and an on-chip planar inductor L1 , and the upper chip includes a full-wave rectifier circuit using MOSFET-based diodes, a smoothing capacitor and an on-chip planar inductor L2 for a power receiver. The power transmitter circuit generates a radio-frequency (RF) signal from the DC supply voltage and activates L1 . L1 and L2 are coupled and the power is transmitted by electromagnetic induction between L1 and L2 . Figure 8.4 shows a power transmitter circuit, where the oscillation frequency of the ring oscillator is designed to be variable for purposes of experimentation. Figure 8.5 shows the circuit diagram of the diode in the rectifier of the power receiver. Two PMOS transistors (M2 and M3) reduce the undesirable body effect of the main PMOS transistor (M1) [9]. The power transceiver shown in Figure 8.3 was designed and fabricated in a 3.3 V, 350 nm CMOS process. Generally, on-chip planar inductors have considerable parasitic capacitance and resistance so that their Q factor is not high. Figure 8.6 shows an equivalent circuit of the planar inductors. RP and CP represent the series resistance and the parasitic capacitances of the inductor, respectively. During design, the values of RP and CP are calculated by analytical formulae and the coupling factor (k) of L1 and L2 is derived by an electromagnetic field simulator (Agilent Momentum). Figure 8.7 shows microphotographs of the fabricated power transmitter and receiver chips. The outside diameter of the inductors is 700 µm and the calculated k is 0.75. RF voltage generated between two terminals of L2 is designed to be twice the nominal power supply voltage (3.3V) when RL = ∞. This elevated RF voltage

196

AC Coupled Wireless Power Delivery

Power transmitter

Power receiver

DD+

RF+

RL

Trans -mitter mitter RF-

L1 L2 Lower chip

Upper chip

Fig. 8.3 Circuit diagram of proposed system.

Differential buffers Fig. 3

RF+

Ring oscillator

RF-

Fig. 8.4 Circuit diagram of power transmitter.

D+

D-

M1 M2 M3

Fig. Fig. 8.5 Circuit diagram of PMOS-based diode in the4rectifier of the power receiver.

RP

L

Fig.C 5

P

Fig. 8.6 Equivalent circuit of inductor with parasitic elements.

Fig. 6

8 AC Coupled Wireless Power Delivery

197

mitigates diode voltage loss but it ought not surpass twice the tolerant voltage of the PMOS shown in Figure 8.5. Figures 8.8 and 8.9 show the measurement setup for the chip-to-chip wireless power delivery. The power transmitter circuit in the lower chip and the power receiver in the upper chip are brought close to each other in face-to-face alignment.

700Pm

700Pm

L1=1.0nH

L2=9.3nH

Fig. 8.7 Chip microphotographs of power transmitter (left) and power receiver (right).

Power receiver (Upper chip) (Face down) Fig. 7(a)(b)

z y

x

Power transmitter (L (Lower chip) hi ) (Face up) Fig. 8.8 Measurement setup for the chip-to-chip wireless power delivery.

Figure 8.10 shows the measured and simulated received power dependence on Fig. 8 L1 is 1.0 nH, L2 is 9.3 nH and the osoutput DC voltage. In this implementation, cillation frequency of the ring oscillator is 330 MHz. The output voltage is varied

198

AC Coupled Wireless Power Delivery

Upper bo U oard

Lower bo L oard

Power transmitter (Lower chip) (Face up)

Power receiver (Upper chip) (Face down) Fig. 8.9 Closeup view of measurement setup.

by changing DC output load RL in Figure 8.3. The simulated results coincide well 9 with the measured results, leading to Fig. high confidence in our modeling accuracy. The maximum received power of 2.5 mW is achieved when RL is 100 Ω , which is the equivalent source resistance of the wireless power transmitter. This 2.5 mW power corresponds to a power transmission density of 5 mW/mm2 . However, the power transmission efficiency from the power transmitter to the power receiver is less than 1%; an improvement plan is shown in the next section. In order to check the alignment tolerance of the wireless power transmission, Figures 8.11 and 8.12 show the measured output voltage dependence on ∆ z (distance between chips), ∆ x (misplacement in x direction) and ∆ y (misplacement in y direction) when the load is open which equals to RL = ∞. The definition of x-, y-, and z-axis is as is shown in Figure 8.8. The output voltage reduces by half at ∆ x and ∆ y = 200µm and ∆ z = 300µm, which corresponds to 29% and 43% misalignment along a 700 µm inductor.

8.3 Theoretical analysis and circuit improvements Although the feasibility of a wireless power delivery system was demonstrated as shown in the previous section, we would like to increase its received power and its power transmission efficiency to further increase its range of applications. In this

Receiv ved powe er (mW)

8 AC Coupled Wireless Power Delivery

199

3

RL varied

2

1

Measured Simulated

0

02 0 0.2 0.4 4 0.6 0 6 0.8 0 8 1.0 1 0 1.2 1 2 1.4 14 Output voltage (V)

Fig. 8.10 Measured and simulated received power dependence on output DC voltage.

Output Vo O oltage (V V)

1.6

08 0.8

0

Fig. 10

800 400 'z (Pm)

1200

Fig. 8.11 Measured output voltage dependence on ∆ z.

section, we describe a design methodology to increase the received power and the power transmission efficiency. The improvement can be achieved by adding resonance capacitors C1 and C2 as Fig. 11 shown in Figure 8.13. RS represents the parasitic resistances of transmitter interconnections and driving transistors. R1 and R2 indicate the series resistances of L1 and L2 . Capacitances C1 and C2 respectively resonate with L1 serially and with L2 in parallel. RL AC relates to the equivalent total impedance of the rectifier, the smoothing capacitor and the DC load resistance RLL DC as shown in Figure 8.14. RL AC can be approximated as follows when the rectifier is ideal and the smoothing capacitor is large enough [10]:

AC Coupled Wireless Power Delivery

Output vo O oltage (V))

200

16 1.6

0.8 0 0

280 560 'x (Pm)

280 'y (Pm)

Fig. 8.12 Measured output voltage dependence on ∆ x and ∆ y.

RS C1 R1 EjZ

R2 RL_AC

LFig. 12 L2 C2 1

Power transmitter

k

Power receiver

Fig. 8.13 Equivalent circuit of wireless power transmission with resonance capacitors C1 and C2 .

Fig. 13

Fig. 8.14 Equivalent circuit of RL

AC .

Fig. 14

RL_DC

8 AC Coupled Wireless Power Delivery

201

RL

AC

≈

RL DC 2

(8.1)

C1 is determined by the resonant frequency ( f ): C1 =

1 4π 2 f 2 L1

(8.2)

Under this condition, the following C2 and RL maximize the received power. Here, ω = 2π f .

RL

(R1 + RS )2 L2 C2 = � �2 (R1 + RS )R2 + ω 2 k2 L1 L2 + ω 2 (R1 + RS )2 L22 AC

=

(R1 + RS )R2 + ω 2 k2 L1 L2 (R1 + RS )(1 − ω 2C2 L2 )

(8.3) (8.4)

Then, the available received power is expressed as the following, where E is the voltage source at the power transmitter. PMax

AC

E 2 ω 2 k2 L1 L2 � = � 2 2 8 ω k L1 L2 + R2 (R1 + RS ) (R1 + RS )

The output AC voltage VOUT

of the load resistance RL � 2PMAX RL AC AC =

AC

VOUT

AC

(8.5)

is

(8.6)

L2 is determined so that VOUT AC equals twice the power supply voltage as mentioned in the previous section. For on-chip planar inductors, the relationship between RN and LN is approximated as follows with ζ . RN ≈ ζ LN We can simplify PMax

AC

(8.7)

using Equation 8.7 and the approximation of RS = 0

PMax

AC

≈

E 2 ω 2 k2 8ζ L1 (ζ 2 + ω 2 k2 )

(8.8)

In our measured system (see above), we designed resonant capacitors which were calculated to be C1 = 281 pF and C2 = 15 pF. When these capacitors are added to the original circuit configuration, the received power is simulated to be increased to 21.6 mW with an output DC voltage of 1.8V, compared with the measured 2.5 mW without the resonant capacitors in the previous section. Using a rectifier circuit is also effective and simulation shows that the received power can be further increased to 35 mW with an appropriate rectifier [11]. If the area allowed for the wireless power delivery system can be increased to 2.1 mm square, nine parallel power transceivers, each of which is 700 µm x 700 µm can be implemented and then more than 300 mW of power can be transmitted.

202

AC Coupled Wireless Power Delivery

RS+R1

EjZ

RY

RX

Fig. 8.15 Circuit model under resonance condition.

On the other hand, it is also important to maximize the power transmission efficiency. C2 resonates under the following condition and the circuit model can be converted to a simple resistance model as shown in Figure 8.15. 4π 2 f 2 R2L

2 2 AC L2C2 − RL ACC2 + L2

=0

(8.9)

In Figure 8.15, RX and RY are theFig. transformed impedances of R2 and RL , and are 15 described as follows. 1 R2 1 + 4π 2 f 2C22 R2L RY = 4π 2 f 2 k2 L1 L2 RL AC

RX = 4π 2 f 2 k2 L1 L2

(8.10) AC

The power transmission efficiency for RY is maximized if � RS + R1 RY = RX RS + R1 + RX

(8.11)

(8.12)

The optimal values of C2 and RL AC are calculated as functions of k, f , ζ , RS , R1 , and R2 by using formulas 8.9 and 8.12.

RL

ζ (RS + R1 ) � � R2 RS + R1 + 4π 2 f 2 ζ 2 (Rs + R1 + k2 R1 ) � ζ RS + R1 = C2 RS + R1 + 4π 2 f 2 ζ 2 k2 R1

C2 =

(8.13)

AC

(8.14)

Figures 8.16 and 8.17 show the calculated received power and the power transmission efficiency (η) when f is 900 MHz, k is 0.75, ζ is 2.6×10−9 , and RS is 2Ω in a 90 nm CMOS technology with an input voltage of 2.5 V. In this design region, the received power and the power transmission efficiency trade off with one another. The power transmission efficiency improves as the value of L1 increases, although the received power degrades. On the other hand, both the power transmission efficiency and the received power are independent of the value of L2 . In a real design,

8 AC Coupled Wireless Power Delivery

203

both the received power and the power transmission efficiency are lower than the calculated results due to other independent power losses including switching loss and rectifying loss; however, the optimization given above is still useful for the fundamental design.

Received power (mW) 150 100 50 1

10 L2(nH) L1(nH)

10 1

Fig. 8.16 Calculated received power dependence on L1 and L2 when power transmission efficiency is maximized.

Power Transmission Efficiency (%) 80 75 70 65

Fig. 16 10 1

L2(nH) L1(nH)

10

1

Fig. 8.17 Calculated power transmission efficiency dependence on L1 and L2 .

8.4 Summary A chip-to-chip 2.5 mW wireless power transmission for 3D stacked inter-chip wireless power delivery for SiP was proposed and in a 350 nm CMOS Fig.demonstrated 17 technology. We derived circuit optimization theories for the maximum received power and the power transmission efficiency and showed a possible increase to 100 mW-order power transmission by introducing resonant capacitors.

204

AC Coupled Wireless Power Delivery

References 1. H. Yoshikawa, A. Kawasaki, T. Iiduka, Y. Nishimura, K. Tanida, K. Akiyama, M. Sekiguchi, M. Matsuo, S. Fukuchi, K. Takahashi, “Chip scale camera module (CSCM) using throughsilicon-via (TSV),” Digest of Technical Papers, IEEE International Solid-State Circuits Conference, 2009, pp. 476–477. 2. N. Miura, D. Mizoguchi, T. Sakurai, T. Kuroda, “Analysis and design of inductive coupling and transceiver circuit for inductive inter-chip wireless superconnect,” IEEE Journal of SolidState Circuits, vol. 40, no. 4, 2005, pp. 829–837. 3. K. Kanda, D. Antono, K. Ishida, H. Kawaguchi, T. Kuroda, T. Sakurai, “1.27 Gbps/pin, 3 mW/pin wireless superconnect (WSC) interface scheme,” Digest of Technical Papers, IEEE International Solid-State Circuits Conference, 2003, pp. 186–187. 4. R.J. Drost, R.D. Hopkins, R. Ho, I. Sutherland, “Proximity communication,” IEEE Journal of Solid-State Circuits, vol. 39, no. 9, 2004, pp. 1529–1535. 5. A. Fazzi, R. Canegallo, L. Ciccarelli, L. Magagni, F. Natali, E. Jung, P. Rolandi, R. Guerrieri, “3D capacitive interconnections with mono- and bi-directional capabilities,” Digest of Technical Papers, IEEE International Solid-State Circuits Conference, 2007, pp. 356–357. 6. K. Niitsu, Y. Shimazaki, Y. Sugimori, Y. Kohama, K. Kasuga, I. Nonomura, M. Saen, S. Komatsu, K. Osada, N. Irie, T. Hattori, A. Hasegawa, and T. Kuroda, “An inductive-coupling link for 3D integration of a 90nm CMOS processor and a 65nm CMOS SRAM,” Digest of Technical Papers, IEEE International Solid-State Circuits Conference, 2009, pp. 480–481. 7. Y. Sugimori, Y. Kohama, M. Saito, Y. Yoshida, N. Miura, H. Ishikuro, T. Sakurai and T. Kuroda, “A 2 Gb/s 15 pJ/b/chip inductive-coupling programmable bus for NAND flash memory stacking,” Digest of Technical Papers, IEEE International Solid-State Circuits Conference, 2009, pp. 244–245. 8. K. Onizuka, H. Kawaguchi, M. Takamiya, T. Kuroda, T. Sakurai, “Chip-to-chip inductive wireless power transmission system for SiP applications,” Digest of Technical Papers, IEEE Custom Integrated Circuits Conference, 2006, pp. 575–578. 9. S. Masui, E. Ishii, T. Iwawaki, Y. Sugawara, K. Sawada, “A 13.56 MHz CMOS RF identification transponder integrated circuit with a dedicated CPU,” Digest of Technical Papers, IEEE International Solid-State Circuits Conference, 1999, pp. 162–163. 10. G.A. Kendir, W. Liu, G. Wang, M. Sivaprakasam, R. Bashirullah, M.S. Humayun, J.D. Weiland, “An optimal design methodology for inductive power link with class-E amplifier,” IEEE Transactions on Circuits and Systems I, vol. 52, no. 5, 2005, pp. 857–866. 11. T. Umeda, H. Yoshida, S. Sekine, Y. Fujita, T. Suzuki, S. Otaka, “A 950 MHz Rectifier Circuit for Sensor Networks with 10 m Distance,” Digest of Technical Papers, IEEE International Solid-State Circuits Conference, 2005, pp. 256–257.

Index

3D integration, 13, 15, 29, 38, 193 8B10B, 63, 76, 136 Agilent Momentum, 195 asynchronous signaling, 83, 102, 119 balls, 161, 163, 172 band-pass filter, 82 bit error rate (BER), 54, 70, 95, 106, 108, 118, 134, 171, 187 board capacitors, 143 bonding, 159 bootstrapping, 62 bridge chip, 76, 158, 163, 171 buried bump packaging, 141 burst transmission, 104 butterfly differential layout, 56 capacitance, area and fringe, 60 CDMA, 63 clocking, 5, 67 coil, 81, 89, 96, 102, 103, 106, 118, 147, 197 comparator, 65 contactless packaging, 127–152 corner differential layout, 56 coupling capacitive, 8, 51–76, 127, 129, 158 capacitive connector, 143 capacitive range, 83 inductive, 8, 52, 79–124, 129, 147, 194 inductive coefficient, 83 inductive range, 83 optical, 8, 23, 52 optical efficiency, 24 through substrate, 84 crossbar, 74 crosstalk, 53

daisy-chain transmitter, 99 datapath slices, 70 David Salzman, 129 DC bias, 62, 92, 137 decoupling die, 41 dielectric layers, 57 differential coil, 110 differential signaling, 56 differentiator, 82, 130 eddy current, 85 electro-static discharge (ESD), 100, 130, 194 encoding, 63 Endicott HyperBGA, 141 epoxy, 135, 166 equalization, 131, 180 error function, 54 etch pit, 161, 172 etch, self-terminating, 161 FDMA, 63 fiber coupling, 187 field solver, 59 flash memory, 113 flip-chip package, 159 fluidic I/O, 18, 26 fractional equalization, 146, 148 frequency acquisition, 68 frequency response, 81, 89, 130, 131 Gaussian pulse, 88 heat, 13, 14, 27 heat sink, 30 high-pass filter, 82, 130 island chip, 76, 158, 163, 171 jitter, 62, 71, 120 205

206 keeper, 64 kickback, 65 KOH, 161 layout, 56, 89, 111, 197 low-pass filter, 81 MCM, 7 memory, 113 metal stackup, 57 micropipe, 18 microspheres, 161, 163, 172 misalignment, 15, 24, 51, 52, 73, 158, 166 electrical compensation, 61 optical compensation, 25 physical alignment, 163 power tolerance, 198 modulation, 63, 87 Moore’s Law, 3, 7 multi-chip module (MCM), 74 multi-core processors, 6 multiplexer, efficient, 62 mutual inductance, 90 network, 73 noise common-mode rejection, 56 crosstalk, 53, 56, 60, 85, 107, 110 crosstalk shield, 115 excess factor, 55 input-referred, 55 thermal, 54 non-return-to-zero (NRZ) signaling, 130, 136, 144 offset voltage, 63, 65, 71 optical I/O, 15, 18, 23 optical interconnect, 183 optical interconnects, 180 optical interface, 182 packaging, 158 parallelism, 6 parasitics, 53, 81, 90, 149, 195 peaking, 81 phase recovery, 68 photolithography, 161 plastic packaging, 170 Polychip, 129 polymer pin, 18 power, 13, 31 distribution, 33, 35, 141, 158, 163, 194 reduction, 94, 100, 115, 130, 138

Index supply, 14 supply noise, 33, 36, 39 processor, 119 pseudo-random bit sequence (PRBS), 63, 93 pulse modulation, 88, 92, 103 pulse shaping, 94 quantum-well modulators, 183 reflections, 144, 148 refresh, 64 resonance, 199 resonator, 81 return-to-zero (RZ) signaling, 132, 144 scaling, 4 scrambler, 63 self-inductance, 81 sense amplifier, 54, 63, 92 serializer-deserializer circuits, 8, 180 signal attenuation, 62 signal processing, 131 single-ended signaling, 56 six-axis manipulator, 187 skew compensation, 121 skin resistance, 131 solder bump, 159, 163, 166 sombrero chip, 173 SpecINT, 3 stacked chips, 17, 92, 96, 99, 106, 108, 113, 114, 118, 119, 193 steering, 61 switch, 73 TDMA, 63, 107 terminations, 144 thermal interface material (TIM), 14 thinned chips, 92, 172 through-silicon via (TSV), 16, 17, 27, 31, 38, 42, 100, 193 Tom Knight, 129 trimodal I/O, 18 variability, 67, 105, 134, 150 vertical-cavity surface-emitting laser (VCSEL), 183, 187 voltage divider, 63 waveguide, 18, 23 wire bonds, 17, 170, 185 wireless power delivery, 193–203 wires, 8