Error Control for Network-on-Chip Links

Error Control for Network-on-Chip Links Bo Fu l Paul Ampadu Error Control for Network-on-Chip Links Bo Fu Marvel...

Author: Bo Fu | Paul Ampadu

39 downloads 959 Views 5MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

Error Control for Network-on-Chip Links

Bo Fu

l

Paul Ampadu

Error Control for Network-on-Chip Links

Bo Fu Marvell Semiconductor, Inc. 5488 Marvell Lane Santa Clara, CA 95054, USA [email protected]

Paul Ampadu Department of Electrical and Computer Engineering University of Rochester Rochester, NY 14627, USA [email protected]

ISBN 978-1-4419-9312-0 e-ISBN 978-1-4419-9313-7 DOI 10.1007/978-1-4419-9313-7 Springer New York Dordrecht Heidelberg London Library of Congress Control Number: 2011936003 # Springer Science+Business Media, LLC 2012 All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Springer Science+Business Media, LLC, 233 Spring Street, New York, NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights.

Printed on acid-free paper Springer is part of Springer ScienceþBusiness Media (www.springer.com)

To Luzviminda, Luzann, Majelia and Paul Jr for Love, Patience, Courage and Dedication

Preface

Traditional bus-based infrastructures can no longer handle the intensive communication among various modules, as hundreds and even thousands of intellectual property (IP) cores are integrated on a single chip. Network-on-Chip (NoC) is emerging as an efficient solution to solve the aggravating scalability and contention issues of on-chip communication. With technology scaled into the nanometer regime, the physical links in NoCs are facing important design challenges of delay, power and reliability. The purpose of this book is to present current solutions addressing reliability issues in on-chip communications. Reliability is an important issue in NoC design. For example, errors in the header of a packet may lead to loss of packet. The reliability issue of on-chip communication can be addressed at different NoC layers, such as physical layer, data link layer and network layer. This book focuses on techniques applied to the data link layer. Error control coding is a common technique used in the data link layer to provide reliable on-chip communication. With the shrinking link feature size, on-chip interconnects are becoming susceptible to multiple random and burst errors, requiring more powerful error control codes (ECCs) than those previously used. At the same time, the energy consumption of on-chip interconnect is becoming an increasingly large portion of on-chip power dissipation, motivating the need for more energy efficient communication solutions. In this book, we present energy-efficient error control approaches for on-chip interconnects. We introduce a method of combining extended Hamming product codes with type-II hybrid automatic repeat request (HARQ). This method provides a strong error correction capability against multiple random and burst errors; while keeping the hardware overhead reasonable. The combination of extended Hamming product codes with type-II HARQ has been shown to meet the same reliability requirements as previous solutions while using a lower link swing voltage to reduce energy consumption. The extended Hamming product codes can also be integrated into a configurable error control scheme by combining it with a traditional Hamming code. The different coding strengths provided by this realization can achieve better energy performance in the presence of varying noise conditions.

vii

viii

Preface

Capacitive crosstalk coupling greatly increases with increased interconnect aspect ratio with each scaled technology node. Capacitive crosstalk coupling can cause delay uncertainty, which greatly decreases the system performance resulting in timing errors. ECCs have been successfully applied to improve the reliability of on-chip interconnect by correcting logic errors. Unfortunately, conventional ECCs are not as efficient in addressing delay uncertainty caused by capacitive crosstalk coupling. In this book, we also present methods that simultaneously address logic errors and crosstalk-induced delay uncertainty. We introduce a method of combining ECCs with conventional skewed transitions. Here, the inherent skew resulting from the ECC parity generation is exploited to ensure that no two adjacent wires switch in opposite directions simultaneously, thereby reducing worst-case on-chip capacitive coupling. This method can reduce the overhead of conventional skewed transitions by hiding the delay insertion overhead in parity calculations. Compared with other solutions that simultaneously handle logic errors and delay uncertainty, this method requires fewer wires, resulting in smaller link area and energy consumption. This book is based on the first author’s Ph.D. dissertation completed at the University of Rochester. The research work was supported in part by the U.S. National Science Foundation under grant NSF-ECCS-0733450. The authors would like to thank friends and colleagues Prof. Eby Friedman, Prof. Chen Ding, and Prof. Thomas Tucker for their invaluable suggestions during the writing of the dissertation that led to this version of the book. Dr. Bo Fu expresses gratitude to his exceptional colleagues, Dr. David Wolpert (now at IBM) and Dr. Qiaoyan Yu (now at UNH), for their productive, supportive and enjoyable collaborations during his studies in the Embedded Integrated System-on-Chip (EdISon) research group at the University of Rochester. Dr. Fu also thanks friends, Lin Zhang, Chao Yu, Qiang Sun, Xin Li, Gaojie Lu, Xiaohua Zhang and Fan Yang for their help and friendship. His deepest gratitude and immense appreciation goes to his parents and his wife for their constant encouragement and unwavering support. Many thanks also to graduate student Meilin Zhang for his assistance in formatting the book and to Charles B. Glaser from Springer for his support and assistance throughout. Santa Clara, CA, USA Rochester, NY, USA

Bo Fu Paul Ampadu

Contents

1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Impact of Scaling on Interconnect Parameters . . . . . . . . . . . . . . . . . . . . . . . 1.2 Reliability Issues for On-Chip Interconnect . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Types of Errors and Error Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.1 Types of Errors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.2 Error Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Book Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1 1 4 8 8 9 12 13

2

Solutions to Improve the Reliability of On-Chip Interconnects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Wire Sizing and Spacing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Shielding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Repeater Insertion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Crosstalk Avoidance Codes (CACs). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Skewed Transitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6 Error Control Coding Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6.1 Automatic Repeat Request (ARQ). . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6.2 Forward-Error Correction (FEC) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6.3 Hybrid ARQ (HARQ). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7 Spare Wires . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

17 17 18 19 21 22 24 24 26 27 28 28

Networks-on-Chip (NoC) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Bus Based On-Chip Communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 NoC Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 NoC Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.2 Routing and Switching Techniques. . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.3 Router Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Reliability in NoC Links. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

33 33 34 35 38 41 42 45

3

ix

x

Contents

4

Error Control Coding for On-Chip Interconnects. . . . . . . . . . . . . . . . . . . . . 4.1 Error Control Coding Basics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.1 Field . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.2 Linear Block Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.3 Systematic Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.4 Hamming Distance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.5 Code Modification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Error Control Codes for On-Chip Interconnect . . . . . . . . . . . . . . . . . . . . . 4.2.1 Single Parity Check (SPC) Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.2 Duplicate-Add-Parity (DAP) Code . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.3 Hamming Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.4 Hsiao Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.5 SEC Codes with Interleaving . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.6 Cyclic Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.7 Bose-Chaudhuri-Hocquenghem (BCH) Codes . . . . . . . . . . . . . . 4.2.8 Reed-Solomon (RS) Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.9 Hamming Product Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

49 49 49 51 52 54 56 57 57 58 59 61 63 64 66 73 73 78

5

Energy Efficient Error Control Implementation . . . . . . . . . . . . . . . . . . . . . . 5.1 Error Control Coding with Low Link Swing Voltage System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Error Control Coding with Dynamic Voltage Swing Scaling System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Product Codes with Type-II ARQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 The Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.2 Extended Hamming Product Codes with Type-II HARQ. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.3 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Configurable Error Control System. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.1 Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.2 Configurable Encoder Design. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.3 Configurable Decoder Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.4 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

79

6

Combining Error Control Codes with Crosstalk Reduction . . . . . . . . . 6.1 Duplicate-Add-Parity (DAP) Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Boundary Shift Code (BSC) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Crosstalk Avoidance and Multiple Error Correction Code (CAMEC) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4 Unified Coding Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.1 Forbidden Overlap Condition (FOC) Codes. . . . . . . . . . . . . . . . 6.4.2 Forbidden Transition Condition (FTC) Codes. . . . . . . . . . . . . .

79 81 87 87 92 95 104 104 105 108 109 114 115 117 117 119 120 123 124 125

Contents

xi

6.4.3 Forbidden Pattern Condition (FPC) Codes . . . . . . . . . . . . . . . . . 6.4.4 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5 Error Control Codes with Skewed Transitions . . . . . . . . . . . . . . . . . . . . . 6.5.1 The Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5.2 Data Mapping Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5.3 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

126 127 130 130 133 136 141 143

List of Symbols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

145

Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

149

Chapter 1

Introduction

On-chip interconnects play an important role for the performance of current VLSI system. As technology scales into nanoscale regime, interconnect is facing several design challenges in terms of delay, power and reliability [1–3].

1.1

Impact of Scaling on Interconnect Parameters

An interconnect can be characterized by three electrical properties – resistance R, capacitance C and inductance L. The resistance R is calculated using (1.1), R¼

r Lint Tint Wint

(1.1)

where r is the resistivity of metal. Lint, Tint, and Wint are the interconnect length, thickness, and width, respectively, as shown in Fig. 1.1 H is the distance between interconnect and ground plane. From (1.1), the resistance R of a wire increases with the reduced value of Tint and Wint. As technology scales, the resistivity r of a metal interconnect can increase [4]. This phenomenon is caused by carrier collisions when the thickness of a wire is approaching the mean free path of electrons. Also, the increased clock frequency aggravates the skin effects [5], in which the current starts to flow through the skin of the wire. Skin effects reduce the effective cross-area that carries the current through a wire further increasing the wire resistance. Figure 1.2 shows that the resistance greatly increases with technology scaling. A large value of wire’s resistance greatly increases the interconnect delay; also causes a large signal attenuation. The interconnect capacitance C of a wire consist of parallel plate capacitance Cg1, fringing capacitance Cg2 and sidewall capacitance CC, as shown in Fig. 1.3. The parallel plate capacitance Cg1 refers to the capacitance between metal wire and substrate or ground, which is proportion to (Wint·Lint)/H. The sidewall capacitance B. Fu and P. Ampadu, Error Control for Network-on-Chip Links, DOI 10.1007/978-1-4419-9313-7_1, # Springer Science+Business Media, LLC 2012

1

2

1 Introduction

Fig. 1.1 Dimensional parameters of single interconnect

Fig. 1.2 Resistance value with technology scaling [6]

Fig. 1.3 Parallel plate, fringing and coupling interconnect capacitances

CC refers to the coupling capacitance between two adjacent wires on the same metal layer, which is proportion to (Tint·Lint)/Sint. As technology scales, the interconnect thickness Tint decreases at a slower rate than the interconnect width Wint and spacing Sint [7]. Thus, the interconnect aspect ratio, defined as the ratio of Tint to Wint, increases with each technology node. The increased interconnect aspect ratio has caused an increase in capacitive coupling effects, which can greatly affect the reliability of on-chip interconnects. The inductance L of an interconnect is caused by the current loop formed by the signal wire and its return. As multi-GHz clock frequencies are widely applied in current VLSI system, the inductive effects become significant, especially for long

1.1 Impact of Scaling on Interconnect Parameters

3

Fig. 1.4 Interconnect delay as technology scaling [11]

and wide global wires [8, 9]. Inductance effects increase the design complexity, as it is difficult to accurately extract inductance. Further, the inductance effects can exist over a long distance, which exacerbates the crosstalk coupling [10]. The technology scaling has a significant impact on the performances of on-chip interconnects [11, 12]. The rise in wire resistance increases the RC time constant resulting in a delay increase for a fixed length of inter connect. Moreover, the length of global interconnect grows as the chip size increases to integrate more components. By considering the chip scaling factor, the delay of global interconnects increases by S2Sc2, where S is the technology scaling factor and Sc is chip size scaling factor [7]. Figure 1.4 shows the scaling trend of the gate delay, local wire delay, and global wire delay with or without repeater insertion. The global interconnect has a considerable increase in delay compared to logic gates and becomes the performance bottleneck with technology scaling. The power consumption is another critical factor faced with on-chip interconnects in nanoscale system [13, 14]. A great portion of the total chip power is consumed by on-chip interconnects because a large interconnect capacitance is charged and discharged every time a transition occurring. The use of large-sized repeaters to reduce the delay of the global interconnect further aggravates the power consumption of on-chip interconnects. Figure 1.5 shows that the dynamic power breakdown of UltraSPARC T3 SoC processor with 16 SPARC cores using 40 nm technology [15]. The power consumption of interconnect is the same as the power consumption of logic gates. Figure 1.6 shows the total interconnect length integrated into a chip with technology scaling. Because the wire capacitance is linearly related to the wire’s length, the increased interconnect capacitance will greatly increases the power consumed by on-chip interconnects.

4

1 Introduction

Fig. 1.5 UltraSPARC T3 and its dynamic power breakdown [15]

Fig. 1.6 Total interconnect length on a chip as technology scales [11]

1.2

Reliability Issues for On-Chip Interconnect

Interconnect reliability issues are caused by manufacturing defects [16, 17] or a variety of noise sources, such as external radiation [18, 19], crosstalk coupling [8, 9, 20–22], supply voltage fluctuations [23, 24], process variations [25–31], temperature variations [4, 32], electromagnetic interference (EMI) [33] and combinations of these sources.

1.2 Reliability Issues for On-Chip Interconnect

5

Fig. 1.7 manufacturing defects in interconnects

Imperfect manufacturing process can cause on-chip interconnects. Figure 1.7 shows metal sliver and crack in on-chip interconnects caused by manufacturing defects. Metal sliver is a small piece of extra metal left between two metal wires during manufacturing process. When the metal temperature increases, the metal sliver will expand and touch both of these two wires. A short connection can be caused by metal sliver. Crack is another common manufacturing defect caused by material stresses. Crack can cause an open connection. The rise of the occurrence probability of manufacturing defects in nanoscale technology results in a decrease in the manufacturing yield of large-area chips. Techniques to improve the yield must be considered. Noise sources affect the reliability of on-chip interconnect in two ways – signal integrity [2] and delay uncertainty [34]. Noise sources reduce the signal integrity of on-chip interconnect by inducing voltage glitches. If a voltage glitch is greater than the tolerable noise margin of the circuit and has a sufficient duration, it can cause logic errors. Delay uncertainty refers to an unknown fluctuation in the timing of a signal transition. Delay uncertainty decreases the system operating frequency because a large design margin is required to guarantee correct operation. In nanoscale technology, crosstalk-induced delay uncertainty can be a critical bottleneck for the operation of high speed synchronous systems. Crosstalk coupling is one of the most important factors affecting reliability of on-chip interconnects. Crosstalk coupling is caused by the mutual capacitance or mutual inductance between wires [10]. As technology scales, the increased interconnect aspect ratio has caused an increase in capacitive coupling effects. Inductive coupling occurs when signal switching causes a change in magnetic field. The gigahertz clock frequencies in nanoscale technology result in a nonnegligible inductive effect in on-chip interconnects [8, 9]. Unlike capacitive coupling, inductive coupling can be a long range phenomenon and is more important in the presence of wide busses [10]. Crosstalk coupling effects can induce significant voltage glitches on a victim line. Figure 1.8 shows noise waveforms resulting from capacitive and inductive coupling between two fully coupled lines [9]. The peak noise voltage can exceed 20% of the supply voltage, potentially inducing logic errors. Crosstalk-induced delay uncertainty is mainly caused by the dependence of coupling capacitance on signal switching patterns. Depending on the switching behavior of a wire and its

6

1 Introduction

Fig. 1.8 Noise waveforms of crosstalk coupling between two coupled lines [9]

Fig. 1.9 Soft errors caused by particle strikes

neighbors, the effective capacitance Ceff of a wire can change from Cg to Cg + 4Cc [35] (where Cg ¼ Cg1 + Cg2). The best case Ceff exists when all three adjacent wires switch in the same direction. The worst case Ceff occurs when there is a transition 010!101 (101!010) on three adjacent wires. The dependence of Ceff on signal switching patterns can result in up to 50% delay change in on-chip interconnect [36]. Figure 1.9 shows an example of soft error caused by particle strikes. As technology scales, integrated circuits become more vulnerable to soft errors caused by particle strikes [18], such as alpha particles and neutron. As the node capacitance decreases in

1.2 Reliability Issues for On-Chip Interconnect

7

Fig. 1.10 A single event transient (SET) jumping from one interconnect to another in the presence of crosstalk [37]

nanoscale technology, a smaller injection of charge can induce errors (the amount of charge required to induce in error is referring to as the critical charge Qcrit). This is exacerbated by the scaling of supply voltage, which decreases circuit noise margins and makes them more susceptible to particle strikes. Higher clock frequency and deep pipelined design also increase the probability that any faults resulting from a strike will be latched by flip-flops, creating errors. When crosstalk coupling is considered, particle strikes become even more problematic [37, 38]. In the presence of a transient error caused by alpha particles or neutron strikes, crosstalk coupling may propagate this error to other parts of the circuit by inducing large voltage glitches on neighboring wires, as shown in Fig. 1.10. The coupling effect increases the probability that multiple adjacent errors (also referred to as a burst error) are caused by a single particle strike. The probability of errors caused by electromigration also increases in nanoscale technology [3, 39, 40]. Electromigration is the alteration of an atomic structure caused by electromagnetic forces, i.e., the dense flow of electrons in interconnects. Over time, the atomic displacement can result in opens or shorts. With the aggressive scaling of interconnect dimensions, the current density within these interconnects significantly increases. This rise in current density, combined with the use of low-k dielectrics, (which have lower thermal conductivities [4]), results in a significant increase in the metal temperature. The large rise in metal temperature exacerbates electromigration, degrading system lifetime [4]. The impact of process variations on interconnect is also expected to increase as technology scales [25–27]. Process variation in interconnects are caused by the imperfect processes of photolithography, planarization and metal etching. Variations in the geometric parameters of interconnects lead to a change in interconnect resistance and capacitance, which causes a variation in the delay of on-chip interconnects. The variations in interconnect delay may lead to timing closure problems. The variations in interconnect dimensions also increase the probability of opens caused by electromigration in narrower sections of a wire. Moreover, device parameter variations introduce delay variations in the drivers and repeaters, which can also result in link delay errors.

8

1 Introduction

Supply voltage fluctuations are another factor affecting signal integrity of on-chip interconnects [23, 24, 41]. Supply voltage fluctuations can affect the driver and repeaters performance and noise margins, increasing the susceptibility to both logic and timing errors. There are two components of power supply noise – low frequency and high frequency. The low frequency component is known as IR drop, which is the reduction in voltage caused by passing current through a resistive line. The high frequency component is known as L@i=@t noise, which is caused by the inductive properties of currents flowing through the chip power grid. A sudden current demand caused by simultaneous switching of a large number of logic gates results in a large L@i=@t noise. On-chip inter connect reliability issues can also be caused by temperature variations [32]. Interconnect resistance is linearly dependent on temperature. On-chip temperature variations result in different wire resistances, which cause delay uncertainty. It has been reported that thermal gradients can be as large as 50 C across high performance microprocessor substrate [32]. Thus, it is very important to take into account the impact of temperature variations on interconnect performance. All of these factors decrease the reliability of on-chip interconnects in nanoscale technology. Design techniques, which can improve the reliability of on-chip interconnects, should be considered.

1.3 1.3.1

Types of Errors and Error Models Types of Errors

Depending on their duration, errors can be divided into three classes – transient, permanent and intermittent. Transient errors, which are also called soft errors, are short-term malfunctions temporarily induced by external radiation or electrical noises rather than manufacturing defects [3, 42, 43]. Transient errors can be caused by neutron or alpha particle strikes. As the node capacitance decreases with technology scaling, a smaller injection of charge can induce errors. It has been shown that transient errors caused by particle strike increase by two orders of magnitude from a 180 nm technology to a 45 nm technology [44]. Also, the probability of multiple errors caused by a single particle strike increases in nanoscale technology [45]. Crosstalk coupling is another factor that causes transient errors. In nanoscale technology, capacitive coupling effects increase with rising of interconnect aspect ratios. A high clock frequency results in a non-negligible inductive coupling. Instead of inducing single errors, crosstalk coupling can cause spatial burst errors, which occur in multiple adjacent wires. Crosstalk coupling can also propagate transient errors caused by particle strikes from a victim wire to its neighbors, further increasing the probability of multiple errors in on-chip interconnects.

1.3 Types of Errors and Error Models

9

Transient errors can also be caused by other noise sources such as process variations, supply voltage fluctuation, electromagnetic interference (EMI), and electrostatic discharge [42]. As technology scales, impacts of these noise sources are expected to increase because of smaller feature sizes, lower supply voltage and higher clock frequency. Permanent errors are irreversible malfunctions caused by physical changes; once permanent errors occur, they will not disappear. Permanent errors are usually a result of manufacturing defects, which can be detected during manufacture testing. However, permanent errors can also occur at run-time (e.g., caused by electromigration or aging) [3]. An efficient approach to fix permanent errors is to use spare wires [17, 46]. Intermittent errors are long-duration errors (but not permanent [3]) occurring in the same position. Intermittent errors are usually activated by voltage or environmental (e.g., temperature) changes or specific input patterns. Intermittent errors can lead to the occurrence of permanent errors. For example, electro migration usually causes timing errors resulting from increased resistance, before it finally breaks down the link and creates an open. The occurrence of permanent and intermittent errors decreases the efficiency of error control schemes. For example, error detection and retransmission (EDR) can be used to address transient errors; but it can be ineffective against intermittent errors (a system may stall while sending many retransmissions of a single piece of data) and EDR fails to work in the presence of permanent errors. Intermittent and permanent errors also reduce the capability of error control codes to tolerate transient errors and require more powerful error control codes, which lead to large power and area overheads.

1.3.2

Error Models

Modeling error rates of on-chip interconnects can be difficult, because it requires the knowledge of various noise sources and their dependence upon the supply voltage. In [47, 48], a simplified model is applied by assuming that all the noise effects on a wire can be modeled as a normal distribution noise VN with standard deviation sN. The probability of an error occurring in this model (shown by the shaded area in Fig. 1.11) is the sum of two components – the probability of noise causing a logic low to exceed the gate switching threshold voltage (Vdd/2), and the probability of noise causing a logic high to fall below the gate switching threshold voltage. The probability of a single wire being erroneous e during a transition can be expressed by a Gaussian pulse function [47], Z 1 Vswing 1 2 ¼ V pﬃﬃﬃﬃﬃﬃ ey =2 dy e¼Q swing 2sN 2p 2s N

(1.2)

10

1 Introduction

Fig. 1.11 Error probability of independent error model

Fig. 1.12 multiple adjacent errors caused by a noise source

where Vswing is the link swing voltage and sN is the standard deviation of the noise voltage. In this model, the error probability in each wire is assumed to be independent. As technology scales, the probability that a single noise source causes errors in multiple neighboring wires increases [49–54]. As shown in Fig. 1.10, a single particle strike can cause multiple errors because of crosstalk coupling effects [37, 38]. Thus, a more realistic error model should include spatial burst errors, where multiple adjacent wires are erroneous. Equation 1.2 can be extended to include burst errors. Instead of only affecting one wire, the noise source is modeled to affect its neighbors. The effect on neighboring wires can be described by a coupling probability Pn [54], as shown in Fig. 1.12. The higher Pn, the more likely the noise source causes errors in multiple neighboring wires. The probability of the noise source causing an error in ln can be expressed as (1.3) below, Pðb ¼ 1Þ ¼ ð1 Pn Þ2 e

(1.3)

where e is the probability of a single wire being erroneous if no coupling effects are considered.

1.3 Types of Errors and Error Models

11

Fig. 1.13 Residual flit error rate of Hamming code for different error models, the coupling probability Pn ¼ 102 in dependent error model

The probability of two- and three-wire errors caused by the same noise source P(b ¼ 2) and P(b ¼ 3) can be expressed as (1.4) and (1.5), respectively, Pðb ¼ 2Þ ¼ 2 Pn ð1 Pn Þ e

(1.4)

Pðb ¼ 3Þ ¼ Pn 2 e

(1.5)

The probability of the same noise source at ln also affecting ln + 2 and ln2 is usually much smaller than Pn. So we ignore the probability of a noise source causing burst errors of four bits or more P(b 4 | ln) ¼ 0. Equation 1.2 above can be considered a specific case of the extended model when Pn is 0. The value of Pn depends on coupling effects and the amplitude of the noise voltage. For simplicity, we use different Pn values to describe the coupling effects in the following analysis and simulation. Figure 1.13 shows the residual flit error of Hamming codes using independent and dependent error models. In the dependent error model, coupling probability Pn is 102. The results show that the residual flit error rate using the dependent error model increases greatly compared to using the independent error model. A more complex error model is proposed in [49]. In this error model, effects of a single noise source are described by a normalized matrix P with the following format (1.6),

12

1 Introduction

2

pð1; 1Þ

...

pð1; tmax Þ

3

7 6 . . . pð2; tmax Þ 7 6 pð2; 1Þ 7 6 P¼6 . 7 .. .. 7 6 .. . . 5 4 pðwmax ; 1Þ . . . pðwmax ; tmax Þ

(1.6)

The element p(o, t) in the matrix P represents the probability of a single noise source affecting o wires for t cycles. omax and tmax are the maximum number of wires and cycles affected by this noise source. Compared to previous models, the error model in [49] can be used to express the probability of multiple-wire and multiple-cycle errors.

1.4

Book Overview

As technology scales into nanoscale regime, it is impossible to guarantee the perfect hardware design. Moreover, if the requirement of 100% correctness in hardware can be relaxed, the cost of manufacturing, verification, and testing will be significantly reduced. Many approaches have been proposed to address the reliability problem of on-chip communications. This book mainly focuses on the use of error control codes (ECCs) to improve on-chip interconnect reliability. In Chap. 2, we examine various techniques used to improve the reliability of on-chip interconnects. These techniques can be separated into noise reduction techniques and error control methods. Noise reduction techniques can reduce the noise effects and lower the probability of error occurring, such as a wider metal wire and a larger interconnect spacing, shielding, skewed transition, repeater insertion, and crosstalk avoidance codes (CACs). Error control methods are used to detect or correct errors after error occurs. The use of spatial redundancy, temporal redundancy and information redundancy are the common techniques exploited in error control methods. An important application of error control coding for on-chip interconnects is to improve the communication reliability in network-on-chip (NoC) architecture. As technology scales, billions of transistors can are integrated into a single chip, and traditional bus-based infrastructures are no longer sufficient to handle intensive on-chip communication. NoC is emerging as an efficient solution to solve the aggravating scalability and contention issues of on-chip communication. In Chap. 3, we introduce different architectures and design components of a NoC. The techniques used to improve the communication reliability in NoC are also discussed in this chapter. Error control codes (ECCs) have been widely applied in conventional communication systems. As area and energy costs of ECCS are relatively small in nanoscale technology, ECCs become a promising solution to address the reliability issue in on-chip interconnects. Simple ECCs such as single parity

References

13

check (SPC) codes, Hamming codes, and duplicate-add-parity (DAP) codes are widely used in previous work. As the probability of multiple errors increases in nanoscale technology, more complex error control codes, such as BoseChaudhuri-Hocquenghem (BCH) codes, Reed-Solomon (RS) codes and product codes are applied to improve the reliability of on-chip interconnects. In Chap. 4, we will discuss these ECCs and their hardware implementation. On-chip interconnects have tight speed, area, and energy constraints. Thus, the implementation of error control codes for on-chip interconnects needs to balance reliability and performance. In Chap. 5, we introduce various design techniques to tradeoff the reliability and energy consumption of on-chip interconnects. These techniques include the implementation of low link swing voltage and dynamic voltage scaling with error control codes, the combination of Hamming product codes with type-II hybrid ARQ, and the configurable error control codes implementation. Conventional error control codes, such as Hamming and BCH codes, have been successfully applied to improve the reliability of on-chip interconnect by correcting logic errors. Unfortunately, these codes are inefficient to address crosstalk-induced delay uncertainty. As the effects of coupling capacitance increase with technology scaling, the delay uncertainty caused by capacitive coupling greatly reduces the system performance because a large additional design margin is required. In Chap. 6, we will discuss the solutions, which can efficiently address both logic errors and capacitive crosstalk induced delay uncertainty simultaneously.

References 1. Davis AJ et al (2001) Interconnect limits on gigascale integration (GSI) in the 21st Century. Proc IEEE 89:305–324 2. Caignet F, Bendhia DS, Sicard E (2001) The challenge of signal integrity in deepsubmicrometer CMOS technology. Proc IEEE 89:556–573 3. Constantinescu C (2003) Trends and challenges in VLSI circuit reliability. IEEE Micro 23:14–19 4. Im S, Srivastava N, Banerjee K, Goodson EK (2005) Scaling analysis of multilevel interconnect temperatures for high performance ICs. IEEE Trans Electron Devices 52:2710–2719 5. Kleveland B, Qi X, Madden L et al (2002) High-frequency characterization of on-chip digital interconnects. IEEE J Solid-State Circuits 37:716–725 6. Ho R, Mai WK, Horowitz AM (2001) The future of wires. Proc IEEE 89:490–504 7. Bakoglu BH, Meindl DJ (1985) Optimal interconnect circuits for VLSI. IEEE Trans Electron Devices 32:903–909 8. Ismail IY, Friedman GE, Neves LJ (1999) Figures of merit to characterize the importance of on-chip inductance. IEEE Trans Very Large Scale Integr (VLSI) Syst 7:442–449 9. Agarwal K, Sylvester D, Blaauw D (2006) Modeling and analysis of crosstalk noise in coupled RLC interconnects. IEEE Trans Comput Aided Des Integr Circuits Syst 25:892–901 10. Ismail IY (2002) On-chip inductance cons and pros. IEEE Trans Very Large Scale Integr (VLSI) Syst 10:685–694 11. International Technology Roadmap for Semiconductors (2005) http://public.itrs.net 12. Horowitz M, Dally B (2004) How scaling will change processor architecture. In: Proceedings of the international solid state circuits conference (ISSCC), pp 132–133

14

1 Introduction

13. Magen N, Kolodny A, Weiser U, Shamir N (2004) Interconnect-power dissipation in a microprocessor. In: Proceedings of the international workshop on system-level interconnect prediction (SLIP), pp 7–13 14. Soteriou V, Peh SL (2004) Design-space exploration of power-aware on/off interconnection networks. In: Proceedings of the International conference on computer design (ICCD), pp 510–517 15. Shin LJ et al (2011) A 40 nm 16-core 128-thread SPARC SoC processor. IEEE J Solid-State Circuits 46:131–144 16. Zorian Y, Gizopoulos D, Vandenberg C, Magarshack P (2004) Guest editors’ introduction: design for yield and reliability. IEEE Des Test Comput 21:177–182 17. Grecu C, Ivanov A, Saleh R, Pande PP (2006) NoC interconnect yield improvement using crosspoint redundancy. In: Proceedings of the IEEE international symposium on defect and fault tolerance in VLSI system (DFT), pp 457–465 18. Karnick T, Hazucha P, Patel J (2004) Characterization of soft errors caused by single event upsets in CMOS processes. IEEE Trans Depend Secure Comput 1:128–143 19. Munteanu D, Autran LJ (2008) Modeling and simulation of single-event effects in digital devices and ICs. IEEE Trans Nucl Sci 55:1854–1878 20. Tang TK, Friedman GE (2000) Delay and noise estimation of CMOS logic gates driving coupled resistive-capacitive interconnections. Integr VLSI J 29:131–165 21. Vittal A, Chen HL, Marek MS et al (1999) Crosstalk in VLSI interconnections. IEEE Trans Comput Aided Des Integr Circuits Syst 18:1817–1824 22. Sylvester D, Hu C (2001) Analytical modeling and characterization of deep submicron interconnect. Proc IEEE 89:634–664 23. Larsson P (1999) Power supply noise in future IC’s: a crystal ball reading. In: Proceedings of the IEEE custom integrated circuits conference, pp 467–474 24. Mezhiba VA, Friedman GE (2004) Scaling trends of on-chip power distribution noise. IEEE Trans Very Large Scale Integr (VLSI) Syst 12:386–394 25. Scheffer L (2006) An overview of on-chip interconnect variation. In: Proceedings of the 2006 international workshop on system-level interconnect prediction, pp 27–28 26. Lin Z et al (1998) Circuit sensitivity to interconnect variations. IEEE Trans Semiconductor Manuf 11:557–568 27. Lopez G et al (2007) The impact of size effects and copper interconnect process variations on the maximum critical path delay of single and multi-core microprocessors. In: Proceedings of the international interconnect technology conference, pp 40–42 28. Demircan E (2006) Effects of interconnect process variations on signal integrity. In: Proceedings of the IEEE international SOC conference, pp 281–284 29. Mehrotra V, Nassif S, Boning D, Chung J (1998) Modeling the effects of manufacturing variation on high-speed microprocessor interconnect performance. In: Proceedings of the IEEE electron devices meetings (IEDM), pp 767–770 30. Mehrotra V, Sam LS, Boning D et al (2000) A methodology for modeling the effects of systematic within-die interconnect and device variation on circuit performance. In: Proceedings of the ACM/IEEE design automation conference (DAC), pp 172–175 31. Qi X, Lo S, Luo Y et al (2005) Simulation and analysis of inductive impact on VLSI interconnects in the presence of process variations. In: IEEE custom integrated circuit conference, pp 309–312 32. Ajami HA, Banerjee K, Pedram M (2005) Modeling and analysis of nonuniform substrate temperature effects on global ULSI interconnects. IEEE Trans Comput Aided Des Integr Circuits Syst 24:849–861 33. Khazaka R, Nakhla M (1998) Analysis of high-speed interconnects in the presence of electromagnetic interference. IEEE Trans Microw Theory Tech 46:940–947 34. Nassif S (2000) Delay variability: sources, impacts and trends. In: Proceedings of the IEEE international solid-state circuits conference digest of technical papers, pp 7–9

References

15

35. Sotiriadis P (2002) Interconnect modeling and optimization in deep submicron technologies. Dissertation, Massachusetts Institute of Technology 36. Tamhankar R, Murali S, Stergiou S et al (2007) Timing-error-tolerant network-on-chip design methodology. IEEE Trans Comput Aided Des Integr Circuits Syst 26:1297–1310 37. Balasubramanian A, Sternberg LA, Bhuva LB, Massengill WL (2006) Crosstalk effects caused by single event hits in deep sub-micron CMOS technologies. IEEE Trans Nucl Sci 53:3306–3311 38. Balasubramanian A et al (2008) Measurement and analysis of interconnect crosstalk due to single events in a 90 nm CMOS technology. IEEE Trans Nucl Sci 55:2079–2084 39. Srinivasan J, Adve V S, Bose P, Rivers AJ (2004) The case for lifetime reliabilityaware microprocessors. In: Proceedings of the 31st international symposium on computer architecture (ISCA), pp 276–287 40. Xuan X, Singh A, Chatterjee A (2003) Reliability evaluation for integrated circuit with defective interconnect under electromigration. In: Proceedings of the international symposium on quality electronic design, pp 29–34 41. Heydari P, Pedram M (2003) Ground bounce in digital VLSI circuits. IEEE Trans Very Large Scale Integr (VLSI) Syst 11:180–193 42. Zhao C, Bai X, Dey S (2007) Evaluating transient error effects in digital nanometer circuits. IEEE Trans Reliab 56:381–391 43. Maheshwari A, Burleson W, Tessier R (2004) Trading off transient fault tolerance and power consumption in deep submicron (DSM) VLSI circuits. IEEE Trans Very Large Scale Integr (VLSI) Syst 12:299–311 44. Heidel FD et al (2008) Alpha-particle-induced upsets in advanced CMOS circuits and technology. IBM J Res Dev 52:225–232 45. Tipton DA et al (2006) Multiple-bit upset in 130 nm CMOS technology. IEEE Trans Nucl Sci 53:3259–3264 46. Lehtonen T, Wolpert D, Liljeberg P, Plosila J, Ampadu P (2010) Self-adaptive system for addressing permanent errors in on-chip interconnects. IEEE Trans Very Large Scale Integr (VLSI) Syst 18:527–540 47. Hegde R, Shanbhag RN (2000) Toward achieving energy-efficiency in presence of deep submicron noise. IEEE Trans Very Large Scale Integr (VLSI) Syst 8:379–391 48. Bertozzi D, Benini L, De Micheli G (2005) Error control schemes for on-chip communication links: the energy-reliability tradeoff. IEEE Trans Comput Aided Des Integr Circuits Syst 24:818–831 49. Zimmer H, Jantsch A (2003) A fault model notation and error-control scheme for switch-toswitch buses in a network-on-chip. In: Proceedings of the international conference on hardware/software codesign and system synthesis (CODES-ISSS), pp 188–193 50. De Micheli G, Benini L (2006) Networks on chips: technology and tools. Elsevier, Amsterdam 51. Lehtonen T, Liljeberg P, Plosila J (2007) Online reconfigurable self-timed links for fault tolerant NoC. VLSI Des. Article ID 94676:13 52. Fu B, Ampadu P (2008) A multi-wire error correction scheme for reliable and energy efficient SoC links using Hamming product codes. In: Proceedings of the IEEE international SoC conference (SoCC), pp 59–62 53. Fu B, Ampadu P (2008) An energy-efficient multi-wire error control scheme for reliable onchip interconnects using Hamming product codes. VLSI Des Article ID: 109490, 1–14, doi:101155/2008/109490 54. Fu B, Ampadu P (2009) On hamming product codes with type-II hybrid ARQ for on-chip interconnects. IEEE Trans Circuits Syst I Reg Papers 56:2042–2054

Chapter 2

Solutions to Improve the Reliability of On-Chip Interconnects

Various noise reduction and error control techniques have been applied to improve the reliability of on-chip interconnects. Noise reduction techniques include increasing wire width and spacing [1, 2], shielding [3–7], repeater insertion [8–13], crosstalk avoidance codes [14–18], skewed transition [19–25] and decoupling capacitors [26–28]. Error control techniques improve the reliability of on-chip interconnect by correcting errors using retransmission, error control codes and spare wires. The use of these techniques relaxes the reliability requirements of circuit components reducing the cost of manufacturing, verification, and testing. In this chapter, we will review both noise reduction and error control techniques and their pros and cons.

2.1

Wire Sizing and Spacing

Increasing interconnect width has different effects on capacitive coupling and inductive coupling. When the interconnect width is increased, inter-wire capacitive coupling effects decreases because a wider wire has a larger ground capacitance. Inductive coupling increases with the interconnect width. Thus, the total impact of crosstalk coupling is only weakly dependent on interconnect width when both capacitive and inductive effects are considered [1]. Capacitive coupling is linearly related to the spacing between two interconnect lines. Increasing the interconnect spacing can effectively reduce the capacitive coupling. Inductive coupling is logarithmically related to the spacing between two wires [1, 2]. As the spacing between two interconnect wires increases, the inductive coupling decreases at a much slow rate than that of capacitive coupling. Above a certain threshold value of the spacing between two interconnect lines, inductive coupling will dominate the total coupling effects, as shown in Fig. 2.1. The dominancy of inductive coupling reduces the efficiency of using wider spacing to reduce crosstalk coupling. The drawback of using wire sizing and spacing is the increase of the link routing area. B. Fu and P. Ampadu, Error Control for Network-on-Chip Links, DOI 10.1007/978-1-4419-9313-7_2, # Springer Science+Business Media, LLC 2012

17

18

2 Solutions to Improve the Reliability of On-Chip Interconnects

Fig. 2.1 Effects of wire spacing on the capacitive and inductive coupling noises [1]

2.2

Shielding

Shielding is the most common design technique to prevent crosstalk coupling. There are two kinds of shielding methods – passive shielding [3, 4] and active shielding [6, 7]. In passive shielding, the shield wires, which are statically connected to power or ground, are placed on either side of the signal wire. The effects of capacitive coupling is reduced by isolating the signal wire from its neighboring signal wires. Passive shielding also reduces inductive coupling by providing a closer return path for the operating currents. In [6, 7], an active shielding approach is proposed that connects the shield wires to the signal wire, as shown in Fig. 2.2. In the active shielding method, the shield wires have the same switching behavior as the signal wire. Active shielding can achieve a larger delay reduction than passive shielding by taking advantage of the Miller effect (The Miller effect states that when two parallel wires switch in the same direction, the effective coupling capacitance is zero, while when they switch in opposite directions, the effective coupling capacitance is doubled). Active shielding reduces the link power consumed by coupling capacitance; however, the self-switching power consumption is increased in active shielding because of the additional switching of the shield wires. Both active and passive shielding requires additional wires, greatly increasing link routing area. Instead of adding the shield wire for each signal line, shield wires can be inserted between every two to four signal wires to reduce the area cost while giving up some of the coupling improvement [3].

2.3 Repeater Insertion

19

Fig. 2.2 An example of active shielding

2.3

Repeater Insertion

In repeater insertion, a long interconnect line is separated into several segments, each driven by an inverting or non-inverting buffer. Repeater insertion has been successfully used to reduce the global interconnect delay. Without repeater insertion, the delay of global interconnect increases quadratically with the interconnect length. By properly sizing and placing repeaters, the global interconnect delay is reduced to a linear dependence on length. Repeater insertion can also be used to reduce the capacitive coupling noise between two adjacent interconnect lines. The coupling capacitance between two neighboring wires is proportional to the interconnect length. By inserting repeaters, a long interconnect wire is divided into several small pieces. The coupling capacitance of each segment is smaller than that of the overall link without repeater insertion resulting in a reduction in coupling noise. In traditional repeater insertion, each segment has the same length and each repeater has the same size, as shown in Fig. 2.3a. Traditional repeater insertion cannot effectively handle delay uncertainty caused by capacitive coupling between adjacent interconnect lines. In order to reduce the delay uncertainty caused by capacitive coupling, several new repeater insertion methods have been proposed [10–13]. In [10], a staggered repeater insertion scheme is presented to reduce the capacitive coupling effect by shifting the inverters locations on adjacent lines, as shown in Fig. 2.3b. In the staggered repeater method, the worst case delay is reduced because the transition with the worst case capacitive coupling is limited to only half of each segment. For example, the transition 010!101 with the worst case capacitive coupling in the first half of each segment becomes the transition 000!111 with the best case capacitive coupling in the second half of each segment. The performance of staggered repeater is sensitive to the repeater insertion position. Thus, the selection of the repeater position of staggered repeater insertion is more complex than that of traditional repeater insertion. An optimum position for staggered repeater insertion is presented in [11]. A hybrid polarity repeater insertion method is presented in [12], shown in Fig. 2.3c. In this method, inverting repeaters (single inverter) and non-inverting (two inverters) repeaters are alternately used at the midpoint of the bus. Similar to

20

2 Solutions to Improve the Reliability of On-Chip Interconnects

Fig. 2.3 Repeater insertion: (a) Traditional repeater insertion, (b) staggered repeater insertion, (c) hybrid polarity repeater insertion, (d) Alternate repeater insertion

the staggered repeater method, a worst case delay transition in the first half of a line becomes a best case delay transition in the second half. Thus the worst case delay is reduced by averaging the coupling effects during the transition across the whole bus

2.4 Crosstalk Avoidance Codes (CACs)

21

line. Compared to staggered repeater method, the hybrid polarity repeater method does not need a shift in repeater positions and the transition patterns are inverted only once at the middle point of the whole interconnect length. Instead of only using non-inverting repeaters at the midpoint of the bus line, an alternate repeater insertion method [13] is proposed by using inverting and noninverting repeaters alternately along the bus line, as shown in Fig. 2.3d. In alternate repeater insertion, the placement of the non-inverting repeaters is shifted for two adjacent interconnect lines. Alternate repeater insertion is suitable for a shared bus line with multiple drivers and receivers. As long as the driver and receiver are separated by more than one segment, the worst case delay caused by crosstalk coupling can be reduced.

2.4

Crosstalk Avoidance Codes (CACs)

The delay of a wire l in a k-bit bus can be modeled as [29], 8 2 l¼1 > : t0 ðð1 þ lÞD2k lDk Dk1 Þ; l¼k

(2.1)

where t0 ¼ Rt Cgt is the wire delay, when there is no crosstalk. Rt and Cgt represent the total resistance and ground capacitance of a wire respectively; l is the ratio of total coupling capacitance Cct to total ground capacitance Cgt, Dl ¼ dl t+1dl t is the difference in value of wire l at time t + 1 and t (if there is a 0 ! 1 transition on wire l, Dl ¼ 1, if the transition is 1 ! 0, Dl ¼ 1). From (2.1), the delay of a wire is dependent on the switching behaviors of its neighboring wires. Table 2.1 shows the middle wire delay of three adjacent wires with different switching patterns. The delay is normalized to the delay t0. The delay of the middle wire can be separated into six Table 2.1 Normalized link delay for different switching patterns tþ1 tþ1 tþ1 dl1 ; dl ; dlþ1 t t dl1 ; dlt ; dlþ1

000 001 010 011 100 101 110 111

000 0 0 1+2l 1+l 0 0 1+l 1

001 0 0 1+3l 1+2l 0 0 1+2l 1+l

010 1+2l 1+3l 0 0 1+3l 1+4l 0 0

011 1+l 1+2l 0 0 1+2l 1+3l 0 0

100 0 0 1+3l 1+2l 0 0 1+2l 1+l

101 0 0 1+4l 1+3l 0 0 1+3l 1+2l

110 1+l 1+2l 0 0 1+2l 1+3l 0 0

111 1 1+l 0 0 1+l 1+2l 0 0

22

2 Solutions to Improve the Reliability of On-Chip Interconnects

classes (0, 1, 1 + l, 1 + 2 l, 1 + 3 l, 1 + 4 l) [29, 30]. The sixth class has the worst case link delay which is caused by a transition 010!101 (or 101!010) on three adjacent wires. From Table 2.1, if the transition patterns with high delay classes can be eliminated from the data transmission, the crosstalk coupling can be reduced. Crosstalk avoidance codes (CACs) use a coding approach to achieve this coupling reduction by mapping the input data into CAC codewords, in which some specific switching patterns are avoided. A number of CACs have been proposed in previous work. There are two types of CACs [31] – memory-based and memory-less. The memorybased CACs generate a codeword using not only the input data also the previous transmitted codeword. The memory-less CACs generate a codeword only based on the input data. CACs are technology independent. Compared to the shielding method, CACs have additional codec cost but require a smaller routing area overhead. CACs cannot be used to correct logic errors. To achieve reliable on-chip communication, CACs are usually incorporate with error control coding (ECC). More detail about combining ECC with CACs will be discussed in Chap. 6.

2.5

Skewed Transitions

Another approach to reduce the delay uncertainty caused by crosstalk coupling is skewed transitions. In skewed transitions, a relative delay DT is introduced to avoid simultaneous opposite switching between adjacent bus wires. Skewed transitions can reduce the crosstalk-induced worst case delay without increasing the routing area. The relative delay DT between adjacent bus lines can be generated statically or dynamically. In the static approach, the relative delay always exists between adjacent bus lines regardless of the switching patterns. In [19, 20], DT is generated by inserting delay elements (e.g., inverter chain) at the beginning of alternate bus lines, as shown in Fig. 2.4. Figure 2.5 shows the relation between the relative delay DT and the total link delay Td. In this case, Td can increase if DT is too large. A careful selection of DT is needed to achieve a reduction of the overall delay Td. In [19], the transitions between adjacent bus lines can also be separated using different clocks. In this method, the signals need to be aligned at the end of the interconnect bus. Instead of inserting the delay element at the beginning of the alternate bus lines, repeaters and bus driver transistors with low threshold voltage can be applied to generate the skewed transition by speeding up the transition on the alternate bus lines [22, 23], as shown in Fig. 2.6. In the dynamic approach, the relative delay is induced only when adjacent bus lines are switching oppositely. In [24], a transition detection circuit is used to detect 0 ! 1 transitions on a bus line. If a 0 ! 1 transition is detected, this transition will

Fig. 2.4 An example of skewed transition by inserting delay elements at the beginning of alternate bus lines

Fig. 2.5 Timing relation in traditional skewed transitions

Fig. 2.6 Skewed transitions using threshold voltage adjustment

24

2 Solutions to Improve the Reliability of On-Chip Interconnects

Fig. 2.7 Skewed repeater bus

be delayed. In this method, if the transitions on two adjacent bus lines have the same direction, they will be both delayed or not delayed. No skew exits between these two wires. If two adjacent wires switch in opposite direction, the 0 ! 1 transition is delayed while the 1 ! 0 transition is completed immediately. Thus, a relative delay DT exists between these two bus lines. The opposite switching transitions between two adjacent bus lines can also be separated by using skewed repeaters [25], as shown in Fig. 2.7. Using skewed repeaters, two neighboring inverters along the bus line are skewed in opposite directions. The opposite transitions between two adjacent lines no longer switch simultaneously, because one transition travels faster than the other.

2.6

Error Control Coding Schemes

Reliability of on-chip interconnects can be improved by introducing error control coding schemes [32–48]. On-chip interconnects typically use one of three schemes for error recovery – automatic repeat request (ARQ), forward error correction (FEC), and hybrid ARQ (HARQ). In this section, we discuss these three types of error control schemes and their advantages and disadvantages (Fig. 2.8).

2.6.1

Automatic Repeat Request (ARQ)

The basic concept of ARQ is to request a retransmission if errors are detected [49]. In ARQ, the input data is encoded using an error detection code. The encoded data is transmitted through the link. In the receiver, the encoded data is decoded to detect errors. When errors are detected in the received data, the receiver sends back a negative acknowledge (NACK) signal to request a retransmission.

2.6 Error Control Coding Schemes

25

Fig. 2.8 Three types of ARQ scheme (a) Stop-and-wait (b) Go-back-N (c) Selective-repeat

There are three types of ARQ [49] – stop-and-wait, go-back-N, and selectiverepeat. In stop-and-wait ARQ, the transmitter sends data and waits until the ACK/ NACK signal is received. Stop-and-wait ARQ has the benefit of simplicity; however its throughput is low and unsuitable for high speed on-chip communication. To improve the system throughput, go-back-N and selective-repeat ARQ schemes are used. In go-back-N ARQ, the transmitter continuously sends data before an NACK is received. The transmitter resends the data acknowledged by the NACK signal and also the succeeding (N1) data transmitted during the round-trip delay. In the receiver, if errors are detected for the received data, a retransmission is

26

2 Solutions to Improve the Reliability of On-Chip Interconnects

requested. The receiver discards all the incoming data until the retransmitted data is received. The average number of transmissions needed to successfully send a data in go-back-N ARQ can be represented by (2.2) below [49], NGBN ¼ 1 ð1 Pd Þ þ ðN þ 1Þ ð1 Pd Þ Pd þ ð2N þ 1Þ ð1 Pd Þ Pd 2 þ þ ðlN þ 1Þ ð1 Pd Þ Pd l þ NPd ¼1þ ð1 Pd Þ

(2.2)

where Pd is the probability that errors can be detected in the received data. N is the round trip delay in clock cycles. NGBN depends on both the channel error rate and the round-trip delay N. A transmitter buffer with length N is needed in go-back-N ARQ to store the transmitted data until the ACK signal is received. Go-back-N trades off the throughput improvement with a moderate implementation cost, and is suitable for on-chip interconnects. Instead of resending N data, selective-repeat ARQ only requests the erroneous data to be retransmitted. Consider an ideal selective-repeat ARQ system, in which the receiver has an infinite buffer to store the error free data. The average number of transmissions needed to successfully send a data in selective-repeat ARQ can be represented by (2.3) below [49], NSR ¼ 1 ð1 Pd Þ þ 2 ð1 Pd Þ Pd þ 3 ð1 Pd Þ Pd 2 þ þ l ð1 Pd Þ Pd l1 þ 1 ¼ 1 Pd

(2.3)

NSR does not depend on the roundtrip delay, which makes it suitable for longdistance applications such as satellite communication. In selective-repeat ARQ, a receiver buffer is used to save all the incoming data. Also a complex mechanism is needed to reorder the received data. In ARQ schemes, the error detection codes are easy to construct at a minor energy cost. When error patterns occur that cannot be detected (known as decoder failure), errors are introduced into the system. The drawback of ARQ is the retransmission latency. In persistent noise environments, a large number of retransmissions are required making it unsuitable for high performance applications.

2.6.2

Forward-Error Correction (FEC)

In FEC schemes, errors are corrected without any retransmission. Compared to ARQ schemes, more complex error control codes are required in FEC schemes increasing the encoder and decoder overhead. However, FEC schemes allow for a much simpler communication protocol with a fixed throughput.

2.6 Error Control Coding Schemes

27

The error control codes used for FEC can be divided in two classes [50] – (1) block codes, in which the data is encoded or decoded block by block and each data block is processed independently, and (2) convolutional codes, in which the encoding process involves the current input data as well as previous input data. In on-chip communication, the data is usually transmitted in parallel. To apply convolutional codes, either an encoder/decoder is needed for each wire, or the data is encoded serially before it is transmitted in parallel across the link. These two approaches lead to a large area or latency overhead and are not suitable for on-chip interconnects. In this book, we focus on block codes, especially linear block code, in which the sum of two codewords in a given code is also a codeword. The implementation of FEC in on-chip communication often has strict performance and cost requirements. Simple codes, such as single-error-correcting (SEC) codes, single-error-correcting and double-error-detecting (SEC-DEC) codes are widely used in previous work [34, 51]. An extended discussion of the various types of error control codes used for on-chip interconnects are discussed in Chap. 3.

2.6.3

Hybrid ARQ (HARQ)

HARQ schemes combine the advantages of FEC and ARQ together to increase performance [49, 52]. In HARQ schemes, the receiver corrects errors within the code’s error correction capability and requests retransmission when the errors are detectable but not correctable. The use of FEC in HARQ reduces the frequency of retransmission by correcting the error patterns which occur most frequently. Unlike error control with FEC alone, a retransmission is requested if there is an error pattern that is detected but cannot be corrected. Unlike error control with ARQ alone, the FEC in HARQ can correct persistent noise errors. As a result, a proper combination of FEC and ARQ can provide higher reliability than error control with FEC alone and higher throughput than error control with ARQ alone. The simplest way to implement HARQ is to use an error control code with the capability of detecting and correcting errors simultaneously. When errors are detected in the received codeword, the receiver first tries to correct these errors. If the number of errors is within the error correcting capability of the code, the errors will be corrected. If an uncorrectable error pattern is detected, the receiver discards the received word and requests a retransmission. The retransmission is the same codeword. This process continues until the codeword is successfully decoded. This type of HARQ is referred as type-I HARQ [49]. The error control code used in typeI HARQ is able to correct and detect errors simultaneously. This requires more parity check bits than a code used in an ARQ scheme purely for error detection. Thus, the transmission overhead of type-I HARQ is larger than an error control scheme using only ARQ. In type-I HARQ, the same amount of parity check bits are transmitted each time. This limits its performance when noise conditions vary with different environmental factors.

28

2 Solutions to Improve the Reliability of On-Chip Interconnects

Type-II HARQ was proposed to improve the performance by sending the parity check bits incrementally [49, 53]. In type-II HARQ, the codeword in the first transmission is comprised of input data and a few parity bits for error correction. When the receiver detects uncorrectable errors in the received data, it saves the erroneous data in a buffer and requests a retransmission at the same time. The retransmission is not the original data but a block of additional parity check bits which is formed based on the input data. When this block of additional parity check bits is received, it is used to correct the errors in the data stored in the receiver buffer. Each retransmission in type-II HARQ is coded differently. Compared to type-I HARQ, parity check bits in type-II HARQ are transmitted incrementally. Because the minimal required redundancy is transmitted each time, type-II HARQ achieves a better throughput when the noise condition varies with the environment.

2.7

Spare Wires

Most permanent errors can be detected during manufacture testing; however some of them can occur during run time (e.g., electromigration or aging). Permanent errors can greatly reduce the correction capabilities of the commonly used error control codes, making them unsuitable to handle permanent errors. An efficient approach to solve this problem is to replace permanently erroneous wires with spare wires. In [54], the use of spare wires to improve on-chip interconnect manufacturing yield is analyzed and a configurable scheme using crossbar switches to remap erroneous wires is proposed. In [55], an in-line test (ILT) method combined with a configurable remapping of erroneous links using spare wires is proposed to detect and correct permanent errors. In the ILT method, each pair of adjacent wires in the link is periodically tested for opens and shorts. A reconfiguration unit is used to remap data from the pair of adjacent wires under test to a set of available spare wires. The ILT method allows the link to be tested for permanent errors without interrupting data transmission. Once the permanent errors are detected, the same reconfiguration unit is used to bypass erroneous wires using spare wires. In order to reduce the complexity of the reconfigurable units, the remapping process ripples through the bus instead of remapping erroneous wires directly to the spare wires.

References 1. Agarwal K, Sylvester D, Blaauw D (2006) Modeling and analysis of crosstalk noise in coupled RLC interconnects. IEEE Trans Computer-Aided Des Integr Circuits Syst 5:892–901 2. Massoud Y (2002) Managing on-chip inductive effects. IEEE Trans Very Large Scale Integr (VLSI) Syst 6:789–798

References

29

3. Huang X, Cao Y, Sylvester D et al (2000) RLC signal integrity analysis of high-speed global interconnects. In: Proceedings of IEEE international electron devices meeting (IEDM), pp 731–743 4. Zhang J, Friedman GE (2004) Effect of shield insertion on reducing crosstalk noise between coupled interconnects. In: Proceedings of IEEE international symposium on circuits and system (ISCAS), pp 529–532 5. Lepak MK, Xu M, Chen J, He L (2004) Simultaneous shielding insertion and net ordering for capacitive and inductive coupling minimization. ACM Trans Des Autom Electron Syst 3:290–309 6. Kaul H, Sylvester D, Blauw D (2002) Active shields: a new approach to shielding global wires. In: Proceedings of IEEE/ACM great lakes symposium on VLSI (GLSVLSI), pp 112–117 7. Kaul H, Sylvester D, Blaauw D (2004) Performance optimization of critical nets through active shielding. IEEE Trans Circuits Syst I Reg Papers 12:2417–2435 8. Adler V, Friedman GE (2000) Uniform repeater insertion in RC trees. IEEE Trans Circuits Syst I Fund Theor Appl 10:1515–1523 9. Alpert JC, Devgan A, Quay TS (1999) Buffer insertion for noise and delay optimization. IEEE Trans Computer-Aided Des Integr Circuits Syst 11:1633–1645 10. Kahng B A, Muddu S, Sarto E, Sharma R (1998) Interconnect tuning strategies for highperformance ICs. In: Proceedings of the design, automation and test in Europe (DATE), pp 471–478 11. Ghoneima M, Ismail Y (2005) Optimum positioning of interleaved repeaters in bidirectional buses. IEEE Trans Comput-Aided Des Integr Circuits Syst 3:461–469 12. Akl JC, Bayoumi AM (2008) Reducing interconnect delay uncertainty via hybrid polarity repeater insertion. IEEE Trans Very Large Scale Integr (VLSI) Syst 9:1230–1239 13. Kaul H, Seo J S, Anders M, Sylvester D, Krishnamurthy R (2008) A robust alternate repeater technique for high performance busses in the multi-core era. In: Proceedings of the IEEE international symposium on circuits and systems (ISCAS), pp 372–375 14. Sridhara RS, Ahmed A, Shanbhag RN (2004) Area and energy-efficient crosstalk avoidance codes for on-chip busses. In: Proceedings of international conference on computer design (ICCD), pp 12–17 15. Duan C, Tirumala A, Khatri PS (2001) Analysis and avoidance of crosstalk in on-chip buses. In: Proceedings of hot interconnects, pp 133–138 16. Victor B, Keutzer K (2001) Bus encoding to prevent crosstalk delay. In: Proceedings of IEEE/ACM international conference on computer-aided design (ICCAD), pp 57–63 17. Patel NK, Markov L (2004) Error-correction and crosstalk avoidance in DSM busses. IEEE Trans Very Large Scale Integr (VLSI) Syst 10:1076–1080 18. Sridhara RS, Shanbhag RN (2007) Coding for reliable on-chip buses: a class of fundamental bounds and practical codes. IEEE Trans Computer-Aided Des Integr Circuits Syst 5:977–982 19. Hirose K, Yassura H (2000) A bus delay reduction technique considering crosstalk. In: Proceedings of the design, automation and test in Europe (DATE), pp 441–445 20. Nose K, Sakurai T (2001) Two schemes to reduce interconnect delay in bi-directional and unidirectional buses. In: Proceedings of VLSI symposium, pp 193–194 21. Ghoneima M, Ismail Y (2004) Effect of relative delay on the dissipated energy in coupled interconnects. In: Proceedings of IEEE international symposium on circuits and systems (ISCAS), pp 525–528 22. Kim WK, Jung OS, Kim T (2003) Coupling delay optimization by temporal decorrelation using dual threshold voltage technology. IEEE Trans Very Large Scale Integr (VLSI) Syst 5:879–887 23. Ghoneima M, Ismail IY, Khellah MM, Tschanz WJ, De V (2006) Reducing the effective coupling capacitance in buses using threshold voltage adjustment techniques. IEEE Trans Circuits Syst I Fund Theor Appl 9:1928–1933 24. Nieuwland KA, Katoch A, Meijer M (2004) Reducing cross-talk induced power consumption and delay. In: Proceedings of international workshop on power and timing modeling optimization and simulation (PATMOS), pp 179–188

30

2 Solutions to Improve the Reliability of On-Chip Interconnects

25. Ghoneima M et al (2006) Skewed repeater bus: a low power scheme for on-chip bus. IEEE Trans Circuits Syst I Fund Theor Appl 7:1904–19106 26. Zhao S, Roy K, Koh KC (2002) Decoupling capacitance allocation and its application to power supply noise aware floorplanning. IEEE Trans Computer-Aided Des Integr Circuits Syst 1:8–92 27. Su H, Sapatnekar SS, Nassif RS (2003) Optimal decoupling capacitor sizing and placement for standard cell layout designs. IEEE Trans Computer-Aided Des Integr Circuits Syst 4:428–436 28. Popovich M, Sotman M, Kolodny A, Friedman GE (2008) Effective radii of on-chip decoupling capacitors. IEEE Trans Very Large Scale Integr (VLSI) Syst 7:894–907 29. Sotiriadis P, Chandrakasan A (2000) Reducing bus delay in sub-micron technology using coding. In: Proceedings of the IEEE Asia and South Pacific design automation conference (ASPDAC), pp 109–114 30. Li L et al. (2004) A crosstalk aware interconnect with variable cycle transmission. In: Proceedings of the design, automation and test in Europe (DATE), pp 102–107 31. Duan C, Cordero Calle HV, Khatri PS (2009) Efficient on-chip crosstalk avoidance DODEC design. IEEE Trans Very Large Scale Integr (VLSI) Syst 4:551–560 32. Li L, Vijaykrishnan N, Kandemir M, Irwin JM (2003) Adaptive error protection for energy efficiency. In: Proceedings of IEEE/ACM international conference on computer-aided design (ICCAD), pp 2–7 33. Bertozzi D, Benini L (2004) Xpipes: a network-on-chip architecture for gigascale systems-onchip. IEEE Circuits Syst Mag 4:18–31 34. Sridhara S, Shanbhag RN (2005) Coding for system-on-chip networks: a unified framework. IEEE Trans Very Large Scale Integr (VLSI) Syst 6:655–667 35. Murali S, Theocharides T, Vijaykrishnan N, Irwin JM, Benini L, De Micheli G (2005) Analysis of error recovery schemes for networks-on-chips. IEEE Des Test Comput 5:434–442 36. Rossi D, Nieuwland KA, Katoch A, Metra C (2005) Exploiting ECC redundancy to minimize crosstalk impact. IEEE Des Test Comput 1:59–70 37. Worm F, Ienne P, Thiran P, Micheli DG (2005) A robust self-calibrating transmission scheme for on-chip networks. IEEE Trans Very Large Scale Integr (VLSI) Syst 1:126–139 38. Komatsu S, Fujita M (2005) Low power and fault tolerant encoding methods for on-chip data transfer in practical applications. IEICE Trans Fund E88-A(12):3282–3289 39. Pande PP, et al. (2006) Design of low power & reliable networks on chip through joint crosstalk avoidance and forward error correction coding. In: Proceedings of IEEE international symposium on defect and fault tolerance in VLSI systems (DFT), pp 466–476 40. Ejlali A, Al-Hashimi MB, Rosinger P, Miremadi GS (2007) Joint consideration of faulttolerance, energy-efficiency and performance in on-chip networks. In: Proceedings of the design, automation and test in Europe (DATE), pp 1–6 41. Rossi D, Angelini P, Metra C (2007) Configurable error control scheme for NoC signal integrity. In: Proceedings of international on line testing symposium (IOLTS), pp 43–48 42. Yu Q, Ampadu P (2008) Adaptive error control for NoC switch-to-switch links in a variable noise environment. In: Proceedings of IEEE international symposium on defect and fault tolerance in VLSI system (DFT), pp 352–360 43. Ganguly A, Pande PP, Belzer B, Grecu C (2008) Design of low power & reliable networks on chip through joint crosstalk avoidance and multiple error correction coding. J Electron Test Theor Appl (JETTA), Special Issue on Defect and Fault Tolerance, 67–81 44. Lehtonen T, Liljeberg P, Plosila J (2007) Analysis of forward error correction methods for nanoscale networks-on-chip. In: Proceedings of 2nd international conference on nano-networks (Nano-Net), pp 1–5 45. Rossi D, Nieuwland KA, Dijk SVE, Kleihorst PR, Metra C (2008) Power consumption of fault tolerant busses. IEEE Trans Very Large Scale Integr (VLSI) Syst 5:542–553 46. Yu Q, Ampadu P (2009) Adaptive error control for nanometer scale NoC links. IET Comput Digit Tech 6:643–659, Special issue on advances in nanoelectronics circuits and systems

References

31

47. Fu B, Ampadu P (2010) Error control combining Hamming and product codes for energy efficient nanoscale on-chip interconnects. IET Comput Digit Tech 3:251–261 48. Fu B, Ampadu P (2009) A dual-mode hybrid ARQ scheme for energy efficient on-chip interconnects. In: Springer lecture notes of the institute for computer sciences, social-informatics and telecommunications engineering – 3 rd international ICST conference NanoNet 2008, revised selected papers, pp 74–79 49. Lin S et al (1984) Automatic-repeat-request error-control schemes. IEEE Commun Mag 12:5–17 50. Lin S, Costello JD (2004) Error control coding, 2nd edn. Prentice Hall, Upper Saddle River 51. Bertozzi D, Benini L, De Micheli G (2005) Error control schemes for on-chip communication links: the energy-reliability tradeoff. IEEE Trans Computer-Aided Des Integr Circuits Syst 6:818–831 52. Benice JR, Frey HA (1964) An analysis of retransmission systems. IEEE Trans Commun Technol 4:135–145 53. Metzner JJ (1979) Improvements in block-retransmission schemes. IEEE Trans Commun 2:524–532 54. Grecu C, Ivanov A, Saleh R, Pande PP (2006) NoC interconnect yield improvement using crosspoint redundancy. In: Proceedings of IEEE international symposium on defect and fault tolerance in VLSI systems (DFT), pp 457–465 55. Lehtonen T, Wolpert D, Liljeberg P, Plosila J, Ampadu P (2010) Self-adaptive system for addressing permanent errors in on-chip interconnects. IEEE Trans Very Large Scale Integer (VLSI) Syst 4:527–540

Chapter 3

Networks-on-Chip (NoC)

The move to many-core system is expected to become the dominant trend in the near future. With technology scaling into nanoscale regime, hundreds and even thousands of intellectual property (IP) cores can be integrated into a single chip. How to provide efficient and reliable communication between these IP cores becomes a bit problem. The conventional bus-based infrastructures are no longer sufficient to handle intensive on-chip communication. Network-on-chip (NoC) is emerging as an efficient solution to solve the aggravating scalability and bandwidth issues of on-chip communication by replacing traditional bus structures with a packet-switched network. This chapter is developed to introduce the common NoC architectures and the reliability issues facing in NoC design.

3.1

Bus Based On-Chip Communication

Bus-based infrastructure is the most frequently used traditional on-chip communication architecture. Figure 3.1 shows the architecture of IBM Cell processor [1], in which bus-based on-chip communication architecture is used to provide data communication between eight special-purpose processing units (SPU) and a single general-purpose 64-bit power processor. In bus-based communication architecture, all IPs are connected to the same transmission medium, bus. The IPs connected to a bus can be separated to masters, which can initiate read or write data transfers, and slaves, which response the requests from other master IPs. If multiple masters want to access the shared-bus at the same time, a bus arbiter is used to determine which master has the right to access. The advantage of bus architecture is simple and low area cost. The bandwidth of bus architecture is low because only one master can access the shared-bus at any time. There are many bus-based SoC interconnect specifications. Three of them are widely used in industrial – ARM Microcontroller Bus Architecture (AMBA) versions 2.0 [2] and 3.0 [3], IBM CoreConnect [4] and OpenCors Wishbone [5]. B. Fu and P. Ampadu, Error Control for Network-on-Chip Links, DOI 10.1007/978-1-4419-9313-7_3, # Springer Science+Business Media, LLC 2012

33

34

3 Networks-on-Chip (NoC)

Fig. 3.1 IBM cell processor architecture [1]

The bus architecture is facing several design challenges as technology scales into nanoscale regime. The capacitive load of a shared-bus greatly increases as more IPs are connected to the same bus architecture. Also, the bus length need increase as the number of integrated IPs increases. These two factors increase the propagation delay of on-chip bus and limit the number of IPs, which can be connected to the shared bus. Hierarchical bus architecture is proposed to solve this problem by splitting a long bus into several segments. Figure 3.2 shows a hierarchical bus architecture using AMBA protocol. In hierarchical bus architecture, bus bridges are used to connect different bus segments. As hundreds of IP are integrated into a single chip, the hierarchical bus architecture becomes complex and still faces the intrinsic bandwidth limitation caused by multiple IP cores sharing the same transmission medium. The conventional bus architecture becomes a bottleneck in through and scalability. More efficient on-chip communication architecture needs to be developed to meet high throughput requirement of large-scale SoC designs.

3.2

NoC Design

NoC architectures are developed to address the complex on-chip communication of large-scale SoCs. In this section, we will introduce the common components, and different topologies of NoC systems. The switching techniques and router design used in NoC systems are also discussed.

3.2 NoC Design

35

Fig. 3.2 A hierarchical bus architecture using AMBA protocol [6]

Fig. 3.3 A NoC based 48-core IA-32 processor from Intel [7]

3.2.1

NoC Architectures

In NoC system, the traditional shared-bus structure is replaced with a packetswitched communication network. Figure 3.3 shows the architecture of a NoC based processor with 48 IA-32 cores from Intel [7, 8]. A NoC system is typically consisted of IP cores, network interfaces (NI), routers and global links. IP cores are connected to on-chip networks via NI. The communication between different IP cores take place in the form of packets. The function of NI is to packetize and unpack the data. In NI, the input data injected by IP core is separated into small packets and extra information used to identify and track these packets are added to each packet.

36

3 Networks-on-Chip (NoC)

NI is also used to establish the connection between the source IP and destination IP. The router is used to route the packet to the correct destination according to a specified routing algorithm. Router plays an important role in the performance of a NoC architecture. Routers are connected using high-performance global links, which include data and control wires for communication. For the processor shown in Fig. 3.3, each tile, which consists of a dual IA-32 core, is connected by a 2D-mesh on-chip network. A mesh interface unit (MIU) in each tile is used to packetize/depacketize data into/from the mesh network. NoC systems provide better performance and scalability. The use of multiple concurrent connections results in much higher bandwidth in NoC systems than that in conventional bus structures. The decoupling of computation IPs from communication network reduces the system design complexity by allowing each part to be optimized separately. Component reuse is easy in NoC systems because of the standardized network interface. NoC are also very scalable; by adding new routers, more resources are connected to the on-chip network. NoC can be organized as different topologies, which define how the routers and computation IP cores are connected [9]. Figure 3.4 shows the common NoC topologies including fat tree, butterfly fat tree (BFT), mesh, octagon, torus, and folded torus. Among these topologies, mesh topology has achieved more consideration because of its regular architecture and simplicity. There are a wide variety of NoC prototypes that have been developed recently. Most of these systems are developed using a mesh topology such as Nostrum [10], Tile64 [11], TRIPS [12], Teraflops [13], and SCC [14]. In mesh topology, except the routers on the network boundaries, each router has five active ports: one is connected to local IP core; while other four ports are connected to the neighboring routers. In mesh topology, the number of IP cores is equal to the number of routers. The disadvantage of mesh topology is that a largescale mesh-based NoC has long communication latency between the IP cores on two opposite edges of the network. Torus topology is proposed to improve the communication latency by directly connecting the routers on the edges to the routers on the opposite edges of the network, as shown in Fig. 3.4e. In the torus topology, each router has the same structure with five active ports. The long wraparound interconnects in torus topology can still be the bottleneck compared to other pieces of the network. This issue can be avoided by using folded torus, as shown in Fig. 3.4f. NoC protocols are typically organized in layers [15, 16]. Similar to the Open System Layer (OSI) model used in internet, NoC implementations can be separated into seven layers – physical layer, data link layer, network layer, transport layer, session layer, presentation layer and application layer, as shown in Fig. 3.5. Among these layers, the physical layer is used to handle the physical implementation of transmission medium (i.e. wires). Different signaling techniques, such as differential signaling, low swing voltage signaling, pulse-based signaling and current mode signaling, are applied in this layer. The data link layer is used to provide reliable data transmission even the physical links unreliable. Error control coding is a common technique applied in this layer. The network layer is used to

3.2 NoC Design

37

Fig. 3.4 Common NoC topologies

establish the data transmission path according to different switching and routing algorithms. Congestion control is also employed in network layer to balance the traffic load over the network. In transport layer, the data is segmented into packets at the source and unpack at the destination. The transport layer needs to guarantee these packets transmitted and received in order. The selection of the packet size is an important problem in transport layer. The session layer, presentation layer and application layer are used to provide an abstraction of the hardware implementation

38

3 Networks-on-Chip (NoC)

Fig. 3.5 Layered structure in NoC design

to the system software and users. For example, the session layer is used to synchronize the data communication of multi-core system running parallel programs. The layered structure simplifies NoC design by hiding the implementation details of low layers from high layers.

3.2.2

Routing and Switching Techniques

Routing and switching are applied in the network layer of a NoC architecture. Routing determines the data transfer path, which will be used to correctly move the data packets through the network to its destination. Switching decides when and how the data packets are actually transferred through the routers. Routing algorithms can be classified into static routing [17–19] and dynamic routing [19–21]. The static routing, also known as deterministic routing, always provides fixed paths between a particular source and destination pair to transfer data. In static routing algorithms, the routing paths are predetermined. The routing scheme does not use the information, such as current network traffic load and link status, to make the routing decisions. XY routing is one common used static routing algorithm. In XY routing algorithm, the data is always first routed in the horizontal direction (X direction), until it reaches the network node, which has the same X coordinate of the destination node. Then, the data is routed in the vertical direction (Y direction), until it reaches the destination node. The advantage of static routing is easy to implement with small area cost. If there is only a single routing path used,

3.2 NoC Design

39

Fig. 3.6 An example of deadlock in NoC routing

static routing can guarantee in-order data packet delivery. In-order data packet delivery simplifies the NI design in the destination node. The dynamic routing is also known as adaptive routing. In dynamic routing, the routing decision is made based on the current network conditions, such as traffic load and error rate. The routing paths between a particular source and destination pair can change over time as the network conditions change. In comparison to the static routing, the dynamic routing is more efficient to avoid the network congestion and link errors by continuously monitoring the network conditions. The flexibility of dynamic routing results in additional hardware costs, such as monitoring circuits and more complex routing control circuits. A large amount of dynamic routing algorithms have been proposed recently, such as minimal adaptive [20], fully adaptive [20], congestion look-ahead [22], odd–even [18], slack time aware [23], west first, north-last and negative-first. The purpose of routing algorithms is to ensure that all the data packets will correctly reach its destination no matter which routing algorithm is selected. To achieve this goal, the routing algorithms need guarantee no livelock and deadlock in the routing path. Livelock is a condition where the data packet is moved around between routers over and over again and never reaches its destination. There exist circles in the routing path when livelock happens. Livelock can be avoided by monitoring the distance of the data packet form the destination on each router. Only the routing paths, which can reduce this distance, are selected. This approach ensures that the packet will reach its destination after a finite number of steps. Deadlock is a condition, where one or more packets in the network cannot move staring from a time t. These packets will be blocked for an infinite time regardless of the routing algorithm selected. Deadlock happens when a circular dependence between different packets exists. Figure 3.6 shows an example of deadlock.

40

3 Networks-on-Chip (NoC)

In this example, each packet has taken its own buffer resource, but at the same tries to request the buffer resource held by other packets. For example, the packet 1 is transmitted from router A to B. At the same time, packet 1 tries to request the resources held by packed 2. Packet 2 is transmitted from router B to C and tries to request resources held by packet 3. The similar scenario happens to packet 3, and packet 4. A dependency loop is formed in this example. Deadlock can be avoided by adding restrictions on the routing algorithm (e.g. forbidding certain turns in the routing algorithm). There are two major switching techniques in NoCs, namely circuit switching [24] and packet switching [21, 25]. In circuit switching, a physical path, composed of a series of links and routers, is reserved from source to destination before the data transmission starts. The physical path is held for the entire duration of the data transmission. The hardware resource reserved for the physical path is released when the last data is transmitted. The advantage of circuit switching is that the whole link bandwidth is available once the physical path has been established. However, the circuit switching has a large setup latency before the data can be transmitted. Also, other routing paths cannot use these link and router resources during the process of data transmission, which wastes the valuable hardware resource. Virtual circuit switching technique can be applied to improve the hardware efficiency. In virtual circuit switching, several virtual links (also known as virtual channel) share on a single physical link. Extra buffer resource or time division multiplexing technique is needed to support virtual circuit switching. In packet switching, data is divided into fixed-length blocks called packets. Instead of establishing a physical path before data transmission, the source sends the data at any time when it is available. There are three types of packet switching techniques – store-and-forward, virtual cut-through and wormhole. Store-and-forward and virtual cut-through require the receiving switch to be able to store the whole packet before the data starts to transmit. Therefore these approaches require a large buffer size, which increase the silicon area and the power consumption. In wormhole switching, packets are divided into flow control units (flits). The first flit of the packet is called as the header, which carries the routing information. The header flit is used to reserve the resource. The following flits in the packet simply follow this path in a pipelined fashion. For example, if a header flit of a packet passes through a link, no other packet can use this link until the tail flit of this packet passes through the link. In wormhole switching, if a flit from a given packet is blocked in a buffer, it will decrease channel utilization. In order to achieve a high throughput, a set of virtual channels can be used. If a flit belonging to a particular packet is blocked in one of the virtual channels, then flits of alternate packets can use the other virtual channel buffers. The header flit can also be used to reserve a dedicated flit buffer in a router with multiple virtual channels (VCs). In wormhole switching, only several flits need be stored at every switching element. As a result, the buffer space requirement in the switches can be small compared to that required for other packet switching schemes.

3.2 NoC Design

41

Fig. 3.7 Block-level representation of a five-port router

3.2.3

Router Design

The router design plays an important role to decide the performance of NoC systems. In order to achieve an efficient router design, the router type, size of the FIFO buffer, and switching technique must be carefully selected. The structure of routers in a NoC depends on the network topology. The size of the FIFO inside the router is very important because it directly impacts the router delay, packet loss, and power consumption. The router design also depends on the routing scheme adopted. If deterministic routing schemes are adopted, the router can be designed to be fast and compact. If dynamic routing schemes are applied, the circuit to define and control the routing path will introduce an extra area overhead. Figure 3.7 shows a block level architecture of a five-port router. It mainly consists of input/output FIFO buffers, a crossbar connecting input ports to output ports, and a routing control unit [26]. Input ports from each side feed data into a corresponding buffer. Each output port selects data from the processor core and three input buffers based on the routing control unit. In order to reduce communication latency in NoCs, virtual channels are widely applied in router design [27, 28]. By introducing virtual channels in the input and output ports, the channel utility is greatly increased. In a router with virtual channel, if a flit belonging to a packet is using one of the virtual channels, then flits of other packets can use the other virtual channel buffers and, ultimately, the physical channel. The architecture of a router having virtual channels is shown in Fig. 3.8.

42

3 Networks-on-Chip (NoC)

Fig. 3.8 Architecture of a router having virtual channels [28]

The router with virtual channels is usually comprised of a number of different components, such as a header decoder, virtual channels, crossbar switch, and a routing unit. The header decoder receives the packets and determines their destination addresses and keeps the packets until it routes them. Multiplexer and de-multiplexer are used to manage virtual channel operations. Virtual channel arbiter selects the appropriate virtual channel. Crossbar switch connects each input channel to each unoccupied output channel. The routing computation unit implements the routing algorithm and controls the crossbar switch. The virtual channels increase the communication throughput, but it also increases the complexity of the router design.

3.3

Reliability in NoC Links

Reliability is an important issue in NoC design. For example, errors in the header of a packet may lead to deadlock and block communication of an NoC. Error control techniques can be applied at different NoC layers [29]. In the physical layer, spare wires [30] can be used to bypass defective wires, which can be detected during manufacturing tests or self-test at start-up (or at run-time). Different wiring and signaling techniques (e.g., shielding and differential signaling) can also be applied at the physical layer to improve reliability of data transmission. The data link layer is used to provide reliable data transmission across an unreliable physical link. Error control coding is the most popular fault tolerant approach at the data link layer [29, 31, 32]. Adaptive routing algorithms [33] can improve NoC reliability at the network

3.3 Reliability in NoC Links

43

Fig. 3.9 Implementation of error control coding in data link layer of a NoC

layer by rerouting data around defective links. Flooding algorithms [34] are another fault tolerant technique used at the network layer, which involve redundant transmission of data over multiple paths to improve the reliability of network transmissions. Error control coding can also be used at network layer by performing an end-to-end error protection [32], in which the data is encoded at the source and decoded at the destination, with no encoding/decoding of the data at intermediate hops. In this book, we mainly focus on applying error control coding at the data link layer. In the data link layer, encoder and decoder circuits are integrated into the NoC routers, shown in Fig. 3.9. The error control schemes are incorporated into the output and input link controllers, respectively. The output link controller includes the encoder, and the input link controller includes the corresponding decoder. The incoming flit is encoded in the encoder and transmitted through the link. The decoder in the receiver side detects or corrects errors. Registers are inserted between encoder and link, and also between link and decoder to allow pipelined operation. If ARQ or HARQ schemes are applied, retransmission is invoked using the ACK/NACK signal. A large amount of research has been done to apply error control coding in data link layer. In [32], the impact of various error control schemes on the energy efficiency, error protection capability, and performance is investigated in an NoC design environment. The error control schemes implemented in this work can be applied to the network or data link layers. Two different error detection and retransmission schemes are used. In the end-to-end error detection scheme, a retransmission occurs between the sender and receiver (which may require multiple switch-to-switch hops). In the switch-to-switch error detection scheme, a retransmission occurs between adjacent switches. This work also implements type-I

44

3 Networks-on-Chip (NoC)

HARQ scheme by correcting any single error on a flit and requesting a retransmission when the error can be detected but not corrected. Single parity check codes, cyclic redundancy check (CRC) codes and Hamming codes are examined. This work provides a large amount of useful information to select an appropriate error control scheme for a given application. In [35], a methodology using error control coding to trade off energy and reliability of on-chip interconnects is proposed. In their method, error control codes are used to meet predefined reliability requirements. Energy consumption of interconnects is reduced by lowering the link swing voltage. This paper shows that sufficient signal integrity of on-chip interconnects can still be achieved for low voltage swing, once an appropriate error control scheme is applied to the on-chip interconnects. Hamming codes and duplicate-add-parity (DAP) codes are used to correct single errors. This work also provides a frame work to combine error control codes with crosstalk avoidance codes, which will be discussed in detail in Chap. 6. In [36], the author compares the energy behavior of different error detection and correction schemes to find the most efficient error control technique from an energy viewpoint. FEC and ARQ are considered in this work. The comparison is performed in a realistic SoC system and includes Hamming codes, extended Hamming codes, parity check codes, and CRC codes. The impact of applying Hamming codes on bus power consumption in nanoscale technology is analyzed in [37, 38]. Bus wires parameters (coupling capacitances, drivers, repeaters and receivers) and encoding and decoding circuits are considered in this work. The analysis shows that all different equivalent Hamming codes (with the same redundancy) have identical energy consumption. This work also shows that DAP codes with a proper bus layout can consume less power than Hamming code with the same bus footprint for a small size bus. A dynamic voltage swing approach with error detection has been proposed in [39]. In this work, the link swing voltage is dynamically scaled down to reduce the link energy consumption. The signal integrity is ensured by using error detection codes (CRC codes) and retransmission. The operational link swing voltage is determined by sampling the retransmission rate. This research has not considered FEC or HARQ. In [40], the author analyzes the impact of ARQ, FEC and HARQ on the trade-off between performance, reliability and energy consumption in on-chip networks, for various voltage swings, noise powers, wire lengths (wire capacitances) and timing constraints. Unlike previous work, this work considers the performance and reliability jointly and uses a new metric performability. The author concludes that HARQ consumes less energy than ARQ and FEC for a given performability constraint, except for when short wires are used. In order to achieve energy efficiency, configurable error control schemes have been proposed that dynamically provide appropriate error control based on noise conditions or system requirements. In [41], an adaptive error detection method is proposed. In this work, parity check codes, Hamming codes and extended Hamming codes are dynamically selected for error detection codes based on the number of errors found over a fixed time window. In [42], the author describes a configurable error control scheme, which can be configured into three different

References

45

operating modes – correction mode, detection mode and mixed mode depending on the particular application. This configurable error control allows the system to meet different Quality of Service (QoS) levels. Hamming codes, extended Hamming codes (Hsiao codes), and two-bit symbol error correcting codes [43] are considered. In [44, 45], an adaptive error control scheme using configurable single-errorcorrecting (SEC) codes and interleaving is applied to address multiple adjacent errors. In [46], different error control schemes are dynamically selected according to the fault type. Hamming codes with interleaving combined with retransmission are used to handle transient errors; and split retransmission and spare wires are used to address intermittent and permanent errors. The research work described above has focused on tolerating a very limited number of simultaneous errors, commonly one or two. In nanoscale technology, the number of simultaneous and burst-type errors is likely to increase. Little research has been concentrated on addressing multiple errors, especially for a combination of multiple random errors and burst errors. In [44, 47], interleaving was used to improve error resilience against burst errors. In this method, a wide bus is split into smaller groups and each group is encoded using a SEC or single-error-correcting and double-errordetecting (SEC-DED) code. The outputs of these SEC or SEC-DED encoders are further interleaved to reduce the probability of multiple errors occurring within the same group. In [42], symbol correcting codes are applied to correct two-bit burst error in a symbol. These two approaches only focus on burst errors; for a combination of random errors and burst errors, they lose their effectiveness. In [48, 49], a multipleerror correction code is constructed by combining Hamming codes with DAP codes. In this method, the outputs of a Hamming encoder are duplicated and an extra parity bit generated from the original Hamming code is added. This method can correct up to three-bit errors but requires a large number of additional interconnect wires (e.g., 143 for a 64-bit input message). The large number of wires greatly increases the link energy consumption and link area overhead. In [50], Bose-Chaudhuri-Hocquenghem (BCH) codes and Reed-Solomon (RS) codes are applied to network-on-chip (NoC) links. BCH and RS codes can correct multiple errors, but the field operations greatly increase the codec area, power consumption and decoding time.

References 1. Gschwind M, Hofstee H, Flachs B et al (2006) Synergistic processing in Cell’s multicore architecture. IEEE Micro 26:10–24 2. ARM AMBA specification and multilayer AHB specification (rev2.0). http://www.arm.com 3. ARM AMBA 3.0 AXI specification. http://www.arm.com/armtech/AXI 4. IBM CoreConnect specification. http://www.ibm.com/chips/techlib/techlib.nsf/product families/ CoreConnect_Bus_Architecture 5. Wishbone specification. http://www.opencores.org/wishbone 6. Pasricha S, Dutt N, Ben-Romdhane M (2004) Fast exploration of bus-based on-chip communication architectures. In: International conference on hardware/software codesign and system synthesis (CODES_ISSS), pp 242–247 7. Mattson GT et al. (2010) The 48-core SCC processor: the programmer’s view. In: Proceedings of ACM/IEEE conference on supercomputing (SC), pp 1–11

46

3 Networks-on-Chip (NoC)

8. Howard J et al (2011) A 48-core IA-32 processor in 45 nm CMOS using on-die message-passing and DVFS for performance and power scaling. IEEE J Solid-State Circuits 46:173–183 9. Pande PP, Grecu C, Ivanov A, Saleh R (2005) Performance evaluation and design trade-offs for network-on-chip interconnect architectures. IEEE Trans Comput 54:1025–1040 10. Millberg M, Nilsson E, Thid R, Jantsch A (2004) Guaranteed bandwidth using looped containers in temporally disjoint networks within the nostrum network on chip. In: Proceedings of the design, automation and test in Europe conference and exhibition (DATE), pp 890–895 11. Wentzlaff D et al (2007) On-chip interconnection architecture of the tile processor. IEEE Micro 27:15–31 12. Gratz P, Kim C, Sankaralingam K, Hanson H et al (2007) On-chip interconnection networks of the TRIPS chip. IEEE Micro 27:41–50 13. Hoskote Y, Vangal S, Singh A, Borkar N, Borkar S (2007) A 5-GHz mesh interconnects for a teraflops processor. IEEE Micro 27:51–61 14. Ilitzky AD, Hoffman DJ, Chun A, Esparza PB (2007) Architecture of the scalable communications core’s network on chip. IEEE Micro 27:62–74 15. Sgroi M et al. (2001) Addressing the system-on-a-chip interconnect woes through communication-based design. In: Proceedings of 38th Design Automation Conference (DAC), pp 667–672 16. Bolotin E, Cidon I, Ginosar R, Kolodny A (2004) Cost considerations in network on chip. Integr VLSI J 38:19–42 17. Dehyadgari M, Nickray M, Afzali-kusha A, Navabi Z (2005) Evaluation of pseudo adaptive XY routing using an object oriented model for NOC. In: The 17th international conference on microelectronics 18. Bobda C, Ahmadinia A, Majer M et al (2005) DyNoC: a dynamic infrastructure for communication in dynamically reconfigurable devices. In: International conference on field programmable logic and applications, pp 153–158 19. Kariniemi H, Nurmi J (2004) Arbitration and routing schemes for on-chip packet networks. In: Interconnect-centric design for advanced SoC and NoC, pp 253–282 20. Dally JW, Towles B (2004) Principles and practices of interconnection networks. Morgan Kauffman, San Francisco 21. Andriahantenaina A, Charlery H, Greiner A et al. (2003) SPIN: a scalable, packet switched, on-chip micro–network. In: Design automation and test in Europe conference and exhibition (DATE), pp 70–73 22. Kim J, Park D, Theocharides T et al. (2005) A low latency router supporting adaptivity for onchip interconnects. In: Proceedings of 42nd design automation conference (DAC), pp 59–564. 23. Andreasson D, Kumar S (2005) Slack-time aware routing in NoC systems. In: IEEE international symposium on circuits and systems, pp 2353–2356 24. Wolkotte PT, Smit G, Rauwerda KG, Smit TL (2005) An energy-efficient reconfigurable circuit switched network-on-chip. In: Proceedings of the 19th IEEE international parallel and distributed processing symposium (IPDPS) 25. Kumar S, Jantsch A, Soininen PJ et al. (2002) A network on chip architecture and design methodology. In: Proceedings of IEEE computer society annual symposium on VLSI, pp 105–112 26. Wentzla D, Griffin P, Hoffmann H et al (2007) On-chip interconnection architecture of the tile processor. IEEE Micro 27:15–31 27. Balfour J, Dally WJ (2006) Design tradeoffs for tiled CMP on-chip networks. In: 20th annual international conference on supercomputing, pp 187–198 28. Nicopoulos CA, Dongkook P, Jongman K et al (2006) ViChar: a dynamic virtual channel regulator for network-on-chip routers. In: 39th Annual IEEE/ACM international symposium on microarchitecture (MICRO), pp 333–346 29. Bjerregaard T, Mahadevan S (2006) A survey of research and practices of network-on-chip. Acm Comput Surv 38:1–51

References

47

30. Lehtonen T, Wolpert D, Liljeberg P, Plosila J, Ampadu P (2010) Self-adaptive system for addressing permanent errors in on-chip interconnects. IEEE Trans Very Large Scale Integr (VLSI) Syst 18:527–540 31. Bertozzi D, Benini L (2004) Xpipes: a network-on-chip architecture for gigascale systems-onchip. IEEE Circuits Syst Mag 4:18–31 32. Murali S, Theocharides T, Vijaykrishnan N, Irwin JM, Benini L, De Micheli G (2005) Analysis of error recovery schemes for networks-on-chips. IEEE Des Test Comput 22:434–442 33. Pirretti M, Link MG, Brooks RR et al (2004) Fault tolerant algorithms for network-on-chip interconnect. In: IEEE computer society annual symposium on VLSI, pp 46–51 34. Dumitras T, Kerner S, Marculescu R (2003) Towards on-chip fault tolerant communication. In: Proceedings of ACM/IEEE design automation conference (DAC), pp 225–232 35. Sridhara S, Shanbhag RN (2005) Coding for system-on-chip networks: a unified framework. IEEE Trans Very Large Scale Integr (VLSI) Syst 13:655–667 36. Bertozzi D, Benini L, De Micheli G (2005) Error control schemes for on-chip communication links: the energy-reliability tradeoff. IEEE Trans Computer-Aided Design Integr Circuits Syst 24:818–831 37. Rossi D, Nieuwland KA, Katoch A, Metra C (2005) Exploiting ECC redundancy to minimize crosstalk impact. IEEE Des Test Comput 22:59–70 38. Rossi D, Nieuwland KA, Dijk SVE, Kleihorst PR, Metra C (2008) Power consumption of fault tolerant busses. IEEE Trans Very Large Scale Integr (VLSI) Syst 16:542–553 39. Worm F, Ienne P, Thiran P, Micheli DG (2005) A robust self-calibrating transmission scheme for on-chip networks. IEEE Trans Very Large Scale Integr (VLSI) Syst 13:126–139 40. Ejlali A, Al-Hashimi MB, Rosinger P, Miremadi GS, Benini L (2010) Performability/energy tradeoff in error-control schemes for on-chip networks. IEEE Trans Very Large Scale Integr (VLSI) Syst 18:1–14 41. Li L, Vijaykrishnan N, Kandemir M, Irwin JM (2003) Adaptive error protection for energy efficiency. In: Proceedings of IEEE/ACM international conference on computer-aided design (ICCAD), pp 2–7 42. Rossi D, Angelini P, Metra C (2007) Configurable error control scheme for NoC signal integrity. In: Proceedings of international on line testing symposium (IOLTS), pp 43–48 43. Fujiwara E (2006) Code design for dependable systems: theory and practical applications. Wiley, Hoboken 44. Yu Q, Ampadu P (2008) Adaptive error control for NoC switch-to-switch links in a variable noise environment. In: Proceedings of IEEE international symposium on defect and fault tolerance in VLSI system (DFT), pp 352–360 45. Yu Q, Ampadu P (2009) Adaptive error control for nanometer scale NoC links. IET Comput Digit Tech 3:643–659 (Special issue on advances in nanoelectronics circuits and systems) 46. Lehtonen T, Liljeberg P, Plosila J (2007) Online reconfigurable self-timed links for fault tolerant NoC, VLSI Design. Article ID 94676:13 47. Zimmer H, Jantsch A (2003) A fault model notation and error-control scheme for switch-toswitch buses in a network-on-chip. In: Proceedings of international conference hardware/ software codesign and systems synthesis (CODES-ISSS), pp 188–193 48. Gangly A, Pande PP, Belter B, Grecu C (2008) Design of low power & reliable networks on chip through joint crosstalk avoidance and multiple error correction coding. J Electron Tes: Theory Apple (JETTA), 67–81 (Special issue on defect and fault tolerance) 49. Gangly A, Pande PP, Belter B (2009) Crosstalk-aware channel coding schemes for energy efficient and reliable NOC interconnects. IEEE Trans VLSI Syst 17:1626–1639 50. Lehtonen T, Liljeberg P, Plosila J (2007) Analysis of forward error correction methods for nanoscale networks-on-chip. In: Proceedings of 2nd international conference on nano-networks (Nano-Net), pp 1–5

Chapter 4

Error Control Coding for On-Chip Interconnects

Error control codes (ECCs) have been widely applied in communication systems [1]. In ECCs, parity check bits are calculated based on the input data. The input data and parity check bits are transmitted across a noisy channel. In the receiver, an ECC decoder is used to detect or correct the errors induced during the transmission. A powerful ECC usually requires more redundant bits and more complex encoding and decoding processes, which increases the codec overhead. To meet the tight speed, area, and energy constraints imposed by on-chip interconnect links, ECCs used for on-chip interconnects need to balance reliability and performance. In this chapter, we will first introduce the basic concepts of error control coding. Then, the error control codes used for on-chip interconnect and their hardware implementations are discussed.

4.1 4.1.1

Error Control Coding Basics Field

Error control coding is based on arithmetic operations in fields. A field F is a nonempty set of elements with the definition of two operators, called as addition ‘+’ and multiplication ‘*’ respectively [1]. A field F must meet the following requirements: 1. Closure: 8 a, b ∈ F c¼aþb d¼ab

(4.1)

where c, d ∈ F.

B. Fu and P. Ampadu, Error Control for Network-on-Chip Links, DOI 10.1007/978-1-4419-9313-7_4, # Springer Science+Business Media, LLC 2012

49

50

4 Error Control Coding for On-Chip Interconnects

2. Associative: 8 a, b, c ∈ F a þ (b þ c) ¼ (a þ b) þ c a (b c) ¼ (a b) c

(4.2)

3. Identity: There exist an additive identity element ‘0’ and a multiplicative identity ‘1’ that satisfy 0þa¼aþ0¼a a1¼1a¼a

(4.3)

where a ∈ F. 4. Inverse: For a ∈ F, there exist elements b and c ∈ F aþb¼0 ac¼1

(4.4)

where b ¼ (a) is called as the additive inverse. c ¼ a1 (a 6¼ 0) is called as the multiplicative inverse. 5. Commutative: 8a, b ∈ F aþb¼bþa ab¼ba

(4.5)

(a þ b) c ¼ a c þ b c

(4.6)

6. Distributive: 8 a, b, c ∈ F

If there are a finite number of elements in a field F, F is said to be a finite field. Finite field is also known as Galois field. GF(q) represents a Galois field with q elements. If q is equal to prime number or its powers (q ¼ pm and p is a prime), it has shown that the set of integers {0, 1, 2, . . ., q 1} together with modulo q addition and multiplication forms a Galois field GF(q). The simplest Galois field is GF(2), in which p is equal to 2. There are two elements {0,1} in GF(2). The modulo-2 addition can be realized as XOR operation and the modulo-2 multiplication can be realized as AND operation. A polynomial p(x) of degree m over GF(2) can be represented as follows, pðxÞ ¼ p0 þ p1 x þ p2 x2 þ ::: þ pm xm

(4.7)

where the coefficients pi belong to GF(2) ¼ {0,1}. Polynomials over GF(2) can be added, subtracted, multiplied and divided. A polynomial p(x) with degree m over GF(2) is irreducible if p(x) is not divisible by any polynomial over GF(2) with

4.1 Error Control Coding Basics

51

degree less than m. An irreducible polynomial p(x) of degree m over GF(2) can be used to generate the extension field GF(2m), in which the field elements are comprised of 2m polynomials of degree less than m over GF(2). An irreducible polynomial p(x) with degree m is a primitive polynomial if p(x) divides xn + 1, where the smallest value of n is 2m 1. If a is a root of an irreducible and primitive polynomial p(x), which is used to generate the extension field GF(2m), all elements in GF(2m) can be represented as {0, 1, a, a2, . . ., an1}, (n ¼ 2m 1). In this case, a is called a primitive element and an ¼ 1. Any element a in GF(2m) can be represented uniquely a linear combination of these m linearly independent elements {1, a, a2, . . ., am1} over GF(2), such as a ¼ a0 þ a1 a þ ::: þ am1 am1

ai 2 GFð2Þ

(4.8)

GF(2) is used to construct binary block codes, in which the encoding and decoding process are based on direct bit operation. GF(2m) is used to construct nonbinary codes, in which more complex field operation is involved.

4.1.2

Linear Block Codes

For a (n, k) block code over GF(q), the encoding process involves in mapping qk message words, which is composed of k-symbol, into qk codewords, which is composed of n-symbol. When the value of k and n is small, the mapping process is simple. For example, a table can be used to list the mapping relationship between message words and codewords. The encoding process becomes too complex when the value of k and n is large. Linear block codes are used to simplify the encoding process by requiring the linearity of the codewords. In linear block codes, qk n-symbol codewords form a vector subspace [1]. The sum of any of two codewords is also a codeword. Almost all useful block codes are linear block codes. Let C denote a (n, k) linear block code over GF(q), there exist k linearly independent codewords {g0, g1 , . . . , gk1}, such that any codeword c ∈ C can be represented as a linear combination of these codewords. c ¼ m0 g0 þ m1 g1 þ ::: þ mk1 gk1

(4.9)

where mi ∈ GF(q) and {g0, g1 , . . . , gk1} comprise a vector subspace and is called as a basis for the codeword. The basis {g0, g1 , . . ., gk1} can be arranged as the rows of a k n matrix G, 2 6 6 Gkn ¼ 6 4

g0 g1 .. . gk1

3

2

7 6 7 6 7¼6 5 4

g0;0 g1;0 .. .

g0;1 g1;1 .. .

gk1;0

gk1;1

g0;n g1;n .. . gk1;n

3 7 7 7 5

(4.10)

52

4 Error Control Coding for On-Chip Interconnects

Let k-symbol message m be m ¼ ½ m0

m1

mk1

(4.11)

The encoding process of a linear block (n, k) code can be represented in matrix form by, c ¼ m Gkn

(4.12)

Because any codeword in C can be generated by multiplying the k-symbol message by the matrix Gkn . Gkn is named as generator matrix of the code. For a (n, k) linear block code C, there must be a (n, n k) dual code C⊥ of C. The dual code C⊥ can be constructed by a (n k) n matrix H with the following format, 2 6 6 HðnkÞn ¼ 6 4

h0 h1 .. .

3

2

7 6 7 6 7¼6 5 4

hnk1

h0;0 h1;0 .. .

h0;1 h1;1 .. .

hnk1;0

hnk1;1

h0;n h1;n .. .

hnk1;n

3 7 7 7 5

(4.13)

The matrix H is known as the parity check matrix of the code C. The inner product between any row vector gi of generator matrix G and any row vector hj of parity check matrix H is zero. This relationship can be expressed in matrix format as, T Gkn HðnkÞn ¼0

(4.14)

where HT is the transpose of the matrix H. Since any codeword c can be constructed from generator matrix G, the following relationship exists for the codeword c and the parity check matrix H, T T ¼ m Gkn HðnkÞn ¼0 c HðnkÞn

4.1.3

(4.15)

Systematic Codes

A (n, k) linear block code is called as a systematic code, if its codeword can be separated into message bits and redundancy bits [1]. Figure 4.1 shows an example of a (7, 4) systematic linear block code over GF(2). In this example, the message bits are placed at the beginning of the codeword and are followed by redundancy bits. The position of message bits and redundancy bits can be exchanged by placing the redundancy bits before the message bits.

4.1 Error Control Coding Basics

53

Fig. 4.1 An example of a (7, 4) systematic linear block code

The generator matrix Gkn of a systematic code is of the following form, Gkn ¼ Ik jPkðnkÞ

(4.16)

where Ik is the k-dimensional identity matrix and Pk(nk) is called as the parity matrix. The parity check matrix H(nk)n of a systematic linear block code can be expressed as, h i HðnkÞn ¼ PTkðnkÞ jIðnkÞ

(4.17)

where HT is the transpose of parity matrix and I(nk) is the (nk) dimensional identity matrix. The generator matrix and parity check matrix of the (7, 4) systematic code shown in Fig. 4.3 is expressed as, 2

G47

H37

1 60 ¼ ½I4 jP43 ¼ 6 40 0

0 1 0 0

0 0 1 0

0 0 0 1

1 0 1 1

1 1 1 0

3 0 17 7 15 1

2 1 T ¼ P43 jI3 ¼ 4 1 0

0 1 1

1 1 1

1 0 1

1 0 0

0 1 0

3 0 05 1

(4.18)

(4.19)

54

4 Error Control Coding for On-Chip Interconnects

Fig. 4.2 Hamming sphere and minimum Hamming distance

4.1.4

Hamming Distance

The Hamming distance between two vectors vi and vj over GF(q)n is defined as the number of symbols that are different between these two vectors [1]. It can expressed as, dH ðvi ; vj Þ ¼ # k j vi;k 6¼ vj;k ; 0 k
(4.20)

where vi,k and vj,k ∈ GF(q). #{A} is the number of the elements in a set A. If these vectors belongs to GF(2)n, the Hamming distance can be calculated as, dH ðvi ; vj Þ ¼

n1 X

vi;k vj;k

(4.21)

k¼0

where is XOR operation. Let C be a (n, k) linear block code over GF(q) and ci ∈ C. All the vectors, which have a Hamming distance less than or equal to l from the codeword ci, comprise of a sphere of radius l around the codeword ci. This sphere is named as Hamming sphere and is described as, Sl ðci Þ ¼ vj dH ðci ; vj Þ l

(4.22)

If the Hamming sphere with radius l around each codeword of code C does not overlap to each other, any received codeword with erroneous symbols less than and equal to l can be corrected by code C. Figure 4.2 demonstrates the relationship

4.1 Error Control Coding Basics

55

between the error correction capability and the Hamming sphere. A linear block codeword ci ∈ C is transmitted. Due to the errors caused by the noise during transmission, the codeword is received as ri. If the number of errors is less than or equal to l, the received codeword is located in the Hamming sphere of codeword ci. By selecting ci as the transmitted codeword, the received codeword ri is properly decoded. From Fig. 4.2, we can see that the radius of nonoverlapping Hamming sphere characterizes the error correction capability of a linear block code [1]. The error correcting capability, t, of a linear block code C is defined as the maximum radius of Hamming spheres St around all the codewords in C, such that there is no overlapping of the Hamming sphere for any different codewords ci and cj ∈ C. It can be expressed as, t ¼ max

ci ;cj 2 C

l j Sl ðci Þ \ Sl ðcj Þ ¼ ;

ci 6¼ cj

(4.23)

When the Hamming spheres around each codeword have the same radius, the error correction capability t is determined by the minimum distance between two codewords ci and cj in the code C. The minimum Hamming distance of a code C is defined as, dmin ¼ min

ci ;cj 2 C

dH ðci ; cj Þ; ci 6¼ cj

(4.24)

The relationship between the error correction capability t and the minimum Hamming distance dmin can be expressed as, t¼

ðdmin 1Þ 2

(4.25)

where b xc is the floor function and represents the largest integer less or equal to x. For a linear block code, the minimum distance is equal to the minimum codeword weight, which is defined as the number of nonzero symbols in a codeword. The minimum Hamming distance is also used to characterize the error detection capability of a linear block code. A linear block code with dmin can guarantee to detect any errors less than or equal to dmin1. A linear block codes C is usually described by three parameters: the codeword length n, the message length k, and the minimum Hamming distance dmin. The code with a large value of dmin is more powerful because of the increased code’s error correction and detection capability. However, powerful codes require a large number of redundancy bits increasing the encoding and decoding process. It is important to balance the error correction capability and the complexity of a code.

56

4.1.5

4 Error Control Coding for On-Chip Interconnects

Code Modification

Code modification is usually used to construct new codes from a linear block code C[1]. In this section, we will discuss these code construction methods and the relationship between the modified codes and the original code. Let C be a (n, k) linear block code, a shortened code is constructed by deleting l message symbols from the codeword. A shortened code is represented as (nl, kl). Suppose that code C is a systematic code, the generator matrix of the shortened code (nl, kl) can be achieved from the original generator matrix Gkn by deleting s columns of the identity matrix Ik and l rows, which have nonzero value in the deleted s columns. The following example shows the generator matrix of a shortened (6, 3) block code, which is modified from the (7, 4) linear block code in Sect. 4.2.3. 2

G36

1 ¼ 40 0

0 1 0

0 0 1

0 1 1

3 1 15 1

1 1 0

(4.26)

In this example, we assume that the first column in the original generator matrix is removed. Then the first row of the original generator matrix is also deleted, because the first row has a nonzero value at its first position. The minimum Hamming distance dmin of the shortened code is greater or equal to the original code. A (n, k) linear block code can be extended by adding l additional parity check symbols. The extended code is represented as (n+l, k). The parity check matrix of the extended code can be constructed by adding l rows and columns to the original parity check matrix. The minimum Hamming distance of the extended code is larger than the original code. The most common approach to extend a code is to add an additional parity check symbol, which is calculated based on all the inputs. The following example shows the parity check matrix an extended (8, 4) code, which is modified from the (7, 4) linear block code in Sect. 4.1.3. 2

H48

1 61 ¼6 40 1

0 1 1 1

1 1 1 1

1 0 1 1

1 0 0 1

0 1 0 1

0 0 1 1

3 0 07 7 05 1

(4.27)

A linear block code can also be punctured and lengthened. In a punctured code, l parity check symbols are removed to construct a new (nl, k) code. For a punctured code, the code rate increases because less parity check symbols in the codeword. A linear block code can be lengthened by adding l message symbols. The generator matrix of a lengthened code is constructed by adding l columns and l rows to the generator matrix of the original code. The minimum Hamming distance of the punctured code and the lengthened code is less than or equal to the minimum Hamming distance of the original code.

4.2 Error Control Codes for On-Chip Interconnect

4.2

57

Error Control Codes for On-Chip Interconnect

On-chip interconnects have tight speed, area, and energy constraints. Thus, error control codes used for on-chip interconnects need to balance reliability and performance. In early research work, simple ECCs, such as single parity check (SPC) codes, Hamming codes, and duplicate-add-parity (DAP) codes are widely used to detect or correct single errors. As the probability of multiple errors increases in nanoscale technology, more complex error control codes, such as Bose-ChaudhuriHocquenghem (BCH) codes, Reed-Solomon (RS) codes and product codes are applied to improve the reliability of on-chip interconnects.

4.2.1

Single Parity Check (SPC) Codes

The single parity check (SPC) code is one of the simplest codes. In SPC codes, an additional parity bit is added to a k-bit data block such that the resulting (k + 1)-bit codeword has an even number (for even parity) or an odd number (for odd parity) of 1 s. SPC codes have a minimum Hamming distance dmin ¼ 2 and can only be used for error detection. SPC codes can detect all odd numbers of errors in a codeword. The hardware circuit used to generate the parity check bit is composed of a number of exclusive OR (XOR) gates as shown in Fig. 4.3. In the SPC decoder, another parity generation circuit, identical to that employed in the encoder, is employed to recalculate the parity check bit based on the received data. The recalculated parity check bit is compared to the received parity check bit.

Fig. 4.3 An example of single parity check (SPC) codes

58

4 Error Control Coding for On-Chip Interconnects

Fig. 4.4 Encoding and decoding process of DAP codes

If the recalculated parity check bit is different from the received parity check bit, errors are detected. The bit comparison can be implemented using an XOR gate as shown in Fig. 4.3.

4.2.2

Duplicate-Add-Parity (DAP) Code

In duplicate-add-parity (DAP) codes [2, 3], a k-bit input is duplicated and an extra parity check bit, calculated from original data, is added. For k-bit input data, the codeword width of DAP codes is 2k + 1. DAP codes have a minimum Hamming distance dmin ¼ 3, because any two distinct codewords differ in at least three bit positions. It can correct single errors. The encoding and decoding process of DAP codes are shown in Fig. 4.4. XOR gates are used to calculate the parity bit based on input data. The input data, its duplicated copy and the parity bit are comprised of the DAP codeword. In a DAP decoder, the recalculated parity check bit is compared to the transmitted parity check bit. If they have the same value, the original data is selected as the decoder output. If the recalculated parity check bit is different from the transmitted parity check bit, the duplicated copy of the original data is selected as the decoder outputs. In the DAP code implementation, each duplicated data bit is placed adjacent to each original data bit. Thus, DAP codes can reduce the impact of crosstalk coupling. Crosstalk reduction using DAP codes will be discussed in Chap. 6.

4.2 Error Control Codes for On-Chip Interconnect

59

Fig. 4.5 An example of Hamming H(7,4) encoder

4.2.3

Hamming Codes

Hamming codes are a type of linear block codes with minimum Hamming distance dmin ¼ 3. Hamming codes can be used to either correct single errors or detect double errors. For a positive integer r 3, there exists a (n, k) Hamming code with the following parameters: n ¼ 2r 1 k ¼ 2r 1 r

(4.28)

An n-bit Hamming codeword c1n can be generated by multiplying the k-bit input data m1k by a generator matrix Gkn, that is c1n ¼ m1k Gkn

(4.29)

An example of a generator matrix for the (7,4) Hamming code H(7,4) with r ¼ 3 is shown below. 2

G47

1 60 6 ¼ ½I44 jP43 ¼ 4 0 0

0 1 0 0

0 0 1 0

0 0 0 1

1 1 1 0

1 1 0 1

3 0 17 7 15 1

(4.30)

The multiplication of a generator matrix Gkn by the input data m1k can be implemented using XOR trees. Figure 4.5 shows an example of the encoding circuit of a Hamming H(7,4) code. The XOR tree depth determines the worst case delay of the Hamming encoder. The largest XOR tree depth is the case where a column in Gkn is all-ones.

60

4 Error Control Coding for On-Chip Interconnects

The Hamming codes can be decoded using a syndrome decoding method. In syndrome decoding method, the syndrome s1r is calculated by multiplying the received codeword r1n with the transpose of parity check matrix H(nk)n, T s1r ¼ r1n HðnkÞ n T T ¼ ðc1n þ e1n Þ HðnkÞn ¼ e1n HðnkÞn

(4.31)

where e1n is an error vector representing any errors introduced during the transmission. The columns in the parity check matrix H(nk)n consist of all the nonzero r-bit vectors such that no two columns have the same value. The multiplication of any valid codeword c1n with the transpose of H(nk)n is equal to zero. Thus, if there is no error in the received codeword, the syndrome s1r is zero. If the received codeword has a single errors, the syndrome is nonzero and equal to one column of the parity check matrix H(nk)n. The nonzero syndrome vector can be used to determine the corresponding error vector e1n by looking through a predefined table. The error correction is performed by adding error vector e1n back to the received codeword r1n. The following is an example using syndrome decoder. Assume that a (7, 4) Hamming code word c1 ¼ [0 0 1 1 1 1 0] is generated using the generator matrix G in (5,3). The fourth bit has error during transmission. The received codeword becomes r1 ¼ [0 0 1 0 1 1 0]. The syndrome is s ¼ [0 1 1] by using the following parity check matrix. 2 3 1 1 1 0 1 0 0 (4.32) H37 ¼ 4 1 1 0 1 0 1 0 5 0 1 1 1 0 0 1 The syndrome is equal to the fourth column of parity check matrix. Thus the error vector is e ¼ [0 0 0 1 0 0 0]. Figure 4.6 shows an example of the Hamming(7,4) decoding circuits. The syndrome is calculated from the received Hamming codeword. The syndrome calculation circuit can be implemented as XOR trees. The calculated syndrome is used to decide the error vector through the syndrome decoder circuit. The syndrome decoder circuit can be realized using AND trees. The syndrome and its inverse are the inputs of the AND trees. For the binary code, the adding of error vector to the received codeword is just XOR operations. A Hamming code can be extended by adding one overall parity check bit. An extended (n + 1, k) Hamming code meets the following requirements, n ¼ 2r 1 1 k ¼ 2r 1 r

(4.33)

Extended Hamming codes have a minimum Hamming distance dmin ¼ 4 and belong to single-error-correcting and double-error-detecting (SEC-DED) codes, which can correct single errors and detect double errors at the same time.

4.2 Error Control Codes for On-Chip Interconnect

61

Fig. 4.6 An example of Hamming H(7,4) decoder

Figure 4.7 shows an implementation example of the extended Hamming EH(8,4) decoder. One of the syndrome bits is an even parity of the entire codeword. If this bit is a zero and other syndrome bits are non-zero, this implies that there were two errors – the zero even parity check bit indicates that there are zero (or an even number of) errors, while the other non-zero syndrome bits indicate that there is at least one error. Since the extended Hamming code can only guarantee detection of up to two errors, we assume that this syndrome pattern represents double errors. The implementation of double error detection is shown in the bottom of Fig. 4.7. A Hamming code or an extended Hamming code can be shortened by eliminating a certain number of information bits (e.g., s). A shortened (ns, ks) Hamming code or extended Hamming code have the same number of redundant bits r as the original codes. Table 4.1 shows the codeword length of normal and shortened Hamming codes and extended Hamming codes with different input data bits. The shortened code makes the code more flexible, able to support any input data width.

4.2.4

Hsiao Codes

Hsiao codes are a special case of extended Hamming code with SEC-DEC capability. In Hsiao codes, the parity check matrix H(nk)n satisfies the following four constraints – (a) Every column is different. (b) No all zero column exists. (c) There are an odd number of 1’s in each column. (d) Each row in parity check matrix contains the same number of 1’s.

62

4 Error Control Coding for On-Chip Interconnects

Fig. 4.7 An implementation example of the extended Hamming EH(8,4) decoder Table 4.1 Codeword length for Hamming codes and extended Hamming codes for different input data bits Shortened/standard Hamming Shortened/standard Ex-Hamming Data bits k code (n, k) code (n + 1, k) 4 (7,4) (8,4) 8 (12,8) (13,8) 16 (21,16) (22,16) 32 (38,32) (39, 32) 64 (71,64) (72,64) 128 (136,128) (137,128) 256 (265,256) (266,256) 512 (522,512) (523,512)

The following is an example of parity check matrix of a (8, 4) Hsiao codes, 2

H48

1 61 ¼6 41 0

1 1 0 1

1 0 1 1

0 1 1 1

1 0 0 0

0 1 0 0

0 0 1 0

3 0 07 7 05 1

(4.34)

4.2 Error Control Codes for On-Chip Interconnect

63

Fig. 4.8 An example of correcting spatial burst error using multiple SEC codes with interleaving

This parity check matrix has an odd number of 1’s in each column and the number of 1’s in each row is equal. The double error is detected, when the syndrome is nonzero and the number of 1’s in syndrome is not odd. The hardware requirement in the encoder and decoder of Hsiao codes is less than that of extended Hamming codes shown in Sect. 4.2.3, because the number of 1’s in parity check matrix of Hsiao codes is less than an extended Hamming code. Further, the same number of 1’s in each row of the parity check matrix reduces the calculation delay of the parity check bits.

4.2.5

SEC Codes with Interleaving

Interleaving is an efficient approach to achieve protection against spatial burst errors. Figure 4.8 shows the principle using multiple SEC codes with inter leaving to correct spatial burst errors [4, 5]. In this method, the input data is separated into smaller groups. Each group is encoded separately using simple linear block codes (e.g., SEC codes). The outputs of these small groups are interleaved. The interleaved data is transmitted to the receiver. In the receiver, the interleaved data is first deinterleavered. The deinterleaving process distributes spatial burst errors into different groups. For the example in Fig. 4.8 of a two-bit burst error with two groups, we see that each group only contains a single error after deinterleaving, which can then be corrected using SEC decoders. A lot of interleaving algorithms have been proposed for communication system. The simplest interleaving algorithm is row-column interleaving. Figure 4.9 illustrates the relation between inputs and outputs of a row-column interleaver. Assume the input data is separated to m groups. After encoding, each group has n-bit outputs. The row-column interleaving algorithm involves a process to assign the wire in the bus in such a way that the first m wires carry the m first bits from m blocks, followed by the m second bits and so on. The interleaving distance, which is the distance between two wires that belong to the same group, can be used to

64

4 Error Control Coding for On-Chip Interconnects

Fig. 4.9 Input and output relation of row-column interleaver

measure the spatial burst error correction capability. A larger interleaving distance will allow correction of larger burst error. For on-chip interconnects, the interleaving is usually implemented as local hardwire connection, adding some design complexity with negligible power and delay overhead.

4.2.6

Cyclic Codes

A linear block code C is a cyclic code if a cyclic shift of any codeword in C is still a codeword in C [1]. In cyclic codes, the input data m ¼ (m0, m1, . . ., mk) and codeword c ¼ (c0, c1, . . ., cn) are usually represented as a polynomial with the following format, mðxÞ ¼ m0 þ m1 x þ m2 x2 þ ::: þ mk xk

(4.35)

cðxÞ ¼ c0 þ c1 x þ c2 x2 þ ::: þ cn xn

(4.36)

A (n, k) cyclic code can be constructed by multiplying an input polynomial with generator polynomial g(x), cðxÞ ¼ mðxÞ gðxÞ

(4.37)

4.2 Error Control Codes for On-Chip Interconnect

65

Fig. 4.10 Encoding circuit for an (n, k) cyclic code

where g(x) ¼ 1 + g1·x + g2·x2 + . . . + gr1·xr1 + xr is a factor of xn 1 and has degree of r ¼ n k. For example, a cyclic (7, 3) code can be constructed with generator polynomial g(x) ¼ 1 + x2 + x3 + x4, which is a factor of x7 1 ((x7 1) ¼ (1 + x) (1 + x + x3)(1 + x2 + x3)). A systematic cyclic code can be constructed by appending the parity check bits to the right-shifted input data, cðxÞ ¼ mðxÞ xnk þ RgðxÞ mðxÞ xnk

(4.38)

The parity check bits are the remainder of the division m(x) · xnk by g(x). The encoding process of cyclic codes can be realized serially by using a simple linear feedback shift register (LFSR). Figure 4.10 shows the encoding circuit for a systematic cyclic code. The input data m(x) is shifted into the encoding circuit one bit at a time from the right end with switches a and b at position 1. After all the input data enter the circuit, the switches a and b are set to position 2 and the parity check bits in b0 to br1 are serially shifted out. The use of an LFSR circuit requires little hardware but introduces a large latency when a large amount of data is processed. Cyclic codes can also be encoded by multiplying input data with a generator matrix. For instance, an (n, k) cyclic code with generator polynomial g(x) ¼ g0 + g1·x + g2·x2 + . . . + gr1·x r1 + gr·xr, the generator matrix can be constructed using the following method: 2

Gkn

g0 60 6 60 6 ¼6 6 : 6 : 6 4 : 0

g1 g0 0

0

g2 g1 g0

0

: g2 g1

0

: : g2

g0

: : :

g1

gr : :

g2

0 gr :

:

0 0 gr

:

3 0 07 7 07 7 : 7 7 : 7 7 : 5 : : gr

0 : : 0 : : 0 : :

:

(4.39)

66

4 Error Control Coding for On-Chip Interconnects

Table 4.2 BCH codes generated by primitive elements for m 7 with different error correction capabilities

m m m m m

¼ ¼ ¼ ¼ ¼

3 4 5 6 7

t¼1 (7,4) (15,11) (31, 26) (63, 57) (127, 120)

t¼2

t¼3

(15,7) (31, 21) (63, 51) (127, 113)

(15,5) (31, 16) (63, 45) (127, 106)

A shortened (nl, kl) cyclic code can be constructed from an (n, k) cyclic code by eliminating the l rightmost bits in the codewords. The shortened cyclic codes are generally not cyclic. A class of shortened cyclic codes, which are usually generated by either a primitive polynomial p(x) or a generator polynomial g(x) ¼ (x + 1)p(x), are also known as cyclic redundancy check (CRC) codes. CRC codes are effective at detecting burst errors. CRC codes can be implemented using an LFSR with the same circuit as the original cyclic codes. CRC codes can also be implemented in a parallel approach to improve throughput [6, 7]. For on-chip interconnects, parallel implementation is preferred. A few CRC codes have become international standards (e.g., CRC-5 with generator polynomial g(x) ¼ 1 + x2 + x4 + x5 is used for the International Telecommunication Union (ITU) standard).

4.2.7

Bose-Chaudhuri-Hocquenghem (BCH) Codes

BCH codes are an important class of linear block codes for multiple error correction [1]. For a positive integer m 3, a t-error-correcting BCH (n, k) code can be constructed over Galois fields GF(2m) with the following requirement for the codeword width n and data width k, n ¼ 2m 1

(4.40)

k n mt

(4.41)

where GF(2 ) is the extension field constructed from GF(2) with elements {0, 1, a, m a2 ,. . ., a2 2 }; a is a primitive element in GF(2m). Table 4.2 shows the examples of BCH codes generated by primitive elements for m 7 with different error correction capabilities. The encoding process of BCH codes can be realized with a similar approach to cyclic codes. The generator polynomial g(x) of a t-error-correcting BCH code is defined as the least common multiple (LCM) of F1, F3, . . ., F2t1, m

gðxÞ ¼ LCMfF1 ; F3 ; :::; F2t1 g

(4.42)

where Fj is the minimal polynomial of aj (0 < j < 2 t). For a t-error-correcting BCH code, the generator polynomial g(x) has a, a2, a3, . . ., a2t as its roots.

4.2 Error Control Codes for On-Chip Interconnect

67

Fig. 4.11 Block diagram of BCH decoder

For an example, the generator polynomial of the 3-error-correcting BCH (15, 5) code is obtained by multiplying the following minimal polynomials, F1 ðxÞ ¼ ðx þ aÞðx þ a2 Þðx þ a4 Þðx þ a8 Þ ¼ 1 þ x þ x4

(4.43)

F3 ðxÞ ¼ ðx þ a3 Þðx þ a6 Þðx þ a12 Þðx þ a9 Þ ¼ 1 þ x þ x2 þ x3 þ x4

(4.44)

F5 ðxÞ ¼ ðx þ a5 Þðx þ a10 Þ ¼ 1 þ x þ x2

(4.45)

Thus the generator polynomial g(x) is given by, gðxÞ ¼ F1 ðxÞF3 ðxÞF5 ðxÞ ¼ 1 þ x þ x2 þ x4 þ x5 þ x8 þ x10

(4.46)

The decoding process of BCH codes is more complicated than the encoding process. Usually, the decoding process of BCH codes can be separated into four steps – calculating the syndromes, calculating the error location polynomial, finding the error locations and flipping the errors, as shown in Fig. 4.11. Let cðxÞ ¼ c0 þ c1 x þ c2 x2 þ ::: þ cn1 xn1 rðxÞ ¼ r0 þ r1 x þ r2 x2 þ ::: þ rn1 xn1 eðxÞ ¼ e0 þ e1 x þ e2 x2 þ ::: þ en1 xn1

(4.47)

be the transmitted polynomial, the received polynomial and the error polynomial respectively. So that rðxÞ ¼ cðxÞ þ eðxÞ

(4.48)

For a t-error-correcting BCH code, 2t syndromes Sj (1 j 2t) can be calculated as, Sj ¼

n1 X i¼0

ri aij

(4.49)

68

4 Error Control Coding for On-Chip Interconnects

Fig. 4.12 An example of calculating the syndrome S3 of BCH code with m ¼ 4

Equation 4.49 can be written as,

Sj ¼ ::: ðr n1 a j þ r n2 Þ a j þ rn3 a j þ a j þ r0

(4.50)

The calculation of syndrome Sj requires (n1) multiplications by the constant value aj and (n1) additions. Because of ri ∈ GF(2), S2j is equal to Sj2. Figure 4.12 shows an example of a circuit calculating S3 for m ¼ 4 with p(x) ¼ x4 + x + 1. The register si (0 i 3) is initialized to zero. Then the received bits ri (0 i 14) and the register s0-s3 are shift every clock cycle. After 15 clock cycle, the S3 is obtained in the s0-s3 register. Syndromes Sj can also be obtained as the remainder in the division of the received polynomial r(x) by the minimal polynomial fj(x). That is, rðxÞ ¼ aj fj ðxÞ þ bj ðxÞ

(4.51)

Sj ¼ bj ða j Þ

(4.52)

Thus,

The minimal polynomials for a, a2, a4, . . . are the same and so the same register architecture can be used to calculate the syndromes S1, S2, S4, . . . . This can be also used for S3, S6, . . ., and so on. Figure 4.13 shows an example of the circuit to S3 for m ¼ 4. The minimal polynomial of a3 is f3(x) ¼ 1 + x + x2 + x3 + x4. Let b(x) ¼ b0 + b1x + b2x2+ b3x3 be the remainder on dividing r(x) by f3(x). Then S3 ¼ bða3 Þ ¼ b0 þ b1 a3 þ b2 a6 þ b3 a9 ¼ b0 þ b3 a þ b2 a2 þ ðb1 þ b2 þ b3 Þa3

(4.53)

In Fig. 4.13, the received vector r(x) is first divided by f3(x) to generate b(x) and then S3 is calculated according (4.53). The result is obtained from the register b0 b3 after 15 clock cycles.

4.2 Error Control Codes for On-Chip Interconnect

69

Fig. 4.13 Another method to calculate the syndrome S3 of BCH code with m ¼ 4

The syndrome can also be calculated by multiplying received codeword with the transpose of the parity check matrix H2tn . T s12t ¼ v1n H2tn

(4.54)

T where v ¼ (v0, v1, . . ., vn-1) is the received codeword. H2tn can be described as the following,

2

H2tn

1 61 6 6 ¼ 61 6. 4 .. 1

a ða2 Þ ða3 Þ

a2 2 ða2 Þ 3 2 ða Þ

ða2t Þ ða2t Þ

2

a3 3 ða2 Þ 3 3 ða Þ

3

ða2t Þ

an1 n1 ða2 Þ 3 n1 ða Þ .. .

3 7 7 7 7 7 5

(4.55)

n1

ða2t Þ

The second stage of the BCH decoding process is finding the coefficients of the errorlocation polynomial s(x) ¼ s0 + s1x + . . . + stxt using the syndromes Sj (1 j 2t). The relationship between the syndromes and these coefficients sj is given by, t X

Stþij sj ¼ 0

ði ¼ 1; :::; tÞ

(4.56)

j¼0

The roots of s(x) give the error positions. The coefficients of s(x) can be calculated by methods [1] such as the Peterson-Gorenstein-Zieler algorithm, Euclid’s algorithm, and Berlekamp–Massey (BM) algorithm. In this section, we mainly introduce BMA algorithm because it is the most efficient method in practice. In the BMA, the error location polynomial s(x) is found by t1 recursive iterations. During each iteration r, the degree of s(x) is usually incremented by one. The discrepancy dr in each iteration is defined as,

70

4 Error Control Coding for On-Chip Interconnects

dr ¼

t X

S2rjþ1 sj

(4.57)

j¼0

If the iteration number r is greater or equal the number of errors ta that have actually occurred, the discrepancy dr in (4.57) is equal to zero. If r < ta, the discrepancy dr calculated in (4.57) is usually non zero. Then, the degree and coefficients of s(x) is modified based on the dr value. The purpose of the BMA is to compute the shortest degree s(x) meeting the requirement of (4.56).The BMA with inversion is given below. Initials values: ( 1 if S1 ¼ 0 dp ¼ S1 if S1 6¼ 0 sð0Þ ðxÞ ¼ 1 þ S1 x ( 3 x if S1 ¼ 0 ð1Þ b ðxÞ ¼ x2 if S1 6¼ 0 ( 0 if S1 ¼ 0 l1 ¼ 1 if S1 6¼ 0 r ¼ 1:

(4.58)

The error location polynomial s(x) is then calculated using the following set of equations: dr ¼

t X

ðrÞ

sj S2rjþ1

i¼0

sðrÞ ðxÞ ¼

bðrþ1Þ ðxÞ ¼ lrþ1 ¼

8 < sðr1Þ ðxÞ

dr ¼ 0

: sðr1Þ ðxÞ d1 d bðrÞ ðxÞ if d 6¼ 0 r r p 8 ðrÞ 2 < x b ðxÞ if dr ¼ 0 or r < lr : x2 sðr1Þ ðxÞ ( lr (

dp ¼

if

if dr 6¼ 0 and r lr if dr ¼ 0 or r < lr

2 r lr þ 1

if dr 6¼ 0 and r lr

dp

if dr ¼ 0 or r < lr

dr

if dr 6¼ 0 and r lr

r ¼ r þ 1:

(4.59)

4.2 Error Control Codes for On-Chip Interconnect

71

Fig. 4.14 Berlekamp Massey Algorithm with inversion

These calculations are carried out for r ¼ 1, . . ., t1. Figure 4.14 shows a circuit implementation of the BMA. The error location polynomial s(x) is obtained in the s registers after t-1 iterations. In some applications it may be useful to implement the BMA without inversion [8]. For inversionless BMA the initial conditions can be the same as that for the BMA with inversion given in (4.58). The error location polynomial is then calculated using following equations: dr ¼

t X

ðrÞ

sj S2rjþ1

i¼0

sðrÞ ðxÞ ¼

bðrþ1Þ ðxÞ ¼ lrþ1 ¼

8 < dp sðr1Þ ðxÞ

dr ¼ 0

: d sðr1Þ ðxÞ d bðrÞ ðxÞ if d 6¼ 0 p r r 8 < x2 bðrÞ ðxÞ if dr ¼ 0 or r < lr : x2 sðr1Þ ðxÞ ( lr (

dp ¼

if

if dr 6¼ 0 and r lr if dr ¼ 0 or r < lr

2 r lr þ 1

if dr 6¼ 0 and r lr

dp

if dr ¼ 0 or r < lr

dr

if dr 6¼ 0 and r lr

r ¼ r þ 1:

(4.60)

72

4 Error Control Coding for On-Chip Interconnects

Fig. 4.15 The implementation of Chien’s search algorithm

Inversionless BMA is more complicated and requires a greater number of multiplications than the BMA with inversion. On the other hand, BMA with inversion takes more clock cycles to complete the same calculation. Therefore the inversionless algorithm can be implemented to meet high performance requirement. For SEC and DEC BCH codes the coefficients of s(x) can be obtained directly without using the BMA. This is because for SEC BCH codes sðxÞ ¼ 1 þ S1 x

(4.61)

sðxÞ ¼ 1 þ s1 ðxÞ þ s2 x2 ¼ 1 þ S1 x þ S1 2 þ S3 S1 1 x2

(4.62)

and for DEC BCH codes

The calculation of s(x) directly from the syndromes can be extended to tripleerror-correcting (TEC) BCH codes. However, this method quickly becomes too complex to implement in hardware, as the error correction capability increases. The third step in decoding BCH codes is to find the erroneous bit locations. These values are the reciprocals of the roots of s(x) and can be found simply by substituting 1, a, a2, . . ., an-1 into s(x). A method of finding the error location has been presented by Chien. In the Chien search algorithm, the sum s0 þ s1 aj þ s2 a2j þ ::: þ st atj ðj ¼ 0; 1; :::; k 1Þ

(4.63)

is evaluated every clock. If the sum equals zero for clock cycle j, the received bit rnj1 is erroneous. Figure 4.15 shows a hardware implementation of the Chien search algorithm. The registers c0, c1, . . ., ct are initialized by the of the error location Pcoefficients t c is calculated and if this polynomial s0, s1, . . ., st. Then the sum i i¼0 value equals zero, the error has been detected. At the same time, each value Pt in the ci register is multiplied by ai (using a constant multiplier). The sum i¼0 ci is calculated again on the next clock cycle. The above operations are carried out for every transmitted message bit.

4.2 Error Control Codes for On-Chip Interconnect

73

The last step of decoding BCH codes is to correct erroneous bits, once the error position is detected by Chien search algorithm. The error correction can be implemented XOR gates. A t-error-correcting (n, k, t) BCH code can be shortened by eliminating a certain number of information bits to construct a shortened t-error-correcting BCH code with the same redundant bits.

4.2.8

Reed-Solomon (RS) Codes

RS codes are a subclass of nonbinary BCH codes [1, 9] that are good at correcting multiple symbol errors. For a RS code with symbols from GF(q), the codeword width n in symbols and the number of symbols in the input data k are defined by the following parameters, n¼q1

(4.64)

k ¼ n 2t

(4.65)

There are n-k parity symbols and the code can correct t symbol errors. q is generally set to 2m. The code symbols are elements from the GF(2m). Let a be a primitive element in GF(q), the generator polynomial g(x) of t-symbolcorrecting RS codes with a, a2, a3, . . ., a2t as its roots can be expressed by, gðxÞ ¼ ðx aÞðx a2 Þ ðx a2t Þ ¼ g0 þ g1 x þ g2 x2 þ þ g2t1 x2t1 þ x2t

(4.66)

where gi (0 i <2 t) is symbol from GF(q). The encoding of RS codes can be realized using an LFSR. The parity check symbols of RS codes are the remainder of the division of right-shifted input polynomial m(x)·x2t by generator polynomial g(x). The decoding process of RS codes is similar to that of BCH codes. After the syndrome calculation, the Berlekamp–Massey algorithm can be used to calculate the coefficients of the error locator polynomial s(x) and the error magnitude polynomial O(x). The Chien search algorithm can be used to identify the error positions and the Forney algorithm can be used to calculate the error values. The error correction is done by adding the error values to the received codeword.

4.2.9

Hamming Product Codes

Product codes were first presented in 1954 [10]. The concept of product codes is very simple. Long and powerful block codes can be constructed by serially concatenating two or more simple component codes [1, 11, 12].

74

4 Error Control Coding for On-Chip Interconnects

Fig. 4.16 Encoding process of product codes

Figure 4.16 shows the construction process of two dimensional product codes. Assume that two component codes C1(n1, k1, d1) and C2(n2, k2, d2) are used, where n1, k1 and d1 are codeword width, input data width, and minimum Hamming distance for the code C1, respectively; n2, k2 and d2 are codeword width, input data width, and minimum Hamming distance for the code C2, respectively. The product code Cp(n1 n2, k1 k2, d1 d2) is constructed from C1 and C2 as follows: 1. Arrange input data in a matrix of k2 rows and k1 columns. 2. Encode the k2 rows using component code C1. The result will be an array of k2 rows and n1 columns. 3. Encode the n1 columns using component code C2. Product codes have a larger Hamming distance compared to that of the component codes. If the component codes C1 and C2 have minimum Hamming distance d1 and d2 respectively, then the minimum Hamming distance of the product code Cp is the product d1 d2, which greatly increases the error correction capability. Product codes can be constructed by a serial concatenation of simple component codes and a row-column block interleaver, in which the input sequence is written into the matrix row-wise and read out column-wise. Product codes can efficiently correct both random and burst errors. For example, if the received product codeword has errors located in a number of rows not exceeding ðd2 1Þ=2 and no errors in other rows, all the errors can be corrected during column decoding. The simplest two-dimensional product codes are single-parity check (SPC) product codes [1]. SPC product codes only guarantee correction of one error. The product codes, whose component codes are Hamming or extended Hamming product codes, are known as Hamming product codes.

4.2 Error Control Codes for On-Chip Interconnect

75

Fig. 4.17 An example of row and column status vectors after first and second decoding stages [13]

The Hamming product codes can be decoded using two-step row-column (or column-row) decoding algorithm [1]. Unfortunately, this decoding method fails to correct certain error patterns (e.g. rectangular four-bit errors). A three-stage pipelined Hamming product code decoding method is proposed in [13]. Compared to the two-step row-column decoding method, the three-stage pipelined decoding method uses a row status vector and a column status vector to record the behaviors of the row and column decoders. Instead of passing only the coded data between row and column decoder, these row and column status vectors are passed between stages to help make decoding decisions [13]. The simplified row and column status vector implementation can be described as follows: The ith (1 i n2) position in the row status vector is set to “1” when there are detectable errors (regardless of whether the errors can be corrected or not) in the ith row; otherwise that position is set to “0”. For the column status vectors, there are two separate conditions that can cause the jth (1 j n1) position in column status vector to be set to “1” (a) when an error is detectable but not correctable, or (b) when an error is correctable, but the row where the error occurs has a status value “0”. Otherwise, that position is set to “0”. Figure 4.17 shows an example of the row and column status vectors after the first and second stage decoding process. Extended Hamming codes are used as row and column component codes.

76

4 Error Control Coding for On-Chip Interconnects

Fig. 4.18 Block diagram of proposed three-stage pipelined decoding algorithm [13]

Figure 4.18 describes the three-stage pipelined Hamming product code decoding process. After initializing all status vectors to zeros, the steps are described as follows: Step 1: Row decoding of the received encoded matrix. If the errors in a row are correctable, the error bit indicated by the syndrome is flipped. The row status vector is set to “0” if the syndrome is zero and “1” if the syndrome is nonzero. Step 2: Column decoding of the updated matrix. The error correction process is similar to Step 1. The column status vector is calculated using both the column error vector and the row status vector from Step 1. Step 3: Row decoding the matrix after changes from Step 2. The syndrome for each row is recalculated. If any remaining errors in each row are correctable, the row syndrome will be used to do the correction. If the errors in each row are still detectable but uncorrectable, the column status vector from Step 2 is used to indicate which columns need to be corrected. To implement the three-stage decoding algorithm, a modification of conventional extended Hamming decoder is needed. The modified extended Hamming decoder needs to generate row/column status value that will later be used to improve the overall error correction capability of the decoding process. The generation of the row status value in the first row decoding process is simple. A row status is set to “1” if the syndrome value of this row is nonzero. This can be implemented using OR gates with all syndrome bits as inputs. Figure 4.19 shows a block diagram of the modified extended Hamming decoder used in the column decoding process. A column status is set to “1” if the output of double error detection is “1” or if there is at least one bit position in which the error vector value is 1 and the row status is “0”. Figure 4.20 shows a block diagram of the modified extended Hamming decoder used in the second row decoding process. Unlike the normal extended Hamming decoder, the error correction is decided by the error vector and also the value of the column status vector. A bit in the codeword is considered erroneous if the error vector is “1” in that position or the output of double error detection is “1” and the column status vector in that position is also “1”.

4.2 Error Control Codes for On-Chip Interconnect

Fig. 4.19 Implementation of the column decoder [13]

Fig. 4.20 Implementation of the row decoder [13]

77

78

4 Error Control Coding for On-Chip Interconnects

Compared to a conventional two-step row-column decoding method, the decoding method in [13] achieves a better error correction capability. For example, in the Hamming product code Cp(8 8, 4 4), the proposed decoding method can correct 100% of error patterns consisting of five errors or less.

References 1. Lin S, Costello DJ (2004) Error control coding, 2nd edn. Prentice Hall, Englewood Cliffs 2. Sridhara S, Shanbhag RN (2005) Coding for system-on-chip networks: a unified framework. IEEE Trans Very Large Scale Integr (VLSI) Syst 12:655–667 3. Rossi D, Metra C, Nieuwland KA, Atul K (2005) Exploiting ECC redundancy to minimize crosstalk impact. IEEE Des Test Comput 22:59–70 4. Zimmer H, Jantsch A (2003) A fault model notation and error-control scheme for switch-toswitch buses in a network-on-chip. In: Proceedings international conference hardware/software codesign and system synthesis (CODES-ISSS), pp 188–193 5. Yu Q, Ampadu P (2008) Adaptive error control for NoC switch-to-switch links in a variable noise environment. In: Proceedings IEEE international symposium on defect and fault tolerance in VLSI system (DFT), pp 352–360 6. Pei BT, Zukowski C (1992) High speed parallel CRC circuits in VLSI. IEEE Trans Commun 40:653–657 7. Shieh DM, Sheu HM, Chen HC, Lo FH (2001) A systematic approach for parallel CRC computations. J Inf Sci Eng 17:445–461 8. Burton OH (1971) Inversionless decoding of binary BCH codes. IEEE Trans Inf Theory 17:464–466 9. Reed SI, Solomon G (1960) Polynomial codes over certain finite fields. J Soc Ind Appl Math 8:300–304 10. Elias P (1954) Error-free coding. IEEE Trans Inf Theory 4:29–37 11. Fujiwara E (2006) Code design for dependable systems: theory and practical applications. Wiley Interscience, Hoboken 12. Pyndiah R (1998) Near-optimum decoding of product codes: block turbo codes. IEEE Trans Commun 46:1003–1010 13. Fu B, Ampadu P (2009) On hamming product codes with type-II hybrid ARQ for on-chip interconnects. IEEE Trans Circuits Syst I, Reg Papers 9:2042–2054

Chapter 5

Energy Efficient Error Control Implementation

Error control is applied to improve the reliability of on-chip communication. However, on-chip interconnects are still facing the challenge of the increased energy consumption. It is important to consider energy efficiency in error control realization. In this chapter, we will introduce design techniques, which can efficiently balance the energy efficiency and reliability of on-chip interconnects.

5.1

Error Control Coding with Low Link Swing Voltage System

A large portion of the total chip power can be consumed by global links [1, 2]. In [3–6], error control schemes are combined with low link swing voltage system to trade off reliability and energy of on-chip interconnects. Figure 5.1 shows a possible implementation of error control codes with a low link swing voltage system. Level shifting circuits are needed to switch between different voltages at the link and receiver. Figure 5.2 shows a simple implementation of these level shifting circuits [7]. Triple modular redundancy is implemented to protect the ACK/NACK and control signals against errors. In the method of combining error control coding with low link swing voltage system, the energy consumption is reduced because the error control codes allow the system to meet the same communication reliability using a lower link swing voltage compared to uncoded system. For a given reliability requirement Preq, the uncoded system and the system with error control coding meet the required reliability at the raw bit error probability eunc and eecc. From Chap. 1, the raw bit error probability eunc and eecc are a function of link swing voltage Vswing and the

B. Fu and P. Ampadu, Error Control for Network-on-Chip Links, DOI 10.1007/978-1-4419-9313-7_5, # Springer Science+Business Media, LLC 2012

79

80

5 Energy Efficient Error Control Implementation

Fig. 5.1 Integration of error control codes with low link swing voltage system

Fig. 5.2 Hardware implementation of level shifter circuits

standard noise voltage deviation sN. If the standard noise voltage deviation sN is the same, the link swing voltages of coded and uncoded system have the following relation [3], Vecc

swing

¼

Vunc

1 swing Q ðeecc Þ Q1 ðeunc Þ

(5.1)

where Q1 ðeÞ is the inverse Q function. Figure 5.3 compares the minimum link swing voltage of different coding schemes to that of uncoded link. In error detection (ED) scheme, Hamming codes are used to detect errors. Figure 5.3 shows that error control coding greatly reduces the link swing voltage for the same reliability requirements. The unencoded link has to use the highest link swing voltage. Figure 5.4 shows the energy consumption of different coding schemes with different link lengths. The energy consumption includes the codec energy and link energy. The link is modeled for a 0.18 um technology. The same reliability is required for all the schemes. Figure 5.4a shows that the use of error control scheme can efficiently reduce the energy consumption for long links. In this case, the link energy dominates the total energy consumption. The reduction of link swing voltage by using

5.2 Error Control Coding with Dynamic Voltage Swing Scaling System

81

Fig. 5.3 Minimum link swing voltage needed by each coding scheme to meet a predefined communication reliability requirement [4]

error control coding can result in a total energy reduction. For short links, the codec energy overhead is comparable to the link energy and reduces the benefits of using error control coding. In Fig. 5.4b, we can see that the error control scheme still archives energy benefits for high reliability requirements. As the technology scales, the codec energy consumption becomes smaller compared to the link energy consumption. The combination of error control coding with low link swing voltage system will be a promising approach to balance the reliability and energy consumption.

5.2

Error Control Coding with Dynamic Voltage Swing Scaling System

In [8], a self-calibrating transmission scheme for on-chip interconnect is proposed. In this method, the dynamic voltage swing scaling (DVSS) technique is combined with error control codes to trade off energy, throughput and reliability of on-chip interconnects. The principle of this method is to run the system with a more aggressive voltage scaling scheme, which takes advantage of the use of error control codes. Figure 5.5 shows a possible architecture for the self-calibrating transmission system. In this architecture, the data transmission is pipelined into three stages – encoder, synchronization and decoder. The input data is first encoded using error control codes and then transmitted together with parity check bits. An operating point controller can dynamically adjust the data transmission frequency and the

82

5 Energy Efficient Error Control Implementation

Fig. 5.4 Energy consumption of different coding schemes (a) Link lengths are several centimeters (b) link lengths are several millimeters [4]

5.2 Error Control Coding with Dynamic Voltage Swing Scaling System

83

Fig. 5.5 The self-calibrating transmission system (a) The concept (b) A possible implementation [8]

link swing voltage according to traffic information and error detection rate. Error control codes are used to detect on-chip communication errors. If errors are detected, a go-back-N ARQ scheme is applied to ensure the data transmission correctly. The error control decoder also sends the error information to the operating point controller, which use this feedback information to select proper link swing voltage. The self-calibrating transmission scheme reduces the energy consumption by providing the minimum link swing voltage, at which the on-chip communication can achieve the targeted bandwidth for a given data transmission error rate. The self-calibrating transmission scheme uses error detection and ARQ to guarantee reliable communication. It is important to select a proper error coding method, which can efficiently detect both logic and timing errors. Figure 5.6 shows the error detection method used. The error detection method is separated into two parts. CRC-8 code with generator polynomial x8 + x2 + x + 1 is used to detect logic errors. CRC codes alone are not efficient to detect timing errors. For example, if the clock cycle time is much less than the data transition time on the link, the data

84

5 Energy Efficient Error Control Implementation

Fig. 5.6 Error detection method used in self-calibrating transmission system [8]

latched in two successive clock cycles may have the same value. These data are both valid codewords. CRC code cannot recognize that the second copy is uncorrected. An additional bit, which changes alternatively between 0 and 1 every clock cycle, is used to detect this case. This additional bit ensures that the encoded data at any two successive clock cycles are always different. The parity bits of CRC-8 code are transmitted with the data. This additional bit is not transmitted. It is generated independently at the transmitter and receiver sides, as shown in Fig. 5.6. Figure 5.7 shows the control algorithm used to select the best frequency and link swing voltage point in self-calibrating transmission scheme. The controller performs three tasks independently. First, the controller needs to find the lowest swing voltage for a given frequency. The controller performs this by monitoring the error rate of data transmission. For a given frequency, a swing voltage is selected and applied to the system. If the system can operate correctly for a period of time (e.g. 500 or 1,000 clock cycles), then the controller attempts to reduce the swing voltage. If the error rate at the lower swing voltage is larger than a threshold value, the system will return to more conservative mode by increasing the swing voltage. The controller continues to do this until the lowest swing voltage, which can satisfy both the frequency and reliability requirements, is found. The same process will be performed for each possible frequency, and the controller records these best voltage and frequency operating points. The second task of the controller is to choose a proper frequency based on the delay constraint and buffer fill level. Once the proper frequency is decided, the last task of the controller is to select the lowest swing voltage according to the frequency. The control algorithm in Fig. 5.7 can minimize the energy consumption; while meet the performance and reliability requirements at the same time. Figure 5.8 illustrates how the self-calibrating transmission scheme performs for a realistic time-varying MPEG based workload. In this example, the self-calibrating scheme can transmit the data with a frequency range from 50 MHz to 1 GHz by

5.2 Error Control Coding with Dynamic Voltage Swing Scaling System

85

Fig. 5.7 Control algorithm used to select the best frequency and link swing voltage point in selfcalibrating transmission scheme [8]

adjusting the link swing voltage from 0.6 to 1.2 V. By choosing the proper frequency based on the workload, the self-calibrating system sends each MPEG frame just below the delay constraint, the all dotted line shown in the bottom figure. The classic system, in which no adaptive design is applied, need to operate at a higher frequency to meet the safety margin requirements for the worst-case workload condition. The lower data transmission frequency of self-calibrating scheme results in a lower link swing voltage, which can greatly reduce the link energy consumption. Figure 5.9 shows the energy consumption of each component of the self-calibrating transmission system except the voltage converter with a 90 nm CMOS technology. The link length is assumed as 1 cm. The results show that the energy overhead of the operating point controller and the synchronizing registers account for 13% of total energy consumption. The link energy consumption is the largest portion of the total

86

5 Energy Efficient Error Control Implementation

Fig. 5.8 An example to transmit a realistic time-varying MPEG based workload [8]

Fig. 5.9 Energy consumption of each component of the self-calibrating transmission system [8]

system energy consumption. By reducing the link energy consumption, the selfcalibrating transmission system can achieve 42% energy reduction compared to classic system, when total 400 MPEG frames are transmitted and each frame consists of several kilobytes data.

5.3 Product Codes with Type-II ARQ

5.3 5.3.1

87

Product Codes with Type-II ARQ The Principle

In Chap. 4, we have introduced product codes, which can provide a good error correction capability of both random and burst errors. However, the direct use of product codes requires a large number of redundancy bits resulting in low code rates. The code rates of the direct use of product codes Cp ¼ C1 C2 can be described by, Rdirect ¼

K k 2 k1 ¼ N n2 n1

(5.2)

where k1 and k2 are the number of column and row in the input data, respectively. n1 and n2 are the row and column codeword length, respectively. For 64-bit input data, Table 5.1 summarizes Rdirect value for different component codes and k2 values. The value show that the large number of redundancy bits in the product codes results in low code rates, increasing link energy consumption. In order to improve the code rate and improve energy efficiency, a method combining product codes with type-II HARQ is proposed in [9, 10]. In type-II HARQ [11], redundancy bits are incrementally transmitted if they are requested. In the method of combining product codes with type-II HARQ, the original input data is encoded and transmitted with its row parity check bits first. If the errors in the receiver are detectable but not correctable, a transmission of the column parity check bits is requested. The effective code rates Reffective of this method can be described by, Reffective ¼

k2 n1 þ Pd

K K ½ðn k Þ n n1 k uc 2 2 1 2

(5.3)

where Pd_uc is the probability of detectable but uncorrectable error in the first transmission. The error probability is usually on the order of 109–1012 errors/ bit [12]; thus, the second term in the denominator is negligible in most cases. The code rates of a direct product code implementation and the product code with type-II HARQ method are compared in Fig. 5.10 for different input data widths K and different numbers of data rows k2. The results show that the combination of product code with type-II HARQ can greatly improve the effective code rate by sending the column parity check bits and checks-on-checks only when necessary. Table 5.1 Rdirect for different component codes and k2 values at K ¼ 64 bits K ¼ 64 bits Component codes k2 N Rdirect

Parity check 2 4 99 85 0.65 0.75

8 81 0.80

Hamming 2 4 190 147 0.34 0.44

8 144 0.45

Extended Hamming 2 4 8 234 176 169 0.27 0.36 0.38

88

5 Energy Efficient Error Control Implementation

Fig. 5.10 Comparison of code rate Reffective and Rdirect for different input data width K and the number of rows k2 in the message of product codes

Fig. 5.11 Encoding process of combining product codes with type-II HARQ

Figure 5.11 shows the encoding process of the method combining product codes with type-II HARQ. K-bit input data is first separated into k2 rows with length of k1-bit each row. Multiple row encoders are used to minimize the encoding latency. Each row is encoded using a component code C1(n1, k1). All k2 n1 outputs of the

5.3 Product Codes with Type-II ARQ

89

Fig. 5.12 An example of combining four extended Hamming EH(22,16) encoders with row-column interleaver

row encoders are fed into a row-column block interleaver. The mapping relation of the (nr nc)-bit row-column interleaver can be described by, iinput (5.4) ioutput ¼ nr modðiinput ; nc Þ þ nc where 0 iinput, ioutput (nr nc)1 and nr and nc are the number of rows and columns of the row-column interleaver, respectively. Figure 5.12 demonstrates an example of the row-column interleaver mapping relation for 64-bit input data. The 64-bit input data is separated into four identical rows. Each row is encoded with an extended Hamming EH(22, 16) code and the outputs of these encoders are interleaved. The interleaved row encoder outputs are both transmitted to the receiver and fed into n1 column encoders. The column parity check bits of the n1 column encoders are saved into a buffer. The total (n2 k2) n1 additional parity check bits are kept in the buffer until an acknowledgement/negative acknowledgement (ACK/NACK) signal indicating the status of the previously transmitted data is received. If a NACK signal is received, the stored parity check bits are sent to the receiver. The flow chart of the encoding process is shown in Fig. 5.13.

90

5 Energy Efficient Error Control Implementation

Fig. 5.13 Flow chart of the encoding process [9]

Figure 5.14 shows the decoding process of the combination of product codes with type-II HARQ. The received data is deinterleaved and then decoded using row decoders. If the number of errors is within the error correction capability of the row decoder, the errors are corrected. If the errors are detectable but not correctable, the row decoded data and row parity check bits are saved into a buffer and the receiver instructs the transmitter to send the column parity check bits and checks-on-checks which are formed based on the original data. When the additional parity check bits are received, they are used with the row decoded data and row parity check bits, which have been stored in the decoder buffer, to complete the product code decoding process. The flow chart of the proposed decoding process is shown in Fig. 5.15. The transmission and retransmission process is shown in Fig. 5.16. In order to simplify the hardware implementation, only one retransmission is allowed. The buffer depth in the transmitter and receiver is determined by the round trip delay of the transmission. The reliability of the proposed method is dependent on two pieces – the error detection capability in the first transmission and the error correction capability when the full product code decoding process is applied. The reliability in terms of residual flit error rate can be estimated by, Presidual ¼ Pud þ Pðedecoding ; edetect Þ

(5.5)

where Pud is the undetectable error probability in the first transmission and P(edecoding, edetect) is the error probability after the full product code decoding process is performed.

5.3 Product Codes with Type-II ARQ

Fig. 5.14 Decoding process of combining product codes with type-II HARQ

Fig. 5.15 Flow chart of proposed decoding process [9]

91

92

5 Energy Efficient Error Control Implementation

Fig. 5.16 Transmission and retransmission procedure

5.3.2

Extended Hamming Product Codes with Type-II HARQ

To balance complexity and error correction capability, an error control method combining extended Hamming product codes with type-II HARQ is introduced by [10]. The encoding process of the combination of extended Hamming product codes with type-II HARQ is simple. K-bit input data is arranged into a matrix with k2 rows and k1 columns. Each row is encoded using an extended Hamming code EH (n1, k1) and each column is encoded using an extended Hamming code EH(n2, k2). The K-bit input data and row parity check bits are transmitted and the column parity check bits are saved into an encoder buffer. The saved column parity check bits will be transmitted once a NACK is received. In the decoding process of extended Hamming product codes with type-II HARQ, the received data is first decoded row by row using multiple extended Hamming decoders. Extended Hamming codes can correct single errors and detect double errors in each row. If all errors are correctable (no more than one error in each row), the receiver indicates a successful transmission by sending back an ACK signal to transmitter. If the receiver detects two errors in any row, it saves the row decoded data and row parity check bits in the decoding buffer and requests a transmission of column parity check bits and checks-on-checks by sending back a NACK signal. When the extra parity check bits are received, they are used with the saved data and row parity check bits to complete the column decoding process and the second row decoding process in the three-stage pipelined decoding method introduced in Chap. 4. Figure 5.17 shows the decoding process, when the threestage decoding method is used. The first row decoding is always performed. The column decoding in the second stage and the row decoding in the third stage are performed only when a retransmission of column parity checks bits and checks-onchecks are requested.

5.3 Product Codes with Type-II ARQ

93

Fig. 5.17 Implementation of three-stage pipelined decoding algorithm in the case of combining extended Hamming product codes with type-II HARQ

Fig. 5.18 An example of decoding process for extended Hamming codes with type-II HARQ

Figure 5.18 shows an example of the decoding process when extended Hamming product codes with type-II HARQ are applied. A rectangular four-error pattern occurs in the transmission of the original data and row parity check bits. The extended Hamming decoder detects these errors during the first row decoding process and a transmission of column parity check bits and checks-on-checks is requested.

94

5 Energy Efficient Error Control Implementation

Fig. 5.19 Number of redundancy bits for extended Hamming code with different input data widths

A single error occurs during retransmission of column parity check bits and checks-on-checks. It can be directly corrected, before these extra parity check bits are combined with the saved data and row parity check bits to complete the threestage pipelined decoding process. In Step 2, because double errors are detectable but uncorrectable, no correction is performed and “1”s are recorded in the corresponding column states. In the second row decoding process (Step 3), the extended Hamming decoder still detects two errors in a row, so the column status vector is used to indicate which positions need to be flipped. In the combination of extended Hamming product codes with type-II HARQ, a K-bit input data is arranged as a matrix of k2 rows with length k1 ¼ dK=k2 e: Each row or column is encoded using an extended Hamming code EH(n1,k1) or EH(n2,k2). The required interconnect width WL1 in the first transmission (original data and row parity check bits) can be described, WL1 ¼ K þ k2 reh ðdK=k2 eÞ

(5.6)

where reh ðdK=k2 eÞ is the number of parity check bits added by the extended Hamming code for the dK=k2 e-bit row input. The relationship between the number of redundancy bits and the number of input data bits is shown in Fig. 5.19. If a NACK is received, a retransmission is requested. The retransmission includes parity check bits for n1 ¼ ðdK=k2 e þ reh ðdK=k2 eÞÞcolumns and requires interconnect width WL2, WL2 ¼ ðdK=k2 e þ reh ðdK=k2 eÞÞ reh ðk2 Þ

(5.7)

where reh ðk2 Þ is the number of parity check bits added by the extended Hamming code for the k2-bit column input.

5.3 Product Codes with Type-II ARQ

95

Table 5.2 Required number of wires in the link for different input data widths K and row numbers k2 [10] k2 ¼ 1 k2 ¼ 2 k2 ¼ 3 k2 ¼ 4 k2 ¼ 5 k2 ¼ 6 k2 ¼ 7 k2 ¼ 8 k2 ¼ 9 44 47 52 57 62 67 64 68 K ¼ 32 WL1 39 WL2 156 88 64 52 60 55 50 40 40 88 64 52 60 62 67 64 68 WL 156 K ¼ 64 WL1 72 78 82 88 94 94 99 104 109 156 112 88 95 80 75 65 65 WL2 288 WL 288 156 112 88 95 94 99 104 109 144 149 156 158 164 170 176 182 K ¼ 128 WL1 137 WL2 548 288 200 156 160 140 125 110 105 WL 548 288 200 156 160 164 170 176 182

The required link width WL to successfully complete two transmissions is the maximum value of WL1 and WL2. WL ¼ MaxðWL1 ; WL2 Þ

(5.8)

From (5.6)–(5.8), the required link width is a function of the row number k2 for a given input message width K. Table 5.2 shows the number of wires in the link for different input data widths K and the number of rows in the message k2. It can be seen that WL changes greatly for different k2 values. Because on-chip interconnects can consume a large proportion of the total energy in nanoscale technology, Reducing the number of wires in the link can improve both energy efficiency and wire area footprint. To achieve the minimum number of wires, WL1 and WL2 should be balanced. The minimum number of wires WL is achieved when the difference between WL1 and WL2 has the smallest value. For the link widths examined in Table 5.2, a minimum value is achieved when k2 is equal to four. In this case (k2 ¼ 4), WL1 and WL2 have the same values and an extended Hamming code EH (8,4) is used for column encoding.

5.3.3

Performance Evaluation

The combination of extended Hamming codes with type-II HARQ is compared to different coding solutions – FEC schemes using Hamming code and three-bit error correction BCH code, ARQ, and HARQ using extended Hamming code. Standard CRC-5 with generator polynomial x5 + x2 + 1 is used for ARQ scheme. The input data width is 64-bit. The number of wires used to transmit encoded information in these error control schemes is 88, 71, 85, 69 and 72, respectively. In order to improve the throughput, go-back-N retransmission approach [11] is applied to the ARQ and HARQ schemes. For implementation simplicity, only one

96

5 Energy Efficient Error Control Implementation

Fig. 5.20 Residual flit error rate for different error control schemes as a function of noise voltage deviation at (a) Pn ¼ 102 and (b) Pn ¼ 1 [10]

retransmission is allowed in the combination of extended Hamming product codes with type-II HARQ. Thus, the ACK/NACK signal not only depends on the double error detection from each row decoder, but also depends on whether the input data is the first transmitted or retransmitted flit. When the input data is the retransmitted flit, an ACK signal (the value is 0) is always sent back to the transmitter.

5.3.3.1

Reliability

The residual flit error rate Presidual is used to measure the system reliability. Figure 5.20 shows the residual flit error rate of different error control schemes as a function of noise voltage deviation. Dependent error model introduced in chap. 1

5.3 Product Codes with Type-II ARQ

97

is used in the simulation with two coupling probability values, Pn ¼ 102 and Pn ¼ 1. A link swing voltage of 1.0 V is used. The simulation results show that Hamming product codes with type-II HARQ achieves a significant reduction in residual flit error rate when multiple random and burst errors are considered. ARQ CRC-5 has a good burst error detection capability but it is inefficient to detect multiple random errors. HARQ EH(72,64) scheme can correct single errors and detect double errors but as the burst error probability increases, the performance of decreases. Compared to the BCH(85,64) code, extended Hamming product codes with type-II HARQ can effectively correct multiple random and burst errors, while BCH code is only good at correcting multiple random errors. The combination of extended Hamming product codes with type-II HARQ can correct at least two permanent errors, while ARQ CRC-5 will not work in this persistent noise environment.

5.3.3.2

Throughput

Another main concern in on-chip communication is the throughput. In the simulations, go-back-N retransmission policy was applied to improve the throughput of ARQ and HARQ schemes. The average number of transmissions needed to successfully send a flit in go-back-N ARQ is represented in Chap. 2. When go-back-N retransmission policy is applied to HARQ scheme, the average number of transmissions needed to successfully send a flit can be described using (5.9) by modifying (2.2), NHARQ ¼ 1 ðPne þ Pud þ Pd c Þ þ ðN þ 1Þ ðPne þ Pud þ Pd c Þ Pd þ ð2N þ 1Þ ðPne þ Pud þ Pd c Þ P2d þ ðlN þ 1Þ ðPne þ Pud þ Pd c Þ Pld NPd uc ¼1þ ð1 Pd uc Þ

uc

uc þ uc

þ (5.9)

where Pne is the probability of no errors. Pud is the undetectable error probability. Pd_c is the probability of correctable error. Pd_uc is the probability that the errors are detectable but uncorrectable. Pne + Pud + Pd_c +Pd_uc is equal to 1. Because Pd_uc is smaller than Pd, the average number of transmissions needed to successfully send a flit in HARQ is less than that in ARQ. In the method combining product codes with type-II HARQ, the retransmission time was limited to one. The average number of transmissions needed to successfully send a flit in the proposed method can be described by, Nprop ¼ 1 ðPne þ Pud þ Pd c Þ þ 2 Pd

uc

(5.10)

The throughput of the different error control schemes is compared in Fig. 5.21. The throughput was normalized to the throughput in the case of no errors occurring.

98

5 Energy Efficient Error Control Implementation

Fig. 5.21 Throughput of different error control schemes with varying standard noise voltage deviation

The throughput comparison does not include the H(71,64) and BCH(85,64) schemes, because no retransmission is needed in these schemes and their normalized throughput are always equal to 1. As shown in Fig. 5.21, ARQ, HARQ and proposed method achieve nearly the same throughput at low noise environments (small sN). As sN increases, The ARQ scheme achieves the lowest throughput, because retransmission is the only way for it to correct errors. The overhead for retransmission increases as noise environments become worse. Compared to HARQ H(72,64) scheme, the combination of extended Hamming codes with type-II HARQ scheme achieves better throughput because more errors can be corrected during the first transmission. The combination of extended Hamming product codes with type-II HARQ can achieve 45% and 10% improvement in the throughput under high noise conditions compared to ARQ and HARQ scheme, respectfully.

5.3.3.3

Energy Consumption

The average energy per flit is used as the metric to measure energy consumption. The average energy consumption includes the encoder energy Ee1, the link energy El1, and the decoder energy Ed1 in the first transmission, as well as the encoder energy Ee2, the link energy El2, and the decoder energy Ed2 in the retransmission, where Pd_uc is the probability that the errors are detectable but uncorrectable. The link energy using low swing voltage can be estimated as, where CL is the interconnect capacitance. WL is the number of wires in the link, which depends on the error control scheme. In the combination of extended Hamming product codes with type-II HARQ, WL is greatly affected by the selection of k2. Sf is the wire switching probability. VDD is the supply voltage. The link swing voltage Vswing is decided by

5.3 Product Codes with Type-II ARQ

99

Fig. 5.22 Required link swing voltages of different error control schemes for given reliability requirements [10]

the reliability requirement. Elevel is the energy consumption of the level translation circuit when low swing voltage is applied. Eavg ¼ ðEe1 þ El1 þ Ed1 Þ þ Pd

uc ðEe2

þ El2 þ Ed2 Þ

El Sf WL CL VDD Vswing þ Elevel

(5.11) (5.12)

Figure 5.22 compares the link swing voltage of different error control schemes for the same residual flit error rate requirement. The sN is assumed to be 0.1 V. The coupling probability Pn is 101. The results show that the more effective the error correction capability of an error control scheme, the lower the swing voltage needed for the interconnect links. To achieve the same residual flit error rate, the combination of extended Hamming product codes with type-II HARQ achieve the smallest link swing voltage. The link swing voltage of the combination of extended Hamming codes with type-II HARQ is about 60% and 80% compared to that of the H(71, 64) and ARQ CRC-5, respectively. The lower link swing voltage allows this method to consume less link energy. Figure 5.23 compares the link energy consumption of different error control schemes given the same residual flit error rate requirement. In the simulation, the requirement of residual flit error rate is assumed to be Presidual 1020 [13]. The simulation was performed for a noise environment of sN ¼ 0.07 V. Different technology nodes are considered in the simulation using Predictive Technology Model (PTM) CMOS 65 and 45 nm technology [14]. The effect of different link lengths on energy consumption is also evaluated. In NoCs, the link length is the distance between two routers, which is decided by the dimension of the tile block. In mesh or torus topologies, the links between two routers are generally a few millimeters long wires [15–17]. In the experiments, link lengths from 1 to 3 mm,

100

5 Energy Efficient Error Control Implementation

Fig. 5.23 Link energy consumption of different error control schemes for different link lengths (a) 45 nm technology (b) 65 nm technology [10]

are examined. The link energy is measured in Cadence Spectre. The input data is generated using an H.264 video encoder with the average switching factor about 0.5. Figure 5.23 shows that the combination of extended Hamming product codes with type II HARQ has the smallest link energy of the compared schemes, because the lowest link swing voltage counterbalances the large number of wires in the link. As link length increases, Hamming product codes with type II HARQ can benefit more from the lowest link swing voltage. The link energy consumption of Hamming product codes with type-II HARQ is about 80% and 35% compared to the link energy consumption of ARQ CRC-5 and H(71,64), respectively.

5.3 Product Codes with Type-II ARQ

101

Fig. 5.24 Energy comparison of different error control schemes at residual flit error rate 1020 (a) Link length 1 mm (b) Link length 3 mm [10]

Figure 5.24 compares the average energy consumption per flit for different error control schemes at the same reliability requirement (1020) [13]. The average energy includes encoder, decoder and link energy consumption. Two noise voltage deviations, sN ¼0.07 V and sN ¼0.1 V, are considered. Raw bit error probability e is about 1012 and 106 for these two cases. The results show that ARQ CRC-5 achieves the least average energy consumption at low noise environment (sN ¼0.07 V) for small link lengths, because of the smaller codec energy and link energy consumption. As the noise voltage deviation increases, however, higher link swing voltages are needed to achieve the same reliability. In high noise conditions, the average energy consumption of ARQ CRC-5 increases more than the average energy consumption of the combination of extended Hamming product codes with type-II HARQ, because ARQ CRC-5 has larger link energy consumption. The combination of extended Hamming product codes yields the

102

5 Energy Efficient Error Control Implementation

Fig. 5.25 Delay of 3 mm link for different link swing voltages [10]

least average energy consumption at the higher noise environment (sN ¼0.1 V). The BCH(85,64) scheme has the larger average energy consumption for small link lengths because it has the largest codec energy consumption. Hamming product codes with type-II HARQ achieve the least energy consumption at large link lengths or high noise environments. When the link length is 3 mm, the energy consumption of the this approach is about 15% and 50% less than that of ARQ CRC-5 and H(71,64), respectively, in high noise environment. In addition to the energy consumption improvement compared to ARQ in high noise environments, the combination of extended Hamming product codes with type-II HARQ can correct at least two permanent errors, while ARQ will not work in a persistent noise environment. Thus, the approach combining forward error correction with limited retransmission can achieve a better performance for balanced energy, performance, and error resilience.

5.3.3.4

Delay and Area

More powerful error control schemes enable reduced link swing voltages, which results in significant energy reduction. However, the link delay increase as the link swing voltage decreases. Figure 5.25 evaluates the effect of reduced swing voltage on link delay (including the delay of level translation circuit). The simulation results show that the link delay at Vswing ¼ 0.75 V increases about 30% compared to that at Vswing ¼1 V. In order to address the increased delay using lower link swing voltage, the link can be pipelined [18], if a higher frequency is required. Figure 5.25 also shows the delay overhead of level translation circuit (the shadowed part on the

5.3 Product Codes with Type-II ARQ

103

Table 5.3 Delay and area of different coding schemes using 65 nm technology [10] Encoder Decoder Area Error control scheme delay (ns) delay (ns) (mm2) Hamming (71,64) 0.40 0.62 2,080.8 ARQ (CRC-5) 0.37 0.41 3,605.6 HARQ 0.42 0.64 4,283.7 (Extended Hamming (72,64)) BCH(85,64) 0.42 0.72 77,353.2 Extended Hamming product 0.41 0.59 9,792.5 codes with type-II HARQ

top) for different link swing voltages. The simulation results show that the delay overhead of level translation circuit increases as the swing voltage decreases. At Vswing ¼ 0.75 V, the delay overhead of the level translation circuit is about 13% compared to the total link delay. Table 5.3 compares the codec delay and area of different error control schemes using TSMC 65 nm libraries. The Hamming encoder is implemented as a simple XOR tree. Instead of using linear feedback shift registers to generate check bits for CRC codes, a parallel implementation method [19] is employed to reduce the large latency of CRC codes at a minor cost of complexity. The decoder delay, typically much larger than encoder delay, is reported in Table 5.3. As expected, the decoder delay of ARQ CRC-5 is the smallest, compared to other error control schemes, because only syndrome is calculated and no error correction is needed in this scheme. The BCH(85,64) scheme has the largest delay because of complexity of arithmetic field operations. The delay of extended Hamming product codes with type-II HARQ is slightly smaller than that of H(71,64) and HARQ EH(72,64) schemes, because the component codes used in constructing product codes have a smaller input data width. Table 5.3 also shows the area of different error control schemes. In go-back-N retransmission policy, N flits will be retransmitted if a NACK signal is received. Thus, a transmitter buffer is needed to store these N flits in ARQ and HARQ schemes. The number N is dependent on the round trip transmission delay. In the simulation, N is equal to four. In the Hamming product codes with type-II HARQ scheme, the part of the decoder buffer which stores the original message can be shared with the routing buffer used for routing and flow control purposes in the router. This greatly reduces the buffer size required. The results show that FEC scheme using H(71,64) has the least area, because the encoder and syndrome calculation circuits are implemented as simple XOR trees; also no buffers are needed in this scheme. The area of extended Hamming product codes with type-II HARQ increases about two times compared to that of HARQ scheme, because of overhead associated with the three-stage decoding method. BCH(85,64) has the largest area because of the large complexity of the decoding process. In nanoscale technologies, the link energy is likely to largely exceed the codec energy. Thus, the proposed method is more promising to achieve the energy benefits as technology scales.

104

5.4

5 Energy Efficient Error Control Implementation

Configurable Error Control System

Each error control scheme has different area, power, throughput, and error correction capability trade-offs. Configurable error control schemes achieve energy efficiency by dynamically providing appropriate error control based on noise conditions or system requirements. In this section, we will introduce a method by combining product codes with conventional Hamming codes to generate different error correction capabilities in varied noise environments [20].

5.4.1

Principle

Hamming codes have been widely applied to on-chip interconnects because of their low codec overhead. As noise environments worsen, Hamming codes are inefficient to maintain system reliability because of their low error correction capability. The core idea of combining product codes with conventional Hamming codes is to construct a system with adjustable code strength, which can be dynamically selected according to noise environments or reliability requirements. In this method, the error control scheme works in two operating modes: mode-(a) directly uses Hamming codes in low noise environments; mode-(b) uses Hamming product codes for high noise environments. This configurable error control scheme can improve energy efficiency for a specified reliability requirement or varying noise environments by switching between two operating modes. Directly using Hamming codes in operating mode-(a) has smaller codec energy. Also, fewer links lead to smaller link energy consumption. In operating mode-(b), using product codes can provide higher reliability. Figure 5.26a shows the concept of the configurable encoder design. In low noise environments, encoder1 is configured as a Hamming encoder, which uses the whole message as the input. The encoded message is sent to the receiver through an interleaver. The interleaver is implemented as hardwire direct connection with negligible overhead. In high noise environments, encoder1 is configured as a Hamming product code component encoder (row encoder). The component encoder consists of multiple Hamming encoders, each using a part of the message as its input. The interleaved outputs of configurable encoder1 (original message and row parity check bits) are sent to the receiver and simultaneously fed to component encoder2(column encoder). The outputs of component encoder2 are saved into a buffer and transmitted when required. Figure 5.26b shows the concept of the configurable decoder design. Decoder1 can be configured as a Hamming decoder using the whole codeword as the input or a Hamming product code component decoder (row decoder). When the configurable error control scheme is in operating mode-(b), the outputs of configurable decoder1 are sent to component decoder2, which is realized using an iterative decoding algorithm. The configurable control signal can be generated by a link quality monitor [21] or system software [22]. The link quality monitor is realized by counting the

5.4 Configurable Error Control System

105

Fig. 5.26 Concept of configurable error control using Hamming product codes (a) encoder (b) decoder [20]

detected errors (the syndrome value of decoder1 is non-zero). The number of detected errors is compared to a preset threshold value to decide the switching between different operating modes. If the number of detected error is greater than the preset value, operating mode-(b) is selected. When the configuration is performed by the system software, an interface control register is needed. The system software can select the operation modes by setting the control register based on the application requirement (e.g., if the correctness of operation is the main concern, mode-(b) is used).

5.4.2

Configurable Encoder Design

Figure 5.27 shows the implementation of the configurable encoder design. In operating mode-(a), the input message is directly encoded by a Hamming code. In operating mode-(b), the K -bit input message is arranged into a 4 (K/4) matrix

106

5 Energy Efficient Error Control Implementation

Fig. 5.27 Implementation of configurable encoder [20]

to construct the product code and each row is encoded with an extended Hamming code with a (K/4)-bit input. To reduce the encoder area overhead, the configurable encoder1 is implemented using a hardware sharing method. In this method, the Hamming encoder with a K-bit input is realized by combining the outputs of four Hamming encoders, each of them with a (K/4)-bit input. The following example demonstrates the hardware sharing method. Consider a 16-bit input message, which is separated into four rows. Each row is encoded using an extended Hamming code EH(8,4) with the generator matrix in (5.13). The parity check bits of each group can be combined to generate parity check bits of an extended Hamming code H(21,16) with the generator matrix in (5.15), where P16x5 is parity matrix. The hardware implementation of the EH(8,4) and H(22,16) encoder is shown in Fig. 5.28. By using the hardware sharing method, the configurable encoder1 is implemented in two stages – parity calculation and merge circuits, as shown in Fig. 5.28. The parity calculation outputs can be directly used as the parity check bits of four extended Hamming encoders with input width (K/4)-bit or merged together to generate the parity check bits of a Hamming encoder with input width K bits. 2

1 60 G1 ¼ ½I44 jP44 ¼ 6 40 0

0 1 0 0

0 0 1 0

0 0 0 1

1 1 1 0

1 1 0 1

0 1 1 1

3 1 07 7 15 1

(5.13)

5.4 Configurable Error Control System

107

Fig. 5.28 Hardware sharing between four extended Hamming EH(8,4) encoders and one Hamming H(22,16) encoder [20]

2

PT44

1 61 ¼6 40 1

1 1 1 0

1 0 1 1

3 0 17 7¼ 15 1 1

M 0 1

(5.14)

1

G2 ¼ ½I1616 jP165

(5.15) 3

2

PT165

61 1 1 0 6 6 61 1 0 1 6 ¼6 60 1 1 1 6 60 0 0 0 6 40 0 0 0 |ﬄﬄﬄﬄﬄﬄﬄﬄﬄ{zﬄﬄﬄﬄﬄﬄﬄﬄﬄ}

1 1 1 0 1 1 0 1 0 1 1 1 1 1 1 1 0 0 0 0 |ﬄﬄﬄﬄﬄﬄﬄﬄﬄ{zﬄﬄﬄﬄﬄﬄﬄﬄﬄ}

1 1 1 0 1 1 0 1 0 1 1 1 0 0 0 0 1 1 1 1 |ﬄﬄﬄﬄﬄﬄﬄﬄﬄ{zﬄﬄﬄﬄﬄﬄﬄﬄﬄ}

row b

row c

row a

2 ¼40 0

M 0 0

M 0 0

0 0

1 0

1 0

1 1 1 07 7 7 1 1 0 17 7 0 1 1 17 7 7 1 1 1 17 7 1 1 1 15 |ﬄﬄﬄﬄﬄﬄﬄﬄﬄ{zﬄﬄﬄﬄﬄﬄﬄﬄﬄ} row d

M 1 0

1 0

0 1

0 1

3

M 0 1

0 1

1 1

1 1

1 1

1 5 (5.16) 1

108

5 Energy Efficient Error Control Implementation

Fig. 5.29 Implementation of configurable decoder [20]

5.4.3

Configurable Decoder Design

Figure 5.29 shows an implementation of the configurable decoder design. Decoder1 can be configured as a single Hamming decoder, which uses the whole codeword as the input, or a component decoder (row decoder) in Hamming product codes. The component decoder consists of four extended Hamming decoders and each decoder uses a part of the codeword as input. The realization of configurable decoder1 is divided into three steps, as shown in Fig. 5.29. The hardware sharing method used in transmitter design is implemented for the parity calculation circuits. The syndrome calculation circuit is an XOR operation of the parity calculation outputs and the parity check bits in the codeword. Syndrome calculation1 generates the syndrome vector of operating mode-(a) and syndrome calculation2 generates the syndrome vector of operating mode-(b). The syndrome vectors are fed into a syndrome decoder and error correction circuit. The syndrome decoder is implemented as an AND tree, whose inputs are the syndrome value or its inverse. The error correction is XOR operation. To save the hardware resource, a hardware sharing method is introduced to realize the syndrome decoder and error correction circuits in [20]. Figure 5.30 shows an example of the hardware sharing method. The 16-bit message is encoded by a Hamming code H(21,16) (operating mode-(a)) or four extended Hamming code EH(8,4) (operating mode-(b)). Four-bit syndrome vectors are used for each EH(8,4) code and a five-bit vector is used for the H(21,16) code. By properly selecting the syndrome value and its inverse from the different operating modes,

5.4 Configurable Error Control System

109

Fig. 5.30 Configurable syndrome decoder and error correction circuits design [20]

the syndrome decoder circuits and error correction circuit can be shared. “1” is assigned as the extra syndrome bit for each EH(8,4) code. The outputs of configurable decoder1 are saved into a receiver buffer. In operating mode-(a), only the decoded message is saved. In operating mode-(b), the decoded message and row parity check bits are both saved. The saved message and row parity check bits are used to perform an iterative decoding procedure when the column parity check bits and checks-on-checks are transmitted.

5.4.4

Performance Evaluation

The performance of the configurable error coding scheme by combining product codes with conventional Hamming codes is evaluated in terms of codec delay, area, reliability, and energy consumption. The input data width K is assumed as 64 bits. The Hamming code H(71,64) is used in operating mode-(a). In operating mode-(b), the 64-bit input message is arranged into a 416 matrix. Each row is encoded using an extended Hamming code EH(22,16) and each column is encoded using an extended Hamming code EH(8,4). The total number of wires in the link of the proposed method is 88. In operating mode-(a), only 71 wires are used and the remaining wires are connected to ground. The configurable error control scheme is developed and verified in Verilog HDL. The encoder and decoder are synthesized using TSMC 45 nm technology. The delay, area and power of the encoder and

110

5 Energy Efficient Error Control Implementation Table 5.4 Numbers of wires in the link and codec delay and area of different error coding schemes [20] The number of wires in the link Decoder Codec Error control scheme (active/total) delay (ns) area (mm2) Hamming (71,64) 71/71 0.53 1,550 BCH(85,64) 85/85 0.63 53,547 RS(85,65) 85/85 0.68 37,482 Product code 88/88 0.50 7,671 Configurable error (a) 71/88 0.58 8,906 control method (b) 88/88

decoder are reported using Synopsis Design Compiler at 1 GHz clock frequency. The link power is measured in Cadence Spectre using a 45 nm global link interconnect model [14]. Simulation results are compared to directly using Hamming code H(71,64), a three-bit error correction BCH(85,64), and a Reed-Solomon RS(85,65) code. Zero padding is applied to meet the length requirement of RS code. The number of wires in the link for different coding schemes is shown in Table 5.3.

5.4.4.1

Codec Delay and Area

Table 5.4 compares the synthesized codec delay of the configurable error scheme to the directly using H(71,64), BCH(85,64), and RS(85,65). The decoder delay, typically much larger than encoder delay, is reported here. The three-stage pipelined decoding process is implemented in operating mode-(b) to decode product codes. The decoding process for operating mode-(a), described in Fig. 5.29 is implemented within one clock cycle. Compared to directly using H(71,64) code, the decoder delay of the configurable coding method increases about 10% because of the overhead of the extra MUX for mode switching. The BCH(85,64) and RS(85,65) decoder are implemented in a 7-stage pipelined architecture. In order to improve the throughput of BCH(85,64) and RS(85,65) codes, the parallel method in [23] is used. Compared to the RS(85,65), the configurable coding method achieves a 15% delay reduction. Table 5.4 also shows the synthesized codec area for different error control schemes. The codec area includes the encoder and decoder area. The encoder area includes the retransmission buffer. The decoder buffer storing the original message in the receiver is not included, because this buffer can be shared with the routing buffer in the router. The area of the error counter and comparison circuits in the configurable control logic is also included in the proposed method. The results show that multiple error correction codes have much larger area than that of simple Hamming codes. The area overhead of the product codes is mainly because of the retransmission buffer and the pipelined decoder architecture. Compared to BCH (85,64) and RS(85,65), the product code has a smaller area, because each component code is still a simple extended Hamming code. BCH(85,64) has the largest area due to the complexity of the field operation and the decoding process. By using the

5.4 Configurable Error Control System

111

Fig. 5.31 Residual flit error rate of different error control schemes as a function of noise voltage deviation (a) Pn ¼ 102 (b) Pn ¼ 1 [20]

proposed hardware sharing method, the area overhead of the configuration circuit is relatively small compared to Hamming product code itself.

5.4.4.2

Reliability

Figure 5.31 shows the residual flit error rate of different error control schemes as a function of noise voltage deviation at Pn ¼ 102 and Pn ¼ 1. A supply voltage of 1 V is assumed. The simulation results show that the H(71,64) used in operating mode-(a) has the worst residual flit error rate, because Hamming codes can only correct one error at a time and simultaneous errors greater than one will lead to uncorrected errors. Compared to the BCH(85,64) code, the product code used in operating mode-(b) achieves a better residual flit error rate, because the product code can effectively correct multiple random and burst errors, while the BCH code

112

5 Energy Efficient Error Control Implementation

Fig. 5.32 Energy comparison of the configurable error control at different operating modes (a) Link length 1 mm (b) Link length 3 mm [20]

is only good at correcting multiple random errors. As Pn value increases, the residual flit error rate of the H(71, 64) code and BCH(85, 64) code decreases because of the higher burst error probability at larger Pn. Compared to RS(85,65), the Hamming product code used in operating mode-(b) has a better error correction capability, because RS(85,65) can only correct multiple errors within two symbols. In NoC links, burst errors caused by noise and crosstalk can begin at any bit position of the links. More powerful RS code can be constructed but with a larger delay and area overhead.

5.4.4.3

Power and Energy Consumption

Figure 5.32 shows the energy consumption of the configurable error control method at two operating modes. The energy includes encoder, decoder and link energy consumption. The results show that the operating mode-(a) consumes less codec and link energy compared to operating mode-(b), if both of the operating modes meet the reliability requirement. This is because gating techniques is applied and fewer link wires in operating mode-(a). The results also show that the link energy

5.4 Configurable Error Control System

113

Fig. 5.33 Example of mode switching for a given reliability requirement [20]

dominates the total energy consumption, as the link length increases. For the 3 mm link length, operating mode-(b) consumes about 28% more energy than mode-(a). The energy consumption of the configurable error control method combining product codes with Hamming codes is also compared to the energy consumption of directly using H(71,64) code, BCH(85,64) and RS(85,64) code. First, the comparison is performed under a fixed residual flit error rate requirement of 1010, shown in Fig. 5.33. Two noise environments are considered. For the favorable environment (sN ¼ 0.06), the proposed method operates in mode-(a). In the noisy environment (sN ¼ 0.11), the proposed method switches to operation mode-(b). As the noise environment worsens, the direct implementation of the H(71,64) code requires a higher link swing voltage to meet the reliability requirement, while the proposed method can switch to more reliable operating mode-(b). In the noisy environment (sN ¼0.11), the conventional Hamming implementation requires a 39% increase in the link swing voltage compared to the configurable method to achieve the required residual flit error rate. The increased link swing voltage greatly increases the link energy of the conventional Hamming implementation. Figure 5.34 shows energy consumption of the four error control scheme for link lengths of 1 and 3 mm. The results show that the configurable method combining product codes with Hamming code consumes the least energy in the high noise environments by switching to operating mode-(b). The BCH(85,64) code consumes the largest energy for a link length 1 mm because its codec energy is larger than the other error control schemes. As the link length increases, the direct implementation of H(71,64) code consumes the largest energy in the high noise environment because of the increased link swing voltage. For a 3 mm link in the noisy environment (sN ¼ 0.11), the configurable method achieves 30% and 25% improvement in energy consumption compared to the direct implementation of H(71,64) code and the BCH (85,64) code. In the more favorable condition (sN ¼ 0.06), direct implementation of

114

5 Energy Efficient Error Control Implementation

Fig. 5.34 Energy comparison for different noise environments and link lengths [20]

the H(71,64) code consumes the least energy of the compared schemes. By switching to operating mode-(a) in low noise environments, the proposed method consumes 10% more energy than the H(71,64) code because of the configurable system overhead. Compared to BCH(85,64), mode-(a) of the configurable error control method achieves a 40% improvement in energy consumption.

5.5

Summary

In this chapter, we have discussed the techniques to achieve reliable and energy efficient on-chip communications. In order to reduce the link energy, error control codes can be combined with low link swing voltage system. In this method, the link energy consumption is reduced because the error control codes allow the system to run at a lower link swing voltage compared to uncoded system. The link energy reduction can benefit the total energy consumption. Instead of using the worst-case design method with an additional safety margins, a self-calibrating transmission scheme is applied to improve the energy efficiency of on-chip interconnects. In this method, error control codes are used to detect or correct errors. The error detection rate provided by an ECC decoder is used to control the voltage scaling. The self-calibrating transmission achieves the low energy consumption benefits by running the system with a more aggressive voltage scaling scheme. The direct use of product codes requires a large number of wires, increasing the link energy consumption. The combination of product codes with type-II HARQ scheme can efficiently solve this problem by transmitting the column parity check bits only when they are requested. As an example of this method, the combination of extended Hamming product codes with type-II HARQ achieves a significant reduction in residual flit error rate when multiple random and burst errors are considered. For a given residual flit error rate requirement, the combination of extended Hamming codes with type-II HARQ can operate at much lower swing voltages than other

References

115

methods – about 60% and 80% of the supply voltages required for the H(71,64) and ARQ CRC-5 schemes, respectively. The lower link swing voltage makes the combination of extended Hamming product codes with type-II HARQ more energy efficient compared to other error control schemes. As technology scales, the link energy is likely to further exceed the codec energy. Thus, the combination of product codes with type-II HARQ is a more promising approach to provide energy efficient and reliable communication for future system designs. A configurable error control scheme, combining extended Hamming product codes with traditional Hamming codes is also presented in this chapter. By using Hamming codes in low noise environments and extended Hamming product codes in high noise environments, this configurable coding method improves the energy efficiency for varied noise environments compared to a fixed error control approach. In order to reduce the configurable system overhead, a hardware sharing method is applied to optimize the parity check calculation circuit, syndrome decoder, and error correction circuits. For a given system reliability requirement, this configurable error control scheme can achieve a 25% energy reduction compared to a multi-error correcting BCH code in a noisy environment. Compared to conventional Hamming codes, this configurable error control scheme uses a lower swing voltage for the same reliability in noisy environments, resulting in a 30% energy reduction. In a low noise environment, this configurable error control method can achieve a 40% reduction in energy consumption compared to a BCH code, and has a 10% energy overhead penalty compared to directly using Hamming codes.

References 1. Magen N, Kolodny A, Weiser U, Shamir N (2004) Interconnect-power dissipation in a microprocessor. In: Proceedings international workshop on system-level interconnect prediction (SLIP), pp 7–13 2. Soteriou V, Peh S L (2004) Design-space exploration of power-aware on/off interconnection networks. In: Proceedings international conference on computer design (ICCD), pp.510–517 3. Sridhara S, Shanbhag RN (2005) Coding for system-on-chip networks: a unified framework. IEEE Trans Very Large Scale Integr (VLSI) Syst 12:655–667 4. Bertozzi D, Benini L, De Micheli G (2005) Error control schemes for on-chip communication links: the energy-reliability tradeoff. IEEE Trans Comput-Aided Des Integr Circuits Syst 6:818–831 5. Murali S, Theocharides T, Vijaykrishnan N, Irwin JM, Benini L, De Micheli G (2005) Analysis of error recovery schemes for networks-on-chips. IEEE Des Test Comput 5:434–442 6. Ejlali A, Al-Hashimi MB, Rosinger P, Miremadi GS, Benini L (2010) Performability/energy tradeoff in error-control schemes for on-chip networks. IEEE Trans Very Large Scale Integr (VLSI) Syst 18:1–14 7. Zhang H, Varghese G, Rabaey MJ (2000) Low-swing on-chip signaling techniques: effectiveness and robustness. IEEE Trans Very Large Scale Integr (VLSI) Syst 3:264–272 8. Worm F, Ienne P, Thiran P, Micheli DG (2005) A robust self-calibrating transmission scheme for on-chip networks. IEEE Trans Very Large Scale Integr (VLSI) Syst 1:126–139

116

5 Energy Efficient Error Control Implementation

9. Fu B, Ampadu P (2008) An energy-efficient multi-wire error control scheme for reliable onchip interconnects using Hamming product codes. VLSI Des 2008:1–14. doi:101155/2008/ 109490 10. Fu B, Ampadu P (2009) On hamming product codes with type-II hybrid ARQ for on-chip interconnects. IEEE Trans Circuits Syst I, Reg Papers 9:2042–2054 11. Lin S, Costello D, Miller M (1984) Automatic-repeat-request error-control schemes. IEEE Commun Mag 12:5–17 12. Srinivasan RG (1996) Modeling the cosmic-ray-induced soft-error rate in integrated circuits: an overview. IBM J Res Dev 1:77–89 13. Sridhara S, Shanbhag RN (2005) Coding for system-on-chip networks: a unified framework. IEEE Trans Very Large Scale Integr (VLSI) Syst 12:655–667 14. Arizona State University. Predictive technology model [Online] http://ptm.asu.edu/ 15. Pande PP, Grecu C, Ivanov A, Saleh R (2005) Performance evaluation and design trade-offs for network-on-chip interconnect architectures. IEEE Trans Comput 54:1025–1040 16. Kim J S, Taylor M B, Miller J, Wentzlaff D (2003) Energy characterization of a tiled architecture processor with on-chip networks. In: Proceedings international symposium on low power electronics and design (ISLPED), pp 424–427 17. Vangal S et al (2008) An 80-tile sub-100-W teraFLOPS processor in 65-nm CMOS. IEEE J Solid-State Circuits 43:29–41 18. Scheffer L (2002) Methodologies and tools for pipelined on-chip interconnect. In: Proceedings international conference on computer design (ICCD), pp 152–157 19. Pei BT, Zukowski C (1992) High speed parallel CRC circuits in VLSI. IEEE Trans Commun 40:653–657 20. Fu B, Ampadu P (2010) Error control combining Hamming and product codes for energy efficient nanoscale on-chip interconnects. IET Comput Digit Tech 4:251–261 21. Li L, Vijaykrishnan N, Kandemir M, Irwin J M (2003) Adaptive error protection for energy efficiency. In: Proceedings IEEE/ACM international conference on computer-aided design (ICCAD), pp 2–7 22. Rossi D, Angelini P, Metra C (2007) Configurable error control scheme for NoC signal integrity. In: Proceedings international on line testing symposium (IOLTS), pp 43–48 23. Sun F, Devarajan S, Rose K, Zhang T (2007) Design of on-chip error correction systems for multilevel NOR and NAND flash memories. IET Circuits Devices Syst 1:241–249

Chapter 6

Combining Error Control Codes with Crosstalk Reduction

Conventional error control codes (ECCs) has been successfully applied to improve the reliability of on-chip interconnect by correcting logic errors. Unfortunately, ECCs is inefficient to address crosstalk-induced delay uncertainty, which greatly decreases the system performance even causing timing errors. Crosstalk-induced delay uncertainty results from the dependence of coupling capacitance and inductance on different wire switching patterns. In this chapter, we mainly focus on the delay uncertainty caused by the capacitive crosstalk coupling. The capacitive crosstalk induced delay uncertainty can be alleviated by techniques such as shielding, routing, wire sizing and spacing, crosstalk avoidance codes (CACs), skewed transitions, and staggered repeater. Typically, these methods do not address logic errors. In this chapter, we will discuss the solutions, which efficiently address both logic errors and capacitive crosstalk induced delay uncertainty simultaneously.

6.1

Duplicate-Add-Parity (DAP) Codes

The encoding and decoding process of the DAP code is introduced in Chap. 4. By duplicating the input data and adding an extra parity check bit, DAP codes have a minimum Hamming distance three and can correct single errors. DAP codes can also reduce capacitive coupling [1, 2]. Let di (i < k) be the k-bit original data, d’i (i < k) be the k-bit duplicated data, and p0 be the parity check bit. In DAP code implementation, the bus wire used to transmit d’i is always placed adjacent to the bus wire used to transmit di. Because the values of di and d’i are always the same, the coupling capacitance between these two adjacent wires does not need to be charged. Thus, any bus wire transmitting DAP codewords only needs to charge the coupling capacitance of one side. Moreover, an intelligent spacing method can be used to further optimize DAP code [2]. In intelligent spacing method, the spacing between two wires carrying the identical data can be smaller than the

B. Fu and P. Ampadu, Error Control for Network-on-Chip Links, DOI 10.1007/978-1-4419-9313-7_6, # Springer Science+Business Media, LLC 2012

117

118

6 Combining Error Control Codes with Crosstalk Reduction

Fig. 6.1 A bus layout of a DAP code with intelligent spacing

Fig. 6.2 Worst-case wire capacitances of a (9, 4) DAP code and a (7, 4) Hamming code [2]

spacing between two wires carrying different data. Figure 6.1 shows a bus layout of a DAP code with intelligent spacing. A grounded shielding wire is assumed to place around the boundary of the bus wire. From Fig. 6.1, a wire (except d0 and p0) in DPA codeword has an effective coupling capacitance 2Cc(SDR), where Cc(SDR) is the physical coupling capacitance between two wires with a spacing SDR. If the boundary condition is considered, the worst case coupling capacitance in DAP code happens at the wire p0. When the wire d’k1 and p0 switch oppositely, the worst case coupling capacitance of wire p0 is 3Cc(SDR). Figure 6.2 shows the worst-case wire capacitances of a (9, 4) DAP code and a (7, 4) Hamming code. The delay factor is defined as the ratio of the worst-case wire capacitance in each coding scheme to the ground capacitance of a wire with the minimum spacing. The same wire routing area is used for both coding schemes. Figure 6.2 shows that (9, 4) DAP code have a smaller effective coupling capacitance compared to (7, 4) Hamming code for most cases. The bus layout of DAP code can further be optimized by increasing the spacing between the wire d’k1 and p0 to reduce the coupling capacitance of p0. Also a modified DAP code (MDR) is proposed in [3]. In MDR code, the parity bit p0 is also duplicated. The worst case coupling capacitance in MDR codes is reduced to 2Cc(SDR).

6.2 Boundary Shift Code (BSC)

6.2

119

Boundary Shift Code (BSC)

BSC [4] is one type of codes, in which no any adjacent bits simultaneously switch in opposite direction (i.e., no 01 ! 10 or 10 ! 01 transition at two adjacent bit positions). The encoding process of BSC codes is similar to DAP codes. The input data is first duplicated and an extra parity check bit is added. In order to avoid adjacent bits switching in opposite directions, the encoded data in odd cycles are right shifted before they are transmitted (the number of cycles starts from 0). In BSC codes, the parity check bit can be the rightmost or the leftmost bit of the codeword. An example of BCS codes is shown in Fig. 6.3. The decoding process of BSC codes is similar to that of DAP codes. The received codeword is first shifted back when the clock cycle is odd. A parity check bit is recalculated using one copy of the input data. The recalculated parity bit is compared to the transmitted parity bit. If the recalculated parity bit is equal to the transmitted parity bit, the data copy used to recalculate the parity bit is selected as the decoder output. If the recalculated parity bit is different from the transmitted parity bit, another copy of input data is selected as the decoder outputs. Figure 6.4 shows the decoder of a BSC (9, 4) code.

Fig. 6.3 An example of BSC code

Fig. 6.4 Decoder design of a BSC (9,4) code

120

6 Combining Error Control Codes with Crosstalk Reduction

BSC codes have the minimum Hamming distance of three, which can be used to correct single errors. Because no any adjacent bits in BSC codeword simultaneously switch in opposite direction, the worst case coupling capacitance of a bus wire used to transmit BSC codeword is equal to 2Cc, which is smaller than the worst case coupling capacitance value of a standard bus wire.

6.3

Crosstalk Avoidance and Multiple Error Correction Code (CAMEC)

The DAP code concept can be extended to construct CAMEC codes. Figure 6.5 shows an example of CAMEC code proposed in [5]. In this example, the input data is first encoded using Hamming codes. The outputs of the Hamming encoder are duplicated and an overall parity check bit, calculated from the output of the Hamming encoder, is added to the whole codeword. For k-bit input information, if a (n, k) Hamming code is used. The codeword width of CAMEC code is 2n + 1. The minimum Hamming distance of CAMEC contracted as Fig. 6.5 is seven. This code can guarantee to correct up to three errors. The decoding process of the CAMEC code is more complex than that of DAP codes. Figure 6.6 shows a crosstalk avoiding double error correction (CADEC) decoding algorithm proposed in [5]. In a CADEC decoder, the parity check bits pa and pb are first recalculated from the original Hamming codeword and its copy, respectively. Then, these recalculated parity check bits pa and pb are compared. If they are the same, the duplicated Hamming codeword is sent to syndrome detection.

Fig. 6.5 An example of CAMEC codes

6.3 Crosstalk Avoidance and Multiple Error Correction Code (CAMEC)

121

Fig. 6.6 Implementation of CADEC decoder

If no error is detected during syndrome detection, the duplicated Hamming codeword will be used as inputs of a conventional Hamming decoding process; otherwise, the original Hamming codeword will be sent to the conventional Hamming decoder. If pa and pb are different, pb is compared to the received parity check bit p0. If pb is equal to p0, the duplicated Hamming codeword is used for further decoding. If pb is not equal to p0, the original Hamming codeword is used to complete the conventional Hamming decoding process. The CADEC decoding algorithm can only guarantee to correct double errors. An updated joint crosstalk avoidance and triple error correction (JTEC) decoding algorithm is proposed in [6]. The JTEC code can guarantee to correct three errors. In JTEC codes, the Hamming code along with the overall parity bit comprise of an extended Hamming code, which can correct single error and detect double error at the same time. Figure 6.7 shows the flowchart of JTEC decoding algorithm. In a JTEC decoder, the syndrome SA and SB are first calculated from the extended Hamming copy (Hamming codeword and parity bit) and the Hamming copy, respectively. If syndrome SA is zero, it means no error exists in extended Hamming copy. Thus, it will be used as decoder output. If SA is not zero, there can be one, two or three errors in the extended Hamming copy. If SA indicates that two errors exist in the extended Hamming copy, the Hamming copy will have single errors. The Hamming copy will be decoded and selected as decoder output. If SA indicates that one or three errors exist in extended Hamming copy, the syndrome of Hamming copy SB will be used to make decision. If SB is zero, all three errors are in extended Hamming copy. The Hamming copy is selected as decoder output. If SB is not zero, it means that only

122

6 Combining Error Control Codes with Crosstalk Reduction

Fig. 6.7 Flowchart of JTEC decoding algorithm

single errors exist in extended Hamming copy, which can be corrected and selected as decoder output. In order to reduce the calculation delay of the parity bit p0, the extended Hamming code can be replaced by Hsiao SEC–DED code. The parity bit p0 in Fig. 6.5 can be duplicated to construct a code with the minimum Hamming distance eight. The new code is proposed in [6] and named as joint crosstalk avoidance and triple error correction and simultaneous quadruple error detection code (JTEC-SQED) code. The worst case coupling capacitance of a bus wire transmitting CAMEC codeword is 2Cc. Table 6.1 compares the wire delay in 64-node NoC architectures when different error control coding schemes are applied to the links between router and router. Three different NoC architectures, mesh, folded torus, and butterfly fat tree (BFT), are considered. For 64 nodes, BFT-based architecture will have three levels of routers. BFTa is the wire delay between the routers in level 2 and the routers in level 3. BFTb is the wire delay between the routers in level 1 and the routers in level 2. Table 6.1 shows that the conventional Hamming codes do not have crosstalk avoidance characteristics and have the largest wire delay.

6.4 Unified Coding Framework

123

Table 6.1 The comparison of wire delay in 64-node NoC architecture [6] Coding scheme NoC architecture Length (mm) Hamming code Mesh 2.86 Folded torus 5.72 BFTa 10 5 BFTb DAP/CADEC/JTEC/JTEC-SQED Mesh 2.86 Folded torus 5.72 BFTa 10 5 BFTb

Table 6.2 The comparison of codec delay and area [6] Coding scheme Encoder delay (ps) Decoder delay (ps) Hamming Code 410 520 DAP 290 475 CADEC 525 545 JTEC 190 440 JTEC-SQEC 190 450

Delay (ps) 243 612 1,620 495 184 375 900 315

Area (2-input NAND gates) 447 396 1,145 1,495 1,675

Table 6.2 compares the codec delay and area of Hamming code, DAP code and different CAMEC codes. The delay and area values are reported using synthesis results with a 90-nm technology. The input data width is 32 bit. CADEC code has the largest codec delay. By using Hsiao code, JTEC and JTEC-SQEC has the smaller codec delay. Compared to convention Hamming code and DAP code, the error correction capability improvement of CAMEC codes comes from a larger codec area.

6.4

Unified Coding Framework

A unified coding framework by combining error control coding with crosstalk avoidance codes is proposed in [1, 7]. In this method, the input data is first encoded using nonlinear crosstalk avoidance codes (CACs). The outputs of CACs are encoded using an error control code. The parity bits generated by the error control codes are protected against crosstalk coupling using techniques such as shielding and duplication. The encoding process of combining ECC with CACs is shown in Fig. 6.8. There are three common used CACs – forbidden overlap condition (FOC) codes [8], forbidden transition condition (FTC) codes [9], and forbidden pattern condition (FPC) codes [10]. Each of these CACs has different crosstalk reduction capabilities. The encoding process of each CAC is described as follows.

124

6 Combining Error Control Codes with Crosstalk Reduction

Fig. 6.8 A unified coding framework combining error control with CACs

6.4.1

Forbidden Overlap Condition (FOC) Codes

In Table 2.1, the worst case link delay (1 + 4l)t0 occurs when there is a 010 ! 101 (or 101 ! 010) transition on three adjacent wires. This worst-case link delay can be avoided by prohibiting these switching patterns. The CACs satisfying the above requirement are named forbidden overlap condition (FOC) codes [8]. Because no 010 ! 101 (or 101 ! 010) transition exists in two continuous FOC codewords, the worst case delay of FOC codes is reduced from (1 + 4l)t0 to (1 + 3l)t0. Table 6.3 shows the truth table of a FOC(5, 4) code. The encoding process can be expressed by, c0 ¼ d1 þ d2 d3 c1 ¼ d2 d3 c2 ¼ d0 c3 ¼ d2 d3 c4 ¼ d1 d2 þ d3

(6.1)

where di (i ¼ 0 to 3) is the input data bit and ci (i ¼ 0 to 4) is the FOC(5, 4) codeword bit. The complexity of the FOC code increases significantly with the increased input data width. It is impractical to encode a wide bus using a single FOC code. A solution to address this issue is to separate a wide bus into small groups and encode each group using a FOC code with a small input width. For example, 32-bit input data can be separated into eight groups, each group encoded using a FOC(5, 4) code. In this hierarchical encoding method, two groups of FOC(5, 4) code can be placed next to each other without violating the requirement of FOC codes. For 32-bit input data, the total encoded output is 40-bit. Half-shielding, in which a shield wire is inserted between every two signal wires, can be regard as the simplest approach satisfying the forbidden overlap condition. For a wide bus, half-shielding has a large area overhead compared to encoding the whole using multiple FOC(5,4) codes.

6.4 Unified Coding Framework

125

Table 6.3 (5, 4) forbidden overlap condition (FOC) codes

Data bits d2 d3 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 1 1 0 1 0 1 0 1 0 1 1 1 1 1 1 1 1

Table 6.4 (4, 3) forbidden transition condition (FTC) codes

Data bits d2 d1 0 0 0 0 0 1 0 1 1 0 1 0 1 1 1 1

6.4.2

d1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1

d0 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1

d0 0 1 0 1 0 1 0 1

Codeword bits c4 c3 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 1 0 1 0 1 0 1 1 1 1 1 1 1 1

Codeword bits c3 c2 0 0 0 1 0 0 0 1 0 1 1 1 1 1 1 1

c2 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1

c1 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0

c1 0 0 0 0 1 0 0 1

c0 0 0 1 1 1 1 1 1 0 0 1 1 0 0 1 1

c0 0 0 1 1 1 0 1 1

Forbidden Transition Condition (FTC) Codes

In FTC codes [9], any transition involving adjacent wires switching in opposite directions is prohibited (i.e., 01 ! 10 or 10 ! 01 transition); thus, the worst case link delay of FTC codes is reduced from (1 + 4l)t0 to (1 + 2l)t0. Inserting a shielding wire between each signal line is the simplest approach satisfying the forbidden transition condition. Table 6.4 shows the truth table of a FTC (4, 3) code. The encoding process can be expressed by, c0 ¼ d1 þ d2 d0 c1 ¼ d0 d1 d2 þ d0 d1 d2 c2 ¼ d0 þ d2 c 3 ¼ d 0 d2 þ d 1 d2

(6.2)

126

6 Combining Error Control Codes with Crosstalk Reduction

Table 6.5 (5, 4) forbidden pattern condition (FPC) codes

Data bits d2 d3 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 1 1 0 1 0 1 0 1 0 1 1 1 1 1 1 1 1

d1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1

d0 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1

Codeword bits c4 c3 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 1 1 0 1 0 1 1 1 0 1 1 1 1 1 1 1 1

c2 0 0 1 0 1 1 1 1 0 0 0 0 1 0 1 1

c1 0 0 1 1 0 1 1 1 0 0 0 1 0 0 1 1

c0 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1

A similar hierarchical encoding method can be applied to encode a wide bus using FTC codes. For example, 32-bit input data can be separated into 11 groups and each group is encoded using a FTC (4, 3) code. Because two groups of FTC (4, 3) code cannot be placed next to each other without violating the FTC requirement at the boundary, shielding wires are needed between each adjacent group to ensure that transitions on boundary wires do not switch in opposite directions. Using this hierarchical method, a FTC (53, 32) can be constructed for 32-bit input data.

6.4.3

Forbidden Pattern Condition (FPC) Codes

In FPC codes [10], the coupling effects are reduced by prohibiting 010 and 101 bit patterns for each codeword. The worst case link delay of FPC codes is reduced from (1 + 4l)t0 to (1 + 2l)t0. Table 6.5 shows the truth tale of a FPC(5,4) code. The encoding process is expressed by, c0 ¼ d0 c1 ¼ d0 d1 þ d1 d2 þ d1 d3 þ d0 d2 d3 c2 ¼ d2 d3 þ d1 d2 þ d0 d2 þ d0 d1 d3 c3 ¼ d2 d3 þ d0 d2 þ d1 d2 þ d0 d1 d3 c 4 ¼ d3

(6.3)

6.4 Unified Coding Framework

127

Fig. 6.9 Hierarchical encoding using two FPC(5,4) codes

Table 6.6 The comparison of coupling factor, minimum Hamming distance and codeword width [7] Component Number Coding Scheme FOC+HC FTC+HC DAP OLC+HC

Maximum coupling 3 2(FT) 2(FP) 1

Minimum of wires for distance 32-bit bus 3 49 3 65 3 65 3 106

DSAP

1

3

97

BSC

2

3

65

CAC FOC(5,4) FTC(4,3) Duplication OLC(8,4)

ECC Hamming Hamming Parity Hamming

Duplication Parity +shielding Duplication Parity

Parity protection Half-shielding Shielding – Duplication +shielding Shielding –

In order to ensure that no bit patterns 101 and 010 occur at the boundaries of two groups of FPC(5, 4) code during hierarchical encoding process, shielding wires can be inserted between two adjacent groups. Figure 6.9 shows another solution to solve the boundary problem. In Fig. 6.9, the most significant input bit of a FPC(5,4) encoder is fed into the least significant input bit of the next adjacent FPC(5,4) encoder. This method is more efficient than simply placing shielding wires between two adjacent groups, resulting in fewer redundancy wires. Using this method, a FPC(52,32) code can be constructed for 32-bit input data.

6.4.4

Performance Evaluation

Table 6.6 lists the coupling factor, the minimum Hamming distance and codeword width of different coding schemes, FOC + HC, FTC + HC, and OLC + HC, which are constructed using the uniform coding method. In these codes, a Hamming code is combined with FOC(5, 4), FTC(4, 3), and OLC(8, 4) based CACs, respectively. OLC represents one lambda code. The simplest OLC can be constructed by duplicating the input data bits and inserting shield wires between adjacent pairs

128 Table 6.7 An OLC(8, 4) code

6 Combining Error Control Codes with Crosstalk Reduction

Data bits d 3 d2 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 1 1 0 1 0 1 0 1 0 1 1 1 1 1 1 1 1

d1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1

d0 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1

Codeword bits c7 c6 c5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 1 0 1 1 0 1 1 1 1 0 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1

c4 0 0 0 1 1 1 1 1 1 0 0 0 1 1 1 1

c3 0 0 0 1 1 0 0 1 1 0 0 0 0 0 1 1

c2 0 0 1 1 1 0 0 1 1 0 0 1 0 0 1 1

c1 0 0 1 0 1 0 0 0 1 0 0 1 0 0 0 1

c0 0 1 1 0 1 0 1 0 1 0 1 1 0 1 0 1

of duplicated bits. OLC(8, 4) is proposed by [11], shown in Table 6.7. To encode a wide bus using OLC(8, 4), the wide bus is first separated into several 4-bit groups. Each group is encoded separately using an OLC(8,4) code. Between each group, the boundary bit is duplicated and a shield wire is inserted. Table also includes DAP code, BSC, and duplicate shield add parity (DSAP) codes, which can also correct single errors but are constructed using alternative approach. In DSAP code, shield wires are inserted between adjacent pairs of duplicated bits and between the duplicated bits and parity bit. Figure 6.10 compares codec overhead of different coding schemes for a 32-bit bus [7]. The codec area, delay and energy are normalized to Hamming (38, 32) code. Figure 6.10 shows that OLC + HC code has the largest codec overhead. The DAP and DSAP codes have the smallest codec overhead. Figure 6.11 compares the speedup of different coding schemes over the uncoded bus as a function of the ratio of the coupling capacitance Cc to the ground Cg capacitance, l. The bus length L is 10 mm. The speedup of code1 over code2 is defined in [1] as, speedup ¼

Tc2 þ Tb2 Tc1 þ Tb1

(6.4)

where Tci is the codec delay of code i including encoder and decoder delays and Tbi is the bus delay with code i. Hamming code has a speedup of less than one because of the same link delay and an extra codec delay. The joint coding schemes, which can simultaneously correct single error and reduce crosstalk coupling, achieve the

6.4 Unified Coding Framework

Fig. 6.10 Codec area, delay and energy comparison for different coding schemes

Fig. 6.11 Speedup comparison of different coding schemes [7]

129

130

6 Combining Error Control Codes with Crosstalk Reduction

speedup over uncoded bus. DAP and DSAP codes achieve speedups of 1.44 and 2.14, respectively at L ¼ 10 mm and l ¼ 2.8, because these two codes can reduce the worst-case capacitance coupling to two and one, respectively, and also have a relative small codec delay overhead. Speedup increases with the increased value of bus length and l. Therefore, as the technology scaling leads to reduced codec delay, longer bus length L, and larger l, the joint coding scheme will achieve a larger speedup.

6.5

Error Control Codes with Skewed Transitions

Skewed transition method [12, 13] is used to reduce crosstalk coupling by delaying adjacent transitions with some finite time DT. In this section, we will introduce another method combining error control coding with skewed transitions to simultaneously address error correction and capacitance coupling induced delay uncertainty.

6.5.1

The Principle

In skewed transitions, the simultaneous opposite switching on neighboring bus lines are avoided by the induced relative delay DT. The worst-case effective capacitance Ceff of skewed transition (Ceff of a middle wire when a 010 ! 101 or 101 ! 010 transition occurs on three adjacent wires) can be expressed by (6.5) below [12], jVN ðDTÞ VN ð0Þj ÞCct VDD ¼ Cgt þ ð4 2vðDTÞÞCct

Ceff ¼ Cgt þ ð4 2

(6.5)

where Cgt is the total capacitance between the wire and ground; Cct is the total coupling capacitance between any two adjacent wires; VN(DT) and VN(0) are the voltages of neighboring wires at time DT and 0, respectively; v(DT) is the ratio of the neighboring wire’s voltage difference at time DT and 0 to VDD (0 v(DT) 1). When DT ¼0, v is 0. As DT increases, v approaches 1. In skewed transition methods, delay elements are inserted at the beginning of alternate bus lines to generate the relative delay DT, as shown in Fig. 6.12a. For a bus line with k 1 repeaters, the worst-case link delay Td in skewed transitions can be described by (6.6) below [12], Rt ÞðCgt þ 4Cct Þ þ 0:7ðkRr þ Rt ÞCr k Rt þ DT ð1:4Rr þ 0:8 ÞCct vðDTÞ k

Td ¼ ð0:7Rr þ 0:4

(6.6)

6.5 Error Control Codes with Skewed Transitions

131

Fig. 6.12 Conventional skewed transitions (a) Skewed transition by inserting delay elements (b) The relation between the worst case link delay and skewed delay DT

where Rr and Cr are the on-resistance and output capacitance of the repeater. Rt is the total resistance of the wire. The first two terms in (6.6) are the worst-case delay of the standard bus. From (6.6), the delay reduction achieved by the skewed transition method depends on the difference between the last two terms. Thus, a large DT increases the overall link delay Td, as shown in Fig. 6.12b. In [14], a method combining ECCs and skewed transitions is proposed to improve the reliability of on-chip interconnects. In this method, ECCs is used to correct logic errors while skewed transitions are applied to reduce capacitive crosstalk induced delay uncertainties. By hiding the delay insertion overhead of the skewed transition method, this method achieves a larger reduction in the worst case link delay compared to conventional skewed transition method. Figure 6.13 show the method combining ECCs with skewed transitions. In an error control encoder, the parity bits are generated from the original input data after

132

6 Combining Error Control Codes with Crosstalk Reduction

Fig. 6.13 Block diagram of proposed method exploiting parity computation latency to reduce crosstalk coupling [14]

a finite delay. Instead of sending the input data and parity bits to the link at the same time, partial input data can be sent before the parity bits are available. Two clocks (CLK1 and CLK2, with CLK1 arriving ahead of CLK2) are used alternately to offset the transitions in each pair of adjacent interconnect lines. The input data and parity bits are mapped to registers triggered by these two clocks, as shown in Fig. 6.13. Figure 6.14 illustrate the transmission procedure of the method combining ECCs with skewed transition. Assume that the clock cycle of CLK1 and CLK2 is Tcycle and k-bit input data are available at the rising edge of CLK1. The calculation of r-bit parity data is completed after a delay of Dparity. The wires l(i) (1 i k + r) in the link with odd index i are triggered by CLK1 and l(i) with even index i are triggered by CLK2. In the proposed method, input data can be sent at the next rising edge of CLK1 or at the rising edge of CLK2, which arrives after a delay DT1, as shown in Fig. 6.14. Because the data bits are available before the parity bits are calculated, thus the data can be sent earlier than the parity bits without affecting the overall system performance. Parity-check bits are calculated using the input data after the delay Dparity; thus, they can only be transmitted at the next rising edge of CLK1. The relationship between Tcycle and the timing offsets DT1 and DT2 is described in Fig. 6.14 and should meet the following constraint, Tcycle ¼ DT1 þ DT2 Dparity

(6.7)

For implementation simplicity, CLK1 and CLK2 can be the rising and falling edge of the same clock.

6.5 Error Control Codes with Skewed Transitions

133

Fig. 6.14 Transmission procedure of the method combining error control coding with crosstalk reduction [14]

Fig. 6.15 Mapping algorithm for a systematic (n, k) linear block code

6.5.2

Data Mapping Algorithm

In the method combining ECCs with skewed transitions, the input data can be transmitted either using CLK1 or CLK2; while the parity check bits can only be transmitted using CLK1. In a systematic (n, k) linear block code, the codeword c can be calculated by, c ¼ m ½Ik jPkðnkÞ

(6.8)

where m is the k-bit input data. Ik is identity matrix and Pk(n-k) is parity matrix. The mapping of the n-bit codeword c to the proper wire position in the link can be realized by the algorithm in Fig. 6.15. c(i) is the ith bit in c and l(i) is the ith wire in the link.

134

6 Combining Error Control Codes with Crosstalk Reduction

Fig. 6.16 An example of the mapping algorithm when a Hamming H(12,8) code is used to correct logic errors

Figure 6.16 shows an example of the mapping algorithm applied to a Hamming H (12,8) code. The parity calculation unit is implemented as XOR trees. The link is driven by two alternating clocks CLK1 and CLK2, ensuring no two adjacent wires use the same clock. As shown in this example, there are four parity check bits, which are assigned to CLK1, separated by first three data bits. The rest of the data bits are assigned to the remaining wires. The more complex case is considered when multiple SEC codes or SEC-DED codes are interleaved to correct spatial burst errors. In this case, a K-bit input data is separated into several smaller groups, with each group encoded separately using a SEC or SEC-DED code. The outputs of these small groups are interleaved before they are transmitted through the link. The burst error correction capability of this method depends on the interleaving distance, which is defined as the distance between two wires belonging to the same group. In this case, the mapping algorithm should meet two conditions – (1) parity check bits can only be transmitted using CLK1, (2) the mapping algorithm should maintain the same interleaving distance. Assume that K-bit input data are separated into g groups and each group Gi (1 j g) is encoded using a SEC (n, k) code with codeword cj. To maintain the interleaving distance, the mapping must cycle through each group in sequence (e.g., G1 ! G2 ! G3 ! G1 etc.). Mapping data and parity bits between CLK1 and CLK2 is straight forward when the number of groups is odd, shown by the algorithm in Fig. 6.17 – in alternating rounds, each group will be mapped to both CLK1 and

6.5 Error Control Codes with Skewed Transitions

135

Fig. 6.17 Proposed mapping algorithm for multiple SEC/ SEC-DED codes with interleaving when the number of groups is odd

CLK2. For example, if we have three groups, the mapping would begin as follows: G1 ! CLK1, G2 ! CLK2, G3 ! CLK1. The next ‘loop’ through the groups would then be G1 ! CLK2, G2 ! CLK1, G3 ! CLK2; thus, each group will have access to both CLK1 and CLK2, and can route their data and parity bits appropriately. The mapping of an even number of groups is slightly more complex. If the same method of looping through an even number of groups is used, each group would only be mapped to one clock. For example, with four groups, G1 and G3 will always be mapped to CLK1 and G2 and G4 will always be mapped to CLK2. One solution to address this issue is to insert an extra wire in the link, shown as l(13) in Fig. 6.18 (an example with four (7,4) Hamming encoded groups). This extra wire allows us to switch G1 and G3 to CLK2 and G2 and G4 to CLK1, ensuring that each group will have access to both CLK1 and CLK2. The resulting mapping algorithm with even number of groups is shown in Fig. 6.19.

136

6 Combining Error Control Codes with Crosstalk Reduction

Fig. 6.18 An example of applying the proposed mapping algorithm to four (7, 4) Hamming encoded groups

6.5.3

Performance Evaluation

Unlike conventional skewed transition methods, the combination of error control codes with skewed transitions hides the overhead induced by delay elements in the ECC encoding stage. The worst-case link delay Td_comb of this method can be analyzed as, Td

comb

Rt ÞðCgt þ 4Cct Þ þ 0:7ðkRr þ Rt ÞCr k Rt ð1:4Rr þ 0:8 ÞCct vðDTcomb Þ k

¼ ð0:7Rr þ 0:4

(6.9)

where DTcomb is equal to the minimum value between DT1 and DT2. Moreover, the last term in (6.9) is greater than that in (6.6) because DTcomb can be much larger than DTconv. Thus, the combination of ECCs with skewed transitions can achieve an extra delay reduction compared to a conventional skewed transition approach. Figure 6.20 compares the worst-case link delay of the combination of ECCs with skewed transitions with the conventional skewed transition method. A Hamming H(71,64) code is used to correct single logic errors. The Hamming encoder is

6.5 Error Control Codes with Skewed Transitions

137

Fig. 6.19 Proposed mapping algorithm for multiple SEC/ SEC-DED codes with interleaving when the number of groups is even [14]

implemented as XOR trees. The depth of the XOR trees determines the worst-case delay of the Hamming encoder. The H(71,64) encoder is synthesized using a TSMC 65 nm technology with the worst-case delay Dparity ¼ 400 ps. DTcomb ¼ 200 ps is equal to half of Dparity. A 65 nm link model [15] with lengths from 1 mm to 5 mm is used in the simulations. The link delay is normalized to the delay of a standard bus with minimum link width and spacing. Figure 6.20 show that the combination of ECCs with skewed transitions reduces the worst-case link delay by up to 46%. Compared to conventional skewed transition method, this method reduces worstcase link delay by 25% for a 5 mm link. The performance of the combination of ECC with skewed transitions is compared to DAP codes, BSC codes, and the combination of ECC with CAC codes, such as FOC codes, FTC codes, and FPC codes. The specific crosstalk reduction techniques used for data and parity bits for each scheme are shown in Table 6.8. The codecs are synthesized using a TSMC 65 nm technology. Codec power, delay and area are reported using Synopsys Design Compiler. The system frequency is 1 GHz. A global 65 nm link model [15] is used. The link power is measured using Cadence Spectre using random input data with switching activity factor 0.5. The signal slew rate is equal to 2.5 the output slew rate of an FO4 inverter [15]. The input data

138

6 Combining Error Control Codes with Crosstalk Reduction

Fig. 6.20 Comparison of worst-case link delay for the combination of ECCs with skewed transitions and conventional skewed transitions [14]

Table 6.8 Specific crosstalk reduction techniques used for scheme [14] Data crosstalk reduction Hamming(71,64) + skewed transitions Skewed transitions DAP(129,64) Duplication BSC(129,64) Duplication Hamming(71,64) + CACs FOC FTC FPC

data and parity bits for each Parity crosstalk reduction Skewed transitions Duplication Duplication Half-shielding Shielding Shielding

Number of wires 71 129 129 91 120 119

Table 6.9 Codec delay and area comparison with previous solutions simultaneously addressing logic errors and crosstalk-induced delay [14] Encoder Decoder Codec area delay (ns) delay (ns) (mm2) Hamming(71, 64) + skewed transitions 0.40 0.62 2,468.2 DAP(129,64) 0.43 0.52 2,338.4 BSC(129,64) 0.46 0.78 2,638.1 Hamming(71,64) + FOC 0.52 0.70 3,048.5 FTC 0.55 0.76 4,342.3 FPC 0.57 0.79 4,485.9

width is 64 bits. Registers are inserted between encoder and link, and between link and decoder to allow pipelined operation. Table 6.9 shows the codec delay for each scheme. As can be seen, the decoder delay is typically larger than the encoder delay. DAP(129,64) has the smallest decoder delay, because the decoding process is simpler than that of Hamming codes. The combination of error control coding with CACs has larger codec delay, because of the extra delay introduced by the CACs. The combination of

6.5 Error Control Codes with Skewed Transitions

139

Fig. 6.21 Link area comparison for different schemes [14]

ECC with skewed transitions achieves a 21% reduction in decoder delay compared to combining Hamming codes with FPC codes. Table 6.9 also compares the codec area of each scheme. Codec area includes the area of encoder, decoder and pipelined registers. The results show that the combination of error control coding with CACs has a large codec area overhead compared to other schemes, because of the encoding and decoding circuits of CACs. The codec area of combining Hamming code with skewed transitions is close to the codec area of DAP codes and is 45% less than the codec area of combining Hamming codes with FPC. Figure 6.21 compare the link area of different schemes. The area is normalized to an uncoded bus with minimum link width and wire spacing. The results show that DAP(129,64) has the largest link area, because of the large number of wires required. The combination of ECC with skewed transition requires the fewest number of wires resulting in the smallest link area of the compared schemes. It achieves about 45% reduction in link area compared to DAP codes for 64-bit data. Residual flit error rate is used to measure the reliability. PResidual of Hamming codes can be estimated by (6.10) below [1], PHamming ðeÞ ¼ C2kþr e2

(6.10)

where k is input data width and r is the number of parity bits in the Hamming codeword. Residual flit error rate of DAP and BSC codes can be estimated by (6.11) below [1], PDAP ðeÞ ¼

3kðk þ 1Þ 2 e 2

(6.11)

The residual flit error probability of each scheme is estimated by replacing k ¼ 64 and the corresponding r value in (6.10) and (6.11). For k ¼ 64, the combination of H(71,64) with skewed transition achieves 1.5X and 2.5X improvement in residual word error probability compared to combining H(71,64) with FOC and DAP (129,64), respectively.

140

6 Combining Error Control Codes with Crosstalk Reduction

Fig. 6.22 Delay uncertainty comparison for different schemes handling both error correction and crosstalk reduction [14]

Fig. 6.23 Link energy consumption for different schemes that simultaneously address error correction with crosstalk reduction [14]

Delay uncertainty is used to measure the effects of crosstalk coupling on the link delay. The delay uncertainty is defined as the ratio of the delay variation to the worst-case delay [16], U¼

tprop ðmaxÞ tprop ðminÞ tprop ðmaxÞ

(6.12)

where tprop(max) is the worst-case link propagation delay. tprop(min) is the minimum link propagation delay. Figure 6.22 shows the delay uncertainty of each scheme. The link length is varied from 1 to 3 mm. Each scheme examined can greatly reduce the delay uncertainty compared to an uncoded link. The delay uncertainty of the combination of ECC with skewed transitions is up to 49% less than that of uncoded links. Compared to the combination of ECC with FPC, it can achieve up to 24% delay uncertainty reduction. Figure 6.23 compares the link energy

6.6 Summary

141

Fig. 6.24 Total energy versus link length for different schemes simultaneously addressing error correction with crosstalk reduction [14]

of each method. The comparison is performed for the same reliability requirement (Preq < 1020 with 3sN noise voltage equal to 20% of Vdd). The results show that the combination of ECC with skewed transitions has the least link energy consumption of the compared schemes because of the fewer number of required wires. Figure 6.23 compares the total energy consumption Etotal of each method at link lengths of 1 and 3 mm. Etota includes encoder, link, and decoder energy. The results show that combining H(71,64) with FPC consumes more total energy than other schemes, because of the larger codec and link energy consumption. The combination of H(71,64) with skewed transitions achieves the least total energy consumption because of the relatively small codec overhead and the least required number of wires. Compared to combining H(71,64) with FPC, it can achieve 32% improvement in energy consumption at link length 3 mm. Figure 6.24 compares the total energy consumption Etotal of each method at link lengths of 1 and 3 mm. Etota includes encoder, link, and decoder energy. The results show that combining H(71,64) with FPC consumes more total energy than other schemes, because of the larger codec and link energy consumption. The combination of H (71, 64) with skewed transitions achieves the least total energy consumption because of the relatively small codec overhead and the least required number of wires. Compared to combining H(71,64) with FPC, it can achieve 32% improvement in energy consumption at link length 3 mm.

6.6

Summary

In this chapter, we have examined different techniques, which can efficiently address both logic errors and capacitive crosstalk induced delay uncertainty simultaneously. By duplicating the input data and adding an extra parity check bit, DAP

142

6 Combining Error Control Codes with Crosstalk Reduction

codes have a minimum Hamming distance three and can correct single errors. In DAP codes, the bus wire used to transmit the duplicated bit is placed adjacent to the bus wire used to transmit the original bit. Thus, DAP codes can reduce the effect of the capacitive coupling. An intelligent spacing method can be used to further optimize DAP code. In intelligent spacing method, the spacing between two wires carrying the identical data can be smaller than the spacing between two wires carrying different data. BSC code can correct single errors and reduce capacitive coupling effects simultaneously. In BSC code, no any adjacent bits simultaneously switch in opposite direction. The worst case coupling capacitance of a bus wire used to transmit BSC codeword is equal to 2Cc, which is smaller than the worst case coupling capacitance value of a standard bus wire. CAMEC codes are constructed based the similar idea of DPA codes. An example of CAMEC codes constructed using Hamming code is presented in this chapter. In this example, the input data is first encoded using a Hamming code. The outputs of the Hamming encoder are duplicated. An overall parity check bit calculated from the output of the Hamming encoder is added to the whole codeword. The decoding process of CAMEC codes is more complex than that of conventional Hamming codes. Two difference decoding algorithms are discussed. CAMEC codes can be used to correct multiple errors and reduce the capacitive crosstalk induced delay uncertainty at the same time. A general coding framework by combining ECCs with CACs is used to address both logic errors and capacitive crosstalk induced delay uncertainty. In this joint coding method, the input data is first encoded using nonlinear CACs. The outputs of nonlinear CACs are encoded using ECCs. The parity bits generated by the ECC encoder are protected against crosstalk coupling using techniques such as shielding and duplication. Three common used CACs – FOC codes, FTC codes, and FPC codes are discussed in this chapter. The combination of ECCs with CACs can achieve the speedup over conventional Hamming codes, which can only correct logic errors. Another method of combining ECCs with skewed transitions is also discussed in this chapter. In this method, the inherent skew resulting from the ECC parity generation is exploited to ensure that no two adjacent wires switch in opposite directions simultaneously, thereby reducing worst-case on-chip capacitive coupling. Instead of waiting for the parity computation to send the original input data and parity bits to the link at the same time, the original input data is sent before the parity bits are available. A mapping algorithm is needed to properly map data and parity check bits to link driver registers, which are triggered by alternating clock phases. Compared to a conventional skewed transition approach, the combination of ECCs with skewed transitions hides the delay element insertion overhead in the parity calculation latency; thus, a large skewed delay is allowed in this method without affecting the overall link delay. The larger delay offset in this method further reduces the effects of capacitive coupling. Compared to other solutions that simultaneously handle logic errors and delay uncertainty, the combination of ECCs with skewed transitions requires fewer wires, resulting in smaller link area and energy consumption.

References

143

References 1. Sridhara S, Shanbhag RN (2005) Coding for system-on-chip networks: a unified framework. IEEE Trans Very Large Scale Integr (VLSI) Syst 12:655–667 2. Rossi D, Metra C, Nieuwland KA, Atul K (2005) Exploiting ECC redundancy to minimize crosstalk impact. IEEE Des Test Comput 22:59–70 3. Rossi D, Metra C, Nieuwland KA, Atul K (2005) New ECC for crosstalk impact minimization. IEEE Des Test Comput 22:340–348 4. Patel KN, Markov IL (2004) Error-correction and crosstalk avoidance in DSM busses. IEEE Trans Very Large Scale Integr (VLSI) Syst 12:1076–1080 5. Ganguly A, Pande PP, Belzer B, Grecu C (2008) Design of low power & reliable networks on chip through joint crosstalk avoidance and multiple error correction coding. J Electron Testing Theory Appl (JETTA), 67–81, Special Issue on Defect and Fault Tolerance 6. Ganguly A, Pande PP, Belzer B (2009) Crosstalk-aware channel coding schemes for energy efficient and reliable NOC interconnects. IEEE Trans Very Large Scale Integr (VLSI) Syst 17:1626–1639 7. Sridhara S, Shanbhag RN (2007) Coding for reliable on-chip buses: a class of fundamental bounds and practical codes. IEEE Trans Comput-Aided Des Integr Circuits Syst 5:977–982 8. Sridhara S, Ahmed A, Shanbhag R N (2004) Area and energy-efficient crosstalk avoidance codes for on-chip busses. In: Proceedings International Conference on Computer Design (ICCD), pp 12–17 9. Duan C, Tirumala A, and Khatri S P (2001) Analysis and avoidance of crosstalk in on-chip buses. In: Proceedings of the international conference on hot interconnects, pp. 133–138 10. Victor B, Keutzer K (2001) Bus encoding to prevent crosstalk delay. In: Proceedings IEEE/ ACM international conference on computer-aided design (ICCAD), pp 57–63 11. Sridhara R S, Shanbhag R N (2005) Coding for reliable on-chip buses: Fundamental limits and practical codes. In: Proceedings VLSI design, pp 417–422 12. Hirose K, Yassura H (2000) A bus delay reduction technique considering crosstalk. In: Proceedings of design, automation and test in Europe (DATE), pp 441–445 13. Nose K, Sakurai T (2001) Two schemes to reduce interconnect delay in bi-directional and unidirectional buses. In: Proceedings of VLSI symposium, pp 193–194 14. Fu B, Ampadu P (2010) Exploiting parity computation latency for on-chip crosstalk reduction. IEEE Trans Circuits Syst II: Express Briefs 57:399–403 15. Arizona State University Predictive Technology Model [Online] Available: http://ptm.asu.edu/ 16. Akl CJ, Bayoumi MA (2008) Reducing interconnect delay uncertainty via hybrid polarity repeater insertion. IEEE Trans Very Large Scale Integr (VLSI) Syst 16:1230–1239

List of Symbols

DT Фj O(x) a e l s(x) sN t0 ACK/NACK ARQ BCH CAC Cc Cct Ceff Cg Cg1 Cg2 Cgt CRC DAP ECC EMI

Delay inserted in the skewed transition method Minimal polynomial of aj Error magnitude polynomial Primitive element in a Galois field Probability of a single wire being erroneous Ratio of Cct to Cgt Error locator polynomial Standard deviation of noise voltage Wire delay when no crosstalk is considered Acknowledge/negative acknowledge Automatic repeat request Bose-Chaudhuri-Hocquenghem Crosstalk avoidance codes Coupling capacitance between two adjacent wires Total coupling capacitance Effective capacitance Ground capacitance Parallel plate capacitance between the parallel surface of a wire and the substrate or ground Fringing capacitance between the sides of a wire and the substrate or ground Total ground capacitance of a wire Cyclic redundancy check Duplicate-add-parity Error control coding Electromagnetic interference

B. Fu and P. Ampadu, Error Control for Network-on-Chip Links, DOI 10.1007/978-1-4419-9313-7, # Springer Science+Business Media, LLC 2012

145

146

FEC Gkn HARQ HðnkÞn Ik LCM P(b), b¼1,2,3,. . . Pd P d_ c Pd_uc PkðnkÞ Pn Pne Presidual Pud Qcrit R RS Reffective Rt SEC SEC-DEC SPC Sint Tint Type-I HARQ Type-II HARQ VN Vdd Vswing WL Wint c(x) c1n dmin e1n g(x)

List of Symbols

Forward error correction Generator matrix for an (n, k) linear block code Hybrid automatic repeat request Parity check matrix for an (n, k) linear block code k-dimensional identity matrix Least common multiple Error probability of b-bit burst error Probability that errors can be detected Probability of correctable error Probability of detectable but uncorrectable error Parity matrix for an (n, k) linear block code Probability of single error source causing errors in neighboring wires Probability of no error Residual flit error rate Probability of undetectable errors in the first transmission Amount of charge required to induce an error Code rate Reed-Solomon Effective code rate Total resistance of a wire Single-error-correcting Single-error-correcting and double-error-detecting Single parity check Spacing between two wires Interconnect thickness Type of HARQ in which the same information is retransmitted Type of HARQ in which redundant bits are transmitted incrementally Normal distribution noise Supply voltage Link swing voltage Number of wires in the link Interconnect width Polynomial representation of codeword n-bit codeword Minimum Hamming distance n-bit error vector Generator polynomial

List of Symbols

m(x) m1k reh ðkÞ s1r v1n

147

Polynomial representation of input message k-bit input data Number of parity check bits added by the extended Hamming code for k-bit input data r-bit syndrome vector Received n-bit codeword

Index

A Active shielding, 18, 19 Alternate repeater insertion, 20, 21 Automatic repeat request (ARQ), 13, 24–26, 43, 44, 83, 87–103, 115

Crosstalk-induced delay uncertainty, 5, 13, 22, 117, 138 Cyclic code, 64–66 Cyclic redundancy check (CRC) code, 44, 66, 83, 84, 103

B Basis, 51 Berlekamp–Massey (BM) algorithm, 69–73 Bose-Chaudhuri-Hocquenghem (BCH) code, 13, 45, 57, 66–73, 95, 97, 98, 102, 103, 110–115 Boundary shift code (BSC), 119–120, 127, 128, 137–139, 142 Bus-based infrastructure, 12, 33

D Deadlock, 39, 40, 42 Delay uncertainty, 5, 8, 13, 19, 22, 117, 130, 140–142 Duplicate-add-parity (DAP) code, 13, 44, 57–59, 117–118 Dynamic routing, 38, 39, 41 Dynamic voltage swing scaling (DVSS), 81–86

C Chien search algorithm, 72, 73 Circuit switching, 40 Coding scheme speedup, 128–130, 142 Configurable error control system, 104–114 CRC code. See Cyclic redundancy check (CRC) code Crosstalk avoidance and multiple error correction (CAMEC) code, 120–123, 142 Crosstalk avoidance codes (CACs), 12, 21–22, 117, 123 Crosstalk avoiding double error correction (CADEC) code, 120, 121, 123 Crosstalk coupling, 3–8, 10, 17, 18, 21, 22, 58, 117, 123, 128, 130, 132, 140, 142

E Effective code rate, 87, 88 Electromigration, 7, 9, 28 Energy efficiency, 43, 44, 79, 87, 95, 104, 114, 115 Error control, 9, 12, 13, 17, 22, 24–28, 36, 42–45, 49–115, 117–142 Error control codes (ECCs), 12, 49, 57, 117, 131–133, 136–138, 142 Error correction capability, 27, 55, 64, 74, 76, 78, 87, 90, 92, 99, 104, 112, 123, 134 Error model, 8–12, 96 Extended codes, 56 Extended Hamming code, 45 Extended Hamming product codes with type-II HARQ, 92–99, 101–103, 114, 115

149

150 F Field, 45, 49–51, 66, 103, 110 Forbidden overlap condition (FOC) code, 123–125, 127, 137–139, 142 Forbidden pattern condition (FPC) code, 123, 126, 127, 137–142 Forbidden transition condition (FTC) code, 123, 125–127, 137, 138, 142 Forward error correction (FEC), 24, 26–27 Fringing capacitance Cg2, 1

G Galois field (GF), 50–52, 54, 66, 68, 73 Generator matrix, 52, 53, 56, 59, 60, 65, 106 Global wire delay, 3 Go-back-N, 25, 26, 83, 95, 97, 103

H Hamming code, 11, 13, 44, 45, 57, 59–63, 75, 80, 92–95, 98, 99, 104–106, 108–111, 113–115, 118, 120–123, 127, 128, 138, 139, 142 Hamming distance, 54–60, 74, 117, 120, 122, 127, 142 Hamming product code, 13, 73–78, 92–105, 108, 111, 112, 114, 115 Hamming sphere, 54, 55 Hardware sharing, 106–108, 111, 115 Hierarchical bus architecture, 34, 35 Hsiao code, 45, 61–63, 123 Hybrid ARQ (HARQ), 24, 27–28 type-I, 27, 28, 43–44 type-II, 28, 87, 88, 90–103, 114, 115 Hybrid polarity repeater insertion, 19, 20

I Inductance effects, 3 Intelligent spacing, 117, 118, 142 Interconnect aspect ratio, 2, 5, 8 resistance, 7, 8 Interleaving, 45, 63–64, 134, 135, 137 Intermittent error, 9

Index J Joint coding scheme, 128, 130 Joint crosstalk avoidance and triple error correction and simultaneous quadruple error detection code (JTEC-SQED) code, 122, 123 Joint crosstalk avoidance and triple error correction (JTEC) code, 121–123

L Level shifter circuit, 80 Linear block codes, 51, 55, 59, 63, 66 Livelock, 39 Low link swing voltage, 13, 79–81, 114

M Mesh, 36, 44, 99, 122, 123 Metal sliver and crack, 5 Minimum Hamming distance, 54–60, 74, 117, 120, 122, 127, 142 Minimum link swing voltage, 80, 81, 83 Modified DAP code (MDR), 118 Multiple adjacent errors, 7, 10, 45

N Network interface (NI), 35, 36, 39 Network-on-chip (NoC), 12, 33–45, 112, 122, 123 layered structure, 38 topology, 36, 37 Noise reduction, 12, 17

O One lambda code (OLC), 127

P Packet switching, 40 Parallel plate capacitance Cg1, 1 Parity check matrix, 52, 53, 56, 60–63, 69 Passive shielding, 18 Permanent error, 9, 28, 97, 102 Pipelined product decoding algorithm, 76, 93 Polynomial p(x), 50, 51, 66 primitive, 51, 66 Process variation, 4, 7, 9

Index

151

R Reed-Solomon (RS) code, 13, 45, 57, 73, 110, 112, 113 Residual flit error rate, 11, 90, 96, 97, 99, 101, 111–114, 139 Router, 34–36, 38–43, 99, 103, 110, 122

Staggered repeater insertion, 19, 20 Static routing, 38, 39 Stop-and-wait, 25 Supply voltage fluctuation, 4, 8, 9 Syndrome decoder, 60, 108, 109, 115 Systematic codes, 52–54

S Selective-repeat, 25, 26 Self-calibrating transmission, 81, 83–86, 114 Shortened codes, 81, 83–86, 114 Sidewall capacitance CC, 1–2 Signal integrity, 5, 8, 44 Single parity check (SPC) code, 57–58 Skewed repeater, 24 Skewed transition, 12, 17, 22–24, 117, 130–142 Soft error, 6, 8 Spare wire, 9, 17, 28, 42, 45 Spatial burst error, 8, 10, 63, 64, 134

T Temperature variation, 4, 8 Torus, 36, 99, 122, 123 Transient error, 7–9, 45

V Varying noise condition, 104 Virtual channel, 40–42

W Wire sizing and spacing, 17–18, 117 Wormhole switching, 40