Transient and Permanent Error Control for Networks-on-Chip

Transient and Permanent Error Control for Networks-on-Chip Qiaoyan Yu l Paul Ampadu Transient and Permanent Error ...

Author: Qiaoyan Yu | Paul Ampadu

41 downloads 507 Views 5MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

Transient and Permanent Error Control for Networks-on-Chip

Qiaoyan Yu

l

Paul Ampadu

Transient and Permanent Error Control for Networks-on-Chip

Qiaoyan Yu University of New Hampshire Durham, NH 03824, USA [email protected]

Paul Ampadu University of Rochester Rochester, NY 14627, USA [email protected]

ISBN 978-1-4614-0961-8 e-ISBN 978-1-4614-0962-5 DOI 10.1007/978-1-4614-0962-5 Springer New York Dordrecht Heidelberg London Library of Congress Control Number: 2011939749 # Springer Science+Business Media, LLC 2012 All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Springer Science+Business Media, LLC, 233 Spring Street, New York, NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights.

Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com)

Preface

Reliability has become one of the most important metrics for on-chip communications infrastructures in nanoscale technologies. Reduced supply voltages and high clock frequencies exacerbate the impact of noise sources such as particle strikes and crosstalk, which can cause transient errors in transmitted data. Additionally, manufacturing defects, electromigration, and aging can cause permanent errors in communication links. Unfortunately, transient and permanent error management techniques typically result in increased power consumption, latency and area overhead, further challenging large-scale system design. Consequently, cost-effective techniques for improving onchip error resilience are needed. The purpose of this book is to address the reliability and energy issues of nanoscale on-chip networks. Since the noise environment is not constant in real applications, the worst-case design approach often used results in wasted energy, particularly when the noise condition is favorable. To address the variable error rates, we present a configurable error control coding (ECC) scheme for datalink-layer transient error management. The method can adjust both error detection and correction strengths at runtime by varying the number of redundant wires for parity-check bits. To further improve energy efficiency, the adaptation on ECC is extended to the network layer. We demonstrate that the proposed dual-layer cooperative error control achieves better reliability, latency, and energy efficiency than other solutions in a wide range of noise and traffic conditions, at moderate area costs. We further extend these methods to tackle joint transient and permanent error correction, exploiting redundant resources already available. This approach reduces the need for energy-consuming fault tolerant routing to minimize latency and energy overhead introduced by error control. The proposed approach is particularly applicable to scenarios where only a small number of permanent errors exists on the on-chip links. To evaluate performance and energy consumption of large networks-on-chip (NoCs), we also describe a flexible parallel NoC simulator. The simulator is designed to facilitate evaluating the impact of various error control methods on NoC performance.

v

vi

Preface

Key features of this book include l

l

l

l

A detailed overview of various error control schemes commonly-used in on-chip interconnect networks Analysis of error control in various NoC layers, as well as presentation of an innovative multi-layer ECC technique Configurable error management solutions and their hardware implementation details for variable noise conditions Detailed description of a flexible and parallel NoC simulator This book should be of interest to

l l l

Researchers interested in error control and fault tolerance techniques Networks-on-chip, systems-on-chip and chip-multiprocessor designers Engineers involved in parallel simulation tool development

Durham, NH, USA Rochester, NY, USA

Qiaoyan Yu Paul Ampadu

Acknowledgments

The original research work presented in this book was made possible in part by grants from the U.S. National Science Foundation (NSF) under grants ECCS0733450, ECCS-0903448, ECCS-0925993, CAREER Award ECCS-0954999, Cyberinfrasturcture Experiences for Graduate Students Supplement ECCS0609140, and the Semiconductor Research Corporation award SRC-2009-HJ-2000. We would like to express our special appreciation to our friends and colleagues Professor Wendi Heinzelman, Professor Chen Ding, Professor Thomas Tucker and Professor Kai Shen for their invaluable suggestions on improving the quality of Dr. Yu’s dissertation leading to this book. We are grateful also to our exceptional colleagues, Dr. Bo Fu (now at Marvell), Dr. David Wolpert (now at IBM), Meilin Zhang and Tony Broyld, for their enjoyable collaborations and support. Our deepest gratitude goes to our families for their unwavering encouragement and support. Many thanks also to our friends at the University of Rochester and to Charles B. Glaser from Springer for his support and assistance throughout the writing of this book. We welcome any suggestions, comments or constructive criticism on this book. Such feedback would be used to improve forthcoming editions. Additional material can be found at http://www.ece.rochester.edu/projects/edison. Durham, NH, USA Rochester, NY, USA

Qiaoyan Yu Paul Ampadu

vii

Contents

1

Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Networks-on-Chip (NoCs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.1 Fundamental Elements in NoC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.2 NoC Layer Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.3 NoC Switching Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.4 NoC Flow Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Reliability Challenges in Scaled Technology . . . . . . . . . . . . . . . . . . . . . . . . 1.2.1 Transient Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.2 Permanent Errors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.3 Intermittent Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Reliability, Performance and Energy Tradeoffs . . . . . . . . . . . . . . . . . . . . 1.4 Book Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1 1 1 2 4 5 5 6 9 10 11 11 14

2

Existing Transient and Permanent Error Management in NoCs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Error Control Schemes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 Automatic Repeat Request. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.2 Hybrid Automatic Repeat Request . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.3 Forward Error Correction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Error Control Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Single Parity Check Code. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.2 Hamming Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.3 Cyclic Redundancy Check Code . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.4 Bose-Chaudhuri-Hocquenghem Code . . . . . . . . . . . . . . . . . . . . . . 2.2.5 Product Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Spare Wires . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Split Transmission . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Fault-Tolerant Routing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.1 Redundant-Packet-Based Routing . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.2 Redundant-Route-Based Routing . . . . . . . . . . . . . . . . . . . . . . . . . . .

19 19 19 21 21 22 23 23 26 26 28 29 30 30 31 31

ix

x

Contents

2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

32 33

3

Adaptive Error Control Coding at Datalink Layer. . . . . . . . . . . . . . . . . . . . 3.1 Adaptive Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Architecture for Sender and Receiver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Configurable Error Detection and Correction . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Two-phase Configurable ECC Encoder . . . . . . . . . . . . . . . . . . . . 3.3.2 Configurable Interleaving . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.3 Configurable ECC Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Evaluation of Adaptive ECC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.1 Error Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.3 Reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.4 Average Throughput . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.5 Energy Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.6 Area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Simulation Using an H.264 Application. . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6 Performance Evaluation Using Dependent Error Model . . . . . . . . . . . 3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

37 38 39 41 41 42 44 46 46 47 48 51 54 56 56 60 61 62

4

Transient and Permanent Link Errors Co-Management . . . . . . . . . . . . . 4.1 Dual-Layer Co-Management Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.1 Co-Management Algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.2 Transmitter and Receiver Architecture . . . . . . . . . . . . . . . . . . . . . 4.2 Packet Re-Organization Approach. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Re-Organization Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.2 Input Port and Output Port Architecture. . . . . . . . . . . . . . . . . . . . 4.3 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.2 Reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.3 Average Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.4 Energy Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.5 Area Overhead . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

65 66 66 66 69 69 70 71 71 72 73 76 76 78 79

5

Dual-Layer Cooperative Error Control for Transient Error. . . . . . . . . 5.1 Existing Hop-to-Hop Error Control. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Existing End-to-End Error Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Dual-Layer ECC Switching Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 ECC Mode Switching Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.2 Network Interface Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.3 Router Implementation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.4 Dual-Layer Information Exchange . . . . . . . . . . . . . . . . . . . . . . . . .

81 81 82 84 84 86 88 90

Contents

5.4

6

xi

Codec for Dual-Layer ECC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.1 Dual-Layer ECC Encoder. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.2 Dual-Layer ECC Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.3 Codec Compatibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5.2 Reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5.3 Energy Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5.4 Average Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5.5 Codec Delay. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5.6 Codec Area Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5.7 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

91 91 91 95 95 95 97 102 106 110 111 111 115 116

A Flexible Parallel Simulator for Networks-on-Chip with Error Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Existing Simulators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Platforms for Error Control Evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Overview of the Proposed Simulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4 Error Control Modeling in Router and Network Interface . . . . . . . 6.4.1 Error Control in Router . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.2 Error Control in Network Interface . . . . . . . . . . . . . . . . . . . . . . . 6.5 Flexible Fault and Traffic Injection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5.1 Fault Injection Location . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5.2 Fault Type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5.3 Faulty Flit Type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5.4 Multiple-Frequency Traffic Injection . . . . . . . . . . . . . . . . . . . . . 6.6 Parallel Fault and Traffic Injection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.6.1 Fault Injection on Links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.6.2 Fault Injection Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.6.3 Parallel Traffic Injection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.7 Energy Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.8 Speed and Memory Consumption for Fault Injection . . . . . . . . . . . . 6.9 Error Control Exploration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.9.1 Experimental Setup and Evaluation Metrics . . . . . . . . . . . . . 6.9.2 Impact of Packet and Fault Injection Rate . . . . . . . . . . . . . . . 6.9.3 Impact of Error Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.9.4 Impact of Faulty Flit Type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.9.5 Impact of Fault Injection Location . . . . . . . . . . . . . . . . . . . . . . . 6.9.6 Impact of Fault Type. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.10 Memory Consumption and Time for Fault Injection . . . . . . . . . . . . 6.11 Investigation of NoC-Based CMP System . . . . . . . . . . . . . . . . . . . . . . . 6.12 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

117 117 118 118 121 121 123 124 125 125 126 126 127 127 127 129 129 130 133 133 134 135 140 140 141 144 145 147 148

xii

7

Contents

Conclusions and Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1 Book Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.1 Adaptive Error Control Codec Design. . . . . . . . . . . . . . . . . . . . . 7.1.2 Error Co-Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.3 Dual-Layer Cooperative Error Control . . . . . . . . . . . . . . . . . . . . 7.1.4 NoC Simulator Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Future Work. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

151 151 151 152 153 153 154 155 157

Chapter 1

Introduction

1.1

Networks-on-Chip (NoCs)

Thanks to the rapid advancement of technology in semiconductor device fabrication, billions of transistors can be integrated to a single die [1–5]. Although the increasing chip density potentially facilitates systems-on-chip (SoCs) and chip multiprocessor (CMP) integrating hundreds or thousands of processing element/ memory cores, several challenges prevent system further progress, such as design complexity, high-performance interconnect and scalable on-chip communication architecture [6–9]. Networks-on-chip (NoCs) becomes a promising paradigm, which manages the increasing interconnect complexity and facilitates the integration of various intellectual property (IP) cores [10–15].

1.1.1

Fundamental Elements in NoC

NoC is a new infrastructure for on-chip communication. Figure 1.1a shows the three fundamental components of NoCs – links, network interfaces (NIs), and routers. Links facilitate communication between routers. NIs transform streams of bits from intellectual property (IP) cores into packets for transmission to routers and vice versa. Routers extract the destination address from each received flow control unit (flit) and pass the flit to its intended destination. Nodes in the NoC can be connected with the various topologies, as shown in Fig. 1.1b [16]. NoC survey shows that over 60% of NoCs employ mesh or torus topology [14]. These regular topologies provide better scalability than buses, crossbars and ad-hoc networks. Bolotin et al. have proved that the complexity of NoC connectivity is O(n), while the complexities pﬃﬃﬃ of simple pﬃﬃﬃ buses, point-to-point pﬃﬃﬃ connectivity and segmented buses are Oðn3 nÞ, Oðn2 nÞ and Oðn2 nÞ, respectively, as shown in Fig. 1.2 [16]. Here, n is the number of nodes in the network. Q. Yu and P. Ampadu, Transient and Permanent Error Control for Networks-on-Chip, DOI 10.1007/978-1-4614-0962-5_1, # Springer Science+Business Media, LLC 2012

1

2

1 Introduction

Fig. 1.1 (a) Fundamental NoC elements, (b) topologies for NoCs

Unlike other interconnect architectures that use direct wiring, NoCs route data through several hops via routers. As a result, the interconnect fabric can be shared with all the IP cores attached to the NoC; significantly improving the efficiency of interconnect utilization. The multi-hop feature of NoCs helps to divide a long link into several short segments, each segment using a router to pass the data over the network. This segmentation also helps to manage the increasing delay and power consumption caused by link resistors. NoCs separate communication and computation and provide a feasible framework for reusing IP cores, as well. NoCs provide attractive benefits for on-chip communication. meanwhile this new infrastructure also brings new challenges – (1) Minimize area and power overhead induced by router and network interface, (2) Need for new design methodologies for NoC-based systems, (3) Need for new circuit and system design tools.

1.1.2

NoC Layer Model

Similar to other networks, the Open Systems Interconnection model (OSI model) is used to provide guidelines for NoC implementation [7, 16]. The layered structure shown in Fig. 1.3 is utilized to hide the implementation details of each layer from the other layers, simplifying system design.

1.1 Networks-on-Chip (NoCs)

3

Fig. 1.2 Scalability comparison of different interconnect topologies: (a) NoC, (b) simple bus, (c) point-to-point interconnection, (d) segmented bus

Fig. 1.3 Open system layer model for on-chip communication

4

1 Introduction

The physical layer transmits unstructured bit streams and deals with the electrical properties of the physical access mediums (i.e., wires), such as signal voltage, pulse shape, synchronization, delay and signal integrity issues. The datalink layer offers packetized data blocks with reliable transmission over physical links with necessary flow control and error control schemes. In this layer, an arbitration scheme for the access of the shared physical links significantly affects the delay, throughput and power consumption of the NoC. The network layer provides the upper layers with independence from the topology, data transmission and switching techniques. This layer is responsible for establishing, maintaining and releasing connections using static or dynamic routing algorithms. Congestion control methods are employed in this layer to balance the traffic load over the entire network. The transport layer ensures reliable and transparent end-to-end communication. In this layer, bit streams from the upper layers are segmented into packets or reconstructed from packets. Packet loss checking and packet reordering are performed in the transport layer, as well. The session layer typically synchronizes message transmission, which is useful for the multi-core systems that are running parallel programs. The presentation layer converts diverse data from the upper layers into a compatible format for the lower layers. This layer is especially necessary for heterogeneous multi-core systems, because heterogeneous IP cores may use different data formats (e.g., big-endian or litter-endian format, floating point or fixed point format). The application layer informs the components in SoCs/CMPs of the underlying communication structure; thus, the system components can communication with each other without considering the implementation details. This layered-stack model facilitates the separation of communication and computation and assists IP-reuse design methodologies, as well as fine-grain optimization of NoC components.

1.1.3

NoC Switching Techniques

Switching techniques determine how data flows through the routers and define the granularity of data transfer. Three switching techniques have been used in NoCs – circuit switching, packet switching and a hybrid version. In circuit switching, a circuit is set up from source to destination using a resource reservation method. Thus, there is no network contention during data propagation. Packets of different flows (i.e., circuit paths) attempting to use a link at the same time result in contention. Circuit switching has high initial latency, but it is appropriate when data is sent very often (e.g., SOCBus [17]). Circuit switching reserves a complete path before data is sent. As a result, it is easy to guarantee the quality of service (QoS).

1.2 Reliability Challenges in Scaled Technology

5

Packet switching can be divided into three categories – store and forward (SAF), virtual cut through (VCT) and wormhole (WH) switching. SAF checks the availability of the next hop and stores the overall packet until the packet is entirely received. VCT only checks the availability of the next hop and does not wait for the packet to be entirely received. In a packet switching network, large buffers are needed to meet performance requirements. Typically, the more buffers are used, the better performance can be obtained. Both SAF and VCT require a buffer space that is sufficient for at least one packet. WH switching has been widely employed in NoCs to reduce the area cost induced by buffers. The hybrid switching technique leverages the advantages and disadvantages of the circuit and packet switching techniques [18].

1.1.4

NoC Flow Control

Flow control defines the way that packets traverse the network. It usually involves buffer location, buffer management and network resource allocation. Efficient flow control can speed up the packet propagation over the network and can also reduce network resource idle time. Flow control methods can be categorized into withoutmemory [19–21] or with-memory [22–29]. Credit-based flow control once used in ATM networks is commonly adopted in NoCs now, at the cost of buffer resources [22–24]. Based on credit-based flow control, other techniques are also applied to some NoCs, such as Paris, NoCGEN and Xpipes. Paris [25], an extended version of SoCIN [26], utilizes handshake signals to create connections between sender and receiver. NoCGEN [27] uses a request, grant and ready handshake to enable flow control in point-to-point connections. Xpipes NoC [28] employs the ACK/NACK flow control method to pipeline links; the received NACK feedback requests retransmission. In Ref. [29], Tamhankar et al. proposed the T-error protocol to deal with timing errors caused by aggressively tackling timing constraints (to improve performance). Straight forward flow control, STALL/Go, T-error and ACK/NACK flow control schemes have been compared in Ref. [30].

1.2

Reliability Challenges in Scaled Technology

Relaxing the requirement for 100% correctness in both transient and permanent failures of signals, logic values, devices, or interconnects may reduce the cost of manufacturing, verification, and testing. – ITRS 2003 [31]

Deep submicron technology (DSM) makes the integration of billions of transistors on a single die possible. In such infrastructure, more and more IP cores are available for parallel processing, which dramatically improves the speed of signal processing.

6

1 Introduction

Fig. 1.4 Alpha particle strike on a transistor

Unfortunately, these benefits are optimistic. Although NoCs bring the advantage of structural regularity, scalability, modularity and efficient communication, they still face considerable reliability challenges. Because of various noise sources, the reliability issue becomes increasingly important in current technology and for the future.

1.2.1

Transient Errors

Transient errors involve unexpected changes to data rather than damages to the physical media (e.g., interconnect link [32–36], storage elements and computation logic paths [37–43]). These errors may have a very short lifetime; thus, if the operation (transmission, write/read or computation) is repeated, the output of the physical media may become correct. Transient errors may be caused by supply voltage fluctuation induced voltage glitches, crosstalk coupling [44–53], particle strike induced single-event upsets (SEU) [54–58] and single-event transients (SET) [57, 59]. Impurities in electronic materials contain high density atoms, which emit alpha particles through radioactive decay [61]. The induced alpha particles inject charges, changing logic values at circuit nodes, as shown in Fig. 1.4 [60]. Some chip packaging materials contain radioactive contaminants, emitting alpha particles. Packaging materials with alpha particle emissions greater than 0.001 counts per hour per cm2 (cph/cm2) should not be used for reliability-critical circuits [61]. Approximately 0.004% of the alpha particle strikes caused by package materials induce logic upsets [62]. As technology scales, the increasing number of circuit nodes and decreasing critical charge will increase the probability of alphaparticle-induced errors.

1.2 Reliability Challenges in Scaled Technology

7

Even with the improvement of packaging materials, soft errors cannot be eliminated. If an energetic neutron at the Earth’s surface is captured by the nucleus of an atom in a chip and this process produces an alpha particle and oxygen nuclei, there is a ~95% probability of causing a soft error [63]. In modern devices, neutrons induce more soft errors than chip packaging materials, especially in aerospace applications. Computers working on mountain tops experience over 10 times soft errors than those at sea level [64]. The soft error rate for the electronic devices in an aircraft increases to 300 times over sea level [64]. In addition, neutrons interacting with their surroundings to reach thermal equilibrium lead to soft errors, as well. This is significantly important for electronic devices in medical applications. For example, high energy cancer radiation therapy using photon beams emits neutrons. The scattered neutrons do not disappear; instead, they are bounced between walls, resulting in the thermal neutron flux in the treatment room 4 107 higher than that in a normal environment [65, 66]. Electromagnetic interference (EMI) is caused by outside electronic devices or on-chip materials (e.g., RF components of mixed signal ICs) [67]. On-chip interconnect wires are relatively long compared to most other on-chip wires, and they are thus more likely to be the EMI victims. With increased integration of complex blocks on a single chip, the circuit will be susceptible to larger EMI levels. In addition to external noise sources, circuit normal operation is interfered by internal noise sources, such as power/ground voltage fluctuation and crosstalk coupling. In a real chip, the power (ground) is not ideally equal to Vdd (Vss). Fluctuation on power (ground) affects the charging capability (discharging capability) of PMOS (NMOS), resulting in the delay uncertainty [68]. If the uncertainty is large enough to be captured by a register, there occurs a logic error. Because of shrinking wire width and pitch size, the coupling noise interferences are getting worse [44, 45, 49, 53]. In DSM and nanometer regimes, crosstalk becomes one of the major noise sources for interconnect [69, 70]. As shown in Fig. 1.5 [71], the peak noise voltage induced by crosstalk can be more than 20% of the supply voltage. Consequently, the voltage glitch caused by crosstalk has a potential to create a logic error. Design for the worst case is simple and safe, but not cost-effective. In reality, the error rate of transient errors varies with location and time. The experiments performed in the UoSAT-2 spacecraft launched into a polar orbit of altitude 700 km in 1984 indicated that the system experienced more soft errors in the South-Atlantic Anomaly region than in other regions, as shown in Fig. 1.6a [72]. The relationship between the particle flux and the altitude has been summarized in Ref. [76]. By examining the number of captured soft errors over time, HarboeSørensen et al. observed the error rate in October is higher than that in other months, as shown in Fig. 1.6b [72]. This is because the sunspot activity inversely influences the magnitude field of the Earth; the more solar flux provides the Earth with more additional shielding effect against the high-energy cosmic rays. Modeling and experiments performed by IBM also demonstrate that soft error rates in different cities are different, as shown in Fig. 1.7.

Fig. 1.5 Noise waveforms of crosstalk coupling between two coupled lines

Fig. 1.6 Soft error rate varies with (a) location and (b) time

Fig. 1.7 Alpha-particle and cosmic contributions to the signal-event-upset rate

1.2 Reliability Challenges in Scaled Technology

9

Fig. 1.8 Error rates for different supply voltages and noise variances

The error rate also changes with the operation conditions, such as supply voltage and temperature. Li et al. [73] use a single Gaussian noise source to model the different noise sources impacting the bus line. As shown in Fig. 1.8, a lower supply voltage or a larger noise interference voltage results in a higher bit error rate [73]. Transient errors are typically managed by error detection with retransmission, forward error correction or muted in the datalink layer. Reviews for transient error management will be provided in Chap. 2.

1.2.2

Permanent Errors

Transient errors disappear after a certain time, while permanent errors do not vanish until the sources of permanent errors are removed. Imperfect manufacturing process [74–78] and device wearout [79, 80] induce faults such as stuck-at faults – where the output is stuck at logic ‘0’ or ‘1’, regardless of the input [74]; bridging faults – two adjacent signals shorted together [81, 82]; open faults – an interconnect is broken [75, 83]; delay faults – signal arrival time is slower than normal, exceeding the timing requirement. Device wearout typically occurs later in the lifetime because of various mechanisms, for instance, electromigration, hot carries degradation, and time-dependent dielectric breakdown.

10

1 Introduction

Fig. 1.9 Permanent and intermittent error sources: (a) bridging faults, (b) mousebite, crack, metal sliver, and hillock

Bridging between two lines (shown in Fig. 1.9a [77]) causes bridging faults. Electromigration, mousebites and hillock in the metal wires as shown in Fig. 1.9b eventually result in open circuits and short circuits, respectively. The mousebite wire has a narrower wire width than normal, resulting in a higher current density than other places and leading to more severe electromigration (eventually causing an open circuit). In contrast, the hillock leads to a lower current density than other normal places; it deteriorates through electromigration and finally results in a shorted circuit. Permanent errors are typically managed by replacing the unusable components with spare devices, or detouring around the broken region. Permanent error management extends chip lifetime and improves chip yield rate, reducing manufacturing cost. State-of-the-art permanent error management methods are summarized in Chap. 2.

1.2.3

Intermittent Errors

Intermittent error occurs repeatedly at the same location; it typically appears in a burst way and lasts several cycles. This type error can be caused by factors, such as temperature variation, voltage fluctuation, process variations and manufacturing residuals [38]. Changing the operation environment or replacing the faulty component can eliminate the intermittent error. Intermittent error sometimes precedes the presence of permanent errors, if the error is induced by device aging. It is notoriously difficult to identify and recover the intermittent errors, because its occurrence depends on the inputs and unpredictable operation environment, as well as the similarity to transient and permanent errors. Since most of intermittent error typically precedes the permanent errors, we regard it as permanent errors here.

1.4 Book Organization

1.3

11

Reliability, Performance and Energy Tradeoffs

Spatial, temporal and information redundancies have been exploited to manage transient and permanent errors in different types of reliability-aware systems [11]. As shown in Table 1.1, fault tolerant techniques are not free. Mainstream low-cost systems have strict constraints on the overhead induced by the applied reliabilityimprovement techniques. In addition, to manage the overwhelming system complexity, the NoC properties, such as structural regularity, modularity and layering model, enable use of advanced error control methods to create fault-tolerant interconnections. In transient error management, powerful error control coding achieves higher error resilience than simple error control coding, at the cost of higher energy and area overhead. Permanent error management has a similar dilemma. Fault tolerant routing takes advantage of multiple routes to transmit multiple message copies or re-route the message around a faulty link. Links and router with permanent errors may be abandoned, wasting bandwidth, increasing latency [85] or increasing energy [84].

1.4

Book Organization

The challenges of reliable NoC design in deeply scaled technologies are summarized in the previous subchapter. In the reminder of this book, multi-layer transient and permanent error control design and simulation methods will be presented to address the reliability issues in nanoscale NoCs. The cooperation among three layers involved in our methods is shown in Fig. 1.10. In Chap. 2, previous datalink layer error recovery schemes and error control coding methods for transient errors are summarized. Existing physical layer techniques and network layer approaches for permanent errors are discussed, as well. In Chap. 3, we present an adaptive error control method for switch-to-switch links in nanoscale NoCs to manage reliability, throughput and energy. Unlike previous works, the proposed method adjusts both error detection and correction simultaneously at runtime. For a given application or predicted noise scenario, an appropriate error control scheme is selected for reliable message transmission. When link conditions degrade, more powerful error detection and correction are temporarily provided to recover the previous message. To achieve this adaptation, we implement a configurable M-error correction, 2M-error detection (MEC2MED) code, combined with a hybrid automatic repeat request (HARQ) retransmission policy. Based on the approach discussed in Chap. 3, we present a dual-layer (physical and datalink layer) transient and permanent error co-management method in Chap. 4. This co-management method uses the idle wires for configurable error control coding to replace permanently unusable wires, reducing the number of redundant wires. Furthermore, a packet re-organization algorithm that cooperates

Table 1.1 Different types of reliability-aware systems Type Issues Goal Long-life system Difficult or expensive Maximize mean time to to repair failure (MTTF) Reliable realErrors or delays can Fault masking capability time systems be catastrophic High-availability Downtime High availability systems very costly High-integrity Data corruption very High data integrity systems costly Mainstream lowReasonable level of Meet failure rate cost systems failures acceptable expectations at low cost Examples Satellites, spacecraft, implanted biomedical devices Aircraft, nuclear power plants, air bag electronics, radar Reservation systems, stock exchanges, telephone systems Banking, transaction processing, databases Consumer electronics, personal computers

No single point of failure; selfchecking pairs; fault isolation Check pointing, time redundancy, ECC; redundant disks None; memory ECC; bus parity

Triple modular redundancy (TMR)

Techniques Dynamic redundancy

12 1 Introduction

1.4 Book Organization

13

Fig. 1.10 Multi-layer cooperative error control

with a shortened error control coding method is proposed to support low-latency split transmission. Our co-management method provides the most energy efficient configuration for the current transient and permanent error condition, with minor performance degradation and area overhead. In Chap. 5, we address the transient error in datalink and network layers. We employ end-to-end error control in the network layer in low noise condition, and enhance the error control capability in high noise condition by adding hop-to-hop error control in datalink layer. One major contribution of this method is a protocol to switch between single-layer error control and dual-layer error control at runtime, based on the detected noise condition or system requirements. Simply combining end-to-end error control with hop-to-hop error control typically results in huge energy consumption. In our method, we employ the concept of product codes to perform dual-layer cooperative error control. Evaluation of residual error rate, latency and energy are provided. Traffic traces obtained from benchmark suit have been used to examine the dual-layer method, in terms of handling imbalance traffic load and asymmetric error distribution over the network. In Chap. 6, we introduce a flexible parallel simulator to evaluate the impact of different error control methods on NoC performance and energy consumption. Different error control schemes can be inserted to the simulator in a plug-andplay manner for evaluation. Moreover, a highly tunable fault injection feature is developed for modeling various fault injection scenarios, including different fault

14

1 Introduction

injection rates, fault types, fault injection locations and faulty flit types. The flexible simulation environment provided by this simulator allows examination of the efficiency of different error control schemes under different fault scenarios and traffic injection rates, as well as investigation of NoC-based chip multi-processor (CMP) system. Conclusions and future directions are presented in Chap. 7.

References 1. Rusu S, Tam S, Muljono H, Stinson J, Ayers D, Chang J, Varada R, Ratta M, Kottapalli S (2009) A 45 nm 8-core enterprise Xeon® processor in. Proc IEEE Intl Solid-State Circuits Conf-Digest of Technical Papers 56–57 2. Kurd NA et al (2010) Westmere: A family of 32 nm IA processors. in Proc IEEE Intl SolidState Circuits Conf-Digest of Technical Papers 96–97 3. Shin JL et al (2010) A 40 nm 16-core 128-thread CMT SPARC SoC processor. in Proc IEEE Intl Solid-State Circuits Conf-Digest of Technical Papers 98–99 4. Wendel DF et al (2011) POWER7TM: A highly parallel, scalable multi-core high-end server processor. IEEE Journal of Solid-State Circuits 46:145–161 5. Anders MA et al (2010) A 41 Tb/s bisection-bandwidth 560 Gb/s/W streaming circuitswitched 8 8 mesh network-on-chip in 45 nm CMOS. in Proc IEEE Intl Solid-State Circuits Conf-Digest of Technical Papers 110–111 6. Dally WJ, Towles B (2001) Route Packets, Not Wires: On-Chip Interconnection Networks. in Proc 38th Design Automation Conference (DAC’01) 684–689 7. Sgroi M et al (2001) Addressing the system-on-a-chip interconnect woes through communicationbased design. in Proc 38th Design Automation Conference (DAC’01) 667–672 8. Benini L, De Micheli G (2002) Networks on Chips: A new SoC paradigm. Computer 35:70–78 9. Henkel J, Wolf W, Chakradhar S (2000) Network on chip: An architecture for billion transistor era. in Proc 18th IEEE NorChip Conf 166–173 10. Benini L, De Micheli G (2001) Powering Networks on chips. in Proc Intl Symp System Synthesis 33–38 11. Jantsch A, Tenhunen H (2003) Networks on Chip. Kluwer Academic Publishers 12. Kim J, Park D, Nicopoulos C, Vijaykrishnan N, Das CR (2005) Design and analysis of an NoC architecture from performance, reliability and energy perspective. in Proc ACM/IEEE Symp on Architectures for Networking and Communications Syst (ANCS’05) 173–182 13. Salminen E, Kulmala A, Hamalainen TD (2007) On network-on-chip comparison. in Proc 10th Euromicro Conf on Digital Syst Design Architectures, Methods and Tools (DSD 2007) 503–510 14. Salminen E, Kulmala A, Hamalainen TD (2008) Survey of network-on-chip proposals. White paper, OCP-IP, 1–13 15. Agarwal A, Iskander C, Shankar R (2009) Survey of Network on Chip (NoC) Architectures & Contributions. Engineering, Computing and Architecture 3:1–15 16. Bolotin E, Cidon I, Ginosar R, Kolodny A (2004) Cost considerations in Network on Chip. Integration - the VLSI journal, 38:19–42 17. Wiklund D, Liu D (2003) Socbus: switched network on chip for hard real time embedded systems. in Proc Intl Parallel and Distributed Processing Symp 1–8 18. De Micheli G, Benini L (2007) Networks On Chips. Morgan Kaufmann, San Francisco 19. Nilsson E, Millberg M, Oberg J, Jantsch A (2003) Load distribution with the proximity congestion awareness in a network on chip. in Proc DATE’03 1126–1127

References

15

20. Liu J, Zheng L-R, Tenhunen H (2003) A guaranteed-throughput switch for network-on-chip. in Proc Intl Symp System-on-chip 31–34 21. Millberg M, Nilsson E, Thid R, Jantsch A (2004) Guaranteed bandwidth using looped containers in temporally disjoint networks within the nostrum network on chip. in Proc DATE’04 8890–895 22. Khorsandi S, Leon-Garcia A (1996) Robust non-probabilistic bounds for delay and throughput in credit-based flow control. in INFOCOMM 677–584 23. Zeferino CA, Kreutz ME, Carro L, and Susin AA (2002) A study on communication issues for systems-on-chip. in Proc Symp Integr Circuits and Syst Design 121–126 24. Radulescu A et al (2005) An efficient on-chip NI offering guaranteed services, shared-memory abstraction, and flexible network configuration. IEEE Trans on Computer-Aided Design of Integr Circuits and Syst (TCAD) 24:4–17 25. Zeferino C A, Santo FGME, Susin AA (2004) Paris: a parameterizable interconnect switch for networks-on-chip. in Proc Symp On Integr Circuits and Syst Design, 204–209 26. Zeferino A, Susin AA (2003) SoCIN: a parametric and scalable network-on-chip. in Proc Symp On Integr Circuits and Syst Design 169–174 27. Chan J, Parameswaran S (2004) Nocgen: a template based reuse methodology for networks on chip architecture. in Proc Intl Conf on VLSI Design 717–720 28. Bertozzi D, Benini L (2004) Xpipes: a network on chip architecture for gigascale systems-onchip. IEEE Circuits and Syst Magazine 4:18–31 29. Tamhankar RR, Murali S, De Micheli G (2005) Performance driven reliable link design for networks on chips. in Proc Asia and South Pacific Design Automation Conf (ASP-DAC’05) 749–754 30. Pullini A, Angiolini F, Bertozzi D, Benini L (2005) Fault tolerance overhead in Network-onchip flow control schemes. in Proc Symp On Integr Circuits and Syst Design (SBCI’05) 4–7 31. ITRS (2003) http://www.itrs.net/Links/2003ITRS/Design2003.pdf 32. Ho R, Mai KW, Horowitz MA (2001) The future of wires. Proc IEEE, 89:490–504 33. Ho PS, Lee Ki-Don, Yoon S, Wang Guotao (2004) Reliability challenges and recent advance for Cu Interconnects. in Proc 5th Intl Conf on Thermal and Material Simulation and Experiments in Micro-electronics and Micro-Syst 15–16 34. Mondal M, Wu X, Aziz A, Massoud Y (2006) Reliability analysis for on-chip networks under RC interconnect delay variation. in Proc Nanonet 1–5 35. Ismail IY (2008) Interconnect design and limitations in nanoscale technologies. in Proc ISCAS’08 780–783 36. Singhal R, Choi Gwan, Mahapatra R (2006) Information theoretic approach to address delay and reliability in long on-chip interconnects. in Proc ICCAD’06 310–314 37. Abraham JA and Fuchs WK (1986) Fault and error models for VLSI. Proc IEEE 74:639–654 38. Constantinescu C (2003) Trends and challenges in VLSI circuit reliability. IEEE Micro 23:14–19 39. Karnik T, Hazucha P, Patel J (2004) Characterization of soft errors caused by single event upsets in CMOS processes. IEEE Trans on Dependenable and Secure Computer 1:128–143 40. Maheshwari A, Koren I, Burleson W (2004) Accurate estimation of soft error rate (SER) in VLSI circuits. in Proc IEEE Intl Symp on Defect and Fault Tolerance in VLSI Systems (DFT’04) 377–385 41. Chandra V, Aitken R (2008) Impact of technology and voltage scaling on the soft error susceptibility in nanoscale CMOS. in Proc DFT’08 114–122 42. Calhoun BH et al (2008) Digital circuit design challenges and opportunities in the era of nanoscale CMOS. Proc IEEE 96:343–365 43. Owens JD et al (2007) Research challenges for on-chip interconnection networks. IEEE Micro 27:96–108 44. Vittal A, Chen LH, Marek-Sadowska M, Wang K-P, Yang S (1999) Crosstalk in VLSI interconnections. IEEE Trans on Computer-Aided Design of Integr Circuits and Syst (TCAD) 18:1817–1824

16

1 Introduction

45. Aingaran K et al (2000) Coupling noise analysis for VLSI and ULSI circuits. in Proc IEEE International Symposiums on Quality Electronic Design (ISQED’00) 485–489 46. Duan C, Calle VHC, Khatri SP (2009) Efficient on-chip crosstalk avoidance CODEC design. IEEE Trans Very Large Scale Integr (VLSI) Syst 17:551–560 47. Li L, Vijaykrishnan N, Kandemir M, Irwin MJ (2004) A crosstalk aware interconnect with variable cycle transmission. in Proc Design, Automation Test in Europe (DATE’04) 102–107 48. Patel KN, Markov I L (2004) Error-correction and crosstalk avoidance in DSM busses. IEEE Trans Very Large Scale Integr (VLSI) Syst 12:1076–1080 49. Rossi D, Metra C, Nieuwland AK, Katoch A (2005) Exploiting ECC redundancy to minimize crosstalk impact. IEEE Design & Test of Computers 22:59–70 50. Pande PP, Ganguly A, Feero B, Belzer B, Grecu C (2006) Design of low power & reliable networks on chip through joint crosstalk avoidance and forward error correction coding. in Proc IEEE Intl Symp on Defect and Fault Tolerance in VLSI Systems (DFT’06) 466–476 51. Ganguly A, Pande PP, Belzer B, Grecu C (2008) Design of low power & reliable networks on chip through joint crosstalk avoidance and multiple error correction coding. J Electron Test 24:67–81 52. Duan C, Tirumala A, Khatri SP (2001) Analysis and avoidance of crosstalk in on-chip buses. in Proc Hot Interconnects 133–138 53. Fu B, Ampadu P (2010) Exploiting Parity Computation Latency for On-Chip Crosstalk Reduction. IEEE Trans on Circuits and Systems–II: Express Briefs 57:399–403 54. Shanbhag N, Soumyanath K, Martin S (2000) Reliable low-power design in the presence of deep submicron noise. in Proc ISLPED’00 295–302 55. Frantz AP, Cassel M, Kastensmidt FL, Cota E, Carro L (2007) Crosstalk- and SEU-aware networks on chips. IEEE Design & Test Computers, 24(4):340–350 56. Hoyos SE, Evans HDR, Daly E (2004) From satellite Ion flux data to SEU rate estimation. IEEE Trans Nuclear Science, 51:2927–2935 57. Zhang M, Shanbhag NR (2006) Soft-error-rate-analysis (SERA) methodology. IEEE Trans Computer-Aided Design of Integr Circuits and Syst (TCAD) 25:2140–2155 58. Bidokhti N (2010) SEU Concept to Reality (Allocation, Prediction, Mitigation). in Proc Reliability and Maintainability Symp (RAMS) 1–5 59. Krishnamohan S, Mahapatra NR (2004) A highly-efficient technique for reducing soft errors in static CMOS circuits. in Proc ICCD’04 126–131 60. Mastipuram R, Edwin CW (2004) http://www.cs.columbia.edu/~cs4823/handouts/soft-errorspaper.pdf 61. Lantz L II (1996) Soft errors induced by alpha particles. IEEE Trans Reliability 45:174–179 62. Heidel DF et al (2008) Alpha-particle-induced upsets in advanced CMOS circuits and technology. IBM J Research and Development 52:225–232 63. Ziegler JF (1996) Terrestrial cosmic rays. IBM J Research and Development 40:19–40 64. Gordon MS et al (2004) Measurement of the flux and energy spectrum of cosmic-ray induced neutrons on the ground. IEEE Trans on Nuclear Science 51:3427–3434 65. Wilkinson JD, Bounds C, Brown T, Gerbi B, Peltier J (2005) Cancer radiotherapy equipment as a cause of soft errors in electronic equipment. IEEE Trans Device and Materials Reliability 5:449–451 66. Franco L et al (2005) SEUs on commercial SRAM induced by low energy neutrons produced at a clinical linac facility. in Proc RADECS’05 67. Khazaka R, Nakhla M (1998) Analysis of high-speed interconnects in the presence of electromagnetic interference. IEEE Trans Microwave Theory Tech 46:940–947 68. Hashimoto M, Yamaguchi J, Sato T, Onodera H (2005) Timing analysis considering temporal supply voltage fluctuation. in Proc ASP-DAC’05 1098–1101 69. Balasubramanian A et al (2008) Measurement and analysis of interconnect crosstalk due to single events in a 90 nm CMOS technology. IEEE Trans Nuclear Science 55:2079–2084

References

17

70. Balasubramanian A, Sternberg AL, Bhuva BL, Massengill LW (2006) Crosstalk effects caused by single event hits in deep sub-micron CMOS technologies. IEEE Trans Nuclear Science 53:3306–3311 71. Agarwal K, Sylvester D, Blaauw D (2006) Modeling and analysis of crosstalk noise in coupled RLC interconnects. IEEE Trans Comput-Aided Des Integrated Circuits Syst 25:892–901 72. Sorensen HR, Daly EJ, Underwood CI, Ward J, Adams L (1990) The behavior of measured SEU at low altitude during periods of high solar activity [spacecraft memories]. IEEE Trans Nuclear Sciences 37:1938–1946 73. Li L, Vijaykrishnan N, Kandemir M, Irwin MJ (2003) Adaptive error protection for energy efficiency. in Proc ICCAD’03 2–7 74. Aitken RC (1999) Nanometer technology effects on fault models for IC testing. IEEE Computer. 32(11):46–51 75. Abraham JA, Krishnamachary A, Tupuri RS (2002) A comprehensive fault model for deep submicron digital circuits. in Proc 1st IEEE Intl Work-shop on Electronic Design, Test and Applications (DELTA’02) 360–364 76. Hawkins C, Keshavarzi A, Segura J (2003) A view from the bottom: nanometer technology AC parameter failures–why, where, and how to detect. in Proc 18th IEEE Intl Symp on Defect and fault Tolerance in VLSI Syst (DFT’03) 267–276 77. Barsky R, Wagner IA (2004) Reliability and yield: a joint defect-oriented approach. in Proc 19th IEEE Intl Symp on Defect and fault Tolerance in VLSI Syst (DFT’04) 2–10 78. Hussein MA, He J (2005) Materials’ impact on interconnect process technology and reliability. IEEE Trans on Semiconductor Manufacturing 18:69–85 79. Lu Z, Huang W, Lach J, Stan M, Skadron K (2004) Interconnect lifetime prediction under dynamic stress for reliability-aware design. in Proc Intl Conf On Computer Aided Design (ICCAD’04) 327–334 80. Alam MA, Mahapatra S (2005) A comprehensive model of PMOS NBTI degradation. Microelectronics Reliability, 45:71–81 81. Chess B, Larrabee T (1998) Logic testing of bridging faults in CMOS integrated circuits. IEEE Transactions on Computers 47:338–345 82. Rousset A et al (2007) Fast bridging fault diagnosis using logic information. in Proc 16th IEEE Asian Test Symp 33–38 83. Hamdioui S, Al-Ars Z, van de Goor AJ (2006) Open and delay faults in CMOS RAM address decoders. IEEE Trans Computers 55:1630–1639 84. Fick D et al (2009) A highly resilient routing algorithm for fault tolerant NoCs. in Proc Design, Automation & Test in Europe Conf & Exhibition (DATE’09) 21–26 85. Zhang Z, Greiner A, Taktak S (2008) A reconfigurable routing algorithm for a fault-tolerant 2D-mesh network-on-chip. in Proc IEEE Design Automation Conf (DAC’08) 441–446

Chapter 2

Existing Transient and Permanent Error Management in NoCs

Error control schemes combined with various error control codes are typically employed to handle the transient errors. Physical layer techniques, such as spare wire replace and split transmission, and network layer approaches, such as faulttolerant routing have been widely investigated for permanent error management. In this chapter, we will review the state-of-the-art techniques for transient and permanent error management in NoCs.

2.1

Error Control Schemes

Three typical error control schemes are used in on-chip communication: error detection combined with automatic repeat request (ARQ), hybrid ARQ (HARQ) and forward error correction (FEC). The generic diagram for transmitter and receiver is shown in Fig. 2.1. ARQ and HARQ use not acknowledge (NACK) signal to request transmitter resending message. FEC does not need acknowledgement signal to recover detected error; instead, it corrects error immediately if error is identified in receiver.

2.1.1

Automatic Repeat Request

In error detection plus automatic repeat request (ARQ) scheme, the decoder in the receiver performs error detection. If an error is detected, retransmission is requested. This scheme is proved as the most energy-efficient method for reliable on-chip communication, if the error rate is rarely small [1]. There are three types of ARQ – stop-and-wait (SW), go-back-N (GBN) and selective-repeat (SR) [2]. As shown in Fig. 2.2a, SW ARQ does not transmit new data until the positive acknowledgement is received. During the time waiting for Q. Yu and P. Ampadu, Transient and Permanent Error Control for Networks-on-Chip, DOI 10.1007/978-1-4614-0962-5_2, # Springer Science+Business Media, LLC 2012

19

Fig. 2.1 Generic diagram for error control scheme

Fig. 2.2 Types of ARQ scheme: (a) stop-and-wait, (b) go-back-N, (c) selective-repeat

2.1 Error Control Schemes

21

ACK/NACK, the transmitter is in idle mode; therefore, this scheme results in long latency and low throughput. Different with SW ARQ, GBN ARQ keeps transmitting data while the transmitter is waiting for the feedback. If an error is detected at the receiver, the data transmitted in the previous NR cycles will be retransmitted, shown in Fig. 2.2b. Here, NR is the round trip delay. GBN ARQ can achieve a high throughput in low error rate conditions, because limited data are retransmitted. As the error rate increases, SR ARQ becomes more efficient than SW ARQ and GBN ARQ. SR ARQ only retransmits the erroneous data to improve the throughput, but requires a large buffer to store previous date until the missing portion arrives, as shown in Fig. 2.2c.

2.1.2

Hybrid Automatic Repeat Request

Hybrid automatic repeat request (HARQ) attempts to correct the detected error; if the number of errors exceeds the codec error correction capability, retransmission is requested. This method achieves more throughput than ARQ, at the cost of more decoder area and redundant bits [2]. For example, extended Hamming code can detect and correct errors; thus, this code has been widely employed in HARQ error control scheme for on-chip interconnects. Depending on the amount of retransmission information, HARQ are divided into type-I HARQ and type-II HARQ [2, 3]. The former one transmits both the error detection and correction check bits. In contrast, the latter one transmits parity checks for error detection in the first transmission. The check bits for error correction are transmitted in the second transmission, if necessary. As a result, type-II HARQ consumes less power consumption than type-I HARQ [4].

2.1.3

Forward Error Correction

Forward error correction (FEC) is typically designed for the worst-case noise condition. Different with ARQ and HARQ, no retransmission is needed in FEC [4, 5]. The decoder always attempts to correct the detected errors. If the error is beyond the codec’s capability, decoder failure occurs. Block FEC codes achieves better throughput than ARQ and HARQ; however, this scheme designed for the worst-case condition wastes energy if the noise condition is favorable. Because of encoding/decoding with previous saved input and current input, convolutional code increases coding strength but yields significant codec latency [6]; thus, FEC with convolutional code is not suitable for on-chip interconnect network. In this chapter, we focus on the block codes.

22

2.2

2 Existing Transient and Permanent Error Management in NoCs

Error Control Coding

Error control coding is essential for the error control scheme mentioned in the previous subchapter. The discipline of error control coding is derived from information theory that is discovered by Claude Shannon in 1948 [7]. Shannon proved that there exist error control codes that facilitate achieving a virtual error-free communication channel as long as the transmission rate is less than the channel capacity C, bits per second. Shannon-Hartley theorem describes the channel capacity for an additive white Gaussian noise (AWGN) channel is expressed in (2.1). C ¼ B log2 ðS=N þ 1Þ

(2.1)

where S/N is signal-to-noise ratio and B is channel bandwidth. Thanks to the advanced integrated circuit design techniques and manufacturing process, error control coding can be applied not only to wireless communication and data storage applications [8], but also to on-chip interconnect [9–14]. With error control coding, we can reduce the supply voltage of the interconnect driver and receiver while still maintaining the original transmission throughput and ensuring the system reliability. We assume that a block of k bits is encoded to an n-bit codeword (n > k). For a binary code, there are 2k codewords in the codebook. If the received codeword does not belong to the codebook, an error must be injected on codeword during transmission. By searching in the codebook, the decoder is capable of detecting or correcting the error. Figure 2.3 shows the decoding spheres demonstrating the process of decoding. Each point represents a codeword. The parameter t is the radius for the sphere where a valid codeword locates; if the received vector is within the sphere, error can be detected. The smallest number of different bits between two valid codewords is Hamming distance dmin, which is a metric to judge a code’s error detection and correction capability. For a code with the Hamming distance dmin, all dmin 1 bit error patterns can be detected. Assume that the codeword C is transmitted. If the received codeword

Fig. 2.3 Decoding spheres

2.2 Error Control Coding

23

is Y (shown in Fig. 2.3), the decoder can detect the error injected during the transmission. This is because Y is an invalid codeword for the given codebook but within the sphere of the codeword C. If the received codeword is D, the error happened in transmission cannot be discovered because D is also a valid codeword in the codebook. To correct the detected error, the radius t must be no greater than d 1 min . Resume the previous example. If the codeword Z is received, the decoder 2 might correct the codeword to D, because Z is within the sphere of the valid codeword D in the codebook. This is a decoding failure. Although it recognizes the error if X is received, the decoder is not able to correct the error because X is not belong to any valid codeword’s sphere. The simplest error detection code is single parity check code that has a Hamming distance of 2. Hamming code (n, k) has a dmin of 3 that can be used either error detection code (detecting 1- and 2-bit errors), or error correction code (correcting 1-bit errors). Extended Hamming code can detect 1- and 2-bit errors and correct 1-bit errors [6]. Cyclic redundancy codes are used to detect burst errors [6]. More complex codes, such as binary Bose-Chaudhuri-Hocquenghem (BCH) codes [15], are capable of correcting more error bits at the cost of large area, delay and power consumption. In this subchapter, we summarize popular codes that have been applied to NoC links.

2.2.1

Single Parity Check Code

Single parity check code is a simple block code, which appends an additional parity check bit after the original message bits. Depending on the way the additional check bit is calculated, single parity check code is divided into odd parity check code and even parity check code. If the total number of ‘1’ in the message is an odd number and party check bit is ‘1’, this is odd parity check code; otherwise, it is an even parity check code. Because of its linearity, even parity check code is more popular than odd parity check code. With one additional check bit, single parity check increases the minimum distance to 2 and achieves 1-bit error detection capability (but no error correction capability). This code combined with retransmission can be applied to on-chip interconnect operating in low noise region [1, 16].

2.2.2

Hamming Code

Hamming code is another simple linear block code. For a standard Hamming(n, k) code, codeword n and original message k are satisfied with (2.2) and (2.3) n ¼ 2r 1

(2.2)

k = 2r 1 r

(2.3)

24

2 Existing Transient and Permanent Error Management in NoCs

Table 2.1 Several Hamming codes and their variants Check bit Standard Hamming Shortened Hamming (r ¼ n k) code (n, k) code (n, k) 3 (7, 4) – 4 (15, 11) (12, 8) 5 (31, 26) (21, 16) 6 (63, 57) (38, 32) 7 (127, 120) (71, 64) 8 (255, 247) (136, 128) 9 (511, 502) (265, 256) 10 (1,023, 1,013) (522, 512)

Ex-shortened Hamming code (n, k) – (13, 8) (22, 16) (39, 32) (72, 64) (137, 128) (266, 256) (523, 512)

in which, the positive integer r is the number of check bits. For a systematic Hamming code, the generator matrix is constructed below Gkn ¼ ½Ik jPkðnkÞ

(2.4)

where Ik and PkðnkÞ are identity and parity matrices, respectively. Thus, the Hamming codeword C is obtained with equation C1n ¼ m1k Gkn

(2.5)

in which, m1k is the input message. In the decoder, the received codeword is computed with the parity check matrix HðnkÞn to obtain the syndrome vector S1ðnkÞ . If the syndrome vector is not zero and matches to one of the H column, the error bit within the codeword is identified. HðnkÞn ¼ ½PTkðnkÞ jIðnkÞ

(2.6)

S1ðnkÞ ¼ m1n HTðnkÞn

(2.7)

The minimum distance of Hamming code is 3, so Hamming code can detect 2-bit errors or correct 1-bit errors. Adding one parity check bit on the top of Hamming codeword, we can extend the Hamming code and increase improves the minimum distance to 4. Consequently, the extended Hamming code is capable detect 3-bit errors, or, correct 1-bit errors and detect 2-bit errors (i.e. single-error correction double error detection, SECDED). Standard Hamming code has strict constraints on the original message width, which is not flexible for NoC link design. By truncating the generator matrix, the codeword for standard Hamming code can be modified as shown in Table 2.1. To obtain the shortened Hamming code, one can subtract the same amount of binary bits from codeword and original message. Consequently, the input width for the

2.2 Error Control Coding

25

shortened Hamming code can be modified to a power of 2, as shown in third column of Table 2.1. The shortened version of Hamming codewords can further be extended by adding an additional check bit, as shown in the fourth column of Table 2.1. The shortened Hamming code maintains the same minimum Hamming distance but has flexible input width, at the cost of reducing code rate. Now, let’s see how to create a shortened Hamming code. Given a Hamming (15, 11) with generator matrix G1115 below. 2

G1115

100000000001100

3

60 1 0 0 0 0 0 0 0 0 0 0 0 1 07 6 7 6 7 60 0 1 0 0 0 0 0 0 0 0 1 1 1 07 6 7 6 7 60 0 0 1 0 0 0 0 0 0 0 0 1 1 07 6 7 60 0 0 0 1 0 0 0 0 0 0 1 0 0 17 6 7 6 7 ¼ 60 0 0 0 0 1 0 0 0 0 0 0 1 0 17 6 7 60 0 0 0 0 0 1 0 0 0 0 1 1 0 17 6 7 6 7 60 0 0 0 0 0 0 1 0 0 0 0 0 1 17 6 7 60 0 0 0 0 0 0 0 1 0 0 1 0 1 17 6 7 6 7 40 0 0 0 0 0 0 0 0 1 0 0 1 1 15 000000000011111

(2.8)

To obtain the generator matrix for the shortened Hamming code, HM(12, 8), one can remove the first three rows and the first three columns in G1115. Consequently, G812 is composed of the bold elements in G1115. The extended version of HM(12, 8) has one more column ‘1’ vector in the generator matrix (i.e. G813) than G812. 2

G813

1000000001101

3

6 7 60 1 0 0 0 0 0 0 1 0 0 1 17 6 7 60 0 1 0 0 0 0 0 0 1 0 1 17 6 7 6 7 60 0 0 1 0 0 0 0 1 1 0 1 17 7 ¼6 60 0 0 0 1 0 0 0 0 0 1 1 17 6 7 6 7 60 0 0 0 0 1 0 0 1 0 1 1 17 6 7 60 0 0 0 0 0 1 0 0 1 1 1 17 4 5 0000000111111

(2.9)

Similar principle applies to the coded shorten and extended codes with larger inputs For low transient noise conditions, Hamming codes and their variants have been widely used to protect link errors [1, 14, 16–19]. Other SECDED code (e.g. Hsiao code in [20]) can also be used for on-chip interconnect.

26

2 Existing Transient and Permanent Error Management in NoCs

2.2.3

Cyclic Redundancy Check Code

A linear block code is a cyclic code C if any cycle shift of this code is another codeword in the code set C. Given the input polynomial m(x) and generator polynomial g(x), the non-systematic codeword C (n, k) is constructed with (2.12). mðxÞ ¼ m0 þ m1 x þ m2 x2 þ mk xk

(2.10)

gðxÞ ¼ 1 þ g1 x þ g2 x2 þ gr1 xr1 þ xr

(2.11)

cðxÞ ¼ mðxÞ gðxÞ ¼ c0 þ c1 x þ c2 x2 þ cn xn

(2.12)

The systematic codeword can be computed with (2.13). cðxÞ ¼ mðxÞ xnk þ RgðxÞ ½mðxÞ xnk

(2.13)

where, Rg(x)[ ] represents the reminder of dividing the polynomial in the bracket by g(x). The reminder is the parity check bits for the systematic cyclic codeword [6]. A simple linear feedback shift register (LFSR) can realize the function expressed in (2.13), at the cost of large latency overhead. Cyclic redundancy check (CRC) code is a class of cyclic code, whose generator polynomial g(x) ¼ (x + 1)*p(x) and p(x) is a primitive polynomial. m For GFðpm Þ ¼ 0; 1; a, a2 ; a3 ;::: ap 2 , the primitive polynomial is the minimal polynomial that has a root in GF(pm). To reduce the codec latency, parallel CRC implementations have been investigated [21–23]. In this book, we employ the tool [24] to generate the hardware description code for our experimental uses. Because of single-error and burst error detection capability, CRC has been applied to networks-on-chip links to detect multi-bit transient errors [1, 16, 25].

2.2.4

Bose-Chaudhuri-Hocquenghem Code

Bose-Chaudhuri-Hocquenghem (BCH) code is one subset of cyclic codes, and its minimum Hamming distance is greater than 2td + 1 (td, number of correctable error bits in codeword). For a binary BCH code (n, k, dmin), the generator polynomial gðxÞ ¼ LCM fb ðxÞ; fbþ1 ðxÞ; fbþ2td 1 ðxÞ

(2.14)

in which, LCM means least common multiple, fðxÞis the minimal polynomial, k and n is satisfied with k ¼ n deg½gðxÞ

(2.15)

2.2 Error Control Coding

27

Table 2.2 Several binary BCH codes and their variants’ generator polynomials Standard binary Shorten binary Primitive BCH code BCH code polynomial f(x) Generator polynomial g(x) x28 + x27 + x26 + x23 + x20 BCH (127, 99) BCH(92, 64) x7 + x3 + 1 + x19 + x18 + x13 + x10 + x9 + x7 + x5 + x4 + x3 + 1 8 4 3 32 x + x31 + x30 + x29 + x27 BCH(225, 223) BCH(160, 128) x +x +x + x2 + 1 + x26 + x25 + x22 + x20 + x19 + x17 + x16 + x14 + x9 + x7 + x6 + x5 + x4 + x 3 + x2 + 1 9 4 27 BCH(511, 484) BCH(283, 256) x +x +1 x + x26 + x24 + x22 + x21 + x16 + x13 + x11 + x9 + x8 + x6 + x5 + x4 + x3 + 1 x40 + x39 + x33 + x31 + x30 + x29 BCH(1,023, 983) BCH(552, 512) x10 + x3 + 1 + x27 + x25 + x24 + x23 + x22 + x21 + x19 + x16 + x12 + x11 + x10 + x9 + x7 + x4 + x3 + x + 1

Fig. 2.4 Architecture of a BCH decoder with GF(2m) arithmetic

Given td ¼ 4, the binary BCH codes (including shorten codes) and their generator polynomials are shown in Table 2.2. The encoding computation for binary BCH is similar to (2.13). However, the decoding process of BCH is much more complex than that of other linear block codes. The architecture of a BCH decoder with GF(2m) is shown in Fig. 2.4 [15]. For a td-error correcting binary BCH (n, k), the syndrome S is calculated with the expression (2.16). S12td ¼ r1n HT2td n where r1n is received codeword, and H is the parity check matrix.

(2.16)

28

2 Existing Transient and Permanent Error Management in NoCs

2

1

6 61 6 H¼6 6 .. 6. 4 1

ab abþ1 .. . abþ2td 1

3 b 2 b n1 a a bþ1 2 bþ1 n1 7 7 a a 7 7 7 .. .. .. 7 . . . 5 bþ2t 1 2 bþ2t 1 n1 d d a a

(2.17)

in which, ab, . . ., ab+2td1 are roots of the generator polynomial g(x). Different with previous linear block codes, the syndrome computation for BCH code is hardware-consuming because of large parity check matrix expressed in (2.17). Each element in the equation is represented with an m-bit vector. Suppose the error location polynomial is given by (2.18). sðxÞ ¼ s0 þ s1 x þ s2 x2 þ þ std xtd

(2.18)

The coefficient for s(x) can be calculated based on the relationship between the syndromes S and s (x) shown in (2.19), which can be solved by PetersonGorenstein-Zieler algorithm, Euclid’s algorithm, Berlekamp-Massey algorithm (BMA) [15] and inversionless BMA [26]. Stþi ¼

td X

Stþij sj

ði; ¼ 1; 2; :::; tÞ

(2.19)

j¼1

The obtained s(x) is one row polynomial of the H matrix. The third step in the BCH decoding process is to compare s(x) with H matrix and identify which H column the error location polynomial s(x) matches to. In the Chien search algorithm, the error bit position is confirmed when we find the root for (2.20). sðaj Þ ¼ s0 þ s1 aj þ s2 a2j þ s3 a3j þ þ std atd

(2.20)

The corresponding error bit position is (n-1-j). For binary BCH, XORing the (n-1-j) bit can correct the received codeword. The hardware overhead for binary BCH codes have been evaluated in Refs. [27, 28].

2.2.5

Product Code

Product code recently has been applied to on-chip interconnect, because of lower complexity (than BCH code) and higher error detection/correction capability for burst errors (than single parity check code, Hamming code and CRC) [27]. Product code is constructed with two simple component codes, one for row and one for column. Figure 2.5 shows the encoding process for product code Cp(n, k) using C1(n1, k1) and C2(n2, k2) for column and row encoding, respectively. The Cp code input k is the product of C1 code input k1 and C2 code input k2. Similarly,

2.3 Spare Wires

29

Fig. 2.5 Encoding process of product codes

n ¼ n1*n2. Consequently, the minimum Hamming distance for Cp is the product of the minimum Hamming distances of C1 and C2 [4]. Three stage decoding process is proposed in Ref. [27]. In the first stage, row decoding is performed to obtain row syndrome and row status vector, which indicates which row has error. If the detected error is correctable, error-free message is produced by flipping the erroneous bit; otherwise the consecutive decoding stages are needed. In the second stage, column decoders detect and correct errors in the codeword with the assistant of row status vector. To adjust the potential wrong row/column correction in the first two stages, another row decoding is executed in the last stage. This decoding process is shown in Fig. 2.6. In Ref. [27], product code has been employed to switch-to-switch links, showing the advantage on energy efficiency and error resilience. In this book, we propose to apply product code principle to dual-layer error control in networks-on-chip in Chapter 6.

2.3

Spare Wires

Permanent errors are usually caused by imperfect manufacturing process or devices aging. These errors do not vanish unless the faulty components are abandoned or replaced with new ones. Broken wires can be replaced with spare wire either in testing stage [29] or at run time [28, 30].

30

2 Existing Transient and Permanent Error Management in NoCs

Fig. 2.6 Decoding process of product codes

Fig. 2.7 Two approaches for permanent error management: (a) half splitting transmission, (b) phit size reduction

2.4

Split Transmission

Splitting transmission improves the utilization rate of broken links and reduces the needs of re-routing to reduce the network congestion and latency. If spare wires are run out or the router is nonfunctional, re-routing is needed. As shown in Fig. 2.7a, half splitting transmission [28] divides one flit into two fractions, transmitting one fraction each cycle. This approach is not efficient if the ratio of broken wires over all links is much smaller than 0.5. In Ref. [31], the user reduces the phit size (¼switch link width) to the maximum number of healthy wires per switch-to-switch link set (Fig. 2.7b).

2.5

Fault-Tolerant Routing

Other than spare wire and split transmission, fault tolerant routing algorithms are suitable to handle permanent link errors. The fault tolerance is achieved either by redundant packets or by redundant routing paths.

2.5 Fault-Tolerant Routing

2.5.1

31

Redundant-Packet-Based Routing

The essence of the redundant-packet-based fault tolerant routing algorithms is to send multiple copies of packets over network and choose one correct copy at the destination. The disadvantages of this routing category are follows: • • • •

Add more network congestion; Increase power consumption; Lose fault tolerance capability if the number of copies decreases; Boost the router design complexity.

Different efforts have been made to improve the efficiency of redundant-packetbased routing. Flooding routing algorithm requires the source router sending the packet to each possible direction and the intermediate routers forwarding the received packet to all possible directions as well [32]. Various flooding variants have been explored. In the probabilistic flooding approach, source router sends multiple packet copies to all of its neighbors; the middle routers forward the received packets to their neighbors with a pre-defined probability (<1), thus reducing the number of redundant packets [33]. In the directed flooding algorithm, the probability of forwarding packet to the particular neighbor is multiplied by a factor, which depends the distance between current node and the destination [34]. Pirretti et al. proposed a redundant random walk algorithm, in which the intermediate node assigns output ports with different packet forwarding probabilities. Because the sum of forwarding probabilities is equal to 1, Pirretti’s approach only forwards one copy of the received packet, reducing the overhead [34]. Considering the tradeoff of redundancy and performance, Patooghy et al. only transmit an additional copy of the packet through low-traffic-load path for replacing the erroneous packet [35]. Redundant-packet-based routing is feasible for both transient and permanent errors, no matter whether the error presents in link, buffer or logic gates in the router.

2.5.2

Redundant-Route-Based Routing

Permanently unusable link or router breaks the regularity of the normal NoC topologies. Redundant-route-based routing algorithms exploit the multiple-path network feature to address the permanent errors. This type of routing algorithms relies on the global/semi-global link/router status, which are updated before routing. Representative redundant-route routing are distance vector routing, link state routing, DyNoC and reconfigurable routing. In distance vector routing, the number of hop between current router and each destination is periodically updated in the routing table. The faulty links and routers are notified in each router within one period. Link state routing uses handshaking protocol to sense the state of neighbor links and router, so that the faulty links and routers can be considered in the shortest path computation [35]. Different with distance vector routing and link state routing,

32

2 Existing Transient and Permanent Error Management in NoCs

DyNoC routing does not broadcast the information of broken links or permanently unusable routers; instead, it only notify the neighboring routers to bypass the obstacle [36]. In reconfigurable routing [37], eight routers around the unusable router are informed; those routers use alternative routing paths to avoid that faulty router and prevent the presence of deadlock. With the redundant wiring and buffers, default – backup path [38] continues to use the healthy processing elements even when the attached router has permanent faults. Using CRC code in crossbar and distributed on-line fault diagnose, Kohler and Radetzki successfully distinguish transient and permanent errors [39]; they further utilize a deflection routing to address the identified permanent errors. Deflection routing may experience livelock, but it guarantees no deadlock. A large portion of distributed control routing algorithms [40–42] are used to avoid network congestion or deadlock. Recently, those algorithms have been employed to tolerate permanent faults [43, 44]. In Ref. [45], Schonwald et al. proposed a force-directed wormhole routing algorithm to uniformly distribute the traffic across the entire network, in the process of handling fault links and switches. Zhou and Lau [46], Boppana and Chalasani [47], and Chen and Chiu [48] took advantage of virtual channel to extend the region of re-routing for the flits encountering fault interference. The fault-tolerant adaptive routing algorithm proposed by Park et al. in Ref. [49] requires additional logic circuit and multiplexers for buffers and virtual channel switching, respectively. The routing algorithms analyzed by Duato need at least four virtual channels per physical channel [50]. In Ref. [51], Flich et al. present a region-based routing to address irregular topology caused by permanent link errors. At every switch, destinations are grouped into different regions, so that the redundant routing information can be forward without using routing table, thus reducing the hardware cost. The region-based routing cannot guarantee to reach each node if permanent error occurs. Fortunately, a universal logic-based distributed routing (uLBDR) algorithm has been shown to achieve 100% coverage [52].

2.6

Summary

Transient error is typically managed with error control scheme with error detection/ correction codes in datalink layer. ARQ, HARQ and FEC are the main three error control schemes in on-chip communication. Simple and complex error control codes have been reviewed, in terms of error rescuing strength, codec complexity and encoding/decoding process. Redundant-packet-based routing algorithm is capable of handling transient error and permanent error in network layer, however, this class of routing results in significant energy and area overhead, as well as performance degradation. Taking advantage of NoC inherent topology redundancy, redundant-route-based routing algorithm provides alternative routing paths for single copy of packet whenever permanent error happens in the regular route. The reviewed methods independently manage transient and permanent errors, which are not efficient in terms of energy and area overhead.

References

33

References 1. Bertozzi D, Benini L, De Micheli G (2005) Error control scheme for on-chip communication links: the energy-reliability tradeoff. IEEE Trans Computer-Aided Design of Integr Circuits and Syst (TCAD) 24:818–831 2. Lin S, Costello D, Miller M (1984) Automatic-repeat-request error control schemes. IEEE Communications Magazine 22:5–17 3. Metzner J (1979) Improvements in block-retransmission schemes. IEEE Trans Communication COM-23:525–532 4. Lehtonen T, Lijieberg P, Plosila J (2007) Analysis of forward error correction methods for nanoscale networks-on-chip. in Proc Nano-Net 1–5 5. Rossi D, Nieuwland AK, van Dijk SVE, Kleihorst RP, Metra C (2008) Power consumption of fault tolerant busses. IEEE Trans Very Large Scale Integr (VLSI) Syst 16:542–553 6. Lin S, Costello DJ (2004) Error control coding, Second Edition ed. Prentice Hall 7. Shannon CE (1948) A mathematical theory of communication. Bell System Technical Journal, 27:379–423 and 623–656 8. Farrell PG (1990) Coding as a cure for communication calamities: the success and failures of error control. Electronics & Communication Engineering Journal 2:213–220 9. Duan C, Calle VHC, Khatri SP (2009) Efficient on-chip crosstalk avoidance CODEC design. IEEE Trans Very Large Scale Integr (VLSI) Syst 17:551–560 10. Patel KN, Markov IL (2004) Error-correction and crosstalk avoidance in DSM busses. IEEE Trans Very Large Scale Integr (VLSI) Syst 12:1076–1080 11. Rossi D, Metra C, Nieuwland AK, Katoch A (2005) Exploiting ECC redundancy to minimize crosstalk impact. IEEE Design & Test of Computers 22:59–70 12. Pande PP, Ganguly A, Feero B, Belzer B, Grecu C (2006) Design of low power & reliable networks on chip through joint crosstalk avoidance and forward error correction coding. in Proc IEEE Intl Symp on Defect and Fault Tolerance in VLSI Systems (DFT’06) 466–476 13. Ganguly A, Pande PP, Belzer B, Grecu C (2008) Design of low power & reliable networks on chip through joint crosstalk avoidance and multiple error correction coding. J Electron Test 24:67–81 14. Rossi D, Angelini P, Metra C (2007) Configurable error control scheme for NoC signal integrity. in Proc IOLTS’07 43–48 15. Morelos-Zaragoza RH (2006) The Art of Error Correcting Coding, Second Edition ed. John Wiley & Sons Inc, Hoboken, NJ, USA 16. Li L, Vijaykrishnan N, Kandemir M, Irwin MJ (2003) Adaptive error protection for energy efficiency. in Proc ICCAD’03 2–7 17. Murali S et al (2005) Analysis of error recovery schemes for networks on chips. IEEE Design & Test of Computers 22:434–442 18. Zimmer H, Jantsch A (2003) A fault model notation and error-control scheme for switch-toswitch buses in a network-on-chip. in Proc CODES + ISSS’03 188–193 19. Sridhara SR, Shanbhag NR (2007) Coding for reliable on-chip buses: A class of fundamental bounds and practical codes. IEEE Trans on Computer-Aided Design of Integrated Circuit and Syst 26:977–982 20. Hsiao MY (1970) A class of optimal minimum odd-weight-column SEC–DED codes. IBM J Research Development 14:395–401 21. Albertengo G, Sisto R (1990) Parallel CRC generation. IEEE Micro 10:63–71 22. Pei T-B, Zukowski C (1992) High speed parallel CRC circuits in VLSI. IEEE Trans Communication 40:653–657 23. Shieh MD, Sheu MH, Chen CH, Lo HF (2001) A systematic approach for parallel CRC computations. J Information Science and Engineering 17:445–461 24. CRC Tool [Online] http://www.easics.com/webtools/crctool 25. Sridhara SR, Shanbhag NR (2005) Coding for system-on-chip networks: a unified framework. IEEE Trans Very Large Scale Integr (VLSI) Syst 13:655–667

34

2 Existing Transient and Permanent Error Management in NoCs

26. DV Sarwate, NR Shanbhag (2001) High-speed architectures for Reed-Solomon decoders. IEEE Trans Very Large Scale Integr (VLSI) Syst 9:641–665 27. Fu B, Ampadu P (2009) On hamming product codes with type-II hybrid ARQ for on-chip interconnects. IEEE Trans on Circuits and Syst Part I: Regular Papers 56:2042–2054 28. Lehtonen T, Wolpert D, Liljeberg P, Plosila J, Ampadu P (2010) Self-adaptive system for addressing permanent errors in on-chip interconnects. IEEE Trans Very Large Scale Integr (VLSI) Syst 18:527–540 29. Shamshiri S, Cheng K-T (2009) Yield and cost analysis of a reliable NoC. in Proc 27th IEEE VLSI Test Symposium 173–178 30. Lehtonen T, Liljeberg P, Plosila J (2007) Online reconfigurable self-timed links for fault tolerant NoC. VLSI Design, 2007:1–13 31. Herna´ndez C, Silla F, Santonja V, Duato J (2008) Dealing with variability in NoC links. in Proc Diagnostic Service in Network-on-Chip (DSNOC’08) 4–10 32. Dumitras T, Kerner S, Marculescu R (2003) Towards on-chip fault-tolerant communication. in Proc of Asia and South Pacific Design Automation Conference (ASP-DAC’03) 225–232 33. Haas ZJ, Halpern JY, Li L (2006) Gossip-based Ad Hoc Routing. IEEE/ACM Trans on Networking (TON) 14:476–491 34. Pirretti M et al (2004) Fault tolerant algorithms for network-on-chip interconnect. in Proc IEEE Computer Society Annual Symp on VLSI Emerging Trends in VLSI Syst Design (ISVLSI’04) 46–51 35. Patooghy A, Miremadi SG (2008) LTR: a low-overhead and reliable routing algorithm for network on chips. in Proc Intl SoC Design Conf I-129–I-133 36. Bobda C et al (2005) DyNoC: A dynamic infrastructure for communication in dynamically reconfigurable devices. in Proc Intl Conf on Field Programmable Logic and Applications. 153–158 37. Zhang Z, Greiner A, Taktak S (2008) A reconfigurable routing algorithm for a fault-tolerant 2D-mesh network-on-chip. in Proc IEEE Design Automation Conf (DAC’08) 441–446 38. Koibuchi M, Matsutani H, Amano H, Pinkston TM (2008) A lightweight fault-tolerant mechanism for network-on-chip. in Proc 2nd ACM/IEEE Intl Symp NOCS 13–22 39. Kohler A, Radetzki M (2008) Fault-tolerant architecture and deflection routing for degradable NoC switches. in Proc 3rd ACM/IEEE Intl Symp NOCS 22–31 40. Glass CJ, Ni LM (1992) The turn model for adaptive routing. in Proc Intl Symp Computer Architecture 278–287 41. Chiu G-M (2000) The odd-even turn model for adaptive routing. IEEE Trans on Parallel and Distributed Syst 11:729–738 42. Li M, Zeng QA, Jone WB (2006) DyXY-A proximity congestion-aware deadlock-free dynamic routing method for network-on-chip. in Proc DAC 2006 849–852 43. Hosseini A, Ragheb T, Massoud Y (2008) A fault-ware dynamic routing algorithm for on-chip networks. in Proc IEEE Intl Symp Circuits and Syst(ISCAS’08) 2653–2656 44. Aliabadi MR, Khademzadeh A, Raiya A M (2008) Dynamic intermediate node algorithm (DINA): a novel fault tolerance routing methodology for NoCs. in Proc Int Symp on Telecommunication 521–526 45. Schonwald T, Zimmermann J, Bringmann O, Rosenstiel W (2007) Fully adaptive faulttolerant routing algorithm for network-on-chip architectures. in Proc Euromicro Conf on Digital System Design Architecture 527–534 46. Zhou J, Lau FCM (2001) Adaptive fault-tolerant wormhole routing in 2D meshes. in Proc 15th Intl Parallel and Distributed Processing Symp 1–8 47. Boppana RV, Chalasani S (1995) Fault-tolerant wormhole routing algorithms for mesh networks. IEEE Trans on Computers 44:848–864 48. Chen K-H, Chiu G-M (1998) Fault-tolerant routing algorithm for meshes without using virtual channels. Information Science and Engineering 14:765–783

References

35

49. Park D, Nicopoulos C, Kim J, Vijaykrishnan N, Das CR (2006) Exploring fault-tolerant Network-on-Chip architectures. in Proc Intl Conf on Dependable Syst and Networks (DSN’06) 93–104 50. J Duato (1997) A theory of fault-tolerant routing in wormhole networks. IEEE Trans on Parallel and Distributed Syst 8:790–802 51. Flich J, Mejia A, Lopez P, Duato J (2007) Region-based routing: An efficient routing mechanism to tackle unreliable hardware in network on chips. in Proc 1st ACM/IEEE Intl Symp NOCS 183–194 52. Rodrigo S, Flich J, Roca A, Medardoni S, Bertozzi D, Camacho J, Silla F, Duato J (2011) Cost-efficient on-chip routing implementations for CMP and MPSoC systems. IEEE Trans Computer-Aided Design of Integr Circuits and Systems 30:534–547

Chapter 3

Adaptive Error Control Coding at Datalink Layer

Reliable on-chip communication in a multi-ore system-on-chip is one of the most important challenges [1–3]. Networks-on-chip (NoCs) have been proposed to facilitate on-chip communication [4–6]. Within this framework, many coding methods have been examined to handle transient errors in NoC links [6–14]. These works typically assume that the probability of error is very low and use simple error detection codes combined with retransmission to save energy [7, 8, 15]. Unfortunately, as technology scales deep into the nanometer regime, on-chip communication becomes more susceptible to increased crosstalk, external radiation and spurious voltage spikes than before; thus, the number of erroneous bits per flit (flow control unit) is expected to increase [16–18]. As a result, more powerful codes are needed to provide improved error resilience against multi-bit errors. Many researchers have recently tackled multi-bit errors using more powerful codes [19–23]. These more powerful codes designed for the worst case typically result in increased hardware complexity, silicon area and energy costs. Different applications have diverse constraints on reliability, energy-efficiency, throughput and area, making adaptive error control attractive in optimizing performance in varied scenarios. In Ref. [19], Rossi et al. evaluate single-error correcting (SEC), single-error correcting double-error detecting (SEC-DED), and symbol-error correcting codes for end-to-end and switch-to-switch error control. They further propose managing different quality of service (QoS) needs by adjusting error detection, correction, and mixed modes at design time. The approach proposed in Ref. [22] uses configurable circuits to adapt to different fault types, particularly for intermittent and permanent errors; a stop-and-wait ARQ retransmission policy is employed to recover transient errors. A self-calibrating on-chip link has been discussed in Ref. [24] to achieve high-performance and low-power consumption by dynamically adjusting the operating frequency and voltage swing; error detection is combined with retransmission to ensure reliability. To reduce energy wasted by error detection in favorable noise conditions, the error detection capability is adapted based on the number of errors found over a fixed time window by a victim line operating at half supply voltage in Ref. [15]. In Ref. [25], we propose the concept of adaptive error Q. Yu and P. Ampadu, Transient and Permanent Error Control for Networks-on-Chip, DOI 10.1007/978-1-4614-0962-5_3, # Springer Science+Business Media, LLC 2012

37

38

3 Adaptive Error Control Coding at Datalink Layer

correction at runtime; in Ref. [26], we implement this adaptation using a hardware sharing technique for reduced area and energy overheads, and we evaluate the method using a simplified multi-bit multi-cycle error model. Unlike our earlier work [25] and [26], where the error detection is fixed and we adapt error correction, this work presents adaption on both error detection and correction simultaneously to further improve reliability and performance. Error detection combined with ARQ retransmission can improve reliability without incurring too much hardware cost; however, this approach impacts energy, throughput, especially in high noise environments, where frequent retransmissions may be required. Moreover, as multi-bit errors increase in nanoscale NoC links, so does the number of retransmissions needed to obtain an error-free flit. Forward error correction (FEC) achieves high throughout, but decoder errors and failures are difficult to manage in the presence of increased noise. FEC codecs designed for the worst-case noise scenario can guarantee a certain reliability and throughput; but energy is wasted when the link quality is not as poor as the worst-case condition. To address these issues, we propose an adaptive error control coding (ECC) combined with hybrid automatic repeat request (HARQ) retransmission, which can adjust both error detection and correction, according to the monitored link quality or the constraints on reliability and throughput for a given application. Different with previous works [15, 20, 22, 25–27], our approach handles multi-bit errors caused by scaled technology by increasing the maximum error detection and correction capabilities. Using multiple sets of extended Hamming codes combined with block interleaving, we have successfully implemented a configurable M-error correction, 2M-error detection code (MEC-2MED) to achieve both improved error detection and correction simultaneously. A flit is a flow control unit in packet switching, and the flit width is typically set to the maximum link width of the particular NoC [15, 28]. Flit-level switch-toswitch error control at the datalink layers is the focus of this chapter. Routingrelated issues are addressed in several works [20, 29] and are outside the scope of this chapter.

3.1

Adaptive Algorithm

Adaptive error control can involve incremental redundancy [15, 30] or variable redundancy [19, 20, 30]. The incremental method increases the amount of redundant information in the flit when a retransmission is requested. In contrast, the variable method directly selects the least amount of redundancy needed for the current noise condition or QoS requirement. To achieve energy efficiency, high throughput and low area overhead, we use incremental redundancy for short-term reliable transmission and variable redundancy for long-term adaptations. As shown in Fig. 3.1a, a new flit j is encoded with the current long-term ECC scheme and delivered to the link. Meanwhile, a copy of flit j is stored in the sender for future retransmission. In the receiver, the received flit j0 is passed through error

3.2 Architecture for Sender and Receiver

39

Fig. 3.1 Adaptive error control algorithm: (a) flowchart, (b) mode switching between short- and long-term ECC

detection and any detected errors are corrected if possible. The presence of uncorrectable errors results in a request to increase the error resilience of the ECC scheme employed on flit j, so that flit j can be retransmitted with higher reliability. This temporary ECC improvement will be terminated as soon as flit j is accepted by the receiver, as described by the pseudo code in Fig. 3.1b. The incremental redundancy is employed to improve reliability in the short term, so that the energy cost of complex coding can be reduced. In contrast, variable redundancy is applied to long-term error control. As shown in Fig. 3.1a, any update of the ECC scheme will last T cycles (T is a user defined value). The upgrade/degrade request signal is produced by a link quality evaluation module, such as a noise monitor using a voltage-reduced victim line [15] or the number of negative acknowledgements over a fixed observation interval [31]. The modification of the long-term ECC scheme can also be used to control throughput and traffic congestion in the NoC.

3.2

Architecture for Sender and Receiver

In the proposed adaptive algorithm, the HARQ retransmission policy facilitates balancing coding redundancy and retransmission latency. If an uncorrectable error is detected in the received flit, a retransmission with more parity check bits will be requested to improve reliability. As technology scales, link switching power tends to exceed codec power [15, 26, 32]; thus the number of wires for

40

3 Adaptive Error Control Coding at Datalink Layer

Fig. 3.2 Architecture of (a) transmitter and (b) receiver

retransmission should be minimized to achieve energy efficiency. In our approach, only a new set of check bits with added redundancy is retransmitted, and the erroneous flit is saved in the receiver waiting for future correction. The entire coded flit will only be retransmitted when the short-term ECC reaches its highest capability. The diagrams for sender and receiver are shown in Fig. 3.2. There are two features for sender design – (1) two encoding phases and (2) separate processing of flit and check bits. Phase I of the adaptive ECC encoder produces the check bits indicated by the current ECC scheme, as well as auxiliary signals for increasing redundancy. Both check bits and the auxiliary signals are saved in a check bit buffer. In phase II, the check bits can be directly passed through the interleaver, or they are stored in the sender for future use. The operation on the check bits depends on the control signals, i.e. the long- and short-term ECC selection. When the current short-term ECC selection is superior to the long-term one, phase I is disabled to stop new flit transmissions until the two ECC selections are the same again. This two-phase encoding eliminates duplicating the flit encoder in Fig. 3.2. Separating the flit processing and check bit computation also allows us to minimize associated hardware cost of increased redundancy. We combine multiple groups of SEC-DED codes with linear block interleaving to obtain better error correction and detection. Unlike message interleaving, the interleaving pattern for check bits is adjustable, as the number of check bits varies for different ECC schemes. In the receiver, the adaptive ECC decoder is controlled only by shortterm ECC selection. To match the encoding scheme, short-term ECC selection is also used to appropriately deinterleave the check bits and fetch the flit from the flit buffer. Details of the multiple groups of SEC codes and configurable interleaving/ deinterleaving will be discussed in the next section.

3.3 Configurable Error Detection and Correction

3.3

41

Configurable Error Detection and Correction

Several challenges exist in the implementation of the proposed adaptive error control method; for example, error detection and correction codecs for the adaptive method are updated simultaneously; using two separate codes to perform detection and correction can increase the number of check bits and impact codec area. To minimize energy consumption, the error detection and correction codecs should be overlapped as much as possible.

3.3.1

Two-phase Configurable ECC Encoder

Extended Hamming codes have been widely used as SEC-DED codes. Inspired by this, we construct an M-error correcting, 2M-error detecting (MEC-2MED) code, using multiple sets of extended Hamming codes combined with linear block interleaving. Using a 64-bit flit as an example, we show below how the MEC-2MED code (for M ¼ 1, 2, 4) can be created. The parity matrix Pconfig for the MEC-2MED code is constructed from the parity matrix PB of a SEC Hamming (7, 4) code (i.e. HM(7, 4)). Equation (3.1) shows an example PB matrix. 2

1 61 6 PB ¼ 6 41 0

1 1 0 1

3 0 17 7 7 15

(3.1)

1

Define an intermediate matrix P4_sub below. 2

PB

6 6 6P 46 B P4 sub ¼6 6 PB 4 PB

! 0 41 ! 1 41 ! 0 41 ! 1 41

! 3 0 41 7 ! 7 0 41 7 7 ! 7 1 41 7 5 ! 1 41

(3.2)

! ! where, 0 41 is 4 1 zeros vector and 1 41 is 4 1 ones vector. Pconfig is one element of the set {[P4, P4, P4, P4], [P2, P2], [P1]}, in which the parity matrices P4 for M ¼ 4, P2 for M ¼ 2, and P1 for M ¼ 1 are expressed in (3.3), (3.4), (3.5) below. P4 ¼ P 4

sub

E 0T

(3.3)

42

3 Adaptive Error Control Coding at Datalink Layer

2 6 P2 ¼ 4

P4 P4

2

P4

6 6 6 6 P4 6 P1 ¼ 6 6 6 P4 6 4 P4 E 0 ¼ ½1 0 E 1 ¼ ½0

1

1 0

1

sub

! 0 161

E 0T

sub

! 1 161

E1

! 0 161

! 0 161

sub

! 1 161

! 0 161

sub

! 0 161

! 1 161

sub

! 1 161

! 1 161

0 1

1

0 0

0

1

7 5

(3.4)

T

sub

0

3

E 0T

3

7 7 7 E1 7 7 7 T 7 E1 7 7 5 T E0 T

(3.5)

0

1

0

0 1

0

1

1 1

0

1

1 0

1

0

1 0

(3.6) (3.7)

! ! Here, 0 161 is 16 1 zeros vector; 1 161 is 16 1 ones vector; E0 and E1 are matrices to compute extended parity check bits. Three error detection and correction modes are achievable for long-term ECC – SEC-DED using EHM (72, 64) (shortened from an extended Hamming(128, 120)), double-error correction, four-bit error detection (DEC-4ED) EHM (78, 64) using two groups of EHM (39, 32) (shortened from an extended Hamming (64, 57)); four-bit error correction, eight-bit error detection (4EC-8ED) EHM (88, 64) using four groups of EHM (22, 16) (shortened from an extended Hamming (32, 26)). Figure 3.3 shows a schematic of the proposed adaptive ECC encoder. In phase I, the parity check bits PB, 0–4 for HM (21, 16) (shortened from Hamming (31, 26)) and the auxiliary signals t0, t1, t2 are produced. In phase II, those auxiliary signals are configured with XOR gates to produce the check bits p1, 0–7, p2, 0–13, and p4, 0–23, for the configurable codes SEC-DED, DEC-4ED and 4EC-8ED, respectively. Since PB, 0–4 and t0–2 are the only information needed to compute all check bits for the three modes, saving PB, 0–4 and t0–2 is enough to allow future ECC upgrading.

3.3.2

Configurable Interleaving

SEC codes correct single-bit errors. A set of M SEC codes are able to recover M-bit errors, as long as those M erroneous bits are properly distributed to the input of M SEC decoders. As shown in Fig 3.4a, the presence of two-bit adjacent errors (indicated by the shaded bits) on the received sub-codeword A results in an uncorrectable error for codeword C0 , because SEC cannot correct two-bit errors.

3.3 Configurable Error Detection and Correction

43

Fig. 3.3 Schematic for adaptive ECC encoder

Fig. 3.4 Use of interleaving to handle adjacent multi-bit errors (a) without interleaving (b) with interleaving

44

3 Adaptive Error Control Coding at Datalink Layer

Fig. 3.5 Interleaving circuits for adaptive ECC encoder

In contrast, the two-bit adjacent errors existing in the interleaved codeword П(C) can be spread out to different sub-codewords A and B after deinterleaving, as shown in Fig. 3.4b. Thus, each error bit can be corrected by a separate SEC decoder, and the merged flit, Flit 0 , is error free. Multiple SEC codecs combined with linear block interleaving can achieve multi-bit error correction, with reduced hardware compared to other multiple error correction schemes, such as BCH and convolutional codes [33]. To reduce the number of multiplexers used in configuring the ECC, flit bits are interleaved using a fixed interleaving distance (the minimum distance between two bits from the same SEC codeword); in contrast, the check bits are interleaved with a variable interleaving distance, facilitating different configurations in the MEC-2MED codec. For our 64-bit example, the interleaving schematics for information and check bits are shown in Fig. 3.5.

3.3.3

Configurable ECC Decoder

The implementation of the decoder is likewise based on a HM (21, 16) code. The combined outputs from the four groups of HM (21, 16) syndrome computation modules (indicated by HM (21, 16) syn. cpt. in Fig. 3.6a) are the syndromes for EHM (88, 64), Syn4, 0–23. By reconfiguring Syn4, 0–23, we can obtain Syn2, 0–13 for EHM (78, 64), and Syn1, 0–7 for EHM (72, 64), as shown in Fig. 3.6. To share the error correction circuit, the syndromes for three modes can be reconfigured again to yield a general syndrome. Next, the 64-bit message is corrected after error pattern comparison. If the error cannot be corrected, a NACK is fed back to the sender.

3.3 Configurable Error Detection and Correction

45

Fig. 3.6 Schematic for adaptive ECC decoder: (a) syndrome computation, (b) error correction

46

3.4 3.4.1

3 Adaptive Error Control Coding at Datalink Layer

Evaluation of Adaptive ECC Error Models

The Gaussian pulse function (3.8) is widely accepted as a model for bit error rate e ð1 Vdd 1 2 (3.8) ¼ V pﬃﬃﬃﬃﬃﬃ ey =2 dy e¼Q dd 2sN 2p 2sN The noise voltage can be assumed to be a normal distribution with standard deviation sN; Vdd is the supply voltage. In previous works [7, 13, 15, 24]. Errors are typically assumed to be statistically independent. As a result, the flit error rate for a K-bit flit is modeled by (3.9) below. Flit error rate ¼ 1 ð1 eÞK Ke

(3.9)

Unfortunately, as technology scales, interconnect links become more susceptible to coupling noise, which is expected to increase. To extend (3.9) to nanoscale systems, we assume that a fault event on a wire has probability b termed the adjacent coupling coefficient of affecting neighboring wires. For ease of analysis of multi-bit errors, the probability that a flit contains two-bit or three-bit adjacent errors is modeled below. pð2Þ ¼ K e ð1 eÞK1 2 b ð1 bÞ

(3.10)

pð3Þ ¼ K e ð1 eÞK1 b2

(3.11)

In Ref. [20], Zimmer and Jantsch propose a more complex dependent error model, which considers both spatial and temporal correlations. Compared to the independent and quasi-independent error models, the dependent model in Ref. [20] is more likely to produce multi-bit and multi-cycle errors, which means one fault event may result in multiple wires being affected for multiple cycles. Considering one fault type, we simplify their model below. 2 6 6 6 Error Matrix ¼ e P ¼ e 6 6 4

pð1; 1Þ

. . . pð1; tmax Þ

3

7 7 pð2; 1Þ . . . pð2; tmax Þ 7 7 .. .. .. 7 . . . 5 pðwmax ; 1Þ . . . pðwmax ; tmax Þ

(3.12)

where e is the probability of a fault event happened on one wire, omax is the maximum bus width, and tmax is the maximum transfer cycle during which the error propagates. The element p (o, t) of the matrix P is the probability of a fault event affecting o wires and retaining the error for t cycles. Using this model,

3.4 Evaluation of Adaptive ECC

47

Fig. 3.7 Probability of errors using different error models

we investigate the probabilities of 2-bit, 4-bit and 6-bit errors. As shown in Fig. 3.7, the probability of 6-bit errors generated by the dependent model is about seven orders of magnitude higher than that generated by the quasi-independent model (b ¼ 0.1), because the multi-bit errors here are caused by the consequence of one fault event affecting multiple wires and the error propagation from the previous cycles. The latter effect typically results in non-adjacent multi-bit errors. The independent, quasi-independent and dependent errors models are used to evaluate different ECC schemes in this paper.

3.4.2

Experimental Setup

To evaluate the proposed method, we compared energy, throughput, and area of three other methods – (1) an adaptive error detection method [15], (2) ARQ combined with Cyclic Redundancy Check (CRC) error detection [7], and (3) a fixed ECC scheme [20]. The following experiments and analyses assume that the flit width is 64 bits, and four-bit error correction and eight-bit error detection capabilities are required to protect against the worst-case noise scenario. To manage errors in different noise conditions, the adaptive error detection method provides three error detection modes – (1) ARQ-parity check (PAR), (2) ARQHM (71, 64) and (3) ARQ-EHM (72, 64), for one-bit, two-bit and three-bit error detection, respectively. In low noise condition, mode 1 (ARQ-PAR) is selected; in moderate noise condition, mode 2 (ARQ-HM (71, 64)) is enabled; if the noise further increases, the switch enters mode 3 (ARQ-EHM (72, 64)). ARQ-CRC4 with generator polynomial 1 + x + x4 is used to detect adjacent multi-bit errors in the coded flit. The fixed ECC scheme using EHM (88, 64) comprises four groups of EHM (22, 16) with block interleaving. In the proposed adaptive ECC scheme, a configurable codec providing EHM (72, 64), EHM (78, 64) or EHM (88, 64) is combined with configurable interleaving to correct four-bit errors and detect

48

3 Adaptive Error Control Coding at Datalink Layer

Table 3.1 Codec and transmitted information for different error control modes in our proposed scheme Long-term mode Employed codec and transmitted information First First Second Third transmission retransmission retransmission retransmission Entire codeword of Check bits of Mode 1 Entire codeword Check bits 4 EHM 4 EHM of EHM (72, 64) of 2 EHM (22, 16) (22, 16) (39, 32) Entire codeword Entire codeword of Check bits Mode 2 Entire codeword 4 EHM of 4 EHM of 4 EHM of 2 EHM (22, 16) (22, 16) (22, 16) (39, 32) Mode 3 Entire codeword Entire codeword Entire codeword Entire codeword of of 4 EHM of 4 EHM of 4 EHM 4 EHM (22, 16) (22, 16) (22, 16) (22, 16)

Table 3.2 Parameters of the global links used in energy simulation Width Technology (mm) 65 nm 0.45

Space (mm) 0.45

Thickness Height (mm) (mm) 1.20 0.20

Dielectric constant 2.2

Resistance (Ohm/mm) 61.1

Total capacitance (fF/mm) 228.5

eight-bit adjacent errors. All the HM or EHM codecs having the same information and codeword bit widths use the same parity matrix. The strategy for codec selection of the proposed method is shown in Table 3.1. For the switch starting from mode 1 in low noise region, only check bits of the EHM (78, 64) (i.e. 2 EHM (39, 32)), rather than the entire codeword, are retransmitted when an uncorrectable error is detected. If the error remains, the check bits of the EHM (88, 64) (i.e. 4 EHM (22, 16)) are sent in the second retransmission. If the flit cannot be successfully delivered by the two retransmissions, the entire codeword of EHM (88, 64) has to be sent to the receiver. When operating in an increased noise condition, the switch enters mode 2 and works in a similar way. The adaptive and fixed ECC schemes were implemented in VerilogHDL and synthesized by Synopsys Design Vision using a TSMC 65 nm typical library. The area for each ECC scheme was reported by Synopsys Design Vision. Codec and link switching energy were further simulated in the Cadence Spectre environment at 1 GHz, with a 65 nm Predictive Technology Model (PTM) CMOS [34]. A lumped RC model was used to model the global wire. The global link parameters used in the simulation are shown in Table 3.2.

3.4.3

Reliability

The probability of detection failure (i.e. failure to detect the presence of errors in the received codeword) is used to evaluate the error detection capability of a given

3.4 Evaluation of Adaptive ECC

49

Fig. 3.8 Probability of detection failure for different error control codes: (a) independent error, (b) quasi-independent error b ¼ 0.001, (c) b ¼ 0.1

codec; this metric also plays an important role in determining the overall reliability of a given interconnect link system. The probabilities of detection failure for CRC4, EHM (72, 64) (the code delivering the best error resilience in the adaptive error detection method [15]), and the proposed MEC-2MED code are compared using independent and quasi-independent error models, in which we vary the adjacent coupling coefficient b and noise voltage deviation sN at a supply voltage of 1 V. Extended Hamming codes detect up to 3-bit errors; CRC4 can detect 1-bit and burst errors with burst length no greater than 4; our MEC-2MED codes correct M-bit errors and detect 2M-bit errors (M up to 4), as long as the erroneous bits do not belong to the same sub-codeword. To thoroughly evaluate link reliability, we examine the error detection and correction capabilities using all possible error patterns for 1-bit to 8-bit errors. As we will describe in Chap. 3.5, we also use realistic benchmarks to generate traffic patterns in Monte Carlo simulations for relevant evaluation. As shown in Fig. 3.8, the probabilities of detection failure of EHM (72, 64) and CRC4 dramatically increase as b increases, because a higher b yields a larger probability of multi-bit errors on the link. On the other hand, the proposed MEC2MED code employs interleaving to spread the adjacent errors to different SEC groups, so that it can effectively detect adjacent multi-bit errors with increasing b. Reliability of on-chip communication in an NoC can be measured by the residual flit error rate, which is defined as the ratio of the number of error-free flits over the total number of flits accepted by the receiver. For an ARQ using a fixed error detection codec, the residual flit error rate includes undetectable errors in the first

50

3 Adaptive Error Control Coding at Datalink Layer

transmission, as well as those in subsequent retransmissions after an error is detected. If the error lasts multiple cycles, several retransmissions are typically needed. Consequently, the residual flit error rate (RFER) is the accumulated probability of the undetectable errors, as shown below [33].

RFERfixed

ARQ

¼

1 X

Prob: of transferring a flit m times

m¼1

Prob: of current undetectable errors

!

¼ 1 Pe þ Pd Pe þ Pd 2 Pe þ Pd 3 Pe þ

(3.13)

where Pe represents the probability of an undetectable error in the received flit, and Pd is the probability that the error is detectable. If the flit is successfully transferred after the first transmission, the residual flit error rate is Pe; otherwise, the residual flit error is the product of Pd and Pe. As the number of transmissions m used for successfully transferring a flit increases, the probability of transferring that flit becomes (Pd)m. Unlike ARQ, HARQ does not request retransmission unless the detected error cannot be corrected by the decoder. As a result, we replace the probability of an undetectable error Pe in (3.13) with the sum of the probability of an undetectable error Pe and the probability of correction failure Pec, and change Pd into the probability of requesting retransmission Pr. Thus, we obtain the residual flit error rate of HARQ expressed as below. RFERfixed

HARQ

¼ ðPe þ Pec Þ þ Pr ðPe þ Pec Þ þ Pr 2 ðPe þ Pec Þ þ (3.14)

For the proposed adaptive error control scheme, Pe, Pec, and Pr vary in different modes (refer to Table 3.1), because different error control schemes are employed in different retransmission modes. Suppose the on-chip communication system uses error control mode i after initialization, and the system enters a new error control mode, i + 1, after retransmission is requested. We simply replace the fixed Pe, Pec and Pr used in (3.14) with Pe,i, Pec,i and Pr, i. Thus, the residual flit error rate for the adaptive HARQ-ECC is given by RFERadaptive

HARQ

¼ Pe;i þ Pec;i þ Pr;i

! Pe;iþ1 þ Pec;iþ1 þ Pr;iþ1 Pe;iþ2 þ Pec;iþ2 þ

(3.15) in which, Pe,i, Pec,i and Pr,i are the probabilities for undetectable error, correction failure, and retransmission, respectively, for the initialized mode. If further retransmission is requested, these probabilities are adjusted by the more powerful ECC in the new mode. As an example, consider that a switch enters mode 1 (HARQ-EHM (72, 64)) after initialization. After an error is detected, the check bits of that flit coded with 2 EHM (39, 32) are transmitted in the retransmission cycle.

3.4 Evaluation of Adaptive ECC

51

If the flit now contains no errors, the residual flit error rate is Pe,1 + Pec,1 + Pr,1* (Pe,2 + Pec,2). Here, Pe,1 and Pec,1 are the probabilities of undetectable errors and correction failure of EHM (72, 64), respectively; Pr,1 is the probability of retransmission; Pe,2 and Pec,2 are probabilities of undetectable errors and correction failure of 2 EHM (39, 32), respectively. The residual flit error rates for different error control schemes are compared in Fig. 3.9. In the adaptive error detection method presented in Ref. [15], ARQ-PAR (mode 1), ARQ-HM (71, 64) (mode 2) and ARQ-EHM (72, 64) (mode 3) are used in three error detection modes. In our proposed adaptive error detection and correction method, HARQ-1 EHM (72, 64) (mode 1), HARQ-2 EHM (39, 32) (mode 2) and HARQ-4 EHM (22, 16) (mode 3) are used in three error detection and correction modes. We assume that the targeted residual flit error rate is 1016, which is equivalent to a mean time to failure (MTTF) of one year for a system operating at 1 GHz. As shown in Fig. 3.9a, ARQ-CRC4 only achieves the target for raw bit error rate e less than 9*1010. The highest mode in Ref. [15], ARQ-EHM (72, 64), can only achieve the targeted error resilience when the raw bit error rate is less than 2.5 107. In contrast, the proposed method can meet the reliability requirement as long as the raw bit error rate is less than 1.5 106, which is three orders of magnitude higher than that of ARQ-CRC4 and five times that of ARQ-EHM (72, 64). Figure 3.9b shows the same experiment with an increased coupling coefficient b. As shown in Fig. 3.9b, the residual flit error rate of ARQ-CRC4 only slightly increases, because the CRC code can detect most adjacent multi-bit errors. The reliability of ARQ-EHM (72, 64), however, dramatically decreases. Still, our proposed method has more than an order of magnitude higher reliability than ARQ-CRC4 and ARQ-EHM (72, 64). As can be seen, ARQ-PAR (mode 1 in Ref. [15]) cannot meet the reliability requirement; the difference in the residual flit error rates obtained by ARQ-HM (71, 64) (mode 2) and ARQ-EHM (72, 64) (mode 3) is only 1.6 times for b ¼ 0.1. In contrast, the maximum differences in the residual flit error rate of the proposed scheme between the mode 2 and 3 are three and four orders of magnitude for b ¼ 0.001 and b ¼ 0.1, respectively. At low bit error rate (~1010) and b ¼ 0.001, the best reliabilities achieved by the proposed method are ten and five times ARQ-CRC4 and ARQ-EHM (72, 64) (mode 3 in Ref. [15]), respectively; for b ¼ 0.1, the reliability benefits are three and four times, respectively.

3.4.4

Average Throughput

Go-back-N retransmission policy can be employed in NoC switch-to-switch links to achieve high average throughput [35, 36] because the transmitter continues transferring new flits while waiting for the acknowledgement of the first transmitted flit.

52

3 Adaptive Error Control Coding at Datalink Layer

a

100 ARQ−PAR ARQ−HM(71,64) ARQ−HM(72,64) ARQ−CRC4 Proposed mode 1 Proposed mode 2 Proposed mode 3 MTTF=1year

Residual Flit Error Rate

10−5

10−10

10−15

10−20

10−25

ε= 9*10−10

ε= 2.5*10−7

10−10

10−8

ε= 1.5*10−6 10−6

Bit Error Rate ε (β = 0.001)

b ARQ−PAR ARQ−HM(71,64) ARQ−HM(72,64) ARQ−CRC4 Proposed mode 1 Proposed mode 2 Proposed mode 3 MTTF = 1 year

Residual Flit Error Rate

10−5

10−10

10−15

10−20 ε= 8*10−10 10−10

ε=

ε=

2.5*10−9

1*10−7

10−8

10−6

Bit error rate ε (β = 0.1)

Fig. 3.9 Residual flit error rates of different error control schemes: (a) b ¼ 0.001, (b) b ¼ 0.1

3.4 Evaluation of Adaptive ECC

53

Suppose the round-trip delay for link propagation and codec computation is NR, then a switch using a fixed error detection code combined with ARQ retransmission yields an average latency Lavg_fixed given by [20]. 0 Lavg

fixed

¼

1 B X

ððm 1Þ NR þ 1Þ

1

C B Prob: of transferring a flit m times C A @ m¼1 Prob: of receiving an error free flit

¼ 1 Pc þ ðNR þ 1Þ ð1 Pc Þ Pc þ ð2NR þ 1Þ ð1 Pc Þ2 Pc þ :::

(3.16) in which, Pc is the probability of receiving an error-free flit varies with the standard noise deviation. If a flit can be successfully transferred at the first transmission, no extra latency is needed (i.e. latency ¼ 1*Pc); otherwise, an NR-cycle latency is introduced by the retransmission. The probability of the first retransmission for that flit is (1 Pc), and the probability of the mth retransmission is (1 Pc)m. In contrast, for a switch using adaptive error detection and correction codecs, Pc varies with both noise deviation and the dynamically selected error detection/ correction codes; Pc is thus (1 Pr). As a result, the average latency Lavg_adaptive is expressed as Lavg

adaptive

¼ 1 Pr; i þ ðNR þ 1Þ Pr; i 1 Pr; iþ1 þ ð2NR þ 1Þ Pr; i Pr; iþ1 1 Pr; iþ2 þ :::

(3.17)

in which, Pr,i is the probability of retransmission for the baseline ECC scheme i. As the number of retransmissions increases, more powerful error detection and correction codecs will be used; thus, Pr,i is updated with Pr,i+1 and so on. The average throughput ratio of number of flits accepted by the receiver over total number of received flits in a unit time is the reciprocal of the average latency given above. Comparison of average throughputs for different error control schemes is shown in Fig. 3.10. Using mode 1, the proposed method improves the average throughput by about 30% compared to ARQ-EHM (72, 64) (mode 3 in the adaptive error detection method of [15]) and ARQ-CRC4, as shown in Fig. 3.10a. Because of more powerful error detection and correction capability, our proposed method achieves 44% more average throughput than other methods, for NR ¼ 3 and b ¼ 0.1. As the round-trip delay NR increases, the average throughputs for the other schemes degrade, because of increasing retransmission latency. However, the throughput achieved by our proposed method is maintained. As shown in Fig. 3.10b, the proposed method using mode 3 achieves 75% more average throughput than ARQ-EHM (72, 64) and ARQ-CRC4.

54

3 Adaptive Error Control Coding at Datalink Layer

Fig. 3.10 Average throughput comparison of different error control schemes: (a) round trip delay NR ¼ 3, b ¼ 0.1, (b) round trip delay NR ¼ 6, b ¼ 0.1

3.4.5

Energy Efficiency

Energy efficiencies of our proposed adaptive ECC, fixed worst-case ECC and the adaptive error detection method in Ref. [15] are compared using the average energy per useful flit metric defined below: Total energy The number of error free flits Total energy ¼ Average throughput ð1 Residual flit error rateÞ (3.18)

Average energy per useful flit ¼

This metric describes the average energy needed to successfully transfer a flit. As shown in Fig. 3.11a, the energy per useful flit increases with noise voltage deviation, because more retransmissions are needed. If the switch works in mode 1 and mode 2, our adaptive ECC uses fewer redundant wires than the fixed worst-case ECC scheme. Moreover, our method only transmits parity check bits if an uncorrectable error is detected, yielding less link switching energy than if the entire codeword were retransmitted. As result, our method reduces average energy per flit by up to 15% compared to the fixed ECC scheme. In mode 3, our adaptive ECC uses the same codec as the fixed worst-case ECC and consumes 2% more energy than the fixed ECC, because of the overhead for adaptation. If the noise voltage deviation sN is below 0.144, ARQ-HM (71, 64) and ARQ-EHM (72, 64) (mode 2 and mode 3 in Ref. [15]) cost less energy than the adaptive ECC and worst-case ECC schemes. However, as the noise deviation increases to 0.184, the method in Ref. [15] consumes 34% more energy than our adaptive ECC scheme. The energy reduction achieved by our method slightly increases with the length of NoC interconnects links, as shown in Fig. 3.11b.

3.4 Evaluation of Adaptive ECC

55

Fig. 3.11 Average energy per useful flit of different error control schemes: (a) energy per useful flit versus different noise deviation (link length ¼ 2 mm), (b) energy reduction with different link length (sN ¼ 0.184)

56

3 Adaptive Error Control Coding at Datalink Layer

Table 3.3 Codec area for different error control schemes ECC schemes Encoder Decoder Equivalent Area Equivalent Area NAND (mm2) NAND (mm2) gates gates Adaptive error 1,149 798 (1) 1,251 869 (1) detection [15] Fixed HARQ876 608 (0.76) 2,157 1,498 (1.72) 4 EHM (22,16) Proposed 1,322 918 (1.15) 2,542 1,765 (2.03)

3.4.6

Total Area Equivalent (mm2) NAND gates 2,400 1,667 (1) 3,033

2,105 (1.26)

3,864

2,683 (1.60)

Area

The area and equivalent NAND gates of the adaptive error detection method proposed in Ref. [15], the fixed ECC [20], and the proposed adaptive error detection and correction method are compared in Table 3.3. Instead of building three different codecs, our adaptive ECC employs hardware sharing for the three ECC modes. Because our proposed codec corrects up to four-bit errors and detects more multiwire errors than the method in Ref. [15], the codec overhead of the proposed adaptive ECC is 60% higher than the codec in Ref. [15]. Yet, compared with an equivalent error resilient ECC scheme, HARQ-4 EHM (22, 16), our adaptive ECC only increases the codec area overhead by 27%, which is manageable for routers in NoC [36].

3.5

Simulation Using an H.264 Application

NoCs facilitate communication among IP cores in MPSoCs. Figure 3.12 shows a multi-media application implemented in an NoC-based MPSoC. The communication between an H.264 encoder and memory IP cores via routers (R) is used to evaluate the effectiveness of our proposed method. Here, we assume no network contention in the interested communication path. Traffic patterns affect the link switching energy. Using an H.264 encoder simulator JM 15.1 [37], we employ three standard video bitstream benchmarks – foreman, football, and mobile in the H.264 encoder to obtain realistic traffic patterns and link switching factors. The parameters for those three bitstreams are shown in Table 3.4. To compare with the average results shown in the previous sections, we set the flit width to 64 bits for the Monte Carlo simulation, with the time and the wire ID for error injection uniformly distributed. As shown in Fig. 3.13, the majority of the flits have a switching factor in the range of 0.4–0.6. The average switching factors for the foreman, football, and mobile video

3.5 Simulation Using an H.264 Application

57

Fig. 3.12 Evaluation using a realistic application. (a) An NoC-based MPSoC, (b) error injection model Table 3.4 Details for three tested video bitstreams Benchmark Foreman Format CIF 352 288 Total flits 1,222 Flit width (bit) 64 Average switching factor 0.502

Football SIF 352 240 3,138 64 0.499

Mobile SIF 352 240 6,078 64 0.501

bitstreams are 0.502, 0.499 and 0.501, respectively. As a result, we approximate the link switching factor to 0.5 in the analyses. To be consistent with the theoretical derivation, we inject errors in the Monte Carlo simulation by inverting the affected bits. As shown in Fig. 3.12b, the error vectors are provided by the error event generator, which is based on the quasiindependent error model with coupling coefficient b ¼ 0.1. Three noise conditions are modeled – low noise condition (sN ¼0.1), moderate noise (sN ¼0.15) and high noise (sN ¼0.18). We simulated each noise condition for ten million cycles. The round trip delay of retransmission is set to three. Table 3.5 illustrates the number of undetected errors and the number of retransmissions requested by different error control methods during the simulations. In the low noise region (sN ¼0.1), all three error control methods detect all injected errors. As the noise increases, the adaptive error detection proposed in Ref. [15] does not have sufficient error detection capability and requires significant retransmission to obtain error-free flits, when errors are detected. Fixed HARQ-4 EHM (22, 16) always uses four-bit error correction and eight-bit error detection capability, thus outperforming the adaptive error detection and our proposed methods in terms of the number of undetected errors

58

3 Adaptive Error Control Coding at Datalink Layer

Fig. 3.13 Switching factors for three typical video bitstreams

Table 3.5 Comparison of undetected errors and retransmission for different methods Adaptive error Fixed HARQ-4 detection [15] EHM (22, 16) Proposed Undetected RetransNoise Undetected RetransUndetected Retranscondition error # mission error # mission error # mission request # request # request # sN ¼0.1 0 3 0 0 0 0 sN ¼0.15 9,865 53,573 0 678 8,080 1,132 55,311 454,814 0 25,125 0 25,125 sN ¼0.18

and retransmission requests. However, the fixed error control scheme is designed for the worst-case noise scenario, wasting energy in favorable noise conditions. The proposed adaptive error detection and correction method dynamically balances reliability, throughput and energy consumption for variable noise environments. As shown in Fig. 3.14, the proposed method achieves a 45% throughput improvement over the adaptive error detection method in the high noise region (sN ¼ 0.18), which is consistent with the results shown in Fig. 3.10a. We use energy per useful flit as a metric to simultaneously evaluate the reliability and energy consumption in the Monte Carlo simulations. As shown in Fig. 3.15b, the proposed adaptive error detection and correction method reduces the total energy

3.5 Simulation Using an H.264 Application

59

Fig. 3.14 Simulated average throughput for different error control schemes

Fig. 3.15 Energy comparison for different error control schemes: (a) noise profile, (b) total energy consumption for 1-mm link, and (c) total energy consumption for 2-mm link

for the given noise profile by 7% and 10% over the previous adaptive ECC and fixed HARQ-4 EHM (22, 16) for 1-mm link, respectively. If the link length increases to 2-mm, our method reduces the total energy consumption for the given noise profile (shown Fig. 3.15a) by up to 12% over other methods, shown in Fig. 3.15b.

60

3 Adaptive Error Control Coding at Datalink Layer

If the noise has more variation, the proposed adaptive error detection and correction method can achieve more energy reduction; thus, it is suitable for a noise environment with a wide variation range.

3.6

Performance Evaluation Using Dependent Error Model

We also evaluate our proposed method using a dependent error model. As shown in Fig. 3.16, our proposed MEC-2MED code achieves two orders of magnitude less detection failure than ARQ-EHM (72, 64) (the best mode in the adaptive error detection method [15]) in the low noise region. This benefit is many orders of magnitude less than the result achieved in the quasi-independent error model (shown in Fig. 3.8c). As noise increases, the detection failure of our proposed method approaches that of the other methods. The reliability degradation is caused by a limitation of the MEC-2MED code; to detect and correct multi-bit errors with the MEC-2MED code, the error cannot belong to the same sub-codeword of each decoder in the receiver; otherwise the error cannot be effectively detected and corrected by the simple MEC-2MED code. To handle this issue, more complicated codes, such as the BCH code, should be considered, but the additional complexity and energy cost may be unsuitable for some designs. The average throughput is also examined. As shown in Fig. 3.17, the average throughput improvement achieved by the proposed method is 38% that is 6% less

Fig. 3.16 Simulated average throughput for different error control schemes

3.7 Summary

61

Fig. 3.17 Normalized energy per useful flit for different error control schemes

than the improvement presented in Fig. 3.10a. In future work, we will further improve the configurable codec to better manage multi-bit and multi-cycle error patterns.

3.7

Summary

Technology scaling deep into the nanometer regime increases multi-bit errors per flit in NoC switch-to-switch links. The proposed adaptive error control scheme improves error resilience while maintaining energy efficiency and performance, by adjusting both error detection and correction, based on detected link quality or system requirements. For a predicted noise scenario, the least complex ECC scheme for average link quality is first selected for reliable message transmission; when the link condition worsens, a more powerful ECC scheme is temporarily provided to recover the previous message. Since a direct implementation of this adaptive ECC scheme is not hardware efficient, a configurable M-bit error correction, 2M-bit error detection code (MEC-2MED) with hardware sharing has been proposed. Simulation results show that the proposed method can reduce the residual flit error rate by over three orders of magnitude compared to previous error control methods. Because of the improved error detection and correction, the adaptive

62

3 Adaptive Error Control Coding at Datalink Layer

ECC achieves up to 75% more average throughput. In the high noise regime, the proposed adaptive error control scheme achieves up to 15% and 34% energy reduction over a fixed ECC scheme, or a previous adaptive error detection method. As link switching consumes more and more energy and noise increases in nanoscale systems, adapting both the error detection and correction can be an effective way to balance energy, reliability, and performance.

References 1. Salminen E, Kulmala A, Hamalainen TD (2007) On network-on-chip comparison. in Proc 10th Euromicro Conf on Digital Syst Design Architectures, Methods and Tools (DSD 2007) 503–510 2. Ho R, Mai KW, Horowitz MA (2001) The future of wires. Proc IEEE, 89:490–504 3. Constantinescu C (2003) Trends and challenges in VLSI circuit reliability. IEEE Micro 23:14–19 4. Dally WJ, Towles B (2001) Route Packets, Not Wires: On-Chip Interconnection Networks. in Proc 38th Design Automation Conference (DAC’01) 684–689 5. Benini L, De Micheli G (2002) Networks on Chips: A new SoC paradigm. Computer 35:70–78 6. Dumitras T, Kerner S, Marculescu R (2003) Towards on-chip fault-tolerant communication. in Proc of Asia and South Pacific Design Automation Conference (ASP-DAC’03) 225–232 7. Bertozzi D, Benini L, De Micheli G (2005) Error control scheme for on-chip communication links: the energy-reliability tradeoff. IEEE Trans Computer-Aided Design of Integr Circuits and Syst (TCAD) 24:818–831 8. Murali S et al (2005) Analysis of error recovery schemes for networks on chips. IEEE Design & Test of Computers 22:434–442 9. Sridhara SR and Shanbhag NR (2007) Coding for reliable on-chip buses: A class of fundamental bounds and practical codes. IEEE Trans on Computer-Aided Design of Integrated Circuit and Syst 26:977–982 10. Sridhara SR, Shanbhag NR (2005) Coding for system-on-chip networks: a unified framework. IEEE Trans Very Large Scale Integr (VLSI) Syst 13:655–667 11. Bertozzi D, Benini L, De Micheli G (2002) Low power error resilient encoding for on-chip data buses. in Proc Design, Automation, and Test in Europe (DATE’02) 102–109 12. Ali M, Welzl M, Hessler S (2007) A fault tolerant mechanism for handling permanent and transient failures in a network on chip. in Proc Intl Technology: New Generations (ITNG’07) 1027–1032 13. Komatsu S, Fujita M (2005) Low power and fault tolerant encoding methods for on-chip data transfer in practical applications. IEICE Trans Fundamentals, E88-A:3282–3289 14. Ejlali A, Al-Hashimi BM, Rosinger P, Miremadi SG (2007) Joint consideration of faulttolerance, energy-efficiency and performance in on-chip networks. in Proc Design, Automation, and Test in Europe (DATE’07) 1647–1652 15. Li L, Vijaykrishnan N, Kandemir M, Irwin MJ (2003) Adaptive error protection for energy efficiency. in Proc ICCAD’03 2–7 16. De Micheli G, Benini L (2007) Networks On Chips. Morgan Kaufmann, San Francisco 17. Pande PP, Grecu C, Jones M, Ivanov A, and Saleh R (2005) Performance evaluation and design trade-offs for network-on-chip interconnect architectures. IEEE Trans on Computers 54:1025–1040 18. Palit AK, Duganapalli KK, Anheier W (2008) Crosstalk fault modeling in defective pair of interconnects. Integration, the VLSI Journal 41:27–37 19. Rossi D, Angelini P, Metra C (2007) Configurable error control scheme for NoC signal integrity. in Proc IOLTS’07 43–48

References

63

20. Zimmer H, Jantsch A (2003) A fault model notation and error-control scheme for switch-toswitch buses in a network-on-chip. in Proc CODES + ISSS’03 188–193 21. Ganguly A, Pande PP, Belzer B, Grecu C (2008) Design of low power & reliable networks on chip through joint crosstalk avoidance and multiple error correction coding. J Electron Test 24:67–81 22. Nunez-Yanez JL, Edwards D, Coppola AM (2008) Adaptive routing strategies for faulttolerant on-chip networks in dynamically reconfigurable systems. IET Computers & Digital Techniques 2:184–198 23. Yu Q, Ampadu P (2008) Configurable error correction for multi-wire errors in switch-toswitch links. in Proc IEEE Intl SOC Conf (SOCC’08) 71–74 24. Worm F, Ienne P, Thiran P, De Micheli G (2005) A robust self-calibrating transmission scheme for on-chip network. IEEE Trans Very Large Scale Integr (VLSI) Syst 13:126–139 25. Yu Q, Ampadu P (2008) Adaptive error control for reliable systems-on-chip. in Proc Intl Symp on Circuits and Syst (ISCAS’08) 832–835 26. Yu Q, Ampadu P (2008) Adaptive error control for NoC switch-to-switch links in a variable noise environment. in Proc 23rd IEEE Intl Symp on Defect and Fault Tolerance in VLSI system (DFT’08) 352–360 27. Pande PP, Ganguly A, Feero B, Belzer B, Grecu C (2006) Design of low power & reliable networks on chip through joint crosstalk avoidance and forward error correction coding. in Proc IEEE Intl Symp on Defect and Fault Tolerance in VLSI Systems (DFT’06) 466–476 28. Lehtonen T, Lijieberg P, Plosila J (2007) Analysis of forward error correction methods for nanoscale networks-on-chip. in Proc Nano-Net 1–5 29. Salminen E, Kulmala A, Hamalainen TD (2008) Survey of network-on-chip proposals. white paper, OCP-IP 30. Kousa M, Turner L (1996) Reliability-throughput optimization for adaptive forward error correction systems. IEE Proc Commun 143:341–346 31. Minn H, Zeng M, Bhargava VK (2001) On ARQ scheme with adaptive error control. IEEE Trans on Vehicular Technology 50:1426–1436 32. Fu B, Ampadu P (2008) A dual-mode hybrid ARQ scheme for energy efficiency on-chip interconnects. in Proc 3rd Intl Conf on Nano-Networks (Nano-Net’08) 1–5 33. Lin S, Costello DJ (2004) Error control coding, Second Edition ed. Prentice Hall 34. PTM [Online] http://www.eas.asu.edu/~ptm/ 35. Bertozzi D, Benini L (2004) Xpipes: a network on chip architecture for gigascale systems-onchip. IEEE Circuits and Syst Magazine 4:18–31 36. Lehtonen T, Liljeberg P, Plosila J (2007) Online reconfigurable self-timed links for fault tolerant NoC. VLSI Design 1–13 37. H264/AVC JM Reference [Online] http://iphome.hhi.de/suehring/tml/

Chapter 4

Transient and Permanent Link Errors Co-Management

Transient and permanent errors can be co-managed at the network layer. Ali et al. use end-to-end error detection and retransmission to deal with transient errors; they utilize deterministic rerouting to avoid broken links [1]. Sanusi et al. apply singleerror correction and multiple-error detection to packets at the destination, and request flooding if permanent errors are detected [2]. Handling all errors in the network layer increases the burden of that layer. In Ref. [3], transient and permanent errors are co-managed in the datalink and physical layers, respectively. Four groups of Hamming codes and retransmission are used to correct transient errors; additional spare wires are provided to tolerate permanent errors, and duplicating half of a flit each cycle is used to further improve reliability. In that method, the worst-case codec wastes energy if the noise is favorable; the additional spare wires increase link overhead; half-splitted transmission reduces performance, if only a few permanent errors are present. In this chapter, we introduce an error control method to co-manage transient and permanent errors in the datalink and physical layers [4, 5]. The overview diagram is shown in Fig. 4.1. To reduce energy overhead, configurable error control coding (ECC) adapts the number of redundant wires to the varying noise conditions, achieving a required error detection capability. Infrequently used redundant wires are utilized as spare wires to replace permanently unusable links. An existing permanent error detection and transparent replacement method is used to facilitate the proposed error co-management. Furthermore, a packet re-organization algorithm that cooperates with a shortened error control coding method is proposed to support low-latency split transmission.

Q. Yu and P. Ampadu, Transient and Permanent Error Control for Networks-on-Chip, DOI 10.1007/978-1-4614-0962-5_4, # Springer Science+Business Media, LLC 2012

65

66

4 Transient and Permanent Link Errors Co-Management

Fig. 4.1 Dual-layer transient and permanent error co-management

4.1 4.1.1

Dual-Layer Co-Management Method Co-Management Algorithm

For different transient and permanent error conditions, the proposed transient and permanent error co-management method consists of two operation modes, shown in Fig. 4.2. ECC mode 1 is composed of simple error control coding and spare wire replacement. In ECC mode 2, complex error control coding works with packet re-organization and spare wire replacement. Mode switching depends on the detected presence of transient and permanent errors. The maximum link width between routers is determined by the complex ECC codec in mode 2 and no additional spare wires are needed in our method. In low noise conditions, a simple ECC is used to detect transient errors; there are some unused redundant wires (reserved for the parity check bits in ECC mode 2) ready for broken wire replacement. If the number of broken wires exceeds the total number of spare wires, low-latency splitting transmission is invoked. This very likely occurs when the ECC mode 2 is used to recover multi-bit transient errors. To maintain permanent error recovery capability without additional spare wires, we re-organize the packet and encode the modified packet with a shortened complex error control code. The flowchart begins with checking the transient noise level. After one packet transmission is finished, the ECC mode determination process starts over.

4.1.2

Transmitter and Receiver Architecture

The corresponding transmitter and receiver architecture is shown in Fig. 4.3. The transient error monitor instructs the configurable ECC encoder/decoder to operate in the most energy-efficient ECC mode. The transient error monitor can be

4.1 Dual-Layer Co-Management Method

67

Fig. 4.2 Flowchart for cooperative transient and permanent error management

Fig. 4.3 Architecture for the proposed error co-management method

implemented using a counter, which records the number of errors detected by a victim line as presented in Ref. [6]. The ECC encoder input width, k, is equal to the flit width. The link width for data transmission is the maximum codeword width n for the configurable ECC. Triple modular redundancy (TMR) is applied to the acknowledgement TMR NACK to ensure the correctness of the retransmission request. Permanent errors are handled after (before) ECC encoding (decoding) using an In-line testing (ILT) unit [7], which can detect permanent link errors and replace unusable wires with spare wires at runtime. Figure 4.4a shows the ILT diagram. The test pattern generator is used to produce test signals. The signal r_data is a

68

4 Transient and Permanent Link Errors Co-Management

Fig. 4.4 In-line testing (ILT) system: (a) control units and (b) reconfiguration example [7]

control signal that is protected by triple modular redundancy. The reconfiguration unit coordinates the receiver and transmitter to synchronize the reconfiguration information (using reconf. signal). The error detection and reconfiguration central control unit identifies the links with permanent errors and then re-arrange health wires for the transmitter and receiver. Figure 4.4b shows an example of link reconfiguration when permanent error occurs at the wire i. The key contributions of this work include the input/output buffer and splitting transmission controller. The latter executes packet re-organization. Buffer modifications facilitate the process of packet rebuilding in the transmitter and packet restoring in the receiver.

4.2 Packet Re-Organization Approach

4.2 4.2.1

69

Packet Re-Organization Approach Re-Organization Algorithm

The first part of packet re-organization algorithm – packet rebuilding – is realized in router output port, shown in Fig. 4.5. The error control codec selection signal (ECC_Sel) selects the desired output from the configurable ECC encoder and also enables the splitting transmission controller. In ECC mode 1, a simple ECC code (n1, k) is used to detect 1- and 2-bit errors on the link. Because the codeword width n1 is

Fig. 4.5 Architecture for router output port supporting splitting transmission (a) output buffer and configurable ECC encoder (b) split buffer for (k k2) ¼ 4

70

4 Transient and Permanent Link Errors Co-Management

Fig. 4.6 Packet rebuilding algorithm

less than the link width n, n-n1 wires are available for broken wire replacement. In ECC mode 2, a powerful ECC code (n, k) is used to detect multi-bit errors. To provide spare wires for permanent error recovery, we shorten the inputs for the (n, k) code to k2. The remaining S (¼k k2) bits are accumulated to rebuild extra flits, which will be appended to the packet after the original flits. If the remaining bits of one packet are no more than k2, zero bits are filled into the last appended flit; otherwise, multiple extra flits are built. Because two bits in each flit indicate the flit type (header, payload or tail), they are used to count the flits in the splitting transmission controller. Splitting control signal Splitting_ctrl determines the input to the ECC encoder, original data flits or rebuilt flits. Split_Buffer stores the remaining bits after flit splitting and provides the rebuilt flit later. As shown in Fig. 4.5b, the Split_Buffer is organized in a pipelined fashion with a width of (k k2) and a depth of dk=ðk k2 Þe. To reduce the depth of the Split_Buffer and latency, we recommend that the ideal dk=ðk k2 Þe should be equal to an integer. This architecture can simultaneously save the splitted bits and send flits. Pseudo-code for the packet rebuilding algorithm is shown in Fig. 4.6.

4.2.2

Input Port and Output Port Architecture

The input port architecture and the packet restoring algorithm are shown in Figs. 4.7 and 4.8, respectively. The packet restoring process is enabled when ECC_Sel is high. Unlike in the output port, there is no split buffer in the input port. Before the rebuilt flit arrives, the flit buffer in the input port of the next router pops out a flit each cycle; the incoming shortened flit is filled into parts of the input buffer. After the rebuilt flits reach the input buffer, the flit constructed by the packet rebuilding algorithm is written to the rest of the input buffer (Note that other content in the buffer keeps same). Because of this unchanged state, the packet restoring algorithm adds one-cycle latency.

4.3 Performance Evaluation

71

Fig. 4.7 Architecture for router input port supporting splitting transmission

Fig. 4.8 Packet restoring algorithm

4.3 4.3.1

Performance Evaluation Experimental Setup

We evaluate different error co-management methods using a 32-bit flit width (the most common flit size [8]). In our method, the configurable ECC integrates ECC1 – Hamming(38, 32) and ECC2 – four groups of Hamming(12, 8) with interleaving. According to our algorithm, the link width is set to 48, which is equal to the maximum codeword width of the configurable ECC. If Hamming(38, 32) is in use, ten wires are available for

72

4 Transient and Permanent Link Errors Co-Management

Proposed Hop−Hop ECC & Half−Split Hop−Hop BCH End−End ECC & Rerouting Smart−Flooding 100 Residual Packet Error Rate

Residual Packet Error Rate

10−5

BER=10−9 10−10

10−15 0

20

40

60

Number of Flits per Packet

10−5

BER=10−5

10−10

10−15 0

20 40 60 Number of Flits per Packet

Fig. 4.9 Impact of packet size on residual packet error rate in low (BER ¼ 109) and high (BER ¼ 105) transient noise conditions

permanent error recovery. If transient noise increases, 4 Hamming(12, 8) is enabled. To maintain permanent error recovery capability without adding spare wires, we shorten Hamming(12, 8) to Hamming(11, 7); thus, four spare wires are available in ECC mode 2. If more than four spare wires are needed, adaptive routing is necessary. In the future work, we will address the integration of the proposed method with adaptive routing. A hardware-efficient implementation for the configurable ECC is presented in our previous work [9]. In the following subsections, we compare tightly related works [1–3] with our method. In Ref. [1], end-end ECC (CRC4 code) is used to detect transient errors and rerouting is employed to tolerate permanent errors. In Ref. [3], hop-hop ECC using 4xHamming(12, 8) and half-splitted method are used for error co-management. Smart-flooding [2] uses extended Hamming(39, 32) to correct 1-bit errors and detect 2bit errors in each packet, and invokes flooding with a moderate gossip rate, 0.4, if permanent errors occur.

4.3.2

Reliability

The reliability of different error control schemes are compared using the metric of residual packet error rate – the probability of errors remaining in an accepted packet after using error control. The impacts of packet size and routing path length on reliability are examined in Figs. 4.9 and 4.10, respectively. As shown, the proposed method achieves four orders of magnitude lower residual packet error rate than

4.3 Performance Evaluation

73

Proposed Hop−Hop ECC & Half−Split Hop−Hop BCH End−End ECC & Rerouting Smart−Flooding 100 Residual Packet Error Rate

Residual Packet Error Rate

10−5

BER=10−9 −10

10

10−15 0

10

20

30

Number of Hops

40

10−5

BER=10−5

10−10

10−15 0

20

40

Number of Hops

Fig. 4.10 Impact of routing path length on residual packet error rate in low (BER ¼ 109) and high (BER ¼ 105) transient noise conditions

end-end ECC with rerouting and smart-flooding methods, for different link bit error rates (BER), number of flits per packet, and number of hops a packet has passed. This is because errors in the packet using end-to-end ECC are accumulated as packet size and number of routing hops increase. Our method and the hop-hop ECC & half-splitted approach recover the errors at each hop, achieving comparable reliability. Hop-hop BCH corrects up to 4-bit errors, achieving better reliability than our method, at the cost of larger energy and area overhead shown later.

4.3.3

Average Latency

Average latency of useful packets is used as a metric to evaluate the latency of different methods. A useful packet is a packet received at the destination error free. To clarify the impact of error control on latency, we examine two separate cases – (1) contain transient errors only or (2) permanent errors only. In the case of no permanent errors, the flooding (in smart-flooding) and rerouting (in end-end ECC & rerouting) mechanisms are not invoked; thus, the latency overhead is induced by error control for the transient error rate. Consistent with the results in Ref. [10], the four compared methods achieve comparable average latency in low noise and short routing path condition. As shown in Fig. 4.11, the latency reduction achieved by our method increases with the routing path length; this is more significant in the high transient noise region. For a packet with ten flits, our method reduces the latency by up to 33% and 50%, compared to end-end ECC & rerouting and smart-flooding, respectively. The latency of our method is close to

74

4 Transient and Permanent Link Errors Co-Management

a

b

Fig. 4.11 Impact of routing path length on (a) latency and (b) packet latency for 1-hop having error

that of hop-hop ECC & half-split and hop-hop BCH approaches. We assume that a link experiences adjacent 2-bit error (thus, all ECC codes compared in this work can detect the errors). As shown in Fig. 4.11b, our method achieves up to 60% latency reduction over the end-end ECC approaches. The hop-hop BCH does not need retransmission, achieving slightly better latency than other methods. In subsequent subsections, we use a 4 4 mesh NoC (a popular size in many applications [8, 11]) to examine the impact of permanent error location and quantity. Four assessment cases are shown in Fig. 4.12a. The first one represents

4.3 Performance Evaluation

75

Fig. 4.12 Impact of permanent error location and quantity on latency: (a) four permanent error cases, (b) average latency and (c) the worst latency for the four examined cases

76

4 Transient and Permanent Link Errors Co-Management

one error on NoC boundary link. The second one represents the broken in the network central location. The third one is a severe case for boundary link error. The last one is a severe case for central link errors. If one hop has broken links, the half-splitted method doubles the packet latency at that hop, while our method only increases one cycle. Permanent errors located closer to the center of the network result in more rerouting or flooding. As shown in Fig. 4.12b, our method is less sensitive to the error location than other methods, and reduces latency by up to 30%, 33% and 47%, compared to hop-hop ECC & half-splitted, end-end ECC & rerouting and smart-flooding, respectively. Our worst latency is close to hop-hop BCH, as shown in Fig. 4.12c.

4.3.4

Energy Efficiency

Considering both energy consumption and reliability simultaneously, we compare different methods with the metric average energy per useful packet. The energy includes the portion consumed in the router, link and codec in network interface (NI). The router and NI energy are obtained based on synthesized results from Synopsys Design Complier using a TSMC 65 nm technology and 1 GHz frequency. Link energy is measured in Cadence Spectre using Predictive Technology Model (PTM) CMOS 65 nm technology. We evaluate energy using a wide range of transient BERs and two permanent error scenarios – 1b1c: one boundary broken wire like case (1) and one center broken wire like case (2) in Fig. 4.12a; 2b2c: two boundary broken wires and two center broken wires. As shown in Fig. 4.13, our method outperforms hop-hop ECC & half-splitted approach and smart-flooding for different packet sizes, transient error rates and permanent error scenarios. As shown in Fig. 4.13a, end-end ECC & rerouting achieves better energy than our method at very low transient error rates and small packet sizes. However, as permanent error quantity and packet size increase, our method achieves better energy than other approaches, as shown in Fig. 4.13b and c. Our method reduces the energy per useful packet by up to 17%, 23%, 70% and 55%, compared to hop-hop ECC & half-splitted, end-end ECC & rerouting smartflooding and hop-hop BCH, respectively. The results in Fig. 4.13 are for a 1 mm link length. The impact of link length on energy has been examined in Ref. [4].

4.3.5

Area Overhead

Table 4.1 compares the area of different error co-management methods. End-end ECC & rerouting and smart-flooding methods do not have ECC overhead in each router, but they need an error control codec in the network interface (NI) and additional NI buffers to store packets awaiting ECC feedback. For the 4 4 mesh NoC, we assume that each NI has four additional packet buffers for retransmission

Average Energy per Useful Packet(pJ)

a

1400 1200 1000 800 600 400 200 0 10−7

Average Energy per Useful Packet(pJ)

b

Proposed Hop−Hop ECC & Half−Split Hop−Hop BCH End−End ECC & Rerouting Smart−Flooding

10−6

10−5 10−4 Transient Error Rate

10−3

10−2

10−3

10−2

10−3

10−2

1400 1200 1000

Proposed Hop−Hop ECC & Half−Split Hop−Hop BCH End−End ECC & Rerouting Smart−Flooding

800 600 400 200 0 10−7

10−6

10−5

10−4

Transient Error Rate

Average Energy per Useful Packet(pJ)

c

1400 1200 1000

Proposed Hop−Hop ECC & Half−Split Hop−Hop BCH End−End ECC & Rerouting Smart−Flooding

800 600 400 200 0 10−7

10−6

10−5

10−4

Transient Error Rate

Fig. 4.13 Energy efficiency comparison for (a) 1b1c scenario and 4-flit packet, (b) 1b1c scenario and 16-flit packet, (c) 2b2c scenario and 16-flit packet

78

4 Transient and Permanent Link Errors Co-Management

Table 4.1 Area comparison Area (mm2) Proposed Hop-Hop ECC & half-splitted [3] End-end ECC rerouting [1] Smart-flooding [2]

128 bit 512 bit 128 bit 512 bit

Router 32,402 32,948

Codec in NI – –

Extra NI buffer – –

Total 32,402 32,948

25,593 25,593 25,593 25,593

946 3,302 10,404 43,521

4,032 8,064 4,032 8,064

30,571 36,959 40,029 77,178

error recovery. As shown in Table 4.1, as packet size increases from 128-bit to 512bit, the codec and additional buffers in the NI significantly increase. As a result, our method achieves 12%, 58% and 99% area reduction over end-end ECC & rerouting, smart-flooding methods and hop-hop BCH methods, respectively. Because of slightly less complexity in input port design, our method reduces area by 2% compared to hop-hop ECC & half-splitting. Since the 512-bit (or larger) packet is required for some applications (e.g. Video Object Plane Decoder) mapped to NoC-based systems, the proposed error co-management method is an efficient way to trade off reliability, performance, energy and area overhead.

4.4

Summary

Transient and permanent errors reduce the reliability of networks-on-chip implemented in nanometer technologies. Although achieving satisfactory latency and energy efficiency, error co-management approaches in the network layer become less efficient as transient/permanent errors, NoC size and packet length increase. Dual-layer worst-case error co-management methods reduce the burden of the network layer, but waste resources and energy if link conditions are favorable. To improve latency and energy efficiency, this work proposes a cooperative error management method across datalink and physical layers. In the low noise region, the proposed method operates in a simple error control coding mode; spare wires obtained from infrequently used redundant wires for transient error control are employed to replace the permanently unusable wires. As transient noise increases, the proposed method switches to the powerful error control coding mode, and a packet re-organization algorithm is proposed to maintain the permanent error recovery capability without additional spare wires. Case studies show that our method significantly improves reliability, and reduces latency by up to 50%, compared to previous methods. For a 4 4 mesh NoC, the proposed method shows less sensitivity to permanent error location and quantity, and obtains up to 70% energy reduction compared to other works. In future work, error co-management across physical, datalink and network layers can be further investigated to improve performance and energy.

References

79

References 1. Ali M, Welzl M, Hessler S, Hellebrand S (2007) An efficient fault tolerant mechanism to deal with permanent and transient failures in a network on chip. Intl J High Performance Syst Architecture 1:113–123 2. Sanusi A, Bayoumi MA (2009) Smart-flooding: A novel scheme for fault-tolerant NoCs. in Proc IEEE SoC Conf 259–262 3. Lehtonen T, Liljeberg P, Plosila J (2007) Online reconfigurable self-timed links for fault tolerant NoC. VLSI Design, vol 2007:1–13 4. Yu Q, Ampadu P (2010) Transient and permanent error co-management for reliable networkon-chip. in Proc NOCS 145–154 5. Yu Q, Ampadu P (2011) A dual-Layer method for transient and permanent error co-management in NoC links. IEEE Trans on Circuit and Systems II-Express Briefs 58:36–40 6. Li L, Vijaykrishnan N, Kandemir M, Irwin MJ (2003) Adaptive error protection for energy efficiency. in Proc ICCAD 2–7 7. Lehtonen T, Wolpert D, Lijeberg P, Plosila J, Ampadu P (2010) Self-adaptive system for addressing permanent errors in on-chip interconnects. IEEE Trans on Very Large Scale Integr (VLSI) Syst 18:527–540 8. Salminen E, Kulmala A, H€am€al€ainen TD (2008) Survey of network-on-chip proposals. White paper, OCP-IP 1–13 9. Yu Q, Ampadu P (2009) Adaptive error control for nanometer scale NoC links. IET Computers & Digital Tech-Special issue on advances in nanoelectronics circuits and syst vol 3:643–659 10. Bertozzi D, Benini L, De Micheli G (2005) Error control scheme for on-chip communication links: the energy-reliability tradeoff. IEEE Trans Computer-Aided Design of Integr Circuits and Syst 24:818–831 11. Bahn J, Yang J, Bagherzadeh N (2008) Parallel FFT algorithms on Network-on-Chips. in Proc ITNG’08 1087–1093

Chapter 5

Dual-Layer Cooperative Error Control for Transient Error

Datalink-layer adaptive error control methods have been investigated in Chap. 3. In this chapter, we extend the error control adaptation to a two-layer approach communicating between the datalink and network layers to further reduce energy consumption. We employ end-to-end error control in the network interface in low noise conditions, and enhance the error control capability in high noise regions by turning on hop-to-hop error control in the router. Simply combining end-to-end error control with hop-to-hop error control typically results in huge energy consumption. Consequently, we apply the concept of product code to the NoC, performing cross-layer cooperative error control. Another major contribution is a protocol to switch between network-layer ECC and datalink-layer ECC at runtime.

5.1

Existing Hop-to-Hop Error Control

Hop-to-hop ECC is performed in the datalink layer, where each flit (flow control unit) is encoded/decoded in each hop. The ECC encoder/decoder is located in each router output/input port and no codec is needed in the network interface (NI), as shown in Fig. 5.1. The generic router is for a mesh/torus topology, and the details of the buffers and other routing blocks are not highlighted here. In the hop-to-hop ECC approach, the link width between routers is equal to the total width of the codeword and the acknowledge signal (if error detection is used). Fixed hop-to-hop error control methods have been widely investigated. Simple error detection with retransmission [1, 2] is efficient for low noise regions. Forward error correction schemes [3] recover the wrong flit without latency overhead. TypeII hybrid automatic repeat request (ARQ) with Hamming product codes [4] leverages the codec complexity and latency overhead. Hop-to-hop methods capture the error and recover the corrupted flit within the current hop. This immediate error recovery prevents error accumulation, reducing

Q. Yu and P. Ampadu, Transient and Permanent Error Control for Networks-on-Chip, DOI 10.1007/978-1-4614-0962-5_5, # Springer Science+Business Media, LLC 2012

81

82

5 Dual-Layer Cooperative Error Control for Transient Error

Fig. 5.1 Generic hop-to-hop ECC in datalink layer

the need for powerful error control codes. Unfortunately, hop-to-hop error control adds complexity to router design. The communication energy on long on-chip interconnect becomes comparable to the computation energy as technology scales [1]; the unnecessary encoding/decoding operations waste energy when noise conditions are favorable. Adaptive hop-to-hop error control methods have been investigated to reduce the energy overhead. Rossi et al. propose a method that configures the router ECC codec at design time based on the quality of service (QoS) requirements; available choices include single-error correction (SEC), single-error correction double-error detection (SECDED) and symbol-error correction codes [5]. SEC and SECDED codes are compatible with respect to hardware implementation; however, the symbol-error correction codec of complex methods such as BCH are typically incompatible with simple codes. To reduce wasted energy in favorable noise conditions, Li et al. [6] adapt the error detection capability to the detected noise condition; a half-supply-voltage victim line based counting method is proposed to assess the current link noise. In Ref. [7], we propose a method that adapts error detection and correction at runtime. Other hop-to-hop error control techniques for errors caused by crosstalk are proposed in Ref. [8, 9]. Co-management of crosstalk and soft errors are discussed in Ref. [10]. We assume that crosstalk issues have been managed using physical techniques, and thus we concentrate on errors induced by particle strikes and spurious voltage fluctuations in this work.

5.2

Existing End-to-End Error Control

Network-layer error control coding is executed only in the network interface and does not detect or correct errors at intermediate hops in the route, as shown in Fig. 5.2. End-to-end error control typically is performed on the entire packet in source/ destination end. An acknowledge packet is sent back to the source end to request retransmitting the packet. End-to-end error control does not increase router complexity and link width between hops; consequently, power consumption of end-to-end error control is less than that of hop-to-hop error control when the error

5.2 Existing End-to-End Error Control

83

Fig. 5.2 Generic end-to-end ECC in network layer

rate, the route length and the number of retransmissions is small [11]. A longer route path may accumulate more errors because of no hop-level error recovery. Thus, end-to-end ECC need increased error control strength to meet the same reliability target of hop-to-hop ECC, potentially requiring a larger codec in NI. Waiting for acknowledgement packets in end-to-end control also results in large latency overheads. Fixed simple end-to-end ECC combined with a timeout retransmission mechanism has been proposed to reduce the latency caused by packet corruption or loss [12]. End-to-end hybrid error detection and correction method repairs packet having single errors and requests retransmission only when multiple errors are detected. Advanced fabrication technologies allow integration of more and more cores in a single die; the NoC size is expected to increase. Without strengthening the ECC strength, current end-to-end methods can barely manage the error accumulation problem, either resulting in high residual error rates or introducing significant retransmission latency and energy consumption overhead. In this chapter, we combine existing resources in different NoC layers to exploit the benefits of hop-to-hop and end-to-end ECC for a wide range of variable noise scenarios. Figure 5.3 shows the dual-layer ECC cooperation. End-to-end ECC is performed in the network layer, and a small unit is added to accumulate packet error detection outcomes. If the total number of erroneous packets crosses a predefined threshold, the network layer requests a switch in the ECC mode between singlelayer and dual-layer. In the datalink layer, hop-to-hop ECC is enabled if the ECC mode is switched to the dual-layer mode. The flit error detection outcome of each hop is written into an error history flit, which helps the ECC mode switching controller to determine when returns to the single-layer ECC mode. The controller also gathers the ECC mode of the neighbor nodes to ensure a consistent ECC mode in the entire network. Our main contributions are as follows: • Extend ECC adaptation from single layer to dual layer to improve energy efficiency and reliability, while maintaining performance. A protocol for runtime ECC mode switching is proposed: only employ end-to-end ECC in low noise conditions and enhance the error control strength by turning on hopto-hop ECC if noise increases; disable the hop-to-hop ECC when noise condition becomes favorable.

84

5 Dual-Layer Cooperative Error Control for Transient Error

Fig. 5.3 Overview of the proposed dual-layer adaptive ECC: (a) generic architecture, (b) cooperation between datalink and network layers

• Implement dual-layer adaptive error control coding. Simply combining end-to-end error control with hop-to-hop error control typically results in huge energy consumption. The proposed utilization of product codes can realize non-interrupting ECC mode switching at runtime. • Propose an error-tag-shifting technique to record hop-to-hop ECC detection outcomes and pass them to the destination, assisting in ECC mode switching.

5.3 5.3.1

Dual-Layer ECC Switching Scheme ECC Mode Switching Algorithm

In variable noise conditions, cooperation of end-to-end and hop-to-hop ECC improves energy efficiency and performance. We propose an ECC mode switching protocol to facilitate error control cooperation at runtime. End-to-end ECC in

5.3 Dual-Layer ECC Switching Scheme

85

Fig. 5.4 ECC mode switching state machine

the network layer (i.e. single-layer mode) is used in low noise conditions; while end-to-end ECC and hop-to-hop ECC (i.e. dual-layer mode) are used in high noise conditions. The state machine for the single-layer (SL) and dual-layer (DL) mode switching is shown in Fig. 5.4. In addition to SL and DL states, there are two intermediate states – Pre-SL and Pre-DL – to propagate the mode switching instruction over the network and ensure that the surrounding nodes are all utilizing the same ECC mode. End-to-end ECC is used if the NoC is in the SL or Pre-DL state; end-to-end ECC and hop-to-hop ECC are used in the DL or Pre-SL state. The NoC is initialized to SL state to reduce energy consumption. If the total number of errors detected in the network interface every Tcount cycles exceeds the predefined threshold, the associated node requests to switch the ECC mode into Pre-DL mode and informs its surrounding nodes. In addition to the mode switching request from the network layer, a request from neighboring routers can also cause the router to enter the Pre-DL state; thus, localized error injection can be considered, as well. During the first Tprop cycles of each Tcount period, the mode switching requests are propagated to the rest of the NoC nodes; while each node remains in the Pre-DL or Pre-SL state. After the Tprop cycle, each NoC node changes to DL or SL state. Tprop is the maximum ECC mode switching notification time, during which the node invoking ECC mode switching informs the rest of nodes in the network to change to a specific ECC mode. Tprop is given by (5.1). pﬃﬃﬃ Tprop ¼ 2 ð n 1Þ

(5.1)

where n is the total number of nodes in a mesh NoC. In this work, we assume that the average bit error rate caused by various noise sources is relatively uniform across the chip, but we consider the fact that the number of errors detected by each node varies with different traffic loads. To prevent mode switching oscillation, the proposed protocol is asymmetric. As shown in Fig. 5.4, only network-layer requests can change the DL state to the Pre-SL state. The time spent in the Pre-SL state is used to propagate the request through the rest of the network using the dual-layer error control. If no dual-layer mode is requested by the local node and neighbor nodes, the network enters the SL state.

86

5 Dual-Layer Cooperative Error Control for Transient Error

The dual-layer ECC mode is triggered when the total number of errors detected by any node in Tcount cycles exceeds the error threshold. Tcount is the time interval in which the NI counts the total number of erroneous packets. Given the traffic load l and the packet error rate pe after traveling h hops, the number of packets containing error bits is expressed by (5.2). Number of Erroneous Packets ¼ Tcount l pe

(5.2)

in which, pe ¼ 1 ð1 eÞwp

h

(5.3)

The computation of pe is based on the independent error model. wp is the number of bits per packet and e is the bit error rate. Re-arranging (5.2) and (5.3), we can express Tcount as below. Tcount ¼

Number of Erroneous Packets l 1 ð1 eÞwp h

(5.4)

Each network interface needs a modular counter for timing Tcount cycles. For simplicity, one can set Tcount using (5.5). Tcount ¼

End Error Threshold lavg 1 ð1 eÞwp havg

(5.5)

in which lavg is the average traffic load and havg is the average number of hops that a packet has traveled. lavg and havg can be obtained by performing simulations with the interested applications. A larger Tcount costs more D-flip-flops for the modular counter; a smaller Tcount may be not enough to distinguish between different noise conditions in various traffic loads and route lengths. To avoid frequent mode switching caused by burst traffic and/or burst errors, we choose the pair of Tcount and the end error threshold a little above the minimal. We examine the counter width using a 4 4 mesh NoC. By running different applications, each node injects packets to the network for 100,000 cycles, during which traffic hot spots are observed. Figure 5.5 shows Tcount and the input width of the modular counter for three applications in the PARSEC benchmark suite [13] and a synthetic uniform traffic. Table 5.1 shows the Tcount and the end error threshold selection based on the traffic patterns in our experiments.

5.3.2

Network Interface Architecture

The network interface executes part of the proposed ECC mode switching protocol – counting the number of errors detected in the Tcount time interval and determining whether to request an ECC mode switch. Figure 5.6 shows the architecture for the

5.3 Dual-Layer ECC Switching Scheme

b

107 Black−Scholes x264 Canneal Uniform

6

10

105 Tcount

24 Black−Scholes x264 Canneal Uniform

22 Input Width of Modular Counter

a

87

4

10

103 2

10

20 18 16 14 12 10 8 6

101 10−8

10−6

10−4 Bit Error Rate ε

10−2

100

4 10−8

10−6

10−4 Bit Error Rate ε

10−2

100

Fig. 5.5 (a) Tcount and (b) input width of modular counter for different noise conditions

Table 5.1 Examples of Tcount and end error threshold Parameters Black-scholes x264 0.2048 0.0211 lavg havg 2.5939 2.4615 Tcount 213 216 End error threshold 8 7

Canneal 0.0917 2.3384 212 4

Uniform 0.15 2.6667 212 5

Fig. 5.6 Circuit to generate mode switching request in NI

unit to generate the mode switching request (NI_req) from the network layer. If the node is in the SL state, the amount of detected errors after Tcount time interval exceeding the End Error Threshold will pull up NI_req. To distinguish real favorable noise conditions and the consequence of adding hop-to-hop ECC, we check logic ‘1’ in the error history flit (EHF), which records the error detection outcome in each hop along the packet route. The error history flit is created by the NI when it packetizes the bit stream into a packet.

88

5 Dual-Layer Cooperative Error Control for Transient Error

Fig. 5.7 Simplified router architecture

This history flit starts with a unique header bit pattern for the router to recognize and is also protected with the same error protection code. The rest of the flit is written at each hop, 1 for error detected and 0 for no error captured. Whenever there is a nonzero syndrome or non-zero error history flit, the ripple counter is incremented. The modular counter increases by one every clock cycle and resets the ripple counter every Tcount cycles. The number of D-flip-flops needed in the modular counter and the ripple counter are equal to dlog2 ðTcount Þe and dlog2 ðEnd Error Threshold Þe.

5.3.3

Router Implementation

A router supporting our dual-layer ECC is shown in Fig. 5.7. The highlighted ECC Mode Switch, Mode Propagation Counter and EHF Update Unit differentiate our work from a conventional router design. For mesh NoCs, a router typically has five input/output ports – north, south, east, west, and local. The information extractor obtains the destination address from the packet header flit. The arbiter facilitates connection between input and output ports. For a router capable of error control, NACK feedback from neighbor routers (NACK_in) is used by the arbiter to control FIFO and channel reservation. The four states in Fig. 5.4 are represented with S1S0. In the DL state and Pre-SL state, S1 ¼ 1 is propagated to neighboring routers to ensure all routers are using hop-to-hop ECC. The SL and Pre-DL states propagate S1 ¼ 0 to turn off the hop-to-hop ECC. The detailed circuit design for the ECC Mode Switch unit is depicted in Fig. 5.8. The propagation finish signal comes from the mode propagation counter. The input width for that counter is log2 Tprop . Figure 5.8 also shows that the mode switching unit only requires a few small fan-in logic gates and two D-flip-flops that do not dramatically increase the critical path.

5.3 Dual-Layer ECC Switching Scheme

89

Fig. 5.8 ECC mode switching unit in router

Table 5.2 Overhead of proposed mode switching circuit Compared designs Delay (%) Area (%) Dynamic power (%) Conventional routing unit 100 100 100 Proposed routing unit 102.2 101.6 102.5

Leakage power (%) 100 102.4

The hardware description for the routing unit (including arbiter, crossbar, mode propagation counter and ECC mode switch unit) is synthesized by the Design Complier with a 65 nm TSMC technology using a 1 GHz frequency target. As shown in Table 5.2, our mode switching unit increases area and power by 2% compared to a conventional routing unit. If the network is using hop-to-hop ECC, the output of the hop-to-hop ECC decoder is saved to the error history flit after each hop, using the error-tag-shifting technique shown in Fig. 5.9a. The unique bit pattern for the EHF identification is passed through the EHF update unit. The last bit of the error history flit is the output of ORing the syndrome from the Hop ECC Decoder, and this bit is used to indicate the noise condition of the current hop. The remaining bits carrying the error history for previous hops are pushed one position forward at each hop until the destination is reached, as shown in Fig. 5.9b. The width for the EHF is the same as an uncoded flit. For a 32-bit flit width, the EHF can record the error history for a route length of up to 24 hops. In the network interface, the error history flit is decoded with an OR operation, providing a hint whether the noise condition is truly favorable or masked by the hop-to-hop ECC.

90

5 Dual-Layer Cooperative Error Control for Transient Error

Fig. 5.9 Error-tag-shifting technique: (a) diagram of error history flit update unit, (b) a multi-hop example of writing error history flit

5.3.4

Dual-Layer Information Exchange

In the proposed method, the mode switching depends on the information exchange between datalink and network layers. In mode 1 (end-to-end ECC), the network interface assesses the global noise condition by counting the number of errors detected by the destination decoder, and informs the datalink layer if it should use hop-to-hop ECC (using NI_req signal in Fig. 5.6). In mode 2 (end-to-end ECC combined with hop-to-hop ECC), the network interface comprehensively evaluates the global noise condition with its local error counter and information passed by the error history flit, which is filled by the EHF Update Unit in the datalink layer (shown in Fig. 5.9).

5.4 Codec for Dual-Layer ECC

5.4

91

Codec for Dual-Layer ECC

Suppose the network is switching from the end-to-end ECC mode into the hop-to-hop ECC mode. At this moment, the packet injected in the end-to-end ECC mode has not yet been encoded with the hop-to-hop ECC. The uncoded flit may be treated as an erroneous coded flit, because the uncoded flit with the appended zero bits may be a valid codeword. Independent end-to-end and hop-to-hop ECC codecs are not feasible for non-interrupting ECC mode switching at runtime. In this work, we use product codes to facilitate non-interrupting ECC mode switching. Unlike the application of product codes to hop-to-hop ECC [4], the processes of message packetization and packet or flow control unit (flit) encoding/decoding are modified, discussed in detail in this section.

5.4.1

Dual-Layer ECC Encoder

The end-to-end ECC encoder is located in the network interface. Figure 5.10 shows the encoding process and encoder architecture. The binary bit stream of each packet is arranged into an array where each flit is a column. The encoding process has three steps. Step 1: each flit is encoded by a column encoder (in this work, we use a systematic linear code) to generate the flit check bits (FCB). Step 2: the row vector (consisting of one bit from each flit across a row of the array) is encoded by a row encoder to produce the packet check bits (PCB). Again, we use a systematic linear code in the row codec. Step 3: the PCB is encoded by the column encoder to generate the checks on checks (CoC). Unlike the application of product codes at the hop-to-hop level [4], the product codes in our dual-layer ECC requires an additional step to packetize the PCB and CoC into a new packet. To reduce the energy consumption for transmitting parity check bits, we transmit a coded packet (i.e., original packet and FCB) first and then transmit the packet composed of PCB and CoC if a retransmission request is received.

5.4.2

Dual-Layer ECC Decoder

The dual-layer ECC decoding process is shown in Fig. 5.11a. In the decoding process, the flit belonging to the first packet transmission is decoded with a column decoder. If the first packet transmission has a detectable but uncorrectable error, a short NACK packet with opposite source/destination address is created to request transmission of the PCB and CoC packet; meanwhile, the coded packet and column error vector are stored in the network interface buffer. Otherwise, the uncoded data packet is delivered to the associated IP core.

92

5 Dual-Layer Cooperative Error Control for Transient Error

If the received flits belong to the second packet transmission, those flits combined with the previously saved first packet and the corresponding column error vector are decoded by row decoders. Column decoding is applied to the first and second packet transmissions to improve the error detection and correction accuracy. The difference between the application of product code in dual-layer adaptive ECC and hop-to-hop level is highlighted with dark boxes in Fig. 5.11a. The data flow in the product code decoder of the network interface is shown in Fig. 5.11b. If there are no uncorrectable errors, step 1 is sufficient; otherwise, steps 1 and 2 are needed. The decoding process implementation in dual-layer ECC is different from the hop-to-hop level product code. The entire message comes in each cycle in the hopto-hop level application of the product code. In contrast, one flit (a piece of a packet) arrives at the network interface; one column decoder (rather than multiple column decoders) is enough to execute the first decoding step. If there are uncorrectable errors in the first transmission packet, an array of column decoders is needed to obtain the final uncoded packet. To obtain the whole first packet transmission or the check bit packet, two NI buffers are necessary. Since these buffers can be shared with the existing NI data buffers, the cost is acceptable. The decoder architecture is shown in Fig. 5.11c.

Fig. 5.10 Dual-layer ECC: (a) flowchart, (b) encoding process, (c) encoder diagram

5.4 Codec for Dual-Layer ECC

Fig. 5.10 (continued)

93

94

5 Dual-Layer Cooperative Error Control for Transient Error

Fig. 5.11 Decoding process: (a) flowchart, (b) data flow in NI, (c) decoder architecture

5.5 Experimental Results

95

Fig. 5.11 (continued)

5.4.3

Codec Compatibility

The data paths for end-to-end ECC and hop-to-hop ECC are compatible in our method. In the network interface, there is no difference between these two modes. This is achieved by the use of the product code – one dimension of the product code is treated as the hop-to-hop ECC. The router data flow is different for single- and dual-layer ECC modes. In the single-layer ECC mode, the received coded flit is directly forwarded to the next hop, as shown in 5.12a. In the dual-layer ECC mode, the coded flit is first examined by the hop-to-hop decoder; detected errors are immediately recovered at that hop; then, the error-free uncoded flit is encoded in the router output port before being transferred to the next hop, as shown in Fig. 5.12b. The hop-to-hop ECC codec in router can be turned on/off by HopECC_en (shown in Fig. 5.7). If the hop-to-hop ECC function is inactive, the incoming flit is passed through the codec, discarding the syndrome (check bits) in the decoder (encoder) computation. One AND2 gate is needed to ensure that the syndrome/check bits are valid.

5.5 5.5.1

Experimental Results Experimental Setup

We evaluate different transient error management methods with statistical average metrics – average energy per useful packet and average latency per useful packet – to consider performance and reliability. We examined different error control methods in wormhole switching mesh NoCs with the uncoded flit size wf ¼ 32 bits (a commonly used flit width [14]). The router and NI power and delay are obtained based on synthesized results from Synopsys Design Compiler using a TSMC 65 nm technology and 1 GHz clock frequency (assuming one cycle for signal propagation over 1.5-mm router-to-router links). The global wire parameters are shown in Table 5.3. Resistance and parasitic capacitance (R, Cg, and CC) were calculated using a 65 nm CMOS global interconnect model [15].

96

5 Dual-Layer Cooperative Error Control for Transient Error

Fig. 5.12 Router data flow in (a) single-layer ECC and (b) dual-layer ECC modes

Table 5.3 Global link parameters Parameter Value 0.45 Width, WL (mm) Minimum spacing, S (mm) 0.45 Thickness, t (mm) 1.20 Height, h (mm) 0.20

Parameter Dielectric constant R (Ω/mm) Substrate capacitance, Cg (fF/mm) Coupling capacitance, CC (fF/mm)

Value 2.2 40.7 82.0 73.2

We compare our dual-layer adaptive ECC to two fixed network-layer ECC methods, a non-cooperative dual-layer ECC method, and a datalink-layer adaptive ECC method. In Ref. [12], end-to-end ECC (a CRC4 code) is used to detect transient errors; a time-out mechanism is employed to retransmit the lost packets. In the end-to-end hybrid error detection and correction scheme [16], SECDED is applied to each packet to improve packet transmission reliability. SECDED is realized by extended Hamming code, which is capable of correcting single-bit errors and detecting double-bit errors. Forward error correction (BCH) in the network layer and hop-to-hop (SECDED) ECC in datalink layer are combined to perform non-cooperative dual-layer error control. The BCH code here is capable of correcting three-bit errors, which is more powerful than a SECDED used in each hop. Our previous hop-to-hop adaptive error control has demonstrated better average performance than fixed ECC [7]. In that work, hop-to-hop adaptive mode 1 uses Hamming code (39, 32) to detect two-bit errors and correct one-bit errors; hop-to-hop adaptive mode 2 uses two groups of Hamming code (22, 16) with interleaving to detect adjacent four-bit errors and correct adjacent two-bit errors. In our dual-layer ECC method, we use the extended Hamming code (39, 32) as an SECDED code in the column encoder/decoder in the network interface, executing single-error correction and double-error detection in mode 1 and three-error detection in mode 2. Extended Hamming codes (8, 4) and (22, 16) are used in the row encoder/decoder for 128-bit packet and 512-bit packet cases, respectively. In the following experiments, we assume the depth of the transmission buffer and

5.5 Experimental Results

97

the latency to retransmit are equal to twice the number of cycles for a packet transferring over the diagonal of the mesh network. The bit error rate e for a router-router link is modeled by the Gaussian pulse function (5.6) [1, 6, 7, 17].

ð1 Vdd 1 2 (5.6) ¼ V pﬃﬃﬃﬃﬃﬃ ey =2 dy e¼Q dd 2sN 2p 2sN The noise voltage can be assumed to be a normal distribution with standard deviation sN; Vdd is the supply voltage. Since the flit is sent over the network in multiple hops, the flit error rate is affected by noise voltage and the route length. Figure 5.13a shows that the ground voltage (‘0’) and supply voltage (‘Vdd’) are not ideal; instead, these voltages may vary in a sN range, because of noise interference. If the signal voltage falls into the shadowed region, the logic value for that signal will be flipped; this results in a bit error. Figure 5.13b plots the bit error rate varying with noise deviation voltage sN, based on (5.1). In addition to the increasing noise deviation voltage, increasing packet width wp and route length h can also dramatically increase error accumulation in the packet, as shown in Fig. 5.14. The packet error rate is expressed in (5.7) h Packet Error Rate ¼ 1 ð1 eÞwp

5.5.2

(5.7)

Reliability

Hop-to-hop ECC performs flit-level error control; end-to-end ECC executes packetlevel error control. To ensure a fair comparison, we examine the residual errors in packets. The reliability of different error control schemes are compared using the metric of residual packet error rate (RPER) – the probability of errors remaining in an accepted packet after using error control. Our method exhibits higher error resilience than simple network-layer ECC approaches and comparable reliability performance with datalink-layer adaptive ECC approaches. Assuming that each packet has four flits and the routing path length is four hops, the proposed mode 1 reduces the residual packet error rate by over two orders of magnitude over a wide noise deviation voltage range on links compared to end-to-end ECC approaches, as shown in Fig. 5.15a; the error resilience is further improved by two more orders of magnitude by the proposed mode 2. The proposed mode 1 (using end-to-end ECC) has less error resilience than the hopto-hop adaptive mode 1; however, the residual packet error rate of the proposed mode 2 is close to that of the hop-to-hop adaptive mode 2 for a four-flit packet transferred over four hops, as shown in Fig. 5.15a. The end BCH and hop SECDED method has three-bit error correction capability at the destination and single-error

98

5 Dual-Layer Cooperative Error Control for Transient Error

a

b 100

Bit Error Rate

10−5

10−10

10−15

10−20

10−25 0.05

0.1

0.15

0.2

Noise Deviation Voltage for Vdd=1V

Fig. 5.13 Link error model: (a) error induced by noise voltage, (b) bit error rate with different noise deviation voltage

correction and double-error-detection capability on each hop to prevent error propagation, achieving comparable residual error rate, but this approach has larger energy (shown in Fig. 5.16) and area overhead (shown in Fig. 5.20). The close expressions of the residual packet error rate (RPER) for the proposed error control method are expressed in (5.8–5.10), where e is the probability of each wire independently experiencing error; wp is the number of bits per packet

5.5 Experimental Results

99

1

Flit Error Rate

0.8 0.6 0.4 0.2 0 0.2 30

0.15 20

0.1

Noise Deviation Voltage

0.05

10 0

0

Number of Hops

Fig. 5.14 Impact of noise deviation voltage and route length on flit error rate

(wp’ is for coded packet); wf is the number of bits per flit; h is the number of hops between source and destination nodes. 1wp wf w f h wf h 1 2 wf h2 þ e ð1 eÞ C B ð1 eÞ þ wf heð1 eÞ C B

2 C B 4h3 9h2 h 3 hðh 1Þ 2 3 C B wf h3 ¼1B þ wf w f e ð1 eÞ C C B 12 4 C B A @ hw w 1 w 2 f f f þ e3 ð1 eÞwf h2wf 3 6 0

wf h

RPERProposed

Mode 1

wf h1

(5.8)

RPERProposed

Mode

wp 0 wf 1 wf 2 3 wf 3 wf h1 ’ e ð1 eÞ 2 ¼ 1 ’ wp h 6 3 3 wp 0 wp 0 h wp h wf wf wf 1 wf 2 6 2wf 6 wf h2 e ð1 eÞ ’ 2 63 (5.9) h

0

in which, ’ ¼ ð1 eÞwf þ wf eð1 eÞwf 1 þ

wf wf 1 2 e ð1 eÞwf 2 2

(5.10)

For a moderate noise condition (wp ¼ 32, sN ¼ 0.135, wp ¼ 4*wf), we examine the impact of routing path length on reliability in Fig. 5.15b. As shown, the residual

100

5 Dual-Layer Cooperative Error Control for Transient Error

a

Residual Packet Error Rate

100 10-2 10-4 10-6 10-8

End−to−End SECDED End−to−End CRC End BCH& Hop SECDED Hop Adaptive Mode 1 Hop Adaptive Mode 2 Proposed Mode 1 Proposed Mode 2

10-10 10-12 0.12

0.13

0.14

0.15

0.16

0.17

0.18

Noise Deviation Voltage (V)

b

c Residual Packet Error Rate

Residual Packet Error Rate

100 10-2 10-4 10-6 10-8 10-10 10-12

0

10

20

Number of Hops

30

100

10-2

10-4

10-6

10-8

10-10

4

8

12

16

20

Number of Flits per Packet

Fig. 5.15 Impact of (a) noise deviation voltage, (b) routing path length and (c) packet size on residual packet error rate (markers in (b) and (c) are as same as those in (a))

packet error rate of end-to-end ECC methods increases more significantly than that of hop-to-hop ECC schemes and our proposed mode 2. Because of additionally using the end-to-end ECC, our mode 2 achieves better reliability than the hop-tohop adaptive mode 2 even though the error resilience in each hop of our method is weaker than that of hop-to-hop adaptive ECC. For a given noise condition and routing path length (sN ¼ 0.135, h ¼ 10), the increasing packet size increases the residual packet error rate of end-to-end approaches (shown in Fig. 5.15c)

5.5 Experimental Results

101

Average Energy per Useful Packet (pJ)

a No ECC

2500

End−End SECDED End−End CRC End BCH & Hop SECDED

2000

Hop Adaptive Mode 1 Hop Adaptive Mode 2 Proposed Mode 1 Proposed Mode 2

1500

1000

500 0.08

0.1

0.12

0.14

0.16

0.18

Noise Deviation Voltage (V)

b Average Energy per Useful Packet (pJ)

4000

End BCH & Hop SECDED Hop Adaptive Mode 1 Hop Adaptive Mode 2 Proposed Mode 1 Proposed Mode 2

3500 3000 2500 2000 1500 1000 500 0.14

0.16

0.18

0.2

0.22

Noise Deviation Voltage (V) Fig. 5.16 Impact of noise deviation voltage on average energy per useful packet: (a) a wide noise range and (b) high noise condition

as more undetectable and uncorrectable errors accumulate. By switching between single-layer ECC and dual-layer ECC, our method can tolerate a wide range of routing path lengths and packet sizes to achieve the system reliability requirement.

102

5 Dual-Layer Cooperative Error Control for Transient Error

5.5.3

Energy Efficiency

5.5.3.1

Analysis

Considering both energy and reliability, we compare different methods with the metric average energy per useful packet, expressed in (5.11). Avg: Energy per Useful Packet Avg: Energy per Packet ¼ ð1 Residual Packet Error RateÞ

(5.11)

The average energy per packet is a function of module energies for the NI codec, link, router, and the probability of retransmission. For a typical end-to-end ECC scheme, the average energy per packet (expressed in (5.12)) consists of the energy consumed in the first-round transmission and the energy needed for retransmission if necessary. In the network interface, error control coding is performed on the entire packet. Each packet consists of multiple flits and is transferred over multiple hops; therefore, the link and router switching energy should be multiplied by the number of flits per packet and the total number of hops a packet has traveled. In addition, if retransmission occurs, the energy overhead induced by the NACK packet is also added to the total energy. End2End Avg: Energy per Packet w ¼ ENI Codec þ wpf hðELink þ ERouter Þ w þ pend retrans ENI Codec þ hðELink þ ERouter Þ wpf þ 1

(5.12)

where, ENI_Codec is the end-to-end codec energy in the network interface, ELink is the router-to-router link energy, and ERouter is the energy consumption for the information extractor and arbiter and FIFOs in input and output ports, and pend_retrans is the probability of end-to-end retransmission. Here, we assume the number of hops the NACK packet travels is equal to the original packet path. The hop-to-hop ECC approach does not have codec in network interface. Again, for a packet travelling through a multi-hop route, the sum of link, router and hopto-hop ECC codec energies is multiplied by the number of flits per packet and the route length. Go-back-n retransmission is typically applied to hop-to-hop ECC to recover corrupted flits without stopping normal traffic injection, at the cost of transmitting several error-free flits being retransmitted. In addition, energy consumed by buffering the transmitted flits is another overhead. The average energy per packet that uses hop-to-hop ECC is expressed in (5.13). Hop2Hop Avg: Energy per Packet w ¼ wpf h ELink þ ERouter þ EHop Codec w þ wpf h phop retrans RN ELink þ EHop Codec þ ERetransBuf

(5.13)

5.5 Experimental Results

103

where EHop_Codec is the hop-to-hop encoder and decoder energy in each router, phop_retrans is the probability of hop-to-hop retransmission and RN is the round trip delay (unit: cycle). Assuming the link latency is one cycle, RN ¼ 4 (one for encoder, one for decoder and the remaining two cycles for back and forth link). As emphasized in Ref. [18], the retransmission energy overhead should include the energy consumed by pushing and popping the transmitted flits in the output buffer ERetransBuf. Retransmission energy is additionally added to the normal energy consumption as energy overhead induced by error recovery. The energy portion not divided by the term (1-residual packet error rate) is the normal energy consumption, regardless of any error bit(s) in the packet. Our proposed method intelligently uses product codes and mode switching; consequently, the energy consumption expressions for two ECC modes are modified as shown in (5.14) and (5.15), respectively. Avg: Energy per Packet Proposed mode 1 l m w ¼ ENI Encoder þ ENI Decoder I þ h wpf ðELink 1 þ ERouter Þ l m þ pend retrans ENI Decoder II þ hðELink 1 þ ERouter 1 Þ wwrcf þ 1

(5.14)

where ENI_Encoder is the energy consumed by the product code encoder in the network interface; ENI_Decoder_I and ENI_Decoder_II are the energy for two portions of the decoder (step 1 and step 2 decoding, respectively, shown in Fig. 5.11); Elink_1 and ERouter_1 are the energies consumed by each link and router, respectively, without using hop-to-hop ECC; wrc is the total number of bits for checks on checks and the check bits generated by row encoders. The end-to-end codec is not fully used for each packet transmission, ENI_Decoder_II is consumed only when the CoC bits are requested during the retransmission period to assist error correction step 2 (shown in Fig. 5.11). Since only CoC bits are transmitted during the retransmission phase, wrc/wf rather wp/wf is used in the last term in (5.14). Although the average energy for our proposed mode 1 is composed of more terms than (5.12), our method does not consume energy for the second portion of the decoder if no retransmission Moreover, the number of flits for retransmission is is needed. reduced by wp =wf wrc =wf . Note that typically wp > wrc. Besides replacing router and link energies with the ones using hop-to-hop ECC, the energy for the proposed method mode 2 (expressed in (5.15)) additionally includes the energy portion consumed by the hop-to-hop flit retransmission. Similarly, we use the go-back-n retransmission protocol. Avg: Energy per Packet Proposed mode 2 l m w ¼ ENI Encoder þ ENI Decoder I þ h wpf EProposedHop l m þ pend retrans ENI Decoder II þ h EProposedHop wwrcf þ 1

(5.15)

104

5 Dual-Layer Cooperative Error Control for Transient Error

where EProposedHop ¼ ELink 2 þ ERouter þ EHop Codec þ phop retrans RN ELink 2 þ EHop Codec

(5.16)

As shown, Equation (5.15) adds hop-to-hop ECC overhead to Equation (5.14). But, the enhanced error control is capable of reducing the probability of retransmission pend_retrans, so that the average energy can be reduced. Although the proposed dual-layer adaptive ECC is more complex than other methods, modules for hop-to-hop ECC and step2 end-to-end decoder can be disabled. If the hop-tohop ECC function is inactive, the incoming flit is passed through the codec discarding the syndrome (in the decoder) and check bits (in the encoder) computation. One AND2 gate is needed to ensure the validity of the computed syndrome/ check bits. Moreover, the use of product codes prevents the execution of error detection/correction on the entire packet; instead, the column decoding is performed on each hop. In addition, the type-II hybrid ARQ scheme incorporated with the dual-layer ECC significantly reduces the energy consumption of transferring parity check bits. Consequently, our method can achieve better energy performance than other methods.

5.5.3.2

Statistical Comparison

The impact of noise deviation voltage on the average energy per useful packet is shown in Fig. 5.16. By examining the energy consumption over a wide range of noise deviation voltages, we can see that end-to-end ECC is not energy-efficient. In the end-to-end SECDED, ECC encoding and decoding is executed on the entire packet, which requires large XOR trees and consumes significant power. Similarly, the CRC encoder/decoder has long XOR trees, but consumes less decoder power because there is no error correction circuit. The end-to-end error detection scheme (end-to-end CRC) consumes slightly less energy than the end-to-end SECDED in low noise conditions (e.g. sN<0.12 with 4-flit packets and h ¼ 4). The proposed method effectively reduces the number of retransmissions and chooses the most energy-efficient ECC codes; thus, our dual-layer method achieves lower average energy consumption than previous work if the deviation voltage is less than 22% of the supply voltage. As shown in Fig. 5.16a, the proposed method reduces the average energy per useful packet by up to 55%, 72%, 49% and 21% compared to end-to-end SECDED, end-to-end CRC, end BCH + hop SECDED, and hop-to-hop adaptive ECC, respectively. Here, we divide the average energy of different methods by that of our method. Each percentage reported is the maximum ratio. In low noise conditions (sN ¼ 0.08), our method reduces the average energy by 7% compared to previous hop-to-hop adaptive ECC approaches.

5.5 Experimental Results

105

Average Energy per Useful Packet(pJ)

5000 No ECC End−End SECDED End−End CRC End BCH & Hop SECDED Hop Adaptive Mode 1 Hop Adaptive Mode 2 Proposed Mode 1 Proposed Mode 2

4500 4000 3500 3000 2500 2000 1500 1000 500 0

0

5

10

15

20

25

30

Number of Hops Fig. 5.17 Impact of route length on average energy consumption

The average energy consumption for different methods in high noise regions is shown in Fig. 5.16b. As noise increases, the energy of the proposed mode 1 exceeds hop-to-hop adaptive ECC and end BCH + hop SECDED because of the large retransmission energy. However, our proposed mode 2 consumes the smallest energy compared to other methods, using the product-code-based hop-to-hop ECC and end-to-end ECC. Given sN ¼ 0.135 and wp ¼ 4wf, the impact of the packet routing path length is shown in Fig. 5.17. The maximum examined routing path length is 30, which is the longest path for a 16 16 mesh NoC. The average energy consumption increases with the number of hops. As shown, the amount of energy reduction achieved by our method is more significant in long paths than in short ones. This is because our method combines end-to-end ECC and hop-to-hop ECC to balance energy consumption and the capability to prevent error accumulation along the route. The energy reduction obtained by our method is shown for a range of packets sizes in Fig. 5.18. Here, sN ¼ 0.135, h ¼ 10. The hop-to-hop ECC does not change as the packet size increases. In contrast, the end-to-end SECDED uses extended Hamming (137, 128) and extended Hamming (523, 512) for 4-flit and 16-flit packets, respectively. The codes CRC (132, 128) and CRC (516, 512) are used in the end-to-end CRC scheme for 4-flit and 16-flit packets, respectively. The three-bit error correction BCH codes for the end-to-end ECC are shortened BCH(152, 128) and BCH(542, 512) for 4-flit and 16-flit packets, respectively. As shown in Fig. 5.18, our method reduces the average energy by up to 36%; as packet size increases to 16 flits, our method achieves up to 20% energy reduction. For a large packet size, the

106

5 Dual-Layer Cooperative Error Control for Transient Error

Fig. 5.18 Impact of packet size on average energy consumption

ratio of our additional bits over the total packet is smaller than that for a small packet size; thus, our method is suitable for a large packet size.

5.5.3.3

Average Energy in Noise Variation Environment

The average energy is also examined on an 8 8 mesh NoC with four synthetic noise profiles, as shown in Fig. 5.19a. In this experiment, each packet has four flits, the error threshold are set to five. The four noise profiles cover four different noise scenarios: low noise dominated, high noise dominated, even occurrence of low and high noise, and pseudo-random noise. We obtain the total energy of the NoC using different error control schemes, and normalize the energy to the energy of our proposed method. As shown in Fig. 5.19b, the proposed method consumes the smallest energy of the compared methods in the four examined scenarios. The endto-end CRC approach yields over 50 times higher total energy than our method in the high noise dominated case. Using the dual-layer ECC adaptation, our method reduces the total energy by up to 45% and 25%, compared to end BCH + hop SECDED and hop-to-hop adaptive ECC, respectively.

5.5.4

Average Latency

Latency is defined as the time interval between a packet header leaving the source and that packet tail reaching the destination. Average latency per useful packet is used as a metric to evaluate the latency of different methods. A useful packet is a packet received at the destination error free after error control coding. The total latency for end-to-end ECC is defined in (5.17).

5.5 Experimental Results

107

a σN

0.2 0.1 0

Low noise dominated 0

5

10

15

20

10

15

20

σN

0.2 0.1 0

High noise dominated 0

5

σN

0.2 0.1 0

Even low and high noise 0

5

10

15

20

0 0

Random noise 5

10

15

20

σN

0.2 0.1

Execution Time (million cycles)

b

Fig. 5.19 Energy comparison in case study: (a) four noise profiles, (b) normalized total energy

108

5 Dual-Layer Cooperative Error Control for Transient Error

End2End Avg: Latency w LNI þ h LRouter þ wpf ð1 þ MR pend retrans Þ þ h LRouter MR pend ¼ ð1 Residual Packet Error RateÞ

retrans

(5.17)

in which, the latency for each hop LRouter is given in (5.18) LRouter ¼ LInPort þ LArbiter þ LOutPort þ LLink

(5.18)

LNI is the latency induced by the ECC codec in the network interface and the transmission of the packet to the router; LInPort, LArbiter and LOutPort are the latency in the input buffer, arbiter and output buffer, respectively; LLink is the link latency; pend_retrans is the probability of retransmitting a packet from the source; MR is the number of retransmissions. As shown in Equation (5.17), the total latency consists of the latency for the original packet transmission over h hops (LNI + hLRouter + wp/wf), the latency for retransmission (LNI + hLRouter + wp/wf)*MRpend_retrans and the latency of waiting for the acknowledgment packet hLRouterMRpend_retrans. For simplicity, we assume an error-free packet can be obtained by one retransmission (i.e. MR ¼ 1). The total latency for hop-to-hop ECC is defined in (5.19). Hop2Hop Avg: Latency w w LNI þ hLRouter þ wpf þ hwpf phop retrans RN ¼ ð1 Residual Packet Error RateÞ

(5.19)

Hop-to-hop ECC does not need to retransmit the packet over the entire route like the end-to-end ECC; instead, the retransmission is performed at each hop as necessary. Compared to Equation (5.18), hop-to-hop ECC does not need long latency induced by the acknowledge packet, but it has the go-back-n retransmission latency along the route. The latency for our proposed mode 1 is the same as Equation (5.17), except the wp for retransmission is equal to the size of checks-on-checks plus an additional header flit. In contrast, the latency for mode 2 is different from Equation (5.19); a closed-form expression is given in Equation (5.20). Avg: Latency Proposed mode 2 1 0 w w LNI þ h LRouter þ wpf phop retrans RN þ wpf þ h LRouter pend retrans C B @ A (5.20) wrc wrc þpend retrans LNI þ h LRouter þ wf phop retrans RN þ wf ¼ ð1 Residual Packet Error RateÞ

Average Latency per Useful Packet (Cycles)

5.5 Experimental Results

109

70 65

No ECC

60

End−End CRC

End−End SECDED

55 50 45

End BCH & Hop SECDED Hop Adaptive Mode 1 Hop Adaptive Mode 2 Proposed Mode 1 Proposed Mode 2

40 35 30 25 20 0.08

0.1

0.12

0.14

0.16

0.18

Noise Deviation Voltage (V) Fig. 5.20 Impact of noise deviation voltage on average latency

Although Equation (5.20) contains more terms than Equation (5.18) or Equation (5.19), the latency for mode 2 can still outperform other methods by decreasing the residual packet error rate and pend_retrans. In the following results, we assume the router is implemented with a three-pipeline-stage architecture (one for input port, one for packet header information extraction and routing, one for output port). For simplicity, we also assume the link between routers takes one cycle for signal propagation. Consequently, RN in (5.18) and (5.19) is four. Here, we also assume that there is no network congestion-induced latency with injecting small traffic loads to the network. Figure 5.20 shows the packet latency achieved by different methods over a wide range of link noise conditions. In the end-to-end CRC method, only error detection is performed in the destination node. As noise increases, the increasing probability of retransmission results in large packet latency. In end-to-end SECDED, single-bit errors are corrected by the decoder; as more and more errors become detectable but uncorrectable in large noise conditions, the packet latency increases dramatically. The proposed method achieves similar latency performance compared to other approaches in the low noise region, but reduces the latency by up to 64% and 43% compared to the end-to-end CRC and end-to-end SECDED, respectively, in high noise region. As the noise voltage increases, our proposed mode 1 achieves 5% latency reduction over the hop-to-hop adaptive ECC. In a wide range of routing path lengths, the latency performance of our method is close to that of end BCH + hop SECDED method, but the latter method consumes more energy than our method. The impacts of routing packet size on latency are shown in Fig. 5.21. When packet size varies, the average latency of our method is nearly constant.

5 Dual-Layer Cooperative Error Control for Transient Error

Average Latency per Useful Packet (Cycles)

110 350

No ECC End−End SECDED End−End CRC

300

End BCH & Hop SECDED Hop Adaptive Mode 1 Hop Adaptive Mode 2

250

Proposed Mode 1 Proposed Mode 2

200

150

100

4

8

12

16

20

Number of Flits per Packet Fig. 5.21 Impact of packet size on average latency

Table 5.4 Codec delay comparison (ns) 128-bit packet ECC schemes End-to-end CRC End-to-end SECDED Hop-to-hop adaptive Proposed mode-1 Proposed mode-2

5.5.5

Encoder 0.53 0.9 0.3 0.40 0.66

512-bit packet Decoder 0.53 0.99 0.53 0.65 0.84

Encoder 0.65 0.98 0.3 0.40 0.80

Decoder 0.65 1.40 0.53 0.65 0.97

Codec Delay

The codec delay comparison is depicted in Table 5.4. Two cases are compared: 128-bit and 512-bit packets. The codec is written in VerilogHDL, and reported delay is based on the synthesized netlist in a 65 nm TSMC technology. As shown, the codec delay of the end-to-end SECDED is the worst one compared to other approaches, because of long XOR tree for check bits and syndrome computation. Comparing the delay for encoder and decoder, one can see the critical path for the codec is in the decoder. Consequently, our codec can run faster than end-to-end SECDED and hop-to-hop BCH. Although our approach in mode 2 has worse codec delay than end-to-end CRC and hop-to-hop adaptive, the proposed method has better performance and energy.

5.5 Experimental Results

111

Table 5.5 Codec area (mm2) ECC schemes End-to-end CRC in NI End-to-end SECDED in NI End BCH and Hop SECDED Hop-to-hop adaptive ECC in router Proposed ECC

5.5.6

Router NI

Router NI

128-bit packet

512-bit packet

Encoder 549 1,131 985 2,792 2,610

Encoder 1,956 6,030 985 10,969 2,610

985 4,764

Decoder 577 7,867 2,125 163,439 7,485 2,125 8,558

985 13,331

Decoder 1,985 3,7323 2,125 1,074,526 7,485 2,125 22,350

Codec Area Comparison

In end-to-end ECC approaches, the error control codec is in the network interface (NI). In hop-to-hop ECC approaches, the ECC codec overhead is in each router port. Both NI and router have ECC codecs in our dual-layer adaptive ECC. The codec areas of different error control methods are compared in Table 5.5. For a mesh/torus router, five input/output ports are needed, so the reported router codec area is five times the total area of the hop-to-hop ECC encoder and decoder (note this is still a small percentage of the overall router area). The end-to-end ECC codec changes with packet size, with larger packets consuming more codec area. In contrast, the hop-to-hop ECC codec remains unchanged because hop-tohop ECC is performed on each flit in a packet. For a 128-bit packet size, Table 5.5 shows that the end-to-end CRC scheme has the least area, because of no error correction circuit; the area of our proposed ECC increases about 1.8X and 63% compared to end-to-end SECDED and hop-to-hop adaptive ECC, respectively. However, our proposed ECC consumes 90% less area than the non-cooperative dual-layer ECC (i.e. End BCH + hop SECDED). For a 512-bit packet size, the endto-end SECDED method requires 12% more area than our method; the end-BCH + hop SECDED increases the codec area about 28X compared to our method. Figure 5.22 compares the total area of each node for different error control methods. The node area includes NI codec and router. As shown, for 128-bit packet size, the approaches containing hop-to-hop ECC have larger area overhead than the methods only performing end-to-end ECC. As the packet size increases to 512-bit, our duallayer ECC outperforms the end-to-end SECDED and non-cooperative dual-layer ECC.

5.5.7

Case Study

Realistic traffic injections are used to evaluate different error control method. We produce the traffic traces between L1 caches and routers from Black-Schoels, x264 and Canneal applications in PARSEC benchmark suite [13]. Black-Scholes and

112

5 Dual-Layer Cooperative Error Control for Transient Error

Fig. 5.22 Area comparison of each node

Canneal applications demonstrate randomly imbalance traffic load among different routers; the hot spots in x264 application has predictable distributions. Therefore, we evaluated the average energy and latency using Black-Scholes and x264 traffic trace in Figs. 5.23–5.26. The NoC used in this case study is 4 4 mesh, each packet has 16 flits, and the end error thresholds are reported in Table 5.1. We performed each simulation 100,000 cycles using our cycle-accurate simulator [19]. Different amount of errors are randomly injected to the NoC links. As shown in Figs. 5.23 and 5.24, for small number of error injection case, our method consumes comparable energy to no ECC, end-to-end CRC and end-to-end SECDED; the average energy of BCH + hop SECDED consumes 52% and 29% more energy than ours, in Black-Scholes and x264 applications, respectively. As the number of injected errors increases, the energy of the end-to-end CRC and end-to-end SECDED significantly increases. In contrast, our method still maintains the energy efficiency. Similarly, our method achieves the best average latency perform even when the traffic load and the number of injected errors change, shown in Figs. 5.25 and 5.26. The conclusions we draw from theoretical results shown in Figs. 5.16–5.19 are confirmed by this case study. We have also evaluated the proposed method a dependent error model [4], which assumes that an erroneous wire can cause errors in neighboring wires with a probability b. As shown in Figs. 5.27 and 5.28, the proposed method consumes less energy and achieves similar or better latency performance than other methods even in the dependent error case. Compared to the performance evaluated in independent error model, the reduction on energy and latency decreases when b ¼ 0.2. This is because the codec currently used in the product code is an extended Hamming code. To be capable of handling dependent errors, multi-group Hamming codes [2] with interleaving can be a possible solution.

5.5 Experimental Results

Fig. 5.23 Average energy for Black-Scholes application

Fig. 5.24 Average energy for x264 application

Fig. 5.25 Average latency for Black-Scholes application

113

114

5 Dual-Layer Cooperative Error Control for Transient Error

Fig. 5.26 Average latency for x264 application

Fig. 5.27 Average energy evaluation using dependent error model

Fig. 5.28 Average latency evaluation using dependent error model

5.6 Summary

115

Fig. 5.29 Impact of end error threshold on average energy per packet

The impact of end error threshold on the average energy reduction is shown in Fig. 5.29. By comparing our energy consumption to that of the hop adaptive ECC approach, we can see that the medium error threshold we selected can improve the energy reduction from 30%, 5% and 19% to 36% (Black-Scholes), 9% (x264) and 23% (Canneal), respectively, compared to the minimal end error threshold. This is because the medium error threshold can tolerate the ECC mode switching oscillation caused by the asymmetric traffic load and some burst errors.

5.6

Summary

In this chapter, we address the variable error rate for transient link errors using duallayer cooperative error control, including both datalink and network layers. The error detection outcome and analyses from datalink and network layers are exchanged to determine the appropriate error control scheme for the current noise conditions. In low noise conditions, network-layer end-to-end ECC is used to reduce the codec and link energy overhead at each hop; in high noise conditions, datalink-layer hop-to-hop ECC is enabled to improve the error resilience capability at each hop and reduce the error propagation and accumulation in each packet. A single-layer and dual-layer ECC mode switching protocol is proposed to enable switching between modes without interrupting dataflow. To reduce the overhead of this dual-layer cooperative ECC, we exploit product codes for datalink- and network-layer transient error management. Compared to previous solutions, the proposed method reduces residual packet error rate by up to four orders of magnitude, achieves up to 72% energy reduction and improves average latency by up to 64%. The energy and latency reduction benefits are maintained as the routing path length and packet size increase, at the cost of a moderate increase in area overhead. For a large scale NoC, the proposed ECC mode switching may be not efficient.

116

5 Dual-Layer Cooperative Error Control for Transient Error

In the future, regional ECC mode switching is possible to reduce the latency induced by the mode switching over the entire network, and thus further trade off the reliability and energy consumption.

References 1. Bertozzi D, Benini L, De Micheli G (2005) Error control schemes for on-chip communication links: the energy-reliability tradeoff. IEEE Trans Computer-Aided Design of Integr Circuits and Syst (TCAD) 24:818–831 2. Zimmer H, Jantsch A (2003) A fault model notation and error-control scheme for switch-toswitch buses in a network-on-chip. in Proc CODES + ISSS’03 188–193 3. Lehtonen T, Lijieberg P, Plosila J (2007) Analysis of forward error correction methods for nanoscale networks-on-chip. Proc Nano-Net 1–5 4. Fu B, Ampadu P (2009) On Hamming product codes with type-II hybrid ARQ for on-chip interconnects. IEEE Trans Circuits Syst I: Regular Papers 56:2042–2054 5. Rossi D, Angelini P, Metra C (2007) Configurable error control scheme for NoC signal integrity. in Proc IOLTS’07 43–48 6. Li L, Vijaykrishnan N, Kandemir M, Jrwin MJ (2003) Adaptive error protection for energy efficiency. in Proc ICCAD’03 2–7 7. Yu Q, Ampadu P (2009) Adaptive error control for nanometer scale NoC links. IET Computers & Digital Tech-Special issue on advances in nanoelectronics circuits and syst 3:643–659 8. Duan C, Cordero V, Khatri SP (2009) Efficient on-chip crosstalk avoidance CODEC design. IEEE Trans Very Large Scale Integr (VLSI) Syst 17:551–560 9. Fu B, Ampadu P (2010) Exploiting parity computation latency for on-chip crosstalk reduction. IEEE Trans Circuits Syst II: Express Briefs 57:399–403 10. Ganguly A, Pande PP, Belzer B, Grecu C (2008) Design of low power & reliable networks on chip through joint crosstalk avoidance and multiple error correction coding. J Electron Test 24:67–81 11. Murali S, Theocharides T, Vijaykrishnan N, Irwin MJ, Benini L, De Micheli G (2005) Analysis of error recovery schemes for networks on chips. IEEE Design & Test of Computers 22:434–442 12. Ali M, Welzl M, Hessler S, Hellebrand S (2007) An efficient fault tolerant mechanism to deal with permanent and transient failures in a network on chip. Intl J High Performance Syst Archi 1:113–123 13. PARSEC benchmark [Online]: http://parsec.cs.princeton.edu 14. Salminen E, Kulmala A, H€am€al€ainen TD (2008) Survey of network-on-chip proposals. White paper, OCP-IP 1–13 15. Arizona State University, Predictive Technology Model [Online]: http://www.eas.asu.edu/ ~ptm 16. Sanusi A, Bayoumi MA (2009) Smart-flooding: A novel scheme for fault-tolerant NoCs. in Proc IEEE SoC Conf 259–262 17. Sridhara S, Shanbhag NR (2005) Coding for system-on-chip networks: a unified framework. IEEE Trans Very Large Scale Integr (VLSI) Syst 13:655–667 18. Lan Y-C, Chen MC, Chen W-D, Chen S-J, Hu Y-H (2009) Performance-energy tradeoffs in reliable NoCs. in Proc ISQED’09 141–146 19. Yu Q, Ampadu P (2010) A flexible parallel simulator for networks-on-chip with error control. IEEE Trans on Computer-Aided Design of Integr Circuits and Syst (TCAD) 29:103–116

Chapter 6

A Flexible Parallel Simulator for Networks-on-Chip with Error Control

To fill in the gap between NoC simulator implementation and NoC error control exploration, we develop an NoC simulator that facilitates comprehensively investigation of the impact of different error control methods on NoC performance and energy consumption. The main functionality of the proposed simulator, plug-andplay error control coding (ECC) insertion and the flexible fault injection environment are introduced in this chapter. Energy estimation and improvements on simulation speed and memory consumption are analyzed, as well.

6.1

Existing Simulators

General network simulators (e.g., ns-2) were previously used to estimate NoC performance [1–3]. Recently, many research groups have proposed specific modeling methods and frameworks to assist in NoC development. Orion was proposed to evaluate network power and latency with different traffic patterns and flow control parameters [4]. On-chip communication network (OCCN) [5] employs a multi-layer NoC modeling methodology; many application programming interfaces (APIs) are well defined in OCCN to assist communication between different layers. The framework for power/performance exploration (PIRATE) can be used to examine the impacts of different topologies and traffic injection rates on the average throughput and power for on-chip interconnects [6]. A system-level NoC modeling framework integrates a transaction-level model and an analytical wire model to evaluate throughput and power of NoCs [7]. Thid [8] introduced a layerednetwork simulator kernel, named Semla. Using the Semla kernel, Lu et al. further proposed the Nostrum NoC simulation environment (NNSE) to explore the design space for Nostrum NoC [9, 10]. Commercial design kits (e.g., NoCexplorer and NoCcomplier [11] provided by Arteris) allow definition of NoC interfaces, quality of service (QoS) requirements, and topologies; they are also capable of estimating area and performance in the network. Various routing algorithms, NoC sizes and arbitrary traffic injection rates are supported in the Noxim NoC simulator [12]. Q. Yu and P. Ampadu, Transient and Permanent Error Control for Networks-on-Chip, DOI 10.1007/978-1-4614-0962-5_6, # Springer Science+Business Media, LLC 2012

117

118

6 A Flexible Parallel Simulator for Networks-on-Chip with Error Control

Different QoS levels and parameters for network traffic are investigated in the Nirgam NoC interconnect routine and application modeling tool [13]. These simulators are useful to explore design space and power estimation; unfortunately, they have not explored error control features in NoCs.

6.2

Platforms for Error Control Evaluation

To improve reliability of on-chip communication, NoCs have employed various error control methods – such as forward error correction (FEC), error detection combined with retransmission (using automatic repeat request (ARQ) protocol), and hybrid ARQ. Retransmission protocols include stop-and-wait, go-back-N and selective repeat. Most prior work evaluates error control schemes based on a pointto-point communication architecture, without considering the impact of traffic injection on the evaluation results. Although realistic traffic injections have been considered in the evaluation in Ref. [14], probability equations for fault presence are used to estimate the energy consumption, rather than throughputs derived from realistic fault injection in an emulated NoC. Impact of fault injection location on performance has been examined on the link between IP core and router [15], instead of the entire NoC. In Ref. [16], Ali et al. have modified the ns-2 simulator to assess their fault tolerant routing protocol; however, only fault injection rate is tunable in their simulation. Researchers [17, 18] have also suggested that different error control strength should be provided to protect header or payload flits of a packet. The immense number of variables in NoC design with error control motivates the need for a comprehensive NoC simulator that allows evaluation of error control schemes against different traffic injection patterns, fault injection rates, fault types, fault injection locations and faulty flit types.

6.3

Overview of the Proposed Simulator

To fill in the gap between NoC simulator implementation and NoC error control exploration, we develop an NoC simulator [19], shown in Fig. 6.1a, that facilitates comprehensively investigation of the impact of different error control methods on NoC performance and energy consumption. Our simulator allows plug-and-play error control coding (ECC) insertion and provides some typical error control codecs. As shown in Fig. 6.1b, NoC specification, traffic trace and noise characteristics are given through a user-friendly Java interface [20]. A parameterized NoC is modeled in the simulator, in which one can define the number of NoC nodes (NoC size), select an NoC topology, choose a routing algorithm and retransmission protocol, and assign the round-trip delay for switch-to-switch retransmission. Two popular NoC topologies – mesh and torus – are available in the simulator. Deadlock-free XY deterministic routing and some partially adaptive routing algorithms have been built-in. The retransmission protocols include stop-and-wait and go-back-N ARQ and HARQ.

6.3 Overview of the Proposed Simulator

119

Fig. 6.1 Proposed simulator: (a) data flow, (b) NoC configuration interface, (c) fault injection interface

120

6 A Flexible Parallel Simulator for Networks-on-Chip with Error Control

Fig. 6.2 Input and fault vectors generation

The noise profile is further processed to create error injection files for the simulator. Our flexible fault injection environment, shown in Fig. 6.1c, assists error control exploration for specific purposes, such as how to offer strong but energy-efficient protection on header flits, whether to maintain high reliability of router-to-router links or router-to-NI links. Using fault injection parameters, our simulator creates a fault injection profile. Plug-and-play error control coding insertion is realized by adding an ECC module in NoC specification tab (shown in Fig. 6.1b). Two free compilers should be available in the execution machine – gcc for C files compilation; mpicc for MPI and C files (parallel codes running on multi-processor server) compilation. In addition, special libraries (e.g. GNU Scientific Library [21]) are needed to generate random numbers. The extensive number of simulation variables potentially requires prohibitive simulation time and system resources, becoming worse as the number of NoC nodes increases [10]. To resolve this challenge, we use C and message passing interface (MPI) languages to schedule a parallel simulation on a multiprocessor server. In addition, we exploit the multiprocessor environment to inject faults in a time and memory efficient manner, and model a NoC-based system comprised of heterogeneous IP cores for the exploration of adaptive error control schemes. Figure 6.2 shows the flowchart for traffic and fault vectors generation. Traffic injection in IP cores is further described in Sect. 6.5. The index-based fault injection method that efficiently generates fault vectors is discussed in Sect. 6.6. The NoC framework modeled in the proposed simulator consists of routers, network interfaces (NIs), and IP cores. These three components can be connected

6.4 Error Control Modeling in Router and Network Interface

121

using a number of different topologies. In this chapter, we are interested in the torus topology, which has been shown to be one of the most energy efficient topologies. The operation of each component can be modeled in a physically parallel manner using a multiprocessor server (e.g. TeraGrid IA-64 [22]). The routers can be simulated concurrently; each router/IP core can also be independently controlled in simulation, rather than employing the same error control to the overall design. This feature is particularly useful because IP cores are typically heterogeneous in NoC-based chip multiprocessor (CMP) system and have different characteristics (e.g. data format, send/receive data, error resilience requirement, and I/O bandwidth).

6.4

Error Control Modeling in Router and Network Interface

In torus NoCs, the router typically has five ports – north, south, east, west (connected to neighboring routers) and local (connected to an IP core) – plus a crossbar switch for routing and input/output port connections. Each port has one input channel and one output channel, consisting of incoming (and outgoing) data buffers and error control modules. The crossbar extracts the destination address from the received packet and directs the packet to the appropriate output channel for its next hop. Meanwhile, the crossbar also detects the availability of the destination port (i.e., sink) to avoid buffer overflow. If a resource contention occurs, a roundrobin arbitration method is employed to ensure the fairness of port-accessing.

6.4.1

Error Control in Router

Different with previous simulators (e.g. [4, 6, 9, 12, 13]), the proposed NoC simulator embeds an ECC module in routers and NIs to support different error control schemes, including various FEC, error detection combined with ARQ and HARQ. FEC always attempt to correct any detected errors, even when the number of errors is beyond its correction capability. ARQ requests a retransmission when any errors are detected. HARQ tries to correct any detected errors; if the number of errors exceeds the codec’s error correction capability, retransmission is needed. The included ECC module is extendable. New error control codecs can be easily added to the ECC module in a plug-and-play fashion. For convenience, typical error control codes such as even-parity check code (PAR), Hamming code (HM), extended Hamming code (EHM), cyclic redundancy code (CRC), duplicated-add-parity (DAP), SEC-DED code [14, 15] and configurable error control codes [18, 23], have been provided in the ECC module. Users can specify their desired error control codec using a global parameter or convey ECC selection information in the header flit of a packet. The various error control strategies are integrated with flow control,

122

6 A Flexible Parallel Simulator for Networks-on-Chip with Error Control

Fig. 6.3 Components of router with error control: (a) input channel, (b) output channel, (c) crossbar switch

routing algorithm and buffer management, as well. To evaluate the performance of an NoC without error control, ECC-related features can be easily disabled using bypass paths. Error detection/correction is executed in the input channel, which accepts data from neighboring routers or its local IP core via Flit_in vector if no error is detected. As shown in Fig. 6.3a, the input channel first executes ECC decoding if the incoming buffer is not full; otherwise no decoding is necessary. Only the “errorfree” data can be propagated to the buffer and then be forwarded to the next hop. Here, “error-free” has different meaning from the viewpoint of different error control methods. For the case of no ECC or FEC, all received data are treated as error-free no matter whether error bits exist or not, since no retransmission is allowed. In contrast, for the case of ECC combined with ARQ, flits that without errors or containing undetectable errors will be flagged as “error-free”; otherwise,

6.4 Error Control Modeling in Router and Network Interface

123

Err is set to high and the buffer control module is informed of the this error event. The HARQ scheme asserts Err only when the detected error cannot be corrected by the current decoder. Because of ECC insertion, writing to the buffer has additional constraint that the input data should be error-free, in addition to available buffer space and the permission of using the intended output port (indicated by Port_admission). The non-propagated flit requests retransmission via NACK signal. The output channel processes the incoming data in a similar but opposite fashion. As shown in Fig. 6.3b, the ECC encoder is placed after the buffer, so that a smaller buffer is needed because the check bits of the coded flit are not stored. The buffer structure in the output channel changes in different error control schemes. For instance, the error recovery scheme using go-back-N retransmission is easily implemented with a circular shifter, so the previous flit can be retransmitted at the moment that the feedback NACK arrives. If no retransmission is needed in the employed error control scheme, traditional FIFO buffers work well for buffering incoming flits from the crossbar switch. Our simulator provides multiple buffer structures for different error control schemes. Note that the module describing the buffer structure is extendable, too. Thus, the exploration of new buffer structures is supported. The crossbar switch creates an interconnect path between the input and output channels, according to the address field indicated in the packet header and the routing algorithm employed. As shown in Fig. 6.3c, the crossbar has five destination-port computation blocks (Dest_port Comp.), one for each port. The Dest_port Comp. block detects the header/tail flit and computes the intended output port ID based on the employed routing algorithm. If a header is detected, a port-reservation request is sent to the destination port reservation block (Dest_port Reservation), in which the algorithm that resolves network contention can be explored. In the simulator, we implement a round-robin method to fairly assign accessing to the output ports. The port reservation is cancelled as soon as the tail flit of the current packet is successfully transferred.

6.4.2

Error Control in Network Interface

The network interface (NI) is the gateway between IP cores and the network. As shown in Fig. 6.4, NIs have only one pair of input and output channels and do not have a crossbar switch. In addition, NIs use buffers to manage the frequency difference between IP cores and NoC link switching. In packet switching NoCs, NIs packetize the data stream into packets with user defined packet length and provide each packet with a header flit and a tail flit. In our simulator, we assume the packet format shown in Fig. 6.5. A header flit includes information such as source address (Src ID), destination address (Dest ID), and routing algorithm. Payload and tail flits carry data. The tail flit has a tail bit to indicate the end of a packet. End-to-end ECC (End ECC) is usually executed on the original packets before packetization/after depacketization; hop-to-hop error control (Hop ECC) is

124

6 A Flexible Parallel Simulator for Networks-on-Chip with Error Control

Fig. 6.4 Network interface

Fig. 6.5 Packet format

executed in the input/output channel, which is similar to routers. Typically, the code used in end-to-end ECC is more powerful than that in hop-to-hop ECC, because the former one is used in a forward error correction manner. Unlike endto-end ECC, simple codes combined with retransmission are more efficient for hop-to-hop ECC. Both end-to-end and hop-to-hop ECC can be disabled by bypassing the data path.

6.5

Flexible Fault and Traffic Injection

The fault injection profile is generated based on four key parameters – fault injection location, fault type, faulty flit type and fault injection rate. According to those parameters, various noise scenarios can be modeled to evaluate different error control schemes. As shown in Fig. 6.1c, the given bit error rate indicates the probability of each wire experiencing an error. Because interconnect error is typically modeled with noise deviation voltage adding on supply voltage or ground line, specification of noise deviation voltage is provided as an alternative item.

6.5 Flexible Fault and Traffic Injection

125

Fig. 6.6 Fault injection on the NoC-based system

6.5.1

Fault Injection Location

In this simulator, we only consider faults occurring on interconnect links. Fault injection location is used to differentiate router-to-router interconnects from routerto-NI interconnects. Global link faults represent faults existing in the global links between routers; local link faults are injected on the link between a router and an NI. These two locations for fault injection are depicted in Fig. 6.6. In the simulator interface, one can select global link, local link, or both. The parameter Link_ID(s) (shown in Fig. 6.1c) is used to specify the number of the links that are affected by faults. According to Link_IDs, our simulator randomly injects errors to the specified links in the Monte Carlo simulation. Or, random link failures can also be modeled by choose the rough link location. During the simulation, each link reads its fault injection profile (if applicable) and manages the faults in the way indicated by the employed error control scheme. Local link faults are detected at the input channel of NI and the local ports in routers. Global link faults are examined in the input channels of the north, south, east and west router ports.

6.5.2

Fault Type

The proposed simulator is capable of simulating three fault types: (1) transient independent faults; (2) transient coupling faults (leading to adjacent multi-bit errors); (3) permanent faults. As shown in Fig. 6.1c, the dependent/independent buttons and dependent coefficient control the noise model. For the independent faults, the time that a wire is injected with a fault is independent. For the coupling faults, one fault affects multiple neighboring wires. The dependent coefficient is defined as the probability of an interrupted wire affecting its neighbor wires. Transient fault disappears after one cycle. In contract, Permanent fault exists on the link through the entire simulation.

126

6 A Flexible Parallel Simulator for Networks-on-Chip with Error Control

Fig. 6.7 Data stream modification for different input frequencies

6.5.3

Faulty Flit Type

Faulty flit type describes whether the faulty flit is a header, payload or tail flit. This can be specified through fault injection tab of the simulator (shown in Fig. 6.1c). Differentiation of faulty flit is used to investigate the impact of erroneous-bit position on reliability, performance and energy consumption of an NoC. For example, uncorrectable errors on the header flit lead to packet loss and incorrect destination. The residual errors on the routing fields result in the performance degradation. Unrecognized errors on the tail flit make the packet transmission endless, which prevents the following packet from passing through the router. Consequently, the average latency and throughput are reduced.

6.5.4

Multiple-Frequency Traffic Injection

The IP core modeled in our simulator acts as a traffic generator, which provides the data streams injected to the NoC. Following the flowchart shown in Fig. 6.2, we create a traffic engine to generate sufficient input streams that replicate the characteristics of different random processes and random distributions, as well as a wide range of packet injection rates. We can further obtain the input streams with different clock frequencies, simulating an NoC-based CMP system. Based on the input file generated by our traffic engine, we insert invalid flits among the valid data to model different clock frequencies. Assume that the switching frequency for the NoC link is fL, and the frequencies for IP core_0, IP core_1 and IP core_2 are fL, 1/2fL, and 1/3fL, respectively. The original data stream is directly injected to the NoC through the network interface. For the IP core running on a slower frequency, invalid packets are inserted before the valid packet enters the network. As shown in Fig. 6.7, for an IP core operating at half of the link frequency, one invalid packet is inserted before the valid packet. If the IP core frequency is higher

6.6 Parallel Fault and Traffic Injection

127

than the link frequency, no modification is needed at this point because the store and forward mechanism applied in the network interface will manage the unmatched frequency. For simplicity, we restrict ourselves to cases where the link period is an integer times the IP core period. The adjustment for non-integer frequencies will be accomplished in future work.

6.6

Parallel Fault and Traffic Injection

The behavior of the router or the IP core combined with NI is described in C and MPI language, which allows each microprocessor in a multiprocessor server to represent a router or IP core. Thus, all the routers and IP cores can operate in parallel. MPI_Send and MPI_Recv functions perform a standard-mode blocking send and receive, modeling the on-chip communication between routers and IP cores.

6.6.1

Fault Injection on Links

For an NoC without error control, one set of MPI_Send/Recv is sufficient to deliver flits (Flit_s/r). To facilitate investigation of ECC features, one more MPI_Send/ Recv function is needed to transfer error control feedback (NACK_s/r), as shown in Fig. 6.8a. Sink and source processor IDs are indicated by DestID and SrcID, respectively. In the server, a message transferred by MPI_Send is first saved in the processor buffer until the matched MPI_Recv fetches it, as shown in Fig. 6.8b. If a fault is injected for the purpose of evaluating ECC, one or more bits of the flit fetched by MPI_Recv from the microprocessor buffer are flipped using XOR logic ‘1’. The exact number of inverted bits and the time period the faults last depends on the fault vector given by a fault injection profile (faultinjection.dat shown in Fig. 6.2).

6.6.2

Fault Injection Rate

Fault injection rate is the probability of a faulty flit during the simulation. To efficiently generate the fault injection profile based on the given fault parameters, we propose an index-based fault injection method that indicates when and where a wire encounters an error. Figure 6.9 shows the execution flowchart of the parallel simulation in N microprocessors. The main steps are described below: 1. The root processor first generates a list of time indices for each wire; these indices indicate when an error occurs in the overall simulation.

128

6 A Flexible Parallel Simulator for Networks-on-Chip with Error Control

Fig. 6.8 MPI functions facilitating on-chip communication modeling: (a) program for sending/ receiving flits, (b) data flow in multiprocessor server

2. Random numbers are generated. In our simulator, we use GSL [21] to select many random number generators (e.g. taus, MT19937, and ranlxd1 from the GSL library). The wire ID is first used as a random seed for each generator; thus different random sequences can be generated. 3. Then, the indices are merged to a completed table. 4. Next, a large simulation is evenly distributed to a number of microprocessors. As a result, the simulation time reduction is approximately equal to the number of microprocessors used in parallel simulation. 5. Finally, the simulated results are collected and the overall performance can be obtained by the root processor. This index-based fault injection compresses the size of the table indicating the fault location. The fault injection profile is generated at the beginning of the simulation. During the simulation, the profile is loaded to each processor. The router and NI modeled checks their corresponding fault profile and induces faults appropriately.

6.7 Energy Estimation

129

Fig. 6.9 Index-based fault injection method

6.6.3

Parallel Traffic Injection

Using the MPI language, we are able to schedule a parallel simulation in a multiprocessor server environment, where all the processors physically work in parallel. In MPI simulation, each real processor can be assigned one or more processor IDs; if the maximum ID is greater than the available amount of processors; virtual processors are created using transparent time-division. Using of this feature, we let each virtual processor model an IP core, loading a traffic file during the simulation. This allows flexible assignment of any traffic injection file to any IP core.

6.7

Energy Estimation

Energy is an important constraint in NoC designs. Instead of simply providing the energy for router busy or idle mode, we use a fine-grain estimation based on realistic switching activities of each module in different simulation scenarios. Thus, the fact that energy in busy mode varies with the number of in-use modules

130

6 A Flexible Parallel Simulator for Networks-on-Chip with Error Control

(i.e. input/output channel and ECC blocks) in each cycle can be taken into consideration. The total energy for the NoC fabric (expressed in Equation (6.1)) includes router and link energy. Router energy shown in (6.2) is comprised of input/output channel and crossbar module energies. Because of the inserted error control module, the energy consumed by the router is not constant for different noise scenarios and reliability requirement. Consequently, the energy for each port (including both input and output channels) consists of buffer energy and error control energy, as expressed in (6.3). ENoC fabric ¼

NM X i¼1

ER ¼

5 X

ER þ

5NM X

ak ELink

(6.1)

k¼1

EPort j þ EXbar

(6.2)

j¼1

EPort j ¼ binput j EBuffer input þ boutput j EBuffer output þ gECC j EECC þ gDeECC j EDeECC

(6.3)

Here, N M is the NoC size; symbols a, b, g are switching factor for links, buffer per port, and error control coding blocks, respectively; ENoC_fabric, ER, ELink, EXbar, Eport, Ebuffer, EECC/DeECC are energy for the overall NoC fabric, router, links between two routers, crossbar module (including crossbar switch, routing block and port reservation block), input/output channel buffers and error control module, respectively. Our simulator records the switching factors mentioned above, and computes the total energy based the energy obtained from synthesized netlists. As a result, one can estimate the energy consumption for a given specification of the NoC structure and the characteristics of the fault and traffic injection.

6.8

Speed and Memory Consumption for Fault Injection

Reduction in simulation time and memory consumption is achieved by means of MPI-based parallel simulation in two ways – multiple link simulations per cycle and index-based fault injection. Unlike simulation performed on a single pair of switch-to-switch links, our method simulates multiple pairs of transmitters/ receivers in parallel. As a result, the time for modeling the probability that a link is affected by noise can be shortened. Further, the index-based fault injection method requires only a small memory to indicate when and where a wire encounters an error, rather than a large table to indicate the error status for each cycle. The bit error rate is a measure of a link’s susceptibility to noise from external and internal sources. In theoretical analysis, the Gaussian pulse function (6.4) is widely used to model the bit error rate e,

6.8 Speed and Memory Consumption for Fault Injection

Vdd e¼Q 2sN

Z ¼

1 Vdd 2sN

131

1 2 pﬃﬃﬃﬃﬃﬃ ey =2 dy 2p

(6.4)

The noise voltage is a normal distribution with standard deviation sN; Vdd is the supply voltage. In a switch-to-switch simulation, the bit error rate can be modeled as the ratio of the number of cycles ne containing erroneous bits over the total simulation time nT, as shown below e¼

! Wflit X ne;i Wflit n i¼1 T

(6.5)

Wflit is the flit width in bits. We estimate the bit error rate as the average value of all bit error rates of router-to-router or NI-to-router links in the NoC. Suppose each node has NL outgoing links (in our torus topology, NL ¼ 5), the bit error rate for the overall NoC links is given below

e¼

0 N N W 1 L node flit P ! n Wflit e;i B C X ne;i 0 B C i¼1 W ¼ B C flit 0 @ A n n N N W T T L node flit i¼1

(6.6)

To obtain a good approximation for e, the total simulation time Tsim is at least equal to nT NL Nnode. Because only one random fault injection condition is simulated in each Tsim, more time is needed in order to examine more fault injection conditions. As e decreases, these simulations would become prohibitively timeconsuming in uniprocessor simulators. To speed up simulations, we divide and distribute the overall simulation to Np microprocessors, while maintaining the same bit error rate with (6.6). Equation (6.7) shows the revised definition for the bit error rate of the NoC links. ðNL Nnode Þ=Np

P

e¼

ej 0

j¼1

ðNL Nnode Þ Np

(6.7)

where, 0

NL Nnode Wflit =Np P

1

ne;i B C B C i¼1 ej 0 ¼ B C @nT NL Nnode Wflit =Np A

(6.8)

132

6 A Flexible Parallel Simulator for Networks-on-Chip with Error Control

and (NL Nnode)/Np is an integer. In each processor, ei0 is simulated. As a result, the total simulation time for each microprocessor is reduced to Tsim ¼

nT NL Nnode Np

(6.9)

A straightforward method that we used in an older version of our simulator to obtain the bit error rate follows the steps below. We refer to it as table-based fault injection. 1. Create an empty table, each entry for one simulation cycle. 2. Randomly select a number of entries to mark as faulty. 3. Randomly choose the bit positions to inject faults in the marked entries of the table. 4. Check the memory table each cycle, and flip the marked bit. Suppose each entry costs one integer memory space (e.g. four bytes), the total memory required for a flit having bit error rate e is given by (6.10) below. 1 MEMtable ¼ 4 S e

ðBytesÞ

(6.10)

Here, S is the number of various fault injection patterns tested in the simulation. For large bit error rates, the table-based method is easy-to-use and consumes an acceptable amount of memory. For an example of e ¼ 107, the memory consumed for fault injection table is 400 MB. As bit error rate decreases, the table-based method memory consumption becomes prohibitive. In contrast, the index-based fault injection method indicates when and where a wire encounters an error. The proposed fault injection method requires memory given below MEMindex ¼ 4 Wflit

S Nsim

ðBytesÞ

(6.11)

Here, Nsim is the number of microprocessors employed in parallel simulation. As shown in (6.11), the memory consumption for the index-based fault injection method only depends on the flit width and the number of fault patterns simulated in each microprocessor (S/Nsim). Here, S/Nsim is many orders of magnitude less than 1/e. Thus memory consumption in this index-based method is much smaller than in the table-based method. Index-based fault injection compresses the size of the table indicating the fault location, compared to table-based fault injection. Because the index-based method does not depend on the bit error rate, it is convenient to simulate a scenario in which the bit error rate is extremely low and simulation error is strictly constrained (i.e. massive number of fault patterns should be tested in the simulation).

6.9 Error Control Exploration

6.9 6.9.1

133

Error Control Exploration Experimental Setup and Evaluation Metrics

The proposed simulator has been successfully applied to the investigation of the impact of error control on NoC performance, as well as NoC design space exploration. In the following experiments, we performed simulation on a 10 10 torus NoC, which is comprised of five-port routers and single-port network interfaces. The input buffer depth is eight flits; the output buffer depth is four flits, each packet contains six flits; retransmission delay is set to four for a single cycle link delay. An XY routing algorithm is employed in the emulated NoC. A uniform traffic injection pattern is modeled. The energy for each module is reported by Synopsys Design Complier using 180 and 65 nm TSMC technologies; global link switching energy in each technology (Cp ¼ 731.304 fF for 180 nm and Cp ¼ 228.475 fF for 65 nm) is simulated in Cadence. The switching factors are reported by the proposed simulator. Parallel simulation has been examined using the SDSC TeraGrid IA-64 server, consisting of 524 Intel® Itanium®2 1.5 GHz processors, and a Linux sever that has 4 Intel® Xeon(TM) 3.06 GHz processors. In the following experiments, we evaluate an NoC using the metrics of average flit latency, average throughput, switching factor (for link, buffer, ECC), and energy per useful flit. The definitions of these metrics are given by (6.12–6.15). M P Tflit sent i Tflit received i

Avg: Flit Latency ¼ i¼1

(6.12)

M

in which, Tflit_send indicates when data leaves from an IP core, Tflit_received indicates when data arrives NI (the queue time to arrive IP core is ignored; M is the number of flit received by NI. N P Total Flit Received C

Avg: Throughput ¼ i¼1

i

(6.13)

N

where N is the number of IP cores, C is the total simulation cycles. C NP node P

Switching Factor ¼

i¼1 j¼1

5 P

SwitchedTimeAtPortk

k¼1

j

C

(6.14)

where Nnode is the number of routers. C P N P

Energy per Useful Flit ¼

Energyj

i¼1 j¼1 C P N P i¼1 j¼1

Total ErrorFree Flits Receivedj

(6.15)

134

6 A Flexible Parallel Simulator for Networks-on-Chip with Error Control

Fig. 6.10 Average latency versus traffic injection rate and flit error rate

6.9.2

Impact of Packet and Fault Injection Rate

Increasing traffic injection rates typically increases network congestion. In spite of improving NoC reliability, error detection combined with retransmission further increases the congestion if the noise increases. In this subchapter, we examine the combined impacts of traffic and fault injection rates on NoC performance. IP cores inject packets following uniform traffic patterns, and our simulator brings in errors with the specified flit error rate, which is the number of erroneous flits over the total number of injected flits. For simplicity, we assume the error detection codec can detect all the injected faults and go-back-N retransmission is used to recover the errors. As shown in Fig. 6.10, the average flit latency increases with the traffic injection rate because of increasing network congestion. As the flit error rate increases, more retransmissions are requested to improve on-chip communication reliability; thus, the latency increases too. The impact of increasing traffic injection rate is more significant than that of increasing flit error rate. Figure 6.11 shows the impact of traffic and flit error injection on throughput. Increasing flit error rate dramatically degrades the throughput of error-free flits in the low traffic region. High traffic injection rate compensates for the throughput degradation. This case study shows that traffic injection rate should be moderately reduced to maintain the desired latency or throughput, if error control schemes are employed. The proposed simulator assists in quantifying the traffic reduction percent for the target QoS.

6.9 Error Control Exploration

135

Fig. 6.11 Average throughput versus traffic injection rate and flit error rate

6.9.3

Impact of Error Control

Three categories of error control approaches have been employed in our analysis of reliable NoC design – FEC, error detection combined with ARQ and HARQ. To demonstrate the impact of different error control schemes on NoC performance and energy, we employ six error control schemes in the proposed simulator – FEC1 (correct 1-bit error), FEC2 (correct 1- and 2-bit adjacent errors), ARQ1 (detect 1and 2-bit errors), ARQ2 (detect 1-, 2-, 4-bit adjacent errors), HARQ1 (correct 1-bit and detect 2-bit errors), HARQ2(correct 1- and 2-bit adjacent errors and detect 2and 4-bit adjacent errors). For simplicity, Hamming and extended Hamming codes are used for error detection and correction. Table 6.1 shows the dynamic and leakage power of each module in the NoC. FEC2, implemented with two groups of Hamming (21, 16) with interleaving, consumes less codec power than FEC1 because of shorter critical path and less logic in general; cost more link switching power than FEC1 because of more redundant wires. The comparison of ARQ1 and ARQ2 (HARQ1 and HARQ2) has similar results to the FEC study. HARQ scheme has higher error resilience than FEC and ARQ at the cost of more codec and link switching power. Table 6.1 also shows that the ratio of leakage power over dynamic power in the 65 nm node is three orders of magnitude higher than that in the 180 nm node. Thus, it is not suitable to ignore the impact of leakage on the energy comparison among different error control schemes. Figures 6.12 and 6.13 show the average flit latency and throughput comparison of these six ECC schemes in three noise scenarios – 1-bit, 2-bit adjacent and 4-bit adjacent transient errors. In each noise case, only the specificed error type is injected.

136

6 A Flexible Parallel Simulator for Networks-on-Chip with Error Control

Table 6.1 Power of error control schemes employed in the emulated NoCs Crossbar Input Output Crossbar Input Output Codec switch buffer buffer Links Codec switch buffer buffer Links ECC schemes FEC1 FEC2 ARQ1 ARQ2 HARQ1 HARQ2

Dynamic power at180 nm node (mW) 18.49 1.43 19.58 4.99 15.0 6.28 1.43 19.67 5.09 16.6 14.65 1.43 19.60 5.02 15.4 4.23 1.43 19.70 5.11 17.0 47.39 1.43 19.62 5.04 15.8 14.70 1.43 19.75 5.16 17.8 Dynamic power at 65 nm node (mW) FEC1 2.53 0.18 2.69 1.05 2.28 FEC2 0.79 0.18 2.70 1.07 2.52 ARQ1 1.28 0.18 2.69 1.06 2.34 ARQ2 0.54 0.18 2.71 1.07 2.58 HARQ1 5.94 0.18 2.70 1.06 2.40 HARQ2 1.67 0.18 2.72 1.08 2.70

Leakage power at180 nm node (mW) 0.64 0.09 1.11 0.36 0.0068 0.21 0.09 1.11 0.36 0.0076 0.50 0.09 1.11 0.36 0.0070 0.17 0.09 1.11 0.36 0.0070 0.78 0.09 1.11 0.36 0.0072 0.25 0.09 1.11 0.36 00.081 Leakage power at 65 nm node (mW) 37.75 7.46 69.27 37.32 1.68 12.83 7.46 69.27 37.32 1.86 16.82 7.46 69.27 37.32 1.73 10.06 7.46 69.27 37.32 1.90 43.06 7.46 69.27 37.32 1.77 18.91 7.46 69.27 37.32 1.99

Fig. 6.12 Average latency in different noise conditions with traffic injection rate ¼ 0.15 packet/ cycle/node

The flit error rate is a function of bit error rate defined in (6.4). For independent transient error, the flit error rate of 1-bit error is Ke(1e)K1, where K is the flit width. For coupling transient error, the flit error rate is defined as the probability of a flit containing any erroneous bits. In the first noise scenario (Fig. 6.12 top left), the 1-bit error can be detected or corrected by all the ECC schemes employed, although ARQ schemes result in the

6.9 Error Control Exploration

137

Fig. 6.13 Average throughput in different noise conditions with traffic injection rate ¼ 0.15 packet/cycle/node

highest latency. In low flit error rate region (i.e. 104 in Fig. 6.12), the latency of those ECC schemes are very close. As the flit error rate increases, the latency of ARQ increases by 3%, 22%, and 27% compared to that in low noise region (104). In contrast, FEC and HARQ can correct the errors without retransmission; thus latency is maintained as the flit error rate increases. In the second noise scenario (Fig. 6.12 top right), HARQ1 fails to correct the detected error, resulting in the same latency with the ARQ schemes. In the third noise scenario (Fig. 6.12 bottom), HARQ2 cannot correct the detected 4-bit errors and requests as many retransmissions as ARQ1, ARQ2 and HARQ1 schemes do. In contrast, the FEC latency is not affected by the increasing flit error rate and the number of error bit, but the received flits are not all error free because of error correction failure. Retransmission in ARQ and HARQ schemes leads to throughput degradation as shown in Fig. 6.13. As the flit error rate increases, the throughput of ARQ decreases by 3%, 26%, 33% compared to that in low noise region (104). Since the same retransmission protocol (i.e. go-back-N) is employed in ARQ and HARQ, the throughput degradation is the same as that of HARQ in the second and third noise condition is same with that of ARQ. Similar to the latency comparison, HARQ1/HARQ2 scheme cannot maintain its throughput in 2-bit/4-bit adjacent error scenarios because of the limited error correction capability of the employed extended Hamming (38, 32)/two groups of extended Hamming(38, 32). The synthetic error injection conditions provided in our simulator facilitate performance

138

6 A Flexible Parallel Simulator for Networks-on-Chip with Error Control

Fig. 6.14 Average energy per useful flit at 180 nm node

evaluation of a specific error detection/correction codec and error recovery policy. Other noise scenarios that mix different number of error bits are also supported in the simulator, by loading faultinjection.dat file (shown in Fig. 6.2). To simultaneously consider energy efficiency, reliability and throughput, energy per useful flit is used to evaluate error control schemes in different noise conditions. A useful flit refers to a flit reaching the destination without errors. As shown in Fig. 6.14, the energy per useful flit of FEC schemes is constant in the 1-bit error scenario, since the number of received error-free flits does not change with the increasing flit error rate. Because more redundant links are used, FEC2 consumes more energy per useful flit than FEC1. Although the codec power of FEC2 is less than that of FEC1, the sum of link switching energy exceeds the codec switching energy (the redundant links switch when the codec is not used during buffer full and retransmission period). As the error bit increases to 2-bit, FEC1 yields fewer useful flits and thus costs higher energy per useful flit than FEC2. In the 4-bit adjacent error scenario, both FEC1 and FEC2 cannot correct the errors, so their energy performance increases with the flit error rate. Different from the FEC schemes, the energy per useful flit of the ARQ schemes increases with increasing flit error rate regardless of the number of error bits. In the moderate flit error rate region, ARQ2

6.9 Error Control Exploration

139

Fig. 6.15 Average energy per useful flit at 65 nm node

consumes more link switching energy than ARQ1, resulting in worse energy efficiency. In the high flit error rate region and 4-bit error scenario, ARQ2 achieves better energy efficiency than ARQ1, because of less codec complexity and more error-free flits. In the 1-bit error case, HARQ schemes achieve the same energy per useful flit as FEC, since all detected errors can be corrected without retransmission. Different from HARQ2, HARQ1 fails to correct 2-bit adjacent errors and requires retransmission to maintain reliability, resulting in more energy consumption. When 4-bit errors occur, both HARQ1 and HARQ2 use retransmission to obtain error-free flits; thus, HARQ2 costs more energy than HARQ1 because of more redundant links. Generally, FEC and HARQ outperform ARQ in the high flit error region if the injected error is correctable. We perform the same experiment in 65 nm technology. Figure 6.15 shows that the same trend of energy per useful flit versus flit error rate in different noise scenarios as those shown in Fig. 6.14. In addition to energy consumption of each module in the router, switching activity of ECC module, crossbar switch, input/output buffers and links also plays an important role in the minimization of energy per useful flit. To obtain the minimum value, one should evaluate an ECC scheme by judging its power and switching activity in applications.

140

6.9.4

6 A Flexible Parallel Simulator for Networks-on-Chip with Error Control

Impact of Faulty Flit Type

Header, payload and tail flits play different roles in packet switching. Header flits contain the information indicating the direction of the next hop, used to connect the appropriate input port and output port in a router. Tail flits have the information indicating the end of a packet (if no tail in the packet format, flit counting mechanism is used). Payload flits only transfer data over the NoC fabrics. The importance of the header, payload and tail data are different, thus researchers suggest employing diverse error control methods on different faulty flit types to achieve the desired QoS. Assume that the used error control schemes are sufficient to guarantee the reliability; here, we are interested in examining the impact of those error control schemes on NoC performance. Since FEC method designed for worst case condition do not change the latency and throughput with traffic injection and flit injection, we particularly focus on the error detection combined retransmission scheme. Errors injected on header, payload or tail flits yield different performance degradation. Errors injected on header flits affect the buffer release in the previous hop because the rejected header flit does not participate in output port reservation (i. e. not involve in resource contention). In contrast, errors on the tail flit postpone release of both the output buffer in the previous hop and the output port reservation in the current hop. As a result, errors on tail flits lead to potential resource contention. Errors on payload flits delay release of the output buffer in the previous hop and may affect the output port reservation in the current hop, which depend on the input buffer depth of the current hop. In our experiment, the input buffer depth is larger than the packet length; thus, errors on payload flit do not affect output port reservation. The flit error rate used in the simulation is 103. As shown in Fig. 6.16, errors on tail, payload and header yield up to 53%, 47% and 29% increase in latency, respectively, compared to the no error case. Correspondingly, errors on tail, payload and header flits result in up to 50%, 42% and 37% decrease in throughput, respectively, compared to the no error case. As a result, an error control method that leads to less latency should be employed to tail flits, if latency is the main design concern.

6.9.5

Impact of Fault Injection Location

There are two type of links in NoCs – router-to-router interconnect (i.e. global links) and router-to-NI (i.e. local links). Error control (error detection combined with retransmission) for global link reliability improvement typically deteriorates network congestion, especially in the high traffic injection region. In contrast, retransmission to recover faults on local links merely stops the packet injection to the network and does not cause additional network congestion. As shown in Fig. 6.17a, global faults increase latency up to 20% more than local faults.

6.9 Error Control Exploration

141

Fig. 6.16 The impact of faulty flit type on (a) latency, (b) throughput

However, because local faults decrease the packet injection to network, these faults result in a lower throughput than global faults, as shown in Fig. 6.17b. If the flit error rate caused by the global faults is large, the corresponding throughput degradation is significant since there are more global links than local links.

6.9.6

Impact of Fault Type

In the previous sections, we have examined the impact of transient faults on NoC performance and energy. In addition to transient faults, the proposed simulator can facilitate the impact of permanent faults on performance. Without knowledge of the

142

6 A Flexible Parallel Simulator for Networks-on-Chip with Error Control

Fig. 6.17 The impact of fault location on (a) latency, (b) throughput

presence of permanent faults, energy-efficient error detection combined with retransmission is employed to improve link reliability. We further assume that no adaptive routing is employed to reroute the flit. In future work, we will examine different adaptive routing combined with error control coding.

6.9 Error Control Exploration

143

Fig. 6.18 The impact of permanent faults on performance: (a) permanent faults on different routers, (b) permanent faults on a single router

Figure 6.18a shows the case where permanent faults occur on different routers. As can be seen, more links experiencing permanent faults yields smaller throughput; increasing traffic injection helps to improve throughput but cannot avoid throughput saturation beyond a certain traffic injection rate. Figure 6.18b shows the case where the permanent faults occur on a single router, which is typically regarded as a faulty node. Similar to Fig. 6.18a, the throughput decreases with increasing permanent faults. If no adaptive routing is employed, faulty nodes deteriorate the throughput more than the distributed faulty links. The flexibility of modeling distributed faulty links and faulty nodes is useful for examining different fault tolerant routing algorithms combined with error control methods.

144

6.10

6 A Flexible Parallel Simulator for Networks-on-Chip with Error Control

Memory Consumption and Time for Fault Injection

In this subchapter, we compare the memory consumption for the table-based and index-based fault injection over a wide range of bit error rates. Here, we examine 100 and 1,000 fault patterns. Typically, more fault patterns require more memory consumption, because a larger number of simulation cycles are needed. As shown in Fig. 6.19, the memory required in the straight-forward table-based fault injection method increases exponentially (log-plot) with decreasing bit error rate. When the bit error rate is below 108, the memory requirement exceeds 10 gigabytes, which is prohibitive for most machines. Furthermore, such large memory requirements limit the number of fault patterns in each simulation. In contrast, our proposed index-based fault injection method consumes near-constant memory for fault injection. These simulation results match the predicted analytical estimations in Equation (6.9). It can be seen in Fig. 6.19 that the index-based fault injection method can reduce the memory consumption for fault injection by several orders of magnitude. In parallel simulation, the memory requirement for the index-based method can be further reduced, since each processor only handles one segment of the overall simulation. To quantize the impact of fault injection on simulation time, times for fault index generation and fault injection in different simulation environments are compared in Figs. 6.20 and 6.21, respectively. Here, 1,000 fault patterns are examined. In this experiment, 1, 10 and 100 processors are used to generate fault indices, which are

Fig. 6.19 Memory consumption for fault injection

6.11

Investigation of NoC-Based CMP System

145

Fig. 6.20 Time for fault index generation

Fig. 6.21 Time for fault injection in simulation

uniformly distributed over the total simulation time. As shown in Fig. 6.20, the time for index generation using 100 processors in the parallel simulation is improved about 100X over that for a uniprocessor; indeed, decreasing bit error rate does not result in increasing time for fault index generation. In contrast, decreasing bit error rate leads to an exponential increase in time for the fault injection method, as shown in Fig. 6.21.

6.11

Investigation of NoC-Based CMP System

NoCs have been applied to CMP systems, in which different IP cores may operate at different frequencies. Our simulation environment can facilitate multi-frequency simulation. Different IP core frequencies are modeled with the method discussed in Sect. 6.3. Because the placement of multi-frequency IP cores affects the system performance, we examine four different artificial scenarios – (a) all IP cores

146

6 A Flexible Parallel Simulator for Networks-on-Chip with Error Control

Fig. 6.22 NoC-based multi-frequency CMP systems

working at the same frequency (uni-freq.), (b) the slowest IP cores are placed in the center of the system (center low freq.); (c) the fastest IP cores are placed in the center of the system (center high freq.); (d) the entire CMP system is divided into four regions, each region using one frequency (localized freq.). These four cases are shown in Fig. 6.22. We assume all the routers in the NoC operate at the same frequency as the fastest IP core in the CMP system. Two extreme on-chip communication patterns are examined in this work – (1) each IP core having the same probability to send data to each IP cores in the CMP system (i. e., uniform destination); (2) each IP core only communicating with IP cores in the quarter region it belongs to (i.e., localized destination). Figure 6.23 shows the latency comparison of the CMP system using the IP core placement shown in Fig. 6.22. As shown in Fig. 6.23a, the uni-freq. case yields the highest latency compared to other IP core placement cases. Because all the IP cores in Fig. 6.22a inject packets to the network with the same speed, that CMP system yields the highest network congestion and thus highest latency. In contrast, the case shown in Fig. 6.22c has only 25% of IP cores operating at f0 and those IP cores are on the fringe of the 4 4

6.12

Summary

147

Fig. 6.23 Impact of IP core placement in the CMP system on latency: (a) uniform destination, (b) localized destination

NoC. Consequently, less network congestion exists in this NoC than other cases, which results in the lowest latency. Figure 6.22b shows the latency comparison for the CMP system, in which the IP cores only communicate with their respective quadrants. As can be seen, the latency of uni-freq. CMP system is comparable to that of the center low freq. CMP system. This is because most of IP cores in the center low freq. CMP system run at f0, which results in injecting comparable packets to the network.

6.12

Summary

Many existing simulators either do not incorporate error control coding features, or do not have a comprehensive simulation environment that allows various fault injection and traffic injection patterns. Thus, evaluating the performance of

148

6 A Flexible Parallel Simulator for Networks-on-Chip with Error Control

nanometer scale NoCs that employ various error control modules becomes increasingly difficult. The proposed simulator allows investigation of the comprehensive dependence of flit error rate and packet injection rate on NoC performance. According to the simulated dependence, users can determine an appropriate traffic injection rate in a range of noisy conditions to achieve the desired latency and throughput. Moreover, the flexibility provided by this simulator can easily model various fault injection scenarios. This feature makes it feasible to explore error control schemes for specific purposes, such as high protection for the header flit in a packet, realization of high reliability in some NoC regions, and/or adjustment of error control methods for different fault types. In addition, a parallel simulation method based on MPI has been employed to improve simulation speed. NoCs promise significant performance improvement in on-chip communication, as designers move to chip multiprocessors (CMPs) to squeeze more performance from scaled technologies. Using this simulator, one can readily estimate the performance of NoC-based CMP systems.

References 1. Sun YR, Kumar S, Jantsch A (2002) Simulation and Evaluation of a Network on Chip Architecture Using Ns-2. in Proc IEEE NorChip Conf (NORCHIP’02) 2. Ali M, Welzl M, Adnan A, F Nadeem (2006) Using the NS-2 Network Simulator for Evaluating Network on Chips (NoC). in Proc IEEE 2nd Intl Conf Emerging Technologies 506–512 3. Ning W, Fen G, Qi W (2007) Simulation and Performance Analysis of Network on Chip Architectures Using OPNET. in Proc 7th Intl Conf ASIC 1285–1288 4. Wang HS, Zhu X, Peh LS, Malik S (2002) Orion: A Power-Performance Simulator for Interconnect Networks,” in Proc 35th Annual ACM/IEEE Intl. Symp. Microarchitecture 294–305 5. Coppola M, Curaba S, Grammatikakis MD, Maruccia G, Papariello F (2004) OCCN: a Network-on-Chip Modeling and Simulation Framework. in Proc Design, Automation and Test in Europe Conf and Exhibition (DATE’04) 174–179 6. Palermo G, Silvano C (2004) A Framework for Power/Performance Exploration of NetworkOn-Chip Architectures. in Proc 14th Intl Workshop Power and Timing Modeling, Optimization and Simulation 521–531 7. Xi J, Zhong P (2006) System-level Network-on-Chip Simulation Framework with Analytical Interconnecting Wire Models. in Proc IEEE Intl Conf Electro/Information Technology 301–306 8. Thid R (2003) Semla tutorial. http://www.imit.kth.se/info/FOFU/Nostrum/NNSE/ semla_tutorial.pdf 9. Lu Z, Thid R, Millberg M, Nilsson E, and Jantsch A (2005) NNSE: Nostrum Network-on-Chip Simulation Environment. in Proc Swedish System-on-Chip Conf 1–4 10. Lu Z (2005) A user introduction to NNSE: Nostrum Network-on-Chip Simulation Environment. http://www.imit.kth.se/info/FOFU/Nostrum/NNSE/ 11. http://www.arteris.com 12. http://noxim.sourceforge.net/ 13. http://nirgam.ecs.soton.ac.uk/

References

149

14. Ganguly A, Pande PP, Belzer B, Grecu C (2008) Design of Low Power & Reliable Networks on Chip through Joint Crosstalk Avoidance and Multiple Error Correction Coding. J Electronic Testing 24:67–81 15. Bertozzi D, Benini L, De Micheli G (2005) Error Control Scheme for On-Chip Communication Links: the Energy-Reliability Tradeoff. IEEE Trans Computer-Aided Design of Integr. Circuits and Syst 24:818–831 16. Ali M, Welzl M, Hessler S, Hellebrand S (2007) A Fault Tolerant Mechanism for Handling Permanent and Transient Failures in a Network on Chip. in Proc Intl Technology: New Generations (ITNG’07) 1027–1032 17. Zimmer H, Jantsch A (2003) A Fault Model Notation and Error-Control Scheme for Switch-toSwitch Buses in a Network-on-Chip. in Proc Intl Conf Hardware/Software Codesign and Syst Synthesis (CODES-ISSS’03) 188–193 18. Rossi D, Angelini P, Metra C (2007) Configurable Error Control Scheme for NoC Signal Integrity. in Proc IEEE Intl On-Line Testing Symp (IOLTS’07) 43–48 19. Yu Q and Ampadu P (2010) A Flexible Parallel Simulator for Networks-on-Chip with Error Control. IEEE Trans. on Computer-Aided Design of Integr. Circuits and Syst. 29:103–116 20. Yu Q, Zhang M and Ampadu P (2011) A Comprehensive Networks-on-Chip Simulator for Error Control Explorations. in Proc. 5th ACM/IEEE Intl. Symp. on Networks-on-Chip (NoCS’11) 263–264 21. http://www.gnu.org/software/gsl/ 22. http://www.sdsc.edu/us/resources/ia64/ 23. Yu Q and Ampadu P (2008) Adaptive Error Control for NoC Switch-to-Switch Links in a Variable Noise Environment. in Proc IEEE Intl Symp on Defect and Fault Tolerance in VLSI system (DFT’08) 352–360

Chapter 7

Conclusions and Future Directions

7.1

Book Summary

This book presents a multi-layer solution to address transient and permanent errors in networks-on-chip. The main contributions are as follows: (1) adaptive error control codec design and implementation, (2) transient and permanent error comanagement, (3) dual-layer cooperative error control, (4) flexible and parallel NoC simulator development.

7.1.1

Adaptive Error Control Codec Design

Transient error rates vary with time and location. Worst-case error control codec design wastes energy if noise conditions are favorable. In contrast, simple error control coding sacrifices reliability, although codec energy consumption is reduced. In deeply scaled technologies, the percentage of multi-core system power consumed by on-chip interconnect is increasing. Simple error control coding may result in large link switching power consumption if retransmission is adopted to recover detected errors. Consequently, a configurable error control coding method and a new reliability-energy-aware NoC framework are needed to trade off performance, reliability and energy for different noise conditions. In Chap. 3, an adaptive error control coding method is proposed to improve error resilience while maintaining energy efficiency and performance by adjusting error detection and correction capability based on link quality or system requirement. Since a direct implementation of this adaptive ECC scheme is not hardware efficient, a configurable M-bit error correction, 2M-bit error detection code (MEC-2MED) with hardware sharing has been proposed. As link switching consumes more and more energy and noise increases in nanoscale systems, adapting both the error detection and correction can be an effective way to balance average energy per useful flit, reliability, and performance. Q. Yu and P. Ampadu, Transient and Permanent Error Control for Networks-on-Chip, DOI 10.1007/978-1-4614-0962-5_7, # Springer Science+Business Media, LLC 2012

151

152

7 Conclusions and Future Directions

This method also uses multi-group codes to provide protection from multi-bit spatial-burst errors without the need for complex codes such as BCH or ReedSolomon codes. The goal of using multi-group simple codes is to distribute adjacent error bits to different small codewords and repair the corrupted original message in smaller segments, at the cost of increasing redundant wires. Technology scaling facilitates power saving on logic blocks, but does not help to reduce long link swing power. Consequently, the overhead induced by the multi-group method may exceed the power saved by reducing codec complexity and the number of retransmissions requested in error recovery protocols. The proposed adaptive error control method changes the number of active ECC groups (and thus the number of redundant wires used) according to noise conditions. Furthermore, the proposed configurable ECC codec adapts the codec strength among same codes, costing minor codec overhead compared to adapting among different codes.

7.1.2

Error Co-Management

As technology scales, imperfect fabrication processes in nanoscale circuits induce more and more defects in long on-chip interconnect. Although progress in fabrication techniques has been made, yields continue to decrease as technology closes in on physical limits. The ability to tolerate small number of defects can dramatically relieve the pressure on yield-rate improvement. Fault-tolerant routing abandons the entire link or even the router when permanent errors occur, resulting in unnecessary performance degradation and energy consumption if only very limited permanent errors exist. In Chap. 4, we propose a transient and permanent error co-management method, which involves both datalink and physical layers. Our configurable error control coding (ECC) adapts the number of redundant wires to varying noise conditions, achieving a required error detection capability at a more optimal energy performance. The infrequently used redundant wires for the configurable ECC are utilized as spare wires to replace permanently unusable links. To make the co-management method suitable for both low/high transient error and low/moderate permanent error conditions, a packet re-organization algorithm that cooperates with a shortened error control coding method is proposed to support low-latency split transmission. This approach intelligently exploits the available resource in a fault-tolerant system to handle permanent fault in interconnect links, rather than requiring additional backup subsystems. Fault tolerant routing algorithms have been used to manage permanent link errors, but those algorithms abandon the entire faulty link and associated routers (or least some ports in the affected router), increasing the traffic burden on other links and routers. Compared with fault-tolerant routing, the proposed approach is more suitable for link errors because of the minor cost and no need of complex routing unit. As a result, this approach can enhance the performance and energy efficiency.

7.1 Book Summary

7.1.3

153

Dual-Layer Cooperative Error Control

Adaptive error control coding on the datalink layer has the advantage of minimizing the average energy consumption in a noise changing environment. However, the significant energy reduction is mostly achieved in high noise conditions; the energy reduction in low noise condition cannot compete with no-protection scheme. Moreover, the datalink layer ECC does not consider the impact of traffic load on the error control scheme selection. It has been reported that end-to-end error control in the network layer is more energy efficient than hop-to-hop error control in the datalink-layer in low noise conditions. Consequently, extending the ECC adaptation to the network-layer is expected to achieve significant average energy reduction in low noise region, as well. In Chap. 5, we proposed a dual-layer cooperative error control to tackle transient errors on NoCs link. To improve energy efficiency, we employ end-to-end error control in the network interface in the low error rate region, and enhance the error control capability in the high error region by adding hop-to-hop error control in the router. One major contribution of this work is to provide a protocol that allows for switching between network-layer ECC and datalink-layer ECC at runtime, considering the NoC size and traffic load. The corresponding hardware-efficient implementation for the protocol is presented in this chapter, as well. Simply combining end-to-end error control with hop-to-hop error control typically results in huge energy consumption; thus, we propose to use product codes in the NoC to realize seamless ECC mode switching at runtime. Performance and energy estimation of this method have been analyzed, considering the impact of packet size, NoC size and error rate. The proposed method has also been evaluated using traffic traces obtained from the PARSEC benchmark. Real application simulations confirm the conclusions drawn from the theoretical analysis. This method provides a framework that uses simple codes to facilitate a robust system that operates in a wide noise range. The ECC mode switching protocol considers the impact of traffic load over the network on the ECC scheme selection, which creates a bridge between the datalink-layer and network-layer or above to communicate determine the best error control solution for current scenario. The cooperation between two layers can improve resource utilization by turning on the codec units and links as necessary, saving dynamic energy.

7.1.4

NoC Simulator Development

Incorporating error control in NoCs leads to performance degradation, as well as increases in energy consumption and area overhead. Previous simulators are useful to explore design space and power estimation; unfortunately those simulators have rarely explored error control features in NoCs. To fill in the gap between NoC simulator implementation and NoC error control exploration, an NoC simulator

154

7 Conclusions and Future Directions

that facilitates comprehensive investigation of the impact of different error control methods is needed. In Chap. 6, we present a flexible parallel simulator to evaluate the impact of different error control methods on the performance and energy consumption of networks-on-chip (NoCs). Different error control schemes can be inserted to the simulator in a plug-and-play manner for evaluation. Moreover, a highly tunable fault injection feature is developed for modeling various fault injection scenarios, including different fault injection rates, fault types, fault injection locations and faulty flit types. The flexible simulation environment provided by this simulator allows examination of the efficiency of different error control schemes under different fault scenarios and traffic injection rates. Case studies are presented to demonstrate the impact of a set of error control schemes on NoC performance and energy in different noise scenarios. We also use the simulator to provide design guidelines for NoCs with error control capabilities.

7.2

Future Work

Several exciting and important avenues of research in this area are worth investigating. • Exploit other codes to construct a configurable error control codec: The investigated Hamming code has advantages such as its simplicity and the ease of shortening or extending it, but its implementation needs a long XOR-tree for the case of large input widths. As a result, large codec delay and codec overhead are induced. New codes that have similar complexity but need a shorter XOR-tree compared to Hamming code can further reduce the codec overhead in our proposed scheme. • Incorporate adaptive routing with the proposed multi-layer ECC framework: In the physical layer, small numbers of permanently unusable links are detected and replaced with infrequently used redundant wires for configurable ECC. This approach cannot tolerate severely broken links and routers. To tradeoff the area and energy overhead with performance, integrating a simple adaptive routing into the proposed framework can extend the error resilience capability against varied transient noises and unexpected large-scale permanent defects. • Apply the proposed error control method to hybrid NoCs: One interesting hybrid NoC is a photonic-electronic NoC, in which the photonic network is used to transmit large messages and the electronic network sets up and tears down the optical path for message transmission. Because of the transparent light path, a large size of packet can be immediately transferred over the photonic network as soon as the optical path is reserved. In this hybrid NoC, adding redundancy on the packet will not induce significant energy overhead on link

References

155

Fig. 7.1 Block diagram of robust routing arbitration unit

swing. Consequently, more powerful error control codes can be used to improve the error resilience for large packets. • Manage router errors: Although it is difficult to use error control coding to detect and correct errors in routers, triple-modular redundancy (TMR) can be used to improve reliability by duplicating the unit under protection and selecting the output through majority voting. Because of its simplicity, TMR has been used in router control paths [1, 2]. However, TMR only theoretically functions when up to 1/3 of the components are erroneous; the potential for errors in the majority voter further reduces the effectiveness of the TMR approach, particularly when the number of units being protected is small. Consequently, TMR is not an ideal solution for the control paths in NoC routers. Inherent information redundancy is exploited to protect the arbitration units, as shown in Fig. 7.1 [3]. The four shaded units are added to the conventional arbiter design, enabling error management in the arbitration. RC is route computation. The inherent information redundancy is extracted from the presence of forbidden signal patterns and inconsistent request-response pairs during the arbitration phase. In Ref. [3], XY routing based router has been investigated. It will be interest to see the feasibility of applying the inherent information redundancy to the router using adaptive routing algorithms.

References 1. Constantinides K et al (2006) BulletProof: a defect-tolerant CMP switch architecture. in Proc HPCA’06 5–16 2. Yanamandra A et al (2010) Optimizing power and performance for reliable on-chip networks. in Proc ASP-DAC’10 431–436 3. Yu Q, Zhang M, Ampadu P (2011) Exploiting inherent information redundancy to manage transient errors in NoC routing arbitration. in Proc 5th ACM/IEEE Intl Symp on Networkson-Chip (NoCS’11) 105–112

Index

A Adaptive error control, 11, 37–62, 84, 96, 120, 151–153 Adjacent coupling coefficient, 46, 49 Application layer, 4 Automatic repeat request (ARQ), 11, 19–21, 81, 118 Average energy per useful flit, 54, 55, 138, 139, 151 Average energy per useful packet, 76, 95, 101, 102, 104 Average latency, 53, 73–76, 95, 106–110, 112–115, 126, 133, 134, 136

B Bit error rate (BER), 9, 46, 73, 85, 86, 97, 98, 124, 130–132, 136, 144, 145 Black-Scholes, 111–113, 115 Bose-Chaudhuri-Hocquenghem (BCH), 23, 26–28, 44, 60, 73, 74, 76, 78, 82, 96, 104–111, 152 Bridging fault, 9, 10 Butterfly fat tree, 2

C Canneal, 87, 111, 112, 115 Capacitive coupling, 96 Channel capacity, 22 Checks on checks (COC), 91, 103, 108 Chip-multiprocessor (CMP), 1, 4, 14, 121, 126, 145–148 Circuit switching, 4, 89 Codebook, 22, 23 Codec delay, 110–111, 154

Codeword, 22–29, 42, 44, 48, 49, 54, 60, 67, 69, 71, 81, 91, 152 Column decoder, 29, 91, 92 Column encoder, 91, 96 Configurable ECC, 41–42, 44, 66, 67, 69, 71, 72, 152, 154 Crack, 10 Crosstalk, 6–8, 37, 82 Cyclic redundancy check (CRC) code, 26, 47

D Datalink layer, 4, 9, 11, 13, 32, 37–61, 81–83, 90, 96, 97, 115, 153 Deep submicron technology (DSM), 5, 7 Depacketization, 123 Directed flooding, 31 Dual-layer cooperative ECC, 115 Duplicate-add-parity (DAP) code, 121 DyNoC, 31, 32

E Electro-magnetic interference (EMI), 7 End error history, 83, 87–90 End error threshold, 86, 87, 112, 115 End-to-end ECC, 73, 83–85, 90, 91, 95–97, 100, 102, 104–106, 108, 111, 112, 115, 123, 124 Error co-management, 11, 65–78, 151, 152 Error control coding, 11, 13, 22–29, 37–62, 65, 66, 78, 82, 84, 102, 106, 117, 118, 120, 130, 142, 151–153, 155 Error history flit (EHF), 83, 87–90 Error matrix, 46

Q. Yu and P. Ampadu, Transient and Permanent Error Control for Networks-on-Chip, DOI 10.1007/978-1-4614-0962-5, # Springer Science+Business Media, LLC 2012

157

158 F Fat tree, 2 Fault injection, 13, 14, 117, 118, 120, 124–132, 134–135, 138, 140–141, 144–145, 148, 154 Fault tolerant routing, 11, 19, 30–32, 118, 143, 152 Flit, 1, 14, 32, 37–41, 44, 46–49, 51–57, 61, 65, 67, 70–73, 77, 81, 83, 88–92, 95, 97, 99, 102–106, 108, 111, 112, 118, 120–124, 126–128, 131–142, 148, 151, 154 Flit check bits (FCB), 91 Flit error rate, 46, 49–52, 61, 97, 99, 134–141, 148 Flooding, 31, 65, 72, 73, 76 Flow control, 1, 4, 117, 121 Forward error correction (FEC), 9, 19, 21, 32, 38, 81, 96, 118, 121, 122, 124, 135, 137–140

G Generator matrix, 24, 25 Global fault, 140, 141 GNU Scientific Library (GSL), 120, 128 Go-back-N ARQ, 118

H H.264, 56–60 Half-splitting, 78 Hamming code, 21, 23–25, 28, 96, 112, 121, 154 Hamming distance, 23–26, 30 Header flit, 88, 108, 121, 123, 126, 140, 148 Hillock, 10 Hop-to-hop ECC, 81–85, 87–91, 95, 97, 100, 102–105, 108, 111, 115, 124 Hybrid ARQ (HARQ), 19, 104, 118 Hybrid NoC, 154

I Incremental redundancy, 38, 39 Index-based fault injection, 120, 127–130, 132, 144 Inductive coupling, 8 Inherent information redundancy, 155 In-line testing (ILT), 67, 68 Intellectual property, 1 Interleaver, 40 Intermittent error, 10

Index J Joint crosstalk avoidance and tripleerror-correction code (JTEC), 16, 33, 63, 116, 149

L Linear feedback shift register (LFSR), 26 Local fault, 140, 141

M Mean time to failure (MTTF), 12, 51 M-error correction, 2M-error detection (MEC–2MED), 11, 38, 41, 44, 49, 60, 61, 151 Mesh, 1, 74, 76, 78, 81, 85, 86, 88, 95, 97, 105, 106, 111, 112, 118 Message passing interface (MPI), 120, 127–130, 148 Metal sliver, 10 Mode propagation counter, 88, 89 Modular counter, 86–88 Mousebite, 10

N Network interface, 1, 2, 76, 81, 82, 85–92, 95, 96, 102, 103, 108, 111, 120–124, 126, 127, 133, 153 Network layer, 4, 7, 11, 13, 19, 32, 65, 78, 81–85, 87, 90, 96, 97, 115, 153 Network-on-chip (NoC), 1–5, 11, 13, 14, 23, 24, 31, 32, 37–39, 49, 51, 54, 56, 57, 61, 74, 76, 78, 81, 83, 85, 86, 105, 106, 112, 115, 117–148, 151, 153–155 NoCGEN, 5 Noise deviation voltage, 49, 54, 97–101, 104, 109, 124

O Octagon, 2 Open system interconnection (OSI), 2

P Packet check bits (PCB), 91 Packet format, 123, 124, 140 Packetization, 91, 123 Packet rebuilding, 68–70

Index Packet re-organization, 11, 65, 66, 68–71, 78, 152 Packet restoring, 68, 70, 71 Packet size, 72, 73, 76, 78, 100, 105, 106, 109–111, 115, 153 Packet switching, 4, 5, 38, 123, 140 Parity check matrix, 24, 27, 28 Partially adaptive routing, 118 Particle strike, 6 Payload flit, 140 Permanent error, 10, 11, 13, 19–32, 65–68, 70, 72, 74–76, 78, 151, 152 Permanent fault, 125, 152 Physical layer, 4, 11, 19, 154 Predictive technology model (PTM), 48, 76 Presentation layer, 4 Product code, 28–29, 81, 92, 95, 103, 105, 112

Q Quality-of-service (QoS), 4, 37, 38, 82, 117, 118, 134, 140

R Receiver, 5, 19, 21, 22, 38–40, 48, 49, 53, 60, 66–68 Residual flit error rate (RFER), 49–51, 54, 61 Residual packet error rate, 72, 73, 97, 98, 100, 102, 103, 108, 109, 115 Ring, 104 Ripple counter, 88 Round-robin arbitration, 121 Route length, 83, 89, 97, 99, 102, 105 Row decoder, 28, 29 Row encoder, 91, 96

S Selective-repeat ARQ, 19, 20 Self-calibration, 37 Session layer, 4 Shorten Hamming code, 24, 25, 42, 72 Signal-to-noise ratio, 22 Single-error correction double error detection (SECDED), 24, 37, 40–42, 82, 96, 97, 104–106, 109–112, 121 Single-event transient (SET), 6 Single-event upset (SEU), 6

159 Single parity check code, 23, 28 Smart-flooding, 72, 73, 76, 78 SoCIN, 5 Spare wire, 19, 29, 30, 66 Split buffer, 69, 70 Splitting transmission, 30, 66, 68–71 Stop-and-wait ARQ, 19, 37 Store and forward (SAF), 5, 9 Syndrome, 24, 27–29, 44, 45, 88, 89, 95, 104, 110 Syndrome computation, 28, 44, 45, 110 System-on-chip, 37

T Tail flit, 123, 126, 140 Torus, 1, 81, 111, 118, 121, 131, 133 Transient coupling fault, 125 Transient error, 9, 11, 13, 32, 66, 73, 76, 78, 81–116, 136, 151, 152 Transient independent fault, 125 Transmitter, 19, 21, 40, 51, 66–68 Transport layer, 4 Triple modular redundancy (TMR), 12, 67, 68, 155 TSMC, 48, 76, 89, 95, 110, 133 Turn model, 34 Type-I HARQ, 21 Type-II HARQ, 21

U Uniform traffic, 86, 133, 134

V Variable redundancy, 38, 39 Victim line, 37, 39, 67, 82 Virtual cut through (VCT), 5

W Wormhole, 5, 32, 95

X X264, 87, 111–115 Xpipe, 5 XY deterministic routing, 118